ABSTRACT

Title of Dissertation: TOWARDS EFFECTIVE AND EFFICIENT
VIDEO UNDERSTANDING

Xijun Wang
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Dinesh Manocha, Ming Lin
Department of Computer Science

“If a picture is worth a thousand words, what is a video worth?” Video information, due to

its inherent richness and efficiency compared to language, plays a pivotal role in conveying com-

plex information. However, video understanding faces numerous challenges, including select-

ing informative frames, addressing domain shifts, semantic grounding, reasoning and attention

deficits, and significant computational burdens. Recent advancements in computer vision un-

derscore the need to address these challenges through effective and efficient approaches, which

are crucial for applications ranging from autonomous systems to human-computer interactions

that require high accuracy and low latency. In this dissertation, we address five critical issues

to overcome these challenges: dataset development, preprocessing, visual reasoning, multimodal

alignment, and computational acceleration.

High-quality datasets serve as the foundational building blocks, providing diverse, compre-

hensive, and representative data to train models capable of handling real-world complexity. In

this dissertation, we proposed METEOR dataset for tailored for autonomous driving applications


in dense, heterogeneous, and unstructured traffic scenarios with rare and challenging conditions.

Additionally, we developed DAVE, a comprehensive benchmark dataset specifically designed to

enhance video understanding research for the safety of vulnerable road users in complex and

unpredictable environments. Our analysis revealed substantial shortcomings of current object

detection and behavior prediction models when tested against our METEOR and DAVE.

Complementing datasets, for preprocessing, we proposed AZTR incorporates an automatic

zooming algorithm for dynamic target scaling and a temporal reasoning mechanism to accurately

capture action sequences. Furthermore, we introduced MITFAS, an alignment and sampling

method based on mutual information specifically designed to address challenges inherent to UAV

video action recognition, including varying human resolutions, large positional changes between

frames, and occluded action features.

For visual reasoning, we introduced SCP, which guides the model to explicitly learn input-

invariant (prompt experts) and input-specific (data-dependent) prompt knowledge, effectively

capturing discriminative patterns and significantly improving accuracy on challenging datasets.

We also developed ICAR, a compatibility learning framework with a novel category-aware Flex-

ible Bidirectional Transformer (FBT), which can effectively generate features across different

domains based on visual similarity and complementarity for reasoning tasks.

For multimodal alignment, we proposed ViLA to address both efficient frame sampling

and effective cross-modal alignment in a unified way. Finally, we propose Bi-VLM to explore

ultra-low precision post-training quantization method to bridge the gap between computational

demands and practical limitations. Our method employs a saliency-aware hybrid quantization

algorithm combined with a non-uniform model weight partition strategy, significantly reducing

computational costs without compromising much overall model performance.


TOWARDS EFFECTIVE AND EFFICIENT
VIDEO UNDERSTANDING

by

Xijun Wang

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Dr. Dinesh Manocha, Chair/Advisor
Dr. Ming Lin, Co-Chair/Co-Advisor
Dr. Maria K. Cameron, Dean’s Representative
Dr. Abhinav Shrivastava
Dr. Shan Yang


© Copyright by
Xijun Wang

2025


Meet the world, meet the multitudes, and in so doing, meet yourself.

ii


Acknowledgments

On this quiet night, I find myself reflecting on my academic journey. Counting from the

preschool transition class I attended at the age of five, I have spent 24 years in school. Throughout

these years, various reasons kept me in school, the curiosity about artificial intelligence, concerns

about job prospects, and reluctance toward becoming a heavy coding engineer, among others.

Now, standing at the threshold of the highest academic degree, I find that there are no more

excuses and no higher degrees left to pursue. I have finally arrived at a crossroads—one that leads

toward society and real life. From a student’s perspective, even if mistakes are made, parents and

teachers have always offered forgiveness, making student status feel like an ever-present shield of

leniency, granting infinite chances to improve. Yet, I sense that I’m about to lose this “privilege.”

There is an old Chinese saying, “One should establish oneself by the age of thirty,” mean-

ing that by the age of thirty, one is expected to stand on their own—financially, intellectually,

and spiritually. As I approach thirty, it seems an ideal moment to bid farewell to the school

and the protective shield of being a student. I am now ready to step into society—this broader

“school”—to forge my own shield and establish my own identity.

Before arriving at this crossroads today, every individual I’ve encountered has left an in-

delible mark on me, forming the foundation from which I now set sail. Firstly, I want to thank

my mother, a strong-minded and resilient woman who taught me that “A gentleman acts with

principle—he knows when to act and when to refrain,” which embodies integrity and boundaries.

iii


My heartfelt gratitude goes to my father, who has consistently supported my education, always

encouraging me to make my own choices without interference, yet steadfastly supporting each

decision. He taught me that life isn’t just about work; it also involves enjoyment. I also thank

my sister, my childhood companion who showered me with care and provided a peer with whom

I could share secrets that were hidden from our parents. Big thanks to my grandmother, my role

model for diligent learning, demonstrated remarkable strength, and of course, always indulged

me generously. Thanks to my brother-in-law and my niece and nephew—you collectively made

our family extraordinarily joyful. Special thanks to my girlfriend, who flew countless times from

Switzerland to the United States to reunite with me, encouraged me through low points, reminded

me during prideful moments, and always remembered every important detail of my life. Walking

beside you fills me with immense happiness.

Academically, I owe profound gratitude to Dr. Dinesh Manocha and Dr. Ming Lin for their

patient and meticulous guidance. Their visionary insights, passion for research, and unwavering

dedication constantly inspired and moved me deeply. They generously provided valuable advice

and carefully analyzed pros and cons whenever I faced difficult choices. Though sometimes lost,

I always found myself back on track thanks to their detailed guidance and support. I am also

deeply grateful to my mentor, Dr. Shan Yang, whose sharp vision and commitment to excellence

profoundly influenced my research approach. My heartfelt thanks extend to all my collaborators,

whose efforts made each project successful and fulfilling.

I would like to express my sincere appreciation to the members of my dissertation com-

mittee, Dr. Maria K. Cameron and Dr. Abhinav Shrivastava, whose valuable suggestions and

constructive feedback greatly enriched my work.

My appreciation also goes to the members of the GAMMA lab. Your creativity and inno-

iv


vation made working together joyful. We are not only colleagues but also friends who play hard

together. I am grateful to my roommates, who enriched my life beyond academics and provided

me with small families abroad.

Finally, I want to motivate myself and everyone else with a quote that deeply resonates with

me: “The magic you’re looking for is in the work you’re avoiding!”

v


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents vi

List of Tables xi

List of Figures xviii

Chapter 1: Introduction 1
1.1 Video Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

I Video Dataset 13

Chapter 2: METEOR: A Dense, Heterogeneous, and Unstructured Traffic Dataset With
Rare Behaviors 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Applications and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Comparison with Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Tracking and Trajectory Prediction Datasets . . . . . . . . . . . . . . . . 19
2.2.2 Semantic Segmentation Datasets . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 METEOR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Dataset Collection and Organization . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Rare and Interesting Behaviors . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Using METEOR to Extract New Insights in Unstructured Traffic . . . . . . . . . . 27
2.4.1 2D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Multi-Agent Behavior Recognition . . . . . . . . . . . . . . . . . . . . . 30

2.5 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 31

vi


Chapter 3: DAVE: Diverse Atomic Visual Elements Dataset with High Representation
of Vulnerable Road Users in Complex and Unpredictable Environments 32

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 DAVE Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 DAVE-DETR for Vulnerable Road Users Detection . . . . . . . . . . . . . . . . 39

3.3.1 Hierarchical Query Generator . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Reduce Redundancy Module . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Datasets for Different Tasks and Experiments . . . . . . . . . . . . . . . . . . . 41
3.4.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Video Moment Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.4 Spatiotemporal Action Localization . . . . . . . . . . . . . . . . . . . . 46
3.4.5 Multi-label Video Action Recognition . . . . . . . . . . . . . . . . . . . 47

3.5 Conclusion, Limitation, Future Work . . . . . . . . . . . . . . . . . . . . . . . . 48

II Preprocessing 49

Chapter 4: AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal
Reasoning 50

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Learning-based Methods for Aerial Video Recognition . . . . . . . . . . 53
4.2.2 UAV and Drone Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Activity Recognition on Edge Architectures . . . . . . . . . . . . . . . . 55

4.3 Our Approach: AZTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Overall Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 Auto Zoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Temporal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.2 Implementation Details and Training . . . . . . . . . . . . . . . . . . . . 63
4.4.3 Results on RoCoG-v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.4 Results on UAV Human . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.5 Results on Drone Action . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Conclusion, Limitations and Future . . . . . . . . . . . . . . . . . . . . . . . . . 67

Chapter 5: MITFAS: Mutual Information based Temporal Feature Alignment and Sam-
pling for Aerial Video Action Recognition 68

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.1 Temporal Feature Alignment . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vii


5.2.3 Video Recognition for Aerial Videos . . . . . . . . . . . . . . . . . . . . 73
5.3 Video Recognition using Mutual Information . . . . . . . . . . . . . . . . . . . 74

5.3.1 Temporal Feature Alignment . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Mutual Information Sampling . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.3 MITFAS: Aerial Video Recognition . . . . . . . . . . . . . . . . . . . . 81

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 Results on UAV Human . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.2 Results on NEC Drone . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.3 Results on Drone Action . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 88

III Visual Reasoning 89

Chapter 6: ICAR: Image-based Complementary Auto Reasoning 90
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3.1 Conditional Compatibility Auto Reasoning . . . . . . . . . . . . . . . . 96
6.3.2 Compatibility Learning Framework . . . . . . . . . . . . . . . . . . . . 97

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.3 Compatibility Learning Results . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.4 Similarity Learning Results . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.5 Conclusion, Limitation, and Future Work . . . . . . . . . . . . . . . . . . . . . 107

Chapter 7: SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition 109
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.2 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Prompt Learning-based Input Encoder . . . . . . . . . . . . . . . . . . . 116
7.3.2 Auto-regressive Temporal Reasoning . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Single-agent and Multi-agent Objective . . . . . . . . . . . . . . . . . . 120

7.4 Datasets and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1 Datasets and Experiment Settings . . . . . . . . . . . . . . . . . . . . . 121
7.4.2 Results on Okutama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.3 Results on NECDrone . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.4 Results on Something-something V2 . . . . . . . . . . . . . . . . . . . . 124
7.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

viii


IV Multimodal Alignment 127

Chapter 8: ViLA: Efficient Video-Language Alignment for Video Question Answering 128
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.2.1 Visual-Language Alignment . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2.2 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.2.3 Frame Selection for Video QA . . . . . . . . . . . . . . . . . . . . . . . 134

8.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3.2 Text-guided Frame-Prompter Learning . . . . . . . . . . . . . . . . . . . 135
8.3.3 Cross-Modal Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4.1 Implementation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4.4 More Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.5 Conclusion, Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . 148

V Quantization 149

Chapter 9: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Bound-
aries in Vision-Language Models 150

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.2.1 Post-Quantization on VLM . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2.2 Network Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.3.1 Binarization Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.3.2 Bi-VLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3.3 Pruning on Quantized Models . . . . . . . . . . . . . . . . . . . . . . . 164

9.4 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.4.1 Baseline Model and Datasets . . . . . . . . . . . . . . . . . . . . . . . . 167
9.4.2 Quantization on Different Components . . . . . . . . . . . . . . . . . . . 168
9.4.3 Comparison with SOTA Methods . . . . . . . . . . . . . . . . . . . . . . 168
9.4.4 Pruning on Quantized Model . . . . . . . . . . . . . . . . . . . . . . . . 169

9.5 Conclusions, Limitations, and Future Work . . . . . . . . . . . . . . . . . . . . 170

Chapter 10: Conclusion, Limitations and Future Work 171
10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.1 Appendix for Chapter 3: DAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

A.1.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

ix


A.1.2 More Related Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.1.3 Full Dataset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

A.2 Appendix for Chapter 5: MITFAS . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.2.1 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . 189
A.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
A.2.3 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

A.3 Appendix for Chapter 6: ICAR . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.3.2 More Details of Our Approach . . . . . . . . . . . . . . . . . . . . . . . 197
A.3.3 More Qualitative Evaluation Results . . . . . . . . . . . . . . . . . . . . 200

A.4 Appendix for Chapter 7: SCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.4.1 Experts Number for Learnable Prompt . . . . . . . . . . . . . . . . . . . 201
A.4.2 Different Inputs for Large Vision Model . . . . . . . . . . . . . . . . . . 202
A.4.3 Effect of Each Component of Our Method . . . . . . . . . . . . . . . . . 205
A.4.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
A.4.5 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

A.5 Appendix for Chapter 9: Bi-VLM . . . . . . . . . . . . . . . . . . . . . . . . . 208
A.5.1 Memory-Efficient Quantization Pipeline . . . . . . . . . . . . . . . . . . 208
A.5.2 Statistical Analysis of Weights: Histograms and Gaussian Fit . . . . . . . 211
A.5.3 Saliency Threshold Determination . . . . . . . . . . . . . . . . . . . . . 214
A.5.4 More results on Pruning in ScienceQA-IMG . . . . . . . . . . . . . . . . 217

Bibliography 225

x


List of Tables

2.1 Characteristics of Traffic Datasets. We compare METEOR with state-of-the-art
autonomous driving datasets that have been used for trajectory tracking, motion
forecasting, semantic segmentation, prediction, and behavior classification. ME-
TEOR is the largest (in terms of number of annotated frames) and most diverse in
terms of heterogeneity, scenarios, varying behaviors, densities, and rare instances.
Darker shades represent a richer collection in that category. Best viewed in color. 15

2.2 Training Details for Object Detection (BS: Batch size, Mom: Momentum, WD:
Weight decay, MGN: Max Gradient Norm). . . . . . . . . . . . . . . . . . . . . 27

2.3 Effect of meta features on object detection. We analyze how meta features such
as traffic density, type of agents, location, time of the day, and weather play a
role in 2D object detection using the DETR, Deformable DETR, YOLOv3 and
CenterNet object detectors. Bold indicates the type of meta feature that is the
most effective for object detection. . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Object detection on Waymo and KITTI. We report the standard mAP for many
widely used methods on autonomous driving datasets. . . . . . . . . . . . . . . . 28

2.5 Swin-T on Waymo and METEOR. We present a more detailed analysis of Swin-
T, one of the state-of-the-art object detection approaches, on Waymo and METEOR. 28

2.6 ACAR-Net on METEOR. PT: pre-train, BS: batch size, Opt.: Optimization, LR:
learning rate, WD: weight decay, FR(RX-101): Faster R-CNN (ResNeXt-101),
Kin.-700: Kinetics-700, CR (Swin-T): Cascade R-CNN (Swin-T) . . . . . . . . . 30

3.1 The existing traffic datasets with Vulnerable Road Users and bbox. Total # in-
cludes all labeled instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Comparison of datasets with respect to pedestrian, vehicle, and other Vulnerable
Road Users (O-VRUs) action and tube annotations. . . . . . . . . . . . . . . . . 37

3.3 DAVE Characteristics. We annotate 16 types of actions performed by 16 types of
actors. We highlight the maximum and average number of actions and actors per
frame. LaneChanging(m) denotes lane changing on roads with clear lane markings. 38

3.4 Comparison with SOTAs. Our DAVE-DETR consistently surpasses four strong
SOTAs across every reported metric. . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Comparison of VRUs Datasets. Compared with Waymo and Waymo + DAVE
with the same setting, DAVE training set outperforms the Waymo training set
by 20.8% in terms of mAP50. When combining the Waymo and DAVE training
sets, the model achieves a 24.0% improvement over Waymo alone and a 3.2%
improvement over DAVE alone. The results show that our DAVE dataset is more
effective for VRUs detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xi


3.6 Comparison of Various Tracking Datasets. DAVE is comparable to GOT-10k in
AO but more challenging for both success rate metrics. For SR0.75, ARTrack
performs 23.7% worse on DAVE than GOT-10k, despite our preprocessing to
keep the same object present in each frame sequence. . . . . . . . . . . . . . . . 44

3.7 Statistics of datasets for Video Moment Retrieval task. The CG-DETR method
only gets 5.1 R1@0.5 on DAVE (58.4 on Charades-STA), and the perception
performance degrades significantly illustrating that Video Moment Retrieval is
still a challenging problem in the unstructured environment. . . . . . . . . . . . . 45

3.8 Spatiotemporal Action Localization. ACAR-Net gets 6.3% mAP on DAVE, which
shows DAVE is a very challenging dataset and has tremendous room to improve. . 47

3.9 Multi-label Video Action Recognition. SlowFast achieves 4.2% more perfor-
mance on Charades than DAVE, which means DAVE is harder. . . . . . . . . . . 47

4.1 3D operators are not well supported on most edge devices or processors, as high-
lighted here. Therefore, we use 2D+1 conv and an efficient attention mechanism
on the RB5 platform. TL: Tensorflow Lite, C3D: Conv3D, MP3D: MaxPooling
3D, AP3D: AveragePooling 3D, DC3D: Depthwidth Conv3D . . . . . . . . . . . 62

4.2 Inference Time on RB5 CPU. Our method takes 56.5 ms to inference one frame
(on average) which is 2 times faster than MoViNet A3 on the RB5, and also
results in improvement on top-1 accuracy, see Table 4.3. . . . . . . . . . . . . . 62

4.3 Results on RoCoG-v2. We demonstrate that our approach can improve the top-1
accuracy by 6.1%-7.4%, outperforms all SOTA methods that can be deployed on
the RB5 platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4 Benchmarking UAV Human and comparisons with prior arts. We compared with
the state-of-the-art methods, which demonstrates an improvement of 8.3%−10.4
over SOTA methods. Trained on high-end desktop GPUs. . . . . . . . . . . . . . 66

4.5 Results on dataset: Drone Action. We demonstrate that AZTR improves the state-
of-the-art accuracy by 3.2%, reaching 95.9% on Drone Action. Trained on high-
end desktop GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Notation and symbols used in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Benchmarking UAV Human and comparisons with prior arts. For 224× 224 res-

olution and 16 frames input, when training from scratch, our approach achieves a
13.2% improvement over the baseline X3D-M and 12.6% over the current state-
of-the-art FAR. For 520 × 520 resolution and 8 frames input, MITFAS overper-
forms the current state-of-the-art FAR by 9.6% when training from scratch. For
224 × 224 resolution and 16 frames input, when initializing with Kinetics pre-
trained weights, MITFAS improves the top-1 accuracy over baseline by 20.2%
and over SOTA method by 18.9%. For resolution over 620 × 620 and 8 frames
input, when initializing with Kinetics pretrained weights, MITFAS overperforms
the current state-of-the-art FAR by 7.5%. Our method obtains better performance
in all settings, which illustrates the effectiveness of our proposed MITFAS. . . . . 79

5.3 Results on Drone Action. Our method achieves 100% top-1 accuracy, 16.6% over
the baseline method X3D-M [1], outperforming current state-of-the-art method
FAR [2] by 7.3% under same configuration. (HLPF [3], PCNN [4]) . . . . . . . . 83

xii


5.4 Results on NEC Drones. Our method shows an improvement of 12.5% on top-
1 accuracy against the baseline X3D-M [1], 7.2% over current state-of-the-art
FAR [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5 Temporal Feature Alignment (TFA) and Mutual Information Sampling (MIS) ab-
lation studies on UAV-Human-Subset. The baseline is vanilla X3D with ran-
dom [5] and uniform sampling [6], and we add our methods TFA and MIS step
by step. From our experiments, TFA boost the accuracy by 16-17.5%. MIS out-
performs the random sampling, uniform sampling, and MG Sampler [7]. . . . . . 85

5.6 Comparison with other methods [8, 9]. . . . . . . . . . . . . . . . . . . . . . . . 85
5.7 Mutual Information Sampling (MIS) ablation studies on UAV-Human-subset and

Drone Action. The baseline is vanilla X3D with TFA, we test the MITFAS Sam-
pling in terms of two hyperparameters for mutual information and joint mutual
information, α and β respectively. From our experiments, MITFAS obtains the
best accuracy when α = 1.0 and β = 1.0. . . . . . . . . . . . . . . . . . . . . . 86

5.8 Comparison with other similarity measures on UAV-Human Subset. Compared
to other similarity measures, mutual information achieves the best accuracy. . . . 86

6.1 FITB Results on DeepRooms [10]. Our approach improved FITB accuracy by
9.5% over Visual Similarity Learning, CSA-Net [11], and OutfitTransformer [12].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 FITB Results on STL and Street2Shop (S2S). Our approach can improve FITB
accuracy by 4.2%–9.6% over IBR [13], Siamese Nets [14], BPR-DAE [15],
Complete the Look [16], & OutfitTransformer [12]. . . . . . . . . . . . . . . . . 103

6.3 SFID Results on DeepRooms, STL(F: fashion, H: home) and Street2Shop(S2S).
Our approach can improve SFID accuracy by 11.2 (23.3%, DeepRooms) and 2.9
(31.8%, STL-Home) on furniture and by 3.4 (22.3%, STL-F) and 1.6 (18.4%,
S2S) on fashion images, respectively, over OutfitTransformer [12]. . . . . . . . . 104

6.4 Masking Method Comparison. Our random length masking outperforms fixed
length masking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.5 Similarity Learning. Visual similarity learning is the most suitable for scene-
based CIR. VQGAN [17], Swin [18], BEiT [19] . . . . . . . . . . . . . . . . . . 107

7.1 Comparison with the state-of-the-art results on the Okutama dataset. With bbox
information, we achieved 10.20% improvement over the SOTA method. Without
bbox information, we outperformed the SOTA by 3.17%. crops: from detection. . 122

7.2 Comparison with existing methods on NEC Drone. Our SCP improves 4.0-7.4%
over X3D and 23.1% over K-centered. . . . . . . . . . . . . . . . . . . . . . . . 123

7.3 Comparison with the state-of-the-art results on the Something Something V2.
Our SCP improves 3.6% over MViTv1 and 1.0% over strong SOTA MViTv2. . . 124

7.4 Ablation study in terms of the effect of different components in our method on
the Okutama dataset. We evaluated ROI, Large Vision Model (SAM), and SCP.
The experiments showed the effectiveness of our proposed methods. . . . . . . . 125

xiii


7.5 Ablation study in terms of different prompts on the Okutama dataset. We eval-
uated various prompts, including optical flow, a large vision model(SAM [20]),
and SCP. From our experiment, the large vision model and SCP achieved better
accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.1 Comparison Results on NExT-QA dataset. Here we measure the accuracy of
choosing the right answer. Especially on Temporal and Causal type of questions,
our ViLA (using only 4 frames) improves 3.3% and 1.7% respectively, compared
with SeViLA. We use bold-face font to indicate the best results and underline on
the second best using the same number of frames ( brown box for 4 frames and
blue box for 8 frames). ViLA using 2-frames only out-performs BLIP-2 using

4-frames by 1.3%. ViLA also achieves upto 3.04× speedup. It needs to be noted
that our ViLA achieves 75.1% average accuracy with only 4 frames when we
finetune LLM with LoRA [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2 Comparison Results on STAR Video QA benchmark. For Interaction type of
question, our ViLA improved 4.6%. On average, our ViLA out-performs the
SOTA method by 2.2% when using 4 frames with 3.04× speed up. Note that our
ViLA using 2-frames out-performs BLIP-2 using 4-frames. . . . . . . . . . . . . 142

8.3 Comparison Results on How2QA, VLEP, and TVQA Video QA Benchmarks.
ViLA improves performance over SeViLA by 1.8% with 3.04× speed up on
TVQA dataset, 0.7% with 1.45× speed up on VLEP dataset, and 0.3% with
3.04× speed up on How2QA dataset at 4 frames setting. Ours 2-frames out-
perform SeViLA 4-frames on VLEP by 0.3% with 4.2× speed up. . . . . . . . . 143

8.4 Frame-Prompter and QFormer-Distiller Ablation Results. Across all four VideoQA
datasets, we observe that both Text-aware Frame-Prompter and cross-modal QFormer-
Distiller contribute significantly to our final performance. We highlight that on
STAR, adding our QFormer-Distiller improves the accuracy by 2.9%. Our Frame-
Prompter further boost the accuracy by 1.6%. . . . . . . . . . . . . . . . . . . . 146

8.5 QFormer-Distiller Decoder Ablation on NExT-QA. We find that a simple Fully
Connected layer (FC) with a Layer Normalization (LN) works best across Tem-
poral, Causal, Description. It is efficient and effective. GELU is activation function.147

9.1 Quantization on different components of Llama 3.2-Vision instruction 11B with
weight 1 to 1.1 bit. The vision model exhibits high sensitivity to quantization;
the adaptor/projector exhibits less sensitivity to quantization, barely affecting the
performance; the language model exhibits considerable sensitivity to quantiza-
tion. FP: Full precision. Vis: Vision encoder. Adp: Adapt layer. Lm: Language
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.2 Quantization on different components of Llava-One-Vision 7B with weight 1 to
1.1 bit. Same conclusion as in Table 9.1. Full precision. Vis: Vision encoder.
Adp: Adapt layer. Lm: Language model. . . . . . . . . . . . . . . . . . . . . . 166

9.3 Quantization on different components of Qwen2.5-VL-7B-Instruct with weight
1 to 1.1 bit. Same conclusion as in Table 9.1. FP: Full precision. Vis: Vision
encoder. Adp: Adapt layer. Lm: Language model. . . . . . . . . . . . . . . . . 166

xiv


9.4 SOTA comparison on Llama 3.2-Vision instruction 11B with weight 1 to 1.1 bit.
For the language model part, our Bi-VLM outperforms the SOTA by 4%-47%.
For the overall VLM, our Bi-VLM outperforms the SOTA by 8%-45%. FP: Full
precision. L: Language model, all: the whole VLM model. . . . . . . . . . . . . 167

9.5 SOTA comparison on Llava-One-Vision 7B with weight 1 to 1.1 bit. For the
language model part, our Bi-VLM outperforms the SOTA by 3%-20%. For the
overall VLM, our Bi-VLM outperforms the SOTA by 4%-19%. FP: Full preci-
sion. L: Language model, all: the whole VLM model. . . . . . . . . . . . . . . . 167

9.6 SOTA comparison on Qwen2.5-VL-7B-Instruct with weight 1.1 bit. For the lan-
guage model part, our Bi-VLM outperforms the SOTA by 5%-10%. For the over-
all VLM, our Bi-VLM outperforms the SOTA by 4%-12%. FP: Full precision. L:
Language model, all: the whole VLM model. . . . . . . . . . . . . . . . . . . . 167

1 Comparison of Various Detection Datasets. Compared with COCO, with the
same setting, Swin-T performs 18% better on the COCO Dataset. The results
show that our DAVE dataset is more challenging than the existing datasets. . . . . 188

2 Ablation studies on UAV-Human subset in terms of using different bin numbers to
calculate mutual information, reference image size (times of the standard size),
using different strides for slipping windows, and searching area size. The best
performance is achieved while using 128 histogram bins, reference image size
1.25× and sliding stride of 10. The size of the searching area does not affect
the overall performance of our method. The top-1 accuracy only varies 0.6%
while using different searching area sizes. This demonstrates the robustness of
our MITFAS as the larger searching area contains more noises and outliers. . . . 189

3 Composing Method Comparison. Here bbox means using the bbox from the
original scene image to place in the composed image and non-bbox means using
a fixed size. GT is the ground truth set and Neg is the randomly chosen set. With
the results, we find using white background and a fixed size for all the items and
random placement in the composed set of images produces the lowest score. . . 199

4 Ablation study in terms of different experts number on the Okutama dataset. We
evaluated various experts number including 4, 8, 16, and 32. From our experi-
ment, with the expert number of 8 achieved better accuracy. . . . . . . . . . . . . 206

5 Ablation study in terms of different inputs for large vision model on the Oku-
tama dataset. We evaluated various inputs including single point, two points,
four points, and bbox. From our experiment, the large vision model with bbox
achieved better accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

6 Llama: Only prune the vision model, keep others unpruned. [FP: Evaluated on
Full precision model]. The baseline accuracy for the full precision model is 86.22%.217

xv


7 Llama: Keep the vision model part unpruned, and selective prune of only image
tokens in the language model part [FP: Evaluated on Full precision model]. The
number in parentheses denotes the cross-attention layers. Baseline accuracy for
the full precision model is 86.22%. Results indicate that pruning up to 86.32%
of the image tokens across language layers maintains performance above 84%,
suggesting significant redundancy in image tokens. However, extreme pruning
levels (e.g., 95.02%, 99%) lead to substantial accuracy drops, highlighting the
importance of retaining a minimal number of tokens to ensure effective model
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

8 Llama: Pruning on both the vision and language part (vision token in vision
model, text token in language model) [FP: Evaluated on Full precision model].
The baseline accuracy for the full precision model is 86.22%. Results show
that pruning both image tokens in the vision encoder and text tokens in the lan-
guage model significantly degrades performance compared to Table 8 and Table
7, where only image tokens in the language model were pruned. Specifically,
accuracy drops sharply as pruning increases. This indicates that pruning the text
token has a much more severe impact on model performance than pruning image
tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

9 Llama (BiLLM): Keep the vision model part unpruned, and selective prune of
only image tokens in the language model part [vlm: Evaluated on quantized
vision language model using BiLLM]. The baseline accuracy for the quantized
model using BiLLM is 21.42%. From this table, the quantized model has 90%-
99% image token redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

10 Llama (BiLLM): Pruning on both the vision and language part (vision token in
vision model, text token in language model) [vlm: Evaluated on quantized vision
language model using BiLLM]. The baseline accuracy for the quantized model
using BiLLM is 21.42%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11 Llama (Bi-VLM, Ours): Keep the vision model part unpruned, and selective
prune of only image tokens in the language model part [vlm: Evaluated on quan-
tized vision language model using Bi-VLM]. The baseline accuracy for the quan-
tized model using Bi-VLM is 58.35%. From this table, the quantized model has
86%-95% image token redundancy. . . . . . . . . . . . . . . . . . . . . . . . . 219

12 Llama (Bi-VLM, Ours): Pruning on both the vision and language part (vision
token in vision model, text token in language model) [vlm: Evaluated on quan-
tized vision language model using Bi-VLM]. The baseline accuracy for the quan-
tized model using Bi-VLM is 58.35%. . . . . . . . . . . . . . . . . . . . . . . . 220

13 Llava: Only prune the vision model, keep others unpruned. [FP: Evaluated
on Full precision model]. The baseline accuracy for the full precision model
is 95.84%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

14 Llava: Keep the vision model part unpruned, and selective prune of only image
tokens in the language model part [FP: Evaluated on Full precision model]. The
baseline accuracy for the full precision model is 95.84%. . . . . . . . . . . . . . 220

15 Llava: Pruning on both the vision and language part (vision token in vision
model, text token in language model) [FP: Evaluated on Full precision model].
The baseline accuracy for the full precision model is 95.84%. . . . . . . . . . . 221

xvi


16 Llava (BiLLM): Keep the vision model part unpruned, and selective prune of
only image tokens in the language model part [vlm: Evaluated on quantized
vision language model using BiLLM]. The baseline accuracy for the quantized
model using BiLLM is 63.81%. . . . . . . . . . . . . . . . . . . . . . . . . . . 221

17 Llava (BiLLM): Pruning on both the vision and language part (vision token in
vision model, text token in language model) [vlm: Evaluated on quantized vision
language model using BiLLM]. The baseline accuracy for the quantized model
using BiLLM is 63.81%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

18 Qwen: Keep the vision model part unpruned, and selective prune of only image
tokens in the language model part [FP: Evaluated on Full precision model]. The
baseline accuracy for the full precision model is 77.29%. . . . . . . . . . . . . . 222

19 Qwen (Bi-VLM Ours): Keep the vision model part unpruned, and selective
prune of only image tokens in the language model part [vlm: Evaluated on quan-
tized vision language model using Bi-VLM]. The baseline accuracy for the quan-
tized model using Bi-VLM is 68.32%. . . . . . . . . . . . . . . . . . . . . . . . 222

20 Qwen (BiLLM): Keep the vision model part unpruned, and selective prune of
only image tokens in the language model part [vlm: Evaluated on quantized
vision language model using BiLLM]. The baseline accuracy for the quantized
model using BiLLM is 59.49%. . . . . . . . . . . . . . . . . . . . . . . . . . . 222

xvii


List of Figures

2.1 METEOR. We summarize various characteristics of our dataset in terms of scene:
traffic density, road type, lighting conditions, agents (we indicate the total count
of each agent across 1250 videos), and behaviors, along with their size distribu-
tion (in GB). The total size of the current version of the dataset is around 100GB,
and it will continue to expand. Our dataset can be used to evaluate the perfor-
mance of current and new methods for perception, prediction, behavior analysis,
and navigation based on some or all of these characteristics. Details of the or-
ganization of our dataset are given at https://gamma.umd.edu/meteor.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Annotations for rare instances. One of the unique aspects of METEOR is the
availability of explicit labels for rare and interesting instances including atypical
interactions, traffic violations, and diverse scenarios. These annotations can be
used to benchmark new methods for object detection and multi-agent behavior
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 We highlight the high traffic density, heterogeneity, and the richness of behavior
information in METEOR. Abbreviations correspond to various behavior cate-
gories and are explained in Section 2.3.3 . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Tasks Overview. We use DAVE for various video recognition tasks, including
Tracking, Detection, Video Moment Retrieval, Spatiotemporal Action Localiza-
tion, and Multi-label Video Action Recognition. Our large-scale dataset is made
up of complex environments that are densely annotated. Each bounding box
(bbox) corresponds to an actor, and the text above each bbox serves as either
the tracking ID or indicates the associated action. . . . . . . . . . . . . . . . . . 33

3.2 Challenging Characteristics of DAVE. These videos correspond to different times
of the day with different brightness, different geographical landforms from city
and rural areas, high density and unpredictable road conditions, diverse actors
including humans, animals, vehicles, etc. . . . . . . . . . . . . . . . . . . . . . 35

3.3 DAVE-DETR consists of a hierarchical query generator to generate a dense query
set and a reduce redundancy module with class-agnostic Non-Maximum Suppres-
sion (NMS) to refine these proposals. . . . . . . . . . . . . . . . . . . . . . . . . 41

xviii

https://gamma.umd.edu/meteor


4.1 Our learning pipeline consists of the auto zoom learning algorithm and the tem-
poral reasoning algorithm. For auto zoom learning, we offer different bounding
box(bbox) and feature operations. Refer to Section 4.3 for details. For the tem-
poral reasoning algorithm, we perform (2D+1) conv on edge devices, 3D conv on
desktop GPUs, and self-attention (Atten) mechanism on both edge devices and
desktop GPUs. Attention layers on desktop GPUs are deeper and wider. . . . . . 51

4.2 We designed two different auto zoom methods with crops or features, for high-
end desktop and mobile or edge devices respectively. (a) For auto zoom with
crops, we use a detector to get the target bounding box and crop it from the
original frame, then scale the crop size. For the auto zoom with features, we use
the features to generate the bounding boxes and classification. (b) We use the
detector to generate bboxes on key frames to reduce the computational cost. We
predict the bbox at the next key frame, and compare the location of predicted
bbox and generated bbox to avoid incorrect detection results. Finally, we apply
linear interpolation to generate the bbox between key frames. Details are shown
in Section 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 We use different combinations on desktop GPUs and edge devices between 2D+1
convolution, 3D convolution, and efficient transformer for temporal reasoning.
The efficient transformer based algorithm has two components, the cross atten-
tion is used to map the input sequences to a new sequence with a specific size
according the computational cost requirement. The self attention is the normal
component from transformers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Action recognition on RoCoG-v2 aerial video. More details are given in the video. 64

5.1 Ft and Ft+1 are two frames at time t and t + 1, respectively, from the same
UAV video. The human actor in the two frames occupies less than 10% of the
pixels due to the high camera altitude (top images). (a) MITFAS will focus on
the regions corresponding to salient motions and use mutual information to find
the more informative frame. (b) Because of the UAV’s motion, the position of
the human actor in Ft+1 appears to be relatively behind compared to Ft. Our
algorithm (MITFAS) computes and aligns these regions so that the recognition
model will infer more from the human motions. As shown in the right image, the
main body of the human actor in two frames overlaps after feature alignment. . . 69

5.2 Given a starting frame Ft in a UAV video, we use a localization network to local-
ize the human action and crop the region containing the human motion as the ref-
erence image Fr. At time t+1, we use our feature alignment algorithm to estimate
the optimal operation parameter ω∗

t+1 and find a region in Lω∗
t+1

(Ft+1) ⊂ Ft+1 that
the mutual information between Lω∗

t+1
(Ft+1) and the reference image Fr is max-

imized. Next, we use Lω∗
t+1

(Ft+1) as the new reference image to find the optimal
parameter ω∗

t+2 at time t+ 2 and repeat for subsequent frames. Then, we use the
criterion illustrated in Section. 5.3.2 Eq . 5.11 to find a sequence of the most dis-
tinctive and informative frames. We use a temporal inference backbone network
(e.g., X3D [1]) to generate the predicted action label from the spatial-temporal
features associated to the sampled frame sequence. . . . . . . . . . . . . . . . . 70

xix


5.3 We sample the i+1th frame Fi+1 from the candidate pool by choosing the frame
that is not only least similar to the previous frame but also all the previously
sampled frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1 Cover Image. We present a new self-supervised model for scene-aware, visually
compatible object retrieval tasks. In this example, given an inspirational home
scene image (sampled from STL-home [16] column 1, 3, 5) with a pool of objects
(3D-FRONT [22]) from an unseen domain, our model auto-regressively retrieves
a set of stylistically compatible items (column 2, 4, 6). . . . . . . . . . . . . . . . 91

6.2 Scene-aware Complementary Item Retrieval Task Illustration. Given a query
scene image, (optional) scene objects and item categories, the task goal is to
generate a cross-domain set of stylistically compatible items. . . . . . . . . . . . 92

6.3 ICAR Model Overview. In similarity learning, we apply a CNN-based model [23]
to learn the visual similarity features across two domains. The learned features
are required for both complementary reasoning in the complementarity learning
and the cross-domain retrieval. With the learned features, in the complementarity
learning, we propose a Flexible Bidirectional Transformer (FBT) model to learn
the multi-object visual compatibility. . . . . . . . . . . . . . . . . . . . . . . . 96

6.4 VSIM: Visual Similarity Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 FBT: Flexible Bidirectional Transformer. We randomly sample M ∈ [0, N ] items

from the total item number N of items (in a scene) as input set, and the (M+1)th
item not in the input set as output target. We put the scene embedding at the
beginning of input set, and take the scene embedding as the start token EI . We
set a zero vector as the end token Ee. . . . . . . . . . . . . . . . . . . . . . . . 98

6.6 Scene-aware Cross-Domain CIR Qualitative Results. We show qualitatively our
model is capable of retrieving stylistically compatible items from both seen (Row
2 and 4) and unseen domains (Row 1 and 3), given a home (Row 1-3) or fashion
(Row 4) scene image. Column 1, 5, 9 are the input scene images. See supple-
mentary materials for more examples. . . . . . . . . . . . . . . . . . . . . . . . 101

6.7 Learned Scene Image Embedding Clustering Results. To validate the style im-
plicitly learned through our network, First column is the t-SNE on the randomly
sampled 2k STL-home and STL-fashion test in split scene images (Columns 2-5). 104

6.8 Human Ratings on Different Datasets. Our SFID score correlates better than
SOTA with human judgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.1 Overall Architecture. Our action recognition method is designed to run one edge
devices (on mobile robots) and cloud servers. This includes lightweight prompts
(embedded), which can be easily embedded in any action recognition model
without much extra computational cost. For large vision models, we perform
these computations on cloud server and use low-latency communication with the
robots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xx


7.2 Task Overview. We use prompt learning for action recognition. Our method
leverages the strengths of prompt learning to guide the learning process by help-
ing models better focus on the descriptions or instructions associated with ac-
tions in the input videos. We explore various prompts, including optical flow,
large vision models, and proposed SCP to improve recognition performance. The
recognition models can be CNNs or Transformers. . . . . . . . . . . . . . . . . . 111

7.3 Overview of the action recognition framework. We use transformer-based ac-
tion recognition methods as an example. We designed a prompt-learning-based
encoder to help better extract the feature and use our auto-regressive temporal
reasoning algorithm for recognition models for enhanced inference ability. . . . 113

7.4 Soft Conditional Prompt Learning (SCP). Learning input-invariant (prompt ex-
perts) and input-specific (data dependent) prompt. The input-invariant prompts
will be updated from all the inputs, which contain task information, and we use
a dynamic mechanism to generate input-specific prompts for different inputs.
Add/Mul means element-wise operations. B×S×C is the input features’ shape,
and l is the expert’s number in the prompt pool. . . . . . . . . . . . . . . . . . . 117

7.5 Visualization. We first detect the interested target and generate the prompts, then
predict the action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.1 Our efficient Vision-Language Alignment (ViLA) model via Frame-Promper and
distilling contains two new modules: a text-guided Frame-Prompter and a cross-
modal QFormer-Distiller. It learns to extract the most question-related frames
while keeping the inference latency low. . . . . . . . . . . . . . . . . . . . . . . 129

8.2 Model Overview. Our ViLA model includes 4 sub-modules: the visual encoder,
text-supervised Frame-Prompter (FP), QFormer-Distiller (QFD), and a LLM. We
encode the video frames through a frozen visual encoder. Then we train the
Teacher-QFormer using all the frame features. After that, we train the Student-
QFormer and Frame-Prompter end-to-end. Unlike the Teacher-QFormer, our
Student-QFormer is trained with masked frames features from a text-supervised
Frame-Prompter. Finally, the input question text and QFormer transformed visual
features go through a frozen large language model to generate the answer. Our
network supports both leveraging LLM through proper visual prompting with-
out affecting the original LLM (Frozen) ability on language tasks and finetuning
LLMs(LoRA) simultaneously to get optimal performance on specific tasks. . . . 130

8.3 Text-guided Frame-Prompter. Here we show the details of our learnable text-
guided Frame-Prompter. We design a learnable Frame-Prompter to sample the
most text query-related frames, with two design choises (a and b). We choose
design (a) for diversified temporal sampling. We first encode the mean-pooled
segment features. We then apply the Gumbel Softmax to compute the segment
mask to guarantee the differentiability. The selected frames embedding then goes
through the QFormer-Distiller. Here B means batch size, T means frame number,
N × C means the frame feature sequences. The Frame-Prompter is learned with
the text-supervised gradient. When VQA loss is applied, the input question text-
related gradient further flows to the Frame-Prompter. The question text-related
gradient guides the Frame-Prompter to select the most critical frames. . . . . . . 135

xxi


8.4 Key-frame Selection Comparison Results (select 4 frames from 32 frames). We
compare frames selected by our ViLA compared with that from the SOTA SeViLA [24]
method. Across different type of questions, especially the Causal, Temporal type
questions, keyframes selected by our network is more relevant and better related
to the question. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.5 QFormer-Distiller Results Visualization. Here we visualize the keyframes se-
lected after cross-modal distillation. After distillation, we can select the most
question-relevant frames even from 16 frames. . . . . . . . . . . . . . . . . . . 146

9.1 Saliency-aware quantile-based partitioning of Gaussian-distributed weights. Un-
salient weights are divided into equal quantiles and binarized, while salient weights,
corresponding to the distribution tails, are quantized using a multi-bit approach. . 156

9.2 Pruning on Bi-VLM quantized model and BiLLM quantized model. After layer
10, we can observe that there is redundancy of image tokens around 95% in the
quantized models. Our Bi-VLM exhibits better performance. . . . . . . . . . . . 170

1 Annotation Statistic. The actor and action distribution for DAVE, includes a wide-
ranging and rich taxonomy of 16 agents and 16 action categories. This dual focus
on both the breadth of agent and action types and the depth of instances allows
for more robust and effective training of video recognition models. . . . . . . . . 178

2 SFID Composed Images Comparison. Here we show how we compose the set
images. In (a), we randomly place the item images with fixed size (same height,
aspect ratio reserved). In (b), we place the item images using the bbox of the
item in the original scene images. After the study, we apply (a) in our SFID
computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

3 FBT: Flexible Bidirectional Transformer. We randomly sample M ∈ [0, N ] items
from the total item number N of items (in a scene) as input set, and the (M+1)th
item not in the input set as output target. We put the scene embedding at the
beginning of input set, and take the scene embedding as the start token EXs . We
set a zero vector as the end token EXe . The output will be the end token when the
input set contains all the items in the scene. . . . . . . . . . . . . . . . . . . . . 197

4 Binary FITB Results. We show the FITB results on the STL-home datasets when
there are two candidates (second row). Our model chooses the item in the green
box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

5 More Visualization for living room. In this example, given an inspirational home
scene image (sampled from STL-home as shown in column 1, 5) with a pool
of products (from 3D-FRONT). Our model auto-regressively retrieves a set of
stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 202

6 More Visualization for bedroom. In this example, given an inspirational home
scene image (sampled from STL-home as shown in column 1, 5) with a pool
of products (from 3D-FRONT). Our model auto-regressively retrieves a set of
stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 203

xxii


7 More Visualization for Fashion. In this example, given an inspirational fashion
scene image (sampled from STL-fashion as shown in column 1, 5) with a pool
of products (from STL-fashion). Our model auto-regressively retrieves a set of
stylistically compatible items (column 2, 3, 4, 6, 7, 8 in the picture). . . . . . . . 204

8 More Visualization for STL-home. In this example, given an inspirational home
scene image (sampled from STL-home as shown in column 1, 5) with a pool
of products (from STL-home). Our model auto-regressively retrieves a set of
stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 205

9 Large Vision Model. Prompts from the large vision model, no supervision needed.
We visualize the outputs in terms of different prompts, including bbox, line, and
different points. bbox and line have more stable outputs, which means better
prompts result in better outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

10 Histogram Analysis and Fitted Gaussian Curve of Layers [0,10,20,30] of the Vi-
sion Model. The curve represents the fitted Gaussian distribution over the his-
togram bar plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

11 Histogram Analysis and Fitted Gaussian Curve of Layers [0,10,20,30] of the Lan-
guage Model. The curve represents the fitted Gaussian distribution over the his-
togram bar plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

12 KL divergence between weight histograms and fitted Gaussian distributions in a
Vision Model. Early self-attention layers exhibit significant deviation from the
Gaussian approximation compared to later layers. . . . . . . . . . . . . . . . . . 213

13 KL divergence between weight histograms and fitted Gaussian distributions in a
Language Model. Early self-attention layers exhibit significant deviation from
the Gaussian approximation compared to later layers. . . . . . . . . . . . . . . . 214

14 Bi-VLM Quantization Error Across Vision Model Layers for Varying Saliency
Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

15 Bi-VLM Quantization Error Across Language Model Layers for Varying Saliency
Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

xxiii


Chapter 1: Introduction

1.1 Video Understanding

Video data has become a dominant form of information in the digital age. On platforms

like YouTube, over 20 million videos are uploaded daily, and approximately 20 billion videos

are on YouTube [25], contributing to an ever-growing reservoir of visual data that far exceeds

what manual analysis can handle. These statistics highlight the urgent need for automatic and

scalable video understanding. Video understanding refers to the process by which computational

models analyze, interpret, and extract meaningful information from video data. This involves

recognizing objects, actions, scenes, temporal relationships, and complex events over time, often

integrating spatial, temporal, and sometimes multimodal information. The goal is to enable ma-

chines to ”understand” video content in a way that supports tasks such as classification, detection,

segmentation, captioning, retrieval, summarization, reasoning, and question answering [26–30].

Real-world applications of video understanding span virtually every domain. In autonomous

driving [31–33], advanced driver assistance systems use multiple cameras to detect pedestrians,

traffic signals, lanes, and other vehicles in real time, enabling self-driving cars to make life-critical

decisions based on video feeds. In public security [34, 35], smart surveillance systems leverage

computer vision to automatically recognize activities or anomalies in CCTV footage, aiming to

enhance safety in smart cities. Unmanned aerial vehicles (UAVs) and drones [36, 37] deploy

1


onboard video analytics for tasks like agricultural monitoring and search-and-rescue, where real-

time interpretation of aerial video can guide timely actions. In the entertainment industry [30,38],

video understanding drives content recommendation, moderation, and immersive experiences –

streaming services analyze video frames to tag content and personalize what viewers see, while

video games and augmented reality rely on understanding live camera input to blend virtual and

real worlds.

This introduction presents an overview of the core foundational components in video under-

standing, We discuss key components from the datasets that fuel progress and the preprocessing

techniques that prepare raw footage, to the visual reasoning methods for spatial and temporal

perception, multimodal alignment for integrating vision with language, and finally model com-

pression approaches for efficient deployment.

Dataset Development for Video Understanding: Large-scale annotated datasets have

been instrumental in advancing video understanding. Early efforts like UCF101 [39] in 2012

provided one of the first sizable action recognition datasets, with 13,000 clips across 101 human

action categories and similar benchmarks (e.g. HMDB-51 [40], Sports-1M [41], Kinetics [42]

and etc.) enabled researchers to train and evaluate video models on diverse “in the wild” footage.

These datasets established evaluation metrics and uncovered challenges (camera motion, back-

ground clutter, etc.) that shaped model development. The progression in the dataset development

illustrates how the community has moved towards larger, more diverse, and more task-driven

video datasets. These datasets are foundational – they supply the training data needed for modern

deep learning models and ensure that research progress translates to real-world generalization.

Preprocessing Techniques: Raw video is high-dimensional and often redundant, so ef-

fective preprocessing is critical for both accuracy and efficiency. A fundamental step is frame

2


sampling – selecting a subset of frames from the video – to balance information vs. computa-

tion. Classical approaches often used uniform or stride-based sampling (e.g. one frame every few

frames) and decoding all frames at a fixed frame rate [43]. Standard image preprocessing like re-

sizing and cropping is extended to video by applying the same spatial transform to each frame in a

clip, maintaining temporal consistency. Data augmentation (random crops, flips, color jitter, etc.)

is similarly applied on video frames to improve robustness, with care to apply identically across

time to avoid disrupting temporal coherence. Beyond these basics, recent research has developed

more advanced preprocessing techniques. One notable trend is differentiable frame sampling or

frame selection policies that learn to pick the most informative frames on the fly [44], rather

than a fixed sampling rate. Such adaptive sampling saves computation by skipping redundant

or uninformative frames, focusing processing on salient moments of the video. In summary,

preprocessing has evolved from heuristic frame selection and manual feature computation (e.g.

optical flow) to learned and content-aware sampling strategies, as well as leveraging video com-

pression to process footage more efficiently. These techniques form the first stage of an efficient

video understanding pipeline, ensuring that subsequent reasoning modules operate on compact

yet information-rich visual inputs.

Spatial Frame Understanding and Visual Reasoning: For video understanding, a fun-

damental building block is robust spatial understanding of each frame. This involves detect-

ing and recognizing the objects, scenes, and other visual elements present in individual images

(frames). Over the years, the computer vision community has achieved remarkable progress in

image understanding, which directly carries over to frames of a video. The introduction of deep

Convolutional Neural Networks (CNNs) was a watershed moment: models like AlexNet [45]

demonstrated that learnable convolutional filters could far outperform hand-crafted features at

3


image classification. This culminated in the development of ResNet [46], a very deep CNN using

residual skip-connections to ease training. ResNet-50 and its deeper variants (with 101 or 152

layers) surpassed human-level performance on ImageNet image recognition, and these architec-

tures became the de facto backbones for many vision tasks. In the context of video, a ResNet

pretrained on ImageNet is often used to encode each frame into a rich feature vector, enabling

the model to understand what is where in the scene. In the meantime, traditional CNNs have

been extended with attention mechanisms and structured representations (e.g. scene graphs) to

perform reasoning. The current state of the art for spatial vision modeling has shifted towards

Vision Transformers (ViT) [47] and their variants. Transformers dispense with convolution en-

tirely, and instead use global self-attention to model relationships between patches of an image.

For video frames, vision transformers provide stronger capabilities for spatial reasoning – they

can attend to disparate image regions (e.g. a person and an object they are reaching for across the

frame) and model their relationship explicitly. Empirically, these transformer models now serve

as powerful frame encoders. In summary, spatial frame understanding has evolved from early

CNN-based recognition of objects to more holistic reasoning with attention-based models.

Temporal Modeling and Motion Analysis: A video is more than a stack of images; the

temporal dimension introduces motion and evolving dynamics that must be captured to under-

stand actions and events. Temporal modeling techniques aim to learn representations that in-

tegrate information over time, from short motion patterns (e.g. a hand waving) to long-range

dependencies (e.g. events unfolding over minutes). One classical approach to temporal modeling

is the use of recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) net-

works [48], which process frame features sequentially and maintain a hidden state that evolves.

Early video recognition models like LRCN [49] combined CNN frame encoders with an LSTM

4


to recognize actions from frame sequences. Around the same time, another influential approach

was the development of 3D Convolutional Networks. In a 3D ConvNet, filters operate across spa-

tial dimensions (height, width) and time, thereby directly learning motion-sensitive features. A

milestone was C3D [50], which showed that a generic 3D CNN trained on videos could automat-

ically learn motion detectors (like rudimentary optical flow) in its early layers and achieve strong

action recognition performance. These classical methods established two important paradigms:

recurrent sequence modeling and spatiotemporal convolution, both of which substantially out-

performed earlier hand-crafted temporal features. In recent years, state-of-the-art temporal mod-

eling has been revolutionized by attention mechanisms and Transformer architectures applied in

time. Just as transformers improved spatial modeling, they have proven extremely effective for

temporal sequences. A representative example is TimeSformer [51], the first pure-transformer ar-

chitecture for video understanding. TimeSformer factorizes attention over space and time – it ap-

plies self-attention within each frame (spatial) and across frames (temporal), allowing the model

to learn long-range temporal dependencies with global context. Empirically, transformer-based

models and their hybrids now achieve state-of-the-art results on video classification benchmarks,

often surpassing traditional 3D CNNs in accuracy while offering more flexibility in modeling.

In summary, temporal modeling has progressed from sequential or fixed-length processing of

frames to more flexible and long-range attentive modeling. These advances allow video under-

standing systems to not only identify what is happening in a clip but also to understand when and

how events unfold, which is crucial for tasks like action recognition, prediction, and temporal

segmentation.

Multimodal Alignment for Video Question Answering: Video understanding increas-

ingly involves multimodal analysis – integrating visual information with other modalities such as

5


language and audio – to enable deeper semantic tasks. One prominent multimodal task is Video

Question Answering (Video QA), where a model answers natural language questions about a

video’s content. This requires aligning visual content (frames, objects, actions) with textual con-

tent (questions, narration, or subtitles) and sometimes audio cues. A classical benchmark in

this area was MovieQA [52], which provided clips from movies, plot summaries, and question-

answer pairs to evaluate story understanding. Early methods for Video QA often used separate

pipelines: a CNN/LSTM to encode the video and a text encoder for the question, with a fusion

mechanism to produce an answer. For example, MovieQA baseline [52] combined visual features

(from frame-level CNN descriptors) with simple text representations, and used multiple-choice

questions to assess comprehension of characters and events. Another line of classical work was

in video captioning [53] where models learned to generate descriptive sentences for video clips,

effectively learning a mapping from video to language. These early works laid the groundwork

for understanding correspondences between visual dynamics and natural language descriptions

or questions. Modern state-of-the-art approaches to multimodal video understanding leverage

powerful pretrained models and large-scale data. A notable example is VideoCLIP [54], which

is one of the first large-scale contrastive pre-training directly on video–subtitle pairs, enabling

zero-shot transfer to retrieval, captioning & action recognition. Other cutting-edge techniques

include dual-stream transformers (one stream for video frames, one for text, with cross-attention

between them) and the use of large language models augmented with visual encoders. For in-

stance, researchers have begun to integrate vision encoders with models like GPT-4 to enable

open-ended question answering about video content, though such systems are still emerging.

The progress in this area is exemplified by huge gains on benchmarks: early models often strug-

gled with complex queries, whereas current models can handle questions about temporal order,

6


causal relationships, and even hypothetical events in videos. In summary, the synergy of vision

and language in video understanding has grown from simple Q&A on short clips to foundation

models that jointly learn from video and text at scale. This multimodal alignment capability is

crucial for high-level understanding tasks like video retrieval with text queries, video captioning,

and video dialogue systems.

Video Model Compression and Acceleration: The drive for efficient video understanding

has led to extensive research in model compression and acceleration techniques. Video models

(especially 3D CNNs or transformers operating on many frames) tend to be computationally

heavy, which poses challenges for real-time processing and deployment on edge devices like

drones or mobile phones. Classical approaches to compress models were often inherited from the

image domain. For example, network pruning techniques systematically remove less important

weights or filters from a CNN after training, while quantization reduces the precision of model

parameters (e.g. from 32-bit floating point to 8-bit integers). Han et al. [55] pioneered Deep

Compression, showing that one can prune, quantize, and even Huffman-code neural networks

to dramatically reduce memory and computation with minimal loss in accuracy. Such methods,

when applied to CNNs for video, can directly translate to faster inference. Another classical

strategy is knowledge distillation [56], where a smaller “student” model is trained to replicate the

outputs of a large “teacher” model, thus transferring knowledge. This has been used to compress

large video models by training a compact model on the soft outputs or feature maps of a high-

capacity reference model.

In recent years, researchers have developed specialized techniques for video model acceler-

ation, motivated by the observation that there is a lot of redundancy in video inputs (neighboring

frames are often similar) and within the model’s feature maps. One effective approach is adap-

7


tive computation in the temporal dimension [57]. Overall, the state-of-the-art in video model

compression is characterized by holistic optimizations – from weight pruning and quantization

of networks, to dynamic execution that adjusts to input content, and architecture innovations that

bake efficiency into the model design. These advancements are crucial for bringing sophisticated

video understanding algorithms from research labs to real-world applications, where computation

and energy are often limited. An efficient model that can run on-device in real time opens up pos-

sibilities for privacy-preserving video analytics on edge (processing video locally) and scalable

deployment (hundreds of cameras or drones each running analytics). The continued research in

this area is driving the field towards the thesis goal of effective and efficient video understanding,

achieving high accuracy and rich functionality without exorbitant computational cost.

1.2 Dissertation Overview

The landscape of video understanding has evolved significantly over the past decade. We

have seen the emergence of extensive datasets that enable training of deep learning models, the

transition from hand-engineered features to end-to-end learned representations, the development

of sophisticated mechanisms to model spatial and temporal information, the rise of video lan-

guage alignment for high-level video reasoning, and the acceleration and compression for video

models. Across all these developments, a common thread is the balance between effectiveness

and efficiency. It is now possible to build models that “understand” videos in the sense of rec-

ognizing complex actions and even answering semantic questions about them – but often at a

tremendous computational cost. The central aim of this thesis is to explore methods that bridge

this gap. By drawing on the foundations reviewed above and introducing new ideas to make

8


video models more effective, faster, and more adaptive, we seek to push the field toward tech-

niques that retain state-of-the-art performance while significantly improving efficiency. The fol-

lowing chapters will delve into these contributions, building upon the rich context outlined in this

introduction.

Part I: Video Datasets

Our focus here is on the design and creation of challenging datasets tailored to improve the

effectiveness and robustness of visual understanding, ensuring diverse, high-quality, and contex-

tually rich data that bridges existing gaps in the field. Specifically, we present METEOR [58] in

Chapter 2 and DAVE [59] in Chapter 3. METEOR is a dataset for rare and interesting, multia-

gent driving behaviors that are grouped into traffic violations, atypical interactions, and diverse

scenarios. DAVE is designed for evaluating perception methods on the benchmark with high

representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bi-

cycles) in complex and unpredictable environments. Furthermore, DAVE can benchmark video

tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment

retrieval, and Multi-label Video Action Recognition. Our analysis revealed substantial shortcom-

ings of current perception models when tested against our METEOR and DAVE.

Part II: Preprocessing

Frame sampling reduces computational overhead by selecting only key frames, retaining

essential temporal cues while eliminating redundant information, and cropping ensures a consis-

tent spatial dimension, focusing the model’s attention on the most relevant regions. Preprocessing

9


improves both the efficiency of model training/inference and the overall accuracy and robustness

of video understanding. In Chapter 4, we introduce AZTR [60], a learning-based approach that

uses customized auto zoom to automatically identify the target and scale it appropriately, then

our efficient transformer-based algorithm is used to map the input sequences to a new sequence

with a specific size according to the computational cost requirement for the efficient Tempo-

ral Reasoning. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy

on the RoCoG-v2 dataset [61], 8.3-10.4% improvement on the UAV-Human dataset [62] and

3.2% improvement on the Drone Action dataset [63]. In Chapter 5, we present MITFAS [64],

which uses the concept of mutual information to compute and align the regions corresponding

to the target action or motion in the temporal domain for better recognition reasoning. In prac-

tice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods on

UAV-Human [62], 7.3% improvement on Drone-Action [63], and 7.16% improvement on NEC

Drones [65].

Part III: Visual Reasoning

In this segment, we focus on the development of new methodologies for enhancing vi-

sual reasoning capabilities, including architecture designs and advanced attention mechanisms

that allow models to focus on contextually significant features and improve effectiveness in real-

world scenarios. In Chapter 6, we proposed ICAR [66], a compatibility learning framework, a

category-aware Flexible Bidirectional Transformer (FBT) is introduced for visual scene-based

set compatibility reasoning with the cross-domain visual similarity module. Compared with the

SOTA methods, our ICAR achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8%

10


SFID improvement on fashion and furniture, respectively. In Chapter 7, we present Conditional

Prompt Learning (SCP) [67], which leverages the strengths of prompt learning to help more

enhance the reasoning ability. Our formulation supports various prompts, including learnable

prompts, auxiliary visual information, and large vision models to improve the recognition perfor-

mance. In practice, we observe a 3.17−10.2% accuracy improvement on the aerial video datasets

(Okutama [68], NECDrone [65]), which consist of scenes with single-agent and multi-agent ac-

tions. We further evaluate our approach on ground camera videos to verify the effectiveness and

generalization and achieve a 1.0− 3.6% improvement on SSV2 [69].

Part IV: Multimodal Alignment

In Chapter 8, we introduce a novel method for the integration of visual and linguistic modal-

ities, fostering a seamless alignment that enhances cross-modal understanding and paves the way

for more intuitive and accurate multimodal applications. In ViLA [70], we propose an efficient

Video-Language Alignment (ViLA) network, which addresses both efficient frame sampling and

effective cross-modal alignment in a unified way. Compared with prior work, our ViLA model

demonstrates the capability of selecting key frames with critical contents, thus improving the

video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA

Temporal with 3.0× speed up). Overall, our ViLA network outperforms the state-of-the-art meth-

ods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR

average with 3.0× speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset

with 4.2× speed-up.

11


Part V: Acceleration and Compression

In Chapter 9, we propose Bi-VLM, which separates model weights non-uniformly based

on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and

multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights

corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quanti-

zation algorithm and use it to quantize weights by imposing different constraints on the scaler

and binary matrices based on the saliency metric and compression objective. We have evalu-

ated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM

outperforms the SOTA by 3%-47% on the visual question answering task in terms of four dif-

ferent benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms

the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that

there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further

prune the visual tokens to improve efficiency.

By addressing these dimensions, this dissertation not only contributes to the methodologies

and insights for video understanding but also provides practical advancements that facilitate the

deployment of more effective, efficient, and interpretable models for real-world applications. It

also further lays the groundwork for future research in robust, context-aware, and multi-modal

AI systems.

12


Part I

Video Dataset

13


Chapter 2: METEOR: A Dense, Heterogeneous, and Unstructured Traffic Dataset

With Rare Behaviors

2.1 Introduction

Recent research in learning-based techniques for robotics, computer vision, and autonomous

driving has been driven by the availability of datasets and benchmarks. Several traffic datasets

have been collected from different parts of the world to stimulate research in autonomous driv-

ing, driver assistants, and intelligent traffic systems. These datasets correspond to highway or

urban traffic, and are widely used in the development and evaluation of new methods for percep-

tion [71], prediction [72], behavior analysis [73], and navigation [74].

Many initial autonomous driving datasets were motivated by computer vision or perception

tasks such as object recognition, semantic segmentation or 3D scene understanding. Recently,

many other datasets have been released that consist of point-cloud representations of objects

captured using LiDAR, pose information, 3D track information, stereo imagery or detailed map

information for applications related to 3D object recognition and motion forecasting. Many large-

scale motion forecasting datasets such as Argoverse [75], and Waymo Open Motion Dataset [76],

among others, have been used extensively by researchers and engineers to develop robust predic-

tion models that can forecast vehicle trajectories. However, existing datasets do not capture the

14


Table 2.1: Characteristics of Traffic Datasets. We compare METEOR with state-of-the-art autonomous
driving datasets that have been used for trajectory tracking, motion forecasting, semantic segmentation,
prediction, and behavior classification. METEOR is the largest (in terms of number of annotated frames)
and most diverse in terms of heterogeneity, scenarios, varying behaviors, densities, and rare instances.
Darker shades represent a richer collection in that category. Best viewed in color.

Rare and Interesting Behaviors‡

Datasets Location Bad weather Night Road type Het.⋆ Size Density Lidar HD Maps Traffic Atypical Diverse
Violations Interactions Scenarios

Argoverse [75] USA ✓ ✓ urban 10 22K Medium ✓ ✓ ✗ ✓ ✗

Lyft Level 5 [77] USA ✗ ✗ urban 9 46K Low ✓ ✓ ✗ ✗ ✗

Waymo [76] USA ✓ urban 4 200K Medium ✓ ✓ ✗ ✓ ✓

ApolloScape [78] China ✗ ✓ urban, rural 5 144K High ✓ ✓ ✗ ✗ ✗

nuScenes [79] USA/Sg. ✓ ✓ urban 13 40K Low ✓ ✓ ✗ ✓ ✓

INTERACTION [80] International ✗ ✗ urban 1 − Medium ✓ ✓ ✗ ✗ ✗

CityScapes [81] Europe ✗ ✗ urban 10 25K Low ✗ ✗ ✗ ✗ ✗

IDD [82] India ✗ ✗ urban, rural 12 10K High ✗ ✗ ✗ ✗ ✗

HDD [83] USA ✗ ✗ urban − 275K Medium ✓ ✗ ✗ ✓ ✓

Brain4cars [84] USA ✗ ✗ urban − 2000K Low ✗ ✓ ✗ ✗ ✗

D2-City [85] China ✓ ✗ urban 12 700K Medium ✗ ✗ ✗ ✗ ✓

TRAF [86] India ✗ ✓ urban, rural 8 72K High ✗ ✗ ✗ ✗ ✗

BDD [87] USA ✓ ✓ urban 8 3000K High ✗ ✗ ✗ ✗ ✓

ROAD [88] UK ✓ ✓ urban 7 122K Low ✓ ✗ ✗ ✗ ✓

METEOR India ✓ ✓ urban, rural† 16†† 2027K High§ ✗ ✗ ✓ ✓ ✓

‡ Rare instances can be broadly grouped into (i) traffic violations, (ii) atypical interactions, and (iii) difficult scenarios.
† Includes roads without lane markings. Roads in other datasets with rural roads may contain lane markings.
⋆ Heterogeneity. We indicate the classes corresponding to moving traffic agents only, excluding static objects such as poles, traffic lights, etc.
§ Up to 40 agents per frame.
†† Up to 9 unique agents per frame.

rare behaviors or heterogeneous patterns. Therefore, prediction models trained on these existing

datasets are not very robust in terms of handling challenging traffic scenarios that arise in the real

world.

A major challenge currently faced by research in autonomous driving is the heavy tail

problem [75, 76], which refers to the challenge of dealing with rare and interesting instances.

There are several ways in which existing datasets currently address the heavy tail problem:

1. Mining: The Argoverse and Waymo datasets use a mining procedure that includes scoring

each trajectory based on its “interestingness” to explicitly search for difficult and unusual

scenarios [75, 76].

2. Diversifying the taxonomy: Train the prediction and forecasting models to identify the

unknown agents at the time of testing. This approach necessitates annotating a diverse

taxonomy of class labels. Argoverse and nuScenes [79] contain 15 and 23 classes, respec-

15


tively.

3. Increasing dataset size: This approach is to simply collect more data with the premise

that collecting more traffic data will likely also increase the number of such scenarios in

the dataset.

In spite of many efforts along these lines, existing datasets manage to collect only a handful of

such instances, due to the infrequent nature of their occurrence. For example, the Waymo Open

Motion dataset [76] contains only atypical interactions and diverse scenarios while the Argoverse

dataset [75] contains only atypical interactions. There is clearly a need for a different approach to

addressing the heavy tail problem. Our solution is to build a traffic dataset from videos collected

in India, where the inherent nature of the traffic is dense, heterogeneous, and unstructured. The

traffic patterns and surrounding environment in parts of India are more challenging. than those

in other parts of the world. This includes high congestion and traffic density. Some of these

roads are unmarked or unpaved. Moreover, the traffic agents moving on these roads correspond

to vehicles, buses, trucks, bicycles, pedestrians, auto-rickshaws, two-wheelers such as scooters

and motorcycles, etc.

2.1.1 Main Contributions

1. We present a novel dataset, METEOR, corresponding to the dense, heterogeneous, and

unstructured traffic in India. METEOR is the first large-scale dataset containing anno-

tated scenes for rare and interesting instances and multi-agent driving behaviors, broadly

grouped into:

(a) Traffic violations—running traffic signals, driving in the wrong lanes, taking wrong

16


turns).

(b) Atypical interactions—cut-ins, yielding, overtaking, overspeeding, zigzagging, lane

changing.

(c) Diverse scenarios—intersections, roundabouts, and traffic signals.

2. METEOR has more than 2 million labeled frames and 13 million annotated bounding boxes

for 16 unique traffic agents, and GPS trajectories for the ego-agent.

3. Every video in METEOR is tagged using a diverse range of factors including weather, time

of the day, road conditions, and traffic density.

4. We use METEOR to extract new insights in perception tasks such as 2D object detection and

multi-agent behavior recognition in unstructured traffic. Additionally, we present a novel,

fine-grained analysis on the relationship between traffic environments (traffic density, mix-

ture of agents, area, time of the day, and weather conditions) and 2D object detection.

2.1.2 Applications and Benefits

We list some promising directions in which METEOR can contribute towards autonomous

driving research:

• Towards Robust Perception: We observe that perception tasks like 2D object detection

and multi-agent behavior recognition fail in challenging Indian traffic scenarios, compared

to their performance on existing datasets captured in the US, Europe, and other developed

nations. METEOR can be a useful benchmark for research in perception in unstructured

traffic environments and developing nations.

17


Figure 2.1: METEOR. We summarize various characteristics of our dataset in terms of scene: traffic
density, road type, lighting conditions, agents (we indicate the total count of each agent across 1250
videos), and behaviors, along with their size distribution (in GB). The total size of the current version
of the dataset is around 100GB, and it will continue to expand. Our dataset can be used to evaluate
the performance of current and new methods for perception, prediction, behavior analysis, and navigation
based on some or all of these characteristics. Details of the organization of our dataset are given at https:
//gamma.umd.edu/meteor.

• Towards Risk-Aware Planning and Control: METEOR can aid the development of risk-

aware motion planners by predicting the behaviors of surrounding agents. Motion planners

can compute controls that guarantee safety around aggressive drivers who are prone to

overtaking and overspeeding.

• Towards Fine-grained Traffic Analysis: With METEOR, researchers can study the causal-

ity relationship between traffic patterns, static scene elements, and dynamic agent behaviors

resulting in novel ADAS for unstructured traffic environments.

18

https://gamma.umd.edu/meteor
https://gamma.umd.edu/meteor


2.2 Comparison with Existing Datasets

2.2.1 Tracking and Trajectory Prediction Datasets

Datasets such as the Argoverse [75], Lyft Level 5 [77], Waymo Open Dataset [76], Apol-

loScape [78], nuScenes dataset [79] are used for trajectory forecasting [86, 89–92] and track-

ing [71]. Several of these datasets use mining procedure [75, 76] that heuristically searches the

dataset for rare and interesting scenarios. The resulting collection of such scenarios and behav-

iors, however, is only a fraction of the entire dataset. METEOR, by comparison, exclusively

contains such scenarios due to the inherent nature of the unstructured traffic in India.

METEOR has many additional characteristics with respect to these datasets. For instance,

METEOR’s 2.02 million annotated frames are more than 10× the current highest number of anno-

tated frames with respect to other dataset with high density traffic (ApolloScape). Furthermore,

METEOR consists of 16 different traffic-agents that include only on-road moving entities (and

not static obstacles). This is by far, the most diverse in terms of class labels. In comparison,

Argoverse and nuScenes both contain 10 and 13 traffic-agents, respectively. METEOR is the first

motion forecasting and behavior prediction dataset with traffic patterns from rural and urban areas

that consist of unmarked roads and high-density traffic. In contrast, traffic scenarios in Argov-

erse, Waymo, Lyft, and nuScenes have been captured on sparse to medium density traffic with

well-marked structured roads in urban areas.

19


2.2.2 Semantic Segmentation Datasets

CityScapes [81] is widely used for several tasks, primarily semantic segmentation. It is

based on urban traffic data collected from European cities with structured roads and low traffic

density. In contrast, the Indian Driving Dataset (IDD) [82] is collected in India with both urban

and rural areas with high-density traffic. A common aspect of both these datasets (CityScapes and

IDD), however, is the relatively low annotated frame count (25K and 10K, respectively). This is

probably due to the effort involved with annotating every pixel in each image. IDD also contains

high-density traffic scenarios in rural areas, similar to METEOR. However, our dataset has 200×

the number of annotated frames and 1.6× the number of traffic-agent classes. Similar to TRAF,

the IDD does not contain behavior data.

2.2.3 Behavior Prediction

Behavior prediction corresponds to the task of predicting turns (right, U-turn, or left), ac-

celeration, merging, and braking in addition to driver-intrinsic behaviors such as over-speeding,

overtaking, cut-ins, yielding, and rule-breaking. The two most prominent datasets for action pre-

diction include the Honda Driving Dataset (HDD) [83] and the BDD dataset [87]. Some of the

major distinctions between METEOR and the HDD in terms of size (approximately 10×), the

availability of scenes with night driving and rainy weather, and the inclusion of unstructured en-

vironments in low-density traffic. The BDD dataset [87] contains more annotated samples than

METEOR, however, the BDD dataset contains 100K videos while METEOR contains 1K videos.

So the number of annotated samples per video is 66× higher for METEOR. The annotations in

prior datasets are limited to actions and do not contain the rare and interesting behaviors con-

20


(a) Cut-ins/Jaywalking. (b) Yielding/Cut-ins.

(c) Overtaking/Overspeeding. (d) Driving in wrong lane.

(e) Running red traffic lights. (f) Ignoring lane signs/wrong lane driving.

(g) High density. (h) Rainy weather. (i) Night time. (j) Rural areas.

Figure 2.2: Annotations for rare instances. One of the unique aspects of METEOR is the availability
of explicit labels for rare and interesting instances including atypical interactions, traffic violations, and
diverse scenarios. These annotations can be used to benchmark new methods for object detection and
multi-agent behavior prediction.

tained in METEOR.

2.3 METEOR dataset

Our dataset is summarized in Figure 2.1 and visually shown in Figure 2.2. Below, we

present some details of the data collection process and discuss some of the salient characteristics

of METEOR.

21


2.3.1 Dataset Collection and Organization

The data was collected in and around the city of Hyderabad, India within a radius of 42

to 62 miles. Several outskirts were chosen to cover rural and unstructured roads. Our hardware

capture setup consists of two wide-angle Thinkware F800 dashcams mounted on an MG Hector

and Maruti Ciaz. The camera sensor has 2.3 megapixel resolution with a 140◦ field of view.

The video is captured in full high definition with a resolution of 1920 × 1080 pixels at a frame

rate of 30 frames per second. The dashcam is embedded with an accurate positioning system

that stores the GPS coordinates, which were processed into the world frame coordinates. The

sensor synchronizes between the camera and the GPS. Recordings from the dashcam are streamed

continuously and are clipped into 1 minute video segments.

The dataset is organized as 1250 one-minute video clips. Each clip contains static and

dynamic XML files. Each static file summarizes the meta-data of the entire video clip including

the behaviors, road type, scene structure etc. Each dynamic file describes frame-level information

such as bounding boxes, GPS coordinates, and agent behaviors. Our dataset can be searched

using helpful filters that sort the data according to the road type, traffic density, area, weather,

and behaviors. We also provide many scripts to easily load the data after downloading.

2.3.2 Annotations

We manually annotated the videos using the Computer Vision Annotation Tool (CVAT)

and provide the following labels: (i) bounding boxes for every agent, (ii) agent class IDs, (iii)

GPS trajectories for the ego-vehicle, (iv) environment conditions including weather, time of the

day, traffic density, and heterogeneity, (v) road conditions with urban, rural, lane markings, (vi)

22


road network including intersections, roundabouts, traffic signal, (vii) actions corresponding to

left/right turns, U-turns, accelerate, brake, (viii) rare and interesting behaviors, and (ix) the

camera intrinsic matrix for depth estimation to generate trajectories of the surrounding vehicles.

This set of annotations is the most diverse and extensive compared prior datasets.

A diverse and rich taxonomy of agent categories is necessary to ensure that autonomous

driving systems can detect different types of agents in any given scenario. Towards that goal,

datasets for autonomous driving are designed or captured to achieve two goals: (a) capture as

many different types of agent categories as possible; (b) capture as many instances of each cat-

egory as possible. In both these aspects, METEOR outperforms all prior datasets. We annotate

16 types of moving traffic entities with rare and interesting behaviors. Note specifically that

the percentage