ABSTRACT

Title of Dissertation: DEVELOPING MULTIMODAL LEARNING
METHODS FOR VIDEO UNDERSTANDING

Mingwei Sun,
Doctor of Philosophy, 2024

Dissertation Directed by: Professor Kunpeng Zhang,
Department of Decision, Operations and Information Technologies

In recent years, the field of deep learning, with a particular emphasis on multimodal representation

learning, has experienced significant advancements. These advancements are largely attributable

to groundbreaking progress in areas such as computer vision, voice recognition, natural language

processing, and graph network learning. This progress has paved the way for a multitude of

new applications. The domain of video, in particular, holds immense potential. Video is often

considered the most potent form of digital content for communication and the dissemination of

information. The ability to effectively and efficiently comprehend video content could prove

instrumental in a variety of downstream applications. However, the task of understanding video

content presents numerous challenges. These challenges stem from the inherently unstructured

and complex nature of video, as well as its interactions with other forms of unstructured data,

such as text and network data. These factors contribute to the difficulty of video analysis. The

objective of this dissertation is to develop deep learning methodologies capable of understanding

video across multiple dimensions. Furthermore, these methodologies aim to offer a degree of


interpretability, which could yield valuable insights for researchers and content creators. These

insights could have significant managerial implications.

In the first study, I introduce an innovative network based on Long Short-Term Memory (LSTM),

enhanced with a Transformer co-attention mechanism, designed for the prediction of apparent

emotion in videos. Each video is segmented into clips of one-second duration, and pre-trained

ResNet networks are employed to extract audio and visual features at the second level. I construct

a co-attention Transformer to effectively capture the interactions between the audio and visual

features that have been extracted. An LSTM network is then utilized to learn the spatiotemporal

information inherent in the video. The proposed model, termed the Sec2Sec Co-attention Transformer,

outperforms several state-of-the-art methods in predicting apparent emotion on a widely recognized

dataset: LIRIS-ACCEDE. In addition, I conduct an extensive data analysis to explore the relationships

between various dimensions of visual and audio components and their influence on video predictions.

A notable feature of the proposed model is its interpretability, which enables us to study the

contributions of different time points to the overall prediction. This interpretability provides

valuable insights into the functioning of the model and its predictions.

In the second study, I introduce a novel neural network, the Multimodal Co-attention Transformer,

designed for the prediction of personality based on video data. The proposed methodology

concurrently models audio, visual, and text representations, along with their intra-relationships,

to achieve precise and efficient predictions. The effectiveness of the proposed approach is demonstrated

through comprehensive experiments conducted on a real-world dataset, namely, First Impressions.

The results indicate that the proposed model surpasses state-of-the-art methods in performance

while preserving high computational efficiency. In addition to evaluating the performance of the

proposed model, I also undertake a thorough interpretability analysis to examine the contribution


across different levels. The insights gained from the findings offer a valuable understanding

of personality predictions. Furthermore, I illustrate the practicality of video-based personality

detection in predicting outcomes of MBA admissions, serving as a decision support system. This

highlights the potential importance of the proposed approach for both researchers and practitioners

in the field.

In the third study, I present a novel generalized multimodal learning model, termed VAN, which

excels in learning a unified representation of visual, acoustic, and network cues. Initially, I

utilize state-of-the-art encoders to model each modality. To augment the efficiency of the training

process, I adopt a pre-training strategy specifically designed to extract information from the

music network. Subsequently, I propose a generalized Co-attention Transformer network. This

network is engineered to amalgamate the three distinct types of information and to learn the

intra-relationships that exist among the three modalities, a critical facet of multimodal learning.

To assess the effectiveness of the proposed model, I collect a real-world dataset from TikTok,

comprising over 88,000 videos. Extensive experiments demonstrate that the proposed model

surpasses existing state-of-the-art models in predicting video popularity. Moreover, I have conducted

a series of ablation studies to attain a deeper comprehension of the behavior of the proposed

model. I also perform an interpretability analysis to study the contributions of each modality to

the model performance, leveraging the unique property of the proposed co-attention structure.

This research contributes to the field by proffering a more comprehensive approach to predicting

video popularity on short-form video platforms.


DEVELOPING MULTIMODAL LEARNING METHODS
FOR VIDEO UNDERSTANDING

by

Mingwei Sun

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Kunpeng Zhang, Chair
Professor P.K. Kannan
Professor Lauren Rhue
Professor Jessica Clark
Professor Vanessa Frias-Martinez, Dean’s Representative


© Copyright by
Mingwei Sun

2024


Acknowledgments

During the course of my six-year doctoral study at Smith Business School, I have had the

privilege of collaborating with numerous esteemed researchers. Their guidance, teachings, and

inspiration have been instrumental in my journey. I am deeply appreciative of their support and

express my gratitude wholeheartedly.

First and foremost, I would like to extend my deepest gratitude to my advisor, Professor

Kunpeng Zhang, for his unwavering support and valuable advice throughout my six years of

study. His consistent provision of constructive feedback and advice has not only enriched my

knowledge but also fostered my growth as an independent researcher. My academic journey

under the guidance of Professor Zhang has been both enriching and transformative. I am deeply

grateful for everything he has provided me with. His mentorship has truly been a cornerstone of

my academic development.

I am sincerely thankful to Professor Jessica Clark, with whom I initiated my first project.

The discussions about research, career, and life with her have been enlightening. I am also

grateful to Professor Lauren Rhue for her consistent support and patience. Working with her has

been a rewarding experience, and I have learned much from her. I also extend my gratitude to

Professor P.K. Kannan for his insightful advice on both my research and career. I am thankful to

Professor Vanessa Frias-Martinez for her unique perspective on research, which has significantly

influenced my thinking process. My heartfelt thanks also go to Professor Balaji Padmanabhan,

ii


whose guidance has been extremely beneficial. His insights on positioning my work have been

enlightening. In general, I have learned a lot from these top-notch researchers.

Furthermore, I would like to acknowledge Justina and Miloyka for their immense help,

encouragement, and patient support. They have always been available to assist, regardless of the

challenges I faced. My doctoral experience has been productive, also due to the interactions with

my peers, particularly Bingze Xu, Wei Feng, Gujie Li, Sung Hyun Kwon, Maya Mudambi, Feiyu

E, Weihong Zhao, Yunfei Wang, among others.

Finally, I would like to express my deepest appreciation to my wife, Fan Yu, and my parents

for their unconditional support, both mentally and financially. Their backing has been a pillar of

strength in my journey.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

List of Tables vi

List of Figures vii

Chapter 1: Introduction 1

Chapter 2: Sec2Sec Co-attention Transformer for Video Emotion Prediction 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Emotion and Its Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Audio-Video Representation Learning . . . . . . . . . . . . . . . . . . . 13
2.2.3 Transformer and Its Application . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Method: Sec2Sec Co-attention Transformer . . . . . . . . . . . . . . . . . . . . 17
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 3: Multimodal Co-attention Transformers for Video-Based Apparent Personality
Understanding 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Video-Based Deep Personality Prediction . . . . . . . . . . . . . . . . . 38
3.2.3 Transformers in Computer Vision . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 Multimodal Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv


3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.5 Efficiency Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.7 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Interpretability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Decision Support Showcasing: MBA Admission Prediction . . . . . . . . . . . . 62
3.10 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Chapter 4: Network-enhanced Multimodal Co-attention Learning for Short-Form Video
Popularity Prediction 72

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Online Video Popularity Prediction . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Multimodal Representation Learning . . . . . . . . . . . . . . . . . . . . 77

4.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Input Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2 VAN: A Generalized Multimodal Co-attention Network . . . . . . . . . . 83

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.2 Popularity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.5 Graph Attention Network Pre-training . . . . . . . . . . . . . . . . . . . 89
4.4.6 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.7 Interpretability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Appendix A: Video Essay Recording Page 101

Appendix B: Additional Interpretability Analysis 102

Bibliography 104

v


List of Tables

2.1 The t-test comparison of audio features on LIRIS-ACCEDE between high emotional
and low emotional videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 The t-test comparison of visual features on LIRIS-ACCEDE between high emotional
and low emotional videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Performance comparison of our model with baselines for arousal prediction. . . 23
2.4 Performance comparison of our model with baselines for valence prediction. . . 24
2.5 Comparison of batch vs. layer normalization. . . . . . . . . . . . . . . . . . . . 28

3.1 Summary Statistics for First Impressions . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Performance comparison of our model with baselines for Big-Five personality

predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Comparison of Positional Encoding vs No Positional Encoding . . . . . . . . . . 58
3.5 Summary Statistics for the Case Study of MBA Admission . . . . . . . . . . . . 63
3.6 Personality Trait Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Estimation Results for MBA Admission . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 Estimation Results for MBA Admission Using Extracted Factor . . . . . . . . . 67

4.1 Hyperparameters of the proposed model. . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Performance comparison of our model with baselines. . . . . . . . . . . . . . . . 91
4.3 Ablation study of the proposed model. . . . . . . . . . . . . . . . . . . . . . . . 93

vi


List of Figures

2.1 An illustration example of images eliciting different emotions. . . . . . . . . . . 8
2.2 An illustration example of various emotions elicited by different audio waveforms. 9
2.3 Correlation heatmaps between audio and visual features for two emotional states

and two emotional intensities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 An overview of our proposed model: Sec2Sec Co-attention Transformer. . . . . . 17
2.5 The co-attention block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 The Sec2Sec Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 The LSTM Attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Effect of key hyperparameters of Sec2Sec SA-CA on accuracy and F1 score. . . 27

3.1 Images sampled from videos in the First Impressions dataset that exhibit varying
degrees of personality traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Audio waveforms in the First Impressions dataset that exhibit varying degrees of
personality traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 An overview of our proposed model: Multimodal Co-attention Transformer. . . . 44
3.4 Average modality importance on personality prediction. . . . . . . . . . . . . . . 59
3.5 Average contributions of images of different positions on personality prediction. . 60
3.6 Region contributions of images on personality predictions. . . . . . . . . . . . . 60
3.7 Parallel Analysis Scree Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 An illustration of a TikTok post. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 An overview of our proposed model: A Generalized Multimodal Co-attention

Transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Spearman’s Rank Correlations regarding the number of sampled frames. . . . . . 96
4.4 The average contributions of each modality. . . . . . . . . . . . . . . . . . . . . 98

A.1 A screenshot of the recording page for the video essay. . . . . . . . . . . . . . . 101

B.1 The contributions of each modality for Head 1. . . . . . . . . . . . . . . . . . . 102
B.2 The contributions of each modality for Head 2. . . . . . . . . . . . . . . . . . . 103
B.3 The contributions of each modality for Head 3. . . . . . . . . . . . . . . . . . . 103
B.4 The contributions of each modality for Head 4. . . . . . . . . . . . . . . . . . . 103

vii


Chapter 1: Introduction

In the past decade, video content has emerged as an integral component of people’s daily routines.

As of 2023, individuals are allocating an average of 17 hours per week to the consumption of

online video content [78]. Specifically, in the United States, it is anticipated that by 2024, there

will be 164.6 million internet users engaging with video content, as per data from Statistica [18].

This surge in video content consumption has led to a transformative impact on social media

platforms. A majority of these platforms, including TikTok, Instagram, and Facebook, now offer

video-sharing services. To illustrate, TikTok reported 1.04 billion monthly active users in May

2024 [92], with an annual expenditure of $3.84 billion from consumers [19]. Given this context,

video marketing presents immense business potential. In fact, 91% of businesses are leveraging

video as a marketing tool [106].

The ubiquity and exponential growth of video marketing have sparked a surge of interest

among scholars and industry professionals alike. They are keen to comprehend the multifaceted

dimensions of video content, including auditory intensity (loudness) [40], aesthetic considerations

(color choice) [56], and verbal (topic) and non-verbal (facial expression) communication cues

[55]. While these elements provide valuable insights, they are relatively straightforward to extract

and interpret. They represent only the surface-level understanding of video content, leaving a vast

array of deeper, more complex aspects unexplored. These uncharted territories hold the potential

1


to unlock a more profound understanding of video content, thereby paving the way for a myriad

of downstream analyses and studies.

In the sphere of artificial intelligence, the past few years have witnessed remarkable strides

in the field of deep learning, with a particular emphasis on multimodal learning. This progress has

been fueled by groundbreaking advancements in several sub-domains, including computer vision,

natural language processing, voice recognition, and network learning. As a result, multimodal

learning has demonstrated exceptional performance across a wide array of applications. One such

application is video analytics, where videos typically comprise visual and auditory components.

This makes video analytics a natural and fitting application for multimodal learning.

Despite the immense potential of multimodal learning in this domain, it is not without its

challenges. The application of existing methods may not always yield optimal performance in a

given context due to several reasons. Firstly, capturing the interaction and alignment between the

auditory and visual components of a video is a complex task. I posit that this aspect is integral

to the performance of the model. Secondly, temporal information embedded within the video is

of critical importance. The ability to accurately utilize this information can significantly enhance

the model’s performance. Thirdly, the fusion of auditory and visual data with other types of

information, such as text and graph networks, is a crucial consideration. The integration of these

diverse data types can provide a more holistic understanding of the video content. Lastly, the

current practice of employing deep learning models in video understanding is predominantly a

black-box approach. This lack of interpretability impedes the broader adoption of these powerful

models and fails to provide actionable insights and managerial implications.

In light of this, the primary objective of this dissertation is to delve deeper into the realm

of video understanding. To achieve this, I propose the development and application of advanced

2


deep learning methodologies. These methods are designed to penetrate beyond the superficial

layers of video content, enabling a more comprehensive and nuanced understanding. By harnessing

the power of multimodal learning techniques, I can uncover hidden patterns and alignments

within the video content. This, in turn, can facilitate a wide range of applications, from enhancing

the effectiveness of video marketing strategies to improving the accuracy of video-based predictive

models. Ultimately, this research aims to contribute significantly to the field of video content

analysis, setting new benchmarks for future studies in this domain.

Study 1: Sec2Sec Co-attention Transformer for Video Emotion Prediction

Video-based apparent emotion detection plays a crucial role in video understanding, as they

encompass various elements such as vision, audio, audio-visual interactions, and spatiotemporal

information, which are essential for accurate video predictions. However, existing approaches

often focus on extracting only a subset of these elements, resulting in the limited predictive

capacity of their models. To address this limitation, I propose a novel LSTM-based network

augmented with a Transformer co-attention mechanism for predicting apparent emotion in videos.

Specifically, I divide each video into one-second clips and utilize pre-trained ResNet networks

to extract audio and visual features at the clip level. I develop a co-attention Transformer to

effectively capture the interactions between the extracted audio and visual features and leverage

an LSTM network to learn the spatiotemporal information present in the video. I demonstrate that

the proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in

predicting apparent emotion on a widely used dataset: LIRIS-ACCEDE. Additionally, I perform

comprehensive data analysis to investigate the relationships between different dimensions of

visual and audio components and their impact on video predictions. Notably, my model offers

interpretability, allowing me to examine the contributions of different time points to the overall

3


prediction.

Study 2: Multimodal Co-attention Transformers for Video-Based Apparent Personality

Understanding

Video has emerged as a pervasive medium for communication, entertainment, and information

sharing. With the consumption of video content continuing to increase rapidly, understanding the

impact of visual narratives on personality has become a crucial area of research. While text-based

personality understanding has been extensively studied in the literature, video-based personality

prediction remains relatively under-explored. Existing approaches to video-based personality

prediction can be broadly categorized into two directions: learning a joint representation of audio

and visual information using fully-connected feed-forward networks, and separating a video into

its individual modalities (text, image, and audio), training each modality independently, and then

ensembling the results for subsequent personality prediction. However, both approaches have

notable limitations: ignoring complex interactions between visual and audio components, or

considering all three modalities but not in a joint manner. Furthermore, all methods require high

computational costs as they require high-resolution images to train. In this chapter, I propose a

novel Multimodal Co-attention Transformer neural network for video-based personality prediction.

My approach simultaneously models audio, visual, and text representations, as well as their

inter-relations, to achieve accurate and efficient predictions. I demonstrate the effectiveness of

my method via extensive experiments on a real-world dataset: First Impressions. My results

show that the proposed model outperforms state-of-the-art approaches while maintaining high

computational efficiency. In addition to my performance evaluation, I also perform a set of

comprehensive interpretability analyses to investigate the contribution across different levels. My

findings reveal valuable insights into personality predictions. In addition, I showcase the utility of

4


video-based personality detection in predicting MBA admission outcomes as a decision support

system, highlighting its potential significance for both researchers and practitioners.

Study 3: Network-enhanced Multimodal Co-attention Learning for Short Video Popularity

Prediction

The recent surge in the popularity of short-form videos has unveiled considerable opportunities

for business applications, encompassing personalized recommendations and targeted advertising.

Predominantly, traditional research employs acoustic-visual information for making predictions

about video popularity. However, a unique feature of these platforms is the music network,

which provides an abundance of information on the distribution and sharing of various trending

songs. This could potentially influence video popularity, a factor often overlooked in existing

literature. In this chapter, I introduce a novel generalized multimodal learning model, termed

VAN, which is adept at learning a unified representation of visual, acoustic and network cues.

Initially, we employ cutting-edge encoders to model each modality. To enhance the efficiency of

the training process, I design a pre-training strategy specifically tailored to extract information

from the music network. As a final step, I put forward a generalized Co-attention Transformer

network. This network is designed to fuse the three distinct types of information and to learn the

intra-relationships that exist among the three modalities, a crucial aspect of multimodal learning.

To evaluate the effectiveness of the proposed model, I have collected a real-world dataset from

TikTok, consisting of over 88,000 videos. My comprehensive experiments demonstrate that my

model outperforms existing state-of-the-art models in predicting video popularity. Furthermore,

we have conducted a series of ablation studies to gain a deeper understanding of the behavior

of my model. We additionally conduct interpretability analysis to examine the contributions of

each modality to the model performance by leveraging the distinctive property of the proposed

5


co-attention structure. This research contributes to the field by offering a more comprehensive

approach to predicting video popularity on short-form video platforms.

Collectively, the proposed methods and the findings of three studies in this dissertation

provide us with a more profound comprehension of videos from multiple perspectives. This

enhanced understanding paves the way for a plethora of downstream research opportunities,

thereby expanding the horizons of knowledge in video analytics. The insights gained from these

studies provide valuable guidance for both researchers and practitioners. For researchers, these

insights can inform the design of future studies, helping to refine research questions, hypotheses,

and methodologies. For practitioners, particularly those in the realm of video marketing and

analytics, these insights can inform strategic decision-making, helping to optimize the effectiveness

of video content and drive business outcomes.

6


Chapter 2: Sec2Sec Co-attention Transformer for Video Emotion Prediction

2.1 Introduction

Emotions are generally described as mental states by neurophysiological changes. They can be

associated with thoughts, feelings, behavioral responses, and a degree of pleasure or displeasure

[104], which can accordingly affect our decision-making and eventually shape how we perceive

the world ubiquitously. As cognitive processes can be profoundly influenced by emotions [77],

people primarily rely on emotional levels when making their judgments [11]. Several dimensions

that could be linked to emotion-related responses have been identified [73]. Among these two

major ones have been widely explored in the literature. They are pleasure-displeasure and

arousal-sleep dimensions. Specifically, The former indicates the degree of positivity or negativity

of the experience, also known as valence, while the latter assesses the level of energy or fatigue

that an experience produces, also known as arousal.

There has been a growing interest in understanding human emotions from both researchers

and practitioners. Emotion detection has been the main focus of existing literature, which is

also the scope of our study. Various methods have been proposed to detect emotions, especially

for text documents. For example, Kratzwald et al. developed a transfer learning-based model

(called sent2affect) for emotion recognition [46]. Su et al. designed a long short-term memory

(LSTM) network to predict emotions based on the combination of semantic and emotional words

7


Figure 2.1: An illustration example of images eliciting different emotions.

(a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence

Note: A high-arousal emotion is evoked by a violet background and an exciting concert in Figure
2.1a. A combination of light colors and a calm and peaceful natural environment in Figure 2.1b
conveys a low-arousal emotion. A warm background along with happy individuals in Figure 2.1c
induce a high-valence emotion. A dark background and a sad person in Figure 2.1d provoke a
low-valence emotion.

[81]. However, video-based emotion detection remains under-explored, even though videos have

been largely generated and posted on various platforms. Prior studies found that videos are more

efficient and effective to elicit emotions compared to text [46].

A video often consists of two components: vision and audio. Both can stimulate emotions

in their own ways. Figure 2.1 shows different images can elicit different emotions, as suggested

by [43]. On the other hand, [103] has shown that the design of electronic music and audio-

visual media can elicit audience emotions. In particular, non-diegetic music is a key feature

that elicits states of emotions. Audio is usually represented by a continuous waveform. Audio

waveform measures how sound pressure varies over time. Figure 2.2 shows different patterns of

audio waveforms that can convey different states of emotions. Furthermore, we examine a set of

audio and visual features that exhibit significant differences between videos with high and low

emotional intensities by performing t-test analyses on the LIRIS-ACCEDE dataset [10]. Table

2.1 and Table 2.2 are the t-test comparisons of audio and visual features across videos with high

and low emotional intensities. From the tables, we can see that many visual and audio features

do play a role in distinguishing video emotions.

8


Figure 2.2: An illustration example of various emotions elicited by different audio waveforms.

(a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence

Note: A high-arousal emotion is provoked by a spike with high sound pressure in Figure 2.3a.
A low-arousal emotion is conveyed by low sound pressure over time in Figure 2.3b. Few spikes
with small sound pressure in Figure 2.3c elicit a high-valence emotion. A low-valence emotion
is evoked by multiple spikes with high sound pressure in Figure 2.3d.

Table 2.1: The t-test comparison of audio features on LIRIS-ACCEDE between high emotional
and low emotional videos.

Feature Emotion p-value (significant?)

Pitch valence 1.84e-15(✓)
Pitch arousal 4.31e-05(✓)
AL valence 0.61(✗)
AL arousal 5.74e-53(✓)

Spectral rolloff valence 3.24e-14 (✓)
Spectral rolloff arousal 0.01(✓)

ZRC valence 7.36e-09(✓)
ZRC arousal 0.024(✓)

Note: Pitch means the average pitch of an audio sample. AL means the average loudness.
Spectral roll-off means the frequency that holds a certain percentage of energy (e.g. 85%).
Zero-crossing rate (ZRC) means the frequency that a waveform crosses zero representing audio
smoothness.

Vision and audio can also jointly affect emotions. Specifically, if a visual component is

well-aligned with its counterpart audio in the same time frame (e.g., within a second), it can create

a synergy that stimulates emotions more intensely. A good alignment could be a high degree of

consistency or correspondence between two components. Most importantly, eliciting different

emotional states requires different combinations of audio and visual features. For example, a cold

picture with a low audio tempo is likely to stimulate a negative valence. Low arousal is conveyed

9


Table 2.2: The t-test comparison of visual features on LIRIS-ACCEDE between high emotional
and low emotional videos.

Feature Emotion p-value (significant?)

saturation valence 1.40e-19 (✓)
saturation arousal 1.04e-05 (✓)
brightness valence 1.38e-39 (✓)
brightness arousal 4.36e-10 (✓)
contrast valence 4.44e-13 (✓)
contrast arousal 0.76 (✗)
clarity valence 9.38e-17 (✓)
clarity arousal 0.005 (✓)

warm hue valence 2.11e-55 (✓)
warm hue arousal 4.17e-16 (✓)

Note: Saturation indicates average saturation across pixels. Brightness: average intensity across
pixels. Contrast indicates the standard deviation of intensity across pixels. Clarity indicates the
proportion of pixels with intensity above a certain threshold (e.g. 0.7). Warm hue indicates the
proportion of warm colors in a frame.

by a piece of light music and a peaceful environment. For the sake of illustration, the correlation

heatmap1 between audio and visual features for various emotional states and intensities are

plotted in Figure 2.3 using LIRIS-ACCEDE [10], which confirms that the integration of audio

and visual signals induces various emotional states and intensities. Another important factor

that might affect how people perceive a video emotionally is the sequential composition of a

video. The temporal pattern exhibited in a video should be captured for emotion prediction. For

example, people are more likely to recall the most recently presented information [61], which

indicates that later audio clips may be weighted higher when estimating emotions.

Despite the popularity of videos and the practical importance of detecting their emotions,

very limited research has been conducted to quantitatively estimate how videos induce emotions.

In this paper, we are among the pioneers to fill this research gap by developing a Transformer-

1The correlations are computed by performing the following steps: (1) splitting each video into n one-second
clips; (2) extracting audio and visual features from each one-second clip; (3) calculating the correlations between
those features for each video; and (4) averaging each correlation pair.

10


(a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence

Figure 2.3: Correlation heatmaps between audio and visual features for two emotional states and
two emotional intensities.

based second-to-second (Sec2Sec) co-attention model to predict the perceived emotional states of

videos when we watch them. Specifically, we first implement a Transformer-based co-attention

network extended from the work proposed by [22] to understand the interaction between audio

and visual components. We further combine an LSTM module with such a co-attention network

to capture the temporal information of videos at the second level. To do so, we first split each

video into one-second video clips. We then feed each one-second clip into our designed co-

attention network. The output of each video clip from the co-attention network is fed into an

LSTM network sequentially. Lastly, we add a fully-connected network to predict emotions. To

evaluate our work, we conduct experiments on a real-world dataset from LIRIS-ACCEDE with

9, 800 videos [10]. The experimental results show that our model outperforms several cutting-

edge baselines, in terms of F1-score for both arousal and valence.

2.2 Related Work

Our work is closely related to three streams of literature: emotion, audio-video representation

learning and applications of Transformers.

11


2.2.1 Emotion and Its Impact

Emotion, a multifaceted psychological phenomenon, lacks a universally acknowledged

definition. However, it is frequently characterized as a mental condition that incorporates cognitive

processes, affective states, physiological alterations, and behavioral reactions, all of which are

marked by varying intensities of pleasure or displeasure [65]. The genesis of emotions can be

attributed to a multitude of mechanisms. These include bottom-up processes initiated by external

stimuli, top-down processes that involve the cognitive evaluation and interpretation of events

based on accumulated experience and knowledge, or an amalgamation of both [59].

Numerous models have been put forth to delineate the dimensions of emotion. Ekman’s

theory of basic emotions [32] is one such model, which advocates the existence of six universal

emotions. Another model, proposed by Cordaro et al. [25], extends Ekman’s basic emotion

theory to identify 22 distinct emotions. These prevalent models primarily concentrate on discerning

the emotions experienced by the individual. However, in the realm of entertainment, such as

social media, the influencer’s focus shifts towards the emotions experienced by the audience.

In this paper, we employ the widely recognized circumplex model of emotion or affect [74]

to scrutinize the emotions elicited in the audience in response to stimuli. The circumplex model

characterizes emotions along two dimensions: valence and arousal. Valence signifies the sentiment

associated with an experience, spanning from pleasant to unpleasant. Arousal denotes the degree

of intensity or activation associated with the experience, ranging from low to high. Valence and

arousal encapsulate the pleasantness and intensity of the video experiences from the audience’s

viewpoint.

A considerable volume of research exists that examines the phenomenon of consumer

12


emotion and its consequential impact on consumer behavior. It is well-documented that consumers

are susceptible to the influence of others’ emotional expressions [38]. For instance, research has

demonstrated that positive facial expressions in fundraising advertisements can sway funding

decisions in a beneficial direction [69]. Furthermore, emotions encapsulated in online product

reviews can markedly affect the perceived utility of the information [110]. Elements of the

broader context, such as culture, can also mold how consumers react to others’ emotions. Consumers

of European cultural descent respond more robustly to excited expressions, while consumers of

Chinese descent exhibit a stronger response to calm emotional expressions [66]. A majority of

these studies explore emotions in text-based or visual mediums such as online reviews, social

media, or facial expressions [13, 69, 80]. However, despite the richness of video content, there

is limited research investigating the impact of perceived emotion, primarily due to the absence

of a predictive model capable of forecasting perceived video emotion. Consequently, our study

endeavors to bridge this gap.

2.2.2 Audio-Video Representation Learning

Audio Representation Learning. Traditionally, research often adopts hand-crafted audio feature

extraction techniques, such as Mel-Frequency Cepstral Coefficients (MFCCs) [27]. Extracting

MFCCs is an audio processing technique that models how human ears sense and resolve sound

frequencies [71]. Recently, with the development of deep learning, researchers have explored

several audio autoencoder techniques. Tagliasacchi et al. applied a convolutional deep belief

network on music and speech data to solve different classification tasks [89]. Cartwright et al.

designed a network, including an audio sub-network and a temporal network, to predict the long-

13


term and cyclic temporal structure using self-supervision [17]. Chung et al. explored a sequence-

to-sequence autoencoder by incorporating RNN and LSTM together [23].

Audio-Visual Cross-Modal Learning. As videos provide a natural bridge between audio and

visual events, they tend to happen together. Therefore, the mainstream of audio-visual representation

learning research is to predict the synchronization or correspondence of audio and visual streams

in videos. Arandjelovic and Zisserman trained an audio-visual cross-modal network from scratch

to predict video correspondence [8]. Alwassel et al. used one clustered modality as a supervisory

signal for another modality and predicted correspondence between two modalities [7]. Cheng

et al. further developed three self-supervised co-attention-based networks to discriminate visual

events related to audio events [22]. In addition, Kuhnke et al. proposed a two-stream aural-visual

model (AVM) to predict facial expressions in videos [47].

2.2.3 Transformer and Its Application

Transformers in Natural Language Processing. Transformer was first introduced in the task

of machine translation [97]. It has been a state-of-the-art natural language processing (NLP)

architecture ever since. A variety of Transformer-based models have been developed to address

many NLP tasks, mainly focusing on two streams. One follows the trend of pre-training Transformer-

based models on large corpora and fine-tuning parameters on downstream NLP tasks. BERT is a

pioneer that employs a multi-layer bi-directional Transformer model architecture [29]. However,

BERT-based models are only capable of handling 512 tokens, which is not enough for long text

documents. Hence, Longformer extends BERT while utilizing sliding window, dilated sliding

window, and global attention to handle long text documents [12]. Unlike BERT-based models,

14


another stream of research focuses on language modeling as a pre-training task, such as GPT [70].

GPT was developed for text generation tasks such as question answering, text summarization, and

many others, which has achieved great performance on downstream tasks in zero-shot or few-shot

settings.

Transformers in Computer Vision. Recently, there has been increasing attention on applying

Transformers to computer vision (CV) tasks as an alternative to convolutional neural networks

(CNN). Many studies have achieved great success. ViT applies a Transformer model to linearly

projected sequences of image patches to classify full images [31]. Swin Transformer improves

ViT by introducing a hierarchical Transformer architecture and a shifted window scheme [53].

These are two representative Transformer-based models for image classification tasks. In order to

classify video tasks, ViViT extends ViT by proposing two methods for embedding video samples:

uniform frames sampling and tubelet embedding, and four model variants based on Transformer:

spatiotemporal attention, factorized encoder-decoder, factorized self-attention, and factorized dot

product attention [9]. Video Swin Transformer further extends Swin Transformer by introducing

a 3D-shifted window-based multi-head self-attention module and a locality inductive bias to the

self-attention module [54]. All these video-based analyses do not separate vision and audio and

explicitly learn the joint effect on subsequent tasks, which is our focus in this study.

2.3 Preliminaries

In this section, we first briefly explain how multi-head self-attention works. It is the key component

in Transformer that maps a query vector Q, a key vector K and a value vector V to an output

15


(embedding), as shown in Equation 3.1.

Attention(Q,K, V ) = softmax(
QKT

√
d

)V (2.1)

Specifically, self-attention is achieved by computing the dot product of Q and K divided by the

square root of the dimension of Q denoted by d, which gives the similarity scores between Q and

K, which is also known as the scaled-dot product. The scores are then translated into probabilities

by applying a softmax function. Lastly, the probabilities are multiplied by V to get the final output

for the next layer. The basic idea of the self-attention mechanism is to focus more on the vectors

with high probabilities in V in the following layers. However, one self-attention layer limits the

model’s ability to focus on more positions without compromising other positions. To mitigate this

limitation, it introduces a multi-head attention mechanism, which can increase the overall model

performance. Specifically, the multi-head attention layer consists of h paralleled self-attention

sub-layers, called “heads”. Each head learns different query, key and value matrices. Different

heads project input features into different sub-spaces. The output features from each head are

concatenated into one matrix for the following layers, as shown in Equation 3.2.

MultiHead(Q,K, V ) = Concat(h1, h2, ..., hh)W
O

where hi = Attention(QWQ
i , KWK

i , V W V
i ) (2.2)

where WQ
i ,WK

i ,W V
i ∈ Rd×dmodel are learnable weights for each head i. WO ∈ Rh×dmodel×dmodel

is the projection weight.

A fully-connected feed-forward network is applied after the self-attention layer. In addition,

16


Figure 2.4: An overview of our proposed model: Sec2Sec Co-attention Transformer.

a residual connection and a normalization function are applied to each sub-layer.

2.4 Method: Sec2Sec Co-attention Transformer

This section presents our proposed model, which considers visual and audio representations, their

interactions and the temporal information of videos. As depicted in Figure 4.2, the proposed

model consists of five components: Video segmentation: We first split each video into n video

segments. For instance, each segment consists of a one-second visual component and a one-

second audio component. Encoder network: The encoder network comprises a visual encoder

and an audio encoder that extract visual and audio features using pre-trained ResNet networks.

[39]. Co-attention block: The co-attention block leverages Transformer [97] to model the

interactions between visual and audio features, shown in Figure 2.5. Sec2Sec structure: It

captures the temporal information via an LSTM network, illustrated in Figure 2.6. Predictor:

The output from the LSTM network is fed into a fully-connected feed-forward network to make

emotion predictions.

Visual encoder. To extract visual features, we first sample m frames per segment. Each frame

is represented by a color image with the Red-Green-Blue (RGB) channels. Like prior studies,

17


we apply pre-processing to images, such as resizing to a dimension of 80x80, center cropping to

64x64, and normalizing based on the mean of (0.485, 0.456, 0.406) and the standard deviation

of (0.229, 0.224, 0.225). Thus, each visual part is represented in a 4-dimensional space (i.e.,

3-dimensional RGB plus m frames), which is fed into a pre-trained R(2+1)D ResNet model.

R(2+1)D ResNet [93] is an extension of ResNet, utilizing 3D convolution and 3D pooling to

learn the temporal features of videos.

Audio Encoder. For the audio segment, we first compute Mel-Frequency Cepstral Coefficients

(MFCCs) [27], MFCC’s first-order (delta coefficients), and second-order frame-to-frame time

derivatives (delta-delta coefficients) from each audio clip. MFCCs are the coefficients that model

how human ears sense and resolve sound frequencies [71]. The delta coefficients are used to

capture speech rate information. The delta-delta coefficients are used to measure the acceleration

of speech. Both delta coefficients and delta-delta coefficients are jointly used to measure the

temporal information of an audio signal [71]. Each of the three coefficients is 2-dimensional.

Therefore, the audio feature can be represented by combining three-channel MFCC features

in which each channel is one type of coefficient. The extracted three-channel MFCC features

are fed into a pre-trained ResNet [39]. ResNet introduces an identity shortcut connection to

solve the vanishing gradient problem, which outperforms other CNN models on popular image

classification tasks. In our work, the 3-channel MFCC audio features are considered a special

type of “image”. Hence, we use a pre-trained 18-layer ResNet to obtain the audio features.

Co-attention block. As illustrated in Figure 2.5, the extracted visual and audio features for

each segment enter into two symmetrical co-attention sub-blocks, visual and audio sub-blocks,

to learn guided audio and visual representations. Each sub-block is built by combining a standard

multi-head self-attention module with a multi-head co-attention module. A normalization layer

18


Figure 2.5: The co-attention block.

(Norm) and a residual connection are applied after each attention module. A fully-connected

feed-forward network (FC) is also added.

In the visual sub-block, the extracted visual embedding from the visual encoder is first

fed into a multi-head self-attention module to get the intermediate visual representation, Iv,

embedding important visual information. Similarly, we can get the intermediate audio representation,

Ia, in the audio sub-block. Specifically, Iv and Ia can be computed as follows.

Iiv = FC(Norm(MultiHead(ziv, ziv, ziv)) + ziv)

Iia = FC(Norm(MultiHead(zia, zia, zia)) + zia)

(2.3)

where ziv and zia denote the output features from the visual encoder and the audio encoder for

segment i, respectively.

19


Next, in the visual sub-block, Iia as key and value and Iiv as query are passed into the

multi-head co-attention module. In this way, we can enforce the visual sub-block to focus on the

information related to audio. Similarly, in the audio sub-block, we feed Iiv as key and value and

Iia as query into the second multi-head attention layer. Hence, the final output features of vision

and audio, Fiv and Fia, can be computed as:

Fiv = FC(Norm(MultiHead(Iiv, Iia, Iia)) + Iiv)

Fia = FC(Norm(MultiHead(Iia, Iiv, Iiv)) + Iia)

(2.4)

Thus, the audio sub-block tends to focus on the information corresponding to vision. Consequently,

two sub-blocks can find important information about themselves, as well as their relationships.

Using such a mechanism, we capture the interaction between visual and audio components.

Finally, we combine the guided visual representation and the guided audio representation by

applying an FC layer as:

Fi = FC(concat(Fiv, Fia)) (2.5)

Therefore, the final output of this co-attention block is the joint representation of vision and audio

for each segment i.

Sec2Sec Structure. To capture the temporal information in the video clip sequence, we feed

the joint representation of each segment, Fi, from the co-attention block to an LSTM network,

illustrated in Figure 2.6. The LSTM network is defined as follows:

20


ui = σ(WFuFi +Whuhi−1 + bu)

fi = σ(WFfFi +Whfhi−1 + bf )

oi = σ(WFoFi +Whohi−1 + bo)

c̃i = tanh(WFcFi +Whchi−1 + bc)

ci = fi ⊙ ci−1 + ui ⊙ c̃i

hi = oi ⊙ tanh(ci)

(2.6)

where σ(·) is an activation function. ⊙ denotes the Hadamard product. W and b are weights and

biases to be learned during training. hi denotes the hidden state at step i. ui, fi, oi and ci denote

the update gate, forget gate, output gate and cell gate, respectively.

Predictor. We apply an FC along with a sigmoid function to the output from the LSTM network

at the last step to make emotion predictions.

2.5 Experiments

2.5.1 Dataset

We use a large-scale publicly available dataset LIRIS-ACCEDE [10] to evaluate the effectiveness

of our proposed model. The dataset contains 9, 800 videos extracted from 160 films. Most

films are from the popular video-sharing platform VODO. The languages spoken in the films

are mainly English with a small set of 9 other languages subtitled in English. There are also 14

silent movies. The films cover 9 main categories of movies including action, comedy, drama,

etc. The video clips last between 8 seconds and 15 seconds. To annotate each video, researchers

21


Figure 2.6: The Sec2Sec Structure.

recruited 1, 517 annotators from 89 countries to minimize the cultural impact on video emotions.

After watching each video, each annotator provides ratings for valence and arousal in the range

between 1 and 9, respectively. The annotation of each video is calculated by taking an average of

all the individual annotations. In this paper, we adopt a binary classification approach based on

the existing literature [1], where the threshold used to separate high from low arousal (or valence)

is 5. The dataset is split into 70-10-20% for training, validating and testing, respectively. For

evaluation, we adopt two standard metrics for emotion classification tasks (valence and arousal),

accuracy and F1 score.

22


2.5.2 Implementation Details

We train all models on four NVIDIA GeForce 3090 24GB GPUs with 250 epochs2. Our model

is trained to minimize the binary cross entropy loss with the Adam optimizer [45]. We set up

an early stopping mechanism, where the training stops if the validation loss increases for 5

consecutive epochs. We use the grid search strategy to find a relatively optimal set of hyperparameters.

Specifically, the learning rate, batch size, and the number of heads are searched within ranges of

[1e-8, 1e-5], [8, 16, 32, 64, 128], and [8, 16, 32, 64, 128], respectively. In each experiment, we

use the model with the best validation accuracy to report results on the holdout testing set.

Table 2.3: Performance comparison of our model with baselines for arousal prediction.

Method Modality Accuracy F1 Score Avg Training Time
Per Epoch (min)

Baselines

ViT[31] Audio 0.7823 0.8768 1:55
ViViT [9] Vision 0.7853 0.8795 1:32
CMA [22] Audio and Vision 0.5680 0.6768 4:50
AVM [47] Audio and Vision 0.7756 0.8722 4:49
ViT-ViViT Audio and Vision 0.7517 0.8541 3:31

Co-attention Audio and Vision 0.5599 0.6603 4:50

Variants

Sec2Sec Audio Audio 0.7832 0.8780 1:51
Sec2Sec Vision Vision 0.7766 0.8733 0:20
Sec2Sec SA-SA Audio and Vision 0.7990 0.8876 2:17
Sec2Sec SA-CA Audio and Vision 0.7949 0.8840 2:14

2.5.3 Baselines

We evaluate the performance of our proposed model (called Sec2Sec SA-CA) with several state-

of-the-art methods.
2For the purpose of reproducibility, our implementation is publicly available at https://github.com/

nestor-sun/sec2sec.

23

https://github.com/nestor-sun/sec2sec
https://github.com/nestor-sun/sec2sec


Table 2.4: Performance comparison of our model with baselines for valence prediction.

Method Modality Accuracy F1 Score Avg Training Time
Per Epoch (min)

Baselines

ViT[31] Audio 0.7022 0.8154 1:52
ViViT[9] Vision 0.7002 0.8234 1:29

CMA [22] Audio and Vision 0.6078 0.7033 6:05
AVM [47] Audio and Vision 0.7205 0.8287 4:49
ViT-ViViT Audio and Vision 0.69 0.8009 3:32

Co-attention Audio and Vision 0.5864 0.6688 4:50

Variants

Sec2Sec Audio Audio 0.7021 0.8191 1:50
Sec2Sec Vision Vision 0.6970 0.8179 0:20
Sec2Sec SA-SA Audio and Vision 0.7047 0.8179 2:14
Sec2Sec SA-CA Audio and Vision 0.7322 0.8372 2:15

CMA [22]: A cross-modal attention (CMA) Transformer network developed for audio-visual

correspondence prediction. We train a CMA using audio and vision.

AVM [47]: A bi-modal (audio and vision) deep network, consisting of R(2+1)D ResNet and

ResNet networks, was developed for emotion prediction. Specifically, a pre-trained R(2+1)D

ResNet is used to extract visual features, and audio features are extracted using ResNet. Lastly, a

fully-connected feed-forward network is added to fuse two types of features for prediction.

ViT [31]: Vision Transformer (ViT) is a Transformer-based model for image classification, which

have been demonstrated outstanding performance over convolutional neural networks. We fine-

tune a pre-trained ViT using the audio component (i.e., treated as images), since ViT can only

process individual images rather than a sequence of images.

ViViT [9]: Video Vision Transformer (ViViT) is a Transformer-based model, designed for video

classification. It can capture spatio-temporal information. We train a ViVit using vision, since

ViViT can process a sequence of images.

ViT-ViViT: We implement a bi-modal (audio and vision) network by combining ViT and ViViT

24


to extract audio and visual features, respectively. Two types of features are concatenated and fed

into a fully-connected feed-forward network for emotion prediction.

We also add a co-attention network as another baseline and 3 variants of our model for

comparison to understand the role of each design in our model (e.g.,. uni vs. bi-modal, co-

attention).

Co-attention: It only trains a multi-head co-attention model without segmenting each video into

audio and vision components.

Sec2Sec Audio: We use a multi-head attention model with 2 layers of self-attention that only

relies on the audio component. Specifically, we first divide each audio into n segments. Each

segment goes through a pre-trained ResNet as an audio encoder. The output from the audio

encoder for each segment is sent to two multi-head self-attention layers. The output for each

segment is then fed into an LSTM network. Finally, a fully-connected layer is added to predict

emotions.

Sec2Sec Vision: It is a multi-head attention model with 2 layers of attention using only the

visual component. Similar to Sec2Sec Audio, we first split each video into n visual segments.

Each visual segment is fed into the visual encoder. And the corresponding output will be sent to

two multi-head self-attention layers. Lastly, an LSTM and a fully-connected layer are applied to

predict emotions.

Sec2Sec SA-SA: It is a multi-head attention model with 2 attention layers using both audio

and vision. Unlike the Sec2Sec SA-CA model, which uses one self-attention layer and one

co-attention layer, we design a variant (called Sec2Sec SA-SA) that uses two self-attention layers

to capture the intra-modal dependencies within segments.

25


2.5.4 Results

Overall performance Table 2.3 and 2.4 present the experimental results for arousal and valence,

respectively. Our proposed Sec2Sec models achieve the best performance on two evaluation

metrics for arousal prediction. They surpass three bi-modal (audio and vision) methods and the

co-attention approach in terms of accuracy and efficiency, demonstrating the benefit of incorporating

LSTM (Sec2Sec) into the video understanding framework. They also outperform Sec2Sec Audio

and Sec2Sec Visual, indicating that using both audio and visual components is more effective than

using either modality alone. Moreover, Sec2Sec SA-SA and Sec2Sec SA-CA obtain comparable

results, suggesting that the interaction between audio and visual features is not essential for

predicting arousal. It is noteworthy that ViT-ViViT performs worse than ViT and ViViT, indicating

that a single fully-connected layer fails to adequately capture the interaction between audio

embeddings and visual embeddings that are derived from ViT and ViViT. We have similar observations

for valence prediction, in terms of performance comparison with baselines.

Model Interpretability we now turn to assess the contribution of each video segment (i.e., every

one-second clip) to emotion prediction. To do so, we modify the Sec2Sec structure by substituting

LSTM with an attention-based LSTM proposed by [100]. We adopt the identical hyperparameters

as the Sec2Sec Co-attention model. After training, we obtain the learned LSTM attention values

and normalize the values by applying a softmax function. The attention values of each video

segment for valence and arousal are plotted in Figure 2.7a and 2.7b, respectively. Similar patterns

are observed in both figures. They tell that the emotion prediction power reaches the highest for

the last 3 seconds of the video. It may suggest that emotions are mostly influenced by last 3

seconds. Moreover, the impact of video segments increases as they approach the end of a video.

26


(a) Valence (b) Arousal

Figure 2.7: The LSTM Attention.

(a) Accuracy regarding #
of LSTM layers.

(b) F1 regarding # of
LSTM layers.

(c) Accuracy regarding #
of heads and batch size.

(d) F1 regarding # of
heads and batch size.

Figure 2.8: Effect of key hyperparameters of Sec2Sec SA-CA on accuracy and F1 score.

We hypothesize that when annotators rate each video, their decisions are dominated by the last 3

seconds of each video, which aligns well with the theory of recency bias in psychology [61].

2.6 Ablation Study

To assess the impact of several key hyperparameters on the model performance, we conduct

several additional experiments. Results are shown in Figure 2.8. Due to the space limitation, we

only report the results for arousal while valence has similar patterns.

Impact of the number of heads and batch size. We examine the model performance by varying

27


Table 2.5: Comparison of batch vs. layer normalization.

Method Emotion Accuracy F1 score

layer arousal 0.7944 0.8840
batch arousal 0.7924 0.8832

layer valence 0.7271 0.8361
batch valence 0.7220 0.8322

the number of heads and batch size simultaneously from 8 to 128. The model achieved the best

accuracy when both the number of heads and batch size were set to 8, while the best F1 score

was achieved when both were set to 64 or 128.

Impact of the number of LSTM layers. We vary the number of LSTM layers from 1 to 5.

We find that even the model with 4 LSTM layers achieves the best accuracy, F1 score. is not

significantly different with 1 LSTM layer, 4 LSTM layers or 5 LSTM layers. It is worth noting

that more LSTM layers can increase memory consumption if batch size remains the same.

Layer vs. batch normalization. To examine the effect of normalization methods on perceived

emotion recognition, we contrast layer normalization with batch normalization, which are commonly

used in Transformers and computer vision models, respectively [42]. As Table 2.5 shows, layer

normalization outperforms batch normalization for both arousal and valence predictions on accuracy

and F1.

2.7 Conclusion

In this study, we propose a novel Sec2Sec Co-attention Transformer model for perceived emotion

classification, which leverages self-attention and co-attention mechanisms to encode and fuse

multimodal features. We have evaluated our model on the LIRIS-ACCEDE dataset and achieved

28


better results compared with state-of-the-art baseline approaches. The results show the effectiveness

of our Sec2Sec structure and the importance of inter-modal interaction for emotion prediction.

We also introduced an attention-based LSTM mechanism to explore the contribution of each

second clip of a video to the overall emotion prediction. Our work has several implications for

multimodal emotion recognition research and applications. First, it demonstrates that Sec2Sec

models can improve both performance and efficiency over traditional encoder-decoder models.

Second, it reveals that co-attention can capture rich inter-modal relations that are essential for

emotion prediction. Third, it provides a novel way to interpret the model prediction by visualizing

the attention weights over video segments. Future work can extend our model to other multimodal

tasks such as video sentiment analysis and audio-visual alignment analysis.

29


Chapter 3: Multimodal Co-attention Transformers for Video-Based Apparent

Personality Understanding

3.1 Introduction

In the dynamic landscape of communication and media consumption, video content has

emerged as a dominant and influential medium. For example, in 2023, video content makes up

for 82% of the Internet traffic [79]. Video content is not essential at a macro level, but also at a

micro level. Many social media platforms, such as TikTok and Instagram, have started providing

video-sharing services, which plays a critical role in people’s daily life. For instance, More

than 78% of viewers consume video content every week, and 55% of them engage every day

[79]. In addition, 93% of companies acquire new customers via social media videos [105]. In

light of the burgeoning prevalence of video content, it becomes imperative to comprehend the

personalities embodied by the presenter or influencer featured in each video. The elucidation of

such personalities holds substantial potential for enhancing the efficacy of subsequent predictive

analytics. The personality traits of presenters or influencers can serve as robust predictors, thereby

contributing to more accurate forecasts in downstream predictive applications. Thus, a thorough

understanding of these personalities is not just beneficial, but essential for leveraging the full

potential of predictive analytics in the realm of video content.

30


Figure 3.1: Images sampled from videos in the First Impressions dataset that exhibit varying
degrees of personality traits.

(a) High O (b) High C (c) High E (d) High A (e) High N

(f) Low O (g) Low C (h) Low E (i) Low A (j) Low N

Note: These traits are represented by the acronym OCEAN, where O stands for Openness, C
for Conscientiousness, E for Extraversion, A for Agreeableness, and N for Neuroticism. High
personality levels are often recognized by exhibiting a friendly face with a bright background,
while low personality levels are recognized by showing an unhappy expression with a dark
background.

Personality plays a pivotal role in shaping human interactions, decision-making processes,

and overall behavioral patterns [6]. For instance, product recommendations and the effectiveness

of word-of-mouth are largely affected by personality in digital marketing [2]. In human resources,

personality can help predict a candidate’s suitability for a specific job [49]. In addition, [62] found

that a CEO’s personality plays an important role in driving a company’s strategic flexibility. In

the context of Information Systems, [28] find a significant relationship between personalities and

technology acceptance and adoption. These studies emphasize a relationship between personalities

and downstream outcomes.

Given its impact, the study of video-based personality detection holds tremendous potential

across various disciplines, such as psychology, marketing, human-computer interaction, and

social sciences. By discerning the personalities projected through videos, researchers and practitioners

can gain valuable insights into how individuals are perceived by others, the effectiveness of

31


Figure 3.2: Audio waveforms in the First Impressions dataset that exhibit varying degrees of
personality traits.

(a) High O (b) High C (c) High E (d) High A (e) High N

(f) Low O (g) Low C (h) Low E (i) Low A (j) Low N

Note: These traits are represented by the acronym OCEAN, where O stands for Openness, C for
Conscientiousness, E for Extraversion, A for Agreeableness, and N for Neuroticism. Individuals
with high personality levels often have a high voice when speaking, whereas individuals with low
personality levels have a low voice.

persuasive communication strategies, and the influence of personality on audience engagement

[2, 49, 72]. Moreover, as social media platforms increase and user-generated video content

becomes increasingly prevalent, video-based personality detection becomes a valuable tool for

understanding broader societal trends, cultural influences, and collective attitudes.

The recent surge in interest in video-based apparent personality trait prediction has underscored

several non-trivial challenges, primarily due to the unique characteristics inherent to the video-

based personality setting. Firstly, a video typically comprises three types of information: visual,

auditory, and textual. Each of these modalities may contain crucial information that could

significantly enhance the accuracy of predictions. Research has demonstrated that personality

traits are evident in appearance, expression, voice [51, 63, 76]. For the purpose of illustration,

Figures 3.1 and 3.2 depict distinct visual and acoustic patterns corresponding to different levels

of the Big-Five personality traits. Secondly, [44] posits that capturing modality interactions is

essential for making accurate predictions. This is further corroborated by the McGurk effect [58],

32


which underscores the importance of the interaction between auditory and visual modalities. We

contend that capturing the interactions among all three modalities is crucial as a good alignment

can synergistically aid audiences in better inferring personalities, while a poor alignment can

adversely affect the perceptions of personalities. Thirdly, existing approaches often necessitate

high-resolution images, typically in the dimension of 224 × 224, which can be computationally

expensive. Finally, deep learning models are often criticized for being ‘black boxes’ due to their

lack of interpretability. However, in the context of video-based perceived personality settings,

it is essential to provide interpretability that offers practical implications for both presenters or

influencers and researchers or platforms. This interpretability not only demystifies the underlying

mechanisms but also facilitates more informed decision-making processes.

Several models have been proposed. For example, [102] proposes a bi-modal network

to process visual and audio information and predict personality in video. A tri-modal network

is also processed to predict video personality by taking in visual, audio and text information

[83] However, most of these works train a different model for each modality independently

and combine predictions using ensemble methods such as taking an average of the predictions

generated by different modalities. More importantly, all the models mentioned above require

high-resolution pictures (e.g. 224 × 224) in order to perform well. Processing high-resolution

pictures is computationally expensive.

To solve these limitations and improve prediction accuracy, in this paper, we propose a

Multimodal Co-attention network based on the multi-head self-attention mechanism proposed in

Transformer [97]. Specifically, we develop a visual encoder extended from [31] along with a

newly proposed hierarchical positional encoding mechanism to efficiently extract visual features,

two linear regressors to extract audio and text features. We further develop a Multimodal Co-

33


attention Transformer to understand the complex interactions among visual, audio and text components

efficiently.

To evaluate our work, we conduct experiments on a real-world dataset, First Impressions,

with 10,000 videos to demonstrate the usefulness and value of the proposed model. The experimental

results not only show that the proposed model outperforms seven state-of-the-art baselines, but

also has improved the computational costs. Furthermore, we conduct a series of interpretability

analyses to demonstrate the model’s decision process. Our analysis uncovers the useful factors

that can be used to predict personality traits at modality-, vision- and image- level, which can

serve as a guideline for presenters, influencers and platforms that seek to improve perceived

personality.

To conduct our interpretability analysis, we calculate the contributions of inputs by computing

the Integrated Gradient for each input proposed by [88]. For the modality-level interpretability

analysis, the contributions of the inputs are aggregated into three modalities: audio, vision and

text. Our results show that text information is less important than the other two modalities.

The relative importance of audio and vision depends on the specific personality traits. For

agreeableness, neuroticism and openness, audio is more important than vision. For extraversion,

vision is more important than audio. For conscientiousness, audio and vision are equally important.

For the vision-level interpretability analysis, the contributions of visual inputs are aggregated by

the time point of each input image, which enables us to investigate at which time point an image

is more important at predicting personality traits. For agreeableness, extraversion, neuroticism

and openness, the importance of images decreases over time, indicating the first impression

does matter when perceiving personality. Interestingly, for conscientiousness, the importance of

images increases over time, suggesting . For the image-level analysis, the contributions of visual

34


inputs are aggregated into different regions of an image, which allows to study the importance

of different regions. The results show that hand movements and background are more important

than faces at predicting personalities. We believe the findings from the interpretability analyses

serve as guidelines for influencers or presenters to better design and create video content and for

audiences to infer personalities.

In addition to the interpretability analysis, we use a real-world case study to showcase

the usefulness of the proposed model. Specifically, we collected the MBA admission data from

a major university in the United States. In the application process, each applicant is required

to record a up-to-one-minute video to answer a specific question. The staff uses the video as

evidence to evaluate the communication skills and English proficiency of each candidate. We

utilize our model to generate the five personality predictions for each candidate, and the impact

of the predicted personality traits as a persuasion tool on the admission outcome. Our results

show that candidates perceived as conscientious, extroverted and agreeable are associated with a

higher chance of being admitted. Among these three traits, being perceived as agreeable has an

even higher probability of being admitted.

In summary, this study makes the following contributions. First, we introduce a Multimodal

Co-attention network, coupled with a novel hierarchical positional encoding mechanism. This

sophisticated architecture adeptly processes information from visual, acoustic, and textual modalities.

Impressively, our approach outperforms state-of-the-art baselines, showcasing its remarkable

performance. Second, we validate our proposed model in extracting valuable insights from low-

resolution (64×64) images. Notably, even with a compact latent representation space of just 512

dimensions, our model excels. Moreover, it achieves this while demanding minimal training time.

Third, we rigorously conduct interpretability analyses, shedding light on the intricate decision-

35


making process of our method. This offers a deeper understanding of its working mechanism,

enhancing its transparency. Finally, to underscore the practical utility of video-based personality

detection, we present a compelling case study. This aptly demonstrates the model’s efficacy in

predicting MBA admission outcomes, showcasing its real-world application.

The remainder of the paper is organized as follows. In Section 2, we discuss prior work on

personality prediction, Transformer in computer vision as well as recent work on multimodal

learning. In Section 3, we give a brief overview of Transformer, followed by the proposed

modeling in Section 4. In Section 5, we present the experimental details. Section 6 presents

the results. We conclude our study in Section 7.

3.2 Related Work

3.2.1 Personality

Personality is often defined as a distinct combination of cognitive, affective, and behavioral

traits [6]. It is considered to be relatively stable compared to emotion [24]. In the business-related

literature, the Big-Five personality traits (OCEAN: Openness, Conscientiousness, Extraversion,

Agreeableness, and Neuroticism) are widely used to describe a person’s personality [30]. They

are defined as follows [35]:

• Openness emphasizes imagination and insight.

• Conscientiousness denotes organization and responsibility.

• Extraversion represents sociability and energy.

• Agreeableness reflects compassion and trust.

36


• Neuroticism involves anxiety and depression tendencies.

The influence of personality traits on various aspects of life and decision-making pro-

cesses has been extensively studied in the literature. For instance, the personalities of senior

chief executive officers (CEOs) have been found to significantly correlate with their companies’

financial outcomes, such as cash holdings, investment, and interest coverage [109]. In particular,

conscientiousness has been negatively associated with a company’s strategic flexibility, while

agreeableness, extraversion, and openness have shown positive associations [62]. The relationship

between personality traits and online shopping behaviors has also been explored. Consumers with

higher degrees of neuroticism, agreeableness, or openness tend to be utility-motivated to shop

online [96]. Furthermore, hedonic purchase motivation is positively influenced by neuroticism,

extraversion, and openness [96]. In the realm of information-seeking tasks, individuals high

in conscientiousness performed fastest, followed by those high in agreeableness and extraversion

[4]. Moreover, personality traits have been linked to social media usage and engagement. Specifically,

openness and extraversion are the two most significant positive predictors of social media use.

Conscientiousness, agreeableness, and neuroticism were also considered important but to a lesser

degree [48]. These studies underscore the crucial relationship between personality traits and

decision-making choices across various domains.

One of the biggest limitations of these studies is that how personality traits are derived. The

majority of literature requires the completion of long questionnaires to determine personality

traits [26], which is time-consuming and burdensome. With the recent data explosion in user-

generated content, such as in social media, it has become almost impossible to conduct questionnaire-

based research. A lot of effort has been devoted to automatic personality detection from text.

37


Early works include using word count to classify personalities. For instance, [2] use Linguistic

Inquiry and Word Count (LIWC) to classify text personalities. Recently, research has started

adopting deep learning techniques to predict personality traits. One advantage of employing

deep learning methods is their ability to learn word embeddings that capture rich contextual

information in text, facilitating the learning of document-level representations by the models. For

instance, a deep convolutional neural network (CNN) has been developed to predict personality

from text information and has been demonstrated to outperform traditional machine learning

techniques [86]. [111] found that CNNs outperform recurrent neural networks (RNNs), such

as long short-term memory (LSTM) and gated recurrent units (GRU), in predicting personality.

Attention techniques have also been incorporated into CNNs to enhance their performance. For

example, word-level attention has been proposed to learn document-level semantic features [108],

while message-level attention has been employed to leverage the relative weight of users’ social

media posts, yielding impressive results [57]. In addition, [37] utilize three pre-trained language

models, BERT, RoBERTa and XLNet, to predict text personalities by averaging predictions.

[109] develop a hierarchical attention network to classify text personalities.

However, even though video content has been booming, there are only a few studies that

focus on video-based personality detection. In the next section, we will discuss current video-

based personality efforts as well as their limitations.

3.2.2 Video-Based Deep Personality Prediction

Personality prediction has recently emerged as a popular research area, with a focus on

utilizing deep learning techniques to predict personality traits from unstructured data sources

38


such as text and video. There is little research has focused on predicting personality traits

from user-posted social media videos. These videos typically consist of at least two modalities:

vision and audio, with some also including text. Various methods have been proposed to process

and combine visual and audio data. Most existing video personality prediction models extract

information and make predictions from each data source (vision, audio or text) separately and

then employ ensemble methods, such as averaging, to combine predictions. For instance, [102]

developed a Descriptor Aggregation network to predict personality traits from video-sampled

images and a linear regressor to predict personality traits from audio, averaging the predictions

from these two models to make final predictions. [83] utilized pre-trained VGG-16 and ResNet

models to predict personality from audio and images respectively, a linear regressor to predict

personality from text, and averaged the predictions from all three models to make final predictions.

Another approach involves extracting features from each source and using a fully connected feed-

forward network to fuse embeddings from two or three modalities. [82] proposed two techniques

for predicting video personality traits: one using a 3D convolution network to extract visual

features and a linear regressor to extract audio features, with a fully connected network fusing

the two modalities to make predictions; the other splitting a video into several equal-length

parts and using a linear regressor and CNN to extract audio and visual features respectively

for each part, with a fully-connected feed-forward network combining the embeddings as the

latent representation for that part before entering an LSTM network sequentially to make final

predictions. [36] developed two CNNs to extract audio and visual features and employed a fully-

connected network to combine the embeddings and make predictions. However, the majority

of these models fail to capture the interactions between audio and vision, which is crucial for

multimodal learning [44]. More importantly, these models require high-resolution pictures (e.g.,

39


224× 224) to process images, which is computationally expensive.

3.2.3 Transformers in Computer Vision

The Transformer model, initially proposed for machine translation tasks in the realm of

natural language processing (NLP) [97], has seen a surge of interest for its application in computer

vision (CV) tasks, positioning it as an alternative to convolutional neural networks (CNNs).

Several studies have made significant strides in this area. For instance, the Vision Transformer

(ViT) [31] applies a Transformer model to linearly projected sequences of image patches for

full image classification. The Swin Transformer enhances the ViT by introducing a hierarchical

Transformer architecture coupled with a shifted window scheme [53]. These models serve as two

representative Transformer-based models for image classification tasks. In the context of video

classification tasks, the Video Vision Transformer (ViViT) extends the ViT by proposing two

methods for embedding video samples: uniform frames sampling and tubelet embedding. It also

introduces four model variants based on the Transformer: spatiotemporal attention, factorized

encoder-decoder, factorized self-attention, and factorized dot product attention [9]. The Video

Swin Transformer further extends the Swin Transformer by introducing a 3D-shifted window-

based multi-head self-attention module and a locality inductive bias to the self-attention module

[54]. Since the advent of pure Transformer-based models for computer vision tasks, they have

been adopted for a diverse range of applications, including semantic segmentation [113], action

recognition [14], and object detection [16]. This underscores the versatility and efficacy of

Transformer models across various domains.

40


3.2.4 Multimodal Learning

Multimodal learning, a deep learning technique, involves the assimilation of information

from diverse modalities such as images, text, audio, and video. Given the inherent multimodal

nature of videos, a substantial body of literature has focused on learning a joint representation of

audio and vision to predict audio-visual synchronization. For instance, [8] trained an audio-visual

cross-modal network from scratch to predict video correspondence. [7] utilized one clustered

modality as a supervisory signal for another modality and predicted correspondence between

two modalities. [22] further developed three self-supervised co-attention-based networks to

discriminate visual events related to audio events. However, only a handful of research studies

have concentrated on handling vision, audio, and text. For example, [5] applies contrastive

learning to vision, audio, and text to learn video-level representations for self-supervised learning

tasks. Most multimodal learning models employ a standard Transformer as the backbone network

to learn the interactions among different modalities. For instance, VATT [3] proposes a Transformer-

based self-supervised learning model that can process audio, visual, and text information and

uses a standard Transformer model as the backbone network. [34] develops an omnivore network

that uses a standard Transformer network to learn representations from images, videos and 3D

View data. While it is relatively straightforward to train a standard Transformer in terms of

implementation, the computational cost can escalate when the latent representation space enlarges.

Our method aims to enrich the multimodal learning literature by proposing a Multimodal Co-

attention Transformer that can efficiently process information from three different modalities.

41


3.3 Preliminaries

In this section, we provide a brief explanation of the functionality of multi-head self-

attention. The key component of the Transformer [97] maps a query vector Q, a key vector

K and a value vector V to an output (embedding), as demonstrated in Equation 3.1.

Attention(Q,K, V ) = softmax(
QKT

√
d

)V (3.1)

Self-attention is achieved by computing the similarity scores between the query matrix Q and the

key matrix K using the scaled dot product. This is obtained by dividing the dot product of Q

and K by the square root of the dimension of Q, denoted by d. These scores are then converted

into probabilities by applying a softmax function. The resulting probabilities are used to weight

the values in the value matrix V, producing the final output for the next layer. The underlying

principle of this mechanism is to assign greater importance to vectors with higher probabilities

in V in subsequent layers.

Despite its effectiveness, a single self-attention layer may constrain a model’s capacity to

attend to multiple positions simultaneously without sacrificing attention to other positions. A

multi-head attention mechanism is introduced to address this limitation, which has been shown

to enhance overall model performance. This mechanism comprises h parallel self-attention sub-

layers, referred to as ‘heads’, each of which learns distinct query, key, and value matrices. These

heads project input features into different subspaces, and their output features are concatenated

42


into a single matrix for subsequent processing by downstream layers, as shown in Equation 3.2.

MultiHead(Q,K, V ) = Concat(h1, h2, ..., hh)W
O

where hi = Attention(QWQ
i , KWK

i , V W V
i ) (3.2)

where WQ
i ,WK

i ,W V
i ∈ Rd×dmodel are learnable weights for each head i. WO ∈ Rh×dmodel×dmodel

is the projection weight.

A fully-connected feed-forward network is applied after the self-attention layer. In addition,

a residual connection and a normalization function are applied to each sub-layer.

3.4 Our Model

In this section, we introduce our proposed model, which is designed to extract and analyze

visual, acoustic, and textual representations, as well as the interactions among these three modalities.

As illustrated in Figure 4.2, our model comprises three primary components:

1. Encoder Network: The encoder network is composed of visual, audio, and text encoders

that are responsible for extracting the respective features from each modality.

2. Multimodal Co-attention Transformer Network: This network is designed to capture

the interactions among the three modalities through the use of a multimodal co-attention mechanism.

3. Predictor: The output from the Multimodal Co-attention Transformer Network is fed

into a fully connected feed-forward network, which generates predictions regarding personality

traits.

Visual encoder and Hierarchical Positional Encoding. For the visual encoder, we build

43


Figure 3.3: An overview of our proposed model: Multimodal Co-attention Transformer.

upon the work of the Vision Transformer (ViT) [31]. Our visual encoder takes as input a 3-

channel Red-Green-Blue (RGB) representation of n sampled image frames, with a size of [3, H,W ].

Each image is partitioned into patches of size [h,w], resulting in a total of [H/h]×[W/w] patches.

Additionally, we propose a hierarchical positional encoding mechanism to incorporate positional

information into the model. In this study, we utilized a sample size of 100 images per video.

Contrary to the majority of studies that resize images to a higher resolution of [224, 224], we

opted for a lower resolution size of [64, 64] for each image. The patch size is a hyperparameter

that requires tuning.

In the visual encoder, which lacks both recurrence and convolution, a hierarchical positional

44


encoding method is introduced to encode the position of each patch within each sampled image

and the position of each image within each video. This enables the model to comprehend the

position of each patch or image. The positional encodings of patches and images share the

same dimension as that of each patch, allowing for the addition of encodings and embeddings.

Specifically, the hierarchical positional encoding method encompasses two components: patch

positional encoding and image positional encoding. Both encodings possess dimensions equivalent

to those of individual patches, facilitating their effortless integration with patch embeddings.

Additionally, in the visual encoder, which lacks both recurrence and convolution, we introduce a

hierarchical positional encoding method to encode the position of each patch within each sampled

image and the position of each image within each video, which enables the model to comprehend

the position of each patch or image. Specifically, the hierarchical positional encoding method

encompasses two components: patch positional encoding and image positional encoding. Both

encodings possess dimensions equivalent to those of individual patches, facilitating their effortless

integration with patch embeddings. The positional encodings of patches and images share the

same dimension as that of each patch, allowing for the addition of encodings and embeddings.

We build upon the positional encoding, PEpos,i, proposed in [97], which uses sine and cosine

encoding functions written as:

PE(pos,2i) = sin(pos/100002i/d) (3.3)

PE(pos,2i+1) = cos(pos/100002i/d) (3.4)

where pos is the position and i is the dimension.

Therefore, we get the positional encodings for patch p in each image and image m in each

45


video, PEp,i and PEm,i, based on the equation above. Together, we get the hierarchical positional

encoding for patch p in image m in a video as

PEp,m = PEposp,i + PEposm,i
(3.5)

After injecting positional encoding, each patch is linearly projected to a latent representation

with a dimension of l. The latent representation for each patch is then concatenated to form the

visual embeddings with a dimension of l × [H/h]× [W/w].

Audio encoder. In the audio modality, our approach involves several stages. First, we

extract the raw audio signal from each video. Subsequently, all audio signals are re-sampled from

their original rate of 44.1kHz to a standard rate of 16kHz. From these re-sampled signals, we

extract 2-dimensional Mel-Frequency Cepstral Coefficients (MFCC) [27], which are designed to

model how the human ear perceives and distinguishes between different sound frequencies [71].

These 2-dimensional MFCCs are then flattened into 1-dimensional representations for input into

the audio encoder. We conducted experiments with various audio features, including log bank

filters and raw audio waveforms, among others. Our results indicated that MFCCs provided the

best performance, and thus we selected them as our audio representations. The extracted MFCC

features are fed into a fully connected feed-forward network to get audio embeddings.

Text encoder. In our approach to processing text data, we begin by applying standard text

processing procedures. These procedures include tokenizing the text data, converting all words

to lowercase, removing English stopwords, and performing stemming and lemmatization on the

words. We conducted experiments with various text extraction techniques, including one-hot

encoding, bi-gram encoding, and the use of pre-trained text encoders such as BERT [29]. Our

46


results indicate that one-hot encoding provides the best performance compared to other encoding

techniques. Hence, we use one-hot encoding as text representations. Similar to the audio encoder,

the extracted one-hot encoding enters into a fully connected feed-forward network to get text

embeddings.

Multimodal Co-attention Transformer Network. The visual, audio, and text representations

are input into three symmetric multimodal co-attention sub-blocks. Each sub-block comprises

a standard multi-head self-attention module and a proposed multimodal co-attention module.

Layer normalization and a residual network are applied following each attention module, and a

fully connected feed-forward network is also incorporated. The multi-head self-attention network

independently identifies salient features from each modality. In contrast, the multimodal co-

attention module learns the significant features of the interactions between the other two modalities,

guided by the guiding modality. The residual network serves to stabilize the network and, more

importantly, combines the guiding modality’s representation with the joined representations,

preserving information from all three modalities.

In the visual sub-block, the extracted visual embeddings from the visual encoder are first

fed into a multi-head self-attention module to get the intermediate representation, Iv, containing

important visual information. Similarly, the intermediate representations of audio and text, Ia

and It, are obtained by feeding the extracted audio and text embeddings into the multi-head self-

attention modules in their corresponding sub-blocks.

In the visual sub-block, the extracted visual embeddings from the visual encoder are input

into a multi-head self-attention module to obtain an intermediate representation, Iv, containing

salient visual information. Similarly, the intermediate representations for audio and text, Ia and

It, are derived by inputting the extracted audio and text embeddings into their respective sub-

47


block multi-head self-attention modules. Specifically, Iv, Ia and It are calculated as follows:

Iv = FC(Norm(MultiHead(zv, zv, zv) + zv))

Ia = FC(Norm(MultiHead(za, za, za) + za))

It = FC(Norm(MultiHead(zt, zt, zt) + zt))

(3.6)

where zv za and zt denote the output features from the visual encoder, the audio encoder and the

text encoder respectively.

Next, in the visual sub-block, Ia and It are stacked to form a joined representation, denoted

as [Ia, It] with a dimension of [2, d]. Concurrently, the dimension of Iv is expended from 1-

dimensional to 2-dimensional (e.g. from d to [1, d]). Subsequently, The multi-head co-attention

module is then fed with Iv as the query and [Ia, It] as both key and value. The resulting dot

product between the query and key represents the similarities between audio, text, and visual

embeddings with dimensions of [1, 2], indicating the relative salience of audio and text embeddings

with respect to visual embeddings. These similarity scores are then multiplied by [Ia, It] to obtain

a joined representation guided by the guiding modality, Iv. Similarly, joined representations

of [Ia, Iv] guided by It and [Iv, It guided by Ia can be also obtained. Together with a residual

network, layer normalization, and a fully-connected feed-forward network, joined representations

for each modality (Fv, Fa, Ft) are calculated as follows:

48


Fv = FC(Norm(MultiHead(Iv, [Ia, It], [Ia, It]) + Iv))

Fa = FC(Norm(MultiHead(Ia, [Iv, It], [Iv, It]) + Ia))

Ft = FC(Norm(MultiHead(It, [Iv, Ia], [Iv, Ia]) + It))

(3.7)

Predictor. We apply a fully-connected feed-forward along with a sigmoid function to the

output from the multimodal Co-attention network at the last step to make personality predictions,

computed as follows:

Personality = Sigmoid(FC(concat(Fv, Fa, Ft))) (3.8)

3.5 Experiments

3.5.1 Dataset

In order to assess the efficacy of our approach, we conducted experiments on a large-scale

dataset: First Impressions [67]. The First Impressions dataset is a widely used benchmark in

the field of apparent personality analysis and was employed in the ECCV 2016 personality trait

recognition competition. It consists of 10,000 labeled video clips extracted from over 3,000

YouTube videos, with 6,000 designated for training and 2,000 for validation and testing. The

dataset provides tri-modal information in the form of audio, visual, and text modalities. The

average length of each video is 15 seconds and the majority have a resolution of [1280, 720].

Each video features a single individual speaking English in front of a camera. Ground truth

49


annotations for the Big-Five personality traits - extraversion, agreeableness, conscientiousness,

neuroticism, and openness - are provided as fractional scores ranging from 0 to 1. The ECCV

competition organizers obtained the annotations via Amazon Mechanical Turk. The summary

statistics are shown in Table 3.1.

Table 3.1: Summary Statistics for