ABSTRACT Title of Dissertation: DEVELOPING MULTIMODAL LEARNING METHODS FOR VIDEO UNDERSTANDING Mingwei Sun, Doctor of Philosophy, 2024 Dissertation Directed by: Professor Kunpeng Zhang, Department of Decision, Operations and Information Technologies In recent years, the field of deep learning, with a particular emphasis on multimodal representation learning, has experienced significant advancements. These advancements are largely attributable to groundbreaking progress in areas such as computer vision, voice recognition, natural language processing, and graph network learning. This progress has paved the way for a multitude of new applications. The domain of video, in particular, holds immense potential. Video is often considered the most potent form of digital content for communication and the dissemination of information. The ability to effectively and efficiently comprehend video content could prove instrumental in a variety of downstream applications. However, the task of understanding video content presents numerous challenges. These challenges stem from the inherently unstructured and complex nature of video, as well as its interactions with other forms of unstructured data, such as text and network data. These factors contribute to the difficulty of video analysis. The objective of this dissertation is to develop deep learning methodologies capable of understanding video across multiple dimensions. Furthermore, these methodologies aim to offer a degree of interpretability, which could yield valuable insights for researchers and content creators. These insights could have significant managerial implications. In the first study, I introduce an innovative network based on Long Short-Term Memory (LSTM), enhanced with a Transformer co-attention mechanism, designed for the prediction of apparent emotion in videos. Each video is segmented into clips of one-second duration, and pre-trained ResNet networks are employed to extract audio and visual features at the second level. I construct a co-attention Transformer to effectively capture the interactions between the audio and visual features that have been extracted. An LSTM network is then utilized to learn the spatiotemporal information inherent in the video. The proposed model, termed the Sec2Sec Co-attention Transformer, outperforms several state-of-the-art methods in predicting apparent emotion on a widely recognized dataset: LIRIS-ACCEDE. In addition, I conduct an extensive data analysis to explore the relationships between various dimensions of visual and audio components and their influence on video predictions. A notable feature of the proposed model is its interpretability, which enables us to study the contributions of different time points to the overall prediction. This interpretability provides valuable insights into the functioning of the model and its predictions. In the second study, I introduce a novel neural network, the Multimodal Co-attention Transformer, designed for the prediction of personality based on video data. The proposed methodology concurrently models audio, visual, and text representations, along with their intra-relationships, to achieve precise and efficient predictions. The effectiveness of the proposed approach is demonstrated through comprehensive experiments conducted on a real-world dataset, namely, First Impressions. The results indicate that the proposed model surpasses state-of-the-art methods in performance while preserving high computational efficiency. In addition to evaluating the performance of the proposed model, I also undertake a thorough interpretability analysis to examine the contribution across different levels. The insights gained from the findings offer a valuable understanding of personality predictions. Furthermore, I illustrate the practicality of video-based personality detection in predicting outcomes of MBA admissions, serving as a decision support system. This highlights the potential importance of the proposed approach for both researchers and practitioners in the field. In the third study, I present a novel generalized multimodal learning model, termed VAN, which excels in learning a unified representation of visual, acoustic, and network cues. Initially, I utilize state-of-the-art encoders to model each modality. To augment the efficiency of the training process, I adopt a pre-training strategy specifically designed to extract information from the music network. Subsequently, I propose a generalized Co-attention Transformer network. This network is engineered to amalgamate the three distinct types of information and to learn the intra-relationships that exist among the three modalities, a critical facet of multimodal learning. To assess the effectiveness of the proposed model, I collect a real-world dataset from TikTok, comprising over 88,000 videos. Extensive experiments demonstrate that the proposed model surpasses existing state-of-the-art models in predicting video popularity. Moreover, I have conducted a series of ablation studies to attain a deeper comprehension of the behavior of the proposed model. I also perform an interpretability analysis to study the contributions of each modality to the model performance, leveraging the unique property of the proposed co-attention structure. This research contributes to the field by proffering a more comprehensive approach to predicting video popularity on short-form video platforms. DEVELOPING MULTIMODAL LEARNING METHODS FOR VIDEO UNDERSTANDING by Mingwei Sun Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Kunpeng Zhang, Chair Professor P.K. Kannan Professor Lauren Rhue Professor Jessica Clark Professor Vanessa Frias-Martinez, Dean’s Representative © Copyright by Mingwei Sun 2024 Acknowledgments During the course of my six-year doctoral study at Smith Business School, I have had the privilege of collaborating with numerous esteemed researchers. Their guidance, teachings, and inspiration have been instrumental in my journey. I am deeply appreciative of their support and express my gratitude wholeheartedly. First and foremost, I would like to extend my deepest gratitude to my advisor, Professor Kunpeng Zhang, for his unwavering support and valuable advice throughout my six years of study. His consistent provision of constructive feedback and advice has not only enriched my knowledge but also fostered my growth as an independent researcher. My academic journey under the guidance of Professor Zhang has been both enriching and transformative. I am deeply grateful for everything he has provided me with. His mentorship has truly been a cornerstone of my academic development. I am sincerely thankful to Professor Jessica Clark, with whom I initiated my first project. The discussions about research, career, and life with her have been enlightening. I am also grateful to Professor Lauren Rhue for her consistent support and patience. Working with her has been a rewarding experience, and I have learned much from her. I also extend my gratitude to Professor P.K. Kannan for his insightful advice on both my research and career. I am thankful to Professor Vanessa Frias-Martinez for her unique perspective on research, which has significantly influenced my thinking process. My heartfelt thanks also go to Professor Balaji Padmanabhan, ii whose guidance has been extremely beneficial. His insights on positioning my work have been enlightening. In general, I have learned a lot from these top-notch researchers. Furthermore, I would like to acknowledge Justina and Miloyka for their immense help, encouragement, and patient support. They have always been available to assist, regardless of the challenges I faced. My doctoral experience has been productive, also due to the interactions with my peers, particularly Bingze Xu, Wei Feng, Gujie Li, Sung Hyun Kwon, Maya Mudambi, Feiyu E, Weihong Zhao, Yunfei Wang, among others. Finally, I would like to express my deepest appreciation to my wife, Fan Yu, and my parents for their unconditional support, both mentally and financially. Their backing has been a pillar of strength in my journey. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vi List of Figures vii Chapter 1: Introduction 1 Chapter 2: Sec2Sec Co-attention Transformer for Video Emotion Prediction 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Emotion and Its Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Audio-Video Representation Learning . . . . . . . . . . . . . . . . . . . 13 2.2.3 Transformer and Its Application . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Method: Sec2Sec Co-attention Transformer . . . . . . . . . . . . . . . . . . . . 17 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3: Multimodal Co-attention Transformers for Video-Based Apparent Personality Understanding 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Video-Based Deep Personality Prediction . . . . . . . . . . . . . . . . . 38 3.2.3 Transformers in Computer Vision . . . . . . . . . . . . . . . . . . . . . 40 3.2.4 Multimodal Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 iv 3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.5 Efficiency Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Interpretability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9 Decision Support Showcasing: MBA Admission Prediction . . . . . . . . . . . . 62 3.10 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 4: Network-enhanced Multimodal Co-attention Learning for Short-Form Video Popularity Prediction 72 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.1 Online Video Popularity Prediction . . . . . . . . . . . . . . . . . . . . . 76 4.2.2 Multimodal Representation Learning . . . . . . . . . . . . . . . . . . . . 77 4.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Input Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 VAN: A Generalized Multimodal Co-attention Network . . . . . . . . . . 83 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.2 Popularity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.5 Graph Attention Network Pre-training . . . . . . . . . . . . . . . . . . . 89 4.4.6 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7 Interpretability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Appendix A: Video Essay Recording Page 101 Appendix B: Additional Interpretability Analysis 102 Bibliography 104 v List of Tables 2.1 The t-test comparison of audio features on LIRIS-ACCEDE between high emotional and low emotional videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 The t-test comparison of visual features on LIRIS-ACCEDE between high emotional and low emotional videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Performance comparison of our model with baselines for arousal prediction. . . 23 2.4 Performance comparison of our model with baselines for valence prediction. . . 24 2.5 Comparison of batch vs. layer normalization. . . . . . . . . . . . . . . . . . . . 28 3.1 Summary Statistics for First Impressions . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Performance comparison of our model with baselines for Big-Five personality predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Comparison of Positional Encoding vs No Positional Encoding . . . . . . . . . . 58 3.5 Summary Statistics for the Case Study of MBA Admission . . . . . . . . . . . . 63 3.6 Personality Trait Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Estimation Results for MBA Admission . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.9 Estimation Results for MBA Admission Using Extracted Factor . . . . . . . . . 67 4.1 Hyperparameters of the proposed model. . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Performance comparison of our model with baselines. . . . . . . . . . . . . . . . 91 4.3 Ablation study of the proposed model. . . . . . . . . . . . . . . . . . . . . . . . 93 vi List of Figures 2.1 An illustration example of images eliciting different emotions. . . . . . . . . . . 8 2.2 An illustration example of various emotions elicited by different audio waveforms. 9 2.3 Correlation heatmaps between audio and visual features for two emotional states and two emotional intensities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 An overview of our proposed model: Sec2Sec Co-attention Transformer. . . . . . 17 2.5 The co-attention block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 The Sec2Sec Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 The LSTM Attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.8 Effect of key hyperparameters of Sec2Sec SA-CA on accuracy and F1 score. . . 27 3.1 Images sampled from videos in the First Impressions dataset that exhibit varying degrees of personality traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Audio waveforms in the First Impressions dataset that exhibit varying degrees of personality traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 An overview of our proposed model: Multimodal Co-attention Transformer. . . . 44 3.4 Average modality importance on personality prediction. . . . . . . . . . . . . . . 59 3.5 Average contributions of images of different positions on personality prediction. . 60 3.6 Region contributions of images on personality predictions. . . . . . . . . . . . . 60 3.7 Parallel Analysis Scree Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 An illustration of a TikTok post. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 An overview of our proposed model: A Generalized Multimodal Co-attention Transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Spearman’s Rank Correlations regarding the number of sampled frames. . . . . . 96 4.4 The average contributions of each modality. . . . . . . . . . . . . . . . . . . . . 98 A.1 A screenshot of the recording page for the video essay. . . . . . . . . . . . . . . 101 B.1 The contributions of each modality for Head 1. . . . . . . . . . . . . . . . . . . 102 B.2 The contributions of each modality for Head 2. . . . . . . . . . . . . . . . . . . 103 B.3 The contributions of each modality for Head 3. . . . . . . . . . . . . . . . . . . 103 B.4 The contributions of each modality for Head 4. . . . . . . . . . . . . . . . . . . 103 vii Chapter 1: Introduction In the past decade, video content has emerged as an integral component of people’s daily routines. As of 2023, individuals are allocating an average of 17 hours per week to the consumption of online video content [78]. Specifically, in the United States, it is anticipated that by 2024, there will be 164.6 million internet users engaging with video content, as per data from Statistica [18]. This surge in video content consumption has led to a transformative impact on social media platforms. A majority of these platforms, including TikTok, Instagram, and Facebook, now offer video-sharing services. To illustrate, TikTok reported 1.04 billion monthly active users in May 2024 [92], with an annual expenditure of $3.84 billion from consumers [19]. Given this context, video marketing presents immense business potential. In fact, 91% of businesses are leveraging video as a marketing tool [106]. The ubiquity and exponential growth of video marketing have sparked a surge of interest among scholars and industry professionals alike. They are keen to comprehend the multifaceted dimensions of video content, including auditory intensity (loudness) [40], aesthetic considerations (color choice) [56], and verbal (topic) and non-verbal (facial expression) communication cues [55]. While these elements provide valuable insights, they are relatively straightforward to extract and interpret. They represent only the surface-level understanding of video content, leaving a vast array of deeper, more complex aspects unexplored. These uncharted territories hold the potential 1 to unlock a more profound understanding of video content, thereby paving the way for a myriad of downstream analyses and studies. In the sphere of artificial intelligence, the past few years have witnessed remarkable strides in the field of deep learning, with a particular emphasis on multimodal learning. This progress has been fueled by groundbreaking advancements in several sub-domains, including computer vision, natural language processing, voice recognition, and network learning. As a result, multimodal learning has demonstrated exceptional performance across a wide array of applications. One such application is video analytics, where videos typically comprise visual and auditory components. This makes video analytics a natural and fitting application for multimodal learning. Despite the immense potential of multimodal learning in this domain, it is not without its challenges. The application of existing methods may not always yield optimal performance in a given context due to several reasons. Firstly, capturing the interaction and alignment between the auditory and visual components of a video is a complex task. I posit that this aspect is integral to the performance of the model. Secondly, temporal information embedded within the video is of critical importance. The ability to accurately utilize this information can significantly enhance the model’s performance. Thirdly, the fusion of auditory and visual data with other types of information, such as text and graph networks, is a crucial consideration. The integration of these diverse data types can provide a more holistic understanding of the video content. Lastly, the current practice of employing deep learning models in video understanding is predominantly a black-box approach. This lack of interpretability impedes the broader adoption of these powerful models and fails to provide actionable insights and managerial implications. In light of this, the primary objective of this dissertation is to delve deeper into the realm of video understanding. To achieve this, I propose the development and application of advanced 2 deep learning methodologies. These methods are designed to penetrate beyond the superficial layers of video content, enabling a more comprehensive and nuanced understanding. By harnessing the power of multimodal learning techniques, I can uncover hidden patterns and alignments within the video content. This, in turn, can facilitate a wide range of applications, from enhancing the effectiveness of video marketing strategies to improving the accuracy of video-based predictive models. Ultimately, this research aims to contribute significantly to the field of video content analysis, setting new benchmarks for future studies in this domain. Study 1: Sec2Sec Co-attention Transformer for Video Emotion Prediction Video-based apparent emotion detection plays a crucial role in video understanding, as they encompass various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models. To address this limitation, I propose a novel LSTM-based network augmented with a Transformer co-attention mechanism for predicting apparent emotion in videos. Specifically, I divide each video into one-second clips and utilize pre-trained ResNet networks to extract audio and visual features at the clip level. I develop a co-attention Transformer to effectively capture the interactions between the extracted audio and visual features and leverage an LSTM network to learn the spatiotemporal information present in the video. I demonstrate that the proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in predicting apparent emotion on a widely used dataset: LIRIS-ACCEDE. Additionally, I perform comprehensive data analysis to investigate the relationships between different dimensions of visual and audio components and their impact on video predictions. Notably, my model offers interpretability, allowing me to examine the contributions of different time points to the overall 3 prediction. Study 2: Multimodal Co-attention Transformers for Video-Based Apparent Personality Understanding Video has emerged as a pervasive medium for communication, entertainment, and information sharing. With the consumption of video content continuing to increase rapidly, understanding the impact of visual narratives on personality has become a crucial area of research. While text-based personality understanding has been extensively studied in the literature, video-based personality prediction remains relatively under-explored. Existing approaches to video-based personality prediction can be broadly categorized into two directions: learning a joint representation of audio and visual information using fully-connected feed-forward networks, and separating a video into its individual modalities (text, image, and audio), training each modality independently, and then ensembling the results for subsequent personality prediction. However, both approaches have notable limitations: ignoring complex interactions between visual and audio components, or considering all three modalities but not in a joint manner. Furthermore, all methods require high computational costs as they require high-resolution images to train. In this chapter, I propose a novel Multimodal Co-attention Transformer neural network for video-based personality prediction. My approach simultaneously models audio, visual, and text representations, as well as their inter-relations, to achieve accurate and efficient predictions. I demonstrate the effectiveness of my method via extensive experiments on a real-world dataset: First Impressions. My results show that the proposed model outperforms state-of-the-art approaches while maintaining high computational efficiency. In addition to my performance evaluation, I also perform a set of comprehensive interpretability analyses to investigate the contribution across different levels. My findings reveal valuable insights into personality predictions. In addition, I showcase the utility of 4 video-based personality detection in predicting MBA admission outcomes as a decision support system, highlighting its potential significance for both researchers and practitioners. Study 3: Network-enhanced Multimodal Co-attention Learning for Short Video Popularity Prediction The recent surge in the popularity of short-form videos has unveiled considerable opportunities for business applications, encompassing personalized recommendations and targeted advertising. Predominantly, traditional research employs acoustic-visual information for making predictions about video popularity. However, a unique feature of these platforms is the music network, which provides an abundance of information on the distribution and sharing of various trending songs. This could potentially influence video popularity, a factor often overlooked in existing literature. In this chapter, I introduce a novel generalized multimodal learning model, termed VAN, which is adept at learning a unified representation of visual, acoustic and network cues. Initially, we employ cutting-edge encoders to model each modality. To enhance the efficiency of the training process, I design a pre-training strategy specifically tailored to extract information from the music network. As a final step, I put forward a generalized Co-attention Transformer network. This network is designed to fuse the three distinct types of information and to learn the intra-relationships that exist among the three modalities, a crucial aspect of multimodal learning. To evaluate the effectiveness of the proposed model, I have collected a real-world dataset from TikTok, consisting of over 88,000 videos. My comprehensive experiments demonstrate that my model outperforms existing state-of-the-art models in predicting video popularity. Furthermore, we have conducted a series of ablation studies to gain a deeper understanding of the behavior of my model. We additionally conduct interpretability analysis to examine the contributions of each modality to the model performance by leveraging the distinctive property of the proposed 5 co-attention structure. This research contributes to the field by offering a more comprehensive approach to predicting video popularity on short-form video platforms. Collectively, the proposed methods and the findings of three studies in this dissertation provide us with a more profound comprehension of videos from multiple perspectives. This enhanced understanding paves the way for a plethora of downstream research opportunities, thereby expanding the horizons of knowledge in video analytics. The insights gained from these studies provide valuable guidance for both researchers and practitioners. For researchers, these insights can inform the design of future studies, helping to refine research questions, hypotheses, and methodologies. For practitioners, particularly those in the realm of video marketing and analytics, these insights can inform strategic decision-making, helping to optimize the effectiveness of video content and drive business outcomes. 6 Chapter 2: Sec2Sec Co-attention Transformer for Video Emotion Prediction 2.1 Introduction Emotions are generally described as mental states by neurophysiological changes. They can be associated with thoughts, feelings, behavioral responses, and a degree of pleasure or displeasure [104], which can accordingly affect our decision-making and eventually shape how we perceive the world ubiquitously. As cognitive processes can be profoundly influenced by emotions [77], people primarily rely on emotional levels when making their judgments [11]. Several dimensions that could be linked to emotion-related responses have been identified [73]. Among these two major ones have been widely explored in the literature. They are pleasure-displeasure and arousal-sleep dimensions. Specifically, The former indicates the degree of positivity or negativity of the experience, also known as valence, while the latter assesses the level of energy or fatigue that an experience produces, also known as arousal. There has been a growing interest in understanding human emotions from both researchers and practitioners. Emotion detection has been the main focus of existing literature, which is also the scope of our study. Various methods have been proposed to detect emotions, especially for text documents. For example, Kratzwald et al. developed a transfer learning-based model (called sent2affect) for emotion recognition [46]. Su et al. designed a long short-term memory (LSTM) network to predict emotions based on the combination of semantic and emotional words 7 Figure 2.1: An illustration example of images eliciting different emotions. (a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence Note: A high-arousal emotion is evoked by a violet background and an exciting concert in Figure 2.1a. A combination of light colors and a calm and peaceful natural environment in Figure 2.1b conveys a low-arousal emotion. A warm background along with happy individuals in Figure 2.1c induce a high-valence emotion. A dark background and a sad person in Figure 2.1d provoke a low-valence emotion. [81]. However, video-based emotion detection remains under-explored, even though videos have been largely generated and posted on various platforms. Prior studies found that videos are more efficient and effective to elicit emotions compared to text [46]. A video often consists of two components: vision and audio. Both can stimulate emotions in their own ways. Figure 2.1 shows different images can elicit different emotions, as suggested by [43]. On the other hand, [103] has shown that the design of electronic music and audio- visual media can elicit audience emotions. In particular, non-diegetic music is a key feature that elicits states of emotions. Audio is usually represented by a continuous waveform. Audio waveform measures how sound pressure varies over time. Figure 2.2 shows different patterns of audio waveforms that can convey different states of emotions. Furthermore, we examine a set of audio and visual features that exhibit significant differences between videos with high and low emotional intensities by performing t-test analyses on the LIRIS-ACCEDE dataset [10]. Table 2.1 and Table 2.2 are the t-test comparisons of audio and visual features across videos with high and low emotional intensities. From the tables, we can see that many visual and audio features do play a role in distinguishing video emotions. 8 Figure 2.2: An illustration example of various emotions elicited by different audio waveforms. (a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence Note: A high-arousal emotion is provoked by a spike with high sound pressure in Figure 2.3a. A low-arousal emotion is conveyed by low sound pressure over time in Figure 2.3b. Few spikes with small sound pressure in Figure 2.3c elicit a high-valence emotion. A low-valence emotion is evoked by multiple spikes with high sound pressure in Figure 2.3d. Table 2.1: The t-test comparison of audio features on LIRIS-ACCEDE between high emotional and low emotional videos. Feature Emotion p-value (significant?) Pitch valence 1.84e-15(✓) Pitch arousal 4.31e-05(✓) AL valence 0.61(✗) AL arousal 5.74e-53(✓) Spectral rolloff valence 3.24e-14 (✓) Spectral rolloff arousal 0.01(✓) ZRC valence 7.36e-09(✓) ZRC arousal 0.024(✓) Note: Pitch means the average pitch of an audio sample. AL means the average loudness. Spectral roll-off means the frequency that holds a certain percentage of energy (e.g. 85%). Zero-crossing rate (ZRC) means the frequency that a waveform crosses zero representing audio smoothness. Vision and audio can also jointly affect emotions. Specifically, if a visual component is well-aligned with its counterpart audio in the same time frame (e.g., within a second), it can create a synergy that stimulates emotions more intensely. A good alignment could be a high degree of consistency or correspondence between two components. Most importantly, eliciting different emotional states requires different combinations of audio and visual features. For example, a cold picture with a low audio tempo is likely to stimulate a negative valence. Low arousal is conveyed 9 Table 2.2: The t-test comparison of visual features on LIRIS-ACCEDE between high emotional and low emotional videos. Feature Emotion p-value (significant?) saturation valence 1.40e-19 (✓) saturation arousal 1.04e-05 (✓) brightness valence 1.38e-39 (✓) brightness arousal 4.36e-10 (✓) contrast valence 4.44e-13 (✓) contrast arousal 0.76 (✗) clarity valence 9.38e-17 (✓) clarity arousal 0.005 (✓) warm hue valence 2.11e-55 (✓) warm hue arousal 4.17e-16 (✓) Note: Saturation indicates average saturation across pixels. Brightness: average intensity across pixels. Contrast indicates the standard deviation of intensity across pixels. Clarity indicates the proportion of pixels with intensity above a certain threshold (e.g. 0.7). Warm hue indicates the proportion of warm colors in a frame. by a piece of light music and a peaceful environment. For the sake of illustration, the correlation heatmap1 between audio and visual features for various emotional states and intensities are plotted in Figure 2.3 using LIRIS-ACCEDE [10], which confirms that the integration of audio and visual signals induces various emotional states and intensities. Another important factor that might affect how people perceive a video emotionally is the sequential composition of a video. The temporal pattern exhibited in a video should be captured for emotion prediction. For example, people are more likely to recall the most recently presented information [61], which indicates that later audio clips may be weighted higher when estimating emotions. Despite the popularity of videos and the practical importance of detecting their emotions, very limited research has been conducted to quantitatively estimate how videos induce emotions. In this paper, we are among the pioneers to fill this research gap by developing a Transformer- 1The correlations are computed by performing the following steps: (1) splitting each video into n one-second clips; (2) extracting audio and visual features from each one-second clip; (3) calculating the correlations between those features for each video; and (4) averaging each correlation pair. 10 (a) High Arousal (b) Low Arousal (c) High Valence (d) Low Valence Figure 2.3: Correlation heatmaps between audio and visual features for two emotional states and two emotional intensities. based second-to-second (Sec2Sec) co-attention model to predict the perceived emotional states of videos when we watch them. Specifically, we first implement a Transformer-based co-attention network extended from the work proposed by [22] to understand the interaction between audio and visual components. We further combine an LSTM module with such a co-attention network to capture the temporal information of videos at the second level. To do so, we first split each video into one-second video clips. We then feed each one-second clip into our designed co- attention network. The output of each video clip from the co-attention network is fed into an LSTM network sequentially. Lastly, we add a fully-connected network to predict emotions. To evaluate our work, we conduct experiments on a real-world dataset from LIRIS-ACCEDE with 9, 800 videos [10]. The experimental results show that our model outperforms several cutting- edge baselines, in terms of F1-score for both arousal and valence. 2.2 Related Work Our work is closely related to three streams of literature: emotion, audio-video representation learning and applications of Transformers. 11 2.2.1 Emotion and Its Impact Emotion, a multifaceted psychological phenomenon, lacks a universally acknowledged definition. However, it is frequently characterized as a mental condition that incorporates cognitive processes, affective states, physiological alterations, and behavioral reactions, all of which are marked by varying intensities of pleasure or displeasure [65]. The genesis of emotions can be attributed to a multitude of mechanisms. These include bottom-up processes initiated by external stimuli, top-down processes that involve the cognitive evaluation and interpretation of events based on accumulated experience and knowledge, or an amalgamation of both [59]. Numerous models have been put forth to delineate the dimensions of emotion. Ekman’s theory of basic emotions [32] is one such model, which advocates the existence of six universal emotions. Another model, proposed by Cordaro et al. [25], extends Ekman’s basic emotion theory to identify 22 distinct emotions. These prevalent models primarily concentrate on discerning the emotions experienced by the individual. However, in the realm of entertainment, such as social media, the influencer’s focus shifts towards the emotions experienced by the audience. In this paper, we employ the widely recognized circumplex model of emotion or affect [74] to scrutinize the emotions elicited in the audience in response to stimuli. The circumplex model characterizes emotions along two dimensions: valence and arousal. Valence signifies the sentiment associated with an experience, spanning from pleasant to unpleasant. Arousal denotes the degree of intensity or activation associated with the experience, ranging from low to high. Valence and arousal encapsulate the pleasantness and intensity of the video experiences from the audience’s viewpoint. A considerable volume of research exists that examines the phenomenon of consumer 12 emotion and its consequential impact on consumer behavior. It is well-documented that consumers are susceptible to the influence of others’ emotional expressions [38]. For instance, research has demonstrated that positive facial expressions in fundraising advertisements can sway funding decisions in a beneficial direction [69]. Furthermore, emotions encapsulated in online product reviews can markedly affect the perceived utility of the information [110]. Elements of the broader context, such as culture, can also mold how consumers react to others’ emotions. Consumers of European cultural descent respond more robustly to excited expressions, while consumers of Chinese descent exhibit a stronger response to calm emotional expressions [66]. A majority of these studies explore emotions in text-based or visual mediums such as online reviews, social media, or facial expressions [13, 69, 80]. However, despite the richness of video content, there is limited research investigating the impact of perceived emotion, primarily due to the absence of a predictive model capable of forecasting perceived video emotion. Consequently, our study endeavors to bridge this gap. 2.2.2 Audio-Video Representation Learning Audio Representation Learning. Traditionally, research often adopts hand-crafted audio feature extraction techniques, such as Mel-Frequency Cepstral Coefficients (MFCCs) [27]. Extracting MFCCs is an audio processing technique that models how human ears sense and resolve sound frequencies [71]. Recently, with the development of deep learning, researchers have explored several audio autoencoder techniques. Tagliasacchi et al. applied a convolutional deep belief network on music and speech data to solve different classification tasks [89]. Cartwright et al. designed a network, including an audio sub-network and a temporal network, to predict the long- 13 term and cyclic temporal structure using self-supervision [17]. Chung et al. explored a sequence- to-sequence autoencoder by incorporating RNN and LSTM together [23]. Audio-Visual Cross-Modal Learning. As videos provide a natural bridge between audio and visual events, they tend to happen together. Therefore, the mainstream of audio-visual representation learning research is to predict the synchronization or correspondence of audio and visual streams in videos. Arandjelovic and Zisserman trained an audio-visual cross-modal network from scratch to predict video correspondence [8]. Alwassel et al. used one clustered modality as a supervisory signal for another modality and predicted correspondence between two modalities [7]. Cheng et al. further developed three self-supervised co-attention-based networks to discriminate visual events related to audio events [22]. In addition, Kuhnke et al. proposed a two-stream aural-visual model (AVM) to predict facial expressions in videos [47]. 2.2.3 Transformer and Its Application Transformers in Natural Language Processing. Transformer was first introduced in the task of machine translation [97]. It has been a state-of-the-art natural language processing (NLP) architecture ever since. A variety of Transformer-based models have been developed to address many NLP tasks, mainly focusing on two streams. One follows the trend of pre-training Transformer- based models on large corpora and fine-tuning parameters on downstream NLP tasks. BERT is a pioneer that employs a multi-layer bi-directional Transformer model architecture [29]. However, BERT-based models are only capable of handling 512 tokens, which is not enough for long text documents. Hence, Longformer extends BERT while utilizing sliding window, dilated sliding window, and global attention to handle long text documents [12]. Unlike BERT-based models, 14 another stream of research focuses on language modeling as a pre-training task, such as GPT [70]. GPT was developed for text generation tasks such as question answering, text summarization, and many others, which has achieved great performance on downstream tasks in zero-shot or few-shot settings. Transformers in Computer Vision. Recently, there has been increasing attention on applying Transformers to computer vision (CV) tasks as an alternative to convolutional neural networks (CNN). Many studies have achieved great success. ViT applies a Transformer model to linearly projected sequences of image patches to classify full images [31]. Swin Transformer improves ViT by introducing a hierarchical Transformer architecture and a shifted window scheme [53]. These are two representative Transformer-based models for image classification tasks. In order to classify video tasks, ViViT extends ViT by proposing two methods for embedding video samples: uniform frames sampling and tubelet embedding, and four model variants based on Transformer: spatiotemporal attention, factorized encoder-decoder, factorized self-attention, and factorized dot product attention [9]. Video Swin Transformer further extends Swin Transformer by introducing a 3D-shifted window-based multi-head self-attention module and a locality inductive bias to the self-attention module [54]. All these video-based analyses do not separate vision and audio and explicitly learn the joint effect on subsequent tasks, which is our focus in this study. 2.3 Preliminaries In this section, we first briefly explain how multi-head self-attention works. It is the key component in Transformer that maps a query vector Q, a key vector K and a value vector V to an output 15 (embedding), as shown in Equation 3.1. Attention(Q,K, V ) = softmax( QKT √ d )V (2.1) Specifically, self-attention is achieved by computing the dot product of Q and K divided by the square root of the dimension of Q denoted by d, which gives the similarity scores between Q and K, which is also known as the scaled-dot product. The scores are then translated into probabilities by applying a softmax function. Lastly, the probabilities are multiplied by V to get the final output for the next layer. The basic idea of the self-attention mechanism is to focus more on the vectors with high probabilities in V in the following layers. However, one self-attention layer limits the model’s ability to focus on more positions without compromising other positions. To mitigate this limitation, it introduces a multi-head attention mechanism, which can increase the overall model performance. Specifically, the multi-head attention layer consists of h paralleled self-attention sub-layers, called “heads”. Each head learns different query, key and value matrices. Different heads project input features into different sub-spaces. The output features from each head are concatenated into one matrix for the following layers, as shown in Equation 3.2. MultiHead(Q,K, V ) = Concat(h1, h2, ..., hh)W O where hi = Attention(QWQ i , KWK i , V W V i ) (2.2) where WQ i ,WK i ,W V i ∈ Rd×dmodel are learnable weights for each head i. WO ∈ Rh×dmodel×dmodel is the projection weight. A fully-connected feed-forward network is applied after the self-attention layer. In addition, 16 Figure 2.4: An overview of our proposed model: Sec2Sec Co-attention Transformer. a residual connection and a normalization function are applied to each sub-layer. 2.4 Method: Sec2Sec Co-attention Transformer This section presents our proposed model, which considers visual and audio representations, their interactions and the temporal information of videos. As depicted in Figure 4.2, the proposed model consists of five components: Video segmentation: We first split each video into n video segments. For instance, each segment consists of a one-second visual component and a one- second audio component. Encoder network: The encoder network comprises a visual encoder and an audio encoder that extract visual and audio features using pre-trained ResNet networks. [39]. Co-attention block: The co-attention block leverages Transformer [97] to model the interactions between visual and audio features, shown in Figure 2.5. Sec2Sec structure: It captures the temporal information via an LSTM network, illustrated in Figure 2.6. Predictor: The output from the LSTM network is fed into a fully-connected feed-forward network to make emotion predictions. Visual encoder. To extract visual features, we first sample m frames per segment. Each frame is represented by a color image with the Red-Green-Blue (RGB) channels. Like prior studies, 17 we apply pre-processing to images, such as resizing to a dimension of 80x80, center cropping to 64x64, and normalizing based on the mean of (0.485, 0.456, 0.406) and the standard deviation of (0.229, 0.224, 0.225). Thus, each visual part is represented in a 4-dimensional space (i.e., 3-dimensional RGB plus m frames), which is fed into a pre-trained R(2+1)D ResNet model. R(2+1)D ResNet [93] is an extension of ResNet, utilizing 3D convolution and 3D pooling to learn the temporal features of videos. Audio Encoder. For the audio segment, we first compute Mel-Frequency Cepstral Coefficients (MFCCs) [27], MFCC’s first-order (delta coefficients), and second-order frame-to-frame time derivatives (delta-delta coefficients) from each audio clip. MFCCs are the coefficients that model how human ears sense and resolve sound frequencies [71]. The delta coefficients are used to capture speech rate information. The delta-delta coefficients are used to measure the acceleration of speech. Both delta coefficients and delta-delta coefficients are jointly used to measure the temporal information of an audio signal [71]. Each of the three coefficients is 2-dimensional. Therefore, the audio feature can be represented by combining three-channel MFCC features in which each channel is one type of coefficient. The extracted three-channel MFCC features are fed into a pre-trained ResNet [39]. ResNet introduces an identity shortcut connection to solve the vanishing gradient problem, which outperforms other CNN models on popular image classification tasks. In our work, the 3-channel MFCC audio features are considered a special type of “image”. Hence, we use a pre-trained 18-layer ResNet to obtain the audio features. Co-attention block. As illustrated in Figure 2.5, the extracted visual and audio features for each segment enter into two symmetrical co-attention sub-blocks, visual and audio sub-blocks, to learn guided audio and visual representations. Each sub-block is built by combining a standard multi-head self-attention module with a multi-head co-attention module. A normalization layer 18 Figure 2.5: The co-attention block. (Norm) and a residual connection are applied after each attention module. A fully-connected feed-forward network (FC) is also added. In the visual sub-block, the extracted visual embedding from the visual encoder is first fed into a multi-head self-attention module to get the intermediate visual representation, Iv, embedding important visual information. Similarly, we can get the intermediate audio representation, Ia, in the audio sub-block. Specifically, Iv and Ia can be computed as follows. Iiv = FC(Norm(MultiHead(ziv, ziv, ziv)) + ziv) Iia = FC(Norm(MultiHead(zia, zia, zia)) + zia) (2.3) where ziv and zia denote the output features from the visual encoder and the audio encoder for segment i, respectively. 19 Next, in the visual sub-block, Iia as key and value and Iiv as query are passed into the multi-head co-attention module. In this way, we can enforce the visual sub-block to focus on the information related to audio. Similarly, in the audio sub-block, we feed Iiv as key and value and Iia as query into the second multi-head attention layer. Hence, the final output features of vision and audio, Fiv and Fia, can be computed as: Fiv = FC(Norm(MultiHead(Iiv, Iia, Iia)) + Iiv) Fia = FC(Norm(MultiHead(Iia, Iiv, Iiv)) + Iia) (2.4) Thus, the audio sub-block tends to focus on the information corresponding to vision. Consequently, two sub-blocks can find important information about themselves, as well as their relationships. Using such a mechanism, we capture the interaction between visual and audio components. Finally, we combine the guided visual representation and the guided audio representation by applying an FC layer as: Fi = FC(concat(Fiv, Fia)) (2.5) Therefore, the final output of this co-attention block is the joint representation of vision and audio for each segment i. Sec2Sec Structure. To capture the temporal information in the video clip sequence, we feed the joint representation of each segment, Fi, from the co-attention block to an LSTM network, illustrated in Figure 2.6. The LSTM network is defined as follows: 20 ui = σ(WFuFi +Whuhi−1 + bu) fi = σ(WFfFi +Whfhi−1 + bf ) oi = σ(WFoFi +Whohi−1 + bo) c̃i = tanh(WFcFi +Whchi−1 + bc) ci = fi ⊙ ci−1 + ui ⊙ c̃i hi = oi ⊙ tanh(ci) (2.6) where σ(·) is an activation function. ⊙ denotes the Hadamard product. W and b are weights and biases to be learned during training. hi denotes the hidden state at step i. ui, fi, oi and ci denote the update gate, forget gate, output gate and cell gate, respectively. Predictor. We apply an FC along with a sigmoid function to the output from the LSTM network at the last step to make emotion predictions. 2.5 Experiments 2.5.1 Dataset We use a large-scale publicly available dataset LIRIS-ACCEDE [10] to evaluate the effectiveness of our proposed model. The dataset contains 9, 800 videos extracted from 160 films. Most films are from the popular video-sharing platform VODO. The languages spoken in the films are mainly English with a small set of 9 other languages subtitled in English. There are also 14 silent movies. The films cover 9 main categories of movies including action, comedy, drama, etc. The video clips last between 8 seconds and 15 seconds. To annotate each video, researchers 21 Figure 2.6: The Sec2Sec Structure. recruited 1, 517 annotators from 89 countries to minimize the cultural impact on video emotions. After watching each video, each annotator provides ratings for valence and arousal in the range between 1 and 9, respectively. The annotation of each video is calculated by taking an average of all the individual annotations. In this paper, we adopt a binary classification approach based on the existing literature [1], where the threshold used to separate high from low arousal (or valence) is 5. The dataset is split into 70-10-20% for training, validating and testing, respectively. For evaluation, we adopt two standard metrics for emotion classification tasks (valence and arousal), accuracy and F1 score. 22 2.5.2 Implementation Details We train all models on four NVIDIA GeForce 3090 24GB GPUs with 250 epochs2. Our model is trained to minimize the binary cross entropy loss with the Adam optimizer [45]. We set up an early stopping mechanism, where the training stops if the validation loss increases for 5 consecutive epochs. We use the grid search strategy to find a relatively optimal set of hyperparameters. Specifically, the learning rate, batch size, and the number of heads are searched within ranges of [1e-8, 1e-5], [8, 16, 32, 64, 128], and [8, 16, 32, 64, 128], respectively. In each experiment, we use the model with the best validation accuracy to report results on the holdout testing set. Table 2.3: Performance comparison of our model with baselines for arousal prediction. Method Modality Accuracy F1 Score Avg Training Time Per Epoch (min) Baselines ViT[31] Audio 0.7823 0.8768 1:55 ViViT [9] Vision 0.7853 0.8795 1:32 CMA [22] Audio and Vision 0.5680 0.6768 4:50 AVM [47] Audio and Vision 0.7756 0.8722 4:49 ViT-ViViT Audio and Vision 0.7517 0.8541 3:31 Co-attention Audio and Vision 0.5599 0.6603 4:50 Variants Sec2Sec Audio Audio 0.7832 0.8780 1:51 Sec2Sec Vision Vision 0.7766 0.8733 0:20 Sec2Sec SA-SA Audio and Vision 0.7990 0.8876 2:17 Sec2Sec SA-CA Audio and Vision 0.7949 0.8840 2:14 2.5.3 Baselines We evaluate the performance of our proposed model (called Sec2Sec SA-CA) with several state- of-the-art methods. 2For the purpose of reproducibility, our implementation is publicly available at https://github.com/ nestor-sun/sec2sec. 23 https://github.com/nestor-sun/sec2sec https://github.com/nestor-sun/sec2sec Table 2.4: Performance comparison of our model with baselines for valence prediction. Method Modality Accuracy F1 Score Avg Training Time Per Epoch (min) Baselines ViT[31] Audio 0.7022 0.8154 1:52 ViViT[9] Vision 0.7002 0.8234 1:29 CMA [22] Audio and Vision 0.6078 0.7033 6:05 AVM [47] Audio and Vision 0.7205 0.8287 4:49 ViT-ViViT Audio and Vision 0.69 0.8009 3:32 Co-attention Audio and Vision 0.5864 0.6688 4:50 Variants Sec2Sec Audio Audio 0.7021 0.8191 1:50 Sec2Sec Vision Vision 0.6970 0.8179 0:20 Sec2Sec SA-SA Audio and Vision 0.7047 0.8179 2:14 Sec2Sec SA-CA Audio and Vision 0.7322 0.8372 2:15 CMA [22]: A cross-modal attention (CMA) Transformer network developed for audio-visual correspondence prediction. We train a CMA using audio and vision. AVM [47]: A bi-modal (audio and vision) deep network, consisting of R(2+1)D ResNet and ResNet networks, was developed for emotion prediction. Specifically, a pre-trained R(2+1)D ResNet is used to extract visual features, and audio features are extracted using ResNet. Lastly, a fully-connected feed-forward network is added to fuse two types of features for prediction. ViT [31]: Vision Transformer (ViT) is a Transformer-based model for image classification, which have been demonstrated outstanding performance over convolutional neural networks. We fine- tune a pre-trained ViT using the audio component (i.e., treated as images), since ViT can only process individual images rather than a sequence of images. ViViT [9]: Video Vision Transformer (ViViT) is a Transformer-based model, designed for video classification. It can capture spatio-temporal information. We train a ViVit using vision, since ViViT can process a sequence of images. ViT-ViViT: We implement a bi-modal (audio and vision) network by combining ViT and ViViT 24 to extract audio and visual features, respectively. Two types of features are concatenated and fed into a fully-connected feed-forward network for emotion prediction. We also add a co-attention network as another baseline and 3 variants of our model for comparison to understand the role of each design in our model (e.g.,. uni vs. bi-modal, co- attention). Co-attention: It only trains a multi-head co-attention model without segmenting each video into audio and vision components. Sec2Sec Audio: We use a multi-head attention model with 2 layers of self-attention that only relies on the audio component. Specifically, we first divide each audio into n segments. Each segment goes through a pre-trained ResNet as an audio encoder. The output from the audio encoder for each segment is sent to two multi-head self-attention layers. The output for each segment is then fed into an LSTM network. Finally, a fully-connected layer is added to predict emotions. Sec2Sec Vision: It is a multi-head attention model with 2 layers of attention using only the visual component. Similar to Sec2Sec Audio, we first split each video into n visual segments. Each visual segment is fed into the visual encoder. And the corresponding output will be sent to two multi-head self-attention layers. Lastly, an LSTM and a fully-connected layer are applied to predict emotions. Sec2Sec SA-SA: It is a multi-head attention model with 2 attention layers using both audio and vision. Unlike the Sec2Sec SA-CA model, which uses one self-attention layer and one co-attention layer, we design a variant (called Sec2Sec SA-SA) that uses two self-attention layers to capture the intra-modal dependencies within segments. 25 2.5.4 Results Overall performance Table 2.3 and 2.4 present the experimental results for arousal and valence, respectively. Our proposed Sec2Sec models achieve the best performance on two evaluation metrics for arousal prediction. They surpass three bi-modal (audio and vision) methods and the co-attention approach in terms of accuracy and efficiency, demonstrating the benefit of incorporating LSTM (Sec2Sec) into the video understanding framework. They also outperform Sec2Sec Audio and Sec2Sec Visual, indicating that using both audio and visual components is more effective than using either modality alone. Moreover, Sec2Sec SA-SA and Sec2Sec SA-CA obtain comparable results, suggesting that the interaction between audio and visual features is not essential for predicting arousal. It is noteworthy that ViT-ViViT performs worse than ViT and ViViT, indicating that a single fully-connected layer fails to adequately capture the interaction between audio embeddings and visual embeddings that are derived from ViT and ViViT. We have similar observations for valence prediction, in terms of performance comparison with baselines. Model Interpretability we now turn to assess the contribution of each video segment (i.e., every one-second clip) to emotion prediction. To do so, we modify the Sec2Sec structure by substituting LSTM with an attention-based LSTM proposed by [100]. We adopt the identical hyperparameters as the Sec2Sec Co-attention model. After training, we obtain the learned LSTM attention values and normalize the values by applying a softmax function. The attention values of each video segment for valence and arousal are plotted in Figure 2.7a and 2.7b, respectively. Similar patterns are observed in both figures. They tell that the emotion prediction power reaches the highest for the last 3 seconds of the video. It may suggest that emotions are mostly influenced by last 3 seconds. Moreover, the impact of video segments increases as they approach the end of a video. 26 (a) Valence (b) Arousal Figure 2.7: The LSTM Attention. (a) Accuracy regarding # of LSTM layers. (b) F1 regarding # of LSTM layers. (c) Accuracy regarding # of heads and batch size. (d) F1 regarding # of heads and batch size. Figure 2.8: Effect of key hyperparameters of Sec2Sec SA-CA on accuracy and F1 score. We hypothesize that when annotators rate each video, their decisions are dominated by the last 3 seconds of each video, which aligns well with the theory of recency bias in psychology [61]. 2.6 Ablation Study To assess the impact of several key hyperparameters on the model performance, we conduct several additional experiments. Results are shown in Figure 2.8. Due to the space limitation, we only report the results for arousal while valence has similar patterns. Impact of the number of heads and batch size. We examine the model performance by varying 27 Table 2.5: Comparison of batch vs. layer normalization. Method Emotion Accuracy F1 score layer arousal 0.7944 0.8840 batch arousal 0.7924 0.8832 layer valence 0.7271 0.8361 batch valence 0.7220 0.8322 the number of heads and batch size simultaneously from 8 to 128. The model achieved the best accuracy when both the number of heads and batch size were set to 8, while the best F1 score was achieved when both were set to 64 or 128. Impact of the number of LSTM layers. We vary the number of LSTM layers from 1 to 5. We find that even the model with 4 LSTM layers achieves the best accuracy, F1 score. is not significantly different with 1 LSTM layer, 4 LSTM layers or 5 LSTM layers. It is worth noting that more LSTM layers can increase memory consumption if batch size remains the same. Layer vs. batch normalization. To examine the effect of normalization methods on perceived emotion recognition, we contrast layer normalization with batch normalization, which are commonly used in Transformers and computer vision models, respectively [42]. As Table 2.5 shows, layer normalization outperforms batch normalization for both arousal and valence predictions on accuracy and F1. 2.7 Conclusion In this study, we propose a novel Sec2Sec Co-attention Transformer model for perceived emotion classification, which leverages self-attention and co-attention mechanisms to encode and fuse multimodal features. We have evaluated our model on the LIRIS-ACCEDE dataset and achieved 28 better results compared with state-of-the-art baseline approaches. The results show the effectiveness of our Sec2Sec structure and the importance of inter-modal interaction for emotion prediction. We also introduced an attention-based LSTM mechanism to explore the contribution of each second clip of a video to the overall emotion prediction. Our work has several implications for multimodal emotion recognition research and applications. First, it demonstrates that Sec2Sec models can improve both performance and efficiency over traditional encoder-decoder models. Second, it reveals that co-attention can capture rich inter-modal relations that are essential for emotion prediction. Third, it provides a novel way to interpret the model prediction by visualizing the attention weights over video segments. Future work can extend our model to other multimodal tasks such as video sentiment analysis and audio-visual alignment analysis. 29 Chapter 3: Multimodal Co-attention Transformers for Video-Based Apparent Personality Understanding 3.1 Introduction In the dynamic landscape of communication and media consumption, video content has emerged as a dominant and influential medium. For example, in 2023, video content makes up for 82% of the Internet traffic [79]. Video content is not essential at a macro level, but also at a micro level. Many social media platforms, such as TikTok and Instagram, have started providing video-sharing services, which plays a critical role in people’s daily life. For instance, More than 78% of viewers consume video content every week, and 55% of them engage every day [79]. In addition, 93% of companies acquire new customers via social media videos [105]. In light of the burgeoning prevalence of video content, it becomes imperative to comprehend the personalities embodied by the presenter or influencer featured in each video. The elucidation of such personalities holds substantial potential for enhancing the efficacy of subsequent predictive analytics. The personality traits of presenters or influencers can serve as robust predictors, thereby contributing to more accurate forecasts in downstream predictive applications. Thus, a thorough understanding of these personalities is not just beneficial, but essential for leveraging the full potential of predictive analytics in the realm of video content. 30 Figure 3.1: Images sampled from videos in the First Impressions dataset that exhibit varying degrees of personality traits. (a) High O (b) High C (c) High E (d) High A (e) High N (f) Low O (g) Low C (h) Low E (i) Low A (j) Low N Note: These traits are represented by the acronym OCEAN, where O stands for Openness, C for Conscientiousness, E for Extraversion, A for Agreeableness, and N for Neuroticism. High personality levels are often recognized by exhibiting a friendly face with a bright background, while low personality levels are recognized by showing an unhappy expression with a dark background. Personality plays a pivotal role in shaping human interactions, decision-making processes, and overall behavioral patterns [6]. For instance, product recommendations and the effectiveness of word-of-mouth are largely affected by personality in digital marketing [2]. In human resources, personality can help predict a candidate’s suitability for a specific job [49]. In addition, [62] found that a CEO’s personality plays an important role in driving a company’s strategic flexibility. In the context of Information Systems, [28] find a significant relationship between personalities and technology acceptance and adoption. These studies emphasize a relationship between personalities and downstream outcomes. Given its impact, the study of video-based personality detection holds tremendous potential across various disciplines, such as psychology, marketing, human-computer interaction, and social sciences. By discerning the personalities projected through videos, researchers and practitioners can gain valuable insights into how individuals are perceived by others, the effectiveness of 31 Figure 3.2: Audio waveforms in the First Impressions dataset that exhibit varying degrees of personality traits. (a) High O (b) High C (c) High E (d) High A (e) High N (f) Low O (g) Low C (h) Low E (i) Low A (j) Low N Note: These traits are represented by the acronym OCEAN, where O stands for Openness, C for Conscientiousness, E for Extraversion, A for Agreeableness, and N for Neuroticism. Individuals with high personality levels often have a high voice when speaking, whereas individuals with low personality levels have a low voice. persuasive communication strategies, and the influence of personality on audience engagement [2, 49, 72]. Moreover, as social media platforms increase and user-generated video content becomes increasingly prevalent, video-based personality detection becomes a valuable tool for understanding broader societal trends, cultural influences, and collective attitudes. The recent surge in interest in video-based apparent personality trait prediction has underscored several non-trivial challenges, primarily due to the unique characteristics inherent to the video- based personality setting. Firstly, a video typically comprises three types of information: visual, auditory, and textual. Each of these modalities may contain crucial information that could significantly enhance the accuracy of predictions. Research has demonstrated that personality traits are evident in appearance, expression, voice [51, 63, 76]. For the purpose of illustration, Figures 3.1 and 3.2 depict distinct visual and acoustic patterns corresponding to different levels of the Big-Five personality traits. Secondly, [44] posits that capturing modality interactions is essential for making accurate predictions. This is further corroborated by the McGurk effect [58], 32 which underscores the importance of the interaction between auditory and visual modalities. We contend that capturing the interactions among all three modalities is crucial as a good alignment can synergistically aid audiences in better inferring personalities, while a poor alignment can adversely affect the perceptions of personalities. Thirdly, existing approaches often necessitate high-resolution images, typically in the dimension of 224 × 224, which can be computationally expensive. Finally, deep learning models are often criticized for being ‘black boxes’ due to their lack of interpretability. However, in the context of video-based perceived personality settings, it is essential to provide interpretability that offers practical implications for both presenters or influencers and researchers or platforms. This interpretability not only demystifies the underlying mechanisms but also facilitates more informed decision-making processes. Several models have been proposed. For example, [102] proposes a bi-modal network to process visual and audio information and predict personality in video. A tri-modal network is also processed to predict video personality by taking in visual, audio and text information [83] However, most of these works train a different model for each modality independently and combine predictions using ensemble methods such as taking an average of the predictions generated by different modalities. More importantly, all the models mentioned above require high-resolution pictures (e.g. 224 × 224) in order to perform well. Processing high-resolution pictures is computationally expensive. To solve these limitations and improve prediction accuracy, in this paper, we propose a Multimodal Co-attention network based on the multi-head self-attention mechanism proposed in Transformer [97]. Specifically, we develop a visual encoder extended from [31] along with a newly proposed hierarchical positional encoding mechanism to efficiently extract visual features, two linear regressors to extract audio and text features. We further develop a Multimodal Co- 33 attention Transformer to understand the complex interactions among visual, audio and text components efficiently. To evaluate our work, we conduct experiments on a real-world dataset, First Impressions, with 10,000 videos to demonstrate the usefulness and value of the proposed model. The experimental results not only show that the proposed model outperforms seven state-of-the-art baselines, but also has improved the computational costs. Furthermore, we conduct a series of interpretability analyses to demonstrate the model’s decision process. Our analysis uncovers the useful factors that can be used to predict personality traits at modality-, vision- and image- level, which can serve as a guideline for presenters, influencers and platforms that seek to improve perceived personality. To conduct our interpretability analysis, we calculate the contributions of inputs by computing the Integrated Gradient for each input proposed by [88]. For the modality-level interpretability analysis, the contributions of the inputs are aggregated into three modalities: audio, vision and text. Our results show that text information is less important than the other two modalities. The relative importance of audio and vision depends on the specific personality traits. For agreeableness, neuroticism and openness, audio is more important than vision. For extraversion, vision is more important than audio. For conscientiousness, audio and vision are equally important. For the vision-level interpretability analysis, the contributions of visual inputs are aggregated by the time point of each input image, which enables us to investigate at which time point an image is more important at predicting personality traits. For agreeableness, extraversion, neuroticism and openness, the importance of images decreases over time, indicating the first impression does matter when perceiving personality. Interestingly, for conscientiousness, the importance of images increases over time, suggesting . For the image-level analysis, the contributions of visual 34 inputs are aggregated into different regions of an image, which allows to study the importance of different regions. The results show that hand movements and background are more important than faces at predicting personalities. We believe the findings from the interpretability analyses serve as guidelines for influencers or presenters to better design and create video content and for audiences to infer personalities. In addition to the interpretability analysis, we use a real-world case study to showcase the usefulness of the proposed model. Specifically, we collected the MBA admission data from a major university in the United States. In the application process, each applicant is required to record a up-to-one-minute video to answer a specific question. The staff uses the video as evidence to evaluate the communication skills and English proficiency of each candidate. We utilize our model to generate the five personality predictions for each candidate, and the impact of the predicted personality traits as a persuasion tool on the admission outcome. Our results show that candidates perceived as conscientious, extroverted and agreeable are associated with a higher chance of being admitted. Among these three traits, being perceived as agreeable has an even higher probability of being admitted. In summary, this study makes the following contributions. First, we introduce a Multimodal Co-attention network, coupled with a novel hierarchical positional encoding mechanism. This sophisticated architecture adeptly processes information from visual, acoustic, and textual modalities. Impressively, our approach outperforms state-of-the-art baselines, showcasing its remarkable performance. Second, we validate our proposed model in extracting valuable insights from low- resolution (64×64) images. Notably, even with a compact latent representation space of just 512 dimensions, our model excels. Moreover, it achieves this while demanding minimal training time. Third, we rigorously conduct interpretability analyses, shedding light on the intricate decision- 35 making process of our method. This offers a deeper understanding of its working mechanism, enhancing its transparency. Finally, to underscore the practical utility of video-based personality detection, we present a compelling case study. This aptly demonstrates the model’s efficacy in predicting MBA admission outcomes, showcasing its real-world application. The remainder of the paper is organized as follows. In Section 2, we discuss prior work on personality prediction, Transformer in computer vision as well as recent work on multimodal learning. In Section 3, we give a brief overview of Transformer, followed by the proposed modeling in Section 4. In Section 5, we present the experimental details. Section 6 presents the results. We conclude our study in Section 7. 3.2 Related Work 3.2.1 Personality Personality is often defined as a distinct combination of cognitive, affective, and behavioral traits [6]. It is considered to be relatively stable compared to emotion [24]. In the business-related literature, the Big-Five personality traits (OCEAN: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) are widely used to describe a person’s personality [30]. They are defined as follows [35]: • Openness emphasizes imagination and insight. • Conscientiousness denotes organization and responsibility. • Extraversion represents sociability and energy. • Agreeableness reflects compassion and trust. 36 • Neuroticism involves anxiety and depression tendencies. The influence of personality traits on various aspects of life and decision-making pro- cesses has been extensively studied in the literature. For instance, the personalities of senior chief executive officers (CEOs) have been found to significantly correlate with their companies’ financial outcomes, such as cash holdings, investment, and interest coverage [109]. In particular, conscientiousness has been negatively associated with a company’s strategic flexibility, while agreeableness, extraversion, and openness have shown positive associations [62]. The relationship between personality traits and online shopping behaviors has also been explored. Consumers with higher degrees of neuroticism, agreeableness, or openness tend to be utility-motivated to shop online [96]. Furthermore, hedonic purchase motivation is positively influenced by neuroticism, extraversion, and openness [96]. In the realm of information-seeking tasks, individuals high in conscientiousness performed fastest, followed by those high in agreeableness and extraversion [4]. Moreover, personality traits have been linked to social media usage and engagement. Specifically, openness and extraversion are the two most significant positive predictors of social media use. Conscientiousness, agreeableness, and neuroticism were also considered important but to a lesser degree [48]. These studies underscore the crucial relationship between personality traits and decision-making choices across various domains. One of the biggest limitations of these studies is that how personality traits are derived. The majority of literature requires the completion of long questionnaires to determine personality traits [26], which is time-consuming and burdensome. With the recent data explosion in user- generated content, such as in social media, it has become almost impossible to conduct questionnaire- based research. A lot of effort has been devoted to automatic personality detection from text. 37 Early works include using word count to classify personalities. For instance, [2] use Linguistic Inquiry and Word Count (LIWC) to classify text personalities. Recently, research has started adopting deep learning techniques to predict personality traits. One advantage of employing deep learning methods is their ability to learn word embeddings that capture rich contextual information in text, facilitating the learning of document-level representations by the models. For instance, a deep convolutional neural network (CNN) has been developed to predict personality from text information and has been demonstrated to outperform traditional machine learning techniques [86]. [111] found that CNNs outperform recurrent neural networks (RNNs), such as long short-term memory (LSTM) and gated recurrent units (GRU), in predicting personality. Attention techniques have also been incorporated into CNNs to enhance their performance. For example, word-level attention has been proposed to learn document-level semantic features [108], while message-level attention has been employed to leverage the relative weight of users’ social media posts, yielding impressive results [57]. In addition, [37] utilize three pre-trained language models, BERT, RoBERTa and XLNet, to predict text personalities by averaging predictions. [109] develop a hierarchical attention network to classify text personalities. However, even though video content has been booming, there are only a few studies that focus on video-based personality detection. In the next section, we will discuss current video- based personality efforts as well as their limitations. 3.2.2 Video-Based Deep Personality Prediction Personality prediction has recently emerged as a popular research area, with a focus on utilizing deep learning techniques to predict personality traits from unstructured data sources 38 such as text and video. There is little research has focused on predicting personality traits from user-posted social media videos. These videos typically consist of at least two modalities: vision and audio, with some also including text. Various methods have been proposed to process and combine visual and audio data. Most existing video personality prediction models extract information and make predictions from each data source (vision, audio or text) separately and then employ ensemble methods, such as averaging, to combine predictions. For instance, [102] developed a Descriptor Aggregation network to predict personality traits from video-sampled images and a linear regressor to predict personality traits from audio, averaging the predictions from these two models to make final predictions. [83] utilized pre-trained VGG-16 and ResNet models to predict personality from audio and images respectively, a linear regressor to predict personality from text, and averaged the predictions from all three models to make final predictions. Another approach involves extracting features from each source and using a fully connected feed- forward network to fuse embeddings from two or three modalities. [82] proposed two techniques for predicting video personality traits: one using a 3D convolution network to extract visual features and a linear regressor to extract audio features, with a fully connected network fusing the two modalities to make predictions; the other splitting a video into several equal-length parts and using a linear regressor and CNN to extract audio and visual features respectively for each part, with a fully-connected feed-forward network combining the embeddings as the latent representation for that part before entering an LSTM network sequentially to make final predictions. [36] developed two CNNs to extract audio and visual features and employed a fully- connected network to combine the embeddings and make predictions. However, the majority of these models fail to capture the interactions between audio and vision, which is crucial for multimodal learning [44]. More importantly, these models require high-resolution pictures (e.g., 39 224× 224) to process images, which is computationally expensive. 3.2.3 Transformers in Computer Vision The Transformer model, initially proposed for machine translation tasks in the realm of natural language processing (NLP) [97], has seen a surge of interest for its application in computer vision (CV) tasks, positioning it as an alternative to convolutional neural networks (CNNs). Several studies have made significant strides in this area. For instance, the Vision Transformer (ViT) [31] applies a Transformer model to linearly projected sequences of image patches for full image classification. The Swin Transformer enhances the ViT by introducing a hierarchical Transformer architecture coupled with a shifted window scheme [53]. These models serve as two representative Transformer-based models for image classification tasks. In the context of video classification tasks, the Video Vision Transformer (ViViT) extends the ViT by proposing two methods for embedding video samples: uniform frames sampling and tubelet embedding. It also introduces four model variants based on the Transformer: spatiotemporal attention, factorized encoder-decoder, factorized self-attention, and factorized dot product attention [9]. The Video Swin Transformer further extends the Swin Transformer by introducing a 3D-shifted window- based multi-head self-attention module and a locality inductive bias to the self-attention module [54]. Since the advent of pure Transformer-based models for computer vision tasks, they have been adopted for a diverse range of applications, including semantic segmentation [113], action recognition [14], and object detection [16]. This underscores the versatility and efficacy of Transformer models across various domains. 40 3.2.4 Multimodal Learning Multimodal learning, a deep learning technique, involves the assimilation of information from diverse modalities such as images, text, audio, and video. Given the inherent multimodal nature of videos, a substantial body of literature has focused on learning a joint representation of audio and vision to predict audio-visual synchronization. For instance, [8] trained an audio-visual cross-modal network from scratch to predict video correspondence. [7] utilized one clustered modality as a supervisory signal for another modality and predicted correspondence between two modalities. [22] further developed three self-supervised co-attention-based networks to discriminate visual events related to audio events. However, only a handful of research studies have concentrated on handling vision, audio, and text. For example, [5] applies contrastive learning to vision, audio, and text to learn video-level representations for self-supervised learning tasks. Most multimodal learning models employ a standard Transformer as the backbone network to learn the interactions among different modalities. For instance, VATT [3] proposes a Transformer- based self-supervised learning model that can process audio, visual, and text information and uses a standard Transformer model as the backbone network. [34] develops an omnivore network that uses a standard Transformer network to learn representations from images, videos and 3D View data. While it is relatively straightforward to train a standard Transformer in terms of implementation, the computational cost can escalate when the latent representation space enlarges. Our method aims to enrich the multimodal learning literature by proposing a Multimodal Co- attention Transformer that can efficiently process information from three different modalities. 41 3.3 Preliminaries In this section, we provide a brief explanation of the functionality of multi-head self- attention. The key component of the Transformer [97] maps a query vector Q, a key vector K and a value vector V to an output (embedding), as demonstrated in Equation 3.1. Attention(Q,K, V ) = softmax( QKT √ d )V (3.1) Self-attention is achieved by computing the similarity scores between the query matrix Q and the key matrix K using the scaled dot product. This is obtained by dividing the dot product of Q and K by the square root of the dimension of Q, denoted by d. These scores are then converted into probabilities by applying a softmax function. The resulting probabilities are used to weight the values in the value matrix V, producing the final output for the next layer. The underlying principle of this mechanism is to assign greater importance to vectors with higher probabilities in V in subsequent layers. Despite its effectiveness, a single self-attention layer may constrain a model’s capacity to attend to multiple positions simultaneously without sacrificing attention to other positions. A multi-head attention mechanism is introduced to address this limitation, which has been shown to enhance overall model performance. This mechanism comprises h parallel self-attention sub- layers, referred to as ‘heads’, each of which learns distinct query, key, and value matrices. These heads project input features into different subspaces, and their output features are concatenated 42 into a single matrix for subsequent processing by downstream layers, as shown in Equation 3.2. MultiHead(Q,K, V ) = Concat(h1, h2, ..., hh)W O where hi = Attention(QWQ i , KWK i , V W V i ) (3.2) where WQ i ,WK i ,W V i ∈ Rd×dmodel are learnable weights for each head i. WO ∈ Rh×dmodel×dmodel is the projection weight. A fully-connected feed-forward network is applied after the self-attention layer. In addition, a residual connection and a normalization function are applied to each sub-layer. 3.4 Our Model In this section, we introduce our proposed model, which is designed to extract and analyze visual, acoustic, and textual representations, as well as the interactions among these three modalities. As illustrated in Figure 4.2, our model comprises three primary components: 1. Encoder Network: The encoder network is composed of visual, audio, and text encoders that are responsible for extracting the respective features from each modality. 2. Multimodal Co-attention Transformer Network: This network is designed to capture the interactions among the three modalities through the use of a multimodal co-attention mechanism. 3. Predictor: The output from the Multimodal Co-attention Transformer Network is fed into a fully connected feed-forward network, which generates predictions regarding personality traits. Visual encoder and Hierarchical Positional Encoding. For the visual encoder, we build 43 Figure 3.3: An overview of our proposed model: Multimodal Co-attention Transformer. upon the work of the Vision Transformer (ViT) [31]. Our visual encoder takes as input a 3- channel Red-Green-Blue (RGB) representation of n sampled image frames, with a size of [3, H,W ]. Each image is partitioned into patches of size [h,w], resulting in a total of [H/h]×[W/w] patches. Additionally, we propose a hierarchical positional encoding mechanism to incorporate positional information into the model. In this study, we utilized a sample size of 100 images per video. Contrary to the majority of studies that resize images to a higher resolution of [224, 224], we opted for a lower resolution size of [64, 64] for each image. The patch size is a hyperparameter that requires tuning. In the visual encoder, which lacks both recurrence and convolution, a hierarchical positional 44 encoding method is introduced to encode the position of each patch within each sampled image and the position of each image within each video. This enables the model to comprehend the position of each patch or image. The positional encodings of patches and images share the same dimension as that of each patch, allowing for the addition of encodings and embeddings. Specifically, the hierarchical positional encoding method encompasses two components: patch positional encoding and image positional encoding. Both encodings possess dimensions equivalent to those of individual patches, facilitating their effortless integration with patch embeddings. Additionally, in the visual encoder, which lacks both recurrence and convolution, we introduce a hierarchical positional encoding method to encode the position of each patch within each sampled image and the position of each image within each video, which enables the model to comprehend the position of each patch or image. Specifically, the hierarchical positional encoding method encompasses two components: patch positional encoding and image positional encoding. Both encodings possess dimensions equivalent to those of individual patches, facilitating their effortless integration with patch embeddings. The positional encodings of patches and images share the same dimension as that of each patch, allowing for the addition of encodings and embeddings. We build upon the positional encoding, PEpos,i, proposed in [97], which uses sine and cosine encoding functions written as: PE(pos,2i) = sin(pos/100002i/d) (3.3) PE(pos,2i+1) = cos(pos/100002i/d) (3.4) where pos is the position and i is the dimension. Therefore, we get the positional encodings for patch p in each image and image m in each 45 video, PEp,i and PEm,i, based on the equation above. Together, we get the hierarchical positional encoding for patch p in image m in a video as PEp,m = PEposp,i + PEposm,i (3.5) After injecting positional encoding, each patch is linearly projected to a latent representation with a dimension of l. The latent representation for each patch is then concatenated to form the visual embeddings with a dimension of l × [H/h]× [W/w]. Audio encoder. In the audio modality, our approach involves several stages. First, we extract the raw audio signal from each video. Subsequently, all audio signals are re-sampled from their original rate of 44.1kHz to a standard rate of 16kHz. From these re-sampled signals, we extract 2-dimensional Mel-Frequency Cepstral Coefficients (MFCC) [27], which are designed to model how the human ear perceives and distinguishes between different sound frequencies [71]. These 2-dimensional MFCCs are then flattened into 1-dimensional representations for input into the audio encoder. We conducted experiments with various audio features, including log bank filters and raw audio waveforms, among others. Our results indicated that MFCCs provided the best performance, and thus we selected them as our audio representations. The extracted MFCC features are fed into a fully connected feed-forward network to get audio embeddings. Text encoder. In our approach to processing text data, we begin by applying standard text processing procedures. These procedures include tokenizing the text data, converting all words to lowercase, removing English stopwords, and performing stemming and lemmatization on the words. We conducted experiments with various text extraction techniques, including one-hot encoding, bi-gram encoding, and the use of pre-trained text encoders such as BERT [29]. Our 46 results indicate that one-hot encoding provides the best performance compared to other encoding techniques. Hence, we use one-hot encoding as text representations. Similar to the audio encoder, the extracted one-hot encoding enters into a fully connected feed-forward network to get text embeddings. Multimodal Co-attention Transformer Network. The visual, audio, and text representations are input into three symmetric multimodal co-attention sub-blocks. Each sub-block comprises a standard multi-head self-attention module and a proposed multimodal co-attention module. Layer normalization and a residual network are applied following each attention module, and a fully connected feed-forward network is also incorporated. The multi-head self-attention network independently identifies salient features from each modality. In contrast, the multimodal co- attention module learns the significant features of the interactions between the other two modalities, guided by the guiding modality. The residual network serves to stabilize the network and, more importantly, combines the guiding modality’s representation with the joined representations, preserving information from all three modalities. In the visual sub-block, the extracted visual embeddings from the visual encoder are first fed into a multi-head self-attention module to get the intermediate representation, Iv, containing important visual information. Similarly, the intermediate representations of audio and text, Ia and It, are obtained by feeding the extracted audio and text embeddings into the multi-head self- attention modules in their corresponding sub-blocks. In the visual sub-block, the extracted visual embeddings from the visual encoder are input into a multi-head self-attention module to obtain an intermediate representation, Iv, containing salient visual information. Similarly, the intermediate representations for audio and text, Ia and It, are derived by inputting the extracted audio and text embeddings into their respective sub- 47 block multi-head self-attention modules. Specifically, Iv, Ia and It are calculated as follows: Iv = FC(Norm(MultiHead(zv, zv, zv) + zv)) Ia = FC(Norm(MultiHead(za, za, za) + za)) It = FC(Norm(MultiHead(zt, zt, zt) + zt)) (3.6) where zv za and zt denote the output features from the visual encoder, the audio encoder and the text encoder respectively. Next, in the visual sub-block, Ia and It are stacked to form a joined representation, denoted as [Ia, It] with a dimension of [2, d]. Concurrently, the dimension of Iv is expended from 1- dimensional to 2-dimensional (e.g. from d to [1, d]). Subsequently, The multi-head co-attention module is then fed with Iv as the query and [Ia, It] as both key and value. The resulting dot product between the query and key represents the similarities between audio, text, and visual embeddings with dimensions of [1, 2], indicating the relative salience of audio and text embeddings with respect to visual embeddings. These similarity scores are then multiplied by [Ia, It] to obtain a joined representation guided by the guiding modality, Iv. Similarly, joined representations of [Ia, Iv] guided by It and [Iv, It guided by Ia can be also obtained. Together with a residual network, layer normalization, and a fully-connected feed-forward network, joined representations for each modality (Fv, Fa, Ft) are calculated as follows: 48 Fv = FC(Norm(MultiHead(Iv, [Ia, It], [Ia, It]) + Iv)) Fa = FC(Norm(MultiHead(Ia, [Iv, It], [Iv, It]) + Ia)) Ft = FC(Norm(MultiHead(It, [Iv, Ia], [Iv, Ia]) + It)) (3.7) Predictor. We apply a fully-connected feed-forward along with a sigmoid function to the output from the multimodal Co-attention network at the last step to make personality predictions, computed as follows: Personality = Sigmoid(FC(concat(Fv, Fa, Ft))) (3.8) 3.5 Experiments 3.5.1 Dataset In order to assess the efficacy of our approach, we conducted experiments on a large-scale dataset: First Impressions [67]. The First Impressions dataset is a widely used benchmark in the field of apparent personality analysis and was employed in the ECCV 2016 personality trait recognition competition. It consists of 10,000 labeled video clips extracted from over 3,000 YouTube videos, with 6,000 designated for training and 2,000 for validation and testing. The dataset provides tri-modal information in the form of audio, visual, and text modalities. The average length of each video is 15 seconds and the majority have a resolution of [1280, 720]. Each video features a single individual speaking English in front of a camera. Ground truth 49 annotations for the Big-Five personality traits - extraversion, agreeableness, conscientiousness, neuroticism, and openness - are provided as fractional scores ranging from 0 to 1. The ECCV competition organizers obtained the annotations via Amazon Mechanical Turk. The summary statistics are shown in Table 3.1. Table 3.1: Summary Statistics for