ABSTRACT Title of Dissertation: RECOGNIZING OBJECT-CENTRIC ATTRIBUTES AND RELATIONS Khoi Viet Pham Dissertation Directed by: Professor Abhinav Shrivastava Department of Computer Science Recognizing an object’s visual appearance through its attributes, such as color and shape, and its relations to other objects in an environment, is an innate human ability that allows us to effortlessly interact with the world. This ability remains effective even when humans encounter unfamiliar objects or objects with appearances evolve over time, as humans can still identify them by discerning their attributes and relations. This dissertation aims to equip computer vision systems with this capability, empowering them to recognize object’s attributes and relations to become more robust in handling real-world scene complexities. The thesis is structured into two main parts. The first part focuses on recognizing attributes for objects, an area where existing research is limited to domain-specific attributes or constrained by small-scale and noisy data. We overcome these limitations by introducing a comprehensive dataset for attributes in the wild, marked by challenges with attribute diversity, label sparsity, and data imbalance. To navigate these challenges, we propose techniques that address class imbalance, employ attention mechanism, and utilize contrastive learning for aligning objects with shared attributes. However, as such dataset is expens- ive to collect, we also develop a framework that leverages large-scale, readily available image-text data for learning attribute prediction. The proposed framework can effectively scale up to predict a larger space of attribute concepts in real-world settings, including novel attributes represented in arbitrary text phrases that are not encountered during training. We showcase various applications of the proposed attribute prediction frameworks, including semantic image search and object image tagging with attributes. The second part delves into the understanding of visual relations between objects. First, we investigate how the interplay of attributes and relations can improve image-text matching. Moving beyond the computationally expensive cross-attention network of previous studies, we introduce a dual encoder framework using scene graphs that is more efficient yet equally powerful on current image-text retrieval benchmark. Our approach can produce scene graph embeddings rich in attribute and relation semantics, which we show to be useful for image retrieval and image tagging. Lastly, we present our work in training large vision-language models on image-text data for recognizing visual relations. We formulate a new subject-centric approach that predicts multiple relations simultaneously conditioned on a single subject. Our approach is among the first work to learn from both weakly- and strongly-grounded image-text data to predict an extensive range of relationship classes. RECOGNIZING OBJECT-CENTRIC ATTRIBUTES AND RELATIONS by Khoi Viet Pham Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor Abhinav Shrivastava, Chair/Advisor Professor Hernisa Kacorri, Dean’s Representative Professor Larry S. Davis Professor Ramani Duraiswami Dr. Zhe Lin, Adobe Research © Copyright by Khoi Viet Pham 2023 Acknowledgments First and foremost, I would like to express my gratitude to my advisor, Professor Abhinav Shrivastava, for his support and guidance throughout my PhD journey. Abhinav granted me tremendous freedom to pursue research topics that resonated with my interests, and provided the security that allowed me to focus solely on doing research. This was particularly invaluable during my difficult times. Abhinav has cultivated an outstanding research environment, equipping his students with a multitude of knowledge and skills. Even now, I am still mastering the things he taught and continuously seeking improvement, which fuels my inspiration and anticipation for my journey post-graduation. Abhinav has taught me to value and cherish the PhD experience as a once-in-a-lifetime opportunity, one that fosters my professional development as a researcher and my personal growth to become a better person. This was a great source of inspiration to continue this journey, and for that, I am profoundly grateful. I would also like to thank Professor Larry Davis for giving me the opportunity to join the computer science graduate program at UMD, and for providing guidance at the beginning when I was still junior with limited knowledge and experience. Professor Larry is knowledgeable in different research areas, always provides interesting insights and asks surprising and thought- provoking questions that makes every discussion with him become a chance to learn something new. I appreciate the opportunity to work with him during the early stage of my PhD journey. I am also grateful to Professor Ramani Duraiswami, Professor Hernisa Kacorri, and Doctor ii Zhe Lin, for graciously dedicating their time to serve on my dissertation committee. Their review of my manuscript and the provision of insightful feedback have contributed to the quality of the dissertation. My summer internships have been pivotal in my growth, particularly the time spent working alongside exceptional researchers at Google and Adobe Research. I extend my appreciation to Junfeng He at Google, and Kushal Kafle, Zhihong Ding, Scott Cohen, Zhe Lin, Quan Tran, and Walter Chang at Adobe Research, for their mentorship and the learning experience about doing research in the industry. I am particularly thankful to Junfeng and Kushal, for their dedicated mentorship on academic writing and their hands-on assistance with writing papers for conference submissions. These internships not only shaped me into a better researcher, improved my abilities in communication and presentation through continuous meetings, but also provided substantial financial support that I am immensely grateful. I would like to express my deepest gratitude to my wife, Uyên, for her unconditional support, for being patient and understanding with all my decisions. I would like to thank her for staying with me through such a long and challenging journey, starting from living together on a single PhD stipend, cooking and preparing meals for me, providing emotional support, to finding a full-time job herself and continue to support me on the journey. She is the unseen force behind the scene that makes the completion of this journey possible, for which I will be eternally grateful. Additionally, I also want to send my appreciation to my family. My father and my mother laid the foundations of the person I am today, ensured that I obtained a good education, ignited the spark of scientific curiosity since I was young, taught me about work ethics of what is right and what is wrong. I thank my younger brother for being the one that takes care of the family while I am still spending time in school pursuing dreams of my own. I also carry the memory iii of the departed loved ones - my grandmother whom I cherished lots of childhood memories, my father-in-law who exemplified the depth of familial love, they all have forever affected how I want to live my life in the future. I am truly blessed to have my family and it is to them that I lovingly dedicate this dissertation. I would like to extend my thanks to friends in the lab, including Hao, Nirat, Pulkit, Bo, Chương, Lilian, Kamal, Soumik, Moustafa, Alex, Yixuan, Shuaiyi, Max, Hanyu, Vinoj, Matthew G., Shishira, Ahmed, Matthew W., Sharath, Saksham, Mara, Saketh, Namitha, Archana, Anubhav, Vatsal, Gaurav, Pravin, Luyu, and Varuni, each person with distinct admirable qualities that I want to learn from. Everyone has been a part of my memories on this journey. In particular, I want to express my appreciation to Nirat and Chương for the collaboration on several research projects. There are also friends outside the lab that shared the doctoral journey and made my early years much more enjoyable, I would like to thank Pattara, Yuheng, Jingling, Jun, Amin, and Rangfu. I would also like to thank my Vietnamese friends at UMD, who showed me there exists life outside of graduate school, including Phong, Khánh, Huyền, Cường, Phụng, Nhật, Chương, Quỳnh, Trí, Khoa, Thủy, Tín, Huy, Đăng, and Phương. My special gratitude goes to Cường and Phụng, and their parents for generously hosting me on numerous occasions and for making my last semester so joyful. I also wish to express my gratitude to my dear high-school friend Long for the financial support during my tough times. Last but not least, I extend my sincere thanks to Tom Hurst, Jodie Gray, Migo Gui, and the ISSS officials for all the timely support with all the logistics and student visa-related issues. I acknowledge it is impossible to remember all those who have helped me. If I have inadvert- ently omitted anyone, please accept my apologies. Your support, whether mentioned here or not, has been invaluable and deeply appreciated. iv Table of Contents Acknowledgements ii Table of Contents v List of Tables viii List of Figures x Chapter 1: Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Visual Attributes of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Visual Relations between Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Closed-Set Attribute Prediction 10 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Visual Attributes in the Wild Dataset . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 Negative Label Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.4 Supervised Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 3: Open-Vocabulary Attribute Prediction 42 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Large-Scale Attribute Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 v 3.3.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Closed-Set Attribute Prediction . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 Open-Vocabulary Attribute Prediction . . . . . . . . . . . . . . . . . . . 59 3.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.5 Closed-Set Human-Object Interaction Classification . . . . . . . . . . . 65 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 4: Compose Object Relations and Attributes for Image-Text Matching 68 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.3 Scene Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.4 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2 Dataset and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.5 Text-to-Entity Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.6 Inference Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3.7 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 5: Subject-Centric Relationship Prediction from Image-Text Data 98 5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.3 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.4 Learning from Image-Text Data . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.2 Closed-Set Relationship Classification . . . . . . . . . . . . . . . . . . . 111 5.3.3 Open-Vocabulary Relationship Classification . . . . . . . . . . . . . . . 113 5.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 6: Summary and Discussion 119 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 vi Bibliography 122 vii List of Tables 2.1 Statistics of VAW compared with other in-the-wild and domain-specific attribute datasets. *person (resp. *clothes) category may represent multiple categories in- cluding {boy, girl, man, woman, etc} (resp. {shirt, pants, top, etc}). While Visual Genome is the largest among these in terms of number of attribute annotations, it is sparsely labeled. Other datasets are either fully annotated for domain-specific attributes or more densely labeled but covering few object categories. . . . . . . 17 2.2 Experimental results compared with baselines and SOTA multi-label learning methods. The top box displays results of multi-label learning methods; the mid- dle box shows results of models from attribute prediction works and our strong baseline; the last row shows performance of our SCoNE algorithm applied onto the strong baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Ablation study. We show how each of our proposed techniques help improve over- all performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Ablation study on the three components of the Strong Baseline model by remov- ing each one. The last row also corresponds to the ResNet-Baseline model. . . . . 35 3.1 Statistics of attributes in LSA. Note that Localized Narratives contains 32k and 122k images from Flickr30K and COCO respectively. Among all instances, 7.1M are grounded (bounding box), 1.4M weakly grounded (mouse trace), and 975k ungrounded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Results on VAW. The top box reports results of methods trained only on VAW, while the bottom box shows our newly introduced baseline RN50-Context and TAP on VAW after pre-trained on LSA. LSA-pretrained and VAW-supervised denote whether a model is trained with attribute labels from LSA and VAW re- spectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Evaluation of LSA common and LSA common→rare. . . . . . . . . . . . . . . 62 3.4 Ablation on class embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Ablation on training portion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Results on HICO image classification . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Our framework achieves the best and second-best on the Flickr30K dataset with two different encoders. Without the CA - “cross-attention”, our method still has competitive results to other baselines. † denotes methods that use ensembling of multiple models, and we highlight the highest and second-highest RSUM for each section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 viii 4.2 Our method yields promising results on the MS-COCO dataset. Our performance is comparable in all test schema with previous works, especially on the simple Bi-GRU architecture. † denotes methods that use ensembling of multiple models. Bold and underline highlight the best and second-best performance. . . . . . . . 86 4.3 Ablation studies for the number of layers in GAT, the graph structure whether encoding scene graph jointly or in 2 separate steps is beneficial, and the impact of losses. Bold and underline highlight the best and second-best performance. . . 88 4.4 Our framework achieves the best results on the Flickr30K dataset when initializ- ing the word embeddings fom scratch for the Bi-GRU semantic encoder. Without the CA - “cross-attention”, our method still has competitive results to other base- lines. † denotes methods that use ensembling of multiple models, and we highlight the highest and second-highest RSUM. . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Our framework achieves the best results on the MS-COCO dataset when initializ- ing the word embeddings from scratch for the Bi-GRU semantic encoder. Without the CA - “cross-attention”, our method still has competitive results to other base- lines. † denotes methods that use ensembling of multiple models, and we highlight the highest and second-highest RSUM. . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Ablation studies to compare between fine-tuning the whole BERT model versus using P-Tuning v2 to encode the short phrases of semantic concepts. The models are evaluated on the MS-COCO 1K Test set. Gray denotes our best model. . . . . 91 4.7 Reranking results on MS-COCO 5K after ensembling with image-entity score. . 92 5.1 Experimental results on the VG-50 dataset. . . . . . . . . . . . . . . . . . . . . . 113 5.2 Open-Vocabulary Relationship Classification Results on VG-50. . . . . . . . . . 114 5.3 Ablation studies for the box incorporation design, the object grounding loss, and the object disjoint loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 ix List of Figures 1.1 Examples of applications of visual attributes and visual relations recognition. (Top left) Semantic image search: search for images that correctly depict the se- mantic information from the input text, e.g, red bowl with food should return images of bowl having color red and containing food. (Top right) Image genera- tion: generate image that is faithfully correct w.r.t. the input text, e.g, the image truly displays a clock with color yellow and a bench with color red as required by the text. (Bottom left) Object description: describe object appearance (e.g, for the visually impaired), e.g, the text to the right describes fully all appearance charac- teristics of the object to the left. (Bottom right) Language-based object selection: select object in image that satisfies the input text condition, e.g, the image displays a bounding box enclosing the upside down chair. . . . . . . . . . . . . . . . . . 3 1.2 Image search results for the input query red bowl with food on a) Adobe Stock [1] where there are many red bowls without food and on b) Pinterest [2] where there are many white bowls with red food (results obtained on Dec 25, 2023). . . . . . . 4 1.3 State-of-the-art large language model Google BARD [3] makes mistakes in visual relation recognition (result obtained on Dec 25, 2023), when it says that there is a bird eating the fruit in the image, while in fact the image only displays an assortment of food that resembles the shape of a bird. . . . . . . . . . . . . . . . 4 1.4 The first part of the thesis (Chapter 2 and Chapter 3) focuses on recognizing attributes of objects. On the left in (a), we illustrate that these attributes can be adjectives that describe physical properties (image source: [4]). Here we denote common primitive attributes in blue, and rare, not commonly used attributes in green. On the right in (b), we show attributes as verbs and verb-object pairs that describe actions of the object and its interactions with others in the scene. . . . . 5 1.5 The second part of the thesis focuses on visual relations between objects. a) In Chapter 4, we study how to compose object with their attributes and relations in a scene graph to improve image-text alignment. Here, the figure displays a joint embedding space for the image and the scene graph. b) In Chapter 5, we formulate a new subject-centric approach for predicting all visual relationships with respect to a particular subject instance, e.g, the figure illustrates all relationships w.r.t. the woman in the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Example annotations in the VAW dataset. Each possible attribute-object category pair is annotated with at least 50 examples consisting of explicit positive and negative labels. Here, we illustrate positive and negative attribute annotations for the object table, plate, flower, cookie in the image. . . . . . . . . . . . . . . . . . 11 x 2.2 Examples of images and their annotations from the VAW dataset. Object names, positive attributes, explicitly labeled negative attributes, and negative labels from our negative label expansion are shown in corresponding colors for each example. 18 2.3 Distribution of positive and negative annotations for attributes in different cate- gories. We show the top-15 attributes with the most number of positive annota- tions in each category sorted in descending order. . . . . . . . . . . . . . . . . . 19 2.4 Strong baseline attribute prediction model. ResNet-feature map extracted from the input image is modulated with the object embedding which allows the model to learn useful attribute-object relationships (e.g, ball is round) and also to sup- press infeasible attribute-object pairs (e.g, talking table). The image-object com- bined feature map X is used to infer the object region G and multiple attention maps {A(m)} which are subsequently used to aggregate features for classifica- tion. Here, Zlow and Zrel respectively denotes low-level and image-object features aggregated inside the estimated object region, Zatt corresponds to image-object features pooled from the multiple attention maps. The classifier is trained with BCE loss on the explicit positive and negative labels. For the missing (unknown) labels, we find treating them as “soft negatives” by assigning them with very small weights in the BCE loss also helps improve results. . . . . . . . . . . . . . 20 2.5 Examples of predictions from SB+SCoNE. We show the object name and its ground truth positive attribute labels above the image. The object localized re- gion, attention map #1, and model top-10 predictions are shown below. Red text represents missed or incorrect predictions. . . . . . . . . . . . . . . . . . . . . . 37 2.6 More examples of predictions from SB+SCoNE. We show the object name and its ground truth positive attribute labels above the image. The object localized region, attention map #1, and model top-10 predictions are shown below. Red text represents missed or incorrect predictions. . . . . . . . . . . . . . . . . . . . 38 2.7 Image search results. We show the top retrieved images of SB+SCoNE when searching for some color attributes. . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Image search results. We show the top retrieved images of SB+SCoNE when searching for images that exhibit multiple color attributes. . . . . . . . . . . . . . 39 2.9 Image search results. We show the top retrieved images of SB+SCoNE when searching for some material attributes. . . . . . . . . . . . . . . . . . . . . . . . 40 2.10 Image search results. We show the top retrieved images of SB+SCoNE when searching for some size attributes. . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Attributes in LSA cover a wide-range of words/phrases that describe an object, including (a) adjective to describe color, shape, state, etc., (b) verb to describe action, (c) verb-object pairs to describe interaction, and (c) preposition-object to describe location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Extracted attributes example. (Left) Examples of objects and their attributes parsed from grounded image-caption pairs from an image in the Flickr30K dataset. (Right) Examples of objects and their attributes parsed from ungrounded image-caption pairs from an image in the COCO dataset. . . . . . . . . . . . . . . . . . . . . . 47 xi 3.3 Model architecture of TAP. Sequence of ResNet encodings form the input visual tokens. This is processed jointly with the query token which consists of object query tokens (red), their object index embedding (blue), a sequence index em- bedding (orange), and a bounding box embedding (green). Contextualized rep- resentation zi of the [CLS] token of all objects are decoded into attributes. In addition, an object grounding loss is used to train object localization (shown here for dog). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Architecture of RN50-Context, which takes the whole image as input and uses RoIAlign to extract features that correspond to the object region. The features from RoIAlign is then multiplied element-wise with the object word embedding, then forwarded through the classification layer at the end to obtain logits for at- tribute probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Examples of TAP predictions on the VAW dataset. For each image, a list of at- tributes sorted in descending order of probability values are returned for the spec- ified objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Attribute predictions of OpenTAP. Every attribute list is sorted in descending or- der of the model’s confidence. Both seen attributes from closed and unseen at- tributes from open-vocabulary branch are shown. We display the attention mask of TAP for objects without bounding box. Strikethrough represents wrong predic- tions as judged by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Image retrieval results of OpenTAP for the unseen classes excited, fishing, salmon- colored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8 Top-3 attribute prediction comparison between OpenTAP versus CLIP. Red de- notes incorrect predictions as judged subjectively by us. . . . . . . . . . . . . . . 64 4.1 Illustration of CORA. CORA has a dual-encoder architecture, consisting of one encoder that embeds the input image (the upper branch) and one encoder that embeds the text caption scene graph (the lower branch) into a joint embedding space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Overview of CORA. a) CORA consists of (1) an image encoder that detects and extracts the salient regions’ features from the input image, then aggregates them into a single image embedding through the GPO pooling operator, (2) a text en- coder that first parses the input text into a scene graph where all semantic infor- mation is readily organized, then two graph attention networks Object-Attribute GAT and Object-Object GAT are used to encode this graph into the same joint space with the image. The red arrow denotes the edge of the active role, while the yellow arrow is for the passive role in the relation (refer to Section 4.2.3.2). b) The semantic concept encoder that uses GRU or BERT to encode each seman- tic concept in the graph corresponding to the object, attribute nodes and relation edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 xii 4.3 Inference time comparison. We compare the text-to-image retrieval inference time between our method CORA against two SOTA cross-attention methods SGRAF and NAAF (lower is better). The inference time is calculated with different num- ber of images in the database, ranging from 10 to 105 images. CORA (blue line) with its dual-encoder architecture is much faster and scalable than cross-attention approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Successful image-to-text and image-to-entity retrieval on MS-COCO. In image- to-text retrieval, green denotes matching text according to the ground truth of MS-COCO, while red denotes incorrect matching. In image-to-entity retrieval, green and red denote correct and incorrect matching, respectively, as judged sub- jectively by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5 Failure cases of image-to-text and image-to-entity retrieval on Flickr30K. In image- to-text retrieval, green denotes matching text according to the ground truth of Flickr30K, while red denotes incorrect matching. In image-to-entity retrieval, green and red denote correct and incorrect matching, respectively, as judged sub- jectively by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Text-to-image retrieval on MS-COCO. For every text, we show the top-5 retrieved images on MS-COCO. The image with the green tick mark is the correct matching according to ground truth in the dataset. . . . . . . . . . . . . . . . . . . . . . . 97 5.1 Illustration of a failed example by SGG-NLS. The man (denoted in the red box) is predicted to sit on two benches at the same time. . . . . . . . . . . . . . . . . 100 5.2 Overview of the SCRP model. The input is an image, its text caption, and op- tionally the bounding boxes of the objects in the caption. A language parser is used to extract the subject and all objects that share a relationship with the sub- ject. An external object detector is also used to retrieve object instances that are not mentioned in the caption, which are treated as having negative relationships with the subject. The model is based on the architecture of ALBEF. It contains a vision transformer for encoding the image patches, a text encoder to encode the text tokens, and a multimodal Transformer to contextualize the text tokens with information from the visual modality. We also employ a box injection module that adds information about the object position into their corresponding object tokens. Finally, the [MASK] tokens are decoded into the relationship predicates. . . . . . 104 5.3 Qualitative results of SCRP (in the (a) column) versus the ground-truth labels (in the (b) column) on the VG-50 dataset. Our model tends to predict more (and mostly correct) predictions than the number of available ground-truth relationship labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Predictions of SCRP for rare relationships. Our model is also able to predict rare relationship classes such as blow drying, lifting up, milking, touching, reaching for thanks to being trained on large amount of image-text data. . . . . . . . . . . 117 5.5 Failure cases of SCRP. The model still struggles with hard cases, such as when there are overlapping boxes (top right and bottom left image). . . . . . . . . . . . 117 xiii Chapter 1: Introduction 1.1 Overview In recent years, computer vision has made significant progress in various visual recogni- tion tasks thanks to the advances of deep learning and the availability of large annotated datasets. However, most of the existing computer vision models have primarily focused on detecting and naming objects, while neglecting other visual information that can be crucial in accurately por- traying the object appearance. For example, these models can visually identify object by assigning it with a semantic label (e.g, car, dog) while being unable to express variations in the object ap- pearance with visual attributes such as red car and visual relations as in dog playing with stick. Given that objects can be dynamic, occur in various states, possess diverse attributes, and have close interactions with other objects in the scene, the intra-category visual appearance of object can significantly vary. Therefore, this thesis aims to explore and develop computer vision models that go beyond the traditional object classification systems and instead focus on understanding and perceiving visual attributes and relations of objects. The ability to recognize attributes and relations of objects facilitates seamless interaction of humans in the world. This is evidenced in our daily experiences, such as describing an object to someone using its appearance characteristics, deducing someone is reading a book because it is open, or exercising caution around a coffee mug placed besides a laptop. The development of au- 1 tomated systems equipped with this nuanced understanding of visual attributes and relations holds potential for improving a wide range of tasks, including semantic image search, language-based object selection, object description, text-to-image generation, or vision-language-based robot as- sistance. A few of these applications are demonstrated in Figure 1.1. Despite the importance of this problem, contemporary computer vision systems still struggle to recognize fine-grained con- cepts of attributes and relations. For example, as depicted in Figure 1.2, image retrieval results on the Adobe Stock [1] and Pinterest [2] platform for red bowl with food often yield imprecise results, such as images of a red bowl without food, or a non-red bowl with red food. Similarly, Fig- ure 1.3 illustrates another example of the state-of-the-art large language model Google BARD [3] incorrectly recognizes that there is a bird eating food in the image. In overall, learning fine-grained semantic understanding of objects can be categorized into two parts: 1) visual attributes, and 2) visual relations. 1.2 Visual Attributes of Objects When people process to recognize objects from visual scenes, they identify the object with a categorical label while also simultaneously perceive and describe it with visual attributes. For instance, a table could be described as large in size, made of wood, having a round shape, being of beige color, and empty. These visual attributes constitute a large portion of information about the object appearance and allow people to easily visualize the objects in their mind, describe the objects to others, and even recognize them. Understanding attributes offers various advantages for downstream computer vision problems, such as facilitating more accurate object selection in images and image retrieval by adding attributes as part of the input text query. Visual attribute 2 Semantic Image Search red bowl with food Image Generation red bench and yellow clock Object Description What is this? Object: emblem Attribute: blue, purple, lighting, neon Relation: on counter, next to mug, wearing bow tie Language-based Object Selection select the upside down chair Figure 1.1: Examples of applications of visual attributes and visual relations recognition. (Top left) Semantic image search: search for images that correctly depict the semantic information from the input text, e.g, red bowl with food should return images of bowl having color red and containing food. (Top right) Image generation: generate image that is faithfully correct w.r.t. the input text, e.g, the image truly displays a clock with color yellow and a bench with color red as required by the text. (Bottom left) Object description: describe object appearance (e.g, for the visually impaired), e.g, the text to the right describes fully all appearance characteristics of the object to the left. (Bottom right) Language-based object selection: select object in image that satisfies the input text condition, e.g, the image displays a bounding box enclosing the upside down chair. allows to ground visual objects from images to attribute concepts in language, resulting in a more comprehensive semantic understanding of objects in visual scenes. Despite the great importance of attributes, existing work is mostly limited to attributes in specific domains (e.g, clothing [5, 6], human [7], emotion [8]), consisting of very small number of attribute-object pairs [9], or is rife with label noise [10]. In addition, designing an approach that can effectively predict visual attributes for objects in the wild is a difficult task due to the following challenges: 3 (a) (b) Figure 1.2: Image search results for the input query red bowl with food on a) Adobe Stock [1] where there are many red bowls without food and on b) Pinterest [2] where there are many white bowls with red food (results obtained on Dec 25, 2023). Figure 1.3: State-of-the-art large language model Google BARD [3] makes mistakes in visual relation recognition (result obtained on Dec 25, 2023), when it says that there is a bird eating the fruit in the image, while in fact the image only displays an assortment of food that resembles the shape of a bird. 4 Dog: fighting, playing, biting, chewing, wrestling, standing, grappling, battling, tugging, facing off, playing tug-of-war Dog: biting cloth, pulling cloth, tilting head, playing with dog, having fun (b) table small round brown white not empty minimalist three- legged lightweight lustrous compact (a) Figure 1.4: The first part of the thesis (Chapter 2 and Chapter 3) focuses on recognizing attributes of objects. On the left in (a), we illustrate that these attributes can be adjectives that describe physical properties (image source: [4]). Here we denote common primitive attributes in blue, and rare, not commonly used attributes in green. On the right in (b), we show attributes as verbs and verb-object pairs that describe actions of the object and its interactions with others in the scene. • Label sparsity: attribute prediction is a multi-label classification problem where all attributes applied to an object must be predicted. However, exhaustively annotate all attribute labels for every object image is a time-consuming and expensive process. Therefore, objects in visual attribute dataset are often sparsely labeled with a large amount of missing labels. • Class imbalance and long-tailed attributes: visual attributes follow a long-tailed distribution. For example, salient attributes such as color are much more commonly used in human daily conversations than the others. • Diverse localization of visual cues: attributes are diverse in nature, some are used to describe physical properties of objects (such as color, material, shape) whose features localization span across the object surface, while other attributes require attention to particular image parts or reasoning with the surrounding context (e.g, bald-headed, wearing hat). This thesis studies attribute prediction in Chapter 2 and Chapter 3 (illustrated in Figure 1.5). In Chapter 2, we introduce a large-scale visual attribute dataset of 620 attributes for objects 5 in the wild to address the limitation of current attribute prediction benchmark, and propose a computational model with learning algorithm to overcome the above challenges. As existing work strongly rely on the availability of high-quality human-annotated attribute labels, it becomes challenging to scale up the predictive capability of learning models to tackle the vast space of possible attribute concepts in real-world settings. Instead of collecting dense attribute annotations which is a costly process, we observe that attribute information is extremely abundant in existing image-caption datasets but has not been utilized for attribute learning by prior work. To this end, in Chapter 3, we propose a novel neural network model with a train- ing scheme that allows for learning attribute prediction from multiple image-text datasets with mixed supervision. The proposed model can learn from strongly supervised data where attribute labels are annotated and grounded to the correct object bounding box, as well as weakly super- vised data where only the image-caption pairs are available without any bounding box correspon- dence. Learning from image-text pairs allows to upgrade the capability of the attribute prediction model to thousands of attribute classes. In addition, by taking advantage of text embeddings from pre-trained large vision-language model, our introduced model is also developed with open- vocabulary prediction ability where it can recognize even novel attribute classes represented in arbitrary text phrases. 1.3 Visual Relations between Objects While perceiving visual attributes allows for a more complete semantic understanding of objects, it is the visual relationships between objects that connect them together and enable the construction of a high-level semantic meaning for the entire scene. One primary goal in com- 6 Imageembedding Scene graph embedding Joint embedding space Similarity score man leap in white air leap through arena bull contain black angry escape woman shoe wearing shirt wearing hat wearing street walking on dog with (b)(a) Figure 1.5: The second part of the thesis focuses on visual relations between objects. a) In Chap- ter 4, we study how to compose object with their attributes and relations in a scene graph to improve image-text alignment. Here, the figure displays a joint embedding space for the image and the scene graph. b) In Chapter 5, we formulate a new subject-centric approach for predicting all visual relationships with respect to a particular subject instance, e.g, the figure illustrates all relationships w.r.t. the woman in the image. puter vision is to achieve such a comprehensive understanding of visual scene by developing models that can recognize objects, describe their attributes, and explain their interactions. This advancement allows for important applications such as image retrieval given text description. This research aims to enhance computer vision models with the ability to reason about objects and their interconnectivity, leading to improve downstream computer vision tasks. Scene graph was introduced by Johnson et al. [11] as a detailed representation for visual scene. Scene graph encapsulates all semantic information about the scene, including object iden- tities, their visual attributes, and their relationships. Upon its introduction, there have been nu- merous research utilizing scene graph as a scene representation in addition to the usual text de- scription to tackle visual recognition problems. However, with the emerging of pre-trained text sequence models on large datasets and the powerful cross-attention networks in vision-language learning, the use of scene graphs has seen a decline. Nevertheless, recent findings such as those by Chefer et al. [12] reveal that current models, including advanced ones like Stable Diffusion (with the usage of Transformer-based text encoder CLIP [13]) still exhibit incorrect object-attribute 7 binding (i.e, pair an attribute with the wrong object in the text description). This suggests that scene graphs, with their inherent design of accurately pairing objects with their attributes and relations, may experience a resurgence as a complementary method to describe visual scene. In regards to the powerful cross-attention networks for vision-language joint modeling, its intensive computation cost renders it infeasible for large-scale image retrieval. Therefore, it becomes more desirable for real-world image retrieval system to utilize a dual-encoder framework that can per- form retrieval efficiently with similarity computation in the embedded vector space. In Chapter 4, we tackle the image-text retrieval problem by showing that scene graph can enable a dual-encoder framework that is both more efficient while being as powerful as cross-attention approaches. Previous studies on visual relationship prediction have predominantly followed a pair- centric approach, where relationships between every pair of objects are independently classified. These work also study relation prediction in a closed setting where models are trained on small dataset with a fixed set of object and relation vocabulary. Furthermore, the proposed models of these work have to be trained on densely-labeled scene graph datasets such as Visual Genome [10] which are expensive to collect. In Chapter 5, we propose a novel subject-centric method where multiple relationships are predicted simultaneously conditioned on one subject. This methodol- ogy offers a distinct advantage that the prediction for one relation can influence the prediction of another, which helps prevent undesirable scenarios such as when a person is predicted to be sitting on two different benches at the same time. In addition, we also extend from the limited scope of previous work that only study relationship recognition in a closed vocabulary, and pro- pose an approach that can learn from large public image-text datasets with an open vocabulary set to recognize arbitrary relationship classes defined at test time. Our approach can learn from image-text datasets with different levels of grounding supervision, i.e, from scene graph datasets 8 with object bounding boxes to image-text datasets with no box localization information. 9 Chapter 2: Closed-Set Attribute Prediction This chapter is based on the publication Learning to Predict Visual Attributes in the Wild, Pham et al., In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 13018–13028, 2021. While several existing works address attribute prediction, they are limited in many ways. Objects in a visual scene can be described using a vast number of attributes, many of which can exist independently of each other. Due to the variety of possible object and attribute combinations, it is a daunting task to curate a large-scale visual attribute prediction dataset. Existing works have largely ignored large-scale visual attribute prediction in-the-wild and have instead focused only on domain-specific attributes [5, 7], datasets consisting of very small number of attribute- object pairs [9], or are rife with label noise, ambiguity and label sparsity despite having a large number of images (Visual Genome [10]). Similarly, while attributes can form an important part of related tasks such as VQA, captioning, referring expression, these works do not address the unique challenges of attribute prediction. Existing work also fails to address the issue of partial labels, where only a small subset of all possible attributes are annotated. Partial labels and the lack of explicit negative labels make it challenging to train or evaluate models for large-scale attribute prediction. To address these problems, we propose a new large-scale visual attribute prediction dataset for images in the wild that includes both positive and negative annotations. 10 Table Positive ✔ Brown, Wooden, Curved, Clean Negative ✘ White, Metallic, Square Unlabeled ?  Large, Flat, Painted, Indoors... Plate Positive ✔ Yellow, Round, Ceramic, Full Negative ✘ White, Square, Glass, Empty Unlabeled ? Red, Colorful, Shallow, Dirty, ... Flower Positive ✔ Pink, Leaning, Floral Negative ✘ Yellow, Held, Dried Unlabeled ? Bright, Cut, Light Red, Patterned, ... Cookie Positive ✔ Brown, Yellow, Colorful Negative ✘ Chocolate, Circular, Burnt Unlabeled ? White, Dark, Big Frosted, ... Figure 2.1: Example annotations in the VAW dataset. Each possible attribute-object category pair is annotated with at least 50 examples consisting of explicit positive and negative labels. Here, we illustrate positive and negative attribute annotations for the object table, plate, flower, cookie in the image. Our dataset, called Visual Attributes in the Wild (VAW), consists of over 927K explicitly la- beled positive and negative attribute annotations applied to over 260K object instances (with 620 unique attributes and 2,260 unique object phrases). Due to the number of combinations possible, it is prohibitively expensive to collect exhaustive attribute annotations for each instance. How- ever, we ensure that every attribute-object phrase pair in the dataset has a minimum of 50 positive and negative annotations. With density of 3.56 annotations per instance, our dataset is 4.9 times denser compared to Visual Genome while also providing negative labels. Additionally, annota- tions in VAW are visually-grounded with segmentation masks available for 92% of the instances. Formally, our VAW dataset proposes attribute prediction as a long-tailed, partially-labeled, multi- label classification problem. Examples of attributes in VAW are illustrated in Figure 2.1. 11 We explore various state-of-the-art methods in attribute prediction and multi-label learning and show that the VAW dataset poses significant challenges to existing work. To this end, we first propose a strong baseline model that considers both low- and high-level features to address the heterogeneity in features required for different classes of attributes (e.g, color vs. action), and is modeled with multi-attention and an ability to localize the region of the object of interest by using partially available segmentation masks. We also propose a series of techniques that are uniquely suited for our problem. Firstly, we explore existing works that address label imbalance between positive and negative labels. Next, we describe a simple yet powerful scheme that exploits linguistic knowledge to expand the number of negative labels. Finally, we propose a supervised contrastive learning approach that allows our model to learn more attribute discriminative features. Through extensive ablations, we show that most of our proposed techniques are model-agnostic, producing improvements not only on our baseline but also other methods. Our final model is called Supervised Contrastive learning with Negative-label Expansion (SCoNE), which surpasses state-of-the-art models by 3.5 mAP and 5.7 overall F1 points. Our work makes the following contributions: 1) We create a new large-scale dataset for visual attributes in the wild (VAW) that addresses many shortcomings in existing literature and demonstrate that VAW poses considerable difficulty to existing algorithms. 2) We design a strong baseline model for attribute prediction using existing visual attention technique. We further ex- tend this baseline to our novel attribute learning paradigm called Supervised Contrastive learn- ing with Negative-label Expansion (SCoNE) that considerably advances the state of the art. 3) Through extensive experimentation, we show the efficacy of both our proposed model and our proposed techniques. 12 2.1 Related Work Attribute learning. Some of the earliest work related to attribute learning stem from a desire to learn to describe objects rather than predicting their identities [14–17]. Since then, extensive work has sought to explore several aspects of object attributes, including attribute-based zero-shot object classification [18–20], relative attribute comparison [21–23], and image search [24, 25]. While research in compositional zero shot learning [26–29] also tackle object attributes, they target transformation of states of objects, treat each instance as having only one state, and focus on predicting unseen compositions rather than the prediction of a complete set of attributes for each object instance. Several works have focused on attribute learning in specific domains such as animals, scenes, clothing, pedestrian, human facial and emotion attributes [5–8, 30, 31]. In contrast, we seek to explore attribute prediction for unconstrained set of objects. Attribute prediction in the wild. Only a limited number of work have sought to explore general attribute prediction. COCO Attributes [9] is an attempt to develop in-the-wild attribute predic- tion dataset, however, it is very limited in scope, covering only 29 object categories. Similarly, a portion of the the Visual Genome (VG) [10] dataset consists of attribute annotations. However, attributes in VG are not a central focus of the work and therefore they are very sparsely labeled, noisy, and lack negative labels, making it unsuitable to be used as a standalone attribute prediction benchmark. Despite this, attribute annotations from VG are often used to train attribute-aware ob- ject detectors for downstream vision-language tasks [32–34]. By introducing the VAW dataset, the research community can use its dense attribute annotations in conjunction with VG and our attribute learning techniques to train better attribute prediction models. Several recent works have 13 also sought to take advantage of massive amount of data in VG to curate datasets for specific chal- lenges [35,36]. In a similar vein, we also start by leveraging existing sources of clean annotations to develop our VAW dataset. Multi-label learning. VAW can be cast as a multi-label classification problem which has been extensively studied in the research community [37–40]. Multi-label learning involving missing labels poses a greater challenge, but is also extensively studied [41–44]. In many cases, missing labels are assumed to be negative examples [45–48] which is unsuitable for attribute prediction, since most of the attributes are not mutually exclusive. Some others attempt to predict missing labels by training expert models [49], which is also infeasible for a large-scale problem like ours. Learning from imbalanced data. Data imbalance naturally arises in datasets with large set of labels. As expected, label imbalance exists in our VAW dataset, therefore techniques designed to learn from imbalanced data are also related to our explorations. These works can be divided into two main approaches: cost-sensitive learning [50–52] and resampling [53–57]. We utilize both of these techniques in our final model. Visual attention. Attention is a highly effective technique in image classification, captioning, VQA, and domain-specific attribute prediction [40, 41, 58–62]. In our VAW dataset, most of the objects are annotated with their segmentation mask, which allows us to guide the attention map to ignore irrelevant image regions. We also use additional attention maps to allow our model to properly explore the surrounding context of the object. Contrastive learning. Contrastive learning has recently gained a lot of traction as an effec- tive self-supervised learning technique [63–66]. While originally intended to be used in self- supervised setting, recent works have expanded contrastive learning to be used in supervised 14 setting [67]. Motivated by these works, we propose an extension of supervised contrastive loss to allow it to work in a multi-label setting required for VAW. To the best of our knowledge, ours is the first attempt to apply contrastive loss for multi-label learning. 2.2 Visual Attributes in the Wild Dataset In this section, we describe how we collect attribute annotations and present statistics of the final VAW dataset. In general, we aim to overcome the limitations of VG on the attribute prediction task which includes noisy labels, label sparsity, and lack of negative labels to create a dataset applicable for training and testing attribute classification models. 2.2.1 Data Collection VAW is created based on the VGPhraseCut [36] and GQA [35] datasets, both of which leverage and refine annotations from Visual Genome [10]. VGPhraseCut is a referring expres- sion dataset that provides high-quality attribute labels and per-instance segmentation mask, while GQA is a VQA dataset that presents cleaner scene graph annotations. Step 1: Extraction from VGPhraseCut and GQA Our goal is to build a dataset that allows us to predict the maximal number of attributes commonly used to describe objects in the wild. From VGPhraseCut, we select attributes that appear within more than 15 referring phrases. After manually cleaning ambiguous and hard to recognize attributes, we obtain a set of 620 unique attributes which are used throughout the rest of the process. Next, we extract more instances from GQA that are labeled with these attributes. We further take advantage of the referring expressions from VGPhraseCut to collect a reliable 15 negative label set: given an image, for instances that are not selected by an attribute referring phrase, we assign that attribute as a negative label for the instance. This step allows us to collect 220,049 positive and 21,799 negative labels. Step 2: Expand attribute-object coverage In this step, we seek to collect additional annotations for every feasible attribute-object pair that may be lacking annotations. We define feasible pair as those with at least 1 positive example in our dataset. We ensure that every feasible pair has at least 50 (positive or negative) annotations. To keep the annotation cost in check, we do not annotate pairs that already have 50 or more annotations. This expansion enriches our dataset with more positives and negatives for every attribute across different objects, allowing for better training and evaluation of classification models. This step adds 156,690 positive and 455,151 negative annotations. Step 3: Expand long-tailed attribute set In this step, we aim to collect additional annotations for the long-tailed attributes. Long- tailed attributes are associated with very few object categories, which is either due to the attribute not being frequently used by humans or the attribute is only applied to a small set of objects. Hence, given a long-tailed attribute and a known object that it applies to, we first expand its set of possibly applied objects by using the WordNet [68] ontology. For example, while playing may only be applied to child in the training set, it could also be applicable to other closely related object categories like man, woman, person. After we find candidate object categories for a given long-tail attribute, we ask humans to annotate randomly sampled images from these candidates with either positive or negative label for the given attribute. This step adds additionally 16,239 positive and 57,751 negative annotations pertaining to all long-tailed attributes. 16 Dataset VAW Visual Genome [10] COCO Attributes [9] EMOTIC [8] WIDER [7] iMaterialist [5] # attributes 620 68,111 196 26 14 228 # instances 260,895 3,843,636 180,000 23,788 57,524 1,012,947 # object categories 2,260 33,877 29 1 (person*) 1 (person*) 1 (clothes*) # attribute anno. per instance 3.56 0.73 ≥ 20 26 14 16.17 Negative Labels Yes No Yes Yes Yes Yes Segmentation masks Yes No No No No Yes Domain In-the-wild In-the-wild In-the-wild Emotions Pedestrian Fashion Table 2.1: Statistics of VAW compared with other in-the-wild and domain-specific attribute datasets. *person (resp. *clothes) category may represent multiple categories including {boy, girl, man, woman, etc} (resp. {shirt, pants, top, etc}). While Visual Genome is the largest among these in terms of number of attribute annotations, it is sparsely labeled. Other datasets are either fully annotated for domain-specific attributes or more densely labeled but covering few object categories. 2.2.2 Statistics Our final dataset consists of 620 attributes describing 260,895 instances from 72,274 im- ages. Our attribute set is diverse across different categories, including color, material, shape, size, texture, and action. On the annotated instances, our dataset consists of 392,978 positive and 534,701 negative attribute labels. The instances from VGPhraseCut (occupy 92% in the dataset) are provided with segmentation masks which can be useful in attribute prediction. We split the dataset into 216,790 instances (58,565 images) for training, 12,286 instances (3,317 images) for validation, and 31,819 instances (10,392 images) for testing. We split the dataset such that the test set has higher annotation density per object, which allows for more thorough testing. In particular, our test set has an average of 7.03 annotations per instance compared to 3.02 in the training set. In Table 2.1, we compare the statistics of the VAW dataset with other in-the-wild and domain-specific visual attribute datasets. Compared to existing work, VAW fills an important gap in the literature by providing a domain-agnostic, in-the-wild visual attribute prediction dataset with denser annotations, explicit negative labels, segmentation masks, and large number of at- tribute and object categories. 17 Figure 2.2: Examples of images and their annotations from the VAW dataset. Object names, positive attributes, explicitly labeled negative attributes, and negative labels from our negative label expansion are shown in corresponding colors for each example. In Figure 2.2, we show examples of images and their attribute annotations from the VAW dataset. The images show both positive and negative annotations from our dataset as well as a subset of the result of our negative label expansion scheme (will be explained in Section 2.3.3), which is a rule-based system derived on the premise of mutual exclusivity of certain attributes. For example, if an object is annotated with positive attribute empty, the attribute filled can be auto-annotated as a negative attribute for the same object. In Figure 2.3, we show the distribution of top-15 attributes in various attribute categories arranged in descending order according to the number of available positive annotations. The diagram clearly shows the long-tailed nature of our VAW dataset, with some categories showing highly skewed distributions (color, material) and others have a more evenly balanced distribution (texture, others). For example, in the material category, the annotations for top-2 attributes (metal and wooden) consist of over 30.91% of total number of annotations (41.4% of positives and 23.5% of negatives). Reassuringly, our strong baseline as well as SCoNE model works almost equally 18 Color Material Shape Texture Action Size Positive Negative Others Figure 2.3: Distribution of positive and negative annotations for attributes in different categories. We show the top-15 attributes with the most number of positive annotations in each category sorted in descending order. well for more balanced categories (e.g, texture) as well as a skewed category (e.g, material). 2.3 Approach In this section, we will describe components of our strong baseline model along with the Supervised Contrastive learning with Negative-label Expansion (SCoNE) algorithm that helps our model to learn more attribute discriminative features. A depiction of our strong baseline model is shown in Figure 2.4. Problem formulation. Let D = {Ii, gi, oi;Yi}Ni=1 be a dataset of N training samples, where Ii is an object instance image (cropped using its bounding box), gi is its segmentation mask, oi is the category phrase of the object for which we want to predict attributes, and Yi = [yi,1, ..., yi,C ] is its C-class label vector with yc ∈ {1, 0,−1} denoting whether attribute c is positive, negative, 19 Composition "chair" embed Object localizer Multi- attention ClassifierResNet Supervise if available Red ✔ Bright Red✔ Clean ✔ Giant ✔ Wooden ✔ Blue ✘ Stuffed ✘ Patterned ✘ Multicolored ✘ Positive N egative Groundtruth mask ... Low-lv feature Standing ⁇ Lying ⁇ Empty ⁇ Weathered ⁇ ... U nknow n Figure 2.4: Strong baseline attribute prediction model. ResNet-feature map extracted from the input image is modulated with the object embedding which allows the model to learn useful attribute-object relationships (e.g, ball is round) and also to suppress infeasible attribute-object pairs (e.g, talking table). The image-object combined feature map X is used to infer the object region G and multiple attention maps {A(m)} which are subsequently used to aggregate features for classification. Here, Zlow and Zrel respectively denotes low-level and image-object features aggregated inside the estimated object region, Zatt corresponds to image-object features pooled from the multiple attention maps. The classifier is trained with BCE loss on the explicit positive and negative labels. For the missing (unknown) labels, we find treating them as “soft negatives” by assigning them with very small weights in the BCE loss also helps improve results. or missing respectively. Our goal is to train a multi-label classifier that, given an input image and the object name, can output the confidence score for all C attribute labels. 2.3.1 Model Architecture Image feature representation. Given an image I of an object o, let fimg(I) ∈ RH×W×D be the D-dimensional image feature map with spatial size H ×W extracted using any CNN backbone architecture. In our model, we use the output of the penultimate layer of ResNet-50 [69]. Image-object feature composition. Prior models for attribute prediction mostly tackle domain- specific settings or a limited number of object categories [9, 40, 49]. Hence, these works are able to employ object-agnostic attribute classification. However, because our VAW dataset contains attribute annotations across a diverse set of object categories, incorporating object embedding as input can help the model learn to avoid infeasible attribute-object combinations (e.g, parked 20 dog). There are multiple ways to compose the image feature map with the object embedding [32, 70, 71]. Here, we opt for a simple object-conditioned gating mechanism, which we find to be consistently better than concatenation used in [32, 33]. Let ϕo ∈ Rd be the object embedding vector, fcomp(fimg(I), ϕo) ∈ RD be the composition module that takes in the image feature map and object embedding. We implement fcomp with a gating mechanism as follows: fcomp(fimg(I), ϕo) = fimg(I)⊙ fgate(ϕo), (2.1) fgate(ϕo) = σ(Wg2 · ReLU(Wg1ϕo + bg1) + bg2), (2.2) where ⊙ is the channel-wise product, σ(·) is the sigmoid function, fgate(ϕo) ∈ RD broadcasts the object embedding to match the feature map spatial dimension and is a 2-layer MLP. Intu- itively, fgate acts as a filter that only selects attribute features relevant to the object of interest and suppresses incompatible attribute-object pairs. Relevant object localization. An object bounding box can contain both the relevant object and other objects or background. Hence, it is desirable to learn a smarter feature aggregation that can suppress all irrelevant image regions. We propose to leverage the availability of the object segmentation mask in the VAW dataset to achieve this. Let X ∈ RH×W×D be the image-object composed feature map, the relevant object region G is localized using a 2-stacked convolutional layers frel with kernel size 1, followed by spatial 21 softmax: g = frel(X), g ∈ RH×W , (2.3) Gh,w = exp(gh,w)∑ h,w exp(gh,w) , G ∈ RH×W . (2.4) We can then pool the image feature vector as Zrel = ∑ h,w Gh,wXh,w. (2.5) G is learned with direct supervision from the object mask whenever available with the following loss: Lrel = ∑ h,w (Gh,w × (1−Mh,w))− λrel(Gh,w ×Mh,w), (2.6) where M is the ground truth object binary mask. Rather than requiring G to exactly match the object mask, we find it is better to penalize the network whenever its prediction falls outside of the mask. This frees the network to learn heterogeneous attention within the object region if necessary (e.g, black mirror refers to its frame being black rather than its interior) instead of distributing its attention uniformly over the object. Hence, by setting λrel to a small positive constant less than 1, we prioritize the need for G to not attend to non-object pixels over the need to uniformly attend to all pixels on the object surface. Multi-attention. Object localization is beneficial for recognizing several attributes such as color, material, texture, and shape, but might be too restrictive for attributes that require attention to different object parts or the background. For example: bald-headed or bare-footed requires look- ing at a person’s head or foot; distinguishing different activities (e.g., jumping vs. crouching) 22 might require information from the surrounding context. Therefore, we utilize a free-form multi- attention mechanism to allow our model to attend to features at different spatial locations. There are two extreme cases to apply spatial attention [41]: (1) one attention map for all attributes and (2) one attention map per attribute [40]. The first approach is similar to using the object foreground which is unlike what we are aiming for. The latter allows for more control but does not scale well with large number of attributes. Hence, we opt for a hybrid multi-attention idea as in [41]. We extract M attention maps {A(m)}Mm=1 from X using f (m) att which has the same architec- ture as frel: E(m) = f (m) att (X), E(m) ∈ RH×W ,m = 1, ...,M (2.7) A (m) h,w = exp(E (m) h,w )∑ h,w exp(E (m) h,w ) , A (m) h,w ∈ RH×W . (2.8) This is partly similar to [72] where object parts are localized using learned embeddings of these parts. Because the VAW dataset does not have part annotations for every attribute, this approach is not usable in our case. Similar to [41], we employ the following divergence loss to encourage these attention maps to focus on different regions: Ldiv = ∑ m ̸=n ⟨E(m), E(n)⟩ ∥E(m)∥2∥E(n)∥2 . (2.9) Using the computed M attention maps, we aggregate M feature vectors {r(m)}Mm=1 from X 23 and pass them through a projection layer to obtain their final representations: r(m) = ∑ h,w A (m) h,wXh,w, r(m) ∈ RD, (2.10) z (m) att = f (m) proj (r (m)), z (m) att ∈ RDproj . (2.11) Our final multi-attention feature is the concatenation of all individual attention features: Zatt = concat([z(1)att , ..., z (M) att ]). (2.12) 2.3.2 Training Objectives Our final feature vector is the concatenation of the localized object and the multi-attention feature. In addition, we also find using low-level feature from early blocks improves accuracy for low-level attributes (color, material). Therefore, we also pool low-level features from the estimated object region G to construct Zlow. The input to the classification layer is [Zlow, Zrel, Zatt], and we use a linear classifier with C output logit values followed by sigmoid. Let Ŷ = [ŷ1, ..., ŷC ] be the output of the classification layer. We apply the following reweighted binary cross-entropy loss that takes data imbalance into account: Lbce(Y, Ŷ ) = − C∑ c=1 wc ( 1[yc=1]pc log(ŷc) + 1[yc=0]nc log(1− ŷc) ) , where wc, pc, nc are respectively the reweighting factors for attribute c, its positive, and its neg- ative example. Let npos c and nneg c be the number of positives and negatives of attribute c. First, we want wc to reflect the importance of the rare attributes by setting wc inversely proportional 24 to its number of positive examples: wc ∝ 1/(npos c )α and normalize so that ∑ cwc = C [52] (α is a smoothing factor). Second, we want to balance between the effect of positive and neg- ative examples. We apply the same idea by setting pc ∝ 1/(npos c )α, nc ∝ 1/(nneg c )α and nor- malize so that pc + nc = 2. As a result, the ratio between the positive and negative becomes pc/nc = (nneg c /npos c )α, which helps balance out their effect based on their frequency. Our re-weighted BCE (termed RW-BCE) is different from [37], where the authors propose to reweigh each sample based on its proportion of available labels (i.e, an object instance with less number of available labels is assigned a larger weight). We posit this is not ideal because the number of labels for an instance should not affect loss computation (e.g, loss for red should be the same between a red car instance and a large shiny red car instance, despite the latter one is annotated with more labels). Our overall loss is a combination of all loss functions presented above: L = Lbce + Lrel + λdivLdiv. (2.13) Empirically, we find applying repeat factor sampling (RFS) [56, 57] with RW-BCE works well. RFS is a method that defines a repeat factor for every image based on the rarity of the labels it contains. Therefore, we employ both RW-BCE and RFS (referred as RR) in training our model. 2.3.3 Negative Label Expansion While our dataset provides unprecedented amount of explicitly labeled negative annota- tions, the amount of possible negatives still far outnumbers the number of possible positive at- tributes. Because many attributes are mutually exclusive (i.e, presence of attribute clean implies absence of attribute dirty), we seek to use existing linguistic and external knowledge tools to 25 expand the set of negative annotations. Consider attribute type A (e.g, material), the following observations can be made about its attributes: (1) there exists overlapping relation between some attributes due to their visual similarity or them being hierarchically related (e.g, wooden overlaps with wicker); (2) there ex- ists exclusive relation where two attributes cannot appear on the same object (e.g, wet vs. dry). Therefore, for an object labeled with attribute a ∈ A, we can generate negative labels for it from the set {a′ ∈ A | ¬overlap(a, a′) ∨ exclusive(a, a′)}. We classify the attributes into types and construct their overlapping and exclusive relations using existing ontology from a related work [73], WordNet [68], and the relation edges from ConceptNetAPI [74]. We further expand the overlapping relations based on the co-occurrence (by using conditional probability) of the attribute pairs (e.g, white and beige are similar and often mistaken by human annotators). Our negative label expansion scheme allows to add 5.9M nega- tive annotations to our training set. Aside from the extra negatives, one benefit of this approach is that when we want to label a novel attribute class, we can use the same approach to discover its relationship with existing attributes in the dataset and attain free negatives for the new class. 2.3.4 Supervised Contrastive Learning [75] shows with success that imbalanced learning could benefit from self-supervised pre- training on both labeled and unlabeled data, where a network can be better initialized by originally avoiding strong label bias due to data imbalance. Also motivated by [67], we propose to use supervised contrastive (SupCon) pretraining for our attribute learning with partial labels problem, where we extend the SupCon loss from a single-label to a multi-label setting. 26 We perform mean-pooling inside the feature map X to obtain x ∈ RD. We follow the design of SimCLR [66] and add a projection layer to map z = Proj(x) ∈ R128. The projection layer is an MLP with hidden size 2048 and is only used during pretraining. In a multi-label setting, it is not trivial how to pull two samples together since they can share some labels but different in terms of other labels. Motivated by [28, 76], we propose to represent each attribute c as a matrix Ac ∈ R128×128 that linearly projects z into an attribute-aware embedding space zc = Acz, which is then ℓ2-normalized onto the unit hypersphere. With this, samples that share the same attribute can have their respective attribute-aware embeddings pulled together. In the pretraining stage, we construct a batch of 2N sample-label vector pairs {Ii, Yi}2Ni=1, where I2k and I2k−1 (k = 1...N ) are two views (from random augmentation) of the same object image and Y2k = Y2k−1. Let zi,c be the c-attribute-aware embedding of Ii, and B(i) = {c ∈ C : Yi,c = 1} is the set of positive attributes of Ii. We reuse notations from [67]: K ≡ {1...2N}, A(i) ≡ K \ {i}, P (i, c) ≡ {p ∈ A(i) : Yp,c = Yi,c} and use the following SupCon loss Lsup = 2N∑ i=1 ∑ c∈B(i) −1 |P (i, c)| ∑ p∈P (i,c) log exp (zi,c · zp,c/τ)∑ j∈A(i) exp(zi,c · zj,c/τ) . (2.14) Linear transformation using Ac, followed by dot product in the SupCon loss, implements an inner product in the embedding space of z, which can be interpreted as finding part of z that encodes the attribute c [76]. Therefore, our approach fits nicely into the multi-label setting where an image embedding vector z can simultaneously encode multiple attribute labels that can be probed by these linear transformations for contrasting in the SupCon loss. After the pretraining stage, we keep the backbone encoder and the image-object composi- tion module and finetune them along with the classification layer. 27 While SupCon is designed to be used for pretraining, empirically, we find it hampers the multi-attention module ability to focus on specific regions. To reconcile this difference, we find it is empirically better to minimize Lsup jointly with the other loss. For other models that do not use attention (vanilla ResNet), we find SupCon pretraining still effective. 2.4 Experiments In this section, we discuss the implementation details of our method, the evaluation metrics, and report the results of our method and other related baselines on the VAW dataset. 2.4.1 Implementation Details In this section, we explain the implementation details of our Strong Baseline and SCoNE method. We use the ImageNet-pre-trained [77] ResNet-50 [69] as the feature extractor, and use the output feature maps from ResNet block 2 and 3 as low-level features. For the object name embedding, we use the pre-trained GloVe [78] 100-d word embeddings. We do not finetune these word embeddings during training as we want our model to generalize to unseen objects during test time. We implement our model in PyTorch [79] and train using Adam optimizer with the default setting, batch size 64, weight decay of 1e-5, an initial learning rate of 1e-5 for the pre-trained ResNet and 0.0007 for the rest of the model. We train for 12 epochs and apply learning rate decay of 0.1 every time the mAP on the validation set stops improving for 2 epochs. We use image size 224x224 as input and basic image augmentations which include random cropping around object bounding box, random grayscale when an instance is not labeled with any color attributes, minor 28 color jittering, and horizontal flipping. For each object bounding box in the dataset, we expand its width and height by min(w, h)× 0.3 to capture more context. For the hyperparameters, we set λfg = 0.25, λdiv = 0.004. In the multi-attention module, we select Dproj = 128 and use M = 3 attention maps. Regarding reweighting and resampling, we use t = 0.0006 for RFS and α = 0.1 for smoothing in the RW-BCE reweighting terms. For SupCon pre-training, we pre-train on top of ImageNet-pre-trained ResNet for 10 epochs with batch size 384 (768 views per batch), and initialize all matrices Ac with the identity matrix. In the contrastive loss, we set temperature τ = 0.25. We believe using a larger batch size will greatly benefit supervised contrastive pretraining as suggested by the authors [67]. For SupCon joint training with the other losses of the Strong Baseline model, we keep batch size as 64, we add λsupLsup to the loss where λsup = 0.5, and all other hyperparameters are the same as above. 2.4.2 Evaluation Metrics In this section, we present details about the different evaluation metrics that we use. We have used mAP as our primary metric, since it describes the quality of the model to rank correct images higher than the incorrect ones for each attribute label. mR@15 is also important as it shows how well the model manages to output the ground truth positive attributes in its top 15 predictions in each image. In addition, mA and F1@15 can also be used to evaluate model performance in a different light. mAP: similar to [80], the mAP score is computed by taking the mean of the average precision of all C classes mAP = 1 C ∑ c APc, (2.15) 29 in which the average precision of each class is computed as APc = 1 Pc Pc∑ k=1 Precision(k, c) · rel(k, c), (2.16) where Pc is the number of positive examples of class c, Precision(k, c) is the precision of class c when retrieving the best k images, rel(k, c) is the indicator function that returns 1 if class c is a ground-truth positive annotation of the image at rank k. Note that due to VAW being partially labeled, we compute this metric only on the annotated data similar as in [80]. This evaluation scheme is also similar to what is used in [56], where the authors introduce the definition of federated dataset. In this federated dataset setup, we only need for each label a positive and a negative set, then average precision for each label can be computed on these 2 sets. mA: as in [81, 82], we compute the mean balanced accuracy (mA) to evaluate all models in a classification setting, using 0.5 as threshold between positive and negative prediction. Because our dataset is highly unbalanced between the number of positive and negative examples for some attributes, balanced accuracy is a good metric as it calculates separately the accuracy of positive and negative examples then take the average of them. In concrete, the mA score can be computed as follows mA = 1 C ∑ c (TPc Pc + TNc Nc ) /2, (2.17) where C is the number of attribute classes, Pc and TPc are the number of positive examples and true positive predictions of class c, and Nc and TNc are defined similarly for the negative examples and predictions. Because mA uses threshold 0.5, models that are not well-balanced between positive and negative prediction tend to receive low score. This metric is also used in 30 pedestrian and human facial attribute works [40, 81]. mR@15: mean recall over all classes at top 15 predictions in each image. Recall@K is often used in datasets that are not exhaustively labeled such as scene graph generation [83, 84]. This is also used in multi-label learning [38, 39, 41] under the name ‘per-class recall’. F1@15: as the above metric may be biased towards infrequent classes, we also report overall F1 (harmonic mean of precision and recall) at top 15 predictions in each image. Because VAW is partially labeled, we only evaluate the prediction of label that has been annotated. The overall precision and recall are computed as follows OV-Precision = ∑ c TPc∑ c N p c , OV-Recall = ∑ c TPc∑ c Pc , (2.18) where TPc is the number of true positives for attribute class c, Np c is the number of positive predictions of class c, and Pc is the number of ground truth positive examples of class c. The F1 score is the harmonic mean of precision and recall, which is defined as OV-F1 = 2× OV-Precision × OV-Recall OV-Precision + OV-Recall . (2.19) 2.4.3 Baselines We consider the following baselines and state-of-the-art multi-label learning approaches and compare them to our SCoNE algorithm. We made our best attempt to modify the authors’ implementation (if available) to include the image-object composition module. All models use ResNet-50 as their backbone and use BCE loss (except LSEP and ResNet-Baseline-CE) for train- ing. Empirically, we find treating missing labels as negatives and assigning them with very small 31 Table 2.2: Experimental results compared with baselines and SOTA multi-label learning meth- ods. The top box displays results of multi-label learning methods; the middle box shows results of models from attribute prediction works and our strong baseline; the last row shows performance of our SCoNE algorithm applied onto the strong baseline. Methods Overall Class imbalance (mAP) Attribute types (mAP) mAP mR@15 mA F1@15 Head Medium Tail Color Material Shape Size Texture Action Others LSEP [39] 61.0 50.7 67.1 62.3 69.1 57.3 40.9 56.1 67.1 63.1 61.4 58.7 50.7 64.9 ML-GCN [38] 63.0 52.8 69.5 64.1 70.8 59.8 42.7 59.1 64.7 65.2 64.2 62.8 54.7 66.5 Partial-BCE + GNN [37] 62.3 52.3 68.9 63.9 70.1 58.7 40.1 57.7 66.5 64.1 65.1 59.3 54.4 65.9 ResNet-Baseline [9] 63.0 52.1 68.6 63.9 71.1 59.4 43.0 58.5 66.3 65.0 64.5 63.1 53.1 66.7 ResNet-Baseline-CE [32, 33] 56.4 55.8 50.3 61.5 64.6 52.7 35.9 54.0 64.6 55.9 56.9 54.6 47.5 59.2 Sarafianos et al [40] 64.6 51.1 68.3 64.6 72.5 61.5 42.9 62.9 68.8 64.9 65.7 62.3 56.6 67.4 Strong Baseline (SB) 65.9 52.9 69.5 65.3 73.6 62.5 46.0 64.5 68.9 67.1 65.7 66.1 57.2 68.7 SB + SCoNE (Ours) 68.3 58.3 71.5 70.3 76.5 64.8 48.0 70.4 75.6 68.3 69.4 68.4 60.7 69.5 weights in the BCE loss also improves results. Hence, we apply this for all methods. • ResNet-Baseline: ResNet-50 followed by image-object composition and classification layer. • ResNet-Baseline-CE: Similar as above, but uses softmax cross entropy loss. This is used by [32, 33] to train attribute prediction head for object detectors on Visual Genome. • Strong Baseline (SB): The combination of our image-object composition, multi-attention, and object localizer. • LSEP [7] : Uses ranking loss and threshold estimation to predict which attributes to output. • ML-GCN [38] : Uses graph convolution network to predict classifier weights based on the GloVe embeddings of the attribute names. Label correlation graph is constructed following the authors’ implementation. • Durand et al (Partial BCE + GNN) [37] : BCE loss reweighted by the authors’ reweighting scheme. Graph neural network is applied on the output logits. • Sarafianos et al [40] : SOTA in pedestrian attribute prediction that also uses multi-attention. 32 Table 2.3: Ablation study. We show how each of our proposed techniques help improve overall performance. Methods mAP mR@15 mA F1@15 Strong Baseline (SB) 65.9 52.9 69.5 65.3 + Negative 67.7 54.3 70.0 69.6 + Neg + SupCon 68.2 55.2 70.3 70.0 + Neg + SupCon + RR (SCoNE) 68.3 58.3 71.5 70.3 ResNet-Baseline 63.0 52.1 68.6 63.9 + SCoNE 66.4 56.8 70.7 68.8 2.4.4 Results Overall results are shown in Table 2.2, where SB and SB+SCoNE are compared with other baselines and state-of-the-art algorithms. In overall, SB is better than other baselines in almost all metrics except for mR@15 where it is lower than ResNet-Baseline-CE. This shows that the object localizer and multi-attention are effective in attribute prediction. ResNet-Baseline-CE, which is adopted by [32, 33], has good recall but very low precision (mAP and F1). This is in contrast to ResNet-Baseline which is trained with BCE. SB+SCoNE substantially improves over SB in all metrics and clearly surpasses available al- gorithms by a large margin. It is particularly effective in long-tail attributes where it outperforms its closest competitor (other than SB) by 5 mAP points, and is also highly effective in detecting color and material attributes where it is nearly 7-8 mAP points higher than the next-best method. This shows that our attribute learning paradigm, including the negative label expansion, super- vised contrastive loss, reweighted and resampling scheme is clearly effective in attribute learning. 33 2.4.5 Ablation Studies Components of SCoNE. Table 2.3 shows effect of different components of SCoNE. Starting from our SB, we can see that each of our model choices substantially improves its performance, with the biggest mAP improvements provided by our negative label expansion scheme. Each of the components of SCoNE also stacks additively, with our final model performing 2.4 mAP, 5.4 mR@15, and 5 F1@15 points over SB. Moreover, the components of SCoNE are model agnostic. We verify that by enhancing our ResNet-Baseline with SCoNE, which also improves its mAP and mR@15 by 3.4 and 4.7 points. Components of Strong Baseline. Strong Baseline is comprised of many sub-components which extends the ResNet-Baseline: the object localizer, the multi-attention module, and the usage of low-level features. We ablate our Strong Baseline model with each component and train on our training data after negative label expansion. We report results in Table 2.4. Removing each sub-component has a negative effect on the performance of the Strong Baseline model. For example, removing the use of low-level features not only lowers mAP in color and material attributes but it also lowers it for higher-level attributes (e.g, action). This is likely due to the absence of clearly defined low- and high-level features, which forces a ‘single’ feature to represent both low- and high-level features. This adversely affects the network’s ability to learn high-level attributes (e.g., action) as well as low-level (color, texture), thus lowering performance for both. Interestingly, removing the object localizer does not result in a drastically diminished per- formance. Visualizing the multi-attention output of our full model (Figure 2.5) reveals that even without object mask supervision, the model is still able to differentiate between object and back- 34 Table 2.4: Ablation study on the three components of the Strong Baseline model by removing each one. The last row also corresponds to the ResNet-Baseline model. Methods (+Neg) Overall Class imbalance (mAP) Attribute types (mAP) mAP mR@15 mA F1@15 Head Medium Tail Color Material Shape Texture Action Others Strong Baseline 67.7 54.3 70.0 69.6 75.9 64.3 46.9 68.8 73.9 67.0 69.4 60.2 69.1 w/o Multi-attention (MA) 67.4 53.5 69.7 69.7 75.9 63.8 46.4 67.8 74.7 66.9 68.5 58.0 69.0 w/o Low-level feature (LL) 67.3 53.7 69.9 69.4 75.4 63.8 48.4 68.5 73.6 66.1 67.5 59.3 68.9 w/o Object localizer (OL) 66.9 53.1 69.6 69.1 75.3 63.4 45.5 67.5 73.8 66.5 68.4 58.9 68.3 w/o OL, MA and LL 65.6 53.8 69.4 68.6 74.8 62.3 43.2 67.3 73.3 66.3 67.7 56.0 67.4 ground/distractors with the multi-attention maps which are trained with weak supervision from the attribute labels. However, removing all components, which is devoid of any form of attention, severely hampers model performance across all categories. In general, all sub-components are necessary for our model to perform well across different attribute types. 2.4.6 Qualitative Results Figure 2.5 shows qualitative results of our SB+SCoNE model, which clearly showcases its various strengths. Firstly, we clearly show a robust ability of the model to predict a variety of at- tribute types of different objects with good accuracy. Next, our object localizer shows remarkable ability to find the correct object of interest and ignore background and other distracting objects (in Figure 2.5a, the ground is not attended by the object localizer). Next, our multi-attention mod- ule often works to complement our object localizer by attending to relevant image regions that may be outside of the object region. For example, in Figure 2.5b, the activity skateboarding is easier to predict if our model can look at the skateboard, but it is outside the person region. Here, our multi-attention correctly learns to look at appropriate image regions that can help our model determine the person has an attribute skateboarding. Figure 2.6 shows more attribute prediction examples of our model. Our object localizer can correctly infer the object region and help alleviate the object occlusion problem. For exam- 35 ple, in Figure 2.6e, the object table is partially occluded by a lot of clutter which can distract a model that relies on global average pooling. Here, our object localizer clearly isolates the parts of table, making it easier to predict its attributes. Because many attributes depend on the context, e.g., parked vs. running car, we designed our multi-attention module to complement the object localizer by allowing it to attend to image regions outside the object. This can be clearly seen in Figure 2.6a, where attributes such as sunny or bright can be hard to infer by simply looking at the given object tree. Our multi-attention module looks at the sunny spots on the pavement which can help our model infer the presence of sunny and bright attributes. However, the multi-attention module is also free to attend to regions in the object to further supplement object localizer’s atten- tion to specific parts of object. For example, in Figure 2.6d, our multi-attention module attends to the hind legs of dog, which supplements the object localizer’s attention map and can provide additional information to help the model infer that the dog is jumping. We show from Figure 2.7 to Figure 2.10 our image search (ranking) results when searching for specific attributes. Our model is able to search for images that exhibit one to multiple at- tributes, as demonstrated in Figure 2.8 where we search for multiple colors at a time. In addition, the results in Figure 2.10 also show that our model is able to differentiate between objects with different size (e.g, small vs. large bird, small vs. large phone). 2.5 Discussion VAW is a first-of-its kind large-scale object attribute prediction dataset in the wild. We explored various challenges posed by the VAW dataset and also discussed efficacy of current models towards this task. Our SCoNE model proposed several novel algorithmic improvements 36 Object: Wall GT: stained, brick Prediction: brick, red, high, tiled, textured, arch shaped, flat, brown, large, stained. Multi-att #1 Object loc Multi-att #1 Predictions: spreading arms, wearing black, young, jumping, skateboarding, thin, skinny, spreading legs, in the air, active. Object: Person GT: spreading arms, silhouetted, skateboarding Object loc Predictions: vintage, antique, stopped, shiny, parking, metal, black, sunlit, outdoors, turned off Object: Car GT: glossy, parking, antique, dark blue, vintage, black Multi-att #1Object loc Figure 2.5: Examples of predictions from SB+SCoNE. We show the object name and its ground truth positive attribute labels above the image. The object localized region, attention map #1, and model top-10 predictions are shown below. Red text represents missed or incorrect predictions. that have helped us improve performance in the VAW dataset compared to our strong baseline by 2.4 mAP and and 5.4 mR@15 points. Despite our results, there are several outstanding challenges remaining to be solved in VAW. Data imbalance: Reweighting and resampling techniques have helped considerably improve the performance of tail categories in VAW dataset. However, even for our best model, mAP for tail categories still lags more than 25 points behind our head category. Similar to many vision and language problems [85], this is one considerable challenge for future works in this space. Object-bias effect: Using object label as input is crucial to obtain good results in VAW, but it may also introduce object-bias in predictions. Ideally, an algorithm should be able to make robust predictions for compositionally novel instance. While not in scope of this work, this can be explored in detail by redistributing train-test split in compositionally novel patterns [27, 86, 87]. 37 Figure 2.6: More examples of predictions from SB+SCoNE. We show the object name and its ground truth positive attribute labels above the image. The object localized region, attention map #1, and model top-10 predictions are shown below. Red text represents missed or incorrect pre- dictions. 38 Figure 2.7: Image search results. We show the top retrieved images of SB+SCoNE when searching for some color attributes. Figure 2.8: Image search results. We show the top retrieved images of SB+SCoNE when searching for images that exhibit multiple color attributes. 39 Figure 2.9: Image search results. We show the top retrieved images of SB+SCoNE when searching for some material attributes. Figure 2.10: Image search results. We show the top retrieved images of SB+SCoNE when search- ing for some size attributes. 40 In conclusion, we believe that VAW can serve as an important benchmark not only for attribute prediction in the wild, but also as a generic test for long-tailed multi-label prediction task with limited labels, data imbalance, out-of-distribution testing and bias-related issues. 41 Chapter 3: Open-Vocabulary Attribute Prediction This chapter is based on the publication Improving closed and open-vocabulary attribute prediction using transformers, Pham et al., In European Conference on Computer Vision, pages 201–219. Springer, 2022. In recent years, several datasets provide explicit annotations of object attributes, such as [10, 88]. However, they are still limited in terms of their coverage of objects and unique attributes, with even the largest datasets only consisting of a few hundreds of attributes. Additionally, existing work considers attributes to only include adjective properties, and exclude their interactions with other objects in the scene. The latter is often classified as visual relationship and is dedicated to an entirely different research topic [89–91] which requires localization of both subject and object in a subject-predicate-object triplet. We believe this distinction is unnecessarily limiting, e.g, person wearing hat conveys information about the property of person that is useful even if exact grounding of hat is unknown. Hence, we expand the definition of attributes to include adjective- as well as action- and interaction-based properties from the point-of-view of an object. To this end, we first describe a pipeline to extract object-centric attributes and interactions from large quantities of grounded, weakly grounded, and ungrounded image-text pairs. Then, we propose a novel attribute prediction model called Transformer for Attribute Prediction (TAP). TAP can predict an order of magnitude larger number of unique attributes than previous methods, 42 matching performance of supervised baselines when directly transfer to the VAW benchmark [88]. After finetuning, we outperform prior art by 5.1% mAP and 5.0% mean recall. Further