ABSTRACT

Title of Dissertation: RECOGNIZING OBJECT-CENTRIC
ATTRIBUTES AND RELATIONS

Khoi Viet Pham

Dissertation Directed by: Professor Abhinav Shrivastava
Department of Computer Science

Recognizing an object’s visual appearance through its attributes, such as color and shape,

and its relations to other objects in an environment, is an innate human ability that allows us to

effortlessly interact with the world. This ability remains effective even when humans encounter

unfamiliar objects or objects with appearances evolve over time, as humans can still identify

them by discerning their attributes and relations. This dissertation aims to equip computer vision

systems with this capability, empowering them to recognize object’s attributes and relations to

become more robust in handling real-world scene complexities. The thesis is structured into two

main parts.

The first part focuses on recognizing attributes for objects, an area where existing research

is limited to domain-specific attributes or constrained by small-scale and noisy data. We overcome

these limitations by introducing a comprehensive dataset for attributes in the wild, marked by

challenges with attribute diversity, label sparsity, and data imbalance. To navigate these challenges,

we propose techniques that address class imbalance, employ attention mechanism, and utilize


contrastive learning for aligning objects with shared attributes. However, as such dataset is expens-

ive to collect, we also develop a framework that leverages large-scale, readily available image-text

data for learning attribute prediction. The proposed framework can effectively scale up to predict

a larger space of attribute concepts in real-world settings, including novel attributes represented in

arbitrary text phrases that are not encountered during training. We showcase various applications

of the proposed attribute prediction frameworks, including semantic image search and object

image tagging with attributes.

The second part delves into the understanding of visual relations between objects. First, we

investigate how the interplay of attributes and relations can improve image-text matching. Moving

beyond the computationally expensive cross-attention network of previous studies, we introduce

a dual encoder framework using scene graphs that is more efficient yet equally powerful on

current image-text retrieval benchmark. Our approach can produce scene graph embeddings rich

in attribute and relation semantics, which we show to be useful for image retrieval and image

tagging. Lastly, we present our work in training large vision-language models on image-text

data for recognizing visual relations. We formulate a new subject-centric approach that predicts

multiple relations simultaneously conditioned on a single subject. Our approach is among the first

work to learn from both weakly- and strongly-grounded image-text data to predict an extensive

range of relationship classes.


RECOGNIZING OBJECT-CENTRIC
ATTRIBUTES AND RELATIONS

by

Khoi Viet Pham

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor Abhinav Shrivastava, Chair/Advisor
Professor Hernisa Kacorri, Dean’s Representative
Professor Larry S. Davis
Professor Ramani Duraiswami
Dr. Zhe Lin, Adobe Research


© Copyright by
Khoi Viet Pham

2023


Acknowledgments

First and foremost, I would like to express my gratitude to my advisor, Professor Abhinav

Shrivastava, for his support and guidance throughout my PhD journey. Abhinav granted me

tremendous freedom to pursue research topics that resonated with my interests, and provided

the security that allowed me to focus solely on doing research. This was particularly invaluable

during my difficult times. Abhinav has cultivated an outstanding research environment, equipping

his students with a multitude of knowledge and skills. Even now, I am still mastering the things

he taught and continuously seeking improvement, which fuels my inspiration and anticipation for

my journey post-graduation. Abhinav has taught me to value and cherish the PhD experience as a

once-in-a-lifetime opportunity, one that fosters my professional development as a researcher and

my personal growth to become a better person. This was a great source of inspiration to continue

this journey, and for that, I am profoundly grateful.

I would also like to thank Professor Larry Davis for giving me the opportunity to join the

computer science graduate program at UMD, and for providing guidance at the beginning when

I was still junior with limited knowledge and experience. Professor Larry is knowledgeable in

different research areas, always provides interesting insights and asks surprising and thought-

provoking questions that makes every discussion with him become a chance to learn something

new. I appreciate the opportunity to work with him during the early stage of my PhD journey.

I am also grateful to Professor Ramani Duraiswami, Professor Hernisa Kacorri, and Doctor

ii


Zhe Lin, for graciously dedicating their time to serve on my dissertation committee. Their review

of my manuscript and the provision of insightful feedback have contributed to the quality of the

dissertation.

My summer internships have been pivotal in my growth, particularly the time spent working

alongside exceptional researchers at Google and Adobe Research. I extend my appreciation to

Junfeng He at Google, and Kushal Kafle, Zhihong Ding, Scott Cohen, Zhe Lin, Quan Tran, and

Walter Chang at Adobe Research, for their mentorship and the learning experience about doing

research in the industry. I am particularly thankful to Junfeng and Kushal, for their dedicated

mentorship on academic writing and their hands-on assistance with writing papers for conference

submissions. These internships not only shaped me into a better researcher, improved my abilities

in communication and presentation through continuous meetings, but also provided substantial

financial support that I am immensely grateful.

I would like to express my deepest gratitude to my wife, Uyên, for her unconditional

support, for being patient and understanding with all my decisions. I would like to thank her

for staying with me through such a long and challenging journey, starting from living together

on a single PhD stipend, cooking and preparing meals for me, providing emotional support, to

finding a full-time job herself and continue to support me on the journey. She is the unseen force

behind the scene that makes the completion of this journey possible, for which I will be eternally

grateful. Additionally, I also want to send my appreciation to my family. My father and my mother

laid the foundations of the person I am today, ensured that I obtained a good education, ignited

the spark of scientific curiosity since I was young, taught me about work ethics of what is right

and what is wrong. I thank my younger brother for being the one that takes care of the family

while I am still spending time in school pursuing dreams of my own. I also carry the memory

iii


of the departed loved ones - my grandmother whom I cherished lots of childhood memories, my

father-in-law who exemplified the depth of familial love, they all have forever affected how I want

to live my life in the future. I am truly blessed to have my family and it is to them that I lovingly

dedicate this dissertation.

I would like to extend my thanks to friends in the lab, including Hao, Nirat, Pulkit, Bo,

Chương, Lilian, Kamal, Soumik, Moustafa, Alex, Yixuan, Shuaiyi, Max, Hanyu, Vinoj, Matthew

G., Shishira, Ahmed, Matthew W., Sharath, Saksham, Mara, Saketh, Namitha, Archana, Anubhav,

Vatsal, Gaurav, Pravin, Luyu, and Varuni, each person with distinct admirable qualities that I want

to learn from. Everyone has been a part of my memories on this journey. In particular, I want to

express my appreciation to Nirat and Chương for the collaboration on several research projects.

There are also friends outside the lab that shared the doctoral journey and made my early years

much more enjoyable, I would like to thank Pattara, Yuheng, Jingling, Jun, Amin, and Rangfu.

I would also like to thank my Vietnamese friends at UMD, who showed me there exists

life outside of graduate school, including Phong, Khánh, Huyền, Cường, Phụng, Nhật, Chương,

Quỳnh, Trí, Khoa, Thủy, Tín, Huy, Đăng, and Phương. My special gratitude goes to Cường and

Phụng, and their parents for generously hosting me on numerous occasions and for making my

last semester so joyful. I also wish to express my gratitude to my dear high-school friend Long

for the financial support during my tough times.

Last but not least, I extend my sincere thanks to Tom Hurst, Jodie Gray, Migo Gui, and the

ISSS officials for all the timely support with all the logistics and student visa-related issues.

I acknowledge it is impossible to remember all those who have helped me. If I have inadvert-

ently omitted anyone, please accept my apologies. Your support, whether mentioned here or not,

has been invaluable and deeply appreciated.

iv


Table of Contents

Acknowledgements ii

Table of Contents v

List of Tables viii

List of Figures x

Chapter 1: Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Visual Attributes of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Visual Relations between Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2: Closed-Set Attribute Prediction 10
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Visual Attributes in the Wild Dataset . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Negative Label Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Supervised Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 3: Open-Vocabulary Attribute Prediction 42
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Large-Scale Attribute Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

v


3.3.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Closed-Set Attribute Prediction . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.3 Open-Vocabulary Attribute Prediction . . . . . . . . . . . . . . . . . . . 59
3.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.5 Closed-Set Human-Object Interaction Classification . . . . . . . . . . . 65

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Chapter 4: Compose Object Relations and Attributes for Image-Text Matching 68
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Scene Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.4 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Dataset and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.5 Text-to-Entity Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.6 Inference Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.7 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 5: Subject-Centric Relationship Prediction from Image-Text Data 98
5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.4 Learning from Image-Text Data . . . . . . . . . . . . . . . . . . . . . . 109

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.2 Closed-Set Relationship Classification . . . . . . . . . . . . . . . . . . . 111
5.3.3 Open-Vocabulary Relationship Classification . . . . . . . . . . . . . . . 113
5.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Chapter 6: Summary and Discussion 119
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

vi


Bibliography 122

vii


List of Tables

2.1 Statistics of VAW compared with other in-the-wild and domain-specific attribute
datasets. *person (resp. *clothes) category may represent multiple categories in-
cluding {boy, girl, man, woman, etc} (resp. {shirt, pants, top, etc}). While Visual
Genome is the largest among these in terms of number of attribute annotations, it
is sparsely labeled. Other datasets are either fully annotated for domain-specific
attributes or more densely labeled but covering few object categories. . . . . . . 17

2.2 Experimental results compared with baselines and SOTA multi-label learning
methods. The top box displays results of multi-label learning methods; the mid-
dle box shows results of models from attribute prediction works and our strong
baseline; the last row shows performance of our SCoNE algorithm applied onto
the strong baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Ablation study. We show how each of our proposed techniques help improve over-
all performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Ablation study on the three components of the Strong Baseline model by remov-
ing each one. The last row also corresponds to the ResNet-Baseline model. . . . . 35

3.1 Statistics of attributes in LSA. Note that Localized Narratives contains 32k and
122k images from Flickr30K and COCO respectively. Among all instances, 7.1M
are grounded (bounding box), 1.4M weakly grounded (mouse trace), and 975k
ungrounded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Results on VAW. The top box reports results of methods trained only on VAW,
while the bottom box shows our newly introduced baseline RN50-Context and
TAP on VAW after pre-trained on LSA. LSA-pretrained and VAW-supervised
denote whether a model is trained with attribute labels from LSA and VAW re-
spectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Evaluation of LSA common and LSA common→rare. . . . . . . . . . . . . . . 62
3.4 Ablation on class embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Ablation on training portion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Results on HICO image classification . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 Our framework achieves the best and second-best on the Flickr30K dataset with
two different encoders. Without the CA - “cross-attention”, our method still has
competitive results to other baselines. † denotes methods that use ensembling of
multiple models, and we highlight the highest and second-highest RSUM for
each section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

viii


4.2 Our method yields promising results on the MS-COCO dataset. Our performance
is comparable in all test schema with previous works, especially on the simple
Bi-GRU architecture. † denotes methods that use ensembling of multiple models.
Bold and underline highlight the best and second-best performance. . . . . . . . 86

4.3 Ablation studies for the number of layers in GAT, the graph structure whether
encoding scene graph jointly or in 2 separate steps is beneficial, and the impact
of losses. Bold and underline highlight the best and second-best performance. . . 88

4.4 Our framework achieves the best results on the Flickr30K dataset when initializ-
ing the word embeddings fom scratch for the Bi-GRU semantic encoder. Without
the CA - “cross-attention”, our method still has competitive results to other base-
lines. † denotes methods that use ensembling of multiple models, and we highlight
the highest and second-highest RSUM. . . . . . . . . . . . . . . . . . . . . . . 90

4.5 Our framework achieves the best results on the MS-COCO dataset when initializ-
ing the word embeddings from scratch for the Bi-GRU semantic encoder. Without
the CA - “cross-attention”, our method still has competitive results to other base-
lines. † denotes methods that use ensembling of multiple models, and we highlight
the highest and second-highest RSUM. . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Ablation studies to compare between fine-tuning the whole BERT model versus
using P-Tuning v2 to encode the short phrases of semantic concepts. The models
are evaluated on the MS-COCO 1K Test set. Gray denotes our best model. . . . . 91

4.7 Reranking results on MS-COCO 5K after ensembling with image-entity score. . 92

5.1 Experimental results on the VG-50 dataset. . . . . . . . . . . . . . . . . . . . . . 113
5.2 Open-Vocabulary Relationship Classification Results on VG-50. . . . . . . . . . 114
5.3 Ablation studies for the box incorporation design, the object grounding loss, and

the object disjoint loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

ix


List of Figures

1.1 Examples of applications of visual attributes and visual relations recognition.
(Top left) Semantic image search: search for images that correctly depict the se-
mantic information from the input text, e.g, red bowl with food should return
images of bowl having color red and containing food. (Top right) Image genera-
tion: generate image that is faithfully correct w.r.t. the input text, e.g, the image
truly displays a clock with color yellow and a bench with color red as required by
the text. (Bottom left) Object description: describe object appearance (e.g, for the
visually impaired), e.g, the text to the right describes fully all appearance charac-
teristics of the object to the left. (Bottom right) Language-based object selection:
select object in image that satisfies the input text condition, e.g, the image displays
a bounding box enclosing the upside down chair. . . . . . . . . . . . . . . . . . 3

1.2 Image search results for the input query red bowl with food on a) Adobe Stock [1]
where there are many red bowls without food and on b) Pinterest [2] where there
are many white bowls with red food (results obtained on Dec 25, 2023). . . . . . . 4

1.3 State-of-the-art large language model Google BARD [3] makes mistakes in visual
relation recognition (result obtained on Dec 25, 2023), when it says that there
is a bird eating the fruit in the image, while in fact the image only displays an
assortment of food that resembles the shape of a bird. . . . . . . . . . . . . . . . 4

1.4 The first part of the thesis (Chapter 2 and Chapter 3) focuses on recognizing
attributes of objects. On the left in (a), we illustrate that these attributes can be
adjectives that describe physical properties (image source: [4]). Here we denote
common primitive attributes in blue, and rare, not commonly used attributes in
green. On the right in (b), we show attributes as verbs and verb-object pairs that
describe actions of the object and its interactions with others in the scene. . . . . 5

1.5 The second part of the thesis focuses on visual relations between objects. a) In
Chapter 4, we study how to compose object with their attributes and relations in
a scene graph to improve image-text alignment. Here, the figure displays a joint
embedding space for the image and the scene graph. b) In Chapter 5, we formulate
a new subject-centric approach for predicting all visual relationships with respect
to a particular subject instance, e.g, the figure illustrates all relationships w.r.t. the
woman in the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Example annotations in the VAW dataset. Each possible attribute-object category
pair is annotated with at least 50 examples consisting of explicit positive and
negative labels. Here, we illustrate positive and negative attribute annotations for
the object table, plate, flower, cookie in the image. . . . . . . . . . . . . . . . . . 11

x


2.2 Examples of images and their annotations from the VAW dataset. Object names,
positive attributes, explicitly labeled negative attributes, and negative labels from
our negative label expansion are shown in corresponding colors for each example. 18

2.3 Distribution of positive and negative annotations for attributes in different cate-
gories. We show the top-15 attributes with the most number of positive annota-
tions in each category sorted in descending order. . . . . . . . . . . . . . . . . . 19

2.4 Strong baseline attribute prediction model. ResNet-feature map extracted from
the input image is modulated with the object embedding which allows the model
to learn useful attribute-object relationships (e.g, ball is round) and also to sup-
press infeasible attribute-object pairs (e.g, talking table). The image-object com-
bined feature map X is used to infer the object region G and multiple attention
maps {A(m)} which are subsequently used to aggregate features for classifica-
tion. Here, Zlow and Zrel respectively denotes low-level and image-object features
aggregated inside the estimated object region, Zatt corresponds to image-object
features pooled from the multiple attention maps. The classifier is trained with
BCE loss on the explicit positive and negative labels. For the missing (unknown)
labels, we find treating them as “soft negatives” by assigning them with very
small weights in the BCE loss also helps improve results. . . . . . . . . . . . . . 20

2.5 Examples of predictions from SB+SCoNE. We show the object name and its
ground truth positive attribute labels above the image. The object localized re-
gion, attention map #1, and model top-10 predictions are shown below. Red text
represents missed or incorrect predictions. . . . . . . . . . . . . . . . . . . . . . 37

2.6 More examples of predictions from SB+SCoNE. We show the object name and
its ground truth positive attribute labels above the image. The object localized
region, attention map #1, and model top-10 predictions are shown below. Red
text represents missed or incorrect predictions. . . . . . . . . . . . . . . . . . . . 38

2.7 Image search results. We show the top retrieved images of SB+SCoNE when
searching for some color attributes. . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.8 Image search results. We show the top retrieved images of SB+SCoNE when
searching for images that exhibit multiple color attributes. . . . . . . . . . . . . . 39

2.9 Image search results. We show the top retrieved images of SB+SCoNE when
searching for some material attributes. . . . . . . . . . . . . . . . . . . . . . . . 40

2.10 Image search results. We show the top retrieved images of SB+SCoNE when
searching for some size attributes. . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 Attributes in LSA cover a wide-range of words/phrases that describe an object,
including (a) adjective to describe color, shape, state, etc., (b) verb to describe
action, (c) verb-object pairs to describe interaction, and (c) preposition-object to
describe location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Extracted attributes example. (Left) Examples of objects and their attributes parsed
from grounded image-caption pairs from an image in the Flickr30K dataset. (Right)
Examples of objects and their attributes parsed from ungrounded image-caption
pairs from an image in the COCO dataset. . . . . . . . . . . . . . . . . . . . . . 47

xi


3.3 Model architecture of TAP. Sequence of ResNet encodings form the input visual
tokens. This is processed jointly with the query token which consists of object
query tokens (red), their object index embedding (blue), a sequence index em-
bedding (orange), and a bounding box embedding (green). Contextualized rep-
resentation zi of the [CLS] token of all objects are decoded into attributes. In
addition, an object grounding loss is used to train object localization (shown here
for dog). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Architecture of RN50-Context, which takes the whole image as input and uses
RoIAlign to extract features that correspond to the object region. The features
from RoIAlign is then multiplied element-wise with the object word embedding,
then forwarded through the classification layer at the end to obtain logits for at-
tribute probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Examples of TAP predictions on the VAW dataset. For each image, a list of at-
tributes sorted in descending order of probability values are returned for the spec-
ified objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6 Attribute predictions of OpenTAP. Every attribute list is sorted in descending or-
der of the model’s confidence. Both seen attributes from closed and unseen at-
tributes from open-vocabulary branch are shown. We display the attention mask
of TAP for objects without bounding box. Strikethrough represents wrong predic-
tions as judged by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7 Image retrieval results of OpenTAP for the unseen classes excited, fishing, salmon-
colored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.8 Top-3 attribute prediction comparison between OpenTAP versus CLIP. Red de-
notes incorrect predictions as judged subjectively by us. . . . . . . . . . . . . . . 64

4.1 Illustration of CORA. CORA has a dual-encoder architecture, consisting of one
encoder that embeds the input image (the upper branch) and one encoder that
embeds the text caption scene graph (the lower branch) into a joint embedding
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Overview of CORA. a) CORA consists of (1) an image encoder that detects and
extracts the salient regions’ features from the input image, then aggregates them
into a single image embedding through the GPO pooling operator, (2) a text en-
coder that first parses the input text into a scene graph where all semantic infor-
mation is readily organized, then two graph attention networks Object-Attribute
GAT and Object-Object GAT are used to encode this graph into the same joint
space with the image. The red arrow denotes the edge of the active role, while
the yellow arrow is for the passive role in the relation (refer to Section 4.2.3.2).
b) The semantic concept encoder that uses GRU or BERT to encode each seman-
tic concept in the graph corresponding to the object, attribute nodes and relation
edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xii


4.3 Inference time comparison. We compare the text-to-image retrieval inference time
between our method CORA against two SOTA cross-attention methods SGRAF
and NAAF (lower is better). The inference time is calculated with different num-
ber of images in the database, ranging from 10 to 105 images. CORA (blue line)
with its dual-encoder architecture is much faster and scalable than cross-attention
approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4 Successful image-to-text and image-to-entity retrieval on MS-COCO. In image-
to-text retrieval, green denotes matching text according to the ground truth of
MS-COCO, while red denotes incorrect matching. In image-to-entity retrieval,
green and red denote correct and incorrect matching, respectively, as judged sub-
jectively by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Failure cases of image-to-text and image-to-entity retrieval on Flickr30K. In image-
to-text retrieval, green denotes matching text according to the ground truth of
Flickr30K, while red denotes incorrect matching. In image-to-entity retrieval,
green and red denote correct and incorrect matching, respectively, as judged sub-
jectively by us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.6 Text-to-image retrieval on MS-COCO. For every text, we show the top-5 retrieved
images on MS-COCO. The image with the green tick mark is the correct matching
according to ground truth in the dataset. . . . . . . . . . . . . . . . . . . . . . . 97

5.1 Illustration of a failed example by SGG-NLS. The man (denoted in the red box)
is predicted to sit on two benches at the same time. . . . . . . . . . . . . . . . . 100

5.2 Overview of the SCRP model. The input is an image, its text caption, and op-
tionally the bounding boxes of the objects in the caption. A language parser is
used to extract the subject and all objects that share a relationship with the sub-
ject. An external object detector is also used to retrieve object instances that are
not mentioned in the caption, which are treated as having negative relationships
with the subject. The model is based on the architecture of ALBEF. It contains a
vision transformer for encoding the image patches, a text encoder to encode the
text tokens, and a multimodal Transformer to contextualize the text tokens with
information from the visual modality. We also employ a box injection module that
adds information about the object position into their corresponding object tokens.
Finally, the [MASK] tokens are decoded into the relationship predicates. . . . . . 104

5.3 Qualitative results of SCRP (in the (a) column) versus the ground-truth labels
(in the (b) column) on the VG-50 dataset. Our model tends to predict more (and
mostly correct) predictions than the number of available ground-truth relationship
labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4 Predictions of SCRP for rare relationships. Our model is also able to predict rare
relationship classes such as blow drying, lifting up, milking, touching, reaching
for thanks to being trained on large amount of image-text data. . . . . . . . . . . 117

5.5 Failure cases of SCRP. The model still struggles with hard cases, such as when
there are overlapping boxes (top right and bottom left image). . . . . . . . . . . . 117

xiii


Chapter 1: Introduction

1.1 Overview

In recent years, computer vision has made significant progress in various visual recogni-

tion tasks thanks to the advances of deep learning and the availability of large annotated datasets.

However, most of the existing computer vision models have primarily focused on detecting and

naming objects, while neglecting other visual information that can be crucial in accurately por-

traying the object appearance. For example, these models can visually identify object by assigning

it with a semantic label (e.g, car, dog) while being unable to express variations in the object ap-

pearance with visual attributes such as red car and visual relations as in dog playing with stick.

Given that objects can be dynamic, occur in various states, possess diverse attributes, and have

close interactions with other objects in the scene, the intra-category visual appearance of object

can significantly vary. Therefore, this thesis aims to explore and develop computer vision models

that go beyond the traditional object classification systems and instead focus on understanding

and perceiving visual attributes and relations of objects.

The ability to recognize attributes and relations of objects facilitates seamless interaction of

humans in the world. This is evidenced in our daily experiences, such as describing an object to

someone using its appearance characteristics, deducing someone is reading a book because it is

open, or exercising caution around a coffee mug placed besides a laptop. The development of au-

1


tomated systems equipped with this nuanced understanding of visual attributes and relations holds

potential for improving a wide range of tasks, including semantic image search, language-based

object selection, object description, text-to-image generation, or vision-language-based robot as-

sistance. A few of these applications are demonstrated in Figure 1.1. Despite the importance of

this problem, contemporary computer vision systems still struggle to recognize fine-grained con-

cepts of attributes and relations. For example, as depicted in Figure 1.2, image retrieval results

on the Adobe Stock [1] and Pinterest [2] platform for red bowl with food often yield imprecise

results, such as images of a red bowl without food, or a non-red bowl with red food. Similarly, Fig-

ure 1.3 illustrates another example of the state-of-the-art large language model Google BARD [3]

incorrectly recognizes that there is a bird eating food in the image.

In overall, learning fine-grained semantic understanding of objects can be categorized into

two parts: 1) visual attributes, and 2) visual relations.

1.2 Visual Attributes of Objects

When people process to recognize objects from visual scenes, they identify the object with

a categorical label while also simultaneously perceive and describe it with visual attributes. For

instance, a table could be described as large in size, made of wood, having a round shape, being

of beige color, and empty. These visual attributes constitute a large portion of information about

the object appearance and allow people to easily visualize the objects in their mind, describe the

objects to others, and even recognize them. Understanding attributes offers various advantages

for downstream computer vision problems, such as facilitating more accurate object selection in

images and image retrieval by adding attributes as part of the input text query. Visual attribute

2


Semantic Image Search

red bowl with food

Image Generation

red bench and yellow clock

Object Description

What is this?
Object: emblem
Attribute: blue, purple,
lighting, neon
Relation: on counter,
next to mug, wearing bow tie

Language-based Object Selection

select the upside down chair

Figure 1.1: Examples of applications of visual attributes and visual relations recognition. (Top
left) Semantic image search: search for images that correctly depict the semantic information
from the input text, e.g, red bowl with food should return images of bowl having color red and
containing food. (Top right) Image generation: generate image that is faithfully correct w.r.t. the
input text, e.g, the image truly displays a clock with color yellow and a bench with color red as
required by the text. (Bottom left) Object description: describe object appearance (e.g, for the
visually impaired), e.g, the text to the right describes fully all appearance characteristics of the
object to the left. (Bottom right) Language-based object selection: select object in image that
satisfies the input text condition, e.g, the image displays a bounding box enclosing the upside
down chair.

allows to ground visual objects from images to attribute concepts in language, resulting in a more

comprehensive semantic understanding of objects in visual scenes.

Despite the great importance of attributes, existing work is mostly limited to attributes in

specific domains (e.g, clothing [5, 6], human [7], emotion [8]), consisting of very small number

of attribute-object pairs [9], or is rife with label noise [10]. In addition, designing an approach

that can effectively predict visual attributes for objects in the wild is a difficult task due to the

following challenges:

3


(a) (b)

Figure 1.2: Image search results for the input query red bowl with food on a) Adobe Stock [1]
where there are many red bowls without food and on b) Pinterest [2] where there are many white
bowls with red food (results obtained on Dec 25, 2023).

Figure 1.3: State-of-the-art large language model Google BARD [3] makes mistakes in visual
relation recognition (result obtained on Dec 25, 2023), when it says that there is a bird eating the
fruit in the image, while in fact the image only displays an assortment of food that resembles the
shape of a bird.

4


Dog: fighting, playing, biting, chewing,
wrestling, standing, grappling,
battling, tugging, facing off, playing
tug-of-war

Dog: biting cloth, pulling cloth,
tilting head, playing with dog,
having fun

(b)

table

small
round

brown

white

not
empty

minimalist

three-
legged

lightweight
lustrous

compact

(a)

Figure 1.4: The first part of the thesis (Chapter 2 and Chapter 3) focuses on recognizing attributes
of objects. On the left in (a), we illustrate that these attributes can be adjectives that describe
physical properties (image source: [4]). Here we denote common primitive attributes in blue, and
rare, not commonly used attributes in green. On the right in (b), we show attributes as verbs and
verb-object pairs that describe actions of the object and its interactions with others in the scene.

• Label sparsity: attribute prediction is a multi-label classification problem where all attributes

applied to an object must be predicted. However, exhaustively annotate all attribute labels for

every object image is a time-consuming and expensive process. Therefore, objects in visual

attribute dataset are often sparsely labeled with a large amount of missing labels.

• Class imbalance and long-tailed attributes: visual attributes follow a long-tailed distribution.

For example, salient attributes such as color are much more commonly used in human daily

conversations than the others.

• Diverse localization of visual cues: attributes are diverse in nature, some are used to describe

physical properties of objects (such as color, material, shape) whose features localization span

across the object surface, while other attributes require attention to particular image parts or

reasoning with the surrounding context (e.g, bald-headed, wearing hat).

This thesis studies attribute prediction in Chapter 2 and Chapter 3 (illustrated in Figure 1.5).

In Chapter 2, we introduce a large-scale visual attribute dataset of 620 attributes for objects

5


in the wild to address the limitation of current attribute prediction benchmark, and propose a

computational model with learning algorithm to overcome the above challenges.

As existing work strongly rely on the availability of high-quality human-annotated attribute

labels, it becomes challenging to scale up the predictive capability of learning models to tackle

the vast space of possible attribute concepts in real-world settings. Instead of collecting dense

attribute annotations which is a costly process, we observe that attribute information is extremely

abundant in existing image-caption datasets but has not been utilized for attribute learning by

prior work. To this end, in Chapter 3, we propose a novel neural network model with a train-

ing scheme that allows for learning attribute prediction from multiple image-text datasets with

mixed supervision. The proposed model can learn from strongly supervised data where attribute

labels are annotated and grounded to the correct object bounding box, as well as weakly super-

vised data where only the image-caption pairs are available without any bounding box correspon-

dence. Learning from image-text pairs allows to upgrade the capability of the attribute prediction

model to thousands of attribute classes. In addition, by taking advantage of text embeddings

from pre-trained large vision-language model, our introduced model is also developed with open-

vocabulary prediction ability where it can recognize even novel attribute classes represented in

arbitrary text phrases.

1.3 Visual Relations between Objects

While perceiving visual attributes allows for a more complete semantic understanding of

objects, it is the visual relationships between objects that connect them together and enable the

construction of a high-level semantic meaning for the entire scene. One primary goal in com-

6


Imageembedding

Scene graph

embedding

Joint embedding
space

Similarity
score

man

leap in white

air

leap through

arena

bull

contain

black angry

escape

woman

shoe

wearing

shirt

wearing

hat

wearing
street

walking on

dog

with

(b)(a)

Figure 1.5: The second part of the thesis focuses on visual relations between objects. a) In Chap-
ter 4, we study how to compose object with their attributes and relations in a scene graph to
improve image-text alignment. Here, the figure displays a joint embedding space for the image
and the scene graph. b) In Chapter 5, we formulate a new subject-centric approach for predicting
all visual relationships with respect to a particular subject instance, e.g, the figure illustrates all
relationships w.r.t. the woman in the image.

puter vision is to achieve such a comprehensive understanding of visual scene by developing

models that can recognize objects, describe their attributes, and explain their interactions. This

advancement allows for important applications such as image retrieval given text description.

This research aims to enhance computer vision models with the ability to reason about objects

and their interconnectivity, leading to improve downstream computer vision tasks.

Scene graph was introduced by Johnson et al. [11] as a detailed representation for visual

scene. Scene graph encapsulates all semantic information about the scene, including object iden-

tities, their visual attributes, and their relationships. Upon its introduction, there have been nu-

merous research utilizing scene graph as a scene representation in addition to the usual text de-

scription to tackle visual recognition problems. However, with the emerging of pre-trained text

sequence models on large datasets and the powerful cross-attention networks in vision-language

learning, the use of scene graphs has seen a decline. Nevertheless, recent findings such as those by

Chefer et al. [12] reveal that current models, including advanced ones like Stable Diffusion (with

the usage of Transformer-based text encoder CLIP [13]) still exhibit incorrect object-attribute

7


binding (i.e, pair an attribute with the wrong object in the text description). This suggests that

scene graphs, with their inherent design of accurately pairing objects with their attributes and

relations, may experience a resurgence as a complementary method to describe visual scene. In

regards to the powerful cross-attention networks for vision-language joint modeling, its intensive

computation cost renders it infeasible for large-scale image retrieval. Therefore, it becomes more

desirable for real-world image retrieval system to utilize a dual-encoder framework that can per-

form retrieval efficiently with similarity computation in the embedded vector space. In Chapter 4,

we tackle the image-text retrieval problem by showing that scene graph can enable a dual-encoder

framework that is both more efficient while being as powerful as cross-attention approaches.

Previous studies on visual relationship prediction have predominantly followed a pair-

centric approach, where relationships between every pair of objects are independently classified.

These work also study relation prediction in a closed setting where models are trained on small

dataset with a fixed set of object and relation vocabulary. Furthermore, the proposed models of

these work have to be trained on densely-labeled scene graph datasets such as Visual Genome [10]

which are expensive to collect. In Chapter 5, we propose a novel subject-centric method where

multiple relationships are predicted simultaneously conditioned on one subject. This methodol-

ogy offers a distinct advantage that the prediction for one relation can influence the prediction

of another, which helps prevent undesirable scenarios such as when a person is predicted to be

sitting on two different benches at the same time. In addition, we also extend from the limited

scope of previous work that only study relationship recognition in a closed vocabulary, and pro-

pose an approach that can learn from large public image-text datasets with an open vocabulary

set to recognize arbitrary relationship classes defined at test time. Our approach can learn from

image-text datasets with different levels of grounding supervision, i.e, from scene graph datasets

8


with object bounding boxes to image-text datasets with no box localization information.

9


Chapter 2: Closed-Set Attribute Prediction

This chapter is based on the publication Learning to Predict Visual Attributes in the Wild,

Pham et al., In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-

nition, pages 13018–13028, 2021.

While several existing works address attribute prediction, they are limited in many ways.

Objects in a visual scene can be described using a vast number of attributes, many of which can

exist independently of each other. Due to the variety of possible object and attribute combinations,

it is a daunting task to curate a large-scale visual attribute prediction dataset. Existing works

have largely ignored large-scale visual attribute prediction in-the-wild and have instead focused

only on domain-specific attributes [5, 7], datasets consisting of very small number of attribute-

object pairs [9], or are rife with label noise, ambiguity and label sparsity despite having a large

number of images (Visual Genome [10]). Similarly, while attributes can form an important part

of related tasks such as VQA, captioning, referring expression, these works do not address the

unique challenges of attribute prediction. Existing work also fails to address the issue of partial

labels, where only a small subset of all possible attributes are annotated. Partial labels and the lack

of explicit negative labels make it challenging to train or evaluate models for large-scale attribute

prediction. To address these problems, we propose a new large-scale visual attribute prediction

dataset for images in the wild that includes both positive and negative annotations.

10


Table

Positive ✔	
Brown, Wooden,
Curved, Clean

Negative ✘
White,	Metallic,
Square

Unlabeled ? 
Large, Flat,
Painted, Indoors...

Plate

Positive ✔	
Yellow, Round,
Ceramic, Full

Negative ✘
White,	Square,
Glass,	Empty

Unlabeled ?
Red, Colorful,
Shallow, Dirty, ...

Flower

Positive ✔	
Pink, Leaning,
Floral

Negative ✘
Yellow,	Held,
Dried

Unlabeled ?
Bright, Cut, Light
Red, Patterned, ...

Cookie

Positive ✔	
Brown, Yellow,
Colorful

Negative ✘
Chocolate,
Circular,	Burnt

Unlabeled ?
White, Dark, Big
Frosted, ...

Figure 2.1: Example annotations in the VAW dataset. Each possible attribute-object category pair
is annotated with at least 50 examples consisting of explicit positive and negative labels. Here,
we illustrate positive and negative attribute annotations for the object table, plate, flower, cookie
in the image.

Our dataset, called Visual Attributes in the Wild (VAW), consists of over 927K explicitly la-

beled positive and negative attribute annotations applied to over 260K object instances (with 620

unique attributes and 2,260 unique object phrases). Due to the number of combinations possible,

it is prohibitively expensive to collect exhaustive attribute annotations for each instance. How-

ever, we ensure that every attribute-object phrase pair in the dataset has a minimum of 50 positive

and negative annotations. With density of 3.56 annotations per instance, our dataset is 4.9 times

denser compared to Visual Genome while also providing negative labels. Additionally, annota-

tions in VAW are visually-grounded with segmentation masks available for 92% of the instances.

Formally, our VAW dataset proposes attribute prediction as a long-tailed, partially-labeled, multi-

label classification problem. Examples of attributes in VAW are illustrated in Figure 2.1.

11


We explore various state-of-the-art methods in attribute prediction and multi-label learning

and show that the VAW dataset poses significant challenges to existing work. To this end, we first

propose a strong baseline model that considers both low- and high-level features to address the

heterogeneity in features required for different classes of attributes (e.g, color vs. action), and

is modeled with multi-attention and an ability to localize the region of the object of interest by

using partially available segmentation masks.

We also propose a series of techniques that are uniquely suited for our problem. Firstly, we

explore existing works that address label imbalance between positive and negative labels. Next,

we describe a simple yet powerful scheme that exploits linguistic knowledge to expand the number

of negative labels. Finally, we propose a supervised contrastive learning approach that allows our

model to learn more attribute discriminative features. Through extensive ablations, we show that

most of our proposed techniques are model-agnostic, producing improvements not only on our

baseline but also other methods. Our final model is called Supervised Contrastive learning with

Negative-label Expansion (SCoNE), which surpasses state-of-the-art models by 3.5 mAP and 5.7

overall F1 points.

Our work makes the following contributions: 1) We create a new large-scale dataset for

visual attributes in the wild (VAW) that addresses many shortcomings in existing literature and

demonstrate that VAW poses considerable difficulty to existing algorithms. 2) We design a strong

baseline model for attribute prediction using existing visual attention technique. We further ex-

tend this baseline to our novel attribute learning paradigm called Supervised Contrastive learn-

ing with Negative-label Expansion (SCoNE) that considerably advances the state of the art. 3)

Through extensive experimentation, we show the efficacy of both our proposed model and our

proposed techniques.

12


2.1 Related Work

Attribute learning. Some of the earliest work related to attribute learning stem from a desire

to learn to describe objects rather than predicting their identities [14–17]. Since then, extensive

work has sought to explore several aspects of object attributes, including attribute-based zero-shot

object classification [18–20], relative attribute comparison [21–23], and image search [24, 25].

While research in compositional zero shot learning [26–29] also tackle object attributes, they

target transformation of states of objects, treat each instance as having only one state, and focus

on predicting unseen compositions rather than the prediction of a complete set of attributes for

each object instance. Several works have focused on attribute learning in specific domains such

as animals, scenes, clothing, pedestrian, human facial and emotion attributes [5–8, 30, 31]. In

contrast, we seek to explore attribute prediction for unconstrained set of objects.

Attribute prediction in the wild. Only a limited number of work have sought to explore general

attribute prediction. COCO Attributes [9] is an attempt to develop in-the-wild attribute predic-

tion dataset, however, it is very limited in scope, covering only 29 object categories. Similarly, a

portion of the the Visual Genome (VG) [10] dataset consists of attribute annotations. However,

attributes in VG are not a central focus of the work and therefore they are very sparsely labeled,

noisy, and lack negative labels, making it unsuitable to be used as a standalone attribute prediction

benchmark. Despite this, attribute annotations from VG are often used to train attribute-aware ob-

ject detectors for downstream vision-language tasks [32–34]. By introducing the VAW dataset,

the research community can use its dense attribute annotations in conjunction with VG and our

attribute learning techniques to train better attribute prediction models. Several recent works have

13


also sought to take advantage of massive amount of data in VG to curate datasets for specific chal-

lenges [35,36]. In a similar vein, we also start by leveraging existing sources of clean annotations

to develop our VAW dataset.

Multi-label learning. VAW can be cast as a multi-label classification problem which has been

extensively studied in the research community [37–40]. Multi-label learning involving missing

labels poses a greater challenge, but is also extensively studied [41–44]. In many cases, missing

labels are assumed to be negative examples [45–48] which is unsuitable for attribute prediction,

since most of the attributes are not mutually exclusive. Some others attempt to predict missing

labels by training expert models [49], which is also infeasible for a large-scale problem like ours.

Learning from imbalanced data. Data imbalance naturally arises in datasets with large set of

labels. As expected, label imbalance exists in our VAW dataset, therefore techniques designed to

learn from imbalanced data are also related to our explorations. These works can be divided into

two main approaches: cost-sensitive learning [50–52] and resampling [53–57]. We utilize both of

these techniques in our final model.

Visual attention. Attention is a highly effective technique in image classification, captioning,

VQA, and domain-specific attribute prediction [40, 41, 58–62]. In our VAW dataset, most of the

objects are annotated with their segmentation mask, which allows us to guide the attention map

to ignore irrelevant image regions. We also use additional attention maps to allow our model to

properly explore the surrounding context of the object.

Contrastive learning. Contrastive learning has recently gained a lot of traction as an effec-

tive self-supervised learning technique [63–66]. While originally intended to be used in self-

supervised setting, recent works have expanded contrastive learning to be used in supervised

14


setting [67]. Motivated by these works, we propose an extension of supervised contrastive loss to

allow it to work in a multi-label setting required for VAW. To the best of our knowledge, ours is

the first attempt to apply contrastive loss for multi-label learning.

2.2 Visual Attributes in the Wild Dataset

In this section, we describe how we collect attribute annotations and present statistics of

the final VAW dataset. In general, we aim to overcome the limitations of VG on the attribute

prediction task which includes noisy labels, label sparsity, and lack of negative labels to create a

dataset applicable for training and testing attribute classification models.

2.2.1 Data Collection

VAW is created based on the VGPhraseCut [36] and GQA [35] datasets, both of which

leverage and refine annotations from Visual Genome [10]. VGPhraseCut is a referring expres-

sion dataset that provides high-quality attribute labels and per-instance segmentation mask, while

GQA is a VQA dataset that presents cleaner scene graph annotations.

Step 1: Extraction from VGPhraseCut and GQA

Our goal is to build a dataset that allows us to predict the maximal number of attributes

commonly used to describe objects in the wild. From VGPhraseCut, we select attributes that

appear within more than 15 referring phrases. After manually cleaning ambiguous and hard to

recognize attributes, we obtain a set of 620 unique attributes which are used throughout the rest

of the process. Next, we extract more instances from GQA that are labeled with these attributes.

We further take advantage of the referring expressions from VGPhraseCut to collect a reliable

15


negative label set: given an image, for instances that are not selected by an attribute referring

phrase, we assign that attribute as a negative label for the instance. This step allows us to collect

220,049 positive and 21,799 negative labels.

Step 2: Expand attribute-object coverage

In this step, we seek to collect additional annotations for every feasible attribute-object

pair that may be lacking annotations. We define feasible pair as those with at least 1 positive

example in our dataset. We ensure that every feasible pair has at least 50 (positive or negative)

annotations. To keep the annotation cost in check, we do not annotate pairs that already have 50

or more annotations. This expansion enriches our dataset with more positives and negatives for

every attribute across different objects, allowing for better training and evaluation of classification

models. This step adds 156,690 positive and 455,151 negative annotations.

Step 3: Expand long-tailed attribute set

In this step, we aim to collect additional annotations for the long-tailed attributes. Long-

tailed attributes are associated with very few object categories, which is either due to the attribute

not being frequently used by humans or the attribute is only applied to a small set of objects.

Hence, given a long-tailed attribute and a known object that it applies to, we first expand its set

of possibly applied objects by using the WordNet [68] ontology. For example, while playing may

only be applied to child in the training set, it could also be applicable to other closely related

object categories like man, woman, person. After we find candidate object categories for a given

long-tail attribute, we ask humans to annotate randomly sampled images from these candidates

with either positive or negative label for the given attribute. This step adds additionally 16,239

positive and 57,751 negative annotations pertaining to all long-tailed attributes.

16


Dataset VAW Visual Genome [10] COCO Attributes [9] EMOTIC [8] WIDER [7] iMaterialist [5]

# attributes 620 68,111 196 26 14 228
# instances 260,895 3,843,636 180,000 23,788 57,524 1,012,947
# object categories 2,260 33,877 29 1 (person*) 1 (person*) 1 (clothes*)
# attribute anno. per instance 3.56 0.73 ≥ 20 26 14 16.17
Negative Labels Yes No Yes Yes Yes Yes
Segmentation masks Yes No No No No Yes
Domain In-the-wild In-the-wild In-the-wild Emotions Pedestrian Fashion

Table 2.1: Statistics of VAW compared with other in-the-wild and domain-specific attribute
datasets. *person (resp. *clothes) category may represent multiple categories including {boy,
girl, man, woman, etc} (resp. {shirt, pants, top, etc}). While Visual Genome is the largest among
these in terms of number of attribute annotations, it is sparsely labeled. Other datasets are either
fully annotated for domain-specific attributes or more densely labeled but covering few object
categories.

2.2.2 Statistics

Our final dataset consists of 620 attributes describing 260,895 instances from 72,274 im-

ages. Our attribute set is diverse across different categories, including color, material, shape,

size, texture, and action. On the annotated instances, our dataset consists of 392,978 positive and

534,701 negative attribute labels. The instances from VGPhraseCut (occupy 92% in the dataset)

are provided with segmentation masks which can be useful in attribute prediction. We split the

dataset into 216,790 instances (58,565 images) for training, 12,286 instances (3,317 images) for

validation, and 31,819 instances (10,392 images) for testing. We split the dataset such that the test

set has higher annotation density per object, which allows for more thorough testing. In particular,

our test set has an average of 7.03 annotations per instance compared to 3.02 in the training set.

In Table 2.1, we compare the statistics of the VAW dataset with other in-the-wild and

domain-specific visual attribute datasets. Compared to existing work, VAW fills an important gap

in the literature by providing a domain-agnostic, in-the-wild visual attribute prediction dataset

with denser annotations, explicit negative labels, segmentation masks, and large number of at-

tribute and object categories.

17


Figure 2.2: Examples of images and their annotations from the VAW dataset. Object names,
positive attributes, explicitly labeled negative attributes, and negative labels from our negative
label expansion are shown in corresponding colors for each example.

In Figure 2.2, we show examples of images and their attribute annotations from the VAW

dataset. The images show both positive and negative annotations from our dataset as well as a

subset of the result of our negative label expansion scheme (will be explained in Section 2.3.3),

which is a rule-based system derived on the premise of mutual exclusivity of certain attributes.

For example, if an object is annotated with positive attribute empty, the attribute filled can be

auto-annotated as a negative attribute for the same object.

In Figure 2.3, we show the distribution of top-15 attributes in various attribute categories

arranged in descending order according to the number of available positive annotations. The

diagram clearly shows the long-tailed nature of our VAW dataset, with some categories showing

highly skewed distributions (color, material) and others have a more evenly balanced distribution

(texture, others). For example, in the material category, the annotations for top-2 attributes (metal

and wooden) consist of over 30.91% of total number of annotations (41.4% of positives and 23.5%

of negatives). Reassuringly, our strong baseline as well as SCoNE model works almost equally

18


Color Material Shape

Texture Action Size

Positive Negative

Others

Figure 2.3: Distribution of positive and negative annotations for attributes in different categories.
We show the top-15 attributes with the most number of positive annotations in each category
sorted in descending order.

well for more balanced categories (e.g, texture) as well as a skewed category (e.g, material).

2.3 Approach

In this section, we will describe components of our strong baseline model along with the

Supervised Contrastive learning with Negative-label Expansion (SCoNE) algorithm that helps

our model to learn more attribute discriminative features. A depiction of our strong baseline

model is shown in Figure 2.4.

Problem formulation. Let D = {Ii, gi, oi;Yi}Ni=1 be a dataset of N training samples, where Ii

is an object instance image (cropped using its bounding box), gi is its segmentation mask, oi is

the category phrase of the object for which we want to predict attributes, and Yi = [yi,1, ..., yi,C ]

is its C-class label vector with yc ∈ {1, 0,−1} denoting whether attribute c is positive, negative,

19


Composition

"chair"
embed

Object
localizer

Multi-
attention

ClassifierResNet

Supervise
if available

Red ✔
Bright Red✔

Clean ✔
Giant ✔

Wooden ✔

Blue ✘
Stuffed ✘

Patterned ✘
Multicolored ✘

Positive
N

egative

Groundtruth mask

...

Low-lv feature

Standing ⁇
Lying ⁇
Empty ⁇

Weathered ⁇
...

U
nknow

n

Figure 2.4: Strong baseline attribute prediction model. ResNet-feature map extracted from the
input image is modulated with the object embedding which allows the model to learn useful
attribute-object relationships (e.g, ball is round) and also to suppress infeasible attribute-object
pairs (e.g, talking table). The image-object combined feature map X is used to infer the object
region G and multiple attention maps {A(m)} which are subsequently used to aggregate features
for classification. Here, Zlow and Zrel respectively denotes low-level and image-object features
aggregated inside the estimated object region, Zatt corresponds to image-object features pooled
from the multiple attention maps. The classifier is trained with BCE loss on the explicit positive
and negative labels. For the missing (unknown) labels, we find treating them as “soft negatives”
by assigning them with very small weights in the BCE loss also helps improve results.

or missing respectively. Our goal is to train a multi-label classifier that, given an input image and

the object name, can output the confidence score for all C attribute labels.

2.3.1 Model Architecture

Image feature representation. Given an image I of an object o, let fimg(I) ∈ RH×W×D be the

D-dimensional image feature map with spatial size H ×W extracted using any CNN backbone

architecture. In our model, we use the output of the penultimate layer of ResNet-50 [69].

Image-object feature composition. Prior models for attribute prediction mostly tackle domain-

specific settings or a limited number of object categories [9, 40, 49]. Hence, these works are able

to employ object-agnostic attribute classification. However, because our VAW dataset contains

attribute annotations across a diverse set of object categories, incorporating object embedding

as input can help the model learn to avoid infeasible attribute-object combinations (e.g, parked

20


dog). There are multiple ways to compose the image feature map with the object embedding

[32, 70, 71]. Here, we opt for a simple object-conditioned gating mechanism, which we find to

be consistently better than concatenation used in [32, 33]. Let ϕo ∈ Rd be the object embedding

vector, fcomp(fimg(I), ϕo) ∈ RD be the composition module that takes in the image feature map

and object embedding. We implement fcomp with a gating mechanism as follows:

fcomp(fimg(I), ϕo) = fimg(I)⊙ fgate(ϕo), (2.1)

fgate(ϕo) = σ(Wg2 · ReLU(Wg1ϕo + bg1) + bg2), (2.2)

where ⊙ is the channel-wise product, σ(·) is the sigmoid function, fgate(ϕo) ∈ RD broadcasts

the object embedding to match the feature map spatial dimension and is a 2-layer MLP. Intu-

itively, fgate acts as a filter that only selects attribute features relevant to the object of interest and

suppresses incompatible attribute-object pairs.

Relevant object localization. An object bounding box can contain both the relevant object and

other objects or background. Hence, it is desirable to learn a smarter feature aggregation that

can suppress all irrelevant image regions. We propose to leverage the availability of the object

segmentation mask in the VAW dataset to achieve this.

Let X ∈ RH×W×D be the image-object composed feature map, the relevant object region

G is localized using a 2-stacked convolutional layers frel with kernel size 1, followed by spatial

21


softmax:

g = frel(X), g ∈ RH×W , (2.3)

Gh,w =
exp(gh,w)∑
h,w exp(gh,w)

, G ∈ RH×W . (2.4)

We can then pool the image feature vector as

Zrel =
∑
h,w

Gh,wXh,w. (2.5)

G is learned with direct supervision from the object mask whenever available with the following

loss:

Lrel =
∑
h,w

(Gh,w × (1−Mh,w))− λrel(Gh,w ×Mh,w), (2.6)

where M is the ground truth object binary mask. Rather than requiring G to exactly match the

object mask, we find it is better to penalize the network whenever its prediction falls outside

of the mask. This frees the network to learn heterogeneous attention within the object region

if necessary (e.g, black mirror refers to its frame being black rather than its interior) instead

of distributing its attention uniformly over the object. Hence, by setting λrel to a small positive

constant less than 1, we prioritize the need for G to not attend to non-object pixels over the need

to uniformly attend to all pixels on the object surface.

Multi-attention. Object localization is beneficial for recognizing several attributes such as color,

material, texture, and shape, but might be too restrictive for attributes that require attention to

different object parts or the background. For example: bald-headed or bare-footed requires look-

ing at a person’s head or foot; distinguishing different activities (e.g., jumping vs. crouching)

22


might require information from the surrounding context. Therefore, we utilize a free-form multi-

attention mechanism to allow our model to attend to features at different spatial locations.

There are two extreme cases to apply spatial attention [41]: (1) one attention map for all

attributes and (2) one attention map per attribute [40]. The first approach is similar to using the

object foreground which is unlike what we are aiming for. The latter allows for more control but

does not scale well with large number of attributes. Hence, we opt for a hybrid multi-attention

idea as in [41].

We extract M attention maps {A(m)}Mm=1 from X using f
(m)
att which has the same architec-

ture as frel:

E(m) = f
(m)
att (X), E(m) ∈ RH×W ,m = 1, ...,M (2.7)

A
(m)
h,w =

exp(E
(m)
h,w )∑

h,w exp(E
(m)
h,w )

, A
(m)
h,w ∈ RH×W . (2.8)

This is partly similar to [72] where object parts are localized using learned embeddings

of these parts. Because the VAW dataset does not have part annotations for every attribute, this

approach is not usable in our case. Similar to [41], we employ the following divergence loss to

encourage these attention maps to focus on different regions:

Ldiv =
∑
m ̸=n

⟨E(m), E(n)⟩
∥E(m)∥2∥E(n)∥2

. (2.9)

Using the computed M attention maps, we aggregate M feature vectors {r(m)}Mm=1 from X

23


and pass them through a projection layer to obtain their final representations:

r(m) =
∑
h,w

A
(m)
h,wXh,w, r(m) ∈ RD, (2.10)

z
(m)
att = f

(m)
proj (r

(m)), z
(m)
att ∈ RDproj . (2.11)

Our final multi-attention feature is the concatenation of all individual attention features:

Zatt = concat([z(1)att , ..., z
(M)
att ]). (2.12)

2.3.2 Training Objectives

Our final feature vector is the concatenation of the localized object and the multi-attention

feature. In addition, we also find using low-level feature from early blocks improves accuracy

for low-level attributes (color, material). Therefore, we also pool low-level features from the

estimated object region G to construct Zlow. The input to the classification layer is [Zlow, Zrel, Zatt],

and we use a linear classifier with C output logit values followed by sigmoid.

Let Ŷ = [ŷ1, ..., ŷC ] be the output of the classification layer. We apply the following

reweighted binary cross-entropy loss that takes data imbalance into account:

Lbce(Y, Ŷ ) = −
C∑
c=1

wc

(
1[yc=1]pc log(ŷc) + 1[yc=0]nc log(1− ŷc)

)
,

where wc, pc, nc are respectively the reweighting factors for attribute c, its positive, and its neg-

ative example. Let npos
c and nneg

c be the number of positives and negatives of attribute c. First,

we want wc to reflect the importance of the rare attributes by setting wc inversely proportional

24


to its number of positive examples: wc ∝ 1/(npos
c )α and normalize so that

∑
cwc = C [52]

(α is a smoothing factor). Second, we want to balance between the effect of positive and neg-

ative examples. We apply the same idea by setting pc ∝ 1/(npos
c )α, nc ∝ 1/(nneg

c )α and nor-

malize so that pc + nc = 2. As a result, the ratio between the positive and negative becomes

pc/nc = (nneg
c /npos

c )α, which helps balance out their effect based on their frequency.

Our re-weighted BCE (termed RW-BCE) is different from [37], where the authors propose

to reweigh each sample based on its proportion of available labels (i.e, an object instance with

less number of available labels is assigned a larger weight). We posit this is not ideal because

the number of labels for an instance should not affect loss computation (e.g, loss for red should

be the same between a red car instance and a large shiny red car instance, despite the latter one

is annotated with more labels). Our overall loss is a combination of all loss functions presented

above:

L = Lbce + Lrel + λdivLdiv. (2.13)

Empirically, we find applying repeat factor sampling (RFS) [56, 57] with RW-BCE works

well. RFS is a method that defines a repeat factor for every image based on the rarity of the labels

it contains. Therefore, we employ both RW-BCE and RFS (referred as RR) in training our model.

2.3.3 Negative Label Expansion

While our dataset provides unprecedented amount of explicitly labeled negative annota-

tions, the amount of possible negatives still far outnumbers the number of possible positive at-

tributes. Because many attributes are mutually exclusive (i.e, presence of attribute clean implies

absence of attribute dirty), we seek to use existing linguistic and external knowledge tools to

25


expand the set of negative annotations.

Consider attribute type A (e.g, material), the following observations can be made about

its attributes: (1) there exists overlapping relation between some attributes due to their visual

similarity or them being hierarchically related (e.g, wooden overlaps with wicker); (2) there ex-

ists exclusive relation where two attributes cannot appear on the same object (e.g, wet vs. dry).

Therefore, for an object labeled with attribute a ∈ A, we can generate negative labels for it from

the set {a′ ∈ A | ¬overlap(a, a′) ∨ exclusive(a, a′)}.

We classify the attributes into types and construct their overlapping and exclusive relations

using existing ontology from a related work [73], WordNet [68], and the relation edges from

ConceptNetAPI [74]. We further expand the overlapping relations based on the co-occurrence

(by using conditional probability) of the attribute pairs (e.g, white and beige are similar and often

mistaken by human annotators). Our negative label expansion scheme allows to add 5.9M nega-

tive annotations to our training set. Aside from the extra negatives, one benefit of this approach

is that when we want to label a novel attribute class, we can use the same approach to discover its

relationship with existing attributes in the dataset and attain free negatives for the new class.

2.3.4 Supervised Contrastive Learning

[75] shows with success that imbalanced learning could benefit from self-supervised pre-

training on both labeled and unlabeled data, where a network can be better initialized by originally

avoiding strong label bias due to data imbalance. Also motivated by [67], we propose to use

supervised contrastive (SupCon) pretraining for our attribute learning with partial labels problem,

where we extend the SupCon loss from a single-label to a multi-label setting.

26


We perform mean-pooling inside the feature map X to obtain x ∈ RD. We follow the design

of SimCLR [66] and add a projection layer to map z = Proj(x) ∈ R128. The projection layer is

an MLP with hidden size 2048 and is only used during pretraining. In a multi-label setting, it

is not trivial how to pull two samples together since they can share some labels but different in

terms of other labels. Motivated by [28, 76], we propose to represent each attribute c as a matrix

Ac ∈ R128×128 that linearly projects z into an attribute-aware embedding space zc = Acz, which

is then ℓ2-normalized onto the unit hypersphere. With this, samples that share the same attribute

can have their respective attribute-aware embeddings pulled together.

In the pretraining stage, we construct a batch of 2N sample-label vector pairs {Ii, Yi}2Ni=1,

where I2k and I2k−1 (k = 1...N ) are two views (from random augmentation) of the same object

image and Y2k = Y2k−1. Let zi,c be the c-attribute-aware embedding of Ii, and B(i) = {c ∈ C :

Yi,c = 1} is the set of positive attributes of Ii. We reuse notations from [67]: K ≡ {1...2N},

A(i) ≡ K \ {i}, P (i, c) ≡ {p ∈ A(i) : Yp,c = Yi,c} and use the following SupCon loss

Lsup =
2N∑
i=1

∑
c∈B(i)

−1

|P (i, c)|
∑

p∈P (i,c)

log
exp (zi,c · zp,c/τ)∑

j∈A(i)

exp(zi,c · zj,c/τ)
. (2.14)

Linear transformation using Ac, followed by dot product in the SupCon loss, implements

an inner product in the embedding space of z, which can be interpreted as finding part of z that

encodes the attribute c [76]. Therefore, our approach fits nicely into the multi-label setting where

an image embedding vector z can simultaneously encode multiple attribute labels that can be

probed by these linear transformations for contrasting in the SupCon loss.

After the pretraining stage, we keep the backbone encoder and the image-object composi-

tion module and finetune them along with the classification layer.

27


While SupCon is designed to be used for pretraining, empirically, we find it hampers the

multi-attention module ability to focus on specific regions. To reconcile this difference, we find

it is empirically better to minimize Lsup jointly with the other loss. For other models that do not

use attention (vanilla ResNet), we find SupCon pretraining still effective.

2.4 Experiments

In this section, we discuss the implementation details of our method, the evaluation metrics,

and report the results of our method and other related baselines on the VAW dataset.

2.4.1 Implementation Details

In this section, we explain the implementation details of our Strong Baseline and SCoNE

method. We use the ImageNet-pre-trained [77] ResNet-50 [69] as the feature extractor, and use

the output feature maps from ResNet block 2 and 3 as low-level features. For the object name

embedding, we use the pre-trained GloVe [78] 100-d word embeddings. We do not finetune these

word embeddings during training as we want our model to generalize to unseen objects during

test time.

We implement our model in PyTorch [79] and train using Adam optimizer with the default

setting, batch size 64, weight decay of 1e-5, an initial learning rate of 1e-5 for the pre-trained

ResNet and 0.0007 for the rest of the model. We train for 12 epochs and apply learning rate decay

of 0.1 every time the mAP on the validation set stops improving for 2 epochs. We use image size

224x224 as input and basic image augmentations which include random cropping around object

bounding box, random grayscale when an instance is not labeled with any color attributes, minor

28


color jittering, and horizontal flipping. For each object bounding box in the dataset, we expand

its width and height by min(w, h)× 0.3 to capture more context. For the hyperparameters, we set

λfg = 0.25, λdiv = 0.004. In the multi-attention module, we select Dproj = 128 and use M = 3

attention maps. Regarding reweighting and resampling, we use t = 0.0006 for RFS and α = 0.1

for smoothing in the RW-BCE reweighting terms.

For SupCon pre-training, we pre-train on top of ImageNet-pre-trained ResNet for 10 epochs

with batch size 384 (768 views per batch), and initialize all matrices Ac with the identity matrix.

In the contrastive loss, we set temperature τ = 0.25. We believe using a larger batch size will

greatly benefit supervised contrastive pretraining as suggested by the authors [67]. For SupCon

joint training with the other losses of the Strong Baseline model, we keep batch size as 64, we

add λsupLsup to the loss where λsup = 0.5, and all other hyperparameters are the same as above.

2.4.2 Evaluation Metrics

In this section, we present details about the different evaluation metrics that we use. We have

used mAP as our primary metric, since it describes the quality of the model to rank correct images

higher than the incorrect ones for each attribute label. mR@15 is also important as it shows how

well the model manages to output the ground truth positive attributes in its top 15 predictions in

each image. In addition, mA and F1@15 can also be used to evaluate model performance in a

different light.

mAP: similar to [80], the mAP score is computed by taking the mean of the average precision of

all C classes

mAP =
1

C

∑
c

APc, (2.15)

29


in which the average precision of each class is computed as

APc =
1

Pc

Pc∑
k=1

Precision(k, c) · rel(k, c), (2.16)

where Pc is the number of positive examples of class c, Precision(k, c) is the precision of class

c when retrieving the best k images, rel(k, c) is the indicator function that returns 1 if class c is

a ground-truth positive annotation of the image at rank k. Note that due to VAW being partially

labeled, we compute this metric only on the annotated data similar as in [80]. This evaluation

scheme is also similar to what is used in [56], where the authors introduce the definition of

federated dataset. In this federated dataset setup, we only need for each label a positive and a

negative set, then average precision for each label can be computed on these 2 sets.

mA: as in [81, 82], we compute the mean balanced accuracy (mA) to evaluate all models in a

classification setting, using 0.5 as threshold between positive and negative prediction. Because

our dataset is highly unbalanced between the number of positive and negative examples for some

attributes, balanced accuracy is a good metric as it calculates separately the accuracy of positive

and negative examples then take the average of them. In concrete, the mA score can be computed

as follows

mA =
1

C

∑
c

(TPc

Pc

+
TNc

Nc

)
/2, (2.17)

where C is the number of attribute classes, Pc and TPc are the number of positive examples

and true positive predictions of class c, and Nc and TNc are defined similarly for the negative

examples and predictions. Because mA uses threshold 0.5, models that are not well-balanced

between positive and negative prediction tend to receive low score. This metric is also used in

30


pedestrian and human facial attribute works [40, 81].

mR@15: mean recall over all classes at top 15 predictions in each image. Recall@K is often used

in datasets that are not exhaustively labeled such as scene graph generation [83, 84]. This is also

used in multi-label learning [38, 39, 41] under the name ‘per-class recall’.

F1@15: as the above metric may be biased towards infrequent classes, we also report overall F1

(harmonic mean of precision and recall) at top 15 predictions in each image. Because VAW is

partially labeled, we only evaluate the prediction of label that has been annotated. The overall

precision and recall are computed as follows

OV-Precision =

∑
c TPc∑
c N

p
c
, OV-Recall =

∑
c TPc∑
c Pc

, (2.18)

where TPc is the number of true positives for attribute class c, Np
c is the number of positive

predictions of class c, and Pc is the number of ground truth positive examples of class c. The F1

score is the harmonic mean of precision and recall, which is defined as

OV-F1 =
2× OV-Precision × OV-Recall

OV-Precision + OV-Recall
. (2.19)

2.4.3 Baselines

We consider the following baselines and state-of-the-art multi-label learning approaches

and compare them to our SCoNE algorithm. We made our best attempt to modify the authors’

implementation (if available) to include the image-object composition module. All models use

ResNet-50 as their backbone and use BCE loss (except LSEP and ResNet-Baseline-CE) for train-

ing. Empirically, we find treating missing labels as negatives and assigning them with very small

31


Table 2.2: Experimental results compared with baselines and SOTA multi-label learning meth-
ods. The top box displays results of multi-label learning methods; the middle box shows results of
models from attribute prediction works and our strong baseline; the last row shows performance
of our SCoNE algorithm applied onto the strong baseline.

Methods Overall Class imbalance (mAP) Attribute types (mAP)

mAP mR@15 mA F1@15 Head Medium Tail Color Material Shape Size Texture Action Others

LSEP [39] 61.0 50.7 67.1 62.3 69.1 57.3 40.9 56.1 67.1 63.1 61.4 58.7 50.7 64.9
ML-GCN [38] 63.0 52.8 69.5 64.1 70.8 59.8 42.7 59.1 64.7 65.2 64.2 62.8 54.7 66.5
Partial-BCE + GNN [37] 62.3 52.3 68.9 63.9 70.1 58.7 40.1 57.7 66.5 64.1 65.1 59.3 54.4 65.9

ResNet-Baseline [9] 63.0 52.1 68.6 63.9 71.1 59.4 43.0 58.5 66.3 65.0 64.5 63.1 53.1 66.7
ResNet-Baseline-CE [32, 33] 56.4 55.8 50.3 61.5 64.6 52.7 35.9 54.0 64.6 55.9 56.9 54.6 47.5 59.2
Sarafianos et al [40] 64.6 51.1 68.3 64.6 72.5 61.5 42.9 62.9 68.8 64.9 65.7 62.3 56.6 67.4
Strong Baseline (SB) 65.9 52.9 69.5 65.3 73.6 62.5 46.0 64.5 68.9 67.1 65.7 66.1 57.2 68.7

SB + SCoNE (Ours) 68.3 58.3 71.5 70.3 76.5 64.8 48.0 70.4 75.6 68.3 69.4 68.4 60.7 69.5

weights in the BCE loss also improves results. Hence, we apply this for all methods.

• ResNet-Baseline: ResNet-50 followed by image-object composition and classification layer.

• ResNet-Baseline-CE: Similar as above, but uses softmax cross entropy loss. This is used by

[32, 33] to train attribute prediction head for object detectors on Visual Genome.

• Strong Baseline (SB): The combination of our image-object composition, multi-attention, and

object localizer.

• LSEP [7] : Uses ranking loss and threshold estimation to predict which attributes to output.

• ML-GCN [38] : Uses graph convolution network to predict classifier weights based on the

GloVe embeddings of the attribute names. Label correlation graph is constructed following the

authors’ implementation.

• Durand et al (Partial BCE + GNN) [37] : BCE loss reweighted by the authors’ reweighting

scheme. Graph neural network is applied on the output logits.

• Sarafianos et al [40] : SOTA in pedestrian attribute prediction that also uses multi-attention.

32


Table 2.3: Ablation study. We show how each of our proposed techniques help improve overall
performance.

Methods mAP mR@15 mA F1@15

Strong Baseline (SB) 65.9 52.9 69.5 65.3
+ Negative 67.7 54.3 70.0 69.6
+ Neg + SupCon 68.2 55.2 70.3 70.0
+ Neg + SupCon + RR (SCoNE) 68.3 58.3 71.5 70.3

ResNet-Baseline 63.0 52.1 68.6 63.9
+ SCoNE 66.4 56.8 70.7 68.8

2.4.4 Results

Overall results are shown in Table 2.2, where SB and SB+SCoNE are compared with other

baselines and state-of-the-art algorithms. In overall, SB is better than other baselines in almost all

metrics except for mR@15 where it is lower than ResNet-Baseline-CE. This shows that the object

localizer and multi-attention are effective in attribute prediction. ResNet-Baseline-CE, which is

adopted by [32, 33], has good recall but very low precision (mAP and F1). This is in contrast to

ResNet-Baseline which is trained with BCE.

SB+SCoNE substantially improves over SB in all metrics and clearly surpasses available al-

gorithms by a large margin. It is particularly effective in long-tail attributes where it outperforms

its closest competitor (other than SB) by 5 mAP points, and is also highly effective in detecting

color and material attributes where it is nearly 7-8 mAP points higher than the next-best method.

This shows that our attribute learning paradigm, including the negative label expansion, super-

vised contrastive loss, reweighted and resampling scheme is clearly effective in attribute learning.

33


2.4.5 Ablation Studies

Components of SCoNE. Table 2.3 shows effect of different components of SCoNE. Starting

from our SB, we can see that each of our model choices substantially improves its performance,

with the biggest mAP improvements provided by our negative label expansion scheme. Each of

the components of SCoNE also stacks additively, with our final model performing 2.4 mAP, 5.4

mR@15, and 5 F1@15 points over SB. Moreover, the components of SCoNE are model agnostic.

We verify that by enhancing our ResNet-Baseline with SCoNE, which also improves its mAP and

mR@15 by 3.4 and 4.7 points.

Components of Strong Baseline. Strong Baseline is comprised of many sub-components which

extends the ResNet-Baseline: the object localizer, the multi-attention module, and the usage of

low-level features. We ablate our Strong Baseline model with each component and train on our

training data after negative label expansion. We report results in Table 2.4.

Removing each sub-component has a negative effect on the performance of the Strong

Baseline model. For example, removing the use of low-level features not only lowers mAP in

color and material attributes but it also lowers it for higher-level attributes (e.g, action). This is

likely due to the absence of clearly defined low- and high-level features, which forces a ‘single’

feature to represent both low- and high-level features. This adversely affects the network’s ability

to learn high-level attributes (e.g., action) as well as low-level (color, texture), thus lowering

performance for both.

Interestingly, removing the object localizer does not result in a drastically diminished per-

formance. Visualizing the multi-attention output of our full model (Figure 2.5) reveals that even

without object mask supervision, the model is still able to differentiate between object and back-

34


Table 2.4: Ablation study on the three components of the Strong Baseline model by removing
each one. The last row also corresponds to the ResNet-Baseline model.

Methods (+Neg) Overall Class imbalance (mAP) Attribute types (mAP)

mAP mR@15 mA F1@15 Head Medium Tail Color Material Shape Texture Action Others

Strong Baseline 67.7 54.3 70.0 69.6 75.9 64.3 46.9 68.8 73.9 67.0 69.4 60.2 69.1
w/o Multi-attention (MA) 67.4 53.5 69.7 69.7 75.9 63.8 46.4 67.8 74.7 66.9 68.5 58.0 69.0
w/o Low-level feature (LL) 67.3 53.7 69.9 69.4 75.4 63.8 48.4 68.5 73.6 66.1 67.5 59.3 68.9
w/o Object localizer (OL) 66.9 53.1 69.6 69.1 75.3 63.4 45.5 67.5 73.8 66.5 68.4 58.9 68.3
w/o OL, MA and LL 65.6 53.8 69.4 68.6 74.8 62.3 43.2 67.3 73.3 66.3 67.7 56.0 67.4

ground/distractors with the multi-attention maps which are trained with weak supervision from

the attribute labels. However, removing all components, which is devoid of any form of attention,

severely hampers model performance across all categories. In general, all sub-components are

necessary for our model to perform well across different attribute types.

2.4.6 Qualitative Results

Figure 2.5 shows qualitative results of our SB+SCoNE model, which clearly showcases its

various strengths. Firstly, we clearly show a robust ability of the model to predict a variety of at-

tribute types of different objects with good accuracy. Next, our object localizer shows remarkable

ability to find the correct object of interest and ignore background and other distracting objects

(in Figure 2.5a, the ground is not attended by the object localizer). Next, our multi-attention mod-

ule often works to complement our object localizer by attending to relevant image regions that

may be outside of the object region. For example, in Figure 2.5b, the activity skateboarding is

easier to predict if our model can look at the skateboard, but it is outside the person region. Here,

our multi-attention correctly learns to look at appropriate image regions that can help our model

determine the person has an attribute skateboarding.

Figure 2.6 shows more attribute prediction examples of our model. Our object localizer

can correctly infer the object region and help alleviate the object occlusion problem. For exam-

35


ple, in Figure 2.6e, the object table is partially occluded by a lot of clutter which can distract a

model that relies on global average pooling. Here, our object localizer clearly isolates the parts

of table, making it easier to predict its attributes. Because many attributes depend on the context,

e.g., parked vs. running car, we designed our multi-attention module to complement the object

localizer by allowing it to attend to image regions outside the object. This can be clearly seen in

Figure 2.6a, where attributes such as sunny or bright can be hard to infer by simply looking at

the given object tree. Our multi-attention module looks at the sunny spots on the pavement which

can help our model infer the presence of sunny and bright attributes. However, the multi-attention

module is also free to attend to regions in the object to further supplement object localizer’s atten-

tion to specific parts of object. For example, in Figure 2.6d, our multi-attention module attends

to the hind legs of dog, which supplements the object localizer’s attention map and can provide

additional information to help the model infer that the dog is jumping.

We show from Figure 2.7 to Figure 2.10 our image search (ranking) results when searching

for specific attributes. Our model is able to search for images that exhibit one to multiple at-

tributes, as demonstrated in Figure 2.8 where we search for multiple colors at a time. In addition,

the results in Figure 2.10 also show that our model is able to differentiate between objects with

different size (e.g, small vs. large bird, small vs. large phone).

2.5 Discussion

VAW is a first-of-its kind large-scale object attribute prediction dataset in the wild. We

explored various challenges posed by the VAW dataset and also discussed efficacy of current

models towards this task. Our SCoNE model proposed several novel algorithmic improvements

36


Object: Wall
GT: stained, brick

Prediction: brick, red, high, tiled,
textured, arch shaped, flat, brown, large,

stained.

Multi-att #1 Object loc Multi-att #1

Predictions: spreading arms,
wearing black, young, jumping,

skateboarding, thin, skinny,
spreading legs, in the air, active.

Object: Person
GT: spreading arms,

silhouetted, skateboarding

Object loc

Predictions: vintage, antique,
stopped, shiny, parking, metal,

black, sunlit, outdoors, turned off

Object: Car
GT: glossy, parking, antique,

dark blue, vintage, black

Multi-att #1Object loc

Figure 2.5: Examples of predictions from SB+SCoNE. We show the object name and its ground
truth positive attribute labels above the image. The object localized region, attention map #1, and
model top-10 predictions are shown below. Red text represents missed or incorrect predictions.

that have helped us improve performance in the VAW dataset compared to our strong baseline by

2.4 mAP and and 5.4 mR@15 points. Despite our results, there are several outstanding challenges

remaining to be solved in VAW.

Data imbalance: Reweighting and resampling techniques have helped considerably improve the

performance of tail categories in VAW dataset. However, even for our best model, mAP for tail

categories still lags more than 25 points behind our head category. Similar to many vision and

language problems [85], this is one considerable challenge for future works in this space.

Object-bias effect: Using object label as input is crucial to obtain good results in VAW, but

it may also introduce object-bias in predictions. Ideally, an algorithm should be able to make

robust predictions for compositionally novel instance. While not in scope of this work, this can be

explored in detail by redistributing train-test split in compositionally novel patterns [27, 86, 87].

37


Figure 2.6: More examples of predictions from SB+SCoNE. We show the object name and its
ground truth positive attribute labels above the image. The object localized region, attention map
#1, and model top-10 predictions are shown below. Red text represents missed or incorrect pre-
dictions.

38


Figure 2.7: Image search results. We show the top retrieved images of SB+SCoNE when searching
for some color attributes.

Figure 2.8: Image search results. We show the top retrieved images of SB+SCoNE when searching
for images that exhibit multiple color attributes.

39


Figure 2.9: Image search results. We show the top retrieved images of SB+SCoNE when searching
for some material attributes.

Figure 2.10: Image search results. We show the top retrieved images of SB+SCoNE when search-
ing for some size attributes.

40


In conclusion, we believe that VAW can serve as an important benchmark not only for

attribute prediction in the wild, but also as a generic test for long-tailed multi-label prediction

task with limited labels, data imbalance, out-of-distribution testing and bias-related issues.

41


Chapter 3: Open-Vocabulary Attribute Prediction

This chapter is based on the publication Improving closed and open-vocabulary attribute

prediction using transformers, Pham et al., In European Conference on Computer Vision, pages

201–219. Springer, 2022.

In recent years, several datasets provide explicit annotations of object attributes, such as [10,

88]. However, they are still limited in terms of their coverage of objects and unique attributes, with

even the largest datasets only consisting of a few hundreds of attributes. Additionally, existing

work considers attributes to only include adjective properties, and exclude their interactions with

other objects in the scene. The latter is often classified as visual relationship and is dedicated

to an entirely different research topic [89–91] which requires localization of both subject and

object in a subject-predicate-object triplet. We believe this distinction is unnecessarily limiting,

e.g, person wearing hat conveys information about the property of person that is useful even

if exact grounding of hat is unknown. Hence, we expand the definition of attributes to include

adjective- as well as action- and interaction-based properties from the point-of-view of an object.

To this end, we first describe a pipeline to extract object-centric attributes and interactions

from large quantities of grounded, weakly grounded, and ungrounded image-text pairs. Then,

we propose a novel attribute prediction model called Transformer for Attribute Prediction (TAP).

TAP can predict an order of magnitude larger number of unique attributes than previous methods,

42


matching performance of supervised baselines when directly transfer to the VAW benchmark

[88]. After finetuning, we outperform prior art by 5.1% mAP and 5.0% mean recall. Further