ABSTRACT

Title of Dissertation: LEARNING AND COMPOSING
PRIMITIVES FOR THE VISUAL WORLD

Kamal Gupta
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Abhinav Shrivastava
Department of Computer Science

Professor Larry S. Davis
Department of Computer Science

Compositionality is at the core of how humans understand and recreate the visual world.

It is what allows us to express infinitely many concepts using finite primitives. For example, we

understand images as a combination of objects, videos as comprising of actions, or we generate

3D animations by rendering 3D surfaces with textures, materials, and lighting. It is unsurprising

to see composition also appear in almost all human-created art forms such as language, mu-

sic, design, or even mathematics. Although compositionality seems an obvious and prevalent

way humans consume and create data, it is often eluded in computational approaches such as

deep learning. Current systems often assume the availability of exhaustive labeled concepts or

primitives during training and fail to generalize to new compositions during inference. In this

dissertation, we propose to discover compositional primitives from the data with little to no su-

pervision and show how we can use these primitives for improving generalization in real-world


applications such as classification, correspondence, or 2D/3D synthesis.

In the first half of this dissertation, I propose two complementary approaches to discover

compositional discrete primitives from visual data. Given a large collection of images without

labels, I propose a generative and a contrastive way of recognizing discriminative parts in the

image which are usual for visual recognition. In the generative approach, I take inspiration

from bayesian approaches such as variational autoencoders, to develop a system that can express

images in form of discrete language-like representation. In the contrastive approach, I play a

referential game between two neural network agents, to learn meaningful discrete concepts from

images. I further show applications of these approaches in image and video editing by learning a

dense correspondence of primitives across images.

In the second half, I’ll focus on learning how to compose primitives for both 2D and 3D

visual data. By expressing the scenes as an assembly of smaller parts, we can easily perform

generation from scratch or from partial scenes as input. I present two works, one on composing

multiple viewpoints to synthesize 3D objects, and another work on composing bounding boxes

or cuboids to generate scene layouts. I also review a work on discovering a data-driven way of

ordering traversing an image or a scene, for composition. I show applications of these works in

image/video compression, as well as 2D and 3D content creation.


LEARNING AND COMPOSING
PRIMITIVES FOR THE VISUAL WORLD

by

Kamal Gupta

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor Abhinav Shrivastava, Chair/Advisor
Professor Larry S. Davis, Co-chair/Advisor
Professor Carol Y. Espy-Wilson, Dean’s Representative
Professor Matthias Zwicker
Professor Noah Snavely, Cornell University


© Copyright by
Kamal Gupta

2023


To my family and friends,

for your unwavering and unconditional support, and I am eternally grateful.

ii


Acknowledgments

This thesis is a culmination of a joint effort of many individuals and institutions over the

course of the last 5 years and I want to express my deep appreciation and gratitude to all those

who have contributed to its successful completion.

First and foremost, I would like to express my sincere gratitude to my wife, Gowthami

Somepalli, for pushing me to apply for a Ph.D. after I spent years procrastinating. Thanks for

sticking with me through the thick and thin, endless brainstorming, bearing my rants, and helping

me prepare for the interviews and presentations (in no specific order). I want to thank my parents

for always pushing me to be a better person, my sisters Ruchi, Shilpa, and Sravani, my niece

Sanvi and nephew Ibhan, and my brother-in-laws Himanshu and Ankur, for their unconditional

love and support. Their patience, understanding, and love provided me with the motivation to

overcome the many challenges I faced while completing this thesis.

I want to thank my advisor Abhinav Shrivastava for his unwavering support, guidance,

and constructive criticism in making this thesis happen. I especially appreciate his flexibility

and availability every time I needed it, the freedom to explore various research topics that I was

passionate about, and above all, his friendship throughout the last 5 years. I am privileged to

have his expertise, knowledge, and enthusiasm which were essential to the successful completion

of my dissertation. I also want to thank my co-advisor, Larry Davis, for being a great mentor,

teaching me how to critically evaluate my own ideas, and creating a massive Computer Vision

iii


community at the University of Maryland that I immensely benefited from.

I am also grateful to other professors and mentors I collaborated with and learned from

- especially Matthias Zwicker for his valuable insights, and for introducing me to the world of

graphics, David Jacobs for a number of discussions, Noah Snavely, Ameesh Makadia, and Vijay

Mahadevan for their expertise and constructive criticism which were instrumental in shaping the

direction of my research. I thank David Luebke and Stephano Soatto for building the best teams

that I had the pleasure to be part of, Sanjiv Singh, Stephen Nuske, Subhajit Sanyal and Vidit Jain,

for their mentorship before I even started my Ph.D. I sincerely appreciate the time and effort they

dedicated to teaching me and reviewing my work.

I want to thank DARPA SAIL-ON (W911NF2020009), DARPA SemaFor (HR001119S0085),

and DARPA GARD (HR00112020007) programs, Amazon Research Award to Abhinav, Kulka-

rni Fellowship, Dean’s Fellowship, Graduate Assistant Award, and Google, NVIDIA, and Ama-

zon Research Labs which helped me conduct my research.

I want to express my gratitude to my friends/collaborators (Saurabh, Justin, Alessandro,

Susmija, Ketul, Hanyu, Abhishek, Varun, Carlos, Nicholas, Towaki, Sameh, Anubhav, Sharath,

Hao, Shishira, Matt W., Saksham, Nirat, Vinoj, Khoi, Archana), friends/labmates (Pulkit, Luyu,

Pallabi, Lillian, Saketh, Gaurav, Soumik, Neha, Sahil, Moustafa, Max, Bo, Matt G., Shirley,

Alex, Mara, Yixuan, Shlok, Anshul, Aman, Pranav, Sweta, Namitha, Vatsal, Ahmed, Ameya,

Samyadeep, Sneha, Rohan, Trisha, Uttaran, Philip, Amanpreet, Koutilya, Sharmila, Kevin, Shra-

may), and friends/housemates (Bhavesh and Shivani) for their unwavering support and encour-

agement throughout this journey.

I would like to express my gratitude to Tom Hurst, and other staff and faculty of the Com-

puter Science Department at UMD and UMIACS who were super supportive and responsive

iv


when it came to any bureaucratic or compute-related questions.

Last but not the least, I want to thank the essential workers and healthcare professionals

across the world, who put their lives at risk every day, to help us make it through the Covid-19

pandemic. Like many others in my time, I owe them a significant deal, and while I can’t name

all of them, my Ph.D. wouldn’t have been possible without their tireless efforts.

My sincere apologies to those I’ve inadvertently left out. Without all of your support,

guidance, and encouragement, this research would not have been possible.

v


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents vi

List of Tables ix

List of Figures xi

Chapter 1: Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Discovering Primitives from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Learning to Compose Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Part I: Discovering Primitives from Data 6

Chapter 2: Learning Local Latent Codes for Recognition 7
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 VAE Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 PatchVAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 PatchVAE with multiple parts . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Improved Reconstruction Loss . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Downstream classification performance . . . . . . . . . . . . . . . . . . 22
2.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 3: Learning to Signal Mid-level Patches in Referential Games 29
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Referential Game Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Speaker agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Listener agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi


3.3.1 Positive Signaling - Visualizing Patch Ranks from θrank . . . . . . . . . . 42
3.3.2 Positive Signaling - Image Classification with subset of patches provided

by θrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Positive Signaling - Visualizing θsymb symbols . . . . . . . . . . . . . . . 44
3.3.4 Positive Listening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 4: Aligning Sparse in-the-wild Image Collections 48
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 ASIC Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.1 Obtaining Pseudo-correspondences . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Canonical Space Alignment . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Visualizing Dense Correspondences . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Pairwise Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.4 Image Set Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Part II: Learning to Compose Primitives 68

Chapter 5: Improved Modeling of 3D Shapes with Multi-view Depth Maps 69
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Loss function and other training details . . . . . . . . . . . . . . . . . . 77
5.2.3 Generative modeling with Implicit MLE . . . . . . . . . . . . . . . . . . 78

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 6: Layout Generation and Completion with Self-attention 87
6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 Layout Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.2 Model architecture and training . . . . . . . . . . . . . . . . . . . . . . . 95

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 3D Shape synthesis (on PartNet dataset) . . . . . . . . . . . . . . . . . . 98
6.3.2 Layouts for natural scenes . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.3 Layouts for Apps and Documents . . . . . . . . . . . . . . . . . . . . . 103
6.3.4 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

vii


Chapter 7: Neural Space-filling Curves 109
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2.1 Overview of Dafner et al.(Single Image-based SFC) . . . . . . . . . . . . 113
7.2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.3 Weight Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.4 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.5 Weight Evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.4 Optimizing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.5 Optimizing Code Length . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.6 Scaling up SFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Chapter 8: Conclusion and Discussion 129
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Bibliography 132

viii


List of Tables

2.1 Classification results on CIFAR100, Indoor67, and Places205. We initialize the
classification model with the representations ϕ(x) learned from unsupervised
learning task. The model ϕ(x) comprises of a conv layer followed by two residual
blocks (each having 2 conv layers). First column (called ‘Conv1’) corresponds
to Top-1 classification accuracy with pre-trained model with the first conv layer
frozen, second and third columns correspond to results with first three and first
five conv layers frozen respectively. Details in Section 2.3.1. . . . . . . . . . . . 22

2.2 ImageNet classification results using ResNet18. We initialize weights from us-
ing the unsupervised task and fine-tune the last two residual blocks. Details in
Section 2.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Effect of N : Increasing the maximum number of patches increases the discrimi-
native power for CIFAR100 but has little or negative effect for Indoor67 . . . . . 27

2.4 Effect of dp: Increasing the number of hidden units for a patch has very little
impact on classification performance. . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Effect of zprior
occ : Increasing increasing the prior probability of patch occurrence

has adverse effect on classification performance. . . . . . . . . . . . . . . . . . . 27
2.6 Effect of βocc: Too high or too low βocc can deteriorate the performance of learned

representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Reconstruction metrics on ImageNet. PatchVAE sacrifices reconstruction quality

to learn discriminative parts, resulting in higher recognition performance (Table 2.2) 27

3.1 Performance of Listener’s Vision module ϕvision - Downstream classification ac-
curacy for ImageNet dataset using k-NN (k=20) . . . . . . . . . . . . . . . . . . 45

3.2 Performance of Listener’s Vision module ϕvision - Downstream mean Average Pre-
cision for Pascal VOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Evaluation on SPair-71k. Per-class and average PCK@0.10 on test split. High-
est PCK among weakly supervised methods in bold, second highest underlined.
Scores marked with (⋆) means the paper uses a fixed image from the test set as
canonical image. Our method is competitive against other weak supervised ap-
proaches and often outperforms them. . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 CUB-200 and PF-Willow. PCK@0.10 for three CUB categories and four PF-
Willow categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Ablation Study. Average PCK@0.10 (for 3 and 4 categories respectively) in
CUB and PF-Willow datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Overview of baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

ix


5.2 Reconstruction on test data. We measure the reconstruction performance of dif-
ferent techniques on the test dataset. [1] and [2] outperforms the reconstruction
in majority of the cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3 Shape reconstruction from a single viewpoint . . . . . . . . . . . . . . . . . . . 84

6.1 Evaluation of generated shapes in Chair category. The best numbers are in bold,
second-best are underlined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Analogies. We demonstrate linguistic nuances being captured by our category
embeddings by attempting word2vec [3] style analogies. . . . . . . . . . . . . . 104

6.3 Quantitative Evaluations on COCO. Negative log-likelihood (NLL) of all the
layouts in the validation set (lower the better). We use the importance sampling
approach described in [4] to compute. We also generated images from layout
using [5] and compute IS and FID. Following [6], we randomly split test set
samples into 5 groups and report standard deviation across the splits. The mean
is reported using the combined test set. . . . . . . . . . . . . . . . . . . . . . . . 105

6.4 Spatial distribution analysis for the samples generated using model trained on
RICO and PubLayNet dataset. Closer the Overlap and Coverage values to real
data, better is the performance. All values in the table are percentages (std in
parenthesis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.1 Comparison of performance of lag-k autocorrelation and LZW Encoding length
for different orders. For autocorrelation, we consistently outperform both the
universal and context-based SFC computation approaches at high values of k.
For LZW Encoding length, we measure the average size per frame in bytes as
well as the relative improvement compared to the raster scan order, in the case of
each of the datasets. We consistently outperform compression performance for
other order schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

x


List of Figures

1.1 We propose to learns mid-level patches in images which occur repetitively across
the data. We further show that these mid-level patches can cluster meaningfully
and can also enable dense correspondence applications. . . . . . . . . . . . . . . 3

1.2 LayoutTransformer [7] can synthesize layouts in diverse natural as well as human
designed data domains such as documents, mobile app wireframes, natural scenes
or 3D objects in a sequential manner. . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 PatchVAE learns to encode repetitive parts across a dataset, by modeling their
appearance and occurrence. (top) Given an image, the occurrence map of a par-
ticular part learned by PatchVAE is shown in the middle, capturing the head/beak
of the birds. Samples of the same part from other images are shown on the right,
indicating consistent appearance. (bottom) More examples of parts discovered by
our PatchVAE framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 (a) VAE Architecture: In a standard VAE architecture, output of encoder net-
work is used to parameterize the variational posterior for z. Samples from this
posterior are input to the decoder network. (b) Proposed PatchVAE Architec-
ture: Our encoder network computes a set of feature maps f using ϕ(x). This is
followed by two independent single layer networks. The bottom network gener-
ates part occurrence parameters QO. We combine QO with output of top network
to generate part appearance parameters QA. We sample zocc and zapp to construct
ẑ as described in Section 2.2.2 which is input to the decoder network. We also
visualize the corresponding priors for latents zapp and zocc in the dashed gray boxes. 12

2.3 Encoded part occurrence maps discovered on CIFAR100 and ImageNet. Each
row represents a different part. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 A few representative examples for several parts to qualitatively demonstrate the
visual concepts captured by PatchVAE. For each part, we crop image patches
centered on the part location where it is predicted to be present. Selected patches
are sorted by part occurrence probability as score. We manually select a diverse
set from the top-50 occurrences from the training images. As can be seen, a
single part may capture diverse set of concepts that are similar in shape or texture
or occur in similar context, but belong to different categories. We show which
categories the patches come from (note that category information was not used
while training the model). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xi


2.5 Swapping source and target part appearance. Column 1, 2 show a source image
with the occurrence map of one of the parts. We can swap the appearance vector
of this part with appearance vectors of a different part in target images. Column
3, 4 show three target images with occurrence maps of one of their parts. Observe
the change in reconstructions (column 5, 6) as we bring in the new appearance
vector. The new reconstruction inherits properties of the source at specific loca-
tions in the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Overview of the Referential Game. We generate two random views of every im-
age in the given batch of images. The speaker takes one of the views as the input
and generates a sequence of symbols (or message). The listener takes the mes-
sage sent by the speaker and the second view of the image, and projects both
into an embedding space. Both the speaker and listener agents learn by minimiz-
ing the constrastive loss (see Equation (3.3)) between between the views in this
embedding space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Speaker agent architecture. The upper branch represents the PatchSymbol, θsymb

module and the lower branch represents PatchRank, θrank module. Each of the
modules take the raw image as the input. PatchSymbol computes a message
for each patch of pre-defined size, using a fixed size vocabulary and message
length. PatchRank uses a ConvNet to compute weights or the importance of each
patch. Note that the symbols for a patch are context independent, however the
importance of a patch depends on the context. The speaker agent combines the
output of each of the models and sends a variable length message to the listener. . 37

3.3 Heat maps based on Patch Importance. Red represents the most important patches
and Blue the least important. Labels are listed for better understanding and are
not used during training. We have used 224 × 224 images and 32 × 32 patches
to get 49 non-overlapping patches per image. Model successfully separates the
primary object in an image - the “device” in the “Abacus” image, “rabbit” in the
“hare” image etc. (as shown by higher concentration of yellow-to–red patches on
these areas). Some failure cases are shown as well on the right. . . . . . . . . . . 42

3.4 Impact of number of patches used during evaluation in a pre-trained ViT [8].
The performance of a ViT drops if we provide fewer patches during inference.
However a PatchRank model (which is a small ResNet-9) can provide important
patches to ViT during inference with minimal loss in Top-1 accuracy. . . . . . . . 44

3.5 Visualizing patches corresponding to various symbols. The number in the top row
corresponds to 1 of 128 vocabulary ids. 6 representative patches corresponding to
each patch are shown. The bottom row corresponds to our interpretation of what
concept that symbol might be capturing. . . . . . . . . . . . . . . . . . . . . . . 45

3.6 Impact of various hyperparameters on the success of communication. Top-1 ac-
curacy denotes the percentage of messages in a batch that model was able to
match to the corresponding image. Having a higher message length and vocabu-
lary helps the model to learn faster. . . . . . . . . . . . . . . . . . . . . . . . . . 46

xii


4.1 Globally consistent and dense aligments with ASIC. Given a small set (∼10-
30) of images of an object or object category captured in-the-wild, our frame-
work computes a dense and consistent mapping between all the images in a self-
supervised manner. First row: Unaligned sets of images from the SAMURAI
(Keywest) and SPair-71k (Cow) datasets. Second row: Dense correspondence
maps produced by our method. Third row: Image in the first column warped to
the images in columns 2-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 ASIC Architecure. The alignment network Φalign predicts canonical space co-
ordinates for all pixels for all input images. Images can be reconstructed using a
differentiable warp from the canonical space. In order to align semantically sim-
ilar pixels from different images to the same location in the canonical space, we
propose two primary loss functions LKP and LRecon. Please refer to the Sec. 4.2.2
for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Visualizing canonical space alignment. For each dataset, the top row shows
sample images from the dataset (composed of 10-30 images each). The mid-
dle row shows part co-segmentations computed by DVD [9]. DVD computes a
coarse, discrete set of parts across the dataset. The bottom row shows the con-
tinuous canonical space mapping computed by our method. Our canonical space
mapping is smooth and consistent across the images for each dataset/collection. . 58

4.4 Dense warping from a source image (top row) to a target image (second row).
We warp all foreground pixels (highlighted by a red overlay in the source image).
Our methods produce dense and semantically more meaningful warps from the
source to the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Image Set Correspondence. k-CyPCK at varying αbbox (higher is better). Our
method outperforms DVD baseline at both large and small values of αbbox. . . . . 65

4.6 Limitations. Top row shows that our model can map left part of object in source
to right part of object in target when object is symmetric. Bottom row shows that
our model fails for very large out-of-plane rotations. . . . . . . . . . . . . . . . . 67

5.1 Architecture: Our Identity Encoder Network takes one or more depth maps of
3D object as input and encodes each of them into latent vectors. We use expected
value of these latent vectors as the identity vector of the object. The decoder or
viewpoint generator network uses this identity vector, category and viewpoint as
input to generate depthmap of the object. It consists of an MLP that maps the
identity vector to a style vector. This style vector is added to each block in the
Viewpoint Generator Network using adaptive instance normalization. . . . . . . . 74

xiii


5.2 Reconstruction on test objects. We evaluate our model’s ability to encode ge-
ometry of 3D shapes with its latent representation. Each row represents 3D shapes
from different categories. Odd columns are the input objects and even columns
are the reconstructions. We render depth maps of input objects from different
viewpoints. Our encoder encodes each of the depth map independently and out-
puts a latent code. We average the latent code over all the viewpoints of the
object. Finally we use our generator to use the same latent code and generate
multiple viewpoints from the latent code. Final output depth maps are projected
back to 3D. Last two columns shows some of the failure cases likely because of
lack of similar samples in training data. . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Single View Reconstruction. Given a single input depth map, we show recon-
structions of complete 3D object using our approach. Odd columns are the input
depth maps, even columns are the full 3D reconstructions viewed from a canoni-
cal viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 Shape Interpolation. The latent representations learned by our model are smooth.
In this figure, we interpolate between two test objects (left-most and right-most
column) by generating the intermediate 3D objects from the spherical interpola-
tion of the latent codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Word embeddings learned by our model for various class objects. We see that
visually similar categories cluster together, even if they are not semantically sim-
ilar. E.g., airplane, rocket and vessel are together, mailbox and microwave are
also together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Two NN to generated 3D models: our model produces 3D models that differ from
the two closest train samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Our framework can synthesize layouts in diverse natural as well as human de-
signed data domains such as documents, mobile app wireframes, natural scenes
or 3D objects in a sequential manner. . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 The architecture depicted for a toy example. LayoutTransformer takes layout
elements as input and predicts the next layout elements as output. During training,
we use teacher forcing, i.e., use the ground-truth layout tokens as input to a multi-
head decoder block. The first layer of this block is a masked self-attention layer,
which allows the model to see only the previous elements in order to predict the
current element. We pad each layout with a special ⟨bos⟩ token in the beginning
and ⟨eos⟩ token in the end. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.3 Generated 3D objects. Top row shows input primitives to the model. Bottom
row shows the layout obtained by our approach. . . . . . . . . . . . . . . . . . . 95

6.4 Generated layouts. Top row shows seed layouts input to the model. Bottom row
shows the layout obtained with nucleus sampling. We skip the ‘stuff’ bounding
boxes for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.5 Downstream task. Image generation with layouts [5]. FID and IS scores for the
generated images provided in Table 6.3. . . . . . . . . . . . . . . . . . . . . . . 101

6.6 TSNE plot of learned category embeddings. Words are colored by their super-
categories provided in the COCO. Observe that semantically similar categories
cluster together. Cats and dogs are closer as compared to sheep, zebra, or cow. . . 103

xiv


6.7 Distribution of xy-coordinates of bounding boxes centers. Distributions for gen-
erated and real layouts is similar. The y-coordinate tends to be more informative
(e.g., sky on the top, road and sea at the bottom) . . . . . . . . . . . . . . . . . . 104

6.8 RICO layouts. Generated layouts for the RICO dataset. We skip the categories
of bounding boxes for the sake of clarity. . . . . . . . . . . . . . . . . . . . . . . 106

6.9 Multiple completions from same initial element . . . . . . . . . . . . . . . . . . 107
6.10 Document Layouts. Generated samples LayoutVAE (top) and our method (bot-

tom). Our method produces aligned bounding boxes for various elements. . . . . 108

7.1 Given a set of images, a gif, or a video, Neural Space-filling Curves (SFC) can
provide a more spatially coherent scan order for images as compared to universal
scan orders such as S-curve, or Peano-Hilbert curves. As shown in the example
of a trouser and a face, the scan line tends to cover the background before moving
to the foreground. (SFCs generated here using half-resolution images and resized
for clarity. Best viewed in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 Cover and Merge Algorithm. (a): An 8 × 8 image fully covered by 2 × 2
circuits. (b): The dual graph G ′ (black) built on the covering circuits G (blue).
(c): The Minimum Spanning Tree T (black solid lines) of G ′ (all black lines) and
the Hamiltonian Circuit (blue) induced by T . (d): A single Hamiltonian circuit
merged from the covering circuits. See more details in Sec. 7.2.1. . . . . . . . . . 113

7.3 Neural SFC pipeline. The Neural SFC model consists of a weight generator
FG and a weight evaluator EG . Taking one or more images as input, FG extracts
the deep features of the image and generates the SFC weights. EG takes both
the image(s) and the corresponding SFC weights as input EG then estimates the
negative autocorrelation of the image pixel sequence inferred by the input weights
to evaluate the goodness of the input weights with respect to the input image. See
more details in Sec. 7.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4 Qualitative comparison between Hilbert curves and Neural SFCs. Left: SFC (in
red color) overlayed on the image. Right: Image flattened according to the SFC
and visualized in 1-dimension. Images in the top two rows are from MNIST,
the ones in the middle two rows are from Fashion-MNIST, and the ones in the
bottom two rows are from FFHQ Faces. Neural SFCs on images from MNIST and
Fashion-MNIST are class-conditional, i.e., computed for each class. Therefore,
for MNIST and Fashion-MNIST, Neural SFCs in the right two columns are the
same since the two images have the same class label. In all datasets, Neural
SFCs are more spatially coherent and produce fewer clusters when visualized in
1-dimension. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.5 We visualize the Neural SFCs training with two objectives considered in this
paper – autocorrelation and LZW encoding. . . . . . . . . . . . . . . . . . . . . 125

7.6 lag-k autocorrelation for MNIST, Fashion-MNIST and FFHQ datasets. While
Dafner et al.provide higher autocorrelation for small lag, i.e., from k = 2 to
k = 4, Neural SFCs outperform Dafner et al.for k > 4 in all the datasets. Note
that we trained our model for k = 6, and hence this behaviour is expected. . . . . 125

xv


7.7 Scaling up SFC. The top row (blue circles) shows a 2× 2 crop of an image grid
of resolution n × n. The bottom row (red circles) shows how we can scale the
SFC path from n × n grid to 2n × 2n grid. For every incoming SFC to a pixel
location, there are 3 possible ‘next’ pixels, straight ahead, left, or right. Column
2 and 3 consider the cases where the SFC goes ‘straight’ or ‘right’. The image on
the right shows a toy example of scaling up SFC for a 5× 5 image. . . . . . . . 128

xvi


Chapter 1: Introduction

1.1 Overview

Humans (and even animals) [10] can innately imagine and recognize a large number of objects in

diverse scenes, by composing a few known objects and their attributes. However, this ability to

synthesize and recognize infinitely many new combinations from finite concepts called compo-

sitional generalization, remains an Achilles’s heel of current data-driven approaches (e.g., deep

networks) that assume robust labeled training data is available for exhaustive object and their

properties [11].

In order to harness the computational approaches for building real-world applications, it is

crucial for them to understand and perform composition. George Bool, also known as the father

of modern boolean algebra, formally defined the principle of compositionality as following,

“The meaning of a complex expression is determined by the meanings of its constituent

expressions and the rules used to combine them”

The majority of this thesis involves abstracting and solving these two challenges in compositional

learning:

1. Discovering discrete primitives from data (images and videos) with few or no labels which

can be composed for various downstream applications such as segmentation, correspon-

1


dence, or image retrieval.

2. Learning the rules to combine visual primitives meaningfully to generate rich, and diverse

compositions of the input data (images and 3D).

Being able to understand and compose 2D and 3D content (both static and dynamic) will

unlock the next generation of tools empowered by Machine Learning. It will allow artificial

agents to understand the visual world around them and perform repetitive or dangerous tasks

reducing human time and effort by orders of magnitude. Tools to create content will enable

the creation of large labeled highly realistic synthetic datasets. They will also be quintessential

for entertainment (movies, sports, gaming) companies as well as have applications in robotics,

healthcare, education, and e-commerce.

1.2 Discovering Primitives from Data

In order to better understand the guiding principles behind generative process of data, it

is imperative to build representations that can discover simple parts or primitives that compose

the data. In images, representations learned by current popular unsupervised and self-supervised

machine learning systems, however, follow a distributed paradigm. This means that image is

usually represented using a high-dimensional continuous vector where a single dimension can

contribute to multiple concepts, and multiple dimensions can contribute to a single concept. In

this work, we propose a new paradigm to represent images in the form of discrete primitives that

appear in images, and can be composed to either reconstruct an image or to differentiate an image

from other images.

We discuss two foundational works, PatchVAE in Chapter 2 and PatchGame in Chap-

2


Figure 1.1: We propose to learns mid-level patches in images which occur repetitively across the data. We
further show that these mid-level patches can cluster meaningfully and can also enable dense correspon-
dence applications.

ter 3. PatchVAE [12] attempts to model the images by representing them as a combination of

part appearance maps, and part visibility maps, which combined together allows us to recon-

struct original images. These parts are meaningful across a collection of images and allow im-

age editing applications by simply replacing discrete symbols from an image by another image.

PatchGame [13] follows a constrastive approach instead. In this work, we play a cooperative

referential game between two neural network agents, a speaker and a listener, who can communi-

cate with each other only via a discrete bottleneck. The goal for the listener is to match a discrete

message sent by the speaker to the corresponding image. I show the applications of this discrete

bottleneck in downstream applications such as detection, retrieval, and part segmentations. Fur-

3


ther in Chapter 4, ASIC [14] takes these discrete primitives a step further by learning a dense

correspondence for a set of images as shown in Fig. 1.1.

Going beyond images, the principle of compositionality can also be applied to represent and

train fast, efficient neural networks. In LilNetX [15, 16], we learn a set of discrete weights sam-

pled from a codebook, to provide a replacement for weights of standard CNNs such as ResNets

or MobileNets. These discrete weights combined with entropy regularization, allow for highly

quantized, and sparse neural networks that can be used for tasks such as classification, or de-

tection. An extension of this work (under review) also shows the application of our entropy

regularization technique to compress Neural Fields. While in this dissertation, we restrict our-

selves to composition for the visual data, we encourage readers to check the above-mentioned

works on model compression.

1.3 Learning to Compose Primitives

The second part of this dissertation focus on the process of synthesizing or learning a

meaningful arrangement of some known concepts to give rise to novel scenes with desired char-

acteristics.

In Chapter 5, our primitives are depth maps from different camera viewpoints. By syn-

thesizing and composing these multiview depthmaps, we propose a new way of generating 3D

surfaces. In Chapter 6, we consider primitives to be bounding boxes, or cuboids for 2D and 3D

scenes. This arrangement of concepts or primitives in a scene, can be referred to as layout of

the scene and it can be characterized by contextual relationships (e.g., support, occlusion, and

relative likelihood, position, and size) between objects comprising the scene. In this work, also

4


Figure 1.2: LayoutTransformer [7] can synthesize layouts in diverse natural as well as human designed
data domains such as documents, mobile app wireframes, natural scenes or 3D objects in a sequential
manner.

dubbed, LayoutTransformer [7], we propose an approach to autoregressively generate 2D and 3D

layouts from scratch or from an initial seed set of primitives, and can easily scale to support an

arbitrary of primitives per layout. A shortcoming of this approach is that it can only generate lay-

outs elements in a raster scan order. We address this shortcoming in Neural SFC [17] discussed

in Chapter 7, where we learn a data driven way to learn scan order of images.

5


Part I: Discovering Primitives from Data

6


Chapter 2: Learning Local Latent Codes for Recognition

Due to the availability of large labeled visual datasets, supervised learning has become

the dominant paradigm for visual recognition. That is, to learn about any new concept, the

modus operandi is to collect thousands of labeled examples for that concept and train a powerful

classifier, such as a deep neural network. This is necessary because the current generation of

models based on deep neural networks require large amounts of labeled data [18]. This is in stark

contrast to the insights that we have from developmental psychology on how infants develop

perception and cognition without any explicit supervision [19]. Moreover, the supervised learning

paradigm is ill-suited for applications, such as health care and robotics, where annotated data is

hard to obtain either due to privacy concerns or high cost of expert human annotators. In such

cases, learning from very few labeled images or discovering underlying natural patterns in large

amounts of unlabeled data can have a large number of potential applications. Discovering such

patterns from unlabeled data is the standard setup of unsupervised learning.

Over the past few years, the field of unsupervised learning in computer vision has followed

two seemingly different tracks with different goals: generative modeling and self-supervised

learning. The goal of generative modeling is to learn the probability distribution from which

data was generated, given some training data. Such models, learned using reconstruction-based

losses, can draw samples from the same distribution or evaluate the likelihoods of new data,

7


Example parts discovered by PatchVAE

Image Part Occurrence Map Part samples

Figure 2.1: PatchVAE learns to encode repetitive parts across a dataset, by modeling their appearance
and occurrence. (top) Given an image, the occurrence map of a particular part learned by PatchVAE is
shown in the middle, capturing the head/beak of the birds. Samples of the same part from other images
are shown on the right, indicating consistent appearance. (bottom) More examples of parts discovered by
our PatchVAE framework.

and are useful for learning compact representation of images. However, we argue that these

representations are not as useful for visual recognition. This is not surprising since the task of

reconstructing images does not require the bottleneck representation to sort out meaningful data

useful for recognition and discard the rest; on the contrary, it encourages preserving as much

information as possible for reconstruction.

In comparison, the goal in self-supervised learning is to learn representations that are use-

ful for recognition. The standard paradigm is to establish proxy tasks that don’t require human-

supervision but can provide signals useful for recognition. Due to the mismatch in goals of un-

supervised learning for visual recognition and the representations learned from generative mod-

eling, self-supervised learning is a more popular way of learning representations from unlabeled

data. However, fundamental limitation of this self-supervised paradigm is that we need to define

a proxy-task that can mimic the desired recognition task. It is not possible to always establish

such a task, nor are these tasks generalizable across recognition tasks.

In this paper, our goal is to enable the unsupervised generative modeling approach of VAEs

8


to learn representations useful for recognition. Our key hypothesis is that for a representation to

be useful, it should capture just the interesting parts of the images, as opposed to everything in

the images.

What constitutes an interesting image part has been defined and studied in earlier works

that pre-date the end-to-end trained deep network methods [20, 21, 22]. Taking inspiration from

these works, we propose a novel representation that only encodes few such parts of an image

that are repetitive across the dataset, i.e., the patches that occur often in images. By avoiding

reconstruction of the entire image our method can focus on regions that are repeating and con-

sistent across many images. In an encoder-decoder based generative model, we constrain the

encoder architecture to learn such repetitive parts – both in terms of representations for appear-

ance of these parts (or patches in an image) and where these parts occur. We formulate this using

variational auto-encoder (β-VAEs) [23, 24], where we impose novel structure on the latent rep-

resentations. We use discrete latents to model part presence or absence and continuous latents to

model their appearance. Figure 2.1 shows an example of the discrete latents or occurrence map,

and example parts discovered by our approach, PatchVAE. We present PatchVAE in Section 2.2

and demonstrate that it learns representations that are much better for recognition as compared to

those learned by the standard β-VAEs [23, 24].

In addition, we present losses that favor foreground, which is more likely to contain repet-

itive patterns, in Section 2.2.4, and demonstrate that they result in representations that are much

better at recognition. Finally, in Section 2.3, we present results on CIFAR100 [25], MIT Indoor

Scene Recognition [26], Places [27], and ImageNet [28] datasets. To summarize, our contribu-

tions are as follows:

9


• We propose a novel patch-based bottleneck in the VAE framework that learns representations

that can encode repetitive parts across images.

• We demonstrate that our method, PatchVAE, learns unsupervised representations that are better

suited for recognition in comparison to traditional VAEs.

• We show that losses that favor foreground are better for unsupervised representation learning

for recognition.

• We perform extensive ablation analysis of the proposed PatchVAE architecture.

2.1 Related Work

Due to its potential impact, unsupervised learning (particularly for deep networks) is one of

the most researched topics in visual recognition over the past few years. Generative models such

as VAEs [23, 24, 29, 30], PixelRNN [31], PixelCNN [32, 33], and their variants have proven

effective when it comes to learning compressed representation of images while being able to

faithfully reconstruct them as well as draw samples from the data distribution. GANs [34, 35, 36,

37] on the other hand, while don’t model the probability density explicitly, can still produce high

quality image samples from noise. There has been work combining VAEs and GANs to be able

to simultaneously learn image data distribution while being able to generate high quality samples

from it [38, 39, 40]. Convolution sparse coding [41] is an alternative approach for reconstruction

or image in-painting problems. Our work complements existing generative frameworks in that

we provide a structured approach for VAEs that can learn beyond low-level representations. We

show the effectiveness of the representations learned by our model by using them for visual

recognition tasks.

10


There has been a lot of work in interpreting or disentangling representations learned using

generative models such as VAEs [24, 42, 43]. However, there is little evidence of effectiveness

of disentangled representations in visual recognition. Semi-supervised learning using generative

models [44, 45], where partial or noisy labels are available to the model during training, has

shown lots of promise in applications of generating conditioned samples from the model. In

our work however, we focus on incorporating inductive biases in these generative models (e.g.,

VAEs) so they can learn representations better suited for visual recognition.

A related, but orthogonal, line of work is self-supervised learning where a proxy task is

designed to learn representation useful for recognition. These proxy tasks vary from simple tasks

like arranging patches in an image in the correct spatial order [46, 47] and arranging frames from

a video in correct temporal order [48, 49], to more involved tasks like in-painting [50] and context

prediction [51, 52]. We follow the best practices from this line of work for evaluating the learned

representations.

2.2 Our Approach

Our work builds upon VAE framework proposed by Kingma and Welling [23]. We briefly

review relevant aspects of the VAE framework and then present our approach.

2.2.1 VAE Review

Standard VAE framework assumes a generative model for data where first a latent z is

sampled from a prior p(z) and then the data is generated from a conditional distribution G(x|z).

A variational approximation Q(z|x) to the true intractable posterior is introduced and the model

11


x "x

z ∼ %(0, ))

Q G

(a) VAE Architecture

𝜙(x)x

f

Q!

Q" Q(z#|x) z$%% ∼ Bern z$%%
#&'$&

.multiply

z(## ∼ 𝒩(0, 𝐼)

G(x|z#)

deconv×
replicate

6x

6z

(b) PatchVAE Architecture

Figure 2.2: (a) VAE Architecture: In a standard VAE architecture, output of encoder network is used to
parameterize the variational posterior for z. Samples from this posterior are input to the decoder network.
(b) Proposed PatchVAE Architecture: Our encoder network computes a set of feature maps f using ϕ(x).
This is followed by two independent single layer networks. The bottom network generates part occurrence
parameters QO. We combine QO with output of top network to generate part appearance parameters QA.
We sample zocc and zapp to construct ẑ as described in Section 2.2.2 which is input to the decoder network.
We also visualize the corresponding priors for latents zapp and zocc in the dashed gray boxes.

12


is learned by minimizing the following negative variational lower bound (ELBO),

LVAE(x) =− Ez∼Q(z|x) [logG(x|z)]

+DKL [Q(z|x) ∥ p(z)]

(2.1)

where Q(z|x) is often referred to as an encoder as it can be viewed as mapping data to

the the latent space, while G(x|z) is referred to as a decoder (or generator) that can be viewed

as mapping latents to the data space. Both Q and G are commonly parameterized as neural

networks. Figure 2.2a shows the commonly used VAE architecture. If the conditional G(x|z)

takes a gaussian form, negative log likelihood in the first term of RHS of Equation (2.1) becomes

mean squared error between generator output x̂ = G(x|z) and input data x. In the second

term, prior p(z) is assumed to be a multi-variate normal distribution with zero-mean and identity

covariance N (0, I) and the loss simplifies to

LVAE(x) = ∥x− x̂∥2 +DKL [Q(z|x) ∥ N (0, I)] (2.2)

When G and Q are differentiable, entire model can be trained with SGD using reparame-

terization trick [23]. Matthey et al. [24] propose an extension for learning disentangled represen-

tation by incorporating a weight factor β for the KL Divergence term yielding

LβVAE(x) = ∥x− x̂∥2 + βDKL [Q(z|x) ∥ N (0, I)] (2.3)

VAE framework aims to learn a generative model for the images where the latents z repre-

13


sent the corresponding low dimensional generating factors. The latents z can therefore be treated

as image representations that capture the necessary details about images. However, we postulate

that representations produced by the standard VAE framework are not ideal for recognition as they

are learned to capture all details, rather than capturing ‘interesting’ aspects of the data and drop-

ping the rest. This is not surprising since there formulation does not encourage learning semantic

information. For learning semantic representations, in the absence of any relevant supervision

(as is available in self-supervised approaches), inductive biases have to be introduced. Therefore,

taking inspiration from works on unsupervised mid-level pattern discovery [20, 21, 22], we pro-

pose a formulation that encourages the encoder to only encode such few parts of an image that

are repetitive across the dataset, i.e., the patches that occur often in images.

Since the VAE framework provides a principled way of learning a mapping from image

to latent space, we consider it ideal for our proposed extension. We chose β-VAEs for their

simplicity and widespread use. In Section 2.2.2, we describe our approach in detail and in

Section 2.2.4 propose a modification in the reconstruction error computation to bias the error

term towards foreground high-energy regions (similar to the biased initial sampling of patterns

in Singh et al. [20]).

2.2.2 PatchVAE

Given an image x, let f = ϕ(x) be a deterministic mapping that produces a 3D representa-

tion f of size h×w×de, with a total of L = h×w locations (grid-cells). We aim to encourage the

encoder network to only encode parts of an image that correspond to highly repetitive patches.

For example, a random patch of noise is unlikely to occur frequently, whereas patterns like faces,

14


wheels, windows, etc. repeat across multiple images. In order capture this intuition, we force the

representation f to be useful for predicting frequently occurring parts in an image, and use just

these predicted parts to reconstruct the image. We achieve this by transforming f to ẑ which en-

codes a set of parts at a small subset of L locations on the grid cells. We refer to ẑ as “patch latent

codes” for an image. Next we describe how we re-tool the β-VAE framework to learn these local

latent codes. We first describe our setup for a single part and follow it up with a generalization to

multiple parts (Section 2.2.3).

Image Encoding. Given the image representation f = ϕ(x), we want to learn part representa-

tions at each grid location l (where l ∈ {1, . . . , L}). A part is parameterized by its appearance

zapp and its occurrence zlocc (i.e., presence or absence of the part at grid location l). We use two

networks, QA
f and QO

f , to parameterize posterior distributions QA
f (zapp | f) and QO

f (z
l
occ | f) of

the part parameters zapp and zlocc respectively. Since the mapping f = ϕ(x) is deterministic, we

can re-write these distributions as QA
f (zapp |ϕ(x)) and QO

f (z
l
occ |ϕ(x)); or simply QA(zapp | x)

and QO(zlocc | x). Therefore, given an image x the encoder networks estimate the posterior

QA(zapp | x) and QO(zlocc | x). Note that f is a deterministic feature map, whereas zapp and

zlocc are stochastic.

Image Decoding. We utilize a generator or decoder network G, that given zocc and zapp, recon-

structs the image. First, we sample a part appearance ẑapp (dp dimensional, continuous) and then

sample part occurrence ẑlocc (L dimensional, binary) one for each location l from the posteriors

ẑapp ∼ QA(zapp |x)

ẑlocc ∼ QO (
zlocc |x

)
, where l ∈ {1, . . . , L}

(2.4)

15


Next, we construct a 3D representation ẑ by placing ẑapp at every location l where the part is

present (i.e., ẑlocc = 1). This can be implemented by a broadcasted product of ẑapp and ẑlocc . We

refer to ẑ as patch latent code. Again note that f is deterministic and ẑ is stochastic. Finally,

a deconvolutional network takes ẑ as input and generates an image x̂. This image generation

process can be written as

x̂ ∼ G
(
x | z1occ, z

2
occ, . . . , z

L
occ, zapp

)
(2.5)

Since all latent variables (zlocc for all l and zapp) are independent of each other, they can be

stacked as

zp =
[
z1occ; z

2
occ; . . . ; z

L
occ; zapp

]
. (2.6)

This enables us to use a simplified the notation (refer to Equation (2.4) and Equation (2.5)):

ẑp ∼ Q{A,O}(zp |x)

x̂ ∼ G (x | zp)

(2.7)

Note that despite the additional structure, our model still resembles the setup of variational auto-

encoders. The primary difference arises from: (1) use of discrete latents for part occurrence,

(2) patch-based bottleneck imposing additional structure on latents, and (4) feature assembly for

generator.

Training. We use the training setup of β-VAE and use the maximization of variational lower

bound to train the encoder and decoder jointly (described in Section 2.2.1). The posterior QA,

16


which captures the appearance of a part, is assumed to be a Normal distribution with zero-mean

and identity covariance N (0, I). The posterior QO, which captures the presence or absence a

part, is assumed to be a Bernoulli distribution Bern
(
zprior

occ

)
with prior zprior

occ . Therefore, the ELBO

for our approach can written as (refer to Equation (2.3)):

LPatchVAE(x) =− Ezp∼Q{A,O}(zp |x) [G (x | zp)]

+ βDKL

[
Q{A,O}(zp |x) ∥ p(zp)

] (2.8)

where, the DKL term can be expanded as:

DKL

[
Q{A,O}(zp | x) ∥ p(zp)

]
=

βapp

L∑
l=1

DKL

(
QO(zlocc |x) ∥ Bern

(
zprior

occ

))
+ βoccDKL

(
QA(zapp |x) ∥ N (0, I )

)
(2.9)

Implementation details. As discussed in Section 2.2.1, the first and second terms of the RHS

of Equation (2.8) can be trained using L2 reconstruction loss and reparameterization trick [23].

In addition, we also need to compute KL Divergence loss for part occurrence. Learning discrete

probability distribution is a challenging task since there is no gradient defined to backpropagate

reconstruction loss through the stochastic layer at decoder even when using the reparameteri-

zation trick. Therefore, we use the relaxed-bernoulli approximation [53, 54] for training part

occurrence distributions zlocc.

For an H ×W image, network Q(f |x) first generates feature maps of size (h× w × de),

where (h, w) are spatial dimensions and de is the number of channels. Therefore, the number of

locations L = h×w. Encoders QA
f (zapp | f) and QO

f (z
l
occ | f) are single layer neural networks to

17


compute zapp and zlocc. z
l
occ is (h× w × 1)-dimensional multivariate bernoulli parameter and zapp

is (1× 1× dp)-dimensional multivariate gaussian. dp is length of the latent vector for a single

part. Input to the decoder ẑ is (h× w × dp)-dimensional. In all experiments, we fix h = H
8

and

w = W
8

.

Constructing zapp. Notice that f is an (h× w × de)-dimensional feature map and zlocc is (h× w × 1)-

dimensional binary output, but zapp is (1× 1× dp)-dimensional feature vector. If
∑

l
zlocc > 1,

the part occurs at multiple locations in an image. Since all these locations correspond to same

part, their appearance should be the same. To incorporate this, we take the weighted average of

the part appearance feature at each location, weighted by the probability that the part is present.

Since we use the probability values for averaging the result is deterministic. This operation is

encapsulated by the QA encoder (refer to Figure 2.2b). During image generation, we sample ẑapp

once and replicate it at each location where ẑlocc = 1. During training, this forces the model to:

(1) only predict ẑlocc = 1 where similar looking parts occur, and (2) learn a common represen-

tation for the part that occurs at these locations. Note that zapp can be modeled as a mixture of

distributions (e.g., mixture of gaussians) to capture complicated appearances. However, in this

work we assume that the convolutional neural network based encoders are powerful enough to

map variable appearance of semantic concepts to similar feature representations. Therefore, we

restrict ourselves to a single gaussian distribution.

2.2.3 PatchVAE with multiple parts

Next we extend the framework described above to use multiple parts. To use N parts, we

use N × 2 encoder networks QA(i)
(
z
(i)
app | x

)
and QO(i)

(
z
l(i)
occ | x

)
, where z

(i)
app and z

l(i)
occ parame-

18


terize the ith part. Again, this can be implemented efficiently as 2 networks by concatenating the

outputs together. The image generator samples ẑ(i)app and ẑl(i)occ from the outputs of these encoder

networks and constructs ẑ(i). We obtain the final patch latent code ẑ by concatenating all ẑ(i) in

channel dimension. Therefore, ẑ(i) is (h × w × dp)-dimensional and ẑ is (h × w × (N × dp))-

dimensional stochastic feature map. For this multiple part case, Equation (2.6) can be written

as:

zP =
[
z(1)p ; z(1)p ; . . . ; z(N)

p

]
where z(i)p =

[
z1(i)occ ; z

2(i)
occ ; . . . ; z

L(i)
occ ; z(i)app

]
.

(2.10)

Similarly, Equation (2.8) and Equation (2.9) can be written as:

LMultiPatchVAE(x) = −EzP [G (x | zP)]

+ βapp

N∑
i=1

L∑
l=1

DKL

(
QO(i)

(
zl(i)occ |x

)
∥ Bern

(
zprior

occ

))
+ βocc

N∑
i=1

DKL

(
QA(i)

(
z(i)app |x

)
∥ N (0, I )

)
(2.11)

The training details and assumptions of posteriors follow the previous section.

2.2.4 Improved Reconstruction Loss

The L2 reconstruction loss used for training β-VAEs (and other reconstruction based ap-

proaches) gives equal importance to each region of an image. This might be reasonable for tasks

like image compression and image de-noising. However, for the purposes of learning semantic

representations, not all regions are equally important. For example, “sky” and “walls” occupy

19


large portions of an image, whereas concepts like “windows,” “wheels,”, “faces” are compara-

tively smaller, but arguably more important. To incorporate this intuition, we use a simple and

intuitive strategy to weigh the regions in an image in proportion to the gradient energy in the

region. More concretely, we compute laplacian of an image to get the intensity of gradients per-

pixel and average the gradient magnitudes in 8 × 8 local patches. The weight multiplier for the

reconstruction loss of each 8 × 8 patch in the image is proportional to the average magnitude of

the patch. All weights are normalized to sum to one. We refer to this as weighted loss (Lw). Note

that this is similar to the gradient-energy biased sampling of mid-level patches used in [20, 21].

Examples of weight masks are provided in the supplemental material.

In addition, we also consider an adversarial training strategy from GANs to train VAEs [40],

where the discriminator network from GAN implicitly learns to compare images and gives a more

abstract reconstruction error for the VAE. We refer to this variant by using ‘GAN’ suffix in exper-

iments. In Section 2.3.2, we demonstrate that the proposed weighted loss (Lw) is complementary

to the discriminator loss from adversarial training, and these losses result in better recognition

capabilities for both β-VAE and PatchVAE.

2.3 Experiments

Datasets. We evaluate PatchVAE on CIFAR100 [25], MIT Indoor Scene Recognition [26],

Places [27] and Imagenet [28] datasets. CIFAR100 consists of 60k 32 × 32 color images from

100 classes, with 600 images per class. There are 50000 training images and 10000 test images.

Indoor dataset contains 67 categories, and a total of 15620 images. Train and test subsets consist

of 80 and 20 images per class respectively. Places dataset has 2.5 millions of images with 205

20


categories. Imagenet dataset has over a million images from 1000 categories.

Learning paradigm. In order to evaluate the utility of PatchVAE features for recognition, we

setup the learning paradigm as follows: we will first train the model in an unsupervised manner on

all training images. After that, we discard the generator network and use only part of the encoder

network ϕ(x) to train a supervised model on the classification task of the respective dataset. We

study different training strategies for the classification stage as discussed later.

Training details. In all experiments, we use the following architectures. For CIFAR100, In-

door67, and Place205, ϕ(x) has a conv layer followed by two residual blocks [55]. For ImageNet,

ϕ(x) is a ResNet18 model (a conv layer followed by four residual blocks). For all datasets, QA

and QO have a single conv layer each. For classification, we start from ϕ(x), and add a fully-

connected layer with 512 hidden units and a final fully-connected layer as classifier. More details

can be found in the supplemental material.

During the unsupervised learning phase of training, all methods are trained for 90 epochs

for CIFAR100 and Indoor67, 2 epochs for Places205, and 30 epochs for ImageNet dataset. All

methods use ADAM optimizer for training, with initial learning rate of 1× 10−4 and a minibatch

size of 128. For relaxed bernoulli in QO, we start with the temperature of 1.0 with an anneal-

ing rate of 3 × 10−5 (following the details in [54]). For training the classifier, all methods use

stochastic gradient descent (SGD) with momentum with a minibatch size of 128. Initial learning

rate is 1 × 10−2 and we reduce it by a factor of 10 every 30 epochs. All experiments are trained

for 90 epochs for CIFAR100 and Indoor67, 5 epochs for Places205, and 30 epochs for ImageNet

datasets.

21


Table 2.1: Classification results on CIFAR100, Indoor67, and Places205. We initialize the classification
model with the representations ϕ(x) learned from unsupervised learning task. The model ϕ(x) comprises
of a conv layer followed by two residual blocks (each having 2 conv layers). First column (called ‘Conv1’)
corresponds to Top-1 classification accuracy with pre-trained model with the first conv layer frozen, second
and third columns correspond to results with first three and first five conv layers frozen respectively. Details
in Section 2.3.1.

CIFAR100 Indoor67 Places205
Model Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5]

β-VAE 44.12 39.65 28.57 20.08 17.76 13.06 28.29 24.34 8.89
β-VAE + Lw 44.96 40.30 28.33 21.34 19.48 13.96 29.43 24.93 9.41
β-VAE-GAN 44.69 40.13 29.89 19.10 17.84 13.06 28.48 24.51 9.72
β-VAE-GAN + Lw 45.61 41.35 31.53 20.45 18.36 14.33 29.63 25.26 10.66

PatchVAE 43.07 38.58 28.72 20.97 19.18 13.43 28.63 24.95 11.09
PatchVAE + Lw 43.75 40.37 30.55 23.21 21.87 15.45 29.39 26.29 12.07
PatchVAE-GAN 44.45 40.57 31.74 21.12 19.63 14.55 28.87 25.25 12.21
PatchVAE-GAN + Lw 45.39 41.74 32.65 22.46 21.87 16.42 29.36 26.30 13.39

BiGAN 47.72 41.89 31.58 21.64 17.09 9.70 30.06 25.11 10.82

Imagenet Pretrained 55.99 54.99 54.36 45.90 45.82 40.90 37.08 36.46 31.26

2.3.1 Downstream classification performance

In Table 2.1, we report the top-1 classification results on CIFAR100, Indoor67, and Places205

datasets for all methods with different training strategies for classification. First, we keep all the

pre-trained weights in ϕ(x) from the unsupervised task frozen and only train the two newly added

conv layers in the classification network (reported under column ‘Conv[1-5]’). We notice that our

method (with different losses) generally outperforms the β-VAE counterpart by a healthy margin.

This shows that the representations learned by PatchVAE framework are better for recognition

compared to β-VAEs. Moreover, better reconstruction losses (‘GAN’ and Lw) generally improve

both β-VAE and PatchVAE, and are complementary to each other.

Next, we fine-tune the last residual block along with the two conv layers (‘Conv[1-3]’ col-

umn). We observe that PatchVAE performs better than VAE under all settings except the for

22


Table 2.2: ImageNet classification results using ResNet18. We initialize weights from using the unsuper-
vised task and fine-tune the last two residual blocks. Details in Section 2.3.1.

Model Top-1 Acc. Top-5 Acc.

β-VAE 44.45 69.67
PatchVAE 47.01 71.71

β-VAE + Lw 47.28 71.78
PatchVAE + Lw 47.87 72.49

Imagenet Supervised 61.37 83.79

CIFAR100 with just L2 loss. However, when using better reconstruction losses, the performance

of PatchVAE improves over β-VAE. Similarly, we fine-tune all but the first conv layer and re-

port the results in ‘Conv1’ column. Again, we notice similar trends, where our method generally

performs better than β-VAE on Indoor67 and Places205 dataset, but β-VAE performs better CI-

FAR100 by a small margin. When compared to BiGAN, PatchVAE representations are better on

all datasets (‘Conv[1-5]’) by a huge margin. However, when fine-tuning the pre-trained weights,

BiGAN performs better on two out of four datasets. We also report results using pre-trained

weights in ϕ(x) using supervised ImageNet classification task (last column, Table 2.1) for com-

pleteness. The results indicate that PatchVAE learns better semantic representations compared to

β-VAE.

ImageNet Results. Finally, we report results on the large-scale ImageNet benchmark in Ta-

ble 2.2. For these experiments, we use ResNet18 [55] architecture for all methods. All weights

are first learned using the unsupervised tasks. Then, we fine-tune the last two residual blocks and

train the two newly added conv layers in the classification network (therefore, first conv layer and

the following two residual blocks are frozen). We notice that PatchVAE framework outperforms

β-VAE under all settings, and the proposed weighted loss helps both approaches. Finally, the

last row in Table 2.2 reports classification results of same architecture randomly initialized and

23


CIFAR100: Images and Encoded Occurrence Map

ImageNet: Images and Encoded Occurrence Map

Figure 2.3: Encoded part occurrence maps discovered on CIFAR100 and ImageNet. Each row represents
a different part.

trained end-to-end on ImageNet using supervised training for comparison.

Baselines. We use the β-VAE model ( Section 2.2.1) as our primary baseline. In addition,

we use weighted loss and discriminator loss resulting in the β-VAE-* family of baselines. We

also compare against a BiGAN model from [39]. We use similar backbone architectures for

encoder/decoder (and discriminator if present) across all methods, and tried to keep the number

of parameters in different approaches comparable to the best of our ability. Exact architecture

details can be found in the supplemental material.

2.3.2 Qualitative Results

We present qualitative results to validate our hypothesis. First, we visualize whether the

structure we impose on the VAE bottleneck is able to capture occurrence and appearance of

important parts of images. We visualize the PatchVAE trained on images from CIFAR100 and

Imagenet datasets in the following ways.

24


Pa
rt 

2

Spaghetti Squash Butternut Squash Orange Lemon

Pa
rt 

55

Killer Whale Coral Reef Scuba Diver

Grocery Store

Pa
rt 

20

Radiator Grille

Pa
rt 

31
Pa

rt 
0

Radiator GrilleAmbulance

Pa
rt 

36

Pa
rt 

17

Jack-o'-lantern Christmas stocking Orange

Figure 2.4: A few representative examples for several parts to qualitatively demonstrate the visual concepts
captured by PatchVAE. For each part, we crop image patches centered on the part location where it is
predicted to be present. Selected patches are sorted by part occurrence probability as score. We manually
select a diverse set from the top-50 occurrences from the training images. As can be seen, a single part may
capture diverse set of concepts that are similar in shape or texture or occur in similar context, but belong
to different categories. We show which categories the patches come from (note that category information
was not used while training the model).

Part Image Part Recon.
Swapped 

Recon.Image
Source Target

Figure 2.5: Swapping source and target part appearance. Column 1, 2 show a source image with the
occurrence map of one of the parts. We can swap the appearance vector of this part with appearance
vectors of a different part in target images. Column 3, 4 show three target images with occurrence maps of
one of their parts. Observe the change in reconstructions (column 5, 6) as we bring in the new appearance
vector. The new reconstruction inherits properties of the source at specific locations in the target.

Concepts captured. First, we visualize the part occurrences in Figure 2.3. We can see that the

parts can capture round (fruit-like) shapes in the top row and faces in the second row regardless

of the class of the image. Similarly for ImageNet, occurrence map of a specific part in images of

chicken focuses on head and neck. Note that these semantically these parts are more informative

than just texture or color what a β-VAE can capture. In Figure 2.4, we show parts captured by

25


the ImageNet model by cropping a part of image centered around the occurring part. We can

see that parts are able to capture multiple concepts, similar in either shape, texture, or context in

which they occur.

Swapping appearances. Using PatchVAE, we can swap appearance of a part with the appear-

ance vector of another part from a different image. In Figure 2.5, keeping the occurrence map

same for a target image, we modify the appearance of a randomly chosen part and observe the

change in reconstructed image. We notice that given the same source part, the decoder tries

similar things across different target images. However, the reconstructions are worse since the

decoder has never encountered this particular combination of part appearance before.

Discriminative vs.Generative strength. As per our design, PatchVAE compromises the gener-

ative capabilities to learn more discriminative features. To quantify this, we use the the images

reconstructed from β-VAE and PatchVAE models (trained on ImageNet) and compute three dif-

ferent metrics to measure the quality of reconstructions of test images. Table 2.7 shows that

β-VAE is better at reconstruction.

2.4 Ablation Analysis

We study the impact of various hyper-parameters used in our experiments. For the purpose

of this evaluation, we follow a similar approach as in the ‘Conv[1-5]’ column of Table 2.1 and all

hyperparameters from the previous section. We use CIFAR100 and Indoor67 datasets for ablation

analysis.

Maximum number of patches. Maximum number of parts N used in our framework. Depend-

ing on the dataset, higher value of N can provide wider pool of patches to pick from. However, it

26


Table 2.3: Effect of N : Increasing the maxi-
mum number of patches increases the discrimi-
native power for CIFAR100 but has little or neg-
ative effect for Indoor67

N CIFAR100 Indoor67

4 27.59 14.40
8 28.74 12.69
16 28.94 14.33
32 27.78 13.28
64 29.00 12.76

Table 2.4: Effect of dp: Increasing the number
of hidden units for a patch has very little impact
on classification performance.

dp CIFAR100 Indoor67

3 28.63 14.25
6 28.97 14.55
9 28.21 14.55

Table 2.5: Effect of zprior
occ : Increasing increasing

the prior probability of patch occurrence has ad-
verse effect on classification performance.

zprior
occ CIFAR100 Indoor67

0.01 28.86 14.33
0.05 28.67 14.25
0.1 28.31 14.03

Table 2.6: Effect of βocc: Too high or too low
βocc can deteriorate the performance of learned
representations.

βocc CIFAR100 Indoor67

0.06 30.11 14.10
0.3 30.37 15.67
0.6 28.90 13.51

Table 2.7: Reconstruction metrics on ImageNet. PatchVAE sacrifices reconstruction quality to learn dis-
criminative parts, resulting in higher recognition performance (Table 2.2)

Model PSNR ↑ FID ↓ SSIM ↑

β-VAE 4.857 108.741 0.289
PatchVAE 4.342 113.692 0.235

can also make the unsupervised learning task harder, since in a minibatch of images, we might not

get too many repeat patches. Table 2.3(left) shows the effect of N on CIFAR100 and Indoor67

datasets. We observe that while increasing number of patches improves the discriminative power

in case of CIFAR100, it has little or negative effect in case of Indoor67. A possible reason for

this decline in performance can be smaller size of the dataset (i.e., fewer images to learn).

Number of hidden units for a patch appearance ẑapp. Next, we study the impact of the number

of channels in the appearance feature ẑapp for each patch (dp). This parameter reflects the capac-

ity of individual patch’s latent representation. While this parameter impacts the reconstruction

27


quality of images. We observed that it has little or no effect on the classification performance of

the base features. Results are summarized in Table 2.4(right) for both CIFAR100 and Indoor67

datasets.

Prior probability for patch occurrence zprior
occ . In all our experiments, prior probability for a

patch is fixed to 1/N , i.e., inverse of maximum number of patches. The intuition is to encourage

each location on occurrence maps to fire for at most one patch. Increasing this patch occurrence

prior will allow all patches to fire at the same location. While this would make the reconstruc-

tion task easier, it will become harder for individual patches to capture anything meaningful.

Table 2.5 shows the deterioration of classification performance on increasing zprior
occ .

Patch occurrence loss weight βocc. The weight for patch occurrence KL Divergence has to be

chosen carefully. If βocc is too low, more patches can fire at same location and this harms the

the learning capability of patches; and if βocc is too high, decoder will not receive any patches to

reconstruct from and both reconstruction and classification will suffer. Table 2.6 summarizes the

impact of varying βocc.

28


Chapter 3: Learning to Signal Mid-level Patches in Referential Games

The ability to communicate using language is a signature characteristic of intelligence [56].

Language provides a structured platform for agents to not only collaborate with each other and

accomplish certain goals, but also to represent and store information in a compressible manner.

Most importantly, language allows us to build infinitely many new concepts by the composition

of the known concepts. These qualities are shared by both the natural languages used in human-

human communication and programming languages used in human-machine communication.

The study of the evolution of language can hence give us insights into intelligent machines that

can communicate [57].

Our goal in this paper is to develop and understand an emergent language, i.e., a language

that emerges when two neural network agents try to communicate with each other. Clark [58]

argued that supervised approaches that consist of a single agent learning statistical relationships

among symbols don’t capture the functional relationships between the symbols i.e., the use of

symbols leading to an action or an outcome. Krishna et al. [59] argued the same viewpoint

in the context of images. We, therefore, resort to the recent works in emergent language [60,

61, 62, 63, 64, 65] which show that a communication protocol can be developed or learned by

two or more cooperative agents trying to solve a task. The choice of the task is quintessential

since the language derives meaning from its use [66]. We choose a task where two agents, a

29


speaker, and a listener, play a referential game, a type of signaling game first proposed by Lewis

[67]. The speaker agent receives a target image and sends a message to the listener. The message

consists of discrete symbols or words, capturing different parts of the image. The listener receives

another view of the target image, and one or more distractor images. The goal of the speaker and

the listener agents is to maximize the agreement between the message and the target image.

Figure 3.1 illustrates the overview of the proposed referential game.

In computer vision, a number of attempts have been made to represent images as visual

words [68], with a focus on low-level feature descriptors such as SIFT [69], SURF [70], etc..

Recent works in deep learning have attempted to describe the entire image with a fixed number of

discrete symbols [71, 72, 73], however, we postulate that large images contain a lot of redundant

information and a good visual representation should focus on only the “interesting” parts of the

image. To discover what constitutes the interesting part of the image, we take inspiration from the

works on mid-level patches [22, 74, 75], the patches in the image that are both representative

and discriminative [12, 74]. This means they can be discovered in a large number of images (and

hence representative), but simultaneously they should also be discriminative enough to set an

image apart from the other images in the dataset. Hence, the speaker agent in our paper focuses

on computing a symbolic representation in terms of these mid-level patches, as opposed to the

entire image.

To summarize, we propose PatchGame, a referential game formulation where given an

image, the speaker sends discrete signal in terms of mid-level patches, and the listener embeds

these symbols to match them with another view of the same image in the presence of distractors.

Compared to previous works [63, 64, 76], we make the following key changes:

30


• Agents in the some of the prior works [63, 64, 76] have access to a pre-trained network, such

as AlexNet [77] or VGG [78], for extracting features from images. In this work, the agents

rely on training on a large scale image dataset, and invariance introduced by various image

augmentations, to learn the language in a self-supervised way [79].

• We propose a novel patch-based architecture for the speaker agent, which comprises of two

modules: (1) PatchSymbol, a multi-layered perceptron (MLP) that operates at the patch-level

and converts a given image patch into a sequence of discrete symbols, and (2) PatchRank, a

ConvNet that looks at the complete image and ranks the importance of patches in a differen-

tiable manner.

• We introduce a novel transformer-based architecture for the listener agent, consisting of two

modules: (1) a language module that projects the message received from the speaker to a la-

tent space, and (2) a vision module that projects the image into the latent space. We use a

contrastive loss in this latent space to train both the speaker and the listener agents simultane-

ously.

• We propose new protocols to evaluate each of the speaker and listeners’ modules.

We assess the success of PatchGame via qualitative and quantitative evaluations of each

of the proposed component, and by demonstrating some practical applications. First, we show

that the speaker’s PatchRank model does indicate important patches in the image. We use the

top patches indicated by this model to classify ImageNet [28] images using a pre-trained Vision

Transformer [8] and show that we can retain over 60% top-1 accuracy with just half of the image

patches. Second, the listener’s vision model (ResNet-18) can achieve upto 30.3% Top-1 accuracy

just by using k-NN (k = 20) classification. This outperforms other state-of-the-art unsupervised

31


Speaker 

Speaker view Listener view 
Listener 

groundtruth 

Figure 3.1: Overview of the Referential Game. We generate two random views of every image in the given
batch of images. The speaker takes one of the views as the input and generates a sequence of symbols (or
message). The listener takes the message sent by the speaker and the second view of the image, and
projects both into an embedding space. Both the speaker and listener agents learn by minimizing the
constrastive loss (see Equation (3.3)) between between the views in this embedding space.

approaches [12, 73] that learn discrete representations of images by 9%. Finally, we also analyze

the symbols learned by our model and the impact of choosing several hyperparameters used in

our experiments.

3.1 Related Work

Referential games. Prior to the advent of deep learning, significant research in the field

of emergent communications has shown that a communication protocol can be developed or

learned by agents by playing a language game [60, 61, 62, 80, 81, 82, 83, 84]. However, the

agents employed in these works were typically located in a synthetic world and made several

assumptions about the world such as the availability of disentangled representations of objects

with discrete properties. More recent works [63, 65, 85, 86, 87, 88, 89, 90] have employed

deep learning methods to develop a discrete language for communication between the agents.

Lazaridou et al. [63] used neural network agents represented by a MLP to communicate concepts

about real-world pictures. They used a fixed-sized message composed of a large vocabulary for

their communication. Havrylov and Titov [64], Evtimova et al. [76], Bouchacourt and Baroni

32


[91] relax this assumption and allow communication via variable-length sequences. Havrylov

and Titov [64] allows the speaker agent to use an LSTM [92] to construct a variable-length

message. Havrylov and Titov [64], Lazaridou et al. [65] show that even when we allow agents to

use variable-length sequences to represent a message, they tend to utilize the maximum possible

sequence to achieve the best performance (in terms of communication success).

The idea of using a Gumbel-softmax distribution [53, 93] with the straight-through trick [94]

for learning a language in multi-agent environment was concurrently proposed by Mordatch and

Abbeel [95] and Havrylov and Titov [64]. They show that we can achieve a more stable and

faster training by using this technique as compared to reinforcement learning used in several

other approaches.

Evaluating emergent communication. Evaluating the emergent language turns out to be

an equally challenging research problem. Existing approaches use the successful completion of

the task or the correlation between learned language and semantic labels as evaluation metrics.

Lowe et al. [96] and Keresztury and Bruni [97] show that simple task success might not be a

good or sufficient metric for evaluating the success of a game. They discuss heuristics and ad-

vocate measuring both positive signaling and positive listening independently to evaluate agents’

communication. Andreas [98] provides a way of evaluating compositional structure in learned

representations.

Discrete Autoencoders. In parallel to the works on emergent communication, there is a

large body of research on learning discrete representations of images using some form of au-

toencoding or reconstruction [12, 71, 73, 99, 100, 101, 102, 103] without labels. The focus of

VQ-VAE [71] and VQ-VAE-2 [73] is to learn a discrete bottleneck using vector quantization.

Once we can represent any image with these discrete symbols, a powerful generative model such

33


as PixelCNN [71, 104], or transformer [105, 106] is learned on top of these symbols to sample

new images. PatchVAE [12] achieves the same using Gumbel-Softmax and imposes an addi-

tional structure in the bottleneck of VAEs. We argue that because of mismatch in the objectives

of reconstruction and visual recognition tasks, each of these models trained using reconstruction-

based losses do not capture meaningful representations in the symbols.

Self-supervised learning in vision. Self-supervised learning (SSL) methods, such as [107,

108, 109, 110, 111, 112, 113, 114] have shown impressive results in recent years on downstream

tasks of classification and object detection. Even though the bottleneck in these methods is con-

tinuous (and not discrete symbols), these methods have been shown to capture semantic and

spatial information of the contents of the image. Unlike SSL methods, neural networks repre-

senting the two agents in our case, do not share any weights. Also, note that the continuous

nature of representations learned by SSL techniques is fundamentally different from the sym-

bolic representations used in language. And indeed, we show that a k-Nearest Neighbor classifier

obtained from the continuous representations learned by SSL methods can perform better than

the one obtained using Bag of Words (or symbols). However, to the best of our knowledge, our

work is one of the first attempts to make representations learned in a self-supervised way more

communication- or language-oriented.

Comparison with Mihai and Hare [79, 115]. [79, 115] extends Havrylov and Titov [64]

by training a speaker and listener agents end-to-end without using pre-trained networks. How-

ever, the prior works [64, 115, 115] use a top-down approach to generate discrete representation

(or a sentence) for an image, i.e., they compute an overall continuous embedding of the image

and then proceed by generating one symbol of the sentence at a time using an LSTM. The com-

putational cost of LSTMs is prohibitive when length of a sentence is large, which is needed to

34


describe complex images. The transformers, on the other hand, require constant time for variable

length sentences at the cost of increased memory (in the listener agent). However, generating

the variable length sentences with the speaker agent using transformers is non-trivial. To solve

this, we propose a bottom-up approach, i.e., we first generate symbols for image patches and

combine them to form a sentence. This approach allows for computationally efficient end-to-end

training. Further, it allows the speaker to compose symbols corresponding to different parts of

the image, instead of deducing it from a pooled 1D representation of the image.

3.2 Our Approach

We first introduce various notations and the referential game played by the agents in our

work. We provide further details of architectures of the different neural network agents, as well

as the loss function. We also highlight the important differences between this work and prior

literature. Code and pre-trained models to reproduce the results are provided.

3.2.1 Referential Game Setup

Figure 3.1 shows the overview of our referential game setup. Given a dataset of N images

{xi}Ni=1, we formulate a referential game [67] played between two agents, a speaker Sθ and a

listener Lϕ as follows: As in the setting of Grill et al. [114], we generate two “random views” for

every image. A random view is generated by taking a 224 × 224 crop from a randomly resized

image and adding one or more of these augmentations - color jitter, horizontal flip, Gaussian

blur and/or solarization. This prevents the neural networks from learning a trivial solution and

encourages the emergent language to capture invariances induced by the augmentations. Given

35


a batch of B images, we refer to the two views as {xi}Bi=1 and {x′
i}Bi=1. In each iteration during

training, one set of views is presented to the speaker agent and another set of the views is shown

to the listener agent.

The speaker Sθ encodes each image xi independently into a variable length message mi.

Each message m is represented by a sequence of one-hot encoded symbols with a maximum

possible length L and a fixed size vocabulary V . The space of all possible messages sent by the

speaker is of the order |V |L. The input to the listener Lϕ is the batch of messages {mi}Bi=1 from

the speaker, and the second set of random views of the batch of images. The listener consists of

a language module ϕtext to encode messages and a vision module ϕvision to encode images. The

goal of the listener is to match each message to its corresponding image.

Specifically, for a batch of B (message, image) pairs, Sθ and Lϕ are jointly trained to

maximize the cosine similarities of B actual pairs while minimizing the similarity of (B2 − B)

incorrect pairs. For a target message mj , the image x′
j (augmented view of xj) acts as the target

image while all the other (B− 1) images act as distractors. And vice versa, for the image x′
k, the

message mk (encoded by the speaker Sθ(xk)) acts as the target message while all the other (B−1)

messages act as distractors. We use the following symmetric and contrastive loss function, also

sometimes referred to as InfoNCE loss in previous metric-learning works [72, 116].

Ltext = −
B∑
j=1

log
exp(ϕtext(mj) · ϕvision(x

′
j)/τ)∑B

k=1 exp(ϕtext(mj) · ϕvision(x′
k)/τ)

(3.1)

Lvision = −
B∑

k=1

log
exp(ϕtext(mk) · ϕvision(x

′
k)/τ)∑B

j=1 exp(ϕtext(mj) · ϕvision(x′
k)/τ)

(3.2)

L = (Ltext + Lvision)/2 (3.3)

36


4.1 7.9 6.8

1.1 8.9 6.8

2.0 5.0 5.9

0 1 1

0 1 0

0 1 1

4 5 1 1 2 7 4 0 8

2 0 7 2 8 74 5 7

2 5 1 0 5 3 7 9 0

sa
m

pl
e

so
ft-

so
rt

no
rm

al
iz

e,
  s

am
pl

e

4 5 1 1 2 7 4 0 8

4 5 7 2 0 7 2 8 7

2 5 1 0 5 3 7 9 0

dot-product

message 

image 

patches

Figure 3.2: Speaker agent architecture. The upper branch represents the PatchSymbol, θsymb module and
the lower branch represents PatchRank, θrank module. Each of the modules take the raw image as the
input. PatchSymbol computes a message for each patch of pre-defined size, using a fixed size vocabulary
and message length. PatchRank uses a ConvNet to compute weights or the importance of each patch. Note
that the symbols for a patch are context independent, however the importance of a patch depends on the
context. The speaker agent combines the output of each of the models and sends a variable length message
to the listener.

where τ is a constant temperature hyperparameter.

The game setting used in our work is inspired from Lazaridou et al. [63] and Havrylov and

Titov [64], but there are important differences. Both our speaker Sθ and listener Lϕ agents are

trained from scratch. This makes the game setting more challenging, since agents cannot use

the pre-trained models which have been shown to encode semantic and/or syntactic information

present in natural language or images. Our training paradigm, where we show different views of

the same image to the speaker and listener, is inspired by the recent success of self-supervised

learning in computer vision. Empirically, we observe that this leads to a more stable training

and prevents the neural networks from learning degenerate solutions. However, in contrast with

such self-supervised approaches, our goal is to learn a discrete emergent language as opposed to

continuous semantic representations. We discuss the differences in the architecture of the two

agents in the following sections.

37


3.2.2 Speaker agent architecture

A desirable characteristic of the speaker agent is that it should be able to encode “im-

portant” components of images with a variable length sequence of discrete symbols. Previous

works [64, 76, 91] have achieved this by first converting the image into a continuous determinis-

tic latent vector and then using an LSTM network [92] to generate a sequence of hidden states,

and sample from this sequence of hidden state until a special end of sequence token (or maximum

length) is reached. As observed by [64, 65], in order to achieve the minimum loss, the model ends

up always using the maximum allowable length. In our experiments as well, we observed that

having an LSTM makes the training slow and does not achieve the objective of encoding images

in variable length sequences. We propose leverage two separate modules in the speaker agent Sθ

to circumvent this problem - the first module called PatchSymbol (θsymb) is a 2-layer MLP that

computes patch-level embeddings for the image, the second module called PatchRank (θrank) is a

small ConvNet that computes rank or importance of each patch in the image.

PatchSymbol, θsymb. The idea of encoding an image at the patch level is inspired by the

works on discovering mid-level patches [22, 74, 75] in images. We use a simple 2-hidden layer

MLP, to encode each RC×S2 dimensional image patch xpatch to l vectors of log of probabilities

[log p11, . . . , log p
1
V ], . . . , [log p

l
1, . . . , log p

l
V ]. Here C is the number of (color) channels in the

input image or patch, S is the spatial dimension of a square patch, V is the size of the vocabulary

used to represent a single symbol, and l is the number of symbols used to encode each patch.

Hence an image of size RC×H×W can be encoded using K = HW
S2 patches, each consisting of l

symbols. The vectors of log probabilities allow us to sample from a categorical distribution of V

categories, with a continuous relaxation by using the Gumbel-softmax trick [53, 93]. For a given

38


vector [log pj1, . . . , log p
j
V ], we draw i.i.d samples gji from the Gumbel(0, 1) distribution [117] and

get a differentiable approximation of argmax as follows:

[
[log p11, . . . , log p

1
V ], . . . , [log p

l
1, . . . , log p

l
V ]
]
= MLP(xpatch) (3.4)

yj = one hot(argmax
i

[log pji ]) ∼

{
exp((log pji + gji )/τs))∑V
k=1 exp((log p

j
k + gjk)/τs)

}V

i=1

,∀j (3.5)

θsymb(xpatch) = y =
{
yj
}l

j=1
(3.6)

where τs controls how close the approximation is to argmax. The final output of the θsymb

network for the entire image x is L one-hot encoded V−dimensional symbols, where L = l ×

HW
S2 . In all our experiments, we fix V = 128 and l = 1.

PatchRank, θrank. An image might have a lot of redundant patches encoded using the

same symbols. The goal of the θrank network is to give an importance score to each patch. Since

importance of a patch depends on the context and not the patch alone, we use a small ResNet-

9 [55] to compute an importance weight for each of the K = HW
S2 patches. One possible way to

use these importance weights is to simply normalize them between (0, 1) and repeat the Gumbel-

Softmax trick to sample important patches. The listener network Lϕ would see only the message

consisting of “important” patches. However, we empirically observed that a simple min-max or

L2-normalization allows the network to assign high weights to each patch and effectively send

the entire sequence of length L to the listener. Instead, we propose to use a differentiable ranking

algorithm by Blondel et al. [118] to convert the importance weights to soft-ranks {1, . . . , K} in

O(K logK) time. This method works by constructing diff