ABSTRACT Title of Dissertation: LEARNING AND COMPOSING PRIMITIVES FOR THE VISUAL WORLD Kamal Gupta Doctor of Philosophy, 2023 Dissertation Directed by: Professor Abhinav Shrivastava Department of Computer Science Professor Larry S. Davis Department of Computer Science Compositionality is at the core of how humans understand and recreate the visual world. It is what allows us to express infinitely many concepts using finite primitives. For example, we understand images as a combination of objects, videos as comprising of actions, or we generate 3D animations by rendering 3D surfaces with textures, materials, and lighting. It is unsurprising to see composition also appear in almost all human-created art forms such as language, mu- sic, design, or even mathematics. Although compositionality seems an obvious and prevalent way humans consume and create data, it is often eluded in computational approaches such as deep learning. Current systems often assume the availability of exhaustive labeled concepts or primitives during training and fail to generalize to new compositions during inference. In this dissertation, we propose to discover compositional primitives from the data with little to no su- pervision and show how we can use these primitives for improving generalization in real-world applications such as classification, correspondence, or 2D/3D synthesis. In the first half of this dissertation, I propose two complementary approaches to discover compositional discrete primitives from visual data. Given a large collection of images without labels, I propose a generative and a contrastive way of recognizing discriminative parts in the image which are usual for visual recognition. In the generative approach, I take inspiration from bayesian approaches such as variational autoencoders, to develop a system that can express images in form of discrete language-like representation. In the contrastive approach, I play a referential game between two neural network agents, to learn meaningful discrete concepts from images. I further show applications of these approaches in image and video editing by learning a dense correspondence of primitives across images. In the second half, I’ll focus on learning how to compose primitives for both 2D and 3D visual data. By expressing the scenes as an assembly of smaller parts, we can easily perform generation from scratch or from partial scenes as input. I present two works, one on composing multiple viewpoints to synthesize 3D objects, and another work on composing bounding boxes or cuboids to generate scene layouts. I also review a work on discovering a data-driven way of ordering traversing an image or a scene, for composition. I show applications of these works in image/video compression, as well as 2D and 3D content creation. LEARNING AND COMPOSING PRIMITIVES FOR THE VISUAL WORLD by Kamal Gupta Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor Abhinav Shrivastava, Chair/Advisor Professor Larry S. Davis, Co-chair/Advisor Professor Carol Y. Espy-Wilson, Dean’s Representative Professor Matthias Zwicker Professor Noah Snavely, Cornell University © Copyright by Kamal Gupta 2023 To my family and friends, for your unwavering and unconditional support, and I am eternally grateful. ii Acknowledgments This thesis is a culmination of a joint effort of many individuals and institutions over the course of the last 5 years and I want to express my deep appreciation and gratitude to all those who have contributed to its successful completion. First and foremost, I would like to express my sincere gratitude to my wife, Gowthami Somepalli, for pushing me to apply for a Ph.D. after I spent years procrastinating. Thanks for sticking with me through the thick and thin, endless brainstorming, bearing my rants, and helping me prepare for the interviews and presentations (in no specific order). I want to thank my parents for always pushing me to be a better person, my sisters Ruchi, Shilpa, and Sravani, my niece Sanvi and nephew Ibhan, and my brother-in-laws Himanshu and Ankur, for their unconditional love and support. Their patience, understanding, and love provided me with the motivation to overcome the many challenges I faced while completing this thesis. I want to thank my advisor Abhinav Shrivastava for his unwavering support, guidance, and constructive criticism in making this thesis happen. I especially appreciate his flexibility and availability every time I needed it, the freedom to explore various research topics that I was passionate about, and above all, his friendship throughout the last 5 years. I am privileged to have his expertise, knowledge, and enthusiasm which were essential to the successful completion of my dissertation. I also want to thank my co-advisor, Larry Davis, for being a great mentor, teaching me how to critically evaluate my own ideas, and creating a massive Computer Vision iii community at the University of Maryland that I immensely benefited from. I am also grateful to other professors and mentors I collaborated with and learned from - especially Matthias Zwicker for his valuable insights, and for introducing me to the world of graphics, David Jacobs for a number of discussions, Noah Snavely, Ameesh Makadia, and Vijay Mahadevan for their expertise and constructive criticism which were instrumental in shaping the direction of my research. I thank David Luebke and Stephano Soatto for building the best teams that I had the pleasure to be part of, Sanjiv Singh, Stephen Nuske, Subhajit Sanyal and Vidit Jain, for their mentorship before I even started my Ph.D. I sincerely appreciate the time and effort they dedicated to teaching me and reviewing my work. I want to thank DARPA SAIL-ON (W911NF2020009), DARPA SemaFor (HR001119S0085), and DARPA GARD (HR00112020007) programs, Amazon Research Award to Abhinav, Kulka- rni Fellowship, Dean’s Fellowship, Graduate Assistant Award, and Google, NVIDIA, and Ama- zon Research Labs which helped me conduct my research. I want to express my gratitude to my friends/collaborators (Saurabh, Justin, Alessandro, Susmija, Ketul, Hanyu, Abhishek, Varun, Carlos, Nicholas, Towaki, Sameh, Anubhav, Sharath, Hao, Shishira, Matt W., Saksham, Nirat, Vinoj, Khoi, Archana), friends/labmates (Pulkit, Luyu, Pallabi, Lillian, Saketh, Gaurav, Soumik, Neha, Sahil, Moustafa, Max, Bo, Matt G., Shirley, Alex, Mara, Yixuan, Shlok, Anshul, Aman, Pranav, Sweta, Namitha, Vatsal, Ahmed, Ameya, Samyadeep, Sneha, Rohan, Trisha, Uttaran, Philip, Amanpreet, Koutilya, Sharmila, Kevin, Shra- may), and friends/housemates (Bhavesh and Shivani) for their unwavering support and encour- agement throughout this journey. I would like to express my gratitude to Tom Hurst, and other staff and faculty of the Com- puter Science Department at UMD and UMIACS who were super supportive and responsive iv when it came to any bureaucratic or compute-related questions. Last but not the least, I want to thank the essential workers and healthcare professionals across the world, who put their lives at risk every day, to help us make it through the Covid-19 pandemic. Like many others in my time, I owe them a significant deal, and while I can’t name all of them, my Ph.D. wouldn’t have been possible without their tireless efforts. My sincere apologies to those I’ve inadvertently left out. Without all of your support, guidance, and encouragement, this research would not have been possible. v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables ix List of Figures xi Chapter 1: Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Discovering Primitives from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Learning to Compose Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Part I: Discovering Primitives from Data 6 Chapter 2: Learning Local Latent Codes for Recognition 7 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 VAE Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 PatchVAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 PatchVAE with multiple parts . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Improved Reconstruction Loss . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Downstream classification performance . . . . . . . . . . . . . . . . . . 22 2.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 3: Learning to Signal Mid-level Patches in Referential Games 29 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 Referential Game Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Speaker agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.3 Listener agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi 3.3.1 Positive Signaling - Visualizing Patch Ranks from θrank . . . . . . . . . . 42 3.3.2 Positive Signaling - Image Classification with subset of patches provided by θrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.3 Positive Signaling - Visualizing θsymb symbols . . . . . . . . . . . . . . . 44 3.3.4 Positive Listening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Aligning Sparse in-the-wild Image Collections 48 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 ASIC Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Obtaining Pseudo-correspondences . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Canonical Space Alignment . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.2 Visualizing Dense Correspondences . . . . . . . . . . . . . . . . . . . . 61 4.3.3 Pairwise Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.4 Image Set Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Part II: Learning to Compose Primitives 68 Chapter 5: Improved Modeling of 3D Shapes with Multi-view Depth Maps 69 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Loss function and other training details . . . . . . . . . . . . . . . . . . 77 5.2.3 Generative modeling with Implicit MLE . . . . . . . . . . . . . . . . . . 78 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 6: Layout Generation and Completion with Self-attention 87 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Layout Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2.2 Model architecture and training . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 3D Shape synthesis (on PartNet dataset) . . . . . . . . . . . . . . . . . . 98 6.3.2 Layouts for natural scenes . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.3 Layouts for Apps and Documents . . . . . . . . . . . . . . . . . . . . . 103 6.3.4 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 vii Chapter 7: Neural Space-filling Curves 109 7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.1 Overview of Dafner et al.(Single Image-based SFC) . . . . . . . . . . . . 113 7.2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.3 Weight Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2.4 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.5 Weight Evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.4 Optimizing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3.5 Optimizing Code Length . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.3.6 Scaling up SFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 8: Conclusion and Discussion 129 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Bibliography 132 viii List of Tables 2.1 Classification results on CIFAR100, Indoor67, and Places205. We initialize the classification model with the representations ϕ(x) learned from unsupervised learning task. The model ϕ(x) comprises of a conv layer followed by two residual blocks (each having 2 conv layers). First column (called ‘Conv1’) corresponds to Top-1 classification accuracy with pre-trained model with the first conv layer frozen, second and third columns correspond to results with first three and first five conv layers frozen respectively. Details in Section 2.3.1. . . . . . . . . . . . 22 2.2 ImageNet classification results using ResNet18. We initialize weights from us- ing the unsupervised task and fine-tune the last two residual blocks. Details in Section 2.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Effect of N : Increasing the maximum number of patches increases the discrimi- native power for CIFAR100 but has little or negative effect for Indoor67 . . . . . 27 2.4 Effect of dp: Increasing the number of hidden units for a patch has very little impact on classification performance. . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Effect of zprior occ : Increasing increasing the prior probability of patch occurrence has adverse effect on classification performance. . . . . . . . . . . . . . . . . . . 27 2.6 Effect of βocc: Too high or too low βocc can deteriorate the performance of learned representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Reconstruction metrics on ImageNet. PatchVAE sacrifices reconstruction quality to learn discriminative parts, resulting in higher recognition performance (Table 2.2) 27 3.1 Performance of Listener’s Vision module ϕvision - Downstream classification ac- curacy for ImageNet dataset using k-NN (k=20) . . . . . . . . . . . . . . . . . . 45 3.2 Performance of Listener’s Vision module ϕvision - Downstream mean Average Pre- cision for Pascal VOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Evaluation on SPair-71k. Per-class and average PCK@0.10 on test split. High- est PCK among weakly supervised methods in bold, second highest underlined. Scores marked with (⋆) means the paper uses a fixed image from the test set as canonical image. Our method is competitive against other weak supervised ap- proaches and often outperforms them. . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 CUB-200 and PF-Willow. PCK@0.10 for three CUB categories and four PF- Willow categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Ablation Study. Average PCK@0.10 (for 3 and 4 categories respectively) in CUB and PF-Willow datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1 Overview of baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 ix 5.2 Reconstruction on test data. We measure the reconstruction performance of dif- ferent techniques on the test dataset. [1] and [2] outperforms the reconstruction in majority of the cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Shape reconstruction from a single viewpoint . . . . . . . . . . . . . . . . . . . 84 6.1 Evaluation of generated shapes in Chair category. The best numbers are in bold, second-best are underlined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Analogies. We demonstrate linguistic nuances being captured by our category embeddings by attempting word2vec [3] style analogies. . . . . . . . . . . . . . 104 6.3 Quantitative Evaluations on COCO. Negative log-likelihood (NLL) of all the layouts in the validation set (lower the better). We use the importance sampling approach described in [4] to compute. We also generated images from layout using [5] and compute IS and FID. Following [6], we randomly split test set samples into 5 groups and report standard deviation across the splits. The mean is reported using the combined test set. . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Spatial distribution analysis for the samples generated using model trained on RICO and PubLayNet dataset. Closer the Overlap and Coverage values to real data, better is the performance. All values in the table are percentages (std in parenthesis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.1 Comparison of performance of lag-k autocorrelation and LZW Encoding length for different orders. For autocorrelation, we consistently outperform both the universal and context-based SFC computation approaches at high values of k. For LZW Encoding length, we measure the average size per frame in bytes as well as the relative improvement compared to the raster scan order, in the case of each of the datasets. We consistently outperform compression performance for other order schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 x List of Figures 1.1 We propose to learns mid-level patches in images which occur repetitively across the data. We further show that these mid-level patches can cluster meaningfully and can also enable dense correspondence applications. . . . . . . . . . . . . . . 3 1.2 LayoutTransformer [7] can synthesize layouts in diverse natural as well as human designed data domains such as documents, mobile app wireframes, natural scenes or 3D objects in a sequential manner. . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 PatchVAE learns to encode repetitive parts across a dataset, by modeling their appearance and occurrence. (top) Given an image, the occurrence map of a par- ticular part learned by PatchVAE is shown in the middle, capturing the head/beak of the birds. Samples of the same part from other images are shown on the right, indicating consistent appearance. (bottom) More examples of parts discovered by our PatchVAE framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 (a) VAE Architecture: In a standard VAE architecture, output of encoder net- work is used to parameterize the variational posterior for z. Samples from this posterior are input to the decoder network. (b) Proposed PatchVAE Architec- ture: Our encoder network computes a set of feature maps f using ϕ(x). This is followed by two independent single layer networks. The bottom network gener- ates part occurrence parameters QO. We combine QO with output of top network to generate part appearance parameters QA. We sample zocc and zapp to construct ẑ as described in Section 2.2.2 which is input to the decoder network. We also visualize the corresponding priors for latents zapp and zocc in the dashed gray boxes. 12 2.3 Encoded part occurrence maps discovered on CIFAR100 and ImageNet. Each row represents a different part. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 A few representative examples for several parts to qualitatively demonstrate the visual concepts captured by PatchVAE. For each part, we crop image patches centered on the part location where it is predicted to be present. Selected patches are sorted by part occurrence probability as score. We manually select a diverse set from the top-50 occurrences from the training images. As can be seen, a single part may capture diverse set of concepts that are similar in shape or texture or occur in similar context, but belong to different categories. We show which categories the patches come from (note that category information was not used while training the model). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 xi 2.5 Swapping source and target part appearance. Column 1, 2 show a source image with the occurrence map of one of the parts. We can swap the appearance vector of this part with appearance vectors of a different part in target images. Column 3, 4 show three target images with occurrence maps of one of their parts. Observe the change in reconstructions (column 5, 6) as we bring in the new appearance vector. The new reconstruction inherits properties of the source at specific loca- tions in the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Overview of the Referential Game. We generate two random views of every im- age in the given batch of images. The speaker takes one of the views as the input and generates a sequence of symbols (or message). The listener takes the mes- sage sent by the speaker and the second view of the image, and projects both into an embedding space. Both the speaker and listener agents learn by minimiz- ing the constrastive loss (see Equation (3.3)) between between the views in this embedding space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Speaker agent architecture. The upper branch represents the PatchSymbol, θsymb module and the lower branch represents PatchRank, θrank module. Each of the modules take the raw image as the input. PatchSymbol computes a message for each patch of pre-defined size, using a fixed size vocabulary and message length. PatchRank uses a ConvNet to compute weights or the importance of each patch. Note that the symbols for a patch are context independent, however the importance of a patch depends on the context. The speaker agent combines the output of each of the models and sends a variable length message to the listener. . 37 3.3 Heat maps based on Patch Importance. Red represents the most important patches and Blue the least important. Labels are listed for better understanding and are not used during training. We have used 224 × 224 images and 32 × 32 patches to get 49 non-overlapping patches per image. Model successfully separates the primary object in an image - the “device” in the “Abacus” image, “rabbit” in the “hare” image etc. (as shown by higher concentration of yellow-to–red patches on these areas). Some failure cases are shown as well on the right. . . . . . . . . . . 42 3.4 Impact of number of patches used during evaluation in a pre-trained ViT [8]. The performance of a ViT drops if we provide fewer patches during inference. However a PatchRank model (which is a small ResNet-9) can provide important patches to ViT during inference with minimal loss in Top-1 accuracy. . . . . . . . 44 3.5 Visualizing patches corresponding to various symbols. The number in the top row corresponds to 1 of 128 vocabulary ids. 6 representative patches corresponding to each patch are shown. The bottom row corresponds to our interpretation of what concept that symbol might be capturing. . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Impact of various hyperparameters on the success of communication. Top-1 ac- curacy denotes the percentage of messages in a batch that model was able to match to the corresponding image. Having a higher message length and vocabu- lary helps the model to learn faster. . . . . . . . . . . . . . . . . . . . . . . . . . 46 xii 4.1 Globally consistent and dense aligments with ASIC. Given a small set (∼10- 30) of images of an object or object category captured in-the-wild, our frame- work computes a dense and consistent mapping between all the images in a self- supervised manner. First row: Unaligned sets of images from the SAMURAI (Keywest) and SPair-71k (Cow) datasets. Second row: Dense correspondence maps produced by our method. Third row: Image in the first column warped to the images in columns 2-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 ASIC Architecure. The alignment network Φalign predicts canonical space co- ordinates for all pixels for all input images. Images can be reconstructed using a differentiable warp from the canonical space. In order to align semantically sim- ilar pixels from different images to the same location in the canonical space, we propose two primary loss functions LKP and LRecon. Please refer to the Sec. 4.2.2 for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Visualizing canonical space alignment. For each dataset, the top row shows sample images from the dataset (composed of 10-30 images each). The mid- dle row shows part co-segmentations computed by DVD [9]. DVD computes a coarse, discrete set of parts across the dataset. The bottom row shows the con- tinuous canonical space mapping computed by our method. Our canonical space mapping is smooth and consistent across the images for each dataset/collection. . 58 4.4 Dense warping from a source image (top row) to a target image (second row). We warp all foreground pixels (highlighted by a red overlay in the source image). Our methods produce dense and semantically more meaningful warps from the source to the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Image Set Correspondence. k-CyPCK at varying αbbox (higher is better). Our method outperforms DVD baseline at both large and small values of αbbox. . . . . 65 4.6 Limitations. Top row shows that our model can map left part of object in source to right part of object in target when object is symmetric. Bottom row shows that our model fails for very large out-of-plane rotations. . . . . . . . . . . . . . . . . 67 5.1 Architecture: Our Identity Encoder Network takes one or more depth maps of 3D object as input and encodes each of them into latent vectors. We use expected value of these latent vectors as the identity vector of the object. The decoder or viewpoint generator network uses this identity vector, category and viewpoint as input to generate depthmap of the object. It consists of an MLP that maps the identity vector to a style vector. This style vector is added to each block in the Viewpoint Generator Network using adaptive instance normalization. . . . . . . . 74 xiii 5.2 Reconstruction on test objects. We evaluate our model’s ability to encode ge- ometry of 3D shapes with its latent representation. Each row represents 3D shapes from different categories. Odd columns are the input objects and even columns are the reconstructions. We render depth maps of input objects from different viewpoints. Our encoder encodes each of the depth map independently and out- puts a latent code. We average the latent code over all the viewpoints of the object. Finally we use our generator to use the same latent code and generate multiple viewpoints from the latent code. Final output depth maps are projected back to 3D. Last two columns shows some of the failure cases likely because of lack of similar samples in training data. . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Single View Reconstruction. Given a single input depth map, we show recon- structions of complete 3D object using our approach. Odd columns are the input depth maps, even columns are the full 3D reconstructions viewed from a canoni- cal viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Shape Interpolation. The latent representations learned by our model are smooth. In this figure, we interpolate between two test objects (left-most and right-most column) by generating the intermediate 3D objects from the spherical interpola- tion of the latent codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Word embeddings learned by our model for various class objects. We see that visually similar categories cluster together, even if they are not semantically sim- ilar. E.g., airplane, rocket and vessel are together, mailbox and microwave are also together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6 Two NN to generated 3D models: our model produces 3D models that differ from the two closest train samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1 Our framework can synthesize layouts in diverse natural as well as human de- signed data domains such as documents, mobile app wireframes, natural scenes or 3D objects in a sequential manner. . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 The architecture depicted for a toy example. LayoutTransformer takes layout elements as input and predicts the next layout elements as output. During training, we use teacher forcing, i.e., use the ground-truth layout tokens as input to a multi- head decoder block. The first layer of this block is a masked self-attention layer, which allows the model to see only the previous elements in order to predict the current element. We pad each layout with a special ⟨bos⟩ token in the beginning and ⟨eos⟩ token in the end. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.3 Generated 3D objects. Top row shows input primitives to the model. Bottom row shows the layout obtained by our approach. . . . . . . . . . . . . . . . . . . 95 6.4 Generated layouts. Top row shows seed layouts input to the model. Bottom row shows the layout obtained with nucleus sampling. We skip the ‘stuff’ bounding boxes for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Downstream task. Image generation with layouts [5]. FID and IS scores for the generated images provided in Table 6.3. . . . . . . . . . . . . . . . . . . . . . . 101 6.6 TSNE plot of learned category embeddings. Words are colored by their super- categories provided in the COCO. Observe that semantically similar categories cluster together. Cats and dogs are closer as compared to sheep, zebra, or cow. . . 103 xiv 6.7 Distribution of xy-coordinates of bounding boxes centers. Distributions for gen- erated and real layouts is similar. The y-coordinate tends to be more informative (e.g., sky on the top, road and sea at the bottom) . . . . . . . . . . . . . . . . . . 104 6.8 RICO layouts. Generated layouts for the RICO dataset. We skip the categories of bounding boxes for the sake of clarity. . . . . . . . . . . . . . . . . . . . . . . 106 6.9 Multiple completions from same initial element . . . . . . . . . . . . . . . . . . 107 6.10 Document Layouts. Generated samples LayoutVAE (top) and our method (bot- tom). Our method produces aligned bounding boxes for various elements. . . . . 108 7.1 Given a set of images, a gif, or a video, Neural Space-filling Curves (SFC) can provide a more spatially coherent scan order for images as compared to universal scan orders such as S-curve, or Peano-Hilbert curves. As shown in the example of a trouser and a face, the scan line tends to cover the background before moving to the foreground. (SFCs generated here using half-resolution images and resized for clarity. Best viewed in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Cover and Merge Algorithm. (a): An 8 × 8 image fully covered by 2 × 2 circuits. (b): The dual graph G ′ (black) built on the covering circuits G (blue). (c): The Minimum Spanning Tree T (black solid lines) of G ′ (all black lines) and the Hamiltonian Circuit (blue) induced by T . (d): A single Hamiltonian circuit merged from the covering circuits. See more details in Sec. 7.2.1. . . . . . . . . . 113 7.3 Neural SFC pipeline. The Neural SFC model consists of a weight generator FG and a weight evaluator EG . Taking one or more images as input, FG extracts the deep features of the image and generates the SFC weights. EG takes both the image(s) and the corresponding SFC weights as input EG then estimates the negative autocorrelation of the image pixel sequence inferred by the input weights to evaluate the goodness of the input weights with respect to the input image. See more details in Sec. 7.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Qualitative comparison between Hilbert curves and Neural SFCs. Left: SFC (in red color) overlayed on the image. Right: Image flattened according to the SFC and visualized in 1-dimension. Images in the top two rows are from MNIST, the ones in the middle two rows are from Fashion-MNIST, and the ones in the bottom two rows are from FFHQ Faces. Neural SFCs on images from MNIST and Fashion-MNIST are class-conditional, i.e., computed for each class. Therefore, for MNIST and Fashion-MNIST, Neural SFCs in the right two columns are the same since the two images have the same class label. In all datasets, Neural SFCs are more spatially coherent and produce fewer clusters when visualized in 1-dimension. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.5 We visualize the Neural SFCs training with two objectives considered in this paper – autocorrelation and LZW encoding. . . . . . . . . . . . . . . . . . . . . 125 7.6 lag-k autocorrelation for MNIST, Fashion-MNIST and FFHQ datasets. While Dafner et al.provide higher autocorrelation for small lag, i.e., from k = 2 to k = 4, Neural SFCs outperform Dafner et al.for k > 4 in all the datasets. Note that we trained our model for k = 6, and hence this behaviour is expected. . . . . 125 xv 7.7 Scaling up SFC. The top row (blue circles) shows a 2× 2 crop of an image grid of resolution n × n. The bottom row (red circles) shows how we can scale the SFC path from n × n grid to 2n × 2n grid. For every incoming SFC to a pixel location, there are 3 possible ‘next’ pixels, straight ahead, left, or right. Column 2 and 3 consider the cases where the SFC goes ‘straight’ or ‘right’. The image on the right shows a toy example of scaling up SFC for a 5× 5 image. . . . . . . . 128 xvi Chapter 1: Introduction 1.1 Overview Humans (and even animals) [10] can innately imagine and recognize a large number of objects in diverse scenes, by composing a few known objects and their attributes. However, this ability to synthesize and recognize infinitely many new combinations from finite concepts called compo- sitional generalization, remains an Achilles’s heel of current data-driven approaches (e.g., deep networks) that assume robust labeled training data is available for exhaustive object and their properties [11]. In order to harness the computational approaches for building real-world applications, it is crucial for them to understand and perform composition. George Bool, also known as the father of modern boolean algebra, formally defined the principle of compositionality as following, “The meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them” The majority of this thesis involves abstracting and solving these two challenges in compositional learning: 1. Discovering discrete primitives from data (images and videos) with few or no labels which can be composed for various downstream applications such as segmentation, correspon- 1 dence, or image retrieval. 2. Learning the rules to combine visual primitives meaningfully to generate rich, and diverse compositions of the input data (images and 3D). Being able to understand and compose 2D and 3D content (both static and dynamic) will unlock the next generation of tools empowered by Machine Learning. It will allow artificial agents to understand the visual world around them and perform repetitive or dangerous tasks reducing human time and effort by orders of magnitude. Tools to create content will enable the creation of large labeled highly realistic synthetic datasets. They will also be quintessential for entertainment (movies, sports, gaming) companies as well as have applications in robotics, healthcare, education, and e-commerce. 1.2 Discovering Primitives from Data In order to better understand the guiding principles behind generative process of data, it is imperative to build representations that can discover simple parts or primitives that compose the data. In images, representations learned by current popular unsupervised and self-supervised machine learning systems, however, follow a distributed paradigm. This means that image is usually represented using a high-dimensional continuous vector where a single dimension can contribute to multiple concepts, and multiple dimensions can contribute to a single concept. In this work, we propose a new paradigm to represent images in the form of discrete primitives that appear in images, and can be composed to either reconstruct an image or to differentiate an image from other images. We discuss two foundational works, PatchVAE in Chapter 2 and PatchGame in Chap- 2 Figure 1.1: We propose to learns mid-level patches in images which occur repetitively across the data. We further show that these mid-level patches can cluster meaningfully and can also enable dense correspon- dence applications. ter 3. PatchVAE [12] attempts to model the images by representing them as a combination of part appearance maps, and part visibility maps, which combined together allows us to recon- struct original images. These parts are meaningful across a collection of images and allow im- age editing applications by simply replacing discrete symbols from an image by another image. PatchGame [13] follows a constrastive approach instead. In this work, we play a cooperative referential game between two neural network agents, a speaker and a listener, who can communi- cate with each other only via a discrete bottleneck. The goal for the listener is to match a discrete message sent by the speaker to the corresponding image. I show the applications of this discrete bottleneck in downstream applications such as detection, retrieval, and part segmentations. Fur- 3 ther in Chapter 4, ASIC [14] takes these discrete primitives a step further by learning a dense correspondence for a set of images as shown in Fig. 1.1. Going beyond images, the principle of compositionality can also be applied to represent and train fast, efficient neural networks. In LilNetX [15, 16], we learn a set of discrete weights sam- pled from a codebook, to provide a replacement for weights of standard CNNs such as ResNets or MobileNets. These discrete weights combined with entropy regularization, allow for highly quantized, and sparse neural networks that can be used for tasks such as classification, or de- tection. An extension of this work (under review) also shows the application of our entropy regularization technique to compress Neural Fields. While in this dissertation, we restrict our- selves to composition for the visual data, we encourage readers to check the above-mentioned works on model compression. 1.3 Learning to Compose Primitives The second part of this dissertation focus on the process of synthesizing or learning a meaningful arrangement of some known concepts to give rise to novel scenes with desired char- acteristics. In Chapter 5, our primitives are depth maps from different camera viewpoints. By syn- thesizing and composing these multiview depthmaps, we propose a new way of generating 3D surfaces. In Chapter 6, we consider primitives to be bounding boxes, or cuboids for 2D and 3D scenes. This arrangement of concepts or primitives in a scene, can be referred to as layout of the scene and it can be characterized by contextual relationships (e.g., support, occlusion, and relative likelihood, position, and size) between objects comprising the scene. In this work, also 4 Figure 1.2: LayoutTransformer [7] can synthesize layouts in diverse natural as well as human designed data domains such as documents, mobile app wireframes, natural scenes or 3D objects in a sequential manner. dubbed, LayoutTransformer [7], we propose an approach to autoregressively generate 2D and 3D layouts from scratch or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. A shortcoming of this approach is that it can only generate lay- outs elements in a raster scan order. We address this shortcoming in Neural SFC [17] discussed in Chapter 7, where we learn a data driven way to learn scan order of images. 5 Part I: Discovering Primitives from Data 6 Chapter 2: Learning Local Latent Codes for Recognition Due to the availability of large labeled visual datasets, supervised learning has become the dominant paradigm for visual recognition. That is, to learn about any new concept, the modus operandi is to collect thousands of labeled examples for that concept and train a powerful classifier, such as a deep neural network. This is necessary because the current generation of models based on deep neural networks require large amounts of labeled data [18]. This is in stark contrast to the insights that we have from developmental psychology on how infants develop perception and cognition without any explicit supervision [19]. Moreover, the supervised learning paradigm is ill-suited for applications, such as health care and robotics, where annotated data is hard to obtain either due to privacy concerns or high cost of expert human annotators. In such cases, learning from very few labeled images or discovering underlying natural patterns in large amounts of unlabeled data can have a large number of potential applications. Discovering such patterns from unlabeled data is the standard setup of unsupervised learning. Over the past few years, the field of unsupervised learning in computer vision has followed two seemingly different tracks with different goals: generative modeling and self-supervised learning. The goal of generative modeling is to learn the probability distribution from which data was generated, given some training data. Such models, learned using reconstruction-based losses, can draw samples from the same distribution or evaluate the likelihoods of new data, 7 Example parts discovered by PatchVAE Image Part Occurrence Map Part samples Figure 2.1: PatchVAE learns to encode repetitive parts across a dataset, by modeling their appearance and occurrence. (top) Given an image, the occurrence map of a particular part learned by PatchVAE is shown in the middle, capturing the head/beak of the birds. Samples of the same part from other images are shown on the right, indicating consistent appearance. (bottom) More examples of parts discovered by our PatchVAE framework. and are useful for learning compact representation of images. However, we argue that these representations are not as useful for visual recognition. This is not surprising since the task of reconstructing images does not require the bottleneck representation to sort out meaningful data useful for recognition and discard the rest; on the contrary, it encourages preserving as much information as possible for reconstruction. In comparison, the goal in self-supervised learning is to learn representations that are use- ful for recognition. The standard paradigm is to establish proxy tasks that don’t require human- supervision but can provide signals useful for recognition. Due to the mismatch in goals of un- supervised learning for visual recognition and the representations learned from generative mod- eling, self-supervised learning is a more popular way of learning representations from unlabeled data. However, fundamental limitation of this self-supervised paradigm is that we need to define a proxy-task that can mimic the desired recognition task. It is not possible to always establish such a task, nor are these tasks generalizable across recognition tasks. In this paper, our goal is to enable the unsupervised generative modeling approach of VAEs 8 to learn representations useful for recognition. Our key hypothesis is that for a representation to be useful, it should capture just the interesting parts of the images, as opposed to everything in the images. What constitutes an interesting image part has been defined and studied in earlier works that pre-date the end-to-end trained deep network methods [20, 21, 22]. Taking inspiration from these works, we propose a novel representation that only encodes few such parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images. By avoiding reconstruction of the entire image our method can focus on regions that are repeating and con- sistent across many images. In an encoder-decoder based generative model, we constrain the encoder architecture to learn such repetitive parts – both in terms of representations for appear- ance of these parts (or patches in an image) and where these parts occur. We formulate this using variational auto-encoder (β-VAEs) [23, 24], where we impose novel structure on the latent rep- resentations. We use discrete latents to model part presence or absence and continuous latents to model their appearance. Figure 2.1 shows an example of the discrete latents or occurrence map, and example parts discovered by our approach, PatchVAE. We present PatchVAE in Section 2.2 and demonstrate that it learns representations that are much better for recognition as compared to those learned by the standard β-VAEs [23, 24]. In addition, we present losses that favor foreground, which is more likely to contain repet- itive patterns, in Section 2.2.4, and demonstrate that they result in representations that are much better at recognition. Finally, in Section 2.3, we present results on CIFAR100 [25], MIT Indoor Scene Recognition [26], Places [27], and ImageNet [28] datasets. To summarize, our contribu- tions are as follows: 9 • We propose a novel patch-based bottleneck in the VAE framework that learns representations that can encode repetitive parts across images. • We demonstrate that our method, PatchVAE, learns unsupervised representations that are better suited for recognition in comparison to traditional VAEs. • We show that losses that favor foreground are better for unsupervised representation learning for recognition. • We perform extensive ablation analysis of the proposed PatchVAE architecture. 2.1 Related Work Due to its potential impact, unsupervised learning (particularly for deep networks) is one of the most researched topics in visual recognition over the past few years. Generative models such as VAEs [23, 24, 29, 30], PixelRNN [31], PixelCNN [32, 33], and their variants have proven effective when it comes to learning compressed representation of images while being able to faithfully reconstruct them as well as draw samples from the data distribution. GANs [34, 35, 36, 37] on the other hand, while don’t model the probability density explicitly, can still produce high quality image samples from noise. There has been work combining VAEs and GANs to be able to simultaneously learn image data distribution while being able to generate high quality samples from it [38, 39, 40]. Convolution sparse coding [41] is an alternative approach for reconstruction or image in-painting problems. Our work complements existing generative frameworks in that we provide a structured approach for VAEs that can learn beyond low-level representations. We show the effectiveness of the representations learned by our model by using them for visual recognition tasks. 10 There has been a lot of work in interpreting or disentangling representations learned using generative models such as VAEs [24, 42, 43]. However, there is little evidence of effectiveness of disentangled representations in visual recognition. Semi-supervised learning using generative models [44, 45], where partial or noisy labels are available to the model during training, has shown lots of promise in applications of generating conditioned samples from the model. In our work however, we focus on incorporating inductive biases in these generative models (e.g., VAEs) so they can learn representations better suited for visual recognition. A related, but orthogonal, line of work is self-supervised learning where a proxy task is designed to learn representation useful for recognition. These proxy tasks vary from simple tasks like arranging patches in an image in the correct spatial order [46, 47] and arranging frames from a video in correct temporal order [48, 49], to more involved tasks like in-painting [50] and context prediction [51, 52]. We follow the best practices from this line of work for evaluating the learned representations. 2.2 Our Approach Our work builds upon VAE framework proposed by Kingma and Welling [23]. We briefly review relevant aspects of the VAE framework and then present our approach. 2.2.1 VAE Review Standard VAE framework assumes a generative model for data where first a latent z is sampled from a prior p(z) and then the data is generated from a conditional distribution G(x|z). A variational approximation Q(z|x) to the true intractable posterior is introduced and the model 11 x "x z ∼ %(0, )) Q G (a) VAE Architecture 𝜙(x)x f Q! Q" Q(z#|x) z$%% ∼ Bern z$%% #&'$& .multiply z(## ∼ 𝒩(0, 𝐼) G(x|z#) deconv× replicate 6x 6z (b) PatchVAE Architecture Figure 2.2: (a) VAE Architecture: In a standard VAE architecture, output of encoder network is used to parameterize the variational posterior for z. Samples from this posterior are input to the decoder network. (b) Proposed PatchVAE Architecture: Our encoder network computes a set of feature maps f using ϕ(x). This is followed by two independent single layer networks. The bottom network generates part occurrence parameters QO. We combine QO with output of top network to generate part appearance parameters QA. We sample zocc and zapp to construct ẑ as described in Section 2.2.2 which is input to the decoder network. We also visualize the corresponding priors for latents zapp and zocc in the dashed gray boxes. 12 is learned by minimizing the following negative variational lower bound (ELBO), LVAE(x) =− Ez∼Q(z|x) [logG(x|z)] +DKL [Q(z|x) ∥ p(z)] (2.1) where Q(z|x) is often referred to as an encoder as it can be viewed as mapping data to the the latent space, while G(x|z) is referred to as a decoder (or generator) that can be viewed as mapping latents to the data space. Both Q and G are commonly parameterized as neural networks. Figure 2.2a shows the commonly used VAE architecture. If the conditional G(x|z) takes a gaussian form, negative log likelihood in the first term of RHS of Equation (2.1) becomes mean squared error between generator output x̂ = G(x|z) and input data x. In the second term, prior p(z) is assumed to be a multi-variate normal distribution with zero-mean and identity covariance N (0, I) and the loss simplifies to LVAE(x) = ∥x− x̂∥2 +DKL [Q(z|x) ∥ N (0, I)] (2.2) When G and Q are differentiable, entire model can be trained with SGD using reparame- terization trick [23]. Matthey et al. [24] propose an extension for learning disentangled represen- tation by incorporating a weight factor β for the KL Divergence term yielding LβVAE(x) = ∥x− x̂∥2 + βDKL [Q(z|x) ∥ N (0, I)] (2.3) VAE framework aims to learn a generative model for the images where the latents z repre- 13 sent the corresponding low dimensional generating factors. The latents z can therefore be treated as image representations that capture the necessary details about images. However, we postulate that representations produced by the standard VAE framework are not ideal for recognition as they are learned to capture all details, rather than capturing ‘interesting’ aspects of the data and drop- ping the rest. This is not surprising since there formulation does not encourage learning semantic information. For learning semantic representations, in the absence of any relevant supervision (as is available in self-supervised approaches), inductive biases have to be introduced. Therefore, taking inspiration from works on unsupervised mid-level pattern discovery [20, 21, 22], we pro- pose a formulation that encourages the encoder to only encode such few parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images. Since the VAE framework provides a principled way of learning a mapping from image to latent space, we consider it ideal for our proposed extension. We chose β-VAEs for their simplicity and widespread use. In Section 2.2.2, we describe our approach in detail and in Section 2.2.4 propose a modification in the reconstruction error computation to bias the error term towards foreground high-energy regions (similar to the biased initial sampling of patterns in Singh et al. [20]). 2.2.2 PatchVAE Given an image x, let f = ϕ(x) be a deterministic mapping that produces a 3D representa- tion f of size h×w×de, with a total of L = h×w locations (grid-cells). We aim to encourage the encoder network to only encode parts of an image that correspond to highly repetitive patches. For example, a random patch of noise is unlikely to occur frequently, whereas patterns like faces, 14 wheels, windows, etc. repeat across multiple images. In order capture this intuition, we force the representation f to be useful for predicting frequently occurring parts in an image, and use just these predicted parts to reconstruct the image. We achieve this by transforming f to ẑ which en- codes a set of parts at a small subset of L locations on the grid cells. We refer to ẑ as “patch latent codes” for an image. Next we describe how we re-tool the β-VAE framework to learn these local latent codes. We first describe our setup for a single part and follow it up with a generalization to multiple parts (Section 2.2.3). Image Encoding. Given the image representation f = ϕ(x), we want to learn part representa- tions at each grid location l (where l ∈ {1, . . . , L}). A part is parameterized by its appearance zapp and its occurrence zlocc (i.e., presence or absence of the part at grid location l). We use two networks, QA f and QO f , to parameterize posterior distributions QA f (zapp | f) and QO f (z l occ | f) of the part parameters zapp and zlocc respectively. Since the mapping f = ϕ(x) is deterministic, we can re-write these distributions as QA f (zapp |ϕ(x)) and QO f (z l occ |ϕ(x)); or simply QA(zapp | x) and QO(zlocc | x). Therefore, given an image x the encoder networks estimate the posterior QA(zapp | x) and QO(zlocc | x). Note that f is a deterministic feature map, whereas zapp and zlocc are stochastic. Image Decoding. We utilize a generator or decoder network G, that given zocc and zapp, recon- structs the image. First, we sample a part appearance ẑapp (dp dimensional, continuous) and then sample part occurrence ẑlocc (L dimensional, binary) one for each location l from the posteriors ẑapp ∼ QA(zapp |x) ẑlocc ∼ QO ( zlocc |x ) , where l ∈ {1, . . . , L} (2.4) 15 Next, we construct a 3D representation ẑ by placing ẑapp at every location l where the part is present (i.e., ẑlocc = 1). This can be implemented by a broadcasted product of ẑapp and ẑlocc . We refer to ẑ as patch latent code. Again note that f is deterministic and ẑ is stochastic. Finally, a deconvolutional network takes ẑ as input and generates an image x̂. This image generation process can be written as x̂ ∼ G ( x | z1occ, z 2 occ, . . . , z L occ, zapp ) (2.5) Since all latent variables (zlocc for all l and zapp) are independent of each other, they can be stacked as zp = [ z1occ; z 2 occ; . . . ; z L occ; zapp ] . (2.6) This enables us to use a simplified the notation (refer to Equation (2.4) and Equation (2.5)): ẑp ∼ Q{A,O}(zp |x) x̂ ∼ G (x | zp) (2.7) Note that despite the additional structure, our model still resembles the setup of variational auto- encoders. The primary difference arises from: (1) use of discrete latents for part occurrence, (2) patch-based bottleneck imposing additional structure on latents, and (4) feature assembly for generator. Training. We use the training setup of β-VAE and use the maximization of variational lower bound to train the encoder and decoder jointly (described in Section 2.2.1). The posterior QA, 16 which captures the appearance of a part, is assumed to be a Normal distribution with zero-mean and identity covariance N (0, I). The posterior QO, which captures the presence or absence a part, is assumed to be a Bernoulli distribution Bern ( zprior occ ) with prior zprior occ . Therefore, the ELBO for our approach can written as (refer to Equation (2.3)): LPatchVAE(x) =− Ezp∼Q{A,O}(zp |x) [G (x | zp)] + βDKL [ Q{A,O}(zp |x) ∥ p(zp) ] (2.8) where, the DKL term can be expanded as: DKL [ Q{A,O}(zp | x) ∥ p(zp) ] = βapp L∑ l=1 DKL ( QO(zlocc |x) ∥ Bern ( zprior occ )) + βoccDKL ( QA(zapp |x) ∥ N (0, I ) ) (2.9) Implementation details. As discussed in Section 2.2.1, the first and second terms of the RHS of Equation (2.8) can be trained using L2 reconstruction loss and reparameterization trick [23]. In addition, we also need to compute KL Divergence loss for part occurrence. Learning discrete probability distribution is a challenging task since there is no gradient defined to backpropagate reconstruction loss through the stochastic layer at decoder even when using the reparameteri- zation trick. Therefore, we use the relaxed-bernoulli approximation [53, 54] for training part occurrence distributions zlocc. For an H ×W image, network Q(f |x) first generates feature maps of size (h× w × de), where (h, w) are spatial dimensions and de is the number of channels. Therefore, the number of locations L = h×w. Encoders QA f (zapp | f) and QO f (z l occ | f) are single layer neural networks to 17 compute zapp and zlocc. z l occ is (h× w × 1)-dimensional multivariate bernoulli parameter and zapp is (1× 1× dp)-dimensional multivariate gaussian. dp is length of the latent vector for a single part. Input to the decoder ẑ is (h× w × dp)-dimensional. In all experiments, we fix h = H 8 and w = W 8 . Constructing zapp. Notice that f is an (h× w × de)-dimensional feature map and zlocc is (h× w × 1)- dimensional binary output, but zapp is (1× 1× dp)-dimensional feature vector. If ∑ l zlocc > 1, the part occurs at multiple locations in an image. Since all these locations correspond to same part, their appearance should be the same. To incorporate this, we take the weighted average of the part appearance feature at each location, weighted by the probability that the part is present. Since we use the probability values for averaging the result is deterministic. This operation is encapsulated by the QA encoder (refer to Figure 2.2b). During image generation, we sample ẑapp once and replicate it at each location where ẑlocc = 1. During training, this forces the model to: (1) only predict ẑlocc = 1 where similar looking parts occur, and (2) learn a common represen- tation for the part that occurs at these locations. Note that zapp can be modeled as a mixture of distributions (e.g., mixture of gaussians) to capture complicated appearances. However, in this work we assume that the convolutional neural network based encoders are powerful enough to map variable appearance of semantic concepts to similar feature representations. Therefore, we restrict ourselves to a single gaussian distribution. 2.2.3 PatchVAE with multiple parts Next we extend the framework described above to use multiple parts. To use N parts, we use N × 2 encoder networks QA(i) ( z (i) app | x ) and QO(i) ( z l(i) occ | x ) , where z (i) app and z l(i) occ parame- 18 terize the ith part. Again, this can be implemented efficiently as 2 networks by concatenating the outputs together. The image generator samples ẑ(i)app and ẑl(i)occ from the outputs of these encoder networks and constructs ẑ(i). We obtain the final patch latent code ẑ by concatenating all ẑ(i) in channel dimension. Therefore, ẑ(i) is (h × w × dp)-dimensional and ẑ is (h × w × (N × dp))- dimensional stochastic feature map. For this multiple part case, Equation (2.6) can be written as: zP = [ z(1)p ; z(1)p ; . . . ; z(N) p ] where z(i)p = [ z1(i)occ ; z 2(i) occ ; . . . ; z L(i) occ ; z(i)app ] . (2.10) Similarly, Equation (2.8) and Equation (2.9) can be written as: LMultiPatchVAE(x) = −EzP [G (x | zP)] + βapp N∑ i=1 L∑ l=1 DKL ( QO(i) ( zl(i)occ |x ) ∥ Bern ( zprior occ )) + βocc N∑ i=1 DKL ( QA(i) ( z(i)app |x ) ∥ N (0, I ) ) (2.11) The training details and assumptions of posteriors follow the previous section. 2.2.4 Improved Reconstruction Loss The L2 reconstruction loss used for training β-VAEs (and other reconstruction based ap- proaches) gives equal importance to each region of an image. This might be reasonable for tasks like image compression and image de-noising. However, for the purposes of learning semantic representations, not all regions are equally important. For example, “sky” and “walls” occupy 19 large portions of an image, whereas concepts like “windows,” “wheels,”, “faces” are compara- tively smaller, but arguably more important. To incorporate this intuition, we use a simple and intuitive strategy to weigh the regions in an image in proportion to the gradient energy in the region. More concretely, we compute laplacian of an image to get the intensity of gradients per- pixel and average the gradient magnitudes in 8 × 8 local patches. The weight multiplier for the reconstruction loss of each 8 × 8 patch in the image is proportional to the average magnitude of the patch. All weights are normalized to sum to one. We refer to this as weighted loss (Lw). Note that this is similar to the gradient-energy biased sampling of mid-level patches used in [20, 21]. Examples of weight masks are provided in the supplemental material. In addition, we also consider an adversarial training strategy from GANs to train VAEs [40], where the discriminator network from GAN implicitly learns to compare images and gives a more abstract reconstruction error for the VAE. We refer to this variant by using ‘GAN’ suffix in exper- iments. In Section 2.3.2, we demonstrate that the proposed weighted loss (Lw) is complementary to the discriminator loss from adversarial training, and these losses result in better recognition capabilities for both β-VAE and PatchVAE. 2.3 Experiments Datasets. We evaluate PatchVAE on CIFAR100 [25], MIT Indoor Scene Recognition [26], Places [27] and Imagenet [28] datasets. CIFAR100 consists of 60k 32 × 32 color images from 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. Indoor dataset contains 67 categories, and a total of 15620 images. Train and test subsets consist of 80 and 20 images per class respectively. Places dataset has 2.5 millions of images with 205 20 categories. Imagenet dataset has over a million images from 1000 categories. Learning paradigm. In order to evaluate the utility of PatchVAE features for recognition, we setup the learning paradigm as follows: we will first train the model in an unsupervised manner on all training images. After that, we discard the generator network and use only part of the encoder network ϕ(x) to train a supervised model on the classification task of the respective dataset. We study different training strategies for the classification stage as discussed later. Training details. In all experiments, we use the following architectures. For CIFAR100, In- door67, and Place205, ϕ(x) has a conv layer followed by two residual blocks [55]. For ImageNet, ϕ(x) is a ResNet18 model (a conv layer followed by four residual blocks). For all datasets, QA and QO have a single conv layer each. For classification, we start from ϕ(x), and add a fully- connected layer with 512 hidden units and a final fully-connected layer as classifier. More details can be found in the supplemental material. During the unsupervised learning phase of training, all methods are trained for 90 epochs for CIFAR100 and Indoor67, 2 epochs for Places205, and 30 epochs for ImageNet dataset. All methods use ADAM optimizer for training, with initial learning rate of 1× 10−4 and a minibatch size of 128. For relaxed bernoulli in QO, we start with the temperature of 1.0 with an anneal- ing rate of 3 × 10−5 (following the details in [54]). For training the classifier, all methods use stochastic gradient descent (SGD) with momentum with a minibatch size of 128. Initial learning rate is 1 × 10−2 and we reduce it by a factor of 10 every 30 epochs. All experiments are trained for 90 epochs for CIFAR100 and Indoor67, 5 epochs for Places205, and 30 epochs for ImageNet datasets. 21 Table 2.1: Classification results on CIFAR100, Indoor67, and Places205. We initialize the classification model with the representations ϕ(x) learned from unsupervised learning task. The model ϕ(x) comprises of a conv layer followed by two residual blocks (each having 2 conv layers). First column (called ‘Conv1’) corresponds to Top-1 classification accuracy with pre-trained model with the first conv layer frozen, second and third columns correspond to results with first three and first five conv layers frozen respectively. Details in Section 2.3.1. CIFAR100 Indoor67 Places205 Model Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5] β-VAE 44.12 39.65 28.57 20.08 17.76 13.06 28.29 24.34 8.89 β-VAE + Lw 44.96 40.30 28.33 21.34 19.48 13.96 29.43 24.93 9.41 β-VAE-GAN 44.69 40.13 29.89 19.10 17.84 13.06 28.48 24.51 9.72 β-VAE-GAN + Lw 45.61 41.35 31.53 20.45 18.36 14.33 29.63 25.26 10.66 PatchVAE 43.07 38.58 28.72 20.97 19.18 13.43 28.63 24.95 11.09 PatchVAE + Lw 43.75 40.37 30.55 23.21 21.87 15.45 29.39 26.29 12.07 PatchVAE-GAN 44.45 40.57 31.74 21.12 19.63 14.55 28.87 25.25 12.21 PatchVAE-GAN + Lw 45.39 41.74 32.65 22.46 21.87 16.42 29.36 26.30 13.39 BiGAN 47.72 41.89 31.58 21.64 17.09 9.70 30.06 25.11 10.82 Imagenet Pretrained 55.99 54.99 54.36 45.90 45.82 40.90 37.08 36.46 31.26 2.3.1 Downstream classification performance In Table 2.1, we report the top-1 classification results on CIFAR100, Indoor67, and Places205 datasets for all methods with different training strategies for classification. First, we keep all the pre-trained weights in ϕ(x) from the unsupervised task frozen and only train the two newly added conv layers in the classification network (reported under column ‘Conv[1-5]’). We notice that our method (with different losses) generally outperforms the β-VAE counterpart by a healthy margin. This shows that the representations learned by PatchVAE framework are better for recognition compared to β-VAEs. Moreover, better reconstruction losses (‘GAN’ and Lw) generally improve both β-VAE and PatchVAE, and are complementary to each other. Next, we fine-tune the last residual block along with the two conv layers (‘Conv[1-3]’ col- umn). We observe that PatchVAE performs better than VAE under all settings except the for 22 Table 2.2: ImageNet classification results using ResNet18. We initialize weights from using the unsuper- vised task and fine-tune the last two residual blocks. Details in Section 2.3.1. Model Top-1 Acc. Top-5 Acc. β-VAE 44.45 69.67 PatchVAE 47.01 71.71 β-VAE + Lw 47.28 71.78 PatchVAE + Lw 47.87 72.49 Imagenet Supervised 61.37 83.79 CIFAR100 with just L2 loss. However, when using better reconstruction losses, the performance of PatchVAE improves over β-VAE. Similarly, we fine-tune all but the first conv layer and re- port the results in ‘Conv1’ column. Again, we notice similar trends, where our method generally performs better than β-VAE on Indoor67 and Places205 dataset, but β-VAE performs better CI- FAR100 by a small margin. When compared to BiGAN, PatchVAE representations are better on all datasets (‘Conv[1-5]’) by a huge margin. However, when fine-tuning the pre-trained weights, BiGAN performs better on two out of four datasets. We also report results using pre-trained weights in ϕ(x) using supervised ImageNet classification task (last column, Table 2.1) for com- pleteness. The results indicate that PatchVAE learns better semantic representations compared to β-VAE. ImageNet Results. Finally, we report results on the large-scale ImageNet benchmark in Ta- ble 2.2. For these experiments, we use ResNet18 [55] architecture for all methods. All weights are first learned using the unsupervised tasks. Then, we fine-tune the last two residual blocks and train the two newly added conv layers in the classification network (therefore, first conv layer and the following two residual blocks are frozen). We notice that PatchVAE framework outperforms β-VAE under all settings, and the proposed weighted loss helps both approaches. Finally, the last row in Table 2.2 reports classification results of same architecture randomly initialized and 23 CIFAR100: Images and Encoded Occurrence Map ImageNet: Images and Encoded Occurrence Map Figure 2.3: Encoded part occurrence maps discovered on CIFAR100 and ImageNet. Each row represents a different part. trained end-to-end on ImageNet using supervised training for comparison. Baselines. We use the β-VAE model ( Section 2.2.1) as our primary baseline. In addition, we use weighted loss and discriminator loss resulting in the β-VAE-* family of baselines. We also compare against a BiGAN model from [39]. We use similar backbone architectures for encoder/decoder (and discriminator if present) across all methods, and tried to keep the number of parameters in different approaches comparable to the best of our ability. Exact architecture details can be found in the supplemental material. 2.3.2 Qualitative Results We present qualitative results to validate our hypothesis. First, we visualize whether the structure we impose on the VAE bottleneck is able to capture occurrence and appearance of important parts of images. We visualize the PatchVAE trained on images from CIFAR100 and Imagenet datasets in the following ways. 24 Pa rt 2 Spaghetti Squash Butternut Squash Orange Lemon Pa rt 55 Killer Whale Coral Reef Scuba Diver Grocery Store Pa rt 20 Radiator Grille Pa rt 31 Pa rt 0 Radiator GrilleAmbulance Pa rt 36 Pa rt 17 Jack-o'-lantern Christmas stocking Orange Figure 2.4: A few representative examples for several parts to qualitatively demonstrate the visual concepts captured by PatchVAE. For each part, we crop image patches centered on the part location where it is predicted to be present. Selected patches are sorted by part occurrence probability as score. We manually select a diverse set from the top-50 occurrences from the training images. As can be seen, a single part may capture diverse set of concepts that are similar in shape or texture or occur in similar context, but belong to different categories. We show which categories the patches come from (note that category information was not used while training the model). Part Image Part Recon. Swapped Recon.Image Source Target Figure 2.5: Swapping source and target part appearance. Column 1, 2 show a source image with the occurrence map of one of the parts. We can swap the appearance vector of this part with appearance vectors of a different part in target images. Column 3, 4 show three target images with occurrence maps of one of their parts. Observe the change in reconstructions (column 5, 6) as we bring in the new appearance vector. The new reconstruction inherits properties of the source at specific locations in the target. Concepts captured. First, we visualize the part occurrences in Figure 2.3. We can see that the parts can capture round (fruit-like) shapes in the top row and faces in the second row regardless of the class of the image. Similarly for ImageNet, occurrence map of a specific part in images of chicken focuses on head and neck. Note that these semantically these parts are more informative than just texture or color what a β-VAE can capture. In Figure 2.4, we show parts captured by 25 the ImageNet model by cropping a part of image centered around the occurring part. We can see that parts are able to capture multiple concepts, similar in either shape, texture, or context in which they occur. Swapping appearances. Using PatchVAE, we can swap appearance of a part with the appear- ance vector of another part from a different image. In Figure 2.5, keeping the occurrence map same for a target image, we modify the appearance of a randomly chosen part and observe the change in reconstructed image. We notice that given the same source part, the decoder tries similar things across different target images. However, the reconstructions are worse since the decoder has never encountered this particular combination of part appearance before. Discriminative vs.Generative strength. As per our design, PatchVAE compromises the gener- ative capabilities to learn more discriminative features. To quantify this, we use the the images reconstructed from β-VAE and PatchVAE models (trained on ImageNet) and compute three dif- ferent metrics to measure the quality of reconstructions of test images. Table 2.7 shows that β-VAE is better at reconstruction. 2.4 Ablation Analysis We study the impact of various hyper-parameters used in our experiments. For the purpose of this evaluation, we follow a similar approach as in the ‘Conv[1-5]’ column of Table 2.1 and all hyperparameters from the previous section. We use CIFAR100 and Indoor67 datasets for ablation analysis. Maximum number of patches. Maximum number of parts N used in our framework. Depend- ing on the dataset, higher value of N can provide wider pool of patches to pick from. However, it 26 Table 2.3: Effect of N : Increasing the maxi- mum number of patches increases the discrimi- native power for CIFAR100 but has little or neg- ative effect for Indoor67 N CIFAR100 Indoor67 4 27.59 14.40 8 28.74 12.69 16 28.94 14.33 32 27.78 13.28 64 29.00 12.76 Table 2.4: Effect of dp: Increasing the number of hidden units for a patch has very little impact on classification performance. dp CIFAR100 Indoor67 3 28.63 14.25 6 28.97 14.55 9 28.21 14.55 Table 2.5: Effect of zprior occ : Increasing increasing the prior probability of patch occurrence has ad- verse effect on classification performance. zprior occ CIFAR100 Indoor67 0.01 28.86 14.33 0.05 28.67 14.25 0.1 28.31 14.03 Table 2.6: Effect of βocc: Too high or too low βocc can deteriorate the performance of learned representations. βocc CIFAR100 Indoor67 0.06 30.11 14.10 0.3 30.37 15.67 0.6 28.90 13.51 Table 2.7: Reconstruction metrics on ImageNet. PatchVAE sacrifices reconstruction quality to learn dis- criminative parts, resulting in higher recognition performance (Table 2.2) Model PSNR ↑ FID ↓ SSIM ↑ β-VAE 4.857 108.741 0.289 PatchVAE 4.342 113.692 0.235 can also make the unsupervised learning task harder, since in a minibatch of images, we might not get too many repeat patches. Table 2.3(left) shows the effect of N on CIFAR100 and Indoor67 datasets. We observe that while increasing number of patches improves the discriminative power in case of CIFAR100, it has little or negative effect in case of Indoor67. A possible reason for this decline in performance can be smaller size of the dataset (i.e., fewer images to learn). Number of hidden units for a patch appearance ẑapp. Next, we study the impact of the number of channels in the appearance feature ẑapp for each patch (dp). This parameter reflects the capac- ity of individual patch’s latent representation. While this parameter impacts the reconstruction 27 quality of images. We observed that it has little or no effect on the classification performance of the base features. Results are summarized in Table 2.4(right) for both CIFAR100 and Indoor67 datasets. Prior probability for patch occurrence zprior occ . In all our experiments, prior probability for a patch is fixed to 1/N , i.e., inverse of maximum number of patches. The intuition is to encourage each location on occurrence maps to fire for at most one patch. Increasing this patch occurrence prior will allow all patches to fire at the same location. While this would make the reconstruc- tion task easier, it will become harder for individual patches to capture anything meaningful. Table 2.5 shows the deterioration of classification performance on increasing zprior occ . Patch occurrence loss weight βocc. The weight for patch occurrence KL Divergence has to be chosen carefully. If βocc is too low, more patches can fire at same location and this harms the the learning capability of patches; and if βocc is too high, decoder will not receive any patches to reconstruct from and both reconstruction and classification will suffer. Table 2.6 summarizes the impact of varying βocc. 28 Chapter 3: Learning to Signal Mid-level Patches in Referential Games The ability to communicate using language is a signature characteristic of intelligence [56]. Language provides a structured platform for agents to not only collaborate with each other and accomplish certain goals, but also to represent and store information in a compressible manner. Most importantly, language allows us to build infinitely many new concepts by the composition of the known concepts. These qualities are shared by both the natural languages used in human- human communication and programming languages used in human-machine communication. The study of the evolution of language can hence give us insights into intelligent machines that can communicate [57]. Our goal in this paper is to develop and understand an emergent language, i.e., a language that emerges when two neural network agents try to communicate with each other. Clark [58] argued that supervised approaches that consist of a single agent learning statistical relationships among symbols don’t capture the functional relationships between the symbols i.e., the use of symbols leading to an action or an outcome. Krishna et al. [59] argued the same viewpoint in the context of images. We, therefore, resort to the recent works in emergent language [60, 61, 62, 63, 64, 65] which show that a communication protocol can be developed or learned by two or more cooperative agents trying to solve a task. The choice of the task is quintessential since the language derives meaning from its use [66]. We choose a task where two agents, a 29 speaker, and a listener, play a referential game, a type of signaling game first proposed by Lewis [67]. The speaker agent receives a target image and sends a message to the listener. The message consists of discrete symbols or words, capturing different parts of the image. The listener receives another view of the target image, and one or more distractor images. The goal of the speaker and the listener agents is to maximize the agreement between the message and the target image. Figure 3.1 illustrates the overview of the proposed referential game. In computer vision, a number of attempts have been made to represent images as visual words [68], with a focus on low-level feature descriptors such as SIFT [69], SURF [70], etc.. Recent works in deep learning have attempted to describe the entire image with a fixed number of discrete symbols [71, 72, 73], however, we postulate that large images contain a lot of redundant information and a good visual representation should focus on only the “interesting” parts of the image. To discover what constitutes the interesting part of the image, we take inspiration from the works on mid-level patches [22, 74, 75], the patches in the image that are both representative and discriminative [12, 74]. This means they can be discovered in a large number of images (and hence representative), but simultaneously they should also be discriminative enough to set an image apart from the other images in the dataset. Hence, the speaker agent in our paper focuses on computing a symbolic representation in terms of these mid-level patches, as opposed to the entire image. To summarize, we propose PatchGame, a referential game formulation where given an image, the speaker sends discrete signal in terms of mid-level patches, and the listener embeds these symbols to match them with another view of the same image in the presence of distractors. Compared to previous works [63, 64, 76], we make the following key changes: 30 • Agents in the some of the prior works [63, 64, 76] have access to a pre-trained network, such as AlexNet [77] or VGG [78], for extracting features from images. In this work, the agents rely on training on a large scale image dataset, and invariance introduced by various image augmentations, to learn the language in a self-supervised way [79]. • We propose a novel patch-based architecture for the speaker agent, which comprises of two modules: (1) PatchSymbol, a multi-layered perceptron (MLP) that operates at the patch-level and converts a given image patch into a sequence of discrete symbols, and (2) PatchRank, a ConvNet that looks at the complete image and ranks the importance of patches in a differen- tiable manner. • We introduce a novel transformer-based architecture for the listener agent, consisting of two modules: (1) a language module that projects the message received from the speaker to a la- tent space, and (2) a vision module that projects the image into the latent space. We use a contrastive loss in this latent space to train both the speaker and the listener agents simultane- ously. • We propose new protocols to evaluate each of the speaker and listeners’ modules. We assess the success of PatchGame via qualitative and quantitative evaluations of each of the proposed component, and by demonstrating some practical applications. First, we show that the speaker’s PatchRank model does indicate important patches in the image. We use the top patches indicated by this model to classify ImageNet [28] images using a pre-trained Vision Transformer [8] and show that we can retain over 60% top-1 accuracy with just half of the image patches. Second, the listener’s vision model (ResNet-18) can achieve upto 30.3% Top-1 accuracy just by using k-NN (k = 20) classification. This outperforms other state-of-the-art unsupervised 31 Speaker Speaker view Listener view Listener groundtruth Figure 3.1: Overview of the Referential Game. We generate two random views of every image in the given batch of images. The speaker takes one of the views as the input and generates a sequence of symbols (or message). The listener takes the message sent by the speaker and the second view of the image, and projects both into an embedding space. Both the speaker and listener agents learn by minimizing the constrastive loss (see Equation (3.3)) between between the views in this embedding space. approaches [12, 73] that learn discrete representations of images by 9%. Finally, we also analyze the symbols learned by our model and the impact of choosing several hyperparameters used in our experiments. 3.1 Related Work Referential games. Prior to the advent of deep learning, significant research in the field of emergent communications has shown that a communication protocol can be developed or learned by agents by playing a language game [60, 61, 62, 80, 81, 82, 83, 84]. However, the agents employed in these works were typically located in a synthetic world and made several assumptions about the world such as the availability of disentangled representations of objects with discrete properties. More recent works [63, 65, 85, 86, 87, 88, 89, 90] have employed deep learning methods to develop a discrete language for communication between the agents. Lazaridou et al. [63] used neural network agents represented by a MLP to communicate concepts about real-world pictures. They used a fixed-sized message composed of a large vocabulary for their communication. Havrylov and Titov [64], Evtimova et al. [76], Bouchacourt and Baroni 32 [91] relax this assumption and allow communication via variable-length sequences. Havrylov and Titov [64] allows the speaker agent to use an LSTM [92] to construct a variable-length message. Havrylov and Titov [64], Lazaridou et al. [65] show that even when we allow agents to use variable-length sequences to represent a message, they tend to utilize the maximum possible sequence to achieve the best performance (in terms of communication success). The idea of using a Gumbel-softmax distribution [53, 93] with the straight-through trick [94] for learning a language in multi-agent environment was concurrently proposed by Mordatch and Abbeel [95] and Havrylov and Titov [64]. They show that we can achieve a more stable and faster training by using this technique as compared to reinforcement learning used in several other approaches. Evaluating emergent communication. Evaluating the emergent language turns out to be an equally challenging research problem. Existing approaches use the successful completion of the task or the correlation between learned language and semantic labels as evaluation metrics. Lowe et al. [96] and Keresztury and Bruni [97] show that simple task success might not be a good or sufficient metric for evaluating the success of a game. They discuss heuristics and ad- vocate measuring both positive signaling and positive listening independently to evaluate agents’ communication. Andreas [98] provides a way of evaluating compositional structure in learned representations. Discrete Autoencoders. In parallel to the works on emergent communication, there is a large body of research on learning discrete representations of images using some form of au- toencoding or reconstruction [12, 71, 73, 99, 100, 101, 102, 103] without labels. The focus of VQ-VAE [71] and VQ-VAE-2 [73] is to learn a discrete bottleneck using vector quantization. Once we can represent any image with these discrete symbols, a powerful generative model such 33 as PixelCNN [71, 104], or transformer [105, 106] is learned on top of these symbols to sample new images. PatchVAE [12] achieves the same using Gumbel-Softmax and imposes an addi- tional structure in the bottleneck of VAEs. We argue that because of mismatch in the objectives of reconstruction and visual recognition tasks, each of these models trained using reconstruction- based losses do not capture meaningful representations in the symbols. Self-supervised learning in vision. Self-supervised learning (SSL) methods, such as [107, 108, 109, 110, 111, 112, 113, 114] have shown impressive results in recent years on downstream tasks of classification and object detection. Even though the bottleneck in these methods is con- tinuous (and not discrete symbols), these methods have been shown to capture semantic and spatial information of the contents of the image. Unlike SSL methods, neural networks repre- senting the two agents in our case, do not share any weights. Also, note that the continuous nature of representations learned by SSL techniques is fundamentally different from the sym- bolic representations used in language. And indeed, we show that a k-Nearest Neighbor classifier obtained from the continuous representations learned by SSL methods can perform better than the one obtained using Bag of Words (or symbols). However, to the best of our knowledge, our work is one of the first attempts to make representations learned in a self-supervised way more communication- or language-oriented. Comparison with Mihai and Hare [79, 115]. [79, 115] extends Havrylov and Titov [64] by training a speaker and listener agents end-to-end without using pre-trained networks. How- ever, the prior works [64, 115, 115] use a top-down approach to generate discrete representation (or a sentence) for an image, i.e., they compute an overall continuous embedding of the image and then proceed by generating one symbol of the sentence at a time using an LSTM. The com- putational cost of LSTMs is prohibitive when length of a sentence is large, which is needed to 34 describe complex images. The transformers, on the other hand, require constant time for variable length sentences at the cost of increased memory (in the listener agent). However, generating the variable length sentences with the speaker agent using transformers is non-trivial. To solve this, we propose a bottom-up approach, i.e., we first generate symbols for image patches and combine them to form a sentence. This approach allows for computationally efficient end-to-end training. Further, it allows the speaker to compose symbols corresponding to different parts of the image, instead of deducing it from a pooled 1D representation of the image. 3.2 Our Approach We first introduce various notations and the referential game played by the agents in our work. We provide further details of architectures of the different neural network agents, as well as the loss function. We also highlight the important differences between this work and prior literature. Code and pre-trained models to reproduce the results are provided. 3.2.1 Referential Game Setup Figure 3.1 shows the overview of our referential game setup. Given a dataset of N images {xi}Ni=1, we formulate a referential game [67] played between two agents, a speaker Sθ and a listener Lϕ as follows: As in the setting of Grill et al. [114], we generate two “random views” for every image. A random view is generated by taking a 224 × 224 crop from a randomly resized image and adding one or more of these augmentations - color jitter, horizontal flip, Gaussian blur and/or solarization. This prevents the neural networks from learning a trivial solution and encourages the emergent language to capture invariances induced by the augmentations. Given 35 a batch of B images, we refer to the two views as {xi}Bi=1 and {x′ i}Bi=1. In each iteration during training, one set of views is presented to the speaker agent and another set of the views is shown to the listener agent. The speaker Sθ encodes each image xi independently into a variable length message mi. Each message m is represented by a sequence of one-hot encoded symbols with a maximum possible length L and a fixed size vocabulary V . The space of all possible messages sent by the speaker is of the order |V |L. The input to the listener Lϕ is the batch of messages {mi}Bi=1 from the speaker, and the second set of random views of the batch of images. The listener consists of a language module ϕtext to encode messages and a vision module ϕvision to encode images. The goal of the listener is to match each message to its corresponding image. Specifically, for a batch of B (message, image) pairs, Sθ and Lϕ are jointly trained to maximize the cosine similarities of B actual pairs while minimizing the similarity of (B2 − B) incorrect pairs. For a target message mj , the image x′ j (augmented view of xj) acts as the target image while all the other (B− 1) images act as distractors. And vice versa, for the image x′ k, the message mk (encoded by the speaker Sθ(xk)) acts as the target message while all the other (B−1) messages act as distractors. We use the following symmetric and contrastive loss function, also sometimes referred to as InfoNCE loss in previous metric-learning works [72, 116]. Ltext = − B∑ j=1 log exp(ϕtext(mj) · ϕvision(x ′ j)/τ)∑B k=1 exp(ϕtext(mj) · ϕvision(x′ k)/τ) (3.1) Lvision = − B∑ k=1 log exp(ϕtext(mk) · ϕvision(x ′ k)/τ)∑B j=1 exp(ϕtext(mj) · ϕvision(x′ k)/τ) (3.2) L = (Ltext + Lvision)/2 (3.3) 36 4.1 7.9 6.8 1.1 8.9 6.8 2.0 5.0 5.9 0 1 1 0 1 0 0 1 1 4 5 1 1 2 7 4 0 8 2 0 7 2 8 74 5 7 2 5 1 0 5 3 7 9 0 sa m pl e so ft- so rt no rm al iz e, s am pl e 4 5 1 1 2 7 4 0 8 4 5 7 2 0 7 2 8 7 2 5 1 0 5 3 7 9 0 dot-product message image patches Figure 3.2: Speaker agent architecture. The upper branch represents the PatchSymbol, θsymb module and the lower branch represents PatchRank, θrank module. Each of the modules take the raw image as the input. PatchSymbol computes a message for each patch of pre-defined size, using a fixed size vocabulary and message length. PatchRank uses a ConvNet to compute weights or the importance of each patch. Note that the symbols for a patch are context independent, however the importance of a patch depends on the context. The speaker agent combines the output of each of the models and sends a variable length message to the listener. where τ is a constant temperature hyperparameter. The game setting used in our work is inspired from Lazaridou et al. [63] and Havrylov and Titov [64], but there are important differences. Both our speaker Sθ and listener Lϕ agents are trained from scratch. This makes the game setting more challenging, since agents cannot use the pre-trained models which have been shown to encode semantic and/or syntactic information present in natural language or images. Our training paradigm, where we show different views of the same image to the speaker and listener, is inspired by the recent success of self-supervised learning in computer vision. Empirically, we observe that this leads to a more stable training and prevents the neural networks from learning degenerate solutions. However, in contrast with such self-supervised approaches, our goal is to learn a discrete emergent language as opposed to continuous semantic representations. We discuss the differences in the architecture of the two agents in the following sections. 37 3.2.2 Speaker agent architecture A desirable characteristic of the speaker agent is that it should be able to encode “im- portant” components of images with a variable length sequence of discrete symbols. Previous works [64, 76, 91] have achieved this by first converting the image into a continuous determinis- tic latent vector and then using an LSTM network [92] to generate a sequence of hidden states, and sample from this sequence of hidden state until a special end of sequence token (or maximum length) is reached. As observed by [64, 65], in order to achieve the minimum loss, the model ends up always using the maximum allowable length. In our experiments as well, we observed that having an LSTM makes the training slow and does not achieve the objective of encoding images in variable length sequences. We propose leverage two separate modules in the speaker agent Sθ to circumvent this problem - the first module called PatchSymbol (θsymb) is a 2-layer MLP that computes patch-level embeddings for the image, the second module called PatchRank (θrank) is a small ConvNet that computes rank or importance of each patch in the image. PatchSymbol, θsymb. The idea of encoding an image at the patch level is inspired by the works on discovering mid-level patches [22, 74, 75] in images. We use a simple 2-hidden layer MLP, to encode each RC×S2 dimensional image patch xpatch to l vectors of log of probabilities [log p11, . . . , log p 1 V ], . . . , [log p l 1, . . . , log p l V ]. Here C is the number of (color) channels in the input image or patch, S is the spatial dimension of a square patch, V is the size of the vocabulary used to represent a single symbol, and l is the number of symbols used to encode each patch. Hence an image of size RC×H×W can be encoded using K = HW S2 patches, each consisting of l symbols. The vectors of log probabilities allow us to sample from a categorical distribution of V categories, with a continuous relaxation by using the Gumbel-softmax trick [53, 93]. For a given 38 vector [log pj1, . . . , log p j V ], we draw i.i.d samples gji from the Gumbel(0, 1) distribution [117] and get a differentiable approximation of argmax as follows: [ [log p11, . . . , log p 1 V ], . . . , [log p l 1, . . . , log p l V ] ] = MLP(xpatch) (3.4) yj = one hot(argmax i [log pji ]) ∼ { exp((log pji + gji )/τs))∑V k=1 exp((log p j k + gjk)/τs) }V i=1 ,∀j (3.5) θsymb(xpatch) = y = { yj }l j=1 (3.6) where τs controls how close the approximation is to argmax. The final output of the θsymb network for the entire image x is L one-hot encoded V−dimensional symbols, where L = l × HW S2 . In all our experiments, we fix V = 128 and l = 1. PatchRank, θrank. An image might have a lot of redundant patches encoded using the same symbols. The goal of the θrank network is to give an importance score to each patch. Since importance of a patch depends on the context and not the patch alone, we use a small ResNet- 9 [55] to compute an importance weight for each of the K = HW S2 patches. One possible way to use these importance weights is to simply normalize them between (0, 1) and repeat the Gumbel- Softmax trick to sample important patches. The listener network Lϕ would see only the message consisting of “important” patches. However, we empirically observed that a simple min-max or L2-normalization allows the network to assign high weights to each patch and effectively send the entire sequence of length L to the listener. Instead, we propose to use a differentiable ranking algorithm by Blondel et al. [118] to convert the importance weights to soft-ranks {1, . . . , K} in O(K logK) time. This method works by constructing diff