ABSTRACT Title of dissertation: LEVERAGING DEEP GENERATIVE MODELS FOR ESTIMATION AND RECOGNITION Koutilya PNVR Doctor of Philosophy, 2023 Dissertation directed by: Professor David W. Jacobs Department of Electrical and Computer Engineering Generative models are a class of statistical models that estimate the joint probability distribution on a given observed variable and a target variable. In computer vision, generative models are typically used to model the joint proba- bility distribution of a set of real image samples assumed to be on a complex high- dimensional image manifold. The recently proposed deep generative architec- tures such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models (DMs) were shown to generate photo-realistic im- ages of human faces and other objects. These generative models also became popular for other generative tasks such as image editing, text-to-image, etc. As appealing as the perceptual quality of the generated images has become, the use of generative models for discriminative tasks such as visual recognition or ge- ometry estimation has not been well studied. Moreover, with different kinds of powerful generative models getting popular lately, it’s important to study their significance in other areas of computer vision. In this dissertation, we demon- strate the advantages of using generative models for applications that go beyond just photo-realistic image generation: Unsupervised Domain Adaptation (UDA) between synthetic and real datasets for geometry estimation; Text-based image segmentation for recognition. In the first half of the dissertation, we propose a novel generative-based UDA method for combining synthetic and real images when training networks to determine geometric information from a single image. Specifically, we use a GAN model to map both synthetic and real domains into a shared image space by translating just the domain-specific task-related information from respective domains. This is connected to a primary network for end-to-end training. Ide- ally, this results in images from two domains that present shared information to the primary network. Compared to previous approaches, we demonstrate an im- proved domain gap reduction and much better generalization between synthetic and real data for geometry estimation tasks such as monocular depth estimation and face normal estimation. In the second half of the dissertation, we showcase the power of a recent class of generative models for improving an important recognition task: text- based image segmentation. Specifically, large-scale pre-training tasks like im- age classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foun- dation models built using text-based latent diffusion techniques may learn se- mantic boundaries. This is because they must synthesize intricate details about all objects in an image based on a text description. Therefore, we present a tech- nique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature rep- resentations like RGB images or CLIP encodings for text-based image segmenta- tion. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image seg- mentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques. LEVERAGING DEEP GENERATIVE MODELS FOR ESTIMATION AND RECOGNITION by Koutilya PNVR Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor David W. Jacobs, Chair/Advisor Professor Joseph Jaja Professor Behtash Babadi Professor Jia-Bin Huang Professor Maria K. Cameron (Dean’s representative) © Copyright by Koutilya PNVR 2023 Dedication To my family — Subrahmanyam Ponukupati Sudha Ponukupati Syamala Pisapati Sandilya Ponukupati Sruthi Ponukupati Vaishnavi Ponukupati Sindhura Purnima Vempati For their constant support, love, sacrifice and selflessness. ii Acknowledgments I wish to express my deepest gratitude to the remarkable individuals who have been instrumental in my Ph.D. journey, contributing immeasurably to my growth and success. First and foremost, I extend my sincere appreciation to my advisor, Prof. David Jacobs. Despite my non-computer-vision background, he offered me the invaluable opportunity to work closely with him. I am profoundly thankful for the numerous research meetings and brainstorming sessions, which not only broadened my research horizons but also nurtured my ability to approach com- plex problems. His unwavering consideration for my circumstances has left an indelible mark, and I couldn’t have asked for a more exceptional Ph.D. advisor. I am deeply honored to have Prof. Joseph JaJa, Prof. Behtash Babadi, Prof. Jia-Bin Huang, and Prof. Maria K. Cameron as members of my dissertation com- mittee. Their commitment to serving on my committee and providing invaluable feedback to enhance the quality of this dissertation is greatly appreciated. My gratitude extends to my remarkable mentors, Bharat Singh and Hao Zhou, who have been constant sources of support during the challenging phases of my Ph.D. Their participation in research meetings and continuous motivation to explore fresh perspectives on research problems have been transformative. In particular, Bharat’s close collaboration and the research skills he imparted are beyond measure. Their guidance has been pivotal, and I owe a significant portion of my progress to their precious mentorship. iii I would like to acknowledge Dr. Varaprasad Bandaru for providing me with opportunities from the early stages of my academic journey, beginning with my master’s program. His enduring belief in my capabilities and involvement in his remarkable research projects have been essential in broadening the breadth of my knowledge during my Ph.D. journey. My fellow research peers at the University of Maryland, including students from the research groups of Prof. David Jacobs, Prof. Abhinav Shrivastava, and Prof. Tom Goldstein, have been a constant source of enlightening discussions and camaraderie. I am grateful to my colleagues from internships, including Pallabi Ghosh, Behjat Siddiquie from Amazon, and Abhijit Bendale, Pranav Mistry from STAR Labs, for the wonderful opportunities they provided, exposing me to real-world experiences. My thanks go to the International Student and Scholar Services (ISSS), the graduate school, the staff at the ECE and CS departments, and UMIACS for their friendly, liberal, and supportive approach. I will cherish the memories of my student life and the warmth of the university. To my friends - Shankar Reddy, Dwith CYN, Pallavi Chirumamilla, Sai Deepika Regani, Anirudh Mothukuri, Likhith Anvhesh, Sriram Vasudevan, Sai Sreedhar Varada, Mounika Chintakayala, Raghuvaran Yaramasu, Avinash Bheem- ineni, Spandana Gorantla, Harika Vakkanthula, Sreeharsha Vardhan Annu, and Manvitha Sree who have been my pillars of strength, offering relentless sup- port and creating wonderful memories, I extend my heartfelt appreciation. Your iv friendships have not only aided my personal growth but have also made my Ph.D. journey exceptionally smooth, making you a cherished part of my family. I would also like to express my sincere thanks to Sindhura Purnima, who entered my life at a crucial stage, offering constant support and understanding. I wholeheartedly believe she is my lucky charm, bringing much-needed fortune at a precious time. Last, but certainly not least, I owe a profound debt of gratitude to my par- ents and family members. Their constant motivation and belief in me, through both the good and challenging times, have been the cornerstone of my journey. I am forever indebted to them for the unwavering support and sacrifices they made to help me reach the point where I stand today. v Table of Contents Acknowledgements iii List of Tables ix List of Figures x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Outline and Contributions . . . . . . . . . . . . . . . . 4 1.2.1 Leveraging GANs for Unsupervised Geometry Estimation (Chapter 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Leveraging LDMs for Text-Based Segmentation (Chapter 3) 5 1.2.3 Bidirectional Convolutional LSTM for the Detection of Vi- olence in Videos (Appendix A) . . . . . . . . . . . . . . . . 6 2 GANs for Unsupervised Geometry Estimation 7 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2.1 Losses for Generative Network . . . . . . . . . . . 15 2.2.2.2 Losses for the Task Network . . . . . . . . . . . . . 17 2.2.2.3 Monocular Depth Estimation . . . . . . . . . . . . 17 2.2.2.4 Face Normal Estimation . . . . . . . . . . . . . . . 18 2.2.2.5 Overall loss . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Monocular Depth Estimation . . . . . . . . . . . . . . . . . 19 2.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1.2 Implementation details . . . . . . . . . . . . . . . 20 2.3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1.4 Generalization to Make3D . . . . . . . . . . . . . 23 2.3.2 Face Normal Estimation . . . . . . . . . . . . . . . . . . . . 25 vi 2.3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2.2 Implementation details . . . . . . . . . . . . . . . 25 2.3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 LDMs for Text-Based Image Segmentation 33 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.1 Text-based image segmentation . . . . . . . . . . . . . . . . 36 3.1.2 Text-to-Image synthesis . . . . . . . . . . . . . . . . . . . . 37 3.1.3 Semantics in generative models . . . . . . . . . . . . . . . . 38 3.2 LDMs for Text-Based Segmentation . . . . . . . . . . . . . . . . . . 39 3.2.1 ZNet: Leveraging Latent Space Features . . . . . . . . . . . 40 3.2.2 LD-ZNet: Leveraging Diffusion Features . . . . . . . . . . . 42 3.2.2.1 Visual-Linguistic Information in LDM Features . 43 3.2.2.2 LD-ZNet Architecture . . . . . . . . . . . . . . . . 44 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Image Segmentation Using Text Prompts . . . . . . . . . . . 48 3.4.2 Generalization to AI Generated Images . . . . . . . . . . . . 51 3.4.3 Generalization to Referring Expressions . . . . . . . . . . . 56 3.4.4 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.5 Cross-attention vs Concat for LDM features . . . . . . . . . 58 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Conclusions and Future Work 63 4.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A Bidirectional Convolutional LSTM for the Detection of Violence in Videos 66 A.1 Contributions and Proposed Approach . . . . . . . . . . . . . . . . 67 A.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.3.1 Spatiotemporal Encoder Architecture . . . . . . . . . . . . 71 A.3.1.1 Spatial Encoding . . . . . . . . . . . . . . . . . . . 72 A.3.1.2 Temporal Encoding . . . . . . . . . . . . . . . . . 73 A.3.1.3 Classifier . . . . . . . . . . . . . . . . . . . . . . . 75 A.3.2 Spatial Encoder Architecture . . . . . . . . . . . . . . . . . 76 A.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.5 Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.6.1 Hockey Fights and Movies . . . . . . . . . . . . . . . . . . . 78 A.6.2 Violent Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.6.3 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . 80 vii A.6.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.6.4.1 Spatial vs Spatiotemporal Encoders . . . . . . . . 83 A.6.4.2 Elementwise Max Pooling vs. Last Encoding . . . 84 A.6.4.3 ConvLSTM vs. BiConvLSTM . . . . . . . . . . . . 85 A.6.4.4 AlexNet vs. VGG13 . . . . . . . . . . . . . . . . . 85 A.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Bibliography 88 viii List of Tables 2.1 Quantitative results for Monocular Depth Estimation (MDE) . . . 21 2.2 Generalization capability of SharinGAN for MDE . . . . . . . . . . 24 2.3 Quantitative results for Face Normal estimation . . . . . . . . . . . 26 2.4 Quantitative results for Lighting Estimation . . . . . . . . . . . . . 29 2.5 Ablation study - Significance of SharinGAN module and recon- struction loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Ablation study - Significance of SharinGAN module and recon- struction loss on unseen make3D dataset . . . . . . . . . . . . . . . 31 3.1 Text-based image segmentation performance on PhraseCut . . . . 49 3.2 Generalization to our AIGI dataset . . . . . . . . . . . . . . . . . . 52 3.3 Generalization to Referring Image Segmentation datasets - Ref- COCO, RefCOCO+ and G-Ref . . . . . . . . . . . . . . . . . . . . . 57 3.4 Ablation studies - Cross-attn vs Concat . . . . . . . . . . . . . . . . 59 A.1 Quantitative results on Hockey, Movies and Violent Flows datasets 81 ix List of Figures 1.1 Generative models for domain adaptation . . . . . . . . . . . . . . 2 1.2 Illustration of the text-based image segmentation task . . . . . . . 3 2.1 Proposed way to reduce domain gap between synthetic and real data 8 2.2 SharinGAN architecture . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Qualitative results for Monocular Depth Estimation (MDE) . . . . 22 2.4 Visualization of regions corresponding to domain gap reduction - MDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Generalization capability of SharinGAN for MDE . . . . . . . . . . 25 2.6 Qualitative results for Face Normal Estimation . . . . . . . . . . . 27 2.7 Visualization of regions corresponding to domain gap reduction - FNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Latent diffusion model (LDM) containing visual linguistic infor- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Reconstructions from the first stage of the LDM . . . . . . . . . . . 40 3.3 Overview of the proposed ZNet and LD-ZNet architectures . . . . 41 3.4 Visual-linguistic semantic information in the internal features of a pretrained LDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 LDM internal features into ZNet via Attention Pool . . . . . . . . . 45 3.6 Samples from AIGI dataset . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 Qualitative comparison on the PhraseCut dataset . . . . . . . . . . 51 3.8 Qualitative comparison on the AIGI samples for text-based seg- mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.9 More qualitative comparison on the AIGI samples for text-based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 More qualitative results of LD-ZNet from AIGI dataset . . . . . . . 56 3.11 LD-ZNet does well in multi-object segmentation - Good overall scene understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.12 LD-ZNet’s ability to segment objects in animations, celebrity im- ages and illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.1 Overview of the Spatiotemporal architecture . . . . . . . . . . . . . 72 x A.2 Overview of a BiConvLSTM Cell . . . . . . . . . . . . . . . . . . . . 75 A.3 Overview of the Spatial encoder architecture . . . . . . . . . . . . 76 A.4 Performance on the Hockey dataset evaluated using the Spatial En- coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.5 Performance on the Violent Flows evaluated using the Spatiotem- poral Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.6 Ablation studies - Spatial vs Spatiotemporal Encoders on the Hockey dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.7 Ablation studies - Spatial vs Spatiotemporal Encoders on the Vio- lent Flows dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.8 Ablation studies - Elementwise Max-pooling vs Last Encoding . . 84 A.9 Ablation studies - ConvLSTM vs BiConvLSTM . . . . . . . . . . . 85 A.10 Ablation studies - AlexNet vs VGG13 . . . . . . . . . . . . . . . . . 86 xi Chapter 1: Introduction 1.1 Motivation The recently proposed deep generative architectures such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs) and Diffusion models (DMs) were shown to exhibit photo-realistic image generation quality. Many generative applications such as image editing, text-to-image etc also be- came popular with these models. However, the use of these generative models for tasks such as representation learning, visual recognition or geometry estima- tion has been little explored. Typically, such discriminative tasks are solved with CNN or transformer based classifiers that excel at obtaining decision boundaries between classes in the training data. Deep generative models on the other hand, estimate the joint probability distribution of the entire training data. Such mod- els hold more information about the training data and are capable of generating realistic looking samples from the distribution. Moreover, with the size of the datasets getting bigger and the architectures becoming more powerful, exploring the importance of deep generative models for tasks that go beyond just image generation becomes critical. Generative models have been studied for tasks such as representation learn- 1 Source Target Generative model Source Target Primary task network Source prediction Target prediction Source GT Figure 1.1: Generative models can be used to reduce domain gap between labeled source and unlabeled target domains. ing [1–7], synthetic data generation [8–10], domain adaptation [11–15] etc. How- ever the rapid progress in the generative models research and the underlying techniques did not scale similarly in these areas. In this dissertation, we attempt to explore and leverage specific deep generative models to improve performance in estimation and recognition tasks namely 1) Unsupervised domain adaptation for geometry estimation and 2) Text-based image segmentation, respectively. Unsupervised domain adaptation refers to the problem of reducing the do- main gap between a labeled source domain and an unlabeled target domain. For geometry estimation such as monocular depth estimation (MDE) and face nor- mal estimation (FNE), some lines of work depend on the vast amount of labeled synthetic data as the source domain and attempt to make it generalize to the real data. Previous works that used generative models for unsupervised geometry estimation, proposed to translate the synthetic data into real-like or vice-versa. However, such an inter-domain mapping is an unnecessarily challenging prob- lem for the generative model and would serve as a bottleneck for the downstream primary task network. We propose a better way to reduce the domain gap by us- ing a GAN based framework that translates just the right amount of information 2 A picture of a large fountain Text-based image segmentation network Figure 1.2: Text-based image segmentation aims to segment regions in the image that refer to an input text prompt. from both synthetic and real domains into a shared image space. A high level overview of the proposed approach is illustrated in Figure 1.1. This shared im- age space is shown to have better properties in terms of domain generalization for geometry estimation. Specifically, we observe it is only necessary to translate the domain-specific task related information of respective domains into a shared image space. This mapping need not modify the information of original domains that is not related to the primary task as the primary network will learn to ignore them regardless. This simple and intuitive formulation combined with the im- age translation ability of the generative model, helps the primary task network to look at shared information from both domains with much less domain gap leading to better generalization. Also, with recent advances in diffusion models (DMs) [16, 17] in uncon- ditional and class conditional settings, they have started gaining more traction compared to GANs. This class of generative models became even more popular for their generated visual quality in text-to-image tasks. Recently, latent diffusion 3 models (LDMs) [18] were proposed that operate on a perceptually compressed la- tent space obtained from an internal first stage. LDMs became a popular choice for text-to-image applications for their ability to learn and operate with lower computational cost and on large scale datasets. Such large scale LDMs were shown to exhibit photo-realistic text-to-image visual quality and lead to several visual-linguistic applications such as text guided image inpainting, personalized text-to-image etc. This indicates that pretrained LDMs contain semantic infor- mation about various objects from the internet. However, the usefulness of these powerful LDMs have not been explored for text-based recognition problems such as text-based segmentation task illustrated in Figure 1.2. In this dissertation, we propose a text-based segmentation network named LD-ZNet that utilizes an LDM pretrained on large datasets. We show that the segmentation network, with the help of LDM, learns knowledge of novel concepts from the internet without requiring annotations. Overall, our LD-ZNet can segment objects from the inter- net in various imagery such as real, AI-Generated, animations, illustrations and celebrity images. 1.2 Dissertation Outline and Contributions 1.2.1 Leveraging GANs for Unsupervised Geometry Estimation (Chapter 2) In this Chapter, we propose a novel generative-based UDA method for com- bining labeled-synthetic and unlabeled-real images when training networks to 4 determine geometric information from a single image. Our proposal outlines a strategy to project both image categories into a single, shared domain. This shared domain acts as input to the primary network during end-to-end training. Consequently, the primary network learns from the shared information of both domains and generalizes much better to real-images during test-time. Our ex- periments demonstrate significant improvements over the state-of-the-art in two important domains, surface normal estimation of human faces and monocular depth estimation for outdoor scenes, both in an unsupervised setting. 1.2.2 Leveraging LDMs for Text-Based Segmentation (Chapter 3) In this Chapter, we propose LD-ZNet a text-based segmentation network that uses an LDM pretrained on large-scale data. Specifically, we suggest a way to use the z-space and the internal representations inside the LDM to improve segmentation performance for novel concepts on various imagery such as real, AI-generated, animations, illustrations and celebrity images. We additionally create a new dataset named AIGI consisting of AI-Generated images along with object labels and categorical captions for evaluating the generalization ability of text-based segmentation methods to AI-Generated content. We show a huge improvement of around 20% for LD-ZNet over existing text-based segmentation methods on the AIGI dataset. 5 1.2.3 Bidirectional Convolutional LSTM for the Detection of Vio- lence in Videos (Appendix A) 1The field of action recognition has gained tremendous traction in recent years. A subset of this, detection of violent activity in videos, is of great impor- tance, particularly in unmanned surveillance or crowd footage videos. In this appendix, we explore this problem on three standard benchmarks widely used for violence detection: the Hockey Fights, Movies, and Violent Flows datasets. To this end, we introduce a Spatiotemporal Encoder, built on the Bidirectional Convolutional LSTM (BiConvLSTM) architecture. The addition of a bidirectional temporal encoding and the elementwise max pooling of these encodings in the Spatiotemporal Encoder is novel in the field of violence detection. This addi- tion is motivated by a desire to derive better video representations via leveraging long-range information in both temporal directions of the video. We find that the Spatiotemporal network is comparable in performance with existing meth- ods for all of the above datasets. A simplified version of this network, the Spatial Encoder is sufficient to match state-of-the-art performance on the Hockey Fights and Movies datasets. However, on the Violent Flows dataset, the Spatiotemporal Encoder outperforms the Spatial Encoder. 1This is placed in the appendix because it is an early thesis work that does not directly connect to the main content of this dissertation. 6 Chapter 2: GANs for Unsupervised Geometry Estimation 1Understanding geometry from images is a fundamental problem in com- puter vision. It has many important applications. For instance, Monocular Depth Estimation (MDE) is important for synthetic object insertion in computer graph- ics [20], grasping in robotics [21] and safety in self-driving cars. Face Normal Es- timation can help in face image editing applications such as relighting [22–24]. However, it is extremely hard to annotate real data for these regression tasks. Synthetic data and their ground truth labels, on the other hand, are easy to gen- erate and are often used to compensate for the lack of labels in real data. Deep models trained on synthetic data, unfortunately, usually perform poorly on real data due to the domain gap between synthetic and real distributions. To deal with this problem, several research studies [25–28] have proposed unsupervised do- main adaptation methods to take advantage of synthetic data by mapping it into the real domain or vice versa, either at the feature level or image level. However, mapping examples from one domain to another domain itself is a challenging problem that can limit performance. We observe that finding such a mapping solves an unnecessarily difficult 1Work done with Hao Zhou and David Jacobs. Accepted [19] in CVPR 2020. 7 G real synthetic Figure 2.1: We propose to reduce the domain gap between synthetic and real by mapping the corresponding domain specific information related to the primary task (δs,δr ) into shared information δsh, preserving everything else. problem. To train a regressor that applies to both real and synthetic domains, it is only necessary that we map both to a new representation that contains the task- relevant information present in both domains, in a common form. The mapping need not alter properties of the original domain that are irrelevant to the task since the regressor will learn to ignore them regardless. To see this, we consider a simplified model of our problem. We suppose that real and synthetic images are formed by two components: domain agnostic (which has semantic information shared across synthetic and real, and is denoted as I) and domain specific. We further assume that domain specific information has two sub-components: domain specific information unrelated to the primary task (denoted as δ′s and δ′r for synthetic and real images respectively) and domain specific information related to the primary task (δs, δr). So real and synthetic images can be represented as: xr = f (I,δr ,δ′r) and xs = f (I,δs,δ′s) respectively. We believe the domain gap between {δs and δr} can affect the training of the primary network, which learns to expect information that is not always present. The domain gap between {δ′s and δ′r}, on the other hand, can be bypassed by the 8 primary network since it does not hold information needed for the primary task. For example, in real face images, information such as the color and texture of the hair is unrelated to the task of estimating face normals but is discriminative enough to distinguish real from synthetic faces. This can be regarded as domain specific information unrelated to the primary task i.e., δ′r . On the other hand, shadows in the real and synthetic images, due to the limitations of the rendering engine, may have different appearances but may contain depth cues that are re- lated to the primary task of MDE in both domains. The simplest strategy, then, for combining real and synthetic data is to map δs and δr to a shared representa- tion, δsh, while not modifying δ′s and δ′r as shown in Figure 2.1. Recent research studies show that a shared network for synthetic and real data can help reduce the discrepancy between images in different domains. For instance, [22] achieved state-of-the-art results in face normal estimation by train- ing a unified network for real and synthetic data. [13] learned the joint distri- bution of multiple domain images by enforcing a weight-sharing constraint for different generative networks. Inspired by these research studies, we define a unified mapping function G, which is called SharinGAN, to reduce the domain gap between real and synthetic images. Different from existing research studies, our G is trained so that minimum domain specific information is removed. This is achieved by pre-training G as an auto-encoder on real and synthetic data, i.e., initializing G as an identity func- tion. Then G is trained end-to-end with reconstruction loss in an adversarial framework, along with a network that solves the primary task, further pushing 9 G to map information relevant to the task to a shared domain. As a result, a successfully trained G will learn to reduce the domain gap existing in δs and δr , mapping them into a shared domain δsh. G will leave I unchanged. δ′s and δ′r can be left relatively unchanged when it is difficult to map them to a common representation. Mathematically, G(xs) = f (I,δsh,δ′s) and G(xr) = f (I,δsh,δ′r). If successful, G will map synthetic and real images to images that may look quite different to the eye, but the primary task network will extract the same information from both. We apply our method to unsupervised monocular depth estimation using virtual KITTI (vKITTI) [29] and KITTI [30] as synthetic and real datasets respec- tively. Our method reduces the absolute error in the KITTI eigen test split and the test set of Make3D [31] by 23.77% and 6.45% respectively compared with the state-of-the-art method [27]. Additionally, our proposed method improves over SfSNet [22] on face normal estimation. It yields an accuracy boost of nearly 4.3% for normal prediction within 20◦ (Acc < 20◦) of ground truth on the Photoface dataset [32]. 2.1 Related Work Monocular Depth Estimation has long been an active area in computer vision. Because this problem is ill-posed, learning-based methods have predomi- nated in recent years. Many early learning works applied Markov Random Fields (MRF) to infer the depth from a single image by modeling the relation between 10 nearby regions [31, 33, 34]. These methods, however, are time-consuming dur- ing inference and rely on manually defined features, which have limitations in performance. More recent studies apply deep Convolutional Neural Networks (CNNs) [35–42] to monocular depth estimation. Eigen [35] first proposed a multi-scale deep CNN for depth estimation. Following this work, [36] proposed to apply CNNs to estimate depth, surface normal and semantic labels together. [37] com- bined deep CNNs with a continuous CRF for monocular depth estimation. One major drawback of these supervised learning-based methods is the requirement for a huge amount of annotated data, which is hard to obtain in reality. With the emergence of large scale, high-quality synthetic data [29], using synthetic data to train a depth estimator network for real data became popu- lar [26, 27]. The biggest challenge for this task is the large domain gap between synthetic data and real data. [28] proposed to first train a depth prediction net- work using synthetic data. A style transfer network is then trained to map real images to synthetic images in a cycle consistent manner [43]. [25] proposed to adapt the features of real images to the features of synthetic images by applying adversarial loss on latent features. A content congruent regularization is further proposed to avoid mode collapse. T2Net [26] trained a network that translates synthetic data into real at the image level and further trained a task network in this translated domain. GASDA [27] proposed to train the network by incorporat- ing epipolar geometry constraints for real data along with the ground truth labels for synthetic data. All these methods try to align two domains by transferring one 11 domain to another. Unlike these works, we propose a mapping function G, also called SharinGAN, to just align the domain specific information that affects the primary task, resulting in a minimum change in the images in both domains. We show that this makes learning the primary task network much easier and can help it focus on the useful information. Self-supervised learning is another way to avoid collecting ground truth labels for monocular depth estimation. Such methods need monocular videos [44–47], stereo pairs [48–51], or both [47] for training. Our proposed method is complementary to these self-supervised methods, it does not require this addi- tional data, but can use it when available. FaceGeometry Estimation is a sub-problem of inverse face rendering which is the key for many applications such as face image editing. Conventional face ge- ometry estimation methods are usually based on 3D Morphable Models (3DMM) [52]. Recent studies demonstrate the effectiveness of deep CNNs for solving this problem [22, 53–58]. Thanks to the 3DMM, generating synthetic face images with ground truth geometry is easy. [22,53,54] make use of synthetic face images with ground truth shape to help train a network for predicting face shape using real images. Most of these works initially pre-train the network with synthetic data and then fine-tune it with a mix of real and synthetic data, either using no supervision or weak supervision, overlooking the domain gap between real and synthetic face images. In this work, we show that by reducing the domain gap between real and synthetic data using our proposed method, face geometry can be better estimated. 12 Domain Adaptation using GANs There are many works [11–15] that use a GAN framework to perform domain adaptation by mapping one domain into an- other via a supervised translation. However, most of these show performance on just toy datasets in a classification setting. We attempt to map both synthetic and real domains into a new shared domain that is learned during training and use this to solve complex problems of unsupervised geometry estimation. Moreover, we apply adversarial loss at the image level for our regression task, in contrast to some of the above previous works where domain invariant feature engineering sufficed for classification tasks. 2.2 Approach To compensate for the lack of annotations for real data and to train a pri- mary task network on easily available synthetic data, we propose SharinGAN to reduce the domain gap between synthetic and real. We aim to train a pri- mary task network on a shared domain created by SharinGAN, which learns the mapping function G : xr 7→ xshr and G : xs 7→ xshs , where xk = f (I,δk ,δ′k); xshk = f (I,δsh,δ′k); k ∈ {r, s} as shown in Figure 2.1. G allows the primary task network to train on a shared space that holds the information needed to do the primary task, making the network more applicable to real data during testing. To achieve this, an adversarial loss is used to find the shared information, δsh. This is done by minimizing the discrepancy in the distributions of xshr and xshs . But at the same time, to preserve the domain agnostic information (shared 13 Synthetic Image  Real Image Real translated image Synthetic translated image G: Generator Primary Network Synthetic Prediction Real Prediction Synthetic GroundTruth Virtual Supervision Shared Semantic Image SharinGAN module Reconstruction loss T D: Image  Discriminator Figure 2.2: Overview of the model architecture. Red dashed arrows indicate the loss computations. semantic information I), we use reconstruction loss. Now, without a loss from the primary task network, G might change the images so that they don’t match the labels. To prevent that, we additionally use a primary task loss for both real and synthetic examples to guide the generator. It is important to note that both the translations from synthetic to real and vice versa are equally crucial for this symmetric setup to find a shared space. To facilitate that, we use a form of weak supervision we call virtual supervision. Some possible virtual supervisions in- clude a prior on the input data or a constraint that can narrow the solution space for the primary task network (details discussed in 2.2.2.2). For synthetic exam- ples, we use the known labels. Adversarial, Reconstruction and Primary task losses together train the gen- erator and primary task network to align the domain specific information {δs,δr} in both the domains into a shared space δsh, preserving everything else. 14 2.2.1 Framework In this work, we propose to train a generative network which is called SharinGAN, to reduce the domain gap between real and synthetic data so as to help to train the primary network. Figure 2.2 shows the framework of our proposed method. It contains a generative network G, a discriminator on image- level D that embodies the SharinGAN module and a task network T to perform the primary task. The generative network G takes either a synthetic image xs or real image xr as input and transforms it to xshs or xshr in an attempt to fool D. Dif- ferent from existing works that transfer images in one domain to another [26–28], our generative network G tries to map the domain specific parts δs and δr of syn- thetic and real images to a shared space δsh, leaving δ′s and δ′r unchanged. As a result, our transformed synthetic and real images (xshs and xshr ) have fewer differ- ences from xs and xr . Our task network T then takes the transformed images xshs and xshr as input and predicts the geometry. The generative network G and task network T are trained together in an end-to-end manner. 2.2.2 Losses 2.2.2.1 Losses for Generative Network We design a single generative network G for synthetic and real data since sharing weights can help align distributions of different domains [13]. Moreover, existing research studies such as [22, 54] also demonstrate that a unified frame- 15 work works reasonably well on synthetic and real images. In order to map δs and δr to a shared space δsh, we apply adversarial loss [59] at the image level. More specifically, we use the Wasserstein discriminator [60] that uses the Earth- Mover’s distance to minimize the discrepancy between the distributions for syn- thetic and real examples {G(xs),G(xr)}, i.e.: LW (D,G) = Exs[D(G(xs))]−Exr [D(G(xr))], (2.1) D is a discriminator and Ge is the encoder part of the generator. Following [61], to overcome the problem of vanishing or exploding gradients due to the weight clipping proposed in [60], a gradient penalty term is added for training the dis- criminator: Lgp(D) = (||∇ĥD(ĥ)||2 − 1)2 (2.2) Our overall adversarial loss is then defined as: Ladv = LW (D,G)−λLgp(D) (2.3) where λ is chosen to be 10 while training the discriminator and 0 while training the generator. Without any constraints, the adversarial loss may learn to remove all do- main specific parts δ and δ′ or even some of the domain agnostic part I in order to fool the discriminator. This may lead to loss of geometric information, which can degrade the performance of the primary task network T . To avoid this, we propose to use the self-regularization loss similar to [62] to force the transformed 16 image to keep as much information as possible: Lr = ||G(xs)− xs||22 + ||G(xr)− xr ||22. (2.4) 2.2.2.2 Losses for the Task Network The task network takes transformed synthetic or real images as input and predicts geometric information. Since the ground truth labels for synthetic data are available, we apply a supervised loss using these ground truth labels. For real images, domain specific losses or regularizations are applied as a form of virtual supervision for training according to the task. We apply our proposed SharinGAN to two tasks: monocular depth estimation (MDE) and face normal estimation (FNE). For MDE, we use the combination of depth smoothness and geometric consistency losses used in GASDA [27] as the virtual supervision. For FNE however, for virtual supervision we use the pseudo supervision used in SfS- Net [22]. We use the term “virtual supervision” to summarize these two losses as a kind of weak supervision on the real examples. 2.2.2.3 Monocular Depth Estimation To make use of ground truth labels for synthetic data, we apply L1 loss for predicted synthetic depth images: L1 = ||ŷs − y∗s ||1 (2.5) where ŷs is the predicted synthetic depth map and y∗s is its corresponding ground truth. Following [27], we apply smoothness loss on depth LDS to encourage it to 17 be consistent with local homogeneous regions. Geometric consistency loss LGC is applied so that the task network can learn the physical geometric structure through epipolar constraints. LDS and LGC are defined as: LDS = e−∇xr ||∇ŷr || (2.6) LGC = η 1− SSIM(xr ,x′rr) 2 +µ||xr − x′rr ||, (2.7) ŷr represents the predicted depth for the real image and ∇ represents the first derivative. xr is the left image in the KITTI dataset [30]. x′rr is the inverse warped image from the right counterpart of xr based on the predicted depth ŷr . The KITTI dataset [30] provides the camera focal length and the baseline distance between the cameras. Similar to [27], we set η as 0.85 and µ as 0.15 in our exper- iments. The overall loss for the task network is defined as: LT = β1LDS + β2L1 + β3LGC , (2.8) where β1 = 0.01,β2 = β3 = 100. 2.2.2.4 Face Normal Estimation SfSnet [22] currently achieves the best performance on face normal esti- mation. We thus follow its setup for face normal estimation and apply “SfS- supervision” for both synthetic and real images during training. LT = λreconLrecon +λNLN +λALA +λLightLLight, (2.9) where Lrecon, LN and LA are L1 losses on the reconstructed image, normal and albedo, whereas Llight is the L2 loss over the 27 dimensional spherical harmonic 18 coefficients. The supervision for real images is from the “pseudo labels”, obtained by applying a pre-trained task network on real images. Please refer to [22] for more details. 2.2.2.5 Overall loss The overall loss used to train our geometry estimation pipeline is then de- fined as: L = α1Ladv +α2Lr +α3LT . (2.10) where (α1,α2,α3) = (1,10,1) for monocular depth estimation task and (α1,α2,α3) = (1,10,0.1) for face normal estimation task. 2.3 Experiments We apply our proposed SharinGAN to monocular depth estimation and face normal estimation. We discuss the details of the experiments in this section. 2.3.1 Monocular Depth Estimation 2.3.1.1 Datasets Following [27], we use vKITTI [29] and KITTI [30] as synthetic and real datasets to train our network. vKITTI contains 21,260 image-depth pairs, which are all used for training. KITTI [30] provides 42,382 stereo pairs, among which, 22,600 images are used for training and 888 are used for validation as suggested by [27]. 19 2.3.1.2 Implementation details We use a generator G and a primary task network T , whose architectures are identical to [27]. We pre-train the generative network G on both synthetic and real data using reconstruction loss Lr . This results in an identity mapping that can help G to keep as much of the input image’s geometry information as possible. Our task network is pre-trained using synthetic data with supervision. G and T are then trained end to end using Equation 2.10 for 150,000 iterations with a batch size of 2, by using an Adam optimizer with a learning rate of 1e − 5. The best model is selected based on the validation set of KITTI. 2.3.1.3 Results Table 2.1 shows the quantitative results on the eigen test split of the KITTI dataset for different methods on the MDE task. The proposed method outper- forms the previous unsupervised domain adaptation methods for MDE [26, 27] on almost all the metrics. Especially, compared with [27], we reduce the abso- lute error by 19.7% and 21.0% on 80m cap and 50m cap settings respectively. Moreover, the performance of our method is much closer to the methods in a supervised setting [35, 37, 63], which was trained on the real KITTI dataset with ground truth depth labels. Figure 2.3 visually compares the predicted depth map from the proposed method with [27]. We show three typical examples: near dis- tance, medium distance, and far distance. It shows that our proposed method performs much better for predicting depth at details. For instance, our predicted 20 Method Supervised Dataset Cap Error Metrics, lower is better Accuracy Metrics, higher is better Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 Eigen [35] Yes K 80m 0.203 1.548 6.307 0.282 0.702 0.890 0.958 Liu [37] Yes K 80m 0.202 1.614 6.523 0.275 0.678 0.895 0.965 All synthetic (baseline) No S 80m 0.253 2.303 6.953 0.328 0.635 0.856 0.937 All real (baseline) No K 80m 0.158 1.151 5.285 0.238 0.811 0.934 0.970 GASDA [27] No K+S 80m 0.149 1.003 4.995 0.227 0.824 0.941 0.973 SharinGAN (proposed) No K+S 80m 0.116 0.939 5.068 0.203 0.850 0.948 0.978 Kuznietsov [63] Yes K 50m 0.117 0.597 3.531 0.183 0.861 0.964 0.989 Garg [64] No K 50m 0.169 1.080 5.104 0.273 0.740 0.904 0.962 Godard [48] No K 50m 0.140 0.976 4.471 0.232 0.818 0.931 0.969 All synthetic (baseline) No S 50m 0.244 1.771 5.354 0.313 0.647 0.866 0.943 All real (baseline) No K 50m 0.151 0.856 4.043 0.227 0.824 0.940 0.973 Kundu [25] No K+S 50m 0.203 1.734 6.251 0.284 0.687 0.899 0.958 T2Net [26] No K+S 50m 0.168 1.199 4.674 0.243 0.772 0.912 0.966 GASDA [27] No K+S 50m 0.143 0.756 3.846 0.217 0.836 0.946 0.976 SharinGAN (proposed) No K+S 50m 0.109 0.673 3.77 0.190 0.864 0.954 0.981 Table 2.1: MDE Results on eigen test split of KITTI dataset [35] . For the training data, K: KITTI dataset and S: vKITTI dataset. Methods highlighted in light gray, use domain adaptation techniques and the non-highlighted rows correspond to supervised methods. depth map can better preserve the shape of the car (Figure 2.3 (a) and (c)) and the structure of the tree and the building behind it (Figure 2.3 (b)). This shows the advantage of our proposed SharinGAN compared with [27]. [27] learns to transfer real images to the synthetic domain and vice versa, which solves a much harder problem compared with SharinGAN, which removes a minimum of do- main specific information. As a result, the quality of the transformation for [27] may not be as good as the proposed method. Moreover, the unsupervised trans- formation cannot guarantee to keep the geometry information unchanged. 21 Input GT Depth GASDA [27] SharinGAN (Ours) (a) The second row shows the corresponding region in the red box of the first row. The depth of the faraway car is better estimated by SharinGAN than GASDA. (b) The second and third row shows the corresponding region in the green and red box of the first row. The depth of the tree to the left (green) and shrubs behind the tree in the right are better estimated by SharinGAN. (c) The second and third row shows the corresponding regions in the green and red boxes of the first row. The boundaries and the depth of the cars are better estimated by SharinGAN. Figure 2.3: Qualitative comparisons of SharinGAN with GASDA [27]. Ground truth (GT) has been interpolated (and the unavailable top regions are masked out) for visualization purposes. Note that in addition to various other aspects mentioned above, we are also able to remove the boundary artifacts present in the depth maps of GASDA. 22 (a) xr (b) xshr = G(xr ) (c) |xr − xshr | (d) xs (e) xshs = G(xs) (f) |xs − xshs | Figure 2.4: (a), (b) and (c) show real image xr , translated real image xshr and their differ- ence |xr − xshr | respectively. (d), (e) and (f) show synthetic image xs, translated synthetic image xshs and their difference |xs − xshs | respectively. To understand how our generative network G works, we show some exam- ples of synthetic and real images, their transformed versions, and the difference images in Figure 2.4. This shows that G mainly operates on edges. Since depth maps are mostly discontinuous at edges, they provide important cues for the geometry of the scene. On the other hand, due to the difference between the ge- ometry and material of objects around the edges, the rendering algorithm may find it hard to render realistic edges compared with other parts of the scene. As a result, most of the domain specific information related to geometry lies in the edges, on which SharinGAN correctly focuses. 2.3.1.4 Generalization to Make3D To demonstrate the generalization ability of the proposed method, we test our trained model on Make3D [31]. Note that we do not fine-tune our model using the data from Make3D. Table 2.2 shows the quantitative results of our method, which outperforms existing state-of-the-art methods by a large margin. Moreover, the performance of SharinGAN is more comparable to the super- 23 Method Trained Error Metrics, lower is better Abs Rel Sq Rel RMSE Karsh et al. [65] Yes 0.398 4.723 7.801 Laina et al. [66] Yes 0.198 1.665 5.461 Kundu et al. [25] Yes 0.452 5.71 9.559 Goddard et al. [67] No 0.505 10.172 10.936 Kundu et al. [25] No 0.647 12.341 11.567 Atapour et al. [28] No 0.423 9.343 9.002 T2Net [26] No 0.508 6.589 8.935 GASDA [27] No 0.403 6.709 10.424 SharinGAN (proposed) No 0.377 4.900 8.388 Table 2.2: MDE results on Make3D dataset [31]. Trained indicates whether the model is trained on Make3D or not. Errors are computed for depths less than 70m in a central image crop [67]. It can be concluded that our proposed method generalized better to an unseen dataset. vised methods. We further visually compare the proposed method with GASDA [27] in Figure 2.5. It is clear that the proposed depth map captures more details in the input images, reflecting more accurate depth prediction. 24 (a) Input Image (b) Ground Truth (c) GASDA [27] (d) SharinGAN Figure 2.5: Qualitative results on the test set of the Make3D dataset [31]. In the top row, some far tree structures that are missing in the depth map predicted by GASDA were better captured on using the SharinGAN module. For the bottom row, GASDA wrongly predicts the depth map of the houses behind the trees to be far, which is correctly cap- tured by the SharinGAN. 2.3.2 Face Normal Estimation 2.3.2.1 Datasets We use the synthetic data provided by [22] and CelebA [68] as real data to train the SharinGAN for face normal estimation similar to [22]. Our trained model is then evaluated on the Photoface dataset [32]. 2.3.2.2 Implementation details We use the RBDN network [69] as our generator and SfSNet [22] as the primary task network. Similar to before, we pre-train the Generator on both 25 Algorithm MAE < 20◦ < 25◦ < 30◦ 3DMM [52] 26.3◦ 4.3% 56.1% 89.4% Pix2Vertex [70] 33.9◦ 24.8% 36.1% 47.6% SfSNet [22] 25.5◦ 43.6% 57.7% 68.7% SharinGAN (proposed) 24.0◦ 47.88% 61.53% 72.1% Table 2.3: Quantitative results for Face Normal estimation on the test split of Photoface dataset [32]. All the listed methods are not fine-tuned on Photoface. The metrics MAE: Mean Angular Error and < 20◦,25◦,30◦ refer to the normals prediction accuracy for dif- ferent thresholds. synthetic and real data using reconstruction loss and pre-train the primary task network on just synthetic data in a supervised manner. Then, we train G and T end-to-end using the overall loss (2.10) for 120,000 iterations. We use a batch size of 16 and a learning rate of 1e − 4. The best model is selected based on the validation set of Photoface [32]. 2.3.2.3 Results Table 2.3 shows the quantitative performance of the estimated surface nor- mals by our method on the test split of the Photoface dataset. With the proposed SharinGAN module, we were able to significantly improve over SfSNet on all the metrics. In particular, we were able to significantly reduce the mean angular error metric by roughly 1.5◦. Additionally, Figure 2.6 depicts the qualitative comparison of our method 26 (a) Input Image (b) GT (c) SfSNet [22] (d) SharinGAN Figure 2.6: Qualitative comparisons of our method with SfSNet on the examples from the test set of Photoface dataset [32]. Our method generalizes much better to unseen data during training. with SfSNet on the test split of Photoface. Both SfSNet and our pipeline are not finetuned on this dataset, and yet we were able to generalize better com- pared to SfSNet. This demonstrates the generalization capacity of the proposed SharinGAN to unseen data in training. Finally, Figure 2.7 depicts the qualitative results of our method on the CelebA [68] and Synthetic [22] datasets. The translated images corresponding to syn- thetic and real images look similar in contrast to the MDE task (Figure 2.4). We suppose that for the task of MDE, regions such as edges are domain specific, and yet hold primary task related information such as depth cues, which is why 27 Input, xs xshs = G(xs) Normal Albedo Shading Reconstruction (a) Qualitative results of our method on CelebA testset [68]. Input, xr xshr = G(xr ) Normal Albedo Shading Reconstruction (b) Qualitative results of our method on the synthetic data used in SfSNet [22]. Figure 2.7: Qualitative results of our method on face normal estimation task. The trans- lated images xshr ,xshs look reasonably similar for our task which additionally predicts albedo, lighting, shading and Reconstructed image along with the face normal. 28 SharinGAN modifies such regions. However, for the task of FNE, we additionally predict albedo, lighting, shading and a reconstructed image along with estimat- ing normals. This means that the primary network needs a lot of shared infor- mation across domains for good generalization to real data. Thus the SharinGAN module seems to bring everything into a shared space, making the translated images {xshr ,xshs } look visually similar. Lighting Estimation The primary network estimates not only face normals but also lighting. We also evaluate this. Following a similar evaluation protocol as that of [22], Table 2.4 summarizes the light classification accuracy on the Mul- tiPIE dataset [71]. Since we do not have the exact cropped dataset that [22] used, we used our own cropping and resizing on the original MultiPIE data: centercrop 300x300 and resize to 128x128. For a fair comparison, we used the same dataset to re-evaluate the lighting performance for [22] and reported the results in Table 2.4. Our method not only outperforms [22] on the face normal estimation, but also on lighting estimation. Algorithm top-1% top-2% top-3% SfSNet [22] 80.25 92.99 96.55 SharinGAN 81.83 93.88 96.69 Table 2.4: Light classification accuracy on MultiPIE dataset [71]. Training with the pro- posed SharinGAN also improves lighting estimation along with face normals. 29 2.3.3 Ablation studies We carried out our ablation study using the KITTI and Make3D datasets on monocular depth estimation. We study the role of the SharinGAN module by removing it and training a primary network on the original synthetic and real data using (2.8). We observe that the performance drops significantly as shown in Table 2.5 and Table 2.6. This shows the importance of the SharinGAN module that helps train the primary task network efficiently. To demonstrate the role of reconstruction loss, we remove it and train our whole pipeline α1Ladv + α3LT . We show the results on the testset of KITTI in the second row of Table 2.5 and on the testset of Make3D in the second row of Table 2.6. For both the testsets, we can see the performance drop compared to our full model. Although the drop is smaller in the case of KITTI, it can be seen that the drop is significant for Make3D dataset that is unseen during training. This signifies the importance of reconstruction loss to generalize well to a domain not seen during training. Components Cap Error Metrics, lower is better Accuracy Metrics, higher is better SharinGAN Reconstruction loss Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 x x 50m 0.137 0.804 4.12 0.210 0.816 0.940 0.978 ✓ x 50m 0.1113 0.6705 3.80 0.192 0.861 0.954 0.980 ✓ ✓ 50m 0.109 0.673 3.77 0.190 0.864 0.954 0.981 Table 2.5: Ablation study for monocular depth estimation to understand the role of the SharinGAN module and Reconstruction loss. We need both to get the best performance for this task. 30 Components Cap Error Metrics, lower is better SharinGAN Reconstruction loss Abs Rel Sq Rel RMSE x x 70m 0.476 8.058 9.449 ✓ x 70m 0.401 5.318 8.377 ✓ ✓ 70m 0.377 4.900 8.388 Table 2.6: Ablation study for monocular depth estimation to understand the role of the SharinGAN module and Reconstruction loss on the Make3D test dataset. We need both to get the best performance for this task. 2.4 Summary Our primary motivation is to simplify the process of combining synthetic and real images in training. Prior approaches often pick one domain and try to map images into it from the other domain. Instead, we train a generator to map all images into a new, shared domain. In doing this, we note that in the new domain, the images need not be indistinguishable to the human eye, only to the network that performs the primary task. The primary network will learn to ignore extraneous, domain-specific information that is retained in the shared domain. To achieve this, we propose a simple network architecture that rests on our new SharinGAN, which maps both real and synthetic images to a shared domain. The resulting images retain domain-specific details that do not prevent the pri- mary network from effectively combining training data from both domains. We 31 demonstrate this by achieving significant improvements over state-of-the-art ap- proaches in two important applications, surface normal estimation for faces, and monocular depth estimation for outdoor scenes. Finally, our ablation studies demonstrate the significance of the proposed SharinGAN in effectively combin- ing synthetic and real data. 32 Chapter 3: LDMs for Text-Based Image Segmentation 1Teaching neural networks to accurately find the boundaries of objects is hard and annotation of boundaries at internet scale is impractical. Also, most self-supervised or weakly supervised problems do not incentivize learning bound- aries. For example, training on classification or captioning allows models to learn the most discriminative parts of the image without focusing on bound- aries [73,74]. Our insight is that Latent Diffusion Models (LDMs) [18], which can be trained without object level supervision at internet scale, must attend to object boundaries, and so we hypothesize that they can learn features which would be useful for open world image segmentation. We support this hypothesis by show- ing that LDMs can improve performance on this task by up to 6%, compared to standard baselines and these gains are further amplified when LDM based seg- mentation models are applied on AI generated images. To test the aforementioned hypothesis about the presence of object-level se- mantic information inside a pretrained LDM, we conduct a simple experiment. We compute the pixel-wise norm between the unconditional and text-conditional 1Work done with Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, and David Jacobs. Accepted [72] as ORAL in ICCV 2023. 33 Latent Diffusion Model Latent Diffusion Model t=400 t=400 A picture of an astronaut, A picture of the Stonehenge NULL Figure 3.1: Coarse segmentation results from an LDM for two distinct images, demon- strating the encoding of fine-grained object-level semantic information within the model’s internal features. noise estimates from a pretrained LDM as part of the reverse diffusion process. This computation identifies the spatial locations that need to be modified for the noised input to align better with the corresponding text condition. Hence, the magnitude of the pixel-wise norm depicts regions that identify the text prompt. As shown in the Figure 3.1, the pixel-wise norm represents a coarse segmenta- tion of the subject although the LDM is not trained on this task. This clearly demonstrates that these large scale LDMs can not only generate visually pleasing images, but their internal representations encode fine-grained semantic informa- tion, that can be useful for tasks like segmentation. Recently, text-based image segmentation has gained traction for creating and editing AI generated content (like AI art, illustrations, cartoons etc.) in im- 34 age inpainting workflows 2 as it provides a conversational interface. Since the latent space z [75], extracted by a VQGAN is trained on several domains like art, cartoons, illustrations and real photographs, we posit that it is a more ro- bust input representation for text-based segmentation on AI-generated images. Furthermore, the internal layers of the LDM are responsible for generating the structure of the image and hence contain rich semantic information about ob- jects. Soft masks from these layers have also been used as a latent input in recent work on image editing [76, 77]. Since this information is already present while generating the image, we propose an architecture in the form of LD-ZNet (shown in Figure 3.3) to decode it for obtaining the semantic boundaries of objects gener- ated in the scene. Not only does our architecture benefit segmentation of objects in AI generated images, but it also improves performance over natural images. Overall our contributions are as follows: • We propose a text-based segmentation architecture, ZNet that operates on the compressed latent space of the LDM (z). • Next, we study the internal representations at different stages of pretrained LDMs and show that they are useful for text-based image segmentation. • Finally, we propose a novel approach named LD-ZNet to incorporate the visual-linguistic latent diffusion features from a pretrained LDM and show improvements across several metrics and domains for text-based image seg- mentation. 2imaginAIry, stable-diffusion-webui 35 https://github.com/brycedrennan/imaginAIry https://github.com/AUTOMATIC1111/stable-diffusion-webui 3.1 Related work 3.1.1 Text-based image segmentation Text-based image segmentation is the general task of segmenting specific regions in an image, based on a text prompt. This is different from the referring expression segmentation (RES) task, which aims to extract instance-level seg- mentation of different objects through distinctive referring expressions. While RES helps applications in robotics that require localization of a single object in an image, text-based segmentation benefits image editing applications by being able to also segment 1) “stuff” categories (clouds/ocean/beach ) and 2) multiple instances of an object category applicable to the text prompt. However, both these tasks have some shared literature in terms of approaches. Preliminary works [78–82] focused on the multi-modal feature fusion between the language and visual representations obtained from recurrent networks (such as LSTM) and CNNs respectively. The subsequent set of works [83–86] included varia- tions of multi-modal training, attention and cross-attention networks etc. Re- cently, [85, 87] used CLIP [88] to extract visual linguistic features of the image and the reference text separately. These features were then combined using a transformer based decoder to predict a binary mask. Alternately, [89, 90], pro- posed vision-language pretraining on other text-based visual recognition tasks (object detection and phrase grounding) and later finetuned for the segmenta- tion task. The concurrent works segment-anything (SAM) [91] and segment- 36 everything-everywhere-all-at-once (SEEM) [92] allow interactive segmentation via point clicks, bounding boxes and text inputs . demonstrating good zero-shot performance. Different from all these works, we show the significance of us- ing the latent space and the internal features from a pretrained latent diffusion model [18] for improving the more generic text-based image segmentation task. 3.1.2 Text-to-Image synthesis Text-to-Image synthesis has initially been explored using GANs [39,93–97] on publicly available image captioning datasets. Another line of work is by us- ing autoregressive models [98–100] via a two stage approach. The first stage is a vector quantized autoencoder such as a VQVAE [101,102] or a VQGAN [75] with an image reconstruction objective to convert an image into a shorter sequence of discrete tokens. This low dimensional latent space enables the training of compute intensive autoregressive models even for high resolution text-to-image synthesis. With the recent advancements in Diffusion Models (DM) [16,17], both in unconditional and class conditional settings, they have started gaining more traction compared to GANs. Their success in the text-to-image tasks [103, 104] made them even more popular. However, the prior diffusion models worked in the high-dimensional image space that made training and inference compu- tationally intensive. Subsequently, latent space representations [18, 105–107] were proposed for high resolution text-to-image synthesis to reduce the heavy compute demands. More specifically, the latent diffusion model (LDM) [18] 37 mitigates this problem by relying on a perceptually compressed latent space produced by a powerful autoencoder from the first stage. Moreover, they em- ploy a convolutional backed UNet [108] as the denoising architecture, allowing for different sized latent spaces as input. Recently this architecture is trained on large scale text-image data [109] from the internet and released as Stable- diffusion3, which exhibited photo-realistic image generations. Subsequently, sev- eral language guided image editing applications such as inpainting [110–112], text-guided image editing [77, 113] became more popular and the usage for text- based image segmentation has surged, especially for AI generated images. We propose a solution for text-based image segmentation by leveraging the features which are already present as part of the synthesis process. 3.1.3 Semantics in generative models Semantics in generative models such as GANs have been studied for bi- nary segmentation [114,115] as well as multi-class segmentation [3,4,116] where the intermediate features have been shown to contain semantic information for these tasks. Moreover, [117] highlighted the practical advantages of these rep- resentations, such as out-of-distribution robustness. However, prior generative models (GANs ) as representation learners have received less attention compared to alternative unsupervised methods [118], because of the training difficulties on complex, diverse and large scale datasets. Diffusion models [16], on the other hand are another class of powerful generative models that recently outperformed 3https://github.com/CompVis/stable-diffusion 38 https://github.com/CompVis/stable-diffusion GANs on image synthesis [17] and are able to train on large datasets such as Im- agenet [119] or LAION [109]. In [5], the authors demonstrated that the internal features of a pre-trained diffusion model were effective at the semantic segmenta- tion task. However, this type of analysis [4,5] has mostly been done in limited set- tings like few shot learning [120] or limited domains like faces [121], horses [122] or cars [122]. Different from these works, we analyze the visual-linguistic seman- tic information present in the internal features of a text-to-image LDM [18] for text based image segmentation, which is an open world visual recognition task. Furthermore, we leverage these LDM features and show performance improve- ments when training with full datasets instead of few-shot settings. 3.2 LDMs for Text-Based Segmentation The text-to-image latent diffusion architecture introduced in [18] consists of two stages: 1) An auto-encoder based VQGAN [75] that extracts a compressed latent representation (z) for a given image 2) A diffusion UNet that is trained to denoise the noisy z created in the forward diffusion process, conditioned on the text features. These text features are obtained from a pretrained frozen CLIP text encoder [88] and is conditioned at multiple layers of the UNet via cross-attention. In this paper, we show performance improvements on the text-based seg- mentation task in two steps. Firstly, we analyze the compressed latent space (z) from the first-stage and propose an approach named ZNet that uses z as the visual input to estimate segmentation mask when conditioned on a text prompt. Sec- 39 E z D Figure 3.2: Reconstructions from the first stage of the LDM. Given an input image, the latent representation z generated by the encoder, can be used to reconstruct images that are perceptually indistinguishable from the inputs. The high quality of these reconstruc- tions suggests that the latent representation z, preserves most of the semantic informa- tion present in the input images. ondly, we study the internal representations from the second stage of the stable- diffusion LDM for visual-linguistic semantic information and propose a way to utilize them inside ZNet for further improvements in the segmentation task. We name this approach as LD-ZNet. 3.2.1 ZNet: Leveraging Latent Space Features We observe that the latent space (z) from the first-stage of the LDM is a compressed representation of the image that preserves semantic information, as depicted in Figure 3.2. The VQGAN in the first-stage achieves such semantic- preserving compression with the help of large scale training data as well as a 40 A picture of an astronaut Second-stage of LDM Denoising UNet ZNet E First-stage of LDM VQGAN encoder Pretrained, Frozen LD-ZNet z CLIP Text Encoder Residual layer Se lf at te nt io n C ro ss at te nt io n Spatial-Attention module Timestep t CLIP Text features Latent Diffusion Features Encoder/Decoder Block Timestep t Forward Diffusion Attention Pool Figure 3.3: Overview of the proposed ZNet and LD-ZNet architectures. We propose to use the compressed latent representation z as input for our segmentation network ZNet. Next, we propose LD-ZNet, which incorporates the latent diffusion features at various intermediate blocks from the LDM’s denoising UNet, into ZNet. combination of losses - perceptual loss [123], a patch-based [124] adversarial ob- jective [75, 125, 126], and a KL-regularization loss. In our experiments, we observe that this compressed latent representation z is more robust compared to the original image in terms of their association with the text prompts. We believe this is because z is a H 8 × W 8 × 4 dimensional feature with 48 × fewer elements compared to the original image, while pre- serving the semantic information. Several prior works [127–129], show that compression techniques like PCA, which create information preserving lower di- mensional representations generalize better. Therefore, we propose using the z representation along with the frozen CLIP text features [88] as an input to our segmentation network. Furthermore, because the VQGAN is trained across sev- 41 eral domains like art, cartoons, illustrations, portraits, etc., it learns a robust and compact representation which generalizes better across domains, as can be seen in our experiments on AI generated images. We call this approach ZNet. The architecture of ZNet is shown in the bottom box of Figure 3.3, and is the same as the denoising UNet module of the LDM. We therefore initialize it with pretrained weights of the second-stage of the LDM. 3.2.2 LD-ZNet: Leveraging Diffusion Features Given a text prompt and a timestep t, the second-stage of the LDM is trained to denoise zt - a noisy version of the latent representation z obtained via for- ward diffusion process for t timesteps. A UNet architecture is used whose en- coder/decoder elements are shown in Figure 3.3 (top right). A typical encoder/decoder block contains a residual layer followed by a spatial-attention module that inter- nally has self-attention and then cross-attention with the text features. We an- alyze the semantic information in the internal visual-linguistic representations developed at different blocks of encoder and decoder right after these spatial- attention modules. We also propose a way to utilize these latent diffusion fea- tures using cross-attention into the ZNet segmentation network and we call the final model as LD-ZNet. 42 20 30 40 50 60 70 100 200 300 400 500 600 700 800 900 1000 Va lia dt io n AP Timesteps Block 2 Block 4 Block 6 Block 7 Block 8 Block 10 Block 12 Block 14 Block 16 Figure 3.4: Semantic information present in the LDM features at various blocks and timesteps for the referring image segmentation task. AP is measured on a small valida- tion subset of the PhraseCut dataset. 3.2.2.1 Visual-Linguistic Information in LDM Features We evaluate the semantic information present in the pretrained LDM at various blocks and timesteps for the text-based image segmentation task. In this experiment, we consider the latent diffusion features right after the spatial- attention layers 1-16 spanning across all the encoder and decoder blocks present in the UNet. At each block, we analyze the features for every 100th timestep in the range [100,1000]. We use a small subset of the training and validation sets from the Phrasecut dataset and train a simple decoder on top of these features to predict the associated binary mask. Specifically, given an image I and timestep t, we first extract its latent representation z from the first stage of LDM and add noise from the forward diffusion to obtain zt for a timestep t. Next we extract 43 the frozen CLIP text features for the text prompt and input both of them into the denoising UNet of the LDM to extract the internal visual-linguistic features at all the blocks for that timestep. We use these representations to train the cor- responding decoders until convergence. Finally, we evaluate the AP metric on a small subset of the validation dataset. The performance of features from different blocks and timesteps is shown in Figure 3.4. Similar to [5], we observe that the middle blocks {6,7,8,9,10} of the UNet contain more semantic information compared to either the early blocks of the encoder or the later blocks of the decoder. We also observe that the timesteps 300- 500 contain the maximum visual-linguistic semantic information compared to other timesteps, for these middle blocks. This is in contrast to the findings of [5] that report the timesteps {50, 150, 250} to contain the most useful information when evaluated on an unconditional DDPM model for the few shot semantic segmentation task for horses [122] and faces [121]. We believe that the reason for this difference is because, in our case, the image synthesis is guided by text, leading to the emergence of semantic information earlier in the reverse diffusion process (t=1000→0), in contrast to unconditional image synthesis. 3.2.2.2 LD-ZNet Architecture We propose using the aforementioned visual-linguistic representations at multiple spatial-attention modules of the pretrained LDM into the ZNet as shown in Figure 3.3. These latent diffusion features are injected into the ZNet via a 44 CLIP Text features Latent D iffusion Features Positional encoding Se lf at te nt io n Residual layer Se lf at te nt io n C ro ss at te nt io n C ro ss at te nt io n Attention Pool  Visual-Linguistic Latent D iffusion Features Figure 3.5: We propose to incorporate the visual-linguistic representations from LDM obtained at the spatial-attention modules via a cross-attention mechanism into the cor- responding spatial-attention modules of the ZNet through an attention pool layer. cross-attention mechanism at the corresponding spatial-attention modules as shown in Figure 3.5. This allows for an interaction between the visual-linguistic repre- sentations from the ZNet and the LDM. Specifically, we pass the latent diffusion features through an attention pool layer that not only acts as a learnable layer to match the range of the features participating in the cross-attention, but also adds a positional encoding to the pixels in the LDM representations. The out- puts from the attention pool are now positional-encoded visual-linguistic repre- sentations that enable the proposed cross-attention mechanism to attend to the corresponding pixels from the ZNet features. ZNet when augmented with these 45 latent diffusion features from the LDM (through cross-attention) is referred to as LD-ZNet. Following the semantic analysis of latent diffusion features (Sec. 3.2.2.1), we incorporate the internal features from blocks {6,7,8,9,10} of the LDM into the corresponding blocks of ZNet, in order to make use of the maximum semantic and diverse visual-linguistic information from the LDM. For AI generated im- ages, these blocks are anyways responsible to generate the final image and using LD-ZNet, we are able to tap into this information which can be used for segment- ing objects in the scene. 3.3 Experiments Implementation details: In this paper, we use the stable-diffusion v1.4 checkpoint as our LDM that internally uses the frozen ViT-L/14 CLIP text en- coder [88]. We implement the above described ZNet and LD-ZNet in pytorch in- side the stable-diffusion library. We also initialize our networks with the weights from the LDM wherever possible, while initializing the remaining parameters from a normal distribution. We train ZNet and LD-ZNet on 8 NVIDIA A100 gpus with a batch size of 4 using the Adam optimizer and a base learning rate of 5e−7 per mini-batch sample, per gpu. For all our experiments, we keep the text encoder frozen and use an image resolution of 384 for a fair comparison with the previous works. Datasets: We use Phrasecut [130], which is currently the largest dataset for 46 Bicycle Teddy Bear Bald Eagle Fish Young Adult Female Zoologist Tsunami Giant Wave Water Vehicles Cars Figure 3.6: Samples from AIGI dataset along with annotated labels and categorical cap- tions. the text-based image segmentation task, with nearly 340K phrases along with cor- responding segmentation masks that not only permit annotations for stuff classes but also accommodate multiple instances. Following [88], we randomly augment the phrases from a fixed set of prefixes. For the images, we randomly crop a square around the object of interest with maximum area, ensuring that the object remains at least partially visible. We avoid negative samples to remove ambiguity in the LDM features for non-existent objects. We create a dataset consisting of AI-generated images which we name AIGI dataset, to showcase the usefulness of our approach for text-based segmentation on a different domain. We use 100 AI-generated images from lexica.art and man- 47 https://lexica.art/ ually annotated multiple regions for 214 text-prompts relevant to these images. Figure 3.6 depicts some of the images from the AIGI dataset along with their annotated labels and categorical captions. We also use the popular referring expression segmentation datasets namely RefCOCO [131], RefCOCO+ [131] and G-Ref [132] to demonstrate the general- ization abilities of ZNet and LD-ZNet. In RefCOCO, each image contains two or more objects and each expression has an average length of 3.6 words. RefCOCO+ is derived from RefCOCO by excluding certain absolute-location words and fo- cuses on purely appearance based descriptions. For example it uses “the man in the yellow polka-dotted shirt” rather than “the second man from the left” which makes it more challenging. Unlike RefCOCO and RefCOCO+, the average length of sentences in G-Ref is 8.4 words, which have more words about locations and appearances. While we adopt the UNC partition for RefCOCO and RefCOCO+ in this paper, we use the UMD partition for G-Ref. Metrics: We follow the evaluation methodology of [87] and report best fore- ground IoU (IoUFG) for the foreground pixels, the best mean IoU of all pixels (mIoU), and the Average Precision (AP). 3.4 Results 3.4.1 Image Segmentation Using Text Prompts On the PhraseCut dataset, we compare the performance of previous ap- proaches with our ZNet and LD-ZNet for the text-based image segmentation 48 Method mIoU IoUFG AP MDETR [89] 53.7 - - GLIPv2-T [90] 59.4 - - RMI [130] 21.1 42.5 - Mask-RCNN Top [130] 39.4 47.4 - HulaNet [130] 41.3 50.8 - CLIPSeg (PC+) [87] 43.4 54.7 76.7 CLIPSeg (PC, D=128) [87] 48.2 56.5 78.2 RGBNet 46.7 56.2 77.2 ZNet (Ours) 51.3 59.0 78.7 LD-ZNet (Ours) 52.7 60.0 78.9 Table 3.1: Text-based image segmentation performance on the PhraseCut testset. The performance of ZNet and LD-ZNet is highlighted in gray. Both these models outperform the baseline RGBNet on all the metrics. task (Table 3.1). In order to showcase the performance improvement of our pro- posed networks, we create a baseline named RGBNet with the same architecture as ZNet except we use the original images as the input instead of its latent space z. For RGBNet, we use additional learnable convolutional layers to map the orig- inal image to match the input resolution of ZNet. From Table 3.1, we observe that our ZNet and LD-ZNet significantly outperform RGBNet. Specifically, the performance improvement from using the latent representation z over the origi- 49 nal images is clear (i.e. ZNet vs RGBNet baseline). Performance further improves upon incorporating the LDM visual-linguistic representations (LD-ZNet) - by 6% overall on the mIoU metric compared to RGBNet. We also highlight this qualita- tively in Figure 3.7. In the figure, we show the original image and the GT mask along with outputs from the RGBNet baseline followed by ZNet and LD-ZNet, where both ZNet and LD-ZNet help improve results consistently. For example in the top row, RGBNet detects light fixtures for the “hanging clock” prompt, and although ZNet does not have as strong activations for these incorrect detections, it is LD-ZNet that correctly segments the “clock”. Similarly in the bottom row, while RGBNet completely got the “castle” wrong, ZNet correctly has activations on the right buildings, but with lower confidence. However, LD-ZNet improves it further. We outperform in all the metrics when compared to previous works, other than MDETR [89] and GLIPv2 [90]. Notably, these works are pre-trained on de- tection and phrase grounding for predicting bounding boxes on huge corpus of text-image pairs across various publicly available datasets with bounding box an- notations and are later fine-tuned on the Phrasecut dataset for the segmentation task. However, our work is orthogonally focused towards exploring and utiliz- ing LDMs and its internal features for improving the text-based segmentation performance. Note that object detection datasets have a good overlap with the visual content in PhraseCut, however, they are not representative of the diversity in images available on the internet. For example, while they could learn common concepts like sky, ocean, chair, table and their synonyms, methods like MDETR 50 Input GT mask RGBNet ZNet LD-ZNet Figure 3.7: Qualitative comparison on the PhraseCut test set. Each row contains an input image with a text prompt as an input, with the goal being to segment the image regions corresponding to the reference text. The text prompts are “hanging clock” and “castle” for the top and bottom rows. We show improvements using ZNet and LD-ZNet compared to the RGBNet. would not understand concepts like Mikey Mouse, Pikachu etc., which we will show in Section 3.5. 3.4.2 Generalization to AI Generated Images With the growing popularity of AI generated images, text-based image seg- mentation is extensively being used by content creators in their daily workflows. Many public libraries 4 widely employ methods such as CLIPSeg [87] for per- forming segmentation in AI-generated images. So we study the generalization ability of our proposed segmentation approach on AI-generated images. To this 4imaginAIry, stable-diffusion-webui 51 https://github.com/brycedrennan/imaginAIry https://github.com/AUTOMATIC1111/stable-diffusion-webui Method mIoU AP MDETR [89] 53.4 63.8 CLIPSeg (PC+) [87] 56.4 79.0 SEEM [92] 57.4 70.0 RGBNet 63.4 84.1 ZNet (Ours) 68.4 85.0 LD-ZNet (Ours) 74.1 89.6 Table 3.2: Generalization of the proposed LD-ZNet on our AIGI dataset when compared with other state-of-the-art text-based segmentation methods. extent, we first prepare a dataset of 100 AI-generated images from lexica.art and manually annotate them using 214 text-prompts. We name this dataset AIGI and release it on our project website 5 for future research. Next, we evaluate our approaches ZNet and LD-ZNet along with our RGBNet baseline and other text- based segmentation methods - CLIPSeg (PC+) [87], MDETR [89] and SEEM [92]. Glipv2 and the SAM model [91] with textual input were not publicly available for us to evaluate at the time of this work. All these methods are trained on the Phrasecut dataset except for SEEM and we measure the IoU metric as shown in Table 3.2. It can be seen that RGBNet outperforms CLIPSeg, MDETR and SEEM because its built on the UNet architecture initialized from the LDM weights that contains semantic information for good generalization. Our methods ZNet and LD-ZNet further improve the generalization to these AI-generated images by 5https://koutilya-pnvr.github.io/LD-ZNet/ 52 https://koutilya-pnvr.github.io/LD-ZNet/ more than 20% compared to MDETR. This is largely due to the robust z-space of the LDM that resulted from a VQGAN pre-training on a variety of domains like art, cartoons, illustrations . Furthermore, the latent diffusion features that con- tain useful semantic information for the synthesis task, also help in segmenting the AI-generated images. We show the qualitative comparison of these methods in Figure 3.8 for four AI-generated images from our dataset. While CLIPSeg can estimate most distinctive regions such as face of the Mickey mouse or rough loca- tions of Goblin, Ramen and animals, MDETR and SEEM incorrectly segment them because these concepts are unknown to them and because of the domain gap be- tween their training data and AIGI images respectively. In both such cases, our proposed LD-ZNet estimates accurate segmentation. More qualitative results for LD-ZNet on images from the AIGI dataset are shown in Figs. 3.9 and 3.10. 53 Input MDETR [89] CLIPSeg [87] SEEM [92] LD-ZNet Figure 3.8: Qualitative comparison on the AI-generated images for text-based segmenta- tion. The text prompts are “Mickey mouse”, “Goblin”, “Ramen” and “animals” respectively. 54 Input MDETR [89] CLIPSeg [87] SEEM [92] LD-ZNet Figure 3.9: More qualitative comparison on the AI-generated images from AIGI dataset for text-based segmentation. The text prompts are “Spiderman”, “tortoise”, “vespa” and “robot” respectively. 55 “Hoodie” “Spiderman” “Owl” “Trump” “Pikachu” “Joker” “Godzilla” “Eiffel” Figure 3.10: More qualitative results of LD-ZNet from AIGI dataset. 3.4.3 Generalization to Referring Expressions The reference expression segmentation task is aimed at robot-localization types of applications, where segmenting at the instance-level is performed through distinctive referring expressions. Many works such as [85, 86] also train the text encoder to learn the complex positional references in the text. However, we are focused on generic text-based segmentation that has support for stuff cate- gories as well as for multiple instances. We study the generalization ability of the 56 Method RefCOCO RefCOCO+ G-Ref IoU AP IoU AP IoU AP CLIPSeg (PC+) [87] 30.1 14.1 30.3 15.5 33.8 23.7 RGBNet 36.3 15.7 37.1 16.7 41.9 27.8 ZNet (Ours) 40.1 16.8 40.9 17.8 47.1 29.2 LD-ZNet (Ours) 41.0 17.2 42.5 18.6 47.8 30.8 Table 3.3: Generalization of our proposed approaches to different types of expressions from other datasets. Z-Net and LD-ZNet outperform both the RGBNet baseline and CLIPSeg on the generalization across all datasets. proposed approach - using LDM features, to this complex task. Specifically, we use the models trained on the PhraseCut dataset and evaluate them on the Ref- COCO [131], RefCOCO+ [131] and G-Ref [132] datasets whose complex referring expressions are for single-instance localization and segmentation. We also eval- uated the generalization of the CLIPSeg (PC+) [87] model that was trained on an extended version of the PhraseCut dataset (PC+), to further demonstrate the generalization capability of our methods. Table 3.3 summarizes the performance of our models along with the RGBNet baseline. We observe a similar trend in performance improvements across RGBNet < ZNet < LD-ZNet. These experi- ments demonstrate that the LDM features enhance the generalization power of the LD-ZNet model even on complex referring expressions. 57 3.4.4 Inference Time During inference, our proposed LD-ZNet relies on the LDM to extract the internal features for just a single time step (as opposed to around 50 reverse dif- fusion time steps for the text-to-image synthesis task). We then use these LDM features for further cross-attention into LD-ZNet via the attention pool layer to extract the final mask. Therefore, using the diffusion model increases the over- all run time by only a small amount. For the stable-diffusion model, inference takes 2.57s for 50 timesteps to synthesize an image (roughly 51ms per timestep), whereas the average inference times for RGBNet, ZNet and LD-ZNet are only 62ms, 55ms and 101ms, respectively, per image on the AIGI dataset with an RTX A6000 gpu. SEEM [92] takes 293ms for the same task. Since we use an archi- tecture similar to UNet (from the second stage of the LDM), as our segmentation network, the proposed LD-ZNet has 925M trainable parameters. 3.4.5 Cross-attention vs Concat for LDM features In LD-ZNet, we inject LDM features into the ZNet model using cross-attention (Figure 3.5). In order to understand the importance of the cross-attention layer, we also train and evaluate another model where the LDM features are concate- nated with the features of the ZNet right before the spatial-attention layer. The results are summarized in Table 3.4 and it shows that concatenating the LDM features yields inferior results compared to the proposed method. This is be- cause of the attention pool layer which serves as a learnable layer and also encodes 58 Diffusion features via mIoU IoUFG AP LD-ZNet with concatenation 50.2 59.0 78.1 LD-ZNet with cross-attention 52.7 60.0 78.9 Table 3.4: Incorporating LDM features into ZNet via cross-attention (LD-ZNet) leverages the visual-linguistic information present in them, compared to concatenation, leading to better performance on the text-based image segmentation task. positional information into the LDM features for setting up the cross-attention. Moreover, the cross-attention layer learns how feature pixels from the ZNet at- tend to feature pixels from the LDM, thereby leveraging context and correlations from the entire image. With concatenation however, we only fuse the correspond- ing features of LDM and ZNet which is sub-optimal. 3.5 Discussion In this section we present more qualitative results to demonstrate several interesting aspects of our proposed technique when applied towards downstream segmentation tasks. In Figs. 3.8 to 3.12, we visualize results of text-based im- age segmentation on a diverse set of images, which include AI generated im- ages, illustrations and generic photographs. In Figure 3.11, we show that when LD-ZNet is applied on the same image with various text prompts, it is able to correctly segment the object and stuff classes being referred to in both exam- ples. This capability is crucial for open-world segmentation and overall under- 59 standing of the scene. The results also highlights that the algorithm works re- markably well on other domains like cartoons/illustrations. It is noteworthy that LD-ZNet can perform accurate segmentation for text prompts which include car- toons (Pikachu, Godzilla), celebrities (Donald Trump, Spiderman), famous land- marks (Eiffel Tower), as seen in Figure 3.10. Finally, Figure 3.12 shows the advan- tages of leveraging semantic information present in the latent diffusion features. Compared to our baseline RGBNet, the proposed LD-ZNet generates better seg- mentation maps across animations, celebrity images and illustrations. 3.6 Summary In this chapter, we presented a novel approach for text-based image seg- mentation using large scale latent diffusion models. By training the segmenta- tion models on the latent z-space, we were able to improve the generalization of segmentation models to new domains, like AI generated images. We also showed that this z-space is a better representation for text-to-image tasks in natural im- ages. By utilizing the internal features of the LDM at appropriate time-steps, we were able to tap into the semantic information hidden inside the image syn- thesis pipeline using a cross-attention mechanism, which further improved the segmentation performance both on natural and AI generated images. This was experimentally validated on several publicly available datasets and on a new dataset of AI generated images, which we will make publicly available. 60 “Books” “Flowers” “Sofa” “Table” “Trees” “Chair” “Clouds” “Grass” “Mountains” “River” “Buildings” “Trees” “Crosswalk” “Bicycle” “Bridge” Figure 3.11: LD-ZNet text-based image segmentation results for a real image and il- lustrations on diverse set of things and stuff classes. High quality segmentation across multiple classes suggests that LD-ZNet has a good understanding of the overall scene. 61 RGBNet LD-ZNet Figure 3.12: More qualitative examples where RGBNet fails to localize “Guitar”, “Panda” from animation images (top row), famous celebrities “Scarlett Johansson”, “Kate Middle- ton” (second row) and objects such as “Lamp”, “Trees” from illustrations (bottom row). LD-ZNet benefits from using z combined with the internal LDM features to correctly segment these text prompts. 62 Chapter 4: Conclusions and Future Work 4.1 Concluding Remarks In this dissertation, we presented novel ways to utilize two popular deep generative models namely GANs and Diffusion models to improve crucial tasks in computer vision - 1) Geometry Estimation and 2) Text-Based Image Segmen- tation, respectively. 1. GANs for Unsupervised Geometry Estimation. In Chapter 2, we proposed a generative-based SharinGAN module for unsupervised domain adapta- tion (UDA) to combine labeled synthetic and unlabeled real images during training. The SharinGAN translates just the domain-specific task-related information from both domains into a shared space that is input to the pri- mary task network. The information unrelated to the task is untouched by SharinGAN during this translation for both domains. With this formu- lation, we show a much improved generalization of the primary task net- work on various estimation tasks - Monocular Depth Estimation of outdoor scenes, Face Normal Estimation, and Lighting Estimation, all in an unsu- pervised setting. 63 2. LDMs for Text-Based Image Segmentation. In Chapter 3, we proposed to use large-scale latent diffusion models (LDM) pretrained on the internet to improve text-based segmentation performance for several novel classes from the internet and on a variety of imagery - Real, AI-generated, illustra- tions, animations etc. The understanding of internet-scale concepts along with the ability to synthesize various photorealistic objects from text, makes the LDM an intuitive candidate to improve text-based recognition perfor- mance. Our proposed segmentation pipeline LD-ZNet benefits from the z-space as well as the internal representations within LDM that is shown to contain semantic information. We showed improved segmentation per- formance for LD-ZNet on not just real images but also on AI-Generated images, animations, illustrations and celebrity images etc. 4.2 Future Work As we move towards an era of large-scale datasets with higher compute, the generative models trained with them can only get more powerful. It thus becomes crucial to understand how to utilize these generative models to improve general computer vision systems. In Chapte