ABSTRACT

Title of dissertation: LEVERAGING DEEP GENERATIVE MODELS
FOR ESTIMATION AND RECOGNITION

Koutilya PNVR
Doctor of Philosophy, 2023

Dissertation directed by: Professor David W. Jacobs
Department of Electrical and Computer Engineering

Generative models are a class of statistical models that estimate the joint

probability distribution on a given observed variable and a target variable. In

computer vision, generative models are typically used to model the joint proba-

bility distribution of a set of real image samples assumed to be on a complex high-

dimensional image manifold. The recently proposed deep generative architec-

tures such as Variational Autoencoders (VAEs), Generative Adversarial Networks

(GANs), and diffusion models (DMs) were shown to generate photo-realistic im-

ages of human faces and other objects. These generative models also became

popular for other generative tasks such as image editing, text-to-image, etc. As

appealing as the perceptual quality of the generated images has become, the use

of generative models for discriminative tasks such as visual recognition or ge-

ometry estimation has not been well studied. Moreover, with different kinds of

powerful generative models getting popular lately, it’s important to study their

significance in other areas of computer vision. In this dissertation, we demon-


strate the advantages of using generative models for applications that go beyond

just photo-realistic image generation: Unsupervised Domain Adaptation (UDA)

between synthetic and real datasets for geometry estimation; Text-based image

segmentation for recognition.

In the first half of the dissertation, we propose a novel generative-based

UDA method for combining synthetic and real images when training networks

to determine geometric information from a single image. Specifically, we use a

GAN model to map both synthetic and real domains into a shared image space

by translating just the domain-specific task-related information from respective

domains. This is connected to a primary network for end-to-end training. Ide-

ally, this results in images from two domains that present shared information to

the primary network. Compared to previous approaches, we demonstrate an im-

proved domain gap reduction and much better generalization between synthetic

and real data for geometry estimation tasks such as monocular depth estimation

and face normal estimation.

In the second half of the dissertation, we showcase the power of a recent

class of generative models for improving an important recognition task: text-

based image segmentation. Specifically, large-scale pre-training tasks like im-

age classification, captioning, or self-supervised techniques do not incentivize

learning the semantic boundaries of objects. However, recent generative foun-

dation models built using text-based latent diffusion techniques may learn se-

mantic boundaries. This is because they must synthesize intricate details about

all objects in an image based on a text description. Therefore, we present a tech-


nique for segmenting real and AI-generated images using latent diffusion models

(LDMs) trained on internet-scale datasets. First, we show that the latent space of

LDMs (z-space) is a better input representation compared to other feature rep-

resentations like RGB images or CLIP encodings for text-based image segmenta-

tion. By training the segmentation models on the latent z-space, which creates

a compressed representation across several domains like different forms of art,

cartoons, illustrations, and photographs, we are also able to bridge the domain

gap between real and AI-generated images. We show that the internal features of

LDMs contain rich semantic information and present a technique in the form of

LD-ZNet to further boost the performance of text-based segmentation. Overall,

we show up to 6% improvement over standard baselines for text-to-image seg-

mentation on natural images. For AI-generated imagery, we show close to 20%

improvement compared to state-of-the-art techniques.


LEVERAGING DEEP GENERATIVE MODELS FOR
ESTIMATION AND RECOGNITION

by

Koutilya PNVR

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor David W. Jacobs, Chair/Advisor
Professor Joseph Jaja
Professor Behtash Babadi
Professor Jia-Bin Huang
Professor Maria K. Cameron (Dean’s representative)


© Copyright by
Koutilya PNVR

2023


Dedication

To my family —

Subrahmanyam Ponukupati

Sudha Ponukupati

Syamala Pisapati

Sandilya Ponukupati

Sruthi Ponukupati

Vaishnavi Ponukupati

Sindhura Purnima Vempati

For their constant support, love, sacrifice and selflessness.

ii


Acknowledgments

I wish to express my deepest gratitude to the remarkable individuals who

have been instrumental in my Ph.D. journey, contributing immeasurably to my

growth and success.

First and foremost, I extend my sincere appreciation to my advisor, Prof.

David Jacobs. Despite my non-computer-vision background, he offered me the

invaluable opportunity to work closely with him. I am profoundly thankful

for the numerous research meetings and brainstorming sessions, which not only

broadened my research horizons but also nurtured my ability to approach com-

plex problems. His unwavering consideration for my circumstances has left an

indelible mark, and I couldn’t have asked for a more exceptional Ph.D. advisor.

I am deeply honored to have Prof. Joseph JaJa, Prof. Behtash Babadi, Prof.

Jia-Bin Huang, and Prof. Maria K. Cameron as members of my dissertation com-

mittee. Their commitment to serving on my committee and providing invaluable

feedback to enhance the quality of this dissertation is greatly appreciated.

My gratitude extends to my remarkable mentors, Bharat Singh and Hao

Zhou, who have been constant sources of support during the challenging phases

of my Ph.D. Their participation in research meetings and continuous motivation

to explore fresh perspectives on research problems have been transformative. In

particular, Bharat’s close collaboration and the research skills he imparted are

beyond measure. Their guidance has been pivotal, and I owe a significant portion

of my progress to their precious mentorship.

iii


I would like to acknowledge Dr. Varaprasad Bandaru for providing me with

opportunities from the early stages of my academic journey, beginning with my

master’s program. His enduring belief in my capabilities and involvement in his

remarkable research projects have been essential in broadening the breadth of

my knowledge during my Ph.D. journey.

My fellow research peers at the University of Maryland, including students

from the research groups of Prof. David Jacobs, Prof. Abhinav Shrivastava, and

Prof. Tom Goldstein, have been a constant source of enlightening discussions

and camaraderie.

I am grateful to my colleagues from internships, including Pallabi Ghosh,

Behjat Siddiquie from Amazon, and Abhijit Bendale, Pranav Mistry from STAR

Labs, for the wonderful opportunities they provided, exposing me to real-world

experiences.

My thanks go to the International Student and Scholar Services (ISSS), the

graduate school, the staff at the ECE and CS departments, and UMIACS for their

friendly, liberal, and supportive approach. I will cherish the memories of my

student life and the warmth of the university.

To my friends - Shankar Reddy, Dwith CYN, Pallavi Chirumamilla, Sai

Deepika Regani, Anirudh Mothukuri, Likhith Anvhesh, Sriram Vasudevan, Sai

Sreedhar Varada, Mounika Chintakayala, Raghuvaran Yaramasu, Avinash Bheem-

ineni, Spandana Gorantla, Harika Vakkanthula, Sreeharsha Vardhan Annu, and

Manvitha Sree who have been my pillars of strength, offering relentless sup-

port and creating wonderful memories, I extend my heartfelt appreciation. Your

iv


friendships have not only aided my personal growth but have also made my Ph.D.

journey exceptionally smooth, making you a cherished part of my family.

I would also like to express my sincere thanks to Sindhura Purnima, who

entered my life at a crucial stage, offering constant support and understanding. I

wholeheartedly believe she is my lucky charm, bringing much-needed fortune at

a precious time.

Last, but certainly not least, I owe a profound debt of gratitude to my par-

ents and family members. Their constant motivation and belief in me, through

both the good and challenging times, have been the cornerstone of my journey.

I am forever indebted to them for the unwavering support and sacrifices they

made to help me reach the point where I stand today.

v


Table of Contents

Acknowledgements iii

List of Tables ix

List of Figures x

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Outline and Contributions . . . . . . . . . . . . . . . . 4

1.2.1 Leveraging GANs for Unsupervised Geometry Estimation
(Chapter 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Leveraging LDMs for Text-Based Segmentation (Chapter 3) 5
1.2.3 Bidirectional Convolutional LSTM for the Detection of Vi-

olence in Videos (Appendix A) . . . . . . . . . . . . . . . . 6

2 GANs for Unsupervised Geometry Estimation 7
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2.1 Losses for Generative Network . . . . . . . . . . . 15
2.2.2.2 Losses for the Task Network . . . . . . . . . . . . . 17
2.2.2.3 Monocular Depth Estimation . . . . . . . . . . . . 17
2.2.2.4 Face Normal Estimation . . . . . . . . . . . . . . . 18
2.2.2.5 Overall loss . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Monocular Depth Estimation . . . . . . . . . . . . . . . . . 19

2.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1.2 Implementation details . . . . . . . . . . . . . . . 20
2.3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1.4 Generalization to Make3D . . . . . . . . . . . . . 23

2.3.2 Face Normal Estimation . . . . . . . . . . . . . . . . . . . . 25

vi


2.3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2.2 Implementation details . . . . . . . . . . . . . . . 25
2.3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.3 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 LDMs for Text-Based Image Segmentation 33
3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.1 Text-based image segmentation . . . . . . . . . . . . . . . . 36
3.1.2 Text-to-Image synthesis . . . . . . . . . . . . . . . . . . . . 37
3.1.3 Semantics in generative models . . . . . . . . . . . . . . . . 38

3.2 LDMs for Text-Based Segmentation . . . . . . . . . . . . . . . . . . 39
3.2.1 ZNet: Leveraging Latent Space Features . . . . . . . . . . . 40
3.2.2 LD-ZNet: Leveraging Diffusion Features . . . . . . . . . . . 42

3.2.2.1 Visual-Linguistic Information in LDM Features . 43
3.2.2.2 LD-ZNet Architecture . . . . . . . . . . . . . . . . 44

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.1 Image Segmentation Using Text Prompts . . . . . . . . . . . 48
3.4.2 Generalization to AI Generated Images . . . . . . . . . . . . 51
3.4.3 Generalization to Referring Expressions . . . . . . . . . . . 56
3.4.4 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.5 Cross-attention vs Concat for LDM features . . . . . . . . . 58

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Conclusions and Future Work 63
4.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A Bidirectional Convolutional LSTM for the Detection of Violence in Videos 66
A.1 Contributions and Proposed Approach . . . . . . . . . . . . . . . . 67
A.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.3.1 Spatiotemporal Encoder Architecture . . . . . . . . . . . . 71
A.3.1.1 Spatial Encoding . . . . . . . . . . . . . . . . . . . 72
A.3.1.2 Temporal Encoding . . . . . . . . . . . . . . . . . 73
A.3.1.3 Classifier . . . . . . . . . . . . . . . . . . . . . . . 75

A.3.2 Spatial Encoder Architecture . . . . . . . . . . . . . . . . . 76
A.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.5 Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

A.6.1 Hockey Fights and Movies . . . . . . . . . . . . . . . . . . . 78
A.6.2 Violent Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.6.3 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . 80

vii


A.6.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.6.4.1 Spatial vs Spatiotemporal Encoders . . . . . . . . 83
A.6.4.2 Elementwise Max Pooling vs. Last Encoding . . . 84
A.6.4.3 ConvLSTM vs. BiConvLSTM . . . . . . . . . . . . 85
A.6.4.4 AlexNet vs. VGG13 . . . . . . . . . . . . . . . . . 85

A.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Bibliography 88

viii


List of Tables

2.1 Quantitative results for Monocular Depth Estimation (MDE) . . . 21
2.2 Generalization capability of SharinGAN for MDE . . . . . . . . . . 24
2.3 Quantitative results for Face Normal estimation . . . . . . . . . . . 26
2.4 Quantitative results for Lighting Estimation . . . . . . . . . . . . . 29
2.5 Ablation study - Significance of SharinGAN module and recon-

struction loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Ablation study - Significance of SharinGAN module and recon-

struction loss on unseen make3D dataset . . . . . . . . . . . . . . . 31

3.1 Text-based image segmentation performance on PhraseCut . . . . 49
3.2 Generalization to our AIGI dataset . . . . . . . . . . . . . . . . . . 52
3.3 Generalization to Referring Image Segmentation datasets - Ref-

COCO, RefCOCO+ and G-Ref . . . . . . . . . . . . . . . . . . . . . 57
3.4 Ablation studies - Cross-attn vs Concat . . . . . . . . . . . . . . . . 59

A.1 Quantitative results on Hockey, Movies and Violent Flows datasets 81

ix


List of Figures

1.1 Generative models for domain adaptation . . . . . . . . . . . . . . 2
1.2 Illustration of the text-based image segmentation task . . . . . . . 3

2.1 Proposed way to reduce domain gap between synthetic and real data 8
2.2 SharinGAN architecture . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Qualitative results for Monocular Depth Estimation (MDE) . . . . 22
2.4 Visualization of regions corresponding to domain gap reduction -

MDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Generalization capability of SharinGAN for MDE . . . . . . . . . . 25
2.6 Qualitative results for Face Normal Estimation . . . . . . . . . . . 27
2.7 Visualization of regions corresponding to domain gap reduction -

FNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Latent diffusion model (LDM) containing visual linguistic infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Reconstructions from the first stage of the LDM . . . . . . . . . . . 40
3.3 Overview of the proposed ZNet and LD-ZNet architectures . . . . 41
3.4 Visual-linguistic semantic information in the internal features of a

pretrained LDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 LDM internal features into ZNet via Attention Pool . . . . . . . . . 45
3.6 Samples from AIGI dataset . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Qualitative comparison on the PhraseCut dataset . . . . . . . . . . 51
3.8 Qualitative comparison on the AIGI samples for text-based seg-

mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 More qualitative comparison on the AIGI samples for text-based

segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.10 More qualitative results of LD-ZNet from AIGI dataset . . . . . . . 56
3.11 LD-ZNet does well in multi-object segmentation - Good overall

scene understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.12 LD-ZNet’s ability to segment objects in animations, celebrity im-

ages and illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.1 Overview of the Spatiotemporal architecture . . . . . . . . . . . . . 72

x


A.2 Overview of a BiConvLSTM Cell . . . . . . . . . . . . . . . . . . . . 75
A.3 Overview of the Spatial encoder architecture . . . . . . . . . . . . 76
A.4 Performance on the Hockey dataset evaluated using the Spatial En-

coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.5 Performance on the Violent Flows evaluated using the Spatiotem-

poral Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.6 Ablation studies - Spatial vs Spatiotemporal Encoders on the Hockey

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.7 Ablation studies - Spatial vs Spatiotemporal Encoders on the Vio-

lent Flows dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.8 Ablation studies - Elementwise Max-pooling vs Last Encoding . . 84
A.9 Ablation studies - ConvLSTM vs BiConvLSTM . . . . . . . . . . . 85
A.10 Ablation studies - AlexNet vs VGG13 . . . . . . . . . . . . . . . . . 86

xi


Chapter 1: Introduction

1.1 Motivation

The recently proposed deep generative architectures such as Variational

Autoencoders (VAEs), Generative Adversarial Networks (GANs) and Diffusion

models (DMs) were shown to exhibit photo-realistic image generation quality.

Many generative applications such as image editing, text-to-image etc also be-

came popular with these models. However, the use of these generative models

for tasks such as representation learning, visual recognition or geometry estima-

tion has been little explored. Typically, such discriminative tasks are solved with

CNN or transformer based classifiers that excel at obtaining decision boundaries

between classes in the training data. Deep generative models on the other hand,

estimate the joint probability distribution of the entire training data. Such mod-

els hold more information about the training data and are capable of generating

realistic looking samples from the distribution. Moreover, with the size of the

datasets getting bigger and the architectures becoming more powerful, exploring

the importance of deep generative models for tasks that go beyond just image

generation becomes critical.

Generative models have been studied for tasks such as representation learn-

1


Source

Target

Generative
model

Source

Target

Primary task
network

Source
prediction

Target
prediction

Source
GT

Figure 1.1: Generative models can be used to reduce domain gap between labeled source

and unlabeled target domains.

ing [1–7], synthetic data generation [8–10], domain adaptation [11–15] etc. How-

ever the rapid progress in the generative models research and the underlying

techniques did not scale similarly in these areas. In this dissertation, we attempt

to explore and leverage specific deep generative models to improve performance

in estimation and recognition tasks namely 1) Unsupervised domain adaptation

for geometry estimation and 2) Text-based image segmentation, respectively.

Unsupervised domain adaptation refers to the problem of reducing the do-

main gap between a labeled source domain and an unlabeled target domain. For

geometry estimation such as monocular depth estimation (MDE) and face nor-

mal estimation (FNE), some lines of work depend on the vast amount of labeled

synthetic data as the source domain and attempt to make it generalize to the real

data. Previous works that used generative models for unsupervised geometry

estimation, proposed to translate the synthetic data into real-like or vice-versa.

However, such an inter-domain mapping is an unnecessarily challenging prob-

lem for the generative model and would serve as a bottleneck for the downstream

primary task network. We propose a better way to reduce the domain gap by us-

ing a GAN based framework that translates just the right amount of information

2


A picture of a
large fountain

Text-based image
segmentation network

Figure 1.2: Text-based image segmentation aims to segment regions in the image that

refer to an input text prompt.

from both synthetic and real domains into a shared image space. A high level

overview of the proposed approach is illustrated in Figure 1.1. This shared im-

age space is shown to have better properties in terms of domain generalization

for geometry estimation. Specifically, we observe it is only necessary to translate

the domain-specific task related information of respective domains into a shared

image space. This mapping need not modify the information of original domains

that is not related to the primary task as the primary network will learn to ignore

them regardless. This simple and intuitive formulation combined with the im-

age translation ability of the generative model, helps the primary task network

to look at shared information from both domains with much less domain gap

leading to better generalization.

Also, with recent advances in diffusion models (DMs) [16, 17] in uncon-

ditional and class conditional settings, they have started gaining more traction

compared to GANs. This class of generative models became even more popular

for their generated visual quality in text-to-image tasks. Recently, latent diffusion

3


models (LDMs) [18] were proposed that operate on a perceptually compressed la-

tent space obtained from an internal first stage. LDMs became a popular choice

for text-to-image applications for their ability to learn and operate with lower

computational cost and on large scale datasets. Such large scale LDMs were

shown to exhibit photo-realistic text-to-image visual quality and lead to several

visual-linguistic applications such as text guided image inpainting, personalized

text-to-image etc. This indicates that pretrained LDMs contain semantic infor-

mation about various objects from the internet. However, the usefulness of these

powerful LDMs have not been explored for text-based recognition problems such

as text-based segmentation task illustrated in Figure 1.2. In this dissertation,

we propose a text-based segmentation network named LD-ZNet that utilizes an

LDM pretrained on large datasets. We show that the segmentation network, with

the help of LDM, learns knowledge of novel concepts from the internet without

requiring annotations. Overall, our LD-ZNet can segment objects from the inter-

net in various imagery such as real, AI-Generated, animations, illustrations and

celebrity images.

1.2 Dissertation Outline and Contributions

1.2.1 Leveraging GANs for Unsupervised Geometry Estimation

(Chapter 2)

In this Chapter, we propose a novel generative-based UDA method for com-

bining labeled-synthetic and unlabeled-real images when training networks to

4


determine geometric information from a single image. Our proposal outlines

a strategy to project both image categories into a single, shared domain. This

shared domain acts as input to the primary network during end-to-end training.

Consequently, the primary network learns from the shared information of both

domains and generalizes much better to real-images during test-time. Our ex-

periments demonstrate significant improvements over the state-of-the-art in two

important domains, surface normal estimation of human faces and monocular

depth estimation for outdoor scenes, both in an unsupervised setting.

1.2.2 Leveraging LDMs for Text-Based Segmentation (Chapter 3)

In this Chapter, we propose LD-ZNet a text-based segmentation network

that uses an LDM pretrained on large-scale data. Specifically, we suggest a way

to use the z-space and the internal representations inside the LDM to improve

segmentation performance for novel concepts on various imagery such as real,

AI-generated, animations, illustrations and celebrity images. We additionally

create a new dataset named AIGI consisting of AI-Generated images along with

object labels and categorical captions for evaluating the generalization ability

of text-based segmentation methods to AI-Generated content. We show a huge

improvement of around 20% for LD-ZNet over existing text-based segmentation

methods on the AIGI dataset.

5


1.2.3 Bidirectional Convolutional LSTM for the Detection of Vio-

lence in Videos (Appendix A)

1The field of action recognition has gained tremendous traction in recent

years. A subset of this, detection of violent activity in videos, is of great impor-

tance, particularly in unmanned surveillance or crowd footage videos. In this

appendix, we explore this problem on three standard benchmarks widely used

for violence detection: the Hockey Fights, Movies, and Violent Flows datasets.

To this end, we introduce a Spatiotemporal Encoder, built on the Bidirectional

Convolutional LSTM (BiConvLSTM) architecture. The addition of a bidirectional

temporal encoding and the elementwise max pooling of these encodings in the

Spatiotemporal Encoder is novel in the field of violence detection. This addi-

tion is motivated by a desire to derive better video representations via leveraging

long-range information in both temporal directions of the video. We find that

the Spatiotemporal network is comparable in performance with existing meth-

ods for all of the above datasets. A simplified version of this network, the Spatial

Encoder is sufficient to match state-of-the-art performance on the Hockey Fights

and Movies datasets. However, on the Violent Flows dataset, the Spatiotemporal

Encoder outperforms the Spatial Encoder.

1This is placed in the appendix because it is an early thesis work that does not directly connect

to the main content of this dissertation.

6


Chapter 2: GANs for Unsupervised Geometry Estimation

1Understanding geometry from images is a fundamental problem in com-

puter vision. It has many important applications. For instance, Monocular Depth

Estimation (MDE) is important for synthetic object insertion in computer graph-

ics [20], grasping in robotics [21] and safety in self-driving cars. Face Normal Es-

timation can help in face image editing applications such as relighting [22–24].

However, it is extremely hard to annotate real data for these regression tasks.

Synthetic data and their ground truth labels, on the other hand, are easy to gen-

erate and are often used to compensate for the lack of labels in real data. Deep

models trained on synthetic data, unfortunately, usually perform poorly on real

data due to the domain gap between synthetic and real distributions. To deal with

this problem, several research studies [25–28] have proposed unsupervised do-

main adaptation methods to take advantage of synthetic data by mapping it into

the real domain or vice versa, either at the feature level or image level. However,

mapping examples from one domain to another domain itself is a challenging

problem that can limit performance.

We observe that finding such a mapping solves an unnecessarily difficult

1Work done with Hao Zhou and David Jacobs. Accepted [19] in CVPR 2020.

7


G

real

synthetic

Figure 2.1: We propose to reduce the domain gap between synthetic and real by mapping

the corresponding domain specific information related to the primary task (δs,δr ) into

shared information δsh, preserving everything else.

problem. To train a regressor that applies to both real and synthetic domains, it

is only necessary that we map both to a new representation that contains the task-

relevant information present in both domains, in a common form. The mapping

need not alter properties of the original domain that are irrelevant to the task

since the regressor will learn to ignore them regardless.

To see this, we consider a simplified model of our problem. We suppose

that real and synthetic images are formed by two components: domain agnostic

(which has semantic information shared across synthetic and real, and is denoted

as I) and domain specific. We further assume that domain specific information

has two sub-components: domain specific information unrelated to the primary

task (denoted as δ′s and δ′r for synthetic and real images respectively) and domain

specific information related to the primary task (δs, δr). So real and synthetic

images can be represented as: xr = f (I,δr ,δ′r) and xs = f (I,δs,δ′s) respectively.

We believe the domain gap between {δs and δr} can affect the training of the

primary network, which learns to expect information that is not always present.

The domain gap between {δ′s and δ′r}, on the other hand, can be bypassed by the

8


primary network since it does not hold information needed for the primary task.

For example, in real face images, information such as the color and texture of

the hair is unrelated to the task of estimating face normals but is discriminative

enough to distinguish real from synthetic faces. This can be regarded as domain

specific information unrelated to the primary task i.e., δ′r . On the other hand,

shadows in the real and synthetic images, due to the limitations of the rendering

engine, may have different appearances but may contain depth cues that are re-

lated to the primary task of MDE in both domains. The simplest strategy, then,

for combining real and synthetic data is to map δs and δr to a shared representa-

tion, δsh, while not modifying δ′s and δ′r as shown in Figure 2.1.

Recent research studies show that a shared network for synthetic and real

data can help reduce the discrepancy between images in different domains. For

instance, [22] achieved state-of-the-art results in face normal estimation by train-

ing a unified network for real and synthetic data. [13] learned the joint distri-

bution of multiple domain images by enforcing a weight-sharing constraint for

different generative networks. Inspired by these research studies, we define a

unified mapping function G, which is called SharinGAN, to reduce the domain

gap between real and synthetic images.

Different from existing research studies, our G is trained so that minimum

domain specific information is removed. This is achieved by pre-training G as an

auto-encoder on real and synthetic data, i.e., initializing G as an identity func-

tion. Then G is trained end-to-end with reconstruction loss in an adversarial

framework, along with a network that solves the primary task, further pushing

9


G to map information relevant to the task to a shared domain.

As a result, a successfully trained G will learn to reduce the domain gap

existing in δs and δr , mapping them into a shared domain δsh. G will leave I

unchanged. δ′s and δ′r can be left relatively unchanged when it is difficult to

map them to a common representation. Mathematically, G(xs) = f (I,δsh,δ′s) and

G(xr) = f (I,δsh,δ′r). If successful, G will map synthetic and real images to images

that may look quite different to the eye, but the primary task network will extract

the same information from both.

We apply our method to unsupervised monocular depth estimation using

virtual KITTI (vKITTI) [29] and KITTI [30] as synthetic and real datasets respec-

tively. Our method reduces the absolute error in the KITTI eigen test split and

the test set of Make3D [31] by 23.77% and 6.45% respectively compared with the

state-of-the-art method [27]. Additionally, our proposed method improves over

SfSNet [22] on face normal estimation. It yields an accuracy boost of nearly 4.3%

for normal prediction within 20◦ (Acc < 20◦) of ground truth on the Photoface

dataset [32].

2.1 Related Work

Monocular Depth Estimation has long been an active area in computer

vision. Because this problem is ill-posed, learning-based methods have predomi-

nated in recent years. Many early learning works applied Markov Random Fields

(MRF) to infer the depth from a single image by modeling the relation between

10


nearby regions [31, 33, 34]. These methods, however, are time-consuming dur-

ing inference and rely on manually defined features, which have limitations in

performance.

More recent studies apply deep Convolutional Neural Networks (CNNs)

[35–42] to monocular depth estimation. Eigen [35] first proposed a multi-scale

deep CNN for depth estimation. Following this work, [36] proposed to apply

CNNs to estimate depth, surface normal and semantic labels together. [37] com-

bined deep CNNs with a continuous CRF for monocular depth estimation. One

major drawback of these supervised learning-based methods is the requirement

for a huge amount of annotated data, which is hard to obtain in reality.

With the emergence of large scale, high-quality synthetic data [29], using

synthetic data to train a depth estimator network for real data became popu-

lar [26, 27]. The biggest challenge for this task is the large domain gap between

synthetic data and real data. [28] proposed to first train a depth prediction net-

work using synthetic data. A style transfer network is then trained to map real

images to synthetic images in a cycle consistent manner [43]. [25] proposed to

adapt the features of real images to the features of synthetic images by applying

adversarial loss on latent features. A content congruent regularization is further

proposed to avoid mode collapse. T2Net [26] trained a network that translates

synthetic data into real at the image level and further trained a task network in

this translated domain. GASDA [27] proposed to train the network by incorporat-

ing epipolar geometry constraints for real data along with the ground truth labels

for synthetic data. All these methods try to align two domains by transferring one

11


domain to another. Unlike these works, we propose a mapping function G, also

called SharinGAN, to just align the domain specific information that affects the

primary task, resulting in a minimum change in the images in both domains. We

show that this makes learning the primary task network much easier and can

help it focus on the useful information.

Self-supervised learning is another way to avoid collecting ground truth

labels for monocular depth estimation. Such methods need monocular videos

[44–47], stereo pairs [48–51], or both [47] for training. Our proposed method is

complementary to these self-supervised methods, it does not require this addi-

tional data, but can use it when available.

FaceGeometry Estimation is a sub-problem of inverse face rendering which

is the key for many applications such as face image editing. Conventional face ge-

ometry estimation methods are usually based on 3D Morphable Models (3DMM)

[52]. Recent studies demonstrate the effectiveness of deep CNNs for solving this

problem [22, 53–58]. Thanks to the 3DMM, generating synthetic face images

with ground truth geometry is easy. [22,53,54] make use of synthetic face images

with ground truth shape to help train a network for predicting face shape using

real images. Most of these works initially pre-train the network with synthetic

data and then fine-tune it with a mix of real and synthetic data, either using no

supervision or weak supervision, overlooking the domain gap between real and

synthetic face images. In this work, we show that by reducing the domain gap

between real and synthetic data using our proposed method, face geometry can

be better estimated.

12


Domain Adaptation using GANs There are many works [11–15] that use a

GAN framework to perform domain adaptation by mapping one domain into an-

other via a supervised translation. However, most of these show performance on

just toy datasets in a classification setting. We attempt to map both synthetic and

real domains into a new shared domain that is learned during training and use

this to solve complex problems of unsupervised geometry estimation. Moreover,

we apply adversarial loss at the image level for our regression task, in contrast to

some of the above previous works where domain invariant feature engineering

sufficed for classification tasks.

2.2 Approach

To compensate for the lack of annotations for real data and to train a pri-

mary task network on easily available synthetic data, we propose SharinGAN

to reduce the domain gap between synthetic and real. We aim to train a pri-

mary task network on a shared domain created by SharinGAN, which learns the

mapping function G : xr 7→ xshr and G : xs 7→ xshs , where xk = f (I,δk ,δ′k); xshk =

f (I,δsh,δ′k); k ∈ {r, s} as shown in Figure 2.1. G allows the primary task network

to train on a shared space that holds the information needed to do the primary

task, making the network more applicable to real data during testing.

To achieve this, an adversarial loss is used to find the shared information,

δsh. This is done by minimizing the discrepancy in the distributions of xshr and

xshs . But at the same time, to preserve the domain agnostic information (shared

13


Synthetic
Image 

Real Image
Real

translated
image

Synthetic
translated

image

G: Generator

Primary Network

Synthetic
Prediction

Real
Prediction

Synthetic
GroundTruth

Virtual
Supervision

Shared Semantic Image

SharinGAN module

Reconstruction loss

T
D: Image

 Discriminator

Figure 2.2: Overview of the model architecture. Red dashed arrows indicate the loss

computations.

semantic information I), we use reconstruction loss. Now, without a loss from

the primary task network, G might change the images so that they don’t match

the labels. To prevent that, we additionally use a primary task loss for both real

and synthetic examples to guide the generator. It is important to note that both

the translations from synthetic to real and vice versa are equally crucial for this

symmetric setup to find a shared space. To facilitate that, we use a form of weak

supervision we call virtual supervision. Some possible virtual supervisions in-

clude a prior on the input data or a constraint that can narrow the solution space

for the primary task network (details discussed in 2.2.2.2). For synthetic exam-

ples, we use the known labels.

Adversarial, Reconstruction and Primary task losses together train the gen-

erator and primary task network to align the domain specific information {δs,δr}

in both the domains into a shared space δsh, preserving everything else.

14


2.2.1 Framework

In this work, we propose to train a generative network which is called

SharinGAN, to reduce the domain gap between real and synthetic data so as

to help to train the primary network. Figure 2.2 shows the framework of our

proposed method. It contains a generative network G, a discriminator on image-

level D that embodies the SharinGAN module and a task network T to perform

the primary task. The generative network G takes either a synthetic image xs or

real image xr as input and transforms it to xshs or xshr in an attempt to fool D. Dif-

ferent from existing works that transfer images in one domain to another [26–28],

our generative network G tries to map the domain specific parts δs and δr of syn-

thetic and real images to a shared space δsh, leaving δ′s and δ′r unchanged. As a

result, our transformed synthetic and real images (xshs and xshr ) have fewer differ-

ences from xs and xr . Our task network T then takes the transformed images xshs

and xshr as input and predicts the geometry. The generative network G and task

network T are trained together in an end-to-end manner.

2.2.2 Losses

2.2.2.1 Losses for Generative Network

We design a single generative network G for synthetic and real data since

sharing weights can help align distributions of different domains [13]. Moreover,

existing research studies such as [22, 54] also demonstrate that a unified frame-

15


work works reasonably well on synthetic and real images. In order to map δs

and δr to a shared space δsh, we apply adversarial loss [59] at the image level.

More specifically, we use the Wasserstein discriminator [60] that uses the Earth-

Mover’s distance to minimize the discrepancy between the distributions for syn-

thetic and real examples {G(xs),G(xr)}, i.e.:

LW (D,G) = Exs[D(G(xs))]−Exr [D(G(xr))], (2.1)

D is a discriminator and Ge is the encoder part of the generator. Following [61],

to overcome the problem of vanishing or exploding gradients due to the weight

clipping proposed in [60], a gradient penalty term is added for training the dis-

criminator:

Lgp(D) = (||∇ĥD(ĥ)||2 − 1)2 (2.2)

Our overall adversarial loss is then defined as:

Ladv = LW (D,G)−λLgp(D) (2.3)

where λ is chosen to be 10 while training the discriminator and 0 while training

the generator.

Without any constraints, the adversarial loss may learn to remove all do-

main specific parts δ and δ′ or even some of the domain agnostic part I in order

to fool the discriminator. This may lead to loss of geometric information, which

can degrade the performance of the primary task network T . To avoid this, we

propose to use the self-regularization loss similar to [62] to force the transformed

16


image to keep as much information as possible:

Lr = ||G(xs)− xs||22 + ||G(xr)− xr ||22. (2.4)

2.2.2.2 Losses for the Task Network

The task network takes transformed synthetic or real images as input and

predicts geometric information. Since the ground truth labels for synthetic data

are available, we apply a supervised loss using these ground truth labels. For

real images, domain specific losses or regularizations are applied as a form of

virtual supervision for training according to the task. We apply our proposed

SharinGAN to two tasks: monocular depth estimation (MDE) and face normal

estimation (FNE). For MDE, we use the combination of depth smoothness and

geometric consistency losses used in GASDA [27] as the virtual supervision. For

FNE however, for virtual supervision we use the pseudo supervision used in SfS-

Net [22]. We use the term “virtual supervision” to summarize these two losses as

a kind of weak supervision on the real examples.

2.2.2.3 Monocular Depth Estimation

To make use of ground truth labels for synthetic data, we apply L1 loss for

predicted synthetic depth images:

L1 = ||ŷs − y∗s ||1 (2.5)

where ŷs is the predicted synthetic depth map and y∗s is its corresponding ground

truth. Following [27], we apply smoothness loss on depth LDS to encourage it to

17


be consistent with local homogeneous regions. Geometric consistency loss LGC

is applied so that the task network can learn the physical geometric structure

through epipolar constraints. LDS and LGC are defined as:

LDS = e−∇xr ||∇ŷr || (2.6)

LGC = η
1− SSIM(xr ,x′rr)

2
+µ||xr − x′rr ||, (2.7)

ŷr represents the predicted depth for the real image and ∇ represents the first

derivative. xr is the left image in the KITTI dataset [30]. x′rr is the inverse warped

image from the right counterpart of xr based on the predicted depth ŷr . The

KITTI dataset [30] provides the camera focal length and the baseline distance

between the cameras. Similar to [27], we set η as 0.85 and µ as 0.15 in our exper-

iments. The overall loss for the task network is defined as:

LT = β1LDS + β2L1 + β3LGC , (2.8)

where β1 = 0.01,β2 = β3 = 100.

2.2.2.4 Face Normal Estimation

SfSnet [22] currently achieves the best performance on face normal esti-

mation. We thus follow its setup for face normal estimation and apply “SfS-

supervision” for both synthetic and real images during training.

LT = λreconLrecon +λNLN +λALA +λLightLLight, (2.9)

where Lrecon, LN and LA are L1 losses on the reconstructed image, normal and

albedo, whereas Llight is the L2 loss over the 27 dimensional spherical harmonic

18


coefficients. The supervision for real images is from the “pseudo labels”, obtained

by applying a pre-trained task network on real images. Please refer to [22] for

more details.

2.2.2.5 Overall loss

The overall loss used to train our geometry estimation pipeline is then de-

fined as:

L = α1Ladv +α2Lr +α3LT . (2.10)

where (α1,α2,α3) = (1,10,1) for monocular depth estimation task and (α1,α2,α3) =

(1,10,0.1) for face normal estimation task.

2.3 Experiments

We apply our proposed SharinGAN to monocular depth estimation and face

normal estimation. We discuss the details of the experiments in this section.

2.3.1 Monocular Depth Estimation

2.3.1.1 Datasets

Following [27], we use vKITTI [29] and KITTI [30] as synthetic and real

datasets to train our network. vKITTI contains 21,260 image-depth pairs, which

are all used for training. KITTI [30] provides 42,382 stereo pairs, among which,

22,600 images are used for training and 888 are used for validation as suggested

by [27].

19


2.3.1.2 Implementation details

We use a generator G and a primary task network T , whose architectures

are identical to [27]. We pre-train the generative network G on both synthetic

and real data using reconstruction loss Lr . This results in an identity mapping

that can help G to keep as much of the input image’s geometry information as

possible. Our task network is pre-trained using synthetic data with supervision.

G and T are then trained end to end using Equation 2.10 for 150,000 iterations

with a batch size of 2, by using an Adam optimizer with a learning rate of 1e − 5.

The best model is selected based on the validation set of KITTI.

2.3.1.3 Results

Table 2.1 shows the quantitative results on the eigen test split of the KITTI

dataset for different methods on the MDE task. The proposed method outper-

forms the previous unsupervised domain adaptation methods for MDE [26, 27]

on almost all the metrics. Especially, compared with [27], we reduce the abso-

lute error by 19.7% and 21.0% on 80m cap and 50m cap settings respectively.

Moreover, the performance of our method is much closer to the methods in a

supervised setting [35, 37, 63], which was trained on the real KITTI dataset with

ground truth depth labels. Figure 2.3 visually compares the predicted depth map

from the proposed method with [27]. We show three typical examples: near dis-

tance, medium distance, and far distance. It shows that our proposed method

performs much better for predicting depth at details. For instance, our predicted

20


Method Supervised Dataset Cap
Error Metrics, lower is better Accuracy Metrics, higher is better

Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253

Eigen [35] Yes K 80m 0.203 1.548 6.307 0.282 0.702 0.890 0.958

Liu [37] Yes K 80m 0.202 1.614 6.523 0.275 0.678 0.895 0.965

All synthetic (baseline) No S 80m 0.253 2.303 6.953 0.328 0.635 0.856 0.937

All real (baseline) No K 80m 0.158 1.151 5.285 0.238 0.811 0.934 0.970

GASDA [27] No K+S 80m 0.149 1.003 4.995 0.227 0.824 0.941 0.973

SharinGAN (proposed) No K+S 80m 0.116 0.939 5.068 0.203 0.850 0.948 0.978

Kuznietsov [63] Yes K 50m 0.117 0.597 3.531 0.183 0.861 0.964 0.989

Garg [64] No K 50m 0.169 1.080 5.104 0.273 0.740 0.904 0.962

Godard [48] No K 50m 0.140 0.976 4.471 0.232 0.818 0.931 0.969

All synthetic (baseline) No S 50m 0.244 1.771 5.354 0.313 0.647 0.866 0.943

All real (baseline) No K 50m 0.151 0.856 4.043 0.227 0.824 0.940 0.973

Kundu [25] No K+S 50m 0.203 1.734 6.251 0.284 0.687 0.899 0.958

T2Net [26] No K+S 50m 0.168 1.199 4.674 0.243 0.772 0.912 0.966

GASDA [27] No K+S 50m 0.143 0.756 3.846 0.217 0.836 0.946 0.976

SharinGAN (proposed) No K+S 50m 0.109 0.673 3.77 0.190 0.864 0.954 0.981

Table 2.1: MDE Results on eigen test split of KITTI dataset [35] . For the training data,

K: KITTI dataset and S: vKITTI dataset. Methods highlighted in light gray, use domain

adaptation techniques and the non-highlighted rows correspond to supervised methods.

depth map can better preserve the shape of the car (Figure 2.3 (a) and (c)) and

the structure of the tree and the building behind it (Figure 2.3 (b)). This shows

the advantage of our proposed SharinGAN compared with [27]. [27] learns to

transfer real images to the synthetic domain and vice versa, which solves a much

harder problem compared with SharinGAN, which removes a minimum of do-

main specific information. As a result, the quality of the transformation for [27]

may not be as good as the proposed method. Moreover, the unsupervised trans-

formation cannot guarantee to keep the geometry information unchanged.

21


Input GT Depth GASDA [27] SharinGAN (Ours)

(a) The second row shows the corresponding region in the red box of the first row. The

depth of the faraway car is better estimated by SharinGAN than GASDA.

(b) The second and third row shows the corresponding region in the green and red box

of the first row. The depth of the tree to the left (green) and shrubs behind the tree in the

right are better estimated by SharinGAN.

(c) The second and third row shows the corresponding regions in the green and red

boxes of the first row. The boundaries and the depth of the cars are better estimated

by SharinGAN.

Figure 2.3: Qualitative comparisons of SharinGAN with GASDA [27]. Ground truth (GT)

has been interpolated (and the unavailable top regions are masked out) for visualization

purposes. Note that in addition to various other aspects mentioned above, we are also

able to remove the boundary artifacts present in the depth maps of GASDA.

22


(a) xr (b) xshr = G(xr ) (c) |xr − xshr | (d) xs (e) xshs = G(xs) (f) |xs − xshs |

Figure 2.4: (a), (b) and (c) show real image xr , translated real image xshr and their differ-

ence |xr − xshr | respectively. (d), (e) and (f) show synthetic image xs, translated synthetic

image xshs and their difference |xs − xshs | respectively.

To understand how our generative network G works, we show some exam-

ples of synthetic and real images, their transformed versions, and the difference

images in Figure 2.4. This shows that G mainly operates on edges. Since depth

maps are mostly discontinuous at edges, they provide important cues for the

geometry of the scene. On the other hand, due to the difference between the ge-

ometry and material of objects around the edges, the rendering algorithm may

find it hard to render realistic edges compared with other parts of the scene. As

a result, most of the domain specific information related to geometry lies in the

edges, on which SharinGAN correctly focuses.

2.3.1.4 Generalization to Make3D

To demonstrate the generalization ability of the proposed method, we test

our trained model on Make3D [31]. Note that we do not fine-tune our model

using the data from Make3D. Table 2.2 shows the quantitative results of our

method, which outperforms existing state-of-the-art methods by a large margin.

Moreover, the performance of SharinGAN is more comparable to the super-

23


Method Trained
Error Metrics, lower is better

Abs Rel Sq Rel RMSE

Karsh et al. [65] Yes 0.398 4.723 7.801

Laina et al. [66] Yes 0.198 1.665 5.461

Kundu et al. [25] Yes 0.452 5.71 9.559

Goddard et al. [67] No 0.505 10.172 10.936

Kundu et al. [25] No 0.647 12.341 11.567

Atapour et al. [28] No 0.423 9.343 9.002

T2Net [26] No 0.508 6.589 8.935

GASDA [27] No 0.403 6.709 10.424

SharinGAN (proposed) No 0.377 4.900 8.388

Table 2.2: MDE results on Make3D dataset [31]. Trained indicates whether the model is

trained on Make3D or not. Errors are computed for depths less than 70m in a central

image crop [67]. It can be concluded that our proposed method generalized better to an

unseen dataset.

vised methods. We further visually compare the proposed method with GASDA

[27] in Figure 2.5. It is clear that the proposed depth map captures more details

in the input images, reflecting more accurate depth prediction.

24


(a) Input Image (b) Ground Truth (c) GASDA [27] (d) SharinGAN

Figure 2.5: Qualitative results on the test set of the Make3D dataset [31]. In the top row,

some far tree structures that are missing in the depth map predicted by GASDA were

better captured on using the SharinGAN module. For the bottom row, GASDA wrongly

predicts the depth map of the houses behind the trees to be far, which is correctly cap-

tured by the SharinGAN.

2.3.2 Face Normal Estimation

2.3.2.1 Datasets

We use the synthetic data provided by [22] and CelebA [68] as real data

to train the SharinGAN for face normal estimation similar to [22]. Our trained

model is then evaluated on the Photoface dataset [32].

2.3.2.2 Implementation details

We use the RBDN network [69] as our generator and SfSNet [22] as the

primary task network. Similar to before, we pre-train the Generator on both

25


Algorithm MAE < 20◦ < 25◦ < 30◦

3DMM [52] 26.3◦ 4.3% 56.1% 89.4%

Pix2Vertex [70] 33.9◦ 24.8% 36.1% 47.6%

SfSNet [22] 25.5◦ 43.6% 57.7% 68.7%

SharinGAN (proposed) 24.0◦ 47.88% 61.53% 72.1%

Table 2.3: Quantitative results for Face Normal estimation on the test split of Photoface

dataset [32]. All the listed methods are not fine-tuned on Photoface. The metrics MAE:

Mean Angular Error and < 20◦,25◦,30◦ refer to the normals prediction accuracy for dif-

ferent thresholds.

synthetic and real data using reconstruction loss and pre-train the primary task

network on just synthetic data in a supervised manner. Then, we train G and T

end-to-end using the overall loss (2.10) for 120,000 iterations. We use a batch

size of 16 and a learning rate of 1e − 4. The best model is selected based on the

validation set of Photoface [32].

2.3.2.3 Results

Table 2.3 shows the quantitative performance of the estimated surface nor-

mals by our method on the test split of the Photoface dataset. With the proposed

SharinGAN module, we were able to significantly improve over SfSNet on all the

metrics. In particular, we were able to significantly reduce the mean angular

error metric by roughly 1.5◦.

Additionally, Figure 2.6 depicts the qualitative comparison of our method

26


(a) Input Image (b) GT (c) SfSNet [22] (d) SharinGAN

Figure 2.6: Qualitative comparisons of our method with SfSNet on the examples from

the test set of Photoface dataset [32]. Our method generalizes much better to unseen

data during training.

with SfSNet on the test split of Photoface. Both SfSNet and our pipeline are

not finetuned on this dataset, and yet we were able to generalize better com-

pared to SfSNet. This demonstrates the generalization capacity of the proposed

SharinGAN to unseen data in training.

Finally, Figure 2.7 depicts the qualitative results of our method on the CelebA

[68] and Synthetic [22] datasets. The translated images corresponding to syn-

thetic and real images look similar in contrast to the MDE task (Figure 2.4). We

suppose that for the task of MDE, regions such as edges are domain specific,

and yet hold primary task related information such as depth cues, which is why

27


Input, xs xshs = G(xs) Normal Albedo Shading Reconstruction

(a) Qualitative results of our method on CelebA testset [68].

Input, xr xshr = G(xr ) Normal Albedo Shading Reconstruction

(b) Qualitative results of our method on the synthetic data used in SfSNet [22].

Figure 2.7: Qualitative results of our method on face normal estimation task. The trans-

lated images xshr ,xshs look reasonably similar for our task which additionally predicts

albedo, lighting, shading and Reconstructed image along with the face normal.

28


SharinGAN modifies such regions. However, for the task of FNE, we additionally

predict albedo, lighting, shading and a reconstructed image along with estimat-

ing normals. This means that the primary network needs a lot of shared infor-

mation across domains for good generalization to real data. Thus the SharinGAN

module seems to bring everything into a shared space, making the translated

images {xshr ,xshs } look visually similar.

Lighting Estimation The primary network estimates not only face normals

but also lighting. We also evaluate this. Following a similar evaluation protocol

as that of [22], Table 2.4 summarizes the light classification accuracy on the Mul-

tiPIE dataset [71]. Since we do not have the exact cropped dataset that [22] used,

we used our own cropping and resizing on the original MultiPIE data: centercrop

300x300 and resize to 128x128. For a fair comparison, we used the same dataset

to re-evaluate the lighting performance for [22] and reported the results in Table

2.4. Our method not only outperforms [22] on the face normal estimation, but

also on lighting estimation.

Algorithm top-1% top-2% top-3%

SfSNet [22] 80.25 92.99 96.55

SharinGAN 81.83 93.88 96.69

Table 2.4: Light classification accuracy on MultiPIE dataset [71]. Training with the pro-

posed SharinGAN also improves lighting estimation along with face normals.

29


2.3.3 Ablation studies

We carried out our ablation study using the KITTI and Make3D datasets

on monocular depth estimation. We study the role of the SharinGAN module by

removing it and training a primary network on the original synthetic and real

data using (2.8). We observe that the performance drops significantly as shown

in Table 2.5 and Table 2.6. This shows the importance of the SharinGAN module

that helps train the primary task network efficiently.

To demonstrate the role of reconstruction loss, we remove it and train our

whole pipeline α1Ladv + α3LT . We show the results on the testset of KITTI in

the second row of Table 2.5 and on the testset of Make3D in the second row of

Table 2.6. For both the testsets, we can see the performance drop compared to our

full model. Although the drop is smaller in the case of KITTI, it can be seen that

the drop is significant for Make3D dataset that is unseen during training. This

signifies the importance of reconstruction loss to generalize well to a domain not

seen during training.

Components
Cap

Error Metrics, lower is better Accuracy Metrics, higher is better

SharinGAN Reconstruction loss Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253

x x 50m 0.137 0.804 4.12 0.210 0.816 0.940 0.978

✓ x 50m 0.1113 0.6705 3.80 0.192 0.861 0.954 0.980

✓ ✓ 50m 0.109 0.673 3.77 0.190 0.864 0.954 0.981

Table 2.5: Ablation study for monocular depth estimation to understand the role of the

SharinGAN module and Reconstruction loss. We need both to get the best performance

for this task.

30


Components
Cap

Error Metrics, lower is better

SharinGAN Reconstruction loss Abs Rel Sq Rel RMSE

x x 70m 0.476 8.058 9.449

✓ x 70m 0.401 5.318 8.377

✓ ✓ 70m 0.377 4.900 8.388

Table 2.6: Ablation study for monocular depth estimation to understand the role of the

SharinGAN module and Reconstruction loss on the Make3D test dataset. We need both

to get the best performance for this task.

2.4 Summary

Our primary motivation is to simplify the process of combining synthetic

and real images in training. Prior approaches often pick one domain and try

to map images into it from the other domain. Instead, we train a generator to

map all images into a new, shared domain. In doing this, we note that in the

new domain, the images need not be indistinguishable to the human eye, only

to the network that performs the primary task. The primary network will learn

to ignore extraneous, domain-specific information that is retained in the shared

domain.

To achieve this, we propose a simple network architecture that rests on our

new SharinGAN, which maps both real and synthetic images to a shared domain.

The resulting images retain domain-specific details that do not prevent the pri-

mary network from effectively combining training data from both domains. We

31


demonstrate this by achieving significant improvements over state-of-the-art ap-

proaches in two important applications, surface normal estimation for faces, and

monocular depth estimation for outdoor scenes. Finally, our ablation studies

demonstrate the significance of the proposed SharinGAN in effectively combin-

ing synthetic and real data.

32


Chapter 3: LDMs for Text-Based Image Segmentation

1Teaching neural networks to accurately find the boundaries of objects is

hard and annotation of boundaries at internet scale is impractical. Also, most

self-supervised or weakly supervised problems do not incentivize learning bound-

aries. For example, training on classification or captioning allows models to

learn the most discriminative parts of the image without focusing on bound-

aries [73,74]. Our insight is that Latent Diffusion Models (LDMs) [18], which can

be trained without object level supervision at internet scale, must attend to object

boundaries, and so we hypothesize that they can learn features which would be

useful for open world image segmentation. We support this hypothesis by show-

ing that LDMs can improve performance on this task by up to 6%, compared to

standard baselines and these gains are further amplified when LDM based seg-

mentation models are applied on AI generated images.

To test the aforementioned hypothesis about the presence of object-level se-

mantic information inside a pretrained LDM, we conduct a simple experiment.

We compute the pixel-wise norm between the unconditional and text-conditional

1Work done with Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, and David Jacobs. Accepted

[72] as ORAL in ICCV 2023.

33


Latent Diffusion
Model

Latent Diffusion
Model

t=400

t=400

A picture of an astronaut, 
A picture of the Stonehenge

NULL

Figure 3.1: Coarse segmentation results from an LDM for two distinct images, demon-

strating the encoding of fine-grained object-level semantic information within the

model’s internal features.

noise estimates from a pretrained LDM as part of the reverse diffusion process.

This computation identifies the spatial locations that need to be modified for the

noised input to align better with the corresponding text condition. Hence, the

magnitude of the pixel-wise norm depicts regions that identify the text prompt.

As shown in the Figure 3.1, the pixel-wise norm represents a coarse segmenta-

tion of the subject although the LDM is not trained on this task. This clearly

demonstrates that these large scale LDMs can not only generate visually pleasing

images, but their internal representations encode fine-grained semantic informa-

tion, that can be useful for tasks like segmentation.

Recently, text-based image segmentation has gained traction for creating

and editing AI generated content (like AI art, illustrations, cartoons etc.) in im-

34


age inpainting workflows 2 as it provides a conversational interface. Since the

latent space z [75], extracted by a VQGAN is trained on several domains like

art, cartoons, illustrations and real photographs, we posit that it is a more ro-

bust input representation for text-based segmentation on AI-generated images.

Furthermore, the internal layers of the LDM are responsible for generating the

structure of the image and hence contain rich semantic information about ob-

jects. Soft masks from these layers have also been used as a latent input in recent

work on image editing [76, 77]. Since this information is already present while

generating the image, we propose an architecture in the form of LD-ZNet (shown

in Figure 3.3) to decode it for obtaining the semantic boundaries of objects gener-

ated in the scene. Not only does our architecture benefit segmentation of objects

in AI generated images, but it also improves performance over natural images.

Overall our contributions are as follows:

• We propose a text-based segmentation architecture, ZNet that operates on

the compressed latent space of the LDM (z).

• Next, we study the internal representations at different stages of pretrained

LDMs and show that they are useful for text-based image segmentation.

• Finally, we propose a novel approach named LD-ZNet to incorporate the

visual-linguistic latent diffusion features from a pretrained LDM and show

improvements across several metrics and domains for text-based image seg-

mentation.
2imaginAIry, stable-diffusion-webui

35

https://github.com/brycedrennan/imaginAIry
https://github.com/AUTOMATIC1111/stable-diffusion-webui


3.1 Related work

3.1.1 Text-based image segmentation

Text-based image segmentation is the general task of segmenting specific

regions in an image, based on a text prompt. This is different from the referring

expression segmentation (RES) task, which aims to extract instance-level seg-

mentation of different objects through distinctive referring expressions. While

RES helps applications in robotics that require localization of a single object in

an image, text-based segmentation benefits image editing applications by being

able to also segment 1) “stuff” categories (clouds/ocean/beach ) and 2) multiple

instances of an object category applicable to the text prompt. However, both

these tasks have some shared literature in terms of approaches. Preliminary

works [78–82] focused on the multi-modal feature fusion between the language

and visual representations obtained from recurrent networks (such as LSTM)

and CNNs respectively. The subsequent set of works [83–86] included varia-

tions of multi-modal training, attention and cross-attention networks etc. Re-

cently, [85, 87] used CLIP [88] to extract visual linguistic features of the image

and the reference text separately. These features were then combined using a

transformer based decoder to predict a binary mask. Alternately, [89, 90], pro-

posed vision-language pretraining on other text-based visual recognition tasks

(object detection and phrase grounding) and later finetuned for the segmenta-

tion task. The concurrent works segment-anything (SAM) [91] and segment-

36


everything-everywhere-all-at-once (SEEM) [92] allow interactive segmentation

via point clicks, bounding boxes and text inputs . demonstrating good zero-shot

performance. Different from all these works, we show the significance of us-

ing the latent space and the internal features from a pretrained latent diffusion

model [18] for improving the more generic text-based image segmentation task.

3.1.2 Text-to-Image synthesis

Text-to-Image synthesis has initially been explored using GANs [39,93–97]

on publicly available image captioning datasets. Another line of work is by us-

ing autoregressive models [98–100] via a two stage approach. The first stage is a

vector quantized autoencoder such as a VQVAE [101,102] or a VQGAN [75] with

an image reconstruction objective to convert an image into a shorter sequence

of discrete tokens. This low dimensional latent space enables the training of

compute intensive autoregressive models even for high resolution text-to-image

synthesis. With the recent advancements in Diffusion Models (DM) [16,17], both

in unconditional and class conditional settings, they have started gaining more

traction compared to GANs. Their success in the text-to-image tasks [103, 104]

made them even more popular. However, the prior diffusion models worked

in the high-dimensional image space that made training and inference compu-

tationally intensive. Subsequently, latent space representations [18, 105–107]

were proposed for high resolution text-to-image synthesis to reduce the heavy

compute demands. More specifically, the latent diffusion model (LDM) [18]

37


mitigates this problem by relying on a perceptually compressed latent space

produced by a powerful autoencoder from the first stage. Moreover, they em-

ploy a convolutional backed UNet [108] as the denoising architecture, allowing

for different sized latent spaces as input. Recently this architecture is trained

on large scale text-image data [109] from the internet and released as Stable-

diffusion3, which exhibited photo-realistic image generations. Subsequently, sev-

eral language guided image editing applications such as inpainting [110–112],

text-guided image editing [77, 113] became more popular and the usage for text-

based image segmentation has surged, especially for AI generated images. We

propose a solution for text-based image segmentation by leveraging the features

which are already present as part of the synthesis process.

3.1.3 Semantics in generative models

Semantics in generative models such as GANs have been studied for bi-

nary segmentation [114,115] as well as multi-class segmentation [3,4,116] where

the intermediate features have been shown to contain semantic information for

these tasks. Moreover, [117] highlighted the practical advantages of these rep-

resentations, such as out-of-distribution robustness. However, prior generative

models (GANs ) as representation learners have received less attention compared

to alternative unsupervised methods [118], because of the training difficulties

on complex, diverse and large scale datasets. Diffusion models [16], on the other

hand are another class of powerful generative models that recently outperformed

3https://github.com/CompVis/stable-diffusion

38

https://github.com/CompVis/stable-diffusion


GANs on image synthesis [17] and are able to train on large datasets such as Im-

agenet [119] or LAION [109]. In [5], the authors demonstrated that the internal

features of a pre-trained diffusion model were effective at the semantic segmenta-

tion task. However, this type of analysis [4,5] has mostly been done in limited set-

tings like few shot learning [120] or limited domains like faces [121], horses [122]

or cars [122]. Different from these works, we analyze the visual-linguistic seman-

tic information present in the internal features of a text-to-image LDM [18] for

text based image segmentation, which is an open world visual recognition task.

Furthermore, we leverage these LDM features and show performance improve-

ments when training with full datasets instead of few-shot settings.

3.2 LDMs for Text-Based Segmentation

The text-to-image latent diffusion architecture introduced in [18] consists

of two stages: 1) An auto-encoder based VQGAN [75] that extracts a compressed

latent representation (z) for a given image 2) A diffusion UNet that is trained to

denoise the noisy z created in the forward diffusion process, conditioned on the

text features. These text features are obtained from a pretrained frozen CLIP text

encoder [88] and is conditioned at multiple layers of the UNet via cross-attention.

In this paper, we show performance improvements on the text-based seg-

mentation task in two steps. Firstly, we analyze the compressed latent space (z)

from the first-stage and propose an approach named ZNet that uses z as the visual

input to estimate segmentation mask when conditioned on a text prompt. Sec-

39


E

z

D

Figure 3.2: Reconstructions from the first stage of the LDM. Given an input image, the

latent representation z generated by the encoder, can be used to reconstruct images that

are perceptually indistinguishable from the inputs. The high quality of these reconstruc-

tions suggests that the latent representation z, preserves most of the semantic informa-

tion present in the input images.

ondly, we study the internal representations from the second stage of the stable-

diffusion LDM for visual-linguistic semantic information and propose a way to

utilize them inside ZNet for further improvements in the segmentation task. We

name this approach as LD-ZNet.

3.2.1 ZNet: Leveraging Latent Space Features

We observe that the latent space (z) from the first-stage of the LDM is a

compressed representation of the image that preserves semantic information, as

depicted in Figure 3.2. The VQGAN in the first-stage achieves such semantic-

preserving compression with the help of large scale training data as well as a

40


A picture of an astronaut

Second-stage of LDM 
Denoising UNet

ZNet

E

First-stage of LDM 
VQGAN encoder

Pretrained, Frozen

LD-ZNet

z

CLIP Text
Encoder

Residual
layer Se

lf 
at

te
nt

io
n

C
ro

ss
 

at
te

nt
io

n

Spatial-Attention module

Timestep t CLIP  
Text features

Latent Diffusion 
Features

Encoder/Decoder Block

Timestep t

Forward Diffusion

Attention Pool

Figure 3.3: Overview of the proposed ZNet and LD-ZNet architectures. We propose to

use the compressed latent representation z as input for our segmentation network ZNet.

Next, we propose LD-ZNet, which incorporates the latent diffusion features at various

intermediate blocks from the LDM’s denoising UNet, into ZNet.

combination of losses - perceptual loss [123], a patch-based [124] adversarial ob-

jective [75, 125, 126], and a KL-regularization loss.

In our experiments, we observe that this compressed latent representation

z is more robust compared to the original image in terms of their association

with the text prompts. We believe this is because z is a H
8 ×

W
8 × 4 dimensional

feature with 48 × fewer elements compared to the original image, while pre-

serving the semantic information. Several prior works [127–129], show that

compression techniques like PCA, which create information preserving lower di-

mensional representations generalize better. Therefore, we propose using the z

representation along with the frozen CLIP text features [88] as an input to our

segmentation network. Furthermore, because the VQGAN is trained across sev-

41


eral domains like art, cartoons, illustrations, portraits, etc., it learns a robust and

compact representation which generalizes better across domains, as can be seen

in our experiments on AI generated images. We call this approach ZNet. The

architecture of ZNet is shown in the bottom box of Figure 3.3, and is the same as

the denoising UNet module of the LDM. We therefore initialize it with pretrained

weights of the second-stage of the LDM.

3.2.2 LD-ZNet: Leveraging Diffusion Features

Given a text prompt and a timestep t, the second-stage of the LDM is trained

to denoise zt - a noisy version of the latent representation z obtained via for-

ward diffusion process for t timesteps. A UNet architecture is used whose en-

coder/decoder elements are shown in Figure 3.3 (top right). A typical encoder/decoder

block contains a residual layer followed by a spatial-attention module that inter-

nally has self-attention and then cross-attention with the text features. We an-

alyze the semantic information in the internal visual-linguistic representations

developed at different blocks of encoder and decoder right after these spatial-

attention modules. We also propose a way to utilize these latent diffusion fea-

tures using cross-attention into the ZNet segmentation network and we call the

final model as LD-ZNet.

42


20

30

40

50

60

70

100 200 300 400 500 600 700 800 900 1000

Va
lia

dt
io

n 
AP

Timesteps

Block 2 Block 4 Block 6 Block 7 Block 8

Block 10 Block 12 Block 14 Block 16

Figure 3.4: Semantic information present in the LDM features at various blocks and

timesteps for the referring image segmentation task. AP is measured on a small valida-

tion subset of the PhraseCut dataset.

3.2.2.1 Visual-Linguistic Information in LDM Features

We evaluate the semantic information present in the pretrained LDM at

various blocks and timesteps for the text-based image segmentation task. In

this experiment, we consider the latent diffusion features right after the spatial-

attention layers 1-16 spanning across all the encoder and decoder blocks present

in the UNet. At each block, we analyze the features for every 100th timestep in

the range [100,1000]. We use a small subset of the training and validation sets

from the Phrasecut dataset and train a simple decoder on top of these features to

predict the associated binary mask. Specifically, given an image I and timestep

t, we first extract its latent representation z from the first stage of LDM and add

noise from the forward diffusion to obtain zt for a timestep t. Next we extract

43


the frozen CLIP text features for the text prompt and input both of them into

the denoising UNet of the LDM to extract the internal visual-linguistic features

at all the blocks for that timestep. We use these representations to train the cor-

responding decoders until convergence. Finally, we evaluate the AP metric on a

small subset of the validation dataset. The performance of features from different

blocks and timesteps is shown in Figure 3.4.

Similar to [5], we observe that the middle blocks {6,7,8,9,10} of the UNet

contain more semantic information compared to either the early blocks of the

encoder or the later blocks of the decoder. We also observe that the timesteps 300-

500 contain the maximum visual-linguistic semantic information compared to

other timesteps, for these middle blocks. This is in contrast to the findings of [5]

that report the timesteps {50, 150, 250} to contain the most useful information

when evaluated on an unconditional DDPM model for the few shot semantic

segmentation task for horses [122] and faces [121]. We believe that the reason

for this difference is because, in our case, the image synthesis is guided by text,

leading to the emergence of semantic information earlier in the reverse diffusion

process (t=1000→0), in contrast to unconditional image synthesis.

3.2.2.2 LD-ZNet Architecture

We propose using the aforementioned visual-linguistic representations at

multiple spatial-attention modules of the pretrained LDM into the ZNet as shown

in Figure 3.3. These latent diffusion features are injected into the ZNet via a

44


CLIP  
Text features

Latent D
iffusion 

Features

Positional 
encoding

Se
lf 

at
te

nt
io

n

Residual
layer Se

lf 
at

te
nt

io
n

C
ro

ss
 

at
te

nt
io

n

C
ro

ss
 

at
te

nt
io

n

Attention Pool 

Visual-Linguistic  
Latent D

iffusion Features

Figure 3.5: We propose to incorporate the visual-linguistic representations from LDM

obtained at the spatial-attention modules via a cross-attention mechanism into the cor-

responding spatial-attention modules of the ZNet through an attention pool layer.

cross-attention mechanism at the corresponding spatial-attention modules as shown

in Figure 3.5. This allows for an interaction between the visual-linguistic repre-

sentations from the ZNet and the LDM. Specifically, we pass the latent diffusion

features through an attention pool layer that not only acts as a learnable layer

to match the range of the features participating in the cross-attention, but also

adds a positional encoding to the pixels in the LDM representations. The out-

puts from the attention pool are now positional-encoded visual-linguistic repre-

sentations that enable the proposed cross-attention mechanism to attend to the

corresponding pixels from the ZNet features. ZNet when augmented with these

45


latent diffusion features from the LDM (through cross-attention) is referred to as

LD-ZNet.

Following the semantic analysis of latent diffusion features (Sec. 3.2.2.1),

we incorporate the internal features from blocks {6,7,8,9,10} of the LDM into the

corresponding blocks of ZNet, in order to make use of the maximum semantic

and diverse visual-linguistic information from the LDM. For AI generated im-

ages, these blocks are anyways responsible to generate the final image and using

LD-ZNet, we are able to tap into this information which can be used for segment-

ing objects in the scene.

3.3 Experiments

Implementation details: In this paper, we use the stable-diffusion v1.4

checkpoint as our LDM that internally uses the frozen ViT-L/14 CLIP text en-

coder [88]. We implement the above described ZNet and LD-ZNet in pytorch in-

side the stable-diffusion library. We also initialize our networks with the weights

from the LDM wherever possible, while initializing the remaining parameters

from a normal distribution. We train ZNet and LD-ZNet on 8 NVIDIA A100

gpus with a batch size of 4 using the Adam optimizer and a base learning rate of

5e−7 per mini-batch sample, per gpu. For all our experiments, we keep the text

encoder frozen and use an image resolution of 384 for a fair comparison with the

previous works.

Datasets: We use Phrasecut [130], which is currently the largest dataset for

46


Bicycle Teddy Bear

Bald Eagle Fish

Young Adult
Female Zoologist

Tsunami Giant
Wave Water

Vehicles

Cars

Figure 3.6: Samples from AIGI dataset along with annotated labels and categorical cap-

tions.

the text-based image segmentation task, with nearly 340K phrases along with cor-

responding segmentation masks that not only permit annotations for stuff classes

but also accommodate multiple instances. Following [88], we randomly augment

the phrases from a fixed set of prefixes. For the images, we randomly crop a

square around the object of interest with maximum area, ensuring that the object

remains at least partially visible. We avoid negative samples to remove ambiguity

in the LDM features for non-existent objects.

We create a dataset consisting of AI-generated images which we name AIGI

dataset, to showcase the usefulness of our approach for text-based segmentation

on a different domain. We use 100 AI-generated images from lexica.art and man-

47

https://lexica.art/


ually annotated multiple regions for 214 text-prompts relevant to these images.

Figure 3.6 depicts some of the images from the AIGI dataset along with their

annotated labels and categorical captions.

We also use the popular referring expression segmentation datasets namely

RefCOCO [131], RefCOCO+ [131] and G-Ref [132] to demonstrate the general-

ization abilities of ZNet and LD-ZNet. In RefCOCO, each image contains two or

more objects and each expression has an average length of 3.6 words. RefCOCO+

is derived from RefCOCO by excluding certain absolute-location words and fo-

cuses on purely appearance based descriptions. For example it uses “the man in

the yellow polka-dotted shirt” rather than “the second man from the left” which

makes it more challenging. Unlike RefCOCO and RefCOCO+, the average length

of sentences in G-Ref is 8.4 words, which have more words about locations and

appearances. While we adopt the UNC partition for RefCOCO and RefCOCO+

in this paper, we use the UMD partition for G-Ref.

Metrics: We follow the evaluation methodology of [87] and report best fore-

ground IoU (IoUFG) for the foreground pixels, the best mean IoU of all pixels

(mIoU), and the Average Precision (AP).

3.4 Results

3.4.1 Image Segmentation Using Text Prompts

On the PhraseCut dataset, we compare the performance of previous ap-

proaches with our ZNet and LD-ZNet for the text-based image segmentation

48


Method mIoU IoUFG AP

MDETR [89] 53.7 - -

GLIPv2-T [90] 59.4 - -

RMI [130] 21.1 42.5 -

Mask-RCNN Top [130] 39.4 47.4 -

HulaNet [130] 41.3 50.8 -

CLIPSeg (PC+) [87] 43.4 54.7 76.7

CLIPSeg (PC, D=128) [87] 48.2 56.5 78.2

RGBNet 46.7 56.2 77.2

ZNet (Ours) 51.3 59.0 78.7

LD-ZNet (Ours) 52.7 60.0 78.9

Table 3.1: Text-based image segmentation performance on the PhraseCut testset. The

performance of ZNet and LD-ZNet is highlighted in gray. Both these models outperform

the baseline RGBNet on all the metrics.

task (Table 3.1). In order to showcase the performance improvement of our pro-

posed networks, we create a baseline named RGBNet with the same architecture

as ZNet except we use the original images as the input instead of its latent space

z. For RGBNet, we use additional learnable convolutional layers to map the orig-

inal image to match the input resolution of ZNet. From Table 3.1, we observe

that our ZNet and LD-ZNet significantly outperform RGBNet. Specifically, the

performance improvement from using the latent representation z over the origi-

49


nal images is clear (i.e. ZNet vs RGBNet baseline). Performance further improves

upon incorporating the LDM visual-linguistic representations (LD-ZNet) - by 6%

overall on the mIoU metric compared to RGBNet. We also highlight this qualita-

tively in Figure 3.7. In the figure, we show the original image and the GT mask

along with outputs from the RGBNet baseline followed by ZNet and LD-ZNet,

where both ZNet and LD-ZNet help improve results consistently. For example in

the top row, RGBNet detects light fixtures for the “hanging clock” prompt, and

although ZNet does not have as strong activations for these incorrect detections,

it is LD-ZNet that correctly segments the “clock”. Similarly in the bottom row,

while RGBNet completely got the “castle” wrong, ZNet correctly has activations

on the right buildings, but with lower confidence. However, LD-ZNet improves

it further.

We outperform in all the metrics when compared to previous works, other

than MDETR [89] and GLIPv2 [90]. Notably, these works are pre-trained on de-

tection and phrase grounding for predicting bounding boxes on huge corpus of

text-image pairs across various publicly available datasets with bounding box an-

notations and are later fine-tuned on the Phrasecut dataset for the segmentation

task. However, our work is orthogonally focused towards exploring and utiliz-

ing LDMs and its internal features for improving the text-based segmentation

performance. Note that object detection datasets have a good overlap with the

visual content in PhraseCut, however, they are not representative of the diversity

in images available on the internet. For example, while they could learn common

concepts like sky, ocean, chair, table and their synonyms, methods like MDETR

50


Input GT mask RGBNet ZNet LD-ZNet

Figure 3.7: Qualitative comparison on the PhraseCut test set. Each row contains an input

image with a text prompt as an input, with the goal being to segment the image regions

corresponding to the reference text. The text prompts are “hanging clock” and “castle” for

the top and bottom rows. We show improvements using ZNet and LD-ZNet compared

to the RGBNet.

would not understand concepts like Mikey Mouse, Pikachu etc., which we will

show in Section 3.5.

3.4.2 Generalization to AI Generated Images

With the growing popularity of AI generated images, text-based image seg-

mentation is extensively being used by content creators in their daily workflows.

Many public libraries 4 widely employ methods such as CLIPSeg [87] for per-

forming segmentation in AI-generated images. So we study the generalization

ability of our proposed segmentation approach on AI-generated images. To this

4imaginAIry, stable-diffusion-webui

51

https://github.com/brycedrennan/imaginAIry
https://github.com/AUTOMATIC1111/stable-diffusion-webui


Method mIoU AP

MDETR [89] 53.4 63.8

CLIPSeg (PC+) [87] 56.4 79.0

SEEM [92] 57.4 70.0

RGBNet 63.4 84.1

ZNet (Ours) 68.4 85.0

LD-ZNet (Ours) 74.1 89.6

Table 3.2: Generalization of the proposed LD-ZNet on our AIGI dataset when compared

with other state-of-the-art text-based segmentation methods.

extent, we first prepare a dataset of 100 AI-generated images from lexica.art and

manually annotate them using 214 text-prompts. We name this dataset AIGI

and release it on our project website 5 for future research. Next, we evaluate our

approaches ZNet and LD-ZNet along with our RGBNet baseline and other text-

based segmentation methods - CLIPSeg (PC+) [87], MDETR [89] and SEEM [92].

Glipv2 and the SAM model [91] with textual input were not publicly available

for us to evaluate at the time of this work. All these methods are trained on the

Phrasecut dataset except for SEEM and we measure the IoU metric as shown in

Table 3.2. It can be seen that RGBNet outperforms CLIPSeg, MDETR and SEEM

because its built on the UNet architecture initialized from the LDM weights that

contains semantic information for good generalization. Our methods ZNet and

LD-ZNet further improve the generalization to these AI-generated images by

5https://koutilya-pnvr.github.io/LD-ZNet/

52

https://koutilya-pnvr.github.io/LD-ZNet/


more than 20% compared to MDETR. This is largely due to the robust z-space of

the LDM that resulted from a VQGAN pre-training on a variety of domains like

art, cartoons, illustrations . Furthermore, the latent diffusion features that con-

tain useful semantic information for the synthesis task, also help in segmenting

the AI-generated images. We show the qualitative comparison of these methods

in Figure 3.8 for four AI-generated images from our dataset. While CLIPSeg can

estimate most distinctive regions such as face of the Mickey mouse or rough loca-

tions of Goblin, Ramen and animals, MDETR and SEEM incorrectly segment them

because these concepts are unknown to them and because of the domain gap be-

tween their training data and AIGI images respectively. In both such cases, our

proposed LD-ZNet estimates accurate segmentation. More qualitative results for

LD-ZNet on images from the AIGI dataset are shown in Figs. 3.9 and 3.10.

53


Input MDETR [89] CLIPSeg [87] SEEM [92] LD-ZNet

Figure 3.8: Qualitative comparison on the AI-generated images for text-based segmenta-

tion. The text prompts are “Mickey mouse”, “Goblin”, “Ramen” and “animals” respectively.

54


Input MDETR [89] CLIPSeg [87] SEEM [92] LD-ZNet

Figure 3.9: More qualitative comparison on the AI-generated images from AIGI dataset

for text-based segmentation. The text prompts are “Spiderman”, “tortoise”, “vespa” and

“robot” respectively.

55


“Hoodie”

“Spiderman”

“Owl”

“Trump”

“Pikachu”

“Joker”

“Godzilla”

“Eiffel”

Figure 3.10: More qualitative results of LD-ZNet from AIGI dataset.

3.4.3 Generalization to Referring Expressions

The reference expression segmentation task is aimed at robot-localization

types of applications, where segmenting at the instance-level is performed through

distinctive referring expressions. Many works such as [85, 86] also train the text

encoder to learn the complex positional references in the text. However, we

are focused on generic text-based segmentation that has support for stuff cate-

gories as well as for multiple instances. We study the generalization ability of the

56


Method
RefCOCO RefCOCO+ G-Ref

IoU AP IoU AP IoU AP

CLIPSeg (PC+) [87] 30.1 14.1 30.3 15.5 33.8 23.7

RGBNet 36.3 15.7 37.1 16.7 41.9 27.8

ZNet (Ours) 40.1 16.8 40.9 17.8 47.1 29.2

LD-ZNet (Ours) 41.0 17.2 42.5 18.6 47.8 30.8

Table 3.3: Generalization of our proposed approaches to different types of expressions

from other datasets. Z-Net and LD-ZNet outperform both the RGBNet baseline and

CLIPSeg on the generalization across all datasets.

proposed approach - using LDM features, to this complex task. Specifically, we

use the models trained on the PhraseCut dataset and evaluate them on the Ref-

COCO [131], RefCOCO+ [131] and G-Ref [132] datasets whose complex referring

expressions are for single-instance localization and segmentation. We also eval-

uated the generalization of the CLIPSeg (PC+) [87] model that was trained on

an extended version of the PhraseCut dataset (PC+), to further demonstrate the

generalization capability of our methods. Table 3.3 summarizes the performance

of our models along with the RGBNet baseline. We observe a similar trend in

performance improvements across RGBNet < ZNet < LD-ZNet. These experi-

ments demonstrate that the LDM features enhance the generalization power of

the LD-ZNet model even on complex referring expressions.

57


3.4.4 Inference Time

During inference, our proposed LD-ZNet relies on the LDM to extract the

internal features for just a single time step (as opposed to around 50 reverse dif-

fusion time steps for the text-to-image synthesis task). We then use these LDM

features for further cross-attention into LD-ZNet via the attention pool layer to

extract the final mask. Therefore, using the diffusion model increases the over-

all run time by only a small amount. For the stable-diffusion model, inference

takes 2.57s for 50 timesteps to synthesize an image (roughly 51ms per timestep),

whereas the average inference times for RGBNet, ZNet and LD-ZNet are only

62ms, 55ms and 101ms, respectively, per image on the AIGI dataset with an RTX

A6000 gpu. SEEM [92] takes 293ms for the same task. Since we use an archi-

tecture similar to UNet (from the second stage of the LDM), as our segmentation

network, the proposed LD-ZNet has 925M trainable parameters.

3.4.5 Cross-attention vs Concat for LDM features

In LD-ZNet, we inject LDM features into the ZNet model using cross-attention

(Figure 3.5). In order to understand the importance of the cross-attention layer,

we also train and evaluate another model where the LDM features are concate-

nated with the features of the ZNet right before the spatial-attention layer. The

results are summarized in Table 3.4 and it shows that concatenating the LDM

features yields inferior results compared to the proposed method. This is be-

cause of the attention pool layer which serves as a learnable layer and also encodes

58


Diffusion features via mIoU IoUFG AP

LD-ZNet with concatenation 50.2 59.0 78.1

LD-ZNet with cross-attention 52.7 60.0 78.9

Table 3.4: Incorporating LDM features into ZNet via cross-attention (LD-ZNet) leverages

the visual-linguistic information present in them, compared to concatenation, leading to

better performance on the text-based image segmentation task.

positional information into the LDM features for setting up the cross-attention.

Moreover, the cross-attention layer learns how feature pixels from the ZNet at-

tend to feature pixels from the LDM, thereby leveraging context and correlations

from the entire image. With concatenation however, we only fuse the correspond-

ing features of LDM and ZNet which is sub-optimal.

3.5 Discussion

In this section we present more qualitative results to demonstrate several

interesting aspects of our proposed technique when applied towards downstream

segmentation tasks. In Figs. 3.8 to 3.12, we visualize results of text-based im-

age segmentation on a diverse set of images, which include AI generated im-

ages, illustrations and generic photographs. In Figure 3.11, we show that when

LD-ZNet is applied on the same image with various text prompts, it is able to

correctly segment the object and stuff classes being referred to in both exam-

ples. This capability is crucial for open-world segmentation and overall under-

59


standing of the scene. The results also highlights that the algorithm works re-

markably well on other domains like cartoons/illustrations. It is noteworthy that

LD-ZNet can perform accurate segmentation for text prompts which include car-

toons (Pikachu, Godzilla), celebrities (Donald Trump, Spiderman), famous land-

marks (Eiffel Tower), as seen in Figure 3.10. Finally, Figure 3.12 shows the advan-

tages of leveraging semantic information present in the latent diffusion features.

Compared to our baseline RGBNet, the proposed LD-ZNet generates better seg-

mentation maps across animations, celebrity images and illustrations.

3.6 Summary

In this chapter, we presented a novel approach for text-based image seg-

mentation using large scale latent diffusion models. By training the segmenta-

tion models on the latent z-space, we were able to improve the generalization of

segmentation models to new domains, like AI generated images. We also showed

that this z-space is a better representation for text-to-image tasks in natural im-

ages. By utilizing the internal features of the LDM at appropriate time-steps,

we were able to tap into the semantic information hidden inside the image syn-

thesis pipeline using a cross-attention mechanism, which further improved the

segmentation performance both on natural and AI generated images. This was

experimentally validated on several publicly available datasets and on a new

dataset of AI generated images, which we will make publicly available.

60


“Books”

“Flowers”

“Sofa”

“Table”

“Trees”

“Chair”

“Clouds”

“Grass”

“Mountains”

“River”

“Buildings”

“Trees”

“Crosswalk”

“Bicycle”

“Bridge”

Figure 3.11: LD-ZNet text-based image segmentation results for a real image and il-

lustrations on diverse set of things and stuff classes. High quality segmentation across

multiple classes suggests that LD-ZNet has a good understanding of the overall scene.

61


RGBNet LD-ZNet

Figure 3.12: More qualitative examples where RGBNet fails to localize “Guitar”, “Panda”

from animation images (top row), famous celebrities “Scarlett Johansson”, “Kate Middle-

ton” (second row) and objects such as “Lamp”, “Trees” from illustrations (bottom row).

LD-ZNet benefits from using z combined with the internal LDM features to correctly

segment these text prompts. 62


Chapter 4: Conclusions and Future Work

4.1 Concluding Remarks

In this dissertation, we presented novel ways to utilize two popular deep

generative models namely GANs and Diffusion models to improve crucial tasks

in computer vision - 1) Geometry Estimation and 2) Text-Based Image Segmen-

tation, respectively.

1. GANs for Unsupervised Geometry Estimation. In Chapter 2, we proposed

a generative-based SharinGAN module for unsupervised domain adapta-

tion (UDA) to combine labeled synthetic and unlabeled real images during

training. The SharinGAN translates just the domain-specific task-related

information from both domains into a shared space that is input to the pri-

mary task network. The information unrelated to the task is untouched

by SharinGAN during this translation for both domains. With this formu-

lation, we show a much improved generalization of the primary task net-

work on various estimation tasks - Monocular Depth Estimation of outdoor

scenes, Face Normal Estimation, and Lighting Estimation, all in an unsu-

pervised setting.

63


2. LDMs for Text-Based Image Segmentation. In Chapter 3, we proposed to

use large-scale latent diffusion models (LDM) pretrained on the internet

to improve text-based segmentation performance for several novel classes

from the internet and on a variety of imagery - Real, AI-generated, illustra-

tions, animations etc. The understanding of internet-scale concepts along

with the ability to synthesize various photorealistic objects from text, makes

the LDM an intuitive candidate to improve text-based recognition perfor-

mance. Our proposed segmentation pipeline LD-ZNet benefits from the

z-space as well as the internal representations within LDM that is shown

to contain semantic information. We showed improved segmentation per-

formance for LD-ZNet on not just real images but also on AI-Generated

images, animations, illustrations and celebrity images etc.

4.2 Future Work

As we move towards an era of large-scale datasets with higher compute,

the generative models trained with them can only get more powerful. It thus

becomes crucial to understand how to utilize these generative models to improve

general computer vision systems.

In Chapte