ABSTRACT

Title of dissertation: ADVANCING VISUAL ASSETS: TAMING
DEEP PRIORS FOR EDITING AND
ENHANCEMENT

Yiran Xu, Doctor of Philosophy, 2025

Dissertation directed by: Professor Jia-Bin Huang
Department of Computer Science, University of
Maryland

Visual data, including images and videos, are valuable assets with broad commercial,

artistic, and personal significance. As digital content consumption continues to grow, there

is an increasing demand for methods that can enhance visual quality, improve adaptability

across different formats, and enable efficient content editing. However, achieving these en-

hancements manually is both labor-intensive and technically challenging. Recent advances

in deep learning have introduced powerful generative models (e.g., GANs, diffusion mod-

els) and human-aligned visual representations (e.g., VGG, DINO-v2) that offer promis-

ing capabilities for improving visual assets. Yet, directly applying these models to real-

world editing and enhancement tasks often introduces artifacts and inconsistencies, such

as temporal flickering in videos, limited generalization to out-of-distribution (OOD) data,

and misalignment between high-level priors and low-level structures. This thesis explores

strategies to “tame” these deep priors, converting their potential into more controllable and

reliable tools for visual asset enhancement.

This dissertation presents four key contributions in editing and enhancement, each

demonstrating how to adapt deep priors for improving visual content usability, quality,


and consistency. First, we develop VideoGigaGAN, a large-scale video super-resolution

model that extends an image super-resolution model to video, enhancing both spatial res-

olution and temporal coherence. Second, we introduce a video editing framework that

enforces temporal consistency by optimizing latent codes and the generator itself, reduc-

ing flickering artifacts in edited videos. Third, we propose a method to improve generative

priors for OOD data using a volumetric decomposition approach, enabling high-fidelity im-

age reconstructions while maintaining editability. Finally, we explore image retargeting by

leveraging perceptual priors to intelligently adapt content to different aspect ratios without

compromising visual coherence.

By addressing these challenges, this thesis contributes to the broader goal of harnessing

deep priors for real-world visual asset enhancement. The proposed approaches demonstrate

that by adapting and refining generative priors, we can develop more reliable, high-quality,

and scalable solutions for visual editing tasks. These contributions have potential appli-

cations in media production, content creation, digital art, and real-time video processing,

paving the way for future research in deep learning-driven visual content adaptation.


ADVANCING VISUAL ASSETS: TAMING DEEP PRIORS FOR
EDITING AND ENHANCEMENT

by

Yiran Xu

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Jia-Bin Huang, Chair/Advisor
Professor Christopher Metzler
Professor Abhinav Shrivastava
Professor Ruohan Gao
Professor Maria K. Cameron


© Copyright by
Yiran Xu

2025


Acknowledgments

I would like to express my deepest gratitude to my advisor, Prof. Jia-Bin Huang, for his

unwavering support, guidance, and encouragement throughout my Ph.D. journey. His in-

sightful mentorship has profoundly shaped my research direction and has been instrumental

in my growth as a researcher. I am also grateful to my advisory committee members, Prof.

Christopher Metzler, Prof. Abhinav Shrivastava, Prof. Ruohan Gao, and Prof. Maria K.

Cameron, for their valuable feedback and guidance, which have significantly enriched my

work.

I am fortunate to have had the opportunity to work with incredible mentors during my

industry internships. At Google DeepMind, I would like to thank Feng Yang, Siqi Xie,

Jijun Jiang, Zhuofang Li, Yinxiao Li, Luciano Sbaiz, Junjie Ke, Miaosen Wang, Hang

Qi, Han Zhang, Jose Lezama, Ming-Hsuan Yang, Irfan Essa, and Jesse Berent for their

mentorship and collaboration, which greatly influenced my understanding of real-world

research applications. At Adobe Research, I am deeply appreciative of Difan Liu, Taesung

Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Seoung Wug Oh, Zhixin Shu,

and Cameron Smith for their support and inspiring discussions. Additionally, I would like

to thank my collaborators on campus: Mingyang Xie, Haoming Cai, Sachin Shah, Brandon

Y. Feng, Yi-Ling Qiao, Alexander Gao, Prof. Ming C. Lin, and Badour AlBahar, whose

insights and teamwork have been invaluable to my research.

I am also grateful to my amazing labmates, who have made this journey both intellectu-

ally stimulating and enjoyable. Yue Feng, Yao-Chih Lee, Yi-Ting Chen, Ting-Hsuan Liao,

Hadi Alzayer, Songwei Ge, Kevin Zhang, Quynh Phung, Haowen Liu, Badour AlBahar,

Yuliang Zou, Chen Gao, and Jinwoo Choi—thank you for your camaraderie, support, and

the many insightful discussions that have pushed my research forward.

I am incredibly thankful for the friendships that have kept me grounded and supported

throughout this journey. Yixuan Ren, Hanyu Wang, Bo He, Haozhe An, Yu Hou, and

ii


Shuaiyi Huang—your encouragement, laughter, and companionship have been a constant

source of motivation.

Last but not least, I am deeply grateful to my family. To my wife, Yimin Peng, for

her unwavering love, patience, and belief in me—I could not have done this without you.

To my parents and grandparents, whose sacrifices and support have paved the way for my

academic pursuits. And to my baby cats, Bai and Michelle, who do not speak but have been

my constant companions, bringing joy and comfort through the long nights of research.

This dissertation is dedicated to all of you.

iii


Table of Contents

Acknowledgements ii

1 Introduction 1

2 Generative Models for Large-scale Video Super-Resolution 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Preliminaries: Image GigaGAN upsampler . . . . . . . . . . . . . 12
2.3.2 Inflation with temporal modules . . . . . . . . . . . . . . . . . . . 12
2.3.3 Flow-guided feature propagation . . . . . . . . . . . . . . . . . . . 13
2.3.4 Anti-aliasing blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 High-frequency shuttle . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.6 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Comparison with previous models . . . . . . . . . . . . . . . . . . 20
2.4.4 Analysis of the trade-off between temporal consistency and frame

fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.5 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.6 8× video upsampling . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Temporally Consistent Video Editing 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Flow-based temporal consistency . . . . . . . . . . . . . . . . . . . 35

iv


3.3.3 Two-phase optimization strategy . . . . . . . . . . . . . . . . . . . 37
3.3.4 Unalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Out-of-domain results . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 In-domain editing results . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 Two-phase optimization strategy ablation study . . . . . . . . . . . 45
3.4.5 Comparison with Latent Transformer . . . . . . . . . . . . . . . . 46
3.4.6 Comparison with Deep Video Prior (DVP) . . . . . . . . . . . . . . 47

3.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Increasing Generalizability of the Generative Models 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Preliminaries: EG3D, 3D-aware GAN . . . . . . . . . . . . . . . . 54
4.3.2 In-distribution GAN inversion . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Modeling out-of-distribution contents . . . . . . . . . . . . . . . . 57
4.3.4 Composite volume rendering . . . . . . . . . . . . . . . . . . . . . 58
4.3.5 Low-resolution reconstruction . . . . . . . . . . . . . . . . . . . . 59
4.3.6 Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.7 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.3 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.4 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.6 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Human-aligned Visual Features for Image Retargeting 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 Overview of HALO . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 Multi-Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3 Perceptual Structure Similarity Loss . . . . . . . . . . . . . . . . . 74
5.3.4 Training loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v


5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.3 Comparison with previous methods . . . . . . . . . . . . . . . . . 81
5.4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.5 Analysis of the off-the-shelf models . . . . . . . . . . . . . . . . . 85
5.4.6 Results on In-the-wild data . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Future work 90
6.1 Text-guided Motion Graphics . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Global to Local: Long-term Consistent Video Generation . . . . . . . . . . 91

7 Conclusion 93

Bibliography 96

vi


Chapter 1: Introduction

Visual data, including images and videos, are essential digital assets that shape en-

tertainment, communication, and artistic expression in today’s world. With the rapid ex-

pansion of the internet, visual content has become ubiquitous across diverse platforms,

spanning photography, filmmaking, social media, and digital archiving. These assets hold

commercial, artistic, and personal value, influencing industries such as advertising, content

creation, and digital marketing. However, raw visual data is often imperfect and requires

extensive processing to enhance its quality and usability. The value of a visual asset can be

increased by correcting errors, enhancing visual quality (e.g., upscaling from low resolu-

tion to 4K), or adapting content for different aspect ratios and display devices. Despite the

growing demand for high-quality visuals, manual enhancement remains labor-intensive,

requiring expert skills and significant effort. Thus, there is a strong need for automated and

intelligent solutions that can efficiently process and enhance visual assets while preserving

perceptual quality.

The emergence of deep learning has introduced powerful tools for visual editing and en-

hancement, such as Generative Adversarial Networks (GANs) [57], diffusion models [69],

and human-aligned visual representations like VGGs [182] and DINOs [21, 144]. These

models have shown remarkable potential in image synthesis, restoration, and manipula-

tion, offering new ways to enhance visual assets. However, directly applying deep priors

to real-world editing tasks often introduces artifacts and inconsistencies. One major chal-

lenge is temporal inconsistency in videos, where image-based generative models struggle

1


Input (low-res)

Output (high-res)

Video Super-Resolution Change aspect ratiosVideo Editing

EditableHigh visual quality Adaptive ARs

Input

Output (“angry”)

Input (4:3)

Output (16:9)

Figure 1.1: Overview of the thesis. We explore how to “tame” powerful deep priors for downstream
tasks to enhance the visual data. We present VideoGigaGAN to introduce generative priors in
large-scale training for Video Super-Resolution (VSR). We also introduce a flow-based method to
leverage an image generator to achieve temporally consistent video editing. We further improve the
generalization of generative models to out-of-distribution data. Finally, besides generative priors,
we use a discriminative prior for image retargeting.

to maintain coherence across frames, leading to flickering and instability. Additionally,

generative models are typically trained on specific distributions and often fail to generalize

to out-of-distribution (OOD) data, limiting their applicability to diverse real-world content.

Another challenge is the mismatch between deep priors and specific tasks—while these

models capture high-level semantics well, they often struggle with preserving low- and

mid-level structures, leading to unnatural artifacts in fine-grained editing tasks. Thus, the

key research question becomes: How can we effectively “tame” deep priors to improve

their adaptability, reliability, and quality for real-world visual asset editing?

This thesis presents four projects exemplifying advancements in visual asset editing

and enhancement, each endeavoring to “tame” these potent yet unpredictable deep priors:

• Improving Temporal Consistency in Video Editing Using Pre-trained Image Gener-

2


ators: This work addresses the challenge of achieving temporal coherence in video

editing by optimizing both the latent code and the pre-trained generator to minimize

photometric inconsistencies across frames.

• Enhancing Generative Priors for Out-of-Distribution Data: This study introduces a

method to decompose in-distribution and out-of-distribution components using volu-

metric representations, thereby improving the editability of generative models when

handling unseen data.

• Developing a Large-Scale Video Super-Resolution Model from a Pre-trained Image

Super-Resolution Model: This project extends the capabilities of an image super-

resolution model to videos, focusing on generating high-frequency details while

maintaining temporal consistency.

• Image Retargeting Leveraging Visual Perceptual Priors: This work explores the ap-

plication of visual perceptual priors to adapt images to different aspect ratios, en-

hancing their compatibility across various display formats.

In Chapter 2, we introduce VideoGigaGAN, a large-scale video super-resolution model

that builds upon advancements in image super-resolution. While existing image super-

resolution models have achieved remarkable success in enhancing spatial resolution, ex-

tending these capabilities to videos presents unique challenges, particularly in maintaining

temporal consistency across frames. VideoGigaGAN addresses these challenges by incor-

porating mechanisms to produce high-frequency details and ensure temporal coherence.

Our model identifies and mitigates key issues that typically lead to temporal inconsisten-

cies, such as flickering and motion artifacts. Through extensive experiments, VideoGiga-

GAN has demonstrated superior performance in generating temporally consistent videos

with fine-grained appearance details, outperforming previous state-of-the-art methods in

3


both objective metrics and subjective visual quality.

In Chapter 3, we delve into our first project, which addresses the challenge of enhancing

temporal consistency in video editing by leveraging pre-trained image generators. Tradi-

tional approaches often apply image-based generative models to video frames individually,

leading to temporal inconsistencies such as flickering artifacts. To overcome this, we pro-

pose a method that minimizes temporal photometric inconsistencies by jointly optimizing

both the latent code and the parameters of the pre-trained generator. This joint optimization

ensures that consecutive frames maintain visual coherence, resulting in smoother transi-

tions and a more stable viewing experience. Our approach has been rigorously evaluated

across various domains and with different GAN inversion techniques, consistently demon-

strating its effectiveness in reducing temporal artifacts and preserving the desired edits

throughout the video sequence.

However, due to the limited exposure to the training data, many generative models

struggle when adapted to out-of-distribution (OOD) data. Chapter 4 presents our innova-

tive approach to enhancing generative priors when dealing with OOD data. Generative

models, particularly 3D-aware GANs, typically excel when operating within the distri-

bution of their training data but often struggle with OOD inputs, leading to suboptimal

reconstructions and limited editability. To address this limitation, we introduce a volumet-

ric decomposition method that explicitly models OOD objects within the 3D-aware GAN

framework. This technique enables the faithful reconstruction of input images, even when

they contain elements not present in the training data, while preserving the model’s ability

to perform semantic edits. By effectively separating in-distribution and out-of-distribution

components, our method strikes a balance between reconstruction fidelity and editability,

significantly expanding the applicability of generative models to more diverse and complex

real-world scenarios.

Besides generative priors mentioned in the previous chapters, discriminative priors

4


learned by deep neural networks also show a powerful ability to recognize visual simi-

larities. Chapter 5 explores the domain of image retargeting by leveraging visual percep-

tual priors to adapt images to various aspect ratios across different devices. In today’s

multi-device environment, images often need to be displayed on screens with differing as-

pect ratios, necessitating intelligent retargeting techniques that preserve visual quality and

content integrity. Our approach utilizes human-aligned visual representations to guide the

retargeting process, ensuring that essential visual features and overall aesthetics are main-

tained. This method addresses the challenges of maintaining visual coherence and quality

during the retargeting process, enabling seamless adaptation of images to diverse display

formats without compromising the viewer’s experience.

The thesis concludes with Chapter 6, discussing several future directions, and Chap-

ter 7, presenting the overall conclusions.

5


Chapter 2: Generative Models for Large-scale Video Super-Resolution

In this chapter, we aim to develop an approach to “tame” generative priors for visual

asset enhancement. We delve into Video Super-Resolution (VSR) as an example. We

develop VideoGigaGAN, a large-scale VSR model initialized from its image counterpart.

It is challenging to balance the high per-frame quality brought by the pretrained image

super-resolution model and the temporal consistency in VSR. We observe that the main

artifact comes from the aliased input and propose well-designed anti-aliasing features to

not only mitigate aliases but also preserve the high-frequency details in the output.

2.1 Introduction

Video super-resolution (VSR) aims to reconstruct high-resolution videos from low-

resolution inputs, a task challenged by the need for temporal consistency and high-frequency

detail generation. VSR shows wide applications in generated videos [11, 64], face videos [44],

satellite videos [223], and Animes [217]. Existing methods [25–27, 76] focus on consis-

tency but often result in blurry outputs that lack high-frequency appearance details or real-

istic textures (see Fig. 2.2). Effective VSR requires generating plausible new content not

present in the low-resolution inputs, a capability that these models struggle with. Recent

diffusion-based methods [64, 161, 236, 251] enjoy higher per-frame quality but suffer from

temporal flickering and slow inference.

Generative models (e.g., diffusion models [159, 205], VAEs [30], and GANs [56, 208,

6


𝑡

𝑥

𝑦
𝑦

𝑡

Input

TTVSR BasicVSR++ Ours GTTTVSR BasicVSR++ Ours

128×128

1024×1024

128×128

1024×1024

128×128

1024×1024

…

Figure 2.1: We present VideoGigaGAN, a generative video super-resolution model that can up-
sample videos with high-frequency details while maintaining temporal consistency. Top: we show
the comparison of our approach with TTVSR [118] and BasicVSR++ [26]. Our method produces
temporally consistent videos with more fine-grained detailed than previous methods. Bottom: our
model can produce high-quality videos with 8× super-resolution.

209]) have advanced Image Super-Resolution (ISR) by modeling high-resolution image

distributions, producing highly detailed textures. GigaGAN [82] further increases the gen-

erative capability of image super-resolution models by training a large-scale GAN model

on billions of images. However, applying a generative model such as GigaGAN indepen-

dently to video frames results in severe temporal artifacts (see Fig. 2.2). This raises the

question: can GigaGAN’s capabilities be harnessed for temporally consistent VSR?

We first experiment with an adapted GigaGAN baseline using temporal convolution

and attention layers, which helps but fails to fully address the flickering of high-frequency

details, brought by the strong hallucinations. Previous VSR approaches use regression-

based networks to trade high-frequency details for better temporal consistency. As blurrier

upsampled videos inherently exhibit better temporal consistency, the capability of GANs

to hallucinate high-frequency details contradicts the goal of VSR in producing temporally

7


Input

VRT [35] GigaGAN Ours Ground truthStableVSR [47]

𝑦

𝑡

Figure 2.2: Limitations of previous methods. Previous VSR approaches such as VRT [112] suffer
from lack of details, as seen from the building example. Generative models, image GigaGAN [82]
and StableVSR [161] produce sharper results with richer details, but it generates videos with tem-
poral flickering or artifacts like aliasing (see red arrows). Our VideoGigaGAN can produce video
results with both high-frequency details and temporal consistency, while artifacts like aliasing are
significantly mitigated. Please refer to our supplementary material for a visual comparison.

consistent frames. We refer to this as the consistency-quality dilemma in VSR. Previous

VSR approaches use regression-based networks to trade high-frequency details for better

temporal consistency.

In this work, we identify several key issues of applying GigaGAN for VSR and propose

techniques to achieve detailed and temporally consistent video super-resolution. Naively

inflating GigaGAN with temporal modules [68] is not sufficient to produce temporally

consistent results with high-quality frames. To address this issue, we employ a recurrent

flow-guided feature propagation module to encourage information aggregation across dif-

ferent frames. We also apply anti-aliasing blocks in GigaGAN to address the temporal

flickering caused by the aliased downsampling operations. Furthermore, we introduce an

effective method for injecting high-frequency features into the GigaGAN decoder, called

high-frequency (HF) shuttle. The proposed high-frequency shuttle can effectively add fine-

grained details to the upsampled videos while maintaining the temporal consistency.

Contributions. We present VideoGigaGAN, the first large-scale GAN-based model for

video super-resolution. We recognize the consistency-quality trade-off that has not been

well discussed in previous VSR literature. We introduce the feature propagation module,

8


anti-aliasing blocks and HF shuttle which significantly improve the temporal consistency

when applying GigaGAN for VSR. We show that VideoGigaGAN can upsample videos

with much more fine-grained details than state-of-the-art methods evaluated on multiple

datasets. We also show that our model can produce detailed and temporally consistent

videos even for challenging 8× upsampling tasks.

2.2 Related Work

Video Super-Resolution. Significant work has been invested in video super-resolution,

using sliding-window approaches [20, 111, 193, 197, 204, 229] and recurrent networks [72,

73, 76, 110, 113, 114, 168, 177]. BasicVSR [25] summarizes the common VSR approaches

into a unified pipeline. It proposes an effective baseline using optical flow for temporal

alignment and bidirectional recurrent networks for feature propagation. BasicVSR++ [26]

redesigns BasicVSR by introducing second-order grid propagation and flow-guided de-

formable alignment. To improve the generalizability on real-world low-resolution videos,

methods like RealBasicVSR [27] and FastRealVSR [226] use diverse degradations as data

augmentation during training. While these approaches can produce temporally consistent

upsampled videos, they are often trained with simple regression objectives and lack the gen-

erative capability, which leads to unrealistic textures and overly blurry results. Unlike pre-

vious VSR approaches, we propose a GAN-based VSR model to generate high-frequency

details while maintaining temporal consistency in the upsampled videos.

GAN-based Image Super-Resolution. SRGAN [105] is a seminal image super-resolution

work that uses a GAN framework to model the manifold of high-resolution images. ESR-

GAN [209] further enhances the visual quality of upsampled images by improving the ar-

chitecture and loss of SRGAN. Real-ESRGAN [208] extends ESRGAN to restore general

real-world low-resolution images. While these methods can produce impressive results,

9


they are still limited in model capacity and unsuitable for large upsampling factors. To

scale up the model capacity of GANs, GigaGAN [82] introduces filter bank and attention

layers to StyleGAN2 [90] and trains the model on billions of images. Even for 8× image

super-resolution tasks, GigaGAN can effectively generate new content not present in the

low-resolution image and produce realistic textures and fine-grained details.

Generative Video Models. Many video generation works are based on the VAEs [10, 106,

230], GANs [53, 184, 244], and autoregressive models [212]. LongVideoGAN [19] intro-

duces a sliding-window approach for video super-resolution, but it is restricted to datasets

with limited diversity. Recently, diffusion models have shown diverse and high-quality

results in video generation tasks [15, 16, 54, 55, 70]. Imagen Video [68] proposes pixel

diffusion models for video super-resolution. Concurrent work Upscale-A-Video [251] adds

temporal modules to a latent diffusion image upsampler [160] and finetunes it as a video

super-resolution model. Unlike diffusion-based video super-resolution models that require

iterative denoising processes, our VideoGigaGAN can generate outputs in a single feedfor-

ward pass with faster inference speed.

2.3 Method

Our VSR model G upsamples a low-resolution (LR) video v ∈ RT×h×w×3 to a high-

resolution (HR) video V = G(v), where V ∈ RT×H×W×3, with an upsampling scale factor

α such that H = αh, W = αw. We aim to generate HR videos with both high-frequency

appearance details and temporal consistency.

We present the overview of our VSR model, VideoGigaGAN, in Figure 2.3. We start

with the large-scale GAN-based image upsampler – GigaGAN [82] (Section 2.3.1). We first

inflate the 2D image GigaGAN upsampler to a 3D video GigaGAN upsampler by adding

temporal convolutional and attention layers (Section 2.3.2). However, as shown in our

10


HF
HF
HF

Fl
ow

-g
ui

de
d 

pr
op

ag
at

io
n

(S
ec

 3
.3

)

Flow 
estimator

Lo
w

-p
as

s 
fil

te
r

↓×
2

U-Net block

Temporal attention (Sec 3.2)

Anti-aliasing block (Sec 3.4)

HF shuttle (Sec 3.5)

⊝

⊝ Element-wise subtraction

Low-frequency feature 𝑓!"#

High-frequency feature 𝑓!#$
Optical flow

Input LR video Output HR video
𝑇×ℎ×𝑤×3 𝑇×𝐻×𝑊×3

HF shuttle

1D
 C

on
v

Te
m

po
ra

l s
el

f-a
ttn

⨁ ⨁

⨁ Element-wise addition
Temporal attention

… …

v *V

Feature 𝑓!

Figure 2.3: Overview of our method for 4× upsampling. Our Video Super-Resolution (VSR)
model is built upon the asymmetric U-Net architecture of the image GigaGAN upsampler [82].
To enforce temporal consistency, we first inflate the image upsampler into a video upsampler by
adding temporal attention layers into the decoder blocks. We also enhance consistency by in-

corporating the features from the flow-guided propagation module. To suppress aliasing arti-

facts, we use Anti-aliasing block in the downsampling layers of the encoder. Lastly, we directly

shuttle the high frequency features via skip connection to the decoder layers to compensate for the
loss of details in the BlurPool process.

experiments, the inflated GigaGAN still produces results with severe temporal flickering

and artifacts, likely due to the limited spatial window size of the temporal attention. To this

end, we introduce flow-guided feature propagation (Section 2.3.3) to the inflated GigaGAN

to better align the features of different frames based on flow information. We also pay

special attention to anti-aliasing (Section 2.3.4) to further mitigate the temporal flickering

caused by the downsampling blocks in the GigaGAN encoder, while maintaining the high-

frequency details by directly shuttling the HF features to the decoder blocks (Section 2.3.5).

Our experimental results validate the importance of these model design choices.

11


2.3.1 Preliminaries: Image GigaGAN upsampler

Our VideoGigaGAN builds upon the GigaGAN image upsampler [82]. GigaGAN

scales up the StyleGAN2 [90] architecture using several key components, including adap-

tive kernel selection for convolutions and self-attention layers. The GigaGAN image up-

sampler has an asymmetric U-Net architecture consisting of 3 downsampling blocks {Ei}

and 3 + k upsampling decoder blocks {Di}.

X = G(x, z) = D(E(x, z), z)

= Dk+2 ◦ · · ·D3︸ ︷︷ ︸
↑×2k

◦D2 ◦D1 ◦D0︸ ︷︷ ︸
↑×8

◦E2 ◦ E1 ◦ E0(x, z)︸ ︷︷ ︸
↓×8

.
(2.1)

This GigaGAN upsampler is able to upsample an input image by 2k. Both encoder E and

decoder D blocks utilize random spatial noise z as a source of stochasticity. The decoder

D contains spatial self-attention layers. The encoder and decoder block at same resolution

are connected by skip connections.

2.3.2 Inflation with temporal modules

To adapt a pretrained 2D image model for video tasks, a common approach is to inflate

2D spatial modules into 3D temporal ones [16, 54, 68, 215, 233, 251]. To reduce the

memory cost, instead of directly using 3D convolutional layers in each block, our temporal

module uses a 1D temporal convolution layer that only operates on the temporal dimension

of kernel size 3, followed by a temporal self-attention layer with no spatial receptive field.

Both 1D temporal convolution and temporal self-attention are inserted after the spatial self-

attention with residual connection [68]. In summary, at each block Di, we first process the

features of individual video frames using the spatial self-attention layer and then jointly

processed by our temporal module. Through our experiment, we find adding temporal

12


modules to the decoder D of the generator G is sufficient to improve video consistency. We

also inflate the discriminator D with comparable temporal modules.

We follow [243] to initialize both temporal convolutions and temporal self-attention

layers with zero weights, such that G and D still perform the same as an image upsampler

at the beginning of the training, leading to a smoother transition to a video upsampler.

2.3.3 Flow-guided feature propagation

The temporal modules alone are insufficient to ensure temporal consistency, mainly due

to the high memory cost of the 3D layers. For input videos with long sequences of frames,

one could partition the video into small, non-overlapping chunks and apply temporal at-

tention. However, this leads to temporal flickering between different chunks. Even within

each chunk, the spatial window size of the temporal attention is limited, meaning a large

motion (i.e., exceeding the receptive field) cannot be modeled by the attention module (see

Figure 2.5).

To address these issues, we augment the input image with features aligned by optical

flow. Specifically, we introduce a recurrent flow-guided feature propagation module (see

Figure 2.3) prior to the inflated GigaGAN, inspired by BasicVSR++ [26]. Instead of di-

rectly using the LR video as input to the inflated GigaGAN, we use the temporal-aware

features produced by the flow-guided propagation module. It comprises a bi-directional

recurrent neural network (RNN) [25, 26] and an image backward warping layer. We ini-

tially employ the optical flow estimator to predict bi-directional optical flow maps from the

input LR video. Subsequently, these maps and the original frame pixels are fed into the

RNN to learn temporal-aware features. Finally, these features are explicitly warped using

the backward warping layer, guided by the pre-computed optical flows, before being fed

into the later inflated GigaGAN blocks. The flow-guided propagation module can effec-

13


tively handle large motion and produce better temporal consistency in output videos, as

demonstrated in Fig 2.5.

During training, we jointly train the flow-guided feature propagation module and the in-

flated GigaGAN model. At inference time, given an input LR video with an arbitrary num-

ber of frames, we first generate frame features using the flow-guided propagation module.

We then partition the frame features into non-overlapping chunks and independently apply

the inflated GigaGAN on each chunk. Since the features inside each chuck are aware of

the other chunks, thanks to the flow-guided propagation module, the temporal consistency

between consecutive chunks is preserved well.

2.3.4 Anti-aliasing blocks

With both temporal and feature propagation modules enabled, our VSR model can pro-

cess longer videos and produce results with better temporal consistency. However, the high-

resolution frames remain flickering in areas with high-frequency details (for example, the

windows in the building in Figure 2.2). We identify that the downsampling operations in the

GigaGAN encoder contribute to the flickering of those regions. The high-frequency com-

ponents in the input can easily alias into lower frequencies due to the downsampling rate not

meeting the classical sampling criterion [142]. The aliasing of pixels manifests as temporal

flickering in video super-resolution. Previous VSR approaches often use regression-based

objectives, which tend to remove high-frequency details. Consequently, these methods

produce output videos free of aliasing. However, in our GAN-based VSR framework, the

GAN training objectives favor the hallucination of high-frequency details, making aliasing

a more severe problem.

In the GigaGAN upsampler, the downsampling operation in the encoder is achieved by

strided convolutions with a stride of 2. To address the aliasing issue in our output video,

14


we apply BlurPool layers to replace all the strided convolution layers in the upsampler

encoder inspired by [245]. More specifically, during downsampling, instead of simply us-

ing a strided convolution, we use convolution with a stride of 1, followed by a low-pass

filter and a subsampling operation. We show the anti-aliasing blocks in Figure 2.3. Our

experiments show that the anti-aliasing downsampling blocks perform significantly better

than naive strided convolutions in preserving temporal consistency for high-frequency de-

tails. We also experimented with StyleGAN3 blocks for anti-aliasing upsampling [87]. The

temporal flickering is mitigated, but we observed a notable drop in frame quality.

2.3.5 High-frequency shuttle

With the newly introduced components, the temporal flicker in our results is signifi-

cantly suppressed. However, as shown in Figure 2.5, adding the flow-guided propagation

module (Section 2.3.3) leads to a blurrier output. Anti-aliasing blocks (Section 2.3.4) make

the results even blurrier. We still need the high-frequency information in the GigaGAN

features to compensate for the loss of high-frequency details. However, as discussed in

Section 2.3.4, the traditional flow of high-frequency information in GigaGAN leads to

aliased output.

We present a simple yet effective approach to address the conflict of high-frequency de-

tails and temporal consistency, called high-frequency shuttle (HF shuttle). To guide where

the high-frequency details should be inserted, the HF shuttle leverages the skip connections

in the U-Net and uses a pyramid-like representation for the feature maps in the encoder.

More specifically, at the feature resolution level i, we decompose the feature map fi into

low-frequency (LF) feature and high-frequency (HF) components. The LF feature map fLF
i

is obtained via the low-pass filter mentioned in Section 2.3.4, while the HF feature map is

computed from the residual as fHF
i = fi−fLF

i . The HF feature map fHF
i containing high-

15


frequency details are injected through the skip connection to the decoder (Figure 2.3). Our

experiments show that the high-frequency shuttle can effectively add fine-grained details

to the upsampled videos while mitigating issues such as aliasing or temporal flickering.

2.3.6 Loss functions

We use stardard, non-saturating GAN loss [61], R1 regularization [131], LPIPS [246]

and Charbonnier loss [28] during the training.

L(Xt,xt) = µGANLGAN(G(xt),D(G(xt))) + µR1LR1(D(Xt))

+ µLPIPSLLPIPS(Xt,xt) + µCharLChar(Xt,xt) ,

(2.2)

where Charbonnier loss is a smoothed version of pixelwise ℓ1 loss, µGAN , µR1, µLPIPS, µChar

are the scales of different loss functions. xt is one of the LR input frames, Xt is the corre-

sponding ground-truth HR frame. We average the loss over all the frames in a video clip

during the training.

2.4 Experimental Results

2.4.1 Setup

Datasets. We strictly follow two widely used training sets from previous VSR works

[25, 26, 118]: REDS [138] and Vimeo-90K [229]. The REDS dataset contains 300 video

sequences. Each sequence consists of 100 frames with a resolution of 1280× 720. We use

REDS4 as our test set and REDSval4 as our validation set; the rest of the sequences are

used for training. The Vimeo-90K contains 64, 612 sequences for training and 7, 824 for

testing (known as Vimeo-90K-T). Each sequence contains seven frames with a resolution

of 448 × 256. Following previous works [25, 26], we compute the metrics only on the

16


Input BasicVSR [25] TTVSR [118] BasicVSR++ [26] Ours GT

28.01/0.8553/0.1916 28.51/0.8686/0.1696 28.65/0.8732/0.1746 26.04/0.8317/0.1498

34.07/0.9404/0.2138 34.06/0.9416/0.2094 34.11/0.9421/0.2100 32.38/0.9201/0.1326

34.07/0.9404/0.2138 34.06/0.9416/0.2094 34.11/0.9421/0.2100 32.38/0.9201/0.1326

Figure 2.4: Qualitative comparison with other baselines on public datasets (REDS4 [138],
Vimeo-90K-T [229]. We show PSNR/SSIM/LPIPS below each output frame. PSNR does not
align well with human perception and favor blurry results. LPIPS is a preferred metric that aligns
better with human perception. Compared to previous VSR approaches, our model can produce more
realistic textures and more fine-grained details.

17


Input +Temporal at-
tention

+Feature prop-
agation

+BlurPool +HF shuttle GT

𝑡

𝑥

𝑦

𝑡

Input y-t slice

+Temporal attention

+Feature propagation

+BlurPool

+HF shuttle

GT
𝑦

Figure 2.5: Ablation study. Starting from the inflated GigaGAN (+Temporal attention in the fig-
ure), we progressively add components to demonstrate its effectiveness. With temporal attention,
the local temporal consistency is improved compared to using image GigaGAN to upsample each
frame independently. The global temporal consistency improves with feature propagation, but
aliasing still exists in the areas with high-frequency details. Also, the video results become more
blurry. By using the anti-aliasing blocks – BlurPool, the aliasing issue is much better, but the video
results become even more blurry. Finally, with HF shuttle, we can bring the per-frame quality and
high-frequency details back while preserving good temporal consistency.

center frame of each sequence. In addition to the official test set Vimeo-90K-T, we also

evaluate the model on Vid4 [117] and UDM10 [238], with different degradation algorithms

(Bicubic Downsampling – BI and Blur Downsampling – BD). We follow MMagic [135]

to perform degradation algorithms. All data are 4× downsampled to generate LR frames

following standard evaluation protocols [25, 26].

Evaluation metrics. We are interested in two aspects of our evaluation: per-frame quality

and temporal consistency. For per-frame quality, we use PSNR, SSIM, and LPIPS [246].

For temporal consistency, the warping error Ewarp [102] is commonly used.

Ewarp(X̂t, X̂t+1) =
1∑
M i

t

∑
M i

t ||X̂i
t,W (X̂i

t+1,Ft→t+1)||22 , (2.3)

18


where (X̂t, X̂t+1) are generated frames at time t and t+ 1, i is the index of the i-th pixel,

and W (·) is the warping function, Ft→t+1 is the forward flow estimated from the generated

frames (X̂t, X̂t+1) using RAFT [194], and Mt ∈ {0, 1} is a non-occlusion mask indicating

non-occluded pixels [165]. However, as reported in Fig. 2.2, previous baselines or even

simple bicubic upsampling achieve lower Ewarp than ground truth high-resolution video

since Ewarp favors over-smoothed results. Consider an extreme algorithm where all the

generated frames are entirely black. Ewarp computes the warping errors by warping the

generated frames. The warping error for this algorithm is 0 since the generated frames

are over-smoothed (in this extreme case, all black). Therefore, instead of warping the

generated frames, we propose to warp the ground-truth frames using the flow computed on

the generated frames. We refer to this new warping error as referenced warping error

(RWE) Eref
warp. The referenced warping error between two frames is

Eref
warp(Xt,Xt+1) =

1∑
M i

t

∑
M i

t ||Xi
t,W (Xi

t+1,Ft→t+1))||22 , (2.4)

where (Xt,Xt+1) are ground-truth frames at time t and t + 1, Ft→t+1 is the forward flow

estimated from the output frames (X̂t, X̂t+1) using RAFT [194].

Hyperparameters. We use a pretrained 4× GigaGAN image upsampler as our base

model. It contains three downsampling blocks in the encoder and five upsampling blocks

in the decoder. The spatial self-attention layers are only used in the first block of the

decoder for memory efficiency. For the flow network, we use a lightweight SpyNet [154].

For the low-pass filters, we use a kernel of 1
16
[1, 4, 6, 4, 1] before the downsampling. We

set µGAN = 0.05, µR1 = 0.2048, µLPIPS = 5, µChar = 10 in Eqn. 2.2. During training,

we randomly crop a 64× 64 patch from each LR input frame at the same location. We use

10 frames of each video and a batch size of 32 for training. The batch is distributed into

32 NVIDIA A100 GPUs. We use a fixed learning rate of 5 × 10−5 for both generator and

discriminator. The total number of training iterations is 100, 000.

19


2.4.2 Ablation study

To demonstrate the effect of each proposed component, we progressively add them one

by one and evaluate them on the REDS4 dataset [138]. We report the quantitative results

in Table 2.1. We also present a qualitative comparison in Figure 2.5. We see that the

flow-guided feature propagation brings a large LPIPS and Eref
warp improvement compared

to the temporal attention. This demonstrates the effectiveness of the feature propagation

contributing to the temporal consistency. By further introducing BlurPool as the anti-

aliasing block, the model has a warping error drop but an LPIPS loss increase (also shown

in Figure 2.5). Finally, by using HF shuttle, we can bring the LPIPS back with a slight loss

of temporal consistency. Though it is not reflected on the number clearly, we observed that

the sharpness of the frame improves significantly with the HF shuttle (see in the x-t slice

plot in Figure 2.5).

Table 2.1: Ablation study. We use LPIPS to evaluate per-frame quality and Eref
warp ↓ (×10−3) for

temporal consistency. Starting from the image GigaGAN (upsampling each frame independently
with the image upsampler), we progressively add components to demonstrate its effectiveness. The
best number: bold. The second best number: underline.

Model LPIPS↓ Eref
warp ↓ (×10−3)

GigaGAN (base upsampler) 0.2031 2.497

+ Temporal attention 0.2029 2.462
+ Flow-guided propagation 0.1551 2.187
+ BlurPool 0.1621 2.152
+ High-freq shuttle 0.1582 2.177

2.4.3 Comparison with previous models

We conduct extensive experiments and report the quantitative comparison of the per-

frame quality in Table 2.3. We report the quantitative comparison of the per-frame quality

in Table 2.3. We show the comparison of temporal consistency for 6 of them in Table 2.2.

20


Table 2.2: Comparison of VideoGigaGAN and previous VSR approaches in terms of temporal
consistency and per-frame quality. The commonly used Ewarp for temporal consistency favors
more blurry results. The naive BICUBIC upsampling method achieves the lowest Ewarp. To address
this issue, we propose to use the referenced warping error Eref

warp for temporal consistency.

Method LPIPS↓ Ewarp ↓ (×10−3) Eref
warp ↓ (×10−3)

Bicubic 0.3396 1.161 2.4232

RealViformer [248] 0.2298 3.128 2.3183
TTVSR [118] 0.1836 1.390 2.1178
BasicVSR++ [26] 0.1786 1.401 2.1206
RVRT [115] 0.1727 1.438 2.1217
MIA-VSR [252] 0.1659 1.439 2.1172
VRT [112] 0.1818 1.398 2.1184
EvTexture [80] 0.1684 1.488 2.1320
StableVSR [161] 0.1934 3.957 2.2123

UAV [251] 0.4157 12.881 7.5241
DiffIR2VR-Zero [236] 0.3265 6.665 3.0942
VEnhancer [64] 0.4744 14.270 2.7383

Ours 0.1582 2.313 2.1773

Ground truth - 2.127 2.1272

Additionally, we provide qualitative comparisons in Figure 2.4.

Per-frame quality. As shown in Table 2.3, our LPIPS outperforms all the other models

by a large margin while showing a poorer performance of PSNR and SSIM. We observe

that PSNR and SSIM do not align well with human perception and favor blurry results, as

also reported in the literature [82, 160, 167]. Thus we consider LPIPS [246] as our core

metric to evaluate per-frame quality as it is closer to the human perception. In Figure 2.4,

it is noticeable that our model produces results with the most fine-grained details. Previous

approaches tend to predict blurry results with a critical loss of details.

Temporal consistency. As observed in previous works [102], the widely used warping

error metric favors a more blurry video. This is also illustrated in the Table 2.2. The simple

21


Table 2.3: Quantitative comparison in terms of per-frame quality (LPIPS↓/PSNR↑ /SSIM↑)
evaluated on multiple datasets. We separate models into regression-based models and generative
models (StableVSR [161], MGLD-VSR [234] and ours). We exclude LPIPS evaluation on Vimeo-
90K-T from EvTexture [80] due to the lack of released preprocessed data. For StableVSR [161]
and MGLD-VSR [234], we omit Vimeo-90K-T evaluation due to its significantly long runtime. We
highlight LPIPS as PSNR/SSIM often misaligns with human perception and favors blurrier results,
as noted in many studies [35, 82, 160, 161, 167]. Our VideoGigaGAN aligns the best with human
perception.

BI degradation (LPIPS↓/PSNR↑/SSIM↑) BD degradation (LPIPS↓/PSNR↑/SSIM↑)

REDS4 [138] Vimeo-90K-T [229] Vid4 [117] UDM10 [238] Vimeo-90K-T [229] Vid4 [117]

EDVR [207] 0.2097/31.05/0.8793 -/37.61/0.9489 -/27.35/0.8264 -/39.89/0.9686 -/37.81/0.9523 -/27.85/0.8503
MuCAN [111] 0.2162/30.88/0.8750 0.1523/37.32/0.9465 - - - -
BasicVSR [25] 0.2023/31.42/0.8909 0.1616/37.18/0.9450 0.2812/27.24/0.8251 0.1148/39.96/0.9694 0.1551/37.53/0.9498 0.2555/27.96/0.8553
IconVSR [25] 0.1939/31.67/0.8948 0.1587/37.47/0.9476 0.2739/27.39/0.8279 0.1152/40.03/0.9694 0.1531/37.84/0.9524 0.2462/28.04/0.8570
TTVSR [118] 0.1836/32.12/0.9021 - - 0.1112/40.41/0.9712 0.1507/37.92/0.9526 0.2381/28.40/0.8643
BasicVSR++ [26] 0.1786/32.39/0.9069 0.1506/37.79/0.9500 0.2627/27.79/0.8400 0.1131/40.72/0.9722 0.1440/38.21/ 0.9550 0.2390/29.04/0.8753
RVRT [115] 0.1727/32.74/0.9113 0.1502/38.15/0.9527 0.2500/27.99/0.8464 0.1100/40.90/0.9729 0.1465/38.59/ 0.9576 0.2219/29.54/0.8811
PSRT-recurrent [178] 0.1676/32.72/0.9106 0.1509/38.27/0.9536 0.2448/28.07/0.8485 - - -
MIA-VSR [252] 0.1659/32.79/0.9115 0.1428/38.22/0.9532 0.2474/28.20/0.8507 - - -
IA-RT [227] 0.1629/32.89/0.9138 0.1498/38.14/0.9528 0.2501/28.26/0.8517 0.1129/41.15/0.9750 0.1435/38.62/0.9579 0.2201/29.68/0.8884
VRT [112] 0.1818/32.19/0.9005 0.1461/38.20/0.9530 0.2478/27.93/0.8425 0.1097/41.05/0.9737 0.1421/38.72/0.9584 0.2214/29.42/0.8795
EvTexture [80] 0.1684/32.79/0.9173 -/38.23/0.9544 0.2188/29.51/0.8909 - - -

StableVSR [161] 0.1934/27.98/0.7952 - 0.2803/24.48/0.6989 - - -
MGLD-VSR [234] 0.2285/26.24/0.7400 - - - - -
Ours 0.1582/30.46/0.8718 0.1120/35.97/0.9238 0.1925/26.78/0.8029 0.1060/36.57/0.9521 0.1129/35.30/0.9317 0.1832/27.04/0.8365

bicubic upsampling method achieves the best performance for the commonly used warp-

ing error, which is much better than the GT warping error. We proposed the referenced

warping error (RWE) in Section 2.4.1 to address the issue of warping error favoring blurry

results. In terms of the referenced warping error, our method is slightly worse than pre-

vious methods (0.05 × 10−3 compared to BasicVSR++ [26]). The newly proposed RWE

is more suitable for evaluating the temporal consistency of upsampled videos. However, it

is still biased towards more blurry results as seen in Table 2.2 (several methods, including

BasicVSR, BasicVSR++, and TTVSR, are still better than the ground truth high-resolution

videos). We leave a better metric of VSR temporal consistency for future works.

22


2.4.4 Analysis of the trade-off between temporal consistency and frame

fidelity

To better understand the trade-off between the temporal consistency and per-frame

quality, we include a visualization in Figure 2.6. We also compare with diffusion-based

VSR (UAV [251], DiffIR2VR-Zero [236], and VEnhancer [64]) that are trained on larger

datasets. Despite large-scale training, they exhibit severe temporal inconsistency and low

fidelity (significantly higher LPIPS) to the ground truth due to model hallucination.

Unlike previous VSR approaches, our final model, VideoGigaGAN, achieves a bal-

anced trade-off, significantly enhancing both temporal consistency and per-frame quality

over the base GigaGAN model with our proposed improvements.

Table 2.4: More comparison. Per-frame runtimes on 320 × 180 → 1280 × 720 are evaluated
on REDS4 [138]. VideoGigaGAN demonstrates competitive runtimes compared to (a) regression-
based models [112, 115, 227] and is substantially faster than (b) diffusion models [64, 161, 236,
251]. (c) Scaling up BasicVSR++ does not yield performance improvements. (d) Adding LPIPS
to the training loss improves performance on LPIPS, but it introduces lower PSNR/SSIM and makes
the training unstable.

Model #Params (M) Runtime (ms) LPIPS↓ PSNR↑

R
eg

re
ss

.

IA-RT [227] 13.4 1895 0.1629 32.89
VRT [112] 35.6 219 0.1818 32.19
RVRT [115] 10.8 169 0.1727 32.74

D
iff

us
io

n UAV [251] 746 7153 0.4157 23.03
VEnhancer [64] 2496 5168 0.4744 19.92
DiffIR2VR-Zero [236] 166 12212 0.3265 24.95
StableVSR [161] 712 9242 0.1934 27.98

Sc
al

in
g BasicVSR++ (small) [26] 7.3 77 0.1786 32.39

BasicVSR++ (medium) 166 85 0.1834 32.09
BasicVSR++ (large) 368 92 0.1941 31.74

+L
PI

PS BasicVSR++ (small) + LPIPS 7.3 77 0.1646 31.42

RVRT + LPIPS 10.8 169 diverged diverged

VideoGigaGAN (ours) 369 295 0.1582 30.46

23


Ours (base model)

Ours (+temporal attn)

TTVSR

VRT

StableVSR

RVRT

EvTexture

MIA-VSR

Ours (+BlurPool)

Ours
(+Feat Prop)

Ours (final)

BasicVSR++

Figure 2.6: Trade-off between per-frame quality (LPIPS↓) and temporal consistency (RWE↓).
Our final model achieves a good balance between the temporal consistency and per-frame quality.

24


2.4.5 Additional results

Model sizes and runtimes. Table 2.4 compares model sizes and runtimes across VSR

methods. Despite its larger size due to generative capacity, our model maintains compet-

itive speed. Unlike slower diffusion-based models [64, 161, 236, 251] requiring iterative

denoising, VideoGigaGAN achieves fast results in a single feedforward pass.

Scaling-up experiments. For fair comparison, we scale up BasicVSR++[26] with addi-

tional layers and channels, evaluating on REDS4[138]. Consistent with findings in [82],

scaling up BasicVSR++ alone does not enhance performance. Despite a similar model

size, training BasicVSR++ (large) becomes unstable past 40K iterations. We report its

performance at 35K in Table 2.4, which is worse than our model.

Adding LPIPS to the training loss. We also add LPIPS to the training loss of Ba-

sicVSR++ and RVRT [115]. We report the results in Table 2.4. The performance of Ba-

sicVSR++ (small) on LPIPS metric improves, but with a drop in PSNR and SSIM. Training

RVRT with LPIPS is unstable and finally diverges. Moreover, training BasicVSR++ with

LPIPS produces severe checkerboard artifacts in all results, similar to previous works. We

show some qualitative results in our supplementary material.

More perceptual metrics. We mainly use LPIPS as the metric for per-frame quality but

acknowledge its limitations in capturing higher-level structures [46]. To address this, we

also evaluate FID and DISTS in Table 2.5.

Table 2.5: Additional results on REDS4 dataset.
Model LPIPS↓ FID↓ DISTS↓

EvTexture [80] 0.1684 101.9 0.065
StableVSR [161] 0.1934 96.2 0.045
RealESRGAN [208] 0.4509 98.2 1.750
OVSR [237] 0.1746 123.8 0.063
Ours 0.1582 95.0 0.041

25


2.4.6 8× video upsampling

Our model is capable for 8x video upsampling with both good temporal consistency and

per-frame quality with rich details. We present some results in Figure 2.1. We encourage

readers to visit our project website (https://videogigagan.github.io/) for more results.

2.5 Limitations

(a) Extremely long video (b) Small objects

Figure 2.7: Limitations. Our approach has some limitations. (a) When the video is extremely long,
the feature propagation becomes inaccurate, which may introduce undesired artifacts like incorrect
propagated patterns. (b) Our model cannot handle well small objects, e.g., small characters.

Our model encounters challenges when processing extremely long videos (e.g., 200

frames or more). This difficulty arises from misguided feature propagation caused by in-

accurate optical flow in such extended video sequences. Additionally, our model does not

perform well in handling small objects, such as text and characters, as the information per-

taining to these objects is significantly lost in the LR video input. Examples of these failure

cases are illustrated in Fig. 2.7.

26

https://videogigagan.github.io/


2.6 Conclusions

We present a novel generative VSR model, VideoGigaGAN, that can upsample input

low-resolution videos to high-resolution videos with both high-frequency details and tem-

poral consistency. Previous VSR approaches often use regression-based networks and tend

to generate blurry results. To this end, our VSR model built upon the powerful generative

image upsampler – GigaGAN. We identify several issues when applying GigaGAN to video

super-resolution tasks including temporal flickering and aliased artifacts. To address these

issues, we introduce new components to the GigaGAN architecture that can effectively im-

prove both the temporal consistency and per-frame quality. Our results demonstrate that

VideoGigaGAN strike a balance in addressing the consistency-quality dilemma of VSR

compared to previous methods.

27


Chapter 3: Temporally Consistent Video Editing

Image generators have shown impressive results in producing photorealistic images.

It can also edit an image with high-level, semantic text prompts. However, it introduces

temporal inconsistency when we directly apply image editing techniques to videos. In this

chapter, we explore how to “tame” powerful image generative priors to videos. We present

a flow-based video editing method that achieves both temporally consistent and visually

plausible results.

In
pu
t

“a
ng
ry
”

“e
ye
gl
as
se
s”

Figure 3.1: Temporally consistent video semantic editing. We present a method for editing the
semantic attributes of a video using a pre-trained StyleGAN model. Here we showcase free-form
text based editing from SytleCLIP [146] to make the person appear “angry” (2nd row) or wear
“eyeglasses” (3rd row).

28


3.1 Introduction

Generative adversarial models (GANs) [56] have shown remarkable ability to generate

photorealistic images in various domains such as faces and common objects [18, 89, 91].

GANs take a latent code (usually sampled from a Gaussian distribution) as input and pro-

duce an image as the output. GAN inversion techniques allow us to project a real image

onto the latent space of a pretrained GAN and retrieve its corresponding latent code. The

pretrained GAN generator can then reconstruct that image using the estimated latent code.

Modifying this estimated latent code opens up exciting new opportunities to perform a wide

range of high-level editing tasks that are traditionally challenging, e.g., changing seman-

tic object classes, modifying high-level attributes of the object/scene, or even applying 3D

geometric transformations. We refer to the modification of the latent code with a semantic

change in the image as semantic editing.

Semantic editing in images. A recent line of research work [1, 2, 6, 74, 158, 253, 255]

has shown promising results in reconstructing an input image by either optimizing the la-

tent code (or latent variables) or directly predicting the latent code via an image encoder.

These GAN inversion techniques enable interesting semantic photo editing applications.

For image-level editing applications, several approaches [75, 174, 175] find specific se-

mantic directions in the latent space, e.g., changing poses, colors, or age, while others [48]

aim to change the global style, e.g., photo → sketch. We denote them as In-domain and

Out-of-domain editing, respectively. With these image GAN inversion-based semantic edit-

ing approaches, how can we extend them to videos?

Per-frame editing. One straightforward way is to apply existing GAN inversion tech-

niques [6, 74, 158, 253] for each frame in a video independently. Figure 3.2 shows an

example of applying a StyleCLIP mapper [146] on two frames. The input and the inde-

pendently reconstructed frames look plausible when viewed individually, but two edited

29


frames exhibit inconsistency (e.g., the frame of the eyeglasses). Recently, Yao et al. [235]

learns to predict per-frame semantic editing directions for editing face videos. However, the

edited videos suffer from apparent temporal flickering and fail to preserve facial identity.

Input Inverted Edited

Figure 3.2: Issues with per-frame
editing. While current methods achieve
faithful inversion and photorealistic
editing, the results are inconsistent
across frames (eyeglasses) and may fail
to preserve details of the input video
(lips).

Our work. In this paper, we present a method for

temporally consistent video semantic editing. We

start from the existing GAN inversion approaches [6,

158] to obtain the latent code for each frame. We

first modify the latent code to achieve the initial per-

frame editing results. However, such a direct editing

approach results in temporal inconsistencies in the

modified video’s appearance or style. To deal with

this challenge, we propose to compute bi-directional

optical flow estimated from a frame pair sampled

from the video. We can then adjust the latent code

and the generator to minimize the photometric loss (along with valid flow vectors). We

present a two-phase optimization strategy. In the first phase, we update only the latent

codes via an MLP (with generator parameters frozen) to adjust the consistency of the de-

tailed appearance. In the second phase, we finetune the generator with a local regularization

to maintain the editability of the latent space. Our two-phase optimization approach helps

achieve significantly improved temporal consistency while preserving the edited contents.

Concurrent work. Two concurrent work [7, 200] also apply StyleGAN for video edit-

ing. These methods either use per-frame pivot tuning [158] for maintaining the similarity

between the edited and input frame [200] or apply latent vector smoothing [7] with Style-

GAN3 [88]. Our method differs in 1) the use of explicit temporal consistency optimization

and 2) the applicability of performing both in-domain and out-of-domain editing.

30


In this Chapter,

• We tackle a task on GAN-based semantic editing in videos. We propose a simple yet

effective flow-based approach to mitigate the temporal inconsistency of a directly

(frame by frame) edited video.

• We present a two-phase optimization approach for updating the latent code and gen-

erator to preserve the video details.

• Our method is agnostic and can be applied to different GAN inversion and editing

approaches.

3.2 Related Work

Generative adversarial networks. The quality and resolution of generated images have

been achieved rapidly in recent years [18, 85, 88, 89, 91]. These GAN models can map a

random latent code (a noise vector) to a photorealistic image. Many recent efforts have been

devoted to improving the generator architectures [84, 88, 89, 91], training strategies [18],

loss function designs [61, 127], and regularization [134]. Our work builds upon existing

pretrained StyleGAN models as they demonstrate disentangled latent space for editing.

Instead of generating synthetic images, our goal is to edit real videos.

GAN inversion. GAN inversion [221, 255] allows us to reconstruct real images by pro-

jecting them onto a pretrained GAN’s latent space. These techniques facilitate interesting

photo editing applications. They can be split into encoder-based [6, 22, 126, 141, 157,

157, 196, 198, 198, 203], optimization-based [1, 2, 36, 37, 60, 74, 152, 195], and hybrid

methods [14, 158, 253]. Our method is agnostic to different GAN inversion approaches

for initializing the latent code, e.g., our experiments explore using PTI [158] for in-domain

editing and Restyle encoder [6] for out-of-domain editing.

31


Semantic image editing in latent space. Semantic image manipulation and editing allow

us to change the content and style of an image. It can be grouped into In-Domain editing

and Out-of-Domain editing. Targeting at finding semantic directions in the latent space of

a pretrained generator, in-domain editing [3–5, 75, 109, 146, 166, 174, 175, 218, 219, 241]

manipulates the attributes of the object, but keeps the same style. Out-of-domain [48, 78,

101], however, aims to change the style of the image. These techniques usually perform

well on a single image but fail to maintain temporal consistency if applied to a video.

Semantic video editing. Recent and concurrent work [7, 200, 235] explore video editing

with a pre-trained StyleGAN. The methods in [200, 235] apply per-frame editing and show

coherent editing without using any temporal information. However, these methods support

only in-domain editing. For localized editing (e.g., adding eyeglasses), we find that the

method in [235] produces inconsistency and fails to preserve identity. The work [7] ap-

plies temporal smoothing on the inverted latent vectors in StyleGAN3 [88]. Our approach,

in contrast, directly minimizes the temporal photometric inconsistency at the synthesized

frames.

Video editing and temporal consistency. Temporal consistency is one critical criterion

in video editing. Existing methods achieve temporal consistency often by enforcing the

output videos to satisfy the constraints imposed by 2D optical flow [31, 71]. Alternatively,

several methods first estimate an unwarped 2D texture map (either explicitly [155] or im-

plicitly [92]) and then perform editing. The editing can then be propagated to the original

video via the estimated UV mapping. Several blind methods enhance the temporal consis-

tency as a post-processing step [17, 102, 108]. However, they typically have difficulty in

handling videos with significant appearance changes. Our work shares similar ideas with

these methods to enforce temporal consistency, using the optical flow fields estimated from

the initial edited video. Instead of directly optimizing the pixel values, our core idea is to

leverage the pretrained generator, update the latent code and generator to achieve temporal

32


𝑉!"#$% = 𝐼&, 𝐼', … , 𝐼(
input video

…
Face 

alignment

𝑉)*!+",- = 𝐼&!", 𝐼'!", … , 𝐼(!"
aligned video

𝑉!". = 𝐼&!". , 𝐼'!". , … , 𝐼(!".
𝐼%!". = 𝐺(𝑊%

!". ,𝜃!".)
inverted video

editing

+𝐼)"/0

+𝐼!0

𝑉,-!% = 𝐼&,-!% , 𝐼',-!% , … , 𝐼(,-!%
𝐼%,-!% = G(𝑊%

,-!% , 𝜃,-!%)
edited video

𝑊)"/,-!%

𝑊!
,-!%

⨁

⨁

Phase 1: Latent code update

𝐼)"/,-!%

𝐼!,-!%

frame 
sampling

Phase 2: Generator update

-𝑊)"/,-!%

-𝑊!
,-!%

GAN 
inversion

+𝐼)"/00

+𝐼!00

G

G

-𝑊)"/,-!%

-𝑊!
,-!%

+𝐼&00, +𝐼'00, … , +𝐼(00

𝑉1$% = 𝐼&1$% , 𝐼'1$% , … , 𝐼(1$%
output video

…

Phase 3: Unalign

𝑓2

𝑓2

unalign G

G

Figure 3.3: Video editing with flow-based temporal consistency. Given an input video of
T frames Vinput, we first spatially align the video frames using an off-the-shelf face land-
mark detector. We then use existing GAN inversion techniques [6, 158] to obtain the inverted
frames {Iinv1 , Iinv2 , · · · , IinvT } and their corresponding latent code in the W+-space of StyleGAN
{W inv

1 ,W inv
2 , · · · ,W inv

T }. We independently perform semantic editing on these inverted frames to
obtain {Iedit1 , Iedit2 , · · · , IeditT } and their corresponding latent code {W edit

1 ,W edit
2 , · · · ,W edit

T }. To
achieve temporal consistency, we choose an anchor frame Ieditanc as the reference frame, and each
time sample another frame Iediti from the edited video. To generate a temporally consistent edited
video, we first refine the latent codes of the directly edited video W edit

anc and {W edit
i }i ̸=anc to Ŵ edit

anc

and {Ŵ edit
i }i ̸=anc by optimizing an MLP fθ (phase 1). These refined latent codes result in the

temporally consistent frames Î
′
anc and Î

′
i . To further improve the temporal consistency, we keep the

refined latent codes Ŵ edit
anc and Ŵ edit

i and only update the generator parameters (phase 2). This will
generate Î

′′
anc and Î

′′
i with improved temporal consistency. After our two phase optimization, we

finally unalign the frames to generate our final edited video Vout (phase 3).

consistent and photorealistic results.

3.3 Method

3.3.1 Overview

GAN Inversion. Given an input video Vinput = {I1, · · · , IT} of T frames, our goal is to

semantically edit all the video frames while preserving the temporal coherence of the edited

33


Warping

𝐹!"#→%
backward flow

𝐹%→!"#
forward flow

warp

𝑤𝑎𝑟𝑝 %𝐼!"#$%&', 𝐹&→!"#

𝑤𝑎𝑟𝑝 %𝐼&$%&', 𝐹!"#→&

flow cycle 
consistency 
error check

𝑀𝑎𝑠𝑘 𝑀!"#→%
&%'

𝐹%→!"# Flow 
Network

&𝐼%

&𝐼%

Flow estimationVisibility masks

𝑀𝑎𝑠𝑘 𝑀%→!"#
&%'

warp𝐹!"#→%

ℒ!"#"$

ℒ!"#"$

⨀

⨀

⨁ ℒ()*+*
%,!"#

Photometric loss

Flow 
Network

LPIPS masks
fuse

𝑀!"#→%

𝑀%→!"#

LPIPS
&𝐼-./
𝐼!"#%"

LPIPS
&𝐼%
𝐼%%"

𝑀𝑎𝑠𝑘 𝑀!"#
01

𝑀𝑎𝑠𝑘 𝑀%
01

fuse

𝑀!"#→%

𝑀%→!"#

&𝐼-./

&𝐼!"#

&𝐼-./

&𝐼%

&𝐼%

&𝐼-./

(a) Lphoto computation (b) Flow and mask computation

Figure 3.4: Photometric loss for temporal consistency. Given a frame pair Îi and Îanc (either
from phase 1 or phase 2), we compute the forward and backward flows Fi→anc and Fanc→i using
RAFT [194]. We then use these two flow fields to compute the visibility masks by performing a
forward-backward and backward-forward flow consistency error check. For in-domain editing, we
also use LPIPS to obtain a semantic mask that highlights the difference between the aligned input
frames Iini and Iinanc and our edited frames Îi and Îanc. We then fuse both the LPIPS semantic masks
and the visibility masks to get our final masks Manc→i and Mi→anc. To compute the photometric
loss (Eqn. 3.1), we use the flows to warp the directly edited frames and utilize the fuzed masks as
shown in (a).

video. To edit the input video Vinput, we first align its frames by using a facial alignment

method [62]. Then we use existing GAN inversion techniques (e.g., [6, 158]) to invert

the frames back to the latent code such that the inverted frame I invt = G(W inv
t ; θinv) is

similar to the input frame: I invt ≈ It. With the inverted frames, we can edit the inverted

video Vinv = {I inv1 , I inv2 , · · · , I invT } by independently editing its frames I invt . We denote

this frame-by-frame editing approach as “direct editing”.

In-domain and out-of-domain GAN-based editing. Commonly used image-based edit-

ing techniques via a GAN include (1) in-domain and (2) out-of-domain editing. We refer

to an in-domain editing [75, 174, 175, 219] as the editing that only manipulates the la-

tent code, given a fixed pretrained generator. That is, the generator parameters θinv remain

frozen (θinv = θedit), and only the latent code W edit
t is updated. The in-domain editing usu-

ally changes semantic attributes such as color, age, or facial expressions. On the other hand,

out-of-domain editing may involve updating the pretrained generator to produce an entirely

new style (as shown in [48]). Here, the latent code remains the same W edit
t = W inv

t and

34


only the generator θedit changes.

Direct editing on a video. When applying both types of editing techniques to a video

independently for each frame, we obtain an edited video Vedit = {Iedit1 , Iedit2 , · · · , IeditT }.

For each directly edited frame Ieditt , there is a corresponding latent code W edit
t such that

Ieditt = G(W edit
t ; θedit). Due to the per-frame, independent process, the edited video Vedit

often suffers from temporal inconsistency. Moreover, due to the poor disentanglement of

this per-frame editing, not only will the edited attributes differ among frames, but other

existing facial attributes also change (see the change in mouth in Fig. 3.5). Our goal is to

ensure that the edited attributes remain temporally consistent while preserving the other

details from the input video.

Overview of our approach. To achieve this goal, we propose a two-phase optimization

approach: phase 1 updates the latent code via an MLP and phase 2 updates the generator.

In both phases, we optimize the temporal photometric loss across frames. With the fine-

tuned latent code and generator, we unalign the edited frames to produce an edited video.

Figure 3.3 outlines our workflow. Below, we describe the details and the losses of our

approach.

3.3.2 Flow-based temporal consistency

We present a flow-based approach to explicitly encourage temporal consistency in the

edited video Vedit.

Frame sampling. As we cannot fit an entire video into the GPU memory, we choose to

perform our optimization from a pair of frames at a time. We choose to use an anchor frame

Ieditanc as one of the pair, which we set as the middle frame of the video. This is inspired by

recent video representation work [156], where a video is represented by a key frame and a

flow network. At each iteration, we sample a latent code W edit
i , corresponding to the frame

35


Iediti and optimize the pair of frames {Ieditanc , I
edit
i }. We perform our optimization in two

phases (Section 3.3.3). In phase 1, we generate temporally consistent pairs {Î ′
anc, Î

′
i}i ̸=anc

as a result. In phase 2, we further improve the temporal consistency, recover other affected

attributes brought by the per-frame editing due to the poor disentanglement, and generate

the pairs {Î ′′
anc, Î

′′
i }i ̸=anc.

Flow estimation and warping. We use RAFT [194] to compute the forward and backward

flows Fi→anc and Fanc→i of the pair {Îanc, Îi}. This pair is either the output of phase 1

{Î ′
anc, Î

′
i} or phase 2 {Î ′′

anc, Î
′′
i }. We then use these two flows to warp the pair of frames

{Îanc, Îi}.

Visibility masks. To highlight the non-occluded regions, we compute the visibility masks

M vis
anc→i and M vis

i→anc ∈ [0, 1]. This mask shows lower weights for occluded pixels and

higher weights for the non-occluded pixels (Figure 3.4). To compute the visibility masks,

we first compute forward-backward and backward-forward flow consistency error maps

ϵanc→i and ϵi→anc and compute the error map by ϵi→anc(p) = ||p−Fanc→i(p+Fanc→j(p))||2 ,

where p is a pixel in the flow field. These resultant error maps are mapped to [0, 1] using an

exponential function such that M vis
anc→i = exp(−10ϵanc→i) and M vis

i→anc = exp(−10ϵi→anc).

Perceptual difference mask. For in-domain editing, because the introduced editing is

temporally inconsistent, we observe that the visibility masks do not emphasize those edited

parts (e.g., eyeglasses). To highlight those edited parts, we compute the soft semantic per-

ceptual difference masks MPD
anc and MPD

i between the pair of frames and their correspond-

ing aligned input frames using LPIPS [246] (Figure 3.4). Due to the significant appearance

differences, we cannot use these semantic perceptual difference masks for out-of-domain

editing.

Fused masks. For in-domain editing, we fuse the visibility masks and the seman-

tic perceptual difference masks such that Manc→i = (M vis
anc→i ⊕ MPD

i ) and Mi→anc =

(M vis
i→anc ⊕ MPD

anc ). The masks will also be clamped to [0, 1]. This fusion is shown

36


in Figure 3.4. On the other hand, for out-of-domain editing, Manc→i = M vis
anc→i and

Mi→anc = M vis
i→anc.

Bi-directional photometric loss. We use the warped frames and the final computed

masks to compute the bi-directional photometric loss to achieve a temporally consistent

video. This loss measures the difference between the two frames to calculate the deviation

in the non-occluded parts.

Lphoto =
∑

Îi,Îanc∈P

Mi→ancLLPIPS(Îanc, warp(Îi, Fanc→i))

+Manc→iLLPIPS(Îi, warp(Îanc, Fi→anc)) ,

(3.1)

where Ît is either the output of phase 1 Î
′
t or phase 2 Î

′′
t . Intuitively, this bi-directional

photometric loss ensures colors along the valid (forward-backward or backward-forward

consistent) vectors across frames are as similar as possible.

Input Direct Editing W updated W,G updated

Figure 3.5: Motivation for two-phase optimization. Updating latent code W brings in the eye-
glasses, and tuning G with the perceptual difference mask recovers the expression in the input.

3.3.3 Two-phase optimization strategy

We split our optimization into two phases. In the first phase, we refine the latent codes

{W edit
t } by only optimizing an MLP fθ. While in the second phase, we only update the

generator weights θedit.

37


Motivation. We use a two-phase optimization approach for in-domain editing because

we observe that only refining the latent codes (phase 1) sometimes introduces undesired

changes to other facial attributes. We show an example in Fig. 3.5. When we only update

the latent codes, we achieve temporal consistency of the introduced glasses; however, the

mouth expression of the person changes. To address this in the case of in-domain editing,

we update the generator weights (phase 2) using the perceptual difference mask to enforce

the pixels outside the mask to be the same as the input. This will maintain the facial ex-

pression of the aligned input frame. The primary source of inconsistency for out-of-domain

editing is the global inconsistency (e.g., background). Hence, updating the generator (phase

2) introduces this desired global change.

Phase 1: Latent code update. In this phase, we update the latent code W edit
t using a

Multi-layer Perceptron (MLP) fθ = (w; θf ) implicitly. We use the same architecture as

StyleCLIP mapper [146]. We use this MLP to predict a residual for the latent codes and

update the parameters of the MLP instead of directly optimizing the latent codes explicitly,

such that:

Ŵ edit
t = W edit

t + αfθ(W
edit
t ; θf ) , (3.2)

then for a pair of directly edited frames {Ieditanc , I
edit
i }, we can get the updated frames Î ′

i =

G(Ŵ edit
i ), Î ′

anc = G(Ŵ edit
anc ).

Our goal is to minimize:

argmin
θf

LI = argmin
θf

∑
t̸=anc

Lphoto + λrfLrf + λϵLϵ , (3.3)

where Lphoto is the photometric loss, and

Lrf = ||fθ(W edit
t ; θf )||1+||fθ(W edit

anc ; θf )||1 (3.4)

38


x-t slice

Input Explicitly update W Implicitly update W

Figure 3.6: x-t slices between updating latent codes explicitly and implicitly with an MLP. We
visualize the optimized frames and an x-t slice at y = 500. Explicitly updating latent code W gives
us an unstable x-t scanline, while updating W implicitly with an MLP gives a smooth scanline.

is a regularization term to make sure we do not deviate too much from W edit
t . We set

λrf = 0.1 for the experiments. Lϵ = ||ϵanc→i||1+||ϵi→anc||1 is the norm of error maps, and

we set λϵ = 10.

The reason we use an MLP to update the latent code implicitly is that we observe that

explicitly optimizing the latent codes results in an unstable optimization when using a large

learning rate. However, the running time becomes too long when using a small learning

rate. To address this, we introduce an MLP to predict the residual and update the latent

codes implicitly. This leads to a more stable optimization. We show an example of x-t

scanline in Fig. 3.6 to demonstrate the effectiveness of introducing the MLP.

Phase 2: Generator update. For in-domain editing, in this phase, we use the updated

latent codes {Ŵ edit
t }Tt=1 from phase 1, and our goal is to finetune the generator only to

minimize:

θ̂edit = argmin
θ̂edit

LII = argmin
θ̂edit

∑
t̸=anc

Lphoto + λϵLϵ + λrLr + λMLM , (3.5)

LM = (1−MPD
i )LLPIPS(Î

′′

i , I
in
i ) + (1−MPD

anc )LLPIPS(Î
′′

anc, I
in
anc) . (3.6)

MPD
i is the perceptual difference mask computed between Î

′′
i = G(Ŵ edit

t ; θ̂edit) and aligned

input I ini , and LLPIPS(·, ·) is the LPIPS distance loss [246]. We initialize θ̂edit as θedit. The

39


LPIPS term also plays a role to maintain the sharpness of the edited frames. This is because

the consistency can be achieved by pushing all the frames to become blurry.

Here, Lr is the regularization loss for the generator and λr is the strength of regular-

ization. We introduce this loss to help prevent the generator G from losing its latent space

editability as we do not wish to ruin its pretrained latent space. Therefore, similar to [158],

we use this local regularization to preserve the editing ability of our generator. More

specifically, we first obtain a latent code Wr by linearly interpolating between the current

latent code Ŵ edit
t and a randomly sampled code Wz with an interpolation parameter αinterp:

Wr = Ŵ edit
t + αinterp

Wz−Ŵ edit
t

||Wz−Ŵ edit
t ||2

. This gives us a new latent code in a local region around

Ŵ edit
t . To ensure that we do not lose the editing capability of the original generator, we add

a penalty on the distance between the generated image from the new generator and the old

one such that:

Lr = LLPIPS(xr, x̂r) + λr
ℓ2
Lℓ2(xr, x̂r) , (3.7)

where xr = G(Wr; θ
edit), x̂r = G(Wr; θ̂

edit), λr
ℓ2

is the weight for ℓ2 loss. This regular-

ization can alleviate the side effects from updating G within a local area. This is desirable

since for a video, the latent codes for the same identity tend to gather locally.

For out-of-domain editing, unlike in-domain editing, we cannot rely on the perceptual

difference mask, so the optimization goal reduces to:

θ̂edit = argmin
θ̂edit

LII = argmin
θ̂edit

∑
t̸=anc

Lphoto + λrLr + λϵLϵ . (3.8)

To compensate for the regularization effect of the perceptual difference mask, we freeze

the last eight layers of the synthesis network in G to avoid blurry results. As all the com-

putations, including the GAN generator, flow estimation network, spatial warping, and

photometric losses, are differentiable, we can backpropagate the errors all the way back.

40


In
pu
t

O
ur
s

“Beard” “Disney princess”

Figure 3.7: Visual results on RAVDESS dataset [123]. We show both in-domain (“beard”) and
out-of-domain (“Disney princess”) editing results. Our results maintain consistent changes with
time preserving the temporal coherence.

After phase 1 and 2, we will have {Ŵ edit
t }Tt=1 and G(·; θ̂edit) as a result.

3.3.4 Unalignment

After our two-phase optimization, we perform stitch tuning approach [200] as post-

processing to put the aligned frames back to the original video to generate our final edited

video. Note that this is only feasible for the in-domain editing because the out-of-domain

editing has a global appearance compared to the input video.

3.4 Experimental Results

3.4.1 Experimental setup

Implementation details. We use StyleGAN-ADA [86] as our pre-trained generator. We

experiment with in-domain and out-of-domain editing techniques to validate our approach

for different GAN inversion methods. Specifically, for in-domain editing, we use the PTI

inversion [158] (based on e4e [198]) and StyleCLIP mapper [146]. For out-of-domain

editing, we use the Restyle encoder [6] and the StyleGAN-NADA [48]. We will release

41


D
E

D
V
P

O
ur
s

Figure 3.8: Visual comparison with DVP [108]. DVP achieves temporal consistency by severely
smoothing the image and hence losing its sharpness. Our method, however, can achieve a balance
between consistency and sharpness. In “eyeglasses” example (left), DVP shows a different pair of
eyeglasses across the time (zoom-in for better visualization), while ours remain a good consistency
for the eyeglasses; in “Disney princess” (right), DVP shows a blurry result with an unstable x-t
scanline, while ours is sharper and shows a stable consistency in the scanline.

the source code and pretrained models. In the following, we show sample results from the

video frames. We encourage the readers to view the videos in the supplementary material

for video results.

Datasets. We conduct our metric evaluation using 20 videos randomly sampled from

RAVDESS dataset [123]. We conduct 5 types of in-domain editing for each video and

5 types of out-of-domain editing. To further demonstrate the capabilities of our method

to handle real videos, we also apply our approach to Internet videos and show the visual

results.

Metrics. We aim to evaluate the method using two main aspects: 1) temporal consistency

and 2) perceptual similarity with the semantically edited frames. To evaluate temporal

42


In
pu
t

O
ur
s

Figure 3.9: Results on Internet videos. Results on the Internet videos. We change the first person
to “surprised” expression, and change the second person to “angry”.

consistency, we measure the Warping Error Ewarp:

Ewarp(It, It+1) =
1∑N

i=1Mt(pi)
·

N∑
i=1

Mt(pi)||It(pi)− Ît+1(pi)||22 , (3.9)

where Ît+1 = warp(It+1, Ft→t+1), N is the number of pixels, pi is the i-th pixel, Mt is

a binary non-occlusion mask, which shows non-occluded pixels, we compute it using the

forward-backward consistency error the threshold in [103, 121].

We also measure the LPIPS perceptual similarity score [246] (with AlexNet [98]) be-

tween the directly edited video V edit = {Iedit1 , Iedit2 , · · · , IeditT } and the output of our phase 2

{Î ′′
1 , Î

′′
2 , · · · , Î

′′
T} by measuring the averaged perceptual similarity between the correspond-

ing frames. The purpose of these two metrics is to evaluate whether the method can achieve

a balance between temporal consistency and fidelity degradation. This is an inherent trade-

off. Preserving all the details of per-frame editing inevitably leads to temporal flickering

artifacts. Focusing only on temporal consistency may easily lead to blurry videos. Our goal

is that the final output video is visually similar to the directly (per-frame) edited video.

3.4.2 Out-of-domain results

Setup. We first invert the videos frame by frame using the Restyle encoder [6] (psp-

based [157]). We then directly apply five different out-of-domain editing effects produced

by StyleGAN-NADA [48]. We perform our two-phase optimization approach on the di-

43


Table 3.1: Out-of-domain editing comparison.
Ewarp ↓ LPIPS↓

Direct editing 0.0098 0.0000

Editing categories DVP [108] Ours DVP Ours

Sketch 0.0036 0.0085 0.2404 0.1314
Pixar 0.0031 0.0025 0.1074 0.1178
Disney Princess 0.0040 0.0078 0.2062 0.1204
Elf 0.0042 0.0108 0.2289 0.1310
Zombie 0.0040 0.0085 0.2033 0.1370

Average perfomance 0.0038 0.0076 0.1972 0.1275

Table 3.2: In-domain editing comparison.
Ewarp ↓ LPIPS↓

Direct editing 0.0076 0.0000

Editing categories LT [235] DVP [108] Ours DVP Ours

angry - 0.0033 0.0032 0.2452 0.1100
beard 0.0064 0.0038 0.0030 0.2444 0.1033
eyeglasses 0.0066 0.0039 0.0034 0.1226 0.1097
Depp - 0.0037 0.0031 0.2452 0.2024
surprised - 0.0035 0.0028 0.1415 0.1012

Average perfomance 0.0065 0.0036 0.0031 0.1760 0.1253

rectly edited video using Adam optimizer [95]. For phase 1, we set the learning rate to

αI = 0.005, and update the latent codes for 5 epochs. In Eqn. 3.2, we set α = 0.04 for all

the editing directions. For phase 2, we set the learning rate to αII = 8× 10−4, and finetune

G for 5 epochs. We set the regularization weight λr to 200.

Evaluation. Table 3.1 shows that our method decreases the temporal error of the directly

edited video. The primary sources of inconsistency in out-of-domain editing can be seen

in the flickering background and the details of the hair. We show our visual results in

Figure 3.7. Our method preserves the temporal consistency and maintains the sharpness of

the input video.

44


3.4.3 In-domain editing results

Setup. We first invert the videos frame by frame by using the PTI method [158]. We

then directly apply five different semantic editing directions discovered by StyleCLIP map-

per [146]. Next, we perform our two-phase optimization approach on the directly edited

video using Adam optimizer [95]. For phase 1, we set the learning rate αI = 0.05, and

update fθ for 10 epochs. In Eqn. 3.2, we set α = 0.12 for the “eyeglasses”, and α = 0.04

for the rest of the semantic directions. For phase 2, we set the learning rate of G to

αII = 0.0001, and finetune G for 5 epochs. We set the regularization weight λr to 200.

Evaluation. Table 3.2 shows that our approach improves the temporal consistency over the

directly edited video baseline by a large margin. When dealing with in-domain editing, the

primary source of inconsistency is the details of the newly added attributes, e.g., glasses or

beard and some background flickering. We show sample visual results in Figure 3.7, where

the introduced changes are consistent among the different frames.

3.4.4 Two-phase optimization strategy ablation study

We demonstrate the effect of our two-phase optimization strategy of updating the latent

codes first and following that with finetuning the generator G. We compare our two-phase

approach to: (1) No optimization (i.e., direct editing), (2) update latent code only (phase 1),

and (3) finetune generator G only. We show the results in Table 3.3. When we only update

generator G, we can achieve a low warping error Ewarp. However, this is not desirable since

finetuning G pushes the video to be consistent globally without modifying the local details.

Therefore, the output video is different from the directly edited video (i.e., high LPIPS

distance). Thus, we follow our two-phase optimization of a) updating the latent codes via

an MLP fθ (to improve local consistency), b) finetuning the generator G (to modify the

global effect).

45


Table 3.3: Two-stage optimization strategy ablation study.
Optimization stage In-domain editing Out-of-domain editing

Update W edit
t Update G Ewarp ↓ LPIPS↓ Ewarp ↓ LPIPS↓

- - 0.0076 0.0000 0.0098 0.0000
✓ - 0.0064 0.2108 0.0094 0.1428
- ✓ 0.0027 0.2463 0.0057 0.1375
✓ ✓ 0.0031 0.1253 0.0076 0.1275

Input LT Ours

Figure 3.10: Visual comparison with Latent Transformer (LT) [235]. LT cannot preserve the
person’s identity very well. Our method can preserve the identity and achieves a temporal consistent
video.

3.4.5 Comparison with Latent Transformer

We compare our method with Latent Transformer (LT) [235]. We show a quantitative

comparison with two overlapped editing types, “beard” and “eyeglasses” in Table 3.2. and

a qualitative comparison in Fig. 3.10. LT edits video by updating the projected latent code

independently for each frame without using temporal constraints. Our method, in con-

trast, uses flow-based loss to improve the temporal consistency, and our second phase uses

a perceptual difference mask as a regularization to preserve the facial details other than

the edited parts. As a result, our method can improve temporal consistency and preserve

personal identity.

46


3.4.6 Comparison with Deep Video Prior (DVP)

We compare our method with DVP [108], a state-of-the-art blind video consistency

approach, using their default setting. We show the in-domain editing comparison in Ta-

ble 3.2 and the out-of-domain editing comparison in Table 3.1. For warping error Ewarp,

our method achieves improved results for in-domain editing and comparable results for

out-of-domain editing. However, in terms of LPIPS distance, our visual results are more

similar to the directly edited video for both in-domain and out-of-domain editing. We show

visual comparison in Figure 3.8. DVP can achieve temporally consistent results (i.e., low

Ewarp). However, this is at the cost of losing local details in the “eyeglasses” example or

excessively smoothing the results to get a blurry video as in the “Disney Princess” example.

(a) ”Zombie” (b) ”Rare pose”

Figure 3.11: Limitations. From (a), it can be seen that earrings are added by GAN editing prior
to our flow-based temporal consistency approach. Since our approach builds on existing GAN
inversion and editing techniques, it will be affected by their quality. From (b), it can be seen that
our method fails when there is a rare pose and a large motion.

3.5 Limitations

We show several limitations of our approach in Figure 3.11. First, our approach relies

on plausible results from existing GAN inversion and editing techniques. We show an ex-

ample of added earrings in Figure 3.11(a), and an example of a rare pose in Figure 3.11(a).

Second, the GANs used in our experiments require the objects to be spatially aligned and

thus may not yet be suitable for inverting and editing unconstrained videos. Third, our

47


method relies on a high-quality GAN model that may be computationally expensive to

train and often require diverse training images. Our full method (phases 1, 2 and 3) takes

40 minutes on a 150-frame video, on a single NVIDIA P6000 GPU.

3.6 Conclusions

We have presented a novel method for video-based semantic editing by leveraging

image-based GAN inversion and editing. Our approach starts from direct per-frame edit-

ing, and we refine the editing results by a flow-based method to minimize the bi-directional

photometric loss. Our core approach is two-phase, by adjusting the latent codes via an

MLP and tuning G to achieve temporal consistency. We show that our method can achieve

temporal consistency and preserve its similarity to the direct editing results. Finally, our

model-agnostic method is applicable to different GAN inversion and manipulation tech-

niques.

Potential negative impacts. Malicious use of our technique may lead to video manipula-

tion of public figures for spreading misinformation.

48


Chapter 4: Increasing Generalizability of the Generative Models

Generative models have difficulties when dealing with out-of-domain (OOD) data due

to limited training data. In this chapter, we propose a novel approach to adapt a pre-trained

generative model to OOD data. With this approach, we can further use powerful generative

priors even for OOD data.

4.1 Introduction

GAN inversion [3, 157, 198, 222, 254] is a set of techniques that project an input im-

age onto the latent space of a pre-trained GAN to obtain a latent code so that the image

generator can reconstruct the input. This is particularly useful as one could perform vari-

ous creative semantic editing tasks [47, 63, 147, 173] for images. Similar techniques have

also been applied in the video domain, with which recent methods also achieved tempo-

rally consistent editing [200, 228]. However, the majority of these methods are effective

primarily with 2D GANs, and they fall short in offering explicit 3D controllability, such

as view synthesis capabilities. With the rapid recent advancements in 3D reconstruction,

especially in neural radiance fields (NeRFs) [13, 29, 133, 137], high-quality 3D-aware

GANs [23, 59, 145, 185] have emerged as a powerful tool for learning 3D generation from

2D images. 3D-aware GANs, equipped with a 3D representations like NeRFs [23, 59]

or SDF [145], offer explicit control over camera views and ensure 3D geometric consis-

tency in generation. Additionally, they retain the generative capacity and editability of 2D

49


Input

Input

Recon.

Recon.

Surprised Younger Elsa

Elsa

OOD Removal

OOD RemovalLess smile Blond

El
sa

Ey
eg

la
ss

es

V
id

eo
 in

pu
t

V
id

eo
 in

pu
t

Si
ng

le
 im

ag
e 

in
pu

t

Figure 4.1: Semantic editing for out-of-distribution data. We present a method for reconstructing
and editing an out-of-distribution (OOD) image or video using a pre-trained 3D-aware generative
model (EG3D [23]). Our method explicitly models and reconstructs the occluders in 3D, allow-
ing faithful reconstruction of the input while preserving the semantic editing capability. Here we
showcase the reconstruction and editing results “Less smile”, “Younger”, “Blond” [173], “Elsa”,
“Surprised” [147]. Our method can also remove the OOD part. Data are from the Internet.

GANs [85, 87, 89, 91]. This enables applications such as novel view synthesis, semantic

image editing [104, 183, 187, 225, 239, 240] and video editing [45, 199].

Core challenges. While state-of-the-art 3D GAN inversion methods achieve remark-

able advances in both image and video editing for human faces, they face challenges when

dealing with images including out-of-distribution (OOD) objects (e.g., heavy make-ups or

occlusions). This limitation arises primarily because these models are pre-trained only on

natural faces without complex textures or substantial occlusions. As a result, the editabil-

ity performance deteriorates when a pre-trained GAN is forced to model OOD objects in

the GAN inversion process. This is commonly known as the reconstruction-editability

50


trade-off [198]. Existing GAN inversion methods assume that a single latent code cor-

responding to the input image can be found in the latent space [186, 222] through opti-

mization once the model is trained. Therefore, they aim to reconstruct the in-distribution

(InD) content (e.g., natural face) and the OOD objects together. However, OOD compo-

nents often cannot be well modeled in a pre-trained GAN, and consequently cannot be

well represented with it using a single latent code, existing methods either cannot recon-

struct them faithfully [187] or can reconstruct them (e.g., through fine-tuning the genera-

tor) but alters the latent space properties and deteriorates the editability [158] (Figure 4.2).

M
or

e
sm

ile
Su

rp
ri

se
d

Input GOAE [240] PTI [158]

Figure 4.2: Limitations of the previous meth-
ods. Existing GAN inversion techniques can-
not deal with frames with OOD elements, re-
sulting in a poor reconstruction-editing balance.
GOAE [240] can produce faithful editing, but
fails to preserve the identity of the input face.
PTI [158] provides higher reconstruction fidelity,
but the edibility suffers.

Our work. We propose a new approach

to address this issue by drawing inspira-

tion from recent composite volume render-

ing works that compose multiple radiance

fields during rendering [50, 128, 216, 231].

Our core idea is to decompose the 3D rep-

resentation of an image with OOD com-

ponents into an in-distribution (InD) part

and an out-of-distribution part, and com-

pose them together to reconstruct the image

in a composite volumetric rendering man-

ner. We use EG3D [23] as our 3D-aware

GAN backbone and leverage its tri-plane representation to model this composed rendering

pipeline. For the InD component (i.e., natural face), we project pixel values onto EG3D’s

W+ space for an InD component reconstruction. We further introduce an additional tri-

plane to represent the OOD content. After that, we combine these two radiance fields in a

composite volumetric rendering to reconstruct the input frames. During the editing stage,

51


we perform the latent code based editing solely on the InD part and leave the OOD com-

ponent unaltered. This framework would allow the applications of any StyleGAN-based

editing approache [147, 173] on the InD component such as changing facial expression,

which is often desirable for user experiences. The advantages of our work are three-fold:

a) we achieve a higher-fidelity reconstructio