ABSTRACT Title of dissertation: ADVANCING VISUAL ASSETS: TAMING DEEP PRIORS FOR EDITING AND ENHANCEMENT Yiran Xu, Doctor of Philosophy, 2025 Dissertation directed by: Professor Jia-Bin Huang Department of Computer Science, University of Maryland Visual data, including images and videos, are valuable assets with broad commercial, artistic, and personal significance. As digital content consumption continues to grow, there is an increasing demand for methods that can enhance visual quality, improve adaptability across different formats, and enable efficient content editing. However, achieving these en- hancements manually is both labor-intensive and technically challenging. Recent advances in deep learning have introduced powerful generative models (e.g., GANs, diffusion mod- els) and human-aligned visual representations (e.g., VGG, DINO-v2) that offer promis- ing capabilities for improving visual assets. Yet, directly applying these models to real- world editing and enhancement tasks often introduces artifacts and inconsistencies, such as temporal flickering in videos, limited generalization to out-of-distribution (OOD) data, and misalignment between high-level priors and low-level structures. This thesis explores strategies to “tame” these deep priors, converting their potential into more controllable and reliable tools for visual asset enhancement. This dissertation presents four key contributions in editing and enhancement, each demonstrating how to adapt deep priors for improving visual content usability, quality, and consistency. First, we develop VideoGigaGAN, a large-scale video super-resolution model that extends an image super-resolution model to video, enhancing both spatial res- olution and temporal coherence. Second, we introduce a video editing framework that enforces temporal consistency by optimizing latent codes and the generator itself, reduc- ing flickering artifacts in edited videos. Third, we propose a method to improve generative priors for OOD data using a volumetric decomposition approach, enabling high-fidelity im- age reconstructions while maintaining editability. Finally, we explore image retargeting by leveraging perceptual priors to intelligently adapt content to different aspect ratios without compromising visual coherence. By addressing these challenges, this thesis contributes to the broader goal of harnessing deep priors for real-world visual asset enhancement. The proposed approaches demonstrate that by adapting and refining generative priors, we can develop more reliable, high-quality, and scalable solutions for visual editing tasks. These contributions have potential appli- cations in media production, content creation, digital art, and real-time video processing, paving the way for future research in deep learning-driven visual content adaptation. ADVANCING VISUAL ASSETS: TAMING DEEP PRIORS FOR EDITING AND ENHANCEMENT by Yiran Xu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Jia-Bin Huang, Chair/Advisor Professor Christopher Metzler Professor Abhinav Shrivastava Professor Ruohan Gao Professor Maria K. Cameron © Copyright by Yiran Xu 2025 Acknowledgments I would like to express my deepest gratitude to my advisor, Prof. Jia-Bin Huang, for his unwavering support, guidance, and encouragement throughout my Ph.D. journey. His in- sightful mentorship has profoundly shaped my research direction and has been instrumental in my growth as a researcher. I am also grateful to my advisory committee members, Prof. Christopher Metzler, Prof. Abhinav Shrivastava, Prof. Ruohan Gao, and Prof. Maria K. Cameron, for their valuable feedback and guidance, which have significantly enriched my work. I am fortunate to have had the opportunity to work with incredible mentors during my industry internships. At Google DeepMind, I would like to thank Feng Yang, Siqi Xie, Jijun Jiang, Zhuofang Li, Yinxiao Li, Luciano Sbaiz, Junjie Ke, Miaosen Wang, Hang Qi, Han Zhang, Jose Lezama, Ming-Hsuan Yang, Irfan Essa, and Jesse Berent for their mentorship and collaboration, which greatly influenced my understanding of real-world research applications. At Adobe Research, I am deeply appreciative of Difan Liu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Seoung Wug Oh, Zhixin Shu, and Cameron Smith for their support and inspiring discussions. Additionally, I would like to thank my collaborators on campus: Mingyang Xie, Haoming Cai, Sachin Shah, Brandon Y. Feng, Yi-Ling Qiao, Alexander Gao, Prof. Ming C. Lin, and Badour AlBahar, whose insights and teamwork have been invaluable to my research. I am also grateful to my amazing labmates, who have made this journey both intellectu- ally stimulating and enjoyable. Yue Feng, Yao-Chih Lee, Yi-Ting Chen, Ting-Hsuan Liao, Hadi Alzayer, Songwei Ge, Kevin Zhang, Quynh Phung, Haowen Liu, Badour AlBahar, Yuliang Zou, Chen Gao, and Jinwoo Choi—thank you for your camaraderie, support, and the many insightful discussions that have pushed my research forward. I am incredibly thankful for the friendships that have kept me grounded and supported throughout this journey. Yixuan Ren, Hanyu Wang, Bo He, Haozhe An, Yu Hou, and ii Shuaiyi Huang—your encouragement, laughter, and companionship have been a constant source of motivation. Last but not least, I am deeply grateful to my family. To my wife, Yimin Peng, for her unwavering love, patience, and belief in me—I could not have done this without you. To my parents and grandparents, whose sacrifices and support have paved the way for my academic pursuits. And to my baby cats, Bai and Michelle, who do not speak but have been my constant companions, bringing joy and comfort through the long nights of research. This dissertation is dedicated to all of you. iii Table of Contents Acknowledgements ii 1 Introduction 1 2 Generative Models for Large-scale Video Super-Resolution 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Preliminaries: Image GigaGAN upsampler . . . . . . . . . . . . . 12 2.3.2 Inflation with temporal modules . . . . . . . . . . . . . . . . . . . 12 2.3.3 Flow-guided feature propagation . . . . . . . . . . . . . . . . . . . 13 2.3.4 Anti-aliasing blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.5 High-frequency shuttle . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.6 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Comparison with previous models . . . . . . . . . . . . . . . . . . 20 2.4.4 Analysis of the trade-off between temporal consistency and frame fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.5 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.6 8× video upsampling . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 Temporally Consistent Video Editing 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Flow-based temporal consistency . . . . . . . . . . . . . . . . . . . 35 iv 3.3.3 Two-phase optimization strategy . . . . . . . . . . . . . . . . . . . 37 3.3.4 Unalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Out-of-domain results . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 In-domain editing results . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.4 Two-phase optimization strategy ablation study . . . . . . . . . . . 45 3.4.5 Comparison with Latent Transformer . . . . . . . . . . . . . . . . 46 3.4.6 Comparison with Deep Video Prior (DVP) . . . . . . . . . . . . . . 47 3.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Increasing Generalizability of the Generative Models 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Preliminaries: EG3D, 3D-aware GAN . . . . . . . . . . . . . . . . 54 4.3.2 In-distribution GAN inversion . . . . . . . . . . . . . . . . . . . . 55 4.3.3 Modeling out-of-distribution contents . . . . . . . . . . . . . . . . 57 4.3.4 Composite volume rendering . . . . . . . . . . . . . . . . . . . . . 58 4.3.5 Low-resolution reconstruction . . . . . . . . . . . . . . . . . . . . 59 4.3.6 Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.7 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.3 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.4 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.6 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 Human-aligned Visual Features for Image Retargeting 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 Overview of HALO . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.2 Multi-Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3 Perceptual Structure Similarity Loss . . . . . . . . . . . . . . . . . 74 5.3.4 Training loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 v 5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.3 Comparison with previous methods . . . . . . . . . . . . . . . . . 81 5.4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.5 Analysis of the off-the-shelf models . . . . . . . . . . . . . . . . . 85 5.4.6 Results on In-the-wild data . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6 Future work 90 6.1 Text-guided Motion Graphics . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Global to Local: Long-term Consistent Video Generation . . . . . . . . . . 91 7 Conclusion 93 Bibliography 96 vi Chapter 1: Introduction Visual data, including images and videos, are essential digital assets that shape en- tertainment, communication, and artistic expression in today’s world. With the rapid ex- pansion of the internet, visual content has become ubiquitous across diverse platforms, spanning photography, filmmaking, social media, and digital archiving. These assets hold commercial, artistic, and personal value, influencing industries such as advertising, content creation, and digital marketing. However, raw visual data is often imperfect and requires extensive processing to enhance its quality and usability. The value of a visual asset can be increased by correcting errors, enhancing visual quality (e.g., upscaling from low resolu- tion to 4K), or adapting content for different aspect ratios and display devices. Despite the growing demand for high-quality visuals, manual enhancement remains labor-intensive, requiring expert skills and significant effort. Thus, there is a strong need for automated and intelligent solutions that can efficiently process and enhance visual assets while preserving perceptual quality. The emergence of deep learning has introduced powerful tools for visual editing and en- hancement, such as Generative Adversarial Networks (GANs) [57], diffusion models [69], and human-aligned visual representations like VGGs [182] and DINOs [21, 144]. These models have shown remarkable potential in image synthesis, restoration, and manipula- tion, offering new ways to enhance visual assets. However, directly applying deep priors to real-world editing tasks often introduces artifacts and inconsistencies. One major chal- lenge is temporal inconsistency in videos, where image-based generative models struggle 1 Input (low-res) Output (high-res) Video Super-Resolution Change aspect ratiosVideo Editing EditableHigh visual quality Adaptive ARs Input Output (“angry”) Input (4:3) Output (16:9) Figure 1.1: Overview of the thesis. We explore how to “tame” powerful deep priors for downstream tasks to enhance the visual data. We present VideoGigaGAN to introduce generative priors in large-scale training for Video Super-Resolution (VSR). We also introduce a flow-based method to leverage an image generator to achieve temporally consistent video editing. We further improve the generalization of generative models to out-of-distribution data. Finally, besides generative priors, we use a discriminative prior for image retargeting. to maintain coherence across frames, leading to flickering and instability. Additionally, generative models are typically trained on specific distributions and often fail to generalize to out-of-distribution (OOD) data, limiting their applicability to diverse real-world content. Another challenge is the mismatch between deep priors and specific tasks—while these models capture high-level semantics well, they often struggle with preserving low- and mid-level structures, leading to unnatural artifacts in fine-grained editing tasks. Thus, the key research question becomes: How can we effectively “tame” deep priors to improve their adaptability, reliability, and quality for real-world visual asset editing? This thesis presents four projects exemplifying advancements in visual asset editing and enhancement, each endeavoring to “tame” these potent yet unpredictable deep priors: • Improving Temporal Consistency in Video Editing Using Pre-trained Image Gener- 2 ators: This work addresses the challenge of achieving temporal coherence in video editing by optimizing both the latent code and the pre-trained generator to minimize photometric inconsistencies across frames. • Enhancing Generative Priors for Out-of-Distribution Data: This study introduces a method to decompose in-distribution and out-of-distribution components using volu- metric representations, thereby improving the editability of generative models when handling unseen data. • Developing a Large-Scale Video Super-Resolution Model from a Pre-trained Image Super-Resolution Model: This project extends the capabilities of an image super- resolution model to videos, focusing on generating high-frequency details while maintaining temporal consistency. • Image Retargeting Leveraging Visual Perceptual Priors: This work explores the ap- plication of visual perceptual priors to adapt images to different aspect ratios, en- hancing their compatibility across various display formats. In Chapter 2, we introduce VideoGigaGAN, a large-scale video super-resolution model that builds upon advancements in image super-resolution. While existing image super- resolution models have achieved remarkable success in enhancing spatial resolution, ex- tending these capabilities to videos presents unique challenges, particularly in maintaining temporal consistency across frames. VideoGigaGAN addresses these challenges by incor- porating mechanisms to produce high-frequency details and ensure temporal coherence. Our model identifies and mitigates key issues that typically lead to temporal inconsisten- cies, such as flickering and motion artifacts. Through extensive experiments, VideoGiga- GAN has demonstrated superior performance in generating temporally consistent videos with fine-grained appearance details, outperforming previous state-of-the-art methods in 3 both objective metrics and subjective visual quality. In Chapter 3, we delve into our first project, which addresses the challenge of enhancing temporal consistency in video editing by leveraging pre-trained image generators. Tradi- tional approaches often apply image-based generative models to video frames individually, leading to temporal inconsistencies such as flickering artifacts. To overcome this, we pro- pose a method that minimizes temporal photometric inconsistencies by jointly optimizing both the latent code and the parameters of the pre-trained generator. This joint optimization ensures that consecutive frames maintain visual coherence, resulting in smoother transi- tions and a more stable viewing experience. Our approach has been rigorously evaluated across various domains and with different GAN inversion techniques, consistently demon- strating its effectiveness in reducing temporal artifacts and preserving the desired edits throughout the video sequence. However, due to the limited exposure to the training data, many generative models struggle when adapted to out-of-distribution (OOD) data. Chapter 4 presents our innova- tive approach to enhancing generative priors when dealing with OOD data. Generative models, particularly 3D-aware GANs, typically excel when operating within the distri- bution of their training data but often struggle with OOD inputs, leading to suboptimal reconstructions and limited editability. To address this limitation, we introduce a volumet- ric decomposition method that explicitly models OOD objects within the 3D-aware GAN framework. This technique enables the faithful reconstruction of input images, even when they contain elements not present in the training data, while preserving the model’s ability to perform semantic edits. By effectively separating in-distribution and out-of-distribution components, our method strikes a balance between reconstruction fidelity and editability, significantly expanding the applicability of generative models to more diverse and complex real-world scenarios. Besides generative priors mentioned in the previous chapters, discriminative priors 4 learned by deep neural networks also show a powerful ability to recognize visual simi- larities. Chapter 5 explores the domain of image retargeting by leveraging visual percep- tual priors to adapt images to various aspect ratios across different devices. In today’s multi-device environment, images often need to be displayed on screens with differing as- pect ratios, necessitating intelligent retargeting techniques that preserve visual quality and content integrity. Our approach utilizes human-aligned visual representations to guide the retargeting process, ensuring that essential visual features and overall aesthetics are main- tained. This method addresses the challenges of maintaining visual coherence and quality during the retargeting process, enabling seamless adaptation of images to diverse display formats without compromising the viewer’s experience. The thesis concludes with Chapter 6, discussing several future directions, and Chap- ter 7, presenting the overall conclusions. 5 Chapter 2: Generative Models for Large-scale Video Super-Resolution In this chapter, we aim to develop an approach to “tame” generative priors for visual asset enhancement. We delve into Video Super-Resolution (VSR) as an example. We develop VideoGigaGAN, a large-scale VSR model initialized from its image counterpart. It is challenging to balance the high per-frame quality brought by the pretrained image super-resolution model and the temporal consistency in VSR. We observe that the main artifact comes from the aliased input and propose well-designed anti-aliasing features to not only mitigate aliases but also preserve the high-frequency details in the output. 2.1 Introduction Video super-resolution (VSR) aims to reconstruct high-resolution videos from low- resolution inputs, a task challenged by the need for temporal consistency and high-frequency detail generation. VSR shows wide applications in generated videos [11, 64], face videos [44], satellite videos [223], and Animes [217]. Existing methods [25–27, 76] focus on consis- tency but often result in blurry outputs that lack high-frequency appearance details or real- istic textures (see Fig. 2.2). Effective VSR requires generating plausible new content not present in the low-resolution inputs, a capability that these models struggle with. Recent diffusion-based methods [64, 161, 236, 251] enjoy higher per-frame quality but suffer from temporal flickering and slow inference. Generative models (e.g., diffusion models [159, 205], VAEs [30], and GANs [56, 208, 6 𝑡 𝑥 𝑦 𝑦 𝑡 Input TTVSR BasicVSR++ Ours GTTTVSR BasicVSR++ Ours 128×128 1024×1024 128×128 1024×1024 128×128 1024×1024 … Figure 2.1: We present VideoGigaGAN, a generative video super-resolution model that can up- sample videos with high-frequency details while maintaining temporal consistency. Top: we show the comparison of our approach with TTVSR [118] and BasicVSR++ [26]. Our method produces temporally consistent videos with more fine-grained detailed than previous methods. Bottom: our model can produce high-quality videos with 8× super-resolution. 209]) have advanced Image Super-Resolution (ISR) by modeling high-resolution image distributions, producing highly detailed textures. GigaGAN [82] further increases the gen- erative capability of image super-resolution models by training a large-scale GAN model on billions of images. However, applying a generative model such as GigaGAN indepen- dently to video frames results in severe temporal artifacts (see Fig. 2.2). This raises the question: can GigaGAN’s capabilities be harnessed for temporally consistent VSR? We first experiment with an adapted GigaGAN baseline using temporal convolution and attention layers, which helps but fails to fully address the flickering of high-frequency details, brought by the strong hallucinations. Previous VSR approaches use regression- based networks to trade high-frequency details for better temporal consistency. As blurrier upsampled videos inherently exhibit better temporal consistency, the capability of GANs to hallucinate high-frequency details contradicts the goal of VSR in producing temporally 7 Input VRT [35] GigaGAN Ours Ground truthStableVSR [47] 𝑦 𝑡 Figure 2.2: Limitations of previous methods. Previous VSR approaches such as VRT [112] suffer from lack of details, as seen from the building example. Generative models, image GigaGAN [82] and StableVSR [161] produce sharper results with richer details, but it generates videos with tem- poral flickering or artifacts like aliasing (see red arrows). Our VideoGigaGAN can produce video results with both high-frequency details and temporal consistency, while artifacts like aliasing are significantly mitigated. Please refer to our supplementary material for a visual comparison. consistent frames. We refer to this as the consistency-quality dilemma in VSR. Previous VSR approaches use regression-based networks to trade high-frequency details for better temporal consistency. In this work, we identify several key issues of applying GigaGAN for VSR and propose techniques to achieve detailed and temporally consistent video super-resolution. Naively inflating GigaGAN with temporal modules [68] is not sufficient to produce temporally consistent results with high-quality frames. To address this issue, we employ a recurrent flow-guided feature propagation module to encourage information aggregation across dif- ferent frames. We also apply anti-aliasing blocks in GigaGAN to address the temporal flickering caused by the aliased downsampling operations. Furthermore, we introduce an effective method for injecting high-frequency features into the GigaGAN decoder, called high-frequency (HF) shuttle. The proposed high-frequency shuttle can effectively add fine- grained details to the upsampled videos while maintaining the temporal consistency. Contributions. We present VideoGigaGAN, the first large-scale GAN-based model for video super-resolution. We recognize the consistency-quality trade-off that has not been well discussed in previous VSR literature. We introduce the feature propagation module, 8 anti-aliasing blocks and HF shuttle which significantly improve the temporal consistency when applying GigaGAN for VSR. We show that VideoGigaGAN can upsample videos with much more fine-grained details than state-of-the-art methods evaluated on multiple datasets. We also show that our model can produce detailed and temporally consistent videos even for challenging 8× upsampling tasks. 2.2 Related Work Video Super-Resolution. Significant work has been invested in video super-resolution, using sliding-window approaches [20, 111, 193, 197, 204, 229] and recurrent networks [72, 73, 76, 110, 113, 114, 168, 177]. BasicVSR [25] summarizes the common VSR approaches into a unified pipeline. It proposes an effective baseline using optical flow for temporal alignment and bidirectional recurrent networks for feature propagation. BasicVSR++ [26] redesigns BasicVSR by introducing second-order grid propagation and flow-guided de- formable alignment. To improve the generalizability on real-world low-resolution videos, methods like RealBasicVSR [27] and FastRealVSR [226] use diverse degradations as data augmentation during training. While these approaches can produce temporally consistent upsampled videos, they are often trained with simple regression objectives and lack the gen- erative capability, which leads to unrealistic textures and overly blurry results. Unlike pre- vious VSR approaches, we propose a GAN-based VSR model to generate high-frequency details while maintaining temporal consistency in the upsampled videos. GAN-based Image Super-Resolution. SRGAN [105] is a seminal image super-resolution work that uses a GAN framework to model the manifold of high-resolution images. ESR- GAN [209] further enhances the visual quality of upsampled images by improving the ar- chitecture and loss of SRGAN. Real-ESRGAN [208] extends ESRGAN to restore general real-world low-resolution images. While these methods can produce impressive results, 9 they are still limited in model capacity and unsuitable for large upsampling factors. To scale up the model capacity of GANs, GigaGAN [82] introduces filter bank and attention layers to StyleGAN2 [90] and trains the model on billions of images. Even for 8× image super-resolution tasks, GigaGAN can effectively generate new content not present in the low-resolution image and produce realistic textures and fine-grained details. Generative Video Models. Many video generation works are based on the VAEs [10, 106, 230], GANs [53, 184, 244], and autoregressive models [212]. LongVideoGAN [19] intro- duces a sliding-window approach for video super-resolution, but it is restricted to datasets with limited diversity. Recently, diffusion models have shown diverse and high-quality results in video generation tasks [15, 16, 54, 55, 70]. Imagen Video [68] proposes pixel diffusion models for video super-resolution. Concurrent work Upscale-A-Video [251] adds temporal modules to a latent diffusion image upsampler [160] and finetunes it as a video super-resolution model. Unlike diffusion-based video super-resolution models that require iterative denoising processes, our VideoGigaGAN can generate outputs in a single feedfor- ward pass with faster inference speed. 2.3 Method Our VSR model G upsamples a low-resolution (LR) video v ∈ RT×h×w×3 to a high- resolution (HR) video V = G(v), where V ∈ RT×H×W×3, with an upsampling scale factor α such that H = αh, W = αw. We aim to generate HR videos with both high-frequency appearance details and temporal consistency. We present the overview of our VSR model, VideoGigaGAN, in Figure 2.3. We start with the large-scale GAN-based image upsampler – GigaGAN [82] (Section 2.3.1). We first inflate the 2D image GigaGAN upsampler to a 3D video GigaGAN upsampler by adding temporal convolutional and attention layers (Section 2.3.2). However, as shown in our 10 HF HF HF Fl ow -g ui de d pr op ag at io n (S ec 3 .3 ) Flow estimator Lo w -p as s fil te r ↓× 2 U-Net block Temporal attention (Sec 3.2) Anti-aliasing block (Sec 3.4) HF shuttle (Sec 3.5) ⊝ ⊝ Element-wise subtraction Low-frequency feature 𝑓!"# High-frequency feature 𝑓!#$ Optical flow Input LR video Output HR video 𝑇×ℎ×𝑤×3 𝑇×𝐻×𝑊×3 HF shuttle 1D C on v Te m po ra l s el f-a ttn ⨁ ⨁ ⨁ Element-wise addition Temporal attention … … v *V Feature 𝑓! Figure 2.3: Overview of our method for 4× upsampling. Our Video Super-Resolution (VSR) model is built upon the asymmetric U-Net architecture of the image GigaGAN upsampler [82]. To enforce temporal consistency, we first inflate the image upsampler into a video upsampler by adding temporal attention layers into the decoder blocks. We also enhance consistency by in- corporating the features from the flow-guided propagation module. To suppress aliasing arti- facts, we use Anti-aliasing block in the downsampling layers of the encoder. Lastly, we directly shuttle the high frequency features via skip connection to the decoder layers to compensate for the loss of details in the BlurPool process. experiments, the inflated GigaGAN still produces results with severe temporal flickering and artifacts, likely due to the limited spatial window size of the temporal attention. To this end, we introduce flow-guided feature propagation (Section 2.3.3) to the inflated GigaGAN to better align the features of different frames based on flow information. We also pay special attention to anti-aliasing (Section 2.3.4) to further mitigate the temporal flickering caused by the downsampling blocks in the GigaGAN encoder, while maintaining the high- frequency details by directly shuttling the HF features to the decoder blocks (Section 2.3.5). Our experimental results validate the importance of these model design choices. 11 2.3.1 Preliminaries: Image GigaGAN upsampler Our VideoGigaGAN builds upon the GigaGAN image upsampler [82]. GigaGAN scales up the StyleGAN2 [90] architecture using several key components, including adap- tive kernel selection for convolutions and self-attention layers. The GigaGAN image up- sampler has an asymmetric U-Net architecture consisting of 3 downsampling blocks {Ei} and 3 + k upsampling decoder blocks {Di}. X = G(x, z) = D(E(x, z), z) = Dk+2 ◦ · · ·D3︸ ︷︷ ︸ ↑×2k ◦D2 ◦D1 ◦D0︸ ︷︷ ︸ ↑×8 ◦E2 ◦ E1 ◦ E0(x, z)︸ ︷︷ ︸ ↓×8 . (2.1) This GigaGAN upsampler is able to upsample an input image by 2k. Both encoder E and decoder D blocks utilize random spatial noise z as a source of stochasticity. The decoder D contains spatial self-attention layers. The encoder and decoder block at same resolution are connected by skip connections. 2.3.2 Inflation with temporal modules To adapt a pretrained 2D image model for video tasks, a common approach is to inflate 2D spatial modules into 3D temporal ones [16, 54, 68, 215, 233, 251]. To reduce the memory cost, instead of directly using 3D convolutional layers in each block, our temporal module uses a 1D temporal convolution layer that only operates on the temporal dimension of kernel size 3, followed by a temporal self-attention layer with no spatial receptive field. Both 1D temporal convolution and temporal self-attention are inserted after the spatial self- attention with residual connection [68]. In summary, at each block Di, we first process the features of individual video frames using the spatial self-attention layer and then jointly processed by our temporal module. Through our experiment, we find adding temporal 12 modules to the decoder D of the generator G is sufficient to improve video consistency. We also inflate the discriminator D with comparable temporal modules. We follow [243] to initialize both temporal convolutions and temporal self-attention layers with zero weights, such that G and D still perform the same as an image upsampler at the beginning of the training, leading to a smoother transition to a video upsampler. 2.3.3 Flow-guided feature propagation The temporal modules alone are insufficient to ensure temporal consistency, mainly due to the high memory cost of the 3D layers. For input videos with long sequences of frames, one could partition the video into small, non-overlapping chunks and apply temporal at- tention. However, this leads to temporal flickering between different chunks. Even within each chunk, the spatial window size of the temporal attention is limited, meaning a large motion (i.e., exceeding the receptive field) cannot be modeled by the attention module (see Figure 2.5). To address these issues, we augment the input image with features aligned by optical flow. Specifically, we introduce a recurrent flow-guided feature propagation module (see Figure 2.3) prior to the inflated GigaGAN, inspired by BasicVSR++ [26]. Instead of di- rectly using the LR video as input to the inflated GigaGAN, we use the temporal-aware features produced by the flow-guided propagation module. It comprises a bi-directional recurrent neural network (RNN) [25, 26] and an image backward warping layer. We ini- tially employ the optical flow estimator to predict bi-directional optical flow maps from the input LR video. Subsequently, these maps and the original frame pixels are fed into the RNN to learn temporal-aware features. Finally, these features are explicitly warped using the backward warping layer, guided by the pre-computed optical flows, before being fed into the later inflated GigaGAN blocks. The flow-guided propagation module can effec- 13 tively handle large motion and produce better temporal consistency in output videos, as demonstrated in Fig 2.5. During training, we jointly train the flow-guided feature propagation module and the in- flated GigaGAN model. At inference time, given an input LR video with an arbitrary num- ber of frames, we first generate frame features using the flow-guided propagation module. We then partition the frame features into non-overlapping chunks and independently apply the inflated GigaGAN on each chunk. Since the features inside each chuck are aware of the other chunks, thanks to the flow-guided propagation module, the temporal consistency between consecutive chunks is preserved well. 2.3.4 Anti-aliasing blocks With both temporal and feature propagation modules enabled, our VSR model can pro- cess longer videos and produce results with better temporal consistency. However, the high- resolution frames remain flickering in areas with high-frequency details (for example, the windows in the building in Figure 2.2). We identify that the downsampling operations in the GigaGAN encoder contribute to the flickering of those regions. The high-frequency com- ponents in the input can easily alias into lower frequencies due to the downsampling rate not meeting the classical sampling criterion [142]. The aliasing of pixels manifests as temporal flickering in video super-resolution. Previous VSR approaches often use regression-based objectives, which tend to remove high-frequency details. Consequently, these methods produce output videos free of aliasing. However, in our GAN-based VSR framework, the GAN training objectives favor the hallucination of high-frequency details, making aliasing a more severe problem. In the GigaGAN upsampler, the downsampling operation in the encoder is achieved by strided convolutions with a stride of 2. To address the aliasing issue in our output video, 14 we apply BlurPool layers to replace all the strided convolution layers in the upsampler encoder inspired by [245]. More specifically, during downsampling, instead of simply us- ing a strided convolution, we use convolution with a stride of 1, followed by a low-pass filter and a subsampling operation. We show the anti-aliasing blocks in Figure 2.3. Our experiments show that the anti-aliasing downsampling blocks perform significantly better than naive strided convolutions in preserving temporal consistency for high-frequency de- tails. We also experimented with StyleGAN3 blocks for anti-aliasing upsampling [87]. The temporal flickering is mitigated, but we observed a notable drop in frame quality. 2.3.5 High-frequency shuttle With the newly introduced components, the temporal flicker in our results is signifi- cantly suppressed. However, as shown in Figure 2.5, adding the flow-guided propagation module (Section 2.3.3) leads to a blurrier output. Anti-aliasing blocks (Section 2.3.4) make the results even blurrier. We still need the high-frequency information in the GigaGAN features to compensate for the loss of high-frequency details. However, as discussed in Section 2.3.4, the traditional flow of high-frequency information in GigaGAN leads to aliased output. We present a simple yet effective approach to address the conflict of high-frequency de- tails and temporal consistency, called high-frequency shuttle (HF shuttle). To guide where the high-frequency details should be inserted, the HF shuttle leverages the skip connections in the U-Net and uses a pyramid-like representation for the feature maps in the encoder. More specifically, at the feature resolution level i, we decompose the feature map fi into low-frequency (LF) feature and high-frequency (HF) components. The LF feature map fLF i is obtained via the low-pass filter mentioned in Section 2.3.4, while the HF feature map is computed from the residual as fHF i = fi−fLF i . The HF feature map fHF i containing high- 15 frequency details are injected through the skip connection to the decoder (Figure 2.3). Our experiments show that the high-frequency shuttle can effectively add fine-grained details to the upsampled videos while mitigating issues such as aliasing or temporal flickering. 2.3.6 Loss functions We use stardard, non-saturating GAN loss [61], R1 regularization [131], LPIPS [246] and Charbonnier loss [28] during the training. L(Xt,xt) = µGANLGAN(G(xt),D(G(xt))) + µR1LR1(D(Xt)) + µLPIPSLLPIPS(Xt,xt) + µCharLChar(Xt,xt) , (2.2) where Charbonnier loss is a smoothed version of pixelwise ℓ1 loss, µGAN , µR1, µLPIPS, µChar are the scales of different loss functions. xt is one of the LR input frames, Xt is the corre- sponding ground-truth HR frame. We average the loss over all the frames in a video clip during the training. 2.4 Experimental Results 2.4.1 Setup Datasets. We strictly follow two widely used training sets from previous VSR works [25, 26, 118]: REDS [138] and Vimeo-90K [229]. The REDS dataset contains 300 video sequences. Each sequence consists of 100 frames with a resolution of 1280× 720. We use REDS4 as our test set and REDSval4 as our validation set; the rest of the sequences are used for training. The Vimeo-90K contains 64, 612 sequences for training and 7, 824 for testing (known as Vimeo-90K-T). Each sequence contains seven frames with a resolution of 448 × 256. Following previous works [25, 26], we compute the metrics only on the 16 Input BasicVSR [25] TTVSR [118] BasicVSR++ [26] Ours GT 28.01/0.8553/0.1916 28.51/0.8686/0.1696 28.65/0.8732/0.1746 26.04/0.8317/0.1498 34.07/0.9404/0.2138 34.06/0.9416/0.2094 34.11/0.9421/0.2100 32.38/0.9201/0.1326 34.07/0.9404/0.2138 34.06/0.9416/0.2094 34.11/0.9421/0.2100 32.38/0.9201/0.1326 Figure 2.4: Qualitative comparison with other baselines on public datasets (REDS4 [138], Vimeo-90K-T [229]. We show PSNR/SSIM/LPIPS below each output frame. PSNR does not align well with human perception and favor blurry results. LPIPS is a preferred metric that aligns better with human perception. Compared to previous VSR approaches, our model can produce more realistic textures and more fine-grained details. 17 Input +Temporal at- tention +Feature prop- agation +BlurPool +HF shuttle GT 𝑡 𝑥 𝑦 𝑡 Input y-t slice +Temporal attention +Feature propagation +BlurPool +HF shuttle GT 𝑦 Figure 2.5: Ablation study. Starting from the inflated GigaGAN (+Temporal attention in the fig- ure), we progressively add components to demonstrate its effectiveness. With temporal attention, the local temporal consistency is improved compared to using image GigaGAN to upsample each frame independently. The global temporal consistency improves with feature propagation, but aliasing still exists in the areas with high-frequency details. Also, the video results become more blurry. By using the anti-aliasing blocks – BlurPool, the aliasing issue is much better, but the video results become even more blurry. Finally, with HF shuttle, we can bring the per-frame quality and high-frequency details back while preserving good temporal consistency. center frame of each sequence. In addition to the official test set Vimeo-90K-T, we also evaluate the model on Vid4 [117] and UDM10 [238], with different degradation algorithms (Bicubic Downsampling – BI and Blur Downsampling – BD). We follow MMagic [135] to perform degradation algorithms. All data are 4× downsampled to generate LR frames following standard evaluation protocols [25, 26]. Evaluation metrics. We are interested in two aspects of our evaluation: per-frame quality and temporal consistency. For per-frame quality, we use PSNR, SSIM, and LPIPS [246]. For temporal consistency, the warping error Ewarp [102] is commonly used. Ewarp(X̂t, X̂t+1) = 1∑ M i t ∑ M i t ||X̂i t,W (X̂i t+1,Ft→t+1)||22 , (2.3) 18 where (X̂t, X̂t+1) are generated frames at time t and t+ 1, i is the index of the i-th pixel, and W (·) is the warping function, Ft→t+1 is the forward flow estimated from the generated frames (X̂t, X̂t+1) using RAFT [194], and Mt ∈ {0, 1} is a non-occlusion mask indicating non-occluded pixels [165]. However, as reported in Fig. 2.2, previous baselines or even simple bicubic upsampling achieve lower Ewarp than ground truth high-resolution video since Ewarp favors over-smoothed results. Consider an extreme algorithm where all the generated frames are entirely black. Ewarp computes the warping errors by warping the generated frames. The warping error for this algorithm is 0 since the generated frames are over-smoothed (in this extreme case, all black). Therefore, instead of warping the generated frames, we propose to warp the ground-truth frames using the flow computed on the generated frames. We refer to this new warping error as referenced warping error (RWE) Eref warp. The referenced warping error between two frames is Eref warp(Xt,Xt+1) = 1∑ M i t ∑ M i t ||Xi t,W (Xi t+1,Ft→t+1))||22 , (2.4) where (Xt,Xt+1) are ground-truth frames at time t and t + 1, Ft→t+1 is the forward flow estimated from the output frames (X̂t, X̂t+1) using RAFT [194]. Hyperparameters. We use a pretrained 4× GigaGAN image upsampler as our base model. It contains three downsampling blocks in the encoder and five upsampling blocks in the decoder. The spatial self-attention layers are only used in the first block of the decoder for memory efficiency. For the flow network, we use a lightweight SpyNet [154]. For the low-pass filters, we use a kernel of 1 16 [1, 4, 6, 4, 1] before the downsampling. We set µGAN = 0.05, µR1 = 0.2048, µLPIPS = 5, µChar = 10 in Eqn. 2.2. During training, we randomly crop a 64× 64 patch from each LR input frame at the same location. We use 10 frames of each video and a batch size of 32 for training. The batch is distributed into 32 NVIDIA A100 GPUs. We use a fixed learning rate of 5 × 10−5 for both generator and discriminator. The total number of training iterations is 100, 000. 19 2.4.2 Ablation study To demonstrate the effect of each proposed component, we progressively add them one by one and evaluate them on the REDS4 dataset [138]. We report the quantitative results in Table 2.1. We also present a qualitative comparison in Figure 2.5. We see that the flow-guided feature propagation brings a large LPIPS and Eref warp improvement compared to the temporal attention. This demonstrates the effectiveness of the feature propagation contributing to the temporal consistency. By further introducing BlurPool as the anti- aliasing block, the model has a warping error drop but an LPIPS loss increase (also shown in Figure 2.5). Finally, by using HF shuttle, we can bring the LPIPS back with a slight loss of temporal consistency. Though it is not reflected on the number clearly, we observed that the sharpness of the frame improves significantly with the HF shuttle (see in the x-t slice plot in Figure 2.5). Table 2.1: Ablation study. We use LPIPS to evaluate per-frame quality and Eref warp ↓ (×10−3) for temporal consistency. Starting from the image GigaGAN (upsampling each frame independently with the image upsampler), we progressively add components to demonstrate its effectiveness. The best number: bold. The second best number: underline. Model LPIPS↓ Eref warp ↓ (×10−3) GigaGAN (base upsampler) 0.2031 2.497 + Temporal attention 0.2029 2.462 + Flow-guided propagation 0.1551 2.187 + BlurPool 0.1621 2.152 + High-freq shuttle 0.1582 2.177 2.4.3 Comparison with previous models We conduct extensive experiments and report the quantitative comparison of the per- frame quality in Table 2.3. We report the quantitative comparison of the per-frame quality in Table 2.3. We show the comparison of temporal consistency for 6 of them in Table 2.2. 20 Table 2.2: Comparison of VideoGigaGAN and previous VSR approaches in terms of temporal consistency and per-frame quality. The commonly used Ewarp for temporal consistency favors more blurry results. The naive BICUBIC upsampling method achieves the lowest Ewarp. To address this issue, we propose to use the referenced warping error Eref warp for temporal consistency. Method LPIPS↓ Ewarp ↓ (×10−3) Eref warp ↓ (×10−3) Bicubic 0.3396 1.161 2.4232 RealViformer [248] 0.2298 3.128 2.3183 TTVSR [118] 0.1836 1.390 2.1178 BasicVSR++ [26] 0.1786 1.401 2.1206 RVRT [115] 0.1727 1.438 2.1217 MIA-VSR [252] 0.1659 1.439 2.1172 VRT [112] 0.1818 1.398 2.1184 EvTexture [80] 0.1684 1.488 2.1320 StableVSR [161] 0.1934 3.957 2.2123 UAV [251] 0.4157 12.881 7.5241 DiffIR2VR-Zero [236] 0.3265 6.665 3.0942 VEnhancer [64] 0.4744 14.270 2.7383 Ours 0.1582 2.313 2.1773 Ground truth - 2.127 2.1272 Additionally, we provide qualitative comparisons in Figure 2.4. Per-frame quality. As shown in Table 2.3, our LPIPS outperforms all the other models by a large margin while showing a poorer performance of PSNR and SSIM. We observe that PSNR and SSIM do not align well with human perception and favor blurry results, as also reported in the literature [82, 160, 167]. Thus we consider LPIPS [246] as our core metric to evaluate per-frame quality as it is closer to the human perception. In Figure 2.4, it is noticeable that our model produces results with the most fine-grained details. Previous approaches tend to predict blurry results with a critical loss of details. Temporal consistency. As observed in previous works [102], the widely used warping error metric favors a more blurry video. This is also illustrated in the Table 2.2. The simple 21 Table 2.3: Quantitative comparison in terms of per-frame quality (LPIPS↓/PSNR↑ /SSIM↑) evaluated on multiple datasets. We separate models into regression-based models and generative models (StableVSR [161], MGLD-VSR [234] and ours). We exclude LPIPS evaluation on Vimeo- 90K-T from EvTexture [80] due to the lack of released preprocessed data. For StableVSR [161] and MGLD-VSR [234], we omit Vimeo-90K-T evaluation due to its significantly long runtime. We highlight LPIPS as PSNR/SSIM often misaligns with human perception and favors blurrier results, as noted in many studies [35, 82, 160, 161, 167]. Our VideoGigaGAN aligns the best with human perception. BI degradation (LPIPS↓/PSNR↑/SSIM↑) BD degradation (LPIPS↓/PSNR↑/SSIM↑) REDS4 [138] Vimeo-90K-T [229] Vid4 [117] UDM10 [238] Vimeo-90K-T [229] Vid4 [117] EDVR [207] 0.2097/31.05/0.8793 -/37.61/0.9489 -/27.35/0.8264 -/39.89/0.9686 -/37.81/0.9523 -/27.85/0.8503 MuCAN [111] 0.2162/30.88/0.8750 0.1523/37.32/0.9465 - - - - BasicVSR [25] 0.2023/31.42/0.8909 0.1616/37.18/0.9450 0.2812/27.24/0.8251 0.1148/39.96/0.9694 0.1551/37.53/0.9498 0.2555/27.96/0.8553 IconVSR [25] 0.1939/31.67/0.8948 0.1587/37.47/0.9476 0.2739/27.39/0.8279 0.1152/40.03/0.9694 0.1531/37.84/0.9524 0.2462/28.04/0.8570 TTVSR [118] 0.1836/32.12/0.9021 - - 0.1112/40.41/0.9712 0.1507/37.92/0.9526 0.2381/28.40/0.8643 BasicVSR++ [26] 0.1786/32.39/0.9069 0.1506/37.79/0.9500 0.2627/27.79/0.8400 0.1131/40.72/0.9722 0.1440/38.21/ 0.9550 0.2390/29.04/0.8753 RVRT [115] 0.1727/32.74/0.9113 0.1502/38.15/0.9527 0.2500/27.99/0.8464 0.1100/40.90/0.9729 0.1465/38.59/ 0.9576 0.2219/29.54/0.8811 PSRT-recurrent [178] 0.1676/32.72/0.9106 0.1509/38.27/0.9536 0.2448/28.07/0.8485 - - - MIA-VSR [252] 0.1659/32.79/0.9115 0.1428/38.22/0.9532 0.2474/28.20/0.8507 - - - IA-RT [227] 0.1629/32.89/0.9138 0.1498/38.14/0.9528 0.2501/28.26/0.8517 0.1129/41.15/0.9750 0.1435/38.62/0.9579 0.2201/29.68/0.8884 VRT [112] 0.1818/32.19/0.9005 0.1461/38.20/0.9530 0.2478/27.93/0.8425 0.1097/41.05/0.9737 0.1421/38.72/0.9584 0.2214/29.42/0.8795 EvTexture [80] 0.1684/32.79/0.9173 -/38.23/0.9544 0.2188/29.51/0.8909 - - - StableVSR [161] 0.1934/27.98/0.7952 - 0.2803/24.48/0.6989 - - - MGLD-VSR [234] 0.2285/26.24/0.7400 - - - - - Ours 0.1582/30.46/0.8718 0.1120/35.97/0.9238 0.1925/26.78/0.8029 0.1060/36.57/0.9521 0.1129/35.30/0.9317 0.1832/27.04/0.8365 bicubic upsampling method achieves the best performance for the commonly used warp- ing error, which is much better than the GT warping error. We proposed the referenced warping error (RWE) in Section 2.4.1 to address the issue of warping error favoring blurry results. In terms of the referenced warping error, our method is slightly worse than pre- vious methods (0.05 × 10−3 compared to BasicVSR++ [26]). The newly proposed RWE is more suitable for evaluating the temporal consistency of upsampled videos. However, it is still biased towards more blurry results as seen in Table 2.2 (several methods, including BasicVSR, BasicVSR++, and TTVSR, are still better than the ground truth high-resolution videos). We leave a better metric of VSR temporal consistency for future works. 22 2.4.4 Analysis of the trade-off between temporal consistency and frame fidelity To better understand the trade-off between the temporal consistency and per-frame quality, we include a visualization in Figure 2.6. We also compare with diffusion-based VSR (UAV [251], DiffIR2VR-Zero [236], and VEnhancer [64]) that are trained on larger datasets. Despite large-scale training, they exhibit severe temporal inconsistency and low fidelity (significantly higher LPIPS) to the ground truth due to model hallucination. Unlike previous VSR approaches, our final model, VideoGigaGAN, achieves a bal- anced trade-off, significantly enhancing both temporal consistency and per-frame quality over the base GigaGAN model with our proposed improvements. Table 2.4: More comparison. Per-frame runtimes on 320 × 180 → 1280 × 720 are evaluated on REDS4 [138]. VideoGigaGAN demonstrates competitive runtimes compared to (a) regression- based models [112, 115, 227] and is substantially faster than (b) diffusion models [64, 161, 236, 251]. (c) Scaling up BasicVSR++ does not yield performance improvements. (d) Adding LPIPS to the training loss improves performance on LPIPS, but it introduces lower PSNR/SSIM and makes the training unstable. Model #Params (M) Runtime (ms) LPIPS↓ PSNR↑ R eg re ss . IA-RT [227] 13.4 1895 0.1629 32.89 VRT [112] 35.6 219 0.1818 32.19 RVRT [115] 10.8 169 0.1727 32.74 D iff us io n UAV [251] 746 7153 0.4157 23.03 VEnhancer [64] 2496 5168 0.4744 19.92 DiffIR2VR-Zero [236] 166 12212 0.3265 24.95 StableVSR [161] 712 9242 0.1934 27.98 Sc al in g BasicVSR++ (small) [26] 7.3 77 0.1786 32.39 BasicVSR++ (medium) 166 85 0.1834 32.09 BasicVSR++ (large) 368 92 0.1941 31.74 +L PI PS BasicVSR++ (small) + LPIPS 7.3 77 0.1646 31.42 RVRT + LPIPS 10.8 169 diverged diverged VideoGigaGAN (ours) 369 295 0.1582 30.46 23 Ours (base model) Ours (+temporal attn) TTVSR VRT StableVSR RVRT EvTexture MIA-VSR Ours (+BlurPool) Ours (+Feat Prop) Ours (final) BasicVSR++ Figure 2.6: Trade-off between per-frame quality (LPIPS↓) and temporal consistency (RWE↓). Our final model achieves a good balance between the temporal consistency and per-frame quality. 24 2.4.5 Additional results Model sizes and runtimes. Table 2.4 compares model sizes and runtimes across VSR methods. Despite its larger size due to generative capacity, our model maintains compet- itive speed. Unlike slower diffusion-based models [64, 161, 236, 251] requiring iterative denoising, VideoGigaGAN achieves fast results in a single feedforward pass. Scaling-up experiments. For fair comparison, we scale up BasicVSR++[26] with addi- tional layers and channels, evaluating on REDS4[138]. Consistent with findings in [82], scaling up BasicVSR++ alone does not enhance performance. Despite a similar model size, training BasicVSR++ (large) becomes unstable past 40K iterations. We report its performance at 35K in Table 2.4, which is worse than our model. Adding LPIPS to the training loss. We also add LPIPS to the training loss of Ba- sicVSR++ and RVRT [115]. We report the results in Table 2.4. The performance of Ba- sicVSR++ (small) on LPIPS metric improves, but with a drop in PSNR and SSIM. Training RVRT with LPIPS is unstable and finally diverges. Moreover, training BasicVSR++ with LPIPS produces severe checkerboard artifacts in all results, similar to previous works. We show some qualitative results in our supplementary material. More perceptual metrics. We mainly use LPIPS as the metric for per-frame quality but acknowledge its limitations in capturing higher-level structures [46]. To address this, we also evaluate FID and DISTS in Table 2.5. Table 2.5: Additional results on REDS4 dataset. Model LPIPS↓ FID↓ DISTS↓ EvTexture [80] 0.1684 101.9 0.065 StableVSR [161] 0.1934 96.2 0.045 RealESRGAN [208] 0.4509 98.2 1.750 OVSR [237] 0.1746 123.8 0.063 Ours 0.1582 95.0 0.041 25 2.4.6 8× video upsampling Our model is capable for 8x video upsampling with both good temporal consistency and per-frame quality with rich details. We present some results in Figure 2.1. We encourage readers to visit our project website (https://videogigagan.github.io/) for more results. 2.5 Limitations (a) Extremely long video (b) Small objects Figure 2.7: Limitations. Our approach has some limitations. (a) When the video is extremely long, the feature propagation becomes inaccurate, which may introduce undesired artifacts like incorrect propagated patterns. (b) Our model cannot handle well small objects, e.g., small characters. Our model encounters challenges when processing extremely long videos (e.g., 200 frames or more). This difficulty arises from misguided feature propagation caused by in- accurate optical flow in such extended video sequences. Additionally, our model does not perform well in handling small objects, such as text and characters, as the information per- taining to these objects is significantly lost in the LR video input. Examples of these failure cases are illustrated in Fig. 2.7. 26 https://videogigagan.github.io/ 2.6 Conclusions We present a novel generative VSR model, VideoGigaGAN, that can upsample input low-resolution videos to high-resolution videos with both high-frequency details and tem- poral consistency. Previous VSR approaches often use regression-based networks and tend to generate blurry results. To this end, our VSR model built upon the powerful generative image upsampler – GigaGAN. We identify several issues when applying GigaGAN to video super-resolution tasks including temporal flickering and aliased artifacts. To address these issues, we introduce new components to the GigaGAN architecture that can effectively im- prove both the temporal consistency and per-frame quality. Our results demonstrate that VideoGigaGAN strike a balance in addressing the consistency-quality dilemma of VSR compared to previous methods. 27 Chapter 3: Temporally Consistent Video Editing Image generators have shown impressive results in producing photorealistic images. It can also edit an image with high-level, semantic text prompts. However, it introduces temporal inconsistency when we directly apply image editing techniques to videos. In this chapter, we explore how to “tame” powerful image generative priors to videos. We present a flow-based video editing method that achieves both temporally consistent and visually plausible results. In pu t “a ng ry ” “e ye gl as se s” Figure 3.1: Temporally consistent video semantic editing. We present a method for editing the semantic attributes of a video using a pre-trained StyleGAN model. Here we showcase free-form text based editing from SytleCLIP [146] to make the person appear “angry” (2nd row) or wear “eyeglasses” (3rd row). 28 3.1 Introduction Generative adversarial models (GANs) [56] have shown remarkable ability to generate photorealistic images in various domains such as faces and common objects [18, 89, 91]. GANs take a latent code (usually sampled from a Gaussian distribution) as input and pro- duce an image as the output. GAN inversion techniques allow us to project a real image onto the latent space of a pretrained GAN and retrieve its corresponding latent code. The pretrained GAN generator can then reconstruct that image using the estimated latent code. Modifying this estimated latent code opens up exciting new opportunities to perform a wide range of high-level editing tasks that are traditionally challenging, e.g., changing seman- tic object classes, modifying high-level attributes of the object/scene, or even applying 3D geometric transformations. We refer to the modification of the latent code with a semantic change in the image as semantic editing. Semantic editing in images. A recent line of research work [1, 2, 6, 74, 158, 253, 255] has shown promising results in reconstructing an input image by either optimizing the la- tent code (or latent variables) or directly predicting the latent code via an image encoder. These GAN inversion techniques enable interesting semantic photo editing applications. For image-level editing applications, several approaches [75, 174, 175] find specific se- mantic directions in the latent space, e.g., changing poses, colors, or age, while others [48] aim to change the global style, e.g., photo → sketch. We denote them as In-domain and Out-of-domain editing, respectively. With these image GAN inversion-based semantic edit- ing approaches, how can we extend them to videos? Per-frame editing. One straightforward way is to apply existing GAN inversion tech- niques [6, 74, 158, 253] for each frame in a video independently. Figure 3.2 shows an example of applying a StyleCLIP mapper [146] on two frames. The input and the inde- pendently reconstructed frames look plausible when viewed individually, but two edited 29 frames exhibit inconsistency (e.g., the frame of the eyeglasses). Recently, Yao et al. [235] learns to predict per-frame semantic editing directions for editing face videos. However, the edited videos suffer from apparent temporal flickering and fail to preserve facial identity. Input Inverted Edited Figure 3.2: Issues with per-frame editing. While current methods achieve faithful inversion and photorealistic editing, the results are inconsistent across frames (eyeglasses) and may fail to preserve details of the input video (lips). Our work. In this paper, we present a method for temporally consistent video semantic editing. We start from the existing GAN inversion approaches [6, 158] to obtain the latent code for each frame. We first modify the latent code to achieve the initial per- frame editing results. However, such a direct editing approach results in temporal inconsistencies in the modified video’s appearance or style. To deal with this challenge, we propose to compute bi-directional optical flow estimated from a frame pair sampled from the video. We can then adjust the latent code and the generator to minimize the photometric loss (along with valid flow vectors). We present a two-phase optimization strategy. In the first phase, we update only the latent codes via an MLP (with generator parameters frozen) to adjust the consistency of the de- tailed appearance. In the second phase, we finetune the generator with a local regularization to maintain the editability of the latent space. Our two-phase optimization approach helps achieve significantly improved temporal consistency while preserving the edited contents. Concurrent work. Two concurrent work [7, 200] also apply StyleGAN for video edit- ing. These methods either use per-frame pivot tuning [158] for maintaining the similarity between the edited and input frame [200] or apply latent vector smoothing [7] with Style- GAN3 [88]. Our method differs in 1) the use of explicit temporal consistency optimization and 2) the applicability of performing both in-domain and out-of-domain editing. 30 In this Chapter, • We tackle a task on GAN-based semantic editing in videos. We propose a simple yet effective flow-based approach to mitigate the temporal inconsistency of a directly (frame by frame) edited video. • We present a two-phase optimization approach for updating the latent code and gen- erator to preserve the video details. • Our method is agnostic and can be applied to different GAN inversion and editing approaches. 3.2 Related Work Generative adversarial networks. The quality and resolution of generated images have been achieved rapidly in recent years [18, 85, 88, 89, 91]. These GAN models can map a random latent code (a noise vector) to a photorealistic image. Many recent efforts have been devoted to improving the generator architectures [84, 88, 89, 91], training strategies [18], loss function designs [61, 127], and regularization [134]. Our work builds upon existing pretrained StyleGAN models as they demonstrate disentangled latent space for editing. Instead of generating synthetic images, our goal is to edit real videos. GAN inversion. GAN inversion [221, 255] allows us to reconstruct real images by pro- jecting them onto a pretrained GAN’s latent space. These techniques facilitate interesting photo editing applications. They can be split into encoder-based [6, 22, 126, 141, 157, 157, 196, 198, 198, 203], optimization-based [1, 2, 36, 37, 60, 74, 152, 195], and hybrid methods [14, 158, 253]. Our method is agnostic to different GAN inversion approaches for initializing the latent code, e.g., our experiments explore using PTI [158] for in-domain editing and Restyle encoder [6] for out-of-domain editing. 31 Semantic image editing in latent space. Semantic image manipulation and editing allow us to change the content and style of an image. It can be grouped into In-Domain editing and Out-of-Domain editing. Targeting at finding semantic directions in the latent space of a pretrained generator, in-domain editing [3–5, 75, 109, 146, 166, 174, 175, 218, 219, 241] manipulates the attributes of the object, but keeps the same style. Out-of-domain [48, 78, 101], however, aims to change the style of the image. These techniques usually perform well on a single image but fail to maintain temporal consistency if applied to a video. Semantic video editing. Recent and concurrent work [7, 200, 235] explore video editing with a pre-trained StyleGAN. The methods in [200, 235] apply per-frame editing and show coherent editing without using any temporal information. However, these methods support only in-domain editing. For localized editing (e.g., adding eyeglasses), we find that the method in [235] produces inconsistency and fails to preserve identity. The work [7] ap- plies temporal smoothing on the inverted latent vectors in StyleGAN3 [88]. Our approach, in contrast, directly minimizes the temporal photometric inconsistency at the synthesized frames. Video editing and temporal consistency. Temporal consistency is one critical criterion in video editing. Existing methods achieve temporal consistency often by enforcing the output videos to satisfy the constraints imposed by 2D optical flow [31, 71]. Alternatively, several methods first estimate an unwarped 2D texture map (either explicitly [155] or im- plicitly [92]) and then perform editing. The editing can then be propagated to the original video via the estimated UV mapping. Several blind methods enhance the temporal consis- tency as a post-processing step [17, 102, 108]. However, they typically have difficulty in handling videos with significant appearance changes. Our work shares similar ideas with these methods to enforce temporal consistency, using the optical flow fields estimated from the initial edited video. Instead of directly optimizing the pixel values, our core idea is to leverage the pretrained generator, update the latent code and generator to achieve temporal 32 𝑉!"#$% = 𝐼&, 𝐼', … , 𝐼( input video … Face alignment 𝑉)*!+",- = 𝐼&!", 𝐼'!", … , 𝐼(!" aligned video 𝑉!". = 𝐼&!". , 𝐼'!". , … , 𝐼(!". 𝐼%!". = 𝐺(𝑊% !". ,𝜃!".) inverted video editing +𝐼)"/0 +𝐼!0 𝑉,-!% = 𝐼&,-!% , 𝐼',-!% , … , 𝐼(,-!% 𝐼%,-!% = G(𝑊% ,-!% , 𝜃,-!%) edited video 𝑊)"/,-!% 𝑊! ,-!% ⨁ ⨁ Phase 1: Latent code update 𝐼)"/,-!% 𝐼!,-!% frame sampling Phase 2: Generator update -𝑊)"/,-!% -𝑊! ,-!% GAN inversion +𝐼)"/00 +𝐼!00 G G -𝑊)"/,-!% -𝑊! ,-!% +𝐼&00, +𝐼'00, … , +𝐼(00 𝑉1$% = 𝐼&1$% , 𝐼'1$% , … , 𝐼(1$% output video … Phase 3: Unalign 𝑓2 𝑓2 unalign G G Figure 3.3: Video editing with flow-based temporal consistency. Given an input video of T frames Vinput, we first spatially align the video frames using an off-the-shelf face land- mark detector. We then use existing GAN inversion techniques [6, 158] to obtain the inverted frames {Iinv1 , Iinv2 , · · · , IinvT } and their corresponding latent code in the W+-space of StyleGAN {W inv 1 ,W inv 2 , · · · ,W inv T }. We independently perform semantic editing on these inverted frames to obtain {Iedit1 , Iedit2 , · · · , IeditT } and their corresponding latent code {W edit 1 ,W edit 2 , · · · ,W edit T }. To achieve temporal consistency, we choose an anchor frame Ieditanc as the reference frame, and each time sample another frame Iediti from the edited video. To generate a temporally consistent edited video, we first refine the latent codes of the directly edited video W edit anc and {W edit i }i ̸=anc to Ŵ edit anc and {Ŵ edit i }i ̸=anc by optimizing an MLP fθ (phase 1). These refined latent codes result in the temporally consistent frames Î ′ anc and Î ′ i . To further improve the temporal consistency, we keep the refined latent codes Ŵ edit anc and Ŵ edit i and only update the generator parameters (phase 2). This will generate Î ′′ anc and Î ′′ i with improved temporal consistency. After our two phase optimization, we finally unalign the frames to generate our final edited video Vout (phase 3). consistent and photorealistic results. 3.3 Method 3.3.1 Overview GAN Inversion. Given an input video Vinput = {I1, · · · , IT} of T frames, our goal is to semantically edit all the video frames while preserving the temporal coherence of the edited 33 Warping 𝐹!"#→% backward flow 𝐹%→!"# forward flow warp 𝑤𝑎𝑟𝑝 %𝐼!"#$%&', 𝐹&→!"# 𝑤𝑎𝑟𝑝 %𝐼&$%&', 𝐹!"#→& flow cycle consistency error check 𝑀𝑎𝑠𝑘 𝑀!"#→% &%' 𝐹%→!"# Flow Network &𝐼% &𝐼% Flow estimationVisibility masks 𝑀𝑎𝑠𝑘 𝑀%→!"# &%' warp𝐹!"#→% ℒ!"#"$ ℒ!"#"$ ⨀ ⨀ ⨁ ℒ()*+* %,!"# Photometric loss Flow Network LPIPS masks fuse 𝑀!"#→% 𝑀%→!"# LPIPS &𝐼-./ 𝐼!"#%" LPIPS &𝐼% 𝐼%%" 𝑀𝑎𝑠𝑘 𝑀!"# 01 𝑀𝑎𝑠𝑘 𝑀% 01 fuse 𝑀!"#→% 𝑀%→!"# &𝐼-./ &𝐼!"# &𝐼-./ &𝐼% &𝐼% &𝐼-./ (a) Lphoto computation (b) Flow and mask computation Figure 3.4: Photometric loss for temporal consistency. Given a frame pair Îi and Îanc (either from phase 1 or phase 2), we compute the forward and backward flows Fi→anc and Fanc→i using RAFT [194]. We then use these two flow fields to compute the visibility masks by performing a forward-backward and backward-forward flow consistency error check. For in-domain editing, we also use LPIPS to obtain a semantic mask that highlights the difference between the aligned input frames Iini and Iinanc and our edited frames Îi and Îanc. We then fuse both the LPIPS semantic masks and the visibility masks to get our final masks Manc→i and Mi→anc. To compute the photometric loss (Eqn. 3.1), we use the flows to warp the directly edited frames and utilize the fuzed masks as shown in (a). video. To edit the input video Vinput, we first align its frames by using a facial alignment method [62]. Then we use existing GAN inversion techniques (e.g., [6, 158]) to invert the frames back to the latent code such that the inverted frame I invt = G(W inv t ; θinv) is similar to the input frame: I invt ≈ It. With the inverted frames, we can edit the inverted video Vinv = {I inv1 , I inv2 , · · · , I invT } by independently editing its frames I invt . We denote this frame-by-frame editing approach as “direct editing”. In-domain and out-of-domain GAN-based editing. Commonly used image-based edit- ing techniques via a GAN include (1) in-domain and (2) out-of-domain editing. We refer to an in-domain editing [75, 174, 175, 219] as the editing that only manipulates the la- tent code, given a fixed pretrained generator. That is, the generator parameters θinv remain frozen (θinv = θedit), and only the latent code W edit t is updated. The in-domain editing usu- ally changes semantic attributes such as color, age, or facial expressions. On the other hand, out-of-domain editing may involve updating the pretrained generator to produce an entirely new style (as shown in [48]). Here, the latent code remains the same W edit t = W inv t and 34 only the generator θedit changes. Direct editing on a video. When applying both types of editing techniques to a video independently for each frame, we obtain an edited video Vedit = {Iedit1 , Iedit2 , · · · , IeditT }. For each directly edited frame Ieditt , there is a corresponding latent code W edit t such that Ieditt = G(W edit t ; θedit). Due to the per-frame, independent process, the edited video Vedit often suffers from temporal inconsistency. Moreover, due to the poor disentanglement of this per-frame editing, not only will the edited attributes differ among frames, but other existing facial attributes also change (see the change in mouth in Fig. 3.5). Our goal is to ensure that the edited attributes remain temporally consistent while preserving the other details from the input video. Overview of our approach. To achieve this goal, we propose a two-phase optimization approach: phase 1 updates the latent code via an MLP and phase 2 updates the generator. In both phases, we optimize the temporal photometric loss across frames. With the fine- tuned latent code and generator, we unalign the edited frames to produce an edited video. Figure 3.3 outlines our workflow. Below, we describe the details and the losses of our approach. 3.3.2 Flow-based temporal consistency We present a flow-based approach to explicitly encourage temporal consistency in the edited video Vedit. Frame sampling. As we cannot fit an entire video into the GPU memory, we choose to perform our optimization from a pair of frames at a time. We choose to use an anchor frame Ieditanc as one of the pair, which we set as the middle frame of the video. This is inspired by recent video representation work [156], where a video is represented by a key frame and a flow network. At each iteration, we sample a latent code W edit i , corresponding to the frame 35 Iediti and optimize the pair of frames {Ieditanc , I edit i }. We perform our optimization in two phases (Section 3.3.3). In phase 1, we generate temporally consistent pairs {Î ′ anc, Î ′ i}i ̸=anc as a result. In phase 2, we further improve the temporal consistency, recover other affected attributes brought by the per-frame editing due to the poor disentanglement, and generate the pairs {Î ′′ anc, Î ′′ i }i ̸=anc. Flow estimation and warping. We use RAFT [194] to compute the forward and backward flows Fi→anc and Fanc→i of the pair {Îanc, Îi}. This pair is either the output of phase 1 {Î ′ anc, Î ′ i} or phase 2 {Î ′′ anc, Î ′′ i }. We then use these two flows to warp the pair of frames {Îanc, Îi}. Visibility masks. To highlight the non-occluded regions, we compute the visibility masks M vis anc→i and M vis i→anc ∈ [0, 1]. This mask shows lower weights for occluded pixels and higher weights for the non-occluded pixels (Figure 3.4). To compute the visibility masks, we first compute forward-backward and backward-forward flow consistency error maps ϵanc→i and ϵi→anc and compute the error map by ϵi→anc(p) = ||p−Fanc→i(p+Fanc→j(p))||2 , where p is a pixel in the flow field. These resultant error maps are mapped to [0, 1] using an exponential function such that M vis anc→i = exp(−10ϵanc→i) and M vis i→anc = exp(−10ϵi→anc). Perceptual difference mask. For in-domain editing, because the introduced editing is temporally inconsistent, we observe that the visibility masks do not emphasize those edited parts (e.g., eyeglasses). To highlight those edited parts, we compute the soft semantic per- ceptual difference masks MPD anc and MPD i between the pair of frames and their correspond- ing aligned input frames using LPIPS [246] (Figure 3.4). Due to the significant appearance differences, we cannot use these semantic perceptual difference masks for out-of-domain editing. Fused masks. For in-domain editing, we fuse the visibility masks and the seman- tic perceptual difference masks such that Manc→i = (M vis anc→i ⊕ MPD i ) and Mi→anc = (M vis i→anc ⊕ MPD anc ). The masks will also be clamped to [0, 1]. This fusion is shown 36 in Figure 3.4. On the other hand, for out-of-domain editing, Manc→i = M vis anc→i and Mi→anc = M vis i→anc. Bi-directional photometric loss. We use the warped frames and the final computed masks to compute the bi-directional photometric loss to achieve a temporally consistent video. This loss measures the difference between the two frames to calculate the deviation in the non-occluded parts. Lphoto = ∑ Îi,Îanc∈P Mi→ancLLPIPS(Îanc, warp(Îi, Fanc→i)) +Manc→iLLPIPS(Îi, warp(Îanc, Fi→anc)) , (3.1) where Ît is either the output of phase 1 Î ′ t or phase 2 Î ′′ t . Intuitively, this bi-directional photometric loss ensures colors along the valid (forward-backward or backward-forward consistent) vectors across frames are as similar as possible. Input Direct Editing W updated W,G updated Figure 3.5: Motivation for two-phase optimization. Updating latent code W brings in the eye- glasses, and tuning G with the perceptual difference mask recovers the expression in the input. 3.3.3 Two-phase optimization strategy We split our optimization into two phases. In the first phase, we refine the latent codes {W edit t } by only optimizing an MLP fθ. While in the second phase, we only update the generator weights θedit. 37 Motivation. We use a two-phase optimization approach for in-domain editing because we observe that only refining the latent codes (phase 1) sometimes introduces undesired changes to other facial attributes. We show an example in Fig. 3.5. When we only update the latent codes, we achieve temporal consistency of the introduced glasses; however, the mouth expression of the person changes. To address this in the case of in-domain editing, we update the generator weights (phase 2) using the perceptual difference mask to enforce the pixels outside the mask to be the same as the input. This will maintain the facial ex- pression of the aligned input frame. The primary source of inconsistency for out-of-domain editing is the global inconsistency (e.g., background). Hence, updating the generator (phase 2) introduces this desired global change. Phase 1: Latent code update. In this phase, we update the latent code W edit t using a Multi-layer Perceptron (MLP) fθ = (w; θf ) implicitly. We use the same architecture as StyleCLIP mapper [146]. We use this MLP to predict a residual for the latent codes and update the parameters of the MLP instead of directly optimizing the latent codes explicitly, such that: Ŵ edit t = W edit t + αfθ(W edit t ; θf ) , (3.2) then for a pair of directly edited frames {Ieditanc , I edit i }, we can get the updated frames Î ′ i = G(Ŵ edit i ), Î ′ anc = G(Ŵ edit anc ). Our goal is to minimize: argmin θf LI = argmin θf ∑ t̸=anc Lphoto + λrfLrf + λϵLϵ , (3.3) where Lphoto is the photometric loss, and Lrf = ||fθ(W edit t ; θf )||1+||fθ(W edit anc ; θf )||1 (3.4) 38 x-t slice Input Explicitly update W Implicitly update W Figure 3.6: x-t slices between updating latent codes explicitly and implicitly with an MLP. We visualize the optimized frames and an x-t slice at y = 500. Explicitly updating latent code W gives us an unstable x-t scanline, while updating W implicitly with an MLP gives a smooth scanline. is a regularization term to make sure we do not deviate too much from W edit t . We set λrf = 0.1 for the experiments. Lϵ = ||ϵanc→i||1+||ϵi→anc||1 is the norm of error maps, and we set λϵ = 10. The reason we use an MLP to update the latent code implicitly is that we observe that explicitly optimizing the latent codes results in an unstable optimization when using a large learning rate. However, the running time becomes too long when using a small learning rate. To address this, we introduce an MLP to predict the residual and update the latent codes implicitly. This leads to a more stable optimization. We show an example of x-t scanline in Fig. 3.6 to demonstrate the effectiveness of introducing the MLP. Phase 2: Generator update. For in-domain editing, in this phase, we use the updated latent codes {Ŵ edit t }Tt=1 from phase 1, and our goal is to finetune the generator only to minimize: θ̂edit = argmin θ̂edit LII = argmin θ̂edit ∑ t̸=anc Lphoto + λϵLϵ + λrLr + λMLM , (3.5) LM = (1−MPD i )LLPIPS(Î ′′ i , I in i ) + (1−MPD anc )LLPIPS(Î ′′ anc, I in anc) . (3.6) MPD i is the perceptual difference mask computed between Î ′′ i = G(Ŵ edit t ; θ̂edit) and aligned input I ini , and LLPIPS(·, ·) is the LPIPS distance loss [246]. We initialize θ̂edit as θedit. The 39 LPIPS term also plays a role to maintain the sharpness of the edited frames. This is because the consistency can be achieved by pushing all the frames to become blurry. Here, Lr is the regularization loss for the generator and λr is the strength of regular- ization. We introduce this loss to help prevent the generator G from losing its latent space editability as we do not wish to ruin its pretrained latent space. Therefore, similar to [158], we use this local regularization to preserve the editing ability of our generator. More specifically, we first obtain a latent code Wr by linearly interpolating between the current latent code Ŵ edit t and a randomly sampled code Wz with an interpolation parameter αinterp: Wr = Ŵ edit t + αinterp Wz−Ŵ edit t ||Wz−Ŵ edit t ||2 . This gives us a new latent code in a local region around Ŵ edit t . To ensure that we do not lose the editing capability of the original generator, we add a penalty on the distance between the generated image from the new generator and the old one such that: Lr = LLPIPS(xr, x̂r) + λr ℓ2 Lℓ2(xr, x̂r) , (3.7) where xr = G(Wr; θ edit), x̂r = G(Wr; θ̂ edit), λr ℓ2 is the weight for ℓ2 loss. This regular- ization can alleviate the side effects from updating G within a local area. This is desirable since for a video, the latent codes for the same identity tend to gather locally. For out-of-domain editing, unlike in-domain editing, we cannot rely on the perceptual difference mask, so the optimization goal reduces to: θ̂edit = argmin θ̂edit LII = argmin θ̂edit ∑ t̸=anc Lphoto + λrLr + λϵLϵ . (3.8) To compensate for the regularization effect of the perceptual difference mask, we freeze the last eight layers of the synthesis network in G to avoid blurry results. As all the com- putations, including the GAN generator, flow estimation network, spatial warping, and photometric losses, are differentiable, we can backpropagate the errors all the way back. 40 In pu t O ur s “Beard” “Disney princess” Figure 3.7: Visual results on RAVDESS dataset [123]. We show both in-domain (“beard”) and out-of-domain (“Disney princess”) editing results. Our results maintain consistent changes with time preserving the temporal coherence. After phase 1 and 2, we will have {Ŵ edit t }Tt=1 and G(·; θ̂edit) as a result. 3.3.4 Unalignment After our two-phase optimization, we perform stitch tuning approach [200] as post- processing to put the aligned frames back to the original video to generate our final edited video. Note that this is only feasible for the in-domain editing because the out-of-domain editing has a global appearance compared to the input video. 3.4 Experimental Results 3.4.1 Experimental setup Implementation details. We use StyleGAN-ADA [86] as our pre-trained generator. We experiment with in-domain and out-of-domain editing techniques to validate our approach for different GAN inversion methods. Specifically, for in-domain editing, we use the PTI inversion [158] (based on e4e [198]) and StyleCLIP mapper [146]. For out-of-domain editing, we use the Restyle encoder [6] and the StyleGAN-NADA [48]. We will release 41 D E D V P O ur s Figure 3.8: Visual comparison with DVP [108]. DVP achieves temporal consistency by severely smoothing the image and hence losing its sharpness. Our method, however, can achieve a balance between consistency and sharpness. In “eyeglasses” example (left), DVP shows a different pair of eyeglasses across the time (zoom-in for better visualization), while ours remain a good consistency for the eyeglasses; in “Disney princess” (right), DVP shows a blurry result with an unstable x-t scanline, while ours is sharper and shows a stable consistency in the scanline. the source code and pretrained models. In the following, we show sample results from the video frames. We encourage the readers to view the videos in the supplementary material for video results. Datasets. We conduct our metric evaluation using 20 videos randomly sampled from RAVDESS dataset [123]. We conduct 5 types of in-domain editing for each video and 5 types of out-of-domain editing. To further demonstrate the capabilities of our method to handle real videos, we also apply our approach to Internet videos and show the visual results. Metrics. We aim to evaluate the method using two main aspects: 1) temporal consistency and 2) perceptual similarity with the semantically edited frames. To evaluate temporal 42 In pu t O ur s Figure 3.9: Results on Internet videos. Results on the Internet videos. We change the first person to “surprised” expression, and change the second person to “angry”. consistency, we measure the Warping Error Ewarp: Ewarp(It, It+1) = 1∑N i=1Mt(pi) · N∑ i=1 Mt(pi)||It(pi)− Ît+1(pi)||22 , (3.9) where Ît+1 = warp(It+1, Ft→t+1), N is the number of pixels, pi is the i-th pixel, Mt is a binary non-occlusion mask, which shows non-occluded pixels, we compute it using the forward-backward consistency error the threshold in [103, 121]. We also measure the LPIPS perceptual similarity score [246] (with AlexNet [98]) be- tween the directly edited video V edit = {Iedit1 , Iedit2 , · · · , IeditT } and the output of our phase 2 {Î ′′ 1 , Î ′′ 2 , · · · , Î ′′ T} by measuring the averaged perceptual similarity between the correspond- ing frames. The purpose of these two metrics is to evaluate whether the method can achieve a balance between temporal consistency and fidelity degradation. This is an inherent trade- off. Preserving all the details of per-frame editing inevitably leads to temporal flickering artifacts. Focusing only on temporal consistency may easily lead to blurry videos. Our goal is that the final output video is visually similar to the directly (per-frame) edited video. 3.4.2 Out-of-domain results Setup. We first invert the videos frame by frame using the Restyle encoder [6] (psp- based [157]). We then directly apply five different out-of-domain editing effects produced by StyleGAN-NADA [48]. We perform our two-phase optimization approach on the di- 43 Table 3.1: Out-of-domain editing comparison. Ewarp ↓ LPIPS↓ Direct editing 0.0098 0.0000 Editing categories DVP [108] Ours DVP Ours Sketch 0.0036 0.0085 0.2404 0.1314 Pixar 0.0031 0.0025 0.1074 0.1178 Disney Princess 0.0040 0.0078 0.2062 0.1204 Elf 0.0042 0.0108 0.2289 0.1310 Zombie 0.0040 0.0085 0.2033 0.1370 Average perfomance 0.0038 0.0076 0.1972 0.1275 Table 3.2: In-domain editing comparison. Ewarp ↓ LPIPS↓ Direct editing 0.0076 0.0000 Editing categories LT [235] DVP [108] Ours DVP Ours angry - 0.0033 0.0032 0.2452 0.1100 beard 0.0064 0.0038 0.0030 0.2444 0.1033 eyeglasses 0.0066 0.0039 0.0034 0.1226 0.1097 Depp - 0.0037 0.0031 0.2452 0.2024 surprised - 0.0035 0.0028 0.1415 0.1012 Average perfomance 0.0065 0.0036 0.0031 0.1760 0.1253 rectly edited video using Adam optimizer [95]. For phase 1, we set the learning rate to αI = 0.005, and update the latent codes for 5 epochs. In Eqn. 3.2, we set α = 0.04 for all the editing directions. For phase 2, we set the learning rate to αII = 8× 10−4, and finetune G for 5 epochs. We set the regularization weight λr to 200. Evaluation. Table 3.1 shows that our method decreases the temporal error of the directly edited video. The primary sources of inconsistency in out-of-domain editing can be seen in the flickering background and the details of the hair. We show our visual results in Figure 3.7. Our method preserves the temporal consistency and maintains the sharpness of the input video. 44 3.4.3 In-domain editing results Setup. We first invert the videos frame by frame by using the PTI method [158]. We then directly apply five different semantic editing directions discovered by StyleCLIP map- per [146]. Next, we perform our two-phase optimization approach on the directly edited video using Adam optimizer [95]. For phase 1, we set the learning rate αI = 0.05, and update fθ for 10 epochs. In Eqn. 3.2, we set α = 0.12 for the “eyeglasses”, and α = 0.04 for the rest of the semantic directions. For phase 2, we set the learning rate of G to αII = 0.0001, and finetune G for 5 epochs. We set the regularization weight λr to 200. Evaluation. Table 3.2 shows that our approach improves the temporal consistency over the directly edited video baseline by a large margin. When dealing with in-domain editing, the primary source of inconsistency is the details of the newly added attributes, e.g., glasses or beard and some background flickering. We show sample visual results in Figure 3.7, where the introduced changes are consistent among the different frames. 3.4.4 Two-phase optimization strategy ablation study We demonstrate the effect of our two-phase optimization strategy of updating the latent codes first and following that with finetuning the generator G. We compare our two-phase approach to: (1) No optimization (i.e., direct editing), (2) update latent code only (phase 1), and (3) finetune generator G only. We show the results in Table 3.3. When we only update generator G, we can achieve a low warping error Ewarp. However, this is not desirable since finetuning G pushes the video to be consistent globally without modifying the local details. Therefore, the output video is different from the directly edited video (i.e., high LPIPS distance). Thus, we follow our two-phase optimization of a) updating the latent codes via an MLP fθ (to improve local consistency), b) finetuning the generator G (to modify the global effect). 45 Table 3.3: Two-stage optimization strategy ablation study. Optimization stage In-domain editing Out-of-domain editing Update W edit t Update G Ewarp ↓ LPIPS↓ Ewarp ↓ LPIPS↓ - - 0.0076 0.0000 0.0098 0.0000 ✓ - 0.0064 0.2108 0.0094 0.1428 - ✓ 0.0027 0.2463 0.0057 0.1375 ✓ ✓ 0.0031 0.1253 0.0076 0.1275 Input LT Ours Figure 3.10: Visual comparison with Latent Transformer (LT) [235]. LT cannot preserve the person’s identity very well. Our method can preserve the identity and achieves a temporal consistent video. 3.4.5 Comparison with Latent Transformer We compare our method with Latent Transformer (LT) [235]. We show a quantitative comparison with two overlapped editing types, “beard” and “eyeglasses” in Table 3.2. and a qualitative comparison in Fig. 3.10. LT edits video by updating the projected latent code independently for each frame without using temporal constraints. Our method, in con- trast, uses flow-based loss to improve the temporal consistency, and our second phase uses a perceptual difference mask as a regularization to preserve the facial details other than the edited parts. As a result, our method can improve temporal consistency and preserve personal identity. 46 3.4.6 Comparison with Deep Video Prior (DVP) We compare our method with DVP [108], a state-of-the-art blind video consistency approach, using their default setting. We show the in-domain editing comparison in Ta- ble 3.2 and the out-of-domain editing comparison in Table 3.1. For warping error Ewarp, our method achieves improved results for in-domain editing and comparable results for out-of-domain editing. However, in terms of LPIPS distance, our visual results are more similar to the directly edited video for both in-domain and out-of-domain editing. We show visual comparison in Figure 3.8. DVP can achieve temporally consistent results (i.e., low Ewarp). However, this is at the cost of losing local details in the “eyeglasses” example or excessively smoothing the results to get a blurry video as in the “Disney Princess” example. (a) ”Zombie” (b) ”Rare pose” Figure 3.11: Limitations. From (a), it can be seen that earrings are added by GAN editing prior to our flow-based temporal consistency approach. Since our approach builds on existing GAN inversion and editing techniques, it will be affected by their quality. From (b), it can be seen that our method fails when there is a rare pose and a large motion. 3.5 Limitations We show several limitations of our approach in Figure 3.11. First, our approach relies on plausible results from existing GAN inversion and editing techniques. We show an ex- ample of added earrings in Figure 3.11(a), and an example of a rare pose in Figure 3.11(a). Second, the GANs used in our experiments require the objects to be spatially aligned and thus may not yet be suitable for inverting and editing unconstrained videos. Third, our 47 method relies on a high-quality GAN model that may be computationally expensive to train and often require diverse training images. Our full method (phases 1, 2 and 3) takes 40 minutes on a 150-frame video, on a single NVIDIA P6000 GPU. 3.6 Conclusions We have presented a novel method for video-based semantic editing by leveraging image-based GAN inversion and editing. Our approach starts from direct per-frame edit- ing, and we refine the editing results by a flow-based method to minimize the bi-directional photometric loss. Our core approach is two-phase, by adjusting the latent codes via an MLP and tuning G to achieve temporal consistency. We show that our method can achieve temporal consistency and preserve its similarity to the direct editing results. Finally, our model-agnostic method is applicable to different GAN inversion and manipulation tech- niques. Potential negative impacts. Malicious use of our technique may lead to video manipula- tion of public figures for spreading misinformation. 48 Chapter 4: Increasing Generalizability of the Generative Models Generative models have difficulties when dealing with out-of-domain (OOD) data due to limited training data. In this chapter, we propose a novel approach to adapt a pre-trained generative model to OOD data. With this approach, we can further use powerful generative priors even for OOD data. 4.1 Introduction GAN inversion [3, 157, 198, 222, 254] is a set of techniques that project an input im- age onto the latent space of a pre-trained GAN to obtain a latent code so that the image generator can reconstruct the input. This is particularly useful as one could perform vari- ous creative semantic editing tasks [47, 63, 147, 173] for images. Similar techniques have also been applied in the video domain, with which recent methods also achieved tempo- rally consistent editing [200, 228]. However, the majority of these methods are effective primarily with 2D GANs, and they fall short in offering explicit 3D controllability, such as view synthesis capabilities. With the rapid recent advancements in 3D reconstruction, especially in neural radiance fields (NeRFs) [13, 29, 133, 137], high-quality 3D-aware GANs [23, 59, 145, 185] have emerged as a powerful tool for learning 3D generation from 2D images. 3D-aware GANs, equipped with a 3D representations like NeRFs [23, 59] or SDF [145], offer explicit control over camera views and ensure 3D geometric consis- tency in generation. Additionally, they retain the generative capacity and editability of 2D 49 Input Input Recon. Recon. Surprised Younger Elsa Elsa OOD Removal OOD RemovalLess smile Blond El sa Ey eg la ss es V id eo in pu t V id eo in pu t Si ng le im ag e in pu t Figure 4.1: Semantic editing for out-of-distribution data. We present a method for reconstructing and editing an out-of-distribution (OOD) image or video using a pre-trained 3D-aware generative model (EG3D [23]). Our method explicitly models and reconstructs the occluders in 3D, allow- ing faithful reconstruction of the input while preserving the semantic editing capability. Here we showcase the reconstruction and editing results “Less smile”, “Younger”, “Blond” [173], “Elsa”, “Surprised” [147]. Our method can also remove the OOD part. Data are from the Internet. GANs [85, 87, 89, 91]. This enables applications such as novel view synthesis, semantic image editing [104, 183, 187, 225, 239, 240] and video editing [45, 199]. Core challenges. While state-of-the-art 3D GAN inversion methods achieve remark- able advances in both image and video editing for human faces, they face challenges when dealing with images including out-of-distribution (OOD) objects (e.g., heavy make-ups or occlusions). This limitation arises primarily because these models are pre-trained only on natural faces without complex textures or substantial occlusions. As a result, the editabil- ity performance deteriorates when a pre-trained GAN is forced to model OOD objects in the GAN inversion process. This is commonly known as the reconstruction-editability 50 trade-off [198]. Existing GAN inversion methods assume that a single latent code cor- responding to the input image can be found in the latent space [186, 222] through opti- mization once the model is trained. Therefore, they aim to reconstruct the in-distribution (InD) content (e.g., natural face) and the OOD objects together. However, OOD compo- nents often cannot be well modeled in a pre-trained GAN, and consequently cannot be well represented with it using a single latent code, existing methods either cannot recon- struct them faithfully [187] or can reconstruct them (e.g., through fine-tuning the genera- tor) but alters the latent space properties and deteriorates the editability [158] (Figure 4.2). M or e sm ile Su rp ri se d Input GOAE [240] PTI [158] Figure 4.2: Limitations of the previous meth- ods. Existing GAN inversion techniques can- not deal with frames with OOD elements, re- sulting in a poor reconstruction-editing balance. GOAE [240] can produce faithful editing, but fails to preserve the identity of the input face. PTI [158] provides higher reconstruction fidelity, but the edibility suffers. Our work. We propose a new approach to address this issue by drawing inspira- tion from recent composite volume render- ing works that compose multiple radiance fields during rendering [50, 128, 216, 231]. Our core idea is to decompose the 3D rep- resentation of an image with OOD com- ponents into an in-distribution (InD) part and an out-of-distribution part, and com- pose them together to reconstruct the image in a composite volumetric rendering man- ner. We use EG3D [23] as our 3D-aware GAN backbone and leverage its tri-plane representation to model this composed rendering pipeline. For the InD component (i.e., natural face), we project pixel values onto EG3D’s W+ space for an InD component reconstruction. We further introduce an additional tri- plane to represent the OOD content. After that, we combine these two radiance fields in a composite volumetric rendering to reconstruct the input frames. During the editing stage, 51 we perform the latent code based editing solely on the InD part and leave the OOD com- ponent unaltered. This framework would allow the applications of any StyleGAN-based editing approache [147, 173] on the InD component such as changing facial expression, which is often desirable for user experiences. The advantages of our work are three-fold: a) we achieve a higher-fidelity reconstructio