ABSTRACT Title of Dissertation: An Efficient Neural Representation for Videos Hao Chen, Doctor of Philosophy, 2023 Dissertation Directed by: Abhinav Shrivastava Department of Computer Science With the increasing popularity of videos, it has become crucial to find efficient and compact ways to represent them for easier storage, transmission, and downstream video tasks. Our dissertation proposes an innovative neural representation for videos called NeRV, which stores each video implicitly as a neural network. Building on NeRV, we introduce a hybrid representation for videos called HNeRV, which improves internal generalization and representation capacity. HNeRV allows for highly efficient video representation and compression, with a model size that can be up to 1000 times smaller than the original raw video. Apart from efficiency, HNeRV’s simple decoding process, which involves a feedforward operation, enables fast video loading and easy deployment. To enhance efficiency, we develope an efficient neural video dataloader called NVLoader, which is 3-6 times faster than conventional video dataloaders. We also introduce the HyperNeRV framework to address encoding speed, which utilizes a hypernetwork to directly map input videos to NeRV model weights, resulting in a 104 faster encoding process. Aside from developing compact and implicit video neural representations, we explore several compelling applications, including frame interpolation, video restoration, and video editing. Furthermore, the compactness of these representations makes them an ideal output video format for video generation models, reducing the search space significantly. Additionally, they can serve as an efficient input for video understanding models. An Efficient Neural Representation For Videos by Hao Chen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Prof. Abhinav Shrivastava, Unviersity of Maryland (Chair/Advisor) Prof. Furong Huang, Unviersity of Maryland Prof. Behtash Babadi, Unviersity of Maryland Prof. Ramani Duraiswami, Unviersity of Maryland Prof. Saining Xie, NewYork University Acknowledgments I am deeply grateful to my advisor, Abhinav Shrivastava, for his exceptional guidance throughout my Ph.D. journey. His extensive knowledge and problem-solving skills have not only inspired me but also shaped my research in countless ways. I have been fortunate to work under his patient and kind mentorship, which has been invaluable in helping me navigate the challenges of research and life. Abhinav once said, ”Hackers are people with the cleverest mind and always know how to solve problems,” and he is a true embodiment of this definition. His wisdom and ability to solve even the most complex issues have been a constant source of motivation for me. I am thankful for everything Abhinav has taught me, and I aspire to become half as good as him through years of learning and practice. Abhinav’s guidance and support have been critical to my success, and I will always be grateful for his exceptional mentorship. I am incredibly grateful to the esteemed members of my thesis committee, Furong Huang, Behtash Babadi, Ramani Duraiswami, and Saining Xie, for the time and effort that they have invested in me. Ramani Duraiswami, who served on both my proposal and defense committee, demonstrated exceptional kindness and support, for which I am deeply appreciative. Saining Xie, a distinguished young researcher in computer vision, has been a source of inspiration for my research. I feel incredibly fortunate to have had the opportunity to collaborate with him on the efficient neural video dataloader project. Finally, I would like to express my gratitude to Christopher Metzler for attending my preliminary exam and offering valuable feedback and ii constructive comments on my research. I am grateful for my labmates, who have not only been great collaborators but also amazing friends. Yixuan Ren, Bo He, Hanyu Wang, Kamal Gupta, Max Enrich, Matthew Gwilliam, Alex Hanson, Shuaiyi Huang, Lillian Huang, Vatsal Agarwal, Sharath Girish, Chuong Huynh, Vinoj Jayasundara, Pulkit Kumar, Mara Levy, Shishira R Maiya, Soumik Mukhopadhyay, Khoi Pham, Nirat Saini, Saksham Suri, Archana Swaminathan, Gaurav Shrivastava, Matthew Walmer, and Luyu Yang have all contributed to my growth as a researcher and a person. I have learned a great deal from them, not only in the research but also in the way they approach life. In particular, I want to express my appreciation to Bo He, Hanyu Wang, Matthew Gwilliam, and Yixuan Ren. Their support and contributions have been invaluable to the success of the NeRV-series. During my time at UMD, I had the privilege of collaborating with exceptional researchers, and I am grateful for these opportunities. Zuxuan Wu has been an outstanding mentor, providing invaluable guidance not only in research but also in career planning. Additionally, working with Xitong Yang and receiving guidance from Hengduo Li and Chen Zhu has been a great experience. I would also like to express my gratitude to the mentors I had during my internships. Ser-Nam Lim from Meta, Zhe Lin from Adobe, and Heng Wang, Binchen Liu, and Yizhe Zhu from Tiktok have all provided me with indispensable insights that have shaped my research. Their mentorship and support have been instrumental in my growth as a researcher. I would like to express my gratitude to the exceptional individuals who ignited my passion for deep learning and computer vision. Firstly, I am indebted to Prof. Guoyou Wang, my first advisor during my Master’s degree, and Prof. Xiang Bai, who provided invaluable guidance and career advice during my study abroad. Additionally, I am thankful for the wonderful collaborators and mentors I had the privilege of working with during my internship at the Shenzhen Institutes iii of Advanced Technology (SIAT), including Prof. Yu Qiao, Prof. Yali Wang, Lei Zhou, Yulun Zhang, Yapeng Tian, Zhi Tian, Tong He, Kaiyang Zhou, and Xiaoxing Zeng. Finally, I want to express my deep appreciation to my parents and all my family members, especially my adorable nephew Jiujiu, for their unwavering support throughout my journey. Without their support, none of this would have been possible. iv Table of Contents Acknowledgements ii Table of Contents v List of Tables viii List of Figures xi Chapter 1: Introduction 1 Chapter 2: Background 7 2.1 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Video redundancy overview . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Image compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Traditional video coding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Traditional video codec standards . . . . . . . . . . . . . . . . . . . . . 15 2.1.5 Learning-based video compression . . . . . . . . . . . . . . . . . . . . . 17 2.1.6 Application of video compression . . . . . . . . . . . . . . . . . . . . . 19 2.2 Implicit neural representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: NeRV: Implicit Neural Representations for Videos 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Neural Representations for Videos . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 NeRV Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Datasets and Implementation Details . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.3 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.4 Video Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.1 NeRV Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.2 Results on MCL-JCL dataset . . . . . . . . . . . . . . . . . . . . . . . . 48 v 3.6.3 Implementation Details of Baselines . . . . . . . . . . . . . . . . . . . . 49 3.6.4 Video Temporal Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.5 More Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 4: HNeRV: A Hybrid Neural Representation for Videos 53 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.1 HNeRV overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.2 Downstream tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.1 Dataset and Implementation Details . . . . . . . . . . . . . . . . . . . . 64 4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.3 Parameter Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . 68 4.4.4 Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.5 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.1 Video decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.2 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.3 Weight Pruning for Model Compression. . . . . . . . . . . . . . . . . . . 74 4.6.4 HNeRV architecture details . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6.5 Per-video compression results . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5: NVLoader: A Neural Video Dataloader for Efficient Data Loading 77 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.1 Prepare NVDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Data loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Datasets and implementation details . . . . . . . . . . . . . . . . . . . . 86 5.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.3 Action recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.4 Comprehensive comparison . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.5 Other results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.6 Video model architectures . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6: HyperNeRV: Towards Fast Learning of Video Neural Representation 97 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.2 HyperNeRV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 vi 6.3.3 Efficient video neural representation . . . . . . . . . . . . . . . . . . . . 108 6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4.1 Datasets and implementation details . . . . . . . . . . . . . . . . . . . . 108 6.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4.3 Efficient neural representation. . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.4 Component analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.5 Discussion and limitations . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.6.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.6.2 More implementation details . . . . . . . . . . . . . . . . . . . . . . . . 120 Chapter 7: Conclusion 121 7.1 Efficient video representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.2 Downstream tasks based on neural representations . . . . . . . . . . . . . . . . . 124 7.2.1 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2.2 Efficient video loading . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2.3 Video restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2.4 Video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2.5 Video understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2.6 Video generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.3 Future work and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.3.1 Internal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.3.2 Scalable learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 vii List of Tables 2.1 Historical development of video codecs . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Comparison of different video representations. Although explicit representations outperform implicit ones in encoding speed and compression ratio now, NeRV shows great advantage in decoding speed. And NeRV outperforms pixel-wise implicit representations in all metrics. . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Compare with pixel-wise implicit representations. Training speed means time/epoch, while encoding time is the total training time. . . . . . . . . . . . . . . . . . . . 39 3.3 PSNR v.s. epochs. Since video encoding of NeRV is an overfit process, the reconstructed video quality keeps increasing with more training epochs. NeRV- S/M/L mean models with different sizes. . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Decoding speed with BPP 0.2 for 1080p videos . . . . . . . . . . . . . . . . . . 43 3.5 PSNR results for video denoising. “baseline” refers to the noisy frames before any denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Input embedding ablation. PE means positional encoding . . . . . . . . . . . . . 45 3.7 Upscale layer ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.8 Norm layer ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9 Activation function ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.10 Loss objective ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.11 NeRV architecture for 1920× 1080 videos. Change the value of C1 and C2 to get models with different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 HNeRV block NeRV block. k is kernel size for each stage, Cout and Cin are output/input channels for each block. We decrease parameters via a small k = 1 for first block, and increase parameters for later layers with a larger k and wider channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Video regression with different sizes . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Video regression with different epochs . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Video regression at resolution 960×1920, PSNR↑ reported . . . . . . . . . . . . 63 4.5 Video regression at resolution 480×960, PSNR↑ reported . . . . . . . . . . . . . 63 4.6 Internal generalization results. NeRV, E-NeRV, and HNeRV use interpolated embedding as input, HNeRV† uses held-out frames as input. With content-adaptive embedding as input, HNeRV shows much better reconstruction on held-out frames 63 4.7 Analysis of parameter rebalancing. . . . . . . . . . . . . . . . . . . . . . . . . . 68 viii 4.8 Video inpainting results. With 5 fixed box masks on input videos, we evaluate the output with PSNR ↑. ‘Input’ is the baseline of mask video and ground truth . 69 4.9 Kernel size (Kmin, Kmax) ablation, (with r=1.2) . . . . . . . . . . . . . . . . . . 71 4.10 Channel reduction r ablation, (with K=1,5) . . . . . . . . . . . . . . . . . . . . 71 4.11 Embedding spatial size ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.12 Embedding dimension ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.13 Decoding FPS ↑ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.14 Decoding time (s) ↓ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.15 HNeRV Decoding FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.16 Compression results. “Size ratio” compares to model with quant. only, and “Sparsity” indicates amount of weights pruned. . . . . . . . . . . . . . . . . . . 75 4.17 HNeRV architecture details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Video loading speed (VPS) for video dataloader based on H.264 videos, with different worker numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Video loading speed (VPS) for NVLoader, with different GPU devices. . . . . . 89 5.3 Top-1 error (%) with different frames. . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Top-1 error (%) with different temporal strides. . . . . . . . . . . . . . . . . . . 90 5.5 Top-1 error (%) with different patch ratios. . . . . . . . . . . . . . . . . . . . . 90 5.6 Top-1 error (%) with different video models. . . . . . . . . . . . . . . . . . . . 90 5.7 Comprehensive results on datasets of different resolutions, top-1 accuracy, average video size, and loading speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.8 Video quality for NVLoaders, before (PSNRorig) and after (PSNRquant) quantization. 92 5.9 Total forward time (model forward + data loading) at testing time. Pure computation time acts as the 1× baseline. Data loading becomes a bottleneck, especially for efficient video model and large batches. . . . . . . . . . . . . . . . . . . . . . . 93 5.10 Average video size in NVLoader. Parameters is the total parameter of video checkpoint (video decoder Wdecoder and frame embedding D), video size measures video checkpoint by MegaBytes. Q means quantization and H means Huffman coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.11 Generalization to other dataloaders, top-1 err (%) is showed. We evaluate models with different sampling frames, which are trained and evaluated on the same dataloader or different video dataloaders. . . . . . . . . . . . . . . . . . . 95 5.12 Video model architectures. Strdsenc and Strdsdec are stride list used in the encoder and decoder. Sizeenc and Sizedec parameter nubmers for the encoder and decoder. d is the embedding dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1 Variables and their definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Encoding comparison for methods: NeRV [1] (train from scrach), Trans-INR [2], and HyperNeRV (ours). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.3 PSNR results for compact video representations. . . . . . . . . . . . . . . . . . 113 6.4 Stronger training setups to obtain efficient video representations. PSNR is reported for training and test videos (UCF101, K400, SthV2, and Avg.). . . . . . 114 ix 6.5 Component analysis for HyperNeRV. ‘Total size’ is the number of all learnable parameters, ‘Video model’ is the parameters for the video model, ‘Img-wise’ is based on image-wise neural representation while Trans-INR [2] is based on pixel- wise neural representation. ‘Act.’ is the activation layer in the video model, Nmax is the maximum number for weight tokens, ‘Train’ and ‘Avg.’ are the average PSNR on the training and test set. ‘VPS’ is videos per second. . . . . . . . . . . 115 6.6 Data augmentations. ‘rand ratios’ is cropping the video with random aspect ratios between [0.67, 1.5]. ‘rand size’ is randomly scaling, between [0.8×, 1.25×], the video before cropping. ‘rand aug’ is random augmentation [3]. . . . . . . . . 119 6.7 Ablation study for Nmax. Increasing Nmax from 128 to 256 does not improve the performance further. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.8 Ablation for learning rate schedules. . . . . . . . . . . . . . . . . . . . . . . . 120 6.9 Implementation details for HyperNeRV. . . . . . . . . . . . . . . . . . . . . . . 120 x List of Figures 1.1 Framework of transform encoding for video compression. . . . . . . . . . . . . . 1 1.2 Key evaluation metrics for efficient video representations. . . . . . . . . . . . . . 3 1.3 The dissertation framework. a) implicit neural representation NeRV. b) hybrid neural representation HNeRV. c) fast learning of NeRV weights. d) downstream tasks based on NeRV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Framework of video compression. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Redundancy for video compression. . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Compression for different video frames. . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Transform coding framework for video compression. . . . . . . . . . . . . . . . 13 2.5 Application of video compression. . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 DenseNet architecture. Image obtained from [4]. . . . . . . . . . . . . . . . . . 26 2.7 An overview of Vision Transformer (on the left) and the details of Transformer encoder (on the right). The architecture resembles Transformers used in the NLP domain and the image patches are simply fed to the model after flattening. After training, the feature obtained from the first token position is used for classification. Image obtained from [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 (a) Conventional video representation as frame sequences. (b) NeRV, representing video as neural networks, which consists of multiple convolutional layers, taking the normalized frame index as the input and output the corresponding RGB frame. 30 3.2 (a) Pixel-wise implicit representation taking pixel coordinates as input and use a simple MLP to output pixel RGB value (b) NeRV: Image-wise implicit representation taking frame index as input and use a MLP + ConvNets to output the whole image. (c) NeRV block architecture, upscale the feature map by S here. . . . . . 35 3.3 NeRV-based video compression pipeline. . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Model pruning. Sparsity is the ratio of parameters pruned. . . . . . . . . . . . . 41 3.5 Model quantization. Bit is the bit length used to represent parameter value. . . . 41 3.6 Compression pipeline to show how much each step contribute to compression ratio. 41 3.7 PSNR v.s. BPP on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 MS-SSIM v.s. BPP on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . . 42 3.9 Video compression visualization. At similar BPP, NeRV reconstructs videos with better details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xi 3.10 Denoising visualization. (c) and (e) are denoising output for DIP [6]. Data generalization of NeRV leads to robust and better denoising performance since all frames share the same representation, while DIP model overfits one model to one image only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.11 Rate distortion plots on the MCL-JCV dataset. . . . . . . . . . . . . . . . . . . 49 3.12 Temporal interpolation results for video with small motion. . . . . . . . . . . . . 51 3.13 Denoising visualization. Left: Ground truth; Middle: Noisy input Right; NeRV output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.14 Video compression visualization. The difference is calculated by the L1 loss (absolute value, scaled by the same level for the same frame, and the darker the more different). “Bosphorus” video in UVG dataset, the residual visualization is much smaller for NeRV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Top) Hybrid neural representation with learnable and content-adaptive embedding (ours). Bottom) Video regression for hybrid and implicit neural representations. 55 4.2 a) HNeRV uses ConvNeXt blocks to encode frames as tiny embeddings, which are decoded by HNeRV blocks. b) HNeRV blocks consist of three layers: convolution, PixelShuffle, and activation (with input/output size illustrated). c) We demonstrate how to compute parameters for a given HNeRV block. d) Output size of each stage with strides 5,4,2,2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Video decoding. Left: HNeRV outperforms traditional video codecs H.264 and H.265, and learning-based compression method DCVC. Middle: HNeRV shows much better flexibility when decoding only a portion of video frames, where the decoding time decreases linearly for HNeRV while other methods still need to decode most frames. Right: HNeRV performs well for compactness (ppp), reconstruction quality (PSNR), and decoding speed (FPS). . . . . . . . . . . . . 65 4.4 Visualization of Embedding interpolation. . . . . . . . . . . . . . . . . . . . . 65 4.5 Visualization of video neural representations at 0.003 ppp, which means the total size is only about 0.3% of the original video size. On the left, we compare HNeRV to ground truth. On the right, we compare NeRV, E-NeRV, and HNeRV for 5 patches with discernible differences, indicated in the original frame by numbers and bounding boxes. For each patch, HNeRV preserves detail at a level of fidelity closer to the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Parameter distributions for decoder blocks. See table 4.7 for PSNR and MS-SSIM results with these 4 settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.7 Compression results on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . 70 4.8 Compression results of best/worst cases from UVG dataset. HNeRV achieves good performance especially for videos caputured by still cameras, like ‘honeybee’ video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.9 Inpainting results of fixed masks and object masks. Left) input frame; Right) HNeRV output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.10 Compression results averaged across all UVG videos, and for each specific videos. 76 xii 5.1 Comparison of video dataloaders based on H.264 videos, HEVC videos, JPEG frames, and NVLoader (ours). With similar video size, NVLoader load videos much more efficiently, measured by videos per second (VPS), without hurting accuracy for video recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 a) NVLoader framework; b) Video model for NVLoader, NeRV block upscale the feature map to h × w, and MLP expand channels from c to 3 × p × p. d is frame embedding dimension, h×w is patch number, p×p is patch size. c) NeRV blocks used in video model. S is the upscale factor. . . . . . . . . . . . . . . . . 80 5.3 Video loading speed. a) common video backends. Naive decoding load videos one by one, video dataloader use 8 workers for parallel speedup; b) naive decoding on different devices; c) different frames for dataloader loading; d) different strides for dataloader loading; e) video resolutions for dataloader loading; f) patch ratios for dataloader loading (for 64 frames). . . . . . . . . . . . . . . . . 89 5.4 Comparison of video dataloaders based on H.264 videos, HEVC videos, HEVC videos, and NVloaders (ours) . Left: Something V2 dataset. Middle: UCF101 dataset. Right: Kinetics-400 dataset. Left y-axis is loading speed, videos per second. Right y-axis is top-1 accuracy. . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Output video frames for NVLoader across datasets. NVLoader can reconstruct videos well and capture details faithfully , for either ones with dynamic scenes, or with rich textures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1 Encoding time comparison between HyperNeRV and training NeRV [1] from scratch. Encoding refers to generating a well-trained neural network for a given video. HyperNeRV eliminates the need for tedious fitting, enabling much faster video encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Left: HyperNeRV takes videos as input and outputs NeRV model weights through the hypernetwork. Right: Architecture details for the transformer hypernetwork (top) and the video model (bottom). . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 a) NeRV block, only the convolution layer has learnable parameters θ ∈ RCout×Cin×K×K . b) Transformer hypernetwork, takes video patches x and initial weights θ0 as input tokens, and outputs video-specific weights θ̂′. c) Obtain final model weights θ′ by an element-wise multiply of shared weights θ1 and video-specific weights θ̂′. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4 Left: video ground truth. Right: HyperNeRV output. HyperNeRV can reconstruct various videos across different datasets, and capture video details with high fidelity, for either dynamic scenes, complex textures, or moving objects. . . . . . . . . . 111 6.5 Left: Trans-INR [2] output: Right: HyperNeRV output (ours). HyperNeRV shows much better reconstruction quality than Trans-INR [2], with more faithful details, sharper textures, and better visual preference. . . . . . . . . . . . . . . . 112 7.1 Framework of efficient video representation. . . . . . . . . . . . . . . . . . . . . 121 7.2 Dissertation framework overview. . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 Downstream video tasks based on implicit neural representations. . . . . . . . . . 125 7.4 Implicit neural representation is a compact input for video understanding and perfect for video generation due to its smaller size compared to original video data.127 xiii 7.5 Some ongoing or potential projects based on NeRV. . . . . . . . . . . . . . . . . 128 xiv Chapter 1: Introduction In today’s world, video has emerged as the dominant form of multimedia, and its popularity continues to increase. However, the high-dimensional and intricate visual information in videos makes it challenging to efficiently represent them for storage, transmission, and downstream video-related tasks. To tackle this issue, various attempts have been made, with the transform encoding approach being the most popular. This approach transforms the input video into a compact embedding space that is much smaller than the original video, while maintaining high fidelity after reconstruction. We illustrate the framework in Figure 1.1. These methods can be broadly divided into two categories based on the chosen transform functions: traditional codecs with hand-crafted transforms and learning- based methods that employ deep neural networks. Traditional video compression methods like MPEG [7], H.264 [8], and H.265 [9] achieve good reconstruction results with decent decompression speeds. In contrast, learning-based methods [10, 11, 12, 13, 14, 15, 16] focus on replacing the entire compression pipeline or several components with deep learning tools, at varying levels of complexity. Transform encoding Transform decoding Compact embed Input video Output video Figure 1.1: Framework of transform encoding for video compression. Despite efforts to improve video compression, traditional codecs and learning-based methods 1 both have limitations. Traditional codecs often have suboptimal compression performance, while learning-based methods can be computationally expensive. As a result, a new approach is needed that can combine the strengths of both methods to enhance video compression. Recent approaches have tried to address this challenge by fine-tuning traditional codecs [17] and optimizing components of the compression pipeline [18]. This thesis aims to develop efficient implicit neural representations for videos (NeRV), where each video is represented as a deep neural network that can output the corresponding video frame given a frame index as input. Such implicit representations are appealing because they can represent a video with significantly fewer parameters and reconstruct it with high fidelity, effectively converting the video compression problem into a model compression problem. Building on NeRV, we propose a hybrid neural representation for videos (HNeRV) through a small frame embedding with a powerful decoder network, resulting in improved internal generalization and representation capacity. With evenly distributed model parameters across layers, HNeRV significantly improves convergence speed compared to NeRV. To evaluate the efficiency of neural representation methods for videos, we consider four key perspectives, as depicted in Figure 1.2. First and foremost, the compression ratio is the most critical metric to assess the efficiency of video representations. Additionally, the encoding speed to convert the original video to efficient representations and the decoding speed to reconstruct the video from such representations should be considered. Finally, an important but often overlooked aspect is the utilization of efficient video representations in downstream video tasks. While most current approaches still rely on the original frame sequences as input, these high-dimensional sequences significantly increase the computation burden for video-related tasks such as video understanding and generation. 2 Efficient video representation Encoding speed Compression ratio Decoding speed Downstream tasks Figure 1.2: Key evaluation metrics for efficient video representations. Firstly, the use of implicit neural representations enables us to transform the video compression problem into a model compression problem. This approach allows us to achieve comparable compression ratios to other compression methods, with our methods showing superior compression ratios for videos with still backgrounds. In addition to compression, our implicit representations also provide a decoding advantage, as only a small neural network is required to fit one video. Moreover, the simple forward pass decoding operation of HNeRV allows for easy deployment on any platform. To further enhance the efficiency of our methods, we developed an efficient neural video dataloader (NVLoader) that is approximately three times faster than conventional video dataloaders. This faster processing speed enables more efficient training and evaluation of video models. In addition to compression and decoding speed, encoding speed remains a significant challenge for implicit neural representations due to the long and tedious training process. To address this issue, we introduce the HyperNeRV framework, which utilizes a hypernetwork to directly map input videos to NeRV model weights. This approach significantly speeds up the encoding process by approximately 104 times, while achieving similar reconstruction quality and generalization to unseen videos compared to training the neural network from scratch. 3 Besides developing efficient implicit video representations and proposing the HyperNeRV framework, we explore several downstream applications based on these representations. Due to their compactness and efficiency, we have found that they perform well for tasks such as frame interpolation, video restoration, and video editing. Furthermore, we believe that these compact and implicit video representations have even more potential to be utilized in various other applications. For instance, they can be an ideal output video format that significantly reduces the searching space or serve as an efficient input for video understanding models. These representations can also be employed in diverse other applications, such as video summarization, action recognition, and content-based video retrieval, which we believe require further investigation in future research We present a summary of our dissertation framework in Figure 1.3. Firstly, we introduce NeRV, an implicit neural representation for video, in Figure 1.3a. We then introduce HNeRV, a content-adaptive embedding that represents video as hybrid ones, in Figure 1.3b. Next, we introduce HyperNeRV in Figure 1.3c, which enables fast learning of video neural representations. Finally, we list different downstream tasks in Figure 1.3d which use the efficient video representations directly, i.e., the model weights. The ultimate goal of these video neural representations is to introduce a new perspective on video processing, similar to the Fourier transform for signal processing. By converting video into neural space, we can greatly advance the research and utilization of video data. The following dissertation consists of 6 chapters, each of which is explained separately. In Chapter 2, we provide a brief overview of the background knowledge relevant to this dissertation, including three main topics: the efficient video presentations, implicit neural representations, and deep neural networks. Specifically, we delve into the evolution of video compression methods 4 c) HyperNeRV: fast learning via hyper-network Hypernetwork Model weights Reshape Reshape Reshape Video frames Video model t d) Downstream tasks Video compression Efficient video loading Video restoration Video editing Video generation / understanding Decoder Time index: t Input Output Video a) Implicit neural representation (e.g., NeRV) Em be dd in g Encoder Output b) Hybrid neural representation (HNeRV) Input Decoder Video Figure 1.3: The dissertation framework. a) implicit neural representation NeRV. b) hybrid neural representation HNeRV. c) fast learning of NeRV weights. d) downstream tasks based on NeRV. and their applications in everyday life to provide a comprehensive understanding of the current state-of-the-art techniques in the field. In Chapter 3, we introduce an implicit neural representation for videos called NeRV. We describe a novel image-wise approach where the neural network outputs one frame given a frame index as input. Compared to previous pixel-wise representations that output one pixel at a time, NeRV significantly improves the encoding/decoding speed and the quality of reconstructed videos. In addition to the basic video reconstruction task, we also present results for video compression and video denoising. In Chapter 4, we present a hybrid neural representation for videos (HNeRV). We replace the content-agnostic frame index input with a content-adaptive embedding generated by an encoder. This change results in video data being represented by two parts: a large video decoder network and a small frame embedding. This hybrid representation improves internal generalization, such 5 as video interpolation in the embedding space, and reconstruction capacity. Additionally, we propose an evenly-distributed model where the model parameters are distributed more evenly than in NeRV, leading to significant improvements in reconstruction capacity. We also explore video interpolation and video inpainting in this chapter. In Chapter 5, we present an efficient neural video dataloader (NVLoader) that accelerates the typical data loading process for video research. Essentially, for each video in a dataset, we first fit a compact HNeRV model to it and save the model checkpoint. To load the video during training or testing, we simply load the model checkpoint and generate the video frames via a straightforward feed-forward operation. Because of the simplicity of our NVLoader, it can be easily deployed on any device, and it improves the video loading speed by 3-6 times compared to traditional data loading approaches. In Chapter 6, we introduce the HyperNeRV framework, which uses a transformer hypernetwork to generate model weights directly. We train this hypernetwork on a large-scale video dataset to learn the mapping function between input video and model weights. With this approach, given a new video, the well-trained hypernetwork can output the model weights directly, eliminating the need for a tedious fitting process. As a result, HyperNeRV can speed up the encoding process by around 104 times compared to training the video model from scratch. In Chapter 7, we provide a summary of potential downstream tasks based on neural video representations, including video compression, video restoration, efficient video loading, video understanding, and video generation. Furthermore, we summarize and conclude the dissertation by highlighting the contributions of each chapter and discussing potential future directions for research in this area. 6 Chapter 2: Background We provide background knowledge in Chapter 2 for efficient video representations and their applications. Specifically, in Section 2.1, we discuss the evolution of video compression methods and their potential applications in various domains. In addition, we provide background knowledge for implicit neural representations in Section 2.2 and for deep neural networks in Section 2.3. 2.1 Video compression We present the video compression pipeline in Figure 2.1, which is designed to decrease the video storage demand or speed up transmission. This is essential since the original video size can often be too large for storage and transmission purposes. Encode DecodeTransmit or storeSender Receiver Figure 2.1: Framework of video compression. 2.1.1 Video redundancy overview For video compression, there are three types of redundancy to remove: spatial redundancy, temporal redundancy, and perceptual redundancy. We illustrate them in Figure 2.2. 7 Video redundancy Spatial redundancy Temporal redundancy Perceptual redundancy Figure 2.2: Redundancy for video compression. The term spatial redundancy describes the occurrence of similar or identical information in adjacent pixels or regions within a single frame of a video. This leads to unnecessary data that can be compressed without significant loss of quality. In other words, when neighboring pixels or areas in a frame contain the same or similar information, it creates redundant data that can be removed. In contrast, temporal redundancy refers to the redundancy between consecutive frames in a video sequence. Because adjacent frames in a video sequence are often very similar, much of the information in one frame can be predicted from the previous frame. Therefore, in video compression, temporal redundancy can be exploited by transmitting only the differences between frames rather than transmitting each frame entirety. The human visual system’s sensitivity to different aspects of a video signal is not uniform, which leads to perceptual redundancy. Thus, video compression algorithms can selectively reduce the information in less perceptually important areas. For instance, compressing a low- frequency color channel may not have a significant impact on overall image quality compared to compressing a high-frequency detail channel. By exploiting perceptual redundancy, video compression algorithms can significantly decrease the amount of data required to represent a video signal without significantly affecting perceptual quality. 8 While our dissertation primarily focuses on leveraging spatial and temporal redundancy to develop an efficient neural representation and a simple compression method for video data, we acknowledge that addressing perceptual redundancy based on our approach can lead to additional improvements and enhancements. 2.1.2 Image compression Video compression is primarily based on image compression because of the spatial redundancy in video data. In order to address this issue, this section explores spatial redundancy and the background of image compression. The widely used image compression standard, JPEG [19], divides the input image into non-overlapping 8×8 blocks that are transformed into the frequency domain using block-DCT [20]. The transformed blocks’ DCT coefficients are then compressed into a binary stream using quantization and entropy coding. The JPEG standard provides the essential transform and prediction modules for traditional visual compression. Block-based image and video coding standards suffer from block-dependent compression, which limits parallelism on platforms like GPUs. Furthermore, the independent optimization strategy for each coding tool restricts performance improvement compared to end-to-end optimization compression. An alternative technological development trajectory, based on neural network techniques for image and video compression, is emerging. The resurgence of neural networks has significantly advanced traditional image and video compression by leveraging Convolutional Neural Networks (CNNs). The first approach was proposed by Cui et al.[21], using an intra-prediction convolutional neural network (IPCNN) to refine the prediction of the current block by leveraging neighboring 9 reconstructed blocks as additional context. Li et al.[22] proposed a fully connected network (IPFCN) as a new intra prediction mode, which achieved obvious bitrate savings but at the cost of extremely high complexity. Li et al. also explored using CNN-based down/up-sampling techniques as a new intra prediction mode for HEVC, which achieved coding gains, particularly at low bitrates. Additionally, several attempts have been made for CNN-based chroma intra prediction, such as [23, 24], by utilizing both the reconstructed luma block and neighboring chroma blocks to improve intra chroma prediction efficiency. The image and video compression community has taken a step forward by introducing an end-to-end optimization framework based on deep neural networks. Deep neural networks have been successful due to back-propagation and gradient descent, which require differentiability of the loss function with respect to the trainable parameters. However, directly incorporating a CNN model into end-to-end image compression is challenging due to the quantization operation. The quantization module produces zero gradients almost everywhere, preventing the parameters from updating in the CNN. Additionally, the learning loss objective must be a differentiable loss function. In 2016, Ball’e et al. introduced the first end-to-end optimized CNN framework for image compression under the scalar quantization assumption [25]. To handle the zero derivatives resulting from quantization, an additive i.i.d uniform noise was used to simulate quantization in the CNN training procedure, enabling gradient descent for neural network optimization. This method outperformed JPEG2000 in terms of both PSNR and MS-SSIM metrics. Later, [26] extended this end-to-end framework to a scale hyper prior, resulting in better compression performance. Many other attempts [27, 28, 29, 30, 31] have been made to advance image compression using neural networks. Generative Adversarial Networks (GANs) are a popular deep learning technique that involves 10 Video frames Intra frames Inter frames Independent coding Dependent coding Figure 2.3: Compression for different video frames. training a generator and a discriminator network simultaneously. In image compression, GANs have been used to improve the subjective quality of decoded images. For instance, Rippel and Bourdev [32] proposed an integrated GAN-based image compression method that achieved significant improvements in compression ratio and enhanced the subjective quality of reconstructed images. However, GAN-based compression has been successful only for narrow-domain images, such as faces, and more research is needed to establish models for general natural images. Video coding typically involves two types of frames, illustrated in Figure 2.3: keyframes (also known as intra frames) and inter frames. A keyframe is a fully encoded frame that contains all the necessary information to reconstruct an image. It is coded independently of any other frames and serves as a reference for inter frames. Keyframes are often used as a starting point for video playback and can be decoded without relying on any other frames. Popular video coding standards such as MPEG[33], MPEG-2, H.264/AVC [8], and HEVC [34] can directly apply image compression methods to keyframes. Inter frames, on the other hand, only contain the changes from a previously encoded frame (either a keyframe or another inter frame) and are coded based on motion estimation and compensation. Because inter frames rely on previously encoded frames, they are usually much smaller than keyframes and can achieve higher compression ratios. However, inter frames cannot be decoded independently and require reference frames for reconstruction. The combination 11 of keyframes and inter frames is commonly used in modern video coding standards such as H.264/AVC and HEVC to achieve efficient compression and high video quality. 2.1.3 Traditional video coding Video coding is a fundamental process that compresses videos to enable efficient transmission and storage. It typically involves two techniques: entropy coding for lossless compression towards the Shannon limit, and lossy coding for removing redundant and less significant data in video. Although entropy coding can only achieve moderate compression ratios due to the Shannon limit, lossy compression is generally more effective because the human visual system can tolerate some loss of details. The video coding process involves an encoder that converts video into a compressed format, and a decoder that restores the compressed video back to an uncompressed format. Together, these components form a codec (encoder/decoder), as illustrated in Figure 2.1. Video coding plays a critical role in transmitting and storing video content efficiently while minimizing the impact on image quality. A standard video encoder typically consists of three primary components: (i) a predictive coding unit, (ii) a transform coding unit, and (iii) an entropy coding unit. (i) Predictive coding. The predictive coding unit is a crucial component of video coding that exploits both temporal (inter-prediction) and spatial redundancies (intra-prediction) in a video sequence to reduce redundancy. This process is achieved through two methods: motion estimation (ME) and motion compensation (MC). ME involves finding a matching region in the reference frame that corresponds to a block in the current frame, while MC involves determining 12 Predictive coding Entropy coding Transform coding Video source Video output Reconstruct frame Entropy decoding Inverse transform Compressed data Figure 2.4: Transform coding framework for video compression. the difference (residual) between the matching regions and the target region. This generates residuals and motion vectors that help to achieve high compression ratios while maintaining a high level of video quality. To create residuals, the encoder subtracts the prediction from the actual current frame, while the motion vector is generated by computing the offset between the current block and the position of the candidate region. The motion vector indicates the direction of movement of the block. By using predictive coding, the encoder can reduce redundancies and transmit only the necessary information, resulting in more efficient video compression. (ii) Transform coding. Transform coding is a crucial step in video compression that converts blocks of residual samples into a set of coefficients, each of which represents a weight for the standard basis pattern. These coefficients are then fed into a quantizer, which produces reduced precision yet bit-saving quantized coefficients. One of the most commonly used transform coding techniques is the discrete cosine transformation (DCT), which was developed in 1974. In the H.264 video coding standard, transform coding is used to convert a block of residual samples into DCT coefficients. By reducing the dependency between sample points, transform coding enables more efficient compression of the video data. The encoder can achieve high 13 compression ratios by utilizing transform coding, while maintaining the video’s visual quality. This reduction in data leads to improved storage and transmission efficiency, making transform coding a critical component in video compression. (iii) Entropy coding. After predictive coding and transform coding, the video data is still not fully compressed. Entropy coding is the final stage in video coding that produces a compact and efficient bit stream for storage and transmission. It compresses the residual signals and the quantized transform coefficients generated by the previous stages. Entropy coding techniques, such as variable length coding (VLC), arithmetic coding, and Huffman coding, assign shorter codes to more frequently occurring symbols, and longer codes to less frequent symbols. For instance, in Huffman coding, the most common symbols are assigned shorter codes, while the less common symbols are assigned longer codes. The motion vectors are also entropy coded separately using a VLC table. By using entropy coding, the average bit rate of the encoded video stream can be further reduced, leading to higher compression ratios and improved storage and transmission efficiency. Video decoding. The video decoding process, as shown in the bottom part of Figure 2.4, works in reverse order of the encoder. First, the entropy decoder recovers the prediction parameters and coefficients from the compressed bit stream. Then, the spatial decoder uses these parameters to reconstruct the residual frame. Finally, the prediction decoder uses the reconstructed pixels and the parameters to reconstruct the original frame, which is then displayed to the viewer. The decoding process plays a crucial role in the video playback performance since it needs to be performed in real-time. To achieve this, modern video codecs use parallel processing and specialized hardware to improve the decoding efficiency. Additionally, video decoding may also involve error resilience techniques to mitigate errors introduced during storage 14 or transmission, such as error concealment and error resilience coding. 2.1.4 Traditional video codec standards As the most common video codecs, let’s briefly go through the techniques used in H.264 and HEVC. AVC H.264 encoding. The H.264 encoder operates on macroblocks, which are 16×16 pixel units. Inter-prediction is performed by utilizing a range of block sizes (from 16×16 to 4×4) to predict pixels in the current frame from similar regions in previously encoded frames. Intra-prediction uses the same range of block sizes to predict the macroblock from the previously encoded pixels within the same frame. The encoder then obtains a residual by subtracting the prediction from the current macroblock. The residual samples are transformed using a 4×4 or 8×8 integer transformation, resulting in a set of DCT coefficients. The coefficients and other information are quantized and coded into bit streams using entropy coding. HEVC (H.265) encoding. The HEVC encoder follows a similar structure to H.264, utilizing inter/intra prediction and transform coding. Each frame of the input video sequence is divided into block-shaped regions called coding tree units (CTUs). A CTU can be of size 64×64, 32×32, or 16×16 and is organized in a quad-tree form to further partition into smaller-sized coding units (CUs). In HEVC, the first picture of the video sequence is coded using only intra-picture prediction, and all remaining pictures are coded using inter-picture predictive coding. Each CU can be predicted via intra-prediction or inter-prediction, and the prediction residual is coded using block transforms. The entropy coding module uses context-adaptive binary arithmetic coding (CABAC). The decoding process is the inverse of the encoding process. 15 Table 2.1: Historical development of video codecs Video coding standar Year Features MPEG family MPEG-1 part-2 1993 video and audio storage on CD-ROMS MPEG-2 part-2 1995 HDTV and video on DVDs MPEG-4 part-2 (visual) 1999 low bit-rate multimedia on mobile platforms MPEG-4 part 10 (AVS) 2003 Co-published with H.264/AVC H.26X family H.120 1984 The first digital video coding standard H.261 1988 Developed for video conferencing over ISDN H.262 1995 See MPEG-2 part 2 H.263/H.263+ 1996/1998 Improved quality to H.261 at lower bit rate H.264 AVC 2003 Significant quality improvement with lower bit rates H.265/HEVC 2013 50% bit-rate savings compared with H.264 H.266/VVC 2020 50% bit-rate savings compared with H.265 Decoding. The decoding process starts by extracting the quantized, transformed coefficients and the prediction information from the bit stream. The decoder then rescales the coefficients to restore each block of the residual data. These blocks are combined together to form a residual macroblock for frame reconstruction. The decoder then adds the prediction to the decoded residual to reconstruct a decoded macroblock. Finally, the decoded macroblocks are combined to reconstruct the original video frame. Table 2.1 summarizes the historical development of video codecs. We provide a brief summary of each codec’s features below: • MPEG-1. Developed for video and audio storage on CD-ROMs; Supports YUV 4:2:0 with a resolution of 352 × 288; Lossless motion vectors. • MPEG-2. Supports HDTV and video on DVDs; Introduction of profiles and levels; Nonlinear quantization and data partitioning. • MPEG-4 part-2 (visual). Supports low bit-rate multimedia applications on mobile platforms; Shares a subset with H.263; Supports object-based or content-based coding. 16 • H.262. Developed for video conferencing over ISDN. Block-based hybrid coding with integer pixel motion compensation; Supports CIF and QCIF resolutions. • H.263 / H.263+. Improved quality to H.261 at a lower bit rate; Shares a subset with MPEG-4 part 2. • H.264 AVC. Supports video on the Internet, computers, mobile devices, and HDTVs; Significantly improves quality with lower bit rates; Increased computational complexity; Improved motion compensation with variable block size, multiple reference frames, and weighted prediction. • H.265 HEVC. Supports ultra HD video up to 8k with frame rates up to 120 fps; Greater flexibility in prediction modes and transfer block sizes; Parallel processing; 50% bit-rate savings compared with H.264 for the same video quality. • H.266 VVC. Provides about 50% better compression rate for the same perceptual quality, with support for lossless and subjectively lossless compression; Supports resolutions ranging from very low resolution up to 4K and 16K, as well as 360° videos. This revised version provides a clearer summary of each codec’s features, as well as some additional details, such as the resolutions supported by each codec. 2.1.5 Learning-based video compression Traditional video compression algorithms, such as H.265 and H.266, rely on hand-crafted motion estimation and motion compensation techniques, such as block-based motion estimation, to achieve inter-frame prediction. While these methods reduce temporal redundancy in video 17 data, they cannot be end-to-end optimized with other neural networks developed for various machine vision tasks, such as action recognition, using large-scale training datasets. Recent advances in neural image compression have led to the development of neural video codecs. The pioneering work of DVC [35] follows a residual coding-based framework similar to traditional codecs. It first generates motion-compensated predictions and then encodes the residual using a hyperprior [26]. With the help of an autoregressive prior [29], DVCPro achieves even higher compression ratios. Recent research in neural video codecs has focused on improving the motion estimation and residual coding-based framework. Some works have proposed advanced network structures to generate optimized residuals or motion. For instance, Yang et al. [36] adaptively scaled the residual using learned parameters, while Agustsson et al. [37] proposed using optical flow estimation in scale space to reduce residual energy in fast motion areas. Hu et al. [38] applied rate distortion optimization to improve motion coding, and Hu et al. [39] used deformable compensation to enhance feature space prediction. Lin et al. [40] proposed using multiple reference frames to reduce residual energy, and in [40, 41], motion prediction was introduced to improve motion coding efficiency. In addition to residual coding, researchers have explored other coding frameworks for neural video codecs. One such framework is the 3D autoencoder[42, 43, 44], which encodes multiple frames simultaneously and is an extension of neural image codecs. However, this approach can introduce significant encoding delay and may not be suitable for real-time scenarios. Another emerging framework is conditional coding, which has a lower or equal entropy bound compared to residual coding [45]. For example, Ladune et al. [45, 46, 47] used conditional coding to code the foreground contents, while in DCVC [48], the condition is the extensible 18 Application of video compression Video streaming Video Conferencing Social media Surveillance video Figure 2.5: Application of video compression. high-dimensional feature instead of the 3D predicted frame. To further boost the compression ratio, recent work has introduced feature propagation and multi-scale temporal contexts [49]. Most existing neural video codecs prioritize the optimization of the latent embedding and network design. Previous research has largely concentrated on temporal correlation. For instance, works such as [13, 48, 49, 50] employ techniques such as temporal context prior, conditional entropy coding, or recurrent entropy model to explore this area. 2.1.6 Application of video compression Video compression is a critical need for many real-time applications, as depicted in Figure 2.5. With the advent of the internet, service providers offer cheap and high-speed bandwidth, leading to an explosion of data. As a result, a vast amount of data generated consists of videos. However, storing all this data requires significant space, making it difficult to manage. To address this challenge, efficient video compression techniques are essential. This section explores various applications where such techniques can be useful in the 21st century. Video streaming. Video streaming has revolutionized the way users consume video content over the internet in real-time without downloading the entire video file. With the increasing 19 availability of high-speed internet connections and growing popularity of online video content, video streaming has become an essential part of our daily lives. However, the large size of video files poses a significant challenge in transmitting them quickly and efficiently over the internet. To overcome this challenge, video compression techniques have been developed to reduce the size of video files while preserving their quality. Video compression algorithms can significantly reduce the amount of data that needs to be transmitted over the internet, making it possible to stream high-quality video content in real- time. As internet speeds continue to increase and video streaming grows in popularity, video compression techniques will become even more critical in delivering high-quality video content to users worldwide. Video Conferencing. Video compression is an essential application in video conferencing, allowing individuals and businesses to connect remotely without having to worry about slow or interrupted connections. Video compression works by reducing the amount of data needed to transmit a video stream over the internet while maintaining a high-quality image. This makes video conferencing accessible to a wider audience, including those with low-bandwidth internet connections. The use of video compression has become increasingly important as remote work and distance learning have become more prevalent. Without video compression, video conferencing would be prohibitively expensive and only accessible to those with high-speed internet connections. However, video compression algorithms allow for real-time transmission of high-quality video, making video conferencing an effective communication tool for businesses, schools, and individuals. Social Media. Social media platforms have become an essential means of communication worldwide, and video content is increasingly becoming the most popular form of media. However, 20 transmitting video content over the internet can be challenging due to the large file sizes involved. As a result, video compression has emerged as a crucial application for social media platforms. It allows users to share and view videos without worrying about slow or interrupted connections. This makes it easier for social media platforms to store and transmit video content, as well as for users to upload and view videos without experiencing delays or buffering. Without video compression, social media platforms would struggle to keep up with the demand for video content and provide a seamless user experience. Moreover, video compression has made it possible for social media platforms to incorporate live video streaming, which has become increasingly popular in recent years. Live video streaming allows users to broadcast events and experiences in real-time, connecting people from all over the world. Video compression algorithms play a critical role in making this technology accessible, allowing for real-time transmission of high-quality video over low-bandwidth internet connections. As video content continues to grow in popularity, video compression algorithms will undoubtedly play a crucial role in making it accessible to a wider audience. Overall, video compression is essential for social media platforms to meet the increasing demand for video content and provide a seamless user experience. Surveillance Video. Video surveillance has become ubiquitous in today’s world, with cameras being used for security purposes in various settings. However, the storage and transmission of the large amounts of video data generated by video surveillance systems pose a significant challenge. Video compression has emerged as a critical application in this context, allowing for more efficient storage and transmission of video data. Video compression algorithms play a vital role in enabling real-time transmission of high- quality video over low-bandwidth internet connections, making it possible for video surveillance 21 to take place in remote areas with limited internet access. Additionally, the use of video compression in video surveillance has many benefits, including lower storage costs, increased efficiency, and improved accessibility. As technology continues to evolve, we can expect video compression algorithms to become even more efficient, enabling higher-quality video to be transmitted and stored at even lower costs. This development will enable businesses and individuals to improve their security measures and ensure that video surveillance is a viable and effective means of keeping people safe. 2.2 Implicit neural representation Recent developments in deep learning have led to the emergence of implicit representations, which are compact data representations [1, 51, 52, 53] that fit a deep neural network to signals such as images, 3D shapes, and videos. One of the main branches of implicit representations is coordinate-based neural representations, which take pixel coordinates as input and output corresponding values such as density or RGB values using an MLP network. These representations have shown promising results in a range of areas including image reconstruction [54, 55], image compression [52], continuous spatial super-resolution [56, 57, 58, 59], shape regression [60, 61], and 3D view synthesis [62, 63]. To improve coordinate-based representations, several approaches have been proposed, such as using sine activation functions instead of ReLU [64] or converting input coordinates to a Fourier feature space [65]. 22 2.3 Deep neural networks With the interdisciplinary research of neuroscience and mathematics, the neural network (NN) was invented, which has shown strong abilities in the context of non-linear transform and classification. Intuitively, the network consists of multiple layers of simple processing units called neuron (perceptron), which interacts with each other via weighted connections. The neurons get activated through weighted connections from previously activated neurons. To achieve non- linearity, the activation functions are always applied for all the intermediate layers [66]. The learning procedure of simple perceptron has been proposed and analyzed in 1960s. During the 1970s and 1980s, backpropagation procedure [67, 68] inspired by the chain rule for derivatives of the training objectives was proposed to solve the training problem of the multi- layer perceptron (MLP). Then, the multi-layer architectures are mostly trained by stochastic gradient descent with backpropagation procedure although it is computationally intensive and suffers from bad local minima. However, the dense connections between the adjacent layers in neural networks make the amount of model parameters increase quadratically and prohibit the development of neural networks in computational efficiency. With the introduction of parameter- sharing for MLP 1990 [69], a more light-weighted version of neural network called convolutional neural network was proposed and applied in the documents recognition, which makes the large scale neural network training possible. Over the last 10 years, several CNN architectures have been presented [70, 71]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it 23 should be noted that the key upgrade in CNN performance occurred largely due to the processing- unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table 2 presents the brief overview of CNN architectures. AlexNet The history of deep CNNs began with the appearance of LeNet [72]. At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [73], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [73] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure 15 illustrates the basic design of the AlexNet architecture. The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [74]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [75] could be utilized as 24 a non-saturating activation function to enhance the rate of convergence [76]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters (5×5and11×11) in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications. VGGNet After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [77], it featured nineteen more layers than AlexNet to simulate the relations of the network representational capacity in depth. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small- size filters made the receptive field similarly efficient to the large-size filters (7×7and5×5). By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting 1×1 convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014- ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure 18 shows the structure of the network. 25 ResNet He et al. [78] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig. 20, which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is a conventional feedforward network plus a residual connection. x0 x1 H1 x2 H2 H3 H4 x3 x4 C onvolution Pooling Dense Block 1 C onvolution Pooling Pooling Linear C onvolution Input Prediction “horse” Dense Block 2 Dense Block 3 Figure 2.6: DenseNet architecture. Image obtained from [4]. DenseNet To solve the problem of the vanishing gradient, DenseNet [4] was presented, following the same direction as ResNet and the Highway network [79]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large 26 number of weights, since each layer has an isolated group of weights. DenseNet employed cross- layer connectivity in an improved approach to address this problem. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure 2.6 shows the architecture of DenseNet Network. Vision Transformer Transformer architectures are based on a self-attention mechanism that learns the relationships between elements of a sequence. As opposed to recurrent networks that pro- cess sequence elements recursively and can only attend to short-term context, Transformers can attend to complete sequences thereby learning long-range relationships. Vision Transformer (ViT) [80] (Figure 2.7) is the first work to showcase how Transformers can ‘altogether’ replace standard convolutions in deep neural networks on large- scale image datasets. They applied the original Transformer model [81] (with minimal changes) on a sequence of image ’patches’ flattend as vectors. The model was pre-trained on a large propriety dataset (JFT dataset [82] with 300 million images) and then fine-tuned to downstream recog- nition benchmarks e.g., ImageNet classification. This is an important step since pre-training ViT on a medium-range dataset would not give competitive results, because the CNNs encode prior knowledge about the 27 images (inductive biases e.g., translation equivariance) that reduces the need of data as compared to Transformers which must discover such information from very large-scale data. The DeiT [83] is the first work to demonstrate that Transformers can be learned on mid-sized datasets (i.e., 1.2 million ImageNet examples compared to 300 million images of JFT used in ViT) in relatively shorter training episodes. Besides using augmentation and regularization procedures common in CNNs, the main contribution of DeiT is a novel native distillation approach for Trans- formers which uses a CNN as a teacher model (RegNetY- 16GF [84]) to train the Transformer model. Figure 2.7: An overview of Vision Transformer (on the left) and the details of Transformer encoder (on the right). The architecture resembles Transformers used in the NLP domain and the image patches are simply fed to the model after flattening. After training, the feature obtained from the first token position is used for classification. Image obtained from [5]. 28 Chapter 3: NeRV: Implicit Neural Representations for Videos 3.1 Introduction What is a video? Typically, a video captures a dynamic visual scene using a sequence of frames. A schematic interpretation of this is a curve in 2D space, where each point can be characterized with a (x, y) pair representing the spatial state. If we have a model for all (x, y) pairs, then, given any x, we can easily find the corresponding y state. Similarly, we can interpret a video as a recording of the visual world, where we can find a corresponding RGB state for every single timestamp. This leads to our main claim: can we represent a video as a function of time? More formally, can we represent a video V as V = {vt}Tt=1, where vt = fθ(t), i.e., , a frame at timestamp t, is represented as a function f parameterized by θ. Given their remarkable representational capacity [85], we choose deep neural networks as the function in our work. Given these intuitions, we propose NeRV, a novel representation that represents videos as implicit functions and encodes them into neural networks. Specifically, with a fairly simple deep neural network design, NeRV can reconstruct the corresponding video frames with high quality, given the frame index. Once the video is encoded into a neural network, this network can be used as a proxy for video, where we can directly extract all video information from the representation. Therefore, unlike traditional video representations which treat videos as sequences of frames, shown in Figure 6.1 (a), our proposed NeRV considers a video as a unified neural network with 29 Table 3.1: Comparison of different video representations. Although explicit representations outperform implicit ones in encoding speed and compression ratio now, NeRV shows great advantage in decoding speed. And NeRV outperforms pixel-wise implicit representations in all metrics. Explicit (frame-based) Implicit (unified) Hand-crafted (e.g. , HEVC [34]) Learning-based (e.g. , DVC [35]) Pixel-wise (e.g. , NeRF [62] Image-wise (Ours) Encoding speed Fast Medium Very slow Slow Decoding speed Medium Slow Very slow Fast Compression ratio Medium High Low Medium all information embedded within its architecture and parameters, shown in Figure 6.1 (b). Video Video (a) Explicit representations for videos (e.g., HEVC) (b) Neural implicit representations for videos (e.g., NeRV) Network layers Figure 3.1: (a) Conventional video representation as frame sequences. (b) NeRV, representing video as neural networks, which consists of multiple convolutional layers, taking the normalized frame index as the input and output the corresponding RGB frame. As an image-wise implicit representation, NeRV shares lots of similarities with pixel-wise implicit visual representations [54, 55] which takes spatial-temporal coordinates as inputs. The main differences between our work and image-wise implicit representation are the output space and architecture designs. Pixel-wise representations output the RGB value for each pixel, while NeRV outputs a whole image, demonstrated in Figure 3.2. Given a video with size of T×H×W , pixel-wise representations need to sample the video T ∗H ∗W times while NeRV only need to 30 sample T times. Considering the huge pixel number, especially for high resolution videos, NeRV shows great advantage for both encoding time and decoding speed. Different output space also leads to different architecture designs, NeRV utilizes a MLP + ConvNets architecture to output an image while pixel-wise representation uses a simple MLP to output the RGB value of the pixel. Sampling efficiency of NeRV also simplify the optimization problem, which leads to better reconstruction quality compared to pixel-wise representations. We also demonstrate the flexibility of NeRV by exploring several applications it affords. Most notably, we examine the suitability of NeRV for video compression. Traditional video compression frameworks are quite involved, such as specifying key frames and inter frames, estimating the residual information, block-size the video frames, applying discrete cosine transform on the resulting image blocks and so on. Such a long pipeline makes the decoding process very complex as well. In contrast, given a neural network that encodes a video in NeRV, we can simply cast the video compression task as a model compression problem, and trivially leverage any well-established or cutting edge model compression algorithm to achieve good compression ratios. Specifically, we explore a three-step model compression pipeline: model pruning, model quantization, and weight encoding, and show the contributions of each step for the compression task. We conduct extensive experiments on popular video compression datasets, such as UVG [86], and show the applicability of model compression techniques on NeRV for video compression. We briefly compare different video representations in Table 3.1 and NeRV shows great advantage in decoding speed. Besides video compression, we also explore other applications of the NeRV representation for the video denoising task. Since NeRV is a learnt implicit function, we can demonstrate its robustness to noise and perturbations. Given a noisy video as input, NeRV generates a high- 31 quality denoised output, without any additional operation, and even outperforms conventional denoising methods. The contribution of this paper can be summarized into four parts: • We propose NeRV, a novel image-wise implicit representation for videos, representating a video as a neural network, converting video encoding to model fitting and video decoding as a simple feedforward operation. • Compared to pixel-wise implicit representation, NeRV output the whole image and shows great efficiency, improving the encoding speed by 25× to 70×, the decoding speed by 38× to 132×, while achieving better video quality. • NeRV allows us to convert the video compression problem to a model compression problem, allowing us to leverage standard model compression tools and reach comparable performance with conventional video compression methods, e.g. , H.264 [8], and HEVC [34]. • As a general representation for videos, NeRV also shows promising results in other tasks, e.g. , video denoising. Without any special denoisng design, NeRV outperforms traditional hand-crafted denoising algorithms (medium filter etc. ) and ConvNets-based denoisng methods. 3.2 Related Work Implicit Neural Representation. Implicit neural representation is a novel way to parameterize a variety of signals. The key idea is to represent an object as a function approximated via a neural network, which maps the coordinate to its corresponding value (e.g. , pixel coordinate 32 for an image and RGB value of the pixel). It has been widely applied in many 3D vision tasks, such as 3D shapes [87, 88], 3D scenes [89, 90, 91, 92], and appearance of the 3D structure [62, 93, 94]. Comparing to explicit 3D representations, such as voxel, point cloud, and mesh, the continuous implicit neural representation can compactly encode high-resolution signals in a memory-efficient way. Most recently, [52] demonstrated the feasibility of using implicit neural representation for image compression tasks. Although it is not yet competitive with the state-of- the-art compression methods, it shows promising and attractive proprieties. In previous methods, MLPs are often used to approximate the implicit neural representations, which take the spatial or spatio-temporal coordinate as the input and output the signals at that single point (e.g. , RGB value, volume density). In contrast, our NeRV representation, trains a purposefully designed neural network composed of MLPs and convolution layers, and takes the frame index as input and directly outputs all the RGB values of that frame. Video Compression. As a fundamental task of computer vision and image processing, visual data compression has been studied for several decades. Before the resurgence of deep networks, handcrafted image compression techniques, like JPEG [19] and JPEG2000 [95], were widely used. Building upon them, many traditional video compression algorithms, such as MPEG [33], H.264 [8], and HEVC [34], have achieved great success. These methods are generally based on transform coding like Discrete Cosine Transform (DCT) [20] or wavelet transform [96], which are well-engineered and tuned to be fast and efficient. More recently, deep learning-based visual compression approaches have been gaining popularity. For video compression, the most common practice is to utilize neural networks for certain components while using the traditional video compression pipeline. For example, [97] proposed an effective image compression approach 33 and generalized it into video compression by adding interpolation loop modules. Similarly, [98] converted the video compression problem into an image interpolation problem and proposed an interpolation network, resulting in competitive compression quality. Furthermore, [37] generalized optical flow to scale-space flow to better model uncertainty in compression. Later, [99] employed a temporal hierarchical structure, and trained neural networks for most components including key frame compression, motion estimation, motions compression, and residual compression. However, all of these works still follow the overall pipeline of traditional compression, arguably limiting their capabilities. Model Compression. The goal of model compression is to simplify an original model by reducing the number of parameters while maintaining its accuracy. Current research on model compression research can be divided into four groups: parameter pruning and quantization [100, 101, 102, 103, 104, 105]; low-rank factorization [106, 107, 108]; transferred and compact convolutional filters [109, 110, 111, 112]; and knowledge distillation [113, 114, 115, 116]. Our proposed NeRV enables us to reformulate the video compression problem into model compression, and utilize standard model compression techniques. Specifically, we use model pruning and quantization to reduce the model size without significantly deteriorating the performance. 3.3 Neural Representations for Videos We first present the NeRV representation in Section 3.3.1, including the input embedding, the network architecture, and the loss objective. Then, we present model compression techniques on NeRV in Section 3.3.2 for video compression. 34 Embedding FC FC FC Pixel concat & Reshape Embedding FC FC NeRV block Frame Output Pixel Coordinate: Frame Index: Pixel-wise Output Frame Output (b) NeRV: Image-wise implicit representation (ours) (a) Pixel-wise implicit representation (e.g, SIREN) NeRV block (c) NeRV block Convolution PixelShuffle Activation Layer Input Output Figure 3.2: (a) Pixel-wise implicit representation taking pixel coordinates as input and use a simple MLP to output pixel RGB value (b) NeRV: Image-wise implicit representation taking frame index as input and use a MLP + ConvNets to output the whole image. (c) NeRV block architecture, upscale the feature map by S here. 3.3.1 NeRV Architecture In NeRV, each video V = {vt}Tt=1 ∈ RT×H×W×3 is represented by a function fθ : R → RH×W×3, where the input is a frame index t and the output is the corresponding RGB image vt ∈ RH×W×3. The encoding function is parameterized with a deep neural network θ, vt = fθ(t). Therefore, video encoding is done by fitting a neural network fθ to a given video, such that it can map each input timestamp to the corresponding RGB frame. Input Embedding. Although deep neural networks can be used as universal function approximators [85], directly training the network fθ with input timestamp t results in poor results, which is also observed by [62, 117]. By mapping the inputs to a high embedding space, the neural network can better fit data with high-frequency variations. Specifically, in NeRV, we use Positional Encoding [54, 62, 81] as our embedding function Γ(t) = ( sin ( b0πt ) , cos ( b0πt ) , . . . , sin ( bl−1πt ) , cos ( bl−1πt )) (3.1) 35 where b and l are hyper-parameters of the networks. Given an input timestamp t, normalized between (0, 1], the output of embedding function Γ(·) is then fed to the following neural network. Network Architecture. NeRV architecture is illustrated in Figure 3.2 (b). NeRV takes the time embedding as input and outputs the corresponding RGB Frame. Leveraging MLPs to directly output all pixel values of the frames can lead to huge parameters, especially when the images resolutions are large. Therefore, we stack multiple NeRV blocks following the MLP layers so that pixels at different locations can share convolutional kernels, leading to an efficient and effective network. Inspired by the super-resolution networks, we design the NeRV block, illustrated in Figure 3.2 (c), adopting PixelShuffle technique [118] for upscaling method. Convolution and activation layers are also inserted to enhance the expressibilty. The detailed architecture can be found in the supplementary material. Loss Objective. For NeRV, we adopt combination of L1 and SSIM loss as our loss function for network optimization, which calculates the loss over all pixel locations of the predicted image and the ground-truth image as following L = 1 T T∑ t=1 α ∥fθ(t)− vt∥1 + (1− α)(1− SSIM(fθ(t), vt)) (3.2) where T is the frame number, fθ(t) ∈ RH×W×3 the NeRV prediction, vt ∈ RH×W×3 the frame ground truth, α is hyper-parameter to balance the weight for each loss component. 3.3.2 Model Compression In this section, we briefly revisit model compression techniques used for video compression with NeRV. Our model compression composes of four standard sequential steps: video overfit, 36 Model Quantization Weight Encoding Model Pruning Video Overfit Figure 3.3: NeRV-based video compression pipeline. model pruning, weight quantization, and weight encoding as shown in Figure 3.3. Model Pruning. Given a neural network fit on a video, we use global unstructured pruning to reduce the model size first. Based on the magnitude of weight values, we set weights below a threshold as zero, θi =  θi, if θi ≥ θq 0, otherwise, (3.3) where θq is the q percentile value for all parameters in θ. As a normal practice, we fine-tune the model to regain the representation, after the pruning operation. Model Quantization. After model pruning, we apply model quantization to all network parameters. Note that different from many recent works [104, 119, 120, 121] that utilize quantization during training, NeRV is only quantized post-hoc (after the training process). Given a parameter tensor µ µi = round ( µi − µmin scale ) ∗ scale + µmin, scale = µmax − µmin 2bit (3.4) where ‘round’ is rounding value to the closest integer, ‘bit’ the bit length for quantized model, µmax and µmin the max and min value for the parameter tensor µ, ‘scale’ the scaling factor. Through Equation 5.1, each parameter can be mapped to a ‘bit’ length value. The overhead to store ‘scale’ and µmin can be ignored given the large parameter number of µ, e.g., they account for only 0.005% in a small 3 × 3 Conv with 64 input channels and 64 output channels (37k parameters in total). 37 Entropy Encoding. Finally, we use entropy encoding to further compress the model size. By taking advantage of character frequency, entropy encoding can represent the data with a more efficient codec. Specifically, we employ Huffman Coding [122] after model quantization. Since Huffman Coding is lossless, it is guaranteed that a decent compression can be achieved without any impact on the reconstruction quality. Empirically, this further reduces the model size by around 10%. 3.4 Experiments 3.4.1 Datasets and Implementation Details We perform experiments on “Big Buck Bunny” sequence from scikit-video to compare our NeRV with pixel-wise implicit representations, which has 132 frames of 720 × 1080 resolution. To compare with state-of-the-arts methods on video compression task, we do experiments on the widely used UVG [86], consisting of 7 videos and 3900 frames with 1920× 1080 in total. In our experiments, we train the network using Adam optimizer [123] with learning rate of 5e-4. For ablation study on UVG, we use cosine annealing learning rate schedule [124], batchsize of 1, training epochs of 150, and warmup epochs of 30 unless otherwise denoted. When compare with state-of-the-arts, we run the model for 1500 epochs, with batchsize of 6. For experiments on “Big Buck Bunny”, we train NeRV for 1200 epochs unless otherwise denoted. For fine-tune process after pruning, we use 50 epochs for both UVG and “Big Buck Bunny”. For NeRV architecture, there are 5 NeRV blocks, with up-scale factor 5, 3, 2, 2, 2 respectively for 1080p videos, and 5, 2, 2, 2, 2 respectively for 720p videos. By changing the hidden dimension of MLP and channel dimension of NeRV blocks, we can build NeRV model with 38 Table 3.2: Compare with pixel-wise implicit representations. Training speed means time/epoch, while encoding time is the total training time. Methods Parameters Training Speed ↑ Encoding Time ↓ PSNR ↑ Decoding FPS ↑ SIREN [55] 3.2M 1× 2.5× 31.39 1.4 NeRF [62] 3.2M 1× 2.5× 33.31 1.4 NeRV-S (ours) 3.2M 25× 1× 34.21 54.5 SIREN [55] 6.4M 1× 5× 31.37 0.8 NeRF [62] 6.4M 1× 5× 35.17 0.8 NeRV-M (ours) 6.3M 50× 1× 38.14 53.8 SIREN [55] 12.7M 1× 7× 25.06 0.4 NeRF [62] 12.7M 1× 7× 37.94 0.4 NeRV-L (ours) 12.5M 70× 1× 41.29 52.9 Table 3.3: PSNR v.s. epochs. Since video encoding of NeRV is an overfit process, the reconstructed video quality keeps increasing with more training epochs. NeRV-S/M/L mean models with different sizes. Epoch NeRV-S NeRV-M NeRV-L 300 32.21 36.05 39.75 600 33.56 37.47 40.84 1.2k 34.21 38.14 41.29 1.8k 34.33 38.32 41.68 2.4k 34.86 38.7 41.99 different sizes. For input embedding in Equation 3.1, we use b = 1.25 and l = 80 as our default setting. For loss objective in Equation 6.6, α is set to 0.7. We evaluate the video quality with two metrics: PSNR and MS-SSIM [125]. Bits-per-pixel (BPP) is adopted to indicate the compression ratio. We implement our model in PyTorch [126] and train it in full precision (FP32). All experiments are run with NVIDIA RTX2080ti. Please refer to the supplementary material for more experimental details, results, and visualizations (e.g. , MCL-JCV [127] results) 39 3.4.2 Main Results We compare NeRV with pixel-wise implicit representations on ’Big Buck Bunny’ video. We take SIREN [55] and NeRF [62] as the baseline, where SIREN [55] takes the original pixel coordinates as input and uses sine activations, while NeRF [62] adds one posi