ABSTRACT

Title of Dissertation: An Efficient Neural Representation for Videos

Hao Chen, Doctor of Philosophy, 2023

Dissertation Directed by: Abhinav Shrivastava
Department of Computer Science

With the increasing popularity of videos, it has become crucial to find efficient and compact

ways to represent them for easier storage, transmission, and downstream video tasks. Our

dissertation proposes an innovative neural representation for videos called NeRV, which stores

each video implicitly as a neural network. Building on NeRV, we introduce a hybrid representation

for videos called HNeRV, which improves internal generalization and representation capacity.

HNeRV allows for highly efficient video representation and compression, with a model size that

can be up to 1000 times smaller than the original raw video.

Apart from efficiency, HNeRV’s simple decoding process, which involves a feedforward

operation, enables fast video loading and easy deployment. To enhance efficiency, we develope

an efficient neural video dataloader called NVLoader, which is 3-6 times faster than conventional

video dataloaders. We also introduce the HyperNeRV framework to address encoding speed,

which utilizes a hypernetwork to directly map input videos to NeRV model weights, resulting in

a 104 faster encoding process.


Aside from developing compact and implicit video neural representations, we explore

several compelling applications, including frame interpolation, video restoration, and video editing.

Furthermore, the compactness of these representations makes them an ideal output video format

for video generation models, reducing the search space significantly. Additionally, they can serve

as an efficient input for video understanding models.


An Efficient Neural Representation For Videos

by

Hao Chen

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Prof. Abhinav Shrivastava, Unviersity of Maryland (Chair/Advisor)
Prof. Furong Huang, Unviersity of Maryland
Prof. Behtash Babadi, Unviersity of Maryland
Prof. Ramani Duraiswami, Unviersity of Maryland
Prof. Saining Xie, NewYork University


Acknowledgments

I am deeply grateful to my advisor, Abhinav Shrivastava, for his exceptional guidance

throughout my Ph.D. journey. His extensive knowledge and problem-solving skills have not only

inspired me but also shaped my research in countless ways. I have been fortunate to work under

his patient and kind mentorship, which has been invaluable in helping me navigate the challenges

of research and life. Abhinav once said, ”Hackers are people with the cleverest mind and always

know how to solve problems,” and he is a true embodiment of this definition. His wisdom and

ability to solve even the most complex issues have been a constant source of motivation for me.

I am thankful for everything Abhinav has taught me, and I aspire to become half as good as him

through years of learning and practice. Abhinav’s guidance and support have been critical to my

success, and I will always be grateful for his exceptional mentorship.

I am incredibly grateful to the esteemed members of my thesis committee, Furong Huang,

Behtash Babadi, Ramani Duraiswami, and Saining Xie, for the time and effort that they have

invested in me. Ramani Duraiswami, who served on both my proposal and defense committee,

demonstrated exceptional kindness and support, for which I am deeply appreciative. Saining

Xie, a distinguished young researcher in computer vision, has been a source of inspiration for

my research. I feel incredibly fortunate to have had the opportunity to collaborate with him

on the efficient neural video dataloader project. Finally, I would like to express my gratitude

to Christopher Metzler for attending my preliminary exam and offering valuable feedback and

ii


constructive comments on my research.

I am grateful for my labmates, who have not only been great collaborators but also amazing

friends. Yixuan Ren, Bo He, Hanyu Wang, Kamal Gupta, Max Enrich, Matthew Gwilliam, Alex

Hanson, Shuaiyi Huang, Lillian Huang, Vatsal Agarwal, Sharath Girish, Chuong Huynh, Vinoj

Jayasundara, Pulkit Kumar, Mara Levy, Shishira R Maiya, Soumik Mukhopadhyay, Khoi Pham,

Nirat Saini, Saksham Suri, Archana Swaminathan, Gaurav Shrivastava, Matthew Walmer, and

Luyu Yang have all contributed to my growth as a researcher and a person. I have learned a great

deal from them, not only in the research but also in the way they approach life. In particular, I

want to express my appreciation to Bo He, Hanyu Wang, Matthew Gwilliam, and Yixuan Ren.

Their support and contributions have been invaluable to the success of the NeRV-series.

During my time at UMD, I had the privilege of collaborating with exceptional researchers,

and I am grateful for these opportunities. Zuxuan Wu has been an outstanding mentor, providing

invaluable guidance not only in research but also in career planning. Additionally, working with

Xitong Yang and receiving guidance from Hengduo Li and Chen Zhu has been a great experience.

I would also like to express my gratitude to the mentors I had during my internships. Ser-Nam

Lim from Meta, Zhe Lin from Adobe, and Heng Wang, Binchen Liu, and Yizhe Zhu from Tiktok

have all provided me with indispensable insights that have shaped my research. Their mentorship

and support have been instrumental in my growth as a researcher.

I would like to express my gratitude to the exceptional individuals who ignited my passion

for deep learning and computer vision. Firstly, I am indebted to Prof. Guoyou Wang, my first

advisor during my Master’s degree, and Prof. Xiang Bai, who provided invaluable guidance and

career advice during my study abroad. Additionally, I am thankful for the wonderful collaborators

and mentors I had the privilege of working with during my internship at the Shenzhen Institutes

iii


of Advanced Technology (SIAT), including Prof. Yu Qiao, Prof. Yali Wang, Lei Zhou, Yulun

Zhang, Yapeng Tian, Zhi Tian, Tong He, Kaiyang Zhou, and Xiaoxing Zeng.

Finally, I want to express my deep appreciation to my parents and all my family members,

especially my adorable nephew Jiujiu, for their unwavering support throughout my journey.

Without their support, none of this would have been possible.

iv


Table of Contents

Acknowledgements ii

Table of Contents v

List of Tables viii

List of Figures xi

Chapter 1: Introduction 1

Chapter 2: Background 7
2.1 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Video redundancy overview . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Image compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Traditional video coding . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Traditional video codec standards . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 Learning-based video compression . . . . . . . . . . . . . . . . . . . . . 17
2.1.6 Application of video compression . . . . . . . . . . . . . . . . . . . . . 19

2.2 Implicit neural representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 3: NeRV: Implicit Neural Representations for Videos 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Neural Representations for Videos . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 NeRV Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Datasets and Implementation Details . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.4 Video Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6.1 NeRV Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.2 Results on MCL-JCL dataset . . . . . . . . . . . . . . . . . . . . . . . . 48

v


3.6.3 Implementation Details of Baselines . . . . . . . . . . . . . . . . . . . . 49
3.6.4 Video Temporal Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.5 More Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 4: HNeRV: A Hybrid Neural Representation for Videos 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 HNeRV overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.2 Downstream tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.1 Dataset and Implementation Details . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.3 Parameter Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . 68
4.4.4 Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.5 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.1 Video decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.2 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.3 Weight Pruning for Model Compression. . . . . . . . . . . . . . . . . . . 74
4.6.4 HNeRV architecture details . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6.5 Per-video compression results . . . . . . . . . . . . . . . . . . . . . . . 75

Chapter 5: NVLoader: A Neural Video Dataloader for Efficient Data Loading 77
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.1 Prepare NVDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Data loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Datasets and implementation details . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.3 Action recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.4 Comprehensive comparison . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.5 Other results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.6 Video model architectures . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 6: HyperNeRV: Towards Fast Learning of Video Neural Representation 97
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.2 HyperNeRV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vi


6.3.3 Efficient video neural representation . . . . . . . . . . . . . . . . . . . . 108
6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.4.1 Datasets and implementation details . . . . . . . . . . . . . . . . . . . . 108
6.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.3 Efficient neural representation. . . . . . . . . . . . . . . . . . . . . . . . 112
6.4.4 Component analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.5 Discussion and limitations . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.6 Experiment supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.6.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.6.2 More implementation details . . . . . . . . . . . . . . . . . . . . . . . . 120

Chapter 7: Conclusion 121
7.1 Efficient video representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Downstream tasks based on neural representations . . . . . . . . . . . . . . . . . 124

7.2.1 Video compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.2 Efficient video loading . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.3 Video restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.4 Video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.5 Video understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2.6 Video generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.3 Future work and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.1 Internal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.2 Scalable learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

vii


List of Tables

2.1 Historical development of video codecs . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Comparison of different video representations. Although explicit representations
outperform implicit ones in encoding speed and compression ratio now, NeRV
shows great advantage in decoding speed. And NeRV outperforms pixel-wise
implicit representations in all metrics. . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Compare with pixel-wise implicit representations. Training speed means time/epoch,
while encoding time is the total training time. . . . . . . . . . . . . . . . . . . . 39

3.3 PSNR v.s. epochs. Since video encoding of NeRV is an overfit process, the
reconstructed video quality keeps increasing with more training epochs. NeRV-
S/M/L mean models with different sizes. . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Decoding speed with BPP 0.2 for 1080p videos . . . . . . . . . . . . . . . . . . 43
3.5 PSNR results for video denoising. “baseline” refers to the noisy frames before

any denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Input embedding ablation. PE means positional encoding . . . . . . . . . . . . . 45
3.7 Upscale layer ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Norm layer ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9 Activation function ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10 Loss objective ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.11 NeRV architecture for 1920× 1080 videos. Change the value of C1 and C2 to get

models with different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 HNeRV block NeRV block. k is kernel size for each stage, Cout and Cin are
output/input channels for each block. We decrease parameters via a small k = 1
for first block, and increase parameters for later layers with a larger k and wider
channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Video regression with different sizes . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Video regression with different epochs . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Video regression at resolution 960×1920, PSNR↑ reported . . . . . . . . . . . . 63
4.5 Video regression at resolution 480×960, PSNR↑ reported . . . . . . . . . . . . . 63
4.6 Internal generalization results. NeRV, E-NeRV, and HNeRV use interpolated

embedding as input, HNeRV† uses held-out frames as input. With content-adaptive
embedding as input, HNeRV shows much better reconstruction on held-out frames 63

4.7 Analysis of parameter rebalancing. . . . . . . . . . . . . . . . . . . . . . . . . . 68

viii


4.8 Video inpainting results. With 5 fixed box masks on input videos, we evaluate
the output with PSNR ↑. ‘Input’ is the baseline of mask video and ground truth . 69

4.9 Kernel size (Kmin, Kmax) ablation, (with r=1.2) . . . . . . . . . . . . . . . . . . 71
4.10 Channel reduction r ablation, (with K=1,5) . . . . . . . . . . . . . . . . . . . . 71
4.11 Embedding spatial size ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.12 Embedding dimension ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.13 Decoding FPS ↑ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.14 Decoding time (s) ↓ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.15 HNeRV Decoding FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.16 Compression results. “Size ratio” compares to model with quant. only, and

“Sparsity” indicates amount of weights pruned. . . . . . . . . . . . . . . . . . . 75
4.17 HNeRV architecture details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 Video loading speed (VPS) for video dataloader based on H.264 videos, with
different worker numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2 Video loading speed (VPS) for NVLoader, with different GPU devices. . . . . . 89
5.3 Top-1 error (%) with different frames. . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Top-1 error (%) with different temporal strides. . . . . . . . . . . . . . . . . . . 90
5.5 Top-1 error (%) with different patch ratios. . . . . . . . . . . . . . . . . . . . . 90
5.6 Top-1 error (%) with different video models. . . . . . . . . . . . . . . . . . . . 90
5.7 Comprehensive results on datasets of different resolutions, top-1 accuracy, average

video size, and loading speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Video quality for NVLoaders, before (PSNRorig) and after (PSNRquant) quantization. 92
5.9 Total forward time (model forward + data loading) at testing time. Pure computation

time acts as the 1× baseline. Data loading becomes a bottleneck, especially for
efficient video model and large batches. . . . . . . . . . . . . . . . . . . . . . . 93

5.10 Average video size in NVLoader. Parameters is the total parameter of video
checkpoint (video decoder Wdecoder and frame embedding D), video size measures
video checkpoint by MegaBytes. Q means quantization and H means Huffman
coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.11 Generalization to other dataloaders, top-1 err (%) is showed. We evaluate
models with different sampling frames, which are trained and evaluated on the
same dataloader or different video dataloaders. . . . . . . . . . . . . . . . . . . 95

5.12 Video model architectures. Strdsenc and Strdsdec are stride list used in the encoder
and decoder. Sizeenc and Sizedec parameter nubmers for the encoder and decoder.
d is the embedding dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1 Variables and their definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Encoding comparison for methods: NeRV [1] (train from scrach), Trans-INR [2],

and HyperNeRV (ours). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3 PSNR results for compact video representations. . . . . . . . . . . . . . . . . . 113
6.4 Stronger training setups to obtain efficient video representations. PSNR is

reported for training and test videos (UCF101, K400, SthV2, and Avg.). . . . . . 114

ix


6.5 Component analysis for HyperNeRV. ‘Total size’ is the number of all learnable
parameters, ‘Video model’ is the parameters for the video model, ‘Img-wise’ is
based on image-wise neural representation while Trans-INR [2] is based on pixel-
wise neural representation. ‘Act.’ is the activation layer in the video model, Nmax

is the maximum number for weight tokens, ‘Train’ and ‘Avg.’ are the average
PSNR on the training and test set. ‘VPS’ is videos per second. . . . . . . . . . . 115

6.6 Data augmentations. ‘rand ratios’ is cropping the video with random aspect
ratios between [0.67, 1.5]. ‘rand size’ is randomly scaling, between [0.8×, 1.25×],
the video before cropping. ‘rand aug’ is random augmentation [3]. . . . . . . . . 119

6.7 Ablation study for Nmax. Increasing Nmax from 128 to 256 does not improve the
performance further. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.8 Ablation for learning rate schedules. . . . . . . . . . . . . . . . . . . . . . . . 120
6.9 Implementation details for HyperNeRV. . . . . . . . . . . . . . . . . . . . . . . 120

x


List of Figures

1.1 Framework of transform encoding for video compression. . . . . . . . . . . . . . 1
1.2 Key evaluation metrics for efficient video representations. . . . . . . . . . . . . . 3
1.3 The dissertation framework. a) implicit neural representation NeRV. b) hybrid

neural representation HNeRV. c) fast learning of NeRV weights. d) downstream
tasks based on NeRV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Framework of video compression. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Redundancy for video compression. . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Compression for different video frames. . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Transform coding framework for video compression. . . . . . . . . . . . . . . . 13
2.5 Application of video compression. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 DenseNet architecture. Image obtained from [4]. . . . . . . . . . . . . . . . . . 26
2.7 An overview of Vision Transformer (on the left) and the details of Transformer

encoder (on the right). The architecture resembles Transformers used in the NLP
domain and the image patches are simply fed to the model after flattening. After
training, the feature obtained from the first token position is used for classification.
Image obtained from [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 (a) Conventional video representation as frame sequences. (b) NeRV, representing
video as neural networks, which consists of multiple convolutional layers, taking
the normalized frame index as the input and output the corresponding RGB frame. 30

3.2 (a) Pixel-wise implicit representation taking pixel coordinates as input and use a
simple MLP to output pixel RGB value (b) NeRV: Image-wise implicit representation
taking frame index as input and use a MLP + ConvNets to output the whole
image. (c) NeRV block architecture, upscale the feature map by S here. . . . . . 35

3.3 NeRV-based video compression pipeline. . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Model pruning. Sparsity is the ratio of parameters pruned. . . . . . . . . . . . . 41
3.5 Model quantization. Bit is the bit length used to represent parameter value. . . . 41
3.6 Compression pipeline to show how much each step contribute to compression ratio. 41
3.7 PSNR v.s. BPP on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 MS-SSIM v.s. BPP on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Video compression visualization. At similar BPP, NeRV reconstructs videos with

better details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xi


3.10 Denoising visualization. (c) and (e) are denoising output for DIP [6]. Data
generalization of NeRV leads to robust and better denoising performance since
all frames share the same representation, while DIP model overfits one model to
one image only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.11 Rate distortion plots on the MCL-JCV dataset. . . . . . . . . . . . . . . . . . . 49
3.12 Temporal interpolation results for video with small motion. . . . . . . . . . . . . 51
3.13 Denoising visualization. Left: Ground truth; Middle: Noisy input Right; NeRV

output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.14 Video compression visualization. The difference is calculated by the L1 loss

(absolute value, scaled by the same level for the same frame, and the darker the
more different). “Bosphorus” video in UVG dataset, the residual visualization is
much smaller for NeRV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Top) Hybrid neural representation with learnable and content-adaptive embedding
(ours). Bottom) Video regression for hybrid and implicit neural representations. 55

4.2 a) HNeRV uses ConvNeXt blocks to encode frames as tiny embeddings, which
are decoded by HNeRV blocks. b) HNeRV blocks consist of three layers: convolution,
PixelShuffle, and activation (with input/output size illustrated). c) We demonstrate
how to compute parameters for a given HNeRV block. d) Output size of each
stage with strides 5,4,2,2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Video decoding. Left: HNeRV outperforms traditional video codecs H.264 and
H.265, and learning-based compression method DCVC. Middle: HNeRV shows
much better flexibility when decoding only a portion of video frames, where
the decoding time decreases linearly for HNeRV while other methods still need
to decode most frames. Right: HNeRV performs well for compactness (ppp),
reconstruction quality (PSNR), and decoding speed (FPS). . . . . . . . . . . . . 65

4.4 Visualization of Embedding interpolation. . . . . . . . . . . . . . . . . . . . . 65
4.5 Visualization of video neural representations at 0.003 ppp, which means the

total size is only about 0.3% of the original video size. On the left, we compare
HNeRV to ground truth. On the right, we compare NeRV, E-NeRV, and HNeRV
for 5 patches with discernible differences, indicated in the original frame by
numbers and bounding boxes. For each patch, HNeRV preserves detail at a level
of fidelity closer to the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6 Parameter distributions for decoder blocks. See table 4.7 for PSNR and MS-SSIM
results with these 4 settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.7 Compression results on UVG dataset. . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Compression results of best/worst cases from UVG dataset. HNeRV achieves

good performance especially for videos caputured by still cameras, like ‘honeybee’
video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.9 Inpainting results of fixed masks and object masks. Left) input frame; Right)
HNeRV output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.10 Compression results averaged across all UVG videos, and for each specific videos. 76

xii


5.1 Comparison of video dataloaders based on H.264 videos, HEVC videos, JPEG
frames, and NVLoader (ours). With similar video size, NVLoader load videos
much more efficiently, measured by videos per second (VPS), without hurting
accuracy for video recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 a) NVLoader framework; b) Video model for NVLoader, NeRV block upscale
the feature map to h × w, and MLP expand channels from c to 3 × p × p. d is
frame embedding dimension, h×w is patch number, p×p is patch size. c) NeRV
blocks used in video model. S is the upscale factor. . . . . . . . . . . . . . . . . 80

5.3 Video loading speed. a) common video backends. Naive decoding load videos
one by one, video dataloader use 8 workers for parallel speedup; b) naive decoding
on different devices; c) different frames for dataloader loading; d) different
strides for dataloader loading; e) video resolutions for dataloader loading; f)
patch ratios for dataloader loading (for 64 frames). . . . . . . . . . . . . . . . . 89

5.4 Comparison of video dataloaders based on H.264 videos, HEVC videos, HEVC
videos, and NVloaders (ours) . Left: Something V2 dataset. Middle: UCF101
dataset. Right: Kinetics-400 dataset. Left y-axis is loading speed, videos per
second. Right y-axis is top-1 accuracy. . . . . . . . . . . . . . . . . . . . . . . 92

5.5 Output video frames for NVLoader across datasets. NVLoader can reconstruct
videos well and capture details faithfully , for either ones with dynamic scenes,
or with rich textures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.1 Encoding time comparison between HyperNeRV and training NeRV [1] from
scratch. Encoding refers to generating a well-trained neural network for a given
video. HyperNeRV eliminates the need for tedious fitting, enabling much faster
video encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Left: HyperNeRV takes videos as input and outputs NeRV model weights through
the hypernetwork. Right: Architecture details for the transformer hypernetwork
(top) and the video model (bottom). . . . . . . . . . . . . . . . . . . . . . . . . 104

6.3 a) NeRV block, only the convolution layer has learnable parameters θ ∈ RCout×Cin×K×K .
b) Transformer hypernetwork, takes video patches x and initial weights θ0
as input tokens, and outputs video-specific weights θ̂′. c) Obtain final model
weights θ′ by an element-wise multiply of shared weights θ1 and video-specific
weights θ̂′. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Left: video ground truth. Right: HyperNeRV output. HyperNeRV can reconstruct
various videos across different datasets, and capture video details with high fidelity,
for either dynamic scenes, complex textures, or moving objects. . . . . . . . . . 111

6.5 Left: Trans-INR [2] output: Right: HyperNeRV output (ours). HyperNeRV
shows much better reconstruction quality than Trans-INR [2], with more faithful
details, sharper textures, and better visual preference. . . . . . . . . . . . . . . . 112

7.1 Framework of efficient video representation. . . . . . . . . . . . . . . . . . . . . 121
7.2 Dissertation framework overview. . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 Downstream video tasks based on implicit neural representations. . . . . . . . . . 125
7.4 Implicit neural representation is a compact input for video understanding and

perfect for video generation due to its smaller size compared to original video data.127

xiii


7.5 Some ongoing or potential projects based on NeRV. . . . . . . . . . . . . . . . . 128

xiv


Chapter 1: Introduction

In today’s world, video has emerged as the dominant form of multimedia, and its popularity

continues to increase. However, the high-dimensional and intricate visual information in videos

makes it challenging to efficiently represent them for storage, transmission, and downstream

video-related tasks. To tackle this issue, various attempts have been made, with the transform

encoding approach being the most popular.

This approach transforms the input video into a compact embedding space that is much

smaller than the original video, while maintaining high fidelity after reconstruction. We illustrate

the framework in Figure 1.1. These methods can be broadly divided into two categories based

on the chosen transform functions: traditional codecs with hand-crafted transforms and learning-

based methods that employ deep neural networks. Traditional video compression methods like

MPEG [7], H.264 [8], and H.265 [9] achieve good reconstruction results with decent decompression

speeds. In contrast, learning-based methods [10, 11, 12, 13, 14, 15, 16] focus on replacing the

entire compression pipeline or several components with deep learning tools, at varying levels of

complexity.

Transform
encoding

Transform
decoding

Compact
embed

Input 
video

Output
video

Figure 1.1: Framework of transform encoding for video compression.

Despite efforts to improve video compression, traditional codecs and learning-based methods

1


both have limitations. Traditional codecs often have suboptimal compression performance, while

learning-based methods can be computationally expensive. As a result, a new approach is needed

that can combine the strengths of both methods to enhance video compression. Recent approaches

have tried to address this challenge by fine-tuning traditional codecs [17] and optimizing components

of the compression pipeline [18].

This thesis aims to develop efficient implicit neural representations for videos (NeRV),

where each video is represented as a deep neural network that can output the corresponding

video frame given a frame index as input. Such implicit representations are appealing because

they can represent a video with significantly fewer parameters and reconstruct it with high fidelity,

effectively converting the video compression problem into a model compression problem. Building

on NeRV, we propose a hybrid neural representation for videos (HNeRV) through a small frame

embedding with a powerful decoder network, resulting in improved internal generalization and

representation capacity. With evenly distributed model parameters across layers, HNeRV significantly

improves convergence speed compared to NeRV.

To evaluate the efficiency of neural representation methods for videos, we consider four

key perspectives, as depicted in Figure 1.2. First and foremost, the compression ratio is the most

critical metric to assess the efficiency of video representations. Additionally, the encoding speed

to convert the original video to efficient representations and the decoding speed to reconstruct the

video from such representations should be considered. Finally, an important but often overlooked

aspect is the utilization of efficient video representations in downstream video tasks. While most

current approaches still rely on the original frame sequences as input, these high-dimensional

sequences significantly increase the computation burden for video-related tasks such as video

understanding and generation.

2


Efficient video
 representation

Encoding speed

Compression ratio

Decoding speed

Downstream tasks

Figure 1.2: Key evaluation metrics for efficient video representations.

Firstly, the use of implicit neural representations enables us to transform the video compression

problem into a model compression problem. This approach allows us to achieve comparable

compression ratios to other compression methods, with our methods showing superior compression

ratios for videos with still backgrounds. In addition to compression, our implicit representations

also provide a decoding advantage, as only a small neural network is required to fit one video.

Moreover, the simple forward pass decoding operation of HNeRV allows for easy deployment on

any platform. To further enhance the efficiency of our methods, we developed an efficient neural

video dataloader (NVLoader) that is approximately three times faster than conventional video

dataloaders. This faster processing speed enables more efficient training and evaluation of video

models.

In addition to compression and decoding speed, encoding speed remains a significant

challenge for implicit neural representations due to the long and tedious training process. To

address this issue, we introduce the HyperNeRV framework, which utilizes a hypernetwork to

directly map input videos to NeRV model weights. This approach significantly speeds up the

encoding process by approximately 104 times, while achieving similar reconstruction quality and

generalization to unseen videos compared to training the neural network from scratch.

3


Besides developing efficient implicit video representations and proposing the HyperNeRV

framework, we explore several downstream applications based on these representations. Due

to their compactness and efficiency, we have found that they perform well for tasks such as

frame interpolation, video restoration, and video editing. Furthermore, we believe that these

compact and implicit video representations have even more potential to be utilized in various

other applications. For instance, they can be an ideal output video format that significantly

reduces the searching space or serve as an efficient input for video understanding models. These

representations can also be employed in diverse other applications, such as video summarization,

action recognition, and content-based video retrieval, which we believe require further investigation

in future research

We present a summary of our dissertation framework in Figure 1.3. Firstly, we introduce

NeRV, an implicit neural representation for video, in Figure 1.3a. We then introduce HNeRV,

a content-adaptive embedding that represents video as hybrid ones, in Figure 1.3b. Next, we

introduce HyperNeRV in Figure 1.3c, which enables fast learning of video neural representations.

Finally, we list different downstream tasks in Figure 1.3d which use the efficient video representations

directly, i.e., the model weights. The ultimate goal of these video neural representations is to

introduce a new perspective on video processing, similar to the Fourier transform for signal

processing. By converting video into neural space, we can greatly advance the research and

utilization of video data.

The following dissertation consists of 6 chapters, each of which is explained separately. In

Chapter 2, we provide a brief overview of the background knowledge relevant to this dissertation,

including three main topics: the efficient video presentations, implicit neural representations, and

deep neural networks. Specifically, we delve into the evolution of video compression methods

4


c) HyperNeRV: fast learning via hyper-network

Hypernetwork

Model weights 

Reshape Reshape Reshape

Video frames

Video model

t

d) Downstream tasks

Video compression
Efficient video loading 
Video restoration
Video editing
Video generation / understanding

Decoder

Time index:
t

Input
Output

Video

a) Implicit neural representation (e.g., NeRV)

Em
be

dd
in

g

Encoder
Output

b) Hybrid neural representation (HNeRV)
Input

Decoder

Video

Figure 1.3: The dissertation framework. a) implicit neural representation NeRV. b) hybrid neural
representation HNeRV. c) fast learning of NeRV weights. d) downstream tasks based on NeRV.

and their applications in everyday life to provide a comprehensive understanding of the current

state-of-the-art techniques in the field.

In Chapter 3, we introduce an implicit neural representation for videos called NeRV. We

describe a novel image-wise approach where the neural network outputs one frame given a

frame index as input. Compared to previous pixel-wise representations that output one pixel at a

time, NeRV significantly improves the encoding/decoding speed and the quality of reconstructed

videos. In addition to the basic video reconstruction task, we also present results for video

compression and video denoising.

In Chapter 4, we present a hybrid neural representation for videos (HNeRV). We replace the

content-agnostic frame index input with a content-adaptive embedding generated by an encoder.

This change results in video data being represented by two parts: a large video decoder network

and a small frame embedding. This hybrid representation improves internal generalization, such

5


as video interpolation in the embedding space, and reconstruction capacity. Additionally, we

propose an evenly-distributed model where the model parameters are distributed more evenly

than in NeRV, leading to significant improvements in reconstruction capacity. We also explore

video interpolation and video inpainting in this chapter.

In Chapter 5, we present an efficient neural video dataloader (NVLoader) that accelerates

the typical data loading process for video research. Essentially, for each video in a dataset, we

first fit a compact HNeRV model to it and save the model checkpoint. To load the video during

training or testing, we simply load the model checkpoint and generate the video frames via a

straightforward feed-forward operation. Because of the simplicity of our NVLoader, it can be

easily deployed on any device, and it improves the video loading speed by 3-6 times compared

to traditional data loading approaches.

In Chapter 6, we introduce the HyperNeRV framework, which uses a transformer hypernetwork

to generate model weights directly. We train this hypernetwork on a large-scale video dataset to

learn the mapping function between input video and model weights. With this approach, given a

new video, the well-trained hypernetwork can output the model weights directly, eliminating the

need for a tedious fitting process. As a result, HyperNeRV can speed up the encoding process by

around 104 times compared to training the video model from scratch.

In Chapter 7, we provide a summary of potential downstream tasks based on neural video

representations, including video compression, video restoration, efficient video loading, video

understanding, and video generation. Furthermore, we summarize and conclude the dissertation

by highlighting the contributions of each chapter and discussing potential future directions for

research in this area.

6


Chapter 2: Background

We provide background knowledge in Chapter 2 for efficient video representations and

their applications. Specifically, in Section 2.1, we discuss the evolution of video compression

methods and their potential applications in various domains. In addition, we provide background

knowledge for implicit neural representations in Section 2.2 and for deep neural networks in

Section 2.3.

2.1 Video compression

We present the video compression pipeline in Figure 2.1, which is designed to decrease the

video storage demand or speed up transmission. This is essential since the original video size can

often be too large for storage and transmission purposes.

Encode DecodeTransmit 
or storeSender Receiver

Figure 2.1: Framework of video compression.

2.1.1 Video redundancy overview

For video compression, there are three types of redundancy to remove: spatial redundancy,

temporal redundancy, and perceptual redundancy. We illustrate them in Figure 2.2.

7


Video redundancy

Spatial redundancy

Temporal redundancy

Perceptual redundancy

Figure 2.2: Redundancy for video compression.

The term spatial redundancy describes the occurrence of similar or identical information

in adjacent pixels or regions within a single frame of a video. This leads to unnecessary data that

can be compressed without significant loss of quality. In other words, when neighboring pixels

or areas in a frame contain the same or similar information, it creates redundant data that can be

removed.

In contrast, temporal redundancy refers to the redundancy between consecutive frames

in a video sequence. Because adjacent frames in a video sequence are often very similar, much

of the information in one frame can be predicted from the previous frame. Therefore, in video

compression, temporal redundancy can be exploited by transmitting only the differences between

frames rather than transmitting each frame entirety.

The human visual system’s sensitivity to different aspects of a video signal is not uniform,

which leads to perceptual redundancy. Thus, video compression algorithms can selectively

reduce the information in less perceptually important areas. For instance, compressing a low-

frequency color channel may not have a significant impact on overall image quality compared

to compressing a high-frequency detail channel. By exploiting perceptual redundancy, video

compression algorithms can significantly decrease the amount of data required to represent a

video signal without significantly affecting perceptual quality.

8


While our dissertation primarily focuses on leveraging spatial and temporal redundancy to

develop an efficient neural representation and a simple compression method for video data, we

acknowledge that addressing perceptual redundancy based on our approach can lead to additional

improvements and enhancements.

2.1.2 Image compression

Video compression is primarily based on image compression because of the spatial redundancy

in video data. In order to address this issue, this section explores spatial redundancy and the

background of image compression. The widely used image compression standard, JPEG [19],

divides the input image into non-overlapping 8×8 blocks that are transformed into the frequency

domain using block-DCT [20]. The transformed blocks’ DCT coefficients are then compressed

into a binary stream using quantization and entropy coding. The JPEG standard provides the

essential transform and prediction modules for traditional visual compression.

Block-based image and video coding standards suffer from block-dependent compression,

which limits parallelism on platforms like GPUs. Furthermore, the independent optimization

strategy for each coding tool restricts performance improvement compared to end-to-end optimization

compression. An alternative technological development trajectory, based on neural network

techniques for image and video compression, is emerging. The resurgence of neural networks

has significantly advanced traditional image and video compression by leveraging Convolutional

Neural Networks (CNNs).

The first approach was proposed by Cui et al.[21], using an intra-prediction convolutional

neural network (IPCNN) to refine the prediction of the current block by leveraging neighboring

9


reconstructed blocks as additional context. Li et al.[22] proposed a fully connected network

(IPFCN) as a new intra prediction mode, which achieved obvious bitrate savings but at the

cost of extremely high complexity. Li et al. also explored using CNN-based down/up-sampling

techniques as a new intra prediction mode for HEVC, which achieved coding gains, particularly

at low bitrates. Additionally, several attempts have been made for CNN-based chroma intra

prediction, such as [23, 24], by utilizing both the reconstructed luma block and neighboring

chroma blocks to improve intra chroma prediction efficiency.

The image and video compression community has taken a step forward by introducing an

end-to-end optimization framework based on deep neural networks. Deep neural networks have

been successful due to back-propagation and gradient descent, which require differentiability

of the loss function with respect to the trainable parameters. However, directly incorporating a

CNN model into end-to-end image compression is challenging due to the quantization operation.

The quantization module produces zero gradients almost everywhere, preventing the parameters

from updating in the CNN. Additionally, the learning loss objective must be a differentiable loss

function. In 2016, Ball’e et al. introduced the first end-to-end optimized CNN framework for

image compression under the scalar quantization assumption [25]. To handle the zero derivatives

resulting from quantization, an additive i.i.d uniform noise was used to simulate quantization in

the CNN training procedure, enabling gradient descent for neural network optimization. This

method outperformed JPEG2000 in terms of both PSNR and MS-SSIM metrics. Later, [26]

extended this end-to-end framework to a scale hyper prior, resulting in better compression performance.

Many other attempts [27, 28, 29, 30, 31] have been made to advance image compression using

neural networks.

Generative Adversarial Networks (GANs) are a popular deep learning technique that involves

10


Video frames

Intra frames

Inter  frames

Independent coding

Dependent coding

Figure 2.3: Compression for different video frames.

training a generator and a discriminator network simultaneously. In image compression, GANs

have been used to improve the subjective quality of decoded images. For instance, Rippel

and Bourdev [32] proposed an integrated GAN-based image compression method that achieved

significant improvements in compression ratio and enhanced the subjective quality of reconstructed

images. However, GAN-based compression has been successful only for narrow-domain images,

such as faces, and more research is needed to establish models for general natural images.

Video coding typically involves two types of frames, illustrated in Figure 2.3: keyframes

(also known as intra frames) and inter frames. A keyframe is a fully encoded frame that contains

all the necessary information to reconstruct an image. It is coded independently of any other

frames and serves as a reference for inter frames. Keyframes are often used as a starting point

for video playback and can be decoded without relying on any other frames. Popular video

coding standards such as MPEG[33], MPEG-2, H.264/AVC [8], and HEVC [34] can directly

apply image compression methods to keyframes.

Inter frames, on the other hand, only contain the changes from a previously encoded

frame (either a keyframe or another inter frame) and are coded based on motion estimation and

compensation. Because inter frames rely on previously encoded frames, they are usually much

smaller than keyframes and can achieve higher compression ratios. However, inter frames cannot

be decoded independently and require reference frames for reconstruction. The combination

11


of keyframes and inter frames is commonly used in modern video coding standards such as

H.264/AVC and HEVC to achieve efficient compression and high video quality.

2.1.3 Traditional video coding

Video coding is a fundamental process that compresses videos to enable efficient transmission

and storage. It typically involves two techniques: entropy coding for lossless compression

towards the Shannon limit, and lossy coding for removing redundant and less significant data

in video. Although entropy coding can only achieve moderate compression ratios due to the

Shannon limit, lossy compression is generally more effective because the human visual system

can tolerate some loss of details.

The video coding process involves an encoder that converts video into a compressed format,

and a decoder that restores the compressed video back to an uncompressed format. Together,

these components form a codec (encoder/decoder), as illustrated in Figure 2.1. Video coding

plays a critical role in transmitting and storing video content efficiently while minimizing the

impact on image quality.

A standard video encoder typically consists of three primary components: (i) a predictive

coding unit, (ii) a transform coding unit, and (iii) an entropy coding unit.

(i) Predictive coding. The predictive coding unit is a crucial component of video coding

that exploits both temporal (inter-prediction) and spatial redundancies (intra-prediction) in a

video sequence to reduce redundancy. This process is achieved through two methods: motion

estimation (ME) and motion compensation (MC). ME involves finding a matching region in the

reference frame that corresponds to a block in the current frame, while MC involves determining

12


Predictive 
coding

Entropy
coding

Transform
coding

Video
source

Video
output

Reconstruct
frame

Entropy
decoding

Inverse
transform

Compressed 
data

Figure 2.4: Transform coding framework for video compression.

the difference (residual) between the matching regions and the target region. This generates

residuals and motion vectors that help to achieve high compression ratios while maintaining a

high level of video quality.

To create residuals, the encoder subtracts the prediction from the actual current frame,

while the motion vector is generated by computing the offset between the current block and the

position of the candidate region. The motion vector indicates the direction of movement of the

block. By using predictive coding, the encoder can reduce redundancies and transmit only the

necessary information, resulting in more efficient video compression.

(ii) Transform coding. Transform coding is a crucial step in video compression that

converts blocks of residual samples into a set of coefficients, each of which represents a weight

for the standard basis pattern. These coefficients are then fed into a quantizer, which produces

reduced precision yet bit-saving quantized coefficients. One of the most commonly used transform

coding techniques is the discrete cosine transformation (DCT), which was developed in 1974.

In the H.264 video coding standard, transform coding is used to convert a block of residual

samples into DCT coefficients. By reducing the dependency between sample points, transform

coding enables more efficient compression of the video data. The encoder can achieve high

13


compression ratios by utilizing transform coding, while maintaining the video’s visual quality.

This reduction in data leads to improved storage and transmission efficiency, making transform

coding a critical component in video compression.

(iii) Entropy coding. After predictive coding and transform coding, the video data is still

not fully compressed. Entropy coding is the final stage in video coding that produces a compact

and efficient bit stream for storage and transmission. It compresses the residual signals and the

quantized transform coefficients generated by the previous stages.

Entropy coding techniques, such as variable length coding (VLC), arithmetic coding, and

Huffman coding, assign shorter codes to more frequently occurring symbols, and longer codes to

less frequent symbols. For instance, in Huffman coding, the most common symbols are assigned

shorter codes, while the less common symbols are assigned longer codes. The motion vectors are

also entropy coded separately using a VLC table.

By using entropy coding, the average bit rate of the encoded video stream can be further

reduced, leading to higher compression ratios and improved storage and transmission efficiency.

Video decoding. The video decoding process, as shown in the bottom part of Figure

2.4, works in reverse order of the encoder. First, the entropy decoder recovers the prediction

parameters and coefficients from the compressed bit stream. Then, the spatial decoder uses

these parameters to reconstruct the residual frame. Finally, the prediction decoder uses the

reconstructed pixels and the parameters to reconstruct the original frame, which is then displayed

to the viewer. The decoding process plays a crucial role in the video playback performance

since it needs to be performed in real-time. To achieve this, modern video codecs use parallel

processing and specialized hardware to improve the decoding efficiency. Additionally, video

decoding may also involve error resilience techniques to mitigate errors introduced during storage

14


or transmission, such as error concealment and error resilience coding.

2.1.4 Traditional video codec standards

As the most common video codecs, let’s briefly go through the techniques used in H.264

and HEVC.

AVC H.264 encoding. The H.264 encoder operates on macroblocks, which are 16×16

pixel units. Inter-prediction is performed by utilizing a range of block sizes (from 16×16 to

4×4) to predict pixels in the current frame from similar regions in previously encoded frames.

Intra-prediction uses the same range of block sizes to predict the macroblock from the previously

encoded pixels within the same frame. The encoder then obtains a residual by subtracting the

prediction from the current macroblock. The residual samples are transformed using a 4×4 or

8×8 integer transformation, resulting in a set of DCT coefficients. The coefficients and other

information are quantized and coded into bit streams using entropy coding.

HEVC (H.265) encoding. The HEVC encoder follows a similar structure to H.264, utilizing

inter/intra prediction and transform coding. Each frame of the input video sequence is divided

into block-shaped regions called coding tree units (CTUs). A CTU can be of size 64×64, 32×32,

or 16×16 and is organized in a quad-tree form to further partition into smaller-sized coding

units (CUs). In HEVC, the first picture of the video sequence is coded using only intra-picture

prediction, and all remaining pictures are coded using inter-picture predictive coding. Each CU

can be predicted via intra-prediction or inter-prediction, and the prediction residual is coded using

block transforms. The entropy coding module uses context-adaptive binary arithmetic coding

(CABAC). The decoding process is the inverse of the encoding process.

15


Table 2.1: Historical development of video codecs

Video coding standar Year Features

MPEG
family

MPEG-1 part-2 1993 video and audio storage on CD-ROMS
MPEG-2 part-2 1995 HDTV and video on DVDs
MPEG-4 part-2 (visual) 1999 low bit-rate multimedia on mobile platforms
MPEG-4 part 10 (AVS) 2003 Co-published with H.264/AVC

H.26X
family

H.120 1984 The first digital video coding standard
H.261 1988 Developed for video conferencing over ISDN
H.262 1995 See MPEG-2 part 2
H.263/H.263+ 1996/1998 Improved quality to H.261 at lower bit rate
H.264 AVC 2003 Significant quality improvement with lower bit rates
H.265/HEVC 2013 50% bit-rate savings compared with H.264
H.266/VVC 2020 50% bit-rate savings compared with H.265

Decoding. The decoding process starts by extracting the quantized, transformed coefficients

and the prediction information from the bit stream. The decoder then rescales the coefficients to

restore each block of the residual data. These blocks are combined together to form a residual

macroblock for frame reconstruction. The decoder then adds the prediction to the decoded

residual to reconstruct a decoded macroblock. Finally, the decoded macroblocks are combined

to reconstruct the original video frame.

Table 2.1 summarizes the historical development of video codecs. We provide a brief

summary of each codec’s features below:

• MPEG-1. Developed for video and audio storage on CD-ROMs; Supports YUV 4:2:0 with

a resolution of 352 × 288; Lossless motion vectors.

• MPEG-2. Supports HDTV and video on DVDs; Introduction of profiles and levels; Nonlinear

quantization and data partitioning.

• MPEG-4 part-2 (visual). Supports low bit-rate multimedia applications on mobile platforms;

Shares a subset with H.263; Supports object-based or content-based coding.

16


• H.262. Developed for video conferencing over ISDN. Block-based hybrid coding with

integer pixel motion compensation; Supports CIF and QCIF resolutions.

• H.263 / H.263+. Improved quality to H.261 at a lower bit rate; Shares a subset with

MPEG-4 part 2.

• H.264 AVC. Supports video on the Internet, computers, mobile devices, and HDTVs;

Significantly improves quality with lower bit rates; Increased computational complexity;

Improved motion compensation with variable block size, multiple reference frames, and

weighted prediction.

• H.265 HEVC. Supports ultra HD video up to 8k with frame rates up to 120 fps; Greater

flexibility in prediction modes and transfer block sizes; Parallel processing; 50% bit-rate

savings compared with H.264 for the same video quality.

• H.266 VVC. Provides about 50% better compression rate for the same perceptual quality,

with support for lossless and subjectively lossless compression; Supports resolutions ranging

from very low resolution up to 4K and 16K, as well as 360° videos.

This revised version provides a clearer summary of each codec’s features, as well as some

additional details, such as the resolutions supported by each codec.

2.1.5 Learning-based video compression

Traditional video compression algorithms, such as H.265 and H.266, rely on hand-crafted

motion estimation and motion compensation techniques, such as block-based motion estimation,

to achieve inter-frame prediction. While these methods reduce temporal redundancy in video

17


data, they cannot be end-to-end optimized with other neural networks developed for various

machine vision tasks, such as action recognition, using large-scale training datasets.

Recent advances in neural image compression have led to the development of neural video

codecs. The pioneering work of DVC [35] follows a residual coding-based framework similar

to traditional codecs. It first generates motion-compensated predictions and then encodes the

residual using a hyperprior [26]. With the help of an autoregressive prior [29], DVCPro achieves

even higher compression ratios.

Recent research in neural video codecs has focused on improving the motion estimation

and residual coding-based framework. Some works have proposed advanced network structures

to generate optimized residuals or motion. For instance, Yang et al. [36] adaptively scaled

the residual using learned parameters, while Agustsson et al. [37] proposed using optical flow

estimation in scale space to reduce residual energy in fast motion areas. Hu et al. [38] applied

rate distortion optimization to improve motion coding, and Hu et al. [39] used deformable

compensation to enhance feature space prediction. Lin et al. [40] proposed using multiple

reference frames to reduce residual energy, and in [40, 41], motion prediction was introduced

to improve motion coding efficiency.

In addition to residual coding, researchers have explored other coding frameworks for

neural video codecs. One such framework is the 3D autoencoder[42, 43, 44], which encodes

multiple frames simultaneously and is an extension of neural image codecs. However, this

approach can introduce significant encoding delay and may not be suitable for real-time scenarios.

Another emerging framework is conditional coding, which has a lower or equal entropy bound

compared to residual coding [45]. For example, Ladune et al. [45, 46, 47] used conditional

coding to code the foreground contents, while in DCVC [48], the condition is the extensible

18


Application  of
video compression

Video streaming

Video Conferencing

Social media

Surveillance video 

Figure 2.5: Application of video compression.

high-dimensional feature instead of the 3D predicted frame. To further boost the compression

ratio, recent work has introduced feature propagation and multi-scale temporal contexts [49].

Most existing neural video codecs prioritize the optimization of the latent embedding and

network design. Previous research has largely concentrated on temporal correlation. For instance,

works such as [13, 48, 49, 50] employ techniques such as temporal context prior, conditional

entropy coding, or recurrent entropy model to explore this area.

2.1.6 Application of video compression

Video compression is a critical need for many real-time applications, as depicted in Figure

2.5. With the advent of the internet, service providers offer cheap and high-speed bandwidth,

leading to an explosion of data. As a result, a vast amount of data generated consists of videos.

However, storing all this data requires significant space, making it difficult to manage. To address

this challenge, efficient video compression techniques are essential. This section explores various

applications where such techniques can be useful in the 21st century.

Video streaming. Video streaming has revolutionized the way users consume video content

over the internet in real-time without downloading the entire video file. With the increasing

19


availability of high-speed internet connections and growing popularity of online video content,

video streaming has become an essential part of our daily lives. However, the large size of video

files poses a significant challenge in transmitting them quickly and efficiently over the internet.

To overcome this challenge, video compression techniques have been developed to reduce the

size of video files while preserving their quality.

Video compression algorithms can significantly reduce the amount of data that needs to

be transmitted over the internet, making it possible to stream high-quality video content in real-

time. As internet speeds continue to increase and video streaming grows in popularity, video

compression techniques will become even more critical in delivering high-quality video content

to users worldwide.

Video Conferencing. Video compression is an essential application in video conferencing,

allowing individuals and businesses to connect remotely without having to worry about slow or

interrupted connections. Video compression works by reducing the amount of data needed to

transmit a video stream over the internet while maintaining a high-quality image. This makes

video conferencing accessible to a wider audience, including those with low-bandwidth internet

connections.

The use of video compression has become increasingly important as remote work and

distance learning have become more prevalent. Without video compression, video conferencing

would be prohibitively expensive and only accessible to those with high-speed internet connections.

However, video compression algorithms allow for real-time transmission of high-quality video,

making video conferencing an effective communication tool for businesses, schools, and individuals.

Social Media. Social media platforms have become an essential means of communication

worldwide, and video content is increasingly becoming the most popular form of media. However,

20


transmitting video content over the internet can be challenging due to the large file sizes involved.

As a result, video compression has emerged as a crucial application for social media platforms.

It allows users to share and view videos without worrying about slow or interrupted connections.

This makes it easier for social media platforms to store and transmit video content, as well as

for users to upload and view videos without experiencing delays or buffering. Without video

compression, social media platforms would struggle to keep up with the demand for video content

and provide a seamless user experience.

Moreover, video compression has made it possible for social media platforms to incorporate

live video streaming, which has become increasingly popular in recent years. Live video streaming

allows users to broadcast events and experiences in real-time, connecting people from all over the

world. Video compression algorithms play a critical role in making this technology accessible,

allowing for real-time transmission of high-quality video over low-bandwidth internet connections.

As video content continues to grow in popularity, video compression algorithms will undoubtedly

play a crucial role in making it accessible to a wider audience. Overall, video compression is

essential for social media platforms to meet the increasing demand for video content and provide

a seamless user experience.

Surveillance Video. Video surveillance has become ubiquitous in today’s world, with

cameras being used for security purposes in various settings. However, the storage and transmission

of the large amounts of video data generated by video surveillance systems pose a significant

challenge. Video compression has emerged as a critical application in this context, allowing for

more efficient storage and transmission of video data.

Video compression algorithms play a vital role in enabling real-time transmission of high-

quality video over low-bandwidth internet connections, making it possible for video surveillance

21


to take place in remote areas with limited internet access. Additionally, the use of video compression

in video surveillance has many benefits, including lower storage costs, increased efficiency, and

improved accessibility. As technology continues to evolve, we can expect video compression

algorithms to become even more efficient, enabling higher-quality video to be transmitted and

stored at even lower costs. This development will enable businesses and individuals to improve

their security measures and ensure that video surveillance is a viable and effective means of

keeping people safe.

2.2 Implicit neural representation

Recent developments in deep learning have led to the emergence of implicit representations,

which are compact data representations [1, 51, 52, 53] that fit a deep neural network to signals

such as images, 3D shapes, and videos. One of the main branches of implicit representations

is coordinate-based neural representations, which take pixel coordinates as input and output

corresponding values such as density or RGB values using an MLP network. These representations

have shown promising results in a range of areas including image reconstruction [54, 55], image

compression [52], continuous spatial super-resolution [56, 57, 58, 59], shape regression [60, 61],

and 3D view synthesis [62, 63]. To improve coordinate-based representations, several approaches

have been proposed, such as using sine activation functions instead of ReLU [64] or converting

input coordinates to a Fourier feature space [65].

22


2.3 Deep neural networks

With the interdisciplinary research of neuroscience and mathematics, the neural network

(NN) was invented, which has shown strong abilities in the context of non-linear transform and

classification. Intuitively, the network consists of multiple layers of simple processing units called

neuron (perceptron), which interacts with each other via weighted connections. The neurons

get activated through weighted connections from previously activated neurons. To achieve non-

linearity, the activation functions are always applied for all the intermediate layers [66].

The learning procedure of simple perceptron has been proposed and analyzed in 1960s.

During the 1970s and 1980s, backpropagation procedure [67, 68] inspired by the chain rule for

derivatives of the training objectives was proposed to solve the training problem of the multi-

layer perceptron (MLP). Then, the multi-layer architectures are mostly trained by stochastic

gradient descent with backpropagation procedure although it is computationally intensive and

suffers from bad local minima. However, the dense connections between the adjacent layers in

neural networks make the amount of model parameters increase quadratically and prohibit the

development of neural networks in computational efficiency. With the introduction of parameter-

sharing for MLP 1990 [69], a more light-weighted version of neural network called convolutional

neural network was proposed and applied in the documents recognition, which makes the large

scale neural network training possible.

Over the last 10 years, several CNN architectures have been presented [70, 71]. Model

architecture is a critical factor in improving the performance of different applications. Various

modifications have been achieved in CNN architecture from 1989 until today. Such modifications

include structural reformulation, regularization, parameter optimizations, etc. Conversely, it

23


should be noted that the key upgrade in CNN performance occurred largely due to the processing-

unit reorganization, as well as the development of novel blocks. In particular, the most novel

developments in CNN architectures were performed on the use of network depth. In this section,

we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and

ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as

input size, depth, and robustness) is the key to help researchers to choose the suitable architecture

for the their target task. Table 2 presents the brief overview of CNN architectures.

AlexNet The history of deep CNNs began with the appearance of LeNet [72]. At that

time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled

to all image classes. In deep CNN architecture, AlexNet is highly respected [73], as it achieved

innovative results in the fields of image recognition and classification. Krizhevesky et al. [73]

first proposed AlexNet and consequently improved the CNN learning ability by increasing its

depth and implementing several parameter optimization strategies. Figure 15 illustrates the

basic design of the AlexNet architecture. The learning ability of the deep CNN was limited

at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs

(NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance

the applicability of the CNN to different image categories, the number of feature extraction

stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth

enhances generalization for several image resolutions, it was in fact overfitting that represented

the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this

problem [74]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky

et al.’s algorithm randomly passes over several transformational units throughout the training

stage. Moreover, by reducing the vanishing gradient problem, ReLU [75] could be utilized as

24


a non-saturating activation function to enhance the rate of convergence [76]. Local response

normalization and overlapping subsampling were also performed to enhance the generalization

by decreasing the overfitting. To improve on the performance of previous networks, other modifications

were made by using large-size filters (5×5and11×11) in the earlier layers. AlexNet has considerable

significance in the recent CNN generations, as well as beginning an innovative research era in

CNN applications.

VGGNet After CNN was determined to be effective in the field of image recognition,

an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This

innovative design was called Visual Geometry Group (VGG). A multilayer model [77], it featured

nineteen more layers than AlexNet to simulate the relations of the network representational

capacity in depth. This showed experimentally that the parallel assignment of these small-size

filters could produce the same influence as the large-size filters. In other words, these small-

size filters made the receptive field similarly efficient to the large-size filters (7×7and5×5). By

decreasing the number of parameters, an extra advantage of reducing computational complication

was achieved by using small-size filters. These outcomes established a novel research trend

for working with small-size filters in CNN. In addition, by inserting 1×1 convolutions in the

middle of the convolutional layers, VGG regulates the network complexity. It learns a linear

grouping of the subsequent feature maps. In general, VGG obtained significant results for

localization problems and image classification. While it did not achieve first place in the 2014-

ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology,

and simplicity. However, VGG’s computational cost was excessive due to its utilization of around

140 million parameters, which represented its main shortcoming. Figure 18 shows the structure

of the network.

25


ResNet He et al. [78] developed ResNet (Residual Network), which was the winner of

ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient

issue, as compared to the previous networks. Several types of ResNet were developed based on

the number of layers (starting with 34 layers and going up to 1202 layers). The most common

type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall

number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The

novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig. 20, which was

employed in Highway Nets to address the problem of training a deeper network in 2015. This is

a conventional feedforward network plus a residual connection.

x0

x1

H1

x2

H2

H3

H4

x3

x4

C
onvolution

Pooling

Dense Block 1 C
onvolution

Pooling

Pooling

Linear

C
onvolution

Input
Prediction

“horse”
Dense Block 2 Dense Block 3

Figure 2.6: DenseNet architecture. Image obtained from [4].

DenseNet To solve the problem of the vanishing gradient, DenseNet [4] was presented,

following the same direction as ResNet and the Highway network [79]. One of the drawbacks of

ResNet is that it clearly conserves information by means of preservative individuality transformations,

as several layers contribute extremely little or no information. In addition, ResNet has a large

26


number of weights, since each layer has an isolated group of weights. DenseNet employed cross-

layer connectivity in an improved approach to address this problem. It connected each layer to

all layers in the network using a feed-forward approach. Therefore, the feature maps of each

previous layer were employed to input into all of the following layers. DenseNet demonstrates

the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to

discriminate clearly between the added and the preserved information, since DenseNet concatenates

the features of the preceding layers rather than adding them. However, due to its narrow layer

structure, DenseNet becomes parametrically high-priced in addition to the increased number of

feature maps. The direct admission of all layers to the gradients via the loss function enhances the

information flow all across the network. In addition, this includes a regularizing impact, which

minimizes overfitting on tasks alongside minor training sets. Figure 2.6 shows the architecture of

DenseNet Network.

Vision Transformer Transformer architectures are based on a self-attention mechanism

that learns the relationships between elements of a sequence. As opposed to recurrent networks

that pro- cess sequence elements recursively and can only attend to short-term context, Transformers

can attend to complete sequences thereby learning long-range relationships. Vision Transformer

(ViT) [80] (Figure 2.7) is the first work to showcase how Transformers can ‘altogether’ replace

standard convolutions in deep neural networks on large- scale image datasets. They applied

the original Transformer model [81] (with minimal changes) on a sequence of image ’patches’

flattend as vectors. The model was pre-trained on a large propriety dataset (JFT dataset [82]

with 300 million images) and then fine-tuned to downstream recog- nition benchmarks e.g.,

ImageNet classification. This is an important step since pre-training ViT on a medium-range

dataset would not give competitive results, because the CNNs encode prior knowledge about the

27


images (inductive biases e.g., translation equivariance) that reduces the need of data as compared

to Transformers which must discover such information from very large-scale data. The DeiT [83]

is the first work to demonstrate that Transformers can be learned on mid-sized datasets (i.e., 1.2

million ImageNet examples compared to 300 million images of JFT used in ViT) in relatively

shorter training episodes. Besides using augmentation and regularization procedures common in

CNNs, the main contribution of DeiT is a novel native distillation approach for Trans- formers

which uses a CNN as a teacher model (RegNetY- 16GF [84]) to train the Transformer model.

  Figure 2.7: An overview of Vision Transformer (on the left) and the details of Transformer
encoder (on the right). The architecture resembles Transformers used in the NLP domain and
the image patches are simply fed to the model after flattening. After training, the feature obtained
from the first token position is used for classification. Image obtained from [5].

28


Chapter 3: NeRV: Implicit Neural Representations for Videos

3.1 Introduction

What is a video? Typically, a video captures a dynamic visual scene using a sequence

of frames. A schematic interpretation of this is a curve in 2D space, where each point can be

characterized with a (x, y) pair representing the spatial state. If we have a model for all (x, y)

pairs, then, given any x, we can easily find the corresponding y state. Similarly, we can interpret

a video as a recording of the visual world, where we can find a corresponding RGB state for every

single timestamp. This leads to our main claim: can we represent a video as a function of time?

More formally, can we represent a video V as V = {vt}Tt=1, where vt = fθ(t), i.e., , a

frame at timestamp t, is represented as a function f parameterized by θ. Given their remarkable

representational capacity [85], we choose deep neural networks as the function in our work.

Given these intuitions, we propose NeRV, a novel representation that represents videos as implicit

functions and encodes them into neural networks. Specifically, with a fairly simple deep neural

network design, NeRV can reconstruct the corresponding video frames with high quality, given

the frame index. Once the video is encoded into a neural network, this network can be used as

a proxy for video, where we can directly extract all video information from the representation.

Therefore, unlike traditional video representations which treat videos as sequences of frames,

shown in Figure 6.1 (a), our proposed NeRV considers a video as a unified neural network with

29


Table 3.1: Comparison of different video representations. Although explicit representations
outperform implicit ones in encoding speed and compression ratio now, NeRV shows great
advantage in decoding speed. And NeRV outperforms pixel-wise implicit representations in all
metrics.

Explicit (frame-based) Implicit (unified)

Hand-crafted
(e.g. , HEVC [34])

Learning-based
(e.g. , DVC [35])

Pixel-wise
(e.g. , NeRF [62]

Image-wise
(Ours)

Encoding speed Fast Medium Very slow Slow
Decoding speed Medium Slow Very slow Fast
Compression ratio Medium High Low Medium

all information embedded within its architecture and parameters, shown in Figure 6.1 (b).

Video Video

(a)  Explicit representations for videos (e.g., HEVC) (b) Neural implicit representations for videos (e.g., NeRV)

     Network layers

Figure 3.1: (a) Conventional video representation as frame sequences. (b) NeRV, representing
video as neural networks, which consists of multiple convolutional layers, taking the normalized
frame index as the input and output the corresponding RGB frame.

As an image-wise implicit representation, NeRV shares lots of similarities with pixel-wise

implicit visual representations [54, 55] which takes spatial-temporal coordinates as inputs. The

main differences between our work and image-wise implicit representation are the output space

and architecture designs. Pixel-wise representations output the RGB value for each pixel, while

NeRV outputs a whole image, demonstrated in Figure 3.2. Given a video with size of T×H×W ,

pixel-wise representations need to sample the video T ∗H ∗W times while NeRV only need to

30


sample T times. Considering the huge pixel number, especially for high resolution videos, NeRV

shows great advantage for both encoding time and decoding speed. Different output space also

leads to different architecture designs, NeRV utilizes a MLP + ConvNets architecture to output

an image while pixel-wise representation uses a simple MLP to output the RGB value of the

pixel. Sampling efficiency of NeRV also simplify the optimization problem, which leads to better

reconstruction quality compared to pixel-wise representations.

We also demonstrate the flexibility of NeRV by exploring several applications it affords.

Most notably, we examine the suitability of NeRV for video compression. Traditional video

compression frameworks are quite involved, such as specifying key frames and inter frames,

estimating the residual information, block-size the video frames, applying discrete cosine transform

on the resulting image blocks and so on. Such a long pipeline makes the decoding process very

complex as well. In contrast, given a neural network that encodes a video in NeRV, we can

simply cast the video compression task as a model compression problem, and trivially leverage

any well-established or cutting edge model compression algorithm to achieve good compression

ratios. Specifically, we explore a three-step model compression pipeline: model pruning, model

quantization, and weight encoding, and show the contributions of each step for the compression

task. We conduct extensive experiments on popular video compression datasets, such as UVG [86],

and show the applicability of model compression techniques on NeRV for video compression. We

briefly compare different video representations in Table 3.1 and NeRV shows great advantage in

decoding speed.

Besides video compression, we also explore other applications of the NeRV representation

for the video denoising task. Since NeRV is a learnt implicit function, we can demonstrate its

robustness to noise and perturbations. Given a noisy video as input, NeRV generates a high-

31


quality denoised output, without any additional operation, and even outperforms conventional

denoising methods.

The contribution of this paper can be summarized into four parts:

• We propose NeRV, a novel image-wise implicit representation for videos, representating a

video as a neural network, converting video encoding to model fitting and video decoding

as a simple feedforward operation.

• Compared to pixel-wise implicit representation, NeRV output the whole image and shows

great efficiency, improving the encoding speed by 25× to 70×, the decoding speed by 38×

to 132×, while achieving better video quality.

• NeRV allows us to convert the video compression problem to a model compression problem,

allowing us to leverage standard model compression tools and reach comparable performance

with conventional video compression methods, e.g. , H.264 [8], and HEVC [34].

• As a general representation for videos, NeRV also shows promising results in other tasks,

e.g. , video denoising. Without any special denoisng design, NeRV outperforms traditional

hand-crafted denoising algorithms (medium filter etc. ) and ConvNets-based denoisng

methods.

3.2 Related Work

Implicit Neural Representation. Implicit neural representation is a novel way to parameterize

a variety of signals. The key idea is to represent an object as a function approximated via a

neural network, which maps the coordinate to its corresponding value (e.g. , pixel coordinate

32


for an image and RGB value of the pixel). It has been widely applied in many 3D vision tasks,

such as 3D shapes [87, 88], 3D scenes [89, 90, 91, 92], and appearance of the 3D structure [62,

93, 94]. Comparing to explicit 3D representations, such as voxel, point cloud, and mesh, the

continuous implicit neural representation can compactly encode high-resolution signals in a

memory-efficient way. Most recently, [52] demonstrated the feasibility of using implicit neural

representation for image compression tasks. Although it is not yet competitive with the state-of-

the-art compression methods, it shows promising and attractive proprieties. In previous methods,

MLPs are often used to approximate the implicit neural representations, which take the spatial

or spatio-temporal coordinate as the input and output the signals at that single point (e.g. , RGB

value, volume density). In contrast, our NeRV representation, trains a purposefully designed

neural network composed of MLPs and convolution layers, and takes the frame index as input

and directly outputs all the RGB values of that frame.

Video Compression. As a fundamental task of computer vision and image processing, visual

data compression has been studied for several decades. Before the resurgence of deep networks,

handcrafted image compression techniques, like JPEG [19] and JPEG2000 [95], were widely

used. Building upon them, many traditional video compression algorithms, such as MPEG [33],

H.264 [8], and HEVC [34], have achieved great success. These methods are generally based on

transform coding like Discrete Cosine Transform (DCT) [20] or wavelet transform [96], which

are well-engineered and tuned to be fast and efficient. More recently, deep learning-based visual

compression approaches have been gaining popularity. For video compression, the most common

practice is to utilize neural networks for certain components while using the traditional video

compression pipeline. For example, [97] proposed an effective image compression approach

33


and generalized it into video compression by adding interpolation loop modules. Similarly, [98]

converted the video compression problem into an image interpolation problem and proposed an

interpolation network, resulting in competitive compression quality. Furthermore, [37] generalized

optical flow to scale-space flow to better model uncertainty in compression. Later, [99] employed

a temporal hierarchical structure, and trained neural networks for most components including

key frame compression, motion estimation, motions compression, and residual compression.

However, all of these works still follow the overall pipeline of traditional compression, arguably

limiting their capabilities.

Model Compression. The goal of model compression is to simplify an original model by

reducing the number of parameters while maintaining its accuracy. Current research on model

compression research can be divided into four groups: parameter pruning and quantization [100,

101, 102, 103, 104, 105]; low-rank factorization [106, 107, 108]; transferred and compact convolutional

filters [109, 110, 111, 112]; and knowledge distillation [113, 114, 115, 116]. Our proposed NeRV

enables us to reformulate the video compression problem into model compression, and utilize

standard model compression techniques. Specifically, we use model pruning and quantization to

reduce the model size without significantly deteriorating the performance.

3.3 Neural Representations for Videos

We first present the NeRV representation in Section 3.3.1, including the input embedding,

the network architecture, and the loss objective. Then, we present model compression techniques

on NeRV in Section 3.3.2 for video compression.

34


Embedding FC FC FC Pixel concat &
Reshape

Embedding FC FC NeRV block Frame Output

Pixel  
Coordinate:

 
Frame
Index: 

Pixel-wise  
Output Frame Output

(b) NeRV: Image-wise implicit representation (ours)

(a) Pixel-wise implicit representation (e.g, SIREN)

NeRV block

(c) NeRV block 

Convolution

PixelShuffle

Activation Layer

Input

Output

Figure 3.2: (a) Pixel-wise implicit representation taking pixel coordinates as input and use a
simple MLP to output pixel RGB value (b) NeRV: Image-wise implicit representation taking
frame index as input and use a MLP + ConvNets to output the whole image. (c) NeRV block
architecture, upscale the feature map by S here.

3.3.1 NeRV Architecture

In NeRV, each video V = {vt}Tt=1 ∈ RT×H×W×3 is represented by a function fθ : R →

RH×W×3, where the input is a frame index t and the output is the corresponding RGB image

vt ∈ RH×W×3. The encoding function is parameterized with a deep neural network θ, vt = fθ(t).

Therefore, video encoding is done by fitting a neural network fθ to a given video, such that it can

map each input timestamp to the corresponding RGB frame.

Input Embedding. Although deep neural networks can be used as universal function approximators [85],

directly training the network fθ with input timestamp t results in poor results, which is also

observed by [62, 117]. By mapping the inputs to a high embedding space, the neural network

can better fit data with high-frequency variations. Specifically, in NeRV, we use Positional

Encoding [54, 62, 81] as our embedding function

Γ(t) =
(
sin

(
b0πt

)
, cos

(
b0πt

)
, . . . , sin

(
bl−1πt

)
, cos

(
bl−1πt

))
(3.1)

35


where b and l are hyper-parameters of the networks. Given an input timestamp t, normalized

between (0, 1], the output of embedding function Γ(·) is then fed to the following neural network.

Network Architecture. NeRV architecture is illustrated in Figure 3.2 (b). NeRV takes the time

embedding as input and outputs the corresponding RGB Frame. Leveraging MLPs to directly

output all pixel values of the frames can lead to huge parameters, especially when the images

resolutions are large. Therefore, we stack multiple NeRV blocks following the MLP layers so that

pixels at different locations can share convolutional kernels, leading to an efficient and effective

network. Inspired by the super-resolution networks, we design the NeRV block, illustrated in

Figure 3.2 (c), adopting PixelShuffle technique [118] for upscaling method. Convolution and

activation layers are also inserted to enhance the expressibilty. The detailed architecture can be

found in the supplementary material.

Loss Objective. For NeRV, we adopt combination of L1 and SSIM loss as our loss function for

network optimization, which calculates the loss over all pixel locations of the predicted image

and the ground-truth image as following

L =
1

T

T∑
t=1

α ∥fθ(t)− vt∥1 + (1− α)(1− SSIM(fθ(t), vt)) (3.2)

where T is the frame number, fθ(t) ∈ RH×W×3 the NeRV prediction, vt ∈ RH×W×3 the frame

ground truth, α is hyper-parameter to balance the weight for each loss component.

3.3.2 Model Compression

In this section, we briefly revisit model compression techniques used for video compression

with NeRV. Our model compression composes of four standard sequential steps: video overfit,

36


Model 
Quantization

Weight 
Encoding

Model 
Pruning

Video  
Overfit

Figure 3.3: NeRV-based video compression pipeline.

model pruning, weight quantization, and weight encoding as shown in Figure 3.3.

Model Pruning. Given a neural network fit on a video, we use global unstructured pruning to

reduce the model size first. Based on the magnitude of weight values, we set weights below a

threshold as zero,

θi =


θi, if θi ≥ θq

0, otherwise,

(3.3)

where θq is the q percentile value for all parameters in θ. As a normal practice, we fine-tune the

model to regain the representation, after the pruning operation.

Model Quantization. After model pruning, we apply model quantization to all network parameters.

Note that different from many recent works [104, 119, 120, 121] that utilize quantization during

training, NeRV is only quantized post-hoc (after the training process). Given a parameter tensor

µ

µi = round
(
µi − µmin

scale

)
∗ scale + µmin, scale =

µmax − µmin

2bit (3.4)

where ‘round’ is rounding value to the closest integer, ‘bit’ the bit length for quantized model,

µmax and µmin the max and min value for the parameter tensor µ, ‘scale’ the scaling factor.

Through Equation 5.1, each parameter can be mapped to a ‘bit’ length value. The overhead

to store ‘scale’ and µmin can be ignored given the large parameter number of µ, e.g., they account

for only 0.005% in a small 3 × 3 Conv with 64 input channels and 64 output channels (37k

parameters in total).

37


Entropy Encoding. Finally, we use entropy encoding to further compress the model size. By

taking advantage of character frequency, entropy encoding can represent the data with a more

efficient codec. Specifically, we employ Huffman Coding [122] after model quantization. Since

Huffman Coding is lossless, it is guaranteed that a decent compression can be achieved without

any impact on the reconstruction quality. Empirically, this further reduces the model size by

around 10%.

3.4 Experiments

3.4.1 Datasets and Implementation Details

We perform experiments on “Big Buck Bunny” sequence from scikit-video to compare our

NeRV with pixel-wise implicit representations, which has 132 frames of 720 × 1080 resolution.

To compare with state-of-the-arts methods on video compression task, we do experiments on the

widely used UVG [86], consisting of 7 videos and 3900 frames with 1920× 1080 in total.

In our experiments, we train the network using Adam optimizer [123] with learning rate of

5e-4. For ablation study on UVG, we use cosine annealing learning rate schedule [124], batchsize

of 1, training epochs of 150, and warmup epochs of 30 unless otherwise denoted. When compare

with state-of-the-arts, we run the model for 1500 epochs, with batchsize of 6. For experiments

on “Big Buck Bunny”, we train NeRV for 1200 epochs unless otherwise denoted. For fine-tune

process after pruning, we use 50 epochs for both UVG and “Big Buck Bunny”.

For NeRV architecture, there are 5 NeRV blocks, with up-scale factor 5, 3, 2, 2, 2 respectively

for 1080p videos, and 5, 2, 2, 2, 2 respectively for 720p videos. By changing the hidden

dimension of MLP and channel dimension of NeRV blocks, we can build NeRV model with

38


Table 3.2: Compare with pixel-wise implicit representations. Training speed means time/epoch,
while encoding time is the total training time.

Methods Parameters
Training
Speed ↑

Encoding
Time ↓ PSNR ↑ Decoding

FPS ↑

SIREN [55] 3.2M 1× 2.5× 31.39 1.4
NeRF [62] 3.2M 1× 2.5× 33.31 1.4
NeRV-S (ours) 3.2M 25× 1× 34.21 54.5

SIREN [55] 6.4M 1× 5× 31.37 0.8
NeRF [62] 6.4M 1× 5× 35.17 0.8
NeRV-M (ours) 6.3M 50× 1× 38.14 53.8

SIREN [55] 12.7M 1× 7× 25.06 0.4
NeRF [62] 12.7M 1× 7× 37.94 0.4
NeRV-L (ours) 12.5M 70× 1× 41.29 52.9

Table 3.3: PSNR v.s. epochs. Since video encoding of NeRV is an overfit process, the
reconstructed video quality keeps increasing with more training epochs. NeRV-S/M/L mean
models with different sizes.

Epoch NeRV-S NeRV-M NeRV-L

300 32.21 36.05 39.75
600 33.56 37.47 40.84
1.2k 34.21 38.14 41.29
1.8k 34.33 38.32 41.68
2.4k 34.86 38.7 41.99

different sizes. For input embedding in Equation 3.1, we use b = 1.25 and l = 80 as our default

setting. For loss objective in Equation 6.6, α is set to 0.7. We evaluate the video quality with two

metrics: PSNR and MS-SSIM [125]. Bits-per-pixel (BPP) is adopted to indicate the compression

ratio. We implement our model in PyTorch [126] and train it in full precision (FP32). All

experiments are run with NVIDIA RTX2080ti. Please refer to the supplementary material for

more experimental details, results, and visualizations (e.g. , MCL-JCV [127] results)

39


3.4.2 Main Results

We compare NeRV with pixel-wise implicit representations on ’Big Buck Bunny’ video.

We take SIREN [55] and NeRF [62] as the baseline, where SIREN [55] takes the original pixel

coordinates as input and uses sine activations, while NeRF [62] adds one posi