ABSTRACT
Title of Dissertation: THE FIRST PRINCPLES OF
DEEP LEARNING AND COMPRESSION
Max Ehrlich
Doctor of Philosophy, 2022
Dissertation Directed by: Professor Abhinav Shrivastava
Department of Computer Science
Professor Larry S. Davis
Department of Computer Science
The deep learning revolution incited by the 2012 Alexnet paper has been
transformative for the field of computer vision. Many problems which were severely
limited using classical solutions are now seeing unprecedented success. The rapid
proliferation of deep learning methods has led to a sharp increase in their use in
consumer and embedded applications. One consequence of consumer and embedded
applications is lossy multimedia compression which is required to engineer the effi-
cient storage and transmission of data in these real-world scenarios. As such, there
has been increased interest in a deep learning solution for multimedia compression
which would allow for higher compression ratios and increased visual quality.
The deep learning approach to multimedia compression, so called Learned Mul-
timedia Compression, involves computing a compressed representation of an image
or video using a deep network for the encoder and the decoder. While these tech-
niques have enjoyed impressive academic success, their industry adoption has been
essentially non-existent. Classical compression techniques like JPEG and MPEG are
too entrenched in modern computing to be easily replaced. This dissertation takes
an orthogonal approach and leverages deep learning to improve the compression
fidelity of these classical algorithms. This allows the incredible advances in deep
learning to be used for multimedia compression without threatening the ubiquity of
the classical methods.
The key insight of this work is that methods which are motivated by first princi-
ples, i.e., the underlying engineering decisions that were made when the compression
algorithms were developed, are more effective than general methods. By encoding
prior knowledge into the design of the algorithm, the flexibility, performance, and/or
accuracy are improved at the cost of generality. While this dissertation focuses on
compression, the high level idea can be applied to many different problems with
success.
Four completed works in this area are reviewed. The first work, which is
foundational, unifies the disjoint mathematical theories of compression and deep
learning allowing deep networks to operate on compressed data directly. The sec-
ond work shows how deep learning can be used to correct information loss in JPEG
compression over a wide range of compression quality, a problem that is not readily
solvable without a first principles approach. This allows images to be encoded at
high compression ratios while still maintaining visual fidelity. The third work ex-
amines how deep learning based inferencing tasks, like classification, detection, and
segmentation, behave in the presence of classical compression and how to mitigate
performance loss. As in the previous work, this allows images to be compressed
further but this time without accuracy loss on downstream learning tasks. Finally,
these ideas are extended to video compression by developing an algorithm to correct
video compression artifacts. By incorporating bitstream metadata and mimicking
the decoding process with deep learning, the method produces more accurate results
with higher throughput than general methods. This allows deep learning to improve
the rate-distortion of classical MPEG codecs and competes with fully deep learning
based codecs but with a much lower barrier-to-entry.
THE FIRST PRINCIPLES OF
DEEP LEARNING AND COMPRESSION
by
Max Ehrlich
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2022
Advisory Committee:
Professor Abhinav Shrivastava, Chair/Advisor
Professor Larry S. Davis, Co-chair/Advisor
Professor Wojciech Czaja, Dean?s Representative
Professor Ramani Duraiswami
Professor Dinesh Manocha
Dr. Michael A. Isnardi
Professor David A. Forsyth
? Copyright by
Max Ehrlich
2022
Preface
Multimedia compression is a critical feature of the modern internet [1]. Web-
sites like Facebook, Instagram, and YouTube have increasingly coalesced around the
sharing of images and video. Viewing and sharing such media are now prerequisite
to modern internet interactions. When comparing media, we can create an approx-
imate hierarchy with each successive level containing an order of magnitude more
information. Text, which comprises a simple linguistic description. Images, which
contain a full visualization of some scene. And videos, which contain a temporal
evolution of the visualization of a scene. Naturally, as the amount of information
contained in a particular medium increases, so too does the size of its digital repre-
sentation.
Because of modern engineering constraints, it is not feasible to transmit image
and video media in their native format (e.g., a 2D or 3D array of samples). As
an example: a single frame of a 1080p video in a raw format, assuming single
byte samples in three colors, would require around 1 byte ? 3 channels ? 1920 ?
1080 = 6220800 bytes ? 6MB to represent it natively. Extending this to 30 seconds
of video at 30 frames per second would require 6220800 bytes ? 30s ? 30 fps =
5598720000 bytes ? 5GB. We can observe that most videos are longer than 30
ii
seconds and 4k videos are becoming more common. Transmitting these media over
modern cellular or even wired connections would be quite difficult. A typical home
internet connection bandwidth ranges from 10-50 Mbps. For the video example,
this would take 5598720000 bytes? 8? [10000000, 50000000]bps = [4478, 896]s, i.e.,
anywhere from 15 minutes to 1.2 hours for this short video. The situation is even
worse on cellular connections where LTE upload speeds range from 2-5 Mbps [2]
(almost 3 hours for our example in the best case) and most users pay for a fixed
amount of data.
To make the modern internet feasible, by reducing transmission and storage
cost, we compress these media. For modern compression codecs, JPEG [3] can
reduce the 6MB image to around 25kB in size, and H.264 [4] compression: the 5GB
video to only a few megabytes depending on the spatial and temporal entropy. These
impressive size reductions are a result of more than just entropy reducing operations:
they also incur a loss of information. The removed information is designed to be
as imperceptible as possible and is based on analyses of human visual perception.
For images, we remove ?high spatial frequencies? [5] or small spatial changes that
would not ordinarily be noticed. For videos, we can take a further step to estimate
motion between frames and encode only a description of the motion [6]. For modern
codecs, the lossy effects are generally not noticed to laymen, and codec development
continues to improve visual fidelity and reduce file sizes year-after-year.
Despite these amazing advances in compression, there are still problems. In
many parts of the world, including in rural America, many people use metered in-
ternet connections [7]. Under these connections they have a finite amount of data
iii
or pay for the data they use similar to most modern cell phone plans. For these
people, participation in the modern internet is increasingly difficult. Not only are
they expected to upload their own media, but they must view others? media in or-
der to take part in the discourse on many websites. For this class of consumer, it is
paramount that as minimal data as possible be used during any internet transmis-
sion, precluding most videos and some images from being accessible. The internet
is historically unprecedented as both an entertainment medium and a repository of
human knowledge, and benefits from maximal participation. Therefore, in order to
reach these people more effectively, it is critical that further advances in multimedia
compression be developed.
Meanwhile, deep learning has revolutionized modern machine learning [8]?[10].
In deep learning, a model is trained to take an input in its native representation and
learn a nonlinear mapping directly from it. This is in contrast to classical machine
learning which depended on engineered features which were extracted prior to map-
pings being computed. By taking the native input representation, the deep model
can be organized into many layers which function as their own feature extractors.
Instead of engineered features, the best features to solve a given problem are learned
jointly with the mapping function. The development of these techniques has enjoyed
unprecedented success in all areas of machine learning, and these models are being
rapidly deployed in consumer settings to solve problems which were once thought
to be impossible for computers to solve.
Unsurprisingly, given the previous discussion, one area of interest for deep
learning applications is that of multimedia compression. And also unsurprisingly,
iv
deep learning has made amazing contributions here [11]?[16]. Deep models are able
to compress both images and video significantly better than classical algorithms and
with little loss of quality. Despite this, the classical algorithms stubbornly persist.
JPEG files are still ubiquitous and MPEG standards continue to be developed and
deployed in consumer application despite the amazing advances of machine learning.
These algorithms, and their associated files, are simply too familiar and too ingrained
in the code powering the modern internet to be easily replaced. Nevertheless, there is
a plain socioeconomic need, as described previously, for deep learning in compression
as there is for anything that reduces the size of images and videos.
In This Dissertation I take an orthogonal approach to multimedia compression
in deep learning. In my approach, I develop deep learning methods which work
with the existing compression algorithms rather than replace them. In this way, our
algorithms are easy to integrate into the modern internet as simple pre- or post-
processing steps on images or videos. These classical compression algorithms are
developed with a series of engineering decisions that determine how much and the
nature of the information that is lost. I call these decisions ?first principles? and I
develop machine learning algorithms that are explicitly aware of these decisions. I
will show that this leads to a significant improvement in fidelity and/or flexibility
of the solution. These advances have greatly improved the practicality of deploying
machine learning solutions to solve compression problems, although their potential
applications are widespread.
v
Organization Of This Document This document is organized into three parts.
In the first part, I will discuss, briefly, background knowledge that a reader should
be equipped with in order to have a full understanding of this dissertation. The
next two parts discuss related works and my own contributions to image and video
compression respectively. This dissertation is written in a conversational style, and
beyond this preface I will often refer to the reader as ?we?. This is indicating
the ?we?, i.e., the reader and I, are discovering the knowledge together as the
concepts in the dissertation are developed from prior work into completed topics. I
strongly believe in the use of color for guidance. When I believe it will be helpful,
I will use color in mathematics and figures to group related ideas. For complex
mathematics specifically, I find this to be much clearer than braces alone especially
for hinting from early in a derivation which parts of long equations are related
and will eventually be grouped together or cancelled. When useful for clarifying
an algorithm I have included code listings. These listings are written in something
approximating python with pytorch [17] APIs where deep learning is required. These
code samples are not guaranteed to run exactly as they are written.
What This Document Is This document is, first and foremost, a dissertation.
This means that its primary purpose is to relay the unique contributions of the
author over the course of about five years of research. The astute reader will notice,
in the table of contents, section titles which are colored in Plum. These sections
represent the unique contributions of my research program, i.e., papers which were
published in the course of completing my Ph.D. These section titles are colored in
vi
the body of the document as well, so it is always easy for the reader to know if
they are reading about background work or one of my contributions. Readers will,
naturally, find these sections are the most detailed and well developed. In each of
these chapters, I have included a dedicated section titled ?Limitations and Future
Directions?. No scientific work is perfect and mine are no exception. I believe it
is important to be up front about these limitations with a candid discussion along
with guidance for future researchers in the field.
What This Document Is Not This document is not a textbook or survey of
multimedia compression algorithms and their relationship with deep learning and
readers should manage their expectations as such. For the purposes of imparting a
full understanding of this dissertation?s contributions to scientific discourse, there is
a review of elementary concepts of mathematics, machine learning, and compression
as well as an overview of related works and recent advances in machine learning. If,
in the course of reading this dissertation, a reader gains any useful knowledge, this
is welcome but entirely accidental.
vii
Dedication
To my wife, Dr. Sujeong Kim
You supported me unwaveringly and unconditionally throughout this process and I
am eternally grateful.
To my daughter Yena and my son Jaeo
Knowing you will be the greatest privilege of my life.
viii
Acknowledgements
Special thanks to
? Christian Steinruecken for providing the math coloring function from his dis-
sertation.
? My editors: Shishira Maiya, Lillian Huang, Vatsal Agarwal, and Namitha
Padmanabhan.
? My social media consultant Gowthami Somepalli and her assistant Kamal
Gupta.
The research presented in this dissertation was partially supported by Facebook AI,
Defense Advanced Research Projects Agency (DARPA) MediFor (FA87501620191),
DARPA SemaFor (HR001119S0085) and DARPA SAIL-ON (W911NF2020009) pro-
grams. There is no collaboration between Facebook and DARPA.
ix
Table of Contents
Preface ii
Dedication viii
Acknowledgements ix
Table of Contents x
List of Tables xiv
List of Figures xv
List of Abbreviations xix
I Preliminaries 1
Chapter 1: Linear Algebra 2
1.1 Scalars, Vectors, and Matrices . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Bases and Finite Dimensional Vector Spaces . . . . . . . . . . . . . . 8
1.3 Infinite Dimensional Vector Spaces . . . . . . . . . . . . . . . . . . . 10
1.4 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Multilinear Algebra 15
2.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Tensor Products and Einstein Notation . . . . . . . . . . . . . . . . . 18
2.3 Tensor Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Linear Pixel Manipulations . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: Harmonic Analysis 31
3.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 The Gabor Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Wavelet Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Continuous and Discrete Wavelet Transforms . . . . . . . . . 44
3.3.2 Haar Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4: Entropy and Information 50
4.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
x
4.2 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: Machine Learning and Deep Learning 60
5.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Perceptrons and Multilayer Perceptrons . . . . . . . . . . . . . . . . . 65
5.3 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . 69
5.3.2 Scale-Invariant Feature Transform . . . . . . . . . . . . . . . . 71
5.4 Convolutional Networks and Deep Learning . . . . . . . . . . . . . . 73
5.5 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 U-Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 79
5.8 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
II Image Compression 83
Chapter 6: JPEG Compression 84
6.1 The JPEG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 The Multilinear JPEG Representation . . . . . . . . . . . . . . . . . 92
6.3 Other Image Compression Algorithms . . . . . . . . . . . . . . . . . . 96
Chapter 7: JPEG Domain Residual Learning 98
7.1 New Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1.1 Frequency-Component Rearrangement . . . . . . . . . . . . . 101
7.1.2 Strided Convolutions . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Exact Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.1 JPEG Domain Convolutions . . . . . . . . . . . . . . . . . . . 103
7.2.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.3 Global Average Pooling . . . . . . . . . . . . . . . . . . . . . 115
7.3 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.6 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 126
Chapter 8: Improving JPEG Compression 129
8.1 Pixel Domain Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2 Dual-Domain Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3 Sparse-Coding Methods . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.4 Summary and Open Problems . . . . . . . . . . . . . . . . . . . . . . 137
Chapter 9: Quantization Guided JPEG Artifact Correction 140
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xi
9.2 Convolutional Filter Manifolds . . . . . . . . . . . . . . . . . . . . . . 144
9.3 Primitive Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Full Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.6.1 Comparison with Other Methods . . . . . . . . . . . . . . . . 156
9.6.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.6.3 Equivalent Quality . . . . . . . . . . . . . . . . . . . . . . . . 158
9.6.4 Exploring Convolutional Filter Manifolds . . . . . . . . . . . . 159
9.6.5 Frequency Domain Results . . . . . . . . . . . . . . . . . . . . 163
9.6.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 163
9.7 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 165
Chapter 10: Task-Targeted Artifact Correction 168
10.1 Standard JPEG Compression Mitigation Techniques . . . . . . . . . . 169
10.2 Artifact Correction for Computer Vision Tasks . . . . . . . . . . . . . 171
10.3 Effect of JPEG Compression on Computer Vision Tasks . . . . . . . . 173
10.4 Transferability and Multiple Task Heads . . . . . . . . . . . . . . . . 175
10.5 Understanding Model Errors . . . . . . . . . . . . . . . . . . . . . . . 177
10.6 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 178
III Video Compression 180
Chapter 11: Modeling Time Redundancy: MPEG 181
11.1 Motion JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.2 Motion Vectors and Error Residuals . . . . . . . . . . . . . . . . . . . 184
11.3 Slices and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.4 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Chapter 12: Improving Video Compression 194
12.1 Notable Methods for General Video Restoration . . . . . . . . . . . . 195
12.2 Single Frame Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 196
12.3 Multi-Frame Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.4 Summary and Open Problems . . . . . . . . . . . . . . . . . . . . . . 199
Chapter 13: Metabit: Leveraging Bitstream Metadata 201
13.1 Capturing GOP Structure . . . . . . . . . . . . . . . . . . . . . . . . 204
13.2 Motion Vector Alignment . . . . . . . . . . . . . . . . . . . . . . . . 206
13.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
13.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13.5 Towards a Better Benchmark . . . . . . . . . . . . . . . . . . . . . . 215
13.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
13.6.1 Restoration Evaluation . . . . . . . . . . . . . . . . . . . . . . 217
13.6.2 Compression Evaluation . . . . . . . . . . . . . . . . . . . . . 220
xii
13.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 223
IV Concluding Remarks 226
V Appendix 232
Appendix A: Study on JPEG Compression and Machine Learning 233
A.1 Plots of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
A.2 Tables of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Appendix B: Additional Results 250
B.1 Quantization Guided JPEG Artifact Correction . . . . . . . . . . . . 250
B.2 Task Targeted Artifact Correction . . . . . . . . . . . . . . . . . . . . 260
B.3 Metabit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Appendix C: Survey of Fully Deep-Learning Based Compression 271
C.1 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
C.2 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
C.3 Lossless Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Glossary 285
Figure Credits 292
Bibliography 294
Index 320
xiii
List of Tables
7.1 Model Conversion Accuracies . . . . . . . . . . . . . . . . . . . . . . 124
8.1 Summary of JPEG Artifact Correction Methods . . . . . . . . . . . . 137
9.1 QGAC Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 156
12.1 Summary of Video Compression Reduction Techniques . . . . . . . . 197
13.1 Metabit HEVC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 218
13.2 Metabit AVC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
13.3 Metabit GAN Numerical Results . . . . . . . . . . . . . . . . . . . . 219
A.1 Results for classification models. . . . . . . . . . . . . . . . . . . . . . 245
A.2 Results for detection models. . . . . . . . . . . . . . . . . . . . . . . . 246
A.3 Results for segmentation models. . . . . . . . . . . . . . . . . . . . . 247
A.4 Reference results (results with no compression). . . . . . . . . . . . . 249
xiv
List of Figures
2.1 Grayscale Example Image . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Grayscale Gaussian Smoothing . . . . . . . . . . . . . . . . . . . . . 23
2.3 Color Example Image . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Color Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Color Downsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Block Linear Map Example . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Morlet Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Wavelet Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Dual Tree Complex Wavelet Transform . . . . . . . . . . . . . . . . . 47
3.5 Haar Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 DWT Using Haar Wavelets . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 The General Communication System . . . . . . . . . . . . . . . . . . 52
4.2 Huffman Tree Example . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Arithmetic Coding Example . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Salmon vs Sea bass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 HoG Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Difference of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 73
5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.7 Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.8 GAN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1 JPEG Information Loss . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Zig-Zag Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1 Frequency Component Rearrangement . . . . . . . . . . . . . . . . . 102
7.2 Transform Domain Global Average Pooling . . . . . . . . . . . . . . . 116
7.3 ReLU Approximation Example . . . . . . . . . . . . . . . . . . . . . 119
7.4 Toy Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 123
7.5 ReLU Approximation Accuracy . . . . . . . . . . . . . . . . . . . . . 125
7.6 Throughput Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 126
xv
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.2 FCR With Grouped Convolutions . . . . . . . . . . . . . . . . . . . . 147
9.3 RRDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.4 8? 8 stride-8 CFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.5 Restoration Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.6 Subnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.7 GAN Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.8 Quality Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.9 Increase in PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.10 Equivalent Quality Plots . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.11 Equivalent Quality Examples . . . . . . . . . . . . . . . . . . . . . . 159
9.12 Embeddings for Different CFM Layers. . . . . . . . . . . . . . . . . . 159
9.13 CFM Weight Visualization. . . . . . . . . . . . . . . . . . . . . . . . 160
9.14 Images Which Maximally Activate CFM Weights. . . . . . . . . . . . 162
9.15 Frequency Domain Results . . . . . . . . . . . . . . . . . . . . . . . . 164
9.16 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.1 Task-Targeted Artifact Correction . . . . . . . . . . . . . . . . . . . . 171
10.2 Performance Loss Due to JPEG Compression . . . . . . . . . . . . . 173
10.3 Performance Loss with Mitigations . . . . . . . . . . . . . . . . . . . 173
10.4 Transfer Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.5 Multiple Task Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.6 MaskRCNN TIDE Plots . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.7 Mask R-CNN Qualitative Result . . . . . . . . . . . . . . . . . . . . . 177
10.8 Model Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.1 Motion JPEG Comparison . . . . . . . . . . . . . . . . . . . . . . . . 183
11.2 Motion Vector Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.3 Motion Vector Arrows . . . . . . . . . . . . . . . . . . . . . . . . . . 186
11.4 Motion Compensation and Error Residuals . . . . . . . . . . . . . . . 188
11.5 Rate Control Comparison . . . . . . . . . . . . . . . . . . . . . . . . 189
11.6 Slicing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.1 Capturing GOP Structure . . . . . . . . . . . . . . . . . . . . . . . . 204
13.2 Motion Vector Alignment . . . . . . . . . . . . . . . . . . . . . . . . 206
13.3 Motion Vectors vs Optical Flow . . . . . . . . . . . . . . . . . . . . . 207
13.4 Metabit System Overview . . . . . . . . . . . . . . . . . . . . . . . . 208
13.5 LR Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
13.6 Metabit Critic Architecture . . . . . . . . . . . . . . . . . . . . . . . 214
13.7 FPS vs Params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
13.8 Rate-Distortion Comparison . . . . . . . . . . . . . . . . . . . . . . . 220
13.9 Learned Compression Throughput Comparison . . . . . . . . . . . . . 221
13.10Metabit Restoration Example . . . . . . . . . . . . . . . . . . . . . . 223
13.11Metabit Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.1 Overall Classification Results . . . . . . . . . . . . . . . . . . . . . . 233
xvi
A.2 Classification Results: MobileNetV2 . . . . . . . . . . . . . . . . . . . 234
A.3 Classification Results: VGG-19 . . . . . . . . . . . . . . . . . . . . . 234
A.4 Classification Results: InceptionV3 . . . . . . . . . . . . . . . . . . . 235
A.5 Classification Results: ResNeXt 50 . . . . . . . . . . . . . . . . . . . 235
A.6 Classification Results: ResNeXt 101 . . . . . . . . . . . . . . . . . . . 236
A.7 Classification Results: ResNet 18 . . . . . . . . . . . . . . . . . . . . 236
A.8 Classification Results: ResNet 50 . . . . . . . . . . . . . . . . . . . . 237
A.9 Classification Results: ResNet 101 . . . . . . . . . . . . . . . . . . . . 237
A.10 Classification Results: EfficientNet B3 . . . . . . . . . . . . . . . . . 238
A.11 Overall Detection and Instance Segmentation Results . . . . . . . . . 239
A.12 Detection Results: FastRCNN . . . . . . . . . . . . . . . . . . . . . . 239
A.13 Detection Results: FasterRCNN . . . . . . . . . . . . . . . . . . . . . 239
A.14 Detection Results: RetinaNet . . . . . . . . . . . . . . . . . . . . . . 240
A.15 Instance Segmentation Results Results: MaskRCNN . . . . . . . . . . 240
A.16 Overall Semantic Segmentation Results . . . . . . . . . . . . . . . . . 241
A.17 Semantic Segmentation Results: HRNetV2 + C1 . . . . . . . . . . . 241
A.18 Semantic Segmentation Results: MobileNetV2 + C1 . . . . . . . . . . 241
A.19 Semantic Segmentation Results: ResNet 18 + PPM . . . . . . . . . . 242
A.20 Semantic Segmentation Results: Resnet50 + UPerNet . . . . . . . . . 242
A.21 Semantic Segmentation Results: ResNet 50 + PPM . . . . . . . . . . 243
A.22 Semantic Segmentation Results: ResNet 101 + UPerNet . . . . . . . 243
A.23 Semantic Segmentation Results: ResNet 101 + PPM . . . . . . . . . 244
B.1 Equivalent quality visualizations. For each image we show the input
JPEG, the JPEG with equivalent SSIM to our model output, and our
model output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
B.2 Frequency domain results 1/4 . . . . . . . . . . . . . . . . . . . . . . 252
B.3 Frequency domain results 2/4 . . . . . . . . . . . . . . . . . . . . . . 253
B.4 Frequency domain results 3/4. . . . . . . . . . . . . . . . . . . . . . . 254
B.5 Frequency domain results 4/4. . . . . . . . . . . . . . . . . . . . . . . 255
B.6 Model interpolation results 1/4 . . . . . . . . . . . . . . . . . . . . . 256
B.7 Model interpolation results 2/4 . . . . . . . . . . . . . . . . . . . . . 256
B.8 Model interpolation results 3/4 . . . . . . . . . . . . . . . . . . . . . 257
B.9 Model interpolation results 4/4 . . . . . . . . . . . . . . . . . . . . . 257
B.10 Qualitative results 1/4. Live-1 images. . . . . . . . . . . . . . . . . . 258
B.11 Qualitative results 2/4. Live-1 images. . . . . . . . . . . . . . . . . . 258
B.12 Qualitative results 3/4. Live-1 images. . . . . . . . . . . . . . . . . . 259
B.13 Qualitative results 4/4. ICB images. . . . . . . . . . . . . . . . . . . 260
B.14 Fine Tuned Model Comparison . . . . . . . . . . . . . . . . . . . . . 261
B.15 Off-the-Shelf Artifact Correction Comparison . . . . . . . . . . . . . . 262
B.16 Task-Targeted Artifact Correction Comparison . . . . . . . . . . . . . 262
B.17 FasterRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right:
quality 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
B.18 MaskRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right:
quality 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
xvii
B.19 MobileNetV2, Ground Truth: ?Pembroke, Pembroke Welsh corgi? . . 264
B.20 FasterRCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
B.21 MaskRCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
B.22 HRNetV2 + C1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
B.23 Dark Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
B.24 Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
B.25 Texture Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
B.26 Compression Artifacts Mistaken for Texture . . . . . . . . . . . . . . 269
B.27 Motion Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
B.28 Artificial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
xviii
List of Abbreviations
CNN Convolutional Neural Network. 60, 74?76, 79, 81, 103
CWT Continuous Wavelet Transform. 44
DCT Discrete Cosine Transform. 36, 37, 87, 88, 91, 100, 101, 103, 142, 146?148,
192
DFT Discrete Fourier Transform. 36, 37, 41
DST Discrete Sine Transform. 36, 37, 192
DTCWT Dual Tree Complex Wavelet Transform. 45
DWT Discrete Wavelet Transform. 44, 45
EXIF Exchangeable Image File Format. 85
FCR Frequency-Component Rearrangement. 102
GAN Generative Adversarial Network. 79?81
JFIF JPEG File Interchange Format. 85, 140, 141
MCU Minimum Coded Unit. 87
MLP Multilayer Perceptron. 66, 67, 74, 78, 145
RRDB Residual-in-Residual Dense Block. 146, 148, 209
STFT Short-Time Fourier Transform. 40?42, 287
xix
Part I
Preliminaries
1
Chapter 1: Linear Algebra
To begin the dissertation, we briefly review the fundamental ideas of linear
algebra. These concepts are extremely important for modeling in the high dimen-
sional spaces used by deep learning, and indeed defining what a high dimensional
space actually is and how it behaves. Generalizations of linear algebra, which we will
cover in the next chapter, have a special relationship with the dissertation outside
of this general importance: we will use these ideas to represent JPEG compression.
Linear algebra also forms the basis of harmonic analysis which is central to lossy
image compression.
Warning If you are familiar with the algebraic definitions of linear algebra, this
chapter may seem somewhat hand-wavy. It is intended as a general introduction
and we will generalize it later.
1.1 Scalars, Vectors, and Matrices
All concepts in mathematics relate back to the foundational idea of the num-
ber. For our purposes, we will call a single number a scalar . Scalars will be denoted
as a lower case letter in regular font: a.
2
We can ?stack? several scalars in rows or columns to create vectors . When we
wish to call attention to a vector, we will use lowercase bold font. For example
?? ???b? 0? ??
???
b1
b = ???? ?? .? .. ??
? (1.1)??
bn
is a vector made by stacking the n scalars in a column.
[ ]
c = c0 c1 ? ? ? c (1.2)n
is also a vector made by stacking n scalars in a row. We will call n the dimension of
the vector. Note that in general b ?= c. We call b a vector and c a co-vector. The
distinction will become important later. For now, we define the transpose operation
on a vector which transforms a vector into a co-vector or a co-vector into a vector
? ?
???c? 0?
????
?
c ?1??
cT = ? ??? ..
. ??? (1.3)??
cn
3
Given a scalar and a vector, we can multiply them to produce a new vector
[ d = a]c (1.4)
= ac0 ac1 ? ? ? ac (1.5)n
where each component of c was multiplied by a, thus scaling the vector by a hence
the name scalar. We can also add vectors by summing their components to produce
another vector. Given a set of m vectors V of dimension n.
?
e = v = (1.6)
[? ? ?v?V ]
= (1.7)
v?V v0 v?V v1 ? ? ? v?V vn
We can now combine these operations to create one of the most fundamental
ideas of linear algebra: the linear combination. A linear combination is the sum of
the product of some number of vectors and scalars and therefore produces a new
vector. Let S be a set of m scalars
?m
g = sivi (1.8)
i=0
for si ? S and vi ? V
Given two vectors, we can multiply them by computing their inner product
4
which produces a scalar
?n
f = ?b, c? = bici (1.9)
i=0
We define the l2 norm of a vector as
?
||b||2 = ?b, b? (1.10)
We call any vector u such that ||u||2 = 1 a unit vector, noting that we normalize a
vector by computing the vector b|| || . Any two vectors v and w such thatb 2
?v,w? = 0 (1.11)
are said to be perpendicular or orthogonal to each other.
The formula for the l2 norm
????n
||b|| = ?2 b2i (1.12)
i=0
implies a general formulation for an ln norm
(? ) 1n n
||b||n = bni (1.13)
i=0
5
Another useful norm is the l1 norm
???n ??
||b|| ? ?1 = ?? bi?? (1.14)
i=0
Taking the limit as n ? ? gives the l? norm
||b||? = max(bi) (1.15)
We can create two dimensional arrays of scalars which we call matrices and
which we denote with upper case normal font: A.
? ?
???a a? 11 12
? ? ? a1n?
?? ?
?
??
a ?21 a22 ? ? ? a2n?
A = ? ???? (1.16)? .. .? . .
.
. . . ..
.
??
am1 am2 ? ? ? amn
A is said to be m? n dimensional.
We multiply a matrix and a vector by taking the linear combination of each
element of the vector with the corresponding column of the matrix
?n
h = Ab = biAi (1.17)
i=1
note that bi is the ith element of b which is a scalar and Ai is the ith column of
A which is a vector. Note also that the result h is a vector of dimension m. This
6
equation implies that the number of columns in A must match the number of rows
in b. We can extend this to matrix-matrix products, given an n?m matrix B
? C = AB? (1.18)
????(A1)T , B1? ?(A1)T , B2? ? ? ? ?(A1)T , B ?? m ?? ???
?
?(A2)T , B ? ?(A2)T , B ? ? ? ? ?(A2)T1 2 , B ?m??
=
??? . ?
? (1.19)
? ..
.. . . ... . . ????
?(An)T , B1? ?(An)T , B2? ? ? ? ?(An)T , Bm?
where Aj denotes the jth row1 from A. In other words, each entry in C is the inner
product of the corresponding row of A with the corresponding column of B. This
construct implies a particular identity matrix
?? ???1 0 ? ? ? 0????
???
0 1 ? ? ? 0??
I = ?? ? (1.20)? .. .. ?. .? . . . . ..????
0 0 ? ? ? 1
Matrices are important for representing linear maps on vectors. A linear map
is any map which preserves vector addition and scalar multiplication. We can es-
sentially ?store? the coefficients of linear maps in matrices and use the action of
matrix-vector multiplication to apply the map to a vector.
1We will use upper indices much more frequently than powers in this dissertation so it is advised
to get familiar with the notation now. It will be left entirely to context to determine which we
mean.
7
1.2 Bases and Finite Dimensional Vector Spaces
Given a set of vectors B, we define the Span of B as the set of all linear
combinations of the vectors in B. Formally, given a set of scalars S
??? ? ?|V | ?? ?
span(B) = ? ? b ?i i?bi ? B, ?i ? S? (1.21)
i=1 ?
Given some arbitrary set of vectors V , we may wish to find B, i.e., a subset of
vectors that spans V . If all elements of B are linearly independent, we say that B
is a basis of V . The basis allows us to express any element of V in terms of scalars
of elements of B and, in effect, defines V .
There may be many bases for the same set of vectors, so we may wish to
change the basis and we may wish to define a particular basis as canonical. For
example, consider the vector space R3. We often choose the following basis
?? ?? ?? ? ? ???1 0? ? ?? ???? ???? ???? ????
0???
e0 = ??0?? e1 = ??1?? e2 = ??0???? (1.22)
0 0 1
This canonical basis is desirable because the vectors are all orthonormal, i.e., they
are all orthogonal to each other and have magnitude of 1 which means there is no
?rotation? or ?scaling? of the coordinates. Moreover, this basis makes it extremely
8
easy to express vectors in a familiar component notation. The vector
? ?
???1
v = ???
?
? ??2???? (1.23)
3
is only expressed as such because we chose this canonical basis (implicitly) and
defined v as
v = 1e0 + 2e1 + 3e2 (1.24)
If we wish to change the basis, we first write the coordinates of the new basis (B1)
vectors in the old basis (B0) and then stack these vectors into a matrix A. We
can then multiply any vector v0 written in terms of basis B0 by A to obtain the
coordinates in terms of basis B1.
v1 = Av0 (1.25)
We make the following notes about bases for V
1. V has a basis
2. All bases of V have the same cardinality which is the dimension of V . If we
write v ? V in coordinates and count the number of coordinates, that count
will be the same as the number of basis vectors
Note the implication of the last property, we can count the number of elements in
9
the dimension of V, therefore, V is a finite dimensional vector space.
1.3 Infinite Dimensional Vector Spaces
Infinite dimensional vector spaces will play an important role in the later
analysis of compression, although the results of this analysis will eventually be
discretized for use on a computer. In principle, infinite dimensional vector spaces
behave in much the same way as finite dimensional ones. While a full treatment of
this topic is beyond the scope of the dissertation, we will make some definitions in
this chapter which will be expanded upon later.
Assume that f and g are members of an infinite dimensional vector space V .
We can think about components of f and g as being indexed by any real number
instead of a finite number of natural numbers. For example we might have f(2) = 4
for the second component and f(?12.5) = ?156.25 for the negative-twelve-point-
five-th component. In other words, f and g are functions, and these functions are
vectors in a vector space.
With that established, our next goal should be to produce a basis for these
functions. After all, being able to express a function as the coefficients of some
basis should have myriad uses especially if we do not know exactly the form of the
function we wish to express. We will develop this basis later in the dissertation
but for now we can define two important concepts: orthogonality of functions and
normality of functions.
Recall that two vectors were said to be orthogonal If they point at right angles
10
to each other, i.e., their inner product is zero. To determine orthogonality we need
an inner product for functions. We make the following definition
? ?
?f(x), g(x)? = f(x)g(x)dx (1.26)
??
which is exactly the same as the inner product formula in finite dimensions with the
sum expanded to an integral. As usual, if ?f(x), g(x)? = 0 then the functions are
orthogonal.
Next we need to define normality of a function. Recall that a vector was said
to be normal if its length is 1. So we need a way of defining the ?length? of a
function. We make the following definition
?? ?
?f(x)?2 = f 2(x)dx (1.27)
??
Once again, this is the same as the discrete formula using only an integral and if
?f(x)? = 1 then the vector is a normal vector.
Given the tools to determine if a set of functions are orthonormal we can now
develop what is essentially a canonical basis for functions. This discussion will be
continued in Chapter 3 (Harmonic Analysis).
1.4 Abstractions
While the geometric interpretations provide useful intuitions, there is a limit
to how far we can take them mathematically. We conclude by briefly introducing
11
the abstract forms of the ideas in this chapter.
A field F is a set on which addition and multiplication are defined. Specifically
we define
+ : F ? F ? F (1.28)
? : F ? F ? F (1.29)
and stipulate that if they meet the following criteria for a, b, c ? F :
Associativity addition and multiplication are associative: a + (b + c) = (a + b) + c
and a ? (b ? c) = (a ? b) ? c
Commutivity addition and multiplication are commutative: a + b = b + a and
a ? b = b ? a
Identity Two different elements 0 and 1 exist that satisfy additive and multiplicative
identity respectively: a+ 0 = a, a ? 1 = a
Inverse There exists an additive inverse ?a and a multiplicative inverse a?1 such
that a+ (?a) = 0 and a ? a?1 = 1
Distributivity Multiplication and addition distribute according to a ? (b + c) = (a ?
b) + (a ? c)
then F is a field.
We define a vector space V over the field F in a similar way. We have two
12
operations
+ : V ? V ? V (1.30)
? : F ? V ? V (1.31)
and we call V a vector space, elements of V vectors, and elements of F scalars if for
u,v,w ? V and a, b ? F ,
Associativity addition is associative: u+ (v +w) = (u+ v) +w
Commutivity addition is commutative: u+ v = v + u
Identity and Inverse two elements 0 and?v exist such that v+0 = v and v+(?v) =
0
Compatibility of Multiplication Scalar and field multiplication are compatible: a ?
(b ? v) = (a ? b) ? v
Scalar Multiplication Identity Multiplication with the scalar identity: 1 ? v = v
Distributivity Scalar multiplication is distributive with respect to both vector and
field addition: a ? (u+ v) = a ? u+ a ? v and (a+ b) ? v = a ? v + b ? v
Note that we made no mention of coordinates or numbers, we only defined sets and
operations along with their behavior.
With these definitions we can form linear combinations for w,v0 . . .vN ? V
13
and a0 . . . aN ? F
?N
w = ai ? vi (1.32)
i=0
14
Chapter 2: Multilinear Algebra
The previous chapter developed vectors and matrices where vectors are a pri-
mary ?mathematical object? in a high-dimensional space and a matrix represents a
map which can transform that object. In a sense, this discussion feels unfinished.
We had scalars which were zero dimensional, vectors which were one dimensional,
and matrices which were two dimensional. Why stop there?
In this chapter we develop the extremely high level ideas of multilinear algebra
which generalizes linear algebra to higher dimensional objects. This is a large and
complex topic which we only need a small piece of for understanding this disser-
tation, in fact this entire chapter may be closer to the first lecture in a semester
long graduate course. This chapter will immediately obsolete the matrix and vector
notation we introduced in the previous chapter for reasons which will be explained
in the first section.
The primary goal of multilinear algebra is to study multilinear maps . Recall
that linear maps are maps which preserve vector addition and scalar multiplication.
More formally, we call A : V ? V a linear map on the vector space V over the field
F if, for v, u ? V and a ? F
? A(v + u) = A(v) + A(u)
15
? A(a ? v) = a ? A(v)
A bilinear map is an extension of this concept to two arguments where the map is
linear in each argument. We call B : V ? V ? V a bilinear map with vector space
V and field F for v0, v1, u1, u1 ? V and a ? F if
? B(v0+v1, u0) = B(v0, u0)+B(v1, u0) and B(v0, u0+u1) = B(v0, u0)+B(v0, u1)
? B(a ? v0, u0) = B(v0, a ? u0) = a ?B(v0, u0)
Continuing this until its natural end, we call a multilinear map a function of multiple
arguments which is linear in each one1. M : V ? V ? ? ? ? ? V ? V is a multilinear
map with v0???N , u ? V and a ? F if
? M(v0 + u, v1, ? ? ? , vN) = M(v0, v1, ? ? ? , vN) +M(u, v1, ? ? ? , vN) and M(v0, v1 +
u, ? ? ? , vN) = M(v0, v1, ? ? ? , vN) +M(v0, u, ? ? ? , vN), etc.
? M(a ? v0, v1, ? ? ? , vN) = M(v0, a ? v1, ? ? ? , vN) ? ? ?M(v0, v1, ? ? ? , a ? vN) = a ?
M(v1, v1, ? ? ? , vN)
We will represent multilinear maps using higher order objects called tensors. Per-
haps surprisingly, the practical use of these concepts in the dissertation will still
only be linear or bilinear maps, however, we leverage multilinear algebra by working
on tensor inputs and outputs which serve as a natural representation for images vs
the vectors that are traditionally used in these maps.
1One important thing to note at this point is that we can only tweak one argument at a time,
we cannot, for example, compute B(v0 + v1, u0 + u1) and expect a linear result.
16
2.1 Tensors
Traditionally in computer science we think of tensors as multidimensional ar-
rays of numbers. Despite the protests of many physicists and mathematicians, this
is a perfectly reasonable definition of a tensor. For example, we might have a 3-
or 4- or 5D array of numbers and call this a tensor. In the mathematical sense, a
tensor is a representation of a multilinear map. We will denote tensors and tensor
spaces with uppercase math font: T .
Recall the concepts of vectors and co-vectors. We will refer to vector spaces
as V and co-vector spaces as V ?. It is important to keep in mind that although
these spaces are related they are not the same. These vectors and co-vectors will,
in a sense, be the primitives that we use to construct tensors. We will index vectors
using upper indices and co-vectors using lower indices.
All tensors have a type, which is the primary way we will refer to them. Some
texts refer to a tensor rank, we do not use this convention because it is ambiguous.
Rank has other meanings in linear algebra and tensor rank does not explain the
composition of the tensor in terms of vectors and co-vectors. If we absolutely have
to refer to the sum of the number of vector and co-vector spaces we will call this the
order of the tensor, though this situation will be extremely rare. We will say that
vectors are type-(1, 0) tensors and co-vectors are type-(0, 1) tensors. Matrices can
then be type-(2,0) tensors, type-(1, 1) tensors, or type-(0, 2) tensors. The distinction
between type is important. Since matrices and vectors now have concrete definitions
as tensors, this obsoletes our earlier notation which drew a distinction between them.
17
From this point on, all non-scalars will be written in tensor notation.
2.2 Tensor Products and Einstein Notation
We construct arbitrary tensors using products of vectors and co-vectors. To
do this we define the tensor product of two tensors. We will build up to this by
revisiting some concepts from linear algebra. Given two vectors v, u in some vector
space V on a field F , we defined the inner product as
?N
viui = a (2.1)
i=0
where a ? F . Given a matrix (a type-(1, 1) tensor) M we can compute the matrix-
vector product as
?N
Mixi = w (2.2)
i=0
for w ? V . Similarly we compute the matrix-matrix product given another matrix
N as
?N
MiN
i = O (2.3)
i=0
These expressions can by simplified using Einstein notation [18]. In Einstein no-
tation, repeated indices that appear as upper and lower indices are assumed to be
summed out, allowing us to remove the summations from the previous equations.
18
For example, the matrix-matrix product is now simply
M ji N
i
k = O
j
k (2.4)
where the non-summed indices are added in for clarity. This is extremely impor-
tant when working with general tensors because the expressions are quite verbose
with summation notation. We will make heavy use of Einstein notation in this
dissertation so it is important to understand it now.
Given two arbitrary tensors we can now define the generic tensor product
? u0,u1,??? ,u u
? ,u?N 0 1,??? ,u?N u0,??? ,u ,u
?
N 0??? ,u?T U = Tl U ? ? ? = V
N
0,l1,??? ,lN l0,l1??? ,lN l0,??? ,l ,l
?
N 0,??? ,l?
(2.5)
N
Of course we are free to form other useful products for tensors. For example, given
a type-(2, 3) tensor P and a type-(4, 2) tensor Q we could compute the type-(4, 3)
tensor R as
P kmlij Q
ij kml
abcd = Rabcd (2.6)
where we have summed out the i, j indices.
To construct a tensor from vectors and co-vectors we can use this tensor prod-
uct. Consider the vectors u, v ? V and the co-vectors p, q, r ? V ?. We can construct
19
a type-(3, 2) tensor from these by computing
uivjpkqlrm = T
ij
klm (2.7)
In many situations it will be useful for us to raise or lower indices (sometimes
called index juggling). In other words, given a tensor Tij we may want to construct
T i or T ijj . These tensors are related to Tij but they are not the same. We can
accomplish this by multiplying T by the covariant or contravariant metric tensor
which relates to vector and co-vector spaces. These tensors are defined such that
gikgkj = ?
i
j (2.8)
where ? is the Kronecker delta
??????0 i =? j?ij = ??? (2.9)1 i = j
a generalization of the identity matrix from linear algebra, gij is the contravariant
metric (for converting co-vectors to vectors) and gij is the covariant metric (for
converting vectors to co-vectors).
For various reasons we will consider a general derivation of the metric tensors
to be beyond the scope of this dissertation, and in fact we will always be using
tensors defined with respect to the canonical basis which has a metric of ?. This
means we can freely raise and lower indices without considering the metric.
20
2.3 Tensor Spaces
If we needed to start with vectors every time we wanted to build a tensor
it would quickly become unsustainable. Instead, we need a way to refer to tensor
spaces, or sets of tensors. This is sometimes referred to as the intrinsic definition of
a tensor. We again use the tensor product but this time we use vector and co-vector
spaces. Recalling the type-(2, 3) tensor T which we constructed from vectors and
co-vectors, we can define T directly as
T ? V ? V ? V ? ? V ? ? V ? (2.10)
in other words, V ?V ?V ??V ??V ? defines a space of tensors. This space contains
all tensors which can be constructed by the tensor product of V twice and V ? three
times. In other words, all tensors which can be built from an equation like Equation
2.7 but with any u, v ? V and p, q, r ? V ?.
For a generic tensor T, we say that it is of type-(p, q) for
T ? V? ? ????? ? V?? ?V ? ? ????? ? V ?? (2.11)
p times q times
This will be the primary convention that we use to define tensors in the rest of
this dissertation. Note that this mimics some of the definitions from Section 1.4
(Abstractions) in that we no longer have need of coordinates, we only deal with
arbitrary vector spaces, co-vector spaces, and the tensor products of their members
21
Figure 2.1: Grayscale Example Image.
which is why this is called the intrinsic definition. We close by noting that although
we have only used V and V ?, in general, the vector spaces defining a tensor can be
different provided that the spaces are defined over the same field.
2.4 Linear Pixel Manipulations
With the boring theory out of the way we can look at an interesting practical
application of tensors: linear pixel manipulations. By representing an image as
a tensor we can compute many complex transformations of the image using other
tensors. Some of these are not traditionally thought of as being ?linear? when we
restrict our thinking to two-dimensional matrices as linear maps that transform
images through matrix multiplication. Instead of thinking of images as ?collections
of vectors? we treat the image as one object: a higher order tensor and then we
consequently define the linear map on this object in even higher dimensions.
More formally, we will deal with planar images. The image may have any
number of channels but it always has two spatial dimensions. So a grayscale image
would be a type-(0, 2) tensor. A traditional color image would be a type-(0, 3)
22
Figure 2.2: Grayscale Gaussian Smoothing.
tensor. In most cases, even for color images it will suffice to define linear maps
as type-(2, 2) tensors which transform the spatial dimensions while preserving the
channel dimension.
We begin with a simple example. Consider the example image in Figure 2.1.
We can represent this grayscale image as a type-(0, 2) tensor I ? H? ? W ?. One
simple linear manipulation we can perform on this image is Gaussian smoothing in
a 3? 3 window. We can represent this linear map as a type-(2, 2) tensor
G : H? ?W ? ? H? ?W ? (2.12)
? G ? H ?W ?H? ?W ? (2.13)
??????????0.5 i = u ? j = v????0.125 i = u ? (j = v ? 1 ? j = v + 1)
Gijuv = ????? (2.14)?????0.125 (i = u? 1 ? i = u+ 1) ? j = v????0 otherwise
From the first equation, we can see that G is a linear map on type-(0, 2)
23
Figure 2.3: Color Example Image.
tensors that transforms them into type-(0, 2) tensors. From the second equation we
see that G is a type-(2, 2) tensor (this is a consequence of the first equation). The
third equation defines the form of G for arbitrary indices i, j, u, v. In this case, i, j
index the input pixel and u, v index the output pixel, the value stored at the index
is the coefficient of the pixel. So that is 0.5 when the indices are equal and 0.125 for
any neighboring pixels, all other pixel have a zero coefficient. We apply this linear
map by computing
I ? = Gijuv uvIij (2.15)
The result of this computation is show in Figure 2.2.
Next we can consider a color image. The color version of the example images
is shown in Figure 2.3. Converting this color image to grayscale is a linear manip-
ulation. We represent the color image as I ? P ? ? H? ? W ?. We then define the
24
Figure 2.4: Color Smoothing.
following linear map
Y : P ? ?H? ?W ? ? H? ?W ? (2.16)
Y ? P ?H ??W ?H? ?W ? (2.17)??
???????0.299 p = 0
Y pijuv =
??????0.587 p = 1
(2.18)
???0.114 p = 2
which comes directly from the grayscale conversion equation
Y = 0.299R + 0.587G+ 0.114B (2.19)
We apply this map as
I ?uv = Y
pij
uv Ipij (2.20)
If we apply this map to the example image we get the same image as Figure 2.1.
25
Figure 2.5: Color Downsampling.
Interestingly, we can apply G to this color image as well and it will perform correct
smoothing on the color image (Figure 2.4). In this case we would be computing
I ?puv = G
ij
uvIpij (2.21)
and since G ? H ?W ?H? ?W ?, the channel dimension of I, P ?, is preserved.
Let?s try something more interesting: resampling. We can define nearest neigh-
bor up- and downsampling as linear maps. This works for both color and grayscale
images, the computation is the same and the tensor will be type-(2, 2). For down-
sampling by a factor of 2 we define the following linear map
H : H? ?W ? ? H ?? ?W ?? (2.22)
H ??H ?W ?H ?? ?W ?? (2.23)???
??1 i = 2u ? j = 2vH ijuv = ??? (2.24)0 otherwise
where H ? and W ? are vector spaces with half the dimension of H and W . We can
26
define upsampling in a similar way. For upsampling by a factor of 2 we define the
following linear map
D : H ?? ?W ?? ? H? ?W ? (2.25)
?H ? H ? ?W ? ?H? ?W ? (2.26)???
??1 i = ?u/2? ? j = ?v/2?H ijuv = ??? (2.27)0 otherwise
We apply these maps by computing
I ? ijuv = HuvIij (2.28)
Iuv = D
ij ?
uvIij (2.29)
for grayscale images and
I ? = H ijpuv uvIpij (2.30)
I = Dij ?puv uvIpij (2.31)
for color images. The result for the color image is shown in Figure 2.5.
Taking this further, we can define any convolution or cross-correlation using
tensors. This is reasonable since we know that convolutions are linear operations
although we do not always see them written out as linear maps. We consider a
general convolution kernel K with any shape. We will denote the shape of the
27
kernel as the tuple S = (s0, s1). Then we define the following linear map
C : H? ?W ? ? H? ?W ? (2.32)
? C ? H ?W ?H? ?W ? (2.33)????
Cijuv = ?Ku?i+s0,v?j+s1 u? s0 ? i ? u+ s0 ? v ? s1 ? j ? v + s1??? (2.34)0 otherwise
note that this does not consider a mapping between channels like we would use in
a convolutional network (this is simple enough to add in though). We apply this to
grayscale or color images as
I ? ijuv = CuvIij (2.35)
I ? ijpuv = CuvIpij (2.36)
As a taste of what?s to come, let?s try something more surprising. It may sound
surprising but breaking an image into evenly sized blocks is a linear operation, and
we can derive a tensor which represents this map. We will first define two new
co-vector spaces, the block dimensions M? and N?2. We will also define the spaces
X? and Y ? with dimension equal to the dimension of H? and W ? divided by the
block size (i.e., the number of blocks that can fit in the image). Then we can define
2For example if we wanted even 8? 8 blocks we might write these as R8? although I do not like
this notation
28
a type-(2, 4) tensor, the linear map
B : H? ?W ? ? X? ? Y ? ?M? ?N? (2.37)
? B ? H ?W ?X? ? Y ? ?M? ?N? (2.38)????
Bijxymn = ?1 pixel h,w belongs in block x, y at offset m,n??? (2.39)0 otherwise
which may seem like kind of a let down but this is the canonical form we will use later
in the dissertation. A more satisfying and programmer-oriented definition might be
??????1 x ? dim(M) +m = i ? y ? dim(N) + n = jBijxymn = ??? (2.40)0 otherwise
We apply the map as
I ? ijpxymn = BxymnIpij (2.41)
Since this one might be a little confusing, consider a concrete example with
the example image in Figure 2.1. This is a 16 ? 16 image and we want to break it
into 8 ? 8 blocks, so there will be four total blocks in a 2 ? 2 grid (Figure 2.6). In
this case, dim(M) = dim(N) = 8 and dim(X) = dim(Y ) = 2. So after applying B
to the 16? 16 image we would get a tensor of shape 2? 2? 8? 8 giving the spatial
arrangement of the 8? 8 blocks.
29
Figure 2.6: Block Linear Map Example. The blocks are arranged spatially but
note that in tensor form there are separate indices for the block position and the
2D offset into each block.
While this was a fun exercise the actual practical application of this idea is
fairly limited since the tensors must be on the order of the image size. A critical
component of the dissertation is that we can actually represent all of JPEG as
a linear map. This is extremely powerful because linear maps are well studied
phenomena, so expressing something as complex as JPEG as a single linear map
gives us myriad tools for further analysis and manipulation.
30
Chapter 3: Harmonic Analysis
Harmonic analysis is an invaluable tool for mathematics and engineering that
enables some of the most important technologies in existence today. In Section 1.3
(Infinite Dimensional Vector Spaces) we touched briefly on the concept of infinite
dimensional vector spaces and we noted that the vector space of functions of real
variables is one such space. In this chapter we expand upon this idea and introduce
the Fourier transform and harmonic analysis. The ideas we present in this chapter
will be fundamental guiding principles behind image and video compression.
Fourier was interested in solving the heat equation which describes the tem-
perature of an ideal length of wire over space and time. The equation defines a
function u(t, x) as a partial differential equation with conditions:
? ?2
u(t, x) = u(t, x) (3.1)
?t ?x2
u(0, x) = f(x) (3.2)
u(t, 0) = 0 (3.3)
u(t, 1) = 0 (3.4)
for t ? 0 and x ? [0, 1].
31
Of critical importance to us is the second equation which relates the form of
u(t, x) at time t = 0 to some arbitrary function of space. Fourier showed that, since
the other conditions of the heat equation yield a harmonic function, one composed
of simple waves, f(x) must be able to be decomposed as such 1. While we will not go
into the full derivation that Fourier used, or even touch on the modern understanding
of the transform, we will show how this implies an orthonormal basis for functions
which allows us to express them as a sum of coefficients of simple waves.
Note There are many different ways to think about the fourier transform. Fourier
was thinking in terms of the heat equations, many people like to envision a ?ma-
chine? that isolates frequencies. I prefer the model which is motivated by linear
algebra and that is what I discuss in this chapter although all views on the subject
are equally correct and interesting.
3.1 The Fourier Transform
Recall our definitions for the l2 norm and inner product of functions
?? ?
?f(x)? 22 =? f (x) dx (3.5)???
?f(x), g(x)? = f(x)g(x) dx (3.6)
??
1It is interesting to note that although this result is one of the most influential results in all of
engineering, it was given negative reviews at the time Fourier published it.
32
given these tools we can try to find something resembling a canonical basis for func-
tions. We would like a canonical basis to be a set of functions that is orthonormal,
i.e., a set of functions which are all of unit length and which are all orthogonal to
each other.
Consider the functions sin(x) and cos(x). We can show easily that these func-
tions are orthogonal to each other by solving
? ?
?sin(x), cos(x)? = sin(x) cos(x) dx (3.7)
??
We start by restricting the domain to [??, ?] since the sine and cosine functions are
periodic.
? ?
= sin(x) cos(x) dx (3.8)
??
Then we use substitution to solve the integral. Let u = cos(x) and du = ?sin(x) dx
? ? ?
sin(x) cos(x) dx = u? ? du (3.9)??
= ? u du (3.10)
2
= ?u + C (3.11)
2
33
substituting and evaluating the result gives
u2? ?cos
2(x)
+ C = +?? C (3.12)2 2 ?
?cos
2(x)
+ C?? (3.13)2 ??
cos2? (??) cos
2(?)
= + = 0 (3.14)
2 2
so sine and cosine are indeed orthogonal. To check if they are normal we compute
? ?
? cos
2(x) dx (3.15)
??
?
sin2(x) dx (3.16)
??
We can solve the first integral with the trigonometric identity
2 1 + cos(2x)cos (x) = (3.17)
2
substituting gives
? ? 1 + cos(2x)
? dx (3.18)?? 2
1 ?
(? = ? 1 + cos(2x) d)x (3.19)2 ??
1 ? ?
= dx+ cos(2x) dx (3.20)
2 ?? ??
34
( )
1 sin(2x) ?????= x+ (3.21)2 2
x sin 2x ???????= + (3.22)2 4 ??
? sin 2? ? ??= + ? sin?2? (3.23)
2 4 2 4
sin 2? ? sin?2?= ? + (3.24)
4 4
= ? (3.25)
We get the same result for sine, so the functions are not normal but they can be
easily made normal by dividing by ?. Therefore, sine and cosine seem like ideal
candidates provided we can produce an infinite set from these two.
In order to have a basis for the infinite dimensional space of functions we need
an infinitely large set of basis vectors. Without further elaboration, the Fourier
transform defines this set as
{sin(?2?x?), cos(?2?x?)|? ? R} (3.26)
or simply
{e?2?ix? |? ? R} (3.27)
Note that this is an uncountable infinite set of vectors, which is what we needed,
35
and we call ? the frequency. The actual integral transform is then
? ?
F (?) = f(x)e?2?ix? dx (3.28)
??
Note that, as we described for the norm and inner product of functions, this is
simply generalizing the expression for a linear combination of a finite dimensional
vector with its basis vectors.
As useful as this result is, it is not readily applicable to computation as is the
case with many concepts dealing with infinity. We can, however, define the Discrete
Fourier Transform (DFT) as the following type-(1, 1) tensor
F ? CN ? CN? (3.29)
?1 ?2?imnFmn = e N (3.30)
N
F is a linear map F : CN ? CN acting on complex vectors of dimension N . Note
that F is symmetric, i.e., F i = F jj i . For practical applications, this matrix multiply
would be prohibitively expensive so we use the fast Fourier transform to recursively
memoize the transform result reducing the number of computations to O(N log(N)).
We do not describe this algorithm in detail here.
There are some other transforms which are related to the DFT and are useful.
Specifically a major downside to the DFT is the dependence on complex numbers.
For many discrete applications, real numbers would work fine. This motivates the
Discrete Sine Transform (DST) [19], [20] and the Discrete Cosine Transform (DCT)
36
[5].
These transforms can be thought of as taking only the imaginary (sine) or real
(cosine) part of the DFT. We can get away with this on discrete samples by assuming
that the signal, outside of the region we sampled, is an odd or even function. We
are free to do this since we do not care at all about what the function actually looks
like outside where we sampled so it does not need to be accurate.
For our purposes, the DCT will play an outsize role since it is central to our
later discussion of JPEG. The DST will come up briefly in video coding, however.
The DCT can be defined differently depending on how boundary conditions are
handled. We will not detail all of these, but the two important ones for us are the
DCT-II, which we will call ?the DCT?, and is defined in two dimensions as
?N ?N [ ] [ ]
i ?1 (2x+ 1)i? (2y + 1)j?Dj = C(i)C(j) cos cos (3.31)
2N 2N 2Nx=1 y=1 ??????1 u = 0
2
C(u) = ???? (3.32)1 u =? 0
and the DCT-III, which we will call ?the inverse DCT?, and is defined in two
dimensions as
?N ?N [ ] [ ]
(D?1)x ?1 (2x+ 1)i? (2y + 1)j?y = C(i)C(j) cos cos (3.33)
2N 2N 2N
i=1 j=1
In both cases, C(u) is a scale factor which makes the transform orthonormal. As
in the DFT, these are both linear maps, this time with D : RN ? RN and D?1 :
37
RN ? RN and are type-(1, 1) tensors.
We note here an important theorem which will be useful for us later in the
dissertation
Theorem 1 (The DCT Least Squares Approximation Theorem). Given a set of N
samples of a signal X, let Y be the DCT coefficients of X. Then for 1 ? m ? N
the approximation of X given by
? ?m ( )1 2 k(2t+ 1)?
pm(t) = ? y0 + yk cos (3.34)
N N 2N
k=1
minimizes the least-squared error
?N
e = (p (i)? x )2m m i (3.35)
i=1
Proof. First consider that since Equation 3.34 represents the Discrete Cosine Trans-
form, which is a Linear map, we can write rewrite it as
DTmy = x (3.36)
where Dm is formed from the first m rows of the DCT matrix, y is a row vector of
the DCT coefficients, and x is a row vector of the original samples.
To solve for the least squares solution, we use the the normal equations, that
is we solve
D DTm my = Dmx (3.37)
38
and since the DCT is an orthonormal transformation, the rows of Dm are orthonor-
mal, so DmD
T
m = I. Therefore
y = Dmx (3.38)
Since there is no contradiction, the least squares solution must use the first m DCT
coefficients.
A related transform to the ?trigonometric? transforms is the Hadamard trans-
form or Walsh-Hadamard transform. The Hadamard transform defines the trans-
formation matrix recursively as
? H0 =?1 (3.39)???Hm?1 Hm?1 ?H ?m = ? (3.40)
Hm?1 ?Hm?1
The obvious advantage of this transform is that it contains only ?1 and 1 entries,
so it can be computed quite efficiently without even multiplication operations (only
sign changes are needed).
3.2 The Gabor Transform
While the Fourier transform is useful for telling us what frequencies make
up a given signal, it cannot tell us when those frequencies occur. It considers all
the samples we have and tells us which frequencies explain all the samples. In some
39
cases, it would be useful to know both which frequencies occur and where they occur.
For example, if we are examining seismic data, it may be important to know when
high frequency vibrations occurred to predict the time of a future earthquake. With
a Fourier transform, we would only know that there were high frequency vibrations.
We can accomplish this in a naive way with a Short-Time Fourier Transform
(STFT). The high level idea is extremely simple. The input signal is broken up
into smaller blocks of time and the Fourier transform is computed on each block
separately. Then, for each block of time we can see which frequencies are available,
and we can adjust the block size to increase the time resolution.
The Gabor transform is an interesting twist on this idea. Instead of a hard
window, we use a soft window by convolving the Fourier transform with a Gaussian
kernel. In a continuous representation this is
? ?
2
G(?, ?) = f(t)e??(t??) e?2?it? dt (3.41)
??
yielding amplitude results with time offsets ? as well as frequencies ? 2. While
this yields a smooth windowed response in time, it still suffers from what we call
the uncertainty principle which all STFTs are subject to. That is, the larger the
time window, the worse the localization is, and the smaller the time window, the
more constrained we are in the frequencies we can represent. Put another way,
time-resolution and frequency-resolution are inverses: can only have one and not
both.
2in contrast to the Fourier result which is amplitude vs frequency with no time component
40
Figure 3.1: Discrete Wavelet Transform. The DWT repeats the sampling pro-
cess recursively on the low frequency band.
To see this result, consider the DFT matrix given in Equation 3.30. This
matrix has a finite number of frequencies that it can represent because of the discrete
representation. The high frequency represents each sample in a single period. If we
restrict the size of the DFT to windows, as in the Gabor transform, we reduce the size
of this matrix and therefore we reduce the number of frequencies we can represent.
Conversely, if we allow the size of the window to increase without bound, so as to
get the best frequency resolution, we will eventually end up with a window size
that is the length of the original signal and therefore is equivalent to the standard
Fourier transform that has no temporal component at all. As we will see in the next
section, this uncertainty principle extends to more sophisticated methods and is a
fundamental limitation of harmonic analysis.
3.3 Wavelet Transforms
Wavelet transforms extend the concept of the STFT to what, at the time of
writing, can be considered its natural end. Instead of using sine and cosine bases,
41
Figure 3.2: Morlet Wavelet. The Morlet wavelet illustrates the high amplitude in
the center of the wave with decreasing amplitude moving to the sides. Image credit:
Wikipedia.
the wavelet transform defines other functions which have ?finite support?. In other
words, they have a high amplitude at time t = 0 with the amplitude gradually
decreasing as t moves away from 0 (this is shown in Figure 3.2 with the Morlet
wavelet). As in the STFT, this measures a local response to the wavelet. Then, as
in the Gabor transform, we can slide the wavelet around by shifting it along the
input signal to compute local responses at different times.
The key improvement of wavelet transforms is that they include a term which
controls the frequency of the wave. This allows for a full bank of frequencies to
be computed at each time representing the response of the signal to wavelets of
increasing frequency. Note that because of the uncertainty principal, this generates a
tree-like structure. For a given time t, there may be multiple high frequency wavelet
responses for a single low frequency wavelet (Figure 3.3). As in the last section, the
more precisely we wish to describe the constituent frequencies in a signal the less
precisely we can localize them in time. Unlike the last section, however, we can still
localize the high frequencies well even if we cannot localize the low frequencies, with
a STFT, our localization capability is defined entirely by the block size (or Gaussian
standard deviation for the Gabor transform). Since we examine the same signal at
42
Figure 3.3: Wavelet Uncertainty. The low frequency wavelet has poor time
resolution, in other words, we cannot tell as exactly the time where that frequency
occured as we can with the high frequency wavelets. Image credit: wikipedia.
multiple scales, or resolutions, we call this multiresolution analysis .
Formally, we define a mother wavelet ?(t) which we can then shift and scale
as desired. This yields a basis for the space of functions, just as with the fourier
transform, given by the following set
{ ???? ( )}1 t? sW = ??,s(t) ? ? R, s ? R, ??,s(t) = ? ? (3.42)? ?
where ? determines the frequency (or scale) of the wavelet and s determines the
shift. We then compute the integral transform
? ?
T (?, s) = f(t)??,s(t) dt (3.43)
??
for a function of time (a signal) f(t). Just as with the Fourier transform this is
simply a linear combination of the signal with each of the basis entries, but we have
generalized from the Fourier basis (e?2?it?) to the more general ?(f, s). In the rest
of this section, we will discuss how to apply the wavelet transform to discrete signals
and how certain important wavelets are defined.
43
3.3.1 Continuous and Discrete Wavelet Transforms
As with the Fourier transform, in order to use these tools on real signals, we
must discretize them for execution on a computer. There are several ways we can
do this, the first one we will discuss is the Continuous Wavelet Transform (CWT)
which, despite the name, is not exactly continuous. To define this we simply assume
that the signal f(t) is finite and discretely sampled, and we rewrite the integral of
Equation 3.43 as a sum
?
T (?, s) = fk??,s(k) (3.44)
k
then we stipulate that the wavelet function have finite support, in other words, we
assume that it is zero outside of a certain range so we can represent it with a finite
number of samples. We can then define the wavelet transform using convolution.
We define the kernel
( )
1 ?T ? t
???,t = ? (3.45)
? ?
for a wavelet with support T and we compute
Tmst = ft ? ???,m+?T (3.46)
The Discrete Wavelet Transform (DWT) takes this idea further. The idea
is that instead of dealing with the wavelet basis change equations directly, we can
44
simply express the transform as a series of high pass/low pass filters which coarsely
discretize the scale. We first construct convolution kernels for a high pass and low
pass filter g and h and compute the convolutions
ylow = f ? g (3.47)
yhigh = f ? h (3.48)
By definition, these filters pass half the frequencies they are given as input. There-
fore, by the Nyquist Sampling Theorem, we can also discard half the samples of each
result without losing information. We represent this with a downsampling by two
operation (?)
ylow = (f ? g) ? 2 (3.49)
yhigh = (f ? h) ? 2 (3.50)
This process is repeated recursively on ylow while yhigh is retained as an output. This
yields a tree structure (Figure 3.1).
We briefly mention a newer technique here, the Dual Tree Complex Wavelet
Transform (DTCWT) [21]. This is a complex wavelet transform which is inspired
by real cosine and imaginary sine components of the Fourier transform. The main
advantage of this transform is shift invariance, i.e., a shift in the input signal yields
the same transform coefficients. While the theory of the DTCWT is quite involved,
the algorithm is simple assuming suitable wavelets exist. As in the DWT, high and
45
low pass filters are applied with the results decimated, only this time there are two
wavelets producing two trees (Figure 3.4). The results of one tree are treated as
the real part of a complex output and the results of the other tree are used as the
imaginary part.
All of these methods require suitable definitions of the ?(t) function. While
the natural instinct is to choose orthogonal wavelets, biorthogonal wavelets, which
relax the orthogonal constraint as long as the transform is still invertible, have also
been shown to work well and have more flexibility in their design. Note that the
definition of a basis does not require orthogonality. Common choices for ?(t) include
the Haar wavelets (discussed next), the Morlet wavelet which is related to the Gabor
transform, and the Daubechies wavelets among others. While most tasks will work
fine with the simplistic Haar wavelets, knowing the properties of each wavelet to
pick the ideal one for a given task can make a difference.
3.3.2 Haar Wavelets
The Haar wavelet is one of the most simple and popular choices for ?(t). It is
defined as
??????????1 0 ? t ? 12
?(t) = ?????1 1?? ? t ? 1
(3.51)
2
???0 otherwise
46
Figure 3.4: Dual Tree Complex Wavelet Transform. The DTCWT is computed
in the same way as the DWT but with two trees.
Figure 3.5: Haar Wavelet. Frequency increases vertically, time increase to the
right.
The Haar wavelet transform is simple to implement and computationally efficient
leading to its widespread use. The wavelets have compact support and are orthogo-
nal making the Haar transform effective for conducting localized frequency analysis,
in fact they were the first attempt at a basis for multiresolution analysis. The 1D
Haar wavelet is plotted in Figure 3.5 for three frequencies and several shifts per fre-
quency. Note that the time axis (horizontal) spans from 0 to 1. The Haar wavelet
has very compact support, outside the support region, which naturally shrinks with
47
Figure 3.6: DWT Using Haar Wavelets. The left image is the single level DWT
of the right image. Note that each filtered image is stored at half the resolution in
the width and height so each of the four filtered images can be arranged in the same
shape as the original image.
increasing frequency, the value of the wavelet is zero, so any samples outside the
considered region contribute no information to the frequency response.
In the 1D transform the wavelet was measuring differences along the time
axis to measure the frequency response. In 2D, we must consider differences on
two axes including the diagonal (both axes simultaneously). Figure 3.6 shows an
example of this for a single level DWT. Note that each of the four frequency bands,
called LL, HL, LH, HH, are stored at half the width and height leading to the 4? 4
arrangement on the left hand side. In this case, the top-left is the LL band, the top
right is the LH band, the bottom left is the HL band, and the bottom right is the
HH band. Note the different features that each band responds to: the HL and LH
bands respond to horizontal and vertical structures respectively and the HH band
respond to diagonal structures.
48
While the Haar transform?s simplicity and effectiveness allow for widespread
use there may be more suitable wavelets for a given task. The Daubechies wavelets
[22] in particular have come into common use as they were designed based on the
analysis of Ingrid Daubachies who made numerous contributions to multiresolution
analysis. For example Daubachies showed that if the number of vanishing moments
is N , then the support of the wavelet is at least 2N ? 1. Vanishing moments,
which relate the wavelet to a polynomial, can be of critical importance in choosing
a wavelet if there is some understanding of the function to be analysed. Generally,
a wavelet with N vanishing moments is orthogonal to a polynomial of degree N ?1.
In this section we covered only the most basic ideas of multiresolution analysis
as it does not factor into the work of this dissertation. However, the wavelet trans-
form, which was a critical part of the last decade of signal processing, is now making
its way rapidly into deep learning applications [23]?[25] 3 so knowledge of these tech-
niques will rapidly become important for the computer vision researchers.
3among many others including currently unpublished work.
49
Chapter 4: Entropy and Information
Information theory marked a major advancement in the understanding of com-
munication. Claude Shannon?s 1943 paper ?A Mathematical Theory of Communi-
cation? was rare in that it both introduced the field of information theory and then
systematically solved all major problems within it, essentially an entire field in one
paper. Importantly for us, Shannon?s formulations for measuring the information
contained in a message gave rise to lossless compression algorithms which are still
used to this day. In this chapter, we review the high level ideas of information
theory, specifically entropy , and how these ideas were used to develop compression
algorithms.
The overall goal of information theory [26] is to measure the amount of infor-
mation contained in a signal. The signal can be discrete (e.g., words) or continuous
(e.g., television, sound, etc.). Shannon was responding to a recent development in
communication: modulation. These techniques were rudimentary lossy compression
methods which introduced noise into the messages in exchange for reducing their
size (similar to JPEG and MPEG as we will see later). Exactly how much noise
was introduced and the limits of the system with respect to how much noise would
make the message unintelligible was a mystery. As expected this was preventing the
50
full and effective use of these technologies, since operators would either introduce
too much distortion and be left with an unintelligible message or introduce too little
noise and be faced with transmission delay.
4.1 Shannon Entropy
Mathematically, we are free to make any choice to define a measure of infor-
mation. In other words, any monotonic function of the number of possible messages
since all are equally likely. However, Shannon chooses to define information on a
log scale since it has some useful properties
? Many practical properties vary with the logarithm. For example, two wires
have double the bandwidth of one wire.
? It makes the math considerably easier since logarithms have nice properties
around addition, multiplication, differentiation, etc.
therefore, we define1 the ?amount of information? I as
I ? log(M) (4.1)
for some message M . For logarithm base 2, we will call the unit of information
?bits?. Since this is our measure of information, we can also measure the information
1Note that I am choosing these words carefully. We are deciding to measure information in this
way and developing a field around that decision rather than measuring some natural property of
the world like a physicist might.
51
INFORMATION
SOURCE TRANSMITTER RECEIVER DESTINATION
SIGNAL RECEIVED
SIGNAL
MESSAGE MESSAGE
NOISE
SOURCE
Figure 4.1: The General Communication System. One of Shannon?s most
important contributions was the idea that any communication system can be divided
into parts and developed separately. Image credit: Claude Shannon [26].
capacity of a channel as
log(N(t))
C = lim (4.2)
t?? t
where N(t) messages can be transmitted in time t.
Before we continue, however, we touch on one of Shannon?s most influential
contributions. That is the general definition of a communication system, given in
Figure 4.1. Shannon showed that any communication system consists of the same
fundamental parts. Even systems such as telegraphy and color television which seem
very different from each other are fundamentally the same. This model drives much
of Shannon?s analysis of information content.
Since the communication system must be designed to support any possible
message, we must take a probabilistic approach to describing the generation of
messages by the information source. In other words, for a discrete communication
52
system, the information source will generate messages by producing discrete symbols
one at a time. The generation of a given symbol is determined based on the past
symbols and we can therefore compute a probability for each symbol.
As an example of this consider the English language. Given a set of letters:
?FIRE BA? we can say that the letter ?D? is highly likely to be the next letter. This
is a Markov process and while incredibly complicated to produce for real scenarios,
Markov modeling would allow us to produce probabilities for each symbol. The
important point here is that since we are fairly certain about ?D?, a ?D? being gen-
erated has low information and therefore requires less space to transmit. Something
unexpected like an ?X? would have high information content. So we can represent
expected or frequent results with fewer bits.
Another example, assume I wish to communicate the weather in Seattle, and
I know that there is a 100% chance of rain in Seattle. This information can be
transmitted with zero bits, since there is no need to communicate anything. Suppose
that I instead wish to communicate the weather in College Park where it rains
roughly 50% of the time, then I would require the same amount of bits to transmit
raining or sunny.
So now we have established an intuitive idea of the information content of a
message. That is, we are measuring how ?expected? or ?surprising? or ?random? a
message appears. Given a set of symbols with probabilities pi for the ith symbol,
53
we define the entropy H as
?N
H = ? pi log pi (4.3)
i=1
This measure has some important properties
? H = 0 if and only if all of the pi are zero except for one, in other words, there
is only one symbol and it always occurs (like the Seattle example). This means
there is no entropy.
? H is maximized when all p are the same ( 1i ), since this is the most uncertainN
situation (like in the College Park example).
At this point we have developed information theory to the barest minimum
extent in order to define entropy of a discrete channel. We are not taking into
account noise or continuous signals, all of which are discussed at length in Shannon?s
paper along with much more thorough derivations. We have already touched on the
idea that low entropy symbols can be represented with fewer bits. In the next
two sections we will develop algorithms for computing these representations. These
methods are examples of lossless compression where all information in the original
message is preserved.
4.2 Huffman Coding
Huffman coding [27] is a method for producing optimal length codes for sym-
bols based on their probability of occurrence. It was the first method for finding
54
P=1
1
P=0.6
0
11
10
P=0.25
110 111
A B C D 
P=0.4 P=0.35 P=0.2 P=0.05
Figure 4.2: Huffman Tree Example. The following tree structure assigned the
smallest length sequence to the most probable symbol and the longest length se-
quence to the least probable.
optimal codes (Shannon presented a method which was not guaranteed to be opti-
mal) and it is still in heavy use at the time of writing by image and video codecs 70
years after its invention.
Huffman coding requires a set of symbols and their probabilities of occurrence
as input. Then, given a message as a sequence of symbols, the algorithm produces
the minimum length code that uniquely conveys the message. This requires assigning
the shortest codes to the most probable symbols and the longest codes to the least
probable symbols.
We do this using a binary tree. Start with a leaf node for each symbol that
stores the probability of that symbol and insert them into a priority queue. Then,
at each step, remove the two nodes with the lowest probability and merge them
into an internal node with probability equal to the sum of the probabilities of these
nodes. Then insert this new node into the priority queue and repeat until the queue
has only one node on it. This node is the root of the tree. The process is a simple
greedy algorithm. Approximate code is given in Listing 4.1.
Listing 4.1: Building a Huffman Tree.
55
def bu i l d t r e e ( symbols : L i s t [ Tuple ( f loat , str ) ] ) ?> Node :
l e av e s = [ ( s [ 0 ] , Node ( s [ 0 ] , s [ 1 ] , None , None ) ) for s in symbols ]
p = heapq . heap i fy ( l e av e s )
while len (p) > 1 :
l = heapq . heappop (p)
r = heapq . heappop (p)
n = Node ( l . p r obab i l i t y + r . p robab i l i t y , None , l , r )
heapq . heappush ( ( n . p robab i l i t y , n ) )
return p [ 0 ]
To encode, for each symbol traverse the tree from the root tracking the series
of left and right child?s used in the traversal. Add a 0 to the symbol for a left and a
1 for a right. When the correct leaf node is reached, the resulting string of 0s and
1s encodes the symbol. To decode, simply read each bit at a time and traverse the
tree (right or left) based on the bit value. When a leaf node is encountered, emit
that symbol and return to the root of the tree.
Let?s consider a simple example. Suppose we are given a simple four letter
alphabet with symbols M = {A,B,C,D}. These four symbols are known to occur
with probabilities P = {pA = 0.4, pB = 0.35, pC = 0.2, pD = 0.05}. Since we have
four symbols, the default encoding would be 2 bits per symbol, {A = 00, B =
56
Encode: ABD
0.4 0.75 0.95
B
A
C D
0 1
0.16 0.3 0.38
0.4
0 A B
B C D
C D
0.16 0.3
A B
C D
C D
A
BCD
Figure 4.3: Arithmetic Coding Example. Using the same alphabet and proba-
bilities as the last section, we encode ABD into the range [0.29, 0.3).
01, C = 10, D = 11}. However computing the entropy of the set P gives
?
H(P ) = ? p log p (4.4)
p?P
= ?0.4 log(0.4)? 0.35 log(0.35)? 0.2 log(0.2)? 0.05 log(0.05) (4.5)
= 0.529 + 0.530 + 0.464 + 0.216 (4.6)
= 1.74 (4.7)
so approximately 1.74 bits, meaning that the default encoding of 2 bits wastes 0.26
bits per symbol on average.
We construct a Huffman tree for the above set in Figure 4.2. This gives the
following variable length codes {A = 0, B = 10, C = 110, D = 111} obtained by
traversing the tree for each symbol. Note that although there are some symbols
which now require 3 bits to encode, these are the least probable symbols and the
most probable symbol, A, requires only 1 bit. If we compute the average size of
a symbol with these codes we actually have 1.85 bits/symbol on average so we are
still above the limit in terms of entropy. This is because symbols cannot occupy a
57
fraction of a bit.
4.3 Arithmetic Coding
Although Huffman codes were optimal in terms of the number of bits to encode
single symbols, we saw that Huffman coding was not able to reach the theoretical
minimum number of bits defined by the entropy of the set. By computing an en-
coding for an entire message rather than one symbol at a time we can overcome this
limitation. This is the motivation behind arithmetic coding, which stores an entire
message into an arbitrary number q such that 0 ? q < 1.
Once again the algorithm is given a set of symbols and their probabilities.
The encoder starts with the interval [0, 1) and divides the interval into sub-intervals
for each symbol. The algorithm picks the interval which corresponds to the current
symbol and proceeds to the symbol. When all symbols are consumed, the resulting
interval uniquely identifies the message, and since the intervals are unique we only
need to transmit a single element of the final interval to identify the message2. To
decode we can follow the same process, but this time we are given the number q.
At each step we construct the same intervals and simply check which one the given
number falls into, emitting that symbol at each step. This does require either a
special terminating symbol or a known message length to stop. The algorithm is
shockingly simple and highly effective.
An example encoding is shown in Figure 4.3. In that example, we encode
2Specifically, enough bits such that any fraction beginning with the transmitted number falls
into the desired interval.
58
the message ?ABD? following the same alphabet and probabilities we used for the
Huffman coding example. We start by dividing [0, 1) into proportional parts for each
symbol, we find that the first symbol is A so we choose the interval from [0, 0.4).
Next we divide that interval into proportional parts and since the next symbol is
B, we choose [0.16, 0.3) since 0.16 = 0.4 ? 0.4 and 0.3 = 0.16 + (0.4 ? 0.35). The
final symbol is D so we choose the interval from [0.29, 0.3) and transmit (arbitrarily)
0.295. Again, decoding follows a similar process. We are given the number 0.295 as
input and we divide up the interval [0, 1), finding that this falls into [0, 0.4), we emit
A. Then we find that 0.295 falls into [0.16, 0.3) and we emit B. Finally, we find
that 0.295 falls into [0.29, 0.3) and we emit D, having decoded the message ?ABD?.
While it may seem remarkable that a message can be transmitted in a single
number, the algorithm does have faults. Again, the message must fit into a discrete
number of bits, which can reduce the efficiency compared to the theoretical maxi-
mum. Furthermore, we are assuming that we have an accurate probability model of
the symbol frequencies. This may not be possible to obtain exactly, and in fact, we
may not even want global symbol probabilities. Since we are encoding a message,
the most efficient encoding of that message would model the probabilities of symbols
in that message only (e.g., 1 for A,B,D and 0 for C in our example). However,
3
this requires transmitting the probability model which may remove any gains in
efficiency from the coding. In general, these are still open problems and while we
can obtain ?optimal? codes with respect to some specific definition of optimal, the
theoretical entropy limit that Shannon?s work gives us remains elusive.
59
Chapter 5: Machine Learning and Deep Learning
Machine learning is rapidly revolutionizing the way that people interact with
computers. This is largely driven by the explosive proliferation of Convolutional
Neural Networks (CNNs) [28] since they were shown to be computationally viable
for large problems in 2012 [8]. Although machine learning seems commonplace
today, this was not the case ten years ago (at the time of writing) and there were
many who believed that machine learning would never achieve widespread success.
While this dissertation is centered on compression as an application, it is first
and foremost a contribution to machine learning for computer vision. In this chapter,
we develop a high-level understanding of machine learning concepts which relate to
the rest of the dissertation. This discussion is grounded in Bayesian decision theory
which is often overlooked in machine learning discourse. Otherwise, the focus is on
computer vision methods rather than general methods.
Note Some of the material in this chapter is based on the book Pattern Classi-
fication [29] which I strongly recommend to interested readers for more in-depth
information.
60
5.1 Bayesian Decision Theory
Bayesian decision theory tells us the best possible decision we can make about
data even if we know exactly the underlying generating distributions. In a sense
this can be thought of as a best case scenario because in real life we do not know the
underlying distributions so we must either approximate them or approximate deci-
sion criteria directly. The classic example of this proceeds as follows. Dockworker
Dave is observing fish as they are unloaded from boats. His task is to sort the fish
into bins, one for sea bass which we will denote as cb and one for salmon which we
will denote as cs. The fish come out of the boat randomly. In the absence of any
other information (such as identifying markers), how can he develop a strategy to
sort them with minimal errors?
Let?s give Dave some knowledge to help. Since the fish are coming off the boat
in a random order, we must describe the occurrence of each fish probabilistically.
Assume that Dave knows how many fish were caught of each type, then he knows
the prior probability P (cb) and P (cs). For example if P (cb) = 0.7 and P (cs) = 0.3
then Dave should classify all of the fish as bass and he will have 70% accuracy. Of
course this will entail him dumping all fish into the bass bin which is a bit odd
considering that he knows there are two types of fish. Nevertheless this strategy
will attain the lowest error given what Dave knows.
We can give Dave some more information to help him. Dave?s daughter Wendy
studies fish and she informs him that the color can be used to differentiate bass
from salmon although it is not a perfect indicator (see Figure 5.1). In this case
61
Figure 5.1: Salmon vs Sea bass. Top: Salmon, Bottom: Sea bass. The two fish
have different colors.
we would say that there is a continuous random variable x which yields conditional
probabilities P (x|cb) which is the probability of each color value for sea bass and
P (x|cs) for salmon. We call this the likelihood of the color given the type of fish and
we will call the color of the fish a feature.
How does Dave use this information? In order to make a decision given color,
we want to compute P (cs|x) and P (cb|x), which we call posterior probability , and
take the larger probability, but we only have P (cb), P (cs), P (x|cb), P (x|cs). We also
know that there is a joint distribution for each class P (cs,b, x) which is the probability
of a fish being class s or b and having color x that relates these quantities. From
probability theory, we can write this in terms of the conditional
P (cs,b, x) = P (cs,b|x)p(x) = P (x|cs,b)P (cs,b) (5.1)
this is the definition of conditional probability. Rearranging to group the quantities
that we know gives
| P (x|cs,b)P (cs,b)P (cs,b x) = (5.2)
P (x)
62
which is known as Bayes rule. This allows us to compute the class probability given
some measurement as long as we have the known likelihood and prior probabilities.
We have another unknown term, the evidence term, in Equation 5.2, P (x), which
is the probability of any fish having the measured color: in general we do not need
this term. The Bayes decision rule is
??????cs P (cs|x) > P (cb|x)c = ??? (5.3)cb P (cb|x) > P (cs|x)
expanding one of these inequalities gives
P (x|cs)P (cs) P (x|cb)P (cb)
> (5.4)
P (x) P (x)
P (x|cs)P (cs) P (x|cb)P (cb)
=  >  (5.5)
P(x) P(x)
= P (x|cs)P (cs) > P (x|cb)P (cb) (5.6)
in terms of only known quantities. This is good because the evidence term is often
hard to measure.
So now Dave can use his knowledge of the prior probabilities and Wendy?s
color probabilities and multiply them to produce the probability of sea bass or
salmon, binning the fish based on whichever is more probable. This seems like a
perfectly reasonable idea, but what kinds of errors will Dave make? Let?s compute
63
the probability of Dave?s error
?????P (cb|x) c = cs
P (error|x) = ???? (5.7)P (cs|x) c = cb
In other words, the error rate will be the probability of the other class. To compute
the average error rate, we marginalize x from the joint distribution
? ?
P (error?) = P (error, x) dx (5.8)???
= P (error|x)P (x) dx (5.9)
??
We cannot control, or really even measure, P (x) but we can control P (error|x) by
making it as small as possible. And the only way to accomplish that is by picking
the higher probability for P (cs,b|x) as our classification choice, thus proving the
optimality of the Bayesian decision.
So now we have a way of making the best possible classification decisions.
Given prior probabilities of the different classes and likelihoods of each feature given
each class, we can then compute the posterior probabilities and pick the higher
one. This guarantees the minimum error: we cannot achieve lower error than this.
However we now have a new problem: how do we produce these probabilities for real
problems? In general, we can not, and we will have to approximate the distributions
leading to an even high error rate. In this sense, the Bayesian decision can be thought
of as a theoretical lower limit for the error rate. Even if we know everything, because
64
of the probabilistic nature of decision problems, we will not make the right choice
for every input.
This sets up a theoretical dichotomy. Do we approximate the underlying prior
and likelihood distributions which generated the data and then make Bayesian de-
cisions based on our observations? Or instead can we simply compute the boundary
between the posterior distributions as a function of the observation that makes a
decision directly? Either way, these two questions are the entire purpose of machine
learning. Given some data, sampled from unknown distributions, how do we com-
pute approximations which match the true distributions or decision boundaries as
closely as possible.
5.2 Perceptrons and Multilayer Perceptrons
One simple way of learning decision boundaries is the perceptron [30]. The
perceptron defines a simple linear model for making a binary decision between two
classes (although it can be extended to more complex scenarios). Given a vector
of weights w, and an input feature vector x, the perceptron makes the following
decision
?????1 ?w,x? > 0
f(x) = ???? (5.10)0 otherwise
65
or simply
f(x) = H(?w,x?) (5.11)
where H() is the Heaviside function, for classes 1 and 0. The decision boundary in
this case is a linear function of x. The task then is to compute a suitable w given
some data.
Starting from a randomly initialized w0 and some set of training data xi with
labels yi the learning algorithm first computes the decision on xi.
y?i = f(xi) = ?w0,x? (5.12)
which may be incorrect. The algorithm then updates the weights as
w1 = w0 + (yi ? y?i)xi (5.13)
This process is repeated for all pairs (xi, yi) until some predefined stopping criterion
is met. In the case that all xi are linearly separable with respect to yi, this stopping
criterion may be convergence, but this is almost never the case in real life.
To model real scenarios, a more complex model is needed: one that can model
non-linear relationships. We can extend the perceptron to model these more complex
scenarios by building aMultilayer Perceptron (MLP) (MLP). The MLP stacks layers
of perceptrons separated by non-linearity (Figure 5.2). More formally, for layer
66
Figure 5.2: Multilayer Perceptron. The multilayer perceptron organizes groups
of perceptrons into layers separated by non-linearities. In this case each circle rep-
resents a perceptron. The first and last layers are termed the input and output
layers respectively; any layers in between are termed hidden layers. Image credit:
wikipedia.
weights Wl (a matrix), input x, and nonlinearity ?(), a MLP can be defined as
f(x) = WN?(. . . ?(W1?(W0x))) (5.14)
for an MLP with N layers. We call the first layer (weights W0) the input layer,
the last layer (weights WN) the output layer, and the intermediate layers (weights
W1, . . . ,WN?1) the hidden layers. In practice we will also define a loss function l()
which takes the network output and the true classification and tell use how wrong
it was. Importantly this function needs to be scalar valued
e(W ) = l(y, f((x);W )) (5.15)
describing the error for some set of weights W .
Training this model requires some tricks. We use an algorithm called back-
67
propagation [28]. If we observe the form of l(), we can see that it is a scalar valued
function of a vector. This means that we can compute the gradient of the output
with respect to the input
?? ??? ? l(y, f((x);W ))? ?w0 ?00? ?? ?? ?0 l(y, f((x);W )) ?? ?w ?10? ?? . ?? .. ??? ??W l(y, f((x);W )) = ? ? (5.16)?? ?l l(y, f((x);W ))? ?w ?
?
ij
? ?? . ???? . ??.
?
L l(y, f((x);W ))?wMN
for L layers and weights of size MN , which tells us ?in what direction and by how
much? we would need to change the network in order to classify x correctly. We
can compute these quantities using the chain rule. For each layer we compute the
L
Jacobian ?WL?1 (since these are vector valued functions) with respect to the previousW
layer and continue until we have differentiated every layer.
?WN ?WN?1 ?W 1?W 0l = ?WN l ? ?WN? ?1 ?WN? ? ? ? (5.17)2 ?W 0
which gives updates for the weights in each layer.
5.3 Image Features
In order to apply any of these models to images, we need some way of repre-
senting images as the input vectors x to the functions in the previous section. While
68
we could simply flatten the images into vectors, this may cause issues with the learn-
ing process. Small perturbations of the input pixels can cause large changes in their
actual values. Also pixels themselves can vary considerably in appearance yet still
represent the same class. These issues impact the separability of the problem, and
create extremely complex decision boundaries that are difficult if not impossible to
model without arbitrarily deep networks.
A more successful strategy would be to compute some higher order repre-
sentation of the images which we can show is more meaningful. Although we may
explore ideas like extracting numerical shape descriptions or color conversions, there
are some abstract representations which have been shown to be effective. We will
explore two of these in this section: Histogram of Oriented Gradients (HOG) and
the Scale-Invariant Feature Transform (SIFT). Both of these techniques transform
an image into a series of vectors, which we call features , that can then be input to
a machine learning model.
5.3.1 Histogram of Oriented Gradients
The Histogram of Oriented Gradients [31] captures shape and orientation of
objects using a local descriptor. Often the image will be contrast normalized in
blocks before the histogram of gradients is computed on each pixel in small cells.
The descriptor for each cell is the concatenation of the histograms for all of the
pixels in the cell.
To compute the gradient of an image, it can simply be convolved with a
69
Figure 5.3: HoG Features. The left shows an example image and the right
shows HoG features which classify as ?person?. The HoG features are shown as
the weighted orientation based on the histogram of the cell and classification confi-
dence. Image credit: [31].
gradient kernel. There are many such kernels but
[ ]
h = ?1 ?0 1? (5.18)
????1? ?v = ??
?
? 0 ????? (5.19)
1
are the popular choices for computing horizontal and vertical gradients respectively.
After the gradient is computed it can be binned per cell (usually 8?8 pixel cells) to
compute the histogram. These histogram cells are then normalized with respect to
larger blocks (usually 16? 16 blocks) to further increase invariance to image trans-
formations. This gives a descriptor for each cell which can be input to any classifier.
For example, Dalal and Triggs used the HoG feature for pedestrian detection an
SVM classifier1.
The result at the time was quite impressive and HoG features came into
1We do not cover SVMs here
70
widespread use. HoG features are ?dense? in the sense that every block in the
image is covered in some sense which means that the model is given a strong prior
on the local shapes present in the image. This can be seen visually in Figure 5.3.
In the figure we see a man in the example image. The HoG features visualized
on the right show outlines of the important shapes in each region. This visualiza-
tion is produced by drawing tangent lines for each orientation in the histogram and
weighting the lines by the histogram values on each cell. The lines are then further
weighted by the SVM confidence to show which lines are contributing to the human
classification. We see strong responses on the feet, shoulders, and head meaning
that the model considers these unique identifiers of people that are not present in
other objects.
5.3.2 Scale-Invariant Feature Transform
One of the most popular and powerful image features is the Scale-Invariant
Feature Transform [32]. Like HoG features, SIFT features capture a local description
of shape using orientation. Unlike HoG, the primary purpose of SIFT was to find
scale-invariant keypoints which are unique locations that appear the same under
scale changes. These points can be used for object matching. Since the points
should be rotation and scale invariant, a query object should be able to be located
even if it is subject to complex deformations.
To compute the scale space, SIFT uses a difference of Gaussian?s (DoG). The
image is computed at different scales by Gaussian blurring the image successively,
71
. . .
Scale
(next
octave)
Scale
(first
octave)
Difference of
Gaussian Gaussian (DOG)
Figure 5.4: Difference of Gaussians. The difference of Gaussian?s scale space
computes gaussian blurs of increasing strength to the input image. The blurred
images are then subtracted from each other. Points which survive this process are
scale invariant. Image credit: [32].
then the difference between neighboring blurred images is taken (Figure 5.4). When
this is done over many scales, any points which survive for the entire stack of DoG
images are considered scale-invariant since they are clearly localized across scales.
These points are pixel localized by applying non-maximum suppression and then
sub-pixel localized by computing a second order Taylor expansion on the pixel which
can produce a zero point in between pixel boundaries.
For each keypoint, the rotation invariant descriptor is computed. The gradient
magnitude m(x, y) and orientation o(x, y) are computed as
?
m(x, y) = (P ? P ? )2x+1,y x 1,y + (Px,y+1 ? P 2x,y?1) (5.20)
Px,y+1 ? Px,y?1
o(x, y) = arctan (5.21)
Px+1,y ? Px?1,y
for image P . This is computed in a 3 ? 3 neighborhood around the keypoint and
then a histogram is computed. The orientation with the highest bin is assigned to
the keypoint. To further improve the invariance, these descriptors are compiled in a
4?4 grid into an 8 bin histogram. The resulting 128 dimensional descriptor resulting
72
Figure 5.5: Convolutional Neural Network. Diagram shows the LeNet-5 ar-
chitecture. The model takes pixels as input and computes feature maps using suc-
cessive convolutions, non-linearities, and subsampling layers. For classification, the
network terminates in a MLP. This allows the classifier and the feature extractors
to be trained jointly using backpropagation. Image credit: [33]
from the concatenation of these histograms is assigned as the keypoint descriptor.
This descriptor can be normalized to improve invariance to lighting changes.
SIFT features were the de facto standard in image features for many years.
During the end of the classical/feature based learning era, there was a particular
shift towards dense SIFT features. This step simply forgoes the keypoint detection
steps and computes a descriptor for each pixel. This is useful for tasks like semantic
segmentation that require per-pixel labels but it can also be used as a rotation
invariant base for more general tasks.
5.4 Convolutional Networks and Deep Learning
Feature engineering is a complex process. The two algorithms we described
in the last section are non-trivial to understand, much less to develop on one?s
own. Furthermore, it is not clear if a given feature is suitable to any particular
task, we only have vague motivation and intuition to guide us. The fundamental
73
contribution of deep learning was that the best features for a given problem can
be learned along with the classifier using only pixels as input. This replaced the
tedious feature engineering process with something much more powerful and much
simpler to develop.
Deep learning is powered by the CNN [28]. These ideas had been around for
some time but it was not until Alexnet [8] that deep enough and complex enough
networks were shown to be computationally viable with a GPU implementation.
This quickly revolutionized machine learning with entire scientific careers dedicated
to feature engineering becoming obsolete in a short time frame.
The CNN itself is not particularly complex. It is a MLP with the matrix
multiplications replaced with convolutions, formally
f(x) = WN ? ?(. . . ?(W1 ? ?(W0 ? x))) (5.22)
The advantage of this is that the weights can be small kernels, usually 3? 3 instead
of the large matrices required to process an image with a MLP (these matrices
would need to be the same width and height as the image). This already made
CNNs much more efficient than MLPs even without the GPU implementation. Fur-
thermore, many seemingly complex image transformations can be computed with
convolutions which is why we say that the convolutional network computes learned
feature representations. These non-linear feature extractors replace the hand de-
signed feature extractors of classical machine learning.
One of the more influential and yet simple architectures is shown in Figure
74
5.5, LeNet-5 [33]. Many CNN variants can be described by the components in
that figure. The convolutional layers are paired with non-linearity and subsampling
layers. The subsampling layers are usually some kind of pooling (max pooling or
average pooling) which helps aggregate feature information spatially. The actual
classification decision is made using a MLP once the feature maps have been reduced
to a sufficiently small and abstract representation. For non-linearity currently ReLU
is the most popular choice which I like to define in terms of the Heaviside function
R(x) = H(x)x (5.23)
but which most people like to write as
?????x x > 0
R(x) = ???? (5.24)0 x ? 0
Why CNNs work as well as they do remains somewhat of a mystery, but
like much of machine learning, we can get an idea using intuition. As we have
already discussed, the hand designed features of classical machine learning may
not have been the best for a given task. The learned features of a convolutional
network are likely more suited since they are customized to the task. Images are
a discrete sampling of a 2D signal, and nearby pixels are often highly correlated
or anti-correlated (in terms of edges). CNNs can pick up on these correlations
because they use a translation-invariant learned convolution which is moved across
75
the image spatially in a sliding window. Finally, since convolutional networks are
highly efficient, they can be made deeper and wider to learn more complex mappings.
In this dissertation we will be exclusively exploring convolutional neural net-
work architectures. While there have been some major advancements to CNNs
which we touch on in the rest of this chapter and throughout the dissertation, it
is worth noting that the CNNs of today are largely the same as those used by the
pioneers of deep learning.
5.5 Residual Networks
Residual networks [9] were a major advancement in the design of convolutional
networks. Instead of learning a mapping y = f(x) like a traditional network, the
residual network defines a mapping y = f(x) + x. This, along with some other
notable architectural changes, makes the residual network highly effective. The pre-
cise reason why this helps so much is still debated however it likely makes ?gradient
flow? easier (gradient flow was also explored by the VGG [34] and Inception [35]
architectures). Examining Equation 5.17, we can spot a potential problem. As the
depth of the network increases, carrying the gradient from the loss through all Ja-
cobians to the earliest layer may be difficult. The gradient tends to shrink as we
move backwards through the layers, we call this problem vanishing gradient. Resid-
ual learning is likely solving this problem by allowing a shortcut connection around
some of the convolution layers which carries a stronger gradient signal to the early
layers.
76
Figure 5.6: U-Net. U-Nets arrange convolutional layers in a U-shape of decreasing
and then increasing size. Skip connections allow for better gradient flow to early
layers. Image credit: [36].
The actual design of the network is based on a so called ?residual block? pic-
tured in Figure 5.7. Each block has two weight layers with batch normalization [37]
separated by a ReLU non-linearity. The addition of batch normalization is thought
to simplify the learning process even further for the weight layers by removing lower
order statsitics of mean and variance. For each batch, the layer tracks the running
mean ? and variance ?2 and computes
x? ?
BN(x) = ? + ? (5.25)
?
for learned ? and ?. The block includes the hallmark residual connection short-
circuiting the two weight layers.
77
Figure 5.7: Residual Block. Each residual block consists of two convolutions with
a ReLU non-linearity and a batch normalization layer. Image credit: [9].
The entire network architecture stacks the residual blocks using strided con-
volutions to perform learned downsampling instead of using pooling. The network
terminates with a ?global average pooling? layer which performs spatial averaging
over each channel of the output to produce a small vector suitable for input to a
MLP like prior network designs.
5.6 U-Nets
It is worth noting that we are, of course, not limited to only classification prob-
lems. The U-Net [36] architecture is suitable for problems which require a spatial
output like image-to-image translation and semantic segmentation. In this disserta-
tion, we will almost exclusively be dealing with image-to-image problems although
the architectures we discuss later will differ greatly from the U-Nets. Similar to
residual networks, U-Nets were a major advancement in these spatial tasks. And
also like residual networks, the major contribution was likely in gradient flow.
U-Nets define the network in two distinct parts: the encoder and the decoder,
the schematic is shown in Figure 5.6. The encoder is much like a traditional convo-
lutional network. There are alternating convolutions and non-linearities with down-
78
sampling. The decoder is the reverse process, taking the compact representation
from the encoder and using upsampling operations to compute a result which has
the same dimensions as the input image. The major design feature of this is the skip
connections. These connections take feature maps from the encoder and concatenate
them with the feature representations of the same size in the decoder which allows
a strong gradient signal to flow to the early layers avoiding the vanishing gradient
problem.
The U-Net was revolutionary at the time for its results on the extremely
difficult semantic segmentation problem. However, the U-Net would quickly become
widely used for any spatial task, and is still used quite frequently. Pix2Pix [38] for
example was based entirely on the U-Net. While U-Nets were highly influential on
all image-to-image problems, we will employ very different architectures later in the
dissertation, and indeed very few works in compression actually use U-Nets. This is
because there are other ways to deal with vanishing gradients (like residual blocks
and their derivatives) and the downsampling operations in U-Nets tend to remove
fine details which we want to preserve in restoration tasks.
5.7 Generative Adversarial Networks
Generative Adversarial Networks (GANs) [39] will be relied upon heavily in
the methods we detail in the dissertation. GANs were a truly revolutionary mo-
ment in the generation of images using CNNs. Prior error based methods, called
79
Fake Image
Noise Generator
Real
Discriminator
Real Image Fake
Figure 5.8: GAN Procedure. The generator creates an image from random noise
and provides it to the discriminator along with real images. The discriminator must
identify which images are real and which are fake.
autoencoders2, produced very poor results even for simple datasets like MNIST [40].
The many variants of the GANs would change this dramatically using an
ingenious and fairly simple idea. The GAN methods sets up an adversarial game
with two networks. One network, the generator, generates images, and another,
the discriminator, tries to identify which images are real and which are fake. The
generator is rewarded for fooling the discriminator into classifying its images as
real and penalized for getting caught. Conversely, the discriminator is rewarded
for correctly identifying fake images and penalized for incorrectly classifying them.
Training (theoretically) ends when the two networks achieve a Nash equilibrium
[41]?[43]. This procedure is shown in Figure 5.8.
We train this pair of networks using standard cross entropy classification loss.
The only difference is that we reverse the labels when training the generator since
we want it to fool the discriminator. This is sometimes call the minimax loss. Given
real samples x, noise vectors z, discriminator D(), and generator G(), we define the
2Although this is an abuse of the term, technically an autoencoder should generate the exact
image it is given as input and nothing else.
80
loss
l(x, z) = log(D(x)) + log(1?D(G(z))) (5.26)
and we train the discriminator to maximize l() while training the generator to min-
imize l(). In other words
minmaxEx?real[log(D(x))] + Ez?noise[log(1? log(D(G(Z))))] (5.27)
G D
As these two networks play their game over the course of training, the dis-
criminator will start to identify more and more fake images. The increasing loss on
the generator will cause it to generate more realistic images. Since identifying fake
images is relatively easy for a CNN, by the end of training, the generator will be
producing extremely realistic images in order to continue to fool the discriminator.
In practice the Nash equilibrium is hard to achieve and we simply stop training
GANs after a certain number of steps. GANs also chronically diverge since it is
hard for the GAN to recover from a situation where the discriminator has a large
advantage over the generator.
5.8 Recap
To recap, we have reviewed machine learning from the ground up. We built
the ideas of machine learning on a foundation of how to make decisions in the pres-
ence of perfect information. We then developed the perceptron and its extension,
81
the multilayer perceptron which is the progenitor of modern deep learning. We dis-
cussed hand engineered features and why they were necessary and finally developed
deep learning as a replacement for these features. We then reviewed some of the
most important ideas of deep learning including convolutional networks, residual
learning, U-Nets, and GANs. This concludes the foundational knowledge which is
required to fully understand the original research developed in the remainder of this
dissertation.
82
Part II
Image Compression
83
Chapter 6: JPEG Compression
JPEG has been a driving force for internet media since its standardization in
1992 [3]. The principal idea in JPEG compression is to identify which details of
an image are the least likely to be noticed if they are missing. These details can
then be replaced with lower entropy versions. By removing information, there is a
significant size reduction over methods which perform entropy coding alone. This
is called lossy compression, since information is lost in the encoding process.
The lost information is, in general, not recoverable. Usually this is not a major
issue, as the JPEG algorithm was designed to remove unnoticed details. However,
there are situations where the information loss is noticeable in the form of unpleasant
artifacts (Figure 6.1). This is particularly true when a JPEG image is saved multiple
times, which causes repeated application of the lossy process. A significant portion
of the dissertation is devoted to using machine learning to approximate the lost
information.
A common source of consumer confusion with JPEG is in the name itself.
JPEG refers to three things simultaneously
The JPEG Algorithm The algorithm for compressing images.
JPEG Files The disk file format for storing JPEG compressed data and its asso-
84
Figure 6.1: JPEG Information Loss. This image suffers from extreme degrada-
tions caused by JPEG compression. Zoom in on this image, it probably has fewer
details than you think it does.
ciated metadata. This is actually either a JPEG File Interchange Format
(JFIF) file or an Exchangeable Image File Format (EXIF) file.
The Joint Photographic Experts Group The working group that maintains the JPEG
standard.
This chapter is devoted to giving the reader an understanding of JPEG com-
pression which is sufficient to motivate the first-principles that we use in develop-
ing the algorithms later in the dissertation. We will review the function of JPEG
compression and decompression step-by-step and we will discuss the extremely im-
portant view of JPEG as a linear map. We will also briefly discuss other image
compression algorithms.
6.1 The JPEG Algorithm
We now present the JPEG algorithm step-by-step. Where the standard is
ambiguous we defer to the Independent JPEG Group?s libjpeg software [44]. This
85
software is widely considered standard in the industry, although there are other
implementations of JPEG. We start by describing the compression process and then
conclude with the decompression process, which is largely the inverse. Throughout
the description we will place emphasis on which parts of the standard are motivated
by human perception and which steps involve loss of information.
6.1.1 Compression
JPEG compression starts with an RGB image usually in interlaced (RGB24)
format. This image is then converted to the YCbCr planar format, however, this
is not the more common ITU-R BT.601 [45] format, which produces values in [16,
235] for Y and [16, 240] for Cb, Cr. Instead, this format uses the full range of byte
values ([0, 255]). The color conversion uses the following three equations
Y = 2.99R + 0.587B + 0.114G (6.1)
Cb = 128? 0.168736R? 0.331264B + 0.5G (6.2)
Cr = 128 + 0.5R? 0.418688B ? 0.081312G (6.3)
This color conversion is designed to better represent human perception of the image
which treats changes in luminance (the Y channel) with more weight than chromi-
nance (the Cb and Cr channels). Therefore, the Cb and Cr channels can have more
information removed with less of an effect on the overall image.
One operation in particular which removes additional information from the
color channels is chroma subsampling . Chroma subsampling describes a 4? 2 block
86
of pixels and is represented as a triple, e.g., 4 : 2 : 0. The 4 represents the number
of luma samples per row. The 2 represents the number of chroma samples in the 1st
row. The 0 represents the number of chroma samples which change in the second
row. So in this example, there are 4 luma samples in each row, 2 chroma samples in
the first row, and none of them change in the second row, meaning that the chroma
channels should be stored at half the width and height of the luma channel. Another
example is 4 : 2 : 2 which indicates that the 2 chroma samples in the first row both
change in the second row, so the chroma channels are stored with half the width
but the same height as the luma channel1.
Before we remove information we need to pad the image. JPEG is based on
8 ? 8 blocks so at the least the image needs to be padded to a multiple of 8 in
the width and height. If chroma subsampling is used, this needs to be taken into
account during padding and the image may need to be padded to a multiple of 16 or
more in the width, height, or both. This defines the Minimum Coded Unit (MCU),
i.e., the minimum size block which can be encoded using the given settings. The
padding in this case is always done on the bottom and right edges of the image and
repeats the final sample as the padding value. With the image padded, the chroma
channels can be subsampled.
Next comes the main feature of the JPEG algorithm, the DCT on non-
overlapping 8 ? 8 blocks. Before computing this, the pixels are centered by sub-
1This is an archaic and confusing notation but unfortunately it is still used.
87
tracting 128. The DCT is applied using
?7 ?7 [ ] [ ]1 (2x+ 1)i? (2y + 1)j?
Dij = C(i)C(j) Pxy cos cos (6.4)
4 16 16
x=0 y=0
??????1 u = 0
2
C(u) = ???? (6.5)1 u ?= 0
for 8 ? 8 block of pixels P . This accomplishes two goals. First, it concentrates
the energy of each block into the top left corner. Second, it serves as a frequency
transform which allows us to remove frequencies which we believe viewers will be
less likely to notice.
The DCT coefficients are then quantized by dividing by a quantization ma-
trix. This is an 8? 8 matrix of coefficients which reduce the magnitude of the DCT
coefficients. Since humans tend not to notice missing high spatial frequencies, the
quantization matrices generally target these. However, most encoders compute the
quantization matrix from a scalar quality factor which is easier for users to compre-
hend, and as this quality decreases, the quantization matrix removes lower and lower
frequencies. After quantization, the result is truncated to an integer. This removes
information in the fractional part and permits the result to be stored in an integer
which takes up less space. In a sense this is the first ?compression? operation. The
88
0 1 5 6 14 15 27 28 
2 4 7 13 16 26 29 42 
3 8 12 17 25 30 41 43 
9 11 18 24 31 40 44 53 
10 19 23 32 39 45 52 54 
20 22 33 38 46 51 55 60 
21 34 37 47 50 56 59 61 
35 36 48 49 57 58 62 63 
Figure 6.2: Zig-Zag Order. This ordering is intended to put low frequencies in
the beginning and high frequencies at the end.
entire operation is given by
? ?
? YijYij = ? (6.6)(Qy)ij ?
? (Cb)ij(Cb)ij = ? ? (6.7)(Qc)ij
(Cr)ij
(C ?r)ij = (6.8)(Qc)ij
for luminance quantization matrixQy and chrominance quantization matrixQc. The
color channels are often quantized more coarsely as human vision is less sensitive
to color data. Note that since we truncate, any fractional part after division is
irrevocably lost, the resulting coefficient can only be approximated from the integer
part. Any coefficient which is less than zero after division is set to zero and cannot
be recovered even approximately. Other than chroma subsampling, this is the only
source of loss in JPEG compression. In order to decode the image, Qy and Qc are
both stored in the JPEG file.
These quantized coefficients are then vectorized in a zig-zag order (Figure 6.2)
89
which is designed to put low frequencies in the beginning of the 64 dimensional
vectors and high frequencies at the end. This is because the next step is to run-
length code this vector. Since the quantization process was more likely to zero
out high frequency coefficients, this concentrates the zeros at the end of the vector
and leads to more effective run-length coding. This is the second ?compression?
operation.
The final run-length coded vectors are then entropy coded. This can use
either Huffman coding [27] or arithmetic coding [46]. With a significant amount
of redundant or unnoticeable information removed, these entropy coding operations
are extremely efficient and yield a significant space reduction over the uncompressed
image.
6.1.2 Decompression
The decompression algorithm is largely the reverse operation. After undoing
entropy coding we have the quantized coefficients. These are element-wise multiplied
by the quantization matrices to compute the approximated coefficients
Y? = Y ?ij i,j(Qy)ij (6.9)
(C?b)ij = (C
?
b)i,j(Qc)ij (6.10)
(C?r)ij = (C )
?
r ij(Qc)ij (6.11)
90
We can then compute the inverse DCT of the approximated coefficients
7 7 [ ] [ ]
1 ?? (2x+ 1)i? (2y + 1)j?
Pxy = C(i)C(j)D?ij cos cos (6.12)
4 16 16
i=0 j=0
and uncenter the spatial domain result by adding 128. The color channels are
interpolated to remove chroma subsampling. We then remove any padding that was
added, and convert the image back to the RGB24 color space
R = Y + 1.402(Cr ? 128) (6.13)
G = Y ? 0.344136(Cb ? 128)? 0.714136(Cr ? 128) (6.14)
B = Y + 1.772(Cb ? 128) (6.15)
and the image is ready for display.
There are three important things to take away from this discussion. First,
other than chroma subsampling, which is optional, the only lossy operation is the
truncation during the quantization step. This is a fairly simple operation considering
the DCT coefficients but it creates complex patterns in the spatial domain. Next,
the blocks are non-overlapping, so for each block there is no dependence on pixels
outside of the block. Finally, each pixel in the block depends on all of the coefficients
in the block. Conversely, each coefficient in the block also depends on all of the pixels
in the block. We will exploit this property later in the dissertation.
91
6.2 The Multilinear JPEG Representation
In what is perhaps a surprising result, the steps of the JPEG transform are
easily linearizable [47], a property that was explored significantly in the 1990s [48]?
[51]. Indeed, outside of entropy coding, the only non-linear step in compression is
the truncation that occurs during quantization, and all the steps of decompression
are linear. Furthermore, when we process JPEG images, we are either dealing with
the decompression process, or we are in full control over the compression process
and it is therefore our choice if and when we truncate. We would only need to do
this if we were saving the result as a JPEG. We now develop the steps of the JPEG
algorithm into linear maps and compose them into a single linear map that models
compression and a single linear map which models decompression.
Without loss of generality, consider a single channel (grayscale) image. We
model this image as the type-(0, 2) tensor I ? H? ? W ?. Note that although we
are essentially dealing with real numbers, we have intentionally left H? and W ? as
arbitrary co-vector spaces because there is no reason to define them concretely for
our purposes. We will, however, make the stipulation that they are defined with
respect to a standard orthonormal basis so that we can freely convert between the co-
vector and vector spaces without the use of a metric tensor. Note that the following
equations are written in Einstein notation; see Chapter 2 (Multilinear Algebra) if
this is unfamiliar.
Our first task is to break this image into 8 ? 8 blocks. We define the linear
92
map
B : H? ?W ? ? X? ? Y ? ?M? ?N? (6.16)
? B ? H ?W ?X? ? Y ? ?M? ?N? (6.17)??
??
?1 pixel h,w belongs in block x, y at offset m,n
Bhwxymn = ??? (6.18)0 otherwise
where B is a type-(2, 4) tensor defining a linear map on type-(0, 2) tensors. The
result of this map will be a tensor with 8? 8 blocks indexed by x, y and 2D offsets
for each block indexed by m,n. Although this definition is fairly abstract, it can
be computed fairly easily using modular arithmetic, although it does need to be
recomputed for each image.
Next we compute the DCT of each block,. We define the following linear map
D : M? ?N? ? A? ?B? (6.19)
( D ?)M ?(N ? A? ?B)? (6.20)
mn 1 (2m+ 1)?? (2n+ 1)??D?? = C(?)C(?) cos cos ? (6.21)4 16 ? 16????1? u = 02C(u) = ??? (6.22)1 u ?= 0
The equation for D should look familiar by now. D is a type-(2, 2) tensor defining a
linear map on type-(0, 2) tensors. The m,n block offset indices in the input tensor
93
will index spatial frequency after applying this map.
Next we linearize the coefficients 2. We define the following linear map
Z : A? ?B? ? ?? (6.23)
? Z ? A?B ? ?? (6.24)?????1 ?, ? is at ? under zigzag orderingZ??? = ??? (6.25)0 otherwise
This is a type-(2, 1) tensor defining a linear map on type-(0, 2) tensors. It flattens
the 8?8 blocks into 64 dimensional vectors. In other words, the ?, ? indices indicate
which indexed spatial frequency will be indexed with a single k after applying this
transformation. This tensor depends on the zigzag ordering and can simply be hard
coded.
Finally, we divide by the quantization matrix. We still need to scale the
coefficients even though we are not rounding them. We define the linear map
S : ?? ? K? (6.26)
S ? ??K? (6.27)
S?
1
k = (6.28)qk
where qk is the kth entry in the quantization matrix for the JPEG image. This is a
type-(1, 1) tensor defining a linear map on co-vectors.
2Note that we?re doing things slightly out of order but ultimately the order does not matter
here and doing it this way simplifies the form of the next tensor
94
In order to define both compression and decompression, we need only one more
linear map, scaling by the quantization matrix
S? : K? ? ?? (6.29)
S? ? K ? ?? (6.30)
S?k? = qk (6.31)
for the same definition of qk as above.
We now have the fairly simple task of assembling these steps into single tensors.
We say this is simple because all of the operations are linear maps and therefore are
readily composable. We define
J : H? ?W ? ? X? ? Y ? ?K? (6.32)
J ? H ?W ?X? ? Y ? ?K? (6.33)
Jhw = Bhw mn ?? ?xyk xymnD?? Z? Sk (6.34)
for compression and
J? : X? ? Y ? ?K? ? H? ?W ? (6.35)
J? ? X ? Y ?K ?H? ?W ? (6.36)
J?xyk = BxymnD?? Z? S?khw hw mn ?? ? (6.37)
for decompression.
95
It is difficult to express how powerful this result is and how easily it is achieved
using rudimentary concepts from multilinear algebra. What seems like a fairly
complex algorithm, and indeed is, when thought of as an operation on a matrix,
reduces to a simple linear map when we model the inputs and intermediate steps as
tensors. Equipped with this linear map, we can and will model complex phenomena
on compressed JPEG data directly without needing to decompress it.
6.3 Other Image Compression Algorithms
The astute reader will have noticed early on that we, quite confidently, are in
a part labeled ?image compression? and yet we are only discussing JPEG. There
are other image compression algorithms, so a natural question is ?why are we not
discussing those??
For myriad reasons, there are really no interesting problems to study for other
compression algorithms. PNG [52], for example, is widely used. But this is lossless
compression, so there are no artifacts to overcome. GIF [53] is also lossless, although
many GIF services quantize colors into a palette to save more space, which is a
potential problem that could be interesting to work on. More modern formats are
based on video compression and while they are lossy, they are simply unused. BPG
[54] is the most promising of these, and therefore the least used, but there is also
HEIC/HEIF [55] which is currently being unsuccessfully pushed by Apple on the
iPhone.
Probably the most interesting of these algorithms is JPEG 2000 [56], which
96
is lossy and widely used in digital cinema [57], although it was completely ignored
by consumers. This codec is interesting because it would require us to update our
theory to take into account the discrete wavelet transform that JPEG 2000 uses in
place of the DCT. However, the use of this transform imposes a major practical
problem as well: JPEG 2000 images look good even at low bitrates because the
wavelet transform is so effective, so they may not require correction.
Instead, we will focus our energy where it can have the most impact: by exclu-
sively studying JPEG. Even at the time of writing, 30 years after standardization,
JPEG is the most commonly used image file format. It is easy to use, familiar
to consumers, and has become the backbone of the internet, making it incredibly
resilient to any challenger, no matter how much better the compression or quality
of the images. At the same time, JPEG does suffer from some extreme artifacts in
many conditions. It is this combination of visible quality loss and widespread use
that makes JPEG ideal for further study.
97
Chapter 7: JPEG Domain Residual Learning
Now we develop a general method for performing residual network ([9], Section
5.5 (Residual Networks)) learning and inference on JPEG data directly, i.e., without
the need for decompressing the images. This method was published separately in
the proceedings of the International Conference on Computer Vision [58].
Warning This chapter is extremely math-heavy and dry. It is strongly recom-
mended to review the background math outlined in the first chapters of the disser-
tation for a complete understanding of the material. This chapter may serve as a
powerful sleep aid: do not operate heavy machinery while reading this chapter.
Compared to processing data in the pixel domain, directly working on JPEG
data has several advantages. First, we observe that most images on the inter-
net are compressed, including deep learning datasets such as ImageNet [59]. Next
we observe that JPEG images, being compressed, are naturally smaller and more
sparse than their uncompressed counterparts. These are all desirable properties for
memory- and compute- hungry deep learning algorithms.
The primary goal of the method presented in this section is to be as close as
possible to the pixel domain result. In other words, given a learned pixel domain
98
mapping H : I ? Rc mapping images to class probabilities for c classes1, we want
to define a mapping ?(H) such that |H(In)??(H)(DCT(In))| is minimized for
any In ? I. We can accomplish this goal analytically and we will develop the theory
in the coming sections, including a discussion of why it is (likely) not possible to
generate a mathematically exact ?(H) function2 and what guarantees are available
on the deviation.
Recall that a residual network requires several components to operate
Convolution The primary learned linear mapping between feature maps at each
layer. Each ?residual block? contains two of these operations.
Batch Normalization Produces normalized features for the convolutions; this is thought
to ease the learning process by removing unnecessary statistics from the input
features which can be represented exactly [37].
Global Average Pooling An innovation of the ResNet. When the convolution layers
are exhausted, the features are averaged channel-wise to produce a vector
suitable for input to a fully connected layer
ReLU The non-linearity of the ?residual block?, this allows the network to learn
complex mapping.
Our task will be to derive transform domain versions of each of these operations.
First Principles
1We frame this discussion in terms of classification but it applies equally well to any problem
type
2i.e., a ? function such that for all i, |H(i)??(H)(DCT(i))| = 0
99
? JPEG is easily linearized, convolutions are linear, composing them expressed
a learned convolution exactly in the JPEG domain.
? Other components of the residual network can be expressed analytically in the
JPEG domain.
? ReLU can be approximated with a bilinear map.
7.1 New Architectures
Before discussing the proposed technique, we will first make a detour to review
two popular methods of JPEG and DCT deep learning. These methods are all new
architectures which enable effective processing in the transform domain but which
do not attempt to replicate any pixel domain result. These methods have some
advantages over both the method presented in the rest of this chapter and when
compared to pixel-domain networks. For example, both methods show good task
accuracy with faster processing. There are some notable disadvantages, however.
In particular, these methods are not suitable for situations when a pixel domain
network already exists and its results need to be replicated on JPEGs.
These ideas were inspired by the ?do nothing? approach published in both
NeurIPS and an ICLR workshop [60]. In this approach, the transform coefficients
are passed into a mostly unmodified ResNet for classification. The authors postulate
that with the higher-level representation of the DCT, fewer layers are required to
achieve similar accuracy and therefore the network will be faster. Indeed the authors
show this is true empirically. However, despite the intuition, this paper?s evaluation
100
leaves much to be desired, and it is unclear what the contribution of the DCT is to
the result. Meanwhile, it is well known in JPEG artifact correction literature (dis-
cussed later in the dissertation), where ?dual-domain? methods are commonplace,
that providing DCT coefficients to a network is not successful without considerable
effort, a result seeming at odds with Gueguen et al..
Instead, the following methods are inspired by unique attributes of DCT co-
efficients. Namely: that each coefficient is a function of all pixels in a block, that
a block of pixels is only correctly represented by all DCT coefficients, and that the
DCT coefficients are orthogonal and arranged in a grid simply for convenience. This
last point is critical. One of the reasons that small convolutional kernels work well
on pixels is that nearby pixels are usually correlated in some way, so translation
invariant features are readily learned. If that convolution is instead applied to co-
efficients, this is then a mapping on arbitrary orthogonal measurements which are
intentionally decorrelated, leaving little hope for success.
7.1.1 Frequency-Component Rearrangement
This is treated by Lo et al. [61] by simply rearranging the frequencies into
the channel dimension before processing (Figure 7.1), yielding a feature map which
is 1th the width and height and with 64 channels (per input channel). Note how
8
this allows a convolution to capture the information contained in the DCT. Since
the convolution operation used in deep learning maps all channels in the input to
each channel in the output, every coefficient plays a role in the resulting map and
101
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23
24 25 26 27 28 19 30 31 24 25 26 27 28 19 30 31
32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 0 0 1 1 2 2 ... 63 63
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1 1 2 2 63 63
8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23
24 25 26 27 28 19 30 31 24 25 26 27 28 19 30 31 Output: 64 channel 2 ? 2
32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63
Input: 1 channel 16 ? 16
Figure 7.1: Frequency component rearrangement.
therefore complete block information is captured. They call this method Frequency-
Component Rearrangement (FCR)
Lo et al. use their network for semantic segmentation of road features quite
successfully. At the time of publication, their method was both fast and accurate.
7.1.2 Strided Convolutions
A similar solution proposed by Deguere et al. [62] uses strided convolutions
instead of FCR. Specifically, this method uses an 8 ? 8 stride-8 convolution such
that each DCT block is processed in isolation. Note again how this makes good
use of the coefficients: the 8 ? 8 convolution ensures that every coefficient plays a
role in the resulting mapping, and the stride-8 ensures that there is no leakage of
information across blocks. Once these ?block representations? are computed, the
resulting feature map is again 1th the width and height (now with a variable number
8
of features) . Deguere et al. use this method for object detection in the DCT domain
and again performed admirably at the time of publication.
102
7.2 Exact Operations
In the previous section we discussed novel architectures that equip CNNs with
the ability to process data in the transform domain. While this is useful and impor-
tant, it requires training a new CNN from scratch and has no particular relationship
to the underlying pixels that the CNN is processing. Since CNNs were designed to
process pixel domain data, and the DCT is a transform of pixel data, a natural
question is whether a method can be formulated that is capable of processing trans-
form domain data and which has some mathematical guarantee or relationship to
the underlying pixel domain model. We now develop just such a method.
7.2.1 JPEG Domain Convolutions
Recall from Section 6.2 (The Multilinear JPEG Representation) that the JPEG
transform can be linearized and written as linear maps on tensor inputs and that
this analysis yields the following linear maps:
J : H? ?W ? ? X? ? Y ? ?K? (7.1)
for compression of an image represented by I ? H? ?W ? to transform coefficients
F ? X? ? Y ? ?K?, and
J? : X? ? Y ? ?K? ? H? ?W ? (7.2)
(7.3)
103
to decompress. We proceed by considering only single channel images. We will add
in channels and batch dimensions later since they have no bearing on the derivation.
We know that convolutions are linear maps, therefore, deriving a JPEG domain
convolution is fairly simple. Assume that C : H? ?W ? ? H? ?W ? is a linear map
representing an arbitrary convolution. This convolution would be applied to an
image I in the pixel domain by computing
I ?h?w? = C
hw
h?w?Ihw (7.4)
Given transform coefficients F ? X? ? Y ? ?K? for I, we can derive I as
Ihw = J?
xyk
hw Fxyk (7.5)
Similarly, we can derive transform coefficients F ? for I ? by applying J
F ? = Jh
?w? ?
x?y?k? x?y?k?Ih?w? (7.6)
Substituting these two expressions yields
I ? = Chw J?xykh?w? h?w? hw Fxyk (7.7)
F ? h
?w? hw xyk
x?y?k? = Jx?y?k?Ch?w? J?hw Fxyk (7.8)
104
And we make the following definition
[ ]
F ? = Jh
?w? hw xyk
x?y?k? x?y?k?Ch?w? J?hw Fxyk (7.9)
?xyk h
?w? hw xyk
x?y?k? = Jx?y?k?Ch?w? J?hw (7.10)
giving a simple expression for computing ? : X? ? Y ? ? K? ? X? ? Y ? ? K?, a
convolution in the compressed domain, given a convolution in the pixel domain. It
is important to note that this is not a simple notational trick. Because J,C, and J?
are linear maps, the resulting ? performs all three operations in a single step and is
significantly faster than performing them separately3.
With the mathematics satisfied, we now turn to the development of an efficient
algorithm for computing ?. After all, the convolution C is usually represented as a
simple 3 ? 3 matrix of numbers4. However our derivation is expressed in terms of
an dim(H)? dim(W )? dim(H)? dim(W ) (2, 2)-tensor.
One way to understand C is as a look-up table of coefficients. For example, if
we index C as C[5, 7], we are given a tensor of coefficients for every pixel in the
input representing its contribution to the (5, 7) pixel in the output. Naturally, many
of these coefficients are 0. In fact, the only non-zero pixels are those from (2, 4) to
(8, 10). Similarly, if we index C as C[:, :, 5, 7] we can see the contribution of
pixel (5, 7) in the input to every output pixel (which again is mostly zero). This
implies a naive algorithm: Exploding Convolutions (Listing 7.1) where the entire
3Much in the same way that linear functions f(x) = 5x and g(x) = 2x can be combined into
(f ? g)(x) = 10x which has only a single multiply vs the two multiplies of separately applying g
and then f
4Although other sizes and shapes are possible
105
(2,2)-tensor is iterated and the correct coefficients are copied from the convolution
kernel. The resulting map is then composed with J and J? to produce the transform
domain map.
Listing 7.1: Exploding Convolutions (Naive)
def exp lode convo lu t i on ( shape : Tuple [ int , int ] , conv : Tensor , J : Tensor , J t i l d e : Tensor ) ?> Tensor :
s i z e = ( conv . shape [ 0 ] // 2 , conv . shape [ 1 ] // 2)
shape = ( shape [ 0 ] + s i z e [ 0 ] ? 2 , shape [ 1 ] + s i z e [ 1 ] ? 2)
c = torch . z e r o s ( ( shape [ 0 ] , shape [ 1 ] , shape [ 0 ] , shape [ 1 ] ) )
for i in range ( shape [ 0 ] ) :
for j in range ( shape [ 1 ] ) :
for u in range ( shape [ 0 ] ) :
for v in range ( shape [ 1 ] ) :
hrange = (u ? s i z e [ 0 ] , u + s i z e [ 0 ] )
vrange = (v ? s i z e [ 1 ] , v + s i z e [ 1 ] )
i f hrange [ 0 ] <= i <= hrange [ 1 ] and vrange [ 0 ] <= j <= vrange [ 1 ] :
x = u ? i + s i z e [ 0 ]
y = v ? j + s i z e [ 1 ]
c [ i , j , u , v ] = conv [ x , y ]
x i = torch . einsum ( ?h ?w ? x ? y ? k ? ,hwh ?w ? , xykhw?>xykx ? y ? k ? ? , [ J , c , J t i l d e ] )
106
return x i
Although this algorithm is simple, it comes with some notable disadvantages.
First, it is slow. Iterating over the entire (2,2)-tensor is time consuming even for a
small image. Second, it is difficult to parallelize without domain knowledge of low-
level programming. In other words, a CUDA kernel (or similar construct) would
need to be produced to efficiently implement this algorithm. A better algorithm
would be readily and efficiently programmed in a high level deep learning library
like PyTorch [17].
Examine the tensor J? and note that
J? ? X ? Y ?K ?H? ?W ? (7.11)
Recall that our model of single channel images uses I ? H? ? W ?, therefore, the
last two dimensions of J? are a single channel image and we can model J? as a batch
of single channel images by reshaping it to fold X, Y,K into a single dimension N ,
giving
J? ? N ?H? ?W ? (7.12)
We are then free to convolve J? with the kernel5 C:
C? = C ? J? (7.13)
5Note that the definition of C has changed slightly and is now kernel
107
and then reshape C? giving
C? ? X ? Y ?K ?H? ?W ? (7.14)
Note that the shape of C? and J? are the same, all we have done here is compose the
convolution kernel C into the decompression operation J? . Next, we compose C? and
J
?xyk xyk hwx?y?k? = C?hw Jx?y?k? (7.15)
to compute ?.
Listing 7.2: Exploding Convolutions (Fast)
def exp lode convo lu t i on ( J t i l d e : Tensor , J : Tensor , C: Tensor ) ?> Tensor :
J hat = J t i l d e . f l a t t e n (0 , 2)
c hat = torch . nn . f un c t i o na l . conv2d ( J hat , C)
c t i l d e = c t i l d e . v i ew as ( J t i l d e )
x i = torch . einsum ( ?xykhw , x ? y ? k ?hw?>xykx ? y ? k ? ? , [ C t i lde , J ] )
return x i
This algorithm (Listing 7.2) is simple to code in machine learning libraries.
Here, it takes up only six lines of code and involves no loops. Furthermore, since
this algorithm depends only on reshaping, convolution, and einsum, it can take
advantage of the built-in optimizations that these libraries include resulting from
years of research into these algorithms [63], [64]. It is also worth noting that autograd
108
algorithms used by these libraries will work as expected for this algorithm, i.e., it
is straightforward to optimize C with respect to some objective when ? is used to
transform the input feature maps.
Extending this to batches of multi-channel images is straightforward. First,
we define the convolution C as C : P ??H??W ? ? P ???H??W ? adding the input
and output plane dimensions P, P ? and noting that C lacks any batch dimension
since the same operation is applied to each image in the batch. Next, we simply
define ? as
?pxyk h
?w?
p?x?y?k? = Jx?y?k?C
phw xyk
p?h?w? J?hw (7.16)
where the J, J? tensors have not changed. This simply adds the plane dimensions
P, P ? to to ?. This map is applied to transform coefficients F ? N? ? P ? ? X? ?
Y ? ?K? as
F ? pxyknp?x?y?k? = ?p?x?y?k?Fnpxyk (7.17)
where the batch dimension N is preserved. With the exception of some extra indices,
this does not change the algorithm in Listing 7.2.
7.2.2 Batch Normalization
Batch normalization [37] is a commonly used technique which ensures each
layer receives normalized feature maps. For a single channel feature map I ? H? ?
109
W ? batch normalization uses the sample mean E[I] and variance Var[I] along with
learnable affine parameters ?, ?. These parameters are then applied as
I ? E[I]
BN(I) = ?? + ? (7.18)
Var[I]
The batch statistics are used to update running statistics which are applied at
inference time instead of the sample statistics. This equation has a simple closed-
form expression in the transform domain.
We start with the mean and variance. Recall from Section 3.1 (The Fourier
Transform) the definition of the 2D Discrete Cosine Transform over N ?N blocks
N??1N??1 ( ) ( )
?1 (2x+ 1)i? (2y + 1)j?D(i, j) = C(i)C(j) I(x, y) cos cos
2N 2N 2N
x=0 y=0
? (7.19)?????1? k = 12C(k) = ???1 k ?= 0
(7.20)
110
Let us compute an expression for the (0, 0) coefficient
N??1N??1 ( ) ( )1 (2x+ 1)0? (2y + 1)0?
D(0, 0) = ? C(0)C(0) I(x, y) cos cos
2N 2N 2Nx=0 y=0
(7.21)
N??1N??1
?1= I(x, y) cos(0) cos(0)
2 2N x=0 y=0
(7.22)
N??1N??1
?1= I(x, y)
2 2N
x=0 y=0
(7.23)
We further assume 8? 8 blocks as used by JPEG
7 7
?1
??
= I(x, y) (7.24)
2 2 ? 8 ?x=0 ?y=07 71
= I(x, y) (7.25)
8
x=0 y=0
(7.26)
Since
7 7
1 ??
E[I] = I(x, y) (7.27)
64
x=0 y=0
we have
1
E[I] = D(0, 0) (7.28)
8
111
yielding a simple expression for the sample mean of a block given DCT coefficients.
Note that this is extremely efficient compared to computing the mean on the feature
maps directly: it requires one read operation and one multiply operation per block
vs 64 reads, 63 sums, and one multiply 6.
To compute the variance we use the following theorem
Theorem 2 (The DCTMean-Variance Theorem). Given a set of samples of a signal
X such that E[X] = 0, let Y be the DCT coefficients of X. Then
Var[X] = E[Y 2] (7.29)
Proof. Start by considering Var[X], we write this as
Var[X] = E[X2]? E[X]2 (7.30)
We are given E[X] = 0, so we simplify this to
Var[X] = E[X2] (7.31)
Next, we use the DCT linear map D : M? ?N? ? A? ?B? where the vector spaces
M and N indicate the block dimensions and A,B indicate spatial frequencies. Then:
Xmn = D
??
mnY?? (7.32)
6If there are multiple coefficient blocks (as is common) their means will need to be combined.
112
and
E[X2mn] = E[(D
?? 2
mnY??) ] (7.33)
Expanding the squared term gives
E[X X ] = E[D?? Y D??mn mn mn ?? mnY??] (7.34)
And expanding the expectation gives
1 1
X X = D?? Y D?? Y
| || | mn mn | || | mn ?? mn ??
(7.35)
M N A B
Note that 1 = 1|M ||N | | || | so we cancel givingA B
X X = Y D?? ??mn mn ?? mnY??Dmn (7.36)
Rearranging the right-hand side gives
X ?? ??mnXmn = DmnDmnY??Y?? (7.37)
Since the tensors D are defined with respect to a standard orthonormal basis, we
can freely raise and lower their indices (their metric tensor is identity). Lowering
113
?, ? and raising m,n on one of the D tensors gives:
X ?? mnmnXmn = DmnD?? Y??Y?? (7.38)
Since D?? mnmnD?? = 1 we have
XmnXmn = Y??Y?? (7.39)
= X2mn = Y
2
?? (7.40)
Substituting gives
Var[X] = E[X2] = E[Y 2] (7.41)
Therefore, it is sufficient to compute the mean of the squared DCT coefficients
to get the variance of the underlying pixels. This is no faster or slower than the
pixel domain algorithm.
Next, we move on to the affine parameters ?, ?. Applying ? is easy: since the
transform we are using is linear, multiplying by a scalar can happen before or after
the transform, i.e.,
J(?I) = ?J(I) (7.42)
so we can simply multiply the transform coefficients by ?. Applying ? is also
114
straightforward, since adding the scalar ? would raise the mean by ?, we can add
? to only the (0,0) coefficient. This yields a simple closed-form algorithm for com-
puting batch normalization.
Listing 7.3: Transform Domain Batch Norm
def batch norm (F : Tensor , gamma: f loat , beta : f loat ) ?> Tensor :
mu = F[ 0 , 0 ]
F [ 0 , 0 ] = 0
var = torch .mean(F??2)
F ?= gamma / torch . s q r t ( var )
F [ 0 , 0 ] = beta
return F
Note that the algorithm in Listing 7.3 assumes each sample is a single 8 ? 8
block. If this is not the case, then the algorithm can be easily adjusted to compute
combined mean and variance over several blocks (and multiple channels)7.
7.2.3 Global Average Pooling
Global average pooling reduces feature maps to a single scalar per channel. In
other words, spatial information is averaged ?globally?. Given the discussion in the
previous section, this is extremely simple to compute in the transform domain. As
the (0, 0) coefficient is proportional to the mean of each block, we can simply read
7Depending on the batch norm implementation, it may be necessary to apply Bessel correction
to the variance computation as well.
115
Global Average Pooling Vector
DCT Coefficients
Figure 7.2: Illustration of transform domain global average pooling.
off these coefficients and scale them to produce the global average pooling vector
(Figure 7.2). This is significantly faster than the pixel domain algorithm. Note that
this is exactly the result that the pixel domain algorithm would have generated,
so from this point forward we no longer need to worry about operations in the
transform domain (i.e., the fully-connected layers do not need modification).
7.3 ReLU
Having defined the exact operations, we now turn to a missing and critical
component of residual networks: ReLU [65], [66]. Note that we have dedicated an
entire section to what is a relatively simple operation in the pixel domain. ReLU is
defined as
?????
R(x) = ?x x ? 0??? (7.43)0 x < 0
116
The previous section made use of mathematical properties of the JPEG transform in
order to derive closed form solutions for transform domain operations. Since ReLU
is necessarily non-linear, we will have no such luck with that approach. In fact,
not only is ReLU non-linear, it is piecewise linear depending on the pixel domain
value, information which we do not have access to in the transform domain. Instead,
we will develop an approximation technique for ReLU that works in the transform
domain and is tunable giving an accuracy-speed trade-off.
We compute this approximation by partially decoding each block of coeffi-
cients. This is still fast since only a subset of coefficients are required and since the
result of the approximation is in the pixel domain we can freely compute ReLU on
it. Recall the DCT Least Squares Approximation Theorem proven in Section 3.1
(The Fourier Transform).
Theorem 3 (The DCT Least Squares Approximation Theorem). Given a set of N
samples of a signal X let Y be the DCT coefficients of X. Then for 1 ? m ? N the
approximation of X given by
? ?m ( )
?1 2 k(2t+ 1)?pm(t) = y0 + yk cos (7.44)
N N 2N
k=1
minimizes the least-squared error
?N
em = (pm(i)? x 2i) (7.45)
i=1
Theorem 3 guides us in choosing the lowest m frequencies when we decode
117
(rather than some arbitrary set) in order to constrain the error of the approximation.
For a 2D DCT, we use all frequencies (i, j) such that i+j ? m yielding 15 frequencies.
The threshold m is freely tunable to the problem and we will examine its effect later.
Although we now have a reasonable algorithm for computing ReLU from trans-
form coefficients, we are left with two major problems. The first is that although our
approximation was motivated by a least-squares minimization, it is not guaranteed
to reproduce any of the original samples. Since ReLU preserves positive samples
(only zeroing negative samples) it would be nice if at least those were preserved. The
second is that our network expects transform coefficients as input but the ReLU we
have computed is in the spatial domain. It would be expensive to have to convert
the result back to transform coefficients before continuing our computation.
Consider for a moment the nature of our first problem. Suppose we have
a sample with value 0.7. After taking the DCT and computing the least-squares
approximation with a subset of coefficients, the value of this sample is changed to
0.5. We can observe that although the least-squares approximation is incorrect, it
is still positive. In other words, the reconstruction has not changed the sign of the
sample so it will not be zeroed by ReLU. The more coefficients we use the more
likely it is that these reconstructions are sign-preserving8 since the high frequencies
contribute less to the accuracy of the result (otherwise they would not be a least-
squares minimization). In this sense we can observe that it is easier to preserve the
sign than the exact pixel value.
8This is true for other piecewise function intervals as well. The technique described here is
general.
118
Original True ReLU
Naive ASM
Figure 7.3: ReLU Approximation Example. Green pixels are negative, red pix-
els are positive, blue pixels are exactly zero. The top-left shows the original image.
The top-right is the true ReLU. The bottom-left shows a naive approximation using
only the least squares approximation. Note that while negative pixels are zeroed,
very few positive pixels have the correct value and there are mask errors resulting
from the approximation. The bottom-right image shows the ASM technique. Note
that while there are still mask errors, positive pixel values are preserved.
Therefore, rather than compute ReLU on this approximation, we can instead
compute a mask and apply that mask. We reformulate ReLU as follows
R(x) = H(x)x (7.46)
??????1 x ? 0H(x) = ??? (7.47)0 x < 0
where H(x) is the Heaviside step function which we treat as a mask. If we compute
H(pm) on the approximation pm, and multiply the result by the original samples
x, we will have masked the negative samples while preserving the positive ones.
We call this technique Approximated Spatial Masking (ASM). See Figure 7.3 for a
visual example of this algorithm.
The only problem left to solve is that our original samples are in the transform
119
domain and the mask is in the pixel domain. To simplify the following discussion,
we consider only DCT blocks here (extending to the full transform is trivial). We
can solve this using our multilinear model of the JPEG transform. Given transform
coefficients F ? A? ? B?, a spatial domain mask G ? M? ? N?, and the masked
result F ? ? A? ?B?, consider the steps such an algorithm would perform
1. Take the inverse DCT of F to give I ? M? ?N?
2. Pixelwise multiply the mask G and I to give I ?
3. Take the DCT of I ? to give the masked result F ?
All of these steps are linear or bilinear
I = D??mn mnF?? (7.48)
I ?mn = GmnImn (7.49)
F ? = Dmn ????? ????Imn (7.50)
Substituting, we have
F ????? = D
mn
????GmnImn (7.51)
= Dmn????GmnD
??
mnF?? (7.52)
= G Dmn D??mn ???? mnF?? (7.53)
And we make the following definition (after raising some indices to preserve dimen-
120
sions)
? [ ]F = G Dmn D??mn???? mn ???? F?? (7.54)
???mn mn ??mn???? = D????D (7.55)
giving the bilinear map ? : M? ? N? ? A? ? B? ? A? ? B?. This map can be
computed once and reused. We can use this map along with our approximate mask
and original DCT coefficients to produce a highly accurate ReLU approximation
with few coefficients.
7.4 Recap
Before continuing to empirical concerns, we briefly recap the theoretical dis-
cussion in the previous sections. Residual networks consist of four basic operations:
Convolution, Batch Normalization, Global Average Pooling, and ReLU.
In Section 7.2.1 (JPEG Domain Convolutions) we found that JPEG domain
convolutions can be expressed as
?pxyk
? ?
p?x?y?k? = J?
h w Cphw xykx?y?k? p?h?w?Jhw (7.56)
and in Listing 7.2 we developed a fast algorithm for computing this.
In Section 7.2.2 (Batch Normalization) we developed a closed form solution
121
for JPEG domain batch normalization. We found that
1
E[I] = D(0, 0) (7.57)
8
Var[I] = E[D2] iff D(0, 0) = 0 (7.58)
and that we can apply ? by adding it to D(0, 0) and we can apply ? as we would
to a spatial domain input (by multiplying it by each coefficient).
In Section 7.2.3 (Global Average Pooling) we found that global average pooling
in the JPEG domain is as simple as computing 1D(0, 0) from each channel. We also
8
noted that since this is equivalent to the spatial domain mean, there is no need to
derive the fully-connected layers.
Finally, in Section 7.3 (ReLU) we developed an approximation technique for
ReLU where we use a subset of coefficients to decode each block and compute and
approximate H(x) on each block where H() is the Heaviside step function producing
a mask Gmn. Then we apply this mask to the original coefficients F?? using
F ????? = G
??mn
mn????? F?? (7.59)
This concluded our theoretical derivations.
Model Conversion One important thing to note is that at no time did we stipu-
late that the convolution weights or batch norm affine parameters need to be learned
from scratch. Indeed, this method can take any such values, random or learned, and
122
Res Block 1: 16 Filters, No Res Block 2: 32 Filters, 
Downsampling Downsampling
Input: T X 1 X 32 X 32 Output: (T X 16 X 32 X 32) Output: T X 32 X 16 X 16
Fully Connected: 64 to Res Block 3: 64 Filters, 
10/100 Global Average Pooling Downsampling
Output: T X 10/100 Output: T X 64 Output: T X 64 X 8 X 8 (single JPEG block)
Figure 7.4: Toy Network Architecture. Note that by the final ResBlock, the
image is reduced to 8? 8 which is a single block of coefficients. This simplifies the
global average pooling layer.
produce JPEG domain operations. Therefore, we can use the method to convert
pre-trained models to operate in the JPEG domain. This idea has some powerful
implications and we will examine it?s trade-offs in the empirical analysis.
7.5 Empirical Analysis
We now turn out attention to an empirical evaluation of the algorithm. After
all, the discussion in the previous sections was highly theoretical and altogether
divorced from practical concerns. A natural question at this point is: ?How well
does this actually work??
We will start by creating a toy network. This small network will be used
in the experiments in this section to evaluate and benchmark the technique. This
toy architecture consists of three residual blocks followed by global average pooling
and a single fully connected layer. Although this is a simple architecture, it will
more than suffice for our benchmarks of MNIST [40] and CIFAR 10/100 [67]. The
123
Table 7.1: Model Conversion Accuracies. Note that the deviation is small
between the spatial domain and JPEG domain network.
Dataset Spatial JPEG Deviation
MNIST 0.988 0.988 2.999e-06
CIFAR-10 0.725 0.725 9e-06
CIFAR-100 0.385 0.385 1e-06
inputs will always be 32 ? 32 images to ensure an even number of JPEG blocks9.
We consider two versions of this network, one which processes images in the spatial
domain (i.e., a traditional ResNet) and one which we have applied algorithm on to
allow it to process JPEG transform coefficients.
For those unconvinced by mathematics (or maybe suspicious of the ability to
implement the math in PyTorch), we first examine whether our derivations were
correct at all. This is straightforward: we simply use an exact ReLU, taking all
15 frequencies for the JPEGified version of the toy network. For more meaningful
accuracies, the network is trained until convergence in the pixel domain and the
weights are then converted. Since our other operations are supposed to be ?exact?,
this should yield the same accuracy as a pixel domain network to within some small
floating point error, which is confirmed by the result in Table 7.1.
Next we examine the accuracy of the ReLU approximation. Since this is not
a true ReLU, we expect there to be some effect on overall network accuracy when
fewer frequencies are used. However, it is still a non-linearity which should enable
the network to learn effective mappings. We consider ReLU accuracy from three
perspectives
Absolute Error How accurate is our ASM approximation compared with an naive
9MNIST inputs are zero padded with two pixels on each side
124
APX MNIST ASM MNIST APX MNIST ASM MNIST
APX CIFAR10 ASM CIFAR10 APX CIFAR10 ASM CIFAR10
APX ASM (ours) APX CIFAR100 ASM CIFAR100 APX CIFAR100 ASM CIFAR100
 0.4  1  1
 0.3  0.8  0.8
 0.6  0.6
 0.2
 0.4  0.4
 0.1
 0.2  0.2
 0  0  0
 1  3  5  7  9  11  13  15  1  3  5  7  9  11  13  15  1  3  5  7  9  11  13  15
Number of Spatial Frequencies Number of Spatial Frequencies Number of Spatial Frequencies
Figure 7.5: ReLU Approximation Accuracy. Left: RMSE error. Middle: Model
accuracy after model conversion. Right: Model accuracy when re-training from
scratch. Note that APX denotes he naive ReLU approximation. Dotted lines rep-
resent spatial domain accuracy.
approximation?
Conversion Error If we convert pre-trained weights, how much does frequency effect
the final accuracy result?
Training Error If we train a network from scratch using the ReLU approximation,
how much does frequency affect the final accuracy result?10
We show results to this effect in Figure 7.5. The left graph shows the absolute
error of the ReLU approximation. For this experiment, 10 million 8? 8 blocks are
generated by upsampling random 4 ? 4 pixel blocks. We then measure RMSE be-
tween the true block and the approximated block. Note that compared to the naive
approximation, the ASM method we developed has lower error throughout and the
error drops faster. In the middle graph, we show model conversion error. We train
100 models from random weights in the pixel domain and then apply our algorithm
to convert the weights, and measure the resulting classification accuracy. Again
we see that the ASM method has better performance. In the final graph, we train
10Assuming the same number of frequencies are used for training and inference.
125
Average RMSE
Average Accuracy (%)
Average Accuracy (%)
JPEG Training Spatial Training
JPEG Testing Spatial Testing
 20
 15
 10
 5
 0
MNIST CIFAR10 CIFAR100
Figure 7.6: Throughput Comparison. We compare JPEG domain and spatial
domain training and inference.
networks from random weights using our JPEG domain algorithm. Interestingly,
this performs significantly better than model conversion indicating that the weights
have learned to adapt to the ReLU approximation.
The final result we show is throughput. In general, the method developed
here should be fast if for no other reason than the JPEG images do not need to
be decompressed before being processed. In Figure 7.6 we compare throughput for
training and testing in the JPEG domain vs in the spatial domain. As expected,
inference is significantly faster in the JPEG domain. Curiously, however, training is
only slightly faster. This is caused by the more complex update rule for autograd
to compute through the ReLU approximation and the JPEG domain conversion for
the convolutions.
7.6 Limitations and Future Directions
The astute reader will have noticed by now a major limitation with this work:
memory usage. Recall that compressed domain convolutions are formed by con-
volving the kernel C, a dim(P ) ? dim(P ?) ? 3 ? 3 matrix, with the JPEG decom-
pression tensor J? ? X ? Y ? K ? H? ? W ? and then applying the JPEG com-
126
Throughput (Images/Sec)
pression tensor J ? X? ? Y ? ?K? ? H ?W . This yields a the type (3, 3) tensor
? ? P ? ?X ? Y ?K ? P ?X? ? Y ? ?K?.
Observe the size of this tensor. For an image of size dim(H)? dim(W ) it is in
O((dim(H)?dim(W ))2). In other words we have taken a small constant size weight
and expanded it to be on the order of the image size squared. This is perhaps the
primary direction for future work. The massive size of this tensor entirely prevents
the method from being useful for anything beyond the toy network and small image
datasets presented in the previous section. While a constant size kernel could be
created using tiling (each convolution depends on at most the blocks one outside
of the ?currently processed? block), this would still be significantly larger than the
small kernel used by spatial domain networks. By restricting the convolution to a
single block, an dim(P ?) ? dim(P ) ? 8 ? 8 ? 8 ? 8 kernel could be created with
an approximate result which would significantly improve the situation. It is left to
future work to determine the practicality of these ideas and what their effect on
network accuracy is.
Our ReLU formulation is currently an approximation. As we studied in the
previous section, this approximation does impact the overall network accuracy even
when retraining. It would be nice if an exact ReLU could be formulated to avoid
this issue. It is currently unknown if this is possible.
While on the topic of ReLU, software support for our method is currently
quite lacking. In essence, many of our memory and speed savings come from the
sparse nature of JPEG compressed data. Zero elements could take up no memory
and contribute no operations to the compute graph, but this depends on adequate
127
software support for sparse operations which is currently missing from libraries like
PyTorch. Specifically, support for sparse einsum would need to be added. This is
perhaps the low-hanging fruit that would immediately reduce the memory footprint
while further increasing the speed the algorithm.
128
Chapter 8: Improving JPEG Compression
With a good understanding of JPEG compression and how it relates to deep
learning, we turn to a survey of methods which improve JPEG compression. These
methods are essentially specializations of image enhancement. So sister problems
in this domain are super-resolution, denoising, deraining, etc. Notably, we will not
be considering new deep learning based codecs which are beyond the scope this this
dissertation. These methods are reviewed briefly in Appendix C, however. We focus
on historical methods which made significant advancements in the understanding of
JPEG artifact correction and present them roughly in publication order although
they are grouped into sections by their high level ideas.
Before discussing the deep learning techniques we first mention two classical
methods for correction of JPEG artifacts. The first method uses a ?pointwise shape-
adaptive DCT? (SA-DCT) [68]. The SA-DCT can be thought of as a generalization
of the block DCT used by JPEG to account for block sizes of varying shape. Foi et al.
model JPEG compression artifacts as Gaussian noise with zero mean and compute ?2
using an empirically developed formula on the quantization matrix. For each point
in the image, the technique computes a DCT kernel that best fits the underlying
data (hence shape-adaptive). This filter is then used to estimate the Gaussian noise
129
term for enhancement. The next method [69] use a generalized lapped biorthogonal
transform (GLBT) [70]. In this technique, the JPEG DCT coefficients are modeled
as an intermediate output of the GLBT and the remaining filters in the method are
designed to remove blocking artifacts. Prior to deep learning, these techniques were
the most successful at removing JPEG artifacts.
Warning This chapter is mostly a history lesson. Skip to the last section if you
want a TL;DR.
8.1 Pixel Domain Techniques
We begin our discussion with the straightforward ?pixel domain? techniques.
These networks function as traditional convolutional networks. They use pixels
and input and output either the corrected network or its residual. The first such
technique was the ARCNN [71] later followed up by Fast ARCNN [72]. These
networks followed a traditional encoder-decoder architecture and are based off of
the contemporary SRCNN [73].
ARCNN is tiny by modern standards with four convolutional layers. The first
is a 9? 9 layer with 64 channels, next a 7? 7 with 32 channels, then a 1? 1 with
16 channels and finally a 5 ? 5 decoder with 1 channel (for grayscale only). The
authors of ARCNN claim that each layer is designed for a specific purpose but there
is no deep supervision on the layers and they are trained end-to-end so it is unlikely
that they learn a particular task.
Fast ARCNN changes this architecture to an ?hourglass? shape, essentially a
130
U-Net [36] without skip connections which was common at the time. The archi-
tecture uses strided convolutions for the downsampling operations. Since the size
of the feature maps is reduced, the architecture processes images faster, hence the
name. This does reduce the overall reconstruction accuracy, however.
The L4/L8 networks [74] introduce two major new ideas to artifact correction.
The first is the idea of residual learning, where the network is encouraged to learn
only the difference between the input image and the true reconstruction. In other
words, the reconstructed image Xr is expressed as
Xr = Xc + f(Xc) (8.1)
for compressed image Xc and learned network f(). The second contribution is that
of an edge preserving loss. The authors rightly observe that prior networks, due to
their regression only losses, have blurry edges. They solve this by using Sobel filters
to compute the partial first derivatives of the reconstructed image and computing
loss on these filtered images which focuses the network on edge reconstructions.
As expected the L4/L8 architectures have four and eight layers respectively and
otherwise do not differ significantly from ARCNN.
CAS-CNN [75] build on the previous idea by employing a significantly more
complex architecture. This architecture contains skip connections, not unlike a U-
Net, and upgrades the traditional regression loss to use multiple scales. These scales
are computed using deep supervision of the downsampled feature maps and make a
fairly significant improvement to the overall accuracy. This is likely helped by the
131
skip connections in the U-Net architecture.
We now jump to the MWCNN [23] which is a major difference in architecture.
MWCNN is a fascinating method for general image restoration which was applied
directly to JPEG artifacts at the time of publication (along with other problems).
The key idea is to replace the pooling layers in a traditional CNN with a discrete
wavelet transform. Recall that a discrete wavelet transform computes band-pass
filters which restrict each output to half the frequency range of the input. By the
nyquist sampling theorem, we can then discard half the samples without losing any
information. MWCNN exploits this by using the DWT in place of a pooling op-
eration, stacking the resulting four frequency sub-bands in the channel dimension
without any significant loss of information. The original image can then be recon-
structed by using the inverse wavelet transform on the feature maps after traditional
convolutional layers. Otherwise the architecture resembles U-Net. The use of this
clever signal processing trick allows MWCNN to achieve remarkable results on a
number of restoration tasks including JPEG artifact correction.
Honorable mention at this point goes to DPW-SDNet [76]. This could be
considered a dual-domain method although we take a somewhat stricter definition
of domain so instead we list it here with MWCNN. The main contribution of DPW-
SDNet was to include two networks, one which processes the image in the pixel
domain and another which processes it after a single level DWT.
Another method from 2018, S-Net [77], introduces a scalable network. This
is based on the apt observation that more quantization requires a deeper network
and ?more work? to restore. Their architecture is, therefore, scalable either based
132
on the amount of degradation applied to the image or constraints on the compute
budget of the hardware. This was an important contribution toward the practical
use of artifact correction and remains an under-explored idea.
Two works by Galteri et al. [78], [79] introduce GANs to the problem of
artifact correction. As we observed in the discussion of L4/L8 and CAS-CNN,
regression losses produce a blurry result. This is both because of the CNN?s in-
herent bias towards error minimization, something which is easiest to accomplish
with a low-frequency reconstruction, and because of JPEG?s tendency to destroy
high frequency details in the first place. Although L4/L8/CAS-CNN make progress
on this problem with specialized losses, they had obvious limitations which Galteri
et al. overcome with a GAN loss. This generates significantly more realistic re-
constructions, although there is no attempt at an ?accurate? reconstruction with
good numerical results1. The 2019 version of this work even includes a rudimentary
attempt at a ?universal? architecture which can operate independently of quality
setting although it accomplishes this with an ensemble.
The final technique we discuss in this section is RDN [80]. This represents a
departure from the more traditional U-Net style networks we have been discussing.
Instead, RDN is based on ESRGAN [81] and its RRDB layers. These layers are an
enhanced version of the traditional residual layer [9] with more residual connections.
Just as these layers were a huge improvement for super-resolution, they are a huge
improvement for artifact correction.
1Which, in my opinion, is completely acceptable.
133
8.2 Dual-Domain Techniques
Dual domain techniques are the result of an attempt to inject some low level
JPEG data into the learning process. The high level idea is to process the input
in both the spatial (pixel) domain and the frequency (DCT) domain. This is done
with two separate networks and their result is fused. This way if there is some
information that either domain does not capture, it can potentially be exploited by
the other domain. The technique was introduced with a sparse-coding method [82]
that we will examine in the next section.
On the deep learning side the idea is first addressed with DDCN [83]. The idea
is very straightforward. There are two separate encoders, one for the pixel domain
and one for the DCT domain. The output of both networks is processed by a third
aggregation network which decodes to a residual that is added to the input image.
DMCNN [84] extends this idea in two ways. The first is with a multiscale loss
on the pixel branch as we saw in L4/L8. The next is with a DCT Rectifier which
constrains the magnitude of the DCT residual based on the possible values that the
true coefficients could take. Recall the formula for quantization
? ?
? YijYij = (8.2)(Qy)ij
shown here for the Y channel only. The approximated coefficient is then
? ?
Yij
Y?ij = (Qy)ij (8.3)
(Qy)ij
134
Dividing by (Qy)ij gives
? ?
Y?ij Yij
= (8.4)
(Qy)ij (Qy)ij
We can now expand this as an inequality since the rounded value must range from
[?1 , 1 ] around the rounding result
2 2
Y?ij ? 1 ? Yij ? Y?ij 1+ (8.5)
(Qy)ij 2 (Qy)ij (Qy)ij 2
multiplying by (Qy)ij yields our desired constraint on Yij
? (Qy)ijY?ij ? Yij ?
(Qy)ij
Y?ij + (8.6)
2 2
Since the artifact correction network is trying to compute Yij from Y?ij, this constraint
helps reduce the space of possible solutions.
The next major innovation in dual-domain methods is IDCN [85]. The major
advantage of IDCN is that it is designed for color images and uses ?variance maps?
to account for the differences in statistics between the channels. Their dual-domain
formulation is also of interest. They introduce a dual-domain layer which is ?im-
plicit?, similar to our result in Section 6.2 (The Multilinear JPEG Representation),
the DCT transform can be composed such that the DCT result and the pixel result
are computed simultaneously.
Finally Jin et al. [86] extend the dual domain concept to process frequency
bands in different paths. This is based on two observations: firstly, some artifacts
135
are restricted to particular frequency bands and secondly, as we have said many
times, accurate high-frequency reconstructions are difficult. By separating out the
frequency bands for separate processing, the network is able to focus on restoring
those particular frequencies as well as freeing up model capacity for artifacts which
occur only in the considered frequency bands of the branch.
8.3 Sparse-Coding Methods
Sparse coding is a dictionary learning method. A series of representative ex-
amples are learned which (we hope) form an ?over complete? basis for our solution
space 2. Because the input is no longer uniquely determined by the basis, we also
try to enforce sparsity such that the members of the basis are as sparse as possible.
We do not cover sparse coding in more detail in this dissertation.
Sparse coding was introduced to artifact correction by Li et al. [82] where
they also introduced dual domain learning. The idea is straightforward: learn sparse
codes in pixel space and DCT space and fuse the results.
D3 [87] makes an interesting extension to Li et al.. They formulate the problem
in a ?feed forward? manner. In other words, sparse coding is used first on the DCT
coefficients and then the result of that is fed into anther sparse coding module in
the pixel domain. Both stages are supervised with loss functions similar to neural
networks.
The final sparse coding method we consider, DCSC [88] is pixel domain only.
2I do not believe that an ?over complete basis? is an actual concept in linear algebra. I assume
the developers of this method are referring to a frame
136
Table 8.1: Summary of JPEG Artifact Correction Methods. The methods
are all listed with their technique (CNN or Sparse Coding) and whether they incor-
porate dual domain information or not. This table is not exhaustive. Methods are
sorted by year.
Year Method Citation Technique Dual Domain Note
2015 ARCNN [73] CNN ?
Data driven sparsity ... [82] Sparse Coding ?
2016 L4/L8 [74] CNN ?
DDCN [83] CNN ?
D3 [87] Sparse Coding ?
2017 CAS-CNN [75] CNN ?
Deep Genarative ... [78] CNN ? GAN, Color
2018 MWCNN [23] CNN ? Uses DWT instead of pooling
DPW-SDNet [76] CNN ? Dual wavelet and pixel domain
S-Net [77] CNN ? Scalable
DMCNN [84] CNN ? DCT Rectifier
2019 Deep Generative ... [79] CNN ? GAN, Universal with ensemble, Color
IDCN [85] CNN ? Implicit DCT Layer, Color
DCSC [88] Sparse Coding ? Uses CNN Features
2020 RDN [80] CNN ? Uses RRDB
Dual stream multi path ... [86] CNN ?
However, they incorporate a simple convolutional network into their architecture
such that the sparse codes are computed on CNN features. This gives a sort of
?best case? scenario where the powerful convolutional features can be exploited by
the sparse coding method. As a bonus, their method uses a single model for all
quality settings, although they do not train in the general case and only target
qualities 10 and 20.
8.4 Summary and Open Problems
We summarize all the methods discussed in this chapter in Table 8.1. There
are some interesting things we can take away from this discussion. For example,
it seems that dual-domain methods work well and they are continually revisited.
Deeper networks have also naturally been successful but the switch to RRDB layers
by RDN was particularly interesting. More complex techniques like wavelet based
137
or sparse coding based methods are underutilized and may be more complex than
is needed with the advances of vanilla neural networks.
One noteworthy takeaway is that while there are pixel domain techniques
and dual-domain techniques, there is not a single DCT domain only technique. A
careful examination of ablation studies in the dual-domain papers explains this:
their DCT branches do not perform well on their own. Somehow, the DCT branch
is capturing new information that the pixel branch does not, but not enough to
carry out restoration on its own. This is likely caused by the DCT being a set of
coefficients for orthogonal basis functions rather than a single correlated signal like
pixels. We will consider this an open problem as we move into the next section.
Also somewhat surprising is that although many methods recognized that
JPEG artifact correction struggles to restore high frequencies with regression losses,
only one author thought to use a GAN for correction. This is at least partially
because of the community?s incessant focus on benchmark results as the criterion
for publication. GAN restoration does not perform well on the benchmarks. We
consider this an open problem as well.
Another oddity: very few of the methods explicitly treat color images. This
is odd on its own but even more so when we consider that JPEG explicitly handles
chrominance differently than luminance by compressing it more aggressively and
downsampling it. Also, there is spatial correlation between luminance and chromi-
nance, which could and should be exploited in the reconstruction. Only the works of
Galteri et al. and IDCN explicitly handle color data. This is another open problem.
Finally, and crucially, there are very few ?universal? or ?quality blinded? tech-
138
niques. In fact, the only ones discussed in this section were Galteri et al. [79] and
DCSC [88]. All other networks in this section trained a different network for each
quality setting they consider, a practice which is not sustainable in real deployments.
Although solutions to this problem have been cropping up at the time of writing
[89]?[92] we again consider this to be an open problem. In the next chapter, we will
develop a method that addresses these problems.
139
Chapter 9: Quantization Guided JPEG Artifact Correction
In the previous chapter we discussed several methods for using deep networks
to improve JPEG compression. These techniques augment JPEG with significantly
better rate-distortion while allowing users to still produce and share their familiar
JPEG files1. These networks, however, come with three major disadvantages that
have so far made them purely academic successes.
First and foremost, these methods are so called ?quality aware? methods in
which all training and testing data is compressed at a single quality level. This
yields a single model per JPEG quality, which is undesirable for several reasons.
Recall that quality is an integer in [0, 100], thus potentially requiring 101 different
models to be trained. Although it is likely that the models may generalize to nearby
qualities (we will examine this somewhat in Section 9.6.2 (Generalization)), at the
very least this still requires the training and deployment of more than one model,
something which is still considered expensive for most institutions. Furthermore,
when these models are deployed, they will be given arbitrary JPEG files to correct
and the JFIF file format does not store quality leaving a real system with no reliable
method to choose a model. This problem could be solved with an auxiliary model
1With the added benefit that users without special software can still view the files albeit at
lower quality.
140
that regresses image to quality [79] but this still requires training and deploying an
ensemble and now an additional model to pick the quality.
The next, and perhaps more peculiar, problem with these methods is that
they are grayscale only. In other words these models only work on the luminance
channel of the compressed images. While this does align well with human perception,
humans can certainly perceive color degradations (see Figure 9.1 to perceive this
yourself). There is an implicit assumption that luminance models could be applied
channel-wise to YCbCr or RGB images however we find that this does not hold well
in practice as we show in Section 9.6.1 (Comparison with Other Methods).
Lastly these methods are hyper-focused on error metrics. While this has proven
to be a reliable way to improve rate-distortion, it generally does not translate to
improved perceptual quality producing blurry edges and an overall lack of texture.
To improve perceptual quality, more complex techniques are required.
In this chapter we develop a technique which addresses all three of these ma-
jor problems. Our method leverages low-level JPEG information to condition a
single network on quantization data stored in the JFIF file, allowing one network to
achieve good results on a wide range of qualities. Our network treats color channels
as first class citizens and takes concrete steps to correct them effectively, keeping in
mind that JPEG compression treats color and luminance differently applying more
compression to the color channels. Finally we develop texture restoring and GAN
losses that are designed to produce a visually pleasing result especially at low qual-
ities. This method was published separately in the proceedings of the European
Conference on Computer Vision [93].
141
Y Channel Color Channel 
Correction Correction GAN
Figure 9.1: Overview. The network first restores the Y channel, then the color
channels, then applies GAN correction.
First Principles
? Conditioning the network on the quantization matrix allows it to correct at
many different qualities using information available to a real system
? Explicitly modeling color degradation improves performance on color images
? Formulating DCT domain regression allows the network to leverage quantiza-
tion data more effectively
? GAN loss functions for high frequency restoration
9.1 Overview
The method we develop in this chapter consists of several parts, all of which
operate together to produce the final result. We will develop this method from the
bottom up, starting with the individual building blocks of the method and then
describing how they are connected. At a high level, our network operates in several
stages, these are illustrated in Figure 9.1.
Our network first corrects the luminance (Y) channel of the image. The lumi-
nance channel has less aggressive compression applied to it and serves as a base for
142
further correction. Our network then moves on to correcting the color channels. As
these channels are further compressed, they lack fine detail and structure that may
have been present in the luminance channel especially after correction. Therefore,
we provide the corrected luminance channel along with the degraded color channel
to the color correction network to give it additional information.
Throughout the network, we condition carefully selected layers on the JPEG
quantization matrix. Recall that this 8?8 matrix describes how much rounding was
applied to each DCT coefficient. Because this is directly describing a phenomenon
in the frequency domain, our entire network processes the DCT coefficients of the
input only: no pixels are used, and the network produces DCT coefficients as output.
This is in stark contrast to other methods which use only pixel or both pixels and
coefficients and depends on new developments in DCT domain networks. We use
methods described in Section 7.1 (New Architectures) to correctly process these
data. Before these methods were developed, DCT domain networks had objectively
inferior performance to pixel and dual-domain networks.
Our training likewise proceeds in stages. After training the network to produce
only luminance coefficients using regression, we then add the color network in and
train it again using regression. This way, the color network is always getting a high
quality luminance result to condition its own correction on. After the luminance
and chrominance networks are trained, we then fine-tune the entire network using
GAN and texture losses. This adds significant detail to the result while preventing
it from quickly deviating and diverging.
143
9.2 Convolutional Filter Manifolds
One potential limitation of traditional convolutional networks is that they
learn only a single mapping from input features (Fi) to output features, in other
words
h(Fi) = ?(W ? Fi) (9.1)
for non-linearity ? and learned weight W . While this is sufficient for many use cases,
it can be limiting in others.
Specifically in our case, we would like to specialize the learned filters for dif-
ferent quantization matrices, in other words, the learned weight W should be a
function of the quantization matrix Q. One simple way to do this is to tile Q to
match the shape of F and concatenate the two
h(Fi, Q) = ?(W ? [Fi Q]]) (9.2)
however, this yields a linear mapping between Fi and Q limits the learned relation-
ship between Fi and Q.
Instead, we can use a filter manifold [94], sometimes called a kernel predic-
tor [95]. The goal of the filter manifold is to predict a convolutional kernel given a
144
scalar side input, i.e.,
h(Fi, s) = ?(W (s) ? Fi) (9.3)
for s ? R, so now the weight W is a non-linear function of s. Kang et al. choose a
small MLP for W
h(Fi, s) = ?(W (s) ? Fi) (9.4)
W (s) = ?(W2(?(W1s))) (9.5)
allowing the network to learn a non-linear relationship between the side data and
Fi along with the learned mapping between Fi and the network output.
However, our side input Q is not a scalar, it is an 8 ? 8 matrix. Using a
MLP for this input would be computationally expensive, so we propose a simple
extension, termed convolutional filter manifolds(CFM), to replace W () with a small
convolutional network. We additionally learn a bias term along with the weight
h(Fi, Q) = ?(W (Q) ? Fi + b(Q)) (9.6)
b(Q) = Wb ? Fq(Q) (9.7)
W (Q) = Ww ? Fq(Q) (9.8)
Fq(Q) = ?(W2 ? (?(W1 ? Q))) (9.9)
This formulation allows us to learn parameterized weights representing the complex
145
relationship between the JPEG DCT features and the quantization matrix and
can be thought of as generating a ?quantization invariant? representation for the
network to operate on. This is the primary contribution which allows the network
to model degradations from many different quality levels. In Section 9.3 (Primitive
Layers) we will describe primitive layers which make use of this formulation and
in Section 9.4 (Full Networks) we will describe where these layers are placed in
the overall network structure in order to maximize their effectiveness. In Section
9.6.4 (Exploring Convolutional Filter Manifolds), we will explore some interesting
properties of these layers.
9.3 Primitive Layers
The network we develop in this chapter is dependent on several ?primitive
layers? or basic operations which we will use to build the network. In this sec-
tion, we describe them in detail. The first is the Residual-in-Residual Dense Block
(RRDB) layer [81] first developed for super-resolution. This layer consists of three
?Dense Blocks? in a residual sequence. Each of these ?Dense Blocks? consists of
five convolution-relu layers with skip connections between each layer forming an
enhanced version of the standard residual block. See Figure 9.3 for a schematic
depiction of this layer. We make only one change to the RRDB used in ESRGAN
by replacing the Leaky ReLU [96] with Parametric ReLU [97].
In Section 7.1 (New Architectures) we discussed recent advances in convo-
lutional networks that can take advantage of the unique characteristics of DCT
146
Channel / Frequency
0 0 1 1 2 2 63 63
0 0 1 1 2 2 63 63
...
G0 G1 G2 G63
0 0 1 1 2 2 63 63
0 0 1 1 2 2 63 63
Figure 9.2: FCR With Grouped Convolutions. Each frequency component is
processed in isolation with its own convolution weights. We implement this using a
grouped convolution with 64 groups.
coefficients. We employ both of these layers in our network. The first is frequency-
component rearrangement where the DCT coefficients for each block are arranged
in the channel dimension yielding 64 channels and 1th the width and height of the
8
input. We take the additional step of using grouped convolutions with 64 groups
to ensure that each frequency is processed in isolation. See Figure 7.1 for the fre-
quency rearrangement and Figure 9.2 for an illustration of the grouped convolution.
We insert these layers into the RRDB described above. This paradigm allows our
network to focus on enhancing individual frequency bands more effectively.
However, many frequency bands are entirely zeroed out by the compression
process. Completely relying on the grouped convolution would be destined for fail-
ure because if a frequency band is set to zero, no amount of convolutional layers can
change its value (it will either remain zero or be set to the layer biases). Therefore,
we need a layer which is also capable of looking at multiple frequency bands, and
for this we choose the 8 ? 8 stride-8 layer. This layer produces a representation
of each DCT block by considering all the frequency bands in the block at once.
Since the stride is set to 8, the representation does not include information from
nearby blocks. Information from nearby blocks is incorporated by processing the
147
Dense Block
Dense Block Dense Block Dense Block
Figure 9.3: RRDB Layer shown with input feature map Fi and output feature map
Fo. Note that we change the original RRDB layer by adding PReLU layers.
block representations with RRDB layers. Since these layers are considering the DCT
coefficients of the entire block, we make the additional step to use CFMs instead
of regular convolutions to equip the layers with quantization information, thus gen-
erating the ?quantization invariant? block representation. This is shown in Figure
9.4. For our applications, the input to the CFM is the 8 ? 8 quantization matrix.
This is processed with a convolutional network to produce the weight and bias. Note
that the weight layer has in channels ? out channels channels and is a transposed
8? 8 convolution. The result is reshaped to an out channels? in channels? 8? 8
convolution kernel.
148
Conv
PReLU
Conv
PReLU
Conv
PReLU
Conv
PReLU
Conv
CFM 
  
channels
Conv  (64 channels)
8
PReLU
Conv  (64 channels)
8
PReLU
CFM  Transposed Conv ?Conv  
(  channels) ?(  channels)
 channels
Input
Features
1
1
Figure 9.4: 8 ? 8 stride-8 CFM. Note that the numbers in parenthesis denote
the number of channels. The CFM layer computes a weight and bias from the
quantization matrix using a small convolutional network. The result of this network
is reshaped as either a weight or bias.
9.4 Full Networks
With the primitive layers defined, we now show how to build those layers into
the networks and subnetworks our method uses for correction of JPEG artifacts.
Recall that our method first corrects the grayscale channel and then uses that re-
sult to aid correction of the chroma channels. Therefore we start by describing
the grayscale correction network. After that we will describe the color correction
network.
The grayscale correction network, shown in Figure 9.5 left, consists of four
subnetworks which work in series to produce the final correction: two blocknets, a
149
frequencynet, and a fusion network which we describe next. The blocknet (Figure
9.6 left) uses the 8 ? 8 stride-8 CFM layers described in the previous section. It
computes block representations and then processes the representations with stacked
RRDB layers before decoding the block representations with a transposed CFM
layer. Between the two blocknets we place a frequencynet (Figure 9.6 middle).
This uses the FCR grouped convolutions to enhance frequency bands in isolation.
The frequencies are first rearranged before being processed with RRDB layers. The
result is then rearranged to restore the frequencies to the spatial dimensions. The
intermediate results from all of the subnetworks are then passed to a fusion layer
(Figure 9.6 right). The primary purpose of this is to strengthen the gradient received
by the early layers which would be prone to gradient vanishing otherwise [98].
The color correction network (Figure 9.5 right) borrows the main ideas from
the blocknet in the grayscale correction network. We assume that inputs are 4:2:0
chroma subsampled, which means they must be upsampled by a factor of two in
each dimension to match the grayscale resolution. We use the block representation
of the color channels and use a 4?4 stride-2 layer to do the upsampling. The result
is concatenated channelwise with the block representation of the restored Y-channel
before being processed further and finally decoded. In both the grayscale and color
networks, we treat network outputs as residuals which are added to the degraded
input coefficients.
150
Degraded {Cb/Cr}-
Quantization
Channel
Degraded Y-Channel
Matrix
Color
Quantization
 stride-8 32 channel CFM
Matrix
PReLU
BlockNet
Restored Y Y Channel Quantization
RRDB (32 Channels)
Channel Matrix
FrequencyNet
Transposed  stride-2 32
 stride-8 32 channel CFM
Channels
PReLU
PReLU
BlockNet
Channelwise Concatenation
RRDB (32 Channels)
Fusion
Transposed  stride-8 32
channel CFM
PReLU
Degraded Output Restored
Y-Channel Residual Coefficients
Degraded {Cb/Cr}- Restored
Output Residual
Channel Coefficients
Figure 9.5: Restoration Networks. Left: Y-Channel Network. Right: Color
Channel Network. Note the skip connections around each of the subnetworks in the
Y-Channel Network which promotes gradient flow to these early layers.
Quantization Matrix Input Feature Map Input Feature Map BlockNet  Output FrequencyNet Output BlockNet  Output
FCR
Channelwise Concatenation
 stride-8 256 channel
CFM
Conv  256 Channels, 64
PReLU FCR
Groups
PReLU
Conv  256 Channels, 64
RRDB, 256 Channels Groups
RRDB, 256 Channels, 64
PReLU
Groups
Transposed  stride-8 32
Conv  256 Channels, 64 Conv  256 Channels, 64
channel CFM
Groups Groups
PReLU
PReLU PReLU
Conv  64 Channels, 64
FCR
Output Feature Map
Groups
Output Feature Map
FCR
Output Coefficient
Residual
Figure 9.6: Subnetworks. Left: BlockNet, Center: FrequencyNet, Right: Fusion.
9.5 Loss Functions
A well documented problem with image-to-image translation is that of a blurry
result. Intuitively, since the network is told to optimize l1 or l2 distance between
the input and output, the easiest way to accomplish its goal is to produce a sort of
?averaging?. The human perception of this averaging is as a low-frequency image
which lacks fine details. This is exacerbated by compression which intentionally
removes high frequency details. In this sense, a simple error based loss function is,
in essence, asking the network to solve the wrong problem. What we really want
the network to do is restore high frequencies.
Nevertheless, an error based loss is useful for correcting hard block boundaries
151
that JPEG creates as well as for preventing divergence with more complex losses.
Therefore, we pre-train the grayscale and color networks using l1 and Structural
Similarity (SSIM) [99] losses to ensure that they start from a reasonable location
when we fine-tune with the more interesting loss functions2. We denote this loss
function as
LR(Xu, Xr) = ||Xu ?Xr||1 ? ?SSIM(Xu, Xr) (9.10)
for restored image Xr and uncompressed image (i.e., a version of Xr which was
never compressed) Xu and ? is a balancing hyperparameter.
With the color and grayscale networks trained for regression we now move on
to GAN [39] and texture losses. GANs were originally introduced purely for gener-
ating realistic images. The algorithm pits a generator network against discriminator
network where the generator?s goal is to produce an image which is realistic enough
to fool the discriminator and the discriminator?s goal is to discover which images
were generated by the generator. In this way, the two networks are adversaries in a
game and by rewarding them for doing well, the generator can create more realistic
images. For our purposes, we use a GAN to hallucinate plausible high frequency
details, edges, and textures onto the compressed images.
For this we employ the relativistic average GAN loss [100]. This loss function
tweaks the original GAN definition to encourage the generator to produce images
which appear ?more realistic than the average fake data? and is generally more
2Note that this is a visual improvement on its own, however it is nothing compared to the result
from the GAN and texture losses.
152
Restored or Compressed Y and Cb/Cr
Uncompressed YCbCr YCbCr Channel Quantization
Conv  stride-8 196
 stride-8 64 channel CFM
Channels
LeakyReLU LeakyReLU
Channelwise Concatenation
Conv  stride-8 128
Channels (spectral norm)
LeakyReLU
Conv  stride-8 128
Channels (spectral norm)
LeakyReLU
Conv  stride-8 128
Channels (spectral norm)
LeakyReLU
Conv  stride-8 128
Channels (spectral norm)
LeakyReLU
Output Per-Block
Decisions
Figure 9.7: GAN Discriminator. Note that the discriminator makes decisions for
each JPEG block.
stable than a vanilla GAN. For our purposes, we redefine ?fake? as the restored
image Xr and ?real? as the uncompressed image Xu. We then define the loss as
? LRA(Xu, Xr) = log(L(Xu))? log(1? L(Xr)) (9.11)??????(D(x)? Exr?Restored[D(xr)]) xis uncompressedL(x) = ??? (9.12)?(D(x)? Exu?Uncompressed[D(xu)]) xis restored
For discriminator D() and sigmoid ?(). We base the discriminator D() on DCGAN
[101], its architecture is shown in Figure 9.7. All convolutional layers use spectral
normalization [102]. Note that we provide both the compressed as well as the
uncompressed/restored version of the image and discriminator decisions are made
on a per-JPEG block basis.
While the GAN is a useful tool for generating realistic corrections, the general
notion of real or fake only provides so much information. In practice, GAN losses for
image-to-image translation are often coupled with ?perceptual losses? [103]. More
153
specifically, these losses use an ImageNet [59] trained VGG network [34]. The in-
tuition is that this auxiliary network measures a semantic similarity between the
input image and the desired target since this network was trained for classification.
By encouraging semantic similarity, a more realistic result can be achieved since the
images appear to fall into the same class.
While this is useful for general image-to-image translation we find an alter-
native approach is more useful for compression. Since compression destroys high
frequency details, like textures, the more these details can be recovered or suf-
ficiently hallucinated, the more realistic the reconstruction. Therefore, we use a
VGG network trained on the MINC [104] dataset for material classification. The
main idea here is that if a restored and uncompressed image have similar logits for
a material classification task, they would likely be classified as the same material
and therefore have realistic textures. We denote this loss function as
Lt(Xu, Xr) = ||MINC5,3(Xu)?MINC5,3(Xr)||1 (9.13)
where MINC5,3 indicates layer 5 convolution 4 from the MINC trained VGG.
This yields the complete GAN loss
LGAN(Xu, Xr) = Lt(Xu, Xr) + ?LRA(Xu, Xr) + ?||Xu ?Xr||1 (9.14)
for balancing hyperparameters ?, ?. Note that the l1 loss makes another appearance
here to prevent the GAN from diverging.
154
9.6 Empirical Evaluation
No artifact correction work is complete without an empirical evaluation, and
with the algorithm now developed, we are in a position to perform one. For this
evaluation we train the network using the Adam [105] optimizer using a batch of
32 256 ? 256 patches, the network is implemented in PyTorch [17]. All DCT co-
efficients are normalized using per-channel and per-frequency mean and standard
deviations. quantization matrices are normalized to [0, 1] and use the ?baseline?
setting in libjpeg [44].
The training proceeds in stages, as described previously. First the Y channel
network is trained using LR (Equation 9.10) for 400,000 batches with the learning
rate starting at 10?3 and decaying by a factor of 2 every 100,000 batches. We set
? = 0.05. Then we freeze the Y channel weights and train the color network using
LR (Equation 9.10) for 100,000 batches with the learning rate decaying from 10?3
to 10?6 using cosine annealing [106].
With the network fully trained for regression we then fine-tune end-to-end
using LGAN (Equation 9.14). The network is again trained for 100,000 iterations
using cosine annealing this time with the learning rate starting at 10?4 and ending
at 10?6. We set ? = 5? 10?3 and ? = 10?2.
For training data we use DIV2k and Flickr2k [107] which contain 900 and
2650 images respectively. We pre-extract 30 256?256 patches from each image and
compress them using quality in [10, 100] in steps of 10 for a total training set size of
1,065,000 patches. We evaluate the method using the Live1 [108], Classic-5 [68], and
155
Table 9.1: QGAC Quantitative Results. Format is PSNR (dB)? / PSNR-B
(dB) ? / SSIM ? with the best result shown in bold. This table is provided for those
dedicated enough to read the small font. These numerical results are unimportant;
what is important is the qualitative results that follow.
Dataset Quality JPEG ARCNN [71] MWCNN [23] IDCN [85] DMCNN [84] QGAC (Ours)
10 25.60 / 23.53 / 0.755 26.66 / 26.54 / 0.792 27.21 / 27.02 / 0.805 27.62 / 27.32 / 0.816 27.18 / 27.03 / 0.810 27.65 / 27.40 / 0.819
Live-1 20 27.96 / 25.77 / 0.837 28.97 / 28.65 / 0.860 29.54 / 29.23 / 0.873 30.01 / 29.49 / 0.881 29.45 / 29.08 / 0.874 29.92 / 29.51 / 0.882
30 29.25 / 27.10 / 0.872 30.29 / 29.97 / 0.891 30.82 / 30.45 / 0.901 - - 31.21 / 30.71 / 0.908
10 25.72 / 23.44 / 0.748 26.83 / 26.65 / 0.783 27.18 / 26.93 / 0.794 27.61 / 27.22 / 0.805 27.16 / 26.95 / 0.799 27.69 / 27.36 / 0.810
BSDS500 20 28.01 / 25.57 / 0.833 29.00 / 28.53 / 0.853 29.45 / 28.96 / 0.866 29.90 / 29.20 / 0.873 29.35 / 28.84 / 0.866 29.89 / 29.29 / 0.876
30 29.31 / 26.85 / 0.869 30.31 / 29.85 / 0.887 30.71 / 30.09 / 0.895 - - 31.15 / 30.37 / 0.903
10 29.31 / 28.07 / 0.749 30.06 / 30.38 / 0.744 30.76 / 31.21 / 0.779 31.71 / 32.02 / 0.809 30.85 / 31.31 / 0.796 32.11 / 32.47 / 0.815
ICB 20 31.84 / 30.63 / 0.804 32.24 / 32.53 / 0.778 32.79 / 33.32 / 0.812 33.99 / 34.37 / 0.838 32.77 / 33.26 / 0.830 34.23 / 34.67 / 0.845
30 33.02 / 31.87 / 0.830 33.31 / 33.72 / 0.807 34.11 / 34.69 / 0.845 - - 35.20 / 35.67 / 0.860
ICB [109] datasets. To be consistent with prior works, we report PSNR, PSNR-B,
and SSIM as metrics for the regression network. For the GAN we report FID score
[43].
9.6.1 Comparison with Other Methods
We start by comparing our method to others in the form of a large boring
table in Table 9.1. Note that this uses the regression weights only. Note that of the
compared methods, only IDCN has native handling of color information and all of
the methods are quality dependent with a different model for each quality. Ours, in
contrast, is only a single model.
9.6.2 Generalization
In the development of this method we place emphasis on a single network
generalizing to multiple JPEG qualities. This raises an interesting question: ?can
other models generalize to different qualities?? In general the answer is ?no? and we
demonstrate this using IDCN [85] with an example image compressed at quality 50.
156
Original JPEG (Q=50) IDCN (Q=10) IDCN (Q=20) QGAC
Figure 9.8: Quality Generalization. Note that both the IDCN quality 10 and 20
models appear to oversmooth the quality 50 JPEG.
 3
Live-1
 BSDS500
 2 ICB
 1
 0
10 20 30 40 50 60 70 80 90 100
Quality
Figure 9.9: Increase in PSNR. Shown for color datasets on all JPEG quality
settings. Note the steep dropoff at high qualities.
Since IDCT provides only quality 10 and 20 models, we test both of those models on
this image. The result is shown in Figure 9.8. The quality 10 model oversmoothes
this image and it appears worse than the JPEG it was supposed to correct. The
quality 20 model looks better, but QGAC?s single model looks the best as it was
able to adapt its weights to the quality 50 JPEG by processing the quantization
data. As this experiment shows, it is important for prior works to select the correct
model for the JPEG.
In fact, since our method is not restricted in quality, we can show how it
generalizes in an even more compelling way: by testing on all JPEG quality settings.
We show this in the graph in Figure 9.9. Note that for most quality settings, the
increase is fairly stable, only at quality 90 and above does a steep dropoff occur. For
these qualities, however, the degradation is hardly noticeable and artifact correction
is likely not necessary.
157
Increase in PSNR (dB)
30
28
26
24
22
20
10 20 30 40 50
Input Quality
80
60
40
20
0
10 20 30 40 50
Input Quality
Figure 9.10: Equivalent Quality Plots. Top: space savings on average. Bottom:
equivalent quality on average.
9.6.3 Equivalent Quality
One important application of artifact correction is to improve compression
fidelity. In other words, rather than replacing the entire JPEG codec with a com-
pression algorithm based on deep learning, we can simply use more aggressive JPEG
settings and use artifact correction to make the result presentable. This is much
more likely to succeed in the short term due to the technical debt surrounding JPEG.
We explore this phenomenon using ?equivalent quality?, i.e., given a JPEG
which is compressed at some low quality and then corrected, what higher quality
would we have had to compress the image at in order to match the restored error?
And how much space did we save by using the smaller JPEG image? We start with
an example in Figure 9.11.
Note that our model is equivalent to almost doubling the quality of the JPEG,
allowing us to save a significant amount of space. Next we compute the equivalent
158
Equivalent Quality Space Saved (kB)
Input Equivalent Quality JPEG Reconstruction
Quality 30 Quality 58 46.8kB Saved (47.9%)
Figure 9.11: Equivalent Quality Examples. Taking three images compressed
at random qualities, we correct them and then find the compression quality that
matches the corrected result in terms of error.
 100
 80
 60
 40
 20
 0
Figure 9.12: Embeddings for Different CFM Layers. 3 channels are taken
from each embedding, color shows JPEG quality setting that produced the input
quantization matrix. Circled points indicate quantization matrices that were seen
during training.
quality over the entire Live1 dataset and plot it along with the space savings in kB.
We do this for qualities 10-50. These plots are shown in Figure 9.10.
9.6.4 Exploring Convolutional Filter Manifolds
Of the various proposals in this work, one of the most intriguing is the Convolu-
tional Filter Manifold (CFM). In this section we explore their properties empirically.
First we can visualize the CFM weights. We do this by pulling out three channels of
one of the CFM layers. Since we can adapt these weights, we generate quantization
matrices for qualities 10, 50, and 100 and produce CFM weights. This is shown with
a heatmap in Figure 9.13.
159
Channel
Figure 9.13: CFM Weight Visualization. Horizontal axis shows different chan-
nels of the weight, vertical axis shows quality. Quality levels shown are Top: 10,
Middle: 50, Bottom: 100. These are simply the heatmapped 8 ? 8 kernels of the
CFM layer.
160
Quality
We see some interesting behavior in this figure. The kernels in each row appear
different, since these are different channels we hope that they are different as they
should capture different information. In contrast, the kernels in each column appear
to be similar, but scaled versions with the magnitude decreasing as quality increases.
This makes sense because high quality images should need less correction.
We can take this visualization further by finding the filters which maximally
activate each of these weights. This will tell us which patterns each filter responds
to. We do this by taking a noise image and optimizing it to maximize the output
of the filter. In other words, we treat the visualization as a parameter and use
stochastic gradient ascent with the objective being the magnitude of the weight we
wish to visualize. This result is shown in Figure 9.14.
We again see some interesting behavior. Clear patterns of JPEG artifacts are
visible in these images. As in Figure 9.13, the columns seem to capture different
types of artifacts, with the first column capturing local block artifacts, the second
capturing larger block artifacts, and the the third capturing some ringing artifacts.
As we descend each column, we see similar artifacts reducing in strength until the
quality 100 filter which leaves the images mostly unchanged.
Finally, we can examine the manifold property of the CFM. We show this in
Figure 9.12 where we have taken the three kernels from three different CFM layers
for all possible quantization matrices (0-100). We then compute a t-SNE embedding
[110] to two dimensions and plot the kernels. What we see is a smooth manifold
through the space of quantization matrices. By coloring each point by the quality
level used to generate it, we can see that the kernels generate an order on the space.
161
Channel
Figure 9.14: Images Which Maximally Activate CFM Weights. Horizontal
axis shows different channels from the weight, vertical axis shows quality. Quality
levels shown are Top: 10, Middle: 50, Bottom: 100.
162
Quality
We can also see that each channel corresponds to a different manifold.
9.6.5 Frequency Domain Results
Next we analyze the constituent frequencies of compressed and restored im-
ages. One of the claims we made when developing the method is that GAN training
produced more realistic high frequency reconstructions. Indeed, by examining Fig-
ure 9.15 we can see that, compared with JPEG and regression reconstruction, the
GAN result has significantly more activity in the high frequencies. We show this
both with heatmaps of the Y channel coefficients and by plotting the ?probabil-
ity? 3 with which a given frequency is non-zero on a bar chart, for four examples.
Examining the frequency chart, we can see that in real images, even the highest
frequency components have some probability of being non-zero. This probability is
significantly reduced by JPEG compression and the regression result does little to
correct it. The GAN result on the other hand has high frequency responses that
are significantly higher, at least as likely to be non-zero as the original images. So
in this sense at least, the GAN loss was successful.
9.6.6 Qualitative Results
In this section we simply show some qualitative outputs of the model, these
results are shown in Figure 9.16. Observe that the degraded images suffer from
extreme banding caused by the quantization process. Our reconstructions are able
to effectively mitigate this banding, along with other complex ringing and blocking
3We use the definition of frequency m such that i+j = m for a 2D frequency (i, j) for simplicity.
163
Original Plot
Original Regression
JPEG GAN
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Frequency
DCT JPEG Q=10 Regression GAN
Figure 9.15: Frequency Domain Results. Note how the GAN reconstruction
generates significantly more high frequency content than the regression reconstruc-
tion. Also note how few high frequencies are in the compressed image. We show
only one example here, please see more examples in Appendix B.
164
Probability
artifacts. More qualitative results are given in Appendix B.
9.7 Limitations and Future Directions
Although this work represents a major step forward in the usability of JPEG
artifact correction methods, there are still some major problems to be solved. First
and foremost is the double compression problem. Because QGAC parameterizes
itself only on the quantization matrix of the file it is correcting, it has no way of
knowing if this image was recompressed. For example a real-life company which will
not be named directly and with a complex image processing pipeline decompresses
and recompresses each JPEG it receives multiple times. Realizing that this would
lead to significant degradation, this company recompresses its images at quality 100,
mitigating most quality loss. However, QGAC will treat this as a quality 100 JPEG
and perform essentially no restoration on it: it knows no better. In effect the image
processing pipeline has lied to QGAC about the nature of the compression. This
was partially addressed by AGARNET [91] which allows for a spatially varying
?Q-map?, essentially per-pixel quality, to be used as an auxiliary input, however
generating the Q-map is not straightforward.
Then there is the related problem: the JPEG degraded image may not be
stored as a JPEG at all. It is fairly common to transcode JPEG files to PNGs
where they can be stored without further degradation. QGAC, of course, cannot
operate on PNGs because they do not contain quantization information. This was
addressed by FBCNN [92] which trains a network to predict quality level from
165
JPEG Q=10 Reconstruction Original
Figure 9.16: Qualitative Results. The compressed images at quality 10 are com-
pared to our reconstructions and the originals.
166
pixels (along with the restored output) thus implicitly parameterizing the network
on quality and allowing it to take any kind of image as input.
There is the longstanding problem of high frequency reconstructions. It seems
that there are currently two paradigms in restoration: low-frequency but accurate
reconstructions and high frequency but inaccurate reconstructions (e.g. GAN re-
construction which looks nice but has little relationship to the ground truth image).
A ?holy grail? of reconstruction work would be allow accurate reconstructions in
the high frequencies. This is partially addressed later in the dissertation with the
scale-space loss of Metabit.
Another important direction to consider for practical usage of artifact cor-
rection is runtime, memory, and hardware concerns. The end goal is to put these
methods into the hands of users who may be on smartphones or laptops but little
attention has been paid to this so far. Current techniques often require datacenter
machines with powerful, often multiple, GPUs in order to run in a timely manner (or
at all). More attention to practical, efficient formulations and to quantized integer
models or specialized hardware is important to the widespread dissemination of this
technology.
167
Chapter 10: Task-Targeted Artifact Correction
Thus far we have considered artifact correction as a tool for presenting attrac-
tive images to a user. In other words, where a compressed image contains certain
artifacts, we want to suppress those artifacts so that the user can view something
closer to the uncompressed image. We noted that this was a difficult task to ac-
complish for some time because artifact correction methods were trained on a ?per-
quality? basis with a different model for each quality and we proceeded to develop
a method for correction of JPEG artifacts that is ?quality-blind?, i.e., only a single
model is trained for all JPEG qualities.
We now consider a slightly different question: what if the images are intended
for machine consumption and not human consumption? How does this change
the problem, if at all, and how do machine learning algorithms respond to JPEG
compression? In this contribution of the dissertation, we develop a flexible method
of overcoming the accuracy loss caused by JPEG compression on common computer
vision models. This includes both a study of how JPEG compression affects these
models and the examination of different methods for mitigation of the accuracy loss.
This method was published separately in the MELEX workshop in the proceedings
of the International Conference on Computer Vision [111].
168
The method presented in this chapter trains an artifact correction network
to target a specific computer vision task. This has significant advantages over off-
the-shelf techniques which we examine in Section 10.4 (Transferability and Multiple
Task Heads). Namely, the method is transferable between models. In other words,
once trained to assist a particular model, it is general enough to assist other mod-
els. Similarly, it can be trained to assist multiple tasks simultaneously without a
significant penalty on its effectiveness. We call this method Task-Targeted Artifact
Correction (TTAC).
First Principles
? JPEG degrades task performance. Leveraging explicit JPEG correction can
mitigate the problem
? Supervise the JPEG correction method using differences between the uncom-
pressed and corrected images
? Task trained correction networks are generalizable to many downstream tasks
10.1 Standard JPEG Compression Mitigation Techniques
Before moving on we briefly review other techniques which are commonly
thought to mitigate JPEG artifacts.
Supervised Fine-Tuning/Data Augmentation The simplest possible scheme,
JPEG compressed inputs are mixed in during training as a form of data augmenta-
169
tion. The goal here is to train the network to expect JPEG compressed inputs and
map them correctly. While this idea works, often well, it has several disadvantages.
The first is that it sacrifices accuracy on clean images. So the result of the network
is no longer ?at a theoretical maximum? because it has, in some sense, expended
capacity modeling JPEG compressed inputs. Additionally, this method requires
ground-truth labels which can be expensive to obtain.
Off-the-Shelf Artifact Correction Another exceedingly simple method: simply
apply an artifact correction network to JPEG compressed inputs. Since the artifact
correction method is reducing error with respect to the clean image, intuition states
that this should help the performance of a downstream task (and indeed it does).
Moreover, this technique could be employed practically with the development of
QGAC which does not require knowledge of the JPEG quality. This technique also
has the advantage of not requiring any training at all and indeed keeping the clean
accuracy intact is a selling point of the method. However, this is an ?all-or-nothing?
approach in that there is no way to tune it when it does not work.
Stability Training This technique [112] is more interesting than the last two ideas
and involves logit matching between the network output on clean and perturbed (in
this case JPEG compressed) images. In this case, the stability loss is defined as
L ? ?stability(x, x ) = ||f(x)? f(x )||2 (10.1)
170
Original
Task Prediction
(Fixed Weights) Error
JPEG Degraded Prediction
Artifact
Correction
(Trainable Weights )
Figure 10.1: Task-Targeted Artifact Correction. The logit difference from the
task network between clean and artifact-corrected versions of the same image is used
to train the artifact correction network.
where f(x) is some neural network and x? is the perturbed version of x. This
objective is then minimized along with the primary task objective during training.
While this technique does encourage robustness and is self-supervised it inherits
several drawbacks from the supervised method. The task network now needs to
expend capacity to model the compressed mapping and performance on clean images
is sacrificed.
10.2 Artifact Correction for Computer Vision Tasks
The algorithm we propose in this chapter targets an artifact correction network
to a particular task. In all cases we will use QGAC from Chapter 9 (Quantization
Guided JPEG Artifact Correction) for the artifact correction network. Starting
from pre-trained weights, we fine-tune the artifact correction network using logit
error from the task network between clean inputs and compressed inputs.
Formally, given a task t(), and artifact correction network q(), we minimize
L?(B) = ||t(b)? t(q(JPEGq(B); ?))||1 (10.2)
171
where JPEGq denotes JPEG compression at quality q. Note that the parameters, ?,
that we optimize belong to q. The task network is unchanged during this process.
See Figure 10.1 for a visual depiction of this process.
While the intuition behind this process is simple there are several details that
need to be accounted for. First consider that we are not training the artifact cor-
rection network based on any decision by the task network, e.g., classification or
detection. Instead, we are matching the actual logit values. These are vectors of
real numbers and are much finer grained than the actual decision which may be
binary. In effect we are rewarding the artifact correction network for inducing the
same perception of an input image in the task network. Note that since there is no
hard decision required for training the method is entirely self-supervised. Only the
logit values, which are independent of any ground truth, are considered during the
training process.
This differs from stability training in several key ways. First, it does not
modify the task network, so performance on clean images is unchanged and the
task network is free to expend its entire capacity learning the relationship between
clean data and the output. Next, since the correction task is given to an auxiliary
network, this network can be reused for other tasks. As we examine in Section 10.4
(Transferability and Multiple Task Heads), this works surprisingly well allowing the
artifact correction network to be trained using a lightweight task and reused for more
complex tasks. To summarize, task-targeted artifact correction takes the advantages
of all prior techniques with none of the disadvantages and adds in transferability as
a bonus.
172
16
14 EfficientNet B3 FasterRCNN
InceptionV3 14 FastRCNN HRNetV2 + C1
MobileNetV2 MaskRCNN 15.0 MobileNetV2 (dilated) + C1 (ds)12 ResNet-101 12 RetinaNet ResNet101 + UPerNet
ResNet-18 12.5 ResNet101 (dilated) + PPM10 ResNet18 (dilated) + PPMResNet-50 10 ResNet50 + UPerNet
8 ResNeXt-101 10.0 ResNet50 (dilated) + PPMResNeXt-50 8
VGG-19
6 6 7.5
4 4 5.0
2 2 2.5
0 0 0.0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Quality Quality Quality
Figure 10.2: Performance Loss Due to JPEG Compression separated by task.
Left: Classification, Middle: Detection, Right: Segmentation. The plots show all
models from a single task with no mitigation applied. For segmentation tasks, the
format of the model name is Encoder Model + Decoder Model and ?ds? indicates
that the model was trained with deep supervision. Note that methods which use a
Pyramid Pooling Module (PPM) decoder always use deep supervision.
16
14 EfficientNet B3 FasterRCNN
InceptionV3 14 FastRCNN HRNetV2 + C1
MobileNetV2 MaskRCNN 15.0 MobileNetV2 (dilated) + C1 (ds)12 ResNet-101 12 RetinaNet ResNet101 + UPerNet
10 ResNet-18 12.5
ResNet101 (dilated) + PPM
ResNet18 (dilated) + PPM
ResNet-50 10 ResNet50 + UPerNet
8 ResNeXt-101 10.0 ResNet50 (dilated) + PPMResNeXt-50 8
VGG-19
6 6 7.5
4 4 5.0
2 2 2.5
0 0 0.0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Quality Quality Quality
Figure 10.3: Performance Loss with Mitigations ? Circle: No Mitigation,
+ Cross: Off-the-Shelf Artifact Correction, ? Diamond: Task-Targeted Artifact
Correction, ? Square: Supervised Fine-Tuning. The models in this figure corre-
spond to those shown in Figure 10.2.
10.3 Effect of JPEG Compression on Computer Vision Tasks
A legitimate question at this point is ?how much does JPEG actually affect
computer vision tasks??. We can answer this with a study, the conclusions of which
are summarized in this section. The full results are relegated to Appendix A.
For this study, we compressed images using quality in [10, 90] in steps of 101
using the test sets of the respective models we are evaluating. The input images are
compressed, then restored, then they are transformed according to the requirements
of the target model (e.g., cropping to 224 ? 224). We evaluate supervised fine-
1We only show [10, 50] in this section as these are the most interesting results.
173
Accuracy Loss (%) Accuracy Loss (%)
mAP Loss mAP Loss
mIoU Loss mIoU Loss
5 Fine-Tuned Fine-Tuned Fine-Tuned
Task-Targeted Artifact Correction 6 Task-Targeted Artifact Correction 7 Task-Targeted Artifact Correction
MobileNetV2 Transfer MobileNetV2 Transfer MobileNetV2 Transfer
4 ResNet-18 Transfer 5 ResNet-18 Transfer 6 ResNet-18 Transfer
3 4 5
4
2 3
2 3
1
1 2
0 0 1
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Quality Quality Quality
Figure 10.4: Transfer Results. Left: ResNet 101 (classification), Middle: Faster
R-CNN (detection), Right: HRNetV2 encoder with C1 decoder (semantic segmen-
tation). In all plots, we add an evaluation using artifact correction weights that
were trained on ResNet-18 and MobileNetV2, our lightest weight models. Note that
?Fine-Tuned? and ?Task-Targeted Artifact Correction? methods are both trained
using their respective task network directly e.g. in (a) they use a ResNet 101. - -
dashed lines indicate results shown in Section 10.3
5 Fine-Tuned 6 Fine-Tuned Fine-Tuned
Task-Targeted Artifact Correction Task-Targeted Artifact Correction 6 Task-Targeted Artifact Correction
Multihead (2 Model) Multihead (2 Model) Multihead (3 Model)4 Multihead (3 Model) 5 Multihead (3 Model) 5
3 4 4
2 3
3
1 2
0 1
2
1
1 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Quality Quality Quality
Figure 10.5: Multiple Task Heads. Left: ResNet 50 (classification), Middle:
Faster R-CNN (detection), Right: HRNetV2 encoder with C1 decoder (semantic
segmentation). In all plots, we add an evaluation using artifact correction weights
that were trained using multiple task networks. For the two task setup, we used
ResNet-50 and FasterRCNN. For the three task setup, we used ResNet-50, Faster-
RCNN, and HRNetV2 + C1. Note that HRNetV2 + C1 has no two-task multihead
model. - - dashed lines indicate results shown in Section 10.3.
tuning, off-the-shelf artifact correction, and task-targeted artifact correction, we do
not evaluate stability training. For methods requiring fine-tuning, we train for 200
epochs varying the learning rate from 10?3 to 10?6 using cosine annealing [106]. We
compare all mitigation methods to a baseline of ?doing nothing?, i.e., accepting
JPEG inputs with no modification. We evaluate the following tasks, datasets, and
models:
Classification using ImageNet [59] with MobileNetV2 [113], ResNet 18, 50, and 101
174
Accuracy Loss (%) Accuracy Loss (%)
mAP Loss mAP Loss
mIoU Loss mIoU Loss
[9], ResNeXt 50 and 101 [114], VGG-19 [34], InceptionV3 [115], and Efficient-
Net B3 [10]
Detection using MS-COCO [116] with Fast R-CNN [117], Faster R-CNN [118], and
RetinaNet [119]
Instance Segmentation again using MS-COCO with Mask-RCNN [120]
Semantic Segmentation using ADE-20k [121], [122] with encoding models MobileNetV2
[113], ResNet 18, 50, 101 [9], and HRNet [123] and decoders C1 [124], PSPNet
[121], and UPerNet [125].
In Figure 10.2 we see the result of these models for varying JPEG quality. All
of the models face a steep penalty for the lowest quality settings which gradually
abates as quality increases. This finding is intuitive and confirms our need for JPEG
mitigation techniques. We follow this up with summary plots of the mitigation study
in Figure 10.3. We can see some interesting behavior here, mainly that the different
mitigations do not behave the same on different tasks. In particular, tasks that are
very localization-heavy like segmentation do not benefit from supervised fine-tuning
as much. These tasks are greatly aided by task-targeted artifact correction, however.
10.4 Transferability and Multiple Task Heads
One of the most intriguing properties of task-targeted artifact correction is the
potential for transferability. Since the original task network is not changed in any
way, and only the artifact correction network is fine-tuned, there is no reason that
175
Figure 10.6: MaskRCNN TIDE Plots. From left to right the model was eval-
uated at quality 10, 50, 100 with no mitigations. Note that the bulk of the errors
are missed detections at low quality. As quality increases, more objects are detected
but they are not localized correctly.
we cannot use the outputs of the artifact correction network for other tasks entirely.
This opens up a range of new potential deployment scenarios. For example, a TTAC
model could be targeted to MobileNetV2 [113], which is fast and lightweight to train,
and then used for a much heavier semantic segmentation network which would have
been impossible to train without significant compute power.
Of course this only works if the TTAC models can generalize. We examine
this in Figure 10.4. For each plot, we take the supervised fine-tuning and task-
targeted artifact correction results from Figure 10.3. These are shown in dashed
lines. We compare this with a target-targeted artifact correction network which was
trained with MobileNetV2 (green) and ResNet-18. As the plots show, the transfer
works quite well even to different tasks. In the right hand plot, for example, the
new results are almost indistinguishable from the task-targeted network which was
fine tuned for segmentation and performs better than fine-tuning the segmentation
network itself.
176
Task-Targeted Artifact Correction
JPEG Q=10
Ground Truth
Figure 10.7: Mask R-CNN Qualitative Result. Input was compressed at quality
10. This compares the JPEG result to TTAC and the ground truth.
It is also worth noting that there is no reason a TTAC model needs to be
trained with only one downstream task target. We can use as many downstream
tasks as we have compute power for. We examine this in Figure 10.5. In these plots
we have added results from a TTACmodel with 2 heads (classification and detection)
and a TTAC model with 3 heads (classification, detection, and segmentation). Not
only does this work perfectly well, but in many cases the additional model heads
improved the generalizability of the TTAC model, leading to improved results.
10.5 Understanding Model Errors
So far we have looked at model error in a very aggregate view. In other words,
we are looking at overall accuracy and how it changes with increasing compression.
In this section, using Mask R-CNN [120] as a representative example, we examine
the errors made by the model in more detail.
We start by using TIDE [126] to compute a breakdown of the exact errors that
the network is making with increasing compression. These plots are shown in Figure
10.6 with no mitigations applied. The trend here is interesting. For low quality, the
bulk of the errors are caused by missed detections. However, as the quality increases
and the missed detections decrease, the localization error actually increases as the
177
newly detected objects are not properly localized. This is sensible because it suggests
that once enough information is present in the image for objects to be identifiable,
the ?spread? induced by the missing high frequency basis functions causes the exact
boundaries of the objects to be obscured.
We can view this qualitatively as well. Figure 10.7 shows the result for a
JPEG compressed at quality 10 both with and without TTAC (as well as the ground
truth). In the uncorrected model, we can observe a significant number of missed
detections as well as minor localization errors on the orange, although overall the
orange is localized quite well given the significant blocking artifacts present on the
boundaries. The TTAC output is also informative. Not only does the image appear
significantly high quality (keep in mind it was the same as the left image before
artifact correction) but there are also far fewer missed detections. What remains
are some localization errors particularly on the bowl.
10.6 Limitations and Future Directions
There are two major limiting factors to TTAC in its current incarnation: speed
and fidelity. Since TTAC requires placing an artifact correction network, specifically
a QGAC network, before any task processing happens, it can severely limit perfor-
mance. As long as this happens at the datacenter level, the impact is likely minimal
but it is still a legitimate concern as GPU resources are still highly valuable and
are currently required for artifact correction networks. This could be addressed by
more efficient formulations for artifact correction or bespoke TTAC architectures
178
Supervised Fine Tuning (Training)
14 Supervised Fine Tuning (Validation)
12 Task-Targeted Artifact Correction (Training)Task-Targeted Artifact Correction (Validation)
10
8
6
4
2
0
Figure 10.8: Model Throughput. Throughput comparing TTAC FPS for training
and inference. TTAC incurs a non-negligible throughput impact.
that are intended to be lightweight.
Next, although TTAC has some marked advantages over data augmentation
techniques, it currently struggles to outperform them in all cases. We expect that
this can be addressed with deeper supervision on the task networks (i.e., matching
more than just the final logits) however this is currently an open problem. There
may be deep changes required to the scheduling for generation of suitable training
data that would be required to see a clear numerical advantage.
Finally, the scope of TTAC is still somewhat restricted. Although JPEG
artifacts are arguably the most important and prevalent type of degradation applied
to images, we believe that TTAC is an invaluable tool for general degradations.
This could mean corruptions like noise, masking, rotations, etc. or something more
mundane like resampling.
179
Throughput (fps)
MobileNetV2
ResNet-18
ResNet-5
R 0esNet-10
R 1esNeXt
R -5e 0sNeXt-101
VGG-1
In 9cepti
E of nfi Vc 3ientNet B
F 3asterRCNN
FastRCNN
RetinaNe
M taskR
H CR NN Net
M Vo 2b  +ile  CN 1etV
R 2e  +sN  Ce 1
R te 1s 8N  +et  5 PP0 M + U
R Pe es rN Ne et tRes 5N 0e  +t1  0 PP1 M + 
R Ue Ps eN rNet e1 t01 + PPM
Part III
Video Compression
180
Chapter 11: Modeling Time Redundancy: MPEG
Having discussed image compression at length, we now move on to video com-
pression. When considering uncompressed images, we modeled them as samples of
a continuous 2D signal. We now allow those samples to vary over time to create a
sort of ?flip-book?. Light intensity is captured in discrete steps in space to create
?frames?, and then, multiple frames are captured in discrete steps in time to create
the video. By sampling frames at a sufficiently high frame rate, there is an illusion
of smooth motion.
Naturally this significantly increases the size of the representation. Since each
frame in the video is the size of a single image, videos increase in size quickly
with increasing framerate and time. Since we classified images as big enough to
warrant compression, videos also certainly need to be compressed. In fact, timely
transmission of even short videos would be impossible without compression.
In this chapter we cover, at a high level, the first principles of video compres-
sion. There are many different video ?codecs?, or, different algorithms for com-
pressing videos. Although most of the concepts we discuss here are applicable to all
modern codecs in some form, when we need specific details, we will defer to MPEG
and specifically, the AVC standard [127]. Readers may be familiar with AVC by
181
other names like H.264 or MPEG4 part 10. We standardize on the AVC terminol-
ogy to easily differentiate between HEVC/H.265 [128] and to align better with the
naming of AOM codecs (like VP9, AV1, etc.). We focus on AVC because it is widely
used [129] and many of its key ideas are used as the foundation for continuing codec
development.
As we will see, the important insight which makes video compression possible
is that we can exploit time redundancy in the signal and remove information across
time. This is in addition to the spatial manipulations we are to used in JPEG, and
the effect is synergistic. In other words, by exploiting temporal redundancy, we can
remove additional spatial information which we would have needed to store if we
only had a single image.
The dependence on the temporal dimension will create the need for three
frames types:
Intra Frames or I-frames which are frames that can be decoded without information
from any other frame, i.e., there is no temporal dependency.
Predicted Frames or P-frames are frames that requires at least one previous frame
to decode We are said to predict the current frame based on the previous frame
and any hints stored with the current frame.
Bipredicted Frames or B-frames are frames that requires at least one previous and
future frame to decode. These frames are beyond the scope of this dissertation.
These frames together referred to as a Group of Pictures, i.e., an I-frame and its
associated P- or B- frames is a group of pictures.
182
Figure 11.1: Motion JPEG Comparison. Left: Motion JPEG frame, Center:
AVC frame, Right: Original frame. The motion JPEG video was larger and poorer
quality than the AVC frame, motivating compression in the time dimension as well
as the spatial dimension.
11.1 Motion JPEG
Before we begin the discussion of ?true? video codecs, it is worth discussing
an obvious solution: Motion JPEG . Motion JPEG can be thought of as a successor
to MPEG 1. The idea is incredibly simple: each frame is compressed separately as
a JPEG and stored in a file along with some kind of frame rate specification. This
information is all that is needed to decode and play the video. Note that although
Motion JPEG enjoys widespread use because of its simplicity, there is actually no
standard which defines it and different software libraries will have different methods
of specifying metadata.
As a quick example of this in action, we can take a raw 240frame, 24fps, 1080p
video and try compressing with motion JPEG and with ffmpeg defaults for AVC. The
1Although historically MPEG-1 was technically standardized first, core MPEG-1 technology
was based on work from the JPEG (committee).
183
original video in this case is 746,496,000 bytes (1920 ? 1080 = 2, 073, 600 bytes for
luminance plane, 4:2:0 subsampling gives 2, 073, 600/4 + 2, 073, 600/4 = 1, 036, 800
bytes for the chrominance planes so 3, 110, 400 bytes per frame times 240 frames
= 746, 496, 000) or about 747MB. Pretty large for 10 seconds of 1080p video.
The Motion JPEG file generated from this video is 12.7MB, an impressive
62x compression ratio. This can be though of as a naive ?limit? on how much
compression is attainable with out considering the temporal dimension. The AVC
file on the other hand is only 7.2MB, a 103x compression ratio. This is not the
end of the story, however, as the AVC file is almost indistinguishable from the
original frame, yet the Motion JPEG frames have significant blocking artifacts from
compression (Figure 11.1). This example motivates our desire to study compression
in the temporal dimension: the AVC frames are both high quality and smaller (in
file size) than the Motion JPEG frames.
11.2 Motion Vectors and Error Residuals
We will ?measure? temporal information by modeling motion between neigh-
boring frames. After modeling the motion, we can warp and subtract the frames
giving a ?residual?. The motion modeling is designed to be compact and simple
while still capturing some complex motions. Since at a high enough framerate,
inter-frame motion is small2, the frames should share a significant amount of in-
formation after accounting for motion. Any additional information is stored in the
residual, which is generally low entropy.
2Barring large cuts or scene changes
184
? :? Reference Frame ? :? Current Frame
Figure 11.2: Motion Vector Grid. The grid is defined based on the frame that is
currently being decoded. Each motion vector indicates where in the previous frame
a block of pixel moved from. The numbers on the grid cells indicate the motion
strength.
We call frames which are constructed from motion and residual information
Predicted Frames (P-frames) since they must be predicted from a previous frame.
We will call the frame which is being decoded the current frame and the previously
decoded frame the reference frame. MPEG standards also define a Bipredicted
Frame (B-Frame) which is predicted from a previous and future frame. This takes
advantage of the presentation timestamp and decoding timestamp features of video
containers to store frames out-of-order and with a complex dependency graph. We
will consider further discussion of B-frame and multi-reference decoding to be beyond
the scope of this dissertation.
Motion information is stored in the form of motion vectors . These vectors are
computed by breaking the current frame into a regular grid and, for each grid cell,
185
Figure 11.3: Motion Vector Arrows. Each vector is positioned in the center of
the block, the arrow points to the position that block came from in the reference
frame. Image credit: Big Buck Bunny [130].
measuring where in the reference frame that grid cell moved from. For some cells
there may be no motion and for others there may be large motions. For AVC, our
example codec, the cells themselves can be 8? 8, 16? 8, 8? 16, or 16? 16 pixels.
The blocks are stored as nine 16-bit integers3. So this operation alone turns, at
worst, 192 byte pixel blocks into 144 byte motions.
In Figure 11.2, we show an example of the grid structure from a real video.
Note that the grid is defined from the current frame which is broken into a clean
regular grid. On the reference frame, these blocks may overlap. There are some
grid cells missing from the current frame, for these blocks no motion was detected,
so they are skipped. We can also visualize the vectors themselves as in Figure 11.3.
For each block in the image, we draw a vector starting from the center of that block
and terminating at the position in the reference frame that the block came from.
The motion vectors are used in a process called motion compensation. This
process simply copies blocks of pixels from their position in the reference frame
3reference frame, width, height, reference x, reference y, current x, current y, motion strength,
and flags
186
to their position in the current frame, pasting over any content that was there
previously. The resulting motion compensated frame represents a coarse warping of
the reference frame to match the current frame. Of course there are still errors since
the motion is only computed on blocks, so it is not usually a perfect representation
of the current frame. See the left side of Figure 11.4 for an example of this. The
vectors are computed using a process called motion estimation. We do not cover
this process here as it is not standardized4.
In order to correct errors in the motion compensated frame, the encoder stores
an error residual . One thing that is immediately noticeable in Figure 11.2 and
Figure 11.3 is that not all blocks move. Indeed in both there are many blocks which
are stationary. This means that the pixels in those blocks are exactly the same in
the reference and current frames and therefore if we subtract the two frames those
blocks will be filled with zeros. Motion compensation takes this a step further to
try to match moving objects as closely as possible as well as stationary regions; this
increases the likelihood of generating zero blocks. These zero blocks are extremely
low entropy and aid in the compression process.
To compute this residual, we first compute the motion compensated current
frame from the reference frame then subtract the true frame, yielding everything
that was not accurately modeled by the motion estimation process. An example
is shown on the right side of Figure 11.4. In effect we have told the decoder to
reuse information it already had about those blocks without needing to store them.
4The MPEG standards only define the decoder. It is up to the encoder to produce a standards-
compliant bitstream however it wants.
187
Figure 11.4: Motion Compensation and Error Residuals. Left: motion com-
pensated frame, Right: error residual. Note the block artifacts in the motion com-
pensated frame. corrections to these are stored in the error residual along with
small edges. Note that the error residual is mostly zeros and therefore much easier
to compress. To produce the residual, we subtracted the motion compensated frame
(left) from the true current frame (Figure 11.2 right).
Small errors are then accumulated on edges and in rapidly moving objects which
make up the bulk of the size of the compressed residual. Note also that the above
discussion is appearance based. An object moving in the physical world may very
well generate blocks in the frames that do not appear to move, like in the center
regions of the parachute in Figure 11.2. This information can still be freely used to
fill in the current frame even though it is not an accurate reflection of the real-life
parachute.
To summarize: the encoder stores per-block motion. This motion is then
combined with a low entropy residual and a previous frame to produce the current
frame. This yields direct savings in that storing motion information is more compact
than storing pixels, and indirect savings because the error residual is much more
188
CBR (5.4Mbps) CQP (25) CRF (23)
9.4MB (5.4Mbps) 9.5MB (5.4Mbps) 9.2MB (5.3Mbps)
Figure 11.5: Rate Control Comparison. The three rate control methods are
tested targeting the same file size. Note the different artifacts produced by each
method despite similar file sizes. Also note that for this video, CQP 26 undershoots
the target to produces 6.6MB file but CQP 25 overshoots at 7.5MB. CBR and CRF
are very close in file size.
compressible than the original pixels.
11.3 Slices and Quantization
In addition to motion modeling, the AVC standard makes some notable changes
to the way frames are structured. Also, similar to JPEG, AVC allows the use of
quantization for rate control, although this is exposed to the user in several differ-
ent ways. Unlike the last section, these ideas are applicable to both P-frames and
I-frames.
The biggest departure from JPEG is the concept of slices . A slice is a region of
the frame made up of a whole number of macroblocks (usually 16?16 pixel blocks).
Prediction (motion compensation) is only possible within a single slice. In the
189
Background Slice
Low Motion Slice
High Motion Slice
Figure 11.6: Slicing Example. The image has been broken up into three slices by
the encoder: a background region, a high motion region, and a low motion region.
simplest case, the entire frame is one slice, but the general idea allows the encoder
to break up the image into more meaningful groups. For example, the encoder
might have two slices: a high detail slice with minimal quantization and a low
detail slice with higher quantization. It may have high and low motion slices. Even
more intriguing is the concept of I- and P-slices. This means that a single frame
can contain intra information and predicted information rather than the encoder
making a blanket decision for the entire frame which may be sub-optimal. For
example, highly detailed regions with low motion may be better stored as an I-slice
but a high motion region may be more efficiently stored as a P-slice. An example
of this is shown in Figure 11.6.
In JPEG rate control was implemented by choosing a scalar quality in [0, 100]
and mapping that scalar to a quantization matrix. For videos we have several
options; these are compared visually in Figure 11.5. The most similar is Constant
Quantization Parameter (CQP). This is essentially the same idea but on a scale of
[0, 51] with 51 being the worst quality, this number is used to derive a quantization
matrix for the coefficient blocks. This type of rate control is not generally used
because it is extremely simplistic to apply the same amount of quantization to
190
every frame and generally produces sub-optimal results in both size and perceptual
quality.
Instead the more common Constant Rate Factor (CRF) is used. This is also
a number in [0, 51] with 0 being truly lossless encoding (no quantization) and 51
being the worst quality. This method is tuned to hold perceptual quality constant
and takes into account inter-frame motion and frame rate. Still objects will be less
aggressively quantized and moving objects more so following a similar argument for
removing high spatial frequencies: fast moving object are harder to perceive with
detail.
The final rate control method, which is also quite common, is Constant Bitrate
(CBR). In CBR encoding, the only thing the encoder is trying to optimize is the
bitrate of the video, it should be as close as possible to a specified target without
going over it. This is useful for maxing out a connection with a known bandwidth
where it is desirable to get the maximum quality that the connection can support
without dropping frames. However, there is no way for the encoder to know a priori
how to perform CBR encoding so it will regularly over- or undershoot the target
unless two pass encoding is used.
11.4 Recap
We have now covered the high level ideas by which video codecs compress
temporal data. The encoder computes and stores coarse motion between frames
and use that motion to compute an error residual containing anything in the frames
191
which cannot be modeled with motion from a previous frame. Internal to each
frame, the encoder is free to slice the frame and make per-slice decisions. Rate
control is accomplished with the help of a target bitrate, a user defined CRF, or a
user defined quantization parameter.
The spatial domain compression is similar to JPEG. Pixels are transformed
into the frequency domain using a 4 ? 4 DCT, DST, or Hadamard transform (the
encoder is free to choose this) and, depending on the rate control mechanism, a QP
is computed or is given to the encoder. These QPs map directly to a standard set of
quantization matrices for the blocks. Note that the QP is allowed to vary spatially
so different blocks can receive different amounts of compression, as opposed to JPEG
where one quantization matrix is used for every block in the image.
This yields a surprisingly straightforward compression algorithm. Given a
set of frames, partition the frames into GOPs. This is usually a fixed number of
frames per GOP but it can be based on the content. Encode the first frame of
each GOP as an intra frame by quantizing the transform coefficients of its pixels.
For each subsequent frame in the GOP, compute motion vectors from the previous
frame, compute the motion compensated frame, and take the difference between the
current frame and the motion compensated frame. Encode the predicted frame by
storing the motion vectors and quantized coefficients of the error residual. Entropy
code the frames. The decoder simply needs to loop through each frame and either
decode it directly if it is an I-frame or warp the currently displayed frame using the
motion vectors and add the error residual if it is a P-frame.
Unsurprisingly, an actual video codec is much more complex than this and is
192
full of small details which make a large difference to the overall coding efficiency. As
of this writing, the current revision of the AVC standard [131] was released on August
2021 and is 844 pages long. Aside from the core decompression algorithm it includes
instructions for storing the resulting data in a stream5, definitions of constants
and other hard coded mappings, algorithms for scalable streams, algorithms for
compressing multi-view/depth/3D videos, etc.. Although covering such details is
beyond the scope of this dissertation, the high level intuition we developed in this
chapter will be enough to guide us in the following chapters as we explore ways to
improve video coding efficiency using deep learning.
5This is not the same as an MP4 file, for example, but the MP4 file will contain AVC video
streams.
193
Chapter 12: Improving Video Compression
Following the same model as Chapter 8 (Improving JPEG Compression), we
now survey techniques for improving video compression using deep learning. Com-
pared with JPEG compression, this chapter will seem quite short. Video methods
are still somewhat of a novelty in deep learning literature as of the time of writing,
and the subfield of video compression reduction is particularly nascent. One ma-
jor difference between JPEG and video compression is the inclusion of the in-loop
deblocking filter. Although some JPEG software does use rudimentary heuristic
deblocking, all modern video codecs include reasonably effective deblocking filters.
However, these filters are not perfect and often there are still visible artifacts, al-
though this filter does complicate the task for the network quite a bit.
The field can be divided into two disjoint sets. The first set of methods, single
frame methods, consider only a single video frame at a time. They may act dif-
ferently on different types of frames (e.g., I- or P-frames) but they never consider
information from any previous or future frame. Multi-frame methods in contrast do
consider several frames. While general restoration techniques have largely moved
from sliding-window to recurrent methods thanks to the significant efficiency im-
provements of the recurrent models, video compression reduction methods still use
194
sliding windows.
Before reviewing these methods we will first discuss four important works in
general video restoration. These are methods which are so influential that they
have had an outsize effect on the entire restoration field and thus should be con-
sidered critical background knowledge. We will then proceed to study only video
compression reduction methods.
Warning Get ready for another, albeit shorter, history lesson
12.1 Notable Methods for General Video Restoration
More general video restoration methods have a long history particularly in
super-resolution. For our purposes we will only examine works of outsize impor-
tance such that they have highly influenced followup works in video compression
reduction. These are all methods which combine information from multiple frames,
a distinguishing feature of video restoration vs image restoration.
The first work we discuss is FRVSR [132]. FRVSR is notable for its highly effi-
cient recurrent formulation. This came at a time when video restoration was either
single frame or was using sliding windows. Sliding windows should offer better per-
formance but are significantly more resource intensive to compute. The lightweight
recurrent formulation was a significant step towards the practical application of
these techniques.
The next year ToFlow [133] was published. A common idea in video restoration
methods is that consecutive frames, or their features, must be aligned in order to
195
make proper use of the extra information, and this is commonly done with optical
flow. The key insight of ToFlow is that that optical flow can be learned using a
network which is trained as part of the restoration process, customizing the optical
flow to the task. Another major contribution of this paper is the Vimeo90k dataset
which is widely used in video restoration literature.
In the same year came the next major innovation in video enhancement: EDVR
[134]. The main contribution of EDVR was the replacement of explicit motion
compensation with optical flow with implicit motion compensation using deformable
convolutions. The deformable convolutions allow the feature extraction network to
automatically capture information which is spatially offset in frames to account for
motion. This should be a faster and more flexible method than optical flow; however,
the 20M parameter model and seven frame sliding window was highly impractical.
Finally we discuss the COMISR [135] method. COMISR is a recurrent super-
resolution method with a specific focus on compressed video. Li et al. rightly observe
that real videos are compressed and yet prior restoration work does not take this
into account. Their method is trained and tested on compressed videos and includes
a novel Laplacian loss which is designed to restore high frequency details.
12.2 Single Frame Methods
As early as 2017, single frame methods were presented for video compression
reduction. These methods are quite simple and only account for information in
the frame that is currently being restored. Unlike the general restoration methods
196
Table 12.1: Summary of Video Compression Reduction Techniques. Meth-
ods are listed with their use of multiple frames and their method for motion com-
pensation in publication order.
Year Name Citation Multi-frame Motion Compensation Note
2017 DCAD [136] ? -
2018 QE-CNN [137] ? - Separate I- and P-frames networks
MFQE [138] ? Explicit PQFs (SVM)
2020 STDF [139] ? Implicit
2021 MFQE 2.0 [140] ? Explicit PQFs (BiLSTM)
PTSQE [141] ? Implicit 3D Convolution
RFDA [142] ? Implicit Cross-window recurrent
presented in the previous section there is no dependence on additional information
from prior or future frames.
DCAD [136] proposed a simple method for restoring single frames of HEVC
compressed video. The method bears a striking resemblance to ARCNN [73]. The
method uses a stack of convolutional layers. The main point of comparison is the
built in deblocking filter which they show an improvement over.
The QE-CNN [137] method presented the next year was designed to take
compression into account explicitly. This is done using two networks, QE-CNN-I
for I-frames and QE-CNN-P for P-frames. Interestingly, for P-frames, the method
applies both the -I network and the -P network as HEVC encoding may contain
intra- and inter- slices in one P-frame. Note that these networks still only consider
a single frame at a time even though separate networks are used for the I- and
P-frames; there is no shared hidden state or window.
197
12.3 Multi-Frame Methods
In 2018, MFQE [138], the first multi-frame video compression reduction net-
work, was developed. In addition to the seven frame sliding window with optical
flow alignment, this method introduced the concept of peak quality frames (PQFs).
PQFs are frames which naturally have a higher quality than their surrounding frames
and therefore have more information to extract. MQFE leverages these frames
by extracting feature from them separately and using those features to guide the
restoration of the nearby non-PQFs. They identify the PQFs by manually labeling
frames by PSNR and then training an SVM. At test time, the SVM identifies PQFs,
then the PQF features are extracted, and finally the entire sequence is restored in
a sliding window. The followup work, MFQE V2 [140] replaces the SVM with a
Bi-LSTM [143] which is more accurate. In addition to these ideas the MFQE paper
also introduces the dataset which is used for all follow-up works.
STDF [139] is the next major advancement. This method takes the key idea
from EDVR - deformable convolutions - and applies it to compression artifact re-
duction. The network consists of offset prediction, deformable feature extraction,
and quality enhancement modules.
The following year, PTSQE [141], a patch-based method, was introduced. The
key idea is to use separate networks for capturing spatial and temporal information
of a single patch and to use attention methods to fuse the two. PTSQE also takes
the step of incorporating the residual dense blocks from ESRGAN [81]. The implicit
motion compensation is also done using 3D convolutions instead of the traditional
198
deformable convolution.
Finally, RFDA [142], another recent model, uses a recursive fusion method to
artificially increase the temporal window size. The method is built on the STDF
method, and uses STDF wholesale as a subnetwork. From the STDF output, the
method outputs a hidden state which is used in a downstream network to hold
additional information from prior sliding windows. In this way, the STDF model
is essentially accessing additional temporal information. This is almost a recurrent
method in its function although the STDF method is still sliding window.
12.4 Summary and Open Problems
The methods discussed in this section are summarized in Table 12.1. At a high
level, the summary is that multi-frame methods are preferred to single frame meth-
ods (because of their increased performance), and implicit motion compensation is
preferred to explicit because it is faster to compute.
Given these high level ideas, there are some outstanding problems we can
identify. First, it is interesting to note that the implicit motion compensation per-
forms as well if not better than explicit (optical flow based) motion compensation.
This implies that fine grained alignment may not be completely necessary for en-
hancement, or at least for video compression reduction. This finding was at least
partially confirmed by ToFlow which showed that a task guided flow is better than
a ?perfect? flow.
On a related note, although QE-CNN was aware of the underlying metadata of
199
frame type, there is otherwise a lack of use of bitstream metadata. This is surprising
since the bitstream information often contains useful cues for how information was
removed from the original frames. Additionally, coarse motion information is present
in the metadata which, as we already discussed, may be good enough for feature
alignment. In the next chapter, we will develop a video compression reduction
method which addresses these issues.
200
Chapter 13: Metabit: Leveraging Bitstream Metadata
Until this point we have reviewed the basic principles of video compression
including how to achieve compression over time by removing redundant motion in-
formation. We have also reviewed several methods for using deep learning to restore
quality to compressed video frames. In order for video compression to function,
i.e., in order to successfully decompress a compressed bitstream, we need additional
information beyond simply transform coefficients. This information, such as QP
values, GOP structure, and motion vectors among others, give a very strong prior
for how the encoder has compressed the video stream and what information has
been removed that should be restored.
We now turn our attention to developing a deep learning method which exploits
this data to improve its reconstruction. This contribution of the dissertation is
currently under submission for separate publishing and is available as a pre-print
[144]. If we closely examine the direction of prior works, there are some whispers of
this idea. For example, MFQE [138] contributed the idea of ?peak quality frames?
which were high quality frames that could be used to restore nearby (in time) low
quality frames. STDF [139] does away with expensive motion compensation to rely
on deformable convolutions.
201
However, both of these methods leave something to be desired, specifically by
relying on outside computation for what is already stored by the encoder. While
the concept of peak quality frames seems somewhat abstract, after all how can we
predict the existence of such frames, their existence is grounded in first principles.
These are I-frames. The encoder inserts them intentionally to create frames with
high information content which improves the decoding fidelity. Recall that MFQE
1.0 scans the entire sequence to determine peak quality frames using an SVM and
MFQE 2.0 [140] does the same using a Bi-LSTM [143]. These are computationally
expensive algorithms which are essentially computing the I- and P-frame structure
of the GOP, something which we can readily extract from a bitstream with no
additional computation.
The MFQE family of networks also rely on optical flow to align nearby frames.
While there are many methods for computing optical flow, they all vary in their
speed and accuracy, although perfectly accurate optical flow may not be necessary
in the first place [133]. The major contribution of STDF was to move away from
explicit motion estimation by using deformable convolutions to learn an implicit
motion estimation. This is desirable because it reduces the computational burden
of the algorithm: the deformable convolutions model both motion and mapping
simultaneously. However, we can do better than both explicit and implicit motion
estimation; we can do no motion estimation at all. Of course we still wish to align
nearby frames and for this we can extract motion vector from the bitstream. This
gives a coarse motion compensation which we show is not only good enough for
accurate reconstruction but, taken with our other contributions, outperforms both
202
MFQE and STDF as well as their later follow-up works.
The common theme among the contributions of our method is that we are
removing things which were computed explicitly by prior algorithms and replacing
them with things that are computed by the encoder and stored in the video. We
view these computations as redundant. By reducing these redundant computations
we are left with extra compute time per frame that we can re-invest in additional
model parameters leading to an improved result. We take the additional step of
moving away from the sliding window paradigm, where a block of seven frames
produces a single frame output, and instead use a block based approach where all
seven frames are produced in a single forward pass. The result of these efficiency
improvements is a network which has almost twice the parameters of STDF and yet
runs the same or faster than it depending on input resolution. It also outperforms
STDF by a wide margin for many compression settings.
First Principles
? Architecture captures GOP structure
? Explicit I- and P-frame representations based on expected information content
? Alignment using motion vectors
? High frequency restoration using targeted loss functions
203
GOP
I P ... P
...
Channelwise Concatenation
.
.
.
GOP Representation
Figure 13.1: Capturing GOP Structure. The GOP representation is computed
from wide I-frame feature extractors and narrow P-frame feature extractors.
13.1 Capturing GOP Structure
One of the primary contributions of this work is the way in which our net-
work takes into account GOP structure. Recall from Chapter 11 (Modeling Time
Redundancy: MPEG) that (in the MPEG standards) frames can be either I-,P-,or
B-frames where I-frames are ?intra frames? which can be reconstructed using only
information in the frame itself, P-frames are ?predicted frames? which require some
previous frame to reconstruct, and B-frames are ?bipredicted frames? which require
a previous and future frame to reconstruct. The goal of using these different frames
types is to rely more on information which is stored in other frames that would be re-
dundant to store again. These frames are organized into a group-of-pictures (GOP)
which is an I-frame and its associated P-/B- frames. Without loss of generality we
only consider P-frames in the following discussion.
Since the predicted frames intentionally do not store information which is
204
stored in previous frames, we can observe that they contain less information and, due
to prediction errors, generally have lower perceptual quality than their associated
I-frame. When other models process video frames in a sliding window, they do
not take this into account in any meaningful way and so the same network which
processes I-frames is used to process P-frames.
We can view this as wasting compute resources. Since the bulk of the informa-
tion is stored in the I-frame, we can process that with a wide representation. We can
then use a narrower, and therefore faster to compute, representation for the P-frame
to extract the additional information the P-frame contains. This is shown in Figure
13.1. Note that it is important to match the depth of the extraction networks so
that the receptive fields are aligned. We view the resulting GOP representation as
capturing the available information in the entire sequence and use it to reconstruct
each frame in the GOP after warping.
This is a major gain in efficiency since the faster network is used for most
frames in each sequence. Further, we will expend significant resources reconstructing
the I-frame, which was already higher quality as it contains the most information
in the frame. We will then use this restored I-frame as a base to compute the
restored P-frames again using a lighter network. In other words, the GOP structure
is encoded into our reconstruction algorithm in all stages.
205
Aligning P-Frames to I-Frame
Apply Reversed Apply Reversed
Aligned to I-Frame Frame 2 Motion Frame 3 Motion
Vectors Vectors
Aligning I-Frame to P-Frames
Apply Frame 2 Motion Apply Frame 3 Motion
Restored I-Frame
Vectors Vectors
Figure 13.2: Motion Vector Alignment. P-frames are warped backwards to the
I-frame during feature extract. The I-frame is warped forwards to align to the P-
frames during frame generation.
13.2 Motion Vector Alignment
In multiframe restoration problems, it is extremely common to align nearby
frames, or features extracted from nearby frames, to ensure that various scene details
are overlapping (see ToFlow [133] and EDVR [134] among others). Conceptually,
this should make the restoration task easier for the network since the additional
information of nearby frames is in the correct location, ready to be exploited for
additional reconstruction accuracy. The removal video compression defects is no
different, and as discussed in the opening to the chapter, this is generally accom-
206
Figure 13.3: Motion Vectors vs Optical Flow. The motion vectors resemble a
coarse or downsampled version of the optical flow. Optical flow was computed with
RAFT [145]
plished explicitly with optical flow as in MQFE and related networks [138], [140]
with STDF using deformable convolutions for an implicit alignment [139].
While it is useful to compute high quality alignments between frames, it may
not be necessary (this is discussed at length in ToFlow). Assuming that the con-
straint of high quality alignment can be relaxed, we have a convenient tool at our
disposal: motion vectors which are compared with optical flow in Figure 13.3. The
motion vectors relate nearby frames at the block level; for most resolutions the blocks
are fine enough and the motion accurate enough that warping frames using motion
vectors instead of optical flow works well 1 The major advantage of using motion
vectors is that they require no computation to produce since they are stored in the
video bitstream. Compared with optical flow, they require no more computation to
apply.
We will use the motion vector to align each P-frame to the I-frames during
feature extract, and then to align the restored I-frame to each P-frame during frame
generation. This is illustrated in Figure 13.2. Since motion vector measures motion
from the previous frame, we must reverse and warp each frame in sequence, e.g.,
1where ?well? is measured in terms of reconstruction accuracy.
207
I-Frame Generation
Inputs I-Frame Representation Network P-Frame Representation? P-Frame Generation
P-Frame Representation 
Network
Network
I-Frame Representation
?(6 channels)
(160?channels)
6 Low Quality P-Frame Pixels
...
... ...
1 Low Quality I-Frame Pixels
10 LR Blocks
10 LR Blocks
10 LR Blocks
3x3x64
3x3x64
3x3x64
High Quality I-Frame
...
6 Low Quality P-Frame Pixels
... ...
Align to P-Frames
6 Warped High Quality I-Frames
Align to I-Frame
6 P-Frame Motion Vectors
10 LR Blocks
3x3x16
1 Restored I-Frame Final Outputs
6 Restored P-Frames
6 P-Frame Motion Vectors
Figure 13.4: MetaBit System Overview. I-Frames are shown in Blue and P-
Frames are shown in Pink. Our network takes an input (Orange) in the form
of a low-quality Group-of-Pictures and first performs multi-frame correction on the
I-Frame. The resulting high-quality I-Frame is used to guide correction of the
low-quality P-Frames. The final output of our network (Yellow) is the entire
high-quality Group-of-Pictures.
Lightweight Restoration Block
Channel Attention
Two Conv, 3x3xN,
Input Output
LeakyReLU
Figure 13.5: LR Block. The Lightweight Restoration block modifies the residual
block to follow recent best practices in deep learning and image-to-image translation.
frame 3 is warped by frame 2?s motion vectors; the result is warped by frame 1?s
motion vectors, etc.. During frame generation we carry out the inverse process by
warping the restored I-frames by each of the P-frame?s motion vectors in sequence.
13.3 Network Architecture
We are now in a position to develop a complete network architecture using
the ideas in the previous two sections. Although we will develop a concrete archi-
tecture in this section, the high level idea to leverage specific bitstream metadata
can actually be applied to many different architectures. First we need to develop
a basic block to build the rest of our network with. We would like to base this on
208
residual blocks (Section 5.5 (Residual Networks)) which are known to be effective
at many tasks; however, residual blocks by themselves do not follow best practices
for image-to-image problems. Conversely, the RRDB layer ([81]) works well but is
computationally inefficient. For videos, we require something which is lightweight
and effective.
We make the following modifications (Figure 13.5) to the residual block which
we call a lightweight restoration (LR) block. First, we remove batch normalization
[37] which is known to perform poorly in image-to-image translation scenarios [81].
Next, we replace the ReLU layers with LeakyReLU. Finally, we add channel atten-
tion [146], [147] following recent best practices in deep learning methodologies. To
these residual blocks, we add our accounting for GOP structure and our motion
vector alignment blocks. An overview of the Metabit system is shown in Figure
13.4.
The network is divided into several stages. The network takes a 7-frame GOP
with no B-frames as input. In the first stage, the I- and P- frame representations are
computed using separate feature extractors. As discussed in Section 13.1 (Captur-
ing GOP Structure), the I-frame representation is 64 dimensional while the P-frame
representation is 16 dimensional. Each of the P-frames is warped using motion vec-
tors as in Section 13.2 (Motion Vector Alignment) to align them to the I-frame.
Given a 7 frame GOP this gives a final representation of 160 dimensions. This rep-
resentation is then used as input to the I-frame generation network which produces
the high quality I-frame. This high quality I-frame is then warped 6 times to gen-
erate 6 copies each aligned to the individual P-frames. Then, the aligned I-frame
209
is concatenated with the low quality P-frames and the P-frame generation network
generates the 6 high quality P-frames. This gives the final output: the high quality
GOP consisting of 7 frames.
Note that this is quite different from the sliding window or even recurrent
approaches used in other video restoration work. In sliding window, a new rep-
resentation would be computed for every frame consisting of three preceding and
three succeeding frames. In a recurrent formulation, an accumulated hidden state
is used to condition the current frame on past frames. In contrast, the method we
developed in this chapter compacts information for a block of frames into a compact
representation and then projects that information forward in time such that each
frame has some past and some future information. This mimics the process that the
video decoder performs when it decodes a bitstream. The information is discarded
when a new I-frame is encountered.
13.4 Loss Functions
As we discussed in Chapter 9 (Quantization Guided JPEG Artifact Correc-
tion), restoration problems which are based purely on regression suffer from blurring
and lack low-frequency content. Video compression reduction is no exception to this,
and in fact, just like with JPEG compression, video compression specifically removes
high frequency content. Unlike other restoration problems, however, all prior work
in video compression reduction uses either l2 loss or Charbonnier loss [148], both of
which are simple error penalties. We can introduce more complex loss functions to
210
overcome this. For ?standard? regression, we will depend on the l1 loss as usual (for
the uncompressed frames Xu and the reconstructed frames Xr)
L1(Xu, Xr) = ||Xu ?Xr||1 (13.1)
In Chapter 9 (Quantization Guided JPEG Artifact Correction), we were lack-
ing a method for accurate high frequency reconstruction. We can address this here,
with partial success, by using a loss based on the Difference of Gaussians (DoG)
scale space. The difference of Gaussians constructs a scale space by convolving an
image with Gaussian blur kernels of differing standard deviations. These function
as bandpass filters which capture image content at different frequency bands. The
process is repeated on downsampled versions of the image to capture information
at different scales.
We can employ this as a loss function by separating out the different frequency
bands at different scales and computing their l1 error as separate loss terms. This
effectively weights each frequency band in the same way rather than the decreasing
magnitude we see using an overall l1 loss. As such, the network is rewarded for
accurately reconstructing high frequencies.
For the uncompressed and reconstructed frames we compute four different
scales:
Su = {Xu, Xu2 , Xu4 , Xu8} (13.2)
Sr = {Xu, Xr2 , Xr4 , Xr8} (13.3)
211
where each entry Xus or Xrs is obtained by downsampling Xu by a factor of s. We
then compute the difference of Gaussians by convolving with a 5 ? 5 2D Gaussian
kernel:
1 i2? +j
2
G(?)ij = e 2?2 (13.4)
2??2
for kernel offsets i, j (note that these range from [-2,2]). Then, for each scale s we
compute the four filtered images
Xus,? = G(?) ? Xus (13.5)
Xrs,? = G(?) ? Xrs (13.6)
for ? ? {1.1, 2.2, 3.3, 4.4}. We then compute the differences between the pairs
Xus,1 = Xus,2.2 ?Xus,1.1 (13.7)
Xus,2 = Xus,3.3 ?Xus,2.2 (13.8)
Xus,3 = Xus,4.4 ?Xus,3.3 (13.9)
Xrs,1 = Xrs,2.2 ?Xrs,1.1 (13.10)
Xrs,2 = Xrs,3.3 ?Xrs,2.2 (13.11)
Xrs,3 = Xrs,4.4 ?Xrs,3.3 (13.12)
212
to yield the per-scale frequency bands. Finally, we compute the loss
? ?3
LDoG(Xu, Xr) = ||Xus,b ?Xrs,b||1 (13.13)
s?{1,2,4,8} b=1
As in Chapter 9 (Quantization Guided JPEG Artifact Correction), however,
we find that even this enhanced regression loss is not sufficient to generate realistic
reconstructions although it does help. Instead, we again turn to the GAN and
Texture losses. The texture loss, repeated here, is discussed at length in Section 9.5
(Loss Functions).
Lt(Xu, Xr) = ||MINC5,3(Xu)?MINC5,3(Xr)||1 (13.14)
where MINC5,3 denotes the output of layer 5 convolution 3 in a MINC [104] trained
VGG [34] network.
For the GAN loss we will use a Wassertein GAN formulation [149]. In a
Wassertein GAN, the discriminator network is replaced with a critic which rates
examples on a [-1, 1] scale with -1 being fake and 1 being real. This critic then
makes a soft decision about the realness or fakeness of a sequence rather than a
hard decision, which, along with some gradient clipping, makes it more stable and
less sensitive to hyperparameter choices.
As in Chapter 9 (Quantization Guided JPEG Artifact Correction) our critic
architecture is based on DCGAN [101] with spectral normalized convolutions [102]
except that we modify the critic procedure to introduce temporal consistency follow-
213
Compressed Frames Restored/Target Frames
Concatenate?
Wasserstein GAN Critic?
...
Figure 13.6: Metabit Critic Architecture. For the entire seven frame sequence,
the compressed and restored frames are concatenated following [150]. The resulting
48 channel input is reduced to a single scalar using a series of spectral-normed
convolutions, batch norm, and ReLU layers.
ing the procedure from TeCoGAN [150]. The modification is relatively simple: we
use the compressed and restored/uncompressed sequence as input with the frames
stacked in the channel dimension. The means the critic is now considering the en-
tire sequence instead of individual frames and therefore is incentivised to produce
similar reconstructions over the sequence. The architecture is shown in Figure 13.6.
This yields the GAN loss
Lw(Xu, Xr) = ?d(Xu, Xr) (13.15)
for critic d() 2.
These are combined into two composite loss functions: one for regression only
2Note that the critic itself has a different loss function which we do not show here, see [149] for
details.
214
results
LR(Xu, Xr) = ?[L1(Xu, Xr) LDoG(Xu, X Tr)] (13.16)
for balancing hyperparameters ? ? R2 and a loss for qualitative results
LG(X Tu, Xr) = ?[L1(Xu, Xr) LDoG(Xu, Xr) LW (Xu, Xr) Lt(Xu, Xr)] (13.17)
for balancing hyperparameters ? ? R4.
13.5 Towards a Better Benchmark
All video compression reduction work is tested on one primary dataset: the
MFQE [138] dataset. This dataset consists of a large training set of diverse sequences
and an eighteen video test set. The test set contains diverse real-world scenes and
a variety of resolutions. The videos are stored in raw (YUV) format. Overall, this
dataset is satisfactory for the purposes of evaluating compression reduction.
The problem, however, comes in how the dataset is used. Prior works used
only HEVC (H.265) compression with constant QP values in {27, 32, 37, 42} or some
subset of these depending on the paper. While numerical results should not generally
be our primary concern when evaluating any proposed method, these compression
settings do leave much to be desired. Firstly, HEVC compression incurs less degra-
dation than other commonly encountered codecs. Although HEVC would no longer
be considered ?new?, it is also much less frequently used with almost no browser
215
support. Additionally, the constant QP compression method is simply not used in
real videos and is mostly included as a debugging tool. The degradations that it
causes are much simpler to model than CRF or CBR which are used frequently in
real videos. See Section 11.3 (Slices and Quantization) for a deeper discussion of
these terms.
Instead we propose to use AVC (H.264) compression for evaluation and we use
CRF instead of CQP. This is a much better representation of real world video than
the previous benchmark as AVC compression accounts for nearly 91% of internet
videos as of 2019 [129] 3. We choose CRF values in {25, 35, 40, 50} ranging from
relatively little compression at 25 (this is the default for ffmpeg [151]) to 50 which
is only one less than the maximum. To reiterate, our goal with this benchmark is to
ensure that compression reduction algorithms face tests which accurately represent
videos in the real world. In the next section we will see that MFQE fails to converge
for any CRF setting as it produces significantly more complex degradations than
CQP, thus justifying our concern.
13.6 Empirical Evaluation
With the method sufficiently developed we can empirically evaluate its per-
formance. In all cases we train our network using the MFQE training split [138]
which consists of 108 variable length sequences. To this we add a randomly selected
one third of the Vimeo90k dataset [133] which is approximately 30,000 7-frame
sequences. We randomly crop 256 ? 256 patches (and 7 frames from the MFQE
3Although this has likely decreased since 2019 it would not be by much.
216
examples) and apply random vertical and horizontal flipping during training. For
H.264 bechmarks we encode using one I-frame and six P-frames with CRF encoding
as discussed in the previous section. For H.265 benchmarks we comply with prior
works and use CQP encoding. We train a separate model for each QP or CRF
setting. All evaluations are conducted on the MQFE test split of 18 variable length
sequences.
The network is implemented in PyTorch [17] and optimized using the Adam
[105] optimizer for 200 epochs with the learning rate set to 10?4. For quantitative
experiments we train using the regression loss (Equation 13.16) with ? = [1.0 1.0].
For qualitative results we fine tune using the GAN loss (Equation 13.17) for an
additional 200 epochs with a learning rate of 10?5 and ? = [0.01 0.01 0.005 1]. As
recommended we use the RMSProp optimizer for Wassertein GAN training [149].
For numerical results we report the change in PSNR and the change in LPIPS
[152]. For compared works we only compute LPIPS if there is a published model
or training code. We find no usable trend in SSIM [99] so we do not report that
although some other works do. For consistency with prior work we report metrics on
the Y channel only although we would like to see this practice end soon. To evaluate
the GAN, besides providing qualitative results we report FID [43] and LPIPS.
13.6.1 Restoration Evaluation
We start with the boring numerical results in Table 13.1 for HEVC and Table
13.2 for AVC. We can see that the Metabit architecture we developed has a signif-
217
Table 13.1: Metabit HEVC Results. Reported as ?PSNR (dB) ? / ?LPIPS ?.
HEVC CQP
Method
27 32 37 42
MFQE 2.0 [140] 0.49 / - 0.52 / - 0.56 / - 0.59 / -
PSTQE [141] 0.63 / - 0.67 / - 0.69 / - 0.69 / -
STDF-R3 [139] 0.72 / 0.025 0.86 / 0.027 0.83 / 0.033 - / -
RFDA [142] 0.82 / - 0.87 / - 0.91 / - 0.82 / -
MetaBit [144] 1.17 / 0.025 0.99 / 0.023 0.91 / 0.029 0.82 / -
Table 13.2: Metabit AVC Results. Reported as ?PSNR (dB) ? / ?LPIPS ?.
AVC CRF
Method
25 35 40 50
STDF-R1 [139] 0.741 / 0.034 0.862 / 0.032 0.814 / 0.030 0.632 / 0.013
STDF-R3 [139] 0.784 / 0.035 0.846 / 0.032 0.882 / 0.029 0.817 / 0.011
MetaBit [144] 1.085 / 0.024 1.137 / 0.014 1.113 / 0.005 0.887 / -0.016
icantly better reconstruction result in most cases. The only exceptions are in the
high QPs for HEVC where our method ties the next most recent work. For AVC, the
results are significantly better indicating that the more complex CRF degradation
is handled well by our method, in other words, the extra parameterization is able
to better model complex degradations. Note that MFQE failed to converge entirely
for CRF training. Note that PTSQE [141] does not provide public code and RFDA
[142] does not provide usable training code so we were unable to fully evaluate these
methods.
To evaluate the GAN we report FID and LPIPS in Table 13.3. Note that this
compares the degraded AVC input with our model trained for regression vs GAN
on extreme CRF settings (40, 50), we do not bother comparing to other works. As
expected, the GAN generates significantly more realistic results.
218
Table 13.3: Metabit GAN Numerical Results. Reported as FID ? / LPIPS ?
H.264 CRF
Method
40 50
AVC 67.07 / 0.259 152.19 / 0.498
Regression 80.67 / 0.265 154.42 / 0.482
GAN 37.78 / 0.191 95.26 / 0.368
2.00
1.75 STDF-R1
MQFE 2.0
1.50
1.25
1.00 STDF-R3L
0.75 Metabit
MQFE 1.0
0.50
0 1 2
Parameters (M)
Figure 13.7: FPS vs Params. The size of each point indicates the increase in
PSNR.
A more important numerical result for our purposes is throughput, specifically
when compared with the number of model parameters. Our formulation is designed
to be highly efficient while permitting a large number of parameters, traits which are
important for timely and accurate video restoration. This result is shown graphically
in Figure 13.7. We see that our method is as fast as STDF with about double the
parameters and a higher PSNR. In other words, our method is able to better utilize
the extra parameters without slowing down thanks to the efficiency improvements
we made by leveraging the bitstream metadata.
With the numerical results out of the way we can look at how the restoration
functions on real images, with an example shown in Figure 13.10 from the short
film Big Buck Bunny [130]. In this example we can see that despite the heavy
compression (ACV CRF 40), the multiframe GAN restoration is able to produce a
striking resemblance to the original image. Textures are accurately reconstructed on
the grass and tree and yet the smooth sky region is preserved, i.e., the network has
219
FPS
38
37
36
AVC
Wu et al. (2018)
35 DVC (2019)
STAT-SSF-SP (2020)
HLVC (2020)
Scale-Space (2020)
34 Liu et al. (2020)
NeRV (2021)
AVC + MetaBit (Ours)
33
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Rate (bpp)
Figure 13.8: Rate-Distortion Comparison. Using Metabit with AVC compres-
sion performs better for low bitrates than fully deep learning codecs. Distortion is
measured using PSNR.
not hallucinated textures where thy should not be. This is even more remarkable
considering that there was no artificial training data.
To compare the effect of GAN restoration we show crops in Figure 13.11. The
difference here is quite pronounced. Although STDF and Metabit using regression
are able to improve the visual quality of the images, the GAN restoration is signifi-
cantly more realistic in terms of overall sharpness and texture. This is particularly
noticeable on the trees in the top row.
13.6.2 Compression Evaluation
As in Chapter 9 (Quantization Guided JPEG Artifact Correction) one of the
principal applications of restoration is as a method to better compress media. For
our case, we can make a direct comparison to other fully deep learning based com-
pression codecs where we compare quite favorably both in terms of latency and
rate-distoration.
The rate-distortion result is shown in Figure 13.8 along with with a number
of recent deep learning based compression algorithms. We use the UVG dataset
220
Distortion (dB)
50
40
30
20
10
0
 al.t C l
. V
e DV
a
et eRN  AV
C
u  W Liu bit 
+
Me
ta
Figure 13.9: Learned Compression Throughput Comparison. Orange shows
encoding time, blue shows decoding time. Current fully deep learning based codecs
are quite slow. One major advantage of using video restoration is that encoding is
quite fast. Decoding is on par with other methods except NeRV, although NeRV
has a much slower encoding time.
[153] for this task, which is a widely used dataset for compression evaluation. We
compare to [154], [155], [156], [157], [158], [159], and [160]. Note that we use a model
trained for regression only in this case.
One potential problem we encounter here is that deep learning based com-
pression literature compares with ?low-delay P? mode compression. This is a com-
pression setting which saves additional space by using only a single I-frame for the
entire sequence with the rest of the frames encoded as P-frames. This setting saves
additional space over placing I-frame periodically at the cost of quality and is often
used in streaming scenarios where low-latency is critical. In order to compare fairly,
we modify our restoration procedure slightly to accommodate this. Since the first
group of seven frames always includes an I-frame, this group is restored following
our standard procedure. We then cache the restored seventh P-frame and instead
of reading seven more frames we read six. These six frames are all P-frames and we
use the cached restored P-frame in place of an I-frame. There is no retraining of
221
FPS
the network involved in this process so the results are likely lower than they could
be if we trained for this scenario; however this likely improves cross-block temporal
consistency by reusing information from the previous block.
For low bitrates (i.e., the bitrates which matter), simply using AVC com-
pression with Metabit restoration outperforms deep learning codecs. This is to be
expected since Metabit is an objective improvement on AVC which many deep learn-
ing methods still struggle to beat in general cases. The advantage of this comes in its
ease of use (almost all modern hardware can decode an AVC compressed video) and
its speed. Encoding times are in the hundreds of frames-per-second on commodity
CPU hardware. Decoding time even including Metabit is also faster.
In Figure 13.9 we compare the throughput of a subset of these methods to
Metabit with AVC compression. Metabit is on par with other methods here, and
actually outperforms all but NeRV. The major disadvantage of NeRV, however, is
the extremely long encoding time. Since NeRV is an implicit representation a new
network is learned for each video taking on the order of hours to encode one video.
We simply skip encoding time for NeRV in the plot.
The major takeaway from this discussion is that video compression reduction,
specifically the Metabit method we developed in this chapter, is efficient and effective
at generating accurately restored video frames from low bitrate video. Since it only
depends on commodity codecs, our method can be encoded quickly and easily and
decoded by anyone without special hardware. Those with special hardware can
use our method to achieve more visually pleasing results. This method provides a
promising avenue for deploying deep learning in compression in the near term.
222
AVC CRF 40 Reconstruction Original
Figure 13.10: Metabit Restoration Example. Crop from 1920? 1080 film ?Big
Buck Bunny?. Left: compressed, Middle: restored, Right: Original. This artificial
scene is restored accurately despite a lack of artificial training data. Note the grass
and tree textures, sharp edges, removal of blocking on the flower, and preservation
of the smooth sky region. The video was compressed using AVC CRF 40.
13.7 Limitations and Future Work
The work as presented in this chapter has a few limitation all of which we be-
lieve are solvable with relatively minor changes. The network architecture currently
depends on a fixed GOP size of seven frames. This would likely not work with real
videos which may use a variable length GOP for several reasons. This is readily
solvable by projecting a variable size GOP representation to a fixed size one using
an adaptive pooling layer following by a projection. It remains to be investigated if
this mapping is as effective as using a fixed GOP.
Similarly, since each GOP is treated as a separate block, and restored sepa-
rately, there are temporal consistency issues across GOP blocks. Note that within
a single GOP the frames are quite consistent thanks to the TeCoGAN [150] formu-
lation which we use. It is only across GOP boundaries that consistency issues arise.
This is noticeable in the restored output as a sort of flickering or noise pattern that
223
appears. This is likely solvable by keeping a compact hidden state from the previous
GOP to compose with the current GOP processing.
Continuing in this line of thought, there are issues which arise in a streaming
scenario which make an architecture like this wholly unsuitable. In streaming sce-
narios, the video is often encoded using ?low-delay P mode?. In this mode, there is
a single I-frame at the beginning of the GOP following by only P-frames, in other
words, there is only a single GOP in the entire video. For this we actually have a
partial solution which is to simply use the previously restored final frame (restored
frame 7) in place of the I-frame for the next GOP, skipping the I-frame restoration
step entirely. This does lead to a loss of visual quality especially if the network was
not trained to perform this kind of restoration.
Another issue in streaming situations is latency and buffering. For real-time
applications this currently precludes restoration technology, but in our case in par-
ticular the 7 frame GOP needs to be buffered before restoration can occur. This
may be a limiting factor in some scenarios. The only way around this is to reduce
the number of buffered frames or move to a fully recurrent solution.
Finally, this method suffers from the same ?quality aware? problem that was
prevalent in JPEG artifact correction. For each CRF setting we have to train a
different model. Unlike JPEG, however, CRF, and more importantly the derived
QP values, are stored in the video file. In this way it should be easy to create a
parameterized network which is aware of the QPs that each frame was compressed
with.
224
Compressed (CRF 40) STDF MetaBit (Regression) MetaBit (GAN) Target
Figure 13.11: Metabit Comparison. Crops from ?Big Buck Bunny?? and ?Traf-
fic? from the MFQE dataset. We compare our GAN restoration to regression restora-
tion and the STDF method.
225
Part IV
Concluding Remarks
226
You have made it to the end of the dissertation, a journey developing the first
principles of deep learning and classical compression from the preliminaries through
to the published research of the author. With the body of the dissertation behind
us we can recap where we?ve been and where we?re going. To reiterate, the overall
goal of my dissertation was to present, explicitly, an approach that follows the first
principles of the compression problems. This is motivated by engineering as much as
by science, and I have shown that incorporating engineering principles, both into the
methodologies, e.g., considering how compression algorithms were developed, and
the philosophy of the research, e.g., approaching a scientific problem as an engineer
and a scientist simultaneously, can be successful. While this is not an approach
unique to myself, I find that it is rarely stated out loud. And yet, the engineers that
developed the compression algorithms we all use on a daily basis were extremely
smart, so is it not logical to follow in their example when studying deep learning? I
hope that the implications of this thought extend far beyond this document.
In Chapter 7 (JPEG Domain Residual Learning) we developed a method for
performing deep learning in the JPEG domain. This method operates on coefficients
directly and requires no decompression. The goal of the method was to produce a
result which is as close as possible to the pixel domain result. In this work we lever-
aged the first principles of JPEG compression by linearizing the JPEG transform
and composing it with our pixel domain convolutions. We also leveraged this to pro-
duce closed form derivations of batch normalization and average pooling.However,
we could not do this for ReLU and so we used an approximation technique. The
primary issue with this work is that it uses substantially more memory to store
227
feature maps and convolutions.
In Chapter 9 (Quantization Guided JPEG Artifact Correction) we improved
JPEG compression using deep learning. We noted that there were several issues
preventing the widespread success of prior work in this field. Prior works were quality
dependent models which trained a unique model for each JPEG quality factor. They
did not handle color images, and they were focused only on regression. We solved
each of these problems again leveraging first principles. By conditioning our network
on the JPEG quantization matrix and processing DCT coefficients instead of pixels,
we were able to encode quality information using a single network. By explicitly
handling chroma subsampling and the additional quantization that color channels
are subjected to, we improved results on color images. Finally, we incorporated GAN
and texture losses to improve the visual result over a regression-only solution. While
this method was highly successful, it had a distinct disadvantage when quantization
data is either incorrect, as in multiple compression, or not available at all as in
transcoding.
This was followed up in Chapter 10 (Task-Targeted Artifact Correction) which
extended the previous method to optimize the correction for machine consumption.
This is somewhat of a novelty in artifact correction which is traditionally for humans
only. By incorporating a loss based on a downstream task, we are able to greatly
improve the performance of that task on JPEG compressed inputs, often outper-
forming data augmentation techniques that retrain the task with JPEG images.
Our method had the added benefit that a network trained for one downstream task
would also work well for other downstream tasks with no re-training required. The
228
main drawback of this method is that it has increased running time (for multiple
networks) and that it does not always out perform data augmentation.
Finally, in Chapter 13 (Metabit: Leveraging Bitstream Metadata), we forayed
into video compression by developing a correction based method for improving AVC
and HEVC compression. We noted that prior works expended significant resources
computing results which were already stored explicitly in the video bitstreams, such
as high quality frame locations (I-frames) and motion data. We leveraged this data
and used our increased compute budget to more than double the number of param-
eters in our network with almost no impact on throughput. We also incorporated
scale-space loss along with the GAN and texture losses from the JPEG work, in
order to improve high frequency reconstructions. While this method outperforms
prior works and fully deep-learning based codecs at low birates, it does struggle with
high bitrate reconstructions and it currently requires a unique model for each video
compression setting. The method would also struggle to run in real time on any
consumer hardware.
So where do we go from here? Aside from the multitude of related problems
to work on in compression, from things as simple as improving the results to as
tangential as data privacy, the number one focus for the next decade of compres-
sion and deep learning is going to be on making these techniques practical. One of
the primary goals in writing this document is to instruct practitioners, e.g., profes-
sional engineers, developers, students, etc., that while these techniques show extreme
promise, actually getting them into the hands of consumers requires considerable
effort.
229
Focusing solely on the techniques which improve compression performance,
which would be of direct use to consumers who lack broadband internet, these meth-
ods currently require significant compute resources on the end user?s side. This could
be shifted to the data center side, i.e., a hybrid technique which extracts some deep
representation during transmission. This could also be addressed simply by devel-
oping faster and lower-memory algorithms or by leveraging customized hardware.
While the latter sounds like an expensive solution, consider that video decoders are
almost exclusively implemented in hardware on modern processors both for desk-
top/laptop machines and mobile phones. This is also starting to happen for deep
learning, e.g., the Google Tensor chips and edge TPUs, and the Apple A14 chips.
In any case, I believe consumer applications for this technology are no more than
two years off at the time of writing, and within the decade for fully deep learning
based compression. The field of compression as a whole is progressing as fast as ever
and it is an exciting time to be involved.
This is all in the midst of the global pandemic. In a world which was just
beginning to address the inequality in internet access, suddenly in late 2019, we
were forced to confront this issue as work and school became primarily remote.
Remote work and school means communication with video and images which means
compression. Those who did not have a strong internet connection were simply left
behind as there were no suitable alternatives, and it remains to be seen what the
long term ramifications of this will be. Now in early 2022, the world is quickly
moving on from pandemic life. Yet it is important not to forget this lesson.
Better compression has the ability to help people right now.
230
Author?s Note I invite the readers to now visit the appendices where they will
find material which is just as interesting, and yet not directly related to, the dis-
sertation proper. In particular we will review some additional qualitative results
and briefly cover fully deep-learning based compression algorithms. Thank you for
reading my dissertation!
Max Ehrlich
231
Part V
Appendix
232
Appendix A: Study on JPEG Compression and Machine Learning
This appendix reproduces the full plots and tables of the results of the study on
JPEG compression and deep learning [111]. See Chapter 10 (Task-Targeted Artifact
Correction) for more details. These plots are for informational purposes only.
A.1 Plots of Results
14 MobileNetV2
ResNet-18
12 ResNet-50
ResNet-101
10 ResNeXt-50
ResNeXt-101
8 VGG-19
InceptionV3
6 EfficientNet B3
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.1: Overall Classification Results
233
Accuracy Loss (%)
14
None
12 Off-the-Shelf Artifact Correction
10 Fine-Tuned
Task-Targeted Artifact Correction
8
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.2: Classification Results: MobileNetV2
None
12.5 Off-the-Shelf Artifact Correction
10.0 Fine-TunedTask-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.3: Classification Results: VGG-19
234
Accuracy Loss (%)
Accuracy Loss (%)
8 None
Off-the-Shelf Artifact Correction
6 Fine-Tuned
Task-Targeted Artifact Correction
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.4: Classification Results: InceptionV3
10 None
Off-the-Shelf Artifact Correction
8 Fine-Tuned
6 Task-Targeted Artifact Correction
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.5: Classification Results: ResNeXt 50
235
Accuracy Loss (%)
Accuracy Loss (%)
10 None
Off-the-Shelf Artifact Correction
8 Fine-Tuned
6 Task-Targeted Artifact Correction
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.6: Classification Results: ResNeXt 101
None
10 Off-the-Shelf Artifact Correction
8 Fine-Tuned
Task-Targeted Artifact Correction
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.7: Classification Results: ResNet 18
236
Accuracy Loss (%)
Accuracy Loss (%)
12 None
10 Off-the-Shelf Artifact Correction
Fine-Tuned
8 Task-Targeted Artifact Correction
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.8: Classification Results: ResNet 50
10 None
Off-the-Shelf Artifact Correction
8 Fine-Tuned
Task-Targeted Artifact Correction
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.9: Classification Results: ResNet 101
237
Accuracy Loss (%)
Accuracy Loss (%)
8 None
Off-the-Shelf Artifact Correction
6 Fine-Tuned
Task-Targeted Artifact Correction
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.10: Classification Results: EfficientNet B3
238
Accuracy Loss (%)
FasterRCNN
14 FastRCNN
RetinaNet
12 MaskRCNN
10
8
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.11: Overall Detection and Instance Segmentation Results
None
12.5 Off-the-Shelf Artifact Correction
10.0 Fine-Tuned
Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.12: Detection Results: FastRCNN
15.0 None
12.5 Off-the-Shelf Artifact Correction
Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.13: Detection Results: FasterRCNN
239
mAP Loss mAP Loss mAP Loss
15.0 None
12.5 Off-the-Shelf Artifact Correction
Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.14: Detection Results: RetinaNet
15.0
None
12.5 Off-the-Shelf Artifact Correction
Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.15: Detection Results: MaskRCNN
240
mAP Loss mAP Loss
16 HRNetV2 + C1
14 MobileNetV2 (dilated) + C1 (ds)ResNet18 (dilated) + PPM
12 ResNet50 + UPerNet
ResNet50 (dilated) + PPM
10 ResNet101 + UPerNet
ResNet101 (dilated) + PPM
8
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.16: Overall Semantic Segmentation Results
15.0 None
Off-the-Shelf Artifact Correction
12.5 Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.17: Semantic Segmentation Results: HRNetV2 + C1
15.0 None
Off-the-Shelf Artifact Correction
12.5 Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.18: Semantic Segmentation Results: MobileNetV2 + C1
241
mIoU Loss mIoU Loss mIoU Loss
15.0 None
Off-the-Shelf Artifact Correction
12.5 Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.19: Semantic Segmentation Results: ResNet 18 + PPM
15.0 None
12.5 Off-the-Shelf Artifact Correction
Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.20: Semantic Segmentation Results: Resnet50 + UPerNet
242
mIoU Loss mIoU Loss
15.0 None
Off-the-Shelf Artifact Correction
12.5 Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.21: Semantic Segmentation Results: ResNet 50 + PPM
15.0 None
12.5 Off-the-Shelf Artifact Correction
Fine-Tuned
10.0 Task-Targeted Artifact Correction
7.5
5.0
2.5
0.0
10 20 30 40 50 60 70 80 90
Quality
Figure A.22: Semantic Segmentation Results: ResNet 101 + UPerNet
243
mIoU Loss mIoU Loss
14 None
12 Off-the-Shelf Artifact CorrectionFine-Tuned
10 Task-Targeted Artifact Correction
8
6
4
2
0
10 20 30 40 50 60 70 80 90
Quality
Figure A.23: Semantic Segmentation Results: ResNet 101 + PPM
244
mIoU Loss
A.2 Tables of Results
Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90
Supervised Fine-Tuning 79.78 81.84 82.47 82.68 82.78 82.75 82.83 82.85 82.83
None 77.24 81.11 81.95 82.52 82.67 82.91 83.10 83.37 83.75
EfficientNet B3 Top-1 Accuracy 83.98
Off-the-Shelf Artifact Correction 75.92 80.02 81.47 82.12 82.44 82.71 82.94 83.23 83.70
Task-Targeted Artifact Correction 81.03 82.71 83.21 83.53 83.64 83.71 83.73 83.80 83.76
Supervised Fine-Tuning 75.11 77.25 77.77 77.89 78.13 78.13 78.24 78.26 78.32
None 69.38 74.15 75.44 75.98 76.38 76.69 76.95 77.14 77.30
InceptionV3 Top-1 Accuracy 77.33
Off-the-Shelf Artifact Correction 71.21 75.04 76.09 76.42 76.68 76.79 76.97 77.06 77.13
Task-Targeted Artifact Correction 73.65 75.89 76.53 76.82 76.93 76.99 77.09 77.15 77.10
Supervised Fine-Tuning 65.65 69.21 69.92 70.20 70.37 70.53 70.50 70.55 70.54
None 57.23 65.55 67.87 68.95 69.47 69.98 70.24 70.60 70.86
MobileNetV2 Top-1 Accuracy 70.72
Off-the-Shelf Artifact Correction 57.33 65.25 67.76 68.93 69.60 70.07 70.40 70.71 70.58
Task-Targeted Artifact Correction 64.64 68.63 69.71 70.18 70.32 70.44 70.50 70.52 70.34
Supervised Fine-Tuning 74.63 76.50 77.07 77.20 77.27 77.29 77.43 77.44 77.53
None 66.12 73.00 74.65 75.39 75.83 76.29 76.51 76.79 76.96
ResNet-101 Top-1 Accuracy 76.91
Off-the-Shelf Artifact Correction 67.91 73.64 75.09 75.84 76.23 76.52 76.56 76.80 76.74
Task-Targeted Artifact Correction 72.99 75.53 76.30 76.60 76.59 76.72 76.70 76.72 76.59
Supervised Fine-Tuning 65.49 68.46 69.07 69.16 69.36 69.33 69.38 69.53 69.49
None 57.62 65.26 67.07 67.68 68.08 68.30 68.61 68.84 68.92
ResNet-18 Top-1 Accuracy 68.84
Off-the-Shelf Artifact Correction 61.19 66.39 67.87 68.39 68.61 68.77 68.97 68.99 68.90
Task-Targeted Artifact Correction 63.83 67.06 68.04 68.24 68.35 68.48 68.52 68.60 68.50
Supervised Fine-Tuning 73.18 75.46 76.02 76.24 76.36 76.42 76.52 76.52 76.55
None 63.43 71.20 73.23 74.10 74.43 74.63 75.01 75.09 75.34
ResNet-50 Top-1 Accuracy 75.31
Off-the-Shelf Artifact Correction 66.90 72.45 73.95 74.60 74.93 75.18 75.26 75.42 75.30
Task-Targeted Artifact Correction 70.48 73.56 74.39 74.81 74.94 75.00 74.98 74.98 74.89
Supervised Fine-Tuning 75.60 78.00 78.50 78.71 78.86 78.97 79.01 78.98 79.06
None 68.83 74.84 76.39 77.05 77.60 78.00 78.16 78.56 78.75
ResNeXt-101 Top-1 Accuracy 78.81
Off-the-Shelf Artifact Correction 71.19 75.88 77.14 77.80 78.15 78.30 78.57 78.66 78.61
Task-Targeted Artifact Correction 74.73 77.33 78.08 78.29 78.55 78.62 78.68 78.73 78.68
Supervised Fine-Tuning 74.21 76.23 76.79 77.01 77.08 77.18 77.16 77.30 77.17
None 66.96 73.21 74.85 75.62 76.07 76.37 76.63 76.88 77.06
ResNeXt-50 Top-1 Accuracy 76.99
Off-the-Shelf Artifact Correction 68.05 73.56 75.11 75.95 76.38 76.59 76.71 76.99 76.90
Task-Targeted Artifact Correction 72.22 75.45 76.09 76.62 76.86 76.83 76.85 76.99 76.81
Supervised Fine-Tuning 69.50 72.66 73.29 73.74 73.83 73.85 73.95 74.14 74.11
None 59.27 68.08 70.49 71.53 71.99 72.42 72.80 73.24 73.46
VGG-19 Top-1 Accuracy 73.44
Off-the-Shelf Artifact Correction 61.93 68.79 70.82 71.83 72.50 72.94 73.13 73.40 73.44
Task-Targeted Artifact Correction 67.50 71.32 72.33 72.76 73.03 73.16 73.50 73.48 73.44
245
Table A.1: Results for classification models.
Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90
Supervised Fine-Tuning 29.09 33.34 34.72 35.08 35.49 35.82 35.96 36.06 36.17
None 20.35 30.03 32.59 33.43 34.04 34.31 34.73 34.93 35.25
FasterRCNN mAP 35.37
Off-the-Shelf Artifact Correction 28.45 31.86 33.10 33.85 34.05 34.47 34.70 34.77 34.71
Task-Targeted Artifact Correction 31.43 33.85 34.29 34.81 34.81 34.97 35.01 34.88 34.81
Supervised Fine-Tuning 28.01 31.94 33.08 33.56 33.88 34.17 34.42 34.44 34.66
None 19.99 29.04 31.22 32.19 32.65 33.00 33.34 33.40 33.80
FastRCNN mAP 34.02
Off-the-Shelf Artifact Correction 27.62 30.91 32.04 32.56 32.78 33.18 33.28 33.48 33.44
Task-Targeted Artifact Correction 30.11 32.31 33.07 33.31 33.39 33.53 33.69 33.68 33.59
Supervised Fine-Tuning 26.32 30.48 31.79 32.21 32.55 32.83 33.11 33.20 33.32
None 18.35 27.58 29.83 30.80 31.32 31.62 32.02 32.29 32.62
MaskRCNN mAP 32.84
Off-the-Shelf Artifact Correction 25.82 29.35 30.67 31.32 31.59 31.85 32.03 32.24 32.16
Task-Targeted Artifact Correction 28.48 30.85 31.71 32.00 32.19 32.24 32.35 32.43 32.26
Supervised Fine-Tuning 27.64 31.97 33.03 33.50 33.80 34.12 34.30 34.33 34.40
None 18.76 28.23 30.63 31.59 32.27 32.57 32.88 33.02 33.42
RetinaNet mAP 33.57
Off-the-Shelf Artifact Correction 26.74 29.90 31.24 31.87 32.19 32.60 32.86 33.02 32.93
Task-Targeted Artifact Correction 29.66 31.86 32.73 32.97 32.98 33.13 33.24 33.23 33.09
Table A.2: Results for detection models.
246
Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90
Supervised Fine-Tuning 34.76 37.35 38.74 38.78 39.27 39.75 39.98 39.86 39.96
None 24.95 35.16 38.03 38.52 39.02 40.09 40.50 40.41 40.54
HRNetV2 + C1 mIoU 40.59
Off-the-Shelf Artifact Correction 32.30 36.54 38.40 38.52 40.08 40.44 40.46 40.22 40.60
Task-Targeted Artifact Correction 34.14 37.61 39.23 39.24 39.92 40.53 40.62 40.39 40.55
Supervised Fine-Tuning 19.07 22.37 23.43 23.62 23.60 24.15 24.44 24.37 24.46
None 13.92 24.03 27.13 27.75 27.73 28.86 29.37 29.35 29.43
MobileNetV2 (dilated) + C1 (ds) mIoU 29.52
Off-the-Shelf Artifact Correction 21.17 25.27 27.31 27.16 29.14 29.32 29.26 29.06 29.54
Task-Targeted Artifact Correction 24.74 27.37 28.44 28.33 29.19 29.56 29.54 29.38 29.52
Supervised Fine-Tuning 35.32 37.41 38.27 38.28 38.55 38.59 38.72 38.58 38.70
None 26.14 36.70 39.45 39.81 39.55 40.47 40.98 40.97 41.07
ResNet101 + UPerNet mIoU 41.08
Off-the-Shelf Artifact Correction 33.90 37.39 39.12 39.38 40.32 40.58 40.78 40.79 41.04
Task-Targeted Artifact Correction 35.82 38.67 39.96 39.98 40.22 40.79 40.97 40.91 41.00
Supervised Fine-Tuning 31.86 35.45 36.73 36.94 36.91 37.33 37.67 37.55 37.65
None 25.68 35.19 37.76 38.43 38.24 39.27 40.03 40.17 40.21
ResNet101 (dilated) + PPM mIoU 40.26
Off-the-Shelf Artifact Correction 31.44 35.86 38.01 38.26 39.54 39.73 39.94 40.06 40.22
Task-Targeted Artifact Correction 33.99 37.63 39.04 39.11 39.38 39.73 40.07 40.11 40.10
Supervised Fine-Tuning 29.84 32.33 33.08 33.01 33.38 33.61 33.50 33.29 33.33
None 21.16 31.99 34.72 35.36 35.41 36.16 36.56 36.60 36.59
ResNet18 (dilated) + PPM mIoU 36.65
Off-the-Shelf Artifact Correction 28.64 32.59 34.56 34.53 35.96 36.21 36.29 36.25 36.64
Task-Targeted Artifact Correction 31.69 34.55 35.80 35.80 36.12 36.50 36.66 36.54 36.60
Supervised Fine-Tuning 32.88 35.11 35.94 35.90 36.41 36.58 36.63 36.49 36.55
None 24.29 34.78 37.34 37.71 37.70 38.57 39.12 39.13 39.16
ResNet50 + UPerNet mIoU 39.21
Off-the-Shelf Artifact Correction 31.83 35.52 37.20 37.26 38.44 38.67 38.87 38.86 39.12
Task-Targeted Artifact Correction 34.36 36.94 38.17 38.07 38.55 38.93 39.14 39.06 39.09
Supervised Fine-Tuning 32.26 35.33 36.04 36.04 36.53 36.75 36.93 36.71 36.92
None 23.05 33.95 36.66 37.07 37.40 38.58 38.93 38.70 38.86
ResNet50 (dilated) + PPM mIoU 38.91
Off-the-Shelf Artifact Correction 28.36 32.69 35.24 35.31 37.74 38.04 38.18 38.13 38.73
Task-Targeted Artifact Correction 31.92 35.43 37.04 36.92 38.05 38.69 38.79 38.52 38.74
Table A.3: Results for segmentation models.
247
Model Value
ImageNet Classification, Metric: Top-1 Accuracy
ResNet 18 68.84
ResNet 50 75.31
ResNet 101 76.91
ResNeXt 50 76.99
ResNeXt 101 78.81
VGG 19 73.44
MobileNetV2 70.72
InceptionV3 77.33
EfficientNet B3 83.98
COCO Object Detection and Instance Segmentation, Metric: mAP
FastRCNN 34.02
FasterRCNN 35.38
RetinaNet 33.57
MaskRCNN 32.84
ADE20k Semantic Segmentation, Metric: mIoU
HRNetV2 + C1 40.59
MobileNetV2 (dilated) + C1 29.52
ResNet 18 (dilated) + PPM 36.65
ResNet 50 (dilated) + PPM 38.91
ResNet 101 41.08
248
ResNet 101 (dilated) + PPM 40.26
Table A.4: Reference results (results with no compression).
249
Appendix B: Additional Results
In this appendix we examine more interesting outputs from various methods dis-
cussed in the body of the dissertation. These are mostly qualitative results. While
these images are not critical to understanding the methods, everyone likes looking
at pictures!
Warning The results presented here are intended to be reproductions from the
published papers, so there may be some repeats from the body of the dissertation.
B.1 Quantization Guided JPEG Artifact Correction
These results are from the method presented in Chapter 9 (Quantization Guided
JPEG Artifact Correction).
We first show more equivalent quality examples next, recall that equivalent quality
performs restoration on an image then uses SSIM to find the matching JPEG quality
to the restored image which can give an indication of how much space is saved by
using QGAC.
250
Input Equivalent Quality JPEG Reconstruction
Quality 50 Quality 85 29.5kB Saved (43.6%)
Input Equivalent Quality JPEG Reconstruction
Quality 30 Quality 58 46.8kB Saved (47.9%)
Input Equivalent Quality JPEG Reconstruction
Quality 40 Quality 78 25.0kB Saved (42.7%)
Figure B.1: Equivalent quality visualizations. For each image we show the input
JPEG, the JPEG with equivalent SSIM to our model output, and our model output.
Next we show the full frequency domain results. Recall that these results show the
frequency domain content of the images comparing JPEG compression, regression
restoration, and GAN restoration.
251
Original Plot
Original Regression
JPEG GAN
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Frequency
DCT JPEG Q=10 Regression GAN
Figure B.2: Frequency domain results 1/4
252
Probability
Original Plot
Original Regression
JPEG GAN
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Frequency
DCT JPEG Q=10 Regression GAN
Figure B.3: Frequency domain results 2/4
253
Probability
Original Plot
Original Regression
JPEG GAN
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Frequency
DCT JPEG Q=10 Regression GAN
Figure B.4: Frequency domain results 3/4.
254
Probability
Original Plot
Original Regression
JPEG GAN
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Frequency
DCT JPEG Q=10 Regression GAN
Figure B.5: Frequency domain results 4/4.
One way to reduce any artifacts caused by divergent GAN training is to use model
interpolation [81]. Model interpolation simply takes the regression weights WR and
the GAN weights WG along with a scalar ? and computes new model parameters
WI = (1? ?)WR + ?WG (B.1)
We show close up views of different textured regions for different choices of ?.
255
Probability
Regression
GAN
Figure B.6: Model interpolation results 1/4
Regression
GAN
Figure B.7: Model interpolation results 2/4
256
Regression
GAN
Figure B.8: Model interpolation results 3/4
Regression
GAN
Figure B.9: Model interpolation results 4/4
We close with purely qualitative results. These are for quality 10 as in Chapter
257
9 (Quantization Guided JPEG Artifact Correction) and quality 20 which was not
shown there to save space.
JPEG Q=10 Reconstruction Original
JPEG Q=20 Reconstruction Original
Figure B.10: Qualitative results 1/4. Live-1 images.
JPEG Q=10 Reconstruction Original
JPEG Q=20 Reconstruction Original
Figure B.11: Qualitative results 2/4. Live-1 images.
258
JPEG Q=10 Reconstruction Original
JPEG Q=20 Reconstruction Original
Figure B.12: Qualitative results 3/4. Live-1 images.
259
JPEG Q=10 Reconstruction Original
JPEG Q=20 Reconstruction Original
Figure B.13: Qualitative results 4/4. ICB images.
B.2 Task Targeted Artifact Correction
These results are from the method of Chapter 10 (Task-Targeted Artifact Correc-
tion). We start with visualizations of model errors, first using GradCam [161]. This
shows how the model focus is impacted by JPEG compression and how it can be
corrected using the various mitigation techniques we studied. The figures show some
interesting behavior. In terms of localization, the JPEG compressed input actually
does well, and the localization is in fact more accurate than the original model with
260
an uncompressed input. The problem with the JPEG compressed input seems to
be with the gradient, which is extremely noisy. Mitigation seems to help with this,
with the supervised method providing the cleanest gradient although there is a loss
of localization accuracy.
Original model gradient with
Fine-tuned model gradient Original model gradient?
original input image?
Original model CAM with
Original model CAM
Fine-tuned model CAM
original input image?
Figure B.14: Fine Tuned Model Comparison
261
Original model gradient with
Fine-tuned model gradient Original model gradient?
original input image?
Original model CAM with
Original model CAM
Fine-tuned model CAM
original input image?
Figure B.15: Off-the-Shelf Artifact Correction Comparison
Original model gradient with
Fine-tuned model gradient Original model gradient?
original input image?
Original model CAM with
Original model CAM
Fine-tuned model CAM
original input image?
Figure B.16: Task-Targeted Artifact Correction Comparison
262
For visualizing detection results we provide plots generated using TIDE [126]. We
show these for FasterRCNN [118] and MaskRCNN [120]. The results show a sig-
nificant number of missed detection for low quality inputs. This is overtaken by
localization errors as quality increases.
Figure B.17: FasterRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right:
quality 100.
Figure B.18: MaskRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right:
quality 100.
We close the section with qualitative results including visualizations of the results
where appropriate.
263
JPEG Q=10, Prediction: Norwich terrier'', Fine- Off-the-shelf Artifact Correction, Prediction:
"basenji"
Tuned Prediction: Pembroke, Pembroke Welsh
corgi''?
Task-Targeted Artifact Correction, Prediction: Original, Prediction: "Pembroke, Pembroke Welsh
"Pembroke, Pembroke Welsh corgi" corgi"
Figure B.19: MobileNetV2, Ground Truth: ?Pembroke, Pembroke Welsh corgi?
264
JPEG Q=10
Off-the-Shelf Artifact Correction
Task-Targeted Artifact Correction Supervised Fine-Tuning
Original
Ground Truth
Figure B.20: FasterRCNN
265
JPEG Q=10
Off-the-Shelf Artifact Correction
Task-Targeted Artifact Correction Supervised Fine-Tuning
Original
Ground Truth
Figure B.21: MaskRCNN
266
JPEG Q=10 Prediction
Ground Truth
JPEG Q=10 Prediction
Ground Truth
JPEG Q=10 Prediction
Ground Truth
JPEG Q=10 Fine-tuning Prediction
Ground Truth
Figure B.22: HRNetV2 + C1
267
B.3 Metabit
The results in this section are from the method of Chapter 13 (Metabit: Leveraging
Bitstream Metadata). These are purely qualitative results but they do highlight
specific successes and failures of the method.
Figure B.23: Dark Region. Crop from 2560? 1600 ?People on Street?. The dark
region, is poorly preserved by compression. Our GAN restoration struggles to cope
with the massive information loss in this region.
Figure B.24: Crowd. Crop from 2560?1600 ?People on Street?. The image shows
an extremely dense crowd. Despite the chaotic nature, our GAN is able to produce
a good restoration although there is detail missing.
268
Figure B.25: Texture Restoration. Crop from 1920?1080 ?Cactus?. The texture
on the background is destroyed by compression. Our GAN reconstructs a reasonable
approximation to the true texture.
Figure B.26: Compression Artifacts Mistaken for Texture. Crop from 1920?
1080 ?Cactus?. The compressed image exhibits strong chroma subsampling artifacts
(lower right corner). These are mistaken by the GAN is a texture and restored as
such.
Figure B.27: Motion Blur. Crop from 1920 ? 1080 ?Cactus?. The tiger exhibits
high motion which presents itself in the target frame as motion blur. This blur
is destroyed by compression and is not able to be restored by the GAN loss. The
GAN loss is also ?rewarded? for sharp edges which would make reconstructing blurry
objects difficult. As an aside, note the additional detail on the background objects
in the GAN image when compared to the compressed image.
269
Figure B.28: Artificial. Crop from 1920? 1080 ?Big Buck Bunny?. This artificial
scene is restored accurately despite a lack of artificial training data. Note the grass
and tree textures, sharp edges, removal of blocking on the flower, and preservation
of the smooth sky region.
270
Appendix C: Survey of Fully Deep-Learning Based Compression
Although fully deep-learning based compression methods are generally considered
out of the scope of this dissertation, there is general interest in these technologies
and they are certainly related to the work presented in the body of the document.
Therefore, in this appendix, we conduct a brief survey of the major points of image
and video compression that depends entirely on deep learning to produce the en-
codings. While deep learning based compression shows extreme promise, it is still a
very academic problem. Models currently require expensive hardware to train and
to compress new media in a timely manner. This also leads to high memory usage.
In general, important compression concepts like rate control are still largely missing.
In terms of objective performance, the most recent methods at the time of writing
are on par with classical compression on some benchmarks. This is not always easy
to evaluate, however, as methods depending on generation like GANs [39] often do
not produce meaningful rate-distortion curves in the traditional sense.
In a rare personal opinion, based on my observation of the state of the art, I
believe that machine learning, wherever it may end up, is the future of compres-
sion. Within the decade (i.e., before 2030) we will begin to see machine learning
techniques used in consumer application. In contrast, the techniques presented in
271
the body of the dissertation will likely be seen in consumer application in the next
one or two years. There are currently a number of companies competing for deep-
learning based compression market share, e.g., Google and Wave One. While these
companies are delivering important research contributions it is unlikely that their
proprietary solutions will win out in the long term given the compression commu-
nity?s reliance on standardization. Although Google was able to gain traction in
classical compression with its VP codecs, even these were eventually standardized
into the Alliance for Open Media (AOM) and development continued with the AV
codecs. Notable standardization efforts include JPEG-AI and MPEG-AI which are
much more likely to see success, meaning that any new players in this field would
do well to work with the standards bodies.
C.1 Image Compression
We start with image compression. The goal of these models is to train a CNN
to encode pixels into a feature vector with another CNN trained simultaneously to
decode the feature vectors to an image, essentially a fancy autoencoder. The feature
vectors are quantized and losslessly compressed before ?transmission?, or in this
case, evaluating their size in bytes. The networks will be trained to minimize both
the size of the feature vectors when stored on disk and the error of the reconstruction.
There are three obvious problems here which drive the works we will consider in this
section
? The size on disk is not differentiable and therefore not suitable for use in a
272
loss function.
? Classical compression algorithms incorporate rate-control to make their use
more flexible. It is not trivial to incorporate such a side parameter into a
CNN.
? Minimizing the error term does not necessarily produce a visually pleasing
result.
Likely the first modern work in image compression with deep learning was the
work by [11] for thumbnails with a follow up for full resolution images [13]. Todereci?s
work is based on recurrent networks specifically Long-Short-Term Memory networks
(LSTMs). The output of the LSTM at time t is subtracted from the input and this
residual used as the input to the LSTM at time t + 1 after starting the process
with the input patch generating a fixed length code for a given bitrate setting.
The network is only trained to minimize the l2 error. Considering how early these
architectures were developed they have some nice properties, including reasonable
results compared to JPEG and a rudimentary attempt at variable rate encoding.
Next, [14] proposed compressive autoencoders to generate a compressed represen-
tation. The idea is to produce a deep encoding of the input image which is then
quantized for transmission and decoded by another deep network. The objective
can be written as
? log(Q(f(x))) + ?d(x, g(f(x))) (C.1)
273
where Q() is the quantization function, f() is the encoder, g() is the decoder, and
d() is a measure of distortion (i.e., error). The left term here is measuring the size
of the representation (number of bits) and the right term is measuring the error. Of
course this objective cannot be minimized directly since Q() is not differentiable.
To get around this they define a differentiable approximation to the rounding step.
By adjusting this approximation, they are able to produce a much more accurate
variable rate encoder, although the empirical results show that training for a single
rate naturally works better.
Also in 2017, [162] developed ?soft-to-hard vector quantization?. They start with
the same problem of [14], that the quantization step is not differentiable. They solve
the problem by using a soft assignment of the features to symbols, i.e., instead of a
hard rounding they compute
?(z) = softmax(??(?z ? c ?21 , . . . , ?z ? c 2n? )) (C.2)
for n symbols c. This equation is fully differentiable. However, this alone would be
a poor approximation so during training the ?hardness? is ?annealed? from some
initial condition to infinity which produces a more and more accurate approximation
of the hard assignment which is used at test time. This allows the network to quickly
converge on the easier soft solution in early training while increasing the problem
difficulty to match the real scenario in late training.
[12] propose yet another solution to this problem. Their solution is motivated
by Shannon?s information theory and, although somewhat questionable in the mo-
274
tivation, has become a staple technique for approximating quantization. Belle? et
al. observe that the discrete quantization processes is essentially introducing noise
into the signal which is output by the deep encoder. Of course, entropy of a noisy
channel is something that Shannon studied quite extensively [26]. Therefore, the
solution is to simply add Gaussian noise to the signal which is a simple and differen-
tiable process. Of course the issue with this is that Gaussian noise is very different
in appearance from quantization noise and CNNs are very sensitive to the actual
appearance even if the entropy analysis is the same (entropy is essentially giving an
aggregate view of the information loss). Nevertheless the method does work well.
[163] specifically focus on designing a method for variable rate encoding. Although
this was a feature of prior works, their primary focus was on overcoming the non-
differentiable quantization. Mentzer et al. use both the soft-to-hard technique
[162] and the compressive autoencoders technique [14] to deal with quantization.
To model the rate term in the loss, Mentzer et al. treat the feature vectors as a
conditional distribution, i.e.,
?N
P (z) = P (zi|zi?1, . . . z1) (C.3)
i=1
in raster order. So each feature vector is considered to have its own probability which
is conditioned on all previous features. They then model P (z) and the conditional
distributions using another deep network (which is differentiable). Specifically they
use a 3D convolution since this is efficient and respects the ?causality constraint.? In
other words, the previous feature vectors cause the current feature vector since they
275
are conditional distributions. This formulation for P (z) allows them to compute an
approximation for the entropy and therefore the rate which they use as a loss term.
In something of a departure from prior works, [164] formulate a compression
algorithm based on GANs [39]. The advantage here is that these algorithms can
produce striking, faithful, images in extreme settings. In this case the distortion term
is replaced with a GAN loss instead of the traditional l2 loss of prior works. This
is based on the correct observation that l2 loss does not capture human perception
well. Although the paper offers limited further insight into the mathematics of deep
learning based compression systems, the imagery their method produces is truly
remarkable often outperforming JPEG by a wide margin while saving considerably
more space.
This work is continued by [165] and they make a number of advancements over
[164]. Where Aggustsson et al. showed visually realistic results there were major
deviations from the true outputs. The preservation of the output seemed to be
more semantic than visual which makes sense given that the GAN training uses
deep networks classifying real/fake. Mentzer et al. by comparison is extremely
faithful to the original images often deviating in ways that are indistinguishable to
the human eye. Aggustsson et al. is also quite limited in the size of the input that
it can accept (an efficiency concern), whereas the Mentzer et al. formulation works
efficiently on sizes up to 2000?2000 which is a respectable size for a modern image.
An interesting avenue of analysis in this work is in the effectiveness, or lack thereof,
of metrics. After conducting an extensive user study, they found that no metric was
adequate for matching the human?s responses. This is not at all surprising.
276
Although we end in 2020 image compression continues to be an active area of
research, although it remains to be seen which works of 2021 will emerge as the most
influential. In the interest of space, we conclude the discussion of image compression
here. Although the advance of [165] was extremely promising, there is still no deep
learning algorithm that is suitable for deployment in a consumer application. This is
partly an efficiency concern but it is also a flexibility concern. JPEG was extremely
well thought out to work for the widest range of situations which is part of the
reason it has persisted for 30 years. Deep learning methods are only just scratching
the surface of this kind of long term thinking.
C.2 Video Compression
We now turn to video compression. Similar to the previous section, the goal
will be to train encoder and decoder CNNs with some kind of quantization of the
encoded feature vectors. Unlike the last section, however, we now have a temporal
component in everything we do. In addition to the challenges of image compression,
dealing with the temporal component is a problem by itself. Some methods will treat
the time component as independent, essentially image compression with different
features over time. Some will attempt to incorporate the temporal component into
the prediction itself either in a recurrent or motion based solution. Still others will
use an implicit representation, essentially over-fitting a network for each video.
We start with the method of [154]. The key insight is that ?keyframes? can
be defined which are then encoded using off-the-shelf image encoders. Then, the
277
intermediate frames are produced using image interpolation networks that take the
two keyframes as input and produce the intermediate frames. Naturally, the longer
the interval between the keyframes the higher the error of the predictions. While the
method certainly works, it does not clearly outperform H.264 in the same way that
image compression algorithms were clearly outperforming JPEG. This was still a
major advancement, however, since prior to this no one had tried to produce a video
codec using deep learning. By focusing on keyframe compression and interpolation,
the method is efficient which was a major concern with video compression.
Next, DVC [155] proposes an end-to-end technique which encodes motion and
residual information for predicted frames. This is intentionally designed to mimic the
classical compression loop which stores intra-frames and then predicts intermediate
frames using motion warping and low-entropy error residuals. In this case, each
component is modeled separately with a CNN. The method has several moving
parts
Motion Estimation which uses a task-specific optical flow network to produce per-
pixel motion. This motion is then encoded using another CNN for compression
and quantized. The decoder performs the inverse process to produce the flows
Motion Compensation Also uses a deep network. First the decoded optical flow is
used to warp the reference image, then the reference image, warped images,
and optical flow are all used as input to another deep network to predict the
true frame.
Transform The residual between the predicted and true frame is taken and encoded
278
using yet another CNN to produce the quantized encoding. This is similar to
image compression techniques.
Overall the method is fairly complex and heavy consisting of several convolutional
networks. While all of this does pay off in terms of the overall result compared with
[154], the actual codec itself struggles to match H.265. Furthermore, per-pixel flow
is likely wasteful, at least we know that classical video codecs do not make use of
dense motion information. That being said the end-to-end nature and the idea of
replacing each part of a traditional video encoder with a CNN are major advances
to the state-of-the-art.
In 2020, we finally had a technique capable of outperforming H.265. The method
of [156] proposed a simple but effective technique. Use a standard deep learning
based image compression algorithm to generate initial codes for each frame. Then
perform internal learning to generate an ?optimal? code for that frame and use
a conditional entropy model to produce a final code for the current frame that
is conditioned on the previous frame. Note that the internal learning method is
actually learning a small CNN just for that particular frame, so the encoding time
is increased but the decoding is still fast since the decoder only needs to perform
inference on the resulting network to obtain the code. The conditional entropy
model also helps the encoder reuse information from prior frames to reduce the final
code length. The contribution is very straightforward. Aside from these two ideas
there are no special formulations (this is a good thing). The result is impressive,
with results that consistently outperform the classical codecs for higher bitrates.
279
The method does struggle at low bitrates, however.
Continuing with internal learning is NeRV [157]. This method is entirely an in-
ternal learning technique, which means that for each video, the compression process
is to train a neural network which predicts only that video (over-fitting it) and then
the neural network weights are compressed using a model compression technique
like pruning. To decode, the transmitted model weights are used in an inferencing
pass to retrieve the frames. In particular, NeRV proposes a frame-based implicit
representation vs the pixel based approach in something like SIREN [166]. What
this means practically is the NeRV takes a time t as input and produces the frame
of the video at time t instead of taking the triple x, y, t and producing the pixel at
position x, y at time t. Not only is the NeRV formulation simpler for the network
to learn (leading to better results) it is also significantly faster, requiring T forward
passes to produce a video of length T instead of H ?W ? T forward passes. While
the overall idea here is interesting the results do leave something to be desired, as
NeRV struggles to outperform even H.264. Furthermore, although the network is
small making decoding time fast, encoding (i.e., training the network) is on the
order of hours.
We close the section with ELF-VC [167], a method which is groundbreaking both
in its results and in its methodical design. The approach is fast, provides a well
motivated method for I- and P-frame encoding with deep networks, supports variable
rate encoding, and compares well to other classical and deep learning codecs. For the
I-frame model, standard image deep learning compression is used. For P-frames, the
method is more interesting. Motion is predicted using a flow model as in [155] and
280
the residual and flows are both stored. The decoder uses a prior frame as an initial
estimate of the warped frame before incorporating the flow vectors and residual.
Variable rate encoding is achieved using a level map where the rate-distortion curve
is discretized in levels. The level is tiled spatially and used as input to the encoder
and decoder with the loss encouraging the network to hit the specified bitrate target.
This provides a simple way to tune the bitrate. In terms of results the ELF-VC
method largely outperforms other work on all benchmarks with the exception of
AV1, the latest in classical compression.
Although ELF-VC hits on a number of ideas that would be required for a com-
mercial video codec, that goal is still very far off. With optimizations ELF-VC can
decode a 1080p video at 18fps, which is not fast enough, and requires a GPU with
a large amount of memory. As new methods are developed which are more efficient
(this needs to be a continuing focus, however) and hardware speed increases, the
likelihood of deep learning compression finding its way to consumer applications
increases. These reasons contribute to the 10 year estimate.
C.3 Lossless Techniques
The previous two section dealt exclusively with lossy compression. We expect
that the networks will remove information from images during encoding, and even if
they do not the quantization process will. But we can also use machine learning for
lossless compression. This takes a number of forms which we discuss in this section.
Importantly this is a fairly interesting use case and could potentially see practical
281
application sooner than the lossy methods although it would be in niche scenar-
ios. For example, these techniques have uses in lossless transcoding of classically
compressed images. This particular application is important because it would allow
large datacenters, which have the resources to run deep learning models at scale, to
save on storage costs by transcoding images and videos to a smaller deep learning
based file. The media are then transcoded back to their consumer format before
being transmitted so that the consumer does not need special hardware or software
to view the media. When we discussed entropy coding in Chapter 4 (Entropy and
Information), we noted that entropy coders work by assigning shorter codes to prob-
able symbols and longer codes to improbable symbols. In order to work well the
encoder needs an accurate probability distribution which is difficult to come up with
particularly for image data. The techniques in this section are primarily focused on
learning such distributions.
PixelRNNs [168], [169] are generative models that predict each pixel in an image
as a discrete conditional distribution. This has an advantage over other generative
methods like GANs [39] because the model predicts the distribution explicitly in-
stead of simply producing samples from the distribution. In standard fashion, each
pixel is treated as a distribution conditioned on all previous pixels
?N
p(z) = p(zi|zi?1, . . . , z1) (C.4)
i=i
Each pixel can then be generated by sampling from the learned distribution pixel
by pixel. So how is this relevant to compression? With the likelihood of each pixel,
282
we can use these distributions to produce probabilities for entropy coders [27], [46].
Integer discrete flows(IDFs) [170], [171] are similar in spirit. The idea again is
to learn an explicit distribution of the image data and produce a latent code from
the distribution with the advantage of much faster sampling. IDFs in particular are
designed to overcome an explicit problem with flows in general: that they assume a
continuous random variable. Images are discrete random variables so quantization
of the resulting model to fit the discrete distribution may introduce loss. By for-
mulating an integer discrete flow, the authors can provably reproduce exactly the
given input from a code. The flow itself is based on a change of variables formula
?? ??z ?
PX(x) = Pz(f(x)) ?? ?? (C.5)?x
The flow is then reformulated to be in integer form where the Jacobian is one. The
method was extended more recently in iVPF [172] which used volume preserving
flows instead of integer discrete flows (they are quite similar in operation however).
In either case, the learned flow can then be used directly as a probability distribution
for entropy coding.
While Bits-back encoding [173]?[176] has been around for some time, the Bits-
Back ANS [177] method was the first algorithm using neural networks for the learn-
ing component and which was shown to be efficient on large datasets. Without
going into too much detail, the idea of bits-back encoding is to assume that the
given symbol s has some latent variable y associated with it and that we have a
way of measuring p(y), p(s|y), and p(y|s). Bits-back encoding allows us to lever-
283
age this knowledge of the latent distribution to store s with fewer bits. Bits-back
ANS uses a variational autoencoder (VAE) for the latent model. Bit-swap [178] and
Hilloc [179] extend this with hierarchical latent variables, and LBB [180] merges
flows with bits-back encoding.
We close with a very different approach, [181]. This method actually leverages
lossy compression in order to improve the lossless compression rate. The idea is to
start with BPG [54] and use a network to predict an optimal quantization parameter
controlling how aggressive BPG should behave. BPG of course loses information so
the residual of the true frame and the compressed frame is taken and another network
predicts the probability of the residual given the input image. The residual is then
encoded using an entropy coder with the learned distribution. Since the encoded
residual is stored with the BPG compressed image, there is no information loss.
Overall, this field is full of interesting and practical ideas. Although somewhat
niche in their application, these are highly developed techniques that could already
be useful engineering applications. However, these ideas are by definition not suit-
able for consumers as they are really one part of a more complex whole and their
performance can not match lossy algorithms. Their use is more suited to specialized
applications in medical imaging, datacenters, or remote sensing where loss of data
many not be acceptable.
284
Glossary
B
basis A set of vectors which is linearly independent and spans a vector space.. 8?11,
33
Bayesian decision theory A method for making optimal decisions given perfect prob-
ability distributions describing possible events.. 61
C
chroma subsampling The process for storing chrominance channels at a smaller res-
olution since human vision is less sensitive to changes in color information..
86, 87, 91
chrominance The color or hue of light captured by a sensor.. 86
communication system A system for conveying some message from one party to
another. Consists of an information source, a transmitter, a signal, a source
of noise corrupting the signal, a receiver, and a destination.. 52
compression Any operation which reduces the size in bits of a computational object..
iv, 289
285
convolution The correlation of a signal and a kernel as the kernel is shifted across
the signal.. 27, 44, 45, 286
convolutional filter manifold A modification of the filter manifold which uses a spa-
tial input.. 145
cross-correlation See convolution (although they are technically different).. 27
D
decision boundary The manifold in space separating classification decisions.. 65, 69
deep learning A machine learning technique that learns many layers of features
jointly with a task objective.. iv, v, 74, 98, 129
dissertation a paricular type of write only document.. vi, 20, 60
E
entropy The amount of information in a message, the amount of randomness in a
system, the minimum number of bits required to encode a message.. 50, 84
error residual The difference between a motion compensated frame and the true
frame.. 187
evidence The probability of an observation.. 63
F
feature An abstract or higher order representation of an image or a part of an image
that is more suitable for input to a machine learning algorithm.. 69
286
filter manifold A method for learning adaptable convolution kernels from a scalar
input.. 144
first principles The underlying engineering decisions which motivate an algorithm..
v, 227
Fourier transform An integral transform defining an orthogonal basis for functions..
35, 39?41, 43?45
G
Gabor transform A special case of the STFT which uses a Gaussian filter to window
the transform.. 40?42
gradient For a scalar valued function of a vector: the vector of partial derivatives
of each component of the input with respect to the scalar output.. 68
H
Hadamard transform An approximation of the discrete cosine transform consisting
of only 1s and -1s.. 39, 192
Huffman coding A method for producing optimal length codes for single symbols
given the probabilities that each symbol will occur.. 54, 55
I
image A discrete 2D signal giving a sample value at integer positions (x, y), the
sample may be a scalar (grayscale) or a vector (color).. 22, 28, 31
287
image-to-image The machine learning problem which takes an image as input and
produces an image as output.. 78
interlaced A method for storing color images which stores color information in se-
quence, e.g., each pixel could consist of 24 bits with 8 bits for res, green, and
blue.. 86
J
Jacobian For a tensor valued function of a tensor: the tensor of partial derivatives
of each component of the input with respect to each component of the output..
68, 76
JPEG The Joint Photographic Experts Group, often referring to an image file or
compression algorithm.. iii, v, 37, 50, 84, 96, 98, 100, 101, 117, 129, 140, 141,
146, 182
L
likelihood The probability of an observation given that some event occurs.. 62
linear combination A series of scalar multiplication and vector addition.. 4, 6, 13,
36
linear map A mapping which preserves scalar multiplication and vector addition..
15, 27, 29, 37
linearly separable A decision which is able to be made using only the relative posi-
tion with respect to a hyperplane.. 66
288
lossless compression A compression operation which preserves all information in the
original signal.. 54
lossy compression A compression operation which removes information from the
signal to save space.. 50, 84
luminance The brightness (?quantity? of light) captured by a sensor.. 86, 89, 141
M
macroblocks Pixel blocks in a frame which are larger than the transform block size..
189
metric tensor A tensor which relates a vector space and a co-vector space.. 20
motion compensation The process by which a video codec warps frames using esti-
mated motion.. 186
motion estimation The process by which a video encoder measures block motion..
187
Motion JPEG A video codec which stores each frame as a JPEG.. 183
motion vectors Vector specifying the motion of video blocks.. 185
MPEG The Motion Picture Experts Group, often referring to a compression algo-
rithm.. v, 50
multilinear map A mapping which is linear in exactly one argument.. 15?17
multiresolution analysis See wavelet transform.. 43
289
N
Nash equilibrium The state of a game where no player can obtain an advantage
over any other player.. 80
Nyquist Sampling Theorem A signal with a maximum frequency ?m can be rep-
resented exactly by discrete samples with a sampling rate of at least 2?m..
45
P
planar A method for storing color images which stores color information separately,
e.g., the image may consist of all the red pixels followed by all the blue pixels,
etc... 86
posterior probability The probability of an event occurring given an observation..
62
prior probability The probability of an event occurring in the absence of any other
information.. 61
R
rate control Any method for tuning the bitrate of an image or video.. 190?192
S
semantic segmentation The machine learning problem which takes an image as in-
put and produces a classification label for each pixel.. 78
290
slices Regions of a video frame consisting of a whole number of macroblocks.. 189
SVM Support Vector Machine, a linear model which separates examples using the
maximum margin hyperplane.. 70
T
transform domain A catch-all term for DCT coefficients, quantized JPEG data, or
any other transformation of pixel data.. 100, 110, 117
V
video A discrete 3D signal giving a sample value at integer positions (x, y, t), the
sample may be a scalar (grayscale) or a vector (color).. 31
W
wavelet A wave-like function with finite support.. 41, 42, 46
wavelet transform An integral transform using a set of wavelets as the basis, allows
for multi-resolution and localization of frqeuencies in time.. 42, 45, 49, 289
291
Figure Credits
Unless listed here, figures are either generated by the author, in the public do-
main. The original authors of these works do not endorse any changes made for this
document.
3.2 on page 42 Wikipedia. User JonMcloone
https://commons.wikimedia.org/wiki/File:
MorletWaveletMathematica.svg.
CC-BY-SA 3.0.
Removed axes.
3.3 on page 43 Wikipedia. User JonMcloone
https://commons.wikimedia.org/wiki/File:
MorletWaveletMathematica.svg.
CC-BY-SA 3.0.
Removed axes, added scaled version to show hierarchy.
3.5 on page 47 Wikipedia. User Omegatron
https://commons.wikimedia.org/wiki/File:Haar_wavelet.svg.
CC-BY-SA 3.0.
292
Removed axes, added scaled version to show hierarchy.
4.1 on page 52 [26]
5.2 on page 67 Wikipedia. User Glosser.ca
https://commons.wikimedia.org/wiki/File:Colored_neural_network.
svg.
CC-BY-SA 3.0.
5.3 on page 70 [31]
5.4 on page 72 [32]
5.5 on page 73 [33]
5.6 on page 77 [36]
5.7 on page 78 [9]
11.3 on page 186 Big Buck Bunny. [130]
https://peach.blender.org/
CC-BY-SA 3.0.
Motion vector arrows added to frame.
293
Bibliography
[1] M. Duggan, ?Photo and video sharing grow online,? Pew research internet
project, 2013.
[2] Verizon Inc., 4g lte speeds vs. your home network. [Online]. Available: https:
//www.verizon.com/articles/4g-lte-speeds-vs-your-home-network/.
[3] G. K. Wallace, ?The jpeg still picture compression standard,? IEEE trans-
actions on consumer electronics, vol. 38, no. 1, pp. xviii?xxxiv, 1992.
[4] I. E. Richardson, The H. 264 advanced video compression standard. John
Wiley & Sons, 2011.
[5] N. Ahmed, T. Natarajan, and K. R. Rao, ?Discrete cosine transform,? IEEE
transactions on Computers, vol. 100, no. 1, pp. 90?93, 1974.
[6] D. Le Gall, ?Mpeg: A video compression standard for multimedia applica-
tions,? Communications of the ACM, vol. 34, no. 4, pp. 46?58, 1991.
[7] T. M. Schmit and R. M. Severson, ?Exploring the feasibility of rural broad-
band cooperatives in the united states: The new new deal?? Telecommunica-
tions Policy, vol. 45, no. 4, p. 102 114, 2021.
294
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ?Imagenet classification with
deep convolutional neural networks,? Advances in neural information pro-
cessing systems, vol. 25, pp. 1097?1105, 2012.
[9] K. He, X. Zhang, S. Ren, and J. Sun, ?Deep residual learning for image
recognition,? in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770?778.
[10] M. Tan and Q. V. Le, ?Efficientnet: Rethinking model scaling for convolu-
tional neural networks,? arXiv preprint arXiv:1905.11946, 2019.
[11] G. Toderici, S. M. O?Malley, S. J. Hwang, et al., ?Variable rate image com-
pression with recurrent neural networks,? arXiv preprint arXiv:1511.06085,
2015.
[12] J. Balle?, V. Laparra, and E. P. Simoncelli, ?End-to-end optimized image
compression,? arXiv preprint arXiv:1611.01704, 2016.
[13] G. Toderici, D. Vincent, N. Johnston, et al., ?Full resolution image compres-
sion with recurrent neural networks,? in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 5306?5314.
[14] L. Theis, W. Shi, A. Cunningham, and F. Husza?r, ?Lossy image compression
with compressive autoencoders,? arXiv preprint arXiv:1703.00395, 2017.
[15] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, ?Semantic per-
ceptual image compression using deep convolution networks,? in 2017 Data
Compression Conference (DCC), IEEE, 2017, pp. 250?259.
295
[16] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Je?gou, ?And the bit
goes down: Revisiting the quantization of neural networks,? in ICLR 2020-
Eighth International Conference on Learning Representations, 2020, pp. 1?
11.
[17] A. Paszke, S. Gross, F. Massa, et al., ?Pytorch: An imperative style, high-
performance deep learning library,? in Advances in Neural Information Pro-
cessing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d?Alche?-
Buc, E. Fox, and R. Garnett, Eds., Curran Associates, Inc., 2019, pp. 8024?
8035. [Online]. Available: http : / / papers . neurips . cc / paper / 9015 -
pytorch- an- imperative- style- high- performance- deep- learning-
library.pdf.
[18] A. Einstein, ?Die grundlage der allgemeinen relativita?tstheorie,? in Das Rel-
ativita?tsprinzip, Springer, 1923, pp. 81?124.
[19] A. Jain and A. fast Karhunen, ?Loeve transform for a class of random pro-
cesses,? IEEE Trans. Comm, vol. 24, pp. 1023?1029, 1976.
[20] H. Kekre and J. Solanki, ?Comparative performance of various trigonometric
unitary transforms for transform image coding,? International Journal of
Electronics Theoretical and Experimental, vol. 44, no. 3, pp. 305?315, 1978.
[21] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, ?The dual-tree com-
plex wavelet transform,? IEEE signal processing magazine, vol. 22, no. 6,
pp. 123?151, 2005.
[22] I. Daubechies, Ten lectures on wavelets. SIAM, 1992.
296
[23] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, ?Multi-level wavelet-cnn
for image restoration,? in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, 2018, pp. 773?782.
[24] J. Bruna and S. Mallat, ?Invariant scattering convolution networks,? IEEE
transactions on pattern analysis and machine intelligence, vol. 35, no. 8,
pp. 1872?1886, 2013.
[25] X. Zhao, P. Huang, and X. Shu, ?Wavelet-attention cnn for image classifica-
tion,? Multimedia Systems, pp. 1?10, 2022.
[26] C. E. Shannon, ?A mathematical theory of communication,? The Bell system
technical journal, vol. 27, no. 3, pp. 379?423, 1948.
[27] D. A. Huffman, ?A method for the construction of minimum-redundancy
codes,? Proceedings of the IRE, vol. 40, no. 9, pp. 1098?1101, 1952.
[28] Y. LeCun, B. E. Boser, J. S. Denker, et al., ?Handwritten digit recognition
with a back-propagation network,? in Advances in neural information pro-
cessing systems, 1990, pp. 396?404.
[29] P. E. Hart, D. G. Stork, and R. O. Duda, Pattern classification. Wiley Hobo-
ken, 2000.
[30] F. Rosenblatt, The perceptron, a perceiving and recognizing automaton
Project Para. Cornell Aeronautical Laboratory, 1957.
297
[31] N. Dalal and B. Triggs, ?Histograms of oriented gradients for human de-
tection,? in 2005 IEEE computer society conference on computer vision and
pattern recognition (CVPR?05), Ieee, vol. 1, 2005, pp. 886?893.
[32] D. G. Lowe, ?Object recognition from local scale-invariant features,? in Pro-
ceedings of the seventh IEEE international conference on computer vision,
Ieee, vol. 2, 1999, pp. 1150?1157.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ?Gradient-based learning
applied to document recognition,? Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278?2324, 1998.
[34] K. Simonyan and A. Zisserman, ?Very deep convolutional networks for large-
scale image recognition,? arXiv preprint arXiv:1409.1556, 2014.
[35] C. Szegedy, W. Liu, Y. Jia, et al., ?Going deeper with convolutions,? in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
2015, pp. 1?9.
[36] O. Ronneberger, P. Fischer, and T. Brox, ?U-net: Convolutional networks for
biomedical image segmentation,? in International Conference on Medical im-
age computing and computer-assisted intervention, Springer, 2015, pp. 234?
241.
[37] S. Ioffe and C. Szegedy, ?Batch normalization: Accelerating deep network
training by reducing internal covariate shift,? in International conference on
machine learning, PMLR, 2015, pp. 448?456.
298
[38] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ?Image-to-image translation
with conditional adversarial networks,? in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1125?1134.
[39] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., ?Generative adversarial
nets,? Advances in neural information processing systems, vol. 27, 2014.
[40] Y. LeCun, ?The mnist database of handwritten digits,? http://yann. lecun.
com/exdb/mnist/, 1998.
[41] J. F. Nash, ?Equilibrium points in n-person games,? Proceedings of the na-
tional academy of sciences, vol. 36, no. 1, pp. 48?49, 1950.
[42] J. F. Nash, ?Non-cooperative games,? Annals of mathematics, pp. 286?295,
1951.
[43] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ?Gans
trained by a two time-scale update rule converge to a local nash equilibrium,?
Advances in neural information processing systems, vol. 30, 2017.
[44] Independant JPEG Group. ?Libjpeg.? (), [Online]. Available: http : / /
libjpeg.sourceforge.net.
[45] International Telecommunication Union, ?Studio encoding parameters of dig-
ital television for standard 4:3 and wide-screen 16:9 aspect ratios,? Geneva,
CH, Standard, Mar. 2011.
[46] J. Rissanen and G. G. Langdon, ?Arithmetic coding,? IBM Journal of re-
search and development, vol. 23, no. 2, pp. 149?162, 1979.
299
[47] B. C. Smith, ?Fast software processing of motion jpeg video,? in Proceed-
ings of the second ACM international conference on Multimedia, ACM, 1994,
pp. 77?88.
[48] S.-F. Chang, ?Video compositing in the dct domain,? in IEEE Workshop
on Visual Signal Processing and Communications, Raleigh, NC, Sep. 1992,
1992.
[49] B. Shen and I. K. Sethi, ?Inner-block operations on compressed images,? in
Proceedings of the third ACM international conference on Multimedia, ACM,
1995, pp. 489?498.
[50] B. K. Natarajan and B. Vasudev, ?A fast approximate algorithm for scaling
down digital images in the dct domain,? in Image Processing, 1995. Proceed-
ings., International Conference on, IEEE, vol. 2, 1995, pp. 241?243.
[51] B. C. Smith and L. A. Rowe, ?Algorithms for manipulating compressed im-
ages,? IEEE Computer Graphics and Applications, vol. 13, no. 5, pp. 34?42,
1993.
[52] T. Boutell, PNG (Portable Network Graphics) Specification Version 1.0, RFC
2083, Mar. 1997. doi: 10.17487/RFC2083. [Online]. Available: https://www.
rfc-editor.org/info/rfc2083.
[53] CompuServe Inc, ?Graphics Interchange Format,? Standard, Mar. 1987.
[54] F. Bellard. ?Better portable graphics.? (2018), [Online]. Available: https:
//bellard.org/bpg/.
300
[55] MPEG, ?Requirements for still image coding using HEVC,? Vienna, AT,
Standard, 2013.
[56] A. Skodras, C. Christopoulos, and T. Ebrahimi, ?The jpeg 2000 still im-
age compression standard,? IEEE Signal processing magazine, vol. 18, no. 5,
pp. 36?58, 2001.
[57] C. S. Swartz, Understanding digital cinema: a professional handbook. Rout-
ledge, 2004.
[58] M. Ehrlich and L. Davis, ?Deep residual learning in the jpeg transform do-
main,? in Proceedings of the IEEE/CVF International Conference on Com-
puter Vision, 2019, pp. 3484?3493.
[59] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ?Imagenet: A
large-scale hierarchical image database,? in 2009 IEEE conference on com-
puter vision and pattern recognition, Ieee, 2009, pp. 248?255.
[60] L. Gueguen, A. Sergeev, B. Kadlec, R. Liu, and J. Yosinski, ?Faster neural
networks straight from jpeg,? Advances in Neural Information Processing
Systems, vol. 31, 2018.
[61] S.-Y. Lo and H.-M. Hang, ?Exploring semantic segmentation on the dct
representation,? in Proceedings of the ACM Multimedia Asia, 2019, pp. 1?6.
[62] B. Deguerre, C. Chatelain, and G. Gasso, ?Fast object detection in com-
pressed jpeg images,? in 2019 ieee intelligent transportation systems confer-
ence (itsc), IEEE, 2019, pp. 333?338.
301
[63] G. Daniel, J. Gray, et al., ?Opt einsum-a python package for optimizing con-
traction order for einsum-like expressions,? Journal of Open Source Software,
vol. 3, no. 26, p. 753, 2018.
[64] S. Chetlur, C. Woolley, P. Vandermersch, et al., ?Cudnn: Efficient primitives
for deep learning,? arXiv preprint arXiv:1410.0759, 2014.
[65] V. Nair and G. E. Hinton, ?Rectified linear units improve restricted boltz-
mann machines,? in Icml, 2010.
[66] K. Fukushima and S. Miyake, ?Neocognitron: A self-organizing neural net-
work model for a mechanism of visual pattern recognition,? in Competition
and cooperation in neural nets, Springer, 1982, pp. 267?285.
[67] A. Krizhevsky, G. Hinton, et al., ?Learning multiple layers of features from
tiny images,? 2009.
[68] A. Foi, V. Katkovnik, and K. Egiazarian, ?Pointwise shape-adaptive dct for
high-quality deblocking of compressed color images ?,? in Proc. 14th Eur.
Signal Process. Conf., EUSIPCO 2006.
[69] S. Yang, S. Kittitornkun, Y.-H. Hu, T. Q. Nguyen, and D. L. Tull, ?Blocking
artifact free inverse discrete cosine transform,? in Proceedings 2000 Interna-
tional Conference on Image Processing (Cat. No. 00CH37101), IEEE, vol. 3,
2000, pp. 869?872.
[70] T. D. Tran, R. De Queiroz, and T. Q. Nguyen, ?The generalized lapped
biorthogonal transform,? in Proceedings of the 1998 IEEE International Con-
302
ference on Acoustics, Speech and Signal Processing, ICASSP?98 (Cat. No.
98CH36181), IEEE, vol. 3, 1998, pp. 1441?1444.
[71] C. Dong, Y. Deng, C. Change Loy, and X. Tang, ?Compression artifacts
reduction by a deep convolutional network,? in Proceedings of the IEEE In-
ternational Conference on Computer Vision, 2015, pp. 576?584.
[72] K. Yu, C. Dong, C. C. Loy, and X. Tang, ?Deep convolution networks for
compression artifacts reduction,? arXiv preprint arXiv:1608.02778, 2016.
[73] C. Dong, C. C. Loy, K. He, and X. Tang, ?Learning a deep convolutional
network for image super-resolution,? in European conference on computer
vision, Springer, 2014, pp. 184?199.
[74] P. Svoboda, M. Hradis, D. Barina, and P. Zemcik, ?Compression ar-
tifacts removal using convolutional neural networks,? arXiv preprint
arXiv:1605.00366, 2016.
[75] L. Cavigelli, P. Hager, and L. Benini, ?Cas-cnn: A deep convolutional neural
network for image compression artifact suppression,? in 2017 International
Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 752?759.
[76] H. Chen, X. He, L. Qing, S. Xiong, and T. Q. Nguyen, ?Dpw-sdnet: Dual
pixel-wavelet domain deep cnns for soft decoding of jpeg-compressed images,?
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2018, pp. 711?720.
303
[77] B. Zheng, R. Sun, X. Tian, and Y. Chen, ?S-net: A scalable convolutional
neural network for jpeg compression artifact reduction,? Journal of Electronic
Imaging, vol. 27, no. 4, p. 043 037, 2018.
[78] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, ?Deep generative
adversarial compression artifact removal,? in Proceedings of the IEEE Inter-
national Conference on Computer Vision, 2017, pp. 4826?4835.
[79] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, ?Deep universal
generative adversarial compression artifact removal,? IEEE Transactions on
Multimedia, 2019.
[80] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, ?Residual dense network
for image restoration,? IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2020.
[81] X. Wang, K. Yu, S. Wu, et al., ?Esrgan: Enhanced super-resolution gener-
ative adversarial networks,? in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 0?0.
[82] X. Liu, X. Wu, J. Zhou, and D. Zhao, ?Data-driven sparsity-based restoration
of jpeg-compressed images in dual transform-pixel domain,? in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 5171?5178.
[83] J. Guo and H. Chao, ?Building dual-domain representations for compression
artifacts reduction,? in European Conference on Computer Vision, Springer,
2016, pp. 628?644.
304
[84] X. Zhang, W. Yang, Y. Hu, and J. Liu, ?Dmcnn: Dual-domain multi-scale
convolutional neural network for compression artifacts removal,? in 2018 25th
IEEE International Conference on Image Processing (ICIP), IEEE, 2018,
pp. 390?394.
[85] B. Zheng, Y. Chen, X. Tian, F. Zhou, and X. Liu, ?Implicit dual-domain
convolutional network for robust color image compression artifact reduction,?
IEEE Transactions on Circuits and Systems for Video Technology, 2019.
[86] Z. Jin, M. Z. Iqbal, W. Zou, X. Li, and E. Steinbach, ?Dual-stream multi-path
recursive residual network for jpeg image compression artifacts reduction,?
IEEE Transactions on Circuits and Systems for Video Technology, 2020.
[87] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, ?D3: Deep
dual-domain based fast restoration of jpeg-compressed images,? in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 2764?2772.
[88] X. Fu, Z.-J. Zha, F. Wu, X. Ding, and J. Paisley, ?Jpeg artifacts reduction
via deep convolutional sparse coding,? in Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, 2019, pp. 2501?2510.
[89] Y. Kim, J. W. Soh, J. Park, et al., ?A pseudo-blind convolutional neural
network for the reduction of compression artifacts,? IEEE Transactions on
Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1121?1135,
2019.
305
[90] S. Zini, S. Bianco, and R. Schettini, ?Deep residual autoencoder for quality
independent jpeg restoration,? arXiv preprint arXiv:1903.06117, 2019.
[91] Y. Kim, J. W. Soh, and N. I. Cho, ?Agarnet: Adaptively gated jpeg compres-
sion artifacts removal network for a wide range quality factor,? IEEE Access,
vol. 8, pp. 20 160?20 170, 2020.
[92] J. Jiang, K. Zhang, and R. Timofte, ?Towards flexible blind jpeg artifacts re-
moval,? in Proceedings of the IEEE/CVF International Conference on Com-
puter Vision, 2021, pp. 4997?5006.
[93] M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, ?Quantization guided
jpeg artifact correction,? in European Conference on Computer Vision,
Springer, 2020, pp. 293?309.
[94] D. Kang, D. Dhar, and A. B. Chan, ?Crowd counting by adapt-
ing convolutional neural networks with side information,? arXiv preprint
arXiv:1611.06748, 2016.
[95] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Car-
roll, ?Burst denoising with kernel prediction networks,? in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 2502?2510.
[96] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., ?Rectifier nonlinearities improve
neural network acoustic models,? in Proc. icml, Citeseer, vol. 30, 2013, p. 3.
306
[97] K. He, X. Zhang, S. Ren, and J. Sun, ?Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification,? in Proceedings of the
IEEE international conference on computer vision, 2015, pp. 1026?1034.
[98] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradient flow
in recurrent nets: The difficulty of learning long-term dependencies, 2001.
[99] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ?Image quality
assessment: From error visibility to structural similarity,? IEEE transactions
on image processing, vol. 13, no. 4, pp. 600?612, 2004.
[100] A. Jolicoeur-Martineau, ?The relativistic discriminator: A key element miss-
ing from standard gan,? in International Conference on Learning Represen-
tations, 2018.
[101] A. Radford, L. Metz, and S. Chintala, ?Unsupervised representation learn-
ing with deep convolutional generative adversarial networks,? arXiv preprint
arXiv:1511.06434, 2015.
[102] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, ?Spectral normalization
for generative adversarial networks,? arXiv preprint arXiv:1802.05957, 2018.
[103] J. Johnson, A. Alahi, and L. Fei-Fei, ?Perceptual losses for real-time style
transfer and super-resolution,? in European conference on computer vision,
Springer, 2016, pp. 694?711.
[104] S. Bell, P. Upchurch, N. Snavely, and K. Bala, ?Material recognition in the
wild with the materials in context database,? in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2015, pp. 3479?3487.
307
[105] D. P. Kingma and J. Ba, ?Adam: A method for stochastic optimization,?
arXiv preprint arXiv:1412.6980, 2014.
[106] I. Loshchilov and F. Hutter, ?Sgdr: Stochastic gradient descent with warm
restarts,? arXiv preprint arXiv:1608.03983, 2016.
[107] E. Agustsson and R. Timofte, ?Ntire 2017 challenge on single image super-
resolution: Dataset and study,? in Proceedings of the IEEE conference on
computer vision and pattern recognition workshops, 2017, pp. 126?135.
[108] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, ?A statistical evaluation of recent
full reference image quality assessment algorithms,? IEEE Transactions on
image processing, vol. 15, no. 11, pp. 3440?3451, 2006.
[109] Rawzor. ?Image compression benchmark.? (), [Online]. Available: http://
imagecompression.info/.
[110] G. E. Hinton and S. Roweis, ?Stochastic neighbor embedding,? Advances in
neural information processing systems, vol. 15, 2002.
[111] M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, ?Analyzing and mit-
igating jpeg compression defects in deep learning,? in Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2021, pp. 2357?
2367.
[112] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, ?Improving the robustness
of deep neural networks via stability training,? in Proceedings of the ieee
conference on computer vision and pattern recognition, 2016, pp. 4480?4488.
308
[113] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ?Mo-
bilenetv2: Inverted residuals and linear bottlenecks,? in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2018, pp. 4510?
4520.
[114] S. Xie, R. Girshick, P. Dolla?r, Z. Tu, and K. He, ?Aggregated residual trans-
formations for deep neural networks,? in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492?1500.
[115] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ?Rethinking
the inception architecture for computer vision,? in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 2818?2826.
[116] T.-Y. Lin, M. Maire, S. Belongie, et al., ?Microsoft coco: Common objects
in context,? in European conference on computer vision, Springer, 2014,
pp. 740?755.
[117] R. Girshick, ?Fast r-cnn,? in Proceedings of the IEEE international conference
on computer vision, 2015, pp. 1440?1448.
[118] S. Ren, K. He, R. Girshick, and J. Sun, ?Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks,? IEEE transactions on pattern
analysis and machine intelligence, vol. 39, no. 6, pp. 1137?1149, 2016.
[119] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla?r, ?Focal loss for dense
object detection,? in Proceedings of the IEEE international conference on
computer vision, 2017, pp. 2980?2988.
309
[120] K. He, G. Gkioxari, P. Dolla?r, and R. Girshick, ?Mask r-cnn,? in Proceedings
of the IEEE international conference on computer vision, 2017, pp. 2961?
2969.
[121] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ?Scene
parsing through ade20k dataset,? in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017.
[122] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ?Se-
mantic understanding of scenes through the ade20k dataset,? arXiv preprint
arXiv:1608.05442, 2016.
[123] K. Sun, Y. Zhao, B. Jiang, et al., ?High-resolution representations for labeling
pixels and regions,? arXiv preprint arXiv:1904.04514, 2019.
[124] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ?Pyramid scene parsing net-
work,? in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 2881?2890.
[125] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, ?Unified perceptual pars-
ing for scene understanding,? in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 418?434.
[126] D. Bolya, S. Foley, J. Hays, and J. Hoffman, ?Tide: A general toolbox for
identifying object detection errors,? in ECCV, 2020.
[127] D. Marpe, T. Wiegand, and G. J. Sullivan, ?The h. 264/mpeg4 advanced
video coding standard and its applications,? IEEE communications maga-
zine, vol. 44, no. 8, pp. 134?143, 2006.
310
[128] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, ?Overview of the
high efficiency video coding (hevc) standard,? IEEE Transactions on circuits
and systems for video technology, vol. 22, no. 12, pp. 1649?1668, 2012.
[129] Bitmovin, Video developer report 2019, 2019. [Online]. Available: https:
//go.bitmovin.com/video-developer-report-2019.
[130] S. Goedegebure, A. Goralczyk, E. Valenza, et al. ?Big buck bunny.? (2008),
[Online]. Available: https://peach.blender.org/.
[131] International Telecommunication Union, ?Advanced video coding for generic
audiovisual services,? Geneva, CH, Standard, Aug. 2021.
[132] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown, ?Frame-Recurrent Video
Super-Resolution,? in The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), Jun. 2018.
[133] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, ?Video enhancement
with task-oriented flow,? International Journal of Computer Vision, vol. 127,
no. 8, pp. 1106?1125, Aug. 2019, arXiv: 1711.09078, issn: 0920-5691, 1573-
1405. doi: 10.1007/s11263-018-01144-2.
[134] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, ?Edvr: Video
restoration with enhanced deformable convolutional networks,? in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops, 2019, pp. 0?0.
[135] Y. Li, P. Jin, F. Yang, C. Liu, M.-H. Yang, and P. Milanfar,
?Comisr: Compression-informed video super-resolution,? in Proceedings of
311
the IEEE/CVF International Conference on Computer Vision (ICCV), Oct.
2021, pp. 2543?2552.
[136] T. Wang, M. Chen, and H. Chao, ?A novel deep learning-based method of
improving coding efficiency from the decoder-end for hevc,? in 2017 Data
Compression Conference (DCC), IEEE, 2017, pp. 410?419.
[137] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, ?Enhancing quality for hevc
compressed videos,? IEEE Transactions on Circuits and Systems for Video
Technology, vol. 29, no. 7, pp. 2039?2054, 2018.
[138] R. Yang, M. Xu, Z. Wang, and T. Li, ?Multi-frame quality enhancement
for compressed video,? in 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, IEEE, Jun. 2018, pp. 6664?6673, isbn: 978-1-5386-
6420-9. doi: 10.1109/CVPR.2018.00697. [Online]. Available: https://
ieeexplore.ieee.org/document/8578795/.
[139] J. Deng, L. Wang, S. Pu, and C. Zhuo, ?Spatio-temporal deformable convo-
lution for compressed video quality enhancement,? Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 34, no. 0707, pp. 10 696?10 703,
Apr. 2020, issn: 2374-3468. doi: 10.1609/aaai.v34i07.6697.
[140] Q. Xing, Z. Guan, M. Xu, R. Yang, T. Liu, and Z. Wang, ?Mfqe 2.0: A
new approach for multi-frame quality enhancement on compressed video,?
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43,
no. 3, pp. 949?963, Mar. 2021, arXiv: 1902.09707, issn: 0162-8828, 2160-9292,
1939-3539. doi: 10.1109/TPAMI.2019.2944806.
312
[141] Q. Ding, L. Shen, L. Yu, H. Yang, and M. Xu, ?Patch-wise spatial-temporal
quality enhancement for hevc compressed video,? IEEE Transactions on Im-
age Processing, vol. 30, pp. 6459?6472, 2021.
[142] M. Zhao, Y. Xu, and S. Zhou, ?Recursive fusion and deformable spatiotem-
poral attention for video compression artifact reduction,? in Proceedings of
the 29th ACM International Conference on Multimedia, 2021, pp. 5646?5654.
[143] M. Schuster and K. K. Paliwal, ?Bidirectional recurrent neural networks,?
IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673?2681, 1997.
[144] M. Ehrlich, J. Barker, N. Padmanabhan, et al., ?Leveraging bitstream meta-
data for fast and accurate video compression correction,? arXiv preprint
arXiv:2202.00011, 2022.
[145] Z. Teed and J. Deng, ?Raft: Recurrent all-pairs field transforms for optical
flow,? in European conference on computer vision, Springer, 2020, pp. 402?
419.
[146] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ?Eca-net: Efficient
channel attention for deep convolutional neural networks, 2020 ieee,? in CVF
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2020.
[147] A. Vaswani, N. Shazeer, N. Parmar, et al., ?Attention is all you need,? in
Advances in neural information processing systems, 2017, pp. 5998?6008.
[148] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud, ?Two de-
terministic half-quadratic regularization algorithms for computed imaging,?
313
in Proceedings of 1st International Conference on Image Processing, IEEE,
vol. 2, 1994, pp. 168?172.
[149] M. Arjovsky, S. Chintala, and L. Bottou, ?Wasserstein generative adversarial
networks,? in International conference on machine learning, PMLR, 2017,
pp. 214?223.
[150] M. Chu, Y. Xie, J. Mayer, L. Leal-Taixe?, and N. Thuerey, ?Learning tem-
poral coherence via self-supervision for gan-based video generation,? ACM
Transactions on Graphics (TOG), vol. 39, no. 4, pp. 75?1, 2020.
[151] S. Tomar, ?Converting video formats with ffmpeg,? Linux Journal, vol. 2006,
no. 146, p. 10, 2006.
[152] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ?The unrea-
sonable effectiveness of deep features as a perceptual metric,? in Proceedings
of the IEEE conference on computer vision and pattern recognition, 2018,
pp. 586?595.
[153] A. Mercat, M. Viitanen, and J. Vanne, ?Uvg dataset: 50/120fps 4k sequences
for video codec analysis and development,? in Proceedings of the 11th ACM
Multimedia Systems Conference, 2020, pp. 297?302.
[154] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, ?Video compression through im-
age interpolation,? in Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 416?431.
[155] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, ?Dvc: An end-to-
end deep video compression framework,? in Proceedings of the IEEE/CVF
314
Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 006?
11 015.
[156] J. Liu, S. Wang, W.-C. Ma, et al., ?Conditional entropy coding for effi-
cient video compression,? in Computer Vision?ECCV 2020: 16th European
Conference, Glasgow, UK, August 23?28, 2020, Proceedings, Part XVII 16,
Springer, 2020, pp. 453?468.
[157] H. Chen, B. He, H. Wang, Y. Ren, S.-N. Lim, and A. Shrivastava, ?Nerv:
Neural representations for videos,? arXiv preprint arXiv:2110.13903, 2021.
[158] R. Yang, Y. Yang, J. Marino, and S. Mandt, ?Hierarchical autoregressive
modeling for neural video compression,? arXiv preprint arXiv:2010.10258,
2020.
[159] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, ?Learning for video com-
pression with hierarchical quality and recurrent enhancement,? in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 6628?6637.
[160] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici,
?Scale-space flow for end-to-end optimized video compression,? in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, 2020, pp. 8503?8512.
[161] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra,
?Grad-cam: Visual explanations from deep networks via gradient-based lo-
315
calization,? in Proceedings of the IEEE international conference on computer
vision, 2017, pp. 618?626.
[162] E. Agustsson, F. Mentzer, M. Tschannen, et al., ?Soft-to-hard vector
quantization for end-to-end learning compressible representations,? in Ad-
vances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg,
S. Bengio, et al., Eds., vol. 30, Curran Associates, Inc., 2017. [Online].
Available: https : / / proceedings . neurips . cc / paper / 2017 / file /
86b122d4358357d834a87ce618a55de0-Paper.pdf.
[163] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,
?Conditional probability models for deep image compression,? in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 4394?4402.
[164] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, ?Gen-
erative adversarial networks for extreme learned image compression,? in Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision,
2019, pp. 221?231.
[165] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, ?High-fidelity
generative image compression,? Advances in Neural Information Processing
Systems, vol. 33, pp. 11 913?11 924, 2020.
[166] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, ?Im-
plicit neural representations with periodic activation functions,? Advances in
Neural Information Processing Systems, vol. 33, pp. 7462?7473, 2020.
316
[167] O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev,
?Elf-vc: Efficient learned flexible-rate video coding,? in Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 479?
14 488.
[168] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, ?Pixel recurrent neural
networks,? in International conference on machine learning, PMLR, 2016,
pp. 1747?1756.
[169] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al.,
?Conditional image generation with pixelcnn decoders,? Advances in neural
information processing systems, vol. 29, 2016.
[170] E. Hoogeboom, J. Peters, R. Van Den Berg, and M. Welling, ?Integer discrete
flows and lossless compression,? Advances in Neural Information Processing
Systems, vol. 32, 2019.
[171] R. v. d. Berg, A. A. Gritsenko, M. Dehghani, C. K. S?nderby, and T. Sal-
imans, ?Idf++: Analyzing and improving integer discrete flows for lossless
compression,? arXiv preprint arXiv:2006.12459, 2020.
[172] S. Zhang, C. Zhang, N. Kang, and Z. Li, ?Ivpf: Numerical invertible vol-
ume preserving flow for efficient lossless compression,? in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021,
pp. 620?629.
317
[173] C. S. Wallace, ?Classification by minimum-message-length inference,? in
International Conference on Computing and Information, Springer, 1990,
pp. 72?81.
[174] G. Hinton and D. van Camp, ?Keeping neural networks simple by minimising
the description length of weights. 1993,? in Proceedings of COLT-93, pp. 5?
13.
[175] B. J. Frey and G. E. Hinton, ?Free energy coding,? in Proceedings of Data
Compression Conference-DCC?96, IEEE, 1996, pp. 73?81.
[176] B. J. Frey, Bayesian networks for pattern classification, data compression,
and channel coding. Citeseer, 1998.
[177] J. Townsend, T. Bird, and D. Barber, ?Practical lossless compression with
latent variables using bits back coding,? arXiv preprint arXiv:1901.04866,
2019.
[178] F. Kingma, P. Abbeel, and J. Ho, ?Bit-swap: Recursive bits-back coding
for lossless compression with hierarchical latent variables,? in International
Conference on Machine Learning, PMLR, 2019, pp. 3408?3417.
[179] J. Townsend, T. Bird, J. Kunze, and D. Barber, ?Hilloc: Lossless im-
age compression with hierarchical latent variable models,? arXiv preprint
arXiv:1912.09953, 2019.
[180] J. Ho, E. Lohn, and P. Abbeel, ?Compression with flows via local bits-back
coding,? Advances in Neural Information Processing Systems, vol. 32, 2019.
318
[181] F. Mentzer, L. V. Gool, and M. Tschannen, ?Learning better lossless com-
pression using lossy compression,? in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, 2020, pp. 6638?6647.
319
Index
approximated spatial change of basis, 9 network, 74
masking, 119 chroma subsampling, convolutional neural
arcnn, 130 87 networks, 60,
co-vector, 3, 17 103
B-Frame, 185
codec, 181 cross-correlation, 27
backpropagation, 68
compression, ii Daubechies wavelet,
basis, 8
constant bitrate, 191 46, 49
batch normalization,
constant quantization DCT, 100
109
parameter, 190 dct, 87, 88, 91
Bayes decision rule, 63
constant rate factor, decision boundary, 65
Bayes rule, 63
191 deep learning, iv, 74
bayesian decision
continuous wavelet difference of gaussians,
theory, 61
transform, 44 71, 211
biorthogonal, 46
convolution, 27 discrete cosine
bipredcted frame, 182
convolutional filter transform, 36
bipredicted frame, 204
manifold, 145 discrete fourier
canonical basis, 33 convolutional neural transform, 36
320
discrete sine frame, 181 identity matrix, 7
transform, 36 frequency component in-loop filter, 194
discrete wavelet rearrangement, index juggling, 20
transform, 44 102, 147 information theory, 50
discriminator, 80 inner product, 4
generalize lapped
dissertation, vi integer discrete flows,
biorthogonal
dual domain, 134 283
transform, 130
dual tree complex integral transform, 36
generative adversarial
wavelet intra frame, 182, 204
network, 152
transform, 45 intra slice, 190
generative adversarial
entropy, 50 networks, 79 JFIF, 85
error residual, 187 generator, 80 JPEG, 84, 96, 98, 100,
EXIF, 85 group of pictures, 182, 129
204 kernel predictor, 144
feature, 62
keypoint, 71
field, 12 Haar wavelet, 46
kingma2013auto, 284
filter manifold, 144 Hadamard transform,
finite dimensional 39 l2 norm, 5
vector space, harmonic analysis, 31 level map, 281
10 heat equation, 31 linear algebra, 2
finite support, 44 Heaviside function, 66 linear combination, 4,
first principles, v histogram of oriented 13
fourier transform, 31 gradients, 69 linear map, 85
321
linear pixel motion vector, 185 quantization, 281
manipulations, multilayer perceptron, quantization matrix,
22 66, 145 88
long-short-term multiresolution
rate control, 190
memory, 273 analysis, 43
residual block, 77, 146
lossless compression,
normalize, 5 residual learning, 98
50, 281
nyquist sampling residual networks, 76
lossy compression, 84,
theorem, 45 residual-in-residual
281
orthogonal, 5 dense block,
low-delay p mode, 221,
orthonormal, 8, 33 146
224
run-length code, 90
P-frame, 185
machine learning, 60 peak quality frame, scalar, 2, 13
macroblocks, 189 198 scale space, 211
matrix, 6, 15 peak quality frames, scale-invariant, 71
metric tensor, 20 201 scale-invariant feature
minimum coded unit, perceptron, 65 transform, 69,
87 posterior probabilities, 71
Morlet wavelet, 46 62 shape-adaptive
mother wavelet, 43 predicted frame, 182, discrete cosine
motion compensation, 204 transform, 129
186 predicted slice, 190 short-time fourier
motion estimation, 187 prior probability, 61 transform, 40
322
skip connections, 79 tensor, 16 vanishing moments, 49
slice, 189 transcode, 282 vector, 3, 13, 15, 17
span, 8 transform domain, vector space, 12
sparse coding, 136 100, 103, 110 video compression
structural similarity, transpose, 3 reduction, 194
152
u-net, 78
task-targeted artifact wavelet, 41
correction, 169 vanishing gradient, 76 wavelet transforms, 41
323