ABSTRACT Title of Dissertation: THE FIRST PRINCPLES OF DEEP LEARNING AND COMPRESSION Max Ehrlich Doctor of Philosophy, 2022 Dissertation Directed by: Professor Abhinav Shrivastava Department of Computer Science Professor Larry S. Davis Department of Computer Science The deep learning revolution incited by the 2012 Alexnet paper has been transformative for the field of computer vision. Many problems which were severely limited using classical solutions are now seeing unprecedented success. The rapid proliferation of deep learning methods has led to a sharp increase in their use in consumer and embedded applications. One consequence of consumer and embedded applications is lossy multimedia compression which is required to engineer the effi- cient storage and transmission of data in these real-world scenarios. As such, there has been increased interest in a deep learning solution for multimedia compression which would allow for higher compression ratios and increased visual quality. The deep learning approach to multimedia compression, so called Learned Mul- timedia Compression, involves computing a compressed representation of an image or video using a deep network for the encoder and the decoder. While these tech- niques have enjoyed impressive academic success, their industry adoption has been essentially non-existent. Classical compression techniques like JPEG and MPEG are too entrenched in modern computing to be easily replaced. This dissertation takes an orthogonal approach and leverages deep learning to improve the compression fidelity of these classical algorithms. This allows the incredible advances in deep learning to be used for multimedia compression without threatening the ubiquity of the classical methods. The key insight of this work is that methods which are motivated by first princi- ples, i.e., the underlying engineering decisions that were made when the compression algorithms were developed, are more effective than general methods. By encoding prior knowledge into the design of the algorithm, the flexibility, performance, and/or accuracy are improved at the cost of generality. While this dissertation focuses on compression, the high level idea can be applied to many different problems with success. Four completed works in this area are reviewed. The first work, which is foundational, unifies the disjoint mathematical theories of compression and deep learning allowing deep networks to operate on compressed data directly. The sec- ond work shows how deep learning can be used to correct information loss in JPEG compression over a wide range of compression quality, a problem that is not readily solvable without a first principles approach. This allows images to be encoded at high compression ratios while still maintaining visual fidelity. The third work ex- amines how deep learning based inferencing tasks, like classification, detection, and segmentation, behave in the presence of classical compression and how to mitigate performance loss. As in the previous work, this allows images to be compressed further but this time without accuracy loss on downstream learning tasks. Finally, these ideas are extended to video compression by developing an algorithm to correct video compression artifacts. By incorporating bitstream metadata and mimicking the decoding process with deep learning, the method produces more accurate results with higher throughput than general methods. This allows deep learning to improve the rate-distortion of classical MPEG codecs and competes with fully deep learning based codecs but with a much lower barrier-to-entry. THE FIRST PRINCIPLES OF DEEP LEARNING AND COMPRESSION by Max Ehrlich Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Professor Abhinav Shrivastava, Chair/Advisor Professor Larry S. Davis, Co-chair/Advisor Professor Wojciech Czaja, Dean?s Representative Professor Ramani Duraiswami Professor Dinesh Manocha Dr. Michael A. Isnardi Professor David A. Forsyth ? Copyright by Max Ehrlich 2022 Preface Multimedia compression is a critical feature of the modern internet [1]. Web- sites like Facebook, Instagram, and YouTube have increasingly coalesced around the sharing of images and video. Viewing and sharing such media are now prerequisite to modern internet interactions. When comparing media, we can create an approx- imate hierarchy with each successive level containing an order of magnitude more information. Text, which comprises a simple linguistic description. Images, which contain a full visualization of some scene. And videos, which contain a temporal evolution of the visualization of a scene. Naturally, as the amount of information contained in a particular medium increases, so too does the size of its digital repre- sentation. Because of modern engineering constraints, it is not feasible to transmit image and video media in their native format (e.g., a 2D or 3D array of samples). As an example: a single frame of a 1080p video in a raw format, assuming single byte samples in three colors, would require around 1 byte ? 3 channels ? 1920 ? 1080 = 6220800 bytes ? 6MB to represent it natively. Extending this to 30 seconds of video at 30 frames per second would require 6220800 bytes ? 30s ? 30 fps = 5598720000 bytes ? 5GB. We can observe that most videos are longer than 30 ii seconds and 4k videos are becoming more common. Transmitting these media over modern cellular or even wired connections would be quite difficult. A typical home internet connection bandwidth ranges from 10-50 Mbps. For the video example, this would take 5598720000 bytes? 8? [10000000, 50000000]bps = [4478, 896]s, i.e., anywhere from 15 minutes to 1.2 hours for this short video. The situation is even worse on cellular connections where LTE upload speeds range from 2-5 Mbps [2] (almost 3 hours for our example in the best case) and most users pay for a fixed amount of data. To make the modern internet feasible, by reducing transmission and storage cost, we compress these media. For modern compression codecs, JPEG [3] can reduce the 6MB image to around 25kB in size, and H.264 [4] compression: the 5GB video to only a few megabytes depending on the spatial and temporal entropy. These impressive size reductions are a result of more than just entropy reducing operations: they also incur a loss of information. The removed information is designed to be as imperceptible as possible and is based on analyses of human visual perception. For images, we remove ?high spatial frequencies? [5] or small spatial changes that would not ordinarily be noticed. For videos, we can take a further step to estimate motion between frames and encode only a description of the motion [6]. For modern codecs, the lossy effects are generally not noticed to laymen, and codec development continues to improve visual fidelity and reduce file sizes year-after-year. Despite these amazing advances in compression, there are still problems. In many parts of the world, including in rural America, many people use metered in- ternet connections [7]. Under these connections they have a finite amount of data iii or pay for the data they use similar to most modern cell phone plans. For these people, participation in the modern internet is increasingly difficult. Not only are they expected to upload their own media, but they must view others? media in or- der to take part in the discourse on many websites. For this class of consumer, it is paramount that as minimal data as possible be used during any internet transmis- sion, precluding most videos and some images from being accessible. The internet is historically unprecedented as both an entertainment medium and a repository of human knowledge, and benefits from maximal participation. Therefore, in order to reach these people more effectively, it is critical that further advances in multimedia compression be developed. Meanwhile, deep learning has revolutionized modern machine learning [8]?[10]. In deep learning, a model is trained to take an input in its native representation and learn a nonlinear mapping directly from it. This is in contrast to classical machine learning which depended on engineered features which were extracted prior to map- pings being computed. By taking the native input representation, the deep model can be organized into many layers which function as their own feature extractors. Instead of engineered features, the best features to solve a given problem are learned jointly with the mapping function. The development of these techniques has enjoyed unprecedented success in all areas of machine learning, and these models are being rapidly deployed in consumer settings to solve problems which were once thought to be impossible for computers to solve. Unsurprisingly, given the previous discussion, one area of interest for deep learning applications is that of multimedia compression. And also unsurprisingly, iv deep learning has made amazing contributions here [11]?[16]. Deep models are able to compress both images and video significantly better than classical algorithms and with little loss of quality. Despite this, the classical algorithms stubbornly persist. JPEG files are still ubiquitous and MPEG standards continue to be developed and deployed in consumer application despite the amazing advances of machine learning. These algorithms, and their associated files, are simply too familiar and too ingrained in the code powering the modern internet to be easily replaced. Nevertheless, there is a plain socioeconomic need, as described previously, for deep learning in compression as there is for anything that reduces the size of images and videos. In This Dissertation I take an orthogonal approach to multimedia compression in deep learning. In my approach, I develop deep learning methods which work with the existing compression algorithms rather than replace them. In this way, our algorithms are easy to integrate into the modern internet as simple pre- or post- processing steps on images or videos. These classical compression algorithms are developed with a series of engineering decisions that determine how much and the nature of the information that is lost. I call these decisions ?first principles? and I develop machine learning algorithms that are explicitly aware of these decisions. I will show that this leads to a significant improvement in fidelity and/or flexibility of the solution. These advances have greatly improved the practicality of deploying machine learning solutions to solve compression problems, although their potential applications are widespread. v Organization Of This Document This document is organized into three parts. In the first part, I will discuss, briefly, background knowledge that a reader should be equipped with in order to have a full understanding of this dissertation. The next two parts discuss related works and my own contributions to image and video compression respectively. This dissertation is written in a conversational style, and beyond this preface I will often refer to the reader as ?we?. This is indicating the ?we?, i.e., the reader and I, are discovering the knowledge together as the concepts in the dissertation are developed from prior work into completed topics. I strongly believe in the use of color for guidance. When I believe it will be helpful, I will use color in mathematics and figures to group related ideas. For complex mathematics specifically, I find this to be much clearer than braces alone especially for hinting from early in a derivation which parts of long equations are related and will eventually be grouped together or cancelled. When useful for clarifying an algorithm I have included code listings. These listings are written in something approximating python with pytorch [17] APIs where deep learning is required. These code samples are not guaranteed to run exactly as they are written. What This Document Is This document is, first and foremost, a dissertation. This means that its primary purpose is to relay the unique contributions of the author over the course of about five years of research. The astute reader will notice, in the table of contents, section titles which are colored in Plum. These sections represent the unique contributions of my research program, i.e., papers which were published in the course of completing my Ph.D. These section titles are colored in vi the body of the document as well, so it is always easy for the reader to know if they are reading about background work or one of my contributions. Readers will, naturally, find these sections are the most detailed and well developed. In each of these chapters, I have included a dedicated section titled ?Limitations and Future Directions?. No scientific work is perfect and mine are no exception. I believe it is important to be up front about these limitations with a candid discussion along with guidance for future researchers in the field. What This Document Is Not This document is not a textbook or survey of multimedia compression algorithms and their relationship with deep learning and readers should manage their expectations as such. For the purposes of imparting a full understanding of this dissertation?s contributions to scientific discourse, there is a review of elementary concepts of mathematics, machine learning, and compression as well as an overview of related works and recent advances in machine learning. If, in the course of reading this dissertation, a reader gains any useful knowledge, this is welcome but entirely accidental. vii Dedication To my wife, Dr. Sujeong Kim You supported me unwaveringly and unconditionally throughout this process and I am eternally grateful. To my daughter Yena and my son Jaeo Knowing you will be the greatest privilege of my life. viii Acknowledgements Special thanks to ? Christian Steinruecken for providing the math coloring function from his dis- sertation. ? My editors: Shishira Maiya, Lillian Huang, Vatsal Agarwal, and Namitha Padmanabhan. ? My social media consultant Gowthami Somepalli and her assistant Kamal Gupta. The research presented in this dissertation was partially supported by Facebook AI, Defense Advanced Research Projects Agency (DARPA) MediFor (FA87501620191), DARPA SemaFor (HR001119S0085) and DARPA SAIL-ON (W911NF2020009) pro- grams. There is no collaboration between Facebook and DARPA. ix Table of Contents Preface ii Dedication viii Acknowledgements ix Table of Contents x List of Tables xiv List of Figures xv List of Abbreviations xix I Preliminaries 1 Chapter 1: Linear Algebra 2 1.1 Scalars, Vectors, and Matrices . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Bases and Finite Dimensional Vector Spaces . . . . . . . . . . . . . . 8 1.3 Infinite Dimensional Vector Spaces . . . . . . . . . . . . . . . . . . . 10 1.4 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Multilinear Algebra 15 2.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Tensor Products and Einstein Notation . . . . . . . . . . . . . . . . . 18 2.3 Tensor Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Linear Pixel Manipulations . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Harmonic Analysis 31 3.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 The Gabor Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Wavelet Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Continuous and Discrete Wavelet Transforms . . . . . . . . . 44 3.3.2 Haar Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Entropy and Information 50 4.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 x 4.2 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 5: Machine Learning and Deep Learning 60 5.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Perceptrons and Multilayer Perceptrons . . . . . . . . . . . . . . . . . 65 5.3 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . 69 5.3.2 Scale-Invariant Feature Transform . . . . . . . . . . . . . . . . 71 5.4 Convolutional Networks and Deep Learning . . . . . . . . . . . . . . 73 5.5 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 U-Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.7 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 79 5.8 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 II Image Compression 83 Chapter 6: JPEG Compression 84 6.1 The JPEG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1.1 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1.2 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 The Multilinear JPEG Representation . . . . . . . . . . . . . . . . . 92 6.3 Other Image Compression Algorithms . . . . . . . . . . . . . . . . . . 96 Chapter 7: JPEG Domain Residual Learning 98 7.1 New Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1.1 Frequency-Component Rearrangement . . . . . . . . . . . . . 101 7.1.2 Strided Convolutions . . . . . . . . . . . . . . . . . . . . . . . 102 7.2 Exact Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.1 JPEG Domain Convolutions . . . . . . . . . . . . . . . . . . . 103 7.2.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . 109 7.2.3 Global Average Pooling . . . . . . . . . . . . . . . . . . . . . 115 7.3 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.5 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.6 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 126 Chapter 8: Improving JPEG Compression 129 8.1 Pixel Domain Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.2 Dual-Domain Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.3 Sparse-Coding Methods . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.4 Summary and Open Problems . . . . . . . . . . . . . . . . . . . . . . 137 Chapter 9: Quantization Guided JPEG Artifact Correction 140 9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 xi 9.2 Convolutional Filter Manifolds . . . . . . . . . . . . . . . . . . . . . . 144 9.3 Primitive Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 9.4 Full Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.6.1 Comparison with Other Methods . . . . . . . . . . . . . . . . 156 9.6.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.6.3 Equivalent Quality . . . . . . . . . . . . . . . . . . . . . . . . 158 9.6.4 Exploring Convolutional Filter Manifolds . . . . . . . . . . . . 159 9.6.5 Frequency Domain Results . . . . . . . . . . . . . . . . . . . . 163 9.6.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 163 9.7 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 165 Chapter 10: Task-Targeted Artifact Correction 168 10.1 Standard JPEG Compression Mitigation Techniques . . . . . . . . . . 169 10.2 Artifact Correction for Computer Vision Tasks . . . . . . . . . . . . . 171 10.3 Effect of JPEG Compression on Computer Vision Tasks . . . . . . . . 173 10.4 Transferability and Multiple Task Heads . . . . . . . . . . . . . . . . 175 10.5 Understanding Model Errors . . . . . . . . . . . . . . . . . . . . . . . 177 10.6 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 178 III Video Compression 180 Chapter 11: Modeling Time Redundancy: MPEG 181 11.1 Motion JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 11.2 Motion Vectors and Error Residuals . . . . . . . . . . . . . . . . . . . 184 11.3 Slices and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 189 11.4 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Chapter 12: Improving Video Compression 194 12.1 Notable Methods for General Video Restoration . . . . . . . . . . . . 195 12.2 Single Frame Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 196 12.3 Multi-Frame Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 12.4 Summary and Open Problems . . . . . . . . . . . . . . . . . . . . . . 199 Chapter 13: Metabit: Leveraging Bitstream Metadata 201 13.1 Capturing GOP Structure . . . . . . . . . . . . . . . . . . . . . . . . 204 13.2 Motion Vector Alignment . . . . . . . . . . . . . . . . . . . . . . . . 206 13.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 13.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 13.5 Towards a Better Benchmark . . . . . . . . . . . . . . . . . . . . . . 215 13.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 13.6.1 Restoration Evaluation . . . . . . . . . . . . . . . . . . . . . . 217 13.6.2 Compression Evaluation . . . . . . . . . . . . . . . . . . . . . 220 xii 13.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 223 IV Concluding Remarks 226 V Appendix 232 Appendix A: Study on JPEG Compression and Machine Learning 233 A.1 Plots of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 A.2 Tables of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Appendix B: Additional Results 250 B.1 Quantization Guided JPEG Artifact Correction . . . . . . . . . . . . 250 B.2 Task Targeted Artifact Correction . . . . . . . . . . . . . . . . . . . . 260 B.3 Metabit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Appendix C: Survey of Fully Deep-Learning Based Compression 271 C.1 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 C.2 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 C.3 Lossless Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Glossary 285 Figure Credits 292 Bibliography 294 Index 320 xiii List of Tables 7.1 Model Conversion Accuracies . . . . . . . . . . . . . . . . . . . . . . 124 8.1 Summary of JPEG Artifact Correction Methods . . . . . . . . . . . . 137 9.1 QGAC Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 156 12.1 Summary of Video Compression Reduction Techniques . . . . . . . . 197 13.1 Metabit HEVC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 218 13.2 Metabit AVC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 13.3 Metabit GAN Numerical Results . . . . . . . . . . . . . . . . . . . . 219 A.1 Results for classification models. . . . . . . . . . . . . . . . . . . . . . 245 A.2 Results for detection models. . . . . . . . . . . . . . . . . . . . . . . . 246 A.3 Results for segmentation models. . . . . . . . . . . . . . . . . . . . . 247 A.4 Reference results (results with no compression). . . . . . . . . . . . . 249 xiv List of Figures 2.1 Grayscale Example Image . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Grayscale Gaussian Smoothing . . . . . . . . . . . . . . . . . . . . . 23 2.3 Color Example Image . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Color Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Color Downsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Block Linear Map Example . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Morlet Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Wavelet Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Dual Tree Complex Wavelet Transform . . . . . . . . . . . . . . . . . 47 3.5 Haar Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 DWT Using Haar Wavelets . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 The General Communication System . . . . . . . . . . . . . . . . . . 52 4.2 Huffman Tree Example . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Arithmetic Coding Example . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Salmon vs Sea bass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 HoG Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Difference of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 73 5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.8 GAN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1 JPEG Information Loss . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Zig-Zag Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Frequency Component Rearrangement . . . . . . . . . . . . . . . . . 102 7.2 Transform Domain Global Average Pooling . . . . . . . . . . . . . . . 116 7.3 ReLU Approximation Example . . . . . . . . . . . . . . . . . . . . . 119 7.4 Toy Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 ReLU Approximation Accuracy . . . . . . . . . . . . . . . . . . . . . 125 7.6 Throughput Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 126 xv 9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 9.2 FCR With Grouped Convolutions . . . . . . . . . . . . . . . . . . . . 147 9.3 RRDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 9.4 8? 8 stride-8 CFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.5 Restoration Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.6 Subnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.7 GAN Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.8 Quality Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.9 Increase in PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.10 Equivalent Quality Plots . . . . . . . . . . . . . . . . . . . . . . . . . 158 9.11 Equivalent Quality Examples . . . . . . . . . . . . . . . . . . . . . . 159 9.12 Embeddings for Different CFM Layers. . . . . . . . . . . . . . . . . . 159 9.13 CFM Weight Visualization. . . . . . . . . . . . . . . . . . . . . . . . 160 9.14 Images Which Maximally Activate CFM Weights. . . . . . . . . . . . 162 9.15 Frequency Domain Results . . . . . . . . . . . . . . . . . . . . . . . . 164 9.16 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 10.1 Task-Targeted Artifact Correction . . . . . . . . . . . . . . . . . . . . 171 10.2 Performance Loss Due to JPEG Compression . . . . . . . . . . . . . 173 10.3 Performance Loss with Mitigations . . . . . . . . . . . . . . . . . . . 173 10.4 Transfer Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 10.5 Multiple Task Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 10.6 MaskRCNN TIDE Plots . . . . . . . . . . . . . . . . . . . . . . . . . 176 10.7 Mask R-CNN Qualitative Result . . . . . . . . . . . . . . . . . . . . . 177 10.8 Model Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.1 Motion JPEG Comparison . . . . . . . . . . . . . . . . . . . . . . . . 183 11.2 Motion Vector Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 11.3 Motion Vector Arrows . . . . . . . . . . . . . . . . . . . . . . . . . . 186 11.4 Motion Compensation and Error Residuals . . . . . . . . . . . . . . . 188 11.5 Rate Control Comparison . . . . . . . . . . . . . . . . . . . . . . . . 189 11.6 Slicing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 13.1 Capturing GOP Structure . . . . . . . . . . . . . . . . . . . . . . . . 204 13.2 Motion Vector Alignment . . . . . . . . . . . . . . . . . . . . . . . . 206 13.3 Motion Vectors vs Optical Flow . . . . . . . . . . . . . . . . . . . . . 207 13.4 Metabit System Overview . . . . . . . . . . . . . . . . . . . . . . . . 208 13.5 LR Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 13.6 Metabit Critic Architecture . . . . . . . . . . . . . . . . . . . . . . . 214 13.7 FPS vs Params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 13.8 Rate-Distortion Comparison . . . . . . . . . . . . . . . . . . . . . . . 220 13.9 Learned Compression Throughput Comparison . . . . . . . . . . . . . 221 13.10Metabit Restoration Example . . . . . . . . . . . . . . . . . . . . . . 223 13.11Metabit Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 A.1 Overall Classification Results . . . . . . . . . . . . . . . . . . . . . . 233 xvi A.2 Classification Results: MobileNetV2 . . . . . . . . . . . . . . . . . . . 234 A.3 Classification Results: VGG-19 . . . . . . . . . . . . . . . . . . . . . 234 A.4 Classification Results: InceptionV3 . . . . . . . . . . . . . . . . . . . 235 A.5 Classification Results: ResNeXt 50 . . . . . . . . . . . . . . . . . . . 235 A.6 Classification Results: ResNeXt 101 . . . . . . . . . . . . . . . . . . . 236 A.7 Classification Results: ResNet 18 . . . . . . . . . . . . . . . . . . . . 236 A.8 Classification Results: ResNet 50 . . . . . . . . . . . . . . . . . . . . 237 A.9 Classification Results: ResNet 101 . . . . . . . . . . . . . . . . . . . . 237 A.10 Classification Results: EfficientNet B3 . . . . . . . . . . . . . . . . . 238 A.11 Overall Detection and Instance Segmentation Results . . . . . . . . . 239 A.12 Detection Results: FastRCNN . . . . . . . . . . . . . . . . . . . . . . 239 A.13 Detection Results: FasterRCNN . . . . . . . . . . . . . . . . . . . . . 239 A.14 Detection Results: RetinaNet . . . . . . . . . . . . . . . . . . . . . . 240 A.15 Instance Segmentation Results Results: MaskRCNN . . . . . . . . . . 240 A.16 Overall Semantic Segmentation Results . . . . . . . . . . . . . . . . . 241 A.17 Semantic Segmentation Results: HRNetV2 + C1 . . . . . . . . . . . 241 A.18 Semantic Segmentation Results: MobileNetV2 + C1 . . . . . . . . . . 241 A.19 Semantic Segmentation Results: ResNet 18 + PPM . . . . . . . . . . 242 A.20 Semantic Segmentation Results: Resnet50 + UPerNet . . . . . . . . . 242 A.21 Semantic Segmentation Results: ResNet 50 + PPM . . . . . . . . . . 243 A.22 Semantic Segmentation Results: ResNet 101 + UPerNet . . . . . . . 243 A.23 Semantic Segmentation Results: ResNet 101 + PPM . . . . . . . . . 244 B.1 Equivalent quality visualizations. For each image we show the input JPEG, the JPEG with equivalent SSIM to our model output, and our model output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 B.2 Frequency domain results 1/4 . . . . . . . . . . . . . . . . . . . . . . 252 B.3 Frequency domain results 2/4 . . . . . . . . . . . . . . . . . . . . . . 253 B.4 Frequency domain results 3/4. . . . . . . . . . . . . . . . . . . . . . . 254 B.5 Frequency domain results 4/4. . . . . . . . . . . . . . . . . . . . . . . 255 B.6 Model interpolation results 1/4 . . . . . . . . . . . . . . . . . . . . . 256 B.7 Model interpolation results 2/4 . . . . . . . . . . . . . . . . . . . . . 256 B.8 Model interpolation results 3/4 . . . . . . . . . . . . . . . . . . . . . 257 B.9 Model interpolation results 4/4 . . . . . . . . . . . . . . . . . . . . . 257 B.10 Qualitative results 1/4. Live-1 images. . . . . . . . . . . . . . . . . . 258 B.11 Qualitative results 2/4. Live-1 images. . . . . . . . . . . . . . . . . . 258 B.12 Qualitative results 3/4. Live-1 images. . . . . . . . . . . . . . . . . . 259 B.13 Qualitative results 4/4. ICB images. . . . . . . . . . . . . . . . . . . 260 B.14 Fine Tuned Model Comparison . . . . . . . . . . . . . . . . . . . . . 261 B.15 Off-the-Shelf Artifact Correction Comparison . . . . . . . . . . . . . . 262 B.16 Task-Targeted Artifact Correction Comparison . . . . . . . . . . . . . 262 B.17 FasterRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right: quality 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 B.18 MaskRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right: quality 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 xvii B.19 MobileNetV2, Ground Truth: ?Pembroke, Pembroke Welsh corgi? . . 264 B.20 FasterRCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 B.21 MaskRCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 B.22 HRNetV2 + C1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 B.23 Dark Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 B.24 Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 B.25 Texture Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 B.26 Compression Artifacts Mistaken for Texture . . . . . . . . . . . . . . 269 B.27 Motion Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 B.28 Artificial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 xviii List of Abbreviations CNN Convolutional Neural Network. 60, 74?76, 79, 81, 103 CWT Continuous Wavelet Transform. 44 DCT Discrete Cosine Transform. 36, 37, 87, 88, 91, 100, 101, 103, 142, 146?148, 192 DFT Discrete Fourier Transform. 36, 37, 41 DST Discrete Sine Transform. 36, 37, 192 DTCWT Dual Tree Complex Wavelet Transform. 45 DWT Discrete Wavelet Transform. 44, 45 EXIF Exchangeable Image File Format. 85 FCR Frequency-Component Rearrangement. 102 GAN Generative Adversarial Network. 79?81 JFIF JPEG File Interchange Format. 85, 140, 141 MCU Minimum Coded Unit. 87 MLP Multilayer Perceptron. 66, 67, 74, 78, 145 RRDB Residual-in-Residual Dense Block. 146, 148, 209 STFT Short-Time Fourier Transform. 40?42, 287 xix Part I Preliminaries 1 Chapter 1: Linear Algebra To begin the dissertation, we briefly review the fundamental ideas of linear algebra. These concepts are extremely important for modeling in the high dimen- sional spaces used by deep learning, and indeed defining what a high dimensional space actually is and how it behaves. Generalizations of linear algebra, which we will cover in the next chapter, have a special relationship with the dissertation outside of this general importance: we will use these ideas to represent JPEG compression. Linear algebra also forms the basis of harmonic analysis which is central to lossy image compression. Warning If you are familiar with the algebraic definitions of linear algebra, this chapter may seem somewhat hand-wavy. It is intended as a general introduction and we will generalize it later. 1.1 Scalars, Vectors, and Matrices All concepts in mathematics relate back to the foundational idea of the num- ber. For our purposes, we will call a single number a scalar . Scalars will be denoted as a lower case letter in regular font: a. 2 We can ?stack? several scalars in rows or columns to create vectors . When we wish to call attention to a vector, we will use lowercase bold font. For example ?? ???b? 0? ?? ??? b1 b = ???? ?? .? .. ?? ? (1.1)?? bn is a vector made by stacking the n scalars in a column. [ ] c = c0 c1 ? ? ? c (1.2)n is also a vector made by stacking n scalars in a row. We will call n the dimension of the vector. Note that in general b ?= c. We call b a vector and c a co-vector. The distinction will become important later. For now, we define the transpose operation on a vector which transforms a vector into a co-vector or a co-vector into a vector ? ? ???c? 0? ???? ? c ?1?? cT = ? ??? .. . ??? (1.3)?? cn 3 Given a scalar and a vector, we can multiply them to produce a new vector [ d = a]c (1.4) = ac0 ac1 ? ? ? ac (1.5)n where each component of c was multiplied by a, thus scaling the vector by a hence the name scalar. We can also add vectors by summing their components to produce another vector. Given a set of m vectors V of dimension n. ? e = v = (1.6) [? ? ?v?V ] = (1.7) v?V v0 v?V v1 ? ? ? v?V vn We can now combine these operations to create one of the most fundamental ideas of linear algebra: the linear combination. A linear combination is the sum of the product of some number of vectors and scalars and therefore produces a new vector. Let S be a set of m scalars ?m g = sivi (1.8) i=0 for si ? S and vi ? V Given two vectors, we can multiply them by computing their inner product 4 which produces a scalar ?n f = ?b, c? = bici (1.9) i=0 We define the l2 norm of a vector as ? ||b||2 = ?b, b? (1.10) We call any vector u such that ||u||2 = 1 a unit vector, noting that we normalize a vector by computing the vector b|| || . Any two vectors v and w such thatb 2 ?v,w? = 0 (1.11) are said to be perpendicular or orthogonal to each other. The formula for the l2 norm ????n ||b|| = ?2 b2i (1.12) i=0 implies a general formulation for an ln norm (? ) 1n n ||b||n = bni (1.13) i=0 5 Another useful norm is the l1 norm ???n ?? ||b|| ? ?1 = ?? bi?? (1.14) i=0 Taking the limit as n ? ? gives the l? norm ||b||? = max(bi) (1.15) We can create two dimensional arrays of scalars which we call matrices and which we denote with upper case normal font: A. ? ? ???a a? 11 12 ? ? ? a1n? ?? ? ? ?? a ?21 a22 ? ? ? a2n? A = ? ???? (1.16)? .. .? . . . . . . .. . ?? am1 am2 ? ? ? amn A is said to be m? n dimensional. We multiply a matrix and a vector by taking the linear combination of each element of the vector with the corresponding column of the matrix ?n h = Ab = biAi (1.17) i=1 note that bi is the ith element of b which is a scalar and Ai is the ith column of A which is a vector. Note also that the result h is a vector of dimension m. This 6 equation implies that the number of columns in A must match the number of rows in b. We can extend this to matrix-matrix products, given an n?m matrix B ? C = AB? (1.18) ????(A1)T , B1? ?(A1)T , B2? ? ? ? ?(A1)T , B ?? m ?? ??? ? ?(A2)T , B ? ?(A2)T , B ? ? ? ? ?(A2)T1 2 , B ?m?? = ??? . ? ? (1.19) ? .. .. . . ... . . ???? ?(An)T , B1? ?(An)T , B2? ? ? ? ?(An)T , Bm? where Aj denotes the jth row1 from A. In other words, each entry in C is the inner product of the corresponding row of A with the corresponding column of B. This construct implies a particular identity matrix ?? ???1 0 ? ? ? 0???? ??? 0 1 ? ? ? 0?? I = ?? ? (1.20)? .. .. ?. .? . . . . ..???? 0 0 ? ? ? 1 Matrices are important for representing linear maps on vectors. A linear map is any map which preserves vector addition and scalar multiplication. We can es- sentially ?store? the coefficients of linear maps in matrices and use the action of matrix-vector multiplication to apply the map to a vector. 1We will use upper indices much more frequently than powers in this dissertation so it is advised to get familiar with the notation now. It will be left entirely to context to determine which we mean. 7 1.2 Bases and Finite Dimensional Vector Spaces Given a set of vectors B, we define the Span of B as the set of all linear combinations of the vectors in B. Formally, given a set of scalars S ??? ? ?|V | ?? ? span(B) = ? ? b ?i i?bi ? B, ?i ? S? (1.21) i=1 ? Given some arbitrary set of vectors V , we may wish to find B, i.e., a subset of vectors that spans V . If all elements of B are linearly independent, we say that B is a basis of V . The basis allows us to express any element of V in terms of scalars of elements of B and, in effect, defines V . There may be many bases for the same set of vectors, so we may wish to change the basis and we may wish to define a particular basis as canonical. For example, consider the vector space R3. We often choose the following basis ?? ?? ?? ? ? ???1 0? ? ?? ???? ???? ???? ???? 0??? e0 = ??0?? e1 = ??1?? e2 = ??0???? (1.22) 0 0 1 This canonical basis is desirable because the vectors are all orthonormal, i.e., they are all orthogonal to each other and have magnitude of 1 which means there is no ?rotation? or ?scaling? of the coordinates. Moreover, this basis makes it extremely 8 easy to express vectors in a familiar component notation. The vector ? ? ???1 v = ??? ? ? ??2???? (1.23) 3 is only expressed as such because we chose this canonical basis (implicitly) and defined v as v = 1e0 + 2e1 + 3e2 (1.24) If we wish to change the basis, we first write the coordinates of the new basis (B1) vectors in the old basis (B0) and then stack these vectors into a matrix A. We can then multiply any vector v0 written in terms of basis B0 by A to obtain the coordinates in terms of basis B1. v1 = Av0 (1.25) We make the following notes about bases for V 1. V has a basis 2. All bases of V have the same cardinality which is the dimension of V . If we write v ? V in coordinates and count the number of coordinates, that count will be the same as the number of basis vectors Note the implication of the last property, we can count the number of elements in 9 the dimension of V, therefore, V is a finite dimensional vector space. 1.3 Infinite Dimensional Vector Spaces Infinite dimensional vector spaces will play an important role in the later analysis of compression, although the results of this analysis will eventually be discretized for use on a computer. In principle, infinite dimensional vector spaces behave in much the same way as finite dimensional ones. While a full treatment of this topic is beyond the scope of the dissertation, we will make some definitions in this chapter which will be expanded upon later. Assume that f and g are members of an infinite dimensional vector space V . We can think about components of f and g as being indexed by any real number instead of a finite number of natural numbers. For example we might have f(2) = 4 for the second component and f(?12.5) = ?156.25 for the negative-twelve-point- five-th component. In other words, f and g are functions, and these functions are vectors in a vector space. With that established, our next goal should be to produce a basis for these functions. After all, being able to express a function as the coefficients of some basis should have myriad uses especially if we do not know exactly the form of the function we wish to express. We will develop this basis later in the dissertation but for now we can define two important concepts: orthogonality of functions and normality of functions. Recall that two vectors were said to be orthogonal If they point at right angles 10 to each other, i.e., their inner product is zero. To determine orthogonality we need an inner product for functions. We make the following definition ? ? ?f(x), g(x)? = f(x)g(x)dx (1.26) ?? which is exactly the same as the inner product formula in finite dimensions with the sum expanded to an integral. As usual, if ?f(x), g(x)? = 0 then the functions are orthogonal. Next we need to define normality of a function. Recall that a vector was said to be normal if its length is 1. So we need a way of defining the ?length? of a function. We make the following definition ?? ? ?f(x)?2 = f 2(x)dx (1.27) ?? Once again, this is the same as the discrete formula using only an integral and if ?f(x)? = 1 then the vector is a normal vector. Given the tools to determine if a set of functions are orthonormal we can now develop what is essentially a canonical basis for functions. This discussion will be continued in Chapter 3 (Harmonic Analysis). 1.4 Abstractions While the geometric interpretations provide useful intuitions, there is a limit to how far we can take them mathematically. We conclude by briefly introducing 11 the abstract forms of the ideas in this chapter. A field F is a set on which addition and multiplication are defined. Specifically we define + : F ? F ? F (1.28) ? : F ? F ? F (1.29) and stipulate that if they meet the following criteria for a, b, c ? F : Associativity addition and multiplication are associative: a + (b + c) = (a + b) + c and a ? (b ? c) = (a ? b) ? c Commutivity addition and multiplication are commutative: a + b = b + a and a ? b = b ? a Identity Two different elements 0 and 1 exist that satisfy additive and multiplicative identity respectively: a+ 0 = a, a ? 1 = a Inverse There exists an additive inverse ?a and a multiplicative inverse a?1 such that a+ (?a) = 0 and a ? a?1 = 1 Distributivity Multiplication and addition distribute according to a ? (b + c) = (a ? b) + (a ? c) then F is a field. We define a vector space V over the field F in a similar way. We have two 12 operations + : V ? V ? V (1.30) ? : F ? V ? V (1.31) and we call V a vector space, elements of V vectors, and elements of F scalars if for u,v,w ? V and a, b ? F , Associativity addition is associative: u+ (v +w) = (u+ v) +w Commutivity addition is commutative: u+ v = v + u Identity and Inverse two elements 0 and?v exist such that v+0 = v and v+(?v) = 0 Compatibility of Multiplication Scalar and field multiplication are compatible: a ? (b ? v) = (a ? b) ? v Scalar Multiplication Identity Multiplication with the scalar identity: 1 ? v = v Distributivity Scalar multiplication is distributive with respect to both vector and field addition: a ? (u+ v) = a ? u+ a ? v and (a+ b) ? v = a ? v + b ? v Note that we made no mention of coordinates or numbers, we only defined sets and operations along with their behavior. With these definitions we can form linear combinations for w,v0 . . .vN ? V 13 and a0 . . . aN ? F ?N w = ai ? vi (1.32) i=0 14 Chapter 2: Multilinear Algebra The previous chapter developed vectors and matrices where vectors are a pri- mary ?mathematical object? in a high-dimensional space and a matrix represents a map which can transform that object. In a sense, this discussion feels unfinished. We had scalars which were zero dimensional, vectors which were one dimensional, and matrices which were two dimensional. Why stop there? In this chapter we develop the extremely high level ideas of multilinear algebra which generalizes linear algebra to higher dimensional objects. This is a large and complex topic which we only need a small piece of for understanding this disser- tation, in fact this entire chapter may be closer to the first lecture in a semester long graduate course. This chapter will immediately obsolete the matrix and vector notation we introduced in the previous chapter for reasons which will be explained in the first section. The primary goal of multilinear algebra is to study multilinear maps . Recall that linear maps are maps which preserve vector addition and scalar multiplication. More formally, we call A : V ? V a linear map on the vector space V over the field F if, for v, u ? V and a ? F ? A(v + u) = A(v) + A(u) 15 ? A(a ? v) = a ? A(v) A bilinear map is an extension of this concept to two arguments where the map is linear in each argument. We call B : V ? V ? V a bilinear map with vector space V and field F for v0, v1, u1, u1 ? V and a ? F if ? B(v0+v1, u0) = B(v0, u0)+B(v1, u0) and B(v0, u0+u1) = B(v0, u0)+B(v0, u1) ? B(a ? v0, u0) = B(v0, a ? u0) = a ?B(v0, u0) Continuing this until its natural end, we call a multilinear map a function of multiple arguments which is linear in each one1. M : V ? V ? ? ? ? ? V ? V is a multilinear map with v0???N , u ? V and a ? F if ? M(v0 + u, v1, ? ? ? , vN) = M(v0, v1, ? ? ? , vN) +M(u, v1, ? ? ? , vN) and M(v0, v1 + u, ? ? ? , vN) = M(v0, v1, ? ? ? , vN) +M(v0, u, ? ? ? , vN), etc. ? M(a ? v0, v1, ? ? ? , vN) = M(v0, a ? v1, ? ? ? , vN) ? ? ?M(v0, v1, ? ? ? , a ? vN) = a ? M(v1, v1, ? ? ? , vN) We will represent multilinear maps using higher order objects called tensors. Per- haps surprisingly, the practical use of these concepts in the dissertation will still only be linear or bilinear maps, however, we leverage multilinear algebra by working on tensor inputs and outputs which serve as a natural representation for images vs the vectors that are traditionally used in these maps. 1One important thing to note at this point is that we can only tweak one argument at a time, we cannot, for example, compute B(v0 + v1, u0 + u1) and expect a linear result. 16 2.1 Tensors Traditionally in computer science we think of tensors as multidimensional ar- rays of numbers. Despite the protests of many physicists and mathematicians, this is a perfectly reasonable definition of a tensor. For example, we might have a 3- or 4- or 5D array of numbers and call this a tensor. In the mathematical sense, a tensor is a representation of a multilinear map. We will denote tensors and tensor spaces with uppercase math font: T . Recall the concepts of vectors and co-vectors. We will refer to vector spaces as V and co-vector spaces as V ?. It is important to keep in mind that although these spaces are related they are not the same. These vectors and co-vectors will, in a sense, be the primitives that we use to construct tensors. We will index vectors using upper indices and co-vectors using lower indices. All tensors have a type, which is the primary way we will refer to them. Some texts refer to a tensor rank, we do not use this convention because it is ambiguous. Rank has other meanings in linear algebra and tensor rank does not explain the composition of the tensor in terms of vectors and co-vectors. If we absolutely have to refer to the sum of the number of vector and co-vector spaces we will call this the order of the tensor, though this situation will be extremely rare. We will say that vectors are type-(1, 0) tensors and co-vectors are type-(0, 1) tensors. Matrices can then be type-(2,0) tensors, type-(1, 1) tensors, or type-(0, 2) tensors. The distinction between type is important. Since matrices and vectors now have concrete definitions as tensors, this obsoletes our earlier notation which drew a distinction between them. 17 From this point on, all non-scalars will be written in tensor notation. 2.2 Tensor Products and Einstein Notation We construct arbitrary tensors using products of vectors and co-vectors. To do this we define the tensor product of two tensors. We will build up to this by revisiting some concepts from linear algebra. Given two vectors v, u in some vector space V on a field F , we defined the inner product as ?N viui = a (2.1) i=0 where a ? F . Given a matrix (a type-(1, 1) tensor) M we can compute the matrix- vector product as ?N Mixi = w (2.2) i=0 for w ? V . Similarly we compute the matrix-matrix product given another matrix N as ?N MiN i = O (2.3) i=0 These expressions can by simplified using Einstein notation [18]. In Einstein no- tation, repeated indices that appear as upper and lower indices are assumed to be summed out, allowing us to remove the summations from the previous equations. 18 For example, the matrix-matrix product is now simply M ji N i k = O j k (2.4) where the non-summed indices are added in for clarity. This is extremely impor- tant when working with general tensors because the expressions are quite verbose with summation notation. We will make heavy use of Einstein notation in this dissertation so it is important to understand it now. Given two arbitrary tensors we can now define the generic tensor product ? u0,u1,??? ,u u ? ,u?N 0 1,??? ,u?N u0,??? ,u ,u ? N 0??? ,u?T U = Tl U ? ? ? = V N 0,l1,??? ,lN l0,l1??? ,lN l0,??? ,l ,l ? N 0,??? ,l? (2.5) N Of course we are free to form other useful products for tensors. For example, given a type-(2, 3) tensor P and a type-(4, 2) tensor Q we could compute the type-(4, 3) tensor R as P kmlij Q ij kml abcd = Rabcd (2.6) where we have summed out the i, j indices. To construct a tensor from vectors and co-vectors we can use this tensor prod- uct. Consider the vectors u, v ? V and the co-vectors p, q, r ? V ?. We can construct 19 a type-(3, 2) tensor from these by computing uivjpkqlrm = T ij klm (2.7) In many situations it will be useful for us to raise or lower indices (sometimes called index juggling). In other words, given a tensor Tij we may want to construct T i or T ijj . These tensors are related to Tij but they are not the same. We can accomplish this by multiplying T by the covariant or contravariant metric tensor which relates to vector and co-vector spaces. These tensors are defined such that gikgkj = ? i j (2.8) where ? is the Kronecker delta ??????0 i =? j?ij = ??? (2.9)1 i = j a generalization of the identity matrix from linear algebra, gij is the contravariant metric (for converting co-vectors to vectors) and gij is the covariant metric (for converting vectors to co-vectors). For various reasons we will consider a general derivation of the metric tensors to be beyond the scope of this dissertation, and in fact we will always be using tensors defined with respect to the canonical basis which has a metric of ?. This means we can freely raise and lower indices without considering the metric. 20 2.3 Tensor Spaces If we needed to start with vectors every time we wanted to build a tensor it would quickly become unsustainable. Instead, we need a way to refer to tensor spaces, or sets of tensors. This is sometimes referred to as the intrinsic definition of a tensor. We again use the tensor product but this time we use vector and co-vector spaces. Recalling the type-(2, 3) tensor T which we constructed from vectors and co-vectors, we can define T directly as T ? V ? V ? V ? ? V ? ? V ? (2.10) in other words, V ?V ?V ??V ??V ? defines a space of tensors. This space contains all tensors which can be constructed by the tensor product of V twice and V ? three times. In other words, all tensors which can be built from an equation like Equation 2.7 but with any u, v ? V and p, q, r ? V ?. For a generic tensor T, we say that it is of type-(p, q) for T ? V? ? ????? ? V?? ?V ? ? ????? ? V ?? (2.11) p times q times This will be the primary convention that we use to define tensors in the rest of this dissertation. Note that this mimics some of the definitions from Section 1.4 (Abstractions) in that we no longer have need of coordinates, we only deal with arbitrary vector spaces, co-vector spaces, and the tensor products of their members 21 Figure 2.1: Grayscale Example Image. which is why this is called the intrinsic definition. We close by noting that although we have only used V and V ?, in general, the vector spaces defining a tensor can be different provided that the spaces are defined over the same field. 2.4 Linear Pixel Manipulations With the boring theory out of the way we can look at an interesting practical application of tensors: linear pixel manipulations. By representing an image as a tensor we can compute many complex transformations of the image using other tensors. Some of these are not traditionally thought of as being ?linear? when we restrict our thinking to two-dimensional matrices as linear maps that transform images through matrix multiplication. Instead of thinking of images as ?collections of vectors? we treat the image as one object: a higher order tensor and then we consequently define the linear map on this object in even higher dimensions. More formally, we will deal with planar images. The image may have any number of channels but it always has two spatial dimensions. So a grayscale image would be a type-(0, 2) tensor. A traditional color image would be a type-(0, 3) 22 Figure 2.2: Grayscale Gaussian Smoothing. tensor. In most cases, even for color images it will suffice to define linear maps as type-(2, 2) tensors which transform the spatial dimensions while preserving the channel dimension. We begin with a simple example. Consider the example image in Figure 2.1. We can represent this grayscale image as a type-(0, 2) tensor I ? H? ? W ?. One simple linear manipulation we can perform on this image is Gaussian smoothing in a 3? 3 window. We can represent this linear map as a type-(2, 2) tensor G : H? ?W ? ? H? ?W ? (2.12) ? G ? H ?W ?H? ?W ? (2.13) ??????????0.5 i = u ? j = v????0.125 i = u ? (j = v ? 1 ? j = v + 1) Gijuv = ????? (2.14)?????0.125 (i = u? 1 ? i = u+ 1) ? j = v????0 otherwise From the first equation, we can see that G is a linear map on type-(0, 2) 23 Figure 2.3: Color Example Image. tensors that transforms them into type-(0, 2) tensors. From the second equation we see that G is a type-(2, 2) tensor (this is a consequence of the first equation). The third equation defines the form of G for arbitrary indices i, j, u, v. In this case, i, j index the input pixel and u, v index the output pixel, the value stored at the index is the coefficient of the pixel. So that is 0.5 when the indices are equal and 0.125 for any neighboring pixels, all other pixel have a zero coefficient. We apply this linear map by computing I ? = Gijuv uvIij (2.15) The result of this computation is show in Figure 2.2. Next we can consider a color image. The color version of the example images is shown in Figure 2.3. Converting this color image to grayscale is a linear manip- ulation. We represent the color image as I ? P ? ? H? ? W ?. We then define the 24 Figure 2.4: Color Smoothing. following linear map Y : P ? ?H? ?W ? ? H? ?W ? (2.16) Y ? P ?H ??W ?H? ?W ? (2.17)?? ???????0.299 p = 0 Y pijuv = ??????0.587 p = 1 (2.18) ???0.114 p = 2 which comes directly from the grayscale conversion equation Y = 0.299R + 0.587G+ 0.114B (2.19) We apply this map as I ?uv = Y pij uv Ipij (2.20) If we apply this map to the example image we get the same image as Figure 2.1. 25 Figure 2.5: Color Downsampling. Interestingly, we can apply G to this color image as well and it will perform correct smoothing on the color image (Figure 2.4). In this case we would be computing I ?puv = G ij uvIpij (2.21) and since G ? H ?W ?H? ?W ?, the channel dimension of I, P ?, is preserved. Let?s try something more interesting: resampling. We can define nearest neigh- bor up- and downsampling as linear maps. This works for both color and grayscale images, the computation is the same and the tensor will be type-(2, 2). For down- sampling by a factor of 2 we define the following linear map H : H? ?W ? ? H ?? ?W ?? (2.22) H ??H ?W ?H ?? ?W ?? (2.23)??? ??1 i = 2u ? j = 2vH ijuv = ??? (2.24)0 otherwise where H ? and W ? are vector spaces with half the dimension of H and W . We can 26 define upsampling in a similar way. For upsampling by a factor of 2 we define the following linear map D : H ?? ?W ?? ? H? ?W ? (2.25) ?H ? H ? ?W ? ?H? ?W ? (2.26)??? ??1 i = ?u/2? ? j = ?v/2?H ijuv = ??? (2.27)0 otherwise We apply these maps by computing I ? ijuv = HuvIij (2.28) Iuv = D ij ? uvIij (2.29) for grayscale images and I ? = H ijpuv uvIpij (2.30) I = Dij ?puv uvIpij (2.31) for color images. The result for the color image is shown in Figure 2.5. Taking this further, we can define any convolution or cross-correlation using tensors. This is reasonable since we know that convolutions are linear operations although we do not always see them written out as linear maps. We consider a general convolution kernel K with any shape. We will denote the shape of the 27 kernel as the tuple S = (s0, s1). Then we define the following linear map C : H? ?W ? ? H? ?W ? (2.32) ? C ? H ?W ?H? ?W ? (2.33)???? Cijuv = ?Ku?i+s0,v?j+s1 u? s0 ? i ? u+ s0 ? v ? s1 ? j ? v + s1??? (2.34)0 otherwise note that this does not consider a mapping between channels like we would use in a convolutional network (this is simple enough to add in though). We apply this to grayscale or color images as I ? ijuv = CuvIij (2.35) I ? ijpuv = CuvIpij (2.36) As a taste of what?s to come, let?s try something more surprising. It may sound surprising but breaking an image into evenly sized blocks is a linear operation, and we can derive a tensor which represents this map. We will first define two new co-vector spaces, the block dimensions M? and N?2. We will also define the spaces X? and Y ? with dimension equal to the dimension of H? and W ? divided by the block size (i.e., the number of blocks that can fit in the image). Then we can define 2For example if we wanted even 8? 8 blocks we might write these as R8? although I do not like this notation 28 a type-(2, 4) tensor, the linear map B : H? ?W ? ? X? ? Y ? ?M? ?N? (2.37) ? B ? H ?W ?X? ? Y ? ?M? ?N? (2.38)???? Bijxymn = ?1 pixel h,w belongs in block x, y at offset m,n??? (2.39)0 otherwise which may seem like kind of a let down but this is the canonical form we will use later in the dissertation. A more satisfying and programmer-oriented definition might be ??????1 x ? dim(M) +m = i ? y ? dim(N) + n = jBijxymn = ??? (2.40)0 otherwise We apply the map as I ? ijpxymn = BxymnIpij (2.41) Since this one might be a little confusing, consider a concrete example with the example image in Figure 2.1. This is a 16 ? 16 image and we want to break it into 8 ? 8 blocks, so there will be four total blocks in a 2 ? 2 grid (Figure 2.6). In this case, dim(M) = dim(N) = 8 and dim(X) = dim(Y ) = 2. So after applying B to the 16? 16 image we would get a tensor of shape 2? 2? 8? 8 giving the spatial arrangement of the 8? 8 blocks. 29 Figure 2.6: Block Linear Map Example. The blocks are arranged spatially but note that in tensor form there are separate indices for the block position and the 2D offset into each block. While this was a fun exercise the actual practical application of this idea is fairly limited since the tensors must be on the order of the image size. A critical component of the dissertation is that we can actually represent all of JPEG as a linear map. This is extremely powerful because linear maps are well studied phenomena, so expressing something as complex as JPEG as a single linear map gives us myriad tools for further analysis and manipulation. 30 Chapter 3: Harmonic Analysis Harmonic analysis is an invaluable tool for mathematics and engineering that enables some of the most important technologies in existence today. In Section 1.3 (Infinite Dimensional Vector Spaces) we touched briefly on the concept of infinite dimensional vector spaces and we noted that the vector space of functions of real variables is one such space. In this chapter we expand upon this idea and introduce the Fourier transform and harmonic analysis. The ideas we present in this chapter will be fundamental guiding principles behind image and video compression. Fourier was interested in solving the heat equation which describes the tem- perature of an ideal length of wire over space and time. The equation defines a function u(t, x) as a partial differential equation with conditions: ? ?2 u(t, x) = u(t, x) (3.1) ?t ?x2 u(0, x) = f(x) (3.2) u(t, 0) = 0 (3.3) u(t, 1) = 0 (3.4) for t ? 0 and x ? [0, 1]. 31 Of critical importance to us is the second equation which relates the form of u(t, x) at time t = 0 to some arbitrary function of space. Fourier showed that, since the other conditions of the heat equation yield a harmonic function, one composed of simple waves, f(x) must be able to be decomposed as such 1. While we will not go into the full derivation that Fourier used, or even touch on the modern understanding of the transform, we will show how this implies an orthonormal basis for functions which allows us to express them as a sum of coefficients of simple waves. Note There are many different ways to think about the fourier transform. Fourier was thinking in terms of the heat equations, many people like to envision a ?ma- chine? that isolates frequencies. I prefer the model which is motivated by linear algebra and that is what I discuss in this chapter although all views on the subject are equally correct and interesting. 3.1 The Fourier Transform Recall our definitions for the l2 norm and inner product of functions ?? ? ?f(x)? 22 =? f (x) dx (3.5)??? ?f(x), g(x)? = f(x)g(x) dx (3.6) ?? 1It is interesting to note that although this result is one of the most influential results in all of engineering, it was given negative reviews at the time Fourier published it. 32 given these tools we can try to find something resembling a canonical basis for func- tions. We would like a canonical basis to be a set of functions that is orthonormal, i.e., a set of functions which are all of unit length and which are all orthogonal to each other. Consider the functions sin(x) and cos(x). We can show easily that these func- tions are orthogonal to each other by solving ? ? ?sin(x), cos(x)? = sin(x) cos(x) dx (3.7) ?? We start by restricting the domain to [??, ?] since the sine and cosine functions are periodic. ? ? = sin(x) cos(x) dx (3.8) ?? Then we use substitution to solve the integral. Let u = cos(x) and du = ?sin(x) dx ? ? ? sin(x) cos(x) dx = u? ? du (3.9)?? = ? u du (3.10) 2 = ?u + C (3.11) 2 33 substituting and evaluating the result gives u2? ?cos 2(x) + C = +?? C (3.12)2 2 ? ?cos 2(x) + C?? (3.13)2 ?? cos2? (??) cos 2(?) = + = 0 (3.14) 2 2 so sine and cosine are indeed orthogonal. To check if they are normal we compute ? ? ? cos 2(x) dx (3.15) ?? ? sin2(x) dx (3.16) ?? We can solve the first integral with the trigonometric identity 2 1 + cos(2x)cos (x) = (3.17) 2 substituting gives ? ? 1 + cos(2x) ? dx (3.18)?? 2 1 ? (? = ? 1 + cos(2x) d)x (3.19)2 ?? 1 ? ? = dx+ cos(2x) dx (3.20) 2 ?? ?? 34 ( ) 1 sin(2x) ?????= x+ (3.21)2 2 x sin 2x ???????= + (3.22)2 4 ?? ? sin 2? ? ??= + ? sin?2? (3.23) 2 4 2 4 sin 2? ? sin?2?= ? + (3.24) 4 4 = ? (3.25) We get the same result for sine, so the functions are not normal but they can be easily made normal by dividing by ?. Therefore, sine and cosine seem like ideal candidates provided we can produce an infinite set from these two. In order to have a basis for the infinite dimensional space of functions we need an infinitely large set of basis vectors. Without further elaboration, the Fourier transform defines this set as {sin(?2?x?), cos(?2?x?)|? ? R} (3.26) or simply {e?2?ix? |? ? R} (3.27) Note that this is an uncountable infinite set of vectors, which is what we needed, 35 and we call ? the frequency. The actual integral transform is then ? ? F (?) = f(x)e?2?ix? dx (3.28) ?? Note that, as we described for the norm and inner product of functions, this is simply generalizing the expression for a linear combination of a finite dimensional vector with its basis vectors. As useful as this result is, it is not readily applicable to computation as is the case with many concepts dealing with infinity. We can, however, define the Discrete Fourier Transform (DFT) as the following type-(1, 1) tensor F ? CN ? CN? (3.29) ?1 ?2?imnFmn = e N (3.30) N F is a linear map F : CN ? CN acting on complex vectors of dimension N . Note that F is symmetric, i.e., F i = F jj i . For practical applications, this matrix multiply would be prohibitively expensive so we use the fast Fourier transform to recursively memoize the transform result reducing the number of computations to O(N log(N)). We do not describe this algorithm in detail here. There are some other transforms which are related to the DFT and are useful. Specifically a major downside to the DFT is the dependence on complex numbers. For many discrete applications, real numbers would work fine. This motivates the Discrete Sine Transform (DST) [19], [20] and the Discrete Cosine Transform (DCT) 36 [5]. These transforms can be thought of as taking only the imaginary (sine) or real (cosine) part of the DFT. We can get away with this on discrete samples by assuming that the signal, outside of the region we sampled, is an odd or even function. We are free to do this since we do not care at all about what the function actually looks like outside where we sampled so it does not need to be accurate. For our purposes, the DCT will play an outsize role since it is central to our later discussion of JPEG. The DST will come up briefly in video coding, however. The DCT can be defined differently depending on how boundary conditions are handled. We will not detail all of these, but the two important ones for us are the DCT-II, which we will call ?the DCT?, and is defined in two dimensions as ?N ?N [ ] [ ] i ?1 (2x+ 1)i? (2y + 1)j?Dj = C(i)C(j) cos cos (3.31) 2N 2N 2Nx=1 y=1 ??????1 u = 0 2 C(u) = ???? (3.32)1 u =? 0 and the DCT-III, which we will call ?the inverse DCT?, and is defined in two dimensions as ?N ?N [ ] [ ] (D?1)x ?1 (2x+ 1)i? (2y + 1)j?y = C(i)C(j) cos cos (3.33) 2N 2N 2N i=1 j=1 In both cases, C(u) is a scale factor which makes the transform orthonormal. As in the DFT, these are both linear maps, this time with D : RN ? RN and D?1 : 37 RN ? RN and are type-(1, 1) tensors. We note here an important theorem which will be useful for us later in the dissertation Theorem 1 (The DCT Least Squares Approximation Theorem). Given a set of N samples of a signal X, let Y be the DCT coefficients of X. Then for 1 ? m ? N the approximation of X given by ? ?m ( )1 2 k(2t+ 1)? pm(t) = ? y0 + yk cos (3.34) N N 2N k=1 minimizes the least-squared error ?N e = (p (i)? x )2m m i (3.35) i=1 Proof. First consider that since Equation 3.34 represents the Discrete Cosine Trans- form, which is a Linear map, we can write rewrite it as DTmy = x (3.36) where Dm is formed from the first m rows of the DCT matrix, y is a row vector of the DCT coefficients, and x is a row vector of the original samples. To solve for the least squares solution, we use the the normal equations, that is we solve D DTm my = Dmx (3.37) 38 and since the DCT is an orthonormal transformation, the rows of Dm are orthonor- mal, so DmD T m = I. Therefore y = Dmx (3.38) Since there is no contradiction, the least squares solution must use the first m DCT coefficients. A related transform to the ?trigonometric? transforms is the Hadamard trans- form or Walsh-Hadamard transform. The Hadamard transform defines the trans- formation matrix recursively as ? H0 =?1 (3.39)???Hm?1 Hm?1 ?H ?m = ? (3.40) Hm?1 ?Hm?1 The obvious advantage of this transform is that it contains only ?1 and 1 entries, so it can be computed quite efficiently without even multiplication operations (only sign changes are needed). 3.2 The Gabor Transform While the Fourier transform is useful for telling us what frequencies make up a given signal, it cannot tell us when those frequencies occur. It considers all the samples we have and tells us which frequencies explain all the samples. In some 39 cases, it would be useful to know both which frequencies occur and where they occur. For example, if we are examining seismic data, it may be important to know when high frequency vibrations occurred to predict the time of a future earthquake. With a Fourier transform, we would only know that there were high frequency vibrations. We can accomplish this in a naive way with a Short-Time Fourier Transform (STFT). The high level idea is extremely simple. The input signal is broken up into smaller blocks of time and the Fourier transform is computed on each block separately. Then, for each block of time we can see which frequencies are available, and we can adjust the block size to increase the time resolution. The Gabor transform is an interesting twist on this idea. Instead of a hard window, we use a soft window by convolving the Fourier transform with a Gaussian kernel. In a continuous representation this is ? ? 2 G(?, ?) = f(t)e??(t??) e?2?it? dt (3.41) ?? yielding amplitude results with time offsets ? as well as frequencies ? 2. While this yields a smooth windowed response in time, it still suffers from what we call the uncertainty principle which all STFTs are subject to. That is, the larger the time window, the worse the localization is, and the smaller the time window, the more constrained we are in the frequencies we can represent. Put another way, time-resolution and frequency-resolution are inverses: can only have one and not both. 2in contrast to the Fourier result which is amplitude vs frequency with no time component 40 Figure 3.1: Discrete Wavelet Transform. The DWT repeats the sampling pro- cess recursively on the low frequency band. To see this result, consider the DFT matrix given in Equation 3.30. This matrix has a finite number of frequencies that it can represent because of the discrete representation. The high frequency represents each sample in a single period. If we restrict the size of the DFT to windows, as in the Gabor transform, we reduce the size of this matrix and therefore we reduce the number of frequencies we can represent. Conversely, if we allow the size of the window to increase without bound, so as to get the best frequency resolution, we will eventually end up with a window size that is the length of the original signal and therefore is equivalent to the standard Fourier transform that has no temporal component at all. As we will see in the next section, this uncertainty principle extends to more sophisticated methods and is a fundamental limitation of harmonic analysis. 3.3 Wavelet Transforms Wavelet transforms extend the concept of the STFT to what, at the time of writing, can be considered its natural end. Instead of using sine and cosine bases, 41 Figure 3.2: Morlet Wavelet. The Morlet wavelet illustrates the high amplitude in the center of the wave with decreasing amplitude moving to the sides. Image credit: Wikipedia. the wavelet transform defines other functions which have ?finite support?. In other words, they have a high amplitude at time t = 0 with the amplitude gradually decreasing as t moves away from 0 (this is shown in Figure 3.2 with the Morlet wavelet). As in the STFT, this measures a local response to the wavelet. Then, as in the Gabor transform, we can slide the wavelet around by shifting it along the input signal to compute local responses at different times. The key improvement of wavelet transforms is that they include a term which controls the frequency of the wave. This allows for a full bank of frequencies to be computed at each time representing the response of the signal to wavelets of increasing frequency. Note that because of the uncertainty principal, this generates a tree-like structure. For a given time t, there may be multiple high frequency wavelet responses for a single low frequency wavelet (Figure 3.3). As in the last section, the more precisely we wish to describe the constituent frequencies in a signal the less precisely we can localize them in time. Unlike the last section, however, we can still localize the high frequencies well even if we cannot localize the low frequencies, with a STFT, our localization capability is defined entirely by the block size (or Gaussian standard deviation for the Gabor transform). Since we examine the same signal at 42 Figure 3.3: Wavelet Uncertainty. The low frequency wavelet has poor time resolution, in other words, we cannot tell as exactly the time where that frequency occured as we can with the high frequency wavelets. Image credit: wikipedia. multiple scales, or resolutions, we call this multiresolution analysis . Formally, we define a mother wavelet ?(t) which we can then shift and scale as desired. This yields a basis for the space of functions, just as with the fourier transform, given by the following set { ???? ( )}1 t? sW = ??,s(t) ? ? R, s ? R, ??,s(t) = ? ? (3.42)? ? where ? determines the frequency (or scale) of the wavelet and s determines the shift. We then compute the integral transform ? ? T (?, s) = f(t)??,s(t) dt (3.43) ?? for a function of time (a signal) f(t). Just as with the Fourier transform this is simply a linear combination of the signal with each of the basis entries, but we have generalized from the Fourier basis (e?2?it?) to the more general ?(f, s). In the rest of this section, we will discuss how to apply the wavelet transform to discrete signals and how certain important wavelets are defined. 43 3.3.1 Continuous and Discrete Wavelet Transforms As with the Fourier transform, in order to use these tools on real signals, we must discretize them for execution on a computer. There are several ways we can do this, the first one we will discuss is the Continuous Wavelet Transform (CWT) which, despite the name, is not exactly continuous. To define this we simply assume that the signal f(t) is finite and discretely sampled, and we rewrite the integral of Equation 3.43 as a sum ? T (?, s) = fk??,s(k) (3.44) k then we stipulate that the wavelet function have finite support, in other words, we assume that it is zero outside of a certain range so we can represent it with a finite number of samples. We can then define the wavelet transform using convolution. We define the kernel ( ) 1 ?T ? t ???,t = ? (3.45) ? ? for a wavelet with support T and we compute Tmst = ft ? ???,m+?T (3.46) The Discrete Wavelet Transform (DWT) takes this idea further. The idea is that instead of dealing with the wavelet basis change equations directly, we can 44 simply express the transform as a series of high pass/low pass filters which coarsely discretize the scale. We first construct convolution kernels for a high pass and low pass filter g and h and compute the convolutions ylow = f ? g (3.47) yhigh = f ? h (3.48) By definition, these filters pass half the frequencies they are given as input. There- fore, by the Nyquist Sampling Theorem, we can also discard half the samples of each result without losing information. We represent this with a downsampling by two operation (?) ylow = (f ? g) ? 2 (3.49) yhigh = (f ? h) ? 2 (3.50) This process is repeated recursively on ylow while yhigh is retained as an output. This yields a tree structure (Figure 3.1). We briefly mention a newer technique here, the Dual Tree Complex Wavelet Transform (DTCWT) [21]. This is a complex wavelet transform which is inspired by real cosine and imaginary sine components of the Fourier transform. The main advantage of this transform is shift invariance, i.e., a shift in the input signal yields the same transform coefficients. While the theory of the DTCWT is quite involved, the algorithm is simple assuming suitable wavelets exist. As in the DWT, high and 45 low pass filters are applied with the results decimated, only this time there are two wavelets producing two trees (Figure 3.4). The results of one tree are treated as the real part of a complex output and the results of the other tree are used as the imaginary part. All of these methods require suitable definitions of the ?(t) function. While the natural instinct is to choose orthogonal wavelets, biorthogonal wavelets, which relax the orthogonal constraint as long as the transform is still invertible, have also been shown to work well and have more flexibility in their design. Note that the definition of a basis does not require orthogonality. Common choices for ?(t) include the Haar wavelets (discussed next), the Morlet wavelet which is related to the Gabor transform, and the Daubechies wavelets among others. While most tasks will work fine with the simplistic Haar wavelets, knowing the properties of each wavelet to pick the ideal one for a given task can make a difference. 3.3.2 Haar Wavelets The Haar wavelet is one of the most simple and popular choices for ?(t). It is defined as ??????????1 0 ? t ? 12 ?(t) = ?????1 1?? ? t ? 1 (3.51) 2 ???0 otherwise 46 Figure 3.4: Dual Tree Complex Wavelet Transform. The DTCWT is computed in the same way as the DWT but with two trees. Figure 3.5: Haar Wavelet. Frequency increases vertically, time increase to the right. The Haar wavelet transform is simple to implement and computationally efficient leading to its widespread use. The wavelets have compact support and are orthogo- nal making the Haar transform effective for conducting localized frequency analysis, in fact they were the first attempt at a basis for multiresolution analysis. The 1D Haar wavelet is plotted in Figure 3.5 for three frequencies and several shifts per fre- quency. Note that the time axis (horizontal) spans from 0 to 1. The Haar wavelet has very compact support, outside the support region, which naturally shrinks with 47 Figure 3.6: DWT Using Haar Wavelets. The left image is the single level DWT of the right image. Note that each filtered image is stored at half the resolution in the width and height so each of the four filtered images can be arranged in the same shape as the original image. increasing frequency, the value of the wavelet is zero, so any samples outside the considered region contribute no information to the frequency response. In the 1D transform the wavelet was measuring differences along the time axis to measure the frequency response. In 2D, we must consider differences on two axes including the diagonal (both axes simultaneously). Figure 3.6 shows an example of this for a single level DWT. Note that each of the four frequency bands, called LL, HL, LH, HH, are stored at half the width and height leading to the 4? 4 arrangement on the left hand side. In this case, the top-left is the LL band, the top right is the LH band, the bottom left is the HL band, and the bottom right is the HH band. Note the different features that each band responds to: the HL and LH bands respond to horizontal and vertical structures respectively and the HH band respond to diagonal structures. 48 While the Haar transform?s simplicity and effectiveness allow for widespread use there may be more suitable wavelets for a given task. The Daubechies wavelets [22] in particular have come into common use as they were designed based on the analysis of Ingrid Daubachies who made numerous contributions to multiresolution analysis. For example Daubachies showed that if the number of vanishing moments is N , then the support of the wavelet is at least 2N ? 1. Vanishing moments, which relate the wavelet to a polynomial, can be of critical importance in choosing a wavelet if there is some understanding of the function to be analysed. Generally, a wavelet with N vanishing moments is orthogonal to a polynomial of degree N ?1. In this section we covered only the most basic ideas of multiresolution analysis as it does not factor into the work of this dissertation. However, the wavelet trans- form, which was a critical part of the last decade of signal processing, is now making its way rapidly into deep learning applications [23]?[25] 3 so knowledge of these tech- niques will rapidly become important for the computer vision researchers. 3among many others including currently unpublished work. 49 Chapter 4: Entropy and Information Information theory marked a major advancement in the understanding of com- munication. Claude Shannon?s 1943 paper ?A Mathematical Theory of Communi- cation? was rare in that it both introduced the field of information theory and then systematically solved all major problems within it, essentially an entire field in one paper. Importantly for us, Shannon?s formulations for measuring the information contained in a message gave rise to lossless compression algorithms which are still used to this day. In this chapter, we review the high level ideas of information theory, specifically entropy , and how these ideas were used to develop compression algorithms. The overall goal of information theory [26] is to measure the amount of infor- mation contained in a signal. The signal can be discrete (e.g., words) or continuous (e.g., television, sound, etc.). Shannon was responding to a recent development in communication: modulation. These techniques were rudimentary lossy compression methods which introduced noise into the messages in exchange for reducing their size (similar to JPEG and MPEG as we will see later). Exactly how much noise was introduced and the limits of the system with respect to how much noise would make the message unintelligible was a mystery. As expected this was preventing the 50 full and effective use of these technologies, since operators would either introduce too much distortion and be left with an unintelligible message or introduce too little noise and be faced with transmission delay. 4.1 Shannon Entropy Mathematically, we are free to make any choice to define a measure of infor- mation. In other words, any monotonic function of the number of possible messages since all are equally likely. However, Shannon chooses to define information on a log scale since it has some useful properties ? Many practical properties vary with the logarithm. For example, two wires have double the bandwidth of one wire. ? It makes the math considerably easier since logarithms have nice properties around addition, multiplication, differentiation, etc. therefore, we define1 the ?amount of information? I as I ? log(M) (4.1) for some message M . For logarithm base 2, we will call the unit of information ?bits?. Since this is our measure of information, we can also measure the information 1Note that I am choosing these words carefully. We are deciding to measure information in this way and developing a field around that decision rather than measuring some natural property of the world like a physicist might. 51 INFORMATION SOURCE TRANSMITTER RECEIVER DESTINATION SIGNAL RECEIVED SIGNAL MESSAGE MESSAGE NOISE SOURCE Figure 4.1: The General Communication System. One of Shannon?s most important contributions was the idea that any communication system can be divided into parts and developed separately. Image credit: Claude Shannon [26]. capacity of a channel as log(N(t)) C = lim (4.2) t?? t where N(t) messages can be transmitted in time t. Before we continue, however, we touch on one of Shannon?s most influential contributions. That is the general definition of a communication system, given in Figure 4.1. Shannon showed that any communication system consists of the same fundamental parts. Even systems such as telegraphy and color television which seem very different from each other are fundamentally the same. This model drives much of Shannon?s analysis of information content. Since the communication system must be designed to support any possible message, we must take a probabilistic approach to describing the generation of messages by the information source. In other words, for a discrete communication 52 system, the information source will generate messages by producing discrete symbols one at a time. The generation of a given symbol is determined based on the past symbols and we can therefore compute a probability for each symbol. As an example of this consider the English language. Given a set of letters: ?FIRE BA? we can say that the letter ?D? is highly likely to be the next letter. This is a Markov process and while incredibly complicated to produce for real scenarios, Markov modeling would allow us to produce probabilities for each symbol. The important point here is that since we are fairly certain about ?D?, a ?D? being gen- erated has low information and therefore requires less space to transmit. Something unexpected like an ?X? would have high information content. So we can represent expected or frequent results with fewer bits. Another example, assume I wish to communicate the weather in Seattle, and I know that there is a 100% chance of rain in Seattle. This information can be transmitted with zero bits, since there is no need to communicate anything. Suppose that I instead wish to communicate the weather in College Park where it rains roughly 50% of the time, then I would require the same amount of bits to transmit raining or sunny. So now we have established an intuitive idea of the information content of a message. That is, we are measuring how ?expected? or ?surprising? or ?random? a message appears. Given a set of symbols with probabilities pi for the ith symbol, 53 we define the entropy H as ?N H = ? pi log pi (4.3) i=1 This measure has some important properties ? H = 0 if and only if all of the pi are zero except for one, in other words, there is only one symbol and it always occurs (like the Seattle example). This means there is no entropy. ? H is maximized when all p are the same ( 1i ), since this is the most uncertainN situation (like in the College Park example). At this point we have developed information theory to the barest minimum extent in order to define entropy of a discrete channel. We are not taking into account noise or continuous signals, all of which are discussed at length in Shannon?s paper along with much more thorough derivations. We have already touched on the idea that low entropy symbols can be represented with fewer bits. In the next two sections we will develop algorithms for computing these representations. These methods are examples of lossless compression where all information in the original message is preserved. 4.2 Huffman Coding Huffman coding [27] is a method for producing optimal length codes for sym- bols based on their probability of occurrence. It was the first method for finding 54 P=1 1 P=0.6 0 11 10 P=0.25 110 111 A B C D P=0.4 P=0.35 P=0.2 P=0.05 Figure 4.2: Huffman Tree Example. The following tree structure assigned the smallest length sequence to the most probable symbol and the longest length se- quence to the least probable. optimal codes (Shannon presented a method which was not guaranteed to be opti- mal) and it is still in heavy use at the time of writing by image and video codecs 70 years after its invention. Huffman coding requires a set of symbols and their probabilities of occurrence as input. Then, given a message as a sequence of symbols, the algorithm produces the minimum length code that uniquely conveys the message. This requires assigning the shortest codes to the most probable symbols and the longest codes to the least probable symbols. We do this using a binary tree. Start with a leaf node for each symbol that stores the probability of that symbol and insert them into a priority queue. Then, at each step, remove the two nodes with the lowest probability and merge them into an internal node with probability equal to the sum of the probabilities of these nodes. Then insert this new node into the priority queue and repeat until the queue has only one node on it. This node is the root of the tree. The process is a simple greedy algorithm. Approximate code is given in Listing 4.1. Listing 4.1: Building a Huffman Tree. 55 def bu i l d t r e e ( symbols : L i s t [ Tuple ( f loat , str ) ] ) ?> Node : l e av e s = [ ( s [ 0 ] , Node ( s [ 0 ] , s [ 1 ] , None , None ) ) for s in symbols ] p = heapq . heap i fy ( l e av e s ) while len (p) > 1 : l = heapq . heappop (p) r = heapq . heappop (p) n = Node ( l . p r obab i l i t y + r . p robab i l i t y , None , l , r ) heapq . heappush ( ( n . p robab i l i t y , n ) ) return p [ 0 ] To encode, for each symbol traverse the tree from the root tracking the series of left and right child?s used in the traversal. Add a 0 to the symbol for a left and a 1 for a right. When the correct leaf node is reached, the resulting string of 0s and 1s encodes the symbol. To decode, simply read each bit at a time and traverse the tree (right or left) based on the bit value. When a leaf node is encountered, emit that symbol and return to the root of the tree. Let?s consider a simple example. Suppose we are given a simple four letter alphabet with symbols M = {A,B,C,D}. These four symbols are known to occur with probabilities P = {pA = 0.4, pB = 0.35, pC = 0.2, pD = 0.05}. Since we have four symbols, the default encoding would be 2 bits per symbol, {A = 00, B = 56 Encode: ABD 0.4 0.75 0.95 B A C D 0 1 0.16 0.3 0.38 0.4 0 A B B C D C D 0.16 0.3 A B C D C D A BCD Figure 4.3: Arithmetic Coding Example. Using the same alphabet and proba- bilities as the last section, we encode ABD into the range [0.29, 0.3). 01, C = 10, D = 11}. However computing the entropy of the set P gives ? H(P ) = ? p log p (4.4) p?P = ?0.4 log(0.4)? 0.35 log(0.35)? 0.2 log(0.2)? 0.05 log(0.05) (4.5) = 0.529 + 0.530 + 0.464 + 0.216 (4.6) = 1.74 (4.7) so approximately 1.74 bits, meaning that the default encoding of 2 bits wastes 0.26 bits per symbol on average. We construct a Huffman tree for the above set in Figure 4.2. This gives the following variable length codes {A = 0, B = 10, C = 110, D = 111} obtained by traversing the tree for each symbol. Note that although there are some symbols which now require 3 bits to encode, these are the least probable symbols and the most probable symbol, A, requires only 1 bit. If we compute the average size of a symbol with these codes we actually have 1.85 bits/symbol on average so we are still above the limit in terms of entropy. This is because symbols cannot occupy a 57 fraction of a bit. 4.3 Arithmetic Coding Although Huffman codes were optimal in terms of the number of bits to encode single symbols, we saw that Huffman coding was not able to reach the theoretical minimum number of bits defined by the entropy of the set. By computing an en- coding for an entire message rather than one symbol at a time we can overcome this limitation. This is the motivation behind arithmetic coding, which stores an entire message into an arbitrary number q such that 0 ? q < 1. Once again the algorithm is given a set of symbols and their probabilities. The encoder starts with the interval [0, 1) and divides the interval into sub-intervals for each symbol. The algorithm picks the interval which corresponds to the current symbol and proceeds to the symbol. When all symbols are consumed, the resulting interval uniquely identifies the message, and since the intervals are unique we only need to transmit a single element of the final interval to identify the message2. To decode we can follow the same process, but this time we are given the number q. At each step we construct the same intervals and simply check which one the given number falls into, emitting that symbol at each step. This does require either a special terminating symbol or a known message length to stop. The algorithm is shockingly simple and highly effective. An example encoding is shown in Figure 4.3. In that example, we encode 2Specifically, enough bits such that any fraction beginning with the transmitted number falls into the desired interval. 58 the message ?ABD? following the same alphabet and probabilities we used for the Huffman coding example. We start by dividing [0, 1) into proportional parts for each symbol, we find that the first symbol is A so we choose the interval from [0, 0.4). Next we divide that interval into proportional parts and since the next symbol is B, we choose [0.16, 0.3) since 0.16 = 0.4 ? 0.4 and 0.3 = 0.16 + (0.4 ? 0.35). The final symbol is D so we choose the interval from [0.29, 0.3) and transmit (arbitrarily) 0.295. Again, decoding follows a similar process. We are given the number 0.295 as input and we divide up the interval [0, 1), finding that this falls into [0, 0.4), we emit A. Then we find that 0.295 falls into [0.16, 0.3) and we emit B. Finally, we find that 0.295 falls into [0.29, 0.3) and we emit D, having decoded the message ?ABD?. While it may seem remarkable that a message can be transmitted in a single number, the algorithm does have faults. Again, the message must fit into a discrete number of bits, which can reduce the efficiency compared to the theoretical maxi- mum. Furthermore, we are assuming that we have an accurate probability model of the symbol frequencies. This may not be possible to obtain exactly, and in fact, we may not even want global symbol probabilities. Since we are encoding a message, the most efficient encoding of that message would model the probabilities of symbols in that message only (e.g., 1 for A,B,D and 0 for C in our example). However, 3 this requires transmitting the probability model which may remove any gains in efficiency from the coding. In general, these are still open problems and while we can obtain ?optimal? codes with respect to some specific definition of optimal, the theoretical entropy limit that Shannon?s work gives us remains elusive. 59 Chapter 5: Machine Learning and Deep Learning Machine learning is rapidly revolutionizing the way that people interact with computers. This is largely driven by the explosive proliferation of Convolutional Neural Networks (CNNs) [28] since they were shown to be computationally viable for large problems in 2012 [8]. Although machine learning seems commonplace today, this was not the case ten years ago (at the time of writing) and there were many who believed that machine learning would never achieve widespread success. While this dissertation is centered on compression as an application, it is first and foremost a contribution to machine learning for computer vision. In this chapter, we develop a high-level understanding of machine learning concepts which relate to the rest of the dissertation. This discussion is grounded in Bayesian decision theory which is often overlooked in machine learning discourse. Otherwise, the focus is on computer vision methods rather than general methods. Note Some of the material in this chapter is based on the book Pattern Classi- fication [29] which I strongly recommend to interested readers for more in-depth information. 60 5.1 Bayesian Decision Theory Bayesian decision theory tells us the best possible decision we can make about data even if we know exactly the underlying generating distributions. In a sense this can be thought of as a best case scenario because in real life we do not know the underlying distributions so we must either approximate them or approximate deci- sion criteria directly. The classic example of this proceeds as follows. Dockworker Dave is observing fish as they are unloaded from boats. His task is to sort the fish into bins, one for sea bass which we will denote as cb and one for salmon which we will denote as cs. The fish come out of the boat randomly. In the absence of any other information (such as identifying markers), how can he develop a strategy to sort them with minimal errors? Let?s give Dave some knowledge to help. Since the fish are coming off the boat in a random order, we must describe the occurrence of each fish probabilistically. Assume that Dave knows how many fish were caught of each type, then he knows the prior probability P (cb) and P (cs). For example if P (cb) = 0.7 and P (cs) = 0.3 then Dave should classify all of the fish as bass and he will have 70% accuracy. Of course this will entail him dumping all fish into the bass bin which is a bit odd considering that he knows there are two types of fish. Nevertheless this strategy will attain the lowest error given what Dave knows. We can give Dave some more information to help him. Dave?s daughter Wendy studies fish and she informs him that the color can be used to differentiate bass from salmon although it is not a perfect indicator (see Figure 5.1). In this case 61 Figure 5.1: Salmon vs Sea bass. Top: Salmon, Bottom: Sea bass. The two fish have different colors. we would say that there is a continuous random variable x which yields conditional probabilities P (x|cb) which is the probability of each color value for sea bass and P (x|cs) for salmon. We call this the likelihood of the color given the type of fish and we will call the color of the fish a feature. How does Dave use this information? In order to make a decision given color, we want to compute P (cs|x) and P (cb|x), which we call posterior probability , and take the larger probability, but we only have P (cb), P (cs), P (x|cb), P (x|cs). We also know that there is a joint distribution for each class P (cs,b, x) which is the probability of a fish being class s or b and having color x that relates these quantities. From probability theory, we can write this in terms of the conditional P (cs,b, x) = P (cs,b|x)p(x) = P (x|cs,b)P (cs,b) (5.1) this is the definition of conditional probability. Rearranging to group the quantities that we know gives | P (x|cs,b)P (cs,b)P (cs,b x) = (5.2) P (x) 62 which is known as Bayes rule. This allows us to compute the class probability given some measurement as long as we have the known likelihood and prior probabilities. We have another unknown term, the evidence term, in Equation 5.2, P (x), which is the probability of any fish having the measured color: in general we do not need this term. The Bayes decision rule is ??????cs P (cs|x) > P (cb|x)c = ??? (5.3)cb P (cb|x) > P (cs|x) expanding one of these inequalities gives P (x|cs)P (cs) P (x|cb)P (cb) > (5.4) P (x) P (x) P (x|cs)P (cs) P (x|cb)P (cb) =  >  (5.5) P(x) P(x) = P (x|cs)P (cs) > P (x|cb)P (cb) (5.6) in terms of only known quantities. This is good because the evidence term is often hard to measure. So now Dave can use his knowledge of the prior probabilities and Wendy?s color probabilities and multiply them to produce the probability of sea bass or salmon, binning the fish based on whichever is more probable. This seems like a perfectly reasonable idea, but what kinds of errors will Dave make? Let?s compute 63 the probability of Dave?s error ?????P (cb|x) c = cs P (error|x) = ???? (5.7)P (cs|x) c = cb In other words, the error rate will be the probability of the other class. To compute the average error rate, we marginalize x from the joint distribution ? ? P (error?) = P (error, x) dx (5.8)??? = P (error|x)P (x) dx (5.9) ?? We cannot control, or really even measure, P (x) but we can control P (error|x) by making it as small as possible. And the only way to accomplish that is by picking the higher probability for P (cs,b|x) as our classification choice, thus proving the optimality of the Bayesian decision. So now we have a way of making the best possible classification decisions. Given prior probabilities of the different classes and likelihoods of each feature given each class, we can then compute the posterior probabilities and pick the higher one. This guarantees the minimum error: we cannot achieve lower error than this. However we now have a new problem: how do we produce these probabilities for real problems? In general, we can not, and we will have to approximate the distributions leading to an even high error rate. In this sense, the Bayesian decision can be thought of as a theoretical lower limit for the error rate. Even if we know everything, because 64 of the probabilistic nature of decision problems, we will not make the right choice for every input. This sets up a theoretical dichotomy. Do we approximate the underlying prior and likelihood distributions which generated the data and then make Bayesian de- cisions based on our observations? Or instead can we simply compute the boundary between the posterior distributions as a function of the observation that makes a decision directly? Either way, these two questions are the entire purpose of machine learning. Given some data, sampled from unknown distributions, how do we com- pute approximations which match the true distributions or decision boundaries as closely as possible. 5.2 Perceptrons and Multilayer Perceptrons One simple way of learning decision boundaries is the perceptron [30]. The perceptron defines a simple linear model for making a binary decision between two classes (although it can be extended to more complex scenarios). Given a vector of weights w, and an input feature vector x, the perceptron makes the following decision ?????1 ?w,x? > 0 f(x) = ???? (5.10)0 otherwise 65 or simply f(x) = H(?w,x?) (5.11) where H() is the Heaviside function, for classes 1 and 0. The decision boundary in this case is a linear function of x. The task then is to compute a suitable w given some data. Starting from a randomly initialized w0 and some set of training data xi with labels yi the learning algorithm first computes the decision on xi. y?i = f(xi) = ?w0,x? (5.12) which may be incorrect. The algorithm then updates the weights as w1 = w0 + (yi ? y?i)xi (5.13) This process is repeated for all pairs (xi, yi) until some predefined stopping criterion is met. In the case that all xi are linearly separable with respect to yi, this stopping criterion may be convergence, but this is almost never the case in real life. To model real scenarios, a more complex model is needed: one that can model non-linear relationships. We can extend the perceptron to model these more complex scenarios by building aMultilayer Perceptron (MLP) (MLP). The MLP stacks layers of perceptrons separated by non-linearity (Figure 5.2). More formally, for layer 66 Figure 5.2: Multilayer Perceptron. The multilayer perceptron organizes groups of perceptrons into layers separated by non-linearities. In this case each circle rep- resents a perceptron. The first and last layers are termed the input and output layers respectively; any layers in between are termed hidden layers. Image credit: wikipedia. weights Wl (a matrix), input x, and nonlinearity ?(), a MLP can be defined as f(x) = WN?(. . . ?(W1?(W0x))) (5.14) for an MLP with N layers. We call the first layer (weights W0) the input layer, the last layer (weights WN) the output layer, and the intermediate layers (weights W1, . . . ,WN?1) the hidden layers. In practice we will also define a loss function l() which takes the network output and the true classification and tell use how wrong it was. Importantly this function needs to be scalar valued e(W ) = l(y, f((x);W )) (5.15) describing the error for some set of weights W . Training this model requires some tricks. We use an algorithm called back- 67 propagation [28]. If we observe the form of l(), we can see that it is a scalar valued function of a vector. This means that we can compute the gradient of the output with respect to the input ?? ??? ? l(y, f((x);W ))? ?w0 ?00? ?? ?? ?0 l(y, f((x);W )) ?? ?w ?10? ?? . ?? .. ??? ??W l(y, f((x);W )) = ? ? (5.16)?? ?l l(y, f((x);W ))? ?w ? ? ij ? ?? . ???? . ??. ? L l(y, f((x);W ))?wMN for L layers and weights of size MN , which tells us ?in what direction and by how much? we would need to change the network in order to classify x correctly. We can compute these quantities using the chain rule. For each layer we compute the L Jacobian ?WL?1 (since these are vector valued functions) with respect to the previousW layer and continue until we have differentiated every layer. ?WN ?WN?1 ?W 1?W 0l = ?WN l ? ?WN? ?1 ?WN? ? ? ? (5.17)2 ?W 0 which gives updates for the weights in each layer. 5.3 Image Features In order to apply any of these models to images, we need some way of repre- senting images as the input vectors x to the functions in the previous section. While 68 we could simply flatten the images into vectors, this may cause issues with the learn- ing process. Small perturbations of the input pixels can cause large changes in their actual values. Also pixels themselves can vary considerably in appearance yet still represent the same class. These issues impact the separability of the problem, and create extremely complex decision boundaries that are difficult if not impossible to model without arbitrarily deep networks. A more successful strategy would be to compute some higher order repre- sentation of the images which we can show is more meaningful. Although we may explore ideas like extracting numerical shape descriptions or color conversions, there are some abstract representations which have been shown to be effective. We will explore two of these in this section: Histogram of Oriented Gradients (HOG) and the Scale-Invariant Feature Transform (SIFT). Both of these techniques transform an image into a series of vectors, which we call features , that can then be input to a machine learning model. 5.3.1 Histogram of Oriented Gradients The Histogram of Oriented Gradients [31] captures shape and orientation of objects using a local descriptor. Often the image will be contrast normalized in blocks before the histogram of gradients is computed on each pixel in small cells. The descriptor for each cell is the concatenation of the histograms for all of the pixels in the cell. To compute the gradient of an image, it can simply be convolved with a 69 Figure 5.3: HoG Features. The left shows an example image and the right shows HoG features which classify as ?person?. The HoG features are shown as the weighted orientation based on the histogram of the cell and classification confi- dence. Image credit: [31]. gradient kernel. There are many such kernels but [ ] h = ?1 ?0 1? (5.18) ????1? ?v = ?? ? ? 0 ????? (5.19) 1 are the popular choices for computing horizontal and vertical gradients respectively. After the gradient is computed it can be binned per cell (usually 8?8 pixel cells) to compute the histogram. These histogram cells are then normalized with respect to larger blocks (usually 16? 16 blocks) to further increase invariance to image trans- formations. This gives a descriptor for each cell which can be input to any classifier. For example, Dalal and Triggs used the HoG feature for pedestrian detection an SVM classifier1. The result at the time was quite impressive and HoG features came into 1We do not cover SVMs here 70 widespread use. HoG features are ?dense? in the sense that every block in the image is covered in some sense which means that the model is given a strong prior on the local shapes present in the image. This can be seen visually in Figure 5.3. In the figure we see a man in the example image. The HoG features visualized on the right show outlines of the important shapes in each region. This visualiza- tion is produced by drawing tangent lines for each orientation in the histogram and weighting the lines by the histogram values on each cell. The lines are then further weighted by the SVM confidence to show which lines are contributing to the human classification. We see strong responses on the feet, shoulders, and head meaning that the model considers these unique identifiers of people that are not present in other objects. 5.3.2 Scale-Invariant Feature Transform One of the most popular and powerful image features is the Scale-Invariant Feature Transform [32]. Like HoG features, SIFT features capture a local description of shape using orientation. Unlike HoG, the primary purpose of SIFT was to find scale-invariant keypoints which are unique locations that appear the same under scale changes. These points can be used for object matching. Since the points should be rotation and scale invariant, a query object should be able to be located even if it is subject to complex deformations. To compute the scale space, SIFT uses a difference of Gaussian?s (DoG). The image is computed at different scales by Gaussian blurring the image successively, 71 . . . Scale (next octave) Scale (first octave) Difference of Gaussian Gaussian (DOG) Figure 5.4: Difference of Gaussians. The difference of Gaussian?s scale space computes gaussian blurs of increasing strength to the input image. The blurred images are then subtracted from each other. Points which survive this process are scale invariant. Image credit: [32]. then the difference between neighboring blurred images is taken (Figure 5.4). When this is done over many scales, any points which survive for the entire stack of DoG images are considered scale-invariant since they are clearly localized across scales. These points are pixel localized by applying non-maximum suppression and then sub-pixel localized by computing a second order Taylor expansion on the pixel which can produce a zero point in between pixel boundaries. For each keypoint, the rotation invariant descriptor is computed. The gradient magnitude m(x, y) and orientation o(x, y) are computed as ? m(x, y) = (P ? P ? )2x+1,y x 1,y + (Px,y+1 ? P 2x,y?1) (5.20) Px,y+1 ? Px,y?1 o(x, y) = arctan (5.21) Px+1,y ? Px?1,y for image P . This is computed in a 3 ? 3 neighborhood around the keypoint and then a histogram is computed. The orientation with the highest bin is assigned to the keypoint. To further improve the invariance, these descriptors are compiled in a 4?4 grid into an 8 bin histogram. The resulting 128 dimensional descriptor resulting 72 Figure 5.5: Convolutional Neural Network. Diagram shows the LeNet-5 ar- chitecture. The model takes pixels as input and computes feature maps using suc- cessive convolutions, non-linearities, and subsampling layers. For classification, the network terminates in a MLP. This allows the classifier and the feature extractors to be trained jointly using backpropagation. Image credit: [33] from the concatenation of these histograms is assigned as the keypoint descriptor. This descriptor can be normalized to improve invariance to lighting changes. SIFT features were the de facto standard in image features for many years. During the end of the classical/feature based learning era, there was a particular shift towards dense SIFT features. This step simply forgoes the keypoint detection steps and computes a descriptor for each pixel. This is useful for tasks like semantic segmentation that require per-pixel labels but it can also be used as a rotation invariant base for more general tasks. 5.4 Convolutional Networks and Deep Learning Feature engineering is a complex process. The two algorithms we described in the last section are non-trivial to understand, much less to develop on one?s own. Furthermore, it is not clear if a given feature is suitable to any particular task, we only have vague motivation and intuition to guide us. The fundamental 73 contribution of deep learning was that the best features for a given problem can be learned along with the classifier using only pixels as input. This replaced the tedious feature engineering process with something much more powerful and much simpler to develop. Deep learning is powered by the CNN [28]. These ideas had been around for some time but it was not until Alexnet [8] that deep enough and complex enough networks were shown to be computationally viable with a GPU implementation. This quickly revolutionized machine learning with entire scientific careers dedicated to feature engineering becoming obsolete in a short time frame. The CNN itself is not particularly complex. It is a MLP with the matrix multiplications replaced with convolutions, formally f(x) = WN ? ?(. . . ?(W1 ? ?(W0 ? x))) (5.22) The advantage of this is that the weights can be small kernels, usually 3? 3 instead of the large matrices required to process an image with a MLP (these matrices would need to be the same width and height as the image). This already made CNNs much more efficient than MLPs even without the GPU implementation. Fur- thermore, many seemingly complex image transformations can be computed with convolutions which is why we say that the convolutional network computes learned feature representations. These non-linear feature extractors replace the hand de- signed feature extractors of classical machine learning. One of the more influential and yet simple architectures is shown in Figure 74 5.5, LeNet-5 [33]. Many CNN variants can be described by the components in that figure. The convolutional layers are paired with non-linearity and subsampling layers. The subsampling layers are usually some kind of pooling (max pooling or average pooling) which helps aggregate feature information spatially. The actual classification decision is made using a MLP once the feature maps have been reduced to a sufficiently small and abstract representation. For non-linearity currently ReLU is the most popular choice which I like to define in terms of the Heaviside function R(x) = H(x)x (5.23) but which most people like to write as ?????x x > 0 R(x) = ???? (5.24)0 x ? 0 Why CNNs work as well as they do remains somewhat of a mystery, but like much of machine learning, we can get an idea using intuition. As we have already discussed, the hand designed features of classical machine learning may not have been the best for a given task. The learned features of a convolutional network are likely more suited since they are customized to the task. Images are a discrete sampling of a 2D signal, and nearby pixels are often highly correlated or anti-correlated (in terms of edges). CNNs can pick up on these correlations because they use a translation-invariant learned convolution which is moved across 75 the image spatially in a sliding window. Finally, since convolutional networks are highly efficient, they can be made deeper and wider to learn more complex mappings. In this dissertation we will be exclusively exploring convolutional neural net- work architectures. While there have been some major advancements to CNNs which we touch on in the rest of this chapter and throughout the dissertation, it is worth noting that the CNNs of today are largely the same as those used by the pioneers of deep learning. 5.5 Residual Networks Residual networks [9] were a major advancement in the design of convolutional networks. Instead of learning a mapping y = f(x) like a traditional network, the residual network defines a mapping y = f(x) + x. This, along with some other notable architectural changes, makes the residual network highly effective. The pre- cise reason why this helps so much is still debated however it likely makes ?gradient flow? easier (gradient flow was also explored by the VGG [34] and Inception [35] architectures). Examining Equation 5.17, we can spot a potential problem. As the depth of the network increases, carrying the gradient from the loss through all Ja- cobians to the earliest layer may be difficult. The gradient tends to shrink as we move backwards through the layers, we call this problem vanishing gradient. Resid- ual learning is likely solving this problem by allowing a shortcut connection around some of the convolution layers which carries a stronger gradient signal to the early layers. 76 Figure 5.6: U-Net. U-Nets arrange convolutional layers in a U-shape of decreasing and then increasing size. Skip connections allow for better gradient flow to early layers. Image credit: [36]. The actual design of the network is based on a so called ?residual block? pic- tured in Figure 5.7. Each block has two weight layers with batch normalization [37] separated by a ReLU non-linearity. The addition of batch normalization is thought to simplify the learning process even further for the weight layers by removing lower order statsitics of mean and variance. For each batch, the layer tracks the running mean ? and variance ?2 and computes x? ? BN(x) = ? + ? (5.25) ? for learned ? and ?. The block includes the hallmark residual connection short- circuiting the two weight layers. 77 Figure 5.7: Residual Block. Each residual block consists of two convolutions with a ReLU non-linearity and a batch normalization layer. Image credit: [9]. The entire network architecture stacks the residual blocks using strided con- volutions to perform learned downsampling instead of using pooling. The network terminates with a ?global average pooling? layer which performs spatial averaging over each channel of the output to produce a small vector suitable for input to a MLP like prior network designs. 5.6 U-Nets It is worth noting that we are, of course, not limited to only classification prob- lems. The U-Net [36] architecture is suitable for problems which require a spatial output like image-to-image translation and semantic segmentation. In this disserta- tion, we will almost exclusively be dealing with image-to-image problems although the architectures we discuss later will differ greatly from the U-Nets. Similar to residual networks, U-Nets were a major advancement in these spatial tasks. And also like residual networks, the major contribution was likely in gradient flow. U-Nets define the network in two distinct parts: the encoder and the decoder, the schematic is shown in Figure 5.6. The encoder is much like a traditional convo- lutional network. There are alternating convolutions and non-linearities with down- 78 sampling. The decoder is the reverse process, taking the compact representation from the encoder and using upsampling operations to compute a result which has the same dimensions as the input image. The major design feature of this is the skip connections. These connections take feature maps from the encoder and concatenate them with the feature representations of the same size in the decoder which allows a strong gradient signal to flow to the early layers avoiding the vanishing gradient problem. The U-Net was revolutionary at the time for its results on the extremely difficult semantic segmentation problem. However, the U-Net would quickly become widely used for any spatial task, and is still used quite frequently. Pix2Pix [38] for example was based entirely on the U-Net. While U-Nets were highly influential on all image-to-image problems, we will employ very different architectures later in the dissertation, and indeed very few works in compression actually use U-Nets. This is because there are other ways to deal with vanishing gradients (like residual blocks and their derivatives) and the downsampling operations in U-Nets tend to remove fine details which we want to preserve in restoration tasks. 5.7 Generative Adversarial Networks Generative Adversarial Networks (GANs) [39] will be relied upon heavily in the methods we detail in the dissertation. GANs were a truly revolutionary mo- ment in the generation of images using CNNs. Prior error based methods, called 79 Fake Image Noise Generator Real Discriminator Real Image Fake Figure 5.8: GAN Procedure. The generator creates an image from random noise and provides it to the discriminator along with real images. The discriminator must identify which images are real and which are fake. autoencoders2, produced very poor results even for simple datasets like MNIST [40]. The many variants of the GANs would change this dramatically using an ingenious and fairly simple idea. The GAN methods sets up an adversarial game with two networks. One network, the generator, generates images, and another, the discriminator, tries to identify which images are real and which are fake. The generator is rewarded for fooling the discriminator into classifying its images as real and penalized for getting caught. Conversely, the discriminator is rewarded for correctly identifying fake images and penalized for incorrectly classifying them. Training (theoretically) ends when the two networks achieve a Nash equilibrium [41]?[43]. This procedure is shown in Figure 5.8. We train this pair of networks using standard cross entropy classification loss. The only difference is that we reverse the labels when training the generator since we want it to fool the discriminator. This is sometimes call the minimax loss. Given real samples x, noise vectors z, discriminator D(), and generator G(), we define the 2Although this is an abuse of the term, technically an autoencoder should generate the exact image it is given as input and nothing else. 80 loss l(x, z) = log(D(x)) + log(1?D(G(z))) (5.26) and we train the discriminator to maximize l() while training the generator to min- imize l(). In other words minmaxEx?real[log(D(x))] + Ez?noise[log(1? log(D(G(Z))))] (5.27) G D As these two networks play their game over the course of training, the dis- criminator will start to identify more and more fake images. The increasing loss on the generator will cause it to generate more realistic images. Since identifying fake images is relatively easy for a CNN, by the end of training, the generator will be producing extremely realistic images in order to continue to fool the discriminator. In practice the Nash equilibrium is hard to achieve and we simply stop training GANs after a certain number of steps. GANs also chronically diverge since it is hard for the GAN to recover from a situation where the discriminator has a large advantage over the generator. 5.8 Recap To recap, we have reviewed machine learning from the ground up. We built the ideas of machine learning on a foundation of how to make decisions in the pres- ence of perfect information. We then developed the perceptron and its extension, 81 the multilayer perceptron which is the progenitor of modern deep learning. We dis- cussed hand engineered features and why they were necessary and finally developed deep learning as a replacement for these features. We then reviewed some of the most important ideas of deep learning including convolutional networks, residual learning, U-Nets, and GANs. This concludes the foundational knowledge which is required to fully understand the original research developed in the remainder of this dissertation. 82 Part II Image Compression 83 Chapter 6: JPEG Compression JPEG has been a driving force for internet media since its standardization in 1992 [3]. The principal idea in JPEG compression is to identify which details of an image are the least likely to be noticed if they are missing. These details can then be replaced with lower entropy versions. By removing information, there is a significant size reduction over methods which perform entropy coding alone. This is called lossy compression, since information is lost in the encoding process. The lost information is, in general, not recoverable. Usually this is not a major issue, as the JPEG algorithm was designed to remove unnoticed details. However, there are situations where the information loss is noticeable in the form of unpleasant artifacts (Figure 6.1). This is particularly true when a JPEG image is saved multiple times, which causes repeated application of the lossy process. A significant portion of the dissertation is devoted to using machine learning to approximate the lost information. A common source of consumer confusion with JPEG is in the name itself. JPEG refers to three things simultaneously The JPEG Algorithm The algorithm for compressing images. JPEG Files The disk file format for storing JPEG compressed data and its asso- 84 Figure 6.1: JPEG Information Loss. This image suffers from extreme degrada- tions caused by JPEG compression. Zoom in on this image, it probably has fewer details than you think it does. ciated metadata. This is actually either a JPEG File Interchange Format (JFIF) file or an Exchangeable Image File Format (EXIF) file. The Joint Photographic Experts Group The working group that maintains the JPEG standard. This chapter is devoted to giving the reader an understanding of JPEG com- pression which is sufficient to motivate the first-principles that we use in develop- ing the algorithms later in the dissertation. We will review the function of JPEG compression and decompression step-by-step and we will discuss the extremely im- portant view of JPEG as a linear map. We will also briefly discuss other image compression algorithms. 6.1 The JPEG Algorithm We now present the JPEG algorithm step-by-step. Where the standard is ambiguous we defer to the Independent JPEG Group?s libjpeg software [44]. This 85 software is widely considered standard in the industry, although there are other implementations of JPEG. We start by describing the compression process and then conclude with the decompression process, which is largely the inverse. Throughout the description we will place emphasis on which parts of the standard are motivated by human perception and which steps involve loss of information. 6.1.1 Compression JPEG compression starts with an RGB image usually in interlaced (RGB24) format. This image is then converted to the YCbCr planar format, however, this is not the more common ITU-R BT.601 [45] format, which produces values in [16, 235] for Y and [16, 240] for Cb, Cr. Instead, this format uses the full range of byte values ([0, 255]). The color conversion uses the following three equations Y = 2.99R + 0.587B + 0.114G (6.1) Cb = 128? 0.168736R? 0.331264B + 0.5G (6.2) Cr = 128 + 0.5R? 0.418688B ? 0.081312G (6.3) This color conversion is designed to better represent human perception of the image which treats changes in luminance (the Y channel) with more weight than chromi- nance (the Cb and Cr channels). Therefore, the Cb and Cr channels can have more information removed with less of an effect on the overall image. One operation in particular which removes additional information from the color channels is chroma subsampling . Chroma subsampling describes a 4? 2 block 86 of pixels and is represented as a triple, e.g., 4 : 2 : 0. The 4 represents the number of luma samples per row. The 2 represents the number of chroma samples in the 1st row. The 0 represents the number of chroma samples which change in the second row. So in this example, there are 4 luma samples in each row, 2 chroma samples in the first row, and none of them change in the second row, meaning that the chroma channels should be stored at half the width and height of the luma channel. Another example is 4 : 2 : 2 which indicates that the 2 chroma samples in the first row both change in the second row, so the chroma channels are stored with half the width but the same height as the luma channel1. Before we remove information we need to pad the image. JPEG is based on 8 ? 8 blocks so at the least the image needs to be padded to a multiple of 8 in the width and height. If chroma subsampling is used, this needs to be taken into account during padding and the image may need to be padded to a multiple of 16 or more in the width, height, or both. This defines the Minimum Coded Unit (MCU), i.e., the minimum size block which can be encoded using the given settings. The padding in this case is always done on the bottom and right edges of the image and repeats the final sample as the padding value. With the image padded, the chroma channels can be subsampled. Next comes the main feature of the JPEG algorithm, the DCT on non- overlapping 8 ? 8 blocks. Before computing this, the pixels are centered by sub- 1This is an archaic and confusing notation but unfortunately it is still used. 87 tracting 128. The DCT is applied using ?7 ?7 [ ] [ ]1 (2x+ 1)i? (2y + 1)j? Dij = C(i)C(j) Pxy cos cos (6.4) 4 16 16 x=0 y=0 ??????1 u = 0 2 C(u) = ???? (6.5)1 u ?= 0 for 8 ? 8 block of pixels P . This accomplishes two goals. First, it concentrates the energy of each block into the top left corner. Second, it serves as a frequency transform which allows us to remove frequencies which we believe viewers will be less likely to notice. The DCT coefficients are then quantized by dividing by a quantization ma- trix. This is an 8? 8 matrix of coefficients which reduce the magnitude of the DCT coefficients. Since humans tend not to notice missing high spatial frequencies, the quantization matrices generally target these. However, most encoders compute the quantization matrix from a scalar quality factor which is easier for users to compre- hend, and as this quality decreases, the quantization matrix removes lower and lower frequencies. After quantization, the result is truncated to an integer. This removes information in the fractional part and permits the result to be stored in an integer which takes up less space. In a sense this is the first ?compression? operation. The 88 0 1 5 6 14 15 27 28 2 4 7 13 16 26 29 42 3 8 12 17 25 30 41 43 9 11 18 24 31 40 44 53 10 19 23 32 39 45 52 54 20 22 33 38 46 51 55 60 21 34 37 47 50 56 59 61 35 36 48 49 57 58 62 63 Figure 6.2: Zig-Zag Order. This ordering is intended to put low frequencies in the beginning and high frequencies at the end. entire operation is given by ? ? ? YijYij = ? (6.6)(Qy)ij ? ? (Cb)ij(Cb)ij = ? ? (6.7)(Qc)ij (Cr)ij (C ?r)ij = (6.8)(Qc)ij for luminance quantization matrixQy and chrominance quantization matrixQc. The color channels are often quantized more coarsely as human vision is less sensitive to color data. Note that since we truncate, any fractional part after division is irrevocably lost, the resulting coefficient can only be approximated from the integer part. Any coefficient which is less than zero after division is set to zero and cannot be recovered even approximately. Other than chroma subsampling, this is the only source of loss in JPEG compression. In order to decode the image, Qy and Qc are both stored in the JPEG file. These quantized coefficients are then vectorized in a zig-zag order (Figure 6.2) 89 which is designed to put low frequencies in the beginning of the 64 dimensional vectors and high frequencies at the end. This is because the next step is to run- length code this vector. Since the quantization process was more likely to zero out high frequency coefficients, this concentrates the zeros at the end of the vector and leads to more effective run-length coding. This is the second ?compression? operation. The final run-length coded vectors are then entropy coded. This can use either Huffman coding [27] or arithmetic coding [46]. With a significant amount of redundant or unnoticeable information removed, these entropy coding operations are extremely efficient and yield a significant space reduction over the uncompressed image. 6.1.2 Decompression The decompression algorithm is largely the reverse operation. After undoing entropy coding we have the quantized coefficients. These are element-wise multiplied by the quantization matrices to compute the approximated coefficients Y? = Y ?ij i,j(Qy)ij (6.9) (C?b)ij = (C ? b)i,j(Qc)ij (6.10) (C?r)ij = (C ) ? r ij(Qc)ij (6.11) 90 We can then compute the inverse DCT of the approximated coefficients 7 7 [ ] [ ] 1 ?? (2x+ 1)i? (2y + 1)j? Pxy = C(i)C(j)D?ij cos cos (6.12) 4 16 16 i=0 j=0 and uncenter the spatial domain result by adding 128. The color channels are interpolated to remove chroma subsampling. We then remove any padding that was added, and convert the image back to the RGB24 color space R = Y + 1.402(Cr ? 128) (6.13) G = Y ? 0.344136(Cb ? 128)? 0.714136(Cr ? 128) (6.14) B = Y + 1.772(Cb ? 128) (6.15) and the image is ready for display. There are three important things to take away from this discussion. First, other than chroma subsampling, which is optional, the only lossy operation is the truncation during the quantization step. This is a fairly simple operation considering the DCT coefficients but it creates complex patterns in the spatial domain. Next, the blocks are non-overlapping, so for each block there is no dependence on pixels outside of the block. Finally, each pixel in the block depends on all of the coefficients in the block. Conversely, each coefficient in the block also depends on all of the pixels in the block. We will exploit this property later in the dissertation. 91 6.2 The Multilinear JPEG Representation In what is perhaps a surprising result, the steps of the JPEG transform are easily linearizable [47], a property that was explored significantly in the 1990s [48]? [51]. Indeed, outside of entropy coding, the only non-linear step in compression is the truncation that occurs during quantization, and all the steps of decompression are linear. Furthermore, when we process JPEG images, we are either dealing with the decompression process, or we are in full control over the compression process and it is therefore our choice if and when we truncate. We would only need to do this if we were saving the result as a JPEG. We now develop the steps of the JPEG algorithm into linear maps and compose them into a single linear map that models compression and a single linear map which models decompression. Without loss of generality, consider a single channel (grayscale) image. We model this image as the type-(0, 2) tensor I ? H? ? W ?. Note that although we are essentially dealing with real numbers, we have intentionally left H? and W ? as arbitrary co-vector spaces because there is no reason to define them concretely for our purposes. We will, however, make the stipulation that they are defined with respect to a standard orthonormal basis so that we can freely convert between the co- vector and vector spaces without the use of a metric tensor. Note that the following equations are written in Einstein notation; see Chapter 2 (Multilinear Algebra) if this is unfamiliar. Our first task is to break this image into 8 ? 8 blocks. We define the linear 92 map B : H? ?W ? ? X? ? Y ? ?M? ?N? (6.16) ? B ? H ?W ?X? ? Y ? ?M? ?N? (6.17)?? ?? ?1 pixel h,w belongs in block x, y at offset m,n Bhwxymn = ??? (6.18)0 otherwise where B is a type-(2, 4) tensor defining a linear map on type-(0, 2) tensors. The result of this map will be a tensor with 8? 8 blocks indexed by x, y and 2D offsets for each block indexed by m,n. Although this definition is fairly abstract, it can be computed fairly easily using modular arithmetic, although it does need to be recomputed for each image. Next we compute the DCT of each block,. We define the following linear map D : M? ?N? ? A? ?B? (6.19) ( D ?)M ?(N ? A? ?B)? (6.20) mn 1 (2m+ 1)?? (2n+ 1)??D?? = C(?)C(?) cos cos ? (6.21)4 16 ? 16????1? u = 02C(u) = ??? (6.22)1 u ?= 0 The equation for D should look familiar by now. D is a type-(2, 2) tensor defining a linear map on type-(0, 2) tensors. The m,n block offset indices in the input tensor 93 will index spatial frequency after applying this map. Next we linearize the coefficients 2. We define the following linear map Z : A? ?B? ? ?? (6.23) ? Z ? A?B ? ?? (6.24)?????1 ?, ? is at ? under zigzag orderingZ??? = ??? (6.25)0 otherwise This is a type-(2, 1) tensor defining a linear map on type-(0, 2) tensors. It flattens the 8?8 blocks into 64 dimensional vectors. In other words, the ?, ? indices indicate which indexed spatial frequency will be indexed with a single k after applying this transformation. This tensor depends on the zigzag ordering and can simply be hard coded. Finally, we divide by the quantization matrix. We still need to scale the coefficients even though we are not rounding them. We define the linear map S : ?? ? K? (6.26) S ? ??K? (6.27) S? 1 k = (6.28)qk where qk is the kth entry in the quantization matrix for the JPEG image. This is a type-(1, 1) tensor defining a linear map on co-vectors. 2Note that we?re doing things slightly out of order but ultimately the order does not matter here and doing it this way simplifies the form of the next tensor 94 In order to define both compression and decompression, we need only one more linear map, scaling by the quantization matrix S? : K? ? ?? (6.29) S? ? K ? ?? (6.30) S?k? = qk (6.31) for the same definition of qk as above. We now have the fairly simple task of assembling these steps into single tensors. We say this is simple because all of the operations are linear maps and therefore are readily composable. We define J : H? ?W ? ? X? ? Y ? ?K? (6.32) J ? H ?W ?X? ? Y ? ?K? (6.33) Jhw = Bhw mn ?? ?xyk xymnD?? Z? Sk (6.34) for compression and J? : X? ? Y ? ?K? ? H? ?W ? (6.35) J? ? X ? Y ?K ?H? ?W ? (6.36) J?xyk = BxymnD?? Z? S?khw hw mn ?? ? (6.37) for decompression. 95 It is difficult to express how powerful this result is and how easily it is achieved using rudimentary concepts from multilinear algebra. What seems like a fairly complex algorithm, and indeed is, when thought of as an operation on a matrix, reduces to a simple linear map when we model the inputs and intermediate steps as tensors. Equipped with this linear map, we can and will model complex phenomena on compressed JPEG data directly without needing to decompress it. 6.3 Other Image Compression Algorithms The astute reader will have noticed early on that we, quite confidently, are in a part labeled ?image compression? and yet we are only discussing JPEG. There are other image compression algorithms, so a natural question is ?why are we not discussing those?? For myriad reasons, there are really no interesting problems to study for other compression algorithms. PNG [52], for example, is widely used. But this is lossless compression, so there are no artifacts to overcome. GIF [53] is also lossless, although many GIF services quantize colors into a palette to save more space, which is a potential problem that could be interesting to work on. More modern formats are based on video compression and while they are lossy, they are simply unused. BPG [54] is the most promising of these, and therefore the least used, but there is also HEIC/HEIF [55] which is currently being unsuccessfully pushed by Apple on the iPhone. Probably the most interesting of these algorithms is JPEG 2000 [56], which 96 is lossy and widely used in digital cinema [57], although it was completely ignored by consumers. This codec is interesting because it would require us to update our theory to take into account the discrete wavelet transform that JPEG 2000 uses in place of the DCT. However, the use of this transform imposes a major practical problem as well: JPEG 2000 images look good even at low bitrates because the wavelet transform is so effective, so they may not require correction. Instead, we will focus our energy where it can have the most impact: by exclu- sively studying JPEG. Even at the time of writing, 30 years after standardization, JPEG is the most commonly used image file format. It is easy to use, familiar to consumers, and has become the backbone of the internet, making it incredibly resilient to any challenger, no matter how much better the compression or quality of the images. At the same time, JPEG does suffer from some extreme artifacts in many conditions. It is this combination of visible quality loss and widespread use that makes JPEG ideal for further study. 97 Chapter 7: JPEG Domain Residual Learning Now we develop a general method for performing residual network ([9], Section 5.5 (Residual Networks)) learning and inference on JPEG data directly, i.e., without the need for decompressing the images. This method was published separately in the proceedings of the International Conference on Computer Vision [58]. Warning This chapter is extremely math-heavy and dry. It is strongly recom- mended to review the background math outlined in the first chapters of the disser- tation for a complete understanding of the material. This chapter may serve as a powerful sleep aid: do not operate heavy machinery while reading this chapter. Compared to processing data in the pixel domain, directly working on JPEG data has several advantages. First, we observe that most images on the inter- net are compressed, including deep learning datasets such as ImageNet [59]. Next we observe that JPEG images, being compressed, are naturally smaller and more sparse than their uncompressed counterparts. These are all desirable properties for memory- and compute- hungry deep learning algorithms. The primary goal of the method presented in this section is to be as close as possible to the pixel domain result. In other words, given a learned pixel domain 98 mapping H : I ? Rc mapping images to class probabilities for c classes1, we want to define a mapping ?(H) such that |H(In)??(H)(DCT(In))| is minimized for any In ? I. We can accomplish this goal analytically and we will develop the theory in the coming sections, including a discussion of why it is (likely) not possible to generate a mathematically exact ?(H) function2 and what guarantees are available on the deviation. Recall that a residual network requires several components to operate Convolution The primary learned linear mapping between feature maps at each layer. Each ?residual block? contains two of these operations. Batch Normalization Produces normalized features for the convolutions; this is thought to ease the learning process by removing unnecessary statistics from the input features which can be represented exactly [37]. Global Average Pooling An innovation of the ResNet. When the convolution layers are exhausted, the features are averaged channel-wise to produce a vector suitable for input to a fully connected layer ReLU The non-linearity of the ?residual block?, this allows the network to learn complex mapping. Our task will be to derive transform domain versions of each of these operations. First Principles 1We frame this discussion in terms of classification but it applies equally well to any problem type 2i.e., a ? function such that for all i, |H(i)??(H)(DCT(i))| = 0 99 ? JPEG is easily linearized, convolutions are linear, composing them expressed a learned convolution exactly in the JPEG domain. ? Other components of the residual network can be expressed analytically in the JPEG domain. ? ReLU can be approximated with a bilinear map. 7.1 New Architectures Before discussing the proposed technique, we will first make a detour to review two popular methods of JPEG and DCT deep learning. These methods are all new architectures which enable effective processing in the transform domain but which do not attempt to replicate any pixel domain result. These methods have some advantages over both the method presented in the rest of this chapter and when compared to pixel-domain networks. For example, both methods show good task accuracy with faster processing. There are some notable disadvantages, however. In particular, these methods are not suitable for situations when a pixel domain network already exists and its results need to be replicated on JPEGs. These ideas were inspired by the ?do nothing? approach published in both NeurIPS and an ICLR workshop [60]. In this approach, the transform coefficients are passed into a mostly unmodified ResNet for classification. The authors postulate that with the higher-level representation of the DCT, fewer layers are required to achieve similar accuracy and therefore the network will be faster. Indeed the authors show this is true empirically. However, despite the intuition, this paper?s evaluation 100 leaves much to be desired, and it is unclear what the contribution of the DCT is to the result. Meanwhile, it is well known in JPEG artifact correction literature (dis- cussed later in the dissertation), where ?dual-domain? methods are commonplace, that providing DCT coefficients to a network is not successful without considerable effort, a result seeming at odds with Gueguen et al.. Instead, the following methods are inspired by unique attributes of DCT co- efficients. Namely: that each coefficient is a function of all pixels in a block, that a block of pixels is only correctly represented by all DCT coefficients, and that the DCT coefficients are orthogonal and arranged in a grid simply for convenience. This last point is critical. One of the reasons that small convolutional kernels work well on pixels is that nearby pixels are usually correlated in some way, so translation invariant features are readily learned. If that convolution is instead applied to co- efficients, this is then a mapping on arbitrary orthogonal measurements which are intentionally decorrelated, leaving little hope for success. 7.1.1 Frequency-Component Rearrangement This is treated by Lo et al. [61] by simply rearranging the frequencies into the channel dimension before processing (Figure 7.1), yielding a feature map which is 1th the width and height and with 64 channels (per input channel). Note how 8 this allows a convolution to capture the information contained in the DCT. Since the convolution operation used in deep learning maps all channels in the input to each channel in the output, every coefficient plays a role in the resulting map and 101 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 19 30 31 24 25 26 27 28 19 30 31 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 0 0 1 1 2 2 ... 63 63 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1 1 2 2 63 63 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 19 30 31 24 25 26 27 28 19 30 31 Output: 64 channel 2 ? 2 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 Input: 1 channel 16 ? 16 Figure 7.1: Frequency component rearrangement. therefore complete block information is captured. They call this method Frequency- Component Rearrangement (FCR) Lo et al. use their network for semantic segmentation of road features quite successfully. At the time of publication, their method was both fast and accurate. 7.1.2 Strided Convolutions A similar solution proposed by Deguere et al. [62] uses strided convolutions instead of FCR. Specifically, this method uses an 8 ? 8 stride-8 convolution such that each DCT block is processed in isolation. Note again how this makes good use of the coefficients: the 8 ? 8 convolution ensures that every coefficient plays a role in the resulting mapping, and the stride-8 ensures that there is no leakage of information across blocks. Once these ?block representations? are computed, the resulting feature map is again 1th the width and height (now with a variable number 8 of features) . Deguere et al. use this method for object detection in the DCT domain and again performed admirably at the time of publication. 102 7.2 Exact Operations In the previous section we discussed novel architectures that equip CNNs with the ability to process data in the transform domain. While this is useful and impor- tant, it requires training a new CNN from scratch and has no particular relationship to the underlying pixels that the CNN is processing. Since CNNs were designed to process pixel domain data, and the DCT is a transform of pixel data, a natural question is whether a method can be formulated that is capable of processing trans- form domain data and which has some mathematical guarantee or relationship to the underlying pixel domain model. We now develop just such a method. 7.2.1 JPEG Domain Convolutions Recall from Section 6.2 (The Multilinear JPEG Representation) that the JPEG transform can be linearized and written as linear maps on tensor inputs and that this analysis yields the following linear maps: J : H? ?W ? ? X? ? Y ? ?K? (7.1) for compression of an image represented by I ? H? ?W ? to transform coefficients F ? X? ? Y ? ?K?, and J? : X? ? Y ? ?K? ? H? ?W ? (7.2) (7.3) 103 to decompress. We proceed by considering only single channel images. We will add in channels and batch dimensions later since they have no bearing on the derivation. We know that convolutions are linear maps, therefore, deriving a JPEG domain convolution is fairly simple. Assume that C : H? ?W ? ? H? ?W ? is a linear map representing an arbitrary convolution. This convolution would be applied to an image I in the pixel domain by computing I ?h?w? = C hw h?w?Ihw (7.4) Given transform coefficients F ? X? ? Y ? ?K? for I, we can derive I as Ihw = J? xyk hw Fxyk (7.5) Similarly, we can derive transform coefficients F ? for I ? by applying J F ? = Jh ?w? ? x?y?k? x?y?k?Ih?w? (7.6) Substituting these two expressions yields I ? = Chw J?xykh?w? h?w? hw Fxyk (7.7) F ? h ?w? hw xyk x?y?k? = Jx?y?k?Ch?w? J?hw Fxyk (7.8) 104 And we make the following definition [ ] F ? = Jh ?w? hw xyk x?y?k? x?y?k?Ch?w? J?hw Fxyk (7.9) ?xyk h ?w? hw xyk x?y?k? = Jx?y?k?Ch?w? J?hw (7.10) giving a simple expression for computing ? : X? ? Y ? ? K? ? X? ? Y ? ? K?, a convolution in the compressed domain, given a convolution in the pixel domain. It is important to note that this is not a simple notational trick. Because J,C, and J? are linear maps, the resulting ? performs all three operations in a single step and is significantly faster than performing them separately3. With the mathematics satisfied, we now turn to the development of an efficient algorithm for computing ?. After all, the convolution C is usually represented as a simple 3 ? 3 matrix of numbers4. However our derivation is expressed in terms of an dim(H)? dim(W )? dim(H)? dim(W ) (2, 2)-tensor. One way to understand C is as a look-up table of coefficients. For example, if we index C as C[5, 7], we are given a tensor of coefficients for every pixel in the input representing its contribution to the (5, 7) pixel in the output. Naturally, many of these coefficients are 0. In fact, the only non-zero pixels are those from (2, 4) to (8, 10). Similarly, if we index C as C[:, :, 5, 7] we can see the contribution of pixel (5, 7) in the input to every output pixel (which again is mostly zero). This implies a naive algorithm: Exploding Convolutions (Listing 7.1) where the entire 3Much in the same way that linear functions f(x) = 5x and g(x) = 2x can be combined into (f ? g)(x) = 10x which has only a single multiply vs the two multiplies of separately applying g and then f 4Although other sizes and shapes are possible 105 (2,2)-tensor is iterated and the correct coefficients are copied from the convolution kernel. The resulting map is then composed with J and J? to produce the transform domain map. Listing 7.1: Exploding Convolutions (Naive) def exp lode convo lu t i on ( shape : Tuple [ int , int ] , conv : Tensor , J : Tensor , J t i l d e : Tensor ) ?> Tensor : s i z e = ( conv . shape [ 0 ] // 2 , conv . shape [ 1 ] // 2) shape = ( shape [ 0 ] + s i z e [ 0 ] ? 2 , shape [ 1 ] + s i z e [ 1 ] ? 2) c = torch . z e r o s ( ( shape [ 0 ] , shape [ 1 ] , shape [ 0 ] , shape [ 1 ] ) ) for i in range ( shape [ 0 ] ) : for j in range ( shape [ 1 ] ) : for u in range ( shape [ 0 ] ) : for v in range ( shape [ 1 ] ) : hrange = (u ? s i z e [ 0 ] , u + s i z e [ 0 ] ) vrange = (v ? s i z e [ 1 ] , v + s i z e [ 1 ] ) i f hrange [ 0 ] <= i <= hrange [ 1 ] and vrange [ 0 ] <= j <= vrange [ 1 ] : x = u ? i + s i z e [ 0 ] y = v ? j + s i z e [ 1 ] c [ i , j , u , v ] = conv [ x , y ] x i = torch . einsum ( ?h ?w ? x ? y ? k ? ,hwh ?w ? , xykhw?>xykx ? y ? k ? ? , [ J , c , J t i l d e ] ) 106 return x i Although this algorithm is simple, it comes with some notable disadvantages. First, it is slow. Iterating over the entire (2,2)-tensor is time consuming even for a small image. Second, it is difficult to parallelize without domain knowledge of low- level programming. In other words, a CUDA kernel (or similar construct) would need to be produced to efficiently implement this algorithm. A better algorithm would be readily and efficiently programmed in a high level deep learning library like PyTorch [17]. Examine the tensor J? and note that J? ? X ? Y ?K ?H? ?W ? (7.11) Recall that our model of single channel images uses I ? H? ? W ?, therefore, the last two dimensions of J? are a single channel image and we can model J? as a batch of single channel images by reshaping it to fold X, Y,K into a single dimension N , giving J? ? N ?H? ?W ? (7.12) We are then free to convolve J? with the kernel5 C: C? = C ? J? (7.13) 5Note that the definition of C has changed slightly and is now kernel 107 and then reshape C? giving C? ? X ? Y ?K ?H? ?W ? (7.14) Note that the shape of C? and J? are the same, all we have done here is compose the convolution kernel C into the decompression operation J? . Next, we compose C? and J ?xyk xyk hwx?y?k? = C?hw Jx?y?k? (7.15) to compute ?. Listing 7.2: Exploding Convolutions (Fast) def exp lode convo lu t i on ( J t i l d e : Tensor , J : Tensor , C: Tensor ) ?> Tensor : J hat = J t i l d e . f l a t t e n (0 , 2) c hat = torch . nn . f un c t i o na l . conv2d ( J hat , C) c t i l d e = c t i l d e . v i ew as ( J t i l d e ) x i = torch . einsum ( ?xykhw , x ? y ? k ?hw?>xykx ? y ? k ? ? , [ C t i lde , J ] ) return x i This algorithm (Listing 7.2) is simple to code in machine learning libraries. Here, it takes up only six lines of code and involves no loops. Furthermore, since this algorithm depends only on reshaping, convolution, and einsum, it can take advantage of the built-in optimizations that these libraries include resulting from years of research into these algorithms [63], [64]. It is also worth noting that autograd 108 algorithms used by these libraries will work as expected for this algorithm, i.e., it is straightforward to optimize C with respect to some objective when ? is used to transform the input feature maps. Extending this to batches of multi-channel images is straightforward. First, we define the convolution C as C : P ??H??W ? ? P ???H??W ? adding the input and output plane dimensions P, P ? and noting that C lacks any batch dimension since the same operation is applied to each image in the batch. Next, we simply define ? as ?pxyk h ?w? p?x?y?k? = Jx?y?k?C phw xyk p?h?w? J?hw (7.16) where the J, J? tensors have not changed. This simply adds the plane dimensions P, P ? to to ?. This map is applied to transform coefficients F ? N? ? P ? ? X? ? Y ? ?K? as F ? pxyknp?x?y?k? = ?p?x?y?k?Fnpxyk (7.17) where the batch dimension N is preserved. With the exception of some extra indices, this does not change the algorithm in Listing 7.2. 7.2.2 Batch Normalization Batch normalization [37] is a commonly used technique which ensures each layer receives normalized feature maps. For a single channel feature map I ? H? ? 109 W ? batch normalization uses the sample mean E[I] and variance Var[I] along with learnable affine parameters ?, ?. These parameters are then applied as I ? E[I] BN(I) = ?? + ? (7.18) Var[I] The batch statistics are used to update running statistics which are applied at inference time instead of the sample statistics. This equation has a simple closed- form expression in the transform domain. We start with the mean and variance. Recall from Section 3.1 (The Fourier Transform) the definition of the 2D Discrete Cosine Transform over N ?N blocks N??1N??1 ( ) ( ) ?1 (2x+ 1)i? (2y + 1)j?D(i, j) = C(i)C(j) I(x, y) cos cos 2N 2N 2N x=0 y=0 ? (7.19)?????1? k = 12C(k) = ???1 k ?= 0 (7.20) 110 Let us compute an expression for the (0, 0) coefficient N??1N??1 ( ) ( )1 (2x+ 1)0? (2y + 1)0? D(0, 0) = ? C(0)C(0) I(x, y) cos cos 2N 2N 2Nx=0 y=0 (7.21) N??1N??1 ?1= I(x, y) cos(0) cos(0) 2 2N x=0 y=0 (7.22) N??1N??1 ?1= I(x, y) 2 2N x=0 y=0 (7.23) We further assume 8? 8 blocks as used by JPEG 7 7 ?1 ?? = I(x, y) (7.24) 2 2 ? 8 ?x=0 ?y=07 71 = I(x, y) (7.25) 8 x=0 y=0 (7.26) Since 7 7 1 ?? E[I] = I(x, y) (7.27) 64 x=0 y=0 we have 1 E[I] = D(0, 0) (7.28) 8 111 yielding a simple expression for the sample mean of a block given DCT coefficients. Note that this is extremely efficient compared to computing the mean on the feature maps directly: it requires one read operation and one multiply operation per block vs 64 reads, 63 sums, and one multiply 6. To compute the variance we use the following theorem Theorem 2 (The DCTMean-Variance Theorem). Given a set of samples of a signal X such that E[X] = 0, let Y be the DCT coefficients of X. Then Var[X] = E[Y 2] (7.29) Proof. Start by considering Var[X], we write this as Var[X] = E[X2]? E[X]2 (7.30) We are given E[X] = 0, so we simplify this to Var[X] = E[X2] (7.31) Next, we use the DCT linear map D : M? ?N? ? A? ?B? where the vector spaces M and N indicate the block dimensions and A,B indicate spatial frequencies. Then: Xmn = D ?? mnY?? (7.32) 6If there are multiple coefficient blocks (as is common) their means will need to be combined. 112 and E[X2mn] = E[(D ?? 2 mnY??) ] (7.33) Expanding the squared term gives E[X X ] = E[D?? Y D??mn mn mn ?? mnY??] (7.34) And expanding the expectation gives 1 1 X X = D?? Y D?? Y | || | mn mn | || | mn ?? mn ?? (7.35) M N A B Note that 1 = 1|M ||N | | || | so we cancel givingA B X X = Y D?? ??mn mn ?? mnY??Dmn (7.36) Rearranging the right-hand side gives X ?? ??mnXmn = DmnDmnY??Y?? (7.37) Since the tensors D are defined with respect to a standard orthonormal basis, we can freely raise and lower their indices (their metric tensor is identity). Lowering 113 ?, ? and raising m,n on one of the D tensors gives: X ?? mnmnXmn = DmnD?? Y??Y?? (7.38) Since D?? mnmnD?? = 1 we have XmnXmn = Y??Y?? (7.39) = X2mn = Y 2 ?? (7.40) Substituting gives Var[X] = E[X2] = E[Y 2] (7.41) Therefore, it is sufficient to compute the mean of the squared DCT coefficients to get the variance of the underlying pixels. This is no faster or slower than the pixel domain algorithm. Next, we move on to the affine parameters ?, ?. Applying ? is easy: since the transform we are using is linear, multiplying by a scalar can happen before or after the transform, i.e., J(?I) = ?J(I) (7.42) so we can simply multiply the transform coefficients by ?. Applying ? is also 114 straightforward, since adding the scalar ? would raise the mean by ?, we can add ? to only the (0,0) coefficient. This yields a simple closed-form algorithm for com- puting batch normalization. Listing 7.3: Transform Domain Batch Norm def batch norm (F : Tensor , gamma: f loat , beta : f loat ) ?> Tensor : mu = F[ 0 , 0 ] F [ 0 , 0 ] = 0 var = torch .mean(F??2) F ?= gamma / torch . s q r t ( var ) F [ 0 , 0 ] = beta return F Note that the algorithm in Listing 7.3 assumes each sample is a single 8 ? 8 block. If this is not the case, then the algorithm can be easily adjusted to compute combined mean and variance over several blocks (and multiple channels)7. 7.2.3 Global Average Pooling Global average pooling reduces feature maps to a single scalar per channel. In other words, spatial information is averaged ?globally?. Given the discussion in the previous section, this is extremely simple to compute in the transform domain. As the (0, 0) coefficient is proportional to the mean of each block, we can simply read 7Depending on the batch norm implementation, it may be necessary to apply Bessel correction to the variance computation as well. 115 Global Average Pooling Vector DCT Coefficients Figure 7.2: Illustration of transform domain global average pooling. off these coefficients and scale them to produce the global average pooling vector (Figure 7.2). This is significantly faster than the pixel domain algorithm. Note that this is exactly the result that the pixel domain algorithm would have generated, so from this point forward we no longer need to worry about operations in the transform domain (i.e., the fully-connected layers do not need modification). 7.3 ReLU Having defined the exact operations, we now turn to a missing and critical component of residual networks: ReLU [65], [66]. Note that we have dedicated an entire section to what is a relatively simple operation in the pixel domain. ReLU is defined as ????? R(x) = ?x x ? 0??? (7.43)0 x < 0 116 The previous section made use of mathematical properties of the JPEG transform in order to derive closed form solutions for transform domain operations. Since ReLU is necessarily non-linear, we will have no such luck with that approach. In fact, not only is ReLU non-linear, it is piecewise linear depending on the pixel domain value, information which we do not have access to in the transform domain. Instead, we will develop an approximation technique for ReLU that works in the transform domain and is tunable giving an accuracy-speed trade-off. We compute this approximation by partially decoding each block of coeffi- cients. This is still fast since only a subset of coefficients are required and since the result of the approximation is in the pixel domain we can freely compute ReLU on it. Recall the DCT Least Squares Approximation Theorem proven in Section 3.1 (The Fourier Transform). Theorem 3 (The DCT Least Squares Approximation Theorem). Given a set of N samples of a signal X let Y be the DCT coefficients of X. Then for 1 ? m ? N the approximation of X given by ? ?m ( ) ?1 2 k(2t+ 1)?pm(t) = y0 + yk cos (7.44) N N 2N k=1 minimizes the least-squared error ?N em = (pm(i)? x 2i) (7.45) i=1 Theorem 3 guides us in choosing the lowest m frequencies when we decode 117 (rather than some arbitrary set) in order to constrain the error of the approximation. For a 2D DCT, we use all frequencies (i, j) such that i+j ? m yielding 15 frequencies. The threshold m is freely tunable to the problem and we will examine its effect later. Although we now have a reasonable algorithm for computing ReLU from trans- form coefficients, we are left with two major problems. The first is that although our approximation was motivated by a least-squares minimization, it is not guaranteed to reproduce any of the original samples. Since ReLU preserves positive samples (only zeroing negative samples) it would be nice if at least those were preserved. The second is that our network expects transform coefficients as input but the ReLU we have computed is in the spatial domain. It would be expensive to have to convert the result back to transform coefficients before continuing our computation. Consider for a moment the nature of our first problem. Suppose we have a sample with value 0.7. After taking the DCT and computing the least-squares approximation with a subset of coefficients, the value of this sample is changed to 0.5. We can observe that although the least-squares approximation is incorrect, it is still positive. In other words, the reconstruction has not changed the sign of the sample so it will not be zeroed by ReLU. The more coefficients we use the more likely it is that these reconstructions are sign-preserving8 since the high frequencies contribute less to the accuracy of the result (otherwise they would not be a least- squares minimization). In this sense we can observe that it is easier to preserve the sign than the exact pixel value. 8This is true for other piecewise function intervals as well. The technique described here is general. 118 Original True ReLU Naive ASM Figure 7.3: ReLU Approximation Example. Green pixels are negative, red pix- els are positive, blue pixels are exactly zero. The top-left shows the original image. The top-right is the true ReLU. The bottom-left shows a naive approximation using only the least squares approximation. Note that while negative pixels are zeroed, very few positive pixels have the correct value and there are mask errors resulting from the approximation. The bottom-right image shows the ASM technique. Note that while there are still mask errors, positive pixel values are preserved. Therefore, rather than compute ReLU on this approximation, we can instead compute a mask and apply that mask. We reformulate ReLU as follows R(x) = H(x)x (7.46) ??????1 x ? 0H(x) = ??? (7.47)0 x < 0 where H(x) is the Heaviside step function which we treat as a mask. If we compute H(pm) on the approximation pm, and multiply the result by the original samples x, we will have masked the negative samples while preserving the positive ones. We call this technique Approximated Spatial Masking (ASM). See Figure 7.3 for a visual example of this algorithm. The only problem left to solve is that our original samples are in the transform 119 domain and the mask is in the pixel domain. To simplify the following discussion, we consider only DCT blocks here (extending to the full transform is trivial). We can solve this using our multilinear model of the JPEG transform. Given transform coefficients F ? A? ? B?, a spatial domain mask G ? M? ? N?, and the masked result F ? ? A? ?B?, consider the steps such an algorithm would perform 1. Take the inverse DCT of F to give I ? M? ?N? 2. Pixelwise multiply the mask G and I to give I ? 3. Take the DCT of I ? to give the masked result F ? All of these steps are linear or bilinear I = D??mn mnF?? (7.48) I ?mn = GmnImn (7.49) F ? = Dmn ????? ????Imn (7.50) Substituting, we have F ????? = D mn ????GmnImn (7.51) = Dmn????GmnD ?? mnF?? (7.52) = G Dmn D??mn ???? mnF?? (7.53) And we make the following definition (after raising some indices to preserve dimen- 120 sions) ? [ ]F = G Dmn D??mn???? mn ???? F?? (7.54) ???mn mn ??mn???? = D????D (7.55) giving the bilinear map ? : M? ? N? ? A? ? B? ? A? ? B?. This map can be computed once and reused. We can use this map along with our approximate mask and original DCT coefficients to produce a highly accurate ReLU approximation with few coefficients. 7.4 Recap Before continuing to empirical concerns, we briefly recap the theoretical dis- cussion in the previous sections. Residual networks consist of four basic operations: Convolution, Batch Normalization, Global Average Pooling, and ReLU. In Section 7.2.1 (JPEG Domain Convolutions) we found that JPEG domain convolutions can be expressed as ?pxyk ? ? p?x?y?k? = J? h w Cphw xykx?y?k? p?h?w?Jhw (7.56) and in Listing 7.2 we developed a fast algorithm for computing this. In Section 7.2.2 (Batch Normalization) we developed a closed form solution 121 for JPEG domain batch normalization. We found that 1 E[I] = D(0, 0) (7.57) 8 Var[I] = E[D2] iff D(0, 0) = 0 (7.58) and that we can apply ? by adding it to D(0, 0) and we can apply ? as we would to a spatial domain input (by multiplying it by each coefficient). In Section 7.2.3 (Global Average Pooling) we found that global average pooling in the JPEG domain is as simple as computing 1D(0, 0) from each channel. We also 8 noted that since this is equivalent to the spatial domain mean, there is no need to derive the fully-connected layers. Finally, in Section 7.3 (ReLU) we developed an approximation technique for ReLU where we use a subset of coefficients to decode each block and compute and approximate H(x) on each block where H() is the Heaviside step function producing a mask Gmn. Then we apply this mask to the original coefficients F?? using F ????? = G ??mn mn????? F?? (7.59) This concluded our theoretical derivations. Model Conversion One important thing to note is that at no time did we stipu- late that the convolution weights or batch norm affine parameters need to be learned from scratch. Indeed, this method can take any such values, random or learned, and 122 Res Block 1: 16 Filters, No Res Block 2: 32 Filters, Downsampling Downsampling Input: T X 1 X 32 X 32 Output: (T X 16 X 32 X 32) Output: T X 32 X 16 X 16 Fully Connected: 64 to Res Block 3: 64 Filters, 10/100 Global Average Pooling Downsampling Output: T X 10/100 Output: T X 64 Output: T X 64 X 8 X 8 (single JPEG block) Figure 7.4: Toy Network Architecture. Note that by the final ResBlock, the image is reduced to 8? 8 which is a single block of coefficients. This simplifies the global average pooling layer. produce JPEG domain operations. Therefore, we can use the method to convert pre-trained models to operate in the JPEG domain. This idea has some powerful implications and we will examine it?s trade-offs in the empirical analysis. 7.5 Empirical Analysis We now turn out attention to an empirical evaluation of the algorithm. After all, the discussion in the previous sections was highly theoretical and altogether divorced from practical concerns. A natural question at this point is: ?How well does this actually work?? We will start by creating a toy network. This small network will be used in the experiments in this section to evaluate and benchmark the technique. This toy architecture consists of three residual blocks followed by global average pooling and a single fully connected layer. Although this is a simple architecture, it will more than suffice for our benchmarks of MNIST [40] and CIFAR 10/100 [67]. The 123 Table 7.1: Model Conversion Accuracies. Note that the deviation is small between the spatial domain and JPEG domain network. Dataset Spatial JPEG Deviation MNIST 0.988 0.988 2.999e-06 CIFAR-10 0.725 0.725 9e-06 CIFAR-100 0.385 0.385 1e-06 inputs will always be 32 ? 32 images to ensure an even number of JPEG blocks9. We consider two versions of this network, one which processes images in the spatial domain (i.e., a traditional ResNet) and one which we have applied algorithm on to allow it to process JPEG transform coefficients. For those unconvinced by mathematics (or maybe suspicious of the ability to implement the math in PyTorch), we first examine whether our derivations were correct at all. This is straightforward: we simply use an exact ReLU, taking all 15 frequencies for the JPEGified version of the toy network. For more meaningful accuracies, the network is trained until convergence in the pixel domain and the weights are then converted. Since our other operations are supposed to be ?exact?, this should yield the same accuracy as a pixel domain network to within some small floating point error, which is confirmed by the result in Table 7.1. Next we examine the accuracy of the ReLU approximation. Since this is not a true ReLU, we expect there to be some effect on overall network accuracy when fewer frequencies are used. However, it is still a non-linearity which should enable the network to learn effective mappings. We consider ReLU accuracy from three perspectives Absolute Error How accurate is our ASM approximation compared with an naive 9MNIST inputs are zero padded with two pixels on each side 124 APX MNIST ASM MNIST APX MNIST ASM MNIST APX CIFAR10 ASM CIFAR10 APX CIFAR10 ASM CIFAR10 APX ASM (ours) APX CIFAR100 ASM CIFAR100 APX CIFAR100 ASM CIFAR100 0.4 1 1 0.3 0.8 0.8 0.6 0.6 0.2 0.4 0.4 0.1 0.2 0.2 0 0 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 Number of Spatial Frequencies Number of Spatial Frequencies Number of Spatial Frequencies Figure 7.5: ReLU Approximation Accuracy. Left: RMSE error. Middle: Model accuracy after model conversion. Right: Model accuracy when re-training from scratch. Note that APX denotes he naive ReLU approximation. Dotted lines rep- resent spatial domain accuracy. approximation? Conversion Error If we convert pre-trained weights, how much does frequency effect the final accuracy result? Training Error If we train a network from scratch using the ReLU approximation, how much does frequency affect the final accuracy result?10 We show results to this effect in Figure 7.5. The left graph shows the absolute error of the ReLU approximation. For this experiment, 10 million 8? 8 blocks are generated by upsampling random 4 ? 4 pixel blocks. We then measure RMSE be- tween the true block and the approximated block. Note that compared to the naive approximation, the ASM method we developed has lower error throughout and the error drops faster. In the middle graph, we show model conversion error. We train 100 models from random weights in the pixel domain and then apply our algorithm to convert the weights, and measure the resulting classification accuracy. Again we see that the ASM method has better performance. In the final graph, we train 10Assuming the same number of frequencies are used for training and inference. 125 Average RMSE Average Accuracy (%) Average Accuracy (%) JPEG Training Spatial Training JPEG Testing Spatial Testing 20 15 10 5 0 MNIST CIFAR10 CIFAR100 Figure 7.6: Throughput Comparison. We compare JPEG domain and spatial domain training and inference. networks from random weights using our JPEG domain algorithm. Interestingly, this performs significantly better than model conversion indicating that the weights have learned to adapt to the ReLU approximation. The final result we show is throughput. In general, the method developed here should be fast if for no other reason than the JPEG images do not need to be decompressed before being processed. In Figure 7.6 we compare throughput for training and testing in the JPEG domain vs in the spatial domain. As expected, inference is significantly faster in the JPEG domain. Curiously, however, training is only slightly faster. This is caused by the more complex update rule for autograd to compute through the ReLU approximation and the JPEG domain conversion for the convolutions. 7.6 Limitations and Future Directions The astute reader will have noticed by now a major limitation with this work: memory usage. Recall that compressed domain convolutions are formed by con- volving the kernel C, a dim(P ) ? dim(P ?) ? 3 ? 3 matrix, with the JPEG decom- pression tensor J? ? X ? Y ? K ? H? ? W ? and then applying the JPEG com- 126 Throughput (Images/Sec) pression tensor J ? X? ? Y ? ?K? ? H ?W . This yields a the type (3, 3) tensor ? ? P ? ?X ? Y ?K ? P ?X? ? Y ? ?K?. Observe the size of this tensor. For an image of size dim(H)? dim(W ) it is in O((dim(H)?dim(W ))2). In other words we have taken a small constant size weight and expanded it to be on the order of the image size squared. This is perhaps the primary direction for future work. The massive size of this tensor entirely prevents the method from being useful for anything beyond the toy network and small image datasets presented in the previous section. While a constant size kernel could be created using tiling (each convolution depends on at most the blocks one outside of the ?currently processed? block), this would still be significantly larger than the small kernel used by spatial domain networks. By restricting the convolution to a single block, an dim(P ?) ? dim(P ) ? 8 ? 8 ? 8 ? 8 kernel could be created with an approximate result which would significantly improve the situation. It is left to future work to determine the practicality of these ideas and what their effect on network accuracy is. Our ReLU formulation is currently an approximation. As we studied in the previous section, this approximation does impact the overall network accuracy even when retraining. It would be nice if an exact ReLU could be formulated to avoid this issue. It is currently unknown if this is possible. While on the topic of ReLU, software support for our method is currently quite lacking. In essence, many of our memory and speed savings come from the sparse nature of JPEG compressed data. Zero elements could take up no memory and contribute no operations to the compute graph, but this depends on adequate 127 software support for sparse operations which is currently missing from libraries like PyTorch. Specifically, support for sparse einsum would need to be added. This is perhaps the low-hanging fruit that would immediately reduce the memory footprint while further increasing the speed the algorithm. 128 Chapter 8: Improving JPEG Compression With a good understanding of JPEG compression and how it relates to deep learning, we turn to a survey of methods which improve JPEG compression. These methods are essentially specializations of image enhancement. So sister problems in this domain are super-resolution, denoising, deraining, etc. Notably, we will not be considering new deep learning based codecs which are beyond the scope this this dissertation. These methods are reviewed briefly in Appendix C, however. We focus on historical methods which made significant advancements in the understanding of JPEG artifact correction and present them roughly in publication order although they are grouped into sections by their high level ideas. Before discussing the deep learning techniques we first mention two classical methods for correction of JPEG artifacts. The first method uses a ?pointwise shape- adaptive DCT? (SA-DCT) [68]. The SA-DCT can be thought of as a generalization of the block DCT used by JPEG to account for block sizes of varying shape. Foi et al. model JPEG compression artifacts as Gaussian noise with zero mean and compute ?2 using an empirically developed formula on the quantization matrix. For each point in the image, the technique computes a DCT kernel that best fits the underlying data (hence shape-adaptive). This filter is then used to estimate the Gaussian noise 129 term for enhancement. The next method [69] use a generalized lapped biorthogonal transform (GLBT) [70]. In this technique, the JPEG DCT coefficients are modeled as an intermediate output of the GLBT and the remaining filters in the method are designed to remove blocking artifacts. Prior to deep learning, these techniques were the most successful at removing JPEG artifacts. Warning This chapter is mostly a history lesson. Skip to the last section if you want a TL;DR. 8.1 Pixel Domain Techniques We begin our discussion with the straightforward ?pixel domain? techniques. These networks function as traditional convolutional networks. They use pixels and input and output either the corrected network or its residual. The first such technique was the ARCNN [71] later followed up by Fast ARCNN [72]. These networks followed a traditional encoder-decoder architecture and are based off of the contemporary SRCNN [73]. ARCNN is tiny by modern standards with four convolutional layers. The first is a 9? 9 layer with 64 channels, next a 7? 7 with 32 channels, then a 1? 1 with 16 channels and finally a 5 ? 5 decoder with 1 channel (for grayscale only). The authors of ARCNN claim that each layer is designed for a specific purpose but there is no deep supervision on the layers and they are trained end-to-end so it is unlikely that they learn a particular task. Fast ARCNN changes this architecture to an ?hourglass? shape, essentially a 130 U-Net [36] without skip connections which was common at the time. The archi- tecture uses strided convolutions for the downsampling operations. Since the size of the feature maps is reduced, the architecture processes images faster, hence the name. This does reduce the overall reconstruction accuracy, however. The L4/L8 networks [74] introduce two major new ideas to artifact correction. The first is the idea of residual learning, where the network is encouraged to learn only the difference between the input image and the true reconstruction. In other words, the reconstructed image Xr is expressed as Xr = Xc + f(Xc) (8.1) for compressed image Xc and learned network f(). The second contribution is that of an edge preserving loss. The authors rightly observe that prior networks, due to their regression only losses, have blurry edges. They solve this by using Sobel filters to compute the partial first derivatives of the reconstructed image and computing loss on these filtered images which focuses the network on edge reconstructions. As expected the L4/L8 architectures have four and eight layers respectively and otherwise do not differ significantly from ARCNN. CAS-CNN [75] build on the previous idea by employing a significantly more complex architecture. This architecture contains skip connections, not unlike a U- Net, and upgrades the traditional regression loss to use multiple scales. These scales are computed using deep supervision of the downsampled feature maps and make a fairly significant improvement to the overall accuracy. This is likely helped by the 131 skip connections in the U-Net architecture. We now jump to the MWCNN [23] which is a major difference in architecture. MWCNN is a fascinating method for general image restoration which was applied directly to JPEG artifacts at the time of publication (along with other problems). The key idea is to replace the pooling layers in a traditional CNN with a discrete wavelet transform. Recall that a discrete wavelet transform computes band-pass filters which restrict each output to half the frequency range of the input. By the nyquist sampling theorem, we can then discard half the samples without losing any information. MWCNN exploits this by using the DWT in place of a pooling op- eration, stacking the resulting four frequency sub-bands in the channel dimension without any significant loss of information. The original image can then be recon- structed by using the inverse wavelet transform on the feature maps after traditional convolutional layers. Otherwise the architecture resembles U-Net. The use of this clever signal processing trick allows MWCNN to achieve remarkable results on a number of restoration tasks including JPEG artifact correction. Honorable mention at this point goes to DPW-SDNet [76]. This could be considered a dual-domain method although we take a somewhat stricter definition of domain so instead we list it here with MWCNN. The main contribution of DPW- SDNet was to include two networks, one which processes the image in the pixel domain and another which processes it after a single level DWT. Another method from 2018, S-Net [77], introduces a scalable network. This is based on the apt observation that more quantization requires a deeper network and ?more work? to restore. Their architecture is, therefore, scalable either based 132 on the amount of degradation applied to the image or constraints on the compute budget of the hardware. This was an important contribution toward the practical use of artifact correction and remains an under-explored idea. Two works by Galteri et al. [78], [79] introduce GANs to the problem of artifact correction. As we observed in the discussion of L4/L8 and CAS-CNN, regression losses produce a blurry result. This is both because of the CNN?s in- herent bias towards error minimization, something which is easiest to accomplish with a low-frequency reconstruction, and because of JPEG?s tendency to destroy high frequency details in the first place. Although L4/L8/CAS-CNN make progress on this problem with specialized losses, they had obvious limitations which Galteri et al. overcome with a GAN loss. This generates significantly more realistic re- constructions, although there is no attempt at an ?accurate? reconstruction with good numerical results1. The 2019 version of this work even includes a rudimentary attempt at a ?universal? architecture which can operate independently of quality setting although it accomplishes this with an ensemble. The final technique we discuss in this section is RDN [80]. This represents a departure from the more traditional U-Net style networks we have been discussing. Instead, RDN is based on ESRGAN [81] and its RRDB layers. These layers are an enhanced version of the traditional residual layer [9] with more residual connections. Just as these layers were a huge improvement for super-resolution, they are a huge improvement for artifact correction. 1Which, in my opinion, is completely acceptable. 133 8.2 Dual-Domain Techniques Dual domain techniques are the result of an attempt to inject some low level JPEG data into the learning process. The high level idea is to process the input in both the spatial (pixel) domain and the frequency (DCT) domain. This is done with two separate networks and their result is fused. This way if there is some information that either domain does not capture, it can potentially be exploited by the other domain. The technique was introduced with a sparse-coding method [82] that we will examine in the next section. On the deep learning side the idea is first addressed with DDCN [83]. The idea is very straightforward. There are two separate encoders, one for the pixel domain and one for the DCT domain. The output of both networks is processed by a third aggregation network which decodes to a residual that is added to the input image. DMCNN [84] extends this idea in two ways. The first is with a multiscale loss on the pixel branch as we saw in L4/L8. The next is with a DCT Rectifier which constrains the magnitude of the DCT residual based on the possible values that the true coefficients could take. Recall the formula for quantization ? ? ? YijYij = (8.2)(Qy)ij shown here for the Y channel only. The approximated coefficient is then ? ? Yij Y?ij = (Qy)ij (8.3) (Qy)ij 134 Dividing by (Qy)ij gives ? ? Y?ij Yij = (8.4) (Qy)ij (Qy)ij We can now expand this as an inequality since the rounded value must range from [?1 , 1 ] around the rounding result 2 2 Y?ij ? 1 ? Yij ? Y?ij 1+ (8.5) (Qy)ij 2 (Qy)ij (Qy)ij 2 multiplying by (Qy)ij yields our desired constraint on Yij ? (Qy)ijY?ij ? Yij ? (Qy)ij Y?ij + (8.6) 2 2 Since the artifact correction network is trying to compute Yij from Y?ij, this constraint helps reduce the space of possible solutions. The next major innovation in dual-domain methods is IDCN [85]. The major advantage of IDCN is that it is designed for color images and uses ?variance maps? to account for the differences in statistics between the channels. Their dual-domain formulation is also of interest. They introduce a dual-domain layer which is ?im- plicit?, similar to our result in Section 6.2 (The Multilinear JPEG Representation), the DCT transform can be composed such that the DCT result and the pixel result are computed simultaneously. Finally Jin et al. [86] extend the dual domain concept to process frequency bands in different paths. This is based on two observations: firstly, some artifacts 135 are restricted to particular frequency bands and secondly, as we have said many times, accurate high-frequency reconstructions are difficult. By separating out the frequency bands for separate processing, the network is able to focus on restoring those particular frequencies as well as freeing up model capacity for artifacts which occur only in the considered frequency bands of the branch. 8.3 Sparse-Coding Methods Sparse coding is a dictionary learning method. A series of representative ex- amples are learned which (we hope) form an ?over complete? basis for our solution space 2. Because the input is no longer uniquely determined by the basis, we also try to enforce sparsity such that the members of the basis are as sparse as possible. We do not cover sparse coding in more detail in this dissertation. Sparse coding was introduced to artifact correction by Li et al. [82] where they also introduced dual domain learning. The idea is straightforward: learn sparse codes in pixel space and DCT space and fuse the results. D3 [87] makes an interesting extension to Li et al.. They formulate the problem in a ?feed forward? manner. In other words, sparse coding is used first on the DCT coefficients and then the result of that is fed into anther sparse coding module in the pixel domain. Both stages are supervised with loss functions similar to neural networks. The final sparse coding method we consider, DCSC [88] is pixel domain only. 2I do not believe that an ?over complete basis? is an actual concept in linear algebra. I assume the developers of this method are referring to a frame 136 Table 8.1: Summary of JPEG Artifact Correction Methods. The methods are all listed with their technique (CNN or Sparse Coding) and whether they incor- porate dual domain information or not. This table is not exhaustive. Methods are sorted by year. Year Method Citation Technique Dual Domain Note 2015 ARCNN [73] CNN ? Data driven sparsity ... [82] Sparse Coding ? 2016 L4/L8 [74] CNN ? DDCN [83] CNN ? D3 [87] Sparse Coding ? 2017 CAS-CNN [75] CNN ? Deep Genarative ... [78] CNN ? GAN, Color 2018 MWCNN [23] CNN ? Uses DWT instead of pooling DPW-SDNet [76] CNN ? Dual wavelet and pixel domain S-Net [77] CNN ? Scalable DMCNN [84] CNN ? DCT Rectifier 2019 Deep Generative ... [79] CNN ? GAN, Universal with ensemble, Color IDCN [85] CNN ? Implicit DCT Layer, Color DCSC [88] Sparse Coding ? Uses CNN Features 2020 RDN [80] CNN ? Uses RRDB Dual stream multi path ... [86] CNN ? However, they incorporate a simple convolutional network into their architecture such that the sparse codes are computed on CNN features. This gives a sort of ?best case? scenario where the powerful convolutional features can be exploited by the sparse coding method. As a bonus, their method uses a single model for all quality settings, although they do not train in the general case and only target qualities 10 and 20. 8.4 Summary and Open Problems We summarize all the methods discussed in this chapter in Table 8.1. There are some interesting things we can take away from this discussion. For example, it seems that dual-domain methods work well and they are continually revisited. Deeper networks have also naturally been successful but the switch to RRDB layers by RDN was particularly interesting. More complex techniques like wavelet based 137 or sparse coding based methods are underutilized and may be more complex than is needed with the advances of vanilla neural networks. One noteworthy takeaway is that while there are pixel domain techniques and dual-domain techniques, there is not a single DCT domain only technique. A careful examination of ablation studies in the dual-domain papers explains this: their DCT branches do not perform well on their own. Somehow, the DCT branch is capturing new information that the pixel branch does not, but not enough to carry out restoration on its own. This is likely caused by the DCT being a set of coefficients for orthogonal basis functions rather than a single correlated signal like pixels. We will consider this an open problem as we move into the next section. Also somewhat surprising is that although many methods recognized that JPEG artifact correction struggles to restore high frequencies with regression losses, only one author thought to use a GAN for correction. This is at least partially because of the community?s incessant focus on benchmark results as the criterion for publication. GAN restoration does not perform well on the benchmarks. We consider this an open problem as well. Another oddity: very few of the methods explicitly treat color images. This is odd on its own but even more so when we consider that JPEG explicitly handles chrominance differently than luminance by compressing it more aggressively and downsampling it. Also, there is spatial correlation between luminance and chromi- nance, which could and should be exploited in the reconstruction. Only the works of Galteri et al. and IDCN explicitly handle color data. This is another open problem. Finally, and crucially, there are very few ?universal? or ?quality blinded? tech- 138 niques. In fact, the only ones discussed in this section were Galteri et al. [79] and DCSC [88]. All other networks in this section trained a different network for each quality setting they consider, a practice which is not sustainable in real deployments. Although solutions to this problem have been cropping up at the time of writing [89]?[92] we again consider this to be an open problem. In the next chapter, we will develop a method that addresses these problems. 139 Chapter 9: Quantization Guided JPEG Artifact Correction In the previous chapter we discussed several methods for using deep networks to improve JPEG compression. These techniques augment JPEG with significantly better rate-distortion while allowing users to still produce and share their familiar JPEG files1. These networks, however, come with three major disadvantages that have so far made them purely academic successes. First and foremost, these methods are so called ?quality aware? methods in which all training and testing data is compressed at a single quality level. This yields a single model per JPEG quality, which is undesirable for several reasons. Recall that quality is an integer in [0, 100], thus potentially requiring 101 different models to be trained. Although it is likely that the models may generalize to nearby qualities (we will examine this somewhat in Section 9.6.2 (Generalization)), at the very least this still requires the training and deployment of more than one model, something which is still considered expensive for most institutions. Furthermore, when these models are deployed, they will be given arbitrary JPEG files to correct and the JFIF file format does not store quality leaving a real system with no reliable method to choose a model. This problem could be solved with an auxiliary model 1With the added benefit that users without special software can still view the files albeit at lower quality. 140 that regresses image to quality [79] but this still requires training and deploying an ensemble and now an additional model to pick the quality. The next, and perhaps more peculiar, problem with these methods is that they are grayscale only. In other words these models only work on the luminance channel of the compressed images. While this does align well with human perception, humans can certainly perceive color degradations (see Figure 9.1 to perceive this yourself). There is an implicit assumption that luminance models could be applied channel-wise to YCbCr or RGB images however we find that this does not hold well in practice as we show in Section 9.6.1 (Comparison with Other Methods). Lastly these methods are hyper-focused on error metrics. While this has proven to be a reliable way to improve rate-distortion, it generally does not translate to improved perceptual quality producing blurry edges and an overall lack of texture. To improve perceptual quality, more complex techniques are required. In this chapter we develop a technique which addresses all three of these ma- jor problems. Our method leverages low-level JPEG information to condition a single network on quantization data stored in the JFIF file, allowing one network to achieve good results on a wide range of qualities. Our network treats color channels as first class citizens and takes concrete steps to correct them effectively, keeping in mind that JPEG compression treats color and luminance differently applying more compression to the color channels. Finally we develop texture restoring and GAN losses that are designed to produce a visually pleasing result especially at low qual- ities. This method was published separately in the proceedings of the European Conference on Computer Vision [93]. 141 Y Channel Color Channel Correction Correction GAN Figure 9.1: Overview. The network first restores the Y channel, then the color channels, then applies GAN correction. First Principles ? Conditioning the network on the quantization matrix allows it to correct at many different qualities using information available to a real system ? Explicitly modeling color degradation improves performance on color images ? Formulating DCT domain regression allows the network to leverage quantiza- tion data more effectively ? GAN loss functions for high frequency restoration 9.1 Overview The method we develop in this chapter consists of several parts, all of which operate together to produce the final result. We will develop this method from the bottom up, starting with the individual building blocks of the method and then describing how they are connected. At a high level, our network operates in several stages, these are illustrated in Figure 9.1. Our network first corrects the luminance (Y) channel of the image. The lumi- nance channel has less aggressive compression applied to it and serves as a base for 142 further correction. Our network then moves on to correcting the color channels. As these channels are further compressed, they lack fine detail and structure that may have been present in the luminance channel especially after correction. Therefore, we provide the corrected luminance channel along with the degraded color channel to the color correction network to give it additional information. Throughout the network, we condition carefully selected layers on the JPEG quantization matrix. Recall that this 8?8 matrix describes how much rounding was applied to each DCT coefficient. Because this is directly describing a phenomenon in the frequency domain, our entire network processes the DCT coefficients of the input only: no pixels are used, and the network produces DCT coefficients as output. This is in stark contrast to other methods which use only pixel or both pixels and coefficients and depends on new developments in DCT domain networks. We use methods described in Section 7.1 (New Architectures) to correctly process these data. Before these methods were developed, DCT domain networks had objectively inferior performance to pixel and dual-domain networks. Our training likewise proceeds in stages. After training the network to produce only luminance coefficients using regression, we then add the color network in and train it again using regression. This way, the color network is always getting a high quality luminance result to condition its own correction on. After the luminance and chrominance networks are trained, we then fine-tune the entire network using GAN and texture losses. This adds significant detail to the result while preventing it from quickly deviating and diverging. 143 9.2 Convolutional Filter Manifolds One potential limitation of traditional convolutional networks is that they learn only a single mapping from input features (Fi) to output features, in other words h(Fi) = ?(W ? Fi) (9.1) for non-linearity ? and learned weight W . While this is sufficient for many use cases, it can be limiting in others. Specifically in our case, we would like to specialize the learned filters for dif- ferent quantization matrices, in other words, the learned weight W should be a function of the quantization matrix Q. One simple way to do this is to tile Q to match the shape of F and concatenate the two h(Fi, Q) = ?(W ? [Fi Q]]) (9.2) however, this yields a linear mapping between Fi and Q limits the learned relation- ship between Fi and Q. Instead, we can use a filter manifold [94], sometimes called a kernel predic- tor [95]. The goal of the filter manifold is to predict a convolutional kernel given a 144 scalar side input, i.e., h(Fi, s) = ?(W (s) ? Fi) (9.3) for s ? R, so now the weight W is a non-linear function of s. Kang et al. choose a small MLP for W h(Fi, s) = ?(W (s) ? Fi) (9.4) W (s) = ?(W2(?(W1s))) (9.5) allowing the network to learn a non-linear relationship between the side data and Fi along with the learned mapping between Fi and the network output. However, our side input Q is not a scalar, it is an 8 ? 8 matrix. Using a MLP for this input would be computationally expensive, so we propose a simple extension, termed convolutional filter manifolds(CFM), to replace W () with a small convolutional network. We additionally learn a bias term along with the weight h(Fi, Q) = ?(W (Q) ? Fi + b(Q)) (9.6) b(Q) = Wb ? Fq(Q) (9.7) W (Q) = Ww ? Fq(Q) (9.8) Fq(Q) = ?(W2 ? (?(W1 ? Q))) (9.9) This formulation allows us to learn parameterized weights representing the complex 145 relationship between the JPEG DCT features and the quantization matrix and can be thought of as generating a ?quantization invariant? representation for the network to operate on. This is the primary contribution which allows the network to model degradations from many different quality levels. In Section 9.3 (Primitive Layers) we will describe primitive layers which make use of this formulation and in Section 9.4 (Full Networks) we will describe where these layers are placed in the overall network structure in order to maximize their effectiveness. In Section 9.6.4 (Exploring Convolutional Filter Manifolds), we will explore some interesting properties of these layers. 9.3 Primitive Layers The network we develop in this chapter is dependent on several ?primitive layers? or basic operations which we will use to build the network. In this sec- tion, we describe them in detail. The first is the Residual-in-Residual Dense Block (RRDB) layer [81] first developed for super-resolution. This layer consists of three ?Dense Blocks? in a residual sequence. Each of these ?Dense Blocks? consists of five convolution-relu layers with skip connections between each layer forming an enhanced version of the standard residual block. See Figure 9.3 for a schematic depiction of this layer. We make only one change to the RRDB used in ESRGAN by replacing the Leaky ReLU [96] with Parametric ReLU [97]. In Section 7.1 (New Architectures) we discussed recent advances in convo- lutional networks that can take advantage of the unique characteristics of DCT 146 Channel / Frequency 0 0 1 1 2 2 63 63 0 0 1 1 2 2 63 63 ... G0 G1 G2 G63 0 0 1 1 2 2 63 63 0 0 1 1 2 2 63 63 Figure 9.2: FCR With Grouped Convolutions. Each frequency component is processed in isolation with its own convolution weights. We implement this using a grouped convolution with 64 groups. coefficients. We employ both of these layers in our network. The first is frequency- component rearrangement where the DCT coefficients for each block are arranged in the channel dimension yielding 64 channels and 1th the width and height of the 8 input. We take the additional step of using grouped convolutions with 64 groups to ensure that each frequency is processed in isolation. See Figure 7.1 for the fre- quency rearrangement and Figure 9.2 for an illustration of the grouped convolution. We insert these layers into the RRDB described above. This paradigm allows our network to focus on enhancing individual frequency bands more effectively. However, many frequency bands are entirely zeroed out by the compression process. Completely relying on the grouped convolution would be destined for fail- ure because if a frequency band is set to zero, no amount of convolutional layers can change its value (it will either remain zero or be set to the layer biases). Therefore, we need a layer which is also capable of looking at multiple frequency bands, and for this we choose the 8 ? 8 stride-8 layer. This layer produces a representation of each DCT block by considering all the frequency bands in the block at once. Since the stride is set to 8, the representation does not include information from nearby blocks. Information from nearby blocks is incorporated by processing the 147 Dense Block Dense Block Dense Block Dense Block Figure 9.3: RRDB Layer shown with input feature map Fi and output feature map Fo. Note that we change the original RRDB layer by adding PReLU layers. block representations with RRDB layers. Since these layers are considering the DCT coefficients of the entire block, we make the additional step to use CFMs instead of regular convolutions to equip the layers with quantization information, thus gen- erating the ?quantization invariant? block representation. This is shown in Figure 9.4. For our applications, the input to the CFM is the 8 ? 8 quantization matrix. This is processed with a convolutional network to produce the weight and bias. Note that the weight layer has in channels ? out channels channels and is a transposed 8? 8 convolution. The result is reshaped to an out channels? in channels? 8? 8 convolution kernel. 148 Conv PReLU Conv PReLU Conv PReLU Conv PReLU Conv CFM channels Conv (64 channels) 8 PReLU Conv (64 channels) 8 PReLU CFM Transposed Conv ?Conv ( channels) ?( channels) channels Input Features 1 1 Figure 9.4: 8 ? 8 stride-8 CFM. Note that the numbers in parenthesis denote the number of channels. The CFM layer computes a weight and bias from the quantization matrix using a small convolutional network. The result of this network is reshaped as either a weight or bias. 9.4 Full Networks With the primitive layers defined, we now show how to build those layers into the networks and subnetworks our method uses for correction of JPEG artifacts. Recall that our method first corrects the grayscale channel and then uses that re- sult to aid correction of the chroma channels. Therefore we start by describing the grayscale correction network. After that we will describe the color correction network. The grayscale correction network, shown in Figure 9.5 left, consists of four subnetworks which work in series to produce the final correction: two blocknets, a 149 frequencynet, and a fusion network which we describe next. The blocknet (Figure 9.6 left) uses the 8 ? 8 stride-8 CFM layers described in the previous section. It computes block representations and then processes the representations with stacked RRDB layers before decoding the block representations with a transposed CFM layer. Between the two blocknets we place a frequencynet (Figure 9.6 middle). This uses the FCR grouped convolutions to enhance frequency bands in isolation. The frequencies are first rearranged before being processed with RRDB layers. The result is then rearranged to restore the frequencies to the spatial dimensions. The intermediate results from all of the subnetworks are then passed to a fusion layer (Figure 9.6 right). The primary purpose of this is to strengthen the gradient received by the early layers which would be prone to gradient vanishing otherwise [98]. The color correction network (Figure 9.5 right) borrows the main ideas from the blocknet in the grayscale correction network. We assume that inputs are 4:2:0 chroma subsampled, which means they must be upsampled by a factor of two in each dimension to match the grayscale resolution. We use the block representation of the color channels and use a 4?4 stride-2 layer to do the upsampling. The result is concatenated channelwise with the block representation of the restored Y-channel before being processed further and finally decoded. In both the grayscale and color networks, we treat network outputs as residuals which are added to the degraded input coefficients. 150 Degraded {Cb/Cr}- Quantization Channel Degraded Y-Channel Matrix Color Quantization stride-8 32 channel CFM Matrix PReLU BlockNet Restored Y Y Channel Quantization RRDB (32 Channels) Channel Matrix FrequencyNet Transposed stride-2 32 stride-8 32 channel CFM Channels PReLU PReLU BlockNet Channelwise Concatenation RRDB (32 Channels) Fusion Transposed stride-8 32 channel CFM PReLU Degraded Output Restored Y-Channel Residual Coefficients Degraded {Cb/Cr}- Restored Output Residual Channel Coefficients Figure 9.5: Restoration Networks. Left: Y-Channel Network. Right: Color Channel Network. Note the skip connections around each of the subnetworks in the Y-Channel Network which promotes gradient flow to these early layers. Quantization Matrix Input Feature Map Input Feature Map BlockNet Output FrequencyNet Output BlockNet Output FCR Channelwise Concatenation stride-8 256 channel CFM Conv 256 Channels, 64 PReLU FCR Groups PReLU Conv 256 Channels, 64 RRDB, 256 Channels Groups RRDB, 256 Channels, 64 PReLU Groups Transposed stride-8 32 Conv 256 Channels, 64 Conv 256 Channels, 64 channel CFM Groups Groups PReLU PReLU PReLU Conv 64 Channels, 64 FCR Output Feature Map Groups Output Feature Map FCR Output Coefficient Residual Figure 9.6: Subnetworks. Left: BlockNet, Center: FrequencyNet, Right: Fusion. 9.5 Loss Functions A well documented problem with image-to-image translation is that of a blurry result. Intuitively, since the network is told to optimize l1 or l2 distance between the input and output, the easiest way to accomplish its goal is to produce a sort of ?averaging?. The human perception of this averaging is as a low-frequency image which lacks fine details. This is exacerbated by compression which intentionally removes high frequency details. In this sense, a simple error based loss function is, in essence, asking the network to solve the wrong problem. What we really want the network to do is restore high frequencies. Nevertheless, an error based loss is useful for correcting hard block boundaries 151 that JPEG creates as well as for preventing divergence with more complex losses. Therefore, we pre-train the grayscale and color networks using l1 and Structural Similarity (SSIM) [99] losses to ensure that they start from a reasonable location when we fine-tune with the more interesting loss functions2. We denote this loss function as LR(Xu, Xr) = ||Xu ?Xr||1 ? ?SSIM(Xu, Xr) (9.10) for restored image Xr and uncompressed image (i.e., a version of Xr which was never compressed) Xu and ? is a balancing hyperparameter. With the color and grayscale networks trained for regression we now move on to GAN [39] and texture losses. GANs were originally introduced purely for gener- ating realistic images. The algorithm pits a generator network against discriminator network where the generator?s goal is to produce an image which is realistic enough to fool the discriminator and the discriminator?s goal is to discover which images were generated by the generator. In this way, the two networks are adversaries in a game and by rewarding them for doing well, the generator can create more realistic images. For our purposes, we use a GAN to hallucinate plausible high frequency details, edges, and textures onto the compressed images. For this we employ the relativistic average GAN loss [100]. This loss function tweaks the original GAN definition to encourage the generator to produce images which appear ?more realistic than the average fake data? and is generally more 2Note that this is a visual improvement on its own, however it is nothing compared to the result from the GAN and texture losses. 152 Restored or Compressed Y and Cb/Cr Uncompressed YCbCr YCbCr Channel Quantization Conv stride-8 196 stride-8 64 channel CFM Channels LeakyReLU LeakyReLU Channelwise Concatenation Conv stride-8 128 Channels (spectral norm) LeakyReLU Conv stride-8 128 Channels (spectral norm) LeakyReLU Conv stride-8 128 Channels (spectral norm) LeakyReLU Conv stride-8 128 Channels (spectral norm) LeakyReLU Output Per-Block Decisions Figure 9.7: GAN Discriminator. Note that the discriminator makes decisions for each JPEG block. stable than a vanilla GAN. For our purposes, we redefine ?fake? as the restored image Xr and ?real? as the uncompressed image Xu. We then define the loss as ? LRA(Xu, Xr) = log(L(Xu))? log(1? L(Xr)) (9.11)??????(D(x)? Exr?Restored[D(xr)]) xis uncompressedL(x) = ??? (9.12)?(D(x)? Exu?Uncompressed[D(xu)]) xis restored For discriminator D() and sigmoid ?(). We base the discriminator D() on DCGAN [101], its architecture is shown in Figure 9.7. All convolutional layers use spectral normalization [102]. Note that we provide both the compressed as well as the uncompressed/restored version of the image and discriminator decisions are made on a per-JPEG block basis. While the GAN is a useful tool for generating realistic corrections, the general notion of real or fake only provides so much information. In practice, GAN losses for image-to-image translation are often coupled with ?perceptual losses? [103]. More 153 specifically, these losses use an ImageNet [59] trained VGG network [34]. The in- tuition is that this auxiliary network measures a semantic similarity between the input image and the desired target since this network was trained for classification. By encouraging semantic similarity, a more realistic result can be achieved since the images appear to fall into the same class. While this is useful for general image-to-image translation we find an alter- native approach is more useful for compression. Since compression destroys high frequency details, like textures, the more these details can be recovered or suf- ficiently hallucinated, the more realistic the reconstruction. Therefore, we use a VGG network trained on the MINC [104] dataset for material classification. The main idea here is that if a restored and uncompressed image have similar logits for a material classification task, they would likely be classified as the same material and therefore have realistic textures. We denote this loss function as Lt(Xu, Xr) = ||MINC5,3(Xu)?MINC5,3(Xr)||1 (9.13) where MINC5,3 indicates layer 5 convolution 4 from the MINC trained VGG. This yields the complete GAN loss LGAN(Xu, Xr) = Lt(Xu, Xr) + ?LRA(Xu, Xr) + ?||Xu ?Xr||1 (9.14) for balancing hyperparameters ?, ?. Note that the l1 loss makes another appearance here to prevent the GAN from diverging. 154 9.6 Empirical Evaluation No artifact correction work is complete without an empirical evaluation, and with the algorithm now developed, we are in a position to perform one. For this evaluation we train the network using the Adam [105] optimizer using a batch of 32 256 ? 256 patches, the network is implemented in PyTorch [17]. All DCT co- efficients are normalized using per-channel and per-frequency mean and standard deviations. quantization matrices are normalized to [0, 1] and use the ?baseline? setting in libjpeg [44]. The training proceeds in stages, as described previously. First the Y channel network is trained using LR (Equation 9.10) for 400,000 batches with the learning rate starting at 10?3 and decaying by a factor of 2 every 100,000 batches. We set ? = 0.05. Then we freeze the Y channel weights and train the color network using LR (Equation 9.10) for 100,000 batches with the learning rate decaying from 10?3 to 10?6 using cosine annealing [106]. With the network fully trained for regression we then fine-tune end-to-end using LGAN (Equation 9.14). The network is again trained for 100,000 iterations using cosine annealing this time with the learning rate starting at 10?4 and ending at 10?6. We set ? = 5? 10?3 and ? = 10?2. For training data we use DIV2k and Flickr2k [107] which contain 900 and 2650 images respectively. We pre-extract 30 256?256 patches from each image and compress them using quality in [10, 100] in steps of 10 for a total training set size of 1,065,000 patches. We evaluate the method using the Live1 [108], Classic-5 [68], and 155 Table 9.1: QGAC Quantitative Results. Format is PSNR (dB)? / PSNR-B (dB) ? / SSIM ? with the best result shown in bold. This table is provided for those dedicated enough to read the small font. These numerical results are unimportant; what is important is the qualitative results that follow. Dataset Quality JPEG ARCNN [71] MWCNN [23] IDCN [85] DMCNN [84] QGAC (Ours) 10 25.60 / 23.53 / 0.755 26.66 / 26.54 / 0.792 27.21 / 27.02 / 0.805 27.62 / 27.32 / 0.816 27.18 / 27.03 / 0.810 27.65 / 27.40 / 0.819 Live-1 20 27.96 / 25.77 / 0.837 28.97 / 28.65 / 0.860 29.54 / 29.23 / 0.873 30.01 / 29.49 / 0.881 29.45 / 29.08 / 0.874 29.92 / 29.51 / 0.882 30 29.25 / 27.10 / 0.872 30.29 / 29.97 / 0.891 30.82 / 30.45 / 0.901 - - 31.21 / 30.71 / 0.908 10 25.72 / 23.44 / 0.748 26.83 / 26.65 / 0.783 27.18 / 26.93 / 0.794 27.61 / 27.22 / 0.805 27.16 / 26.95 / 0.799 27.69 / 27.36 / 0.810 BSDS500 20 28.01 / 25.57 / 0.833 29.00 / 28.53 / 0.853 29.45 / 28.96 / 0.866 29.90 / 29.20 / 0.873 29.35 / 28.84 / 0.866 29.89 / 29.29 / 0.876 30 29.31 / 26.85 / 0.869 30.31 / 29.85 / 0.887 30.71 / 30.09 / 0.895 - - 31.15 / 30.37 / 0.903 10 29.31 / 28.07 / 0.749 30.06 / 30.38 / 0.744 30.76 / 31.21 / 0.779 31.71 / 32.02 / 0.809 30.85 / 31.31 / 0.796 32.11 / 32.47 / 0.815 ICB 20 31.84 / 30.63 / 0.804 32.24 / 32.53 / 0.778 32.79 / 33.32 / 0.812 33.99 / 34.37 / 0.838 32.77 / 33.26 / 0.830 34.23 / 34.67 / 0.845 30 33.02 / 31.87 / 0.830 33.31 / 33.72 / 0.807 34.11 / 34.69 / 0.845 - - 35.20 / 35.67 / 0.860 ICB [109] datasets. To be consistent with prior works, we report PSNR, PSNR-B, and SSIM as metrics for the regression network. For the GAN we report FID score [43]. 9.6.1 Comparison with Other Methods We start by comparing our method to others in the form of a large boring table in Table 9.1. Note that this uses the regression weights only. Note that of the compared methods, only IDCN has native handling of color information and all of the methods are quality dependent with a different model for each quality. Ours, in contrast, is only a single model. 9.6.2 Generalization In the development of this method we place emphasis on a single network generalizing to multiple JPEG qualities. This raises an interesting question: ?can other models generalize to different qualities?? In general the answer is ?no? and we demonstrate this using IDCN [85] with an example image compressed at quality 50. 156 Original JPEG (Q=50) IDCN (Q=10) IDCN (Q=20) QGAC Figure 9.8: Quality Generalization. Note that both the IDCN quality 10 and 20 models appear to oversmooth the quality 50 JPEG. 3 Live-1 BSDS500 2 ICB 1 0 10 20 30 40 50 60 70 80 90 100 Quality Figure 9.9: Increase in PSNR. Shown for color datasets on all JPEG quality settings. Note the steep dropoff at high qualities. Since IDCT provides only quality 10 and 20 models, we test both of those models on this image. The result is shown in Figure 9.8. The quality 10 model oversmoothes this image and it appears worse than the JPEG it was supposed to correct. The quality 20 model looks better, but QGAC?s single model looks the best as it was able to adapt its weights to the quality 50 JPEG by processing the quantization data. As this experiment shows, it is important for prior works to select the correct model for the JPEG. In fact, since our method is not restricted in quality, we can show how it generalizes in an even more compelling way: by testing on all JPEG quality settings. We show this in the graph in Figure 9.9. Note that for most quality settings, the increase is fairly stable, only at quality 90 and above does a steep dropoff occur. For these qualities, however, the degradation is hardly noticeable and artifact correction is likely not necessary. 157 Increase in PSNR (dB) 30 28 26 24 22 20 10 20 30 40 50 Input Quality 80 60 40 20 0 10 20 30 40 50 Input Quality Figure 9.10: Equivalent Quality Plots. Top: space savings on average. Bottom: equivalent quality on average. 9.6.3 Equivalent Quality One important application of artifact correction is to improve compression fidelity. In other words, rather than replacing the entire JPEG codec with a com- pression algorithm based on deep learning, we can simply use more aggressive JPEG settings and use artifact correction to make the result presentable. This is much more likely to succeed in the short term due to the technical debt surrounding JPEG. We explore this phenomenon using ?equivalent quality?, i.e., given a JPEG which is compressed at some low quality and then corrected, what higher quality would we have had to compress the image at in order to match the restored error? And how much space did we save by using the smaller JPEG image? We start with an example in Figure 9.11. Note that our model is equivalent to almost doubling the quality of the JPEG, allowing us to save a significant amount of space. Next we compute the equivalent 158 Equivalent Quality Space Saved (kB) Input Equivalent Quality JPEG Reconstruction Quality 30 Quality 58 46.8kB Saved (47.9%) Figure 9.11: Equivalent Quality Examples. Taking three images compressed at random qualities, we correct them and then find the compression quality that matches the corrected result in terms of error. 100 80 60 40 20 0 Figure 9.12: Embeddings for Different CFM Layers. 3 channels are taken from each embedding, color shows JPEG quality setting that produced the input quantization matrix. Circled points indicate quantization matrices that were seen during training. quality over the entire Live1 dataset and plot it along with the space savings in kB. We do this for qualities 10-50. These plots are shown in Figure 9.10. 9.6.4 Exploring Convolutional Filter Manifolds Of the various proposals in this work, one of the most intriguing is the Convolu- tional Filter Manifold (CFM). In this section we explore their properties empirically. First we can visualize the CFM weights. We do this by pulling out three channels of one of the CFM layers. Since we can adapt these weights, we generate quantization matrices for qualities 10, 50, and 100 and produce CFM weights. This is shown with a heatmap in Figure 9.13. 159 Channel Figure 9.13: CFM Weight Visualization. Horizontal axis shows different chan- nels of the weight, vertical axis shows quality. Quality levels shown are Top: 10, Middle: 50, Bottom: 100. These are simply the heatmapped 8 ? 8 kernels of the CFM layer. 160 Quality We see some interesting behavior in this figure. The kernels in each row appear different, since these are different channels we hope that they are different as they should capture different information. In contrast, the kernels in each column appear to be similar, but scaled versions with the magnitude decreasing as quality increases. This makes sense because high quality images should need less correction. We can take this visualization further by finding the filters which maximally activate each of these weights. This will tell us which patterns each filter responds to. We do this by taking a noise image and optimizing it to maximize the output of the filter. In other words, we treat the visualization as a parameter and use stochastic gradient ascent with the objective being the magnitude of the weight we wish to visualize. This result is shown in Figure 9.14. We again see some interesting behavior. Clear patterns of JPEG artifacts are visible in these images. As in Figure 9.13, the columns seem to capture different types of artifacts, with the first column capturing local block artifacts, the second capturing larger block artifacts, and the the third capturing some ringing artifacts. As we descend each column, we see similar artifacts reducing in strength until the quality 100 filter which leaves the images mostly unchanged. Finally, we can examine the manifold property of the CFM. We show this in Figure 9.12 where we have taken the three kernels from three different CFM layers for all possible quantization matrices (0-100). We then compute a t-SNE embedding [110] to two dimensions and plot the kernels. What we see is a smooth manifold through the space of quantization matrices. By coloring each point by the quality level used to generate it, we can see that the kernels generate an order on the space. 161 Channel Figure 9.14: Images Which Maximally Activate CFM Weights. Horizontal axis shows different channels from the weight, vertical axis shows quality. Quality levels shown are Top: 10, Middle: 50, Bottom: 100. 162 Quality We can also see that each channel corresponds to a different manifold. 9.6.5 Frequency Domain Results Next we analyze the constituent frequencies of compressed and restored im- ages. One of the claims we made when developing the method is that GAN training produced more realistic high frequency reconstructions. Indeed, by examining Fig- ure 9.15 we can see that, compared with JPEG and regression reconstruction, the GAN result has significantly more activity in the high frequencies. We show this both with heatmaps of the Y channel coefficients and by plotting the ?probabil- ity? 3 with which a given frequency is non-zero on a bar chart, for four examples. Examining the frequency chart, we can see that in real images, even the highest frequency components have some probability of being non-zero. This probability is significantly reduced by JPEG compression and the regression result does little to correct it. The GAN result on the other hand has high frequency responses that are significantly higher, at least as likely to be non-zero as the original images. So in this sense at least, the GAN loss was successful. 9.6.6 Qualitative Results In this section we simply show some qualitative outputs of the model, these results are shown in Figure 9.16. Observe that the degraded images suffer from extreme banding caused by the quantization process. Our reconstructions are able to effectively mitigate this banding, along with other complex ringing and blocking 3We use the definition of frequency m such that i+j = m for a 2D frequency (i, j) for simplicity. 163 Original Plot Original Regression JPEG GAN 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency DCT JPEG Q=10 Regression GAN Figure 9.15: Frequency Domain Results. Note how the GAN reconstruction generates significantly more high frequency content than the regression reconstruc- tion. Also note how few high frequencies are in the compressed image. We show only one example here, please see more examples in Appendix B. 164 Probability artifacts. More qualitative results are given in Appendix B. 9.7 Limitations and Future Directions Although this work represents a major step forward in the usability of JPEG artifact correction methods, there are still some major problems to be solved. First and foremost is the double compression problem. Because QGAC parameterizes itself only on the quantization matrix of the file it is correcting, it has no way of knowing if this image was recompressed. For example a real-life company which will not be named directly and with a complex image processing pipeline decompresses and recompresses each JPEG it receives multiple times. Realizing that this would lead to significant degradation, this company recompresses its images at quality 100, mitigating most quality loss. However, QGAC will treat this as a quality 100 JPEG and perform essentially no restoration on it: it knows no better. In effect the image processing pipeline has lied to QGAC about the nature of the compression. This was partially addressed by AGARNET [91] which allows for a spatially varying ?Q-map?, essentially per-pixel quality, to be used as an auxiliary input, however generating the Q-map is not straightforward. Then there is the related problem: the JPEG degraded image may not be stored as a JPEG at all. It is fairly common to transcode JPEG files to PNGs where they can be stored without further degradation. QGAC, of course, cannot operate on PNGs because they do not contain quantization information. This was addressed by FBCNN [92] which trains a network to predict quality level from 165 JPEG Q=10 Reconstruction Original Figure 9.16: Qualitative Results. The compressed images at quality 10 are com- pared to our reconstructions and the originals. 166 pixels (along with the restored output) thus implicitly parameterizing the network on quality and allowing it to take any kind of image as input. There is the longstanding problem of high frequency reconstructions. It seems that there are currently two paradigms in restoration: low-frequency but accurate reconstructions and high frequency but inaccurate reconstructions (e.g. GAN re- construction which looks nice but has little relationship to the ground truth image). A ?holy grail? of reconstruction work would be allow accurate reconstructions in the high frequencies. This is partially addressed later in the dissertation with the scale-space loss of Metabit. Another important direction to consider for practical usage of artifact cor- rection is runtime, memory, and hardware concerns. The end goal is to put these methods into the hands of users who may be on smartphones or laptops but little attention has been paid to this so far. Current techniques often require datacenter machines with powerful, often multiple, GPUs in order to run in a timely manner (or at all). More attention to practical, efficient formulations and to quantized integer models or specialized hardware is important to the widespread dissemination of this technology. 167 Chapter 10: Task-Targeted Artifact Correction Thus far we have considered artifact correction as a tool for presenting attrac- tive images to a user. In other words, where a compressed image contains certain artifacts, we want to suppress those artifacts so that the user can view something closer to the uncompressed image. We noted that this was a difficult task to ac- complish for some time because artifact correction methods were trained on a ?per- quality? basis with a different model for each quality and we proceeded to develop a method for correction of JPEG artifacts that is ?quality-blind?, i.e., only a single model is trained for all JPEG qualities. We now consider a slightly different question: what if the images are intended for machine consumption and not human consumption? How does this change the problem, if at all, and how do machine learning algorithms respond to JPEG compression? In this contribution of the dissertation, we develop a flexible method of overcoming the accuracy loss caused by JPEG compression on common computer vision models. This includes both a study of how JPEG compression affects these models and the examination of different methods for mitigation of the accuracy loss. This method was published separately in the MELEX workshop in the proceedings of the International Conference on Computer Vision [111]. 168 The method presented in this chapter trains an artifact correction network to target a specific computer vision task. This has significant advantages over off- the-shelf techniques which we examine in Section 10.4 (Transferability and Multiple Task Heads). Namely, the method is transferable between models. In other words, once trained to assist a particular model, it is general enough to assist other mod- els. Similarly, it can be trained to assist multiple tasks simultaneously without a significant penalty on its effectiveness. We call this method Task-Targeted Artifact Correction (TTAC). First Principles ? JPEG degrades task performance. Leveraging explicit JPEG correction can mitigate the problem ? Supervise the JPEG correction method using differences between the uncom- pressed and corrected images ? Task trained correction networks are generalizable to many downstream tasks 10.1 Standard JPEG Compression Mitigation Techniques Before moving on we briefly review other techniques which are commonly thought to mitigate JPEG artifacts. Supervised Fine-Tuning/Data Augmentation The simplest possible scheme, JPEG compressed inputs are mixed in during training as a form of data augmenta- 169 tion. The goal here is to train the network to expect JPEG compressed inputs and map them correctly. While this idea works, often well, it has several disadvantages. The first is that it sacrifices accuracy on clean images. So the result of the network is no longer ?at a theoretical maximum? because it has, in some sense, expended capacity modeling JPEG compressed inputs. Additionally, this method requires ground-truth labels which can be expensive to obtain. Off-the-Shelf Artifact Correction Another exceedingly simple method: simply apply an artifact correction network to JPEG compressed inputs. Since the artifact correction method is reducing error with respect to the clean image, intuition states that this should help the performance of a downstream task (and indeed it does). Moreover, this technique could be employed practically with the development of QGAC which does not require knowledge of the JPEG quality. This technique also has the advantage of not requiring any training at all and indeed keeping the clean accuracy intact is a selling point of the method. However, this is an ?all-or-nothing? approach in that there is no way to tune it when it does not work. Stability Training This technique [112] is more interesting than the last two ideas and involves logit matching between the network output on clean and perturbed (in this case JPEG compressed) images. In this case, the stability loss is defined as L ? ?stability(x, x ) = ||f(x)? f(x )||2 (10.1) 170 Original Task Prediction (Fixed Weights) Error JPEG Degraded Prediction Artifact Correction (Trainable Weights ) Figure 10.1: Task-Targeted Artifact Correction. The logit difference from the task network between clean and artifact-corrected versions of the same image is used to train the artifact correction network. where f(x) is some neural network and x? is the perturbed version of x. This objective is then minimized along with the primary task objective during training. While this technique does encourage robustness and is self-supervised it inherits several drawbacks from the supervised method. The task network now needs to expend capacity to model the compressed mapping and performance on clean images is sacrificed. 10.2 Artifact Correction for Computer Vision Tasks The algorithm we propose in this chapter targets an artifact correction network to a particular task. In all cases we will use QGAC from Chapter 9 (Quantization Guided JPEG Artifact Correction) for the artifact correction network. Starting from pre-trained weights, we fine-tune the artifact correction network using logit error from the task network between clean inputs and compressed inputs. Formally, given a task t(), and artifact correction network q(), we minimize L?(B) = ||t(b)? t(q(JPEGq(B); ?))||1 (10.2) 171 where JPEGq denotes JPEG compression at quality q. Note that the parameters, ?, that we optimize belong to q. The task network is unchanged during this process. See Figure 10.1 for a visual depiction of this process. While the intuition behind this process is simple there are several details that need to be accounted for. First consider that we are not training the artifact cor- rection network based on any decision by the task network, e.g., classification or detection. Instead, we are matching the actual logit values. These are vectors of real numbers and are much finer grained than the actual decision which may be binary. In effect we are rewarding the artifact correction network for inducing the same perception of an input image in the task network. Note that since there is no hard decision required for training the method is entirely self-supervised. Only the logit values, which are independent of any ground truth, are considered during the training process. This differs from stability training in several key ways. First, it does not modify the task network, so performance on clean images is unchanged and the task network is free to expend its entire capacity learning the relationship between clean data and the output. Next, since the correction task is given to an auxiliary network, this network can be reused for other tasks. As we examine in Section 10.4 (Transferability and Multiple Task Heads), this works surprisingly well allowing the artifact correction network to be trained using a lightweight task and reused for more complex tasks. To summarize, task-targeted artifact correction takes the advantages of all prior techniques with none of the disadvantages and adds in transferability as a bonus. 172 16 14 EfficientNet B3 FasterRCNN InceptionV3 14 FastRCNN HRNetV2 + C1 MobileNetV2 MaskRCNN 15.0 MobileNetV2 (dilated) + C1 (ds)12 ResNet-101 12 RetinaNet ResNet101 + UPerNet ResNet-18 12.5 ResNet101 (dilated) + PPM10 ResNet18 (dilated) + PPMResNet-50 10 ResNet50 + UPerNet 8 ResNeXt-101 10.0 ResNet50 (dilated) + PPMResNeXt-50 8 VGG-19 6 6 7.5 4 4 5.0 2 2 2.5 0 0 0.0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Quality Quality Quality Figure 10.2: Performance Loss Due to JPEG Compression separated by task. Left: Classification, Middle: Detection, Right: Segmentation. The plots show all models from a single task with no mitigation applied. For segmentation tasks, the format of the model name is Encoder Model + Decoder Model and ?ds? indicates that the model was trained with deep supervision. Note that methods which use a Pyramid Pooling Module (PPM) decoder always use deep supervision. 16 14 EfficientNet B3 FasterRCNN InceptionV3 14 FastRCNN HRNetV2 + C1 MobileNetV2 MaskRCNN 15.0 MobileNetV2 (dilated) + C1 (ds)12 ResNet-101 12 RetinaNet ResNet101 + UPerNet 10 ResNet-18 12.5 ResNet101 (dilated) + PPM ResNet18 (dilated) + PPM ResNet-50 10 ResNet50 + UPerNet 8 ResNeXt-101 10.0 ResNet50 (dilated) + PPMResNeXt-50 8 VGG-19 6 6 7.5 4 4 5.0 2 2 2.5 0 0 0.0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Quality Quality Quality Figure 10.3: Performance Loss with Mitigations ? Circle: No Mitigation, + Cross: Off-the-Shelf Artifact Correction, ? Diamond: Task-Targeted Artifact Correction, ? Square: Supervised Fine-Tuning. The models in this figure corre- spond to those shown in Figure 10.2. 10.3 Effect of JPEG Compression on Computer Vision Tasks A legitimate question at this point is ?how much does JPEG actually affect computer vision tasks??. We can answer this with a study, the conclusions of which are summarized in this section. The full results are relegated to Appendix A. For this study, we compressed images using quality in [10, 90] in steps of 101 using the test sets of the respective models we are evaluating. The input images are compressed, then restored, then they are transformed according to the requirements of the target model (e.g., cropping to 224 ? 224). We evaluate supervised fine- 1We only show [10, 50] in this section as these are the most interesting results. 173 Accuracy Loss (%) Accuracy Loss (%) mAP Loss mAP Loss mIoU Loss mIoU Loss 5 Fine-Tuned Fine-Tuned Fine-Tuned Task-Targeted Artifact Correction 6 Task-Targeted Artifact Correction 7 Task-Targeted Artifact Correction MobileNetV2 Transfer MobileNetV2 Transfer MobileNetV2 Transfer 4 ResNet-18 Transfer 5 ResNet-18 Transfer 6 ResNet-18 Transfer 3 4 5 4 2 3 2 3 1 1 2 0 0 1 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Quality Quality Quality Figure 10.4: Transfer Results. Left: ResNet 101 (classification), Middle: Faster R-CNN (detection), Right: HRNetV2 encoder with C1 decoder (semantic segmen- tation). In all plots, we add an evaluation using artifact correction weights that were trained on ResNet-18 and MobileNetV2, our lightest weight models. Note that ?Fine-Tuned? and ?Task-Targeted Artifact Correction? methods are both trained using their respective task network directly e.g. in (a) they use a ResNet 101. - - dashed lines indicate results shown in Section 10.3 5 Fine-Tuned 6 Fine-Tuned Fine-Tuned Task-Targeted Artifact Correction Task-Targeted Artifact Correction 6 Task-Targeted Artifact Correction Multihead (2 Model) Multihead (2 Model) Multihead (3 Model)4 Multihead (3 Model) 5 Multihead (3 Model) 5 3 4 4 2 3 3 1 2 0 1 2 1 1 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Quality Quality Quality Figure 10.5: Multiple Task Heads. Left: ResNet 50 (classification), Middle: Faster R-CNN (detection), Right: HRNetV2 encoder with C1 decoder (semantic segmentation). In all plots, we add an evaluation using artifact correction weights that were trained using multiple task networks. For the two task setup, we used ResNet-50 and FasterRCNN. For the three task setup, we used ResNet-50, Faster- RCNN, and HRNetV2 + C1. Note that HRNetV2 + C1 has no two-task multihead model. - - dashed lines indicate results shown in Section 10.3. tuning, off-the-shelf artifact correction, and task-targeted artifact correction, we do not evaluate stability training. For methods requiring fine-tuning, we train for 200 epochs varying the learning rate from 10?3 to 10?6 using cosine annealing [106]. We compare all mitigation methods to a baseline of ?doing nothing?, i.e., accepting JPEG inputs with no modification. We evaluate the following tasks, datasets, and models: Classification using ImageNet [59] with MobileNetV2 [113], ResNet 18, 50, and 101 174 Accuracy Loss (%) Accuracy Loss (%) mAP Loss mAP Loss mIoU Loss mIoU Loss [9], ResNeXt 50 and 101 [114], VGG-19 [34], InceptionV3 [115], and Efficient- Net B3 [10] Detection using MS-COCO [116] with Fast R-CNN [117], Faster R-CNN [118], and RetinaNet [119] Instance Segmentation again using MS-COCO with Mask-RCNN [120] Semantic Segmentation using ADE-20k [121], [122] with encoding models MobileNetV2 [113], ResNet 18, 50, 101 [9], and HRNet [123] and decoders C1 [124], PSPNet [121], and UPerNet [125]. In Figure 10.2 we see the result of these models for varying JPEG quality. All of the models face a steep penalty for the lowest quality settings which gradually abates as quality increases. This finding is intuitive and confirms our need for JPEG mitigation techniques. We follow this up with summary plots of the mitigation study in Figure 10.3. We can see some interesting behavior here, mainly that the different mitigations do not behave the same on different tasks. In particular, tasks that are very localization-heavy like segmentation do not benefit from supervised fine-tuning as much. These tasks are greatly aided by task-targeted artifact correction, however. 10.4 Transferability and Multiple Task Heads One of the most intriguing properties of task-targeted artifact correction is the potential for transferability. Since the original task network is not changed in any way, and only the artifact correction network is fine-tuned, there is no reason that 175 Figure 10.6: MaskRCNN TIDE Plots. From left to right the model was eval- uated at quality 10, 50, 100 with no mitigations. Note that the bulk of the errors are missed detections at low quality. As quality increases, more objects are detected but they are not localized correctly. we cannot use the outputs of the artifact correction network for other tasks entirely. This opens up a range of new potential deployment scenarios. For example, a TTAC model could be targeted to MobileNetV2 [113], which is fast and lightweight to train, and then used for a much heavier semantic segmentation network which would have been impossible to train without significant compute power. Of course this only works if the TTAC models can generalize. We examine this in Figure 10.4. For each plot, we take the supervised fine-tuning and task- targeted artifact correction results from Figure 10.3. These are shown in dashed lines. We compare this with a target-targeted artifact correction network which was trained with MobileNetV2 (green) and ResNet-18. As the plots show, the transfer works quite well even to different tasks. In the right hand plot, for example, the new results are almost indistinguishable from the task-targeted network which was fine tuned for segmentation and performs better than fine-tuning the segmentation network itself. 176 Task-Targeted Artifact Correction JPEG Q=10 Ground Truth Figure 10.7: Mask R-CNN Qualitative Result. Input was compressed at quality 10. This compares the JPEG result to TTAC and the ground truth. It is also worth noting that there is no reason a TTAC model needs to be trained with only one downstream task target. We can use as many downstream tasks as we have compute power for. We examine this in Figure 10.5. In these plots we have added results from a TTACmodel with 2 heads (classification and detection) and a TTAC model with 3 heads (classification, detection, and segmentation). Not only does this work perfectly well, but in many cases the additional model heads improved the generalizability of the TTAC model, leading to improved results. 10.5 Understanding Model Errors So far we have looked at model error in a very aggregate view. In other words, we are looking at overall accuracy and how it changes with increasing compression. In this section, using Mask R-CNN [120] as a representative example, we examine the errors made by the model in more detail. We start by using TIDE [126] to compute a breakdown of the exact errors that the network is making with increasing compression. These plots are shown in Figure 10.6 with no mitigations applied. The trend here is interesting. For low quality, the bulk of the errors are caused by missed detections. However, as the quality increases and the missed detections decrease, the localization error actually increases as the 177 newly detected objects are not properly localized. This is sensible because it suggests that once enough information is present in the image for objects to be identifiable, the ?spread? induced by the missing high frequency basis functions causes the exact boundaries of the objects to be obscured. We can view this qualitatively as well. Figure 10.7 shows the result for a JPEG compressed at quality 10 both with and without TTAC (as well as the ground truth). In the uncorrected model, we can observe a significant number of missed detections as well as minor localization errors on the orange, although overall the orange is localized quite well given the significant blocking artifacts present on the boundaries. The TTAC output is also informative. Not only does the image appear significantly high quality (keep in mind it was the same as the left image before artifact correction) but there are also far fewer missed detections. What remains are some localization errors particularly on the bowl. 10.6 Limitations and Future Directions There are two major limiting factors to TTAC in its current incarnation: speed and fidelity. Since TTAC requires placing an artifact correction network, specifically a QGAC network, before any task processing happens, it can severely limit perfor- mance. As long as this happens at the datacenter level, the impact is likely minimal but it is still a legitimate concern as GPU resources are still highly valuable and are currently required for artifact correction networks. This could be addressed by more efficient formulations for artifact correction or bespoke TTAC architectures 178 Supervised Fine Tuning (Training) 14 Supervised Fine Tuning (Validation) 12 Task-Targeted Artifact Correction (Training)Task-Targeted Artifact Correction (Validation) 10 8 6 4 2 0 Figure 10.8: Model Throughput. Throughput comparing TTAC FPS for training and inference. TTAC incurs a non-negligible throughput impact. that are intended to be lightweight. Next, although TTAC has some marked advantages over data augmentation techniques, it currently struggles to outperform them in all cases. We expect that this can be addressed with deeper supervision on the task networks (i.e., matching more than just the final logits) however this is currently an open problem. There may be deep changes required to the scheduling for generation of suitable training data that would be required to see a clear numerical advantage. Finally, the scope of TTAC is still somewhat restricted. Although JPEG artifacts are arguably the most important and prevalent type of degradation applied to images, we believe that TTAC is an invaluable tool for general degradations. This could mean corruptions like noise, masking, rotations, etc. or something more mundane like resampling. 179 Throughput (fps) MobileNetV2 ResNet-18 ResNet-5 R 0esNet-10 R 1esNeXt R -5e 0sNeXt-101 VGG-1 In 9cepti E of nfi Vc 3ientNet B F 3asterRCNN FastRCNN RetinaNe M taskR H CR NN Net M Vo 2b +ile CN 1etV R 2e +sN Ce 1 R te 1s 8N +et 5 PP0 M + U R Pe es rN Ne et tRes 5N 0e +t1 0 PP1 M + R Ue Ps eN rNet e1 t01 + PPM Part III Video Compression 180 Chapter 11: Modeling Time Redundancy: MPEG Having discussed image compression at length, we now move on to video com- pression. When considering uncompressed images, we modeled them as samples of a continuous 2D signal. We now allow those samples to vary over time to create a sort of ?flip-book?. Light intensity is captured in discrete steps in space to create ?frames?, and then, multiple frames are captured in discrete steps in time to create the video. By sampling frames at a sufficiently high frame rate, there is an illusion of smooth motion. Naturally this significantly increases the size of the representation. Since each frame in the video is the size of a single image, videos increase in size quickly with increasing framerate and time. Since we classified images as big enough to warrant compression, videos also certainly need to be compressed. In fact, timely transmission of even short videos would be impossible without compression. In this chapter we cover, at a high level, the first principles of video compres- sion. There are many different video ?codecs?, or, different algorithms for com- pressing videos. Although most of the concepts we discuss here are applicable to all modern codecs in some form, when we need specific details, we will defer to MPEG and specifically, the AVC standard [127]. Readers may be familiar with AVC by 181 other names like H.264 or MPEG4 part 10. We standardize on the AVC terminol- ogy to easily differentiate between HEVC/H.265 [128] and to align better with the naming of AOM codecs (like VP9, AV1, etc.). We focus on AVC because it is widely used [129] and many of its key ideas are used as the foundation for continuing codec development. As we will see, the important insight which makes video compression possible is that we can exploit time redundancy in the signal and remove information across time. This is in addition to the spatial manipulations we are to used in JPEG, and the effect is synergistic. In other words, by exploiting temporal redundancy, we can remove additional spatial information which we would have needed to store if we only had a single image. The dependence on the temporal dimension will create the need for three frames types: Intra Frames or I-frames which are frames that can be decoded without information from any other frame, i.e., there is no temporal dependency. Predicted Frames or P-frames are frames that requires at least one previous frame to decode We are said to predict the current frame based on the previous frame and any hints stored with the current frame. Bipredicted Frames or B-frames are frames that requires at least one previous and future frame to decode. These frames are beyond the scope of this dissertation. These frames together referred to as a Group of Pictures, i.e., an I-frame and its associated P- or B- frames is a group of pictures. 182 Figure 11.1: Motion JPEG Comparison. Left: Motion JPEG frame, Center: AVC frame, Right: Original frame. The motion JPEG video was larger and poorer quality than the AVC frame, motivating compression in the time dimension as well as the spatial dimension. 11.1 Motion JPEG Before we begin the discussion of ?true? video codecs, it is worth discussing an obvious solution: Motion JPEG . Motion JPEG can be thought of as a successor to MPEG 1. The idea is incredibly simple: each frame is compressed separately as a JPEG and stored in a file along with some kind of frame rate specification. This information is all that is needed to decode and play the video. Note that although Motion JPEG enjoys widespread use because of its simplicity, there is actually no standard which defines it and different software libraries will have different methods of specifying metadata. As a quick example of this in action, we can take a raw 240frame, 24fps, 1080p video and try compressing with motion JPEG and with ffmpeg defaults for AVC. The 1Although historically MPEG-1 was technically standardized first, core MPEG-1 technology was based on work from the JPEG (committee). 183 original video in this case is 746,496,000 bytes (1920 ? 1080 = 2, 073, 600 bytes for luminance plane, 4:2:0 subsampling gives 2, 073, 600/4 + 2, 073, 600/4 = 1, 036, 800 bytes for the chrominance planes so 3, 110, 400 bytes per frame times 240 frames = 746, 496, 000) or about 747MB. Pretty large for 10 seconds of 1080p video. The Motion JPEG file generated from this video is 12.7MB, an impressive 62x compression ratio. This can be though of as a naive ?limit? on how much compression is attainable with out considering the temporal dimension. The AVC file on the other hand is only 7.2MB, a 103x compression ratio. This is not the end of the story, however, as the AVC file is almost indistinguishable from the original frame, yet the Motion JPEG frames have significant blocking artifacts from compression (Figure 11.1). This example motivates our desire to study compression in the temporal dimension: the AVC frames are both high quality and smaller (in file size) than the Motion JPEG frames. 11.2 Motion Vectors and Error Residuals We will ?measure? temporal information by modeling motion between neigh- boring frames. After modeling the motion, we can warp and subtract the frames giving a ?residual?. The motion modeling is designed to be compact and simple while still capturing some complex motions. Since at a high enough framerate, inter-frame motion is small2, the frames should share a significant amount of in- formation after accounting for motion. Any additional information is stored in the residual, which is generally low entropy. 2Barring large cuts or scene changes 184 ? :? Reference Frame ? :? Current Frame Figure 11.2: Motion Vector Grid. The grid is defined based on the frame that is currently being decoded. Each motion vector indicates where in the previous frame a block of pixel moved from. The numbers on the grid cells indicate the motion strength. We call frames which are constructed from motion and residual information Predicted Frames (P-frames) since they must be predicted from a previous frame. We will call the frame which is being decoded the current frame and the previously decoded frame the reference frame. MPEG standards also define a Bipredicted Frame (B-Frame) which is predicted from a previous and future frame. This takes advantage of the presentation timestamp and decoding timestamp features of video containers to store frames out-of-order and with a complex dependency graph. We will consider further discussion of B-frame and multi-reference decoding to be beyond the scope of this dissertation. Motion information is stored in the form of motion vectors . These vectors are computed by breaking the current frame into a regular grid and, for each grid cell, 185 Figure 11.3: Motion Vector Arrows. Each vector is positioned in the center of the block, the arrow points to the position that block came from in the reference frame. Image credit: Big Buck Bunny [130]. measuring where in the reference frame that grid cell moved from. For some cells there may be no motion and for others there may be large motions. For AVC, our example codec, the cells themselves can be 8? 8, 16? 8, 8? 16, or 16? 16 pixels. The blocks are stored as nine 16-bit integers3. So this operation alone turns, at worst, 192 byte pixel blocks into 144 byte motions. In Figure 11.2, we show an example of the grid structure from a real video. Note that the grid is defined from the current frame which is broken into a clean regular grid. On the reference frame, these blocks may overlap. There are some grid cells missing from the current frame, for these blocks no motion was detected, so they are skipped. We can also visualize the vectors themselves as in Figure 11.3. For each block in the image, we draw a vector starting from the center of that block and terminating at the position in the reference frame that the block came from. The motion vectors are used in a process called motion compensation. This process simply copies blocks of pixels from their position in the reference frame 3reference frame, width, height, reference x, reference y, current x, current y, motion strength, and flags 186 to their position in the current frame, pasting over any content that was there previously. The resulting motion compensated frame represents a coarse warping of the reference frame to match the current frame. Of course there are still errors since the motion is only computed on blocks, so it is not usually a perfect representation of the current frame. See the left side of Figure 11.4 for an example of this. The vectors are computed using a process called motion estimation. We do not cover this process here as it is not standardized4. In order to correct errors in the motion compensated frame, the encoder stores an error residual . One thing that is immediately noticeable in Figure 11.2 and Figure 11.3 is that not all blocks move. Indeed in both there are many blocks which are stationary. This means that the pixels in those blocks are exactly the same in the reference and current frames and therefore if we subtract the two frames those blocks will be filled with zeros. Motion compensation takes this a step further to try to match moving objects as closely as possible as well as stationary regions; this increases the likelihood of generating zero blocks. These zero blocks are extremely low entropy and aid in the compression process. To compute this residual, we first compute the motion compensated current frame from the reference frame then subtract the true frame, yielding everything that was not accurately modeled by the motion estimation process. An example is shown on the right side of Figure 11.4. In effect we have told the decoder to reuse information it already had about those blocks without needing to store them. 4The MPEG standards only define the decoder. It is up to the encoder to produce a standards- compliant bitstream however it wants. 187 Figure 11.4: Motion Compensation and Error Residuals. Left: motion com- pensated frame, Right: error residual. Note the block artifacts in the motion com- pensated frame. corrections to these are stored in the error residual along with small edges. Note that the error residual is mostly zeros and therefore much easier to compress. To produce the residual, we subtracted the motion compensated frame (left) from the true current frame (Figure 11.2 right). Small errors are then accumulated on edges and in rapidly moving objects which make up the bulk of the size of the compressed residual. Note also that the above discussion is appearance based. An object moving in the physical world may very well generate blocks in the frames that do not appear to move, like in the center regions of the parachute in Figure 11.2. This information can still be freely used to fill in the current frame even though it is not an accurate reflection of the real-life parachute. To summarize: the encoder stores per-block motion. This motion is then combined with a low entropy residual and a previous frame to produce the current frame. This yields direct savings in that storing motion information is more compact than storing pixels, and indirect savings because the error residual is much more 188 CBR (5.4Mbps) CQP (25) CRF (23) 9.4MB (5.4Mbps) 9.5MB (5.4Mbps) 9.2MB (5.3Mbps) Figure 11.5: Rate Control Comparison. The three rate control methods are tested targeting the same file size. Note the different artifacts produced by each method despite similar file sizes. Also note that for this video, CQP 26 undershoots the target to produces 6.6MB file but CQP 25 overshoots at 7.5MB. CBR and CRF are very close in file size. compressible than the original pixels. 11.3 Slices and Quantization In addition to motion modeling, the AVC standard makes some notable changes to the way frames are structured. Also, similar to JPEG, AVC allows the use of quantization for rate control, although this is exposed to the user in several differ- ent ways. Unlike the last section, these ideas are applicable to both P-frames and I-frames. The biggest departure from JPEG is the concept of slices . A slice is a region of the frame made up of a whole number of macroblocks (usually 16?16 pixel blocks). Prediction (motion compensation) is only possible within a single slice. In the 189 Background Slice Low Motion Slice High Motion Slice Figure 11.6: Slicing Example. The image has been broken up into three slices by the encoder: a background region, a high motion region, and a low motion region. simplest case, the entire frame is one slice, but the general idea allows the encoder to break up the image into more meaningful groups. For example, the encoder might have two slices: a high detail slice with minimal quantization and a low detail slice with higher quantization. It may have high and low motion slices. Even more intriguing is the concept of I- and P-slices. This means that a single frame can contain intra information and predicted information rather than the encoder making a blanket decision for the entire frame which may be sub-optimal. For example, highly detailed regions with low motion may be better stored as an I-slice but a high motion region may be more efficiently stored as a P-slice. An example of this is shown in Figure 11.6. In JPEG rate control was implemented by choosing a scalar quality in [0, 100] and mapping that scalar to a quantization matrix. For videos we have several options; these are compared visually in Figure 11.5. The most similar is Constant Quantization Parameter (CQP). This is essentially the same idea but on a scale of [0, 51] with 51 being the worst quality, this number is used to derive a quantization matrix for the coefficient blocks. This type of rate control is not generally used because it is extremely simplistic to apply the same amount of quantization to 190 every frame and generally produces sub-optimal results in both size and perceptual quality. Instead the more common Constant Rate Factor (CRF) is used. This is also a number in [0, 51] with 0 being truly lossless encoding (no quantization) and 51 being the worst quality. This method is tuned to hold perceptual quality constant and takes into account inter-frame motion and frame rate. Still objects will be less aggressively quantized and moving objects more so following a similar argument for removing high spatial frequencies: fast moving object are harder to perceive with detail. The final rate control method, which is also quite common, is Constant Bitrate (CBR). In CBR encoding, the only thing the encoder is trying to optimize is the bitrate of the video, it should be as close as possible to a specified target without going over it. This is useful for maxing out a connection with a known bandwidth where it is desirable to get the maximum quality that the connection can support without dropping frames. However, there is no way for the encoder to know a priori how to perform CBR encoding so it will regularly over- or undershoot the target unless two pass encoding is used. 11.4 Recap We have now covered the high level ideas by which video codecs compress temporal data. The encoder computes and stores coarse motion between frames and use that motion to compute an error residual containing anything in the frames 191 which cannot be modeled with motion from a previous frame. Internal to each frame, the encoder is free to slice the frame and make per-slice decisions. Rate control is accomplished with the help of a target bitrate, a user defined CRF, or a user defined quantization parameter. The spatial domain compression is similar to JPEG. Pixels are transformed into the frequency domain using a 4 ? 4 DCT, DST, or Hadamard transform (the encoder is free to choose this) and, depending on the rate control mechanism, a QP is computed or is given to the encoder. These QPs map directly to a standard set of quantization matrices for the blocks. Note that the QP is allowed to vary spatially so different blocks can receive different amounts of compression, as opposed to JPEG where one quantization matrix is used for every block in the image. This yields a surprisingly straightforward compression algorithm. Given a set of frames, partition the frames into GOPs. This is usually a fixed number of frames per GOP but it can be based on the content. Encode the first frame of each GOP as an intra frame by quantizing the transform coefficients of its pixels. For each subsequent frame in the GOP, compute motion vectors from the previous frame, compute the motion compensated frame, and take the difference between the current frame and the motion compensated frame. Encode the predicted frame by storing the motion vectors and quantized coefficients of the error residual. Entropy code the frames. The decoder simply needs to loop through each frame and either decode it directly if it is an I-frame or warp the currently displayed frame using the motion vectors and add the error residual if it is a P-frame. Unsurprisingly, an actual video codec is much more complex than this and is 192 full of small details which make a large difference to the overall coding efficiency. As of this writing, the current revision of the AVC standard [131] was released on August 2021 and is 844 pages long. Aside from the core decompression algorithm it includes instructions for storing the resulting data in a stream5, definitions of constants and other hard coded mappings, algorithms for scalable streams, algorithms for compressing multi-view/depth/3D videos, etc.. Although covering such details is beyond the scope of this dissertation, the high level intuition we developed in this chapter will be enough to guide us in the following chapters as we explore ways to improve video coding efficiency using deep learning. 5This is not the same as an MP4 file, for example, but the MP4 file will contain AVC video streams. 193 Chapter 12: Improving Video Compression Following the same model as Chapter 8 (Improving JPEG Compression), we now survey techniques for improving video compression using deep learning. Com- pared with JPEG compression, this chapter will seem quite short. Video methods are still somewhat of a novelty in deep learning literature as of the time of writing, and the subfield of video compression reduction is particularly nascent. One ma- jor difference between JPEG and video compression is the inclusion of the in-loop deblocking filter. Although some JPEG software does use rudimentary heuristic deblocking, all modern video codecs include reasonably effective deblocking filters. However, these filters are not perfect and often there are still visible artifacts, al- though this filter does complicate the task for the network quite a bit. The field can be divided into two disjoint sets. The first set of methods, single frame methods, consider only a single video frame at a time. They may act dif- ferently on different types of frames (e.g., I- or P-frames) but they never consider information from any previous or future frame. Multi-frame methods in contrast do consider several frames. While general restoration techniques have largely moved from sliding-window to recurrent methods thanks to the significant efficiency im- provements of the recurrent models, video compression reduction methods still use 194 sliding windows. Before reviewing these methods we will first discuss four important works in general video restoration. These are methods which are so influential that they have had an outsize effect on the entire restoration field and thus should be con- sidered critical background knowledge. We will then proceed to study only video compression reduction methods. Warning Get ready for another, albeit shorter, history lesson 12.1 Notable Methods for General Video Restoration More general video restoration methods have a long history particularly in super-resolution. For our purposes we will only examine works of outsize impor- tance such that they have highly influenced followup works in video compression reduction. These are all methods which combine information from multiple frames, a distinguishing feature of video restoration vs image restoration. The first work we discuss is FRVSR [132]. FRVSR is notable for its highly effi- cient recurrent formulation. This came at a time when video restoration was either single frame or was using sliding windows. Sliding windows should offer better per- formance but are significantly more resource intensive to compute. The lightweight recurrent formulation was a significant step towards the practical application of these techniques. The next year ToFlow [133] was published. A common idea in video restoration methods is that consecutive frames, or their features, must be aligned in order to 195 make proper use of the extra information, and this is commonly done with optical flow. The key insight of ToFlow is that that optical flow can be learned using a network which is trained as part of the restoration process, customizing the optical flow to the task. Another major contribution of this paper is the Vimeo90k dataset which is widely used in video restoration literature. In the same year came the next major innovation in video enhancement: EDVR [134]. The main contribution of EDVR was the replacement of explicit motion compensation with optical flow with implicit motion compensation using deformable convolutions. The deformable convolutions allow the feature extraction network to automatically capture information which is spatially offset in frames to account for motion. This should be a faster and more flexible method than optical flow; however, the 20M parameter model and seven frame sliding window was highly impractical. Finally we discuss the COMISR [135] method. COMISR is a recurrent super- resolution method with a specific focus on compressed video. Li et al. rightly observe that real videos are compressed and yet prior restoration work does not take this into account. Their method is trained and tested on compressed videos and includes a novel Laplacian loss which is designed to restore high frequency details. 12.2 Single Frame Methods As early as 2017, single frame methods were presented for video compression reduction. These methods are quite simple and only account for information in the frame that is currently being restored. Unlike the general restoration methods 196 Table 12.1: Summary of Video Compression Reduction Techniques. Meth- ods are listed with their use of multiple frames and their method for motion com- pensation in publication order. Year Name Citation Multi-frame Motion Compensation Note 2017 DCAD [136] ? - 2018 QE-CNN [137] ? - Separate I- and P-frames networks MFQE [138] ? Explicit PQFs (SVM) 2020 STDF [139] ? Implicit 2021 MFQE 2.0 [140] ? Explicit PQFs (BiLSTM) PTSQE [141] ? Implicit 3D Convolution RFDA [142] ? Implicit Cross-window recurrent presented in the previous section there is no dependence on additional information from prior or future frames. DCAD [136] proposed a simple method for restoring single frames of HEVC compressed video. The method bears a striking resemblance to ARCNN [73]. The method uses a stack of convolutional layers. The main point of comparison is the built in deblocking filter which they show an improvement over. The QE-CNN [137] method presented the next year was designed to take compression into account explicitly. This is done using two networks, QE-CNN-I for I-frames and QE-CNN-P for P-frames. Interestingly, for P-frames, the method applies both the -I network and the -P network as HEVC encoding may contain intra- and inter- slices in one P-frame. Note that these networks still only consider a single frame at a time even though separate networks are used for the I- and P-frames; there is no shared hidden state or window. 197 12.3 Multi-Frame Methods In 2018, MFQE [138], the first multi-frame video compression reduction net- work, was developed. In addition to the seven frame sliding window with optical flow alignment, this method introduced the concept of peak quality frames (PQFs). PQFs are frames which naturally have a higher quality than their surrounding frames and therefore have more information to extract. MQFE leverages these frames by extracting feature from them separately and using those features to guide the restoration of the nearby non-PQFs. They identify the PQFs by manually labeling frames by PSNR and then training an SVM. At test time, the SVM identifies PQFs, then the PQF features are extracted, and finally the entire sequence is restored in a sliding window. The followup work, MFQE V2 [140] replaces the SVM with a Bi-LSTM [143] which is more accurate. In addition to these ideas the MFQE paper also introduces the dataset which is used for all follow-up works. STDF [139] is the next major advancement. This method takes the key idea from EDVR - deformable convolutions - and applies it to compression artifact re- duction. The network consists of offset prediction, deformable feature extraction, and quality enhancement modules. The following year, PTSQE [141], a patch-based method, was introduced. The key idea is to use separate networks for capturing spatial and temporal information of a single patch and to use attention methods to fuse the two. PTSQE also takes the step of incorporating the residual dense blocks from ESRGAN [81]. The implicit motion compensation is also done using 3D convolutions instead of the traditional 198 deformable convolution. Finally, RFDA [142], another recent model, uses a recursive fusion method to artificially increase the temporal window size. The method is built on the STDF method, and uses STDF wholesale as a subnetwork. From the STDF output, the method outputs a hidden state which is used in a downstream network to hold additional information from prior sliding windows. In this way, the STDF model is essentially accessing additional temporal information. This is almost a recurrent method in its function although the STDF method is still sliding window. 12.4 Summary and Open Problems The methods discussed in this section are summarized in Table 12.1. At a high level, the summary is that multi-frame methods are preferred to single frame meth- ods (because of their increased performance), and implicit motion compensation is preferred to explicit because it is faster to compute. Given these high level ideas, there are some outstanding problems we can identify. First, it is interesting to note that the implicit motion compensation per- forms as well if not better than explicit (optical flow based) motion compensation. This implies that fine grained alignment may not be completely necessary for en- hancement, or at least for video compression reduction. This finding was at least partially confirmed by ToFlow which showed that a task guided flow is better than a ?perfect? flow. On a related note, although QE-CNN was aware of the underlying metadata of 199 frame type, there is otherwise a lack of use of bitstream metadata. This is surprising since the bitstream information often contains useful cues for how information was removed from the original frames. Additionally, coarse motion information is present in the metadata which, as we already discussed, may be good enough for feature alignment. In the next chapter, we will develop a video compression reduction method which addresses these issues. 200 Chapter 13: Metabit: Leveraging Bitstream Metadata Until this point we have reviewed the basic principles of video compression including how to achieve compression over time by removing redundant motion in- formation. We have also reviewed several methods for using deep learning to restore quality to compressed video frames. In order for video compression to function, i.e., in order to successfully decompress a compressed bitstream, we need additional information beyond simply transform coefficients. This information, such as QP values, GOP structure, and motion vectors among others, give a very strong prior for how the encoder has compressed the video stream and what information has been removed that should be restored. We now turn our attention to developing a deep learning method which exploits this data to improve its reconstruction. This contribution of the dissertation is currently under submission for separate publishing and is available as a pre-print [144]. If we closely examine the direction of prior works, there are some whispers of this idea. For example, MFQE [138] contributed the idea of ?peak quality frames? which were high quality frames that could be used to restore nearby (in time) low quality frames. STDF [139] does away with expensive motion compensation to rely on deformable convolutions. 201 However, both of these methods leave something to be desired, specifically by relying on outside computation for what is already stored by the encoder. While the concept of peak quality frames seems somewhat abstract, after all how can we predict the existence of such frames, their existence is grounded in first principles. These are I-frames. The encoder inserts them intentionally to create frames with high information content which improves the decoding fidelity. Recall that MFQE 1.0 scans the entire sequence to determine peak quality frames using an SVM and MFQE 2.0 [140] does the same using a Bi-LSTM [143]. These are computationally expensive algorithms which are essentially computing the I- and P-frame structure of the GOP, something which we can readily extract from a bitstream with no additional computation. The MFQE family of networks also rely on optical flow to align nearby frames. While there are many methods for computing optical flow, they all vary in their speed and accuracy, although perfectly accurate optical flow may not be necessary in the first place [133]. The major contribution of STDF was to move away from explicit motion estimation by using deformable convolutions to learn an implicit motion estimation. This is desirable because it reduces the computational burden of the algorithm: the deformable convolutions model both motion and mapping simultaneously. However, we can do better than both explicit and implicit motion estimation; we can do no motion estimation at all. Of course we still wish to align nearby frames and for this we can extract motion vector from the bitstream. This gives a coarse motion compensation which we show is not only good enough for accurate reconstruction but, taken with our other contributions, outperforms both 202 MFQE and STDF as well as their later follow-up works. The common theme among the contributions of our method is that we are removing things which were computed explicitly by prior algorithms and replacing them with things that are computed by the encoder and stored in the video. We view these computations as redundant. By reducing these redundant computations we are left with extra compute time per frame that we can re-invest in additional model parameters leading to an improved result. We take the additional step of moving away from the sliding window paradigm, where a block of seven frames produces a single frame output, and instead use a block based approach where all seven frames are produced in a single forward pass. The result of these efficiency improvements is a network which has almost twice the parameters of STDF and yet runs the same or faster than it depending on input resolution. It also outperforms STDF by a wide margin for many compression settings. First Principles ? Architecture captures GOP structure ? Explicit I- and P-frame representations based on expected information content ? Alignment using motion vectors ? High frequency restoration using targeted loss functions 203 GOP I P ... P ... Channelwise Concatenation . . . GOP Representation Figure 13.1: Capturing GOP Structure. The GOP representation is computed from wide I-frame feature extractors and narrow P-frame feature extractors. 13.1 Capturing GOP Structure One of the primary contributions of this work is the way in which our net- work takes into account GOP structure. Recall from Chapter 11 (Modeling Time Redundancy: MPEG) that (in the MPEG standards) frames can be either I-,P-,or B-frames where I-frames are ?intra frames? which can be reconstructed using only information in the frame itself, P-frames are ?predicted frames? which require some previous frame to reconstruct, and B-frames are ?bipredicted frames? which require a previous and future frame to reconstruct. The goal of using these different frames types is to rely more on information which is stored in other frames that would be re- dundant to store again. These frames are organized into a group-of-pictures (GOP) which is an I-frame and its associated P-/B- frames. Without loss of generality we only consider P-frames in the following discussion. Since the predicted frames intentionally do not store information which is 204 stored in previous frames, we can observe that they contain less information and, due to prediction errors, generally have lower perceptual quality than their associated I-frame. When other models process video frames in a sliding window, they do not take this into account in any meaningful way and so the same network which processes I-frames is used to process P-frames. We can view this as wasting compute resources. Since the bulk of the informa- tion is stored in the I-frame, we can process that with a wide representation. We can then use a narrower, and therefore faster to compute, representation for the P-frame to extract the additional information the P-frame contains. This is shown in Figure 13.1. Note that it is important to match the depth of the extraction networks so that the receptive fields are aligned. We view the resulting GOP representation as capturing the available information in the entire sequence and use it to reconstruct each frame in the GOP after warping. This is a major gain in efficiency since the faster network is used for most frames in each sequence. Further, we will expend significant resources reconstructing the I-frame, which was already higher quality as it contains the most information in the frame. We will then use this restored I-frame as a base to compute the restored P-frames again using a lighter network. In other words, the GOP structure is encoded into our reconstruction algorithm in all stages. 205 Aligning P-Frames to I-Frame Apply Reversed Apply Reversed Aligned to I-Frame Frame 2 Motion Frame 3 Motion Vectors Vectors Aligning I-Frame to P-Frames Apply Frame 2 Motion Apply Frame 3 Motion Restored I-Frame Vectors Vectors Figure 13.2: Motion Vector Alignment. P-frames are warped backwards to the I-frame during feature extract. The I-frame is warped forwards to align to the P- frames during frame generation. 13.2 Motion Vector Alignment In multiframe restoration problems, it is extremely common to align nearby frames, or features extracted from nearby frames, to ensure that various scene details are overlapping (see ToFlow [133] and EDVR [134] among others). Conceptually, this should make the restoration task easier for the network since the additional information of nearby frames is in the correct location, ready to be exploited for additional reconstruction accuracy. The removal video compression defects is no different, and as discussed in the opening to the chapter, this is generally accom- 206 Figure 13.3: Motion Vectors vs Optical Flow. The motion vectors resemble a coarse or downsampled version of the optical flow. Optical flow was computed with RAFT [145] plished explicitly with optical flow as in MQFE and related networks [138], [140] with STDF using deformable convolutions for an implicit alignment [139]. While it is useful to compute high quality alignments between frames, it may not be necessary (this is discussed at length in ToFlow). Assuming that the con- straint of high quality alignment can be relaxed, we have a convenient tool at our disposal: motion vectors which are compared with optical flow in Figure 13.3. The motion vectors relate nearby frames at the block level; for most resolutions the blocks are fine enough and the motion accurate enough that warping frames using motion vectors instead of optical flow works well 1 The major advantage of using motion vectors is that they require no computation to produce since they are stored in the video bitstream. Compared with optical flow, they require no more computation to apply. We will use the motion vector to align each P-frame to the I-frames during feature extract, and then to align the restored I-frame to each P-frame during frame generation. This is illustrated in Figure 13.2. Since motion vector measures motion from the previous frame, we must reverse and warp each frame in sequence, e.g., 1where ?well? is measured in terms of reconstruction accuracy. 207 I-Frame Generation Inputs I-Frame Representation Network P-Frame Representation? P-Frame Generation P-Frame Representation Network Network I-Frame Representation ?(6 channels) (160?channels) 6 Low Quality P-Frame Pixels ... ... ... 1 Low Quality I-Frame Pixels 10 LR Blocks 10 LR Blocks 10 LR Blocks 3x3x64 3x3x64 3x3x64 High Quality I-Frame ... 6 Low Quality P-Frame Pixels ... ... Align to P-Frames 6 Warped High Quality I-Frames Align to I-Frame 6 P-Frame Motion Vectors 10 LR Blocks 3x3x16 1 Restored I-Frame Final Outputs 6 Restored P-Frames 6 P-Frame Motion Vectors Figure 13.4: MetaBit System Overview. I-Frames are shown in Blue and P- Frames are shown in Pink. Our network takes an input (Orange) in the form of a low-quality Group-of-Pictures and first performs multi-frame correction on the I-Frame. The resulting high-quality I-Frame is used to guide correction of the low-quality P-Frames. The final output of our network (Yellow) is the entire high-quality Group-of-Pictures. Lightweight Restoration Block Channel Attention Two Conv, 3x3xN, Input Output LeakyReLU Figure 13.5: LR Block. The Lightweight Restoration block modifies the residual block to follow recent best practices in deep learning and image-to-image translation. frame 3 is warped by frame 2?s motion vectors; the result is warped by frame 1?s motion vectors, etc.. During frame generation we carry out the inverse process by warping the restored I-frames by each of the P-frame?s motion vectors in sequence. 13.3 Network Architecture We are now in a position to develop a complete network architecture using the ideas in the previous two sections. Although we will develop a concrete archi- tecture in this section, the high level idea to leverage specific bitstream metadata can actually be applied to many different architectures. First we need to develop a basic block to build the rest of our network with. We would like to base this on 208 residual blocks (Section 5.5 (Residual Networks)) which are known to be effective at many tasks; however, residual blocks by themselves do not follow best practices for image-to-image problems. Conversely, the RRDB layer ([81]) works well but is computationally inefficient. For videos, we require something which is lightweight and effective. We make the following modifications (Figure 13.5) to the residual block which we call a lightweight restoration (LR) block. First, we remove batch normalization [37] which is known to perform poorly in image-to-image translation scenarios [81]. Next, we replace the ReLU layers with LeakyReLU. Finally, we add channel atten- tion [146], [147] following recent best practices in deep learning methodologies. To these residual blocks, we add our accounting for GOP structure and our motion vector alignment blocks. An overview of the Metabit system is shown in Figure 13.4. The network is divided into several stages. The network takes a 7-frame GOP with no B-frames as input. In the first stage, the I- and P- frame representations are computed using separate feature extractors. As discussed in Section 13.1 (Captur- ing GOP Structure), the I-frame representation is 64 dimensional while the P-frame representation is 16 dimensional. Each of the P-frames is warped using motion vec- tors as in Section 13.2 (Motion Vector Alignment) to align them to the I-frame. Given a 7 frame GOP this gives a final representation of 160 dimensions. This rep- resentation is then used as input to the I-frame generation network which produces the high quality I-frame. This high quality I-frame is then warped 6 times to gen- erate 6 copies each aligned to the individual P-frames. Then, the aligned I-frame 209 is concatenated with the low quality P-frames and the P-frame generation network generates the 6 high quality P-frames. This gives the final output: the high quality GOP consisting of 7 frames. Note that this is quite different from the sliding window or even recurrent approaches used in other video restoration work. In sliding window, a new rep- resentation would be computed for every frame consisting of three preceding and three succeeding frames. In a recurrent formulation, an accumulated hidden state is used to condition the current frame on past frames. In contrast, the method we developed in this chapter compacts information for a block of frames into a compact representation and then projects that information forward in time such that each frame has some past and some future information. This mimics the process that the video decoder performs when it decodes a bitstream. The information is discarded when a new I-frame is encountered. 13.4 Loss Functions As we discussed in Chapter 9 (Quantization Guided JPEG Artifact Correc- tion), restoration problems which are based purely on regression suffer from blurring and lack low-frequency content. Video compression reduction is no exception to this, and in fact, just like with JPEG compression, video compression specifically removes high frequency content. Unlike other restoration problems, however, all prior work in video compression reduction uses either l2 loss or Charbonnier loss [148], both of which are simple error penalties. We can introduce more complex loss functions to 210 overcome this. For ?standard? regression, we will depend on the l1 loss as usual (for the uncompressed frames Xu and the reconstructed frames Xr) L1(Xu, Xr) = ||Xu ?Xr||1 (13.1) In Chapter 9 (Quantization Guided JPEG Artifact Correction), we were lack- ing a method for accurate high frequency reconstruction. We can address this here, with partial success, by using a loss based on the Difference of Gaussians (DoG) scale space. The difference of Gaussians constructs a scale space by convolving an image with Gaussian blur kernels of differing standard deviations. These function as bandpass filters which capture image content at different frequency bands. The process is repeated on downsampled versions of the image to capture information at different scales. We can employ this as a loss function by separating out the different frequency bands at different scales and computing their l1 error as separate loss terms. This effectively weights each frequency band in the same way rather than the decreasing magnitude we see using an overall l1 loss. As such, the network is rewarded for accurately reconstructing high frequencies. For the uncompressed and reconstructed frames we compute four different scales: Su = {Xu, Xu2 , Xu4 , Xu8} (13.2) Sr = {Xu, Xr2 , Xr4 , Xr8} (13.3) 211 where each entry Xus or Xrs is obtained by downsampling Xu by a factor of s. We then compute the difference of Gaussians by convolving with a 5 ? 5 2D Gaussian kernel: 1 i2? +j 2 G(?)ij = e 2?2 (13.4) 2??2 for kernel offsets i, j (note that these range from [-2,2]). Then, for each scale s we compute the four filtered images Xus,? = G(?) ? Xus (13.5) Xrs,? = G(?) ? Xrs (13.6) for ? ? {1.1, 2.2, 3.3, 4.4}. We then compute the differences between the pairs Xus,1 = Xus,2.2 ?Xus,1.1 (13.7) Xus,2 = Xus,3.3 ?Xus,2.2 (13.8) Xus,3 = Xus,4.4 ?Xus,3.3 (13.9) Xrs,1 = Xrs,2.2 ?Xrs,1.1 (13.10) Xrs,2 = Xrs,3.3 ?Xrs,2.2 (13.11) Xrs,3 = Xrs,4.4 ?Xrs,3.3 (13.12) 212 to yield the per-scale frequency bands. Finally, we compute the loss ? ?3 LDoG(Xu, Xr) = ||Xus,b ?Xrs,b||1 (13.13) s?{1,2,4,8} b=1 As in Chapter 9 (Quantization Guided JPEG Artifact Correction), however, we find that even this enhanced regression loss is not sufficient to generate realistic reconstructions although it does help. Instead, we again turn to the GAN and Texture losses. The texture loss, repeated here, is discussed at length in Section 9.5 (Loss Functions). Lt(Xu, Xr) = ||MINC5,3(Xu)?MINC5,3(Xr)||1 (13.14) where MINC5,3 denotes the output of layer 5 convolution 3 in a MINC [104] trained VGG [34] network. For the GAN loss we will use a Wassertein GAN formulation [149]. In a Wassertein GAN, the discriminator network is replaced with a critic which rates examples on a [-1, 1] scale with -1 being fake and 1 being real. This critic then makes a soft decision about the realness or fakeness of a sequence rather than a hard decision, which, along with some gradient clipping, makes it more stable and less sensitive to hyperparameter choices. As in Chapter 9 (Quantization Guided JPEG Artifact Correction) our critic architecture is based on DCGAN [101] with spectral normalized convolutions [102] except that we modify the critic procedure to introduce temporal consistency follow- 213 Compressed Frames Restored/Target Frames Concatenate? Wasserstein GAN Critic? ... Figure 13.6: Metabit Critic Architecture. For the entire seven frame sequence, the compressed and restored frames are concatenated following [150]. The resulting 48 channel input is reduced to a single scalar using a series of spectral-normed convolutions, batch norm, and ReLU layers. ing the procedure from TeCoGAN [150]. The modification is relatively simple: we use the compressed and restored/uncompressed sequence as input with the frames stacked in the channel dimension. The means the critic is now considering the en- tire sequence instead of individual frames and therefore is incentivised to produce similar reconstructions over the sequence. The architecture is shown in Figure 13.6. This yields the GAN loss Lw(Xu, Xr) = ?d(Xu, Xr) (13.15) for critic d() 2. These are combined into two composite loss functions: one for regression only 2Note that the critic itself has a different loss function which we do not show here, see [149] for details. 214 results LR(Xu, Xr) = ?[L1(Xu, Xr) LDoG(Xu, X Tr)] (13.16) for balancing hyperparameters ? ? R2 and a loss for qualitative results LG(X Tu, Xr) = ?[L1(Xu, Xr) LDoG(Xu, Xr) LW (Xu, Xr) Lt(Xu, Xr)] (13.17) for balancing hyperparameters ? ? R4. 13.5 Towards a Better Benchmark All video compression reduction work is tested on one primary dataset: the MFQE [138] dataset. This dataset consists of a large training set of diverse sequences and an eighteen video test set. The test set contains diverse real-world scenes and a variety of resolutions. The videos are stored in raw (YUV) format. Overall, this dataset is satisfactory for the purposes of evaluating compression reduction. The problem, however, comes in how the dataset is used. Prior works used only HEVC (H.265) compression with constant QP values in {27, 32, 37, 42} or some subset of these depending on the paper. While numerical results should not generally be our primary concern when evaluating any proposed method, these compression settings do leave much to be desired. Firstly, HEVC compression incurs less degra- dation than other commonly encountered codecs. Although HEVC would no longer be considered ?new?, it is also much less frequently used with almost no browser 215 support. Additionally, the constant QP compression method is simply not used in real videos and is mostly included as a debugging tool. The degradations that it causes are much simpler to model than CRF or CBR which are used frequently in real videos. See Section 11.3 (Slices and Quantization) for a deeper discussion of these terms. Instead we propose to use AVC (H.264) compression for evaluation and we use CRF instead of CQP. This is a much better representation of real world video than the previous benchmark as AVC compression accounts for nearly 91% of internet videos as of 2019 [129] 3. We choose CRF values in {25, 35, 40, 50} ranging from relatively little compression at 25 (this is the default for ffmpeg [151]) to 50 which is only one less than the maximum. To reiterate, our goal with this benchmark is to ensure that compression reduction algorithms face tests which accurately represent videos in the real world. In the next section we will see that MFQE fails to converge for any CRF setting as it produces significantly more complex degradations than CQP, thus justifying our concern. 13.6 Empirical Evaluation With the method sufficiently developed we can empirically evaluate its per- formance. In all cases we train our network using the MFQE training split [138] which consists of 108 variable length sequences. To this we add a randomly selected one third of the Vimeo90k dataset [133] which is approximately 30,000 7-frame sequences. We randomly crop 256 ? 256 patches (and 7 frames from the MFQE 3Although this has likely decreased since 2019 it would not be by much. 216 examples) and apply random vertical and horizontal flipping during training. For H.264 bechmarks we encode using one I-frame and six P-frames with CRF encoding as discussed in the previous section. For H.265 benchmarks we comply with prior works and use CQP encoding. We train a separate model for each QP or CRF setting. All evaluations are conducted on the MQFE test split of 18 variable length sequences. The network is implemented in PyTorch [17] and optimized using the Adam [105] optimizer for 200 epochs with the learning rate set to 10?4. For quantitative experiments we train using the regression loss (Equation 13.16) with ? = [1.0 1.0]. For qualitative results we fine tune using the GAN loss (Equation 13.17) for an additional 200 epochs with a learning rate of 10?5 and ? = [0.01 0.01 0.005 1]. As recommended we use the RMSProp optimizer for Wassertein GAN training [149]. For numerical results we report the change in PSNR and the change in LPIPS [152]. For compared works we only compute LPIPS if there is a published model or training code. We find no usable trend in SSIM [99] so we do not report that although some other works do. For consistency with prior work we report metrics on the Y channel only although we would like to see this practice end soon. To evaluate the GAN, besides providing qualitative results we report FID [43] and LPIPS. 13.6.1 Restoration Evaluation We start with the boring numerical results in Table 13.1 for HEVC and Table 13.2 for AVC. We can see that the Metabit architecture we developed has a signif- 217 Table 13.1: Metabit HEVC Results. Reported as ?PSNR (dB) ? / ?LPIPS ?. HEVC CQP Method 27 32 37 42 MFQE 2.0 [140] 0.49 / - 0.52 / - 0.56 / - 0.59 / - PSTQE [141] 0.63 / - 0.67 / - 0.69 / - 0.69 / - STDF-R3 [139] 0.72 / 0.025 0.86 / 0.027 0.83 / 0.033 - / - RFDA [142] 0.82 / - 0.87 / - 0.91 / - 0.82 / - MetaBit [144] 1.17 / 0.025 0.99 / 0.023 0.91 / 0.029 0.82 / - Table 13.2: Metabit AVC Results. Reported as ?PSNR (dB) ? / ?LPIPS ?. AVC CRF Method 25 35 40 50 STDF-R1 [139] 0.741 / 0.034 0.862 / 0.032 0.814 / 0.030 0.632 / 0.013 STDF-R3 [139] 0.784 / 0.035 0.846 / 0.032 0.882 / 0.029 0.817 / 0.011 MetaBit [144] 1.085 / 0.024 1.137 / 0.014 1.113 / 0.005 0.887 / -0.016 icantly better reconstruction result in most cases. The only exceptions are in the high QPs for HEVC where our method ties the next most recent work. For AVC, the results are significantly better indicating that the more complex CRF degradation is handled well by our method, in other words, the extra parameterization is able to better model complex degradations. Note that MFQE failed to converge entirely for CRF training. Note that PTSQE [141] does not provide public code and RFDA [142] does not provide usable training code so we were unable to fully evaluate these methods. To evaluate the GAN we report FID and LPIPS in Table 13.3. Note that this compares the degraded AVC input with our model trained for regression vs GAN on extreme CRF settings (40, 50), we do not bother comparing to other works. As expected, the GAN generates significantly more realistic results. 218 Table 13.3: Metabit GAN Numerical Results. Reported as FID ? / LPIPS ? H.264 CRF Method 40 50 AVC 67.07 / 0.259 152.19 / 0.498 Regression 80.67 / 0.265 154.42 / 0.482 GAN 37.78 / 0.191 95.26 / 0.368 2.00 1.75 STDF-R1 MQFE 2.0 1.50 1.25 1.00 STDF-R3L 0.75 Metabit MQFE 1.0 0.50 0 1 2 Parameters (M) Figure 13.7: FPS vs Params. The size of each point indicates the increase in PSNR. A more important numerical result for our purposes is throughput, specifically when compared with the number of model parameters. Our formulation is designed to be highly efficient while permitting a large number of parameters, traits which are important for timely and accurate video restoration. This result is shown graphically in Figure 13.7. We see that our method is as fast as STDF with about double the parameters and a higher PSNR. In other words, our method is able to better utilize the extra parameters without slowing down thanks to the efficiency improvements we made by leveraging the bitstream metadata. With the numerical results out of the way we can look at how the restoration functions on real images, with an example shown in Figure 13.10 from the short film Big Buck Bunny [130]. In this example we can see that despite the heavy compression (ACV CRF 40), the multiframe GAN restoration is able to produce a striking resemblance to the original image. Textures are accurately reconstructed on the grass and tree and yet the smooth sky region is preserved, i.e., the network has 219 FPS 38 37 36 AVC Wu et al. (2018) 35 DVC (2019) STAT-SSF-SP (2020) HLVC (2020) Scale-Space (2020) 34 Liu et al. (2020) NeRV (2021) AVC + MetaBit (Ours) 33 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Rate (bpp) Figure 13.8: Rate-Distortion Comparison. Using Metabit with AVC compres- sion performs better for low bitrates than fully deep learning codecs. Distortion is measured using PSNR. not hallucinated textures where thy should not be. This is even more remarkable considering that there was no artificial training data. To compare the effect of GAN restoration we show crops in Figure 13.11. The difference here is quite pronounced. Although STDF and Metabit using regression are able to improve the visual quality of the images, the GAN restoration is signifi- cantly more realistic in terms of overall sharpness and texture. This is particularly noticeable on the trees in the top row. 13.6.2 Compression Evaluation As in Chapter 9 (Quantization Guided JPEG Artifact Correction) one of the principal applications of restoration is as a method to better compress media. For our case, we can make a direct comparison to other fully deep learning based com- pression codecs where we compare quite favorably both in terms of latency and rate-distoration. The rate-distortion result is shown in Figure 13.8 along with with a number of recent deep learning based compression algorithms. We use the UVG dataset 220 Distortion (dB) 50 40 30 20 10 0 al.t C l . V e DV a et eRN AV C u W Liu bit + Me ta Figure 13.9: Learned Compression Throughput Comparison. Orange shows encoding time, blue shows decoding time. Current fully deep learning based codecs are quite slow. One major advantage of using video restoration is that encoding is quite fast. Decoding is on par with other methods except NeRV, although NeRV has a much slower encoding time. [153] for this task, which is a widely used dataset for compression evaluation. We compare to [154], [155], [156], [157], [158], [159], and [160]. Note that we use a model trained for regression only in this case. One potential problem we encounter here is that deep learning based com- pression literature compares with ?low-delay P? mode compression. This is a com- pression setting which saves additional space by using only a single I-frame for the entire sequence with the rest of the frames encoded as P-frames. This setting saves additional space over placing I-frame periodically at the cost of quality and is often used in streaming scenarios where low-latency is critical. In order to compare fairly, we modify our restoration procedure slightly to accommodate this. Since the first group of seven frames always includes an I-frame, this group is restored following our standard procedure. We then cache the restored seventh P-frame and instead of reading seven more frames we read six. These six frames are all P-frames and we use the cached restored P-frame in place of an I-frame. There is no retraining of 221 FPS the network involved in this process so the results are likely lower than they could be if we trained for this scenario; however this likely improves cross-block temporal consistency by reusing information from the previous block. For low bitrates (i.e., the bitrates which matter), simply using AVC com- pression with Metabit restoration outperforms deep learning codecs. This is to be expected since Metabit is an objective improvement on AVC which many deep learn- ing methods still struggle to beat in general cases. The advantage of this comes in its ease of use (almost all modern hardware can decode an AVC compressed video) and its speed. Encoding times are in the hundreds of frames-per-second on commodity CPU hardware. Decoding time even including Metabit is also faster. In Figure 13.9 we compare the throughput of a subset of these methods to Metabit with AVC compression. Metabit is on par with other methods here, and actually outperforms all but NeRV. The major disadvantage of NeRV, however, is the extremely long encoding time. Since NeRV is an implicit representation a new network is learned for each video taking on the order of hours to encode one video. We simply skip encoding time for NeRV in the plot. The major takeaway from this discussion is that video compression reduction, specifically the Metabit method we developed in this chapter, is efficient and effective at generating accurately restored video frames from low bitrate video. Since it only depends on commodity codecs, our method can be encoded quickly and easily and decoded by anyone without special hardware. Those with special hardware can use our method to achieve more visually pleasing results. This method provides a promising avenue for deploying deep learning in compression in the near term. 222 AVC CRF 40 Reconstruction Original Figure 13.10: Metabit Restoration Example. Crop from 1920? 1080 film ?Big Buck Bunny?. Left: compressed, Middle: restored, Right: Original. This artificial scene is restored accurately despite a lack of artificial training data. Note the grass and tree textures, sharp edges, removal of blocking on the flower, and preservation of the smooth sky region. The video was compressed using AVC CRF 40. 13.7 Limitations and Future Work The work as presented in this chapter has a few limitation all of which we be- lieve are solvable with relatively minor changes. The network architecture currently depends on a fixed GOP size of seven frames. This would likely not work with real videos which may use a variable length GOP for several reasons. This is readily solvable by projecting a variable size GOP representation to a fixed size one using an adaptive pooling layer following by a projection. It remains to be investigated if this mapping is as effective as using a fixed GOP. Similarly, since each GOP is treated as a separate block, and restored sepa- rately, there are temporal consistency issues across GOP blocks. Note that within a single GOP the frames are quite consistent thanks to the TeCoGAN [150] formu- lation which we use. It is only across GOP boundaries that consistency issues arise. This is noticeable in the restored output as a sort of flickering or noise pattern that 223 appears. This is likely solvable by keeping a compact hidden state from the previous GOP to compose with the current GOP processing. Continuing in this line of thought, there are issues which arise in a streaming scenario which make an architecture like this wholly unsuitable. In streaming sce- narios, the video is often encoded using ?low-delay P mode?. In this mode, there is a single I-frame at the beginning of the GOP following by only P-frames, in other words, there is only a single GOP in the entire video. For this we actually have a partial solution which is to simply use the previously restored final frame (restored frame 7) in place of the I-frame for the next GOP, skipping the I-frame restoration step entirely. This does lead to a loss of visual quality especially if the network was not trained to perform this kind of restoration. Another issue in streaming situations is latency and buffering. For real-time applications this currently precludes restoration technology, but in our case in par- ticular the 7 frame GOP needs to be buffered before restoration can occur. This may be a limiting factor in some scenarios. The only way around this is to reduce the number of buffered frames or move to a fully recurrent solution. Finally, this method suffers from the same ?quality aware? problem that was prevalent in JPEG artifact correction. For each CRF setting we have to train a different model. Unlike JPEG, however, CRF, and more importantly the derived QP values, are stored in the video file. In this way it should be easy to create a parameterized network which is aware of the QPs that each frame was compressed with. 224 Compressed (CRF 40) STDF MetaBit (Regression) MetaBit (GAN) Target Figure 13.11: Metabit Comparison. Crops from ?Big Buck Bunny?? and ?Traf- fic? from the MFQE dataset. We compare our GAN restoration to regression restora- tion and the STDF method. 225 Part IV Concluding Remarks 226 You have made it to the end of the dissertation, a journey developing the first principles of deep learning and classical compression from the preliminaries through to the published research of the author. With the body of the dissertation behind us we can recap where we?ve been and where we?re going. To reiterate, the overall goal of my dissertation was to present, explicitly, an approach that follows the first principles of the compression problems. This is motivated by engineering as much as by science, and I have shown that incorporating engineering principles, both into the methodologies, e.g., considering how compression algorithms were developed, and the philosophy of the research, e.g., approaching a scientific problem as an engineer and a scientist simultaneously, can be successful. While this is not an approach unique to myself, I find that it is rarely stated out loud. And yet, the engineers that developed the compression algorithms we all use on a daily basis were extremely smart, so is it not logical to follow in their example when studying deep learning? I hope that the implications of this thought extend far beyond this document. In Chapter 7 (JPEG Domain Residual Learning) we developed a method for performing deep learning in the JPEG domain. This method operates on coefficients directly and requires no decompression. The goal of the method was to produce a result which is as close as possible to the pixel domain result. In this work we lever- aged the first principles of JPEG compression by linearizing the JPEG transform and composing it with our pixel domain convolutions. We also leveraged this to pro- duce closed form derivations of batch normalization and average pooling.However, we could not do this for ReLU and so we used an approximation technique. The primary issue with this work is that it uses substantially more memory to store 227 feature maps and convolutions. In Chapter 9 (Quantization Guided JPEG Artifact Correction) we improved JPEG compression using deep learning. We noted that there were several issues preventing the widespread success of prior work in this field. Prior works were quality dependent models which trained a unique model for each JPEG quality factor. They did not handle color images, and they were focused only on regression. We solved each of these problems again leveraging first principles. By conditioning our network on the JPEG quantization matrix and processing DCT coefficients instead of pixels, we were able to encode quality information using a single network. By explicitly handling chroma subsampling and the additional quantization that color channels are subjected to, we improved results on color images. Finally, we incorporated GAN and texture losses to improve the visual result over a regression-only solution. While this method was highly successful, it had a distinct disadvantage when quantization data is either incorrect, as in multiple compression, or not available at all as in transcoding. This was followed up in Chapter 10 (Task-Targeted Artifact Correction) which extended the previous method to optimize the correction for machine consumption. This is somewhat of a novelty in artifact correction which is traditionally for humans only. By incorporating a loss based on a downstream task, we are able to greatly improve the performance of that task on JPEG compressed inputs, often outper- forming data augmentation techniques that retrain the task with JPEG images. Our method had the added benefit that a network trained for one downstream task would also work well for other downstream tasks with no re-training required. The 228 main drawback of this method is that it has increased running time (for multiple networks) and that it does not always out perform data augmentation. Finally, in Chapter 13 (Metabit: Leveraging Bitstream Metadata), we forayed into video compression by developing a correction based method for improving AVC and HEVC compression. We noted that prior works expended significant resources computing results which were already stored explicitly in the video bitstreams, such as high quality frame locations (I-frames) and motion data. We leveraged this data and used our increased compute budget to more than double the number of param- eters in our network with almost no impact on throughput. We also incorporated scale-space loss along with the GAN and texture losses from the JPEG work, in order to improve high frequency reconstructions. While this method outperforms prior works and fully deep-learning based codecs at low birates, it does struggle with high bitrate reconstructions and it currently requires a unique model for each video compression setting. The method would also struggle to run in real time on any consumer hardware. So where do we go from here? Aside from the multitude of related problems to work on in compression, from things as simple as improving the results to as tangential as data privacy, the number one focus for the next decade of compres- sion and deep learning is going to be on making these techniques practical. One of the primary goals in writing this document is to instruct practitioners, e.g., profes- sional engineers, developers, students, etc., that while these techniques show extreme promise, actually getting them into the hands of consumers requires considerable effort. 229 Focusing solely on the techniques which improve compression performance, which would be of direct use to consumers who lack broadband internet, these meth- ods currently require significant compute resources on the end user?s side. This could be shifted to the data center side, i.e., a hybrid technique which extracts some deep representation during transmission. This could also be addressed simply by devel- oping faster and lower-memory algorithms or by leveraging customized hardware. While the latter sounds like an expensive solution, consider that video decoders are almost exclusively implemented in hardware on modern processors both for desk- top/laptop machines and mobile phones. This is also starting to happen for deep learning, e.g., the Google Tensor chips and edge TPUs, and the Apple A14 chips. In any case, I believe consumer applications for this technology are no more than two years off at the time of writing, and within the decade for fully deep learning based compression. The field of compression as a whole is progressing as fast as ever and it is an exciting time to be involved. This is all in the midst of the global pandemic. In a world which was just beginning to address the inequality in internet access, suddenly in late 2019, we were forced to confront this issue as work and school became primarily remote. Remote work and school means communication with video and images which means compression. Those who did not have a strong internet connection were simply left behind as there were no suitable alternatives, and it remains to be seen what the long term ramifications of this will be. Now in early 2022, the world is quickly moving on from pandemic life. Yet it is important not to forget this lesson. Better compression has the ability to help people right now. 230 Author?s Note I invite the readers to now visit the appendices where they will find material which is just as interesting, and yet not directly related to, the dis- sertation proper. In particular we will review some additional qualitative results and briefly cover fully deep-learning based compression algorithms. Thank you for reading my dissertation! Max Ehrlich 231 Part V Appendix 232 Appendix A: Study on JPEG Compression and Machine Learning This appendix reproduces the full plots and tables of the results of the study on JPEG compression and deep learning [111]. See Chapter 10 (Task-Targeted Artifact Correction) for more details. These plots are for informational purposes only. A.1 Plots of Results 14 MobileNetV2 ResNet-18 12 ResNet-50 ResNet-101 10 ResNeXt-50 ResNeXt-101 8 VGG-19 InceptionV3 6 EfficientNet B3 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.1: Overall Classification Results 233 Accuracy Loss (%) 14 None 12 Off-the-Shelf Artifact Correction 10 Fine-Tuned Task-Targeted Artifact Correction 8 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.2: Classification Results: MobileNetV2 None 12.5 Off-the-Shelf Artifact Correction 10.0 Fine-TunedTask-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.3: Classification Results: VGG-19 234 Accuracy Loss (%) Accuracy Loss (%) 8 None Off-the-Shelf Artifact Correction 6 Fine-Tuned Task-Targeted Artifact Correction 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.4: Classification Results: InceptionV3 10 None Off-the-Shelf Artifact Correction 8 Fine-Tuned 6 Task-Targeted Artifact Correction 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.5: Classification Results: ResNeXt 50 235 Accuracy Loss (%) Accuracy Loss (%) 10 None Off-the-Shelf Artifact Correction 8 Fine-Tuned 6 Task-Targeted Artifact Correction 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.6: Classification Results: ResNeXt 101 None 10 Off-the-Shelf Artifact Correction 8 Fine-Tuned Task-Targeted Artifact Correction 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.7: Classification Results: ResNet 18 236 Accuracy Loss (%) Accuracy Loss (%) 12 None 10 Off-the-Shelf Artifact Correction Fine-Tuned 8 Task-Targeted Artifact Correction 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.8: Classification Results: ResNet 50 10 None Off-the-Shelf Artifact Correction 8 Fine-Tuned Task-Targeted Artifact Correction 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.9: Classification Results: ResNet 101 237 Accuracy Loss (%) Accuracy Loss (%) 8 None Off-the-Shelf Artifact Correction 6 Fine-Tuned Task-Targeted Artifact Correction 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.10: Classification Results: EfficientNet B3 238 Accuracy Loss (%) FasterRCNN 14 FastRCNN RetinaNet 12 MaskRCNN 10 8 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.11: Overall Detection and Instance Segmentation Results None 12.5 Off-the-Shelf Artifact Correction 10.0 Fine-Tuned Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.12: Detection Results: FastRCNN 15.0 None 12.5 Off-the-Shelf Artifact Correction Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.13: Detection Results: FasterRCNN 239 mAP Loss mAP Loss mAP Loss 15.0 None 12.5 Off-the-Shelf Artifact Correction Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.14: Detection Results: RetinaNet 15.0 None 12.5 Off-the-Shelf Artifact Correction Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.15: Detection Results: MaskRCNN 240 mAP Loss mAP Loss 16 HRNetV2 + C1 14 MobileNetV2 (dilated) + C1 (ds)ResNet18 (dilated) + PPM 12 ResNet50 + UPerNet ResNet50 (dilated) + PPM 10 ResNet101 + UPerNet ResNet101 (dilated) + PPM 8 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.16: Overall Semantic Segmentation Results 15.0 None Off-the-Shelf Artifact Correction 12.5 Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.17: Semantic Segmentation Results: HRNetV2 + C1 15.0 None Off-the-Shelf Artifact Correction 12.5 Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.18: Semantic Segmentation Results: MobileNetV2 + C1 241 mIoU Loss mIoU Loss mIoU Loss 15.0 None Off-the-Shelf Artifact Correction 12.5 Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.19: Semantic Segmentation Results: ResNet 18 + PPM 15.0 None 12.5 Off-the-Shelf Artifact Correction Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.20: Semantic Segmentation Results: Resnet50 + UPerNet 242 mIoU Loss mIoU Loss 15.0 None Off-the-Shelf Artifact Correction 12.5 Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.21: Semantic Segmentation Results: ResNet 50 + PPM 15.0 None 12.5 Off-the-Shelf Artifact Correction Fine-Tuned 10.0 Task-Targeted Artifact Correction 7.5 5.0 2.5 0.0 10 20 30 40 50 60 70 80 90 Quality Figure A.22: Semantic Segmentation Results: ResNet 101 + UPerNet 243 mIoU Loss mIoU Loss 14 None 12 Off-the-Shelf Artifact CorrectionFine-Tuned 10 Task-Targeted Artifact Correction 8 6 4 2 0 10 20 30 40 50 60 70 80 90 Quality Figure A.23: Semantic Segmentation Results: ResNet 101 + PPM 244 mIoU Loss A.2 Tables of Results Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90 Supervised Fine-Tuning 79.78 81.84 82.47 82.68 82.78 82.75 82.83 82.85 82.83 None 77.24 81.11 81.95 82.52 82.67 82.91 83.10 83.37 83.75 EfficientNet B3 Top-1 Accuracy 83.98 Off-the-Shelf Artifact Correction 75.92 80.02 81.47 82.12 82.44 82.71 82.94 83.23 83.70 Task-Targeted Artifact Correction 81.03 82.71 83.21 83.53 83.64 83.71 83.73 83.80 83.76 Supervised Fine-Tuning 75.11 77.25 77.77 77.89 78.13 78.13 78.24 78.26 78.32 None 69.38 74.15 75.44 75.98 76.38 76.69 76.95 77.14 77.30 InceptionV3 Top-1 Accuracy 77.33 Off-the-Shelf Artifact Correction 71.21 75.04 76.09 76.42 76.68 76.79 76.97 77.06 77.13 Task-Targeted Artifact Correction 73.65 75.89 76.53 76.82 76.93 76.99 77.09 77.15 77.10 Supervised Fine-Tuning 65.65 69.21 69.92 70.20 70.37 70.53 70.50 70.55 70.54 None 57.23 65.55 67.87 68.95 69.47 69.98 70.24 70.60 70.86 MobileNetV2 Top-1 Accuracy 70.72 Off-the-Shelf Artifact Correction 57.33 65.25 67.76 68.93 69.60 70.07 70.40 70.71 70.58 Task-Targeted Artifact Correction 64.64 68.63 69.71 70.18 70.32 70.44 70.50 70.52 70.34 Supervised Fine-Tuning 74.63 76.50 77.07 77.20 77.27 77.29 77.43 77.44 77.53 None 66.12 73.00 74.65 75.39 75.83 76.29 76.51 76.79 76.96 ResNet-101 Top-1 Accuracy 76.91 Off-the-Shelf Artifact Correction 67.91 73.64 75.09 75.84 76.23 76.52 76.56 76.80 76.74 Task-Targeted Artifact Correction 72.99 75.53 76.30 76.60 76.59 76.72 76.70 76.72 76.59 Supervised Fine-Tuning 65.49 68.46 69.07 69.16 69.36 69.33 69.38 69.53 69.49 None 57.62 65.26 67.07 67.68 68.08 68.30 68.61 68.84 68.92 ResNet-18 Top-1 Accuracy 68.84 Off-the-Shelf Artifact Correction 61.19 66.39 67.87 68.39 68.61 68.77 68.97 68.99 68.90 Task-Targeted Artifact Correction 63.83 67.06 68.04 68.24 68.35 68.48 68.52 68.60 68.50 Supervised Fine-Tuning 73.18 75.46 76.02 76.24 76.36 76.42 76.52 76.52 76.55 None 63.43 71.20 73.23 74.10 74.43 74.63 75.01 75.09 75.34 ResNet-50 Top-1 Accuracy 75.31 Off-the-Shelf Artifact Correction 66.90 72.45 73.95 74.60 74.93 75.18 75.26 75.42 75.30 Task-Targeted Artifact Correction 70.48 73.56 74.39 74.81 74.94 75.00 74.98 74.98 74.89 Supervised Fine-Tuning 75.60 78.00 78.50 78.71 78.86 78.97 79.01 78.98 79.06 None 68.83 74.84 76.39 77.05 77.60 78.00 78.16 78.56 78.75 ResNeXt-101 Top-1 Accuracy 78.81 Off-the-Shelf Artifact Correction 71.19 75.88 77.14 77.80 78.15 78.30 78.57 78.66 78.61 Task-Targeted Artifact Correction 74.73 77.33 78.08 78.29 78.55 78.62 78.68 78.73 78.68 Supervised Fine-Tuning 74.21 76.23 76.79 77.01 77.08 77.18 77.16 77.30 77.17 None 66.96 73.21 74.85 75.62 76.07 76.37 76.63 76.88 77.06 ResNeXt-50 Top-1 Accuracy 76.99 Off-the-Shelf Artifact Correction 68.05 73.56 75.11 75.95 76.38 76.59 76.71 76.99 76.90 Task-Targeted Artifact Correction 72.22 75.45 76.09 76.62 76.86 76.83 76.85 76.99 76.81 Supervised Fine-Tuning 69.50 72.66 73.29 73.74 73.83 73.85 73.95 74.14 74.11 None 59.27 68.08 70.49 71.53 71.99 72.42 72.80 73.24 73.46 VGG-19 Top-1 Accuracy 73.44 Off-the-Shelf Artifact Correction 61.93 68.79 70.82 71.83 72.50 72.94 73.13 73.40 73.44 Task-Targeted Artifact Correction 67.50 71.32 72.33 72.76 73.03 73.16 73.50 73.48 73.44 245 Table A.1: Results for classification models. Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90 Supervised Fine-Tuning 29.09 33.34 34.72 35.08 35.49 35.82 35.96 36.06 36.17 None 20.35 30.03 32.59 33.43 34.04 34.31 34.73 34.93 35.25 FasterRCNN mAP 35.37 Off-the-Shelf Artifact Correction 28.45 31.86 33.10 33.85 34.05 34.47 34.70 34.77 34.71 Task-Targeted Artifact Correction 31.43 33.85 34.29 34.81 34.81 34.97 35.01 34.88 34.81 Supervised Fine-Tuning 28.01 31.94 33.08 33.56 33.88 34.17 34.42 34.44 34.66 None 19.99 29.04 31.22 32.19 32.65 33.00 33.34 33.40 33.80 FastRCNN mAP 34.02 Off-the-Shelf Artifact Correction 27.62 30.91 32.04 32.56 32.78 33.18 33.28 33.48 33.44 Task-Targeted Artifact Correction 30.11 32.31 33.07 33.31 33.39 33.53 33.69 33.68 33.59 Supervised Fine-Tuning 26.32 30.48 31.79 32.21 32.55 32.83 33.11 33.20 33.32 None 18.35 27.58 29.83 30.80 31.32 31.62 32.02 32.29 32.62 MaskRCNN mAP 32.84 Off-the-Shelf Artifact Correction 25.82 29.35 30.67 31.32 31.59 31.85 32.03 32.24 32.16 Task-Targeted Artifact Correction 28.48 30.85 31.71 32.00 32.19 32.24 32.35 32.43 32.26 Supervised Fine-Tuning 27.64 31.97 33.03 33.50 33.80 34.12 34.30 34.33 34.40 None 18.76 28.23 30.63 31.59 32.27 32.57 32.88 33.02 33.42 RetinaNet mAP 33.57 Off-the-Shelf Artifact Correction 26.74 29.90 31.24 31.87 32.19 32.60 32.86 33.02 32.93 Task-Targeted Artifact Correction 29.66 31.86 32.73 32.97 32.98 33.13 33.24 33.23 33.09 Table A.2: Results for detection models. 246 Model Metric Reference Mitigation Q=10 Q=20 Q=30 Q=40 Q=50 Q=60 Q=70 Q=80 Q=90 Supervised Fine-Tuning 34.76 37.35 38.74 38.78 39.27 39.75 39.98 39.86 39.96 None 24.95 35.16 38.03 38.52 39.02 40.09 40.50 40.41 40.54 HRNetV2 + C1 mIoU 40.59 Off-the-Shelf Artifact Correction 32.30 36.54 38.40 38.52 40.08 40.44 40.46 40.22 40.60 Task-Targeted Artifact Correction 34.14 37.61 39.23 39.24 39.92 40.53 40.62 40.39 40.55 Supervised Fine-Tuning 19.07 22.37 23.43 23.62 23.60 24.15 24.44 24.37 24.46 None 13.92 24.03 27.13 27.75 27.73 28.86 29.37 29.35 29.43 MobileNetV2 (dilated) + C1 (ds) mIoU 29.52 Off-the-Shelf Artifact Correction 21.17 25.27 27.31 27.16 29.14 29.32 29.26 29.06 29.54 Task-Targeted Artifact Correction 24.74 27.37 28.44 28.33 29.19 29.56 29.54 29.38 29.52 Supervised Fine-Tuning 35.32 37.41 38.27 38.28 38.55 38.59 38.72 38.58 38.70 None 26.14 36.70 39.45 39.81 39.55 40.47 40.98 40.97 41.07 ResNet101 + UPerNet mIoU 41.08 Off-the-Shelf Artifact Correction 33.90 37.39 39.12 39.38 40.32 40.58 40.78 40.79 41.04 Task-Targeted Artifact Correction 35.82 38.67 39.96 39.98 40.22 40.79 40.97 40.91 41.00 Supervised Fine-Tuning 31.86 35.45 36.73 36.94 36.91 37.33 37.67 37.55 37.65 None 25.68 35.19 37.76 38.43 38.24 39.27 40.03 40.17 40.21 ResNet101 (dilated) + PPM mIoU 40.26 Off-the-Shelf Artifact Correction 31.44 35.86 38.01 38.26 39.54 39.73 39.94 40.06 40.22 Task-Targeted Artifact Correction 33.99 37.63 39.04 39.11 39.38 39.73 40.07 40.11 40.10 Supervised Fine-Tuning 29.84 32.33 33.08 33.01 33.38 33.61 33.50 33.29 33.33 None 21.16 31.99 34.72 35.36 35.41 36.16 36.56 36.60 36.59 ResNet18 (dilated) + PPM mIoU 36.65 Off-the-Shelf Artifact Correction 28.64 32.59 34.56 34.53 35.96 36.21 36.29 36.25 36.64 Task-Targeted Artifact Correction 31.69 34.55 35.80 35.80 36.12 36.50 36.66 36.54 36.60 Supervised Fine-Tuning 32.88 35.11 35.94 35.90 36.41 36.58 36.63 36.49 36.55 None 24.29 34.78 37.34 37.71 37.70 38.57 39.12 39.13 39.16 ResNet50 + UPerNet mIoU 39.21 Off-the-Shelf Artifact Correction 31.83 35.52 37.20 37.26 38.44 38.67 38.87 38.86 39.12 Task-Targeted Artifact Correction 34.36 36.94 38.17 38.07 38.55 38.93 39.14 39.06 39.09 Supervised Fine-Tuning 32.26 35.33 36.04 36.04 36.53 36.75 36.93 36.71 36.92 None 23.05 33.95 36.66 37.07 37.40 38.58 38.93 38.70 38.86 ResNet50 (dilated) + PPM mIoU 38.91 Off-the-Shelf Artifact Correction 28.36 32.69 35.24 35.31 37.74 38.04 38.18 38.13 38.73 Task-Targeted Artifact Correction 31.92 35.43 37.04 36.92 38.05 38.69 38.79 38.52 38.74 Table A.3: Results for segmentation models. 247 Model Value ImageNet Classification, Metric: Top-1 Accuracy ResNet 18 68.84 ResNet 50 75.31 ResNet 101 76.91 ResNeXt 50 76.99 ResNeXt 101 78.81 VGG 19 73.44 MobileNetV2 70.72 InceptionV3 77.33 EfficientNet B3 83.98 COCO Object Detection and Instance Segmentation, Metric: mAP FastRCNN 34.02 FasterRCNN 35.38 RetinaNet 33.57 MaskRCNN 32.84 ADE20k Semantic Segmentation, Metric: mIoU HRNetV2 + C1 40.59 MobileNetV2 (dilated) + C1 29.52 ResNet 18 (dilated) + PPM 36.65 ResNet 50 (dilated) + PPM 38.91 ResNet 101 41.08 248 ResNet 101 (dilated) + PPM 40.26 Table A.4: Reference results (results with no compression). 249 Appendix B: Additional Results In this appendix we examine more interesting outputs from various methods dis- cussed in the body of the dissertation. These are mostly qualitative results. While these images are not critical to understanding the methods, everyone likes looking at pictures! Warning The results presented here are intended to be reproductions from the published papers, so there may be some repeats from the body of the dissertation. B.1 Quantization Guided JPEG Artifact Correction These results are from the method presented in Chapter 9 (Quantization Guided JPEG Artifact Correction). We first show more equivalent quality examples next, recall that equivalent quality performs restoration on an image then uses SSIM to find the matching JPEG quality to the restored image which can give an indication of how much space is saved by using QGAC. 250 Input Equivalent Quality JPEG Reconstruction Quality 50 Quality 85 29.5kB Saved (43.6%) Input Equivalent Quality JPEG Reconstruction Quality 30 Quality 58 46.8kB Saved (47.9%) Input Equivalent Quality JPEG Reconstruction Quality 40 Quality 78 25.0kB Saved (42.7%) Figure B.1: Equivalent quality visualizations. For each image we show the input JPEG, the JPEG with equivalent SSIM to our model output, and our model output. Next we show the full frequency domain results. Recall that these results show the frequency domain content of the images comparing JPEG compression, regression restoration, and GAN restoration. 251 Original Plot Original Regression JPEG GAN 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency DCT JPEG Q=10 Regression GAN Figure B.2: Frequency domain results 1/4 252 Probability Original Plot Original Regression JPEG GAN 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency DCT JPEG Q=10 Regression GAN Figure B.3: Frequency domain results 2/4 253 Probability Original Plot Original Regression JPEG GAN 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency DCT JPEG Q=10 Regression GAN Figure B.4: Frequency domain results 3/4. 254 Probability Original Plot Original Regression JPEG GAN 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency DCT JPEG Q=10 Regression GAN Figure B.5: Frequency domain results 4/4. One way to reduce any artifacts caused by divergent GAN training is to use model interpolation [81]. Model interpolation simply takes the regression weights WR and the GAN weights WG along with a scalar ? and computes new model parameters WI = (1? ?)WR + ?WG (B.1) We show close up views of different textured regions for different choices of ?. 255 Probability Regression GAN Figure B.6: Model interpolation results 1/4 Regression GAN Figure B.7: Model interpolation results 2/4 256 Regression GAN Figure B.8: Model interpolation results 3/4 Regression GAN Figure B.9: Model interpolation results 4/4 We close with purely qualitative results. These are for quality 10 as in Chapter 257 9 (Quantization Guided JPEG Artifact Correction) and quality 20 which was not shown there to save space. JPEG Q=10 Reconstruction Original JPEG Q=20 Reconstruction Original Figure B.10: Qualitative results 1/4. Live-1 images. JPEG Q=10 Reconstruction Original JPEG Q=20 Reconstruction Original Figure B.11: Qualitative results 2/4. Live-1 images. 258 JPEG Q=10 Reconstruction Original JPEG Q=20 Reconstruction Original Figure B.12: Qualitative results 3/4. Live-1 images. 259 JPEG Q=10 Reconstruction Original JPEG Q=20 Reconstruction Original Figure B.13: Qualitative results 4/4. ICB images. B.2 Task Targeted Artifact Correction These results are from the method of Chapter 10 (Task-Targeted Artifact Correc- tion). We start with visualizations of model errors, first using GradCam [161]. This shows how the model focus is impacted by JPEG compression and how it can be corrected using the various mitigation techniques we studied. The figures show some interesting behavior. In terms of localization, the JPEG compressed input actually does well, and the localization is in fact more accurate than the original model with 260 an uncompressed input. The problem with the JPEG compressed input seems to be with the gradient, which is extremely noisy. Mitigation seems to help with this, with the supervised method providing the cleanest gradient although there is a loss of localization accuracy. Original model gradient with Fine-tuned model gradient Original model gradient? original input image? Original model CAM with Original model CAM Fine-tuned model CAM original input image? Figure B.14: Fine Tuned Model Comparison 261 Original model gradient with Fine-tuned model gradient Original model gradient? original input image? Original model CAM with Original model CAM Fine-tuned model CAM original input image? Figure B.15: Off-the-Shelf Artifact Correction Comparison Original model gradient with Fine-tuned model gradient Original model gradient? original input image? Original model CAM with Original model CAM Fine-tuned model CAM original input image? Figure B.16: Task-Targeted Artifact Correction Comparison 262 For visualizing detection results we provide plots generated using TIDE [126]. We show these for FasterRCNN [118] and MaskRCNN [120]. The results show a sig- nificant number of missed detection for low quality inputs. This is overtaken by localization errors as quality increases. Figure B.17: FasterRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right: quality 100. Figure B.18: MaskRCNN TIDE Plots. Left: quality 10, Middle: quality 50, Right: quality 100. We close the section with qualitative results including visualizations of the results where appropriate. 263 JPEG Q=10, Prediction: Norwich terrier'', Fine- Off-the-shelf Artifact Correction, Prediction: "basenji" Tuned Prediction: Pembroke, Pembroke Welsh corgi''? Task-Targeted Artifact Correction, Prediction: Original, Prediction: "Pembroke, Pembroke Welsh "Pembroke, Pembroke Welsh corgi" corgi" Figure B.19: MobileNetV2, Ground Truth: ?Pembroke, Pembroke Welsh corgi? 264 JPEG Q=10 Off-the-Shelf Artifact Correction Task-Targeted Artifact Correction Supervised Fine-Tuning Original Ground Truth Figure B.20: FasterRCNN 265 JPEG Q=10 Off-the-Shelf Artifact Correction Task-Targeted Artifact Correction Supervised Fine-Tuning Original Ground Truth Figure B.21: MaskRCNN 266 JPEG Q=10 Prediction Ground Truth JPEG Q=10 Prediction Ground Truth JPEG Q=10 Prediction Ground Truth JPEG Q=10 Fine-tuning Prediction Ground Truth Figure B.22: HRNetV2 + C1 267 B.3 Metabit The results in this section are from the method of Chapter 13 (Metabit: Leveraging Bitstream Metadata). These are purely qualitative results but they do highlight specific successes and failures of the method. Figure B.23: Dark Region. Crop from 2560? 1600 ?People on Street?. The dark region, is poorly preserved by compression. Our GAN restoration struggles to cope with the massive information loss in this region. Figure B.24: Crowd. Crop from 2560?1600 ?People on Street?. The image shows an extremely dense crowd. Despite the chaotic nature, our GAN is able to produce a good restoration although there is detail missing. 268 Figure B.25: Texture Restoration. Crop from 1920?1080 ?Cactus?. The texture on the background is destroyed by compression. Our GAN reconstructs a reasonable approximation to the true texture. Figure B.26: Compression Artifacts Mistaken for Texture. Crop from 1920? 1080 ?Cactus?. The compressed image exhibits strong chroma subsampling artifacts (lower right corner). These are mistaken by the GAN is a texture and restored as such. Figure B.27: Motion Blur. Crop from 1920 ? 1080 ?Cactus?. The tiger exhibits high motion which presents itself in the target frame as motion blur. This blur is destroyed by compression and is not able to be restored by the GAN loss. The GAN loss is also ?rewarded? for sharp edges which would make reconstructing blurry objects difficult. As an aside, note the additional detail on the background objects in the GAN image when compared to the compressed image. 269 Figure B.28: Artificial. Crop from 1920? 1080 ?Big Buck Bunny?. This artificial scene is restored accurately despite a lack of artificial training data. Note the grass and tree textures, sharp edges, removal of blocking on the flower, and preservation of the smooth sky region. 270 Appendix C: Survey of Fully Deep-Learning Based Compression Although fully deep-learning based compression methods are generally considered out of the scope of this dissertation, there is general interest in these technologies and they are certainly related to the work presented in the body of the document. Therefore, in this appendix, we conduct a brief survey of the major points of image and video compression that depends entirely on deep learning to produce the en- codings. While deep learning based compression shows extreme promise, it is still a very academic problem. Models currently require expensive hardware to train and to compress new media in a timely manner. This also leads to high memory usage. In general, important compression concepts like rate control are still largely missing. In terms of objective performance, the most recent methods at the time of writing are on par with classical compression on some benchmarks. This is not always easy to evaluate, however, as methods depending on generation like GANs [39] often do not produce meaningful rate-distortion curves in the traditional sense. In a rare personal opinion, based on my observation of the state of the art, I believe that machine learning, wherever it may end up, is the future of compres- sion. Within the decade (i.e., before 2030) we will begin to see machine learning techniques used in consumer application. In contrast, the techniques presented in 271 the body of the dissertation will likely be seen in consumer application in the next one or two years. There are currently a number of companies competing for deep- learning based compression market share, e.g., Google and Wave One. While these companies are delivering important research contributions it is unlikely that their proprietary solutions will win out in the long term given the compression commu- nity?s reliance on standardization. Although Google was able to gain traction in classical compression with its VP codecs, even these were eventually standardized into the Alliance for Open Media (AOM) and development continued with the AV codecs. Notable standardization efforts include JPEG-AI and MPEG-AI which are much more likely to see success, meaning that any new players in this field would do well to work with the standards bodies. C.1 Image Compression We start with image compression. The goal of these models is to train a CNN to encode pixels into a feature vector with another CNN trained simultaneously to decode the feature vectors to an image, essentially a fancy autoencoder. The feature vectors are quantized and losslessly compressed before ?transmission?, or in this case, evaluating their size in bytes. The networks will be trained to minimize both the size of the feature vectors when stored on disk and the error of the reconstruction. There are three obvious problems here which drive the works we will consider in this section ? The size on disk is not differentiable and therefore not suitable for use in a 272 loss function. ? Classical compression algorithms incorporate rate-control to make their use more flexible. It is not trivial to incorporate such a side parameter into a CNN. ? Minimizing the error term does not necessarily produce a visually pleasing result. Likely the first modern work in image compression with deep learning was the work by [11] for thumbnails with a follow up for full resolution images [13]. Todereci?s work is based on recurrent networks specifically Long-Short-Term Memory networks (LSTMs). The output of the LSTM at time t is subtracted from the input and this residual used as the input to the LSTM at time t + 1 after starting the process with the input patch generating a fixed length code for a given bitrate setting. The network is only trained to minimize the l2 error. Considering how early these architectures were developed they have some nice properties, including reasonable results compared to JPEG and a rudimentary attempt at variable rate encoding. Next, [14] proposed compressive autoencoders to generate a compressed represen- tation. The idea is to produce a deep encoding of the input image which is then quantized for transmission and decoded by another deep network. The objective can be written as ? log(Q(f(x))) + ?d(x, g(f(x))) (C.1) 273 where Q() is the quantization function, f() is the encoder, g() is the decoder, and d() is a measure of distortion (i.e., error). The left term here is measuring the size of the representation (number of bits) and the right term is measuring the error. Of course this objective cannot be minimized directly since Q() is not differentiable. To get around this they define a differentiable approximation to the rounding step. By adjusting this approximation, they are able to produce a much more accurate variable rate encoder, although the empirical results show that training for a single rate naturally works better. Also in 2017, [162] developed ?soft-to-hard vector quantization?. They start with the same problem of [14], that the quantization step is not differentiable. They solve the problem by using a soft assignment of the features to symbols, i.e., instead of a hard rounding they compute ?(z) = softmax(??(?z ? c ?21 , . . . , ?z ? c 2n? )) (C.2) for n symbols c. This equation is fully differentiable. However, this alone would be a poor approximation so during training the ?hardness? is ?annealed? from some initial condition to infinity which produces a more and more accurate approximation of the hard assignment which is used at test time. This allows the network to quickly converge on the easier soft solution in early training while increasing the problem difficulty to match the real scenario in late training. [12] propose yet another solution to this problem. Their solution is motivated by Shannon?s information theory and, although somewhat questionable in the mo- 274 tivation, has become a staple technique for approximating quantization. Belle? et al. observe that the discrete quantization processes is essentially introducing noise into the signal which is output by the deep encoder. Of course, entropy of a noisy channel is something that Shannon studied quite extensively [26]. Therefore, the solution is to simply add Gaussian noise to the signal which is a simple and differen- tiable process. Of course the issue with this is that Gaussian noise is very different in appearance from quantization noise and CNNs are very sensitive to the actual appearance even if the entropy analysis is the same (entropy is essentially giving an aggregate view of the information loss). Nevertheless the method does work well. [163] specifically focus on designing a method for variable rate encoding. Although this was a feature of prior works, their primary focus was on overcoming the non- differentiable quantization. Mentzer et al. use both the soft-to-hard technique [162] and the compressive autoencoders technique [14] to deal with quantization. To model the rate term in the loss, Mentzer et al. treat the feature vectors as a conditional distribution, i.e., ?N P (z) = P (zi|zi?1, . . . z1) (C.3) i=1 in raster order. So each feature vector is considered to have its own probability which is conditioned on all previous features. They then model P (z) and the conditional distributions using another deep network (which is differentiable). Specifically they use a 3D convolution since this is efficient and respects the ?causality constraint.? In other words, the previous feature vectors cause the current feature vector since they 275 are conditional distributions. This formulation for P (z) allows them to compute an approximation for the entropy and therefore the rate which they use as a loss term. In something of a departure from prior works, [164] formulate a compression algorithm based on GANs [39]. The advantage here is that these algorithms can produce striking, faithful, images in extreme settings. In this case the distortion term is replaced with a GAN loss instead of the traditional l2 loss of prior works. This is based on the correct observation that l2 loss does not capture human perception well. Although the paper offers limited further insight into the mathematics of deep learning based compression systems, the imagery their method produces is truly remarkable often outperforming JPEG by a wide margin while saving considerably more space. This work is continued by [165] and they make a number of advancements over [164]. Where Aggustsson et al. showed visually realistic results there were major deviations from the true outputs. The preservation of the output seemed to be more semantic than visual which makes sense given that the GAN training uses deep networks classifying real/fake. Mentzer et al. by comparison is extremely faithful to the original images often deviating in ways that are indistinguishable to the human eye. Aggustsson et al. is also quite limited in the size of the input that it can accept (an efficiency concern), whereas the Mentzer et al. formulation works efficiently on sizes up to 2000?2000 which is a respectable size for a modern image. An interesting avenue of analysis in this work is in the effectiveness, or lack thereof, of metrics. After conducting an extensive user study, they found that no metric was adequate for matching the human?s responses. This is not at all surprising. 276 Although we end in 2020 image compression continues to be an active area of research, although it remains to be seen which works of 2021 will emerge as the most influential. In the interest of space, we conclude the discussion of image compression here. Although the advance of [165] was extremely promising, there is still no deep learning algorithm that is suitable for deployment in a consumer application. This is partly an efficiency concern but it is also a flexibility concern. JPEG was extremely well thought out to work for the widest range of situations which is part of the reason it has persisted for 30 years. Deep learning methods are only just scratching the surface of this kind of long term thinking. C.2 Video Compression We now turn to video compression. Similar to the previous section, the goal will be to train encoder and decoder CNNs with some kind of quantization of the encoded feature vectors. Unlike the last section, however, we now have a temporal component in everything we do. In addition to the challenges of image compression, dealing with the temporal component is a problem by itself. Some methods will treat the time component as independent, essentially image compression with different features over time. Some will attempt to incorporate the temporal component into the prediction itself either in a recurrent or motion based solution. Still others will use an implicit representation, essentially over-fitting a network for each video. We start with the method of [154]. The key insight is that ?keyframes? can be defined which are then encoded using off-the-shelf image encoders. Then, the 277 intermediate frames are produced using image interpolation networks that take the two keyframes as input and produce the intermediate frames. Naturally, the longer the interval between the keyframes the higher the error of the predictions. While the method certainly works, it does not clearly outperform H.264 in the same way that image compression algorithms were clearly outperforming JPEG. This was still a major advancement, however, since prior to this no one had tried to produce a video codec using deep learning. By focusing on keyframe compression and interpolation, the method is efficient which was a major concern with video compression. Next, DVC [155] proposes an end-to-end technique which encodes motion and residual information for predicted frames. This is intentionally designed to mimic the classical compression loop which stores intra-frames and then predicts intermediate frames using motion warping and low-entropy error residuals. In this case, each component is modeled separately with a CNN. The method has several moving parts Motion Estimation which uses a task-specific optical flow network to produce per- pixel motion. This motion is then encoded using another CNN for compression and quantized. The decoder performs the inverse process to produce the flows Motion Compensation Also uses a deep network. First the decoded optical flow is used to warp the reference image, then the reference image, warped images, and optical flow are all used as input to another deep network to predict the true frame. Transform The residual between the predicted and true frame is taken and encoded 278 using yet another CNN to produce the quantized encoding. This is similar to image compression techniques. Overall the method is fairly complex and heavy consisting of several convolutional networks. While all of this does pay off in terms of the overall result compared with [154], the actual codec itself struggles to match H.265. Furthermore, per-pixel flow is likely wasteful, at least we know that classical video codecs do not make use of dense motion information. That being said the end-to-end nature and the idea of replacing each part of a traditional video encoder with a CNN are major advances to the state-of-the-art. In 2020, we finally had a technique capable of outperforming H.265. The method of [156] proposed a simple but effective technique. Use a standard deep learning based image compression algorithm to generate initial codes for each frame. Then perform internal learning to generate an ?optimal? code for that frame and use a conditional entropy model to produce a final code for the current frame that is conditioned on the previous frame. Note that the internal learning method is actually learning a small CNN just for that particular frame, so the encoding time is increased but the decoding is still fast since the decoder only needs to perform inference on the resulting network to obtain the code. The conditional entropy model also helps the encoder reuse information from prior frames to reduce the final code length. The contribution is very straightforward. Aside from these two ideas there are no special formulations (this is a good thing). The result is impressive, with results that consistently outperform the classical codecs for higher bitrates. 279 The method does struggle at low bitrates, however. Continuing with internal learning is NeRV [157]. This method is entirely an in- ternal learning technique, which means that for each video, the compression process is to train a neural network which predicts only that video (over-fitting it) and then the neural network weights are compressed using a model compression technique like pruning. To decode, the transmitted model weights are used in an inferencing pass to retrieve the frames. In particular, NeRV proposes a frame-based implicit representation vs the pixel based approach in something like SIREN [166]. What this means practically is the NeRV takes a time t as input and produces the frame of the video at time t instead of taking the triple x, y, t and producing the pixel at position x, y at time t. Not only is the NeRV formulation simpler for the network to learn (leading to better results) it is also significantly faster, requiring T forward passes to produce a video of length T instead of H ?W ? T forward passes. While the overall idea here is interesting the results do leave something to be desired, as NeRV struggles to outperform even H.264. Furthermore, although the network is small making decoding time fast, encoding (i.e., training the network) is on the order of hours. We close the section with ELF-VC [167], a method which is groundbreaking both in its results and in its methodical design. The approach is fast, provides a well motivated method for I- and P-frame encoding with deep networks, supports variable rate encoding, and compares well to other classical and deep learning codecs. For the I-frame model, standard image deep learning compression is used. For P-frames, the method is more interesting. Motion is predicted using a flow model as in [155] and 280 the residual and flows are both stored. The decoder uses a prior frame as an initial estimate of the warped frame before incorporating the flow vectors and residual. Variable rate encoding is achieved using a level map where the rate-distortion curve is discretized in levels. The level is tiled spatially and used as input to the encoder and decoder with the loss encouraging the network to hit the specified bitrate target. This provides a simple way to tune the bitrate. In terms of results the ELF-VC method largely outperforms other work on all benchmarks with the exception of AV1, the latest in classical compression. Although ELF-VC hits on a number of ideas that would be required for a com- mercial video codec, that goal is still very far off. With optimizations ELF-VC can decode a 1080p video at 18fps, which is not fast enough, and requires a GPU with a large amount of memory. As new methods are developed which are more efficient (this needs to be a continuing focus, however) and hardware speed increases, the likelihood of deep learning compression finding its way to consumer applications increases. These reasons contribute to the 10 year estimate. C.3 Lossless Techniques The previous two section dealt exclusively with lossy compression. We expect that the networks will remove information from images during encoding, and even if they do not the quantization process will. But we can also use machine learning for lossless compression. This takes a number of forms which we discuss in this section. Importantly this is a fairly interesting use case and could potentially see practical 281 application sooner than the lossy methods although it would be in niche scenar- ios. For example, these techniques have uses in lossless transcoding of classically compressed images. This particular application is important because it would allow large datacenters, which have the resources to run deep learning models at scale, to save on storage costs by transcoding images and videos to a smaller deep learning based file. The media are then transcoded back to their consumer format before being transmitted so that the consumer does not need special hardware or software to view the media. When we discussed entropy coding in Chapter 4 (Entropy and Information), we noted that entropy coders work by assigning shorter codes to prob- able symbols and longer codes to improbable symbols. In order to work well the encoder needs an accurate probability distribution which is difficult to come up with particularly for image data. The techniques in this section are primarily focused on learning such distributions. PixelRNNs [168], [169] are generative models that predict each pixel in an image as a discrete conditional distribution. This has an advantage over other generative methods like GANs [39] because the model predicts the distribution explicitly in- stead of simply producing samples from the distribution. In standard fashion, each pixel is treated as a distribution conditioned on all previous pixels ?N p(z) = p(zi|zi?1, . . . , z1) (C.4) i=i Each pixel can then be generated by sampling from the learned distribution pixel by pixel. So how is this relevant to compression? With the likelihood of each pixel, 282 we can use these distributions to produce probabilities for entropy coders [27], [46]. Integer discrete flows(IDFs) [170], [171] are similar in spirit. The idea again is to learn an explicit distribution of the image data and produce a latent code from the distribution with the advantage of much faster sampling. IDFs in particular are designed to overcome an explicit problem with flows in general: that they assume a continuous random variable. Images are discrete random variables so quantization of the resulting model to fit the discrete distribution may introduce loss. By for- mulating an integer discrete flow, the authors can provably reproduce exactly the given input from a code. The flow itself is based on a change of variables formula ?? ??z ? PX(x) = Pz(f(x)) ?? ?? (C.5)?x The flow is then reformulated to be in integer form where the Jacobian is one. The method was extended more recently in iVPF [172] which used volume preserving flows instead of integer discrete flows (they are quite similar in operation however). In either case, the learned flow can then be used directly as a probability distribution for entropy coding. While Bits-back encoding [173]?[176] has been around for some time, the Bits- Back ANS [177] method was the first algorithm using neural networks for the learn- ing component and which was shown to be efficient on large datasets. Without going into too much detail, the idea of bits-back encoding is to assume that the given symbol s has some latent variable y associated with it and that we have a way of measuring p(y), p(s|y), and p(y|s). Bits-back encoding allows us to lever- 283 age this knowledge of the latent distribution to store s with fewer bits. Bits-back ANS uses a variational autoencoder (VAE) for the latent model. Bit-swap [178] and Hilloc [179] extend this with hierarchical latent variables, and LBB [180] merges flows with bits-back encoding. We close with a very different approach, [181]. This method actually leverages lossy compression in order to improve the lossless compression rate. The idea is to start with BPG [54] and use a network to predict an optimal quantization parameter controlling how aggressive BPG should behave. BPG of course loses information so the residual of the true frame and the compressed frame is taken and another network predicts the probability of the residual given the input image. The residual is then encoded using an entropy coder with the learned distribution. Since the encoded residual is stored with the BPG compressed image, there is no information loss. Overall, this field is full of interesting and practical ideas. Although somewhat niche in their application, these are highly developed techniques that could already be useful engineering applications. However, these ideas are by definition not suit- able for consumers as they are really one part of a more complex whole and their performance can not match lossy algorithms. Their use is more suited to specialized applications in medical imaging, datacenters, or remote sensing where loss of data many not be acceptable. 284 Glossary B basis A set of vectors which is linearly independent and spans a vector space.. 8?11, 33 Bayesian decision theory A method for making optimal decisions given perfect prob- ability distributions describing possible events.. 61 C chroma subsampling The process for storing chrominance channels at a smaller res- olution since human vision is less sensitive to changes in color information.. 86, 87, 91 chrominance The color or hue of light captured by a sensor.. 86 communication system A system for conveying some message from one party to another. Consists of an information source, a transmitter, a signal, a source of noise corrupting the signal, a receiver, and a destination.. 52 compression Any operation which reduces the size in bits of a computational object.. iv, 289 285 convolution The correlation of a signal and a kernel as the kernel is shifted across the signal.. 27, 44, 45, 286 convolutional filter manifold A modification of the filter manifold which uses a spa- tial input.. 145 cross-correlation See convolution (although they are technically different).. 27 D decision boundary The manifold in space separating classification decisions.. 65, 69 deep learning A machine learning technique that learns many layers of features jointly with a task objective.. iv, v, 74, 98, 129 dissertation a paricular type of write only document.. vi, 20, 60 E entropy The amount of information in a message, the amount of randomness in a system, the minimum number of bits required to encode a message.. 50, 84 error residual The difference between a motion compensated frame and the true frame.. 187 evidence The probability of an observation.. 63 F feature An abstract or higher order representation of an image or a part of an image that is more suitable for input to a machine learning algorithm.. 69 286 filter manifold A method for learning adaptable convolution kernels from a scalar input.. 144 first principles The underlying engineering decisions which motivate an algorithm.. v, 227 Fourier transform An integral transform defining an orthogonal basis for functions.. 35, 39?41, 43?45 G Gabor transform A special case of the STFT which uses a Gaussian filter to window the transform.. 40?42 gradient For a scalar valued function of a vector: the vector of partial derivatives of each component of the input with respect to the scalar output.. 68 H Hadamard transform An approximation of the discrete cosine transform consisting of only 1s and -1s.. 39, 192 Huffman coding A method for producing optimal length codes for single symbols given the probabilities that each symbol will occur.. 54, 55 I image A discrete 2D signal giving a sample value at integer positions (x, y), the sample may be a scalar (grayscale) or a vector (color).. 22, 28, 31 287 image-to-image The machine learning problem which takes an image as input and produces an image as output.. 78 interlaced A method for storing color images which stores color information in se- quence, e.g., each pixel could consist of 24 bits with 8 bits for res, green, and blue.. 86 J Jacobian For a tensor valued function of a tensor: the tensor of partial derivatives of each component of the input with respect to each component of the output.. 68, 76 JPEG The Joint Photographic Experts Group, often referring to an image file or compression algorithm.. iii, v, 37, 50, 84, 96, 98, 100, 101, 117, 129, 140, 141, 146, 182 L likelihood The probability of an observation given that some event occurs.. 62 linear combination A series of scalar multiplication and vector addition.. 4, 6, 13, 36 linear map A mapping which preserves scalar multiplication and vector addition.. 15, 27, 29, 37 linearly separable A decision which is able to be made using only the relative posi- tion with respect to a hyperplane.. 66 288 lossless compression A compression operation which preserves all information in the original signal.. 54 lossy compression A compression operation which removes information from the signal to save space.. 50, 84 luminance The brightness (?quantity? of light) captured by a sensor.. 86, 89, 141 M macroblocks Pixel blocks in a frame which are larger than the transform block size.. 189 metric tensor A tensor which relates a vector space and a co-vector space.. 20 motion compensation The process by which a video codec warps frames using esti- mated motion.. 186 motion estimation The process by which a video encoder measures block motion.. 187 Motion JPEG A video codec which stores each frame as a JPEG.. 183 motion vectors Vector specifying the motion of video blocks.. 185 MPEG The Motion Picture Experts Group, often referring to a compression algo- rithm.. v, 50 multilinear map A mapping which is linear in exactly one argument.. 15?17 multiresolution analysis See wavelet transform.. 43 289 N Nash equilibrium The state of a game where no player can obtain an advantage over any other player.. 80 Nyquist Sampling Theorem A signal with a maximum frequency ?m can be rep- resented exactly by discrete samples with a sampling rate of at least 2?m.. 45 P planar A method for storing color images which stores color information separately, e.g., the image may consist of all the red pixels followed by all the blue pixels, etc... 86 posterior probability The probability of an event occurring given an observation.. 62 prior probability The probability of an event occurring in the absence of any other information.. 61 R rate control Any method for tuning the bitrate of an image or video.. 190?192 S semantic segmentation The machine learning problem which takes an image as in- put and produces a classification label for each pixel.. 78 290 slices Regions of a video frame consisting of a whole number of macroblocks.. 189 SVM Support Vector Machine, a linear model which separates examples using the maximum margin hyperplane.. 70 T transform domain A catch-all term for DCT coefficients, quantized JPEG data, or any other transformation of pixel data.. 100, 110, 117 V video A discrete 3D signal giving a sample value at integer positions (x, y, t), the sample may be a scalar (grayscale) or a vector (color).. 31 W wavelet A wave-like function with finite support.. 41, 42, 46 wavelet transform An integral transform using a set of wavelets as the basis, allows for multi-resolution and localization of frqeuencies in time.. 42, 45, 49, 289 291 Figure Credits Unless listed here, figures are either generated by the author, in the public do- main. The original authors of these works do not endorse any changes made for this document. 3.2 on page 42 Wikipedia. User JonMcloone https://commons.wikimedia.org/wiki/File: MorletWaveletMathematica.svg. CC-BY-SA 3.0. Removed axes. 3.3 on page 43 Wikipedia. User JonMcloone https://commons.wikimedia.org/wiki/File: MorletWaveletMathematica.svg. CC-BY-SA 3.0. Removed axes, added scaled version to show hierarchy. 3.5 on page 47 Wikipedia. User Omegatron https://commons.wikimedia.org/wiki/File:Haar_wavelet.svg. CC-BY-SA 3.0. 292 Removed axes, added scaled version to show hierarchy. 4.1 on page 52 [26] 5.2 on page 67 Wikipedia. User Glosser.ca https://commons.wikimedia.org/wiki/File:Colored_neural_network. svg. CC-BY-SA 3.0. 5.3 on page 70 [31] 5.4 on page 72 [32] 5.5 on page 73 [33] 5.6 on page 77 [36] 5.7 on page 78 [9] 11.3 on page 186 Big Buck Bunny. [130] https://peach.blender.org/ CC-BY-SA 3.0. Motion vector arrows added to frame. 293 Bibliography [1] M. Duggan, ?Photo and video sharing grow online,? Pew research internet project, 2013. [2] Verizon Inc., 4g lte speeds vs. your home network. [Online]. Available: https: //www.verizon.com/articles/4g-lte-speeds-vs-your-home-network/. [3] G. K. Wallace, ?The jpeg still picture compression standard,? IEEE trans- actions on consumer electronics, vol. 38, no. 1, pp. xviii?xxxiv, 1992. [4] I. E. Richardson, The H. 264 advanced video compression standard. John Wiley & Sons, 2011. [5] N. Ahmed, T. Natarajan, and K. R. Rao, ?Discrete cosine transform,? IEEE transactions on Computers, vol. 100, no. 1, pp. 90?93, 1974. [6] D. Le Gall, ?Mpeg: A video compression standard for multimedia applica- tions,? Communications of the ACM, vol. 34, no. 4, pp. 46?58, 1991. [7] T. M. Schmit and R. M. Severson, ?Exploring the feasibility of rural broad- band cooperatives in the united states: The new new deal?? Telecommunica- tions Policy, vol. 45, no. 4, p. 102 114, 2021. 294 [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ?Imagenet classification with deep convolutional neural networks,? Advances in neural information pro- cessing systems, vol. 25, pp. 1097?1105, 2012. [9] K. He, X. Zhang, S. Ren, and J. Sun, ?Deep residual learning for image recognition,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770?778. [10] M. Tan and Q. V. Le, ?Efficientnet: Rethinking model scaling for convolu- tional neural networks,? arXiv preprint arXiv:1905.11946, 2019. [11] G. Toderici, S. M. O?Malley, S. J. Hwang, et al., ?Variable rate image com- pression with recurrent neural networks,? arXiv preprint arXiv:1511.06085, 2015. [12] J. Balle?, V. Laparra, and E. P. Simoncelli, ?End-to-end optimized image compression,? arXiv preprint arXiv:1611.01704, 2016. [13] G. Toderici, D. Vincent, N. Johnston, et al., ?Full resolution image compres- sion with recurrent neural networks,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5306?5314. [14] L. Theis, W. Shi, A. Cunningham, and F. Husza?r, ?Lossy image compression with compressive autoencoders,? arXiv preprint arXiv:1703.00395, 2017. [15] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, ?Semantic per- ceptual image compression using deep convolution networks,? in 2017 Data Compression Conference (DCC), IEEE, 2017, pp. 250?259. 295 [16] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Je?gou, ?And the bit goes down: Revisiting the quantization of neural networks,? in ICLR 2020- Eighth International Conference on Learning Representations, 2020, pp. 1? 11. [17] A. Paszke, S. Gross, F. Massa, et al., ?Pytorch: An imperative style, high- performance deep learning library,? in Advances in Neural Information Pro- cessing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d?Alche?- Buc, E. Fox, and R. Garnett, Eds., Curran Associates, Inc., 2019, pp. 8024? 8035. [Online]. Available: http : / / papers . neurips . cc / paper / 9015 - pytorch- an- imperative- style- high- performance- deep- learning- library.pdf. [18] A. Einstein, ?Die grundlage der allgemeinen relativita?tstheorie,? in Das Rel- ativita?tsprinzip, Springer, 1923, pp. 81?124. [19] A. Jain and A. fast Karhunen, ?Loeve transform for a class of random pro- cesses,? IEEE Trans. Comm, vol. 24, pp. 1023?1029, 1976. [20] H. Kekre and J. Solanki, ?Comparative performance of various trigonometric unitary transforms for transform image coding,? International Journal of Electronics Theoretical and Experimental, vol. 44, no. 3, pp. 305?315, 1978. [21] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, ?The dual-tree com- plex wavelet transform,? IEEE signal processing magazine, vol. 22, no. 6, pp. 123?151, 2005. [22] I. Daubechies, Ten lectures on wavelets. SIAM, 1992. 296 [23] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, ?Multi-level wavelet-cnn for image restoration,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 773?782. [24] J. Bruna and S. Mallat, ?Invariant scattering convolution networks,? IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1872?1886, 2013. [25] X. Zhao, P. Huang, and X. Shu, ?Wavelet-attention cnn for image classifica- tion,? Multimedia Systems, pp. 1?10, 2022. [26] C. E. Shannon, ?A mathematical theory of communication,? The Bell system technical journal, vol. 27, no. 3, pp. 379?423, 1948. [27] D. A. Huffman, ?A method for the construction of minimum-redundancy codes,? Proceedings of the IRE, vol. 40, no. 9, pp. 1098?1101, 1952. [28] Y. LeCun, B. E. Boser, J. S. Denker, et al., ?Handwritten digit recognition with a back-propagation network,? in Advances in neural information pro- cessing systems, 1990, pp. 396?404. [29] P. E. Hart, D. G. Stork, and R. O. Duda, Pattern classification. Wiley Hobo- ken, 2000. [30] F. Rosenblatt, The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957. 297 [31] N. Dalal and B. Triggs, ?Histograms of oriented gradients for human de- tection,? in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR?05), Ieee, vol. 1, 2005, pp. 886?893. [32] D. G. Lowe, ?Object recognition from local scale-invariant features,? in Pro- ceedings of the seventh IEEE international conference on computer vision, Ieee, vol. 2, 1999, pp. 1150?1157. [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ?Gradient-based learning applied to document recognition,? Proceedings of the IEEE, vol. 86, no. 11, pp. 2278?2324, 1998. [34] K. Simonyan and A. Zisserman, ?Very deep convolutional networks for large- scale image recognition,? arXiv preprint arXiv:1409.1556, 2014. [35] C. Szegedy, W. Liu, Y. Jia, et al., ?Going deeper with convolutions,? in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1?9. [36] O. Ronneberger, P. Fischer, and T. Brox, ?U-net: Convolutional networks for biomedical image segmentation,? in International Conference on Medical im- age computing and computer-assisted intervention, Springer, 2015, pp. 234? 241. [37] S. Ioffe and C. Szegedy, ?Batch normalization: Accelerating deep network training by reducing internal covariate shift,? in International conference on machine learning, PMLR, 2015, pp. 448?456. 298 [38] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ?Image-to-image translation with conditional adversarial networks,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125?1134. [39] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., ?Generative adversarial nets,? Advances in neural information processing systems, vol. 27, 2014. [40] Y. LeCun, ?The mnist database of handwritten digits,? http://yann. lecun. com/exdb/mnist/, 1998. [41] J. F. Nash, ?Equilibrium points in n-person games,? Proceedings of the na- tional academy of sciences, vol. 36, no. 1, pp. 48?49, 1950. [42] J. F. Nash, ?Non-cooperative games,? Annals of mathematics, pp. 286?295, 1951. [43] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ?Gans trained by a two time-scale update rule converge to a local nash equilibrium,? Advances in neural information processing systems, vol. 30, 2017. [44] Independant JPEG Group. ?Libjpeg.? (), [Online]. Available: http : / / libjpeg.sourceforge.net. [45] International Telecommunication Union, ?Studio encoding parameters of dig- ital television for standard 4:3 and wide-screen 16:9 aspect ratios,? Geneva, CH, Standard, Mar. 2011. [46] J. Rissanen and G. G. Langdon, ?Arithmetic coding,? IBM Journal of re- search and development, vol. 23, no. 2, pp. 149?162, 1979. 299 [47] B. C. Smith, ?Fast software processing of motion jpeg video,? in Proceed- ings of the second ACM international conference on Multimedia, ACM, 1994, pp. 77?88. [48] S.-F. Chang, ?Video compositing in the dct domain,? in IEEE Workshop on Visual Signal Processing and Communications, Raleigh, NC, Sep. 1992, 1992. [49] B. Shen and I. K. Sethi, ?Inner-block operations on compressed images,? in Proceedings of the third ACM international conference on Multimedia, ACM, 1995, pp. 489?498. [50] B. K. Natarajan and B. Vasudev, ?A fast approximate algorithm for scaling down digital images in the dct domain,? in Image Processing, 1995. Proceed- ings., International Conference on, IEEE, vol. 2, 1995, pp. 241?243. [51] B. C. Smith and L. A. Rowe, ?Algorithms for manipulating compressed im- ages,? IEEE Computer Graphics and Applications, vol. 13, no. 5, pp. 34?42, 1993. [52] T. Boutell, PNG (Portable Network Graphics) Specification Version 1.0, RFC 2083, Mar. 1997. doi: 10.17487/RFC2083. [Online]. Available: https://www. rfc-editor.org/info/rfc2083. [53] CompuServe Inc, ?Graphics Interchange Format,? Standard, Mar. 1987. [54] F. Bellard. ?Better portable graphics.? (2018), [Online]. Available: https: //bellard.org/bpg/. 300 [55] MPEG, ?Requirements for still image coding using HEVC,? Vienna, AT, Standard, 2013. [56] A. Skodras, C. Christopoulos, and T. Ebrahimi, ?The jpeg 2000 still im- age compression standard,? IEEE Signal processing magazine, vol. 18, no. 5, pp. 36?58, 2001. [57] C. S. Swartz, Understanding digital cinema: a professional handbook. Rout- ledge, 2004. [58] M. Ehrlich and L. Davis, ?Deep residual learning in the jpeg transform do- main,? in Proceedings of the IEEE/CVF International Conference on Com- puter Vision, 2019, pp. 3484?3493. [59] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ?Imagenet: A large-scale hierarchical image database,? in 2009 IEEE conference on com- puter vision and pattern recognition, Ieee, 2009, pp. 248?255. [60] L. Gueguen, A. Sergeev, B. Kadlec, R. Liu, and J. Yosinski, ?Faster neural networks straight from jpeg,? Advances in Neural Information Processing Systems, vol. 31, 2018. [61] S.-Y. Lo and H.-M. Hang, ?Exploring semantic segmentation on the dct representation,? in Proceedings of the ACM Multimedia Asia, 2019, pp. 1?6. [62] B. Deguerre, C. Chatelain, and G. Gasso, ?Fast object detection in com- pressed jpeg images,? in 2019 ieee intelligent transportation systems confer- ence (itsc), IEEE, 2019, pp. 333?338. 301 [63] G. Daniel, J. Gray, et al., ?Opt einsum-a python package for optimizing con- traction order for einsum-like expressions,? Journal of Open Source Software, vol. 3, no. 26, p. 753, 2018. [64] S. Chetlur, C. Woolley, P. Vandermersch, et al., ?Cudnn: Efficient primitives for deep learning,? arXiv preprint arXiv:1410.0759, 2014. [65] V. Nair and G. E. Hinton, ?Rectified linear units improve restricted boltz- mann machines,? in Icml, 2010. [66] K. Fukushima and S. Miyake, ?Neocognitron: A self-organizing neural net- work model for a mechanism of visual pattern recognition,? in Competition and cooperation in neural nets, Springer, 1982, pp. 267?285. [67] A. Krizhevsky, G. Hinton, et al., ?Learning multiple layers of features from tiny images,? 2009. [68] A. Foi, V. Katkovnik, and K. Egiazarian, ?Pointwise shape-adaptive dct for high-quality deblocking of compressed color images ?,? in Proc. 14th Eur. Signal Process. Conf., EUSIPCO 2006. [69] S. Yang, S. Kittitornkun, Y.-H. Hu, T. Q. Nguyen, and D. L. Tull, ?Blocking artifact free inverse discrete cosine transform,? in Proceedings 2000 Interna- tional Conference on Image Processing (Cat. No. 00CH37101), IEEE, vol. 3, 2000, pp. 869?872. [70] T. D. Tran, R. De Queiroz, and T. Q. Nguyen, ?The generalized lapped biorthogonal transform,? in Proceedings of the 1998 IEEE International Con- 302 ference on Acoustics, Speech and Signal Processing, ICASSP?98 (Cat. No. 98CH36181), IEEE, vol. 3, 1998, pp. 1441?1444. [71] C. Dong, Y. Deng, C. Change Loy, and X. Tang, ?Compression artifacts reduction by a deep convolutional network,? in Proceedings of the IEEE In- ternational Conference on Computer Vision, 2015, pp. 576?584. [72] K. Yu, C. Dong, C. C. Loy, and X. Tang, ?Deep convolution networks for compression artifacts reduction,? arXiv preprint arXiv:1608.02778, 2016. [73] C. Dong, C. C. Loy, K. He, and X. Tang, ?Learning a deep convolutional network for image super-resolution,? in European conference on computer vision, Springer, 2014, pp. 184?199. [74] P. Svoboda, M. Hradis, D. Barina, and P. Zemcik, ?Compression ar- tifacts removal using convolutional neural networks,? arXiv preprint arXiv:1605.00366, 2016. [75] L. Cavigelli, P. Hager, and L. Benini, ?Cas-cnn: A deep convolutional neural network for image compression artifact suppression,? in 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 752?759. [76] H. Chen, X. He, L. Qing, S. Xiong, and T. Q. Nguyen, ?Dpw-sdnet: Dual pixel-wavelet domain deep cnns for soft decoding of jpeg-compressed images,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 711?720. 303 [77] B. Zheng, R. Sun, X. Tian, and Y. Chen, ?S-net: A scalable convolutional neural network for jpeg compression artifact reduction,? Journal of Electronic Imaging, vol. 27, no. 4, p. 043 037, 2018. [78] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, ?Deep generative adversarial compression artifact removal,? in Proceedings of the IEEE Inter- national Conference on Computer Vision, 2017, pp. 4826?4835. [79] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, ?Deep universal generative adversarial compression artifact removal,? IEEE Transactions on Multimedia, 2019. [80] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, ?Residual dense network for image restoration,? IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [81] X. Wang, K. Yu, S. Wu, et al., ?Esrgan: Enhanced super-resolution gener- ative adversarial networks,? in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0?0. [82] X. Liu, X. Wu, J. Zhou, and D. Zhao, ?Data-driven sparsity-based restoration of jpeg-compressed images in dual transform-pixel domain,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5171?5178. [83] J. Guo and H. Chao, ?Building dual-domain representations for compression artifacts reduction,? in European Conference on Computer Vision, Springer, 2016, pp. 628?644. 304 [84] X. Zhang, W. Yang, Y. Hu, and J. Liu, ?Dmcnn: Dual-domain multi-scale convolutional neural network for compression artifacts removal,? in 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 390?394. [85] B. Zheng, Y. Chen, X. Tian, F. Zhou, and X. Liu, ?Implicit dual-domain convolutional network for robust color image compression artifact reduction,? IEEE Transactions on Circuits and Systems for Video Technology, 2019. [86] Z. Jin, M. Z. Iqbal, W. Zou, X. Li, and E. Steinbach, ?Dual-stream multi-path recursive residual network for jpeg image compression artifacts reduction,? IEEE Transactions on Circuits and Systems for Video Technology, 2020. [87] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, ?D3: Deep dual-domain based fast restoration of jpeg-compressed images,? in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2764?2772. [88] X. Fu, Z.-J. Zha, F. Wu, X. Ding, and J. Paisley, ?Jpeg artifacts reduction via deep convolutional sparse coding,? in Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2019, pp. 2501?2510. [89] Y. Kim, J. W. Soh, J. Park, et al., ?A pseudo-blind convolutional neural network for the reduction of compression artifacts,? IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1121?1135, 2019. 305 [90] S. Zini, S. Bianco, and R. Schettini, ?Deep residual autoencoder for quality independent jpeg restoration,? arXiv preprint arXiv:1903.06117, 2019. [91] Y. Kim, J. W. Soh, and N. I. Cho, ?Agarnet: Adaptively gated jpeg compres- sion artifacts removal network for a wide range quality factor,? IEEE Access, vol. 8, pp. 20 160?20 170, 2020. [92] J. Jiang, K. Zhang, and R. Timofte, ?Towards flexible blind jpeg artifacts re- moval,? in Proceedings of the IEEE/CVF International Conference on Com- puter Vision, 2021, pp. 4997?5006. [93] M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, ?Quantization guided jpeg artifact correction,? in European Conference on Computer Vision, Springer, 2020, pp. 293?309. [94] D. Kang, D. Dhar, and A. B. Chan, ?Crowd counting by adapt- ing convolutional neural networks with side information,? arXiv preprint arXiv:1611.06748, 2016. [95] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Car- roll, ?Burst denoising with kernel prediction networks,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2502?2510. [96] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., ?Rectifier nonlinearities improve neural network acoustic models,? in Proc. icml, Citeseer, vol. 30, 2013, p. 3. 306 [97] K. He, X. Zhang, S. Ren, and J. Sun, ?Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,? in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026?1034. [98] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradient flow in recurrent nets: The difficulty of learning long-term dependencies, 2001. [99] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ?Image quality assessment: From error visibility to structural similarity,? IEEE transactions on image processing, vol. 13, no. 4, pp. 600?612, 2004. [100] A. Jolicoeur-Martineau, ?The relativistic discriminator: A key element miss- ing from standard gan,? in International Conference on Learning Represen- tations, 2018. [101] A. Radford, L. Metz, and S. Chintala, ?Unsupervised representation learn- ing with deep convolutional generative adversarial networks,? arXiv preprint arXiv:1511.06434, 2015. [102] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, ?Spectral normalization for generative adversarial networks,? arXiv preprint arXiv:1802.05957, 2018. [103] J. Johnson, A. Alahi, and L. Fei-Fei, ?Perceptual losses for real-time style transfer and super-resolution,? in European conference on computer vision, Springer, 2016, pp. 694?711. [104] S. Bell, P. Upchurch, N. Snavely, and K. Bala, ?Material recognition in the wild with the materials in context database,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3479?3487. 307 [105] D. P. Kingma and J. Ba, ?Adam: A method for stochastic optimization,? arXiv preprint arXiv:1412.6980, 2014. [106] I. Loshchilov and F. Hutter, ?Sgdr: Stochastic gradient descent with warm restarts,? arXiv preprint arXiv:1608.03983, 2016. [107] E. Agustsson and R. Timofte, ?Ntire 2017 challenge on single image super- resolution: Dataset and study,? in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 126?135. [108] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, ?A statistical evaluation of recent full reference image quality assessment algorithms,? IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440?3451, 2006. [109] Rawzor. ?Image compression benchmark.? (), [Online]. Available: http:// imagecompression.info/. [110] G. E. Hinton and S. Roweis, ?Stochastic neighbor embedding,? Advances in neural information processing systems, vol. 15, 2002. [111] M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, ?Analyzing and mit- igating jpeg compression defects in deep learning,? in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2357? 2367. [112] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, ?Improving the robustness of deep neural networks via stability training,? in Proceedings of the ieee conference on computer vision and pattern recognition, 2016, pp. 4480?4488. 308 [113] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ?Mo- bilenetv2: Inverted residuals and linear bottlenecks,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510? 4520. [114] S. Xie, R. Girshick, P. Dolla?r, Z. Tu, and K. He, ?Aggregated residual trans- formations for deep neural networks,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492?1500. [115] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ?Rethinking the inception architecture for computer vision,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818?2826. [116] T.-Y. Lin, M. Maire, S. Belongie, et al., ?Microsoft coco: Common objects in context,? in European conference on computer vision, Springer, 2014, pp. 740?755. [117] R. Girshick, ?Fast r-cnn,? in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440?1448. [118] S. Ren, K. He, R. Girshick, and J. Sun, ?Faster r-cnn: Towards real-time ob- ject detection with region proposal networks,? IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137?1149, 2016. [119] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla?r, ?Focal loss for dense object detection,? in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980?2988. 309 [120] K. He, G. Gkioxari, P. Dolla?r, and R. Girshick, ?Mask r-cnn,? in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961? 2969. [121] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ?Scene parsing through ade20k dataset,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [122] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ?Se- mantic understanding of scenes through the ade20k dataset,? arXiv preprint arXiv:1608.05442, 2016. [123] K. Sun, Y. Zhao, B. Jiang, et al., ?High-resolution representations for labeling pixels and regions,? arXiv preprint arXiv:1904.04514, 2019. [124] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ?Pyramid scene parsing net- work,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881?2890. [125] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, ?Unified perceptual pars- ing for scene understanding,? in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 418?434. [126] D. Bolya, S. Foley, J. Hays, and J. Hoffman, ?Tide: A general toolbox for identifying object detection errors,? in ECCV, 2020. [127] D. Marpe, T. Wiegand, and G. J. Sullivan, ?The h. 264/mpeg4 advanced video coding standard and its applications,? IEEE communications maga- zine, vol. 44, no. 8, pp. 134?143, 2006. 310 [128] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, ?Overview of the high efficiency video coding (hevc) standard,? IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649?1668, 2012. [129] Bitmovin, Video developer report 2019, 2019. [Online]. Available: https: //go.bitmovin.com/video-developer-report-2019. [130] S. Goedegebure, A. Goralczyk, E. Valenza, et al. ?Big buck bunny.? (2008), [Online]. Available: https://peach.blender.org/. [131] International Telecommunication Union, ?Advanced video coding for generic audiovisual services,? Geneva, CH, Standard, Aug. 2021. [132] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown, ?Frame-Recurrent Video Super-Resolution,? in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), Jun. 2018. [133] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, ?Video enhancement with task-oriented flow,? International Journal of Computer Vision, vol. 127, no. 8, pp. 1106?1125, Aug. 2019, arXiv: 1711.09078, issn: 0920-5691, 1573- 1405. doi: 10.1007/s11263-018-01144-2. [134] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, ?Edvr: Video restoration with enhanced deformable convolutional networks,? in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition Workshops, 2019, pp. 0?0. [135] Y. Li, P. Jin, F. Yang, C. Liu, M.-H. Yang, and P. Milanfar, ?Comisr: Compression-informed video super-resolution,? in Proceedings of 311 the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 2543?2552. [136] T. Wang, M. Chen, and H. Chao, ?A novel deep learning-based method of improving coding efficiency from the decoder-end for hevc,? in 2017 Data Compression Conference (DCC), IEEE, 2017, pp. 410?419. [137] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, ?Enhancing quality for hevc compressed videos,? IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 7, pp. 2039?2054, 2018. [138] R. Yang, M. Xu, Z. Wang, and T. Li, ?Multi-frame quality enhancement for compressed video,? in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Jun. 2018, pp. 6664?6673, isbn: 978-1-5386- 6420-9. doi: 10.1109/CVPR.2018.00697. [Online]. Available: https:// ieeexplore.ieee.org/document/8578795/. [139] J. Deng, L. Wang, S. Pu, and C. Zhuo, ?Spatio-temporal deformable convo- lution for compressed video quality enhancement,? Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 0707, pp. 10 696?10 703, Apr. 2020, issn: 2374-3468. doi: 10.1609/aaai.v34i07.6697. [140] Q. Xing, Z. Guan, M. Xu, R. Yang, T. Liu, and Z. Wang, ?Mfqe 2.0: A new approach for multi-frame quality enhancement on compressed video,? IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 949?963, Mar. 2021, arXiv: 1902.09707, issn: 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2019.2944806. 312 [141] Q. Ding, L. Shen, L. Yu, H. Yang, and M. Xu, ?Patch-wise spatial-temporal quality enhancement for hevc compressed video,? IEEE Transactions on Im- age Processing, vol. 30, pp. 6459?6472, 2021. [142] M. Zhao, Y. Xu, and S. Zhou, ?Recursive fusion and deformable spatiotem- poral attention for video compression artifact reduction,? in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5646?5654. [143] M. Schuster and K. K. Paliwal, ?Bidirectional recurrent neural networks,? IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673?2681, 1997. [144] M. Ehrlich, J. Barker, N. Padmanabhan, et al., ?Leveraging bitstream meta- data for fast and accurate video compression correction,? arXiv preprint arXiv:2202.00011, 2022. [145] Z. Teed and J. Deng, ?Raft: Recurrent all-pairs field transforms for optical flow,? in European conference on computer vision, Springer, 2020, pp. 402? 419. [146] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ?Eca-net: Efficient channel attention for deep convolutional neural networks, 2020 ieee,? in CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. [147] A. Vaswani, N. Shazeer, N. Parmar, et al., ?Attention is all you need,? in Advances in neural information processing systems, 2017, pp. 5998?6008. [148] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud, ?Two de- terministic half-quadratic regularization algorithms for computed imaging,? 313 in Proceedings of 1st International Conference on Image Processing, IEEE, vol. 2, 1994, pp. 168?172. [149] M. Arjovsky, S. Chintala, and L. Bottou, ?Wasserstein generative adversarial networks,? in International conference on machine learning, PMLR, 2017, pp. 214?223. [150] M. Chu, Y. Xie, J. Mayer, L. Leal-Taixe?, and N. Thuerey, ?Learning tem- poral coherence via self-supervision for gan-based video generation,? ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 75?1, 2020. [151] S. Tomar, ?Converting video formats with ffmpeg,? Linux Journal, vol. 2006, no. 146, p. 10, 2006. [152] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ?The unrea- sonable effectiveness of deep features as a perceptual metric,? in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586?595. [153] A. Mercat, M. Viitanen, and J. Vanne, ?Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,? in Proceedings of the 11th ACM Multimedia Systems Conference, 2020, pp. 297?302. [154] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, ?Video compression through im- age interpolation,? in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 416?431. [155] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, ?Dvc: An end-to- end deep video compression framework,? in Proceedings of the IEEE/CVF 314 Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 006? 11 015. [156] J. Liu, S. Wang, W.-C. Ma, et al., ?Conditional entropy coding for effi- cient video compression,? in Computer Vision?ECCV 2020: 16th European Conference, Glasgow, UK, August 23?28, 2020, Proceedings, Part XVII 16, Springer, 2020, pp. 453?468. [157] H. Chen, B. He, H. Wang, Y. Ren, S.-N. Lim, and A. Shrivastava, ?Nerv: Neural representations for videos,? arXiv preprint arXiv:2110.13903, 2021. [158] R. Yang, Y. Yang, J. Marino, and S. Mandt, ?Hierarchical autoregressive modeling for neural video compression,? arXiv preprint arXiv:2010.10258, 2020. [159] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, ?Learning for video com- pression with hierarchical quality and recurrent enhancement,? in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6628?6637. [160] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, ?Scale-space flow for end-to-end optimized video compression,? in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2020, pp. 8503?8512. [161] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, ?Grad-cam: Visual explanations from deep networks via gradient-based lo- 315 calization,? in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618?626. [162] E. Agustsson, F. Mentzer, M. Tschannen, et al., ?Soft-to-hard vector quantization for end-to-end learning compressible representations,? in Ad- vances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, et al., Eds., vol. 30, Curran Associates, Inc., 2017. [Online]. Available: https : / / proceedings . neurips . cc / paper / 2017 / file / 86b122d4358357d834a87ce618a55de0-Paper.pdf. [163] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, ?Conditional probability models for deep image compression,? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4394?4402. [164] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, ?Gen- erative adversarial networks for extreme learned image compression,? in Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 221?231. [165] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, ?High-fidelity generative image compression,? Advances in Neural Information Processing Systems, vol. 33, pp. 11 913?11 924, 2020. [166] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, ?Im- plicit neural representations with periodic activation functions,? Advances in Neural Information Processing Systems, vol. 33, pp. 7462?7473, 2020. 316 [167] O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev, ?Elf-vc: Efficient learned flexible-rate video coding,? in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 479? 14 488. [168] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, ?Pixel recurrent neural networks,? in International conference on machine learning, PMLR, 2016, pp. 1747?1756. [169] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., ?Conditional image generation with pixelcnn decoders,? Advances in neural information processing systems, vol. 29, 2016. [170] E. Hoogeboom, J. Peters, R. Van Den Berg, and M. Welling, ?Integer discrete flows and lossless compression,? Advances in Neural Information Processing Systems, vol. 32, 2019. [171] R. v. d. Berg, A. A. Gritsenko, M. Dehghani, C. K. S?nderby, and T. Sal- imans, ?Idf++: Analyzing and improving integer discrete flows for lossless compression,? arXiv preprint arXiv:2006.12459, 2020. [172] S. Zhang, C. Zhang, N. Kang, and Z. Li, ?Ivpf: Numerical invertible vol- ume preserving flow for efficient lossless compression,? in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 620?629. 317 [173] C. S. Wallace, ?Classification by minimum-message-length inference,? in International Conference on Computing and Information, Springer, 1990, pp. 72?81. [174] G. Hinton and D. van Camp, ?Keeping neural networks simple by minimising the description length of weights. 1993,? in Proceedings of COLT-93, pp. 5? 13. [175] B. J. Frey and G. E. Hinton, ?Free energy coding,? in Proceedings of Data Compression Conference-DCC?96, IEEE, 1996, pp. 73?81. [176] B. J. Frey, Bayesian networks for pattern classification, data compression, and channel coding. Citeseer, 1998. [177] J. Townsend, T. Bird, and D. Barber, ?Practical lossless compression with latent variables using bits back coding,? arXiv preprint arXiv:1901.04866, 2019. [178] F. Kingma, P. Abbeel, and J. Ho, ?Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables,? in International Conference on Machine Learning, PMLR, 2019, pp. 3408?3417. [179] J. Townsend, T. Bird, J. Kunze, and D. Barber, ?Hilloc: Lossless im- age compression with hierarchical latent variable models,? arXiv preprint arXiv:1912.09953, 2019. [180] J. Ho, E. Lohn, and P. Abbeel, ?Compression with flows via local bits-back coding,? Advances in Neural Information Processing Systems, vol. 32, 2019. 318 [181] F. Mentzer, L. V. Gool, and M. Tschannen, ?Learning better lossless com- pression using lossy compression,? in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2020, pp. 6638?6647. 319 Index approximated spatial change of basis, 9 network, 74 masking, 119 chroma subsampling, convolutional neural arcnn, 130 87 networks, 60, co-vector, 3, 17 103 B-Frame, 185 codec, 181 cross-correlation, 27 backpropagation, 68 compression, ii Daubechies wavelet, basis, 8 constant bitrate, 191 46, 49 batch normalization, constant quantization DCT, 100 109 parameter, 190 dct, 87, 88, 91 Bayes decision rule, 63 constant rate factor, decision boundary, 65 Bayes rule, 63 191 deep learning, iv, 74 bayesian decision continuous wavelet difference of gaussians, theory, 61 transform, 44 71, 211 biorthogonal, 46 convolution, 27 discrete cosine bipredcted frame, 182 convolutional filter transform, 36 bipredicted frame, 204 manifold, 145 discrete fourier canonical basis, 33 convolutional neural transform, 36 320 discrete sine frame, 181 identity matrix, 7 transform, 36 frequency component in-loop filter, 194 discrete wavelet rearrangement, index juggling, 20 transform, 44 102, 147 information theory, 50 discriminator, 80 inner product, 4 generalize lapped dissertation, vi integer discrete flows, biorthogonal dual domain, 134 283 transform, 130 dual tree complex integral transform, 36 generative adversarial wavelet intra frame, 182, 204 network, 152 transform, 45 intra slice, 190 generative adversarial entropy, 50 networks, 79 JFIF, 85 error residual, 187 generator, 80 JPEG, 84, 96, 98, 100, EXIF, 85 group of pictures, 182, 129 204 kernel predictor, 144 feature, 62 keypoint, 71 field, 12 Haar wavelet, 46 kingma2013auto, 284 filter manifold, 144 Hadamard transform, finite dimensional 39 l2 norm, 5 vector space, harmonic analysis, 31 level map, 281 10 heat equation, 31 linear algebra, 2 finite support, 44 Heaviside function, 66 linear combination, 4, first principles, v histogram of oriented 13 fourier transform, 31 gradients, 69 linear map, 85 321 linear pixel motion vector, 185 quantization, 281 manipulations, multilayer perceptron, quantization matrix, 22 66, 145 88 long-short-term multiresolution rate control, 190 memory, 273 analysis, 43 residual block, 77, 146 lossless compression, normalize, 5 residual learning, 98 50, 281 nyquist sampling residual networks, 76 lossy compression, 84, theorem, 45 residual-in-residual 281 orthogonal, 5 dense block, low-delay p mode, 221, orthonormal, 8, 33 146 224 run-length code, 90 P-frame, 185 machine learning, 60 peak quality frame, scalar, 2, 13 macroblocks, 189 198 scale space, 211 matrix, 6, 15 peak quality frames, scale-invariant, 71 metric tensor, 20 201 scale-invariant feature minimum coded unit, perceptron, 65 transform, 69, 87 posterior probabilities, 71 Morlet wavelet, 46 62 shape-adaptive mother wavelet, 43 predicted frame, 182, discrete cosine motion compensation, 204 transform, 129 186 predicted slice, 190 short-time fourier motion estimation, 187 prior probability, 61 transform, 40 322 skip connections, 79 tensor, 16 vanishing moments, 49 slice, 189 transcode, 282 vector, 3, 13, 15, 17 span, 8 transform domain, vector space, 12 sparse coding, 136 100, 103, 110 video compression structural similarity, transpose, 3 reduction, 194 152 u-net, 78 task-targeted artifact wavelet, 41 correction, 169 vanishing gradient, 76 wavelet transforms, 41 323