ABSTRACT Title of Dissertation: TOWARDS IMMERSIVE VISUAL CONTENT WITH MACHINE LEARNING Brandon Yushan Feng Doctor of Philosophy, 2023 Dissertation Directed by: Professor Amitabh Varshney Department of Computer Science Extended reality technology stands poised to revolutionize how we perceive, learn, and engage with our environment. However, transforming data captured in the physical world into digital content for immersive experiences continues to pose challenges. In this dissertation, I present my research on employing machine learning algorithms to enhance the generation and representation of immersive visual data. Firstly, I address the issue of recovering depth information from videos captured using 360-degree cameras. I propose a novel technique that unifies the representation of object depth and surface normal utilizing double quaternions. Experimental results demonstrate that training with a double-quaternion-based loss function improves the prediction accuracy of a neural network using 360-degree video frames as input. Secondly, I examine the problem of efficiently representing 4D light fields using the emerging concept of neural fields. Light fields hold significant potential for immersive visual applications; however, their widespread adoption is hindered by the substantial cost associated with storing and transmitting such high-dimensional data. I propose a novel approach for representing light fields. Deviating from previous approaches, I treat the light field data as a mapping function from pixel coordinates to color and train a neural network to accurately learn this mapping function. This functional representation enables high-quality interpolation and super-resolution for light fields while achieving state-of-the-art results in light field compression. Thirdly, I present neural subspaces for light fields. I adapt the ideas of subspace learning and tracking and identify the conceptual relationship between neural representations of light fields and the framework of subspace learning. My method considers a light field as an aggregate of local segments or multiple local neural subspaces. A set of local neural networks are trained to encode each subset of viewpoints. Since each local network specializes in a specific region, this specialization allows for smaller networks without compromising accuracy. Fourthly, I introduce primary ray-based implicit function to represent geometric shapes. Traditional implicit shape representations, such as the signed distance function, describe a shape by its relationship to each spatial point. Such a point- based representation of shapes often necessitates the costly iterative sphere tracing to render a surface hit point. I propose a ray-based approach to implicit neural shape modeling, wherein the shape is implicitly described by its relationship with each ray in the 3D space. To render the hit point, my method only requires a single inference pass, considerably reducing the computational cost of rendering. Lastly, I describe a technique to generate novel view rendering without relying on any 3D structure or camera pose information. I harness the power of neural fields to encode individual images without estimating their camera poses. My method learns a latent code for each image in the multi-view collection, and then produces plausible and photorealistic novel view renderings by interpolating their latent codes. This entirely 3D-agnostic approach avoids the computational cost incurred by 3D representations, offering a promising outlook on employing image-based neural fields for image manipulation tasks beyond fitting and super-resolving known images. TOWARDS IMMERSIVE VISUAL CONTENT WITH MACHINE LEARNING by Brandon Yushan Feng Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor Amitabh Varshney, Chair/Advisor Assistant Professor Furong Huang Assistant Professor Christopher A. Metzler Associate Professor Jia-Bin Huang Professor Joseph F. JaJa © Copyright by Brandon Yushan Feng 2023 Acknowledgments My first thanks go to my dissertation advisor, Amitabh Varshney. My doctoral studies have been incredibly intellectually rewarding and enlightening due to his thoughtful guidance and constant encouragement. Having the opportunity to conduct my doctoral training under his tutelage has been both a privilege and a pleasure I appreciate the time spent by all my committee members, namely Amitabh Varshney, Furong Huang, Christopher A. Metzler, Jia-Bin Huang, and Joseph F. JaJa. They all served on my committee, offered valuable suggestions for improvement, and asked insightful questions during my dissertation defense. Beyond the work on this dissertation, I am privileged to have enjoyed fruitful collaborations with various research groups at UMD and beyond. I learned much about matrix sketching from Furong Huang. I was fortunate to work with Brian Pierce on studying antibody-antigen docking algorithms. I owe much of my knowledge of and joy in studying computational imaging and optics to Chris Metzler. I gained invaluable lessons on formulating research problems and managing projects from my interaction with Jia-Bin Huang. I also gathered countless insights and valuable experience from my many project collaborators, including (in roughly chronological order) Susmija Jabbireddy, David Li, Rui Yin, Tahseen Rabbani, Mingyang Xie, Haiyun Guo, Vivek Boominathan, Manoj K. Sharma, Ashok Veeraraghavan, Ruofei Du, Zhenyi He, Keru Wang, Yinda Zhang, Danhang Tang, Zhiwen Fan, Chenxin Li, Sazan Mahbub, Hadi AlZayer, Kevin Zhang, Michael Rubinstein, and William T. Freeman. I thank the essential support provided by many staff members at UMIACS ii and UMD CS, including Tom Hurst, Jonathan Heagerty, Barbara Brawn, Sida Li, Eric Lee, Vivian Lu, Janice Perrone, and Jodie Gray. I thank all my family, friends, and mentors for their support and impact over the years. My parents have always believed in me and would do anything for me to succeed. I feel fortunate that my grandmother is still around to celebrate my achievements, and for the blessings given by my other grandparents who have passed. I am grateful to Erzhen for the unwavering support and encouragement during this time. I thank Xue Feng, Lu Feng, and Marc Santugini for their kindness and indispensable guidance when I applied for doctoral programs. I thank Kaiming for the friendship that has powered us through all the ups and downs of our academic endeavors. I am especially thankful to Gregory C. Robbins for instilling grit in me and changing the way I read, write, and think. Greg is my best teacher, and what he taught me during my teenage years have shaped the person I am today. Most of the work on this dissertation occurred during an aberrant and unsettling time, marked by the pandemic and its repercussions. The unprecedented and unexpected challenges of this era have made me more determined and resilient. I must acknowledge the unique challenges faced by international students, and I offer my sympathies to those unable to begin or complete their studies during this period due to obstacles and distress caused by various contingent social and geopolitical factors. I hope that future students will not have to endure similar circumstances. iii Table of Contents Acknowledgements ii Table of Contents iv 1 Introduction 1 1.1 Deep Depth Estimation on 360-degree Images with a Double Quater- nion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Efficient Neural Representation for Light Fields . . . . . . . . . . . . 4 1.3 Neural Subspaces for Light Fields . . . . . . . . . . . . . . . . . . . . 8 1.4 Primary Ray-based Implicit Function . . . . . . . . . . . . . . . . . . 12 1.5 View Interpolation with Implicit Neural Representations of Images . . 16 2 Deep Depth Estimation on 360-degree Images with a Double Quaternion Loss 19 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Depth Estimation on 360-degree Images . . . . . . . . . . . . 23 2.2.2 Joint Estimation of Depth and Normal . . . . . . . . . . . . . 25 2.2.3 Use of Quaternions . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Depth Refinement based on Normal . . . . . . . . . . . . . . . 28 2.3.3 Aggregation with Confidence Scores . . . . . . . . . . . . . . . 30 2.3.4 Double Quaternion Approximation of Depth and Normal in the Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.4.1 Constructing Double Quaternion . . . . . . . . . . . 32 2.3.4.2 Loss Function Based on Double Quaternions . . . . . 34 2.3.5 Stereo Consistency . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.6 Overall Loss Function . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 Comparison with Other Methods . . . . . . . . . . . . . . . . 38 2.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Limitations and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 39 iv 3 Efficient Neural Representation for Light Fields 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Light Fields as Functions . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Function Approximations . . . . . . . . . . . . . . . . . . . . 51 3.3.3 MLP for Approximation . . . . . . . . . . . . . . . . . . . . . 52 3.3.4 Towards Multi-dimensional Input . . . . . . . . . . . . . . . . 53 3.3.5 Gegenbauer Basis . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . 56 3.4.3 Data and Training Setup . . . . . . . . . . . . . . . . . . . . . 58 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Static Light Field Reconstruction . . . . . . . . . . . . . . . . 59 3.5.2 Extension to Light Field Videos . . . . . . . . . . . . . . . . . 62 3.5.3 Light Field Super-Resolution . . . . . . . . . . . . . . . . . . . 62 3.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Neural Subspaces for Light Fields 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.1 Light Field Compression for Streaming . . . . . . . . . . . . . 74 4.2.2 Neural Light Field . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.3 Subspace Learning and Tracking . . . . . . . . . . . . . . . . . 78 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Neural Light Fields with MLP . . . . . . . . . . . . . . . . . . 79 4.3.2 Constructing Light Field Segments . . . . . . . . . . . . . . . 80 4.3.3 Adaptive Weight Sharing in MLP . . . . . . . . . . . . . . . . 81 4.3.4 Soft-Classification for RGB Prediction . . . . . . . . . . . . . 83 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Quantitative Metrics . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.3 Sub-aperture Light Fields . . . . . . . . . . . . . . . . . . . . 87 4.4.3.1 Comparison with the Residual Approach . . . . . . . 88 4.4.3.2 Improved Accuracy with Soft Classification . . . . . 88 4.4.4 Volumetric Light Fields . . . . . . . . . . . . . . . . . . . . . . 89 4.4.4.1 Extension to Dynamic Content . . . . . . . . . . . . 95 4.4.4.2 Possible Flickering Across Subspaces . . . . . . . . . 96 4.4.5 Hyper-parameter Analysis . . . . . . . . . . . . . . . . . . . . 96 4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5.1 End-to-end versus Two-stage Learning . . . . . . . . . . . . . 97 v 4.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Primary Ray-based Implicit Function 103 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 3D Shape Representations . . . . . . . . . . . . . . . . . . . . 107 5.2.1.1 Functional Representations. . . . . . . . . . . . . . . 107 5.2.1.2 Global versus Local Representations . . . . . . . . . 108 5.2.2 Ray-based Neural Networks . . . . . . . . . . . . . . . . . . . 109 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.2 Describing Geometry with Perpendicular Foot . . . . . . . . . 112 5.3.3 Background Mask . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.4 Outlier Points Removal . . . . . . . . . . . . . . . . . . . . . . 114 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4.1 Single Shape Representation . . . . . . . . . . . . . . . . . . . 116 5.4.2 Shape Generation . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.3 Shape Denoising and Completion . . . . . . . . . . . . . . . . 119 5.4.4 Analysis and Ablations . . . . . . . . . . . . . . . . . . . . . . 120 5.4.4.1 Complexity Analysis . . . . . . . . . . . . . . . . . . 120 5.4.4.2 Stress Testing . . . . . . . . . . . . . . . . . . . . . . 123 5.4.4.3 Ablations . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.5.1 Learning Camera Poses . . . . . . . . . . . . . . . . 123 5.4.5.2 Neural Rendering with Color . . . . . . . . . . . . . 125 5.5 Additional Details and Discussion . . . . . . . . . . . . . . . . . . . . 126 5.5.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1.1 Optimization . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1.2 Network Architecture . . . . . . . . . . . . . . . . . 126 5.5.2 Runtime Discussions . . . . . . . . . . . . . . . . . . . . . . . 127 5.5.2.1 Rendering Speed . . . . . . . . . . . . . . . . . . . . 127 5.5.2.2 Meshing Speed . . . . . . . . . . . . . . . . . . . . . 128 5.5.3 More Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.5.3.1 Comparison to Plücker Coordinates . . . . . . . . . . 128 5.5.3.2 Varying Model Complexity . . . . . . . . . . . . . . 129 5.5.3.3 Different Noise Levels . . . . . . . . . . . . . . . . . 129 5.5.3.4 Limitation . . . . . . . . . . . . . . . . . . . . . . . . 130 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6 View Interpolation with Implicit Neural Representations of Images 131 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1 Implicit Neural Representations . . . . . . . . . . . . . . . . . 134 vi 6.2.1.1 3D Reconstruction . . . . . . . . . . . . . . . . . . . 135 6.2.1.2 Image Fitting . . . . . . . . . . . . . . . . . . . . . . 135 6.2.2 Image-based Rendering . . . . . . . . . . . . . . . . . . . . . . 136 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3.1 INR for Image Fitting . . . . . . . . . . . . . . . . . . . . . . 137 6.3.2 Extension to Multiple Images . . . . . . . . . . . . . . . . . . 138 6.3.3 Direct Regularization . . . . . . . . . . . . . . . . . . . . . . . 139 6.3.4 Indirect Regularization . . . . . . . . . . . . . . . . . . . . . . 141 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.6 Supplementary Information . . . . . . . . . . . . . . . . . . . . . . . 150 6.6.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.6.1.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . 150 6.6.1.2 Training with CLIP-based Features . . . . . . . . . . 151 6.6.1.3 Dataset Details . . . . . . . . . . . . . . . . . . . . . 152 6.6.1.4 Beyond Interpolating Between Two Views . . . . . . 153 6.6.1.5 Extending to Frame Interpolation. . . . . . . . . . . 154 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7 Conclusion and Future Work 164 Bibliography 166 vii Chapter 1: Introduction In this dissertation, I present a series of innovative machine learning algorithms designed to enhance the creation and representation of immersive visual data for extended reality applications. Firstly, I propose a method that simultaneously addresses depth and normal information to improve depth predictions from 360- degree videos. Secondly, I construct a streamlined technique for representing 4D light fields using neural fields, also known as implicit neural representations (INR). Thirdly, I put forward the concept of neural subspaces to optimize the quality and efficiency of neural light field representations. Fourthly, I introduce the idea of primary ray-based implicit functions for effective geometric shape modeling. Lastly, I explore a 3D-agnostic approach using image-based neural fields to generate novel view renderings without 3D structure or camera pose data. In this chapter, I provide brief overview of these advances before detailing them in later chapters. 1.1 Deep Depth Estimation on 360-degree Images with a Double Quaternion Loss Depth information for 360◦ content enables 3D rendering based on the viewer’s position and allows scene editing effects like relighting and object insertion. Obtaining 1 Figure 1.1: Method Overview. The convolutional neural network (CNN) takes in a 360◦ image and for each pixel estimates its depth, normal, and uncertainty of that depth estimate. These three estimates are used by a refinement module to produce the final depth estimate for each pixel. I train the CNN using a novel loss based on the double quaternion representation of the depth and normal. depth information for 360◦ videos has several solutions, such as adding an active depth sensor or using stereo correspondence. However, both methods are cumbersome, costly, and challenging for average users. Recent deep learning advances have enabled a data-driven approach to the 360◦ depth problem, demonstrating that a neural network can predict 360◦ depth with monocular input only. In Chapter 2, I present a novel method to improve existing deep learning methods on monocular 360◦ depth estimation, as illustrated by Figure 2.1. The proposed method unifies the representations of depth and normal based on the concept of double quaternions. My method enables the conversion of predicted and ground-truth depth and 2 Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates. In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we show results from our full model. Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first column, we show the input image for reference. Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and Xie et al. [44]. Our method produces superior results in most metrics. Following the procedure presented in Section 3.2, we combine depth and normal estimates from the stereo pair images into two double quaternions (GL, HL) and (GR, HR), from which we calculate the stereo loss: LStereo = √ α2 Stereo + β2 Stereo (18) where αStereo and βStereo are also calculated as in Eq (14). 3.6. Overall Loss function With the double-quaternion-based losses derived above, we present the overall loss function for network training: Ltotal = LberHu + LDQ + LStereo (19) Here, LberHu is the reverse Huber loss function for both the depth and normal estimates compared to their respec- tive ground truth [21]. In effect, this loss is equivalent to Figure 1.2: Surface Normal Prediction. Training with double quaternion loss enables the network to produce better normal estimates. In the third column, I show predictions from the baseline model which separately calculates depth and normal loss without combining them into a double quaternion form. Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates. In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we show results from our full model. Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first column, we show the input image for reference. Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and Xie et al. [44]. Our method produces superior results in most metrics. Following the procedure presented in Section 3.2, we combine depth and normal estimates from the stereo pair images into two double quaternions (GL, HL) and (GR, HR), from which we calculate the stereo loss: LStereo = √ α2 Stereo + β2 Stereo (18) where αStereo and βStereo are also calculated as in Eq (14). 3.6. Overall Loss function With the double-quaternion-based losses derived above, we present the overall loss function for network training: Ltotal = LberHu + LDQ + LStereo (19) Here, LberHu is the reverse Huber loss function for both the depth and normal estimates compared to their respec- tive ground truth [21]. In effect, this loss is equivalent to Figure 1.3: Depth Refinement Results. I compare initial depth estimates produced by the network and the refined output based on surface normal. In the first column, I show the input image for reference. normals into two double quaternions, from which I derive a new loss for joint depth and surface normal estimation. I also use the double quaternion representation to measure the discrepancy between two CNN estimates from a stereo image pair. Experimental results show that training with a double-quaternion-based loss function improves prediction accuracy for neural networks with 360-degree video frames as 3 Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [1] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [1] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [2] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [3] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [4] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [4] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [5] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 1.1: Performance Comparison on the ODS dataset [4]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [4] and Xie et al. [5]. My method produces superior results in most metrics. input. In Figure 1.2, I show normal estimates produced by our trained network. In Figure 1.3, I demonstrate the benefit of refining the initial depth predictions with predicted normals. My method outperforms state-of-the-art methods in most metrics as shown in Table 1.1. 1.2 Efficient Neural Representation for Light Fields Light field content holds great potential for immersive visual applications. However, a primary obstacle to widespread adoption is the extreme cost of storing and transmitting such high-dimensional data. Past research has proposed numerous 4 Figure 1.4: Overview of SIGNET. I train a MLP to approximate the mapping function from each pixel’s coordinates to its color values. My input transformation strategy based on the Gegenbauer polynomials enables the MLP to more accurately learn the high-dimensional mapping function. methods to compress light fields, but these efforts have limited success in making light field content compact enough for casual streaming. In Chapter 3, I introduce SIGNET, a novel approach for representing light fields with neural networks, as illustrated in Figure 1.4. Instead of treating light fields as pixel collections, I consider them as functions mapping pixel coordinates to colors. I also introduce a new transformation based on Gegenbauer polynomials for input coordinates, enabling our network to successfully represent dense light fields. The ability to accurately represent a huge light field with a compact neural network automatically achieves compression with high fidelity. As shown in Ta- 5 Table 1.2: Compression Performance Compared to Other Methods. Size denotes the storage in megabytes(MB) for each method without further quantizations. Static Light Fields Light Field Videos Scene Lego Bracelet Tarot Painter Trains Method Size PSNR Size PSNR Size PSNR Size PSNR Size PSNR SIGNET 9.0 41.26 12.0 38.70 9.0 37.47 144 39.56 144 39.73 AMDE [6] 29.3 40.90 18.1 39.90 44.2 38.54 941 38.25 809 37.00 KSVD [7] 29.3 38.39 18.1 36.73 44.3 38.81 942 38.12 807 35.06 HOSVD [6] 29.3 37.24 18.0 33.98 44.3 34.53 942 36.91 807 35.29 5D DCT [6] 29.4 37.29 18.1 32.31 44.2 33.03 941 36.79 807 35.20 CDF 9/7 [8] 29.0 33.71 18.2 31.98 44.3 29.17 941 31.69 1116 29.80 ble 1.2, SIGNET outperforms other compression methods on multiple light field scenes. Moreover, such a functional representation easily achieves interpolation and super-resolution on light fields. In Figure 1.5, I demonstrate the result of spatial upsampling with SIGNET, achieved by evaluating the network on a denser set of coordinates. SIGNET produces acceptable output without being trained for this task. In Figure 1.6, I present SIGNET’s performance on angular upsampling, or novel view synthesis. In Figure 1.7, I present the result of temporal upsampling with SIGNET. Although SIGNET has only been trained on content from the first frame at t0 and the third frame at t0 + 1, when evaluated at the unseen intermediate time step t0 + 1 2 , SIGNET is able to preserve the motion trajectory. SIGNET’s functional design allows random access to encoded pixels, while 6 Figure 1.5: Spatial Upsampling with SIGNET. I evaluate the trained SIGNET on dense sampling grid points in the spatial dimensions. I show zoomed-in details in the cropped region bounded by the yellow rectangle. Figure 1.6: Angular Upsampling with SIGNET. At the bottom left corner of the reconstructed view, I show the relative positions of the reconstructed view (red square) and its four nearest views (blue squares) in the original light field. I compare with results from the deep-learning-based method, LFASR [9], which is trained specifically for light field angular upsampling. traditional codecs like JPEG or MPEG require decoding an entire image patch to access a single pixel. I envision that this property would be highly beneficial for foveated rendering, enabling substantial rendering and streaming speedups by 7 Figure 1.7: Temporal Upsampling with SIGNET. t0 and t0+1 are consecutive frames in the original video. The blue boxes contain output from frames evaluated at t0 + 1 2 , which is not present in the original video. The vertical lines are drawn for easier observation of the motion trajectory. adaptively selecting pixel portions based on the viewer’s gaze location. 1.3 Neural Subspaces for Light Fields SIGNET is a neural field, also known as implicit neural representation (INR), that compactly encodes multiple viewpoints from a light field scene (hundreds of megabytes) into a single neural network’s weights (a few megabytes). These weights are the only necessary information for storage and streaming. This unified design of training a single neural network to cover the entire light field has simplistic appeal as a concept. However, it is not ideal in practical scenarios that emphasize efficient transmission and rendering. Such scenarios generally involve only rendering a subset 8 Color A Color B Neural Subspace Input Input RGB RGB Weight SharingEncode Decode Neural Network Layers for Subspace Construction High-dimensional Light Field Data Pixel A Pixel B Coordinate A Coordinate B Figure 1.8: Concept of Neural Subspaces for Light Fields. Given a light field scene, I divide it into multiple local segments and construct a neural subspace for each segment. The neural subspace construction is equivalent to training the network parameters to learn accurate coordinate-to-color mappings within each segment. The adaptive weight sharing strategy utilizes the similarity among nearby subspaces and reduces the total number of parameters to represent the entire light field. of light field viewpoints. Therefore, unnecessary costs are incurred to transmit and evaluate a single neural network that contains other unrelated views. In Chapter 4, I discuss the conceptual connection between the neural rep- resentation of light fields and subspace learning, a signal processing concept for dimensionality reduction of high-dimensional data. Inspired by subspace learning, I show that it is not necessary to treat light fields only as one unified entity - light fields can be regarded as a composite collection of local segments or neural subspaces. Such a perspective is meaningful in practice since only a subset of the light field might be relevant at a particular moment for streaming or rendering. The approach introduced in Chapter 4 trains a set of local neural networks that 9 Input RGB 16 Specialized Neural Subspaces Shared Layers Local Layers Light Field with 16 x 16 Viewpoints Partitioned into 2 x 2 Segments Local Layers Local Layers Figure 1.9: Illustration of the Weight Sharing Strategy. The light field is partitioned into segments, each containing 2× 2 viewpoints. I construct a neural subspace for each segment that summarizes its pixel-to-color mapping relationship. Each subspace shares a set of network layers while possessing its local layers that enable network specialization on its corresponding data segment. each encodes only a subset of viewpoints, unlike SIGNET which is one global network representing the entire light field. An overview of this approach is illustrated in Figure 1.8. As each local network specializes in a particular region, this specialization permits smaller networks without sacrificing accuracy. Furthermore, recognizing the similarity among nearby subspaces, I propose a weight-sharing strategy for those local networks to enhance overall parameter efficiency while maintaining network capacity within each subspace. Effectively, this proposed strategy, illustrated in Figure 1.9, achieves the tracking of implicit neural subspaces. As shown in Table 4.1, experimental results indicate that the proposed framework leads to better efficiency and accuracy than the original SIGNET and a range of previous methods. With this work invoking the classic idea of subspace learning, I take neural light field 10 Scene Method PSNR(↑) SSIM(↑) EncMB(↓) DecMB(↓) Lego 909 MB AMDE 40.90 0.973 29.3 29.3 KSVD 38.39 0.960 29.3 29.3 SIGNET 41.26 0.976 9.0 9.0 Ours 41.95 0.982 22.9 2.2 Tarot 909 MB AMDE 38.54 0.973 44.2 44.2 KSVD 38.81 0.980 44.3 44.3 SIGNET 37.47 0.975 9.0 9.0 Ours 38.21 0.975 22.9 2.2 Bracelet 568 MB AMDE 39.90 0.980 18.1 18.1 KSVD 36.73 0.973 18.1 18.1 SIGNET 38.70 0.973 12.0 12.0 Ours 39.64 0.985 22.9 2.2 Table 1.3: We provide results on sub-aperture light fields. Comparisons to previous methods AMDE [6], KSVD [7], and SIGNET [10]. DecMB is the memory (in MB) required in streaming to decode any frame, and EncMB is the memory (in MB) required to encode the entire light field in storage. representation introduced by SIGNET to the next step by making this representation more compact and streaming-friendly. 11 1.4 Primary Ray-based Implicit Function Learning accurate and efficient 3D object representations is vital for applica- tions in graphics, vision, and robotics. Recent advances in machine learning involve training neural fields of signed distance functions (SDF) as implicit shape representa- tions. However, rendering and extracting shapes from trained SDF networks can be computationally expensive and typically limited to watertight shapes. Moreover, the shape quality is ultimately constrained by the converging criteria of sphere tracing or grid resolution of the marching cubes extraction. Marching Cube Rasterization Sphere Tracing Proposed Method PRIF Multiple evaluations per ray Slow rendering Offline mesh reconstruction Watertight assumption One evaluation per ray No watertight assumption Hit Point Primary Ray Figure 1.10: Comparing various implicit shape representations. Rendering from common implicit neural shape representations, such as signed distance functions (SDF) and occupancy functions (OF), requires either sphere tracing or rasterizing a separately extracted mesh. My new representation, PRIF, directly maps each primary ray to its hit point. A network encoding PRIF is more efficient and convenient for rendering, since it requires only one evaluation for each ray, avoids the watertight constraint in conventional methods, and easily enables differentiable rendering. 12 Distance from to along Ray Direction Per. Foot w.r.t. 0 Signed Distance Distance to Hit Point Z X Y 0 Perpendicular Foot Primary Ray-based Implicit Function Input Output (a) (b) (c) Figure 1.11: Formulation of PRIF. (a) Signed distance at a sampling position (white) reveals the sphere (blue dots) where its nearest surface point (blue) exists, when we really want to know the hit point (red) along a specific direction. Thus, multiple samples are required. (b) To obtain the surface hit point, PRIF uses only one sample (yellow) along the ray: the perpendicular foot between the given ray and the coordinate system’s origin O. (c) PRIF takes in the ray’s direction and its sampling point, and returns the distance from that point to the actual surface hit point. Chapter 5 presents a novel implicit geometric representation that is efficient, accurate, and innately compatible for downstream tasks involving reconstruction and rendering. I break away from the conventional point-wise implicit functions like SDF, and propose to encode 3D geometry into a ray-based implicit function called PRIF. Specifically, PRIF operates on the realm of oriented rays r = (pr,dr), where pr ∈ R3 is the ray origin and dr ∈ S2 is the normalized ray direction. Unlike SDF that only outputs the distance to the nearest but undetermined surface point, I formulate this representation such that its output directly reveals the surface hit point of the input ray. Figure 1.10 presents the overview of PRIF. Figure 1.11 shows the formulation of PRIF. I train an MLP to learn Φ(fr,dr) = sr. In effect, the objective is equivalent to finding a simple affine transformation 13 Method Armadillo Bunny Buddha Dragon Lucy SDF 1.905|1.260 1.717|1.147 6.119|2.258 5.184|1.946 3.387|1.417 OF 4.805|1.624 1.704|1.133 17.279|3.113 19.577|3.014 3.396|1.427 PRIF 0.978|0.706 1.169|0.835 1.443|0.821 1.586|0.913 0.846|0.519 Table 1.4: Quantitative results on single shape representation. The left and right numbers represent the mean and median Chamfer Distance (multiplied by 10−4). Ground Truth OF SDF Ours - Points Figure 1.12: Comparing PRIF with SDF and OF. I test on a Tetrahedron grid that is self-intersecting and non-watertight. I obtain the SDF and OF values based on the ground truth geometry, and extract the mesh by Marching Cubes. While the level-set representations fail, PRIF reliably preserves the shape. f(x) = Ax+ b, with the input x = dr, A = srI3, and b = fr. I also avoid a major limitation in previous sphere-tracing-based methods, which is having to sample multiple points and perform multiple network evaluations to obtain a hit point. Chapter 5 also presents various experiments that verify the efficacy of PRIF for shape representation and demonstrate the applications enabled by using PRIF as the underlying neural shape representation. Table 1.4 and Figure 1.12 offer a preview 14 Initial Optimization Progress Target Figure 1.13: Learning Camera Poses. The Initial camera pose is optimized based on the difference between PRIF output rendered at the current pose and the PRIF output rendered at an unknown Target pose. The PRIF output is depth images. The camera pose gradually converges to the correct Target pose. showing that PRIF significantly outperform SDF and OF in accurately preserving the fine details of the 3D shapes. Figure 1.13 show the successful recovery of camera pose enabled by using PRIF as a hitpoint renderer. In summary, Chapter 5 introduces PRIF, a new 3D shape representation based on the relationship between a ray and its perpendicular foot with the origin. I demonstrate that neural networks can successfully encode PRIF to achieve accurate shape representations. This new representation avoids multi-sample sphere tracing and obtain the hit point with a single network evaluation. Neural networks trained to encode PRIF inherit such advantages and can represent shapes more accurately than common neural shape representations using the same network architecture. 15 Train Implicit Neural Representation of Images 𝑥, 𝑦 𝑧! View Interpolation by Code Interpolation Standard Training 𝑧" Capture Images at Multiple Viewpoints Randomly Initialized 𝑥, 𝑦 𝑥, 𝑦 RGB 𝑥, 𝑦 RGB Proposed Techniques 3D Structure Camera Pose Code Interpolation: 1 − 𝑡 ⋅ 𝑧! + 𝑡 ⋅ 𝑧" Volume Depth Planes Estimation Assumed Layout Keypoint Optical Flow Correspondence 𝑡 = 0 𝑡 = 1 Figure 1.14: Overview of VIINTER. After each image is randomly assigned a code vector z, the codes are then jointly trained with the neural network to produce the RGB color given coordinate (x, y). With standard training, the INR fails to decode coherent images at new interpolated codes, but VIINTER enables smooth transition between two known viewpoints. Contrary to common methods for view interpolation, VIINTER does not use 3D structure, camera poses, or pixel correspondence. 1.5 View Interpolation with Implicit Neural Representations of Images Neural fields, also known as implicit neural representations (INR), have been successful in representing visual signals such as images, videos, signed distance fields, and radiance fields. In scenarios where only 2D images are available, two prominent applications are 2D image fitting and 3D view synthesis. INRs achieve impressive visual results along these two orthogonal directions. On the one hand, the quality of fitting images is improved by incorporating traditional signal processing techniques. On the other hand, the quality of view synthesis is improved by augmenting INRs with well-established 3D graphics techniques. In Chapter 6, I explore a different direction and ask a new question: Given 16 VIINTER: View Interpolation with Implicit Neural Representations of Images SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea 𝑡 = 0 𝑡 = 0.5 𝑡 = 1 𝑡 = 0 𝑡 = 0.5 𝑡 = 1 No Control ∞-norm 𝑡 = 0 𝑡 = 0.5 𝑡 = 1 𝑡 = 0 𝑡 = 0.5 𝑡 = 1 2-norm 1-norm Figure 2: Effect of Controlling ∥𝑧∥𝑝 = 1 with Different 𝑝. For each condition, we show the INR-produced images given 𝑧𝑖 (left), 0.5𝑧𝑖 + 0.5𝑧 𝑗 (center), and 𝑧 𝑗 (right). “No Control” trains INR F and all codes for known images without controlling their scales, showing proper reconstruction at known views (left and right) but complete failure in interpolation (center). “∞-norm” scales each 𝑧 with its maximum norm, but still does not interpolate well. “2-norm” significantly improve interpolation and reconstructs known views better, but “1-norm” is much better at interpolation (see red boxes). 2.2 Image-based Rendering. The early approaches of image-based rendering (IBR) achieve novel view synthesis through explicitly blending relevant pixels from known images [Debevec et al. 1996; Gortler et al. 1996; Levoy and Hanrahan 1996]. The visual quality of IBR is heavily dependent on the strategy of deciding the blending weights of images, and researchers have developed a line of techniques improving blending weights selection, such as ray-space proximity [Chai et al. 2000; Levoy and Hanrahan 1996], proxy geometry [Buehler et al. 2001; De- bevec et al. 1996; Heigl et al. 1999], optical flow [Chen andWilliams 1993; Du et al. 2018], soft blending [Penner and Zhang 2017; Riegler and Koltun 2020], and neural-network-assisted blending [Milden- hall et al. 2019; Rombach et al. 2021; Thies et al. 2019; Wang et al. 2021b]. These techniques often require an approximate 3D struc- ture (proxy geometry or depth) of the scene so that pixels can be re-projected to the novel view. For methods that do not involve 3D re-projection [Levoy and Hanrahan 1996; Ng et al. 2005], many still assume the knowledge of the 3D camera locations and orien- tations of each image and leverage the spatial relationship among the cameras to decide the blending weights. In contrast, we explore a different and more challenging problem setting which does not involve 3D reconstruction nor the knowledge of 3D locations and camera orientations. Our problem setup is similar to prior work on image morphing [Chen and Williams 1993; Liao et al. 2014; Seitz and Dyer 1996; Wolberg 1998], but we achieve the morphing effect without finding pixel-wise correspondences between images. 3 METHOD We provide details on the INR parametrization adopted in our study, and we introduce the proposed modifications to INR training. 3.1 INR for Image Fitting Let F denote the INR of images. In the case of a single image, for all pixels 𝑝 of the image, the INR F defines F (𝑝𝑥 , 𝑝𝑦) = 𝑝𝑐 , (1) where (𝑝𝑥 , 𝑝𝑦) denotes the coordinate of the pixel 𝑝 , with 𝑝𝑥 ∈ R and 𝑝𝑦 ∈ R. 𝑝𝑐 ∈ R3 denotes the value (often the RGB vector) associated with the pixel 𝑝 . In itself, the INR formulation is invariant to different numeric ranges of (𝑝𝑥 , 𝑝𝑦) or 𝑝𝑐 , and for simplicity we rescale the pixel coordinates and values to be within [0, 1]. We adopt the conventional MLP architecture to parameterize F as a chain of fully connected layers, with activation function usually set as a ReLU or sinusoidal function. Various embedding functions of the input coordinate (𝑥,𝑦) have been proposed, but in this work we apply no embedding and use sinusoidal activation [Sitzmann et al. 2020], which are sufficient for fitting single 2D images. The primary training objective of INR F for single 2D images is to minimize the reconstruction error between the predicted 𝑝𝑐 and ground truth 𝑝𝐺𝑇𝑐 across all known pixels in a single image, namely 𝐿𝑆𝑖𝑛𝑔𝑙𝑒𝑅𝑒𝑐𝑜𝑛 = ∑︁ 𝑝 ∥𝑝𝑐 − 𝑝𝐺𝑇𝑐 ∥2 . (2) 3.2 Extension to Multiple Images Our goal is to use a single network F as the INR for multiple images from the same scene. Prior methods assume the camera layout (for planar light fields [Feng and Varshney 2021]) or known camera poses in the pipeline (for general light fields [Attal et al. 2022; Sitzmann et al. 2021]), but we are interested in pushing the limit to where the camera pose of each image is unknown. In our 3D-agnostic setup which does not consider camera poses, we assign a randomly initialized vector 𝑧 ∈ R𝑀 for each image, Figure 1.15: Effect of Controlling Latent Codes ∥z∥p = 1 with Different p-norm. For each condition, we show the INR output given zi (left), 0.5zi + 0.5zj (center), and zj (right). “No Control” does not control the latent codes, leading to proper reconstruction at known views (left and right) but complete failure in interpolation (center). “∞-norm” scales each z with its maximum norm, but still does not interpolate well. “2-norm” significantly improves interpolation and reconstructs known views better, but “1-norm” is much better at interpolation (see red boxes). multiple 2D image views of a 3D scene, can we use the INR of those 2D images alone to do view synthesis without any 3D reconstruction, pose, or correspondence? With randomly initialized INR weights and code vectors for individual images, I modify the standard INR training process such that the trained INR can both faithfully reproduce the given images and synthesize plausible novel views when interpolating between those learned image codes. This method is called VIINTER, and its overview is shown in Figure 1.14. Chapter 6 includes various analysis and experiments. For example, I show that it is important to regularize the magnitude of latent codes, as shown in Figure 1.15. 17 SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea Feng et al. 𝑡 = 0 𝑡 = 0.25 𝑡 = 0.5 𝑡 = 0.75 𝑡 = 1 Figure 6: Unstructured Light Field Results. We interpolate learned codes of views 𝑖 and 𝑗 as (1−𝑡) ·𝑧𝑖 +𝑡 ·𝑧 𝑗 . The cameramovement between the two known views includes rotation and translation. The interpolation through INR smoothly transforms the perspective, despite having no knowledge of 3D scene structure or camera pose. Images are zoomed in for easier evaluations. or demonstration of the interpolation results between different viewpoints, after training is finished. 4 EXPERIMENTS In this section, we provide more results on view interpolation and ablation studies on the techniques introduced in Section 3. We train VIINTER to encode real-world scenes captured under two different regimes: 4D light fields (viewpoints are on a 2D plane with the same orientation) and unstructured light fields (viewpoints are not aligned on a 2D grid and orientations might be rotated). 4D Planar Light Fields. We use scenes from Stanford Light Field Archive [Wilburn et al. 2005], with 17 × 17 camera viewpoints on a 2D grid.We use a 5×5 subset by taking every 4-th image horizontally and vertically. We render new views by selecting two trained codes and linearly interpolate them. The resulting interpolation results are shown in Fig. 5, with more in the supplements. Unstructured Light Fields. To test VIINTER on scenes with irreg- ular camera layout, we test on the LLFF dataset [Mildenhall et al. 2019] and our own volumetric dataset. The LLFF scenes are cap- tured in natural indoor environments, while our own scenes come from a volumetric studio for human body captures. We present the interpolation results in Fig. 5, with more in the supplements. Quantitative Evaluation. The unique challenge in evaluating our method is we cannot explicitly specify a camera pose to render at. Nonetheless, to provide a quantitative evaluation, we approximately render at testing viewpoints by interpolating the codes from nearby known viewpoints. For example, for the Stanford Light Field scenes, we select two viewpoints in the 5 × 5 training set, viewpoints (4, 4) and (4, 8), and interpolate their learned codes with 𝑡 = 0.5. Then we render the full image with the interpolated code and compare it against the actual test image (withheld from training) captured at viewpoint (4, 6). Thanks to the well-aligned structure of these 4D scenes, we can compute metrics like peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against a reasonable ground truth image. Figure 1.16: Unstructured Light Field Results. We interpolate learned codes of views i and j as (1− t) · zi + t · zj. The camera movement between the two known views includes rotation and translation. The interpolation through INR smoothly transforms the perspective, despite having no knowledge of 3D scene structure or camera pose. Images are zoomed in for easier evaluations. Chapter 6 further includes evaluation results of VIINTER on different types of multi-view scenes. Example results are presented in Figure 1.16. VIINTER takes an important step toward revealing new potential of neural fields or INRs. With careful modifications, they can perform view interpolation without 3D structures. This work offers a promising outlook on employing them for image manipulation tasks beyond simply representing known images. 18 Chapter 2: Deep Depth Estimation on 360-degree Images with a Double Quaternion Loss 2.1 Introduction Traditional depth estimation uses binocular or multi-view stereo image in- puts [11–14]. Based on explicit geometric constraints, most of these stereo methods infer relative depth through computing stereo disparity, i.e., the distance between a pixel’s location in one image to its corresponding location in the other image. The rise of deep learning enables direct training of convolutional neural networks (CNN) for depth estimation by implicitly computing the matching cost between pixels in stereo images. However, since stereo images are not easily accessible, depth estimation on monocular images serves as a valuable alternative. Deep CNNs for this problem have shown promising results. A unique advantage for this approach is that monocular CNNs can be trained on both monocular image datasets and stereo image datasets. As virtual and augmented reality (VR and AR) become more commoditized and panoramic cameras become ubiquitous, 360◦ visual content is becoming more relevant [15–17]. The interactive nature of VR and AR, fosters an urgent need for 19 Figure 2.1: Method Overview. The CNN takes in a 360◦ image and for each pixel estimates its depth, normal, and uncertainty of that depth estimate. These three estimates are used by a refinement module to produce the final depth estimate for each pixel. We train the CNN using a novel loss based on the double quaternion representation of the depth and normal. methods that estimate depth information from 2D views to instill more creative freedom in content rendering and interaction, including reconstructing the original 3D scenes and synthesizing views from novel angles [18,19]. However, most of the previous research on depth estimation targets traditional perspective images. Unlike typical photographs captured on a planar sensor, 360◦ images have a spherical layout. For 360◦ stereo images, traditional depth-estimation methods based on binocular disparity are not directly applicable due to the spherical singularity at the stereo epipoles. Moreover, CNNs trained on narrow-field-of-view images for monocular depth estimation perform poorly on 360◦ monocular images because 20 of the significant domain shift from traditional perspective to wide-field-of-view equirectangular images. Zioulis et al. [20] and Lai et al. [4] have recently released separate datasets for depth estimation on 360◦ images. While both datasets provide multi-view stereo images, their choices of baseline distance between cameras are vastly different. This difference signifies a severe drawback of training a CNN that simply takes in a stereo image pair: networks directly trained on stereo images with a particular baseline cannot adapt to different baseline configurations at test time. Moreover, such networks require a fixed baseline in training, making it difficult to aggregate training data from multiple datasets. Therefore, training a depth estimation CNN with monocular input seems more favorable. Among methods that train CNN for depth estimation, joint estimation of depth and normal is commonly adopted as an augmentation technique. However, to the best of our knowledge, all previous networks that jointly estimate normal and depth consider the errors from depth and normal separately. While Qi et al. [21] and Yang et al. [22] have proposed depth refinement methods that explicitly link surface normal estimates with depth estimates, their methods are based on the planar sensor camera model for traditional narrow-field-of-view images and do not map well to 360◦ images. Moreover, their refinement procedures modify all pixel points uniformly in the estimated depth map and do not consider the varying quality across different regions. In this chapter, we present a new framework for 360◦ depth estimation. We start from a generic CNN that jointly estimates depth and surface normal based 21 on monocular RGB images. We develop a new loss for this joint estimation task, which combines depth and surface normals into a 4D hyperspherical space with a double quaternion approximation. We implement depth refinement using the normal estimates produced by this network. In contrast with previous normal-based refinement methods on perspective images, our new method adaptively adjusts the refinement to the initial depth estimates by an uncertainty score map which is also estimated by the CNN. This uncertainty construct allows us to identify image regions where further refinement could be helpful and avoid unnecessary changes to estimates that the network expects to be accurate. Furthermore, to make full use of available image data, we introduce a stereo loss when training the CNN on stereo-image pairs. After producing two separate monocular estimates of depth and normal for a stereo image pair, the CNN learns to minimize their hyperspherical angular difference. By this design, the monocular network can take advantage of stereo training data without being restricted by a particular stereo baseline distance. Experiments show the improved performance of our proposed framework compared to previous methods on 360◦ depth estimation. In summary, our contributions include: • An adaptive depth refinement framework for 360◦ images using normal estimates and uncertainty scores. • A new way to incorporate depth and surface normal estimates for a 3D point into a hyperspherical 4D space using a double quaternion approximation. • A stereo loss that enables the CNN to learn stereo consistency and remain 22 flexible across datasets with different stereo baseline distances. 2.2 Related Work We first present learning-based methods for monocular and stereo depth estima- tion on 360◦ images, followed by previous work on using the surface normal to refine depth from perspective images. We then present previous approaches that incorpo- rate quaternion representations in estimating surface normals and approximating 3D motions. 2.2.1 Depth Estimation on 360-degree Images Several methods have been used to perform depth estimation [2, 23–29] and surface normal estimation [28, 30–32] on perspective images. Unfortunately, 360◦ images are distorted by equirectangular projection and contain irregular disparity pattern due to the spherical singularity at the stereo epipoles. Therefore, depth estimation on 360◦ images requires special adaptations. One approach for learning on 360◦ images is to project pixels onto rectified cubemaps and then perform inference using pre-trained CNNs. Huang et al. [33] apply the traditional structure-from-motion (SfM) algorithm [34] in 3D scene reconstruction by projecting each 360◦ video frame onto a cubemap. Monroy et al. [35] obtain 360◦ saliency maps following this approach, but the distortion and discontinuity among cubemap patches are not handled by their method. Cube padding [36, 37] was introduced to help resolve cubemap distortion problem by padding each patch 23 with features from adjacent cubemap patches. Another approach for 360◦ depth estimation is to transfer models for perspective images to 360◦ images. To account for the distortion from equirectangular projection, Su and Graumann [38] modified a CNN trained on perspective images by varying the kernel shape based on its location on the sphere. Su and Graumann [39] improve the previous method by learning a transformation function for kernels pre-trained on perspective images without separately training new kernels for each location. Zioulis et al. [1] directly train CNNs on 360◦ images using rectangular kernels of varying resolutions along with traditional square kernels to cover different distortion levels. They also adopt dilated convolutions [40] to increase the receptive field and enable the networks to gather more global information. Lee et al. [41] use a spherical polyhedron to represent 360◦ images and devise special convolution and pooling kernels for image pixels after they are projected on the polyhedron. Tateno et al. [42] deform the kernel sampling grid to compensate for distortions in spherical images. For the similar task of saliency detection on 360◦ videos, Zhang et al. [43] also define kernels on the 360◦ sphere and resample the kernels on the grid points for every location in the equirectangular projection. Unsupervised learning through view synthesis has also been exploited to solve depth estimation [22,44]. De La Garanderie et al. [45] use the stereo consistency of perspective images to achieve unsupervised depth estimation on panoramic images. Wang et al. [37] explore self-supervised depth estimation from 360◦ images through cubemap projection. Zioulis et al. [20] introduced the view-synthesis approach into the realm of omnidirectional 360◦ images. Aware of the distortion problem of 360◦ 24 images, they also adaptively weight the loss contribution of each pixel based on its coordinates on the image grid. While most previous work on 360◦ depth estimation focuses on monocular input, Lai et al. [4] present a framework for stereo depth estimation on 360◦ images with a CNN which produces a depth map for a horizontally displaced pair of images. Xie et al. [5] further extend this stereo depth estimation framework to include deformable convolution and correlation convolution. Wang et al. [46] propose a learnable cost volume approach for spherical stereo depth estimation which also shows promising results. 2.2.2 Joint Estimation of Depth and Normal Motivated by the inherent geometric relationship between depth and normal estimates of points on the same surface, several methods include the surface normal information into depth estimation. Wang et al. [47] deploy a dense conditional random field on initial estimates of normal and depth, which produces more regularized depth and normal outputs with better geometric consistency. Eigen and Fergus [48] also simultaneously estimate depth, surface normal, and semantic segmentation for perspective images. Furthermore, the depth-normal relationship can be explicitly constructed. Two spatially close points with similar surface normal estimates are approximately co-planar, and thus they form a vector that is orthogonal to the surface normal. Building upon this assumption, Qi et al. [21] introduce a module that refines the 25 depth estimates produced by a CNN using its normal estimates. Likewise, Yang et al. [22] formulate this depth-normal relationship as a quadratic minimization problem for a set of linear equations constructed by the local depth and normal estimates in a small region. However, these methods do not consider the varying quality of CNN estimates across different regions. Lai et al. [4] also use the information of surface normal to improve depth estimation. To the best of our knowledge, this is the first work that implements a joint estimation of depth and normal on 360◦ images. However, their method only includes normal as an auxiliary task of the CNN, without further exploiting the explicit geometric relationship between depth and surface normal. 2.2.3 Use of Quaternions Quaternions are widely used in computer graphics to represent rotation trans- formation of 3D points. By representing surface normal as a pure quaternion, Karakottas et al. [49] calculate the angular loss of normal predictions based on the quaternion product of the estimated and ground-truth normal vectors. As a natural extension to quaternions, double quaternions integrate the rota- tion and translation components for motion interpolation [50]. Unlike traditional 3D point representation where spatial displacements are separately characterized into translation and rotation, double quaternions provide a unified framework to approximate 3D displacements as a rotation in the 4D space. In other words, the difference between two 3D spatial displacements can be described by their angular 26 distance in 4D. In this chapter, we introduce a method that directly unifies depth and surface normal information into a single measurement based on a double quaternion approx- imation. With this novel construct, the predicted and the ground-truth depth and normals can be converted into two double quaternions. We thus derive a new loss specifically for joint estimation of depth and surface normals. We also take advantage of this double quaternion representation to measure the discrepancy between two CNN estimates from a stereo image pair. After transforming the two separate estimates into a homogeneous coordinate system, we derive a stereo loss based on the double quaternion angular distance between these two sets of estimates. 2.3 Method Our goal is to train a CNN for 360◦ depth estimation. To exploit the information from surface normals, the CNN produces a normal map and an uncertainty map for initial depth estimates, which we feed into a refinement procedure to produce a final depth map. We derive a loss function based on double quaternions to facilitate better depth-normal joint learning. To further use datasets containing stereo pairs, we introduce a stereo loss also based on double quaternions. 27 2.3.1 CNN Architecture We adopt the commonly used U-Net architecture with skip connections, as shown in Figure 2.1. For an input RGB image of size h× w × 3, the CNN produces three separate outputs: 1) h × w × 1 depth map, 2) h × w × 1 uncertainty map for depth, and 3) h× w × 3 normal map. These three output maps are fed into a refinement step detailed in Section 2.3.2. Convolution BatchNormReLU MaxPoolUpsample Figure 2.2: Network Architecture. We adopt the commonly used U-Net architecture for end-to-end per-pixel estimation. The first six blocks in the encoding part are based on the VGG-16 model [51]. The decoding part is symmetric to the encoding part, and it outputs a depth map, an uncertainty map, and a normal map. These three maps are combined to produce a final depth map using the method described in Section 2.3 2.3.2 Depth Refinement based on Normal In general, image-based depth estimation aims to recover the depth value of a 3D point (x, y, z) given its projected pixel location (u, v) in an image. The depth value for a pixel in a 360◦ image is defined as the distance of its 28 corresponding 3D point from the camera. r = √ x2 + y2 + z2 (2.1) Moreover, the pixel coordinates (u, v) of a 360◦ image with width w and height h directly correspond to the spherical coordinates (θ, ϕ) of its corresponding 3D point. ϕ = 2πu θ = π 2v − 1 2 u, v ∈ [0, 1] (2.2) The direct conversion between spherical and Cartesian coordinates in 3D is given as follows x = r sinϕ sin θ y = r cos θ z = r cosϕ sin θ (2.3) Using equations (2.2) and (2.3), we can obtain the relationship that maps 2D grid coordinates to 3D Cartesian coordinates for 360◦ depth maps. Using the normal estimates (nix, niy, niz) also produced by the CNN, we can further formulate the following equation based on the orthogonality between surface normal vector and in-plane vector among points (xi, yi, zi) and (xj, yj, zj): nix(x− xi) + niy(y − yi) + niz(z − zi) = 0 (2.4) nixxj + niyyj + nizzj nixxi + niyyi + nizzi = 1 (2.5) Then, using an assumption similar to Qi et al. [21], for pixels within a small region, we treat their corresponding 3D points as co-planar if their surface normal estimates are also similar. Thus, we obtain an approximately co-planar neighborhood Ni for each image pixel Pi using spatio-angular measures defined as follows: 29 Ni = {(xj, yj, zj) | n⊺ jni > α, |ui − uj| < β, |vi − vj| < β} (2.6) where (ui, vi) and (uj, vj) are the 2D grid coordinates of pixels Pi and Pj, β is the parameter that controls the size of the spatial neighborhood, and α controls the size of the angular neighborhood. A larger value of n⊺ jni implies a greater likelihood that the corresponding 3D points for Pi and Pj are co-planar. For each neighbor Pj ∈ Ni, we may obtain an estimate rij for the depth of ri of Pi by plugging in the spherical coordinates with equations (2.3) and (2.5): rij = nixxj + niyyj + nizzj nix sinϕi sin θi + niy cos θi + niz cosϕi sin θi (2.7) where θi and ϕi are determined by Eq (2.2). Note that the calculation in Eq (2.7) suffers from instability when the denominator is close to zero, producing abnormal values. Thus, we leave out any depth estimate that violates the following constraints: 0 < rij < 255 max( rij ri , ri rij ) < 10 (2.8) For any rij that violates the constraints in Eq (2.8), we set it as ri, the original depth estimate of Pi. 2.3.3 Aggregation with Confidence Scores For each pixel Pi, we aggregate the estimates of its depth ri from its neighbors Pj ∈ Ni by using normalized weights. These weights have two components. First, we use the uncertainty score qj of pixel Pj from the CNN output to compute its 30 confidence value C(Pj) = 1− q2j . An example of the uncertainty score output maps can be seen in Figure 2.6. Second, the neighbor Pj ’s contribution is also weighted by W (Pi, Pj), the dot product between their respective normals, ni and nj. Specifically, we aggregate the depth estimates for each pixel Pi with its neighbors as: rNi = ∑ Pj∈Ni C(Pj) ·W (Pi, Pj) · rij∑ Pj∈Ni C(Pj) ·W (Pi, Pj) (2.9) with C(Pj) = 1− q2j and W (Pi, Pj) = n⊺ jni. Finally, the refined r̂i for Pi is calculated as: r̂i = C(Pi) · ri + (1−C(Pi)) · rNi (2.10) In other words, for a pixel with higher uncertainty and lower confidence, we place a greater reliance on its neighbors to refine its initial depth estimate. On the other hand, if a pixel has a low uncertainty score, the CNN believes this depth estimate is likely accurate, and so the neighbor estimates are less informative. This formulation allows us to adaptively refine the initial CNN estimates and reduce the unnecessary modifications of the already robust estimates. 31 Figure 2.3: Hyperspherical Rotation Approximation. This figure illustrates that a displacement in 2D (shown at bottom right) can be regarded as a rotation around the center of a 3D sphere. Similarly, a double quaternion approximates a 3D displacement as a rotation on a suitably large 4D sphere. 2.3.4 Double Quaternion Approximation of Depth and Normal in the Loss function 2.3.4.1 Constructing Double Quaternion Since a point’s spatial coordinate (x, y, z) represents a translation from the co- ordinate origin (0, 0, 0), a pixel’s corresponding depth and surface normal orientation can be viewed as 3D translation and rotation, respectively. The translation component of a 2D spatial displacement can be viewed as a rotation with respect to the origin of the 3D coordinate system. In fact, similar 32 approximation can be done from 3D to 4D. McCarthy [52,53] has shown that the homogeneous transform of 3D spatial displacements with rotation and translation is the limiting case of a 4D rotation as the radius of the 4D sphere R approaches infinity. Thus, we combine the 3D depth (translation) and normal (rotation) into one 4D measurement, which is represented by a double quaternion. Specifically, a 3D translation d can be approximated by a rotation on a 4D sphere of radius R, limR → ∞ by an angle ψ as limψ→0 sin(ψ) = ψ = d R . The double quaternion representing this 4D rotation is: D = cos( ψ 2 ) + sin( ψ 2 ) d |d| (2.11) Therefore, the 3D translation can be represented by a double quaternion (D,D*), and the 3D rotation is represented as (Q,Q), where Q = (0, nx, ny, nz) (2.12) Two double quaternions, (G1, H1) and (G2, H2) can be composed into a new double quaternion (G3, H3), where G3 = G1G2 H3 = H1H2 (2.13) Moreover, following Ge et al. [50], we can compute the spatial distance between two double quaternions (G1,H1) and (G2,H2) as the angle between the respective double quaternion components: α = cos−1(G1 ·G2) β = cos−1(H1 ·H2) (2.14) 33 2.3.4.2 Loss Function Based on Double Quaternions Based on Eq (2.13), we combine the double quaternions representing translation and rotation (Eqs (2.11) and (2.12)) into a double quaternion representation for a depth and normal estimation pair: G = DQ H = D∗Q (2.15) We thus derive a loss function based on the angular distance between the two double quaternions: predicted (GPred, HPred) and ground-truth (GGT, HGT) as LDQ = √ α2 DQ + β2 DQ (2.16) where αDQ and βDQ are calculated as in Eq (2.14) 2.3.5 Stereo Consistency While training on datasets with stereo pairs, we further impose a stereo loss to minimize the discrepancy between the estimates from the two horizontally displaced images. Given a known baseline distance b between a horizontal stereo pair, a pixel in one image L can be mapped onto the other image R with the following equation: ϕR = ϕL + b · cos(ϕL) rL · sin(θL) θR = θL + b · sin(ϕL)cos(θL) rL (2.17) Following the procedure presented in Section 2.3.2, we combine depth and normal estimates from the stereo pair images into two double quaternions (GL, HL) and (GR, HR), from which we calculate the stereo loss: LStereo = √ α2 Stereo + β2 Stereo (2.18) 34 where αStereo and βStereo are also calculated as in Eq (2.14). 2.3.6 Overall Loss Function With the double-quaternion-based losses derived above, we present the overall loss function for network training: Ltotal = LberHu + LDQ + LStereo (2.19) Here, LberHu is the reverse Huber loss function for both the depth and normal estimates compared to their respective ground truth [2]. In effect, this loss is equivalent to the mean absolute error for errors below a threshold c, and equivalent to a weighted mean squared loss for errors larger than c. We follow Laina et al. [2] and set c as 20% of the maximal error among all images of the current batch. We follow Lai et al. [4] and place extra weight for error at boundary pixels in calculating LberHu. 2.4 Experiments We have trained and evaluated the performance of our method on the ODS dataset [4]. It contains 40, 000 frames of indoor scenes from the Stanford 2D-3D- Semantics Dataset [54] with ground truth depth and surface normal. We adopt the same training-validation data split and evaluation metrics as Lai et al. [4]. We have also evaluated our method on the 360D dataset provided by Zioulis et al. [20]. 35 2.4.1 Training Details We initialize the encoding blocks of the CNN shown in Figure 2.1 with the commonly used VGG-16 [51] pre-trained weights. We use the Adam optimizer with its default parameters. We follow the data augmentation procedures detailed in Lai et al. [4] to introduce more variability in data. To be consistent with previous work, Algorithm 1: Steps to Compute Training Loss 1 Input: Horizontally Displaced Image Pair I, I ′ 2 Parameters: Weights θ of CNN 3 Label: Depth Map DI and Surface Normal Map SI 4 Output: Initial depth D̃init, refined depth D̃I , estimated normal S̃I , total loss Ltotal 5 For each training iteration 6 D̃init, S̃I = CNN(I) (Section 3.1) 7 D̃init ′ , S̃I′ = CNN(I ′ ) (Section 3.1) 8 D̃I = Refine(D̃init, S̃I) (Sections 3.2 - 3.3) 9 D̃I′ = Refine(D̃init ′ , S̃I′ ) (Sections 3.2 - 3.3) 10 LDQ = DQLoss(D̃I , S̃I ,DI , SI) (Section 3.4) 11 LStereo = StereoLoss(D̃I , D̃I′ ) (Section 3.5) 12 LberHu = berHuLoss(D̃I ,DI) (Section 3.6) 13 Ltotal = LberHu + LDQ + LStereo 14 θ = Update(Ltotal, θ) 36 we train our networks for 40 epochs on this dataset to enable direct comparison of method performance. We adopt the conventional depth estimation metrics [1,2,4,55]. We denote the absolute prediction error of a pixel i as Ei = |yi − ŷi|, where yi is the ground truth depth and ŷi is the predicted depth. δj refers to the percentage of pixels with max(yi ŷi , ŷi yi ) < 1.25j. The other metrics used and their definitions are listed below: RMSE : √∑N i=1 Ei 2 N Abs. Rel. : ∑N i=1 Ei ŷi N RMLSE : √∑N i=1 | ln yi − ln ŷi|2 N Sq. Rel. : ∑N Ei 2 ŷi N Log10 : ∑N i=1 | log10 yi − log10 ŷi| N Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates. In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we show results from our full model. Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first column, we show the input image for reference. Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and Xie et al. [44]. Our method produces superior results in most metrics. Following the procedure presented in Section 3.2, we combine depth and normal estimates from the stereo pair images into two double quaternions (GL, HL) and (GR, HR), from which we calculate the stereo loss: LStereo = √ α2 Stereo + β2 Stereo (18) where αStereo and βStereo are also calculated as in Eq (14). 3.6. Overall Loss function With the double-quaternion-based losses derived above, we present the overall loss function for network training: Ltotal = LberHu + LDQ + LStereo (19) Here, LberHu is the reverse Huber loss function for both the depth and normal estimates compared to their respec- tive ground truth [21]. In effect, this loss is equivalent to Figure 2.4: Surface Normal Prediction. Training with double quaternion loss enables the network to produce better surface normal estimates. In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we show results from our full model. 37 Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates. In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we show results from our full model. Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first column, we show the input image for reference. Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and Xie et al. [44]. Our method produces superior results in most metrics. Following the procedure presented in Section 3.2, we combine depth and normal estimates from the stereo pair images into two double quaternions (GL, HL) and (GR, HR), from which we calculate the stereo loss: LStereo = √ α2 Stereo + β2 Stereo (18) where αStereo and βStereo are also calculated as in Eq (14). 3.6. Overall Loss function With the double-quaternion-based losses derived above, we present the overall loss function for network training: Ltotal = LberHu + LDQ + LStereo (19) Here, LberHu is the reverse Huber loss function for both the depth and normal estimates compared to their respec- tive ground truth [21]. In effect, this loss is equivalent to Figure 2.5: Depth Refinement Results. We compare initial depth estimates produced by the network and the refined output based on surface normal. In the first column, we show the input image for reference. 2.4.2 Comparison with Other Methods The network performance of a CNN trained with our method is shown in Tables 2.1 and 2.2. Compared to other methods in Tables 2.1 and 2.2, our network shows improved performance in almost all metrics. In Figure 2.7 we show an example where our method better preserves the geometric detail of the scene. We believe this is because our model is aware of the surface normals and can use it to improve depth estimation. 2.4.3 Ablation Studies We present the performance comparisons in Table 2.3. We observe decreased estimation accuracy with the removal of each component of the loss function. Fig- ures 2.4, 2.5, and 2.6 further illustrate the impact of our method. It is worth noting that the network trained with double quaternions shows smoother normal esti- 38 Method RMSE Log10 AbsRel δ1 δ2 δ3 UResNet [1] 2.037 0.326 16.906 0.213 0.399 0.560 RectNet [1] 1.738 0.291 16.132 0.240 0.453 0.634 FCRN [2] 0.672 0.101 7.448 0.806 0.932 0.966 PSMNet [3] 0.393 0.059 5.641 0.953 0.975 0.980 SepUNet [4] 0.495 0.042 1.779 0.944 0.987 0.993 SepUNetS [4] 0.614 0.072 1.841 0.835 0.966 0.985 SepUNetDD [5] 0.392 0.036 2.120 0.960 0.987 0.992 Ours 0.389 0.031 0.413 0.954 0.984 0.990 Table 2.1: Performance Comparison on the ODS dataset [4]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [4] and Xie et al. [5]. Our method produces superior results in most metrics. mates, which could explain the increase in estimate accuracy since the normal-based refinement method relies on accurate normal estimates. 2.5 Limitations and Conclusion We have shown how double-quaternion loss is useful in reducing the geometric inconsistency and improving estimation accuracy. Our results indicate that a double quaternion construct could have a meaningful potential for other tasks that involve processing 360◦ images. We hope our work will bring a new hyperspherical perspective to analyzing omnidirectional visual data, as a complement to the traditional Cartesian 39 Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3 UResNet [1] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968 RectNet [1] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979 monoDepth [44] 7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510 FCRN [2] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150 DCRF [25] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774 Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988 Table 2.2: Performance Comparison the 360D dataset [20]. Evaluation statistics for row 1-5 are directly taken from Zioulis et al. [1]. Our method surpasses other methods in all metrics except for AbsRel. Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3 UResNet [49] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968 RectNet [49] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979 monoDepth [16] 7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510 FCRN [21] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150 DCRF [26] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774 Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988 Table 2. Performance Comparison the 360D dataset [50]. Evaluation statistics for row 1-5 are directly taken from Zioulis et al. [49]. Our method surpasses other methods in all metrics except for AbsRel. Input RGB image Predicted uncertainty map Predicted depth map Figure 5. This example illustrates that the network learns to produce meaningful uncertainty maps by effectively grasping the object’s geometric outline. It places higher uncertainty near object edges, where depth predictions tend to be overly smooth and prone to error. Cropped input RGB image Predicted by RectNet [49] Predicted by our network Figure 6. More qualitative comparison. Here we show an example from a test image from the 360D dataset [49]. Note that our result largely preserves the geometry of the hallway railings. the mean absolute error for errors below a threshold c, and equivalent to a weighted mean squared loss for errors larger than c. We follow Laina et al. [21] and set c as 20% of the maximal error among all images of the current batch. We follow Lai et al. [20] and place extra weight for error at boundary pixels in calculating LberHu. 4. Experiments We have trained and evaluated the performance of our method on the ODS dataset [20]. It contains 40, 000 frames of indoor scenes from the Stanford 2D-3D-Semantics Dataset [1] with ground truth depth and surface normal. We adopt the same training-validation data split and eval- uation metrics as Lai et al. [20]. We have also evaluated our method on the 360D dataset provided by Zioulis et al. [50]. 4.1. Training details We initialize the encoding blocks of the CNN shown in Figure 2 with the commonly used VGG-16 [36] pre-trained weights. We use the Adam optimizer with its default pa- rameters. We follow the data augmentation procedures de- tailed in Lai et al. [20] to introduce more variability in data. To be consistent with previous work, we train our networks for 40 epochs on this dataset to enable direct comparison of method performance. We adopt the conventional depth estimation metrics [13, 20, 21, 49]. We denote the absolute prediction error of a pixel i as Ei = |yi− ŷi|, where yi is the ground truth depth and ŷi is the predicted depth. δj refers to the percentage of pixels with max(yiŷi , ŷi yi ) < 1.25j . The other metrics used and their definitions are listed below: RMSE : √∑N i=1 Ei2 N Abs. Rel. : ∑N i=1 Ei ŷi N RMLSE : √∑N i=1 | ln yi − ln ŷi|2 N Sq. Rel. : ∑N Ei 2 ŷi N Log10 : ∑N i=1 | log10 yi − log10 ŷi| N Figure 2.6: Uncertainty Estimates. The network learns to produce meaningful uncertainty maps by effectively grasping the object’s geometric outline. It places higher uncertainty near object edges, where depth predictions tend to be overly smooth and prone to error. (or equirectangular) perspective. Our method achieves good performance on the testing scenes in the given datasets. One of the assumptions our method makes is that the normals can be 40 Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3 UResNet [49] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968 RectNet [49] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979 monoDepth [16] 7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510 FCRN [21] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150 DCRF [26] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774 Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988 Table 2. Performance Comparison the 360D dataset [50]. Evaluation statistics for row 1-5 are directly taken from Zioulis et al. [49]. Our method surpasses other methods in all metrics except for AbsRel. Input RGB image Predicted uncertainty map Predicted depth map Figure 5. This example illustrates that the network learns to produce meaningful uncertainty maps by effectively grasping the object’s geometric outline. It places higher uncertainty near object edges, where depth predictions tend to be overly smooth and prone to error. Cropped input RGB image Predicted by RectNet [49] Predicted by our network Figure 6. More qualitative comparison. Here we show an example from a test image from the 360D dataset [49]. Note that our result largely preserves the geometry of the hallway railings. the mean absolute error for errors below a threshold c, and equivalent to a weighted mean squared loss for errors larger than c. We follow Laina et al. [21] and set c as 20% of the maximal error among all images of the current batch. We follow Lai et al. [20] and place extra weight for error at boundary pixels in calculating LberHu. 4. Experiments We have trained and evaluated the performance of our method on the ODS dataset [20]. It contains 40, 000 frames of indoor scenes from the Stanford 2D-3D-Semantics Dataset [1] with ground truth depth and surface normal. We adopt the same training-validation data split and eval- uation metrics as Lai et al. [20]. We have also evaluated our method on the 360D dataset provided by Zioulis et al. [50]. 4.1. Training details We initialize the encoding blocks of the CNN shown in Figure 2 with the commonly used VGG-16 [36] pre-trained weights. We use the Adam optimizer with its default pa- rameters. We follow the data augmentation procedures de- tailed in Lai et al. [20] to introduce more variability in data. To be consistent with previous work, we train our networks for 40 epochs on this dataset to enable direct comparison of method performance. We adopt the conventional depth estimation metrics [13, 20, 21, 49]. We denote the absolute prediction error of a pixel i as Ei = |yi− ŷi|, where yi is the ground truth depth and ŷi is the predicted depth. δj refers to the percentage of pixels with max(yiŷi , ŷi yi ) < 1.25j . The other metrics used and their definitions are listed below: RMSE : √∑N i=1 Ei2 N Abs. Rel. : ∑N i=1 Ei ŷi N RMLSE : √∑N i=1 | ln yi − ln ŷi|2 N Sq. Rel. : ∑N Ei 2 ŷi N Log10 : ∑N i=1 | log10 yi − log10 ŷi| N Figure 2.7: More Qualitative Comparisons. Here we show an example from a test image from the 360D dataset [1]. Note that our result largely preserves the geometry of the hallway railings. Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3 Full model 0.3894 0.2572 0.4130 0.6872 0.9543 0.9836 0.9904 w/o LDQ 0.4731 0.3452 0.5830 0.9012 0.9257 0.9718 0.9880 w/o Refinement 0.4114 0.3190 0.5535 0.9220 0.9313 0.9780 0.9903 w/o LStereo 0.3953 0.2622 0.4562 0.7068 0.9530 0.9801 0.9904 Table 2.3: Ablation Results. Evaluation statistics are based on prediction results on the ODS dataset [4]. Results in rows 2-4 show the network performance when trained without the double quaternion loss, depth refinement step, and stereo consistency loss, respectively. Results show that each component in our proposed method contributes to better estimation accuracy. estimated well and provide meaningful guidance for depth refinement. Also, the quality of our depth estimation on real world 360◦ images is dependent on their domain similarity to the training dataset on which the model is trained. Our method does not perform well if either of these assumptions do not hold. Furthermore, as previously discussed, direct learning on 360◦ images suffers 41 from image distortion, which is not explicitly addressed by our method. In particular, we directly deploy a 2D CNN with regular, square kernels without any modification. Thus, it would be worthwhile to incorporate methods that alleviate the distortion problem, such as modifying convolutional kernels to account for distortion, and directly performing convolution on spheres instead of images with equirectangular projection. In summary, we present a new framework for 360◦ depth estimation using CNN. We use the double quaternion formulation to integrate depth and surface normal in loss calculations. Experiments show superior results for the joint depth and normal estimation task. We also extend the double quaternion formulation to establish stereo consistency from the training data without restricting the network to a fixed baseline. We demonstrate quantitative and qualitative results that confirm the benefits of our new approach. 42 Chapter 3: Efficient Neural Representation for Light Fields 3.1 Introduction Light fields offer an information-rich medium for static and dynamic scenes. However, a significant barrier to their widespread adoptions is a lack of sufficiently compact representations of such high-dimensional data, making it impractical for efficient storage, editing, and streaming. For example, a 1080p 60-fps light field video captured on a 10× 10 camera grid easily requires several gigabytes of storage space for every second of content. A straightforward solution to compressing light fields is to apply existing, widely used compression methods such as JPEG and MPEG. However, due to the sheer amount of images captured in a light field, the compression rate of these single- view-based methods are far from satisfactory [56,57]. Therefore, it is imperative to have a compact way to represent light fields by taking advantage of the overlapping and repetitive visual patterns in light fields. Extensive research has been devoted to designing compact light field representa- tions based on the patch-based compression strategy manifest in the JPEG standard. These methods represent each image patch as a weighted sum of a small dictionary of basis functions, and the goal is finding new ways to construct dictionaries of basis 43 Figure 3.1: Overview of SIGNET. We train a MLP to approximate the mapping function from each pixel’s coordinates to its color values. My input transformation strategy based on the Gegenbauer polynomials enables the MLP to more accurately learn the high-dimensional mapping function. functions that achieve better compression results. Yet, previous efforts have limited success in enabling easy transmission and manipulation of light field content. Recent advances in deep learning have led to impressive results in representing data like images and volumes [58–60] with neural networks. A common thread among these methods is incorporating Fourier-inspired modifications to the classical neural network design called multilayer perceptron (MLP). Specifically, the SIREN network [59] uses a sinusoidal activation function between the MLP layers, while neural radiance field (NeRF) networks [58] designed for volumetric radiance data show the effectiveness of applying cosine and sine transformations on input coordinates. 44 The improvement brought by the Fourier basis used in NeRF is further analyzed and formalized by Tancik et al. [60], who also successfully extend the neural representation to data like 2D images and 3D shapes. The proven capability of MLPs to express visual content with high fidelity implies that we could potentially compress a gigapixel light field within a few megabytes. However, as shown in Section 3.4, the previous techniques fall short of representing light fields without visible artifacts. In this chapter, we present a new framework that efficiently and accurately represents light field content using neural networks. Crucially, we introduce a novel input transformation strategy of the multi-dimensional light field coordinate based on the orthogonal Gegenbauer polynomials, which in our experiments work very well with the sinusoidal activation functions between the MLP layers. We call this network SIGNET (SInusoidal Gegenbauer NETwork), and we show its superiority for light field neural representation over a variety of Fourier-inspired input transformation strategies. SIGNET also achieves outstanding reconstruction quality with a higher compression rate than state-of-the-art dictionary-based light field compression methods. We further demonstrate how our MLP-based approach easily allows for view synthesis and super-resolution on the encoded light field scenes. In summary, our contributions are as follows: • We present a neural representation of light fields which achieves high recon- struction quality and compression rate and offers pixel-level random access to the encoded light field. 45 • We introduce an input transformation strategy for coordinate-input MLPs using Gegenbauer polynomials, which outperforms other recently proposed techniques on light field data. • We show such a neural representation enables high-quality decoding at novel co- ordinates without additional training, achieving super-resolution along spatial, angular, and temporal dimensions on light fields. 3.2 Related Work Light Field Compression. Traditional compression relies on classical coding strate- gies that typically involve analytical basis functions such as the Fourier basis and wavelets. Prior research has augmented this analytical approach with dis- parity [61–64] and geometry information [65]. Some sophisticated applications of light field video [66–68] also integrate motion prediction and build on existing video codec algorithms such as HEVC (H.265) [69] and VP9 [70]. More recently, Le Pendu et al. [71] present a Fourier Disparity Layer representation for light fields, which allows upsampling [72] and compression [73,74] in the Fourier domain. A different approach towards light field compression involves learning a dictio- nary of basis functions, which is inspired by progress in sparse coding from machine learning, where dictionaries learned with data-driven algorithms have been shown to outperform analytical basis functions [75–78]. However, the dictionaries learned with conventional algorithms such as K-SVD [7] still contain too much redundancy and have a high storage cost. The current state-of-the-art methods [6, 79] for light field 46 compression improve this approach by learning an ensemble of orthogonal dictionaries with a novel pre-clustering strategy. We present a novel approach to this task by learning a neural representation of light fields. While our approach is rooted in the idea of basis functions, we fundamentally differ from the previous methods as we use the expressive power of neural networks with non-linear activation functions to combine the basis functions into the desired output. Light Field Interpolation. Most approaches rely on proxy information such as depth or optical flow [12, 19, 80–83]. Recently, deep learning methods have been used to infer depth and optical flow from light fields, and render novel viewpoints [84–88]. These methods warp the original frames to a novel viewpoint. While the results are impressive, they require access to the original light field data at run-time, incurring additional, sometimes prohibitive, costs to the light field processing pipeline. In this chapter, we show how our neural light field representation naturally enables interpolation from the compressed data without explicit learning or proxy information. Although our presented network is not specifically designed for light field super-resolution or view synthesis, our results show its promising potential to be adapted for such tasks. Coordinate-input MLP. Recent research [58–60] has shown the potential of using coordinate-input MLP networks to represent various data. The Fourier-inspired transformation achieves state-of-the-art free viewpoint synthesis on static scenes [58]. 47 Figure 3.2: Illustration of Gegenbauer (Ultraspherical) Polynomials. We evaluate the 2D Gegenbauer basis functions on a 2D Cartesian grid (left) and a 3D polar grid (right). Only the first six orders of the basis are selected for illustration purposes. The sine activation, introduced in SIREN [59], allows a simple MLP with raw coordinate inputs to accurately model the coordinate-to-color mapping of data including images and videos. However, our experimental results show that these Fourier-inspired methods are unable to accurately model the coordinate-to-color mapping in light fields. We present a new transformation that allows the MLPs to successfully represent dense light fields, and we show its applicability for compactly representing high-resolution light fields. Gegenbauer Polynomials. Previous research in applied mathematics has shown the effectiveness of Gegenbauer polynomials, also known as ultraspherical polynomials, in addressing the Gibbs phenomenon [89], which is a commonly observed artifact in MRI reconstruction using Fourier-based approximations [90, 91]. It has been shown by Gottlieb et al. [89] that the finite Gegenbauer expansion of such functions provides a better convergence and usually resolves the Gibbs artifact using fewer basis functions 48 Figure 3.3: We show examples of reconstructed images (left) and absolute errors (right). SIGNET achieves good accuracy while other methods find encoding this scene challenging. than the Fourier approach. Specifically, they show that given