ABSTRACT

Title of Dissertation: TOWARDS IMMERSIVE
VISUAL CONTENT
WITH MACHINE LEARNING

Brandon Yushan Feng
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Amitabh Varshney
Department of Computer Science

Extended reality technology stands poised to revolutionize how we perceive,

learn, and engage with our environment. However, transforming data captured in

the physical world into digital content for immersive experiences continues to pose

challenges. In this dissertation, I present my research on employing machine learning

algorithms to enhance the generation and representation of immersive visual data.

Firstly, I address the issue of recovering depth information from videos captured

using 360-degree cameras. I propose a novel technique that unifies the representation

of object depth and surface normal utilizing double quaternions. Experimental results

demonstrate that training with a double-quaternion-based loss function improves

the prediction accuracy of a neural network using 360-degree video frames as input.

Secondly, I examine the problem of efficiently representing 4D light fields using

the emerging concept of neural fields. Light fields hold significant potential for

immersive visual applications; however, their widespread adoption is hindered by the

substantial cost associated with storing and transmitting such high-dimensional data.


I propose a novel approach for representing light fields. Deviating from previous

approaches, I treat the light field data as a mapping function from pixel coordinates

to color and train a neural network to accurately learn this mapping function. This

functional representation enables high-quality interpolation and super-resolution for

light fields while achieving state-of-the-art results in light field compression.

Thirdly, I present neural subspaces for light fields. I adapt the ideas of

subspace learning and tracking and identify the conceptual relationship between

neural representations of light fields and the framework of subspace learning. My

method considers a light field as an aggregate of local segments or multiple local

neural subspaces. A set of local neural networks are trained to encode each subset of

viewpoints. Since each local network specializes in a specific region, this specialization

allows for smaller networks without compromising accuracy.

Fourthly, I introduce primary ray-based implicit function to represent geometric

shapes. Traditional implicit shape representations, such as the signed distance

function, describe a shape by its relationship to each spatial point. Such a point-

based representation of shapes often necessitates the costly iterative sphere tracing

to render a surface hit point. I propose a ray-based approach to implicit neural

shape modeling, wherein the shape is implicitly described by its relationship with

each ray in the 3D space. To render the hit point, my method only requires a single

inference pass, considerably reducing the computational cost of rendering.

Lastly, I describe a technique to generate novel view rendering without relying

on any 3D structure or camera pose information. I harness the power of neural fields

to encode individual images without estimating their camera poses. My method


learns a latent code for each image in the multi-view collection, and then produces

plausible and photorealistic novel view renderings by interpolating their latent codes.

This entirely 3D-agnostic approach avoids the computational cost incurred by 3D

representations, offering a promising outlook on employing image-based neural fields

for image manipulation tasks beyond fitting and super-resolving known images.


TOWARDS IMMERSIVE VISUAL CONTENT
WITH MACHINE LEARNING

by

Brandon Yushan Feng

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor Amitabh Varshney, Chair/Advisor
Assistant Professor Furong Huang
Assistant Professor Christopher A. Metzler
Associate Professor Jia-Bin Huang
Professor Joseph F. JaJa


© Copyright by
Brandon Yushan Feng

2023


Acknowledgments

My first thanks go to my dissertation advisor, Amitabh Varshney. My doctoral

studies have been incredibly intellectually rewarding and enlightening due to his

thoughtful guidance and constant encouragement. Having the opportunity to conduct

my doctoral training under his tutelage has been both a privilege and a pleasure

I appreciate the time spent by all my committee members, namely Amitabh

Varshney, Furong Huang, Christopher A. Metzler, Jia-Bin Huang, and Joseph F.

JaJa. They all served on my committee, offered valuable suggestions for improvement,

and asked insightful questions during my dissertation defense.

Beyond the work on this dissertation, I am privileged to have enjoyed fruitful

collaborations with various research groups at UMD and beyond. I learned much

about matrix sketching from Furong Huang. I was fortunate to work with Brian

Pierce on studying antibody-antigen docking algorithms. I owe much of my knowledge

of and joy in studying computational imaging and optics to Chris Metzler. I gained

invaluable lessons on formulating research problems and managing projects from

my interaction with Jia-Bin Huang. I also gathered countless insights and valuable

experience from my many project collaborators, including (in roughly chronological

order) Susmija Jabbireddy, David Li, Rui Yin, Tahseen Rabbani, Mingyang Xie,

Haiyun Guo, Vivek Boominathan, Manoj K. Sharma, Ashok Veeraraghavan, Ruofei

Du, Zhenyi He, Keru Wang, Yinda Zhang, Danhang Tang, Zhiwen Fan, Chenxin Li,

Sazan Mahbub, Hadi AlZayer, Kevin Zhang, Michael Rubinstein, and William T.

Freeman. I thank the essential support provided by many staff members at UMIACS

ii


and UMD CS, including Tom Hurst, Jonathan Heagerty, Barbara Brawn, Sida Li,

Eric Lee, Vivian Lu, Janice Perrone, and Jodie Gray.

I thank all my family, friends, and mentors for their support and impact over

the years. My parents have always believed in me and would do anything for me

to succeed. I feel fortunate that my grandmother is still around to celebrate my

achievements, and for the blessings given by my other grandparents who have passed.

I am grateful to Erzhen for the unwavering support and encouragement during

this time. I thank Xue Feng, Lu Feng, and Marc Santugini for their kindness and

indispensable guidance when I applied for doctoral programs. I thank Kaiming for

the friendship that has powered us through all the ups and downs of our academic

endeavors. I am especially thankful to Gregory C. Robbins for instilling grit in me

and changing the way I read, write, and think. Greg is my best teacher, and what

he taught me during my teenage years have shaped the person I am today.

Most of the work on this dissertation occurred during an aberrant and unsettling

time, marked by the pandemic and its repercussions. The unprecedented and

unexpected challenges of this era have made me more determined and resilient. I

must acknowledge the unique challenges faced by international students, and I offer

my sympathies to those unable to begin or complete their studies during this period

due to obstacles and distress caused by various contingent social and geopolitical

factors. I hope that future students will not have to endure similar circumstances.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

1 Introduction 1
1.1 Deep Depth Estimation on 360-degree Images with a Double Quater-

nion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Efficient Neural Representation for Light Fields . . . . . . . . . . . . 4
1.3 Neural Subspaces for Light Fields . . . . . . . . . . . . . . . . . . . . 8
1.4 Primary Ray-based Implicit Function . . . . . . . . . . . . . . . . . . 12
1.5 View Interpolation with Implicit Neural Representations of Images . . 16

2 Deep Depth Estimation on 360-degree Images with a Double Quaternion Loss 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Depth Estimation on 360-degree Images . . . . . . . . . . . . 23
2.2.2 Joint Estimation of Depth and Normal . . . . . . . . . . . . . 25
2.2.3 Use of Quaternions . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Depth Refinement based on Normal . . . . . . . . . . . . . . . 28
2.3.3 Aggregation with Confidence Scores . . . . . . . . . . . . . . . 30
2.3.4 Double Quaternion Approximation of Depth and Normal in

the Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4.1 Constructing Double Quaternion . . . . . . . . . . . 32
2.3.4.2 Loss Function Based on Double Quaternions . . . . . 34

2.3.5 Stereo Consistency . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.6 Overall Loss Function . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Comparison with Other Methods . . . . . . . . . . . . . . . . 38
2.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.5 Limitations and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 39

iv


3 Efficient Neural Representation for Light Fields 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.1 Light Fields as Functions . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Function Approximations . . . . . . . . . . . . . . . . . . . . 51
3.3.3 MLP for Approximation . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Towards Multi-dimensional Input . . . . . . . . . . . . . . . . 53
3.3.5 Gegenbauer Basis . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.2 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . 56
3.4.3 Data and Training Setup . . . . . . . . . . . . . . . . . . . . . 58

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Static Light Field Reconstruction . . . . . . . . . . . . . . . . 59
3.5.2 Extension to Light Field Videos . . . . . . . . . . . . . . . . . 62
3.5.3 Light Field Super-Resolution . . . . . . . . . . . . . . . . . . . 62
3.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.6 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Neural Subspaces for Light Fields 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.1 Light Field Compression for Streaming . . . . . . . . . . . . . 74
4.2.2 Neural Light Field . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3 Subspace Learning and Tracking . . . . . . . . . . . . . . . . . 78

4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Neural Light Fields with MLP . . . . . . . . . . . . . . . . . . 79
4.3.2 Constructing Light Field Segments . . . . . . . . . . . . . . . 80
4.3.3 Adaptive Weight Sharing in MLP . . . . . . . . . . . . . . . . 81
4.3.4 Soft-Classification for RGB Prediction . . . . . . . . . . . . . 83

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Quantitative Metrics . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.3 Sub-aperture Light Fields . . . . . . . . . . . . . . . . . . . . 87

4.4.3.1 Comparison with the Residual Approach . . . . . . . 88
4.4.3.2 Improved Accuracy with Soft Classification . . . . . 88

4.4.4 Volumetric Light Fields . . . . . . . . . . . . . . . . . . . . . . 89
4.4.4.1 Extension to Dynamic Content . . . . . . . . . . . . 95
4.4.4.2 Possible Flickering Across Subspaces . . . . . . . . . 96

4.4.5 Hyper-parameter Analysis . . . . . . . . . . . . . . . . . . . . 96
4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.5.1 End-to-end versus Two-stage Learning . . . . . . . . . . . . . 97

v


4.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Primary Ray-based Implicit Function 103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.1 3D Shape Representations . . . . . . . . . . . . . . . . . . . . 107
5.2.1.1 Functional Representations. . . . . . . . . . . . . . . 107
5.2.1.2 Global versus Local Representations . . . . . . . . . 108

5.2.2 Ray-based Neural Networks . . . . . . . . . . . . . . . . . . . 109
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Describing Geometry with Perpendicular Foot . . . . . . . . . 112
5.3.3 Background Mask . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.4 Outlier Points Removal . . . . . . . . . . . . . . . . . . . . . . 114

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Single Shape Representation . . . . . . . . . . . . . . . . . . . 116
5.4.2 Shape Generation . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.3 Shape Denoising and Completion . . . . . . . . . . . . . . . . 119
5.4.4 Analysis and Ablations . . . . . . . . . . . . . . . . . . . . . . 120

5.4.4.1 Complexity Analysis . . . . . . . . . . . . . . . . . . 120
5.4.4.2 Stress Testing . . . . . . . . . . . . . . . . . . . . . . 123
5.4.4.3 Ablations . . . . . . . . . . . . . . . . . . . . . . . . 123

5.4.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.5.1 Learning Camera Poses . . . . . . . . . . . . . . . . 123
5.4.5.2 Neural Rendering with Color . . . . . . . . . . . . . 125

5.5 Additional Details and Discussion . . . . . . . . . . . . . . . . . . . . 126
5.5.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.5.1.1 Optimization . . . . . . . . . . . . . . . . . . . . . . 126
5.5.1.2 Network Architecture . . . . . . . . . . . . . . . . . 126

5.5.2 Runtime Discussions . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.2.1 Rendering Speed . . . . . . . . . . . . . . . . . . . . 127
5.5.2.2 Meshing Speed . . . . . . . . . . . . . . . . . . . . . 128

5.5.3 More Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.3.1 Comparison to Plücker Coordinates . . . . . . . . . . 128
5.5.3.2 Varying Model Complexity . . . . . . . . . . . . . . 129
5.5.3.3 Different Noise Levels . . . . . . . . . . . . . . . . . 129
5.5.3.4 Limitation . . . . . . . . . . . . . . . . . . . . . . . . 130

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 View Interpolation with Implicit Neural Representations of Images 131
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.2.1 Implicit Neural Representations . . . . . . . . . . . . . . . . . 134

vi


6.2.1.1 3D Reconstruction . . . . . . . . . . . . . . . . . . . 135
6.2.1.2 Image Fitting . . . . . . . . . . . . . . . . . . . . . . 135

6.2.2 Image-based Rendering . . . . . . . . . . . . . . . . . . . . . . 136
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.3.1 INR for Image Fitting . . . . . . . . . . . . . . . . . . . . . . 137
6.3.2 Extension to Multiple Images . . . . . . . . . . . . . . . . . . 138
6.3.3 Direct Regularization . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.4 Indirect Regularization . . . . . . . . . . . . . . . . . . . . . . 141

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.6 Supplementary Information . . . . . . . . . . . . . . . . . . . . . . . 150

6.6.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.6.1.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . 150
6.6.1.2 Training with CLIP-based Features . . . . . . . . . . 151
6.6.1.3 Dataset Details . . . . . . . . . . . . . . . . . . . . . 152
6.6.1.4 Beyond Interpolating Between Two Views . . . . . . 153
6.6.1.5 Extending to Frame Interpolation. . . . . . . . . . . 154

6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7 Conclusion and Future Work 164

Bibliography 166

vii


Chapter 1: Introduction

In this dissertation, I present a series of innovative machine learning algorithms

designed to enhance the creation and representation of immersive visual data for

extended reality applications. Firstly, I propose a method that simultaneously

addresses depth and normal information to improve depth predictions from 360-

degree videos. Secondly, I construct a streamlined technique for representing 4D

light fields using neural fields, also known as implicit neural representations (INR).

Thirdly, I put forward the concept of neural subspaces to optimize the quality and

efficiency of neural light field representations. Fourthly, I introduce the idea of

primary ray-based implicit functions for effective geometric shape modeling. Lastly,

I explore a 3D-agnostic approach using image-based neural fields to generate novel

view renderings without 3D structure or camera pose data. In this chapter, I provide

brief overview of these advances before detailing them in later chapters.

1.1 Deep Depth Estimation on 360-degree Images with a Double

Quaternion Loss

Depth information for 360◦ content enables 3D rendering based on the viewer’s

position and allows scene editing effects like relighting and object insertion. Obtaining

1


Figure 1.1: Method Overview. The convolutional neural network (CNN) takes in a

360◦ image and for each pixel estimates its depth, normal, and uncertainty of that

depth estimate. These three estimates are used by a refinement module to produce

the final depth estimate for each pixel. I train the CNN using a novel loss based on

the double quaternion representation of the depth and normal.

depth information for 360◦ videos has several solutions, such as adding an active depth

sensor or using stereo correspondence. However, both methods are cumbersome,

costly, and challenging for average users. Recent deep learning advances have enabled

a data-driven approach to the 360◦ depth problem, demonstrating that a neural

network can predict 360◦ depth with monocular input only. In Chapter 2, I present

a novel method to improve existing deep learning methods on monocular 360◦

depth estimation, as illustrated by Figure 2.1. The proposed method unifies the

representations of depth and normal based on the concept of double quaternions.

My method enables the conversion of predicted and ground-truth depth and

2


Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss
Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates.
In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model
in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we
show results from our full model.

Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction
Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first
column, we show the input image for reference.

Method RMSE Log10 AbsRel δ1 δ2 δ3
UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560
RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634
FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966
PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980
SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993
SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985
SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992
Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and
Xie et al. [44]. Our method produces superior results in most metrics.

Following the procedure presented in Section 3.2, we
combine depth and normal estimates from the stereo pair
images into two double quaternions (GL, HL) and (GR,
HR), from which we calculate the stereo loss:

LStereo =
√
α2

Stereo + β2
Stereo (18)

where αStereo and βStereo are also calculated as in Eq (14).

3.6. Overall Loss function

With the double-quaternion-based losses derived above,
we present the overall loss function for network training:

Ltotal = LberHu + LDQ + LStereo (19)

Here, LberHu is the reverse Huber loss function for both
the depth and normal estimates compared to their respec-
tive ground truth [21]. In effect, this loss is equivalent to

Figure 1.2: Surface Normal Prediction. Training with double quaternion loss enables

the network to produce better normal estimates. In the third column, I show

predictions from the baseline model which separately calculates depth and normal

loss without combining them into a double quaternion form.

Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss
Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates.
In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model
in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we
show results from our full model.

Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction
Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first
column, we show the input image for reference.

Method RMSE Log10 AbsRel δ1 δ2 δ3
UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560
RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634
FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966
PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980
SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993
SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985
SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992
Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and
Xie et al. [44]. Our method produces superior results in most metrics.

Following the procedure presented in Section 3.2, we
combine depth and normal estimates from the stereo pair
images into two double quaternions (GL, HL) and (GR,
HR), from which we calculate the stereo loss:

LStereo =
√
α2

Stereo + β2
Stereo (18)

where αStereo and βStereo are also calculated as in Eq (14).

3.6. Overall Loss function

With the double-quaternion-based losses derived above,
we present the overall loss function for network training:

Ltotal = LberHu + LDQ + LStereo (19)

Here, LberHu is the reverse Huber loss function for both
the depth and normal estimates compared to their respec-
tive ground truth [21]. In effect, this loss is equivalent to

Figure 1.3: Depth Refinement Results. I compare initial depth estimates produced

by the network and the refined output based on surface normal. In the first column,

I show the input image for reference.

normals into two double quaternions, from which I derive a new loss for joint depth

and surface normal estimation. I also use the double quaternion representation

to measure the discrepancy between two CNN estimates from a stereo image pair.

Experimental results show that training with a double-quaternion-based loss function

improves prediction accuracy for neural networks with 360-degree video frames as

3


Method RMSE Log10 AbsRel δ1 δ2 δ3

UResNet [1] 2.037 0.326 16.906 0.213 0.399 0.560

RectNet [1] 1.738 0.291 16.132 0.240 0.453 0.634

FCRN [2] 0.672 0.101 7.448 0.806 0.932 0.966

PSMNet [3] 0.393 0.059 5.641 0.953 0.975 0.980

SepUNet [4] 0.495 0.042 1.779 0.944 0.987 0.993

SepUNetS [4] 0.614 0.072 1.841 0.835 0.966 0.985

SepUNetDD [5] 0.392 0.036 2.120 0.960 0.987 0.992

Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 1.1: Performance Comparison on the ODS dataset [4]. Evaluation statistics for

row 1-7 are directly taken from Lai et al. [4] and Xie et al. [5]. My method produces

superior results in most metrics.

input. In Figure 1.2, I show normal estimates produced by our trained network. In

Figure 1.3, I demonstrate the benefit of refining the initial depth predictions with

predicted normals. My method outperforms state-of-the-art methods in most metrics

as shown in Table 1.1.

1.2 Efficient Neural Representation for Light Fields

Light field content holds great potential for immersive visual applications.

However, a primary obstacle to widespread adoption is the extreme cost of storing

and transmitting such high-dimensional data. Past research has proposed numerous

4


Figure 1.4: Overview of SIGNET. I train a MLP to approximate the mapping

function from each pixel’s coordinates to its color values. My input transformation

strategy based on the Gegenbauer polynomials enables the MLP to more accurately

learn the high-dimensional mapping function.

methods to compress light fields, but these efforts have limited success in making

light field content compact enough for casual streaming.

In Chapter 3, I introduce SIGNET, a novel approach for representing light

fields with neural networks, as illustrated in Figure 1.4. Instead of treating light

fields as pixel collections, I consider them as functions mapping pixel coordinates to

colors. I also introduce a new transformation based on Gegenbauer polynomials for

input coordinates, enabling our network to successfully represent dense light fields.

The ability to accurately represent a huge light field with a compact neural

network automatically achieves compression with high fidelity. As shown in Ta-

5


Table 1.2: Compression Performance Compared to Other Methods. Size denotes the

storage in megabytes(MB) for each method without further quantizations.

Static Light Fields Light Field Videos

Scene Lego Bracelet Tarot Painter Trains

Method Size PSNR Size PSNR Size PSNR Size PSNR Size PSNR

SIGNET 9.0 41.26 12.0 38.70 9.0 37.47 144 39.56 144 39.73

AMDE [6] 29.3 40.90 18.1 39.90 44.2 38.54 941 38.25 809 37.00

KSVD [7] 29.3 38.39 18.1 36.73 44.3 38.81 942 38.12 807 35.06

HOSVD [6] 29.3 37.24 18.0 33.98 44.3 34.53 942 36.91 807 35.29

5D DCT [6] 29.4 37.29 18.1 32.31 44.2 33.03 941 36.79 807 35.20

CDF 9/7 [8] 29.0 33.71 18.2 31.98 44.3 29.17 941 31.69 1116 29.80

ble 1.2, SIGNET outperforms other compression methods on multiple light field

scenes. Moreover, such a functional representation easily achieves interpolation and

super-resolution on light fields. In Figure 1.5, I demonstrate the result of spatial

upsampling with SIGNET, achieved by evaluating the network on a denser set of

coordinates. SIGNET produces acceptable output without being trained for this

task. In Figure 1.6, I present SIGNET’s performance on angular upsampling, or

novel view synthesis. In Figure 1.7, I present the result of temporal upsampling with

SIGNET. Although SIGNET has only been trained on content from the first frame

at t0 and the third frame at t0 + 1, when evaluated at the unseen intermediate time

step t0 +
1
2
, SIGNET is able to preserve the motion trajectory.

SIGNET’s functional design allows random access to encoded pixels, while

6


Figure 1.5: Spatial Upsampling with SIGNET. I evaluate the trained SIGNET on

dense sampling grid points in the spatial dimensions. I show zoomed-in details in

the cropped region bounded by the yellow rectangle.

Figure 1.6: Angular Upsampling with SIGNET. At the bottom left corner of the

reconstructed view, I show the relative positions of the reconstructed view (red

square) and its four nearest views (blue squares) in the original light field. I compare

with results from the deep-learning-based method, LFASR [9], which is trained

specifically for light field angular upsampling.

traditional codecs like JPEG or MPEG require decoding an entire image patch

to access a single pixel. I envision that this property would be highly beneficial

for foveated rendering, enabling substantial rendering and streaming speedups by

7


Figure 1.7: Temporal Upsampling with SIGNET. t0 and t0+1 are consecutive frames

in the original video. The blue boxes contain output from frames evaluated at t0 +
1
2
,

which is not present in the original video. The vertical lines are drawn for easier

observation of the motion trajectory.

adaptively selecting pixel portions based on the viewer’s gaze location.

1.3 Neural Subspaces for Light Fields

SIGNET is a neural field, also known as implicit neural representation (INR),

that compactly encodes multiple viewpoints from a light field scene (hundreds of

megabytes) into a single neural network’s weights (a few megabytes). These weights

are the only necessary information for storage and streaming. This unified design of

training a single neural network to cover the entire light field has simplistic appeal

as a concept. However, it is not ideal in practical scenarios that emphasize efficient

transmission and rendering. Such scenarios generally involve only rendering a subset

8


Color A

Color B

Neural 
Subspace

Input

Input RGB

RGB

Weight SharingEncode Decode

Neural Network Layers 
for Subspace Construction

High-dimensional
Light Field Data

Pixel A

Pixel B

Coordinate A

Coordinate B

Figure 1.8: Concept of Neural Subspaces for Light Fields. Given a light field scene,

I divide it into multiple local segments and construct a neural subspace for each

segment. The neural subspace construction is equivalent to training the network

parameters to learn accurate coordinate-to-color mappings within each segment. The

adaptive weight sharing strategy utilizes the similarity among nearby subspaces and

reduces the total number of parameters to represent the entire light field.

of light field viewpoints. Therefore, unnecessary costs are incurred to transmit and

evaluate a single neural network that contains other unrelated views.

In Chapter 4, I discuss the conceptual connection between the neural rep-

resentation of light fields and subspace learning, a signal processing concept for

dimensionality reduction of high-dimensional data. Inspired by subspace learning, I

show that it is not necessary to treat light fields only as one unified entity - light

fields can be regarded as a composite collection of local segments or neural subspaces.

Such a perspective is meaningful in practice since only a subset of the light field

might be relevant at a particular moment for streaming or rendering.

The approach introduced in Chapter 4 trains a set of local neural networks that

9


Input RGB

16 Specialized 
Neural Subspaces

Shared
Layers

Local
Layers

Light Field with
16 x 16 Viewpoints

Partitioned into
2 x 2 Segments

Local
Layers

Local
Layers

Figure 1.9: Illustration of the Weight Sharing Strategy. The light field is partitioned

into segments, each containing 2× 2 viewpoints. I construct a neural subspace for

each segment that summarizes its pixel-to-color mapping relationship. Each subspace

shares a set of network layers while possessing its local layers that enable network

specialization on its corresponding data segment.

each encodes only a subset of viewpoints, unlike SIGNET which is one global network

representing the entire light field. An overview of this approach is illustrated in

Figure 1.8. As each local network specializes in a particular region, this specialization

permits smaller networks without sacrificing accuracy. Furthermore, recognizing the

similarity among nearby subspaces, I propose a weight-sharing strategy for those

local networks to enhance overall parameter efficiency while maintaining network

capacity within each subspace. Effectively, this proposed strategy, illustrated in

Figure 1.9, achieves the tracking of implicit neural subspaces. As shown in Table 4.1,

experimental results indicate that the proposed framework leads to better efficiency

and accuracy than the original SIGNET and a range of previous methods. With

this work invoking the classic idea of subspace learning, I take neural light field

10


Scene Method PSNR(↑) SSIM(↑) EncMB(↓) DecMB(↓)

Lego

909 MB

AMDE 40.90 0.973 29.3 29.3

KSVD 38.39 0.960 29.3 29.3

SIGNET 41.26 0.976 9.0 9.0

Ours 41.95 0.982 22.9 2.2

Tarot

909 MB

AMDE 38.54 0.973 44.2 44.2

KSVD 38.81 0.980 44.3 44.3

SIGNET 37.47 0.975 9.0 9.0

Ours 38.21 0.975 22.9 2.2

Bracelet

568 MB

AMDE 39.90 0.980 18.1 18.1

KSVD 36.73 0.973 18.1 18.1

SIGNET 38.70 0.973 12.0 12.0

Ours 39.64 0.985 22.9 2.2

Table 1.3: We provide results on sub-aperture light fields. Comparisons to previous

methods AMDE [6], KSVD [7], and SIGNET [10]. DecMB is the memory (in MB)

required in streaming to decode any frame, and EncMB is the memory (in MB)

required to encode the entire light field in storage.

representation introduced by SIGNET to the next step by making this representation

more compact and streaming-friendly.

11


1.4 Primary Ray-based Implicit Function

Learning accurate and efficient 3D object representations is vital for applica-

tions in graphics, vision, and robotics. Recent advances in machine learning involve

training neural fields of signed distance functions (SDF) as implicit shape representa-

tions. However, rendering and extracting shapes from trained SDF networks can be

computationally expensive and typically limited to watertight shapes. Moreover, the

shape quality is ultimately constrained by the converging criteria of sphere tracing

or grid resolution of the marching cubes extraction.

Marching Cube        Rasterization

Sphere Tracing

Proposed Method

PRIF

Multiple evaluations per ray
Slow rendering

Offline mesh reconstruction
Watertight assumption

One evaluation per ray
No watertight assumption

Hit Point

Primary Ray

Figure 1.10: Comparing various implicit shape representations. Rendering from

common implicit neural shape representations, such as signed distance functions

(SDF) and occupancy functions (OF), requires either sphere tracing or rasterizing a

separately extracted mesh. My new representation, PRIF, directly maps each primary

ray to its hit point. A network encoding PRIF is more efficient and convenient for

rendering, since it requires only one evaluation for each ray, avoids the watertight

constraint in conventional methods, and easily enables differentiable rendering.

12


Distance from     
to     along 

Ray Direction 

Per. Foot w.r.t. 0

Signed Distance

Distance to Hit Point

Z

X

Y

0

Perpendicular Foot
Primary Ray-based Implicit Function

Input Output

(a) (b) (c)

Figure 1.11: Formulation of PRIF. (a) Signed distance at a sampling position (white)

reveals the sphere (blue dots) where its nearest surface point (blue) exists, when we

really want to know the hit point (red) along a specific direction. Thus, multiple

samples are required. (b) To obtain the surface hit point, PRIF uses only one

sample (yellow) along the ray: the perpendicular foot between the given ray and the

coordinate system’s origin O. (c) PRIF takes in the ray’s direction and its sampling

point, and returns the distance from that point to the actual surface hit point.

Chapter 5 presents a novel implicit geometric representation that is efficient,

accurate, and innately compatible for downstream tasks involving reconstruction

and rendering. I break away from the conventional point-wise implicit functions

like SDF, and propose to encode 3D geometry into a ray-based implicit function

called PRIF. Specifically, PRIF operates on the realm of oriented rays r = (pr,dr),

where pr ∈ R3 is the ray origin and dr ∈ S2 is the normalized ray direction. Unlike

SDF that only outputs the distance to the nearest but undetermined surface point,

I formulate this representation such that its output directly reveals the surface hit

point of the input ray. Figure 1.10 presents the overview of PRIF.

Figure 1.11 shows the formulation of PRIF. I train an MLP to learn Φ(fr,dr) =

sr. In effect, the objective is equivalent to finding a simple affine transformation

13


Method Armadillo Bunny Buddha Dragon Lucy

SDF 1.905|1.260 1.717|1.147 6.119|2.258 5.184|1.946 3.387|1.417

OF 4.805|1.624 1.704|1.133 17.279|3.113 19.577|3.014 3.396|1.427

PRIF 0.978|0.706 1.169|0.835 1.443|0.821 1.586|0.913 0.846|0.519

Table 1.4: Quantitative results on single shape representation. The left and right

numbers represent the mean and median Chamfer Distance (multiplied by 10−4).

Ground Truth OF SDF Ours - Points

Figure 1.12: Comparing PRIF with SDF and OF. I test on a Tetrahedron grid that

is self-intersecting and non-watertight. I obtain the SDF and OF values based on

the ground truth geometry, and extract the mesh by Marching Cubes. While the

level-set representations fail, PRIF reliably preserves the shape.

f(x) = Ax+ b, with the input x = dr, A = srI3, and b = fr. I also avoid a major

limitation in previous sphere-tracing-based methods, which is having to sample

multiple points and perform multiple network evaluations to obtain a hit point.

Chapter 5 also presents various experiments that verify the efficacy of PRIF for

shape representation and demonstrate the applications enabled by using PRIF as the

underlying neural shape representation. Table 1.4 and Figure 1.12 offer a preview

14


Initial Optimization Progress Target

Figure 1.13: Learning Camera Poses. The Initial camera pose is optimized based

on the difference between PRIF output rendered at the current pose and the PRIF

output rendered at an unknown Target pose. The PRIF output is depth images.

The camera pose gradually converges to the correct Target pose.

showing that PRIF significantly outperform SDF and OF in accurately preserving

the fine details of the 3D shapes. Figure 1.13 show the successful recovery of camera

pose enabled by using PRIF as a hitpoint renderer.

In summary, Chapter 5 introduces PRIF, a new 3D shape representation based

on the relationship between a ray and its perpendicular foot with the origin. I

demonstrate that neural networks can successfully encode PRIF to achieve accurate

shape representations. This new representation avoids multi-sample sphere tracing

and obtain the hit point with a single network evaluation. Neural networks trained

to encode PRIF inherit such advantages and can represent shapes more accurately

than common neural shape representations using the same network architecture.

15


Train Implicit Neural Representation of Images

𝑥, 𝑦

𝑧!

View Interpolation by Code Interpolation

Standard Training

𝑧"

Capture Images at Multiple Viewpoints

Randomly Initialized 𝑥, 𝑦

𝑥, 𝑦

RGB

𝑥, 𝑦
RGB

Proposed Techniques

3D Structure Camera Pose

Code Interpolation: 1 − 𝑡 ⋅ 𝑧! + 𝑡 ⋅ 𝑧"

Volume Depth Planes Estimation Assumed Layout Keypoint Optical Flow
Correspondence

𝑡 = 0 𝑡 = 1

Figure 1.14: Overview of VIINTER. After each image is randomly assigned a code

vector z, the codes are then jointly trained with the neural network to produce the

RGB color given coordinate (x, y). With standard training, the INR fails to decode

coherent images at new interpolated codes, but VIINTER enables smooth transition

between two known viewpoints. Contrary to common methods for view interpolation,

VIINTER does not use 3D structure, camera poses, or pixel correspondence.

1.5 View Interpolation with Implicit Neural Representations of Images

Neural fields, also known as implicit neural representations (INR), have been

successful in representing visual signals such as images, videos, signed distance fields,

and radiance fields. In scenarios where only 2D images are available, two prominent

applications are 2D image fitting and 3D view synthesis. INRs achieve impressive

visual results along these two orthogonal directions. On the one hand, the quality of

fitting images is improved by incorporating traditional signal processing techniques.

On the other hand, the quality of view synthesis is improved by augmenting INRs

with well-established 3D graphics techniques.

In Chapter 6, I explore a different direction and ask a new question: Given

16


VIINTER: View Interpolation with Implicit Neural Representations of Images SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea

𝑡 = 0 𝑡 = 0.5 𝑡 = 1 𝑡 = 0 𝑡 = 0.5 𝑡 = 1
No Control ∞-norm

𝑡 = 0 𝑡 = 0.5 𝑡 = 1 𝑡 = 0 𝑡 = 0.5 𝑡 = 1
2-norm 1-norm

Figure 2: Effect of Controlling ∥𝑧∥𝑝 = 1 with Different 𝑝. For each condition, we show the INR-produced images given 𝑧𝑖
(left), 0.5𝑧𝑖 + 0.5𝑧 𝑗 (center), and 𝑧 𝑗 (right). “No Control” trains INR F and all codes for known images without controlling their
scales, showing proper reconstruction at known views (left and right) but complete failure in interpolation (center). “∞-norm”
scales each 𝑧 with its maximum norm, but still does not interpolate well. “2-norm” significantly improve interpolation and
reconstructs known views better, but “1-norm” is much better at interpolation (see red boxes).

2.2 Image-based Rendering.
The early approaches of image-based rendering (IBR) achieve novel
view synthesis through explicitly blending relevant pixels from
known images [Debevec et al. 1996; Gortler et al. 1996; Levoy and
Hanrahan 1996]. The visual quality of IBR is heavily dependent
on the strategy of deciding the blending weights of images, and
researchers have developed a line of techniques improving blending
weights selection, such as ray-space proximity [Chai et al. 2000;
Levoy and Hanrahan 1996], proxy geometry [Buehler et al. 2001; De-
bevec et al. 1996; Heigl et al. 1999], optical flow [Chen andWilliams
1993; Du et al. 2018], soft blending [Penner and Zhang 2017; Riegler
and Koltun 2020], and neural-network-assisted blending [Milden-
hall et al. 2019; Rombach et al. 2021; Thies et al. 2019; Wang et al.
2021b]. These techniques often require an approximate 3D struc-
ture (proxy geometry or depth) of the scene so that pixels can be
re-projected to the novel view. For methods that do not involve
3D re-projection [Levoy and Hanrahan 1996; Ng et al. 2005], many
still assume the knowledge of the 3D camera locations and orien-
tations of each image and leverage the spatial relationship among
the cameras to decide the blending weights. In contrast, we explore
a different and more challenging problem setting which does not
involve 3D reconstruction nor the knowledge of 3D locations and
camera orientations. Our problem setup is similar to prior work on
image morphing [Chen and Williams 1993; Liao et al. 2014; Seitz
and Dyer 1996; Wolberg 1998], but we achieve the morphing effect
without finding pixel-wise correspondences between images.

3 METHOD
We provide details on the INR parametrization adopted in our study,
and we introduce the proposed modifications to INR training.

3.1 INR for Image Fitting
Let F denote the INR of images. In the case of a single image, for
all pixels 𝑝 of the image, the INR F defines

F (𝑝𝑥 , 𝑝𝑦) = 𝑝𝑐 , (1)

where (𝑝𝑥 , 𝑝𝑦) denotes the coordinate of the pixel 𝑝 , with 𝑝𝑥 ∈ R
and 𝑝𝑦 ∈ R. 𝑝𝑐 ∈ R3 denotes the value (often the RGB vector)
associated with the pixel 𝑝 . In itself, the INR formulation is invariant
to different numeric ranges of (𝑝𝑥 , 𝑝𝑦) or 𝑝𝑐 , and for simplicity we
rescale the pixel coordinates and values to be within [0, 1].

We adopt the conventional MLP architecture to parameterize F
as a chain of fully connected layers, with activation function usually
set as a ReLU or sinusoidal function. Various embedding functions
of the input coordinate (𝑥,𝑦) have been proposed, but in this work
we apply no embedding and use sinusoidal activation [Sitzmann
et al. 2020], which are sufficient for fitting single 2D images.

The primary training objective of INR F for single 2D images
is to minimize the reconstruction error between the predicted 𝑝𝑐
and ground truth 𝑝𝐺𝑇𝑐 across all known pixels in a single image,
namely

𝐿𝑆𝑖𝑛𝑔𝑙𝑒𝑅𝑒𝑐𝑜𝑛 =
∑︁
𝑝

∥𝑝𝑐 − 𝑝𝐺𝑇𝑐 ∥2 . (2)

3.2 Extension to Multiple Images
Our goal is to use a single network F as the INR for multiple
images from the same scene. Prior methods assume the camera
layout (for planar light fields [Feng and Varshney 2021]) or known
camera poses in the pipeline (for general light fields [Attal et al.
2022; Sitzmann et al. 2021]), but we are interested in pushing the
limit to where the camera pose of each image is unknown.

In our 3D-agnostic setup which does not consider camera poses,
we assign a randomly initialized vector 𝑧 ∈ R𝑀 for each image,

Figure 1.15: Effect of Controlling Latent Codes ∥z∥p = 1 with Different p-norm.

For each condition, we show the INR output given zi (left), 0.5zi + 0.5zj (center),

and zj (right). “No Control” does not control the latent codes, leading to proper

reconstruction at known views (left and right) but complete failure in interpolation

(center). “∞-norm” scales each z with its maximum norm, but still does not

interpolate well. “2-norm” significantly improves interpolation and reconstructs

known views better, but “1-norm” is much better at interpolation (see red boxes).

multiple 2D image views of a 3D scene, can we use the INR of those 2D images alone

to do view synthesis without any 3D reconstruction, pose, or correspondence? With

randomly initialized INR weights and code vectors for individual images, I modify

the standard INR training process such that the trained INR can both faithfully

reproduce the given images and synthesize plausible novel views when interpolating

between those learned image codes. This method is called VIINTER, and its overview

is shown in Figure 1.14.

Chapter 6 includes various analysis and experiments. For example, I show that

it is important to regularize the magnitude of latent codes, as shown in Figure 1.15.

17


SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea Feng et al.

𝑡 = 0 𝑡 = 0.25 𝑡 = 0.5 𝑡 = 0.75 𝑡 = 1

Figure 6: Unstructured Light Field Results. We interpolate learned codes of views 𝑖 and 𝑗 as (1−𝑡) ·𝑧𝑖 +𝑡 ·𝑧 𝑗 . The cameramovement
between the two known views includes rotation and translation. The interpolation through INR smoothly transforms the
perspective, despite having no knowledge of 3D scene structure or camera pose. Images are zoomed in for easier evaluations.

or demonstration of the interpolation results between different
viewpoints, after training is finished.

4 EXPERIMENTS
In this section, we provide more results on view interpolation and
ablation studies on the techniques introduced in Section 3. We train
VIINTER to encode real-world scenes captured under two different
regimes: 4D light fields (viewpoints are on a 2D plane with the
same orientation) and unstructured light fields (viewpoints are not
aligned on a 2D grid and orientations might be rotated).

4D Planar Light Fields. We use scenes from Stanford Light Field
Archive [Wilburn et al. 2005], with 17 × 17 camera viewpoints on a
2D grid.We use a 5×5 subset by taking every 4-th image horizontally
and vertically. We render new views by selecting two trained codes
and linearly interpolate them. The resulting interpolation results
are shown in Fig. 5, with more in the supplements.

Unstructured Light Fields. To test VIINTER on scenes with irreg-
ular camera layout, we test on the LLFF dataset [Mildenhall et al.

2019] and our own volumetric dataset. The LLFF scenes are cap-
tured in natural indoor environments, while our own scenes come
from a volumetric studio for human body captures. We present the
interpolation results in Fig. 5, with more in the supplements.

Quantitative Evaluation. The unique challenge in evaluating our
method is we cannot explicitly specify a camera pose to render at.
Nonetheless, to provide a quantitative evaluation, we approximately
render at testing viewpoints by interpolating the codes from nearby
known viewpoints. For example, for the Stanford Light Field scenes,
we select two viewpoints in the 5 × 5 training set, viewpoints (4, 4)
and (4, 8), and interpolate their learned codes with 𝑡 = 0.5. Then
we render the full image with the interpolated code and compare
it against the actual test image (withheld from training) captured
at viewpoint (4, 6). Thanks to the well-aligned structure of these
4D scenes, we can compute metrics like peak signal-to-noise ratio
(PSNR) and structural similarity index measure (SSIM) against a
reasonable ground truth image.

Figure 1.16: Unstructured Light Field Results. We interpolate learned codes of

views i and j as (1− t) · zi + t · zj. The camera movement between the two known

views includes rotation and translation. The interpolation through INR smoothly

transforms the perspective, despite having no knowledge of 3D scene structure or

camera pose. Images are zoomed in for easier evaluations.

Chapter 6 further includes evaluation results of VIINTER on different types of

multi-view scenes. Example results are presented in Figure 1.16.

VIINTER takes an important step toward revealing new potential of neural

fields or INRs. With careful modifications, they can perform view interpolation

without 3D structures. This work offers a promising outlook on employing them for

image manipulation tasks beyond simply representing known images.

18


Chapter 2: Deep Depth Estimation on 360-degree Images with a

Double Quaternion Loss

2.1 Introduction

Traditional depth estimation uses binocular or multi-view stereo image in-

puts [11–14]. Based on explicit geometric constraints, most of these stereo methods

infer relative depth through computing stereo disparity, i.e., the distance between a

pixel’s location in one image to its corresponding location in the other image. The

rise of deep learning enables direct training of convolutional neural networks (CNN)

for depth estimation by implicitly computing the matching cost between pixels in

stereo images.

However, since stereo images are not easily accessible, depth estimation on

monocular images serves as a valuable alternative. Deep CNNs for this problem have

shown promising results. A unique advantage for this approach is that monocular

CNNs can be trained on both monocular image datasets and stereo image datasets.

As virtual and augmented reality (VR and AR) become more commoditized

and panoramic cameras become ubiquitous, 360◦ visual content is becoming more

relevant [15–17]. The interactive nature of VR and AR, fosters an urgent need for

19


Figure 2.1: Method Overview. The CNN takes in a 360◦ image and for each pixel

estimates its depth, normal, and uncertainty of that depth estimate. These three

estimates are used by a refinement module to produce the final depth estimate for

each pixel. We train the CNN using a novel loss based on the double quaternion

representation of the depth and normal.

methods that estimate depth information from 2D views to instill more creative

freedom in content rendering and interaction, including reconstructing the original

3D scenes and synthesizing views from novel angles [18,19]. However, most of the

previous research on depth estimation targets traditional perspective images.

Unlike typical photographs captured on a planar sensor, 360◦ images have a

spherical layout. For 360◦ stereo images, traditional depth-estimation methods based

on binocular disparity are not directly applicable due to the spherical singularity

at the stereo epipoles. Moreover, CNNs trained on narrow-field-of-view images for

monocular depth estimation perform poorly on 360◦ monocular images because

20


of the significant domain shift from traditional perspective to wide-field-of-view

equirectangular images.

Zioulis et al. [20] and Lai et al. [4] have recently released separate datasets for

depth estimation on 360◦ images. While both datasets provide multi-view stereo

images, their choices of baseline distance between cameras are vastly different. This

difference signifies a severe drawback of training a CNN that simply takes in a

stereo image pair: networks directly trained on stereo images with a particular

baseline cannot adapt to different baseline configurations at test time. Moreover,

such networks require a fixed baseline in training, making it difficult to aggregate

training data from multiple datasets. Therefore, training a depth estimation CNN

with monocular input seems more favorable.

Among methods that train CNN for depth estimation, joint estimation of

depth and normal is commonly adopted as an augmentation technique. However, to

the best of our knowledge, all previous networks that jointly estimate normal and

depth consider the errors from depth and normal separately. While Qi et al. [21] and

Yang et al. [22] have proposed depth refinement methods that explicitly link surface

normal estimates with depth estimates, their methods are based on the planar sensor

camera model for traditional narrow-field-of-view images and do not map well to

360◦ images. Moreover, their refinement procedures modify all pixel points uniformly

in the estimated depth map and do not consider the varying quality across different

regions.

In this chapter, we present a new framework for 360◦ depth estimation. We

start from a generic CNN that jointly estimates depth and surface normal based

21


on monocular RGB images. We develop a new loss for this joint estimation task,

which combines depth and surface normals into a 4D hyperspherical space with

a double quaternion approximation. We implement depth refinement using the

normal estimates produced by this network. In contrast with previous normal-based

refinement methods on perspective images, our new method adaptively adjusts the

refinement to the initial depth estimates by an uncertainty score map which is also

estimated by the CNN. This uncertainty construct allows us to identify image regions

where further refinement could be helpful and avoid unnecessary changes to estimates

that the network expects to be accurate. Furthermore, to make full use of available

image data, we introduce a stereo loss when training the CNN on stereo-image pairs.

After producing two separate monocular estimates of depth and normal for a stereo

image pair, the CNN learns to minimize their hyperspherical angular difference.

By this design, the monocular network can take advantage of stereo training data

without being restricted by a particular stereo baseline distance. Experiments show

the improved performance of our proposed framework compared to previous methods

on 360◦ depth estimation.

In summary, our contributions include:

• An adaptive depth refinement framework for 360◦ images using normal estimates

and uncertainty scores.

• A new way to incorporate depth and surface normal estimates for a 3D point

into a hyperspherical 4D space using a double quaternion approximation.

• A stereo loss that enables the CNN to learn stereo consistency and remain

22


flexible across datasets with different stereo baseline distances.

2.2 Related Work

We first present learning-based methods for monocular and stereo depth estima-

tion on 360◦ images, followed by previous work on using the surface normal to refine

depth from perspective images. We then present previous approaches that incorpo-

rate quaternion representations in estimating surface normals and approximating 3D

motions.

2.2.1 Depth Estimation on 360-degree Images

Several methods have been used to perform depth estimation [2, 23–29] and

surface normal estimation [28, 30–32] on perspective images. Unfortunately, 360◦

images are distorted by equirectangular projection and contain irregular disparity

pattern due to the spherical singularity at the stereo epipoles. Therefore, depth

estimation on 360◦ images requires special adaptations.

One approach for learning on 360◦ images is to project pixels onto rectified

cubemaps and then perform inference using pre-trained CNNs. Huang et al. [33] apply

the traditional structure-from-motion (SfM) algorithm [34] in 3D scene reconstruction

by projecting each 360◦ video frame onto a cubemap. Monroy et al. [35] obtain

360◦ saliency maps following this approach, but the distortion and discontinuity

among cubemap patches are not handled by their method. Cube padding [36, 37]

was introduced to help resolve cubemap distortion problem by padding each patch

23


with features from adjacent cubemap patches.

Another approach for 360◦ depth estimation is to transfer models for perspective

images to 360◦ images. To account for the distortion from equirectangular projection,

Su and Graumann [38] modified a CNN trained on perspective images by varying

the kernel shape based on its location on the sphere. Su and Graumann [39] improve

the previous method by learning a transformation function for kernels pre-trained

on perspective images without separately training new kernels for each location.

Zioulis et al. [1] directly train CNNs on 360◦ images using rectangular kernels of

varying resolutions along with traditional square kernels to cover different distortion

levels. They also adopt dilated convolutions [40] to increase the receptive field and

enable the networks to gather more global information. Lee et al. [41] use a spherical

polyhedron to represent 360◦ images and devise special convolution and pooling

kernels for image pixels after they are projected on the polyhedron. Tateno et al. [42]

deform the kernel sampling grid to compensate for distortions in spherical images.

For the similar task of saliency detection on 360◦ videos, Zhang et al. [43] also define

kernels on the 360◦ sphere and resample the kernels on the grid points for every

location in the equirectangular projection.

Unsupervised learning through view synthesis has also been exploited to solve

depth estimation [22,44]. De La Garanderie et al. [45] use the stereo consistency of

perspective images to achieve unsupervised depth estimation on panoramic images.

Wang et al. [37] explore self-supervised depth estimation from 360◦ images through

cubemap projection. Zioulis et al. [20] introduced the view-synthesis approach into

the realm of omnidirectional 360◦ images. Aware of the distortion problem of 360◦

24


images, they also adaptively weight the loss contribution of each pixel based on its

coordinates on the image grid.

While most previous work on 360◦ depth estimation focuses on monocular input,

Lai et al. [4] present a framework for stereo depth estimation on 360◦ images with a

CNN which produces a depth map for a horizontally displaced pair of images. Xie et

al. [5] further extend this stereo depth estimation framework to include deformable

convolution and correlation convolution. Wang et al. [46] propose a learnable cost

volume approach for spherical stereo depth estimation which also shows promising

results.

2.2.2 Joint Estimation of Depth and Normal

Motivated by the inherent geometric relationship between depth and normal

estimates of points on the same surface, several methods include the surface normal

information into depth estimation. Wang et al. [47] deploy a dense conditional random

field on initial estimates of normal and depth, which produces more regularized

depth and normal outputs with better geometric consistency. Eigen and Fergus [48]

also simultaneously estimate depth, surface normal, and semantic segmentation for

perspective images.

Furthermore, the depth-normal relationship can be explicitly constructed.

Two spatially close points with similar surface normal estimates are approximately

co-planar, and thus they form a vector that is orthogonal to the surface normal.

Building upon this assumption, Qi et al. [21] introduce a module that refines the

25


depth estimates produced by a CNN using its normal estimates. Likewise, Yang et

al. [22] formulate this depth-normal relationship as a quadratic minimization problem

for a set of linear equations constructed by the local depth and normal estimates in

a small region. However, these methods do not consider the varying quality of CNN

estimates across different regions.

Lai et al. [4] also use the information of surface normal to improve depth

estimation. To the best of our knowledge, this is the first work that implements a

joint estimation of depth and normal on 360◦ images. However, their method only

includes normal as an auxiliary task of the CNN, without further exploiting the

explicit geometric relationship between depth and surface normal.

2.2.3 Use of Quaternions

Quaternions are widely used in computer graphics to represent rotation trans-

formation of 3D points. By representing surface normal as a pure quaternion,

Karakottas et al. [49] calculate the angular loss of normal predictions based on the

quaternion product of the estimated and ground-truth normal vectors.

As a natural extension to quaternions, double quaternions integrate the rota-

tion and translation components for motion interpolation [50]. Unlike traditional

3D point representation where spatial displacements are separately characterized

into translation and rotation, double quaternions provide a unified framework to

approximate 3D displacements as a rotation in the 4D space. In other words, the

difference between two 3D spatial displacements can be described by their angular

26


distance in 4D.

In this chapter, we introduce a method that directly unifies depth and surface

normal information into a single measurement based on a double quaternion approx-

imation. With this novel construct, the predicted and the ground-truth depth and

normals can be converted into two double quaternions. We thus derive a new loss

specifically for joint estimation of depth and surface normals.

We also take advantage of this double quaternion representation to measure the

discrepancy between two CNN estimates from a stereo image pair. After transforming

the two separate estimates into a homogeneous coordinate system, we derive a stereo

loss based on the double quaternion angular distance between these two sets of

estimates.

2.3 Method

Our goal is to train a CNN for 360◦ depth estimation. To exploit the information

from surface normals, the CNN produces a normal map and an uncertainty map

for initial depth estimates, which we feed into a refinement procedure to produce a

final depth map. We derive a loss function based on double quaternions to facilitate

better depth-normal joint learning. To further use datasets containing stereo pairs,

we introduce a stereo loss also based on double quaternions.

27


2.3.1 CNN Architecture

We adopt the commonly used U-Net architecture with skip connections, as

shown in Figure 2.1. For an input RGB image of size h× w × 3, the CNN produces

three separate outputs: 1) h × w × 1 depth map, 2) h × w × 1 uncertainty map

for depth, and 3) h× w × 3 normal map. These three output maps are fed into a

refinement step detailed in Section 2.3.2.

Convolution BatchNormReLU MaxPoolUpsample

Figure 2.2: Network Architecture. We adopt the commonly used U-Net architecture

for end-to-end per-pixel estimation. The first six blocks in the encoding part are

based on the VGG-16 model [51]. The decoding part is symmetric to the encoding

part, and it outputs a depth map, an uncertainty map, and a normal map. These

three maps are combined to produce a final depth map using the method described

in Section 2.3

2.3.2 Depth Refinement based on Normal

In general, image-based depth estimation aims to recover the depth value of a

3D point (x, y, z) given its projected pixel location (u, v) in an image.

The depth value for a pixel in a 360◦ image is defined as the distance of its

28


corresponding 3D point from the camera.

r =
√
x2 + y2 + z2 (2.1)

Moreover, the pixel coordinates (u, v) of a 360◦ image with width w and height

h directly correspond to the spherical coordinates (θ, ϕ) of its corresponding 3D

point.

ϕ = 2πu θ = π
2v − 1

2
u, v ∈ [0, 1] (2.2)

The direct conversion between spherical and Cartesian coordinates in 3D is given as

follows

x = r sinϕ sin θ y = r cos θ z = r cosϕ sin θ (2.3)

Using equations (2.2) and (2.3), we can obtain the relationship that maps 2D

grid coordinates to 3D Cartesian coordinates for 360◦ depth maps.

Using the normal estimates (nix, niy, niz) also produced by the CNN, we can

further formulate the following equation based on the orthogonality between surface

normal vector and in-plane vector among points (xi, yi, zi) and (xj, yj, zj):

nix(x− xi) + niy(y − yi) + niz(z − zi) = 0 (2.4)

nixxj + niyyj + nizzj
nixxi + niyyi + nizzi

= 1 (2.5)

Then, using an assumption similar to Qi et al. [21], for pixels within a small

region, we treat their corresponding 3D points as co-planar if their surface normal

estimates are also similar. Thus, we obtain an approximately co-planar neighborhood

Ni for each image pixel Pi using spatio-angular measures defined as follows:

29


Ni = {(xj, yj, zj) | n⊺
jni > α, |ui − uj| < β, |vi − vj| < β} (2.6)

where (ui, vi) and (uj, vj) are the 2D grid coordinates of pixels Pi and Pj, β is the

parameter that controls the size of the spatial neighborhood, and α controls the size

of the angular neighborhood. A larger value of n⊺
jni implies a greater likelihood that

the corresponding 3D points for Pi and Pj are co-planar.

For each neighbor Pj ∈ Ni, we may obtain an estimate rij for the depth of ri

of Pi by plugging in the spherical coordinates with equations (2.3) and (2.5):

rij =
nixxj + niyyj + nizzj

nix sinϕi sin θi + niy cos θi + niz cosϕi sin θi
(2.7)

where θi and ϕi are determined by Eq (2.2). Note that the calculation in

Eq (2.7) suffers from instability when the denominator is close to zero, producing

abnormal values. Thus, we leave out any depth estimate that violates the following

constraints:

0 < rij < 255 max(
rij
ri
,
ri
rij

) < 10 (2.8)

For any rij that violates the constraints in Eq (2.8), we set it as ri, the original

depth estimate of Pi.

2.3.3 Aggregation with Confidence Scores

For each pixel Pi, we aggregate the estimates of its depth ri from its neighbors

Pj ∈ Ni by using normalized weights. These weights have two components. First,

we use the uncertainty score qj of pixel Pj from the CNN output to compute its

30


confidence value C(Pj) = 1− q2j .

An example of the uncertainty score output maps can be seen in Figure 2.6.

Second, the neighbor Pj ’s contribution is also weighted by W (Pi, Pj), the dot product

between their respective normals, ni and nj.

Specifically, we aggregate the depth estimates for each pixel Pi with its neighbors

as:

rNi =

∑
Pj∈Ni

C(Pj) ·W (Pi, Pj) · rij∑
Pj∈Ni

C(Pj) ·W (Pi, Pj)
(2.9)

with C(Pj) = 1− q2j and W (Pi, Pj) = n⊺
jni.

Finally, the refined r̂i for Pi is calculated as:

r̂i = C(Pi) · ri + (1−C(Pi)) · rNi (2.10)

In other words, for a pixel with higher uncertainty and lower confidence, we

place a greater reliance on its neighbors to refine its initial depth estimate. On

the other hand, if a pixel has a low uncertainty score, the CNN believes this depth

estimate is likely accurate, and so the neighbor estimates are less informative. This

formulation allows us to adaptively refine the initial CNN estimates and reduce the

unnecessary modifications of the already robust estimates.

31


Figure 2.3: Hyperspherical Rotation Approximation. This figure illustrates that a

displacement in 2D (shown at bottom right) can be regarded as a rotation around the

center of a 3D sphere. Similarly, a double quaternion approximates a 3D displacement

as a rotation on a suitably large 4D sphere.

2.3.4 Double Quaternion Approximation of Depth and Normal in the

Loss function

2.3.4.1 Constructing Double Quaternion

Since a point’s spatial coordinate (x, y, z) represents a translation from the co-

ordinate origin (0, 0, 0), a pixel’s corresponding depth and surface normal orientation

can be viewed as 3D translation and rotation, respectively.

The translation component of a 2D spatial displacement can be viewed as a

rotation with respect to the origin of the 3D coordinate system. In fact, similar

32


approximation can be done from 3D to 4D. McCarthy [52,53] has shown that the

homogeneous transform of 3D spatial displacements with rotation and translation

is the limiting case of a 4D rotation as the radius of the 4D sphere R approaches

infinity. Thus, we combine the 3D depth (translation) and normal (rotation) into

one 4D measurement, which is represented by a double quaternion. Specifically,

a 3D translation d can be approximated by a rotation on a 4D sphere of radius

R, limR → ∞ by an angle ψ as limψ→0 sin(ψ) = ψ = d
R
. The double quaternion

representing this 4D rotation is:

D = cos(
ψ

2
) + sin(

ψ

2
)
d

|d|
(2.11)

Therefore, the 3D translation can be represented by a double quaternion

(D,D*), and the 3D rotation is represented as (Q,Q), where

Q = (0, nx, ny, nz) (2.12)

Two double quaternions, (G1, H1) and (G2, H2) can be composed into a new

double quaternion (G3, H3), where

G3 = G1G2 H3 = H1H2 (2.13)

Moreover, following Ge et al. [50], we can compute the spatial distance between

two double quaternions (G1,H1) and (G2,H2) as the angle between the respective

double quaternion components:

α = cos−1(G1 ·G2) β = cos−1(H1 ·H2) (2.14)

33


2.3.4.2 Loss Function Based on Double Quaternions

Based on Eq (2.13), we combine the double quaternions representing translation

and rotation (Eqs (2.11) and (2.12)) into a double quaternion representation for a

depth and normal estimation pair:

G = DQ H = D∗Q (2.15)

We thus derive a loss function based on the angular distance between the two

double quaternions: predicted (GPred, HPred) and ground-truth (GGT, HGT) as

LDQ =
√
α2
DQ + β2

DQ (2.16)

where αDQ and βDQ are calculated as in Eq (2.14)

2.3.5 Stereo Consistency

While training on datasets with stereo pairs, we further impose a stereo loss to

minimize the discrepancy between the estimates from the two horizontally displaced

images.

Given a known baseline distance b between a horizontal stereo pair, a pixel in

one image L can be mapped onto the other image R with the following equation:

ϕR = ϕL + b · cos(ϕL)

rL · sin(θL)
θR = θL + b · sin(ϕL)cos(θL)

rL
(2.17)

Following the procedure presented in Section 2.3.2, we combine depth and

normal estimates from the stereo pair images into two double quaternions (GL, HL)

and (GR, HR), from which we calculate the stereo loss:

LStereo =
√
α2
Stereo + β2

Stereo (2.18)

34


where αStereo and βStereo are also calculated as in Eq (2.14).

2.3.6 Overall Loss Function

With the double-quaternion-based losses derived above, we present the overall

loss function for network training:

Ltotal = LberHu + LDQ + LStereo (2.19)

Here, LberHu is the reverse Huber loss function for both the depth and normal

estimates compared to their respective ground truth [2]. In effect, this loss is

equivalent to the mean absolute error for errors below a threshold c, and equivalent

to a weighted mean squared loss for errors larger than c. We follow Laina et al. [2]

and set c as 20% of the maximal error among all images of the current batch. We

follow Lai et al. [4] and place extra weight for error at boundary pixels in calculating

LberHu.

2.4 Experiments

We have trained and evaluated the performance of our method on the ODS

dataset [4]. It contains 40, 000 frames of indoor scenes from the Stanford 2D-3D-

Semantics Dataset [54] with ground truth depth and surface normal. We adopt the

same training-validation data split and evaluation metrics as Lai et al. [4]. We have

also evaluated our method on the 360D dataset provided by Zioulis et al. [20].

35


2.4.1 Training Details

We initialize the encoding blocks of the CNN shown in Figure 2.1 with the

commonly used VGG-16 [51] pre-trained weights. We use the Adam optimizer with

its default parameters. We follow the data augmentation procedures detailed in Lai et

al. [4] to introduce more variability in data. To be consistent with previous work,

Algorithm 1: Steps to Compute Training Loss

1 Input: Horizontally Displaced Image Pair I, I
′

2 Parameters: Weights θ of CNN

3 Label: Depth Map DI and Surface Normal Map SI

4 Output: Initial depth D̃init, refined depth D̃I , estimated normal S̃I , total

loss Ltotal

5 For each training iteration

6 D̃init, S̃I = CNN(I) (Section 3.1)

7 D̃init
′ , S̃I′ = CNN(I

′
) (Section 3.1)

8 D̃I = Refine(D̃init, S̃I) (Sections 3.2 - 3.3)

9 D̃I′ = Refine(D̃init
′ , S̃I′ ) (Sections 3.2 - 3.3)

10 LDQ = DQLoss(D̃I , S̃I ,DI , SI) (Section 3.4)

11 LStereo = StereoLoss(D̃I , D̃I′ ) (Section 3.5)

12 LberHu = berHuLoss(D̃I ,DI) (Section 3.6)

13 Ltotal = LberHu + LDQ + LStereo

14 θ = Update(Ltotal, θ)

36


we train our networks for 40 epochs on this dataset to enable direct comparison of

method performance. We adopt the conventional depth estimation metrics [1,2,4,55].

We denote the absolute prediction error of a pixel i as Ei = |yi − ŷi|, where yi is

the ground truth depth and ŷi is the predicted depth. δj refers to the percentage

of pixels with max(yi
ŷi
, ŷi
yi
) < 1.25j. The other metrics used and their definitions are

listed below:

RMSE :

√∑N
i=1 Ei

2

N
Abs. Rel. :

∑N
i=1

Ei

ŷi

N

RMLSE :

√∑N
i=1 | ln yi − ln ŷi|2

N
Sq. Rel. :

∑N Ei
2

ŷi

N

Log10 :

∑N
i=1 | log10 yi − log10 ŷi|

N

Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss
Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates.
In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model
in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we
show results from our full model.

Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction
Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first
column, we show the input image for reference.

Method RMSE Log10 AbsRel δ1 δ2 δ3
UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560
RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634
FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966
PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980
SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993
SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985
SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992
Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and
Xie et al. [44]. Our method produces superior results in most metrics.

Following the procedure presented in Section 3.2, we
combine depth and normal estimates from the stereo pair
images into two double quaternions (GL, HL) and (GR,
HR), from which we calculate the stereo loss:

LStereo =
√
α2

Stereo + β2
Stereo (18)

where αStereo and βStereo are also calculated as in Eq (14).

3.6. Overall Loss function

With the double-quaternion-based losses derived above,
we present the overall loss function for network training:

Ltotal = LberHu + LDQ + LStereo (19)

Here, LberHu is the reverse Huber loss function for both
the depth and normal estimates compared to their respec-
tive ground truth [21]. In effect, this loss is equivalent to

Figure 2.4: Surface Normal Prediction. Training with double quaternion loss enables

the network to produce better surface normal estimates. In the first column, we

show the input image for reference. In the third column, we show predictions from

the traditional baseline model in which we separately calculate depth and normal

loss without combining them into a double quaternion form. In the fourth column,

we show results from our full model.

37


Reference Input View Ground truth normal map Output from baseline model Output with double quaternion loss
Figure 3. This example shows that training with double quaternion loss also enables the network to produce better surface normal estimates.
In the first column, we show the input image for reference. In the third column, we show predictions from the traditional baseline model
in which we separately calculate depth and normal loss without combining them into a double quaternion form. In the fourth column, we
show results from our full model.

Reference Input View Ground truth depth map Initial depth prediction Refined depth prediction
Figure 4. Comparison of initial depth estimates produced by the network and the refined output based on surface normal. In the first
column, we show the input image for reference.

Method RMSE Log10 AbsRel δ1 δ2 δ3
UResNet [49] 2.037 0.326 16.906 0.213 0.399 0.560
RectNet [49] 1.738 0.291 16.132 0.240 0.453 0.634
FCRN [21] 0.672 0.101 7.448 0.806 0.932 0.966
PSMNet [5] 0.393 0.059 5.641 0.953 0.975 0.980
SepUNet [20] 0.495 0.042 1.779 0.944 0.987 0.993
SepUNetS [20] 0.614 0.072 1.841 0.835 0.966 0.985
SepUNetDD [44] 0.392 0.036 2.120 0.960 0.987 0.992
Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 1. Performance Comparison on the ODS dataset [20]. Evaluation statistics for row 1-7 are directly taken from Lai et al. [20] and
Xie et al. [44]. Our method produces superior results in most metrics.

Following the procedure presented in Section 3.2, we
combine depth and normal estimates from the stereo pair
images into two double quaternions (GL, HL) and (GR,
HR), from which we calculate the stereo loss:

LStereo =
√
α2

Stereo + β2
Stereo (18)

where αStereo and βStereo are also calculated as in Eq (14).

3.6. Overall Loss function

With the double-quaternion-based losses derived above,
we present the overall loss function for network training:

Ltotal = LberHu + LDQ + LStereo (19)

Here, LberHu is the reverse Huber loss function for both
the depth and normal estimates compared to their respec-
tive ground truth [21]. In effect, this loss is equivalent to

Figure 2.5: Depth Refinement Results. We compare initial depth estimates produced

by the network and the refined output based on surface normal. In the first column,

we show the input image for reference.

2.4.2 Comparison with Other Methods

The network performance of a CNN trained with our method is shown in

Tables 2.1 and 2.2. Compared to other methods in Tables 2.1 and 2.2, our network

shows improved performance in almost all metrics. In Figure 2.7 we show an example

where our method better preserves the geometric detail of the scene. We believe this

is because our model is aware of the surface normals and can use it to improve depth

estimation.

2.4.3 Ablation Studies

We present the performance comparisons in Table 2.3. We observe decreased

estimation accuracy with the removal of each component of the loss function. Fig-

ures 2.4, 2.5, and 2.6 further illustrate the impact of our method. It is worth

noting that the network trained with double quaternions shows smoother normal esti-

38


Method RMSE Log10 AbsRel δ1 δ2 δ3

UResNet [1] 2.037 0.326 16.906 0.213 0.399 0.560

RectNet [1] 1.738 0.291 16.132 0.240 0.453 0.634

FCRN [2] 0.672 0.101 7.448 0.806 0.932 0.966

PSMNet [3] 0.393 0.059 5.641 0.953 0.975 0.980

SepUNet [4] 0.495 0.042 1.779 0.944 0.987 0.993

SepUNetS [4] 0.614 0.072 1.841 0.835 0.966 0.985

SepUNetDD [5] 0.392 0.036 2.120 0.960 0.987 0.992

Ours 0.389 0.031 0.413 0.954 0.984 0.990

Table 2.1: Performance Comparison on the ODS dataset [4]. Evaluation statistics for

row 1-7 are directly taken from Lai et al. [4] and Xie et al. [5]. Our method produces

superior results in most metrics.

mates, which could explain the increase in estimate accuracy since the normal-based

refinement method relies on accurate normal estimates.

2.5 Limitations and Conclusion

We have shown how double-quaternion loss is useful in reducing the geometric

inconsistency and improving estimation accuracy. Our results indicate that a double

quaternion construct could have a meaningful potential for other tasks that involve

processing 360◦ images. We hope our work will bring a new hyperspherical perspective

to analyzing omnidirectional visual data, as a complement to the traditional Cartesian

39


Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3

UResNet [1] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968

RectNet [1] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979

monoDepth [44] 7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510

FCRN [2] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150

DCRF [25] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774

Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988

Table 2.2: Performance Comparison the 360D dataset [20]. Evaluation statistics

for row 1-5 are directly taken from Zioulis et al. [1]. Our method surpasses other

methods in all metrics except for AbsRel.

Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3
UResNet [49] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968
RectNet [49] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979
monoDepth
[16]

7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510

FCRN [21] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150
DCRF [26] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774
Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988

Table 2. Performance Comparison the 360D dataset [50]. Evaluation statistics for row 1-5 are directly taken from Zioulis et al. [49]. Our
method surpasses other methods in all metrics except for AbsRel.

Input RGB image Predicted uncertainty map Predicted depth map
Figure 5. This example illustrates that the network learns to produce meaningful uncertainty maps by effectively grasping the object’s
geometric outline. It places higher uncertainty near object edges, where depth predictions tend to be overly smooth and prone to error.

Cropped input RGB image Predicted by RectNet [49] Predicted by our network
Figure 6. More qualitative comparison. Here we show an example from a test image from the 360D dataset [49]. Note that our result
largely preserves the geometry of the hallway railings.

the mean absolute error for errors below a threshold c, and
equivalent to a weighted mean squared loss for errors larger
than c. We follow Laina et al. [21] and set c as 20% of
the maximal error among all images of the current batch.
We follow Lai et al. [20] and place extra weight for error at
boundary pixels in calculating LberHu.

4. Experiments
We have trained and evaluated the performance of our

method on the ODS dataset [20]. It contains 40, 000 frames
of indoor scenes from the Stanford 2D-3D-Semantics
Dataset [1] with ground truth depth and surface normal.
We adopt the same training-validation data split and eval-
uation metrics as Lai et al. [20]. We have also evaluated our
method on the 360D dataset provided by Zioulis et al. [50].

4.1. Training details

We initialize the encoding blocks of the CNN shown in
Figure 2 with the commonly used VGG-16 [36] pre-trained

weights. We use the Adam optimizer with its default pa-
rameters. We follow the data augmentation procedures de-
tailed in Lai et al. [20] to introduce more variability in data.
To be consistent with previous work, we train our networks
for 40 epochs on this dataset to enable direct comparison
of method performance. We adopt the conventional depth
estimation metrics [13, 20, 21, 49]. We denote the absolute
prediction error of a pixel i as Ei = |yi− ŷi|, where yi is the
ground truth depth and ŷi is the predicted depth. δj refers
to the percentage of pixels with max(yiŷi ,

ŷi
yi
) < 1.25j . The

other metrics used and their definitions are listed below:

RMSE :

√∑N
i=1 Ei2

N
Abs. Rel. :

∑N
i=1

Ei

ŷi

N

RMLSE :

√∑N
i=1 | ln yi − ln ŷi|2

N
Sq. Rel. :

∑N Ei
2

ŷi

N

Log10 :

∑N
i=1 | log10 yi − log10 ŷi|

N

Figure 2.6: Uncertainty Estimates. The network learns to produce meaningful

uncertainty maps by effectively grasping the object’s geometric outline. It places

higher uncertainty near object edges, where depth predictions tend to be overly

smooth and prone to error.

(or equirectangular) perspective.

Our method achieves good performance on the testing scenes in the given

datasets. One of the assumptions our method makes is that the normals can be

40


Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3
UResNet [49] 0.3374 0.1204 0.0835 0.0416 0.9319 0.9889 0.9968
RectNet [49] 0.2911 0.1017 0.0702 0.0297 0.9574 0.9933 0.9979
monoDepth
[16]

7.2097 0.8200 0.4747 2.3783 0.2970 0.7900 0.7510

FCRN [21] 0.9410 0.3760 0.3181 0.4469 0.4922 0.7792 0.9150
DCRF [26] 1.1596 0.4400 0.4202 0.7597 0.3889 0.7044 0.8774
Ours 0.2373 0.0907 0.0859 0.0213 0.9690 0.9954 0.9988

Table 2. Performance Comparison the 360D dataset [50]. Evaluation statistics for row 1-5 are directly taken from Zioulis et al. [49]. Our
method surpasses other methods in all metrics except for AbsRel.

Input RGB image Predicted uncertainty map Predicted depth map
Figure 5. This example illustrates that the network learns to produce meaningful uncertainty maps by effectively grasping the object’s
geometric outline. It places higher uncertainty near object edges, where depth predictions tend to be overly smooth and prone to error.

Cropped input RGB image Predicted by RectNet [49] Predicted by our network
Figure 6. More qualitative comparison. Here we show an example from a test image from the 360D dataset [49]. Note that our result
largely preserves the geometry of the hallway railings.

the mean absolute error for errors below a threshold c, and
equivalent to a weighted mean squared loss for errors larger
than c. We follow Laina et al. [21] and set c as 20% of
the maximal error among all images of the current batch.
We follow Lai et al. [20] and place extra weight for error at
boundary pixels in calculating LberHu.

4. Experiments
We have trained and evaluated the performance of our

method on the ODS dataset [20]. It contains 40, 000 frames
of indoor scenes from the Stanford 2D-3D-Semantics
Dataset [1] with ground truth depth and surface normal.
We adopt the same training-validation data split and eval-
uation metrics as Lai et al. [20]. We have also evaluated our
method on the 360D dataset provided by Zioulis et al. [50].

4.1. Training details

We initialize the encoding blocks of the CNN shown in
Figure 2 with the commonly used VGG-16 [36] pre-trained

weights. We use the Adam optimizer with its default pa-
rameters. We follow the data augmentation procedures de-
tailed in Lai et al. [20] to introduce more variability in data.
To be consistent with previous work, we train our networks
for 40 epochs on this dataset to enable direct comparison
of method performance. We adopt the conventional depth
estimation metrics [13, 20, 21, 49]. We denote the absolute
prediction error of a pixel i as Ei = |yi− ŷi|, where yi is the
ground truth depth and ŷi is the predicted depth. δj refers
to the percentage of pixels with max(yiŷi ,

ŷi
yi
) < 1.25j . The

other metrics used and their definitions are listed below:

RMSE :

√∑N
i=1 Ei2

N
Abs. Rel. :

∑N
i=1

Ei

ŷi

N

RMLSE :

√∑N
i=1 | ln yi − ln ŷi|2

N
Sq. Rel. :

∑N Ei
2

ŷi

N

Log10 :

∑N
i=1 | log10 yi − log10 ŷi|

N

Figure 2.7: More Qualitative Comparisons. Here we show an example from a test

image from the 360D dataset [1]. Note that our result largely preserves the geometry

of the hallway railings.

Method RMSE RMLSE AbsRel SqRel δ1 δ2 δ3

Full model 0.3894 0.2572 0.4130 0.6872 0.9543 0.9836 0.9904

w/o LDQ 0.4731 0.3452 0.5830 0.9012 0.9257 0.9718 0.9880

w/o Refinement 0.4114 0.3190 0.5535 0.9220 0.9313 0.9780 0.9903

w/o LStereo 0.3953 0.2622 0.4562 0.7068 0.9530 0.9801 0.9904

Table 2.3: Ablation Results. Evaluation statistics are based on prediction results on

the ODS dataset [4]. Results in rows 2-4 show the network performance when trained

without the double quaternion loss, depth refinement step, and stereo consistency loss,

respectively. Results show that each component in our proposed method contributes

to better estimation accuracy.

estimated well and provide meaningful guidance for depth refinement. Also, the

quality of our depth estimation on real world 360◦ images is dependent on their

domain similarity to the training dataset on which the model is trained. Our method

does not perform well if either of these assumptions do not hold.

Furthermore, as previously discussed, direct learning on 360◦ images suffers

41


from image distortion, which is not explicitly addressed by our method. In particular,

we directly deploy a 2D CNN with regular, square kernels without any modification.

Thus, it would be worthwhile to incorporate methods that alleviate the distortion

problem, such as modifying convolutional kernels to account for distortion, and

directly performing convolution on spheres instead of images with equirectangular

projection.

In summary, we present a new framework for 360◦ depth estimation using

CNN. We use the double quaternion formulation to integrate depth and surface

normal in loss calculations. Experiments show superior results for the joint depth

and normal estimation task. We also extend the double quaternion formulation to

establish stereo consistency from the training data without restricting the network

to a fixed baseline. We demonstrate quantitative and qualitative results that confirm

the benefits of our new approach.

42


Chapter 3: Efficient Neural Representation for Light Fields

3.1 Introduction

Light fields offer an information-rich medium for static and dynamic scenes.

However, a significant barrier to their widespread adoptions is a lack of sufficiently

compact representations of such high-dimensional data, making it impractical for

efficient storage, editing, and streaming. For example, a 1080p 60-fps light field video

captured on a 10× 10 camera grid easily requires several gigabytes of storage space

for every second of content.

A straightforward solution to compressing light fields is to apply existing,

widely used compression methods such as JPEG and MPEG. However, due to the

sheer amount of images captured in a light field, the compression rate of these single-

view-based methods are far from satisfactory [56,57]. Therefore, it is imperative to

have a compact way to represent light fields by taking advantage of the overlapping

and repetitive visual patterns in light fields.

Extensive research has been devoted to designing compact light field representa-

tions based on the patch-based compression strategy manifest in the JPEG standard.

These methods represent each image patch as a weighted sum of a small dictionary

of basis functions, and the goal is finding new ways to construct dictionaries of basis

43


Figure 3.1: Overview of SIGNET. We train a MLP to approximate the mapping

function from each pixel’s coordinates to its color values. My input transformation

strategy based on the Gegenbauer polynomials enables the MLP to more accurately

learn the high-dimensional mapping function.

functions that achieve better compression results. Yet, previous efforts have limited

success in enabling easy transmission and manipulation of light field content.

Recent advances in deep learning have led to impressive results in representing

data like images and volumes [58–60] with neural networks. A common thread

among these methods is incorporating Fourier-inspired modifications to the classical

neural network design called multilayer perceptron (MLP). Specifically, the SIREN

network [59] uses a sinusoidal activation function between the MLP layers, while

neural radiance field (NeRF) networks [58] designed for volumetric radiance data show

the effectiveness of applying cosine and sine transformations on input coordinates.

44


The improvement brought by the Fourier basis used in NeRF is further analyzed and

formalized by Tancik et al. [60], who also successfully extend the neural representation

to data like 2D images and 3D shapes.

The proven capability of MLPs to express visual content with high fidelity

implies that we could potentially compress a gigapixel light field within a few

megabytes. However, as shown in Section 3.4, the previous techniques fall short of

representing light fields without visible artifacts.

In this chapter, we present a new framework that efficiently and accurately

represents light field content using neural networks. Crucially, we introduce a

novel input transformation strategy of the multi-dimensional light field coordinate

based on the orthogonal Gegenbauer polynomials, which in our experiments work

very well with the sinusoidal activation functions between the MLP layers. We

call this network SIGNET (SInusoidal Gegenbauer NETwork), and we show its

superiority for light field neural representation over a variety of Fourier-inspired

input transformation strategies. SIGNET also achieves outstanding reconstruction

quality with a higher compression rate than state-of-the-art dictionary-based light

field compression methods. We further demonstrate how our MLP-based approach

easily allows for view synthesis and super-resolution on the encoded light field scenes.

In summary, our contributions are as follows:

• We present a neural representation of light fields which achieves high recon-

struction quality and compression rate and offers pixel-level random access to

the encoded light field.

45


• We introduce an input transformation strategy for coordinate-input MLPs

using Gegenbauer polynomials, which outperforms other recently proposed

techniques on light field data.

• We show such a neural representation enables high-quality decoding at novel co-

ordinates without additional training, achieving super-resolution along spatial,

angular, and temporal dimensions on light fields.

3.2 Related Work

Light Field Compression. Traditional compression relies on classical coding strate-

gies that typically involve analytical basis functions such as the Fourier basis

and wavelets. Prior research has augmented this analytical approach with dis-

parity [61–64] and geometry information [65]. Some sophisticated applications of

light field video [66–68] also integrate motion prediction and build on existing video

codec algorithms such as HEVC (H.265) [69] and VP9 [70]. More recently, Le

Pendu et al. [71] present a Fourier Disparity Layer representation for light fields,

which allows upsampling [72] and compression [73,74] in the Fourier domain.

A different approach towards light field compression involves learning a dictio-

nary of basis functions, which is inspired by progress in sparse coding from machine

learning, where dictionaries learned with data-driven algorithms have been shown to

outperform analytical basis functions [75–78]. However, the dictionaries learned with

conventional algorithms such as K-SVD [7] still contain too much redundancy and

have a high storage cost. The current state-of-the-art methods [6, 79] for light field

46


compression improve this approach by learning an ensemble of orthogonal dictionaries

with a novel pre-clustering strategy.

We present a novel approach to this task by learning a neural representation

of light fields. While our approach is rooted in the idea of basis functions, we

fundamentally differ from the previous methods as we use the expressive power of

neural networks with non-linear activation functions to combine the basis functions

into the desired output.

Light Field Interpolation. Most approaches rely on proxy information such as depth

or optical flow [12, 19, 80–83]. Recently, deep learning methods have been used to

infer depth and optical flow from light fields, and render novel viewpoints [84–88].

These methods warp the original frames to a novel viewpoint. While the results are

impressive, they require access to the original light field data at run-time, incurring

additional, sometimes prohibitive, costs to the light field processing pipeline.

In this chapter, we show how our neural light field representation naturally

enables interpolation from the compressed data without explicit learning or proxy

information. Although our presented network is not specifically designed for light

field super-resolution or view synthesis, our results show its promising potential to

be adapted for such tasks.

Coordinate-input MLP. Recent research [58–60] has shown the potential of using

coordinate-input MLP networks to represent various data. The Fourier-inspired

transformation achieves state-of-the-art free viewpoint synthesis on static scenes [58].

47


Figure 3.2: Illustration of Gegenbauer (Ultraspherical) Polynomials. We evaluate

the 2D Gegenbauer basis functions on a 2D Cartesian grid (left) and a 3D polar grid

(right). Only the first six orders of the basis are selected for illustration purposes.

The sine activation, introduced in SIREN [59], allows a simple MLP with raw

coordinate inputs to accurately model the coordinate-to-color mapping of data

including images and videos. However, our experimental results show that these

Fourier-inspired methods are unable to accurately model the coordinate-to-color

mapping in light fields. We present a new transformation that allows the MLPs to

successfully represent dense light fields, and we show its applicability for compactly

representing high-resolution light fields.

Gegenbauer Polynomials. Previous research in applied mathematics has shown the

effectiveness of Gegenbauer polynomials, also known as ultraspherical polynomials, in

addressing the Gibbs phenomenon [89], which is a commonly observed artifact in MRI

reconstruction using Fourier-based approximations [90, 91]. It has been shown by

Gottlieb et al. [89] that the finite Gegenbauer expansion of such functions provides a

better convergence and usually resolves the Gibbs artifact using fewer basis functions

48


Figure 3.3: We show examples of reconstructed images (left) and absolute errors

(right). SIGNET achieves good accuracy while other methods find encoding this

scene challenging.

than the Fourier approach. Specifically, they show that given