ABSTRACT
Title of Dissertation: ENHANCING VISUAL AND GESTURAL FIDELITY
FOR EFFECTIVE VIRTUAL ENVIRONMENTS
Xiaoxu Meng
Doctor of Philosophy, 2020
Directed by: Professor Amitabh Varshney
Department of Computer Science
A challenge for the virtual reality (VR) industry is facing is that VR is not
immersive enough to make people feel a genuine sense of presence: the low frame
rate leads to dizziness and the lack of human body visualization limits the human-
computer interaction. In this dissertation, I present our research on enhancing visual
and gestural fidelity in the virtual environment.
First, I present a new foveated rendering technique: Kernel Foveated Rendering
(KFR), which parameterizes foveated rendering by embedding polynomial kernel
functions in log-polar space. This GPU-driven technique uses parameterized foveation
that mimics the distribution of photoreceptors in the human retina. I present a
two-pass kernel foveated rendering pipeline that maps well onto modern GPUs. I
have carried out user studies to empirically identify the KFR parameters and have
observed a 2.8??3.2? speedup in rendering on 4K displays.
Second, I explore the rendering acceleration through foveation for 4D light
fields, which captures both the spatial and angular rays, thus enabling free-viewpoint
rendering and custom selection of the focal plane. I optimize the KFR algorithm
by adjusting the weight of each slice in the light field, so that it automatically
selects the optimal foveation parameters for different images according to the gaze
position. I have validated our approach on the rendering of light fields by carrying
out both quantitative experiments and user studies. Our method achieves speedups
of 3.47??7.28? for different levels of foveation and different rendering resolutions.
Thirdly, I present a simple yet effective technique for further reducing the cost
of foveated rendering by leveraging ocular dominance - the tendency of the human
visual system to prefer scene perception from one eye over the other. Our new
approach, eye-dominance-guided foveated rendering (EFR), renders the scene at a
lower foveation level (with higher detail) for the dominant eye than the non-dominant
eye. Compared with traditional foveated rendering, EFR can be expected to provide
superior rendering performance while preserving the same level of perceived visual
quality.
Finally, I present an approach to use an end-to-end convolutional neural network,
which consists of a concatenation of an encoder and a decoder, to reconstruct a 3D
model of a human hand from a single RGB image. Previous research work on hand
mesh reconstruction suffers from the lack of training data. To train networks with
full supervision, we fit a parametric hand model to 3D annotations, and we train the
networks with the RGB image with the fitted parametric model as the supervision.
Our approach leads to significantly improved quality compared to state-of-the-art
hand mesh reconstruction techniques.
ENHANCING VISUAL AND GESTURAL FIDELITY FOR
EFFECTIVE VIRTUAL ENVIRONMENTS
by
Xiaoxu Meng
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2020
Advisory Committee:
Professor Amitabh Varshney, Chair/Advisor
Professor Joseph F. JaJa
Professor Matthias Zwicker
Professor Roger Eastman
Professor Furong Huang
? Copyright by
Xiaoxu Meng
2020
Acknowledgments
I owe my gratitude to all the people who have made this dissertation possible
and because of whom my graduate experience has been one that I will cherish forever.
First and foremost I?d like to thank my advisor, Professor Amitabh Varshney
for giving me an invaluable opportunity to work on the challenging and interesting
projects over the past five years. He has always made himself available for help and
advice and there has never been an occasion when I?ve knocked on his door and he
hasn?t given me time. It has been a pleasure to work with and learn from such an
extraordinary individual.
Thanks are due to Professor Matthias Zwicker, Professor Joseph F. JaJa,
Professor Furong Huang, and Professor Roger Eastman for agreeing to serve on my
dissertation committee and for sparing his/her invaluable time.
I would also like to thank Dr. Michael Hemmer and Dr. Kun He for their
supervision during my internship at Google and Facebook, respectively.
My colleagues at the Graphics and Visual Informatics Laboratory (GVIL)
have enriched my graduate life in many ways and deserve a special mention. Dr.
Ruofei Du inspired my research and helped me in solving technical problems. My
interactions with Dr. Hsueh-Chien Cheng, Dr. Eric Krokos, Xuetong Sun, Tara
Larrue, Shuo Li, Somay Jain, Mukul Agarwal, Alexander Rowden, Susmija Reddy
Jabbireddy and David Li have been very fruitful.
I would like to acknowledge my classmates and friends, including Yuexi Chen,
Sheng Cheng, Hui Ding, Fenfei Guo, Yue Jiang, Gregory Kramida, Shang Li, Yuntao
ii
Liu, Zhenyu Lin, Zijie Lin, Tiantao Lu, Nitin J. Sanket, Sayyed Sina Miran, Jiahao
Su, Guowei Sun, Manasij Venkatesh, Harry Wandersman, Qian Wu, Pengcheng Xu,
Xi Yi and Daiwei Zhu.
I would also like to acknowledge help and support from the staff members
from UMIACS and the CS department. Tom Ventsias, Barbara Brawn-Cinani, Sida
Li, Eric Lee, Jonathan Heagerty, Vivian Lu, and Tom Hurst?s technical help and
encouragement are highly appreciated.
I owe my deepest thanks to my family ? my mother and father who have always
stood by me and guided me through my career, and have pulled me through against
impossible odds at times. Words cannot express the gratitude I owe them. I would
also like to thank Yunchuan Li, Hua Luo, and Hong Wei who are like family members
to me.
I would like to acknowledge financial support from UMIACS for all the projects
discussed herein.
It is impossible to remember all, and I apologize to those I?ve inadvertently
left out.
iii
Table of Contents
Acknowledgements ii
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Kernel Foveated Rendering for 3D Graphics . . . . . . . . . . . . . . 5
1.3 3D-Kernel Foveated Rendering for Light Fields . . . . . . . . . . . . . 8
1.4 Eye-dominance-guided Foveated Rendering . . . . . . . . . . . . . . . 11
1.5 Hand Mesh Reconstruction from a RGB Image . . . . . . . . . . . . . 12
2 Kernel Foveated Rendering for 3D Graphics 14
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Foveated Images and Videos . . . . . . . . . . . . . . . . . . . 16
2.2.2 Foveated 3D Graphics . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Pass I: Forward Kernel Log-polar Transformation . . . . . . . 35
2.3.2 Pass II: Inverse Kernel Log-Polar Transformation . . . . . . . 39
2.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.2 Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.3 Final User Study . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5 Results and Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.1 Rendering Acceleration of 3D Textured Meshes . . . . . . . . 48
2.5.2 Rendering Acceleration of Ray-casting Rendering . . . . . . . 48
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iv
3 3D-Kernel Foveated Rendering for Light Fields 54
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Light Field Rendering . . . . . . . . . . . . . . . . . . . . . . 56
3.2.2 Light Field Microscopy . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 KFR for 4D Light Field Rendering . . . . . . . . . . . . . . . 60
3.3.2 3D-KFR for 4D Light Field . . . . . . . . . . . . . . . . . . . 63
3.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 Results and Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.1 Results of the User Study . . . . . . . . . . . . . . . . . . . . 74
3.5.2 Rendering Acceleration . . . . . . . . . . . . . . . . . . . . . . 79
3.5.3 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Eye-dominance-guided Foveated Rendering 86
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Foveation Model . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.2 Eye-dominance-guided Foveation Model . . . . . . . . . . . . 93
4.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.1 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.2 Pre-experiment: Dominant Eye Identification . . . . . . . . . 95
4.4.3 Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.4 Main Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.5 Validity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5 Results and Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.1 Parameters Estimated with Different Scenes . . . . . . . . . . 111
4.5.2 Results of ?UF and ?NF . . . . . . . . . . . . . . . . . . . . . 111
4.5.3 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.4 Rendering Acceleration . . . . . . . . . . . . . . . . . . . . . . 113
5 Hand Mesh Reconstruction from RGB Images 115
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.1 Hand Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.2 Hand Skeleton Reconstruction from Multiview . . . . . . . . . 116
5.2.3 Hand Mesh Reconstruction from Singleview . . . . . . . . . . 117
5.3 Estimation of the Hand Mesh . . . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.3 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . 121
v
5.3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.5 Evaluation Metric and Result . . . . . . . . . . . . . . . . . . 122
5.4 Hand Reconstruction from RGB Images . . . . . . . . . . . . . . . . 124
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.4 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.2 Quantitative Evaluation of 3D Hand Mesh Estimation . . . . 127
5.5.3 Qualitative Evaluation of 3D Hand Mesh Estimation . . . . . 131
5.5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Conclusion and Future Work 139
Bibliography 146
vi
List of Tables
2.1 Cochran?s Q values at different ?2. . . . . . . . . . . . . . . . . . . . 47
2.2 Timing comparison between the ground truth and KFR for one frame.
The resolution is 1920? 1080. . . . . . . . . . . . . . . . . . . . . . . 51
2.3 Frame rate and speedup comparison for kernel foveated rendering at
different resolutions with ? = 1.8, ? = 4.0. . . . . . . . . . . . . . . . 52
3.1 The average timings and the corresponding speedups of 3D-KFR at
different light field dimensions and foveation parameters ?. . . . . . . 81
4.1 The score frequency for different comparisons in the slider test and
the random test. We notice that P (score ? 4) ? 85% for both com-
parisons in the slider test and that P (score ? 4) ? 95% for both
comparisons in the random test. The result indicates the generaliz-
ability of eye-dominance-guided foveated rendering. . . . . . . . . . . 110
vii
List of Figures
1.1 Spatial distribution of various retinal components, using data from [1]
and [2]. Ganglion cell (orange) density tends to match photoreceptor
density in the fovea (left), but away from the fovea many photorecep-
tors map to the same Ganglion cell. ?Nasal?, ?Temp?, ?Sup? and ?Inf?
indicate the four directions (nasal, temporal, superior, inferior) away
from the fovea. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The comparison of the full-resolution rendering (left) and kernel
foveated rendering (right). Kernel foveated rendering system mimics
the distribution of photoreceptors in the human retina and generates
foveated rendering with smoothly changing resolution. . . . . . . . . 5
1.3 An overview of the kernel foveated rendering pipeline. I transform
the necessary parameters and textures in the G-buffer from Cartesian
coordinates to log-polar coordinates, compute lighting in the log-
polar (LP) buffer and perform internal anti-aliasing. Next, I apply
the inverse transformation to recover the frame buffer in Cartesian
coordinates and employ post anti-aliasing to reduce the foveation
artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The pipeline of the foveated light field. Part (a) represents the light
field image array, the region with dark-green mask represents the
foveal region that the local center camera position of the frames are
around fovea. The peripheral region masked by light-green are the
regions that the local center camera position of the frames is far from
the fovea. I apply kernel log-polar transformation for each image with
different ? (? is determined by the gaze position) to get the image
sub-arrays as shown in the left part of (b). Then I average the image
sub-arrays to get textures as shown in the right part of (b). Finally, I
apply the inverse log-polar transformation for each sub-array, calculate
the weighted sum of pixel values and perform anti-aliasing to get the
final image displayed on-screen as shown in part (c). . . . . . . . . . . 9
viii
1.5 The comparison of the original full-resolution light field rendering
(right) and foveated light field rendering (left). We optimize the KFR
algorithm into 3D-KFR by adjusting the weight of each slice in the light
field, so that it is able to automatically select the optimal foveation
parameters for different images according to the gaze position, thereby
achieving greater speedup minimal perceptual loss. . . . . . . . . . . 10
1.6 Our pipeline renders the frames displayed in the dominant eye at a
lower foveation level (with higher detail), and renders the frames for
the non-dominant eye at a higher foveation level. This improves ren-
dering performance over traditional foveated rendering with minimal
perceptual difference. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Example of the estimation of the hand pose and shape. The Fscore@10mm
indicates the accuracy of estimation (higher is better). . . . . . . . . 13
2.1 Kortum [3] have developed one of the earliest eye-tracking-based
foveated imaging systems with space-variant degradation. . . . . . . . 17
2.2 Lungaro et al. [4, 5] propose to use a tile-based foveated rendering
algorithm to reduce the overall bandwidth requirements for streaming
360? videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Foveated reconstruction with DeepFovea. Left to right: (1) sparse
foveated video frame (gaze in the upper right) with 10% of pixels; (2)
a frame reconstructed from it with our reconstruction method; and (3)
full resolution reference. Our method in-hallucinates missing details
based on the spatial and temporal context provided by the stream
of sparse pixels. It achieves 14? compression on RGB video with
no significant degradation in perceived quality. Zoom-ins show the 0
foveal and 30 periphery regions with different pixel densities. Note it
is impossible to assess peripheral quality with your foveal vision. . . . 19
2.4 Guenter et al. [6] render three eccentricity layers (red border = inner
layer, green = middle layer, blue = outer layer) around the tracked
gaze point (pink dot), shown at their correct relative sizes in the top
row. These are interpolated to native display resolution and smoothly
composited to yield the final image at the bottom. Foveated rendering
greatly reduces the number of pixels shaded and overall graphics
computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Vaidyanathan et al. [7] perform foveated rendering by sampling coarse
pixels (2? 2 pixels and 4? 4 pixels) in the peripheral regions. . . . . 21
2.6 Patney et al. [8, 9] perform foveated rendering by sampling coarse
pixels and address temporal artifacts in foveated rendering by using
pre-filters and temporal anti-aliasing. . . . . . . . . . . . . . . . . . . 22
2.7 Clarberg et al. [10] have proposed an approach in which pixel shading
is tied to the coarse input patches and reused between triangles,
effectively decoupling the shading cost from the tessellation level, as
shown in this example. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ix
2.8 He et al. [11] introduce multi-rate GPU shading to support more
shading samples near regions of specular highlights, shadows, edges,
and motion blur regions, helping achieve a 3X to 5X speedup. . . . . 24
2.9 Swafford et al. [12] implement four foveated renderers as described
in the text of the figure. (a) is the annotated view of a foveated
render with moderate settings pre-composition. The checkerboard
area represents the proportion of pixels saved for the targeted simulated
resolution; (b) is the strips from two foveated renders with the same
fixation point (bottom-right) but different peripheral sampling levels.
Region transition is handled smoothly, but at four samples there
are noticeable artifacts in the peripheral region, such as banding; (c)
Top: Sample frame from our ray-casting method with 120 per-pixel
steps in the foveal region (within circle) and 10 per-pixel steps in the
peripheral region (outwith circle). Bottom: Close-up of right lamp
showing artifacts across different quality levels; (d) is the Wireframe
view of our foveated tessellation method. The inner circle is the foveal
region, between circles is the inter-regional blending, and outside the
circles is the peripheral region. . . . . . . . . . . . . . . . . . . . . . . 26
2.10 Stengel et al. [13] use adaptive sampling from fovea to peripheral
regions in a gaze-contingent rendering pipeline and compensate for
the missing pixels by pull-push interpolation. . . . . . . . . . . . . . . 27
2.11 Tursun [14] propose luminance-contrast-aware foveated rendring, which
demonstrates that the computational savings of foveated rendering
can be significantly improved if local luminance contrast of the image
is analyzed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Display results from our Foveated AR prototype. By tracking the
user?s gaze direction (red cross), the system dynamically provides high-
resolution inset images to the foveal region and low-resolution large-
FOV images to the periphery. The system supports accommodation
cues; the magenta and blue zoom-in panels show optical defocus of
real objects together with foveated display of correctly defocus-blurred
synthetic objects. Red dashed discs highlight the foveal vs peripheral
display regions. A monocular wearable prototype (functional but
manually actuated) illustrates the compact optical path. . . . . . . . 28
2.13 The relationship among ?2, K (x) = x?, and the sampling rate. The
number of samples in each image is proportional to ?2. I use a variant
of the PixelPie algorithm [15] to generate the Poisson samples shown. 29
2.14 Transformation from Cartesian coordinates to log-polar coordinates
with kernel function K (x) = x?. (a) is the image in the Cartesian
coordinates, (b)?(e) are the corresponding images in the log-polar
coordinates with varying kernel parameter ?. Matching colors in the
log-polar and Cartesian coordinates show the same regions. . . . . . . 31
x
2.15 Comparison of foveated rendering with varying ? for 2560 ? 1440
resolution. From left to right: original rendering, kernel log-polar
rendering, and the foveated rendering with zoomed-in view of the
peripheral regions. Here ? = 1.8, (a) classic log-polar transformation,
i.e. ? = 1.0, (b) kernel function with ? = 2.0, (c) kernel function with
? = 3.0, and (d) kernel function with ? = 4.0. The foveated rendering
is at 67 FPS while the original is at 31 FPS. . . . . . . . . . . . . . . 34
2.16 Comparison of foveated frame with different ? (fovea is marked as the
semi-transparent ring in the zoomed-in view): (a) original scene, (b)
foveated with ? = 1.0, (c) foveated with ? = 4.0, (d) foveated with
? = 5.0, and (e) foveated with ? = 6.0. The lower zoomed-in views
show that large ? enhances the peripheral detail; the upper zoomed-in
views show that when ? ? 5.0, foveal quality suffers. . . . . . . . . . 35
2.17 Comparison of foveated rendering with varying ? for 2560 ? 1440
resolution. From left to right: original rendering, kernel log-polar
rendering, the recovered scene in Cartesian coordinates, and a zoomed-
in view of peripheral regions. Here, K (x) = x4, (a) full-resolution
rendered at 31 FPS, (b) ? = 1.2 at 43 FPS, (c) ? = 1.8 at 67 FPS,
and (d) ? = 2.4 at 83 FPS. . . . . . . . . . . . . . . . . . . . . . . . 36
2.18 User study setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.19 The percentage of times that the participants considered the foveated
rendering and the full-resolution rendering to be the same for varying
?2 and ? in pilot user study with 24 participants. . . . . . . . . . . . 42
2.20 The percentage of times that participants considered the foveated
rendering and the full-resolution rendering to be identical for different
?2 and ? in the final user study with 18 participants. . . . . . . . . . 43
2.21 Comparison of (a) full-resolution rendering and (b) foveated rendering
for 3D meshes involving a geometry pass with 1, 020, 895 triangles as
well as multiple G-buffers . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.22 Comparison of (a) full-resolution rendering and (b) foveated ray-
marching scene with 16 samples per pixel . . . . . . . . . . . . . . . . 50
3.1 Levoy and Hanrahan [16] two visualizations of a light field. (a) Each
image in the array represents the rays arriving at one point on the
uv plane from all points on the st plane, as shown at left. (b) Each
image represents the rays leaving one point on the st plane bound for
all points on the uv plane. The images in (a) are off-axis perspective
views of the scene, while the images in (b) look like reflectance maps. 57
3.2 Sun et al. [17] design a real-time foveated 4D light field rendering and
display system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xi
3.3 The result comparison of the foveated light field with fovea on the
center of the screen. (b) - (d) are the application of 3D-KFR on light
field with (b) ?0 = 1.2, (c) ?0 = 2.0, (d) ?0 = 3.0. The left zoomed-in
views show that the application of 3D-KFR doesn?t make changes in
the fovea; the middle zoomed-in views and the right zoomed-in views
show that larger ?0 causes detail loss in the peripheral region. . . . . 64
3.4 The result comparison of the foveated light field with fovea on the
peripheral region of the screen. (b) - (d) are the application of 3D-
KFR on light field with (b) ?0 = 1.2, (c) ?0 = 2.0, (d) ?0 = 3.0. The
left zoomed-in views show that the application of 3D-KFR doesn?t
make changes in the fovea; the middle zoomed-in views and the right
zoomed-in views show that larger ?0 causes detail loss in the peripheral
region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Our user study set up with gaze-tracker integrated into the FOVE
head-mounted display. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6 The Pair Test responses of S? across sliding foveation parameters ?.
S? decreases with the increase of ?. 5 represents perceptually identical,
4 represents minimal perceptual difference, 3 represents acceptable
perceptual difference, 2 represents noticeable perceptual difference,
and 1 represents significant perceptual difference (2 and 1 are not
shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 The Pair Test responses of P? across sliding foveation parameters ?.
P? decreases with the increase of ?. . . . . . . . . . . . . . . . . . . . 75
3.8 The Random Test responses of S? across gradually varied foveation
parameters ?. S? decreases with the increase of ?. 5 represents
perceptually identical, 4 represents minimal perceptual difference, 3
represents acceptable perceptual difference, 2 represents noticeable
perceptual difference, and 1 represents significant perceptual difference
(2 and 1 are not shown) . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.9 The Random Test responses of P? across sliding foveation parame-
ters ?. P? decreases with the increase of ?. . . . . . . . . . . . . . . . 77
3.10 The histogram of the optimal foveation parameter ? selected by each
user in the Slider Test. For instance, the histogram shows that 80%
of the users found that ? = 1.6 or lower is acceptable. . . . . . . . . . 78
3.11 The rendering time for light field with different dimension and different
?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.12 Comparison of the foveated light field Biomine II. (b) - (d) using
3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6. The
measured DSSIM (lower is better) is shown for each zoomed-in view. 83
3.13 Comparison of the foveated light field Cellular Lattice IV. (b) - (d)
using 3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6.
The measured DSSIM (lower is better) is shown for each zoomed-in
view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xii
3.14 Comparison of the foveated light field Red Cells IV. (b) - (d) using
3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6. The
measured DSSIM (lower is better) is shown for each zoomed-in view. 84
4.1 An overview of the eye-dominance-guided foveated rendering system.
Our system uses two foveated renderers, with different values of the
foveation parameter ?, for the dominant eye and the non-dominant eye,
respectively. For the dominant eye, we choose the foveation parameter
?d which results in an acceptable foveation level for both eyes. For the
non-dominant eye, we choose ?nd ? ?d, which corresponds to a higher
foveation level. Because the non-dominant eye is weaker in sensitivity
and acuity, the user is unable to notice the difference between the two
foveation frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 The scenes used for the user study. Scene 0 and Scene 1 are animated
fireplace room [18] and the other scenes are animated Amazon Lum-
beryard Bistro [19]. These scenes are rendered with the Unity game
engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 The average value of ?UF and ?NF in the slider and the random tests.
The pilot study yields a gap between the results of the slider and the
random tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 The result of the slider test of the pilot user study. We observe that
?NF often reach our upper bound (3.0). . . . . . . . . . . . . . . . . . 100
4.5 The result of the slider test of the random user study. We observe
that ?NF often reach our upper bound (3.0). . . . . . . . . . . . . . . 101
4.6 Change of parameter ?d and ?nd in the slider test. In Step 1, estimation
of ?UF , we present the participant with the same foveated rendering
for both eyes and the participant progressively decrease the foveation
level until ?d = ?UF (m). In Step 2, estimation of ?NF , we present the
participant the foveated rendering with ?d = ?UF (m) for the dominant
eye, and allow the participant to adjust the level of foveation for the
non-dominant eye. The participant can progressively increase the
foveation level until they reach the highest foveation level. . . . . . . 104
4.7 The average value of ?UF and ?NF in the slider test and the random
test. A paired T-test reveals no significant difference (p = 0.8995 >
0.01) between the result of the slider test and the result of the random
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.8 The average score in Step 1 ( Estimation of ?UF ) and Step 2 (Estima-
tion of ?NF ) over different scenes and different users in the random
test. To achieve perceptually identical and minimal perceptual differ-
ence between regular rendering and foveated rendering, we therefore
choose ?UF = 2.0 and ?NF = 3.0 as our desired parameters. . . . . . . 108
xiii
4.9 The measured frame-rates (in fps) and the speedups. The speedups of
the eye-dominance-guided foveated rendering (EFR) compared with
the original kernel foveated rendering (KFR) ranges between 1.06?
and 1.47? with an average speedup of 1.35?. The speedups of EFR
compared with regular rendering (RR) ranges between 2.19? and
2.71? with an average speedup of 2.38?. . . . . . . . . . . . . . . . . 113
5.1 The pipeline of the ground truth mesh generation. . . . . . . . . . . 118
5.2 The qualitative results of hand mesh estimation from joints. . . . . . 123
5.3 The pipeline of the hand Reconstruction from RGB Images. . . . . . 125
5.4 Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Ku-
lon et al. [20] (the fourth column), and Boukhayma et al. [21] (the
fifth column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5 Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Ku-
lon et al. [20] (the fourth column), and Boukhayma et al. [21] (the
fifth column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.6 Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Ku-
lon et al. [20] (the fourth column), and Boukhayma et al. [21] (the
fifth column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.7 2D PCK our hand reconstruction approach, Kulon et al. [20], and
Boukhayma et al. [21]. Our method outperforms the other methods
in AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.8 3D PCK our hand reconstruction approach, Kulon et al. [20], and
Boukhayma et al. [21]. Our method outperforms the other methods
in AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.9 Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column) with
the sum of joint loss and parameter loss as the objective, our hand
reconstruction approach (the fourth column) with the parameter loss
as the objective, and our hand reconstruction approach (the fifth
column) with the joint loss as the objective. . . . . . . . . . . . . . . 134
5.10 3D PCK of our hand reconstruction approach with the sum of joint
loss and parameter loss as the objective, our hand reconstruction
approach with the parameter loss as the objective, and our hand
reconstruction approach with the joint loss as the objective. . . . . . 135
5.11 Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column) with
the RGB image and heatmaps as inputs, our hand reconstruction
approach (the fourth column) with the RGB image as input, and our
hand reconstruction approach (the fifth column) with heatmaps as
input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xiv
5.12 3D PCK of our hand reconstruction approach with the RGB image
and heatmaps as inputs, our hand reconstruction approach with the
RGB image as input, and our hand reconstruction approach with
heatmaps as input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1 Temporal flickering issue. The original scene and the foveated scene of
two consecutive frames (FI and FII). In FI , the specular reflection in
the original scene as shown in the red and blue circles in the zoomed-
in view of (a) are amplified in the foveated scene as shown in the
zoomed-in view of (b). In the next frame FII , the specular reflection
in the original scene as shown in the pink circle in the zoomed-in view
of (c) is amplified in the foveated scene as shown in the zoomed-in
view of (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 Illustration of a path traced frame in Visual-Polar space, the denoised
result transformed into Cartesian screen space, and the distribution of
the path tracing samples in screen space. Path tracing and denoising
in Visual-Polar space makes both 2.5? faster. . . . . . . . . . . . . . 142
xv
Chapter 1: Introduction
1.1 Overview
Rendering speed and transmission bandwidth are two critical constraints
in realizing effective and distributed virtual reality [22]. Human vision spans a
field of view of 135? ? 160?, but the highest-resolution foveal vision covers only
the central 1.5? ? 2? [6]. Patney et al. [8] have estimated that in modern virtual
reality head-mounted displays (HMD) only 4% of the pixels are mapped onto the
fovea. Foveated rendering [6,13,23] aims to improve the rendering efficiency while
maintaining visual quality by leveraging the capabilities and the limitations of the
human visual system. Equipped with an eye-tracker, a foveated rendering system
presents the foveal vision with full-resolution rendering and the peripheral vision
with low-resolution rendering. This allows one to improve the overall rendering
performance while maintaining high visual fidelity. Therefore, foveated rendering
techniques that allocate more computational resources for foveal pixels and fewer
resources elsewhere can dramatically speed up rendering [24] for large displays,
especially for virtual and augmented reality headsets equipped with eye trackers.
There is a large research literature that documents the falloff of accuracy in
visual periphery. Most of the previous research on foveated perception has been
1
surveyed in [25]. Recent research has also addressed the issue of perception time for
depicting information in the far peripheral field [26]. A commonly used model is
the linear acuity model, which measures the minimum angle of resolution (MAR).
A linear model matches both anatomical data and is applicable for many low-level
vision tasks [25]. However, the model only works for the ?central? vision (with
angular radius ? 8?), after which MAR rises more steeply [6]. In the periphery,
receptors become increasingly sparse relative to the eyes? optical system Nyquist
limit. Another model is the log acuity model. It has been found that the excitation of
the cortex can be approximated by a log-polar mapping of the eye?s retinal image [27].
The calculation of this model is cheap and fast, thus being used in many practical
applications such as computer vision, robotics, and other fields. Curcio [1,2] proposed
the mixed acuity model. As shown in Figure 1.1, the Ganglion cell (orange) density
tends to match the photoreceptor density in the fovea (left), but many photoreceptors
map to the same Ganglion cell away from the fovea. ?Nasal?, ?Temp?, ?Sup? and ?Inf?
indicate the four directions (nasal, temporal, superior, inferior) away from the fovea.
Most of the foveated rendering algorithms are inspired by the acuity models
mentioned above. However, it is not easy to quantify the pixel density rate by using
the models mentioned above. Previous research has addressed the issue of adjusting
visual acuity by conducting user studies with eye tracking technologies.
In this dissertation, I first present kernel foveated rendering for rendering
3D meshes and ray-traced scenes. I parameterize the foveation of rendering by
embedding polynomial kernel functions in the classic log-polar mapping [28,29]. This
allows us to alter both the sampling density and distribution, and match them to
2
Figure 1.1: Spatial distribution of various retinal components, using data from [1]
and [2]. Ganglion cell (orange) density tends to match photoreceptor density in the
fovea (left), but away from the fovea many photoreceptors map to the same Ganglion
cell. ?Nasal?, ?Temp?, ?Sup? and ?Inf? indicate the four directions (nasal, temporal,
superior, inferior) away from the fovea.
3
human perception in virtual reality HMDs.
Second, I optimize 3D-KFR by adjusting the weight of each slice in the light
fields, so that it automatically selects the optimal foveation parameters for different
images according to the gaze position and achieves higher speedup. In this way,
3D-KFR further accelerates the rendering process of high-resolution light fields while
preserving the perceptually accurate foveal detail.
Third, I present a simple yet effective technique for further reducing the cost
of foveated rendering by leveraging ocular dominance - the tendency of the human
visual system to prefer scene perception from one eye over the other. Our approach,
eye-dominance-guided foveated rendering (EFR), renders the scene with better detail
for the dominant-eye than the non-dominant-eye. Compared with traditional foveated
rendering, EFR provides superior rendering efficiency while preserving the same level
of perceived visual quality.
Finally, I present an end-to-end convolutional autoencoder to reconstruct a 3D
human hand from a single RGB image. To train networks with full supervision, we fit
a parametric hand model to 3D annotations, and we train the networks with the RGB
image with the fitted parametric model as the supervision. Our approach leads to
significantly improved quality compared to state-of-the-art hand mesh reconstruction
techniques.
4
1.2 Kernel Foveated Rendering for 3D Graphics
In Chapter 2, I present the kernel foveated rendering (KFR) for 3D graphics [23],
a foveated rendering system with smoothly changing resolution from fovea to the
periphery as shown in Figure 1.2.
Figure 1.2: The comparison of the full-resolution rendering (left) and kernel
foveated rendering (right). Kernel foveated rendering system mimics the distribution
of photoreceptors in the human retina and generates foveated rendering with smoothly
changing resolution.
In the KFR rendering system, I parameterize foveated rendering by embedding
polynomial kernel functions in the classic log-polar mapping. The GPU-driven
technique uses closed-form, parameterized foveation that mimics the distribution
of photoreceptors in the human retina. The pipeline of kernel foveated rendering
contains two passes as shown in Figure 1.3.
5
G-buffer
World position Bit tangent
Texture coordinates Albedo map
Normal Roughness, ambient, and 
refraction maps
Kernel log-polar 
transformation
LP-buffer 
(? = 3.0)
Shading &
internal anti-aliasing
Inverse kernel 
log-polar transformation
& post anti-aliasing
Screen
Figure 1.3: An overview of the kernel foveated rendering pipeline. I transform the
necessary parameters and textures in the G-buffer from Cartesian coordinates to
log-polar coordinates, compute lighting in the log-polar (LP) buffer and perform
internal anti-aliasing. Next, I apply the inverse transformation to recover the frame
buffer in Cartesian coordinates and employ post anti-aliasing to reduce the foveation
artifacts.
6
In the first pass, I compute the kernel log-polar transformation of the necessary
parameters and textures in the G-buffer from Cartesian coordinates to log-polar
coordinates and store the transformation in a reduced-resolution log-polar (LP)
buffer. Then I compute shading in the LP buffer; the shading cost is greatly reduced
because the lighting calculations at the reduced-resolution are lower. Due to the
low-resolution of the LP-buffer, there may be artifacts in the peripheral regions after
the inverse transformation. Therefore, I add a denoising stage in the log-polar space.
To reduce artifacts in the peripheral regions, I use a Gaussian filter with a 3 ? 3
kernel for the right part of the rendering(corresponding to the peripheral regions) in
the LP-buffer.
In the second pass, I carry out the inverse-log-polar transformation to recover
the rendered image in log-polar coordinates to the Cartesian coordinates for display.
I also perform spacial and temporal anti-aliasing for the full-resolution image. To
empirically establish the most suitable foveation parameter values, I have carried
out pilot and formal user studies. To achieve visually acceptable results for foveated
rendering, I use a threshold of 80% responses considering foveated rendering to be
visually indistinguishable from full-resolution rendering. I therefore choose ? = 1.8
and ? = 4 as the desired parameters for the interactive rendering evaluation. With
the desired parameters, I observe a 2.8? ? 3.2? speedup in rendering on 4K UHD
(2160p) displays with minimal perceptual loss of detail. The relevance of eye-tracking-
guided kernel foveated rendering can only increase as the anticipated growth of
display resolution makes it ever more difficult to resolve the mutually conflicting
goals of interactive rendering and perceptual realism.
7
1.3 3D-Kernel Foveated Rendering for Light Fields
Light fields capture both the spatial and angular rays, thus enabling free-
viewpoint rendering and custom selection of the focal plane. Scientists can interac-
tively explore microscopic light fields of organs, microbes, and neurons using virtual
reality headsets. However, rendering high-resolution light fields at interactive frame
rates requires a very high rate of texture sampling, which is a challenge as the
resolutions of light fields and displays continue to increase. In Chapter 3, I present
3D-kernel foveated rendering for light fields [30], a foveation system for light field as
shown in Figure 1.4.
I have developed a perceptual model for foveated light fields by extending the
KFR for the rendering of 3D meshes: since the foveation level of a pixel is affected
by the distance to the center camera, the foveation parameter can be different for
different slices in a light-field image array. We optimize the KFR algorithm into
3D-KFR by adjusting the weight of each slice in the light field, so that it is able to
automatically select the optimal foveation parameters for different images according
to the gaze position, thereby achieving greater speedup with minimal perceptual loss
as shown in Figure 1.5.
3D-KFR coupled with eye-tracking can accelerate the rendering of 4D depth-
cued light fields dramatically. On datasets of high-resolution microscopic light fields,
I observe 3.47?-7.28? speedup in light field rendering with minimal perceptual loss
of detail. I envision that 3D-KFR will reconcile the mutually conflicting goals of
visual fidelity and rendering speed for interactive visualization of light fields.
8
3D kernel log-polar sum up the frames inverse log-polar 
transformation with different with the same & into transformation and 
& for different frames one texture post-anti-aliasing
3D LP - buffers
Screen
! = !#
! = !$ fovea
! = !%
recovered visualization of 
the light field microscopy
rray
t field im
age a
h
original 
lig
(a) (b) (c)
Figure 1.4: The pipeline of the foveated light field. Part (a) represents the light
field image array, the region with dark-green mask represents the foveal region that
the local center camera position of the frames are around fovea. The peripheral
region masked by light-green are the regions that the local center camera position
of the frames is far from the fovea. I apply kernel log-polar transformation for
each image with different ? (? is determined by the gaze position) to get the image
sub-arrays as shown in the left part of (b). Then I average the image sub-arrays to
get textures as shown in the right part of (b). Finally, I apply the inverse log-polar
transformation for each sub-array, calculate the weighted sum of pixel values and
perform anti-aliasing to get the final image displayed on-screen as shown in part (c).
9
?
?
?
?
Figure 1.5: The comparison of the original full-resolution light field rendering
(right) and foveated light field rendering (left). We optimize the KFR algorithm into
3D-KFR by adjusting the weight of each slice in the light field, so that it is able to
automatically select the optimal foveation parameters for different images according
to the gaze position, thereby achieving greater speedup minimal perceptual loss.
10
1.4 Eye-dominance-guided Foveated Rendering
In Chapter 4, I introduce eye-dominance-guided foveated rendering [31] as
shown in Figure 1.6.
Rendered Frame Perceived Frame
foveated rendering foveated rendering (more foveation) foveated rendering foveated rendering
for for for for
dominant eye non-dominant eye dominant eye non-dominant eye
Figure 1.6: Our pipeline renders the frames displayed in the dominant eye at a lower
foveation level (with higher detail), and renders the frames for the non-dominant eye
at a higher foveation level. This improves rendering performance over traditional
foveated rendering with minimal perceptual difference.
Here, I present a simple yet effective technique for further reducing the cost
of foveated rendering by leveraging ocular dominance ? the tendency of the human
visual system to prefer scene perception from one eye over the other. I present the
technique of eye-dominance-guided foveated rendering (EFR), which leverages ocular
dominance property of the human visual system. I render the scene for the dominant
eye at the normal foveation level and render the scene for the non-dominant eye at
11
a higher foveation level. This formulation allows us to save more in the rendering
budget for the non-dominant eye.
I have validated our approach by carrying out quantitative experiments and
user studies. I designed two user tests to establish the most suitable foveation
parameter values for the dominant eye and the parameter for the non-dominant eye.
I have implemented the eye-dominance-guided foveated rendering pipeline on a GPU,
and achieve up to 1.47? speedup compared with the original foveated rendering
at a resolution of 1280? 1440 per eye with minimal perceptual loss of detail. The
technique of eye-dominance-guided foveated rendering can be easily integrated into
the current rasterization rendering pipeline for head-mounted displays.
1.5 Hand Mesh Reconstruction from a RGB Image
Accurate reconstruction of 3D human hands from monocular RGB images is a
challenging task. Hand estimation benefits a broad range of applications, such as
human-computer interaction and virtual and augmented reality. The goal of this
research is to use an end-to-end deep neural network to reconstruct the 3D model of a
human hand from a single RGB image as shown in Figure 1.7. In Chapter 5, I present
a network architecture as a concatenation of an encoder and a decoder. Given a
single RGB image of a hand, the encoder predicts the feature vector, from which the
decoder decodes a 3D hand model. To train networks with full supervision, we fit a
parametric hand model to 3D annotations, and we train the networks with the RGB
image with the fitted parametric model as the supervision. Our approach leads to
12
significantly improved quality compared to state-of-the-art hand mesh reconstruction
techniques.
We envision that the proposed approach could be widely used in the human-
object-interaction by facilitating the interaction between users and virtual objects
and bring virtual reality users more immersive experience by the visualization of
their hand model.
Figure 1.7: Example of the estimation of the hand pose and shape. The
Fscore@10mm indicates the accuracy of estimation (higher is better).
13
Chapter 2: Kernel Foveated Rendering for 3D Graphics
2.1 Overview
Araujo and Dias [28] use a log-polar mapping to approximate the excitation
of the cortex in the human visual system. The classic log-polar transformation has
been used for foveating 2D images on the GPU [29]. However, to the best of my
knowledge, direct use of the log-polar mapping for 3D graphics has not yet been
attempted on GPUs.
In this chapter, I present a kernel foveated rendering pipeline for modern GPUs
that parameterizes foveated rendering by embedding polynomial kernel functions
in the classic log-polar mapping. This allows us to easily vary the sampling density
and distribution, and match them to human perception in virtual reality HMDs. In
contrast to adaptive sampling in Cartesian coordinates, which requires a complex
interpolation process [13] and the classic three-pass foveated rendering pipeline [6],
KFR just needs a two-pass algorithm. In the first pass, I carry out the kernel log-
polar transformation and render to a reduced-resolution framebuffer using deferred
shading [32,33]. In the second pass, I apply the inverse kernel log-polar transformation
to the reduced-resolution framebuffer to map the final foveated rendering to the
full-resolution display.
14
I have built several foveated renderings with varying sampling density and
distribution and evaluate them via pilot and final user studies. I have found the
optimal parameters with minimal perceptual errors that correspond to the distribution
of photoreceptors in the retina. This algorithm is designed to achieve a high frame
rate by shading fewer pixels in the peripheral vision. Finally, I show results of
validating my approach on 3D rendering of textured meshes as well as ray-marching
scenes.
The KFR pipeline is broadly applicable for eye-tracking devices, and efficiently
testing, or previewing real-time rendering results with global lighting and physically
based rendering.
In summary, my contributions include:
1. designing the kernel log-polar mapping algorithm to enable a parameterized
trade-off of visual quality and rendering speed for foveated rendering,
2. conducting user studies to identify the kernel foveated rendering parameters
governing the sampling distribution and density to maximize perceptual realism
and minimize computation,
3. mapping kernel foveated rendering onto the GPU to achieve speedups of 2.8X
for textured 3D meshes and 3.2X for ray-casting scenes for 3840?2160 displays
with the minimal perceived loss of detail.
15
2.2 Related Work
In this section, I will review the development of foveated rendering in images,
videos and 3D renderings and give an overview of the eye-tracking technology.
2.2.1 Foveated Images and Videos
The last few decades have seen significant advances in foveated rendering for
2D images and videos.
Burt [34] has generated foveated images with multi-resolution Gaussian pyra-
mids. He takes advantage of a coarse-to-fine scheme to adaptively select the critical
information for constructing the foveated image. Kortum and Kortum et al. [3]
have developed one of the earliest eye-tracking-based foveated imaging systems with
space-variant degradation. The structure of their image foveation system is shown in
Figure 2.1. Using 256? 256 8-bit gray-scale images, they have achieved bandwidth
reduction of up to 94.7% with minimal perceptual artifacts. Other image foveation
techniques include embedded zero-tree wavelets [35], set partitioning in hierarchical
trees [36], wavelet-based image foveation [37], embedded foveation image coding [38],
and gigapixel displays [39].
Video foveation has also been explored [40?42]. The filter bank method is used
for video preprocessing before using standard video compression algorithms (e.g.
MPEG and H.26x) [41,43,44]. Foveation filtering has been implemented with the
quantization processes in standard MPEG and H.26x compression [45,46].
Video foveation coupled with eye tracking could reduce overall system latency,
16
Figure 2.1: Kortum [3] have developed one of the earliest eye-tracking-based foveated
imaging systems with space-variant degradation.
including network latency, processing latency, and display latency. Lungaro et
al. [4,5] propose to use a tile-based foveated rendering algorithm to reduce the overall
bandwidth requirements for streaming 360? videos. An overview of video foveation
is shown in Figure 2.2. They quantified the bandwidth savings achievable by the
proposed approach and characterize the relationships between Quality of Experience
(QoE) and network latency. The results showed that up to 83% less bandwidth
is required to deliver high QoE levels to the users, as compared to conventional
solutions.
As shown in Figure 2.3, Kaplanyan et al. present DeepFovea [47], which uses a
generative adversarial network [48] to reconstruct the given sparsely foveated image
by considering the closest match on a learned manifold of natural videos. Given a
history of frames and the corresponding gaze points that the user sees till a given
timestamp, along with the sparsely constructed image at that time frame, the deep
17
Figure 2.2: Lungaro et al. [4, 5] propose to use a tile-based foveated rendering
algorithm to reduce the overall bandwidth requirements for streaming 360? videos.
neural network reconstructs the original frame by inpainting and in-hallucinating
peripheral details while maintaining high acuity at the gaze point.
While previous work in foveation for images and videos provides strong founda-
tions, most of these methods cannot be easily generalized for interactive 3D graphics
rendering on modern GPUs. A notable exception is the work by Antonelli et al. [29],
which uses log-polar mapping to speed-up 2D image rendering on modern GPUs.
However, their approach does not directly work with 3D graphics primitives and
does not use kernel functions.
2.2.2 Foveated 3D Graphics
Weier et al. [49] have reviewed several approaches for foveated rendering
including mesh simplification in the areas of lower acuity [50?52]. However, these
days shading has often been found to dominate the cost for rendering sophisticated
18
Figure 2.3: Foveated reconstruction with DeepFovea. Left to right: (1) sparse foveated
video frame (gaze in the upper right) with 10% of pixels; (2) a frame reconstructed
from it with our reconstruction method; and (3) full resolution reference. Our method
in-hallucinates missing details based on the spatial and temporal context provided
by the stream of sparse pixels. It achieves 14? compression on RGB video with
no significant degradation in perceived quality. Zoom-ins show the 0 foveal and
30 periphery regions with different pixel densities. Note it is impossible to assess
peripheral quality with your foveal vision.
scenes on modern graphics pipelines [7, 11].
Ragan-Kelley et al. [53] use decoupled sampling for stochastic super-sampling
of motion and defocus blur at a reduced shading cost. Guenter et al. [6] present
a three-pass pipeline for foveated 3D rendering by using three eccentricity layers
around the tracked gaze point. As shown in Figure 2.4, the innermost layer is
rendered at the highest resolution (native display), while the successively outer
peripheral layers are rendered with progressively lower resolution and coarser level
of detail (LOD). They interpolate and blend between the layers and use frame jitter
and temporal re-projection to reduce spatial and temporal artifacts. However, this
approach renders the scene three times, which requires lots of rendering resources.
Vaidyanathan et al. [7] present a novel approach using a generalization of
19
Figure 2.4: Guenter et al. [6] render three eccentricity layers (red border = inner
layer, green = middle layer, blue = outer layer) around the tracked gaze point (pink
dot), shown at their correct relative sizes in the top row. These are interpolated to
native display resolution and smoothly composited to yield the final image at the
bottom. Foveated rendering greatly reduces the number of pixels shaded and overall
graphics computation.
20
multi-sample anti-aliasing (MSAA). They perform foveated rendering by sampling
coarse pixels (2 ? 2 pixels and 4 ? 4 pixels) in the peripheral regions as shown in
Figure 2.5. This approach targets small-form-factor devices with high resolution,
such as phones and tablets rather than HMDs. It therefore presents two challenges
for HMDs: the effective pixel size in current HMDs is too large for MSAA, and
gaze-dependent motions exaggerate the artifacts.
Figure 2.5: Vaidyanathan et al. [7] perform foveated rendering by sampling coarse
pixels (2? 2 pixels and 4? 4 pixels) in the peripheral regions.
Patney et al. [8, 9] address temporal artifacts in foveated rendering by using
pre-filters and temporal anti-aliasing. Because human eyes are sensitive to edges,
they add contrast preservation for the foveated image, which greatly enhances the
21
image quality by reducing the tunneling effect as shown in Figure 2.6. They tested
the foveated rendering effect on both Desktop and VR headset.
Figure 2.6: Patney et al. [8, 9] perform foveated rendering by sampling coarse pixels
and address temporal artifacts in foveated rendering by using pre-filters and temporal
anti-aliasing.
Clarberg et al. [10] propose a modification to the current hardware architecture,
which enables flexible control of shading rates and automatic shading reuse between
triangles in tessellated primitives as shown in Figure 2.7.
He et al. [11] introduce multi-rate GPU shading to support more shading
samples near regions of specular highlights, shadows, edges, and motion blur regions,
helping achieve a 3X to 5X speedup as shown in Figure 2.8. However, this imple-
mentation of multi-rate shading requires an extension of the graphics pipeline, which
22
Figure 2.7: Clarberg et al. [10] have proposed an approach in which pixel shading is
tied to the coarse input patches and reused between triangles, effectively decoupling
the shading cost from the tessellation level, as shown in this example.
is not available on commodity graphics hardware.
Swafford et al. [12] implement four foveated renderers as shown in Figure 2.9.
The first method reduces the effective rendered pixel density of the peripheral region
while maintaining the base density of the foveal window, as shown in Figure 2.9
(a). The second varies per-pixel depth-buffer samples in the fovea and periphery for
screen-space ambient occlusion. Although a very low number of per-pixel samples
can cause banding, they expect these differences to go unnoticed in the periphery
due to the loss of visual acuity and contrast sensitivity, as shown in Figure 2.9 (b).
The third method normally casts rays to geometry and detects intersections with a
given number of depth layers, represented as a series of RGBA textures mapped on
the geometry, then it varies the per-pixel ray-casting steps across the field of view,
as shown in Figure 2.9 (c). The final method implements a terrain renderer using
GPU-level tessellation for the fovea. In order to determine the appropriate level of
23
Figure 2.8: He et al. [11] introduce multi-rate GPU shading to support more shading
samples near regions of specular highlights, shadows, edges, and motion blur regions,
helping achieve a 3X to 5X speedup.
24
tessellation, we project the foveal window from screen coordinates into the scene. If
a tile falls within either the foveal or peripheral field of view, the level of tessellation
is set statically to the appropriate level. If the tile falls between the two regions (on
the blending border) the level of tessellation is linearly interpolated between the two
levels, as shown in Figure 2.9 (d).
Stengel et al. [13] use adaptive sampling from fovea to peripheral regions in
a gaze-contingent rendering pipeline as shown in Figure 2.10. To compensate for
the missing pixels caused by sparsely distributed shading samples on the periphery,
they use pull-push [54] interpolation to create the full foveated image. This strategy
achieves a reduction of render time of 25.4% (with speedup of 1.3X) and reduction
of shading time of 41% (with speedup of 1.7X).
As shown in Figure 2.11, Tursun [14] propose luminance-contrast-aware foveated
rendring, which demonstrates that the computational savings of foveated rendering
can be significantly improved if local luminance contrast of the image is analyzed.
They first study the resolution requirements at different eccentricities as a function
of luminance patterns. They later use this information to derive a low-cost predictor
of the foveated rendering parameters. Its main feature is the ability to predict the
parameters using only a low-resolution version of the current frame, even though the
prediction holds for high-resolution rendering.
Besides virtual reality, foveated rendering is also desirable for augmented reality
(AR) [55]. As shown in Figure 2.12, the AR display combines a traveling micro-display
relayed off a concave half-mirror magnifier for the high-resolution foveal region, with a
wide field-of-view peripheral display using a projector-based Maxwellian-view display
25
(a) (c)
(b) (d)
Figure 2.9: Swafford et al. [12] implement four foveated renderers as described in
the text of the figure. (a) is the annotated view of a foveated render with moderate
settings pre-composition. The checkerboard area represents the proportion of pixels
saved for the targeted simulated resolution; (b) is the strips from two foveated renders
with the same fixation point (bottom-right) but different peripheral sampling levels.
Region transition is handled smoothly, but at four samples there are noticeable
artifacts in the peripheral region, such as banding; (c) Top: Sample frame from our
ray-casting method with 120 per-pixel steps in the foveal region (within circle) and
10 per-pixel steps in the peripheral region (outwith circle). Bottom: Close-up of
right lamp showing artifacts across different quality levels; (d) is the Wireframe view
of our foveated tessellation method. The inner circle is the foveal region, between
circles is the inter-regional blending, and outside the circles is the peripheral region.
26
Figure 2.10: Stengel et al. [13] use adaptive sampling from fovea to peripheral regions
in a gaze-contingent rendering pipeline and compensate for the missing pixels by
pull-push interpolation.
Figure 2.11: Tursun [14] propose luminance-contrast-aware foveated rendring, which
demonstrates that the computational savings of foveated rendering can be significantly
improved if local luminance contrast of the image is analyzed.
27
whose nodal point is translated to follow the viewer?s pupil during eye movements
using a traveling holographic optical element.
Figure 2.12: Display results from our Foveated AR prototype. By tracking the user?s
gaze direction (red cross), the system dynamically provides high-resolution inset
images to the foveal region and low-resolution large-FOV images to the periphery.
The system supports accommodation cues; the magenta and blue zoom-in panels
show optical defocus of real objects together with foveated display of correctly
defocus-blurred synthetic objects. Red dashed discs highlight the foveal vs peripheral
display regions. A monocular wearable prototype (functional but manually actuated)
illustrates the compact optical path.
Recently, deferred shading has been used for antialiasing foveated rendering.
Karis [56] optimizes temporal anti-aliasing for deferred shading, which uses samples
over multiple frames to reduce flickering. Crassin et al. [57] reduce aliasing by pre-
filtering sub-pixel geometric detail in the G-buffer for deferred shading. Chajdas et
al. [58]?s subpixel anti-aliasing operates as a post-process on a rendered image with
super-resolution depth and normal buffers. It targets deferred shading renderers that
cannot use MSAA.
28
?? = 1 ?? = 2 ?? = 3 ?? = 4 Density
??2 = 1
??2 = 2
??2 = 4
Figure 2.13: The relationship among ?2, K (x) = x?, and the sampling rate. The
number of samples in each image is proportional to ?2. I use a variant of the PixelPie
algorithm [15] to generate the Poisson samples shown.
In this dissertation, I present a simple two-pass foveated rendering pipeline that
maps well onto modern GPUs. Kernel foveated rendering (KFR) provides gradually
changing resolution and achieves 2.8X ? 3.2X speedup with little perceptual loss.
2.3 Proposed Algorithm
Overall, my algorithm applies the kernel log-polar transformation for rasteriza-
tion in a reduced-resolution log-polar buffer (LP-buffer), carries out shading within
the LP-buffer, and then uses the inverse kernel log-polar transformation to render
29
on the full resolution display. This is shown in Figure 1.3.
In the classic log-polar transformation [29], given a W ?H pixel display screen,
and an LP-buffer of w?h pixels, the screen-space pixel (x, y) in Cartesian coordinates
is transformed to (u, v) in the log-polar coordinates according to Equation 2.1,
log?x?, y??2
u = ? w
L ( )
? (2.1)
arctan y
x?
v = ? h+ 1 [y? < 0] ? h
2?
where, (x?, y?) represent (x, y) with respect to the center of the screen as origin,
L is the log-distance from the center to the corner of the screen, and 1 [?] is the
indicator function,
(?
? W ? H ? ? )W H ?x = x? , y = y ? , L = log ?? , ?? (2.2)2 2 2 2 2
????1 , y? < 0
1 [y? < 0] = ??? (2.3)0 , y? ? 0
Notice how the central dark green area in Figure 2.14 (a) is mapped to a
relatively large region in the left part of the log-polar coordinates in Figure 2.14 (b),
while the peripheral regions of Figure 2.14 (a) are mapped to a relatively small part
of Figure 2.14 (b).
In the inverse log-polar transformation, a pixel with log-polar coordinates (u, v)
is transformed back to (x??, y??) in Cartesian coordinates. Let
L 2?
A = , B = , (2.4)
w h
30
(b) Log-polar Coordinates, ? = 1 (c) Log-polar Coordinates, ? = 2
(a) Cartesian Coordinates
(d) Log-polar Coordinates, ? = 3 (e) Log-polar Coordinates, ? = 4
Figure 2.14: Transformation from Cartesian coordinates to log-polar coordinates
with kernel function K (x) = x?. (a) is the image in the Cartesian coordinates,
(b)?(e) are the corresponding images in the log-polar coordinates with varying kernel
parameter ?. Matching colors in the log-polar and Cartesian coordinates show the
same regions.
31
then the inverse transformation can be formulated as Equation 2.5,
x?? = exp (Au) cos (Bv)
(2.5)
y?? = exp (Au) sin (Bv)
To understand how the resolution changes in the log-polar space, consider
r = ?x, y?2 = exp (Au).
Now, dr represents the change in r based on u,
dr = A ? exp (Au) du. (2.6)
Inversely, D is defined as the number of pixels in the LP-buffer that map to a single
pixel on the screen,
du 1
D = = ? exp(?Au). (2.7)
dr A
Equation 2.7 shows the foveation effect of pixel density decreasing from the fovea to
the periphery. In this formulation, it is not easy to systematically alter the density
fall-off function and evaluate the perceptual quality of foveated rendering.
I propose a kernel log-polar mapping algorithm that allows us more flexibility
to better mimic the fall-off of photo-(receptor densi(ty )o)f the human visual system,
exp ?wC? ?K(?1) uD = w? ? . (2.8)C? ?K? 1 u(w )2
Here, the constant parameter C = 1 + H represents the ratio between
W
screen diagonal and screen width. ? = W represents the ratio between the full-
w
2
resolution screen width and the reduced-resolution LP-buffer width, ?2 = W2 repre-w
sents the ratio between the number of pixels in the full-resolution screen and the
number of pixels in the reduced-resolution LP-buffer. Larger ?2 corresponds to more
32
condensed LP-buffer, which means less calculation in the rendering process. A more
condensed LP-buffer also means more foveation and greater peripheral blur.
The kernel function K (x) can be any monotonically increasing function with
K (0) = 0 and K (1) = 1, such as the sum of power functions,
?? ??
K (x) = ? xii , where ?i = 1. (2.9)
i=0 i=0
Such kernel functions can?be used to adjust the pixel density distribution in
the LP-buffer. I use K (x) = ? ii=0 ?ix in this project because the calculation of
power functions is fast on modern GPUs. There may be other kernel functions worth
x ?
trying, such as K (x) = sin(x ? ? ) and K (x) = e ?1? . For example, for C = 2 and2 e 1
K (x) = x?, the relationship between D and r under varying ?2 and ? is illustrated in
Figure 2.13 1. Kernel functions can adjust the pixel density such that the percentage
of the peripheral regions in the LP-buffer increases as shown in Figure 2.14 (c),
(d), and (e). This makes it possible to increase the peripheral image quality while
maintaining the same frame rates. A comparison among different kernel functions
?
is shown in Figure 2.15 with ? = 1.8 and C = 2. The use of the kernel function
reduces the artifacts in the zoomed-in peripheral view, improving the peripheral
image quality.
Meanwhile, as shown in Figure 2.16, when ? ? 5.0, the sampling rate of even
the foveal region drops, affecting the visual quality of the fovea. A comparison among
?
different ? is shown in Figure 2.17 with fixed ? = 4.0, C = 2.
1The figure is the visualization of sampling rate rather than the true sampling map.
33
(a) ? = 1
fovea
(b) ? = 2
fovea
(c) ? = 3
fovea
(d) ? = 4
fovea
Figure 2.15: Comparison of foveated rendering with varying ? for 2560 ? 1440
resolution. From left to right: original rendering, kernel log-polar rendering, and the
foveated rendering with zoomed-in view of the peripheral regions. Here ? = 1.8, (a)
classic log-polar transformation, i.e. ? = 1.0, (b) kernel function with ? = 2.0, (c)
kernel function with ? = 3.0, and (d) kernel function with ? = 4.0. The foveated
rendering is at 67 FPS while the original is at 31 FPS.
34
(a) original scene (b) foveated, ? = 1.0
(c) foveated, ? = 4.0 (d) foveated, ? = 5.0 (e) foveated, ? = 6.0
Figure 2.16: Comparison of foveated frame with different ? (fovea is marked as
the semi-transparent ring in the zoomed-in view): (a) original scene, (b) foveated
with ? = 1.0, (c) foveated with ? = 4.0, (d) foveated with ? = 5.0, and (e) foveated
with ? = 6.0. The lower zoomed-in views show that large ? enhances the peripheral
detail; the upper zoomed-in views show that when ? ? 5.0, foveal quality suffers.
2.3.1 Pass I: Forward Kernel Log-polar Transformation
Kernel Log-polar Transformation For each pixel in screen space with coordi-
nates (x, y), foveal point F (?x, y?) in Cartesian coordinates, I change Equation 2.1 to
Equation 2.10, ( )
?1 log?x?, y??2u = (K ( ) ? wL ) (2.10)
y? h
v = arctan ? + 1 [y
? < 0] ? 2? ?
x 2?
Here,
x? = x? x?, y? = y ? y?. (2.11)
35
(a) original
fovea
(b) ? = 1.2 
fovea
(c) ? = 1.8 fovea
(b) ? = 2.4 
fovea
Figure 2.17: Comparison of foveated rendering with varying ? for 2560 ? 1440
resolution. From left to right: original rendering, kernel log-polar rendering, the
recovered scene in Cartesian coordinates, and a zoomed-in view of peripheral regions.
Here, K (x) = x4, (a) full-resolution rendered at 31 FPS, (b) ? = 1.2 at 43 FPS, (c)
? = 1.8 at 67 FPS, and (d) ? = 2.4 at 83 FPS.
36
K?1 (?) is the inverse of the kernel function, and L is the log of the maximum distance
from fovea to one of the four corners of the screen as shown in Equation 2.12,
L = log (max (max (l1, l2) ,max (l3, l4))) . (2.12)
Here,
l1 = ?x?, y??2
l2 = ?W ? x?, H ? y??2
(2.13)
l3 = ?x?, H ? y??2
l4 = ?W ? x?, y??2
Lighting In lighting calculation for traditional deferred shading, mesh positions,
normals, depth and material information such as roughness, index of reflection, and
normal maps are fetched from the G-buffer [32,33]. Instead of obtaining information
from the G-buffer with texture coordinates (x, y), in my approach, I sample from
the transformed kernel log-polar texture coordinates (u, v) with bilinear filtering for
the lighting resources in the G-buffer. The reduced-resolution of the log-polar (LP)
buffer helps in reducing the lighting calculation to only those pixels that matter in
the final foveated rendering.
Internal Anti-aliasing Due to the low-resolution of the LP-buffer, there may
be artifacts in the peripheral regions after the inverse transformation. However, I
can directly perform denoising in the log-polar space. To reduce artifacts in the
peripheral regions, I apply a Gaussian filter with a 3? 3 kernel for the first quarter
from the right of the rendering (corresponding to the peripheral regions) in the
37
Algorithm 1 Kernel Log-polar Transformation
Input:
Fovea coordinates in screen space: (?x, y?),
pixel coordinates in screen space: (x, y).
Output:
Pixel coordinates in the log-polar space: (u, v).
1: acquire fovea coordinates (?x, y?)
2: for x ? [0,W ] do
3: for y ? [0, H] do
4: x? = x? x?
5: y? = y ? y?( )
?
u = (K?1 log(?x ,y
??
6: ) ? wL )
v = arctan y
?
7: ? + 1 [y
? < 0] ? 2? ? h
x 2?
8: end for
9: end for
38
LP-buffer. Since the LP-buffer pixels correspond to the adaptive detail of foveated
rendering, the Gaussian filtering in the LP-buffer gives us higher-level of anti-aliasing
in the peripheral regions.
2.3.2 Pass II: Inverse Kernel Log-Polar Transformation
Pass II performs the inverse kernel log-polar transformation to Cartesian
coordinates, applies anti-aliasing, and renders to screen.
Inverse Kernel Log-polar Mapping Transformation I can recover the Carte-
sian coordinates (x??, y??), from the pixel coordinates (u, v) and the fovea coordinates
(?x, y?) using Algorithm 2 with bilinear filtering.
Post Anti-aliasing One of the crucial considerations in foveated rendering is
mitigating temporal artifacts due to aliasing in the peripheral, high eccentricity
regions. I apply temporal anti-aliasing (TAA) [56] with Halton sampling [59] to the
recovered screen-space pixels after the inverse kernel log-polar transformation. I
also use Gaussian filtering with different kernel sizes ? for different L (as defined in
Equation 2.12) in post anti-aliasing. The kernel size ? is shown in Equation 2.14,
which depends on the normalized distance between the pixel coordinate and the
fovea, ? ?
?x?,y??2
L ? 0.10
? = 3 + 2? e . (2.14)
0.05
39
Algorithm 2 Kernel Log-polar Inverse Transformation
Input:
Fovea coordinates in screen space: (?x, y?),
pixel coordinates in the log-polar coordinates: (u, v).
Output:
Screen-space coordinates (x??, y??) for pixel coordinates (u, v).
1: update L with fovea coordinates (?x, y?) with Equation 2.12
2: let A = L , B = 2?
w h
3: for u ? [0, w] do
4: for v ? [0, h] do
5: x?? = exp (A ?K (u)) ? cos (Bv) + x?
6: y?? = exp (A ?K (u)) ? sin (Bv) + y?
7: end for
8: end for
40
2.4 User Study
I have carried out user studies to empirically establish the most suitable
foveation parameter values for ? and ? that result in visually acceptable foveated
rendering. To systematically investigate this, I conducted a pilot study to examine a
broad range of the two parameters, ?2 and ?. I used the results and my experience
with the pilot study to fine tune the protocol and ranges of ?2 and ? for the final
user study.
2.4.1 Apparatus
My user study apparatus, shown in Figure 2.18, consists of an Alienware laptop
with an NVIDIA GeForce GTX 1080, a FOVE head-mounted display, and an XBOX
controller. The FOVE display has a 100? field of view, a resolution of 2560? 1440,
and a 120 Hz infrared eye-tracking system with a precision of 1? and a latency of 14
Figure 2.18: User study setup.
41
Identical percentage under different ? and ?
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
?2=1.00 ?2=1.44 ?2=1.96 ?2=2.56 ?2=3.24 ?2=4.00 ?2=4.84 ?2=5.76 ?2=6.76 ?2=7.84 ?2=9.00 ?2=10.24 ?2=11.56 ?2=12.96
? = 1 90.28% 80.56% 72.22% 58.33% 52.78% 47.22% 37.50% 22.22% 19.44% 5.56% 13.89% 9.72% 11.11% 9.72%
? = 2 87.50% 81.94% 76.39% 73.61% 52.78% 51.39% 45.83% 29.17% 13.89% 22.22% 18.06% 12.50% 4.17% 8.33%
? = 3 91.67% 90.28% 83.33% 68.06% 70.83% 61.11% 54.17% 38.89% 20.83% 26.39% 25.00% 13.89% 11.11% 11.11%
? = 4 94.44% 83.33% 70.83% 68.06% 73.61% 52.78% 50.00% 41.67% 26.39% 26.39% 20.83% 19.44% 8.33% 12.50%
? = 1 ? = 2 ? = 3 ? = 4
Figure 2.19: The percentage of times that the participants considered the foveated
rendering and the full-resolution rendering to be the same for varying ?2 and ? in
pilot user study with 24 participants.
ms, the system latency meets the eye tracking delay requirement of 50 ms - 70 ms.
2.4.2 Pilot Study
Procedure The session for each participant lasted between 35? 50 minutes and
involved four stages: introduction, calibration, training, and testing.
In the introduction stage, I showed participants the FOVE headset, the eye
trackers, and the XBOX controllers and discussed how to use them. I did not provide
any information about the research or the algorithm to avoid biasing the participants
towards any rendering.
After the participant comfortably wore the HMD, I moved forward to the
calibration stage, where I ran a one-minute eye-tracking calibration program provided
by the FOVE software development kit.
In the training stage, I presented the participants with 20 trials with different
42
percentage
Identical percentage under different ? and ?
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
?2=1.44 ?2=1.96 ?2=2.56 ?2=3.24 ?2=4.00 ?2=4.84 ?2=5.76
? = 1 91.67% 88.33% 78.33% 66.67% 46.67% 31.67% 31.67%
? = 2 91.67% 96.67% 86.67% 75.00% 58.33% 51.67% 46.67%
? = 3 91.67% 90.00% 81.67% 85.00% 66.67% 61.67% 41.67%
? = 4 96.67% 96.67% 95.00% 80.00% 66.67% 56.67% 48.33%
? = 1 ? = 2 ? = 3 ? = 4
Figure 2.20: The percentage of times that participants considered the foveated
rendering and the full-resolution rendering to be identical for different ?2 and ? in
the final user study with 18 participants.
combinations of ?2 and ?, to ensure that they are familiar with the HMD and the
controller.
Trials in the training and testing stages were identical. In each trial of the
two-alternative forced choice test, I presented participants with a pair of rendered
scenes, each for 2 seconds and separated by a black-screen interval of 0.75 seconds.
One scene uses full-resolution rendering, and the other uses KFR with different
parameters ?2 and ?. In each trial, I presented the KFR scene and the full-resolution
scene in a random order. The participant indicated whether the two images look the
43
Percentage
same by pressing a button on the XBOX controller. I instructed the participants to
maintain their gaze at the center of the screen, even though the foveated renderer
can use eye-tracking to update the foveated image.
The testing stage had three sessions, each with 56 trials. The LP-buffer
resolution reduction parameter ? ranges from 1.0 to 3.6 with step size 0.2 (?2 ranges
from 1.00 to 12.96), and the kernel sampling distribution parameter ? ranges from 1
to 4 with step size 1. I rendered scenes from the Sponza and Amazon Lumberyard
Bistro datasets for different sessions. I allowed the participants to have some rest
between different sessions.
Participants In the pilot study, I recruited 24 participants via campus email lists
and flyers. All participants are at least 18 years old with normal or corrected-to-
normal vision (with contact lenses). The participants are collected using campus
flyers and emails.
Results and Analysis I define PI as the percentage of the trials for which
participants reported the two images shown in a trial to be the same. The results of
PI are shown in Figure 2.19. First, I find that PI is inversely related to ?
2. With
increase in ?2, the LP-buffer gets smaller, thus reducing the overall sampling rate
in foveated rendering. Second, I notice that with the increase of ?, PI significantly
increases for ? ranging from 1.2 to 2.8 (?2 ranging from 1.44 to 7.84). This shows
that for the same ?, the perception of the quality of foveated rendering increases by
the use of ? for kernel functions. For ? = 1.0 and ? > 2.8 the improvement is not
44
significant. The reason is that for ? = 1.0, the foveated renderings for ? = 1 and
? > 1 are both clear, there is little space for improvement. Similarly, for ? > 2.8,
even if the quality improves by applying kernel functions, it still looks blurry for
both images. Therefore, the participants choose ?different? for these comparisons.
Third, some participants reported that the length of the study led to visual fatigue
and that they were not sure about some of their responses.
Using the above observations, I modified the final user study to be shorter and
more focused.
First, to reduce the total time that participants are in the HMD, I used the fact
that most participants found foveated renderings different from the full-resolution
rendering for ? > 2.4 (?2 > 5.76). Since the goal is to accelerate rendering while
maintaining perceptually similar quality, I reduced the range of ? to be between 1.2
to 2.4 (?2 between 1.44 to 5.76) in the final user study.
Second, I observed that the participants quickly came up to speed within a
couple of trials in the training session. I therefore reduced the number of trials in
the training session from 20 to 5. This also allowed us to shorten the user study
duration and maintain a high level of visual attentiveness of the participants. Third,
some of the participants reported that the rendering time of 2 seconds was too short.
To address this I increased the time of each rendering to 2.5 seconds in the final
study. Fourth, to continually check for the visual attentiveness of the participants, I
modified the final user study by randomly inserting 30% of the trials to be ?validation
trials? that had identical full-resolution renderings for both choices. If the participant
declared these validation renderings to be different, I would ask the participant to
45
stop, get some rest, and then continue. After making these changes, the total time
participants spent in the HMD was reduced from around 25 minutes in the pilot
study to around 15 minutes in the final study.
2.4.3 Final User Study
Procedure The introduction and calibration stages are the same as the pilot user
study. The training session includes five trials with different parameters. Each testing
session involves 28 trials with multiple parameter combinations (parameter ? ranging
from 1.2 to 2.4 with the step size 0.2 (?2 ranging from 1.44 to 5.76); and kernel
parameter ? ranging from 1 to 4 with the step size 1) as well as additional ?validation
trials?. Order of the parameters is fully counterbalanced. The participants are asked
to rest after each session or if they do not pass a ?validation trial?. I also changed
the rendering-display time to 2.5 seconds.
Participants I recruited 18 participants via campus email lists and flyers. All
participants were at least 18 years old with normal or corrected-to-normal vision
(with contact lenses). The participants are collected using campus flyers and emails.
Results and Analysis I report the percentage PI and the corresponding standard
error in Figure 2.20. I make the null hypothesis (H0) that the foveated rendering
results with the four kernel functions are equally effective. As shown in Table 1, with
a Cochran?s Q test [60,61], I have found that there exists a significant difference across
the multiple ? for ? = 1.6, 1.8, 2.2 (?2 = 2.56, 3.24, 4.84) with ?2(3) = 7.81, p < 0.05.
46
The results with very small ? = 1.2, 1.4 (?2 = 1.44, 1.96) and very large ? = 2.4
(?2 = 5.76) are not significantly different, which are reasonable. For small ?2, the
rendering result without kernel function is clear enough, so there is little room for
improvement. For large ?2, both the rendering results with and without kernel
function are blurry for the users.
Table 2.1: Cochran?s Q values at different ?2.
?2 1.44 1.96 2.56 3.24 4.00 4.84 5.76
Cochran?s Q value 1.72 5.79 8.20 8.25 7.49 14.27 5.48
p value 0.631 0.122 0.042 0.041 0.058 0.002 0.139
To achieve visually acceptable results for foveated rendering, I use a threshold
of 80% responses considering foveated rendering to be visually indistinguishable from
full-resolution rendering. To achieve the highest rendering acceleration, I look for
the highest ? that met this threshold. I therefore choose ? = 1.8 (?2 = 3.24) and
? = 4 as the desired parameters for the interactive rendering evaluation.
2.5 Results and Acceleration
I implemented kernel foveated rendering on NVIDIA GeForce GTX 1080, by
using the deferred shading pipeline of the Falcor engine [62]. I report results of my
rendering acceleration for resolutions of 1920? 1080, 2560? 1440, and 3840? 2160.
Using the results from the final user study, I selected the LP-buffer parameter ? = 1.8,
and kernel parameter ? = 4 for the evaluations below.
47
2.5.1 Rendering Acceleration of 3D Textured Meshes
I use the Amazon Lumberyard Bistro [19] scene with physically-based shading,
reflection, refraction, and shadows to simulate the complex shading effects as shown
in Figure 2.21. I choose Amazon Lumberyard Bistro because this scene has complex
triangular meshes, rendering textures and compact lighting effect. The comparison
of the break-down of rendering time between KFR and the ground truth of deferred
shading is shown in Table 2. I observed that the frame rate increases for all resolutions
as shown in Table 3, with a speedup of 2.0X ? 2.8X.
2.5.2 Rendering Acceleration of Ray-casting Rendering
Ray casting uses ray?surface intersection tests to solve a variety of problems
in computer graphics and computational geometry. It can also be used for creating
scenes. Rendering of high-resolution ray cast scenes can be an extremely time-
consuming process. I used the complex ray-casting scene with 16 different primitives
by I?n?igo Qu??lez to evaluate the acceleration of kernel foveated rendering. Figure 2.22
shows a comparison of the foveated scene and the ground truth. The frame rate
increases for all resolutions as shown in Table 3, with a speedup of 2.9X ? 3.2X.
48
(a) original 3D geometries, 31 FPS
fovea
(b) foveated 3D geometries (? = 1.8, ? = 4), 67 FPS
Figure 2.21: Comparison of (a) full-resolution rendering and (b) foveated rendering
for 3D meshes involving a geometry pass with 1, 020, 895 triangles as well as multiple
G-buffers
at 2560? 1440 resolution.
49
(a) original ray-marching scene, 10 FPS
fovea
(b) foveated ray-marching scene (? = 1.8, ? = 4), 30 FPS
Figure 2.22: Comparison of (a) full-resolution rendering and (b) foveated ray-
marching scene with 16 samples per pixel
rendered at 2560? 1440.
50
Table 2.2: Timing comparison between the ground truth and KFR for one frame.
The resolution is 1920? 1080.
Procedure Timing (ms)
Ground Truth KFR
Depth Pass 0.327 0.309
Shadow Pass 3.744 4.503
Defer Pass 2.985 3.034
SkyBox 0.039 0.039
Shading / Pass1 22.043 6.674
Pass2 N/A 0.090
Total 29.138 14.649
Total GPU Time 31.892 17.052
2.6 Discussion
Here I compare the Kernel Foveated Rendering (KFR) pipeline with selected
prior art, including: Foveated 3D Graphics (F3D) [6], Multi-rate Shading (MRS) [11],
Coarse Pixel Shading (CPS) [7], and Adaptive Image-Space sampling (AIS) [13].
As mentioned by [13], both MRS and CPS pipelines require adaptive shading
features which are not yet commonly available on commodity GPUs and so they
51
Table 2.3: Frame rate and speedup comparison for kernel foveated rendering at
different resolutions with ? = 1.8, ? = 4.0.
Scene 3D Textured Meshes Ray Casting
Resolution Ground Truth Foveated Speedup Ground Truth Foveated Speedup
1920? 1080 55 FPS 110 FPS 2.0X 20 FPS 57 FPS 2.9X
2560? 1440 31 FPS 67 FPS 2.2X 10 FPS 30 FPS 3.0X
3840? 2160 8 FPS 23 FPS 2.8X 5 FPS 16 FPS 3.2X
rely on software simulator implementations. In contrast, F3D, AIS, and the KFR
pipelines can be easily mapped onto the current generation of GPUs.
The F3D pipeline has achieved impressive speedups of 10X?15X in the informal
user study, and a factor of 4.8X ? 5.7X in the formal user study. Nevertheless, the
F3D approach uses three discrete layers, while the KFR parameterizes the distribution
of samples continuously in the log-polar domain. F3D relies on specifically designed
anti-aliasing algorithms including jitter sampling and temporal reprojection, thus
limiting F3D to simpler material models and less complex geometry [13]. In contrast,
KFR could easily be coupled with the state-of-the-art screen-space anti-aliasing
techniques, such as TAA [56] and recent G-buffer anti-aliasing strategies [57].
Both the AIS and KFR pipelines mimic the continuously changing distribution
of photo-receptors in the retina. Nonetheless, there are three significant differences:
complexity and evaluation of the perceptual model, interpolation, and speedup. First,
AIS uses four parameters from [63] to approximately model the linear degradation
52
behavior of acuity with 30? eccentricity. However, these parameters have not yet
been evaluated on how they affect foveation and perception in HMDs or beyond 30?
eccentricity. In contrast, KFR uses only two parameters: the reduced-resolution LP-
buffer parameter ? and the kernel parameter ? in conjunction with a simple coordinate
transformation. KFR has established desirable values of ? and ? through user studies
in head-mounted displays. Second, AIS relies on the pull-push interpolation [54] to
fill the pixels that are missed due to variable sampling of silhouette features and
object saliency.
In comparison, KFR uses the built-in GPU-driven mipmap interpolation which
reduces the additional interpolation cost. However, it is worth investigating how
incorporating object saliency [64,65] could further improve the KFR pipeline.
It is a challenge to compare multiple foveated approaches given the varying
hardware and perceptual quality of the results. One possibility is to compare the
speedups as percentages of rendering time reduction with certain reduction sampling
rate. By rendering with 59% of the total amount of shaded pixels, AIS reports
an overall rendering time reduction of 25.4%, while KFR achieves 29.9% reduction
on average. Like AIS, KFR speedup also depends on the amount of time spent in
shading; the greater the shader computations, the higher will be the KFR speedup.
53
Chapter 3: 3D-Kernel Foveated Rendering for Light Fields
3.1 Overview
Classic light field rendering is limited to low-resolution images because the
rendering process of large-scale, high-resolution light field image arrays requires
a great amount of texture sampling, thus increasing the latency of rendering and
streaming.
With the rapid advances in optical microscopy imaging, several technologies
have been developed for interactively visualizing [66] or reconstructing microscopy
volumes [67?71]. Recently, light field microscopy [72] has emerged, which allows one
to capture light fields of biological specimens in a single shot. Afterwards, one could
interactively explore the microscopic specimens with a light-field-renderer, which
automatically generates novel perspectives and focal stacks from the microscopy
data [73]. Unlike regular images, light field microscopy enables natural-to-senses
stereoscopic visualization. Users may examine the high-resolution microscopy light
fields with the inexpensive commodity virtual reality head-mounted displays (HMDs)
as a natural stereoscopic tool. The method of rendering light field microscopy can
be applicable to high-resolution light fields from other sources.
To the best of our knowledge, the interactive visualization of high-resolution
54
light fields with low latency and high-quality remains a challenging problem.
Human vision spans 135? vertically and 160? horizontally, but the highest-
resolution foveal vision only covers a 5? central region of the vision [6]. As estimated
by Patney et al. [8], only 4% of the pixels in a modern HMD are mapped on the
fovea. Therefore, foveated rendering techniques that allocate more computational
resources for the foveal pixels and fewer resources elsewhere can dramatically speed
up light field visualization.
In this chapter, we present 3D-kernel foveated rendering (3D-KFR), a novel
approach to extend the kernel foveated rendering (KFR) [23] framework to light fields.
In 3D-KFR, we parameterize the foveation of light fields by embedding polynomial
kernel functions in the classic log-polar mapping [28,29] for each slice. This allows
us to alter both the sampling density and distribution, and match them to human
perception in virtual reality HMDs. Next, we optimize 3D-KFR by adjusting the
weight of each slice in the light fields, so that it automatically selects the optimal
foveation parameters for different images according to the gaze position and achieves
higher speedup. In this way, 3D-KFR further accelerates the rendering process of
high-resolution light fields while preserving the perceptually accurate foveal detail.
We have validated our approach on the rendering of light fields by carrying
out both quantitative experiments and user studies. Our method achieves speedups
of 3.47??7.28? on different levels of foveation and different rendering resolutions.
Moreover, our user studies suggest the optimal parameters that anyone can use for
rendering of foveated light fields on modern HMDs.
In summary, our contributions include:
55
1. designing 3D-KFR, a new visualization method to observe the light fields,
which provides similar visual results as the original light field, but at a higher
rendering frame rate;
2. conducting user studies to identify the 3D-KFR parameters governing the
density of sampling to maximize perceptual realism and minimize computation
for foveated light fields in HMDs;
3. implementing the 3D-KFR light field pipeline on a GPU, and achieving speedups
of up to 7.28? for the light field with a resolution of 25? 25? 1024? 1024 px
with minimal perceptual loss of detail.
3.2 Related Work
3.2.1 Light Field Rendering
4D light fields [16, 54] represent an object or a scene from multiple camera
positions as shown in Figure 3.1. We can generating new views from arbitrary camera
positions without depth information or feature matching, simply by interpreting the
input images as 2D slices of a 4D light field function, which completely characterizes
the flow of light through unobstructed space in a static scene with fixed illumination.
Chai et al. [74] determine the minimum sampling rate for light field render-
ing by spectral analysis of light field signals using the sampling theorem. Ng [75]
contributes to a Fourier-domain algorithm for fast digital focusing for light fields.
Lanman and Luebke [76] propose near-eye light field displays supporting continuous
56
Figure 3.1: Levoy and Hanrahan [16] two visualizations of a light field. (a) Each
image in the array represents the rays arriving at one point on the uv plane from all
points on the st plane, as shown at left. (b) Each image represents the rays leaving
one point on the st plane bound for all points on the uv plane. The images in (a) are
off-axis perspective views of the scene, while the images in (b) look like reflectance
maps.
57
accommodation of the eye throughout a finite depth of field, thus providing a means
to address the accommodation-convergence conflict occurring with existing stereo-
scopic displays. Huang et al. [77] analyze the lens-distortion in light field rendering
and correct it, thus improving the resolution and blur quality. Zhang et al. [78]
propose a unified mathematical model for multilayer-multiframe compressive light
field displays that significantly reduces artifacts compared with attenuation-based
multilayer-multiframe displays. Lee et al. [79] propose foveated retinal optimization
(FRO), which has tolerance for pupil movement without gaze tracking while main-
taining image definition and accurate focus cues. The system achieves 38?? 19? FoV,
continuous focus cues, low aberration, small form factor, and clear see-through prop-
erty. However, FRO adopts the idea of foveation to improve the display performance
of the multi-layer displays rather than the rendering speed of 3D content. Sun et
al. [17] design a real-time foveated 4D light field rendering and display system. Their
work analyzes the bandwidth bound for perceiving 4D light fields and proposes a
rendering method with importance sampling and a sparse reconstruction scheme.
Their prototype renders only 16% ? 30% of the rays without compromising the
perceptual quality. The algorithm is designed for the desktop screen. In contrast
to the previous work, our approach focused on foveated visualization of large light
fields in virtual reality HMDs.
Sun et al. [17] design a real-time foveated 4D light field rendering and display
system as shown in Figure 3.2. Based on the theoretical analysis on visual and
display bandwidths, they find the frequency bound for the retina, eye lens and the
display. Afterwards, they finish a series of psychophysical experiments and formulate
58
a content-adaptive importance model in the 4D ray space. Their prototype renders
only 16%? 30% of the rays without compromising the perceptual quality.
Figure 3.2: Sun et al. [17] design a real-time foveated 4D light field rendering and
display system.
Mildenhall et al. [80] propose an algorithm to render novel views from an
irregular grid of sampled views by expanding each sampled view into a local light
field via a multiplane image (MPI) scene representation and blending adjacent local
light fields.
59
3.2.2 Light Field Microscopy
Weinstein and Descour [81] use lens arrays for single-view-point array micro-
scope with ultra-wide FOV instead of light fields with perspective views. Levoy et
al. [72] propose using light fields to produce microscopy with perspective views and
focal stacks. Wilt et al. [82] confirm the importance of observing cellular properties by
using light microscopy for neuroscientists. The advances include enabling new exper-
imental capabilities and permitting functional imaging at faster speeds. Prevedel et
al. [83] implement a light field deconvolution microscopy and demonstrate its ability
to simultaneously capture the neuronal activity of the entire nervous system.
3.3 Proposed Algorithm
In this section, we first introduce KFR for 4D light field rendering. Next, we
generalize KFR to 3D-KFR. Finally, we discuss the resulting rendering acceleration
that 3D-KFR is able to achieve over KFR.
3.3.1 KFR for 4D Light Field Rendering
In the k ? k light field with image resolution of W ?H, the total number of
texture samples for rendering the original light field Noriginal is,
Noriginal = k
2 ?WH (3.1)
KFR accelerates the rendering process of light fields by reducing texture
sampling. In the first pass, we perform kernel log-polar transformation for each
60
slice and render to a reduced resolution buffer with dimensions of k ? k ? w ? h.
In the second pass, we perform the inverse log-polar transformation to map the
pixels back to the screen. The kernel function K (x) is defined in [23], which can
be any monotonically increasing function with K (0) = 0 and K (1) = 1, such as a
polynomial, ?? ??
K (x) = ? xii , where ?i = 1. (3.2)
i=0 i=0
We next present the two passes of the KFR algorithm.
In the first pass, we transform the image from Cartesian coordinates to kernel
log-polar coordinates. For each pixel in screen space with coordinates (x, y), foveal
point F (?x, y?) in Cartesian coordinates, we define x?, y? as
x? = x? x?, y? = y ? y?. (3.3)
Then, we transform point (x?, y?) to (u, v) in kernel log-polar coordinates using
Equation 3.4, ( ? ? )
?1 log?x , y ?2u = (K ( ? wL) ) (3.4)
y? h
v = arctan ? + 1 [y
? < 0] ? 2? ?
x 2?
K?1 (?) is the inverse of the kernel function, and L is the log of the maximum
distance from fovea to one of the four corners of the screen as shown in Equation 3.5,
L = log (max (max (l1, l2) ,max (l3, l4))) . (3.5)
61
Here,
l1 = ?x?, y??2
l2 = ?W ? x?, H ? y??2
(3.6)
l3 = ?x?, H ? y??2
l4 = ?W ? x?, y??2
We define ? = W = H as the ratio between the full-resolution screen width (or
w h
height) and the reduced-resolution buffer width (or height). The number of texture
samples for the first pass NKFR pass 1 can be theoretically inferred as:
W H k2
N 2KFR pass 1 = k ? ? = ?WH (3.7)
? ? ?2
In the second pass, a pixel with kernel log-polar coordinates (u, v) is transformed
back to (x??, y??) in Cartesian coordinates. Let
L 2?
A = , B = , (3.8)
w h
then the inverse transformation can be formulated as Equation 3.9,
x?? = exp (A ?K (u)) ? cos (Bv) + x?
(3.9)
y?? = exp (A ?K (u)) ? sin (Bv) + y?
The number of texture samples for the second pass NKFR pass 2 is
NKFR pass 2 = WH (3.10)
The total number of texture samples for rendering the light field with KFR is
NKFR = NKFR pass 1 +NKFR pass 2
(3.11)
k2
= ( + 1) ?WH
?2
62
The parameter ? controls the total number of pixels of the reduced-resolution
buffer, thus controlling the foveation rate and the amount of sampling. Comparing
Equations 3.1 and 3.11, we notice that the number of texture samples can be greatly
reduced in KFR with ? > 1.0. Kernel function controls the distribution of pixels
in the whole image. By adjusting kernel functions, we can determine the pixel
distribution and choose one that mimics the photo-receptor distribution of human
eyes. The kernel log-polar mapping algorithm allows us to mimic the fall-off of
photo-receptor density of human visual system with different ? and different kernel
functions.
3.3.2 3D-KFR for 4D Light Field
The rendering of 4D light field is different from the rendering of 3D meshes
because the center camera position of each slice is different. Since the foveation level
of a pixel is affected by the distance to the center camera, the foveation parameter
can be different for different slices in a light-field image array. We optimize the KFR
algorithm into 3D-KFR by adjusting the weight of each slice in the light field, so that
it is able to automatically select the optimal foveation parameters for different images
according to the gaze position, thereby achieving greater speedup. Our algorithm
consists of two passes as shown in Figure 1.4.
We define d as the distance between the local center camera of the frame
Xcam ij(xij, yij) and the gaze position Xcam 0(x0, y0),
d = ?Xcam 0 ?Xcam ij?2 (3.12)
63
fovea fovea fovea fovea
(a) original light field (b) 3D-KFR, ? = 1.2 (c) 3D-KFR, ? = 2.0 (d) 3D-KFR, ? = 3.0
Figure 3.3: The result comparison of the foveated light field with fovea on the center
of the screen. (b) - (d) are the application of 3D-KFR on light field with (b) ?0 = 1.2,
(c) ?0 = 2.0, (d) ?0 = 3.0. The left zoomed-in views show that the application of
3D-KFR doesn?t make changes in the fovea; the middle zoomed-in views and the
right zoomed-in views show that larger ?0 causes detail loss in the peripheral region.
64
fovea fovea fovea fovea
(a) original light field (b) 3D-KFR, ? = 1.2 (c) 3D-KFR, ? = 2.0 (d) 3D-KFR, ? = 3.0
Figure 3.4: The result comparison of the foveated light field with fovea on the
peripheral region of the screen. (b) - (d) are the application of 3D-KFR on light field
with (b) ?0 = 1.2, (c) ?0 = 2.0, (d) ?0 = 3.0. The left zoomed-in views show that
the application of 3D-KFR doesn?t make changes in the fovea; the middle zoomed-in
views and the right zoomed-in views show that larger ?0 causes detail loss in the
peripheral region.
65
We partition the original dataset into multiple progressive regions: the inner
foveal region (highlighted in dark green) indicates the fovea, i.e., where the user is
currently looking at; as d increases, the peripheral regions (highlighted in lighter
green and white) are rendered in smaller framebuffers with less texture sampling.
We classify the frame of the i-th row and j-th column Iij to foveal region or
peripheral region with different framebuffers by d as shown in Equation 3.13.
???????????foveal region d < r0???????peripheral region 1 r0 ? d < r1
Iij ? ?????peripheral region 2 r ? d < r (3.13)???
1 2
??????... ...???peripheral region N rN?1 ? d < rN
In the first pass, assume the foveal region covers k0 frames and the i ? th
peripheral region peripheral region i covers ki frames. Our approach reduces the
framebuffer size for the foveal region by ?20, and reduces the framebuffer size of
peripheral region i by ?21,..., ?
2
N , respectively. Then the number of total texture
samples in the first pass N3D-KFR pass 1 can be theoretically inferred as:
1 1 1
N3D-KFR pass 1 = k(0 ?WH ? + k1 ?WH ? + ...+ k2 2 N ?WH ?? 20 ) ?1 ?N (3.14)
k0 k1 kN
= + + ...+ ?WH
?20 ?
2 2
1 ?N
We can also write Noriginal as:
N 2original = k ?WH = (k0 + k1 + ...+ kN) ?WH (3.15)
66
So the total number for texture sampling for the foveal region and peripheral
region are reduced by 1 ?, 1 ?, ..., and 12 2 ?, respectively.?0 ?1 ?2N
We choose smaller ? with small d in order to keep more details. And we
choose larger ? for frames with larger distance in order to reduce rendering cost (i.e.,
?0 ? ?1 ? ... ? ?n).
The algorithm of light field rendering combined with kernel log-polar transfor-
mation is shown in Algorithm 3.
Algorithm 3 3D-KFR: Pass 1
Input:
Aperture size: a,
focal point ratio: f ,
fovea coordinate in screen space: Xfovea (?x, y?),
pixel coordinate in LP-Buffer: Xbuffer (u, v),
k ? k light field {I} with image resolution of n? n.
Output:
Pixel value Cbuffer, ? for the coordinate Xbuffer.
1: acquire the coordinate for the center camera Xcam 0
2: acquire the coordinate for the fovea Xfovea
3: initialization: Cbuffer, ? ? 0, Nbuffer, ? ? 0
4: for row index i ? [0, k] do
5: for column index j ? [0, k] do
6: calculate ? with Xfovea with Equation 3.13
7: update L with Xfovea with Equation 3.5
67
8: let A = L , B = 2?
w h
9: acquire Xcam ij for frame Iij
10: dij ? ?Xcam 0 ?Xcam ij?2
11: if dij < a then
12: x? ? exp (A ?K (u, ?)) ? cos (Bv) + x?
13: y? ? exp (A ?K (v, ?)) ? sin (Bv) + y?
14: X ? ? ?Sample ? (x , y )
15: XSample ? X ?cam ij + (XSample ?Xcam ij) ? f
16: if XSample in the range of the screen then
17: Cbuffer, ? ? Cbuffer, ? + Iij ? Color(XSample)
18: Nbuffer, ? ? Nbuffer, ? + 1
19: end if
20: end if
21: end for
22: end for
C
23: return Cbuffer, ? ? buffer, ?Nbuffer, ?
In the second pass, we carry out the inverse-log-polar transformation with
anti-aliasing to map the reduced-resolution rendering to the full-resolution screen,
the algorithm is shown in Algorithm 4. To reduce artifacts in the peripheral regions,
we use a Gaussian filter with a 5? 5 kernel for the peripheral region of the recovered
image in the screen.
68
The number of texture samples for the second pass N3D-KFR pass 2 is,
N3D-KFR pass 2 = (1 +N) ?WH (3.16)
The total number of texture samples for rendering the light field with 3D-KFR
is
N3D-KFR = N(3D-KFR pass 1 +N3D-KFR pass 2 ) (3.17)
k0 k1 kN
= + + ...+ + 1 +N ?WH
?20 ?
2 ?21 N
In the light field rendering, we commonly have k ? 16. In 3D-KFR, we
commonly have 1.0 < ? ? 3.0, and we choose N = 2 as the number of peripheral
regions and K (x) = x4 as the kernel function. Therefore, we have
2
NKFR pass 2 = WH 
k ?WH = NKFR pass 1, (3.18)
?2
and
N3D-KFR pass 2 = ((N + 1)WH ) (3.19)
 k0 k1 kN+ + ...+ ?WH = N .
?2 2
3D-KFR pass 1
0 ?1 ?
2
N
Equations 3.18 and 3.19 show that the extra time consumed by the pass 2 can be
omitted. We then have
? k
2
NKFR ( ) ?WH (3.20)
?20
( )
? k0 k1 kNN3D-KFR + + ...+ ?WH (3.21)
?20 ?
2
1 ?
2
N
Comparing Equations 3.1, 3.20 and 3.21, we have
N3D-KFR  NKFR  Noriginal, (3.22)
69
which shows that the 3D KFR scheme can accelerate the rendering of the light field
beyond a simple KFR approach. The resulting comparison of the original light field
rendering and the 3D-KFR for light field is shown in Figure 3.3 and Figure 3.4. With
3D-KFR applied, pixel density decreases from the fovea to the periphery. We do not
notice any differences in the fovea with different ?0 between the left zoomed-in views
because 3D-KFR uses a weighted-sum which strengthens the frames with small d.
For the same reason, we can notice the loss of detail from the right zoomed-in views
of the periphery. Next, we determine what parameters ensure that the peripheral
loss and the peripheral blur are not noticeable by conducting user studies.
Algorithm 4 3D-KFR: Pass 2
Input:
Fovea coordinate in screen space: Xfovea (?x, y?),
pixel coordinate in screen space: XDisplay (x, y),
Output:
Pixel value Cdisplay for coordinate XDisplay.
1: initialization: Cdisplay ? 0, Ndisplay ? 0
2: acquire the coordinate for the fovea Xfovea
3: update L with Xfovea with Equation 3.5
4: for each attachment I? in LP-Buffer do
5: x? ? x? x?
6: y? ? y ? y?( )
u? K?1 l(og?x
?,y??
7:
L) ? w
y?
8: v ? arctan h? + 1 [y? < 0] ? hx 2?
70
9: XSample ? (u, v)
10: Cdisplay ? Cdisplay + I? ? Color(XSample)
11: Ndisplay ? Ndisplay + 1
12: end for
C
13: return C ? displaydisplay Ndisplay
3.4 User Study
We have carried out user studies to find the largest foveation parameter values
for ?0 that results in visually acceptable foveated rendering.
3.4.1 Apparatus
Our user study apparatus is shown in Figure 3.5. We used an Alienware laptop
with an NVIDIA GTX 1080, a FOVE HMD, and an XBOX controller. The FOVE
display has a 100? field of view, a resolution of 2560? 1440, and a 120 Hz infrared
eye-tracking system with a precision of 1? and a latency of 14 ms.
Since public datasets on large-scale and high-resolution microscopy light field
datasets are not yet available, we have rendered and open sourced 30 microscopy
light field datasets1. We synthesized the microscopy dataset on Cell and Cellular
Lattice for the user study.
1Simulated HD Light Fields: http://users.umiacs.umd.edu/~xmeng525/3D_KFR_For_Light_
Fields/MicroscopyLightFieldResource/
71
Figure 3.5: Our user study set up with gaze-tracker integrated into the FOVE
head-mounted display.
3.4.2 Participants
We recruited a total of 22 participants via campus email lists and flyers. All
participants are at least 18 years old with normal or corrected-to-normal vision (with
contact lenses).
3.4.3 Procedure
We conducted three different and independent experiments to test the param-
eters for which 3D-KFR produces acceptable quality to non-foveated rendering: a
Pair Test, a Random Test, and a Slider test.
In the Pair Test, we presented each participant with pairs of foveated and
full-resolution light field renderings. We presented the two renderings in each pair in
a random order and separated by a short interval of black screen (0.75 seconds). The
foveation parameter ranged between ?0 = 1.2 to ?0 = 3.0. Pairs at all quality levels
in this range were presented twice (monotone increasing then monotone decreasing)
for each dataset, i.e. ?0 increased from 1.2 to 3.0 then decreased from 3.0 to 1.2. At
72
the end of each comparison, the participant responded upon the similarity between
the two rendering results by the XBOX controller. The answer contains 5 confidence
levels: 5 represents perceptually identical, 4 represents minimal perceptual difference, 3
represents acceptable perceptual difference, 2 represents noticeable perceptual difference
and 1 represents significant perceptual difference.
In the Random Test, we presented each participant with pairs of foveated
and full-resolution light field renderings. We presented the two renderings in each
pair in a random order and separated by a short interval of black (0.75 seconds).
The foveation parameter ranged between ?0 = 1.2 to ?0 = 3.0. Pairs at all quality
levels in the range were presented once for each dataset in random order. At the
end of each comparison, the participant responded upon the similarity between the
two rendering results by the XBOX controller. The answer contains 5 confidence
levels: 5 represents perceptually identical, 4 represents minimal perceptual difference, 3
represents acceptable perceptual difference, 2 represents noticeable perceptual difference
and 1 represents significant perceptual difference.
The Slider Test lets the participants navigate the foveation quality space
themselves. First, the participant observed the full-resolution rendering result as a
reference. Next, we presented the participant with the lowest level of foveation quality
(?0 = 3.0) while the participant could progressively increase the foveation level (with
a step size of 0.1). The participant switched between the foveated rendering result
and the reference image back and forth, until they found out the lowest foveation
level which is visually equivalent to the non-foveated reference. We recorded the
first quality level index at which the participant stopped as the final response for
73
the slider test.
To ensure the visual attentiveness of the participants, we randomly inserted
30% of the trials to be ?validation trials? that had identical full-resolution results for
both choices in the Pair Test and the Random Test. If the participant declared
these validation renderings to be mostly the same with acceptable difference, noticeable
difference or totally different, we would ask the participant to stop, take some rest,
and then continue. Meanwhile, we recorded this choice as an error. If error ? 5 in
the Pair Test and the Random Test, we would stop the user study and discard
the data of the user. We discarded two participants according to this rule.
3.5 Results and Acceleration
3.5.1 Results of the User Study
Let S? be the average score of all the users for a specific ?0, and let P? be the
percentage of responses of rated foveated and non-foveated renderings as perceptually
identical (5) and minimal perceptual difference (4) for a specific ? = ?0.
The result of S? for the Pair Test is shown in Figure 3.6. Generally, S?
decreases with the increase of ?0. A Friedman test revealed a significant effect
of the users? responses on foveation parameter ? (?2(20) = 104.3, p < 8.9? 10?14).
The result of P? for the Pair Test is shown in Figure 3.7. We have identified a
threshold of P? = 90% for ?pair = 2.4 as the foveation parameter that provides
minimal perceptual differences based on the Pair Test.
The result of S? for the Random Test is shown in Figure 3.8. The trend
74
Average Score of Pair Test v.s. ?
5
4.5
4
3.5
3
?=1.2 ?=1.4 ?=1.6 ?=1.8 ?=2.0 ?=2.2 ?=2.4 ?=2.6 ?=2.8 ?=3.0 ?=3.0 ?=2.8 ?=2.6 ?=2.4 ?=2.2 ?=2.0 ?=1.8 ?=1.6 ?=1.4 ?=1.2
 
Figure 3.6: The Pair Test responses of S? across sliding foveation parameters ?.
S? decreases with the increase of ?. 5 represents perceptually identical, 4 repre-
sents minimal perceptual difference, 3 represents acceptable perceptual difference, 2
represents noticeable perceptual difference, and 1 represents significant perceptual
difference (2 and 1 are not shown)
.
100% 98% 100% 98% 98% 98% 100% 98%
95% 95%
95% 93% 93% 93% 93%
90%
90% 88%
85%
85% 83%
80% 78% 78%
75%
70%
70%
65%
60%
?=1.2 ?=1.4 ?=1.6 ?=1.8 ?=2.0 ?=2.2 ?=2.4 ?=2.6 ?=2.8 ?=3.0 ?=3.0 ?=2.8 ?=2.6 ?=2.4 ?=2.2 ?=2.0 ?=1.8 ?=1.6 ?=1.4 ?=1.2
 
Figure 3.7: The Pair Test responses of P? across sliding foveation parameters ?.
P? decreases with the increase of ?.
75
Average Score of Jump Test v.s. ?
5
4.5
4
3.5
3
?=1.2 ?=1.4 ?=1.6 ?=1.8 ?=2.0 ?=2.2 ?=2.4 ?=2.6 ?=2.8 ?=3.0
 
Figure 3.8: The Random Test responses of S? across gradually varied foveation
parameters ?. S? decreases with the increase of ?. 5 represents perceptually identi-
cal, 4 represents minimal perceptual difference, 3 represents acceptable perceptual
difference, 2 represents noticeable perceptual difference, and 1 represents significant
perceptual difference (2 and 1 are not shown)
.
76
100% 98% 100% 100% 98% 100%
95% 95% 95%
95%
90% 88%
85%
80%
80%
75%
70%
65%
60%
?=1.2 ?=1.4 ?=1.6 ?=1.8 ?=2.0 ?=2.2 ?=2.4 ?=2.6 ?=2.8 ?=3.0
 
Figure 3.9: The Random Test responses of P? across sliding foveation parameters
?. P? decreases with the increase of ?.
that S? decreases with an increase of ? matches our expectation. A Friedman
test revealed a significant effect of the users? responses on foveation parameter
? (?2(20) = 29.2, p < 0.0006). The result of P? for the Random Test is shown
in Figure 3.9. We have identified a threshold of P? = 90% for ?random = 2.6 as
the foveation parameter that provides minimal perceptual differences based on the
Random Test.
The histogram of the user-chosen thresholds in the Slider Test is shown in
Figure 3.10. For instance, the histogram shows that 25% of the users found that
? = 3.0 or lower is acceptable; 75% of the users found that ? = 1.8 or lower is
acceptable. With ?0 = 1.6, 80% of the users considered that the foveated rendering
is visually indistinguishable from full-resolution rendering. We chose threshold
77
100%
100%
80% 80% 80%
80% 75%
70%
60%
60%
45%
40%
40%
25% 25%
20%
0%
?=1.0 ?=1.2 ?=1.4 ?=1.6 ?=1.8 ?=2.0 ?=2.2 ?=2.4 ?=2.6 ?=2.8 ?=3.0
 
Figure 3.10: The histogram of the optimal foveation parameter ? selected by each
user in the Slider Test. For instance, the histogram shows that 80% of the users
found that ? = 1.6 or lower is acceptable.
78
?slider = 1.6.
We have noticed that ?slider = 1.6 is smaller than ?pair = 2.4 and ?random = 2.6.
We speculate that the reason for a smaller sigma in the Slider Test is: if the users
are free to choose the threshold, they tend to choose the best quality they can achieve,
instead of the lower bound of the perceptually indistinguishable quality.
3.5.2 Rendering Acceleration
Using the three parameters, one could think of building a foveated rendering sys-
tem where the saccades are implemented with ? = 2.6 and the fixation implemented
with ? = 1.6.
Performance Evaluation and Discussion We have implemented the 3D
kernel foveated rendering pipeline in C++ 11 and OpenGL 4 on NVIDIA GTX
1080. We report the results of our rendering acceleration for the tissue dataset at
the resolution of k? k? 1024? 1024. We tested the rendering time for different light
field dimensions (20 ? k ? 25) and different ?0 (1.2 ? ?0 ? 3.0) with ?1 = 1.6?0,
?2 = 2.0?1. We used the kernel function K (x) = x
4. The evaluations are shown
in Figure 3.11, where ?0 = 1.0 corresponds to the rendering time of the original
field, and ?0 > 1.0 corresponds to the 3D-KFR rendering time. We notice that the
rendering time of 3D-KFR decreases with the increase of ?0.
We further tested the rendering time comparison and the speedup for ?slider =
1.6, ?pair = 2.4 and ?random = 2.6 as shown in Table 3.1. With ?pair = 2.4, the
rendering time is less than 21.96 ms (45.54 fps); with ?random = 2.6, the rendering
79
Rendering Time vs. <  for k x k x 1024 x 1024 Light Field
140
120
100
k = 20
80 k = 21
k = 22
k = 23
60
k = 24
k = 25
40
20
0
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
<
Figure 3.11: The rendering time for light field with different dimension and different
?.
80
Rendering Time (ms)
Table 3.1: The average timings and the corresponding speedups of 3D-KFR at
different light field dimensions and foveation parameters ?.
81
? = 1.6 ? = 2.4 ? = 2.6
Resolution Ground Truth
3D KFR3D-KFR Speedup3D KFR3D-KFR Speedup3D KFR3D-KFR Speedup
20? 20? 1024? 1024 66.83 ms 19.27 ms 3.47? 10.22 ms 6.54? 9.39 ms 7.11?
21? 21? 1024? 1024 74.17 ms 22.39 ms 3.31? 11.90 ms 6.24? 10.39 ms 7.14?
22? 22? 1024? 1024 92.33 ms 28.26 ms 3.27? 14.65 ms 6.30? 12.64 ms 7.30?
23? 23? 1024? 1024 100.26 ms 30.64 ms 3.27? 16.30 ms 6.15? 13.95 ms 7.18?
24? 24? 1024? 1024 122.29 ms 35.92 ms 3.40? 19.09 ms 6.41? 16.79 ms 7.28?
25? 25? 1024? 1024 138.93 ms 41.42 ms 3.35? 21.96 ms 6.33? 19.09 ms 7.28?
time is less than 19.09 ms (52.38 fps). ?pair = 2.4 and ?random = 2.6 meets the real-
time requirement of 30 fps. With ?slider = 1.6, the rendering times for k = 20, 21, 22,
or 23 are less than 30.64 ms (32.64 fps), which meets the real-time requirement of 30
fps. While the rendering times for k = 24 and 25 are less than 41.42 ms (24.14 fps),
they are still able to achieve reasonably interactive frame rates.
3.5.3 Quality Evaluation
The comparisons of the visualization of the original light field rendering and
the 3D-KFR rendering of different datasets are shown in Figure 3.12 - Figure 3.14.
We use structural dissimilarity (DSSIM) [84] [85] between the 3D-KFR and
the original light field approaches as the metric to evaluate the quality of 3D-KFR
results. DSSIM can be derived from structural similarity index (SSIM) [86] [87].
The measurement of SSIM and DSSIM between the two images ? and ? with size
N ?N is shown in Equations 3.23 and 3.24.
(2???? + c1)(2??? + c2)
SSIM(?,?) = (3.23)
(?2? + ?
2 + c )(?2 + ?2? 1 ? ? + c2)
1? SSIM(?,?)
DSSIM(?,?) = (3.24)
2
where ?? and ?? are the average pixel values for images ? and ?, respectively; ??
and ?? are the pixel variances for images ? and ?, respectively; ??? is the covariance
between images ? and ?; c1, c2 are two constants used to stabilize the division with
a weak denominator.
SSIM is a perception-based model that considers image degradation as perceived
82
change in structural information. A pair of images with low DSSIM indicates better
structural similarity. We measure the average DSSIM of the RGB channels, and
we show the results in Figure 3.12 - Figure 3.14. The DSSIM measurement of the
zoomed-in views for the fovea regions are small, which indicates high visual similarity.
With an increase in distance between the fovea position and the pixel position, the
DSSIM increases because of the foveation effect.
fovea fovea fovea fovea
(a) original light field (b) 3D-KFR, ? = 1.6 (c) 3D-KFR, ? = 2.4 (d) 3D-KFR, ? = 2.6
DSSIM 2.1e-5 9.03e-5 6.95e-4 4.20e-5 1.64e-4 7.64e-4 5.09e-5 1.85e-4 7.96e-4
Figure 3.12: Comparison of the foveated light field Biomine II. (b) - (d) using
3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6. The measured
DSSIM (lower is better) is shown for each zoomed-in view.
83
fovea fovea fovea fovea
(a) original light field (b) 3D-KFR, ? = 1.6 (c) 3D-KFR, ? = 2.4 (d) 3D-KFR, ? = 2.6
DSSIM 0.0010 0.0210 0.0221 0.0022 0.0311 0.0320 0.0027 0.0365 0.0366
Figure 3.13: Comparison of the foveated light field Cellular Lattice IV. (b) - (d) using
3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6. The measured
DSSIM (lower is better) is shown for each zoomed-in view.
fovea fovea fovea fovea
(a) original light field (b) 3D-KFR, ? = 1.6 (c) 3D-KFR, ? = 2.4 (d) 3D-KFR, ? = 2.6
DSSIM 9.90e-7 2.86e-6 1.07e-5 1.79e-6 5.03e-6 3.02e-5 1.95e-6 5.38e-6 3.25e-5
Figure 3.14: Comparison of the foveated light field Red Cells IV. (b) - (d) using
3D-KFR with (b) ?slider = 1.6, (c) ?pair = 2.4, (d) ?random = 2.6. The measured
DSSIM (lower is better) is shown for each zoomed-in view.
84
85
Chapter 4: Eye-dominance-guided Foveated Rendering
4.1 Overview
As we have seen in Chapter 2, foveation speeds up the rendering of each frame
by 3? to 5? [6, 13,23]. Other rendering acceleration approaches take advantage of
the properties of the human visual system, such as perception-guided reduction of
motion artifacts [88, 89] and temporal resolution multiplexing [22], which renders
even-numbered frames at a lower resolution.
The human visual system has a tendency to prefer visual stimuli of one eye
over the other eye [90]. This phenomenon is referred to as eye (or ocular) dominance.
The dominant eye is found to be superior to the non-dominant eye in visual acuity,
contrast sensitivity [91], color discrimination [92], and motor functions that are
visually managed and require spatial attention [93].
In this chapter, we propose the technique of eye-dominance-guided foveated
rendering (EFR), which leverages ocular dominance property of the human visual
system. We render the scene for the dominant eye at the normal foveation level
and render the scene for the non-dominant eye at a higher foveation level. This
formulation allows us to save more in the rendering budget for the non-dominant
eye. We have validated our approach by carrying out quantitative experiments and
86
user studies. Our contributions include:
1. designing eye-dominance-guided foveated rendering, an optimized technique for
foveated rendering, that provides similar visual results as the original foveated
rendering, but at a higher rendering frame rate;
2. conducting user studies to identify the parameters for the dominant eye and the
non-dominant eye to maximize perceptual realism and minimize computation
for foveated rendering in head-mounted displays; and,
3. implementing eye-dominance-guided foveated rendering pipeline on a GPU,
and achieving up to 1.47? speedup over the original foveated rendering at a
resolution of 1280? 1440 per eye with minimal perceptual loss of detail.
4.2 Related Work
The related work in foveated rendering as been reviewed in Section 2.2. In this
section, we review the relevant state-of-the-art research eye-dominance that inspires
our work. Eye dominance has been described as the inherent tendency of the human
visual system to prefer scene perception from one eye over the other [90].
Einat and Shaul [91] study the role of eye dominance in dichoptic non-rivalry
conditions, testing visual search and comparing performance with target presented
to the dominant or the non-dominant eye. Einat and Shaul [91] designed user study
with red?green glasses. The participants viewed an array of green and red lines
of uniform orientation, with a differently oriented target line present on half the
trials. And they observed the performance of visual perception. They found that the
87
dominant eye significantly performed better than the non-dominant eye when the
dominant eye saw the target, especially when the opposite eye saw the distractors.
This effect was reduced when only nearest-neighbor surrounding distractors were
homogeneous. They conclude that the dominant eye has priority in visual processing,
perhaps even resulting in inhibition of non-dominant eye representations.
Oishi et al. [94] observe that the dominant eye is functionally activated prior
to the non-dominant eye in conjugate eye movements. They recorded conjugate eye
movements to elucidate whether ocular dominancy was present at reading distance
in 21 right-handed normal participants by using a video-oculographic measurement.
This included the velocity of smooth pursuits, and the latency and velocity of saccades.
They defined the dominant eye for each participant by the near?far alignment test
and 20 subjects showed the right dominant eyes. Although the ocular dominancy
was not found in the velocity of smooth pursuit and vertical saccades, the velocity of
horizontal saccades in the dominant eyes was faster than that in the non-dominant
eyes. These results suggest that the dominant eye is functionally activated prior to
non-dominant eye in horizontal saccades at reading distance, which thus indicates
the functional dominance of the dominant eye in conjugate eye movements.
Koctekin et al. [92] find that the dominant eye has priority in red/green
color spectral region, leading to better color-vision discrimination ability. For
this comparative study, 50 students studying at Bas?kent University Faculty of
Medicine, including 31 males (62%) and 19 females (38%), with visual acuity of
20/20 and without congenital color vision deficiency (CCVD) evaluated by Ishihara
pseudoisochromatic plate test (IPPT) were recruited. Dominant eye was determined
88
by the Gundogan Method. The color discrimination ability was examined with the
Farnsworth ?Munsell 100 hue (FM100) test. The statistical differences among
the dominant eye and the non-dominant eye in red/green local region and total error
scores were found to be statistically significant in both genders.
McManus et al. [93] studied the relationship between handedness and eye-
dominance. Handedness and eye-dominance are associated statistically, although a
previous meta-analysis has found that the precise relationship is difficult to explain,
with about 35% of right-handers and 57% of left-handers being left eye dominant. Of
particular difficulty to genetic or other models is that the proportions are distributed
asymmetrically around 50%. This study explored whether this complicated pattern of
association occurred because it divides right-and left-handers into consistent handers
(who write and throw with the same hand) and inconsistent handers (who write and
throw with opposite hands). In an analysis of 10,635 participants from questionnaire
studies, 28.8% of left-handers and 1.6% of right-handers by the writing task were
found to be inconsistent for the throwing task. The results also showed that writing
hand and throwing hand both relate independently to eyedness and that throwing
hand is somewhat more strongly associated with eyedness. The study found that
24.2% of consistent right-handers are left eye dominant compared with 72.3% of
consistent left-handers, and 55.4% of inconsistent right-handers compared with 47.0%
of inconsistent left-handers. They conclude that eyedness is phenotypically secondary
to writing and throwing handedness. In the discussion they note that eyedness runs
in families. They present new data suggesting that writing hand and throwing hand
are co-inherited, and they argue that further data are now required to model properly
89
the associations of writing hand, throwing hand, and eyedness, as well as probably
also footedness and language dominance.
Chaumillon et al. [95] show that sighting eye dominance has an influence on
visually triggered manual action with shorter reaction time. They used the simple
and well-known Poffenberger paradigm [96] in which participants press a button with
the right or left index finger, in reaction to the appearance of a lateralized visual
stimulus. By selecting participants according to their dominant-eye and handedness,
they deciphered the impact of eye dominance on visuomotor transformation speed.
They showed that, in right-handers simple reaction times (RT) in response to a
lateralized visual target are shorter when it appears in the contralateral visual
hemifield with respect to the dominant eye. Meanwhile, in left-handers, only those
with a right dominant eye exhibit a shorter RT with the left hand and they show
no hemifield difference. Additionally, the Poffenberger paradigm has been used to
estimate the interhemispheric transfer time (IHTT) in both directions, from the right
to the left hemisphere or the reverse, by comparing hand RTs following stimulation
of each visual hemifield. Chaumillon et al. [95] demonstrates that this paradigm
leads to biased estimations of these directionally considered IHTT and provides
an explanation to the often reported IHTT negative values that otherwise appear
implausible. These new findings highlight the need to consider eye dominance in
studies investigating the neural processes underlying visually-guided actions. More
generally, they demonstrate a substantial impact of eye dominance on the neural
mechanisms involved in converting visual inputs into motor commands.
In this work, we take advantage of the weaker sensitivity and acuity of the
90
non-dominant eye and render the non-dominant display with greater foveation to
accelerate overall foveated rendering.
4.3 Proposed Algorithm
Here we present an overview of the parameterized foveated rendering and then
we build upon it to accomplish eye-dominance-guided foveated rendering.
KFR Shading Inv-KFRTransformer Transformer Anti-aliasing
KFR Inv-KFR
Transformer Shading Transformer Anti-aliasing
Figure 4.1: An overview of the eye-dominance-guided foveated rendering system. Our
system uses two foveated renderers, with different values of the foveation parameter ?,
for the dominant eye and the non-dominant eye, respectively. For the dominant eye,
we choose the foveation parameter ?d which results in an acceptable foveation level
for both eyes. For the non-dominant eye, we choose ?nd ? ?d, which corresponds to
a higher foveation level. Because the non-dominant eye is weaker in sensitivity and
acuity, the user is unable to notice the difference between the two foveation frames.
91
4.3.1 Foveation Model
We use the kernel foveated rendering (KFR) model proposed in Section 2
because this model parameterizes the level of foveation with two simple parameters:
frame buffer parameter ? controls the width of the frame-buffer to be rendered, thus
controlling the level of foveation; and the kernel function parameter ? controls the
distribution of pixels.
Here I briefly describe the KFR model again. The KFR model contains two
passes. In the first pass, the renderer transforms the shading materials in the G-buffer
(world positions, texture coordinates, normal maps, albedo maps, etc.) from the
Cartesian coordinates to kernel log-polar coordinates. Because of the non-uniform
scaling effect in the transformation, details in the foveal region are preserved and
details in the peripheral region are reduced.
Given a screen of resolution W?H, for each pixel with coordinate (x, y), we first
normalize the coordinate to (x?, y?). Then, KFR transforms the point (x?, y?) to (u, v)
in the kernel log-polar space via Equation 2.10, where L is the log of the maximum
distance from F (?x, y?) to the farthest screen corner as shown in Equation 2.12.
In the second pass, the renderer transforms the rendered scene from kernel
log-polar coordinates to Cartesian coordinates and renders to the full-resolution
screen. A pixel with log-polar coordinates (u, v) is transformed back to (x??, y??) in
Cartesian coordinates as shown in Equation 2.5.
According to Section 2, the kernel function parameter is suggested as ? = 4.
Therefore, we can control the level of foveation by only altering the parameter ?.
92
4.3.2 Eye-dominance-guided Foveation Model
Previous research on ocular dominance indicates that the non-dominant eye is
weaker than the dominant eye in sensitivity and acuity. Here, we propose that the
non-dominant eye is able to accept a higher level of foveation.
The overview of our eye-dominance-guided foveated rendering (EFR) system is
shown in Figure 4.1. In our EFR framework, for the baseline rendering, the system
uses a KFR renderer with foveation parameter ?d for the dominant eye and a KFR
renderer with ?nd for the non-dominant eye.
In the KFR algorithm, the parameter ? controls the width of the frame-buffer
to be rendered, and the rendering time is proportional to the area of the rendered
buffer. In other words, rendering time is inversely proportional to ?2. Suppose the
rendering time of the original frame for each eye is T , then the expected rendering
time of KFR with ?d = ?nd is:
T T 2T
tFR = + = (4.1)
?2 ?2d nd ?
2
d
The expected rendering time of eye-dominance-guided foveated rendering (EFR)
with ?d 6= ?nd is: ( ( ) )2
T T T ?d
tEFR = + = 1 + (4.2)
?2d ?
2 2
nd ?d ?nd
93
Then,
?(d ? ?)nd2
? ?
?(d (? 1nd ) )2 (4.3)
? T ?d 2T1 + ?
?2 2d ?nd ?d
?tEFR ? tFR.
Therefore, with ?d ? ?nd, the rendering time for head-mounted displays can
be reduced with non-perceivable difference between the foveated renderings for the
dominant eye and the non-dominant eye.
The theoretical speedup S achieved by EFR is shown in Equation 4.4:
tFR 2
S = = ( )2 ? 1. (4.4)tEFR 1 + ?d
?nd
Next, we conduct user studies to validate that the non-dominant eye is able
to accept a higher level of foveation than the dominant eye, and also identify the
foveation parameters for the dominant and non-dominant eyes.
4.4 User Study
We have conducted two user studies: a pilot study and a main study to
identify the eye-dominance-guided foveated rendering parameters which can produce
perceptually indistinguishable results compared with non-foveated rendering.
4.4.1 Apparatus
Our user study apparatus consists of an Alienware laptop with an NVIDIA
GTX 1080 and a FOVE head-mounted display. The FOVE headset is integrated
94
with a 120 Hz infrared eye-tracking system and a 2560 ? 1440 resolution screen
(1280? 1440 per eye). We use an XBOX controller for the interaction between the
participant and the system. User studies took place in a quiet room.
As shown in Figure 4.2, the computer-generated environments consist of two
fireplace room scenes [18] and 8 scenes from the Amazon Lumberyard Bistro [19].
These scenes are rendered with the Unity game engine. To ensure that the participants
are familiar with the user study system, we requested the participants to complete
all the tasks for the trial run and familiarize themselves fully with the interaction
before the formal tests.
4.4.2 Pre-experiment: Dominant Eye Identification
In both of the pilot study and the main study, we use the Miles Test [97] to
measure the eye dominance for each participant before the start of the study.
First, the participant extends their arms out in front of themselves and creates
a triangular opening between their thumbs and forefingers by placing their hands
together at a 45-degree angle. Next, with both eyes open, the participant centers
the triangular opening on a goal object that is 20 feet away. Then, the participant
closes their left eye with their right eye open. Finally, the participant closes their
right eye with their left eye open. If the goal object stays centered with the right
eye open and is no longer framed by their hands with the left eye open, the right
eye is their dominant eye. If the goal object stays centered with the left eye open
and is no longer framed by their hands with the right eye open, the left eye is their
95
dominant eye. The Miles Test is performed twice for each participant, and we record
the participant?s dominant eye and configure our renderer accordingly.
Scene 0 Scene 1 Scene 2 Scene 3 Scene 4
Scene 5 Scene 6 Scene 7 Scene 8 Scene 9
Figure 4.2: The scenes used for the user study. Scene 0 and Scene 1 are animated
fireplace room [18] and the other scenes are animated Amazon Lumberyard Bistro [19].
These scenes are rendered with the Unity game engine.
4.4.3 Pilot Study
We conduct a slider test and a random test in the pilot study. Each test
consists of two steps:
1. the participant estimates the Uniform Foveation Parameter ?UF which is
acceptable for both the dominant eye and the non-dominant eye. We express
this condition as ?d = ?nd = ?UF ;
2. the participant estimates the Non-dominant Eye Foveation Parameter ?NF
that results in the same overall visual perception as the uniform foveation, by
96
increasing the foveation level (reducing overall detail) of the rendering for the
non-dominant eye. We express this condition as: ?d = ?UF , ?nd = ?NF .
Participants We recruited 17 participants (5 females) at least 18 years old with
normal or corrected-to-normal vision via campus email lists and flyers. The majority
of participants had some experience with virtual reality. None of the participants
was involved with this project prior to the user study.
Slider Test The slider test allows the participants to navigate the foveation space
by themselves. We conduct the test with five different scenes with one trial for each
scene. We present the two-step study protocol as follows.
1. Estimation of ?UF : In each trial, we first present the participant with the
full-resolution rendering as a reference. Next, we present the participant with the
same foveated rendering for both eyes and allow the participant to adjust the level
of foveation by themselves: starting with the highest level of foveation, ?d = 3.0,
the participants progressively decrease the foveation level (with a step size of 0.2).
The participants can switch between the foveated rendering result and the reference
image back and forth until they arrive at the highest foveation level ?UF (with the
lowest overall level of detail) that is visually equivalent to the non-foveated reference.
2. Estimation of ?NF : In each trial, we present the participant the foveated
rendering with ?d = ?UF for the dominant eye, and allow the participant to adjust
the level of foveation for the non-dominant eye. Starting with ?nd = ?UF , the
participant can progressively increase the foveation level (with a step size of 0.2)
97
until they reach the highest foveation level ?NF that is perceptually equivalent to
the foveated rendering with uniform foveation parameter ?UF .
Random Test The random test allows the participant to score the quality of the
foveated rendering with different parameters in a random sequence. We conduct
the test with five different scenes with one trial for each scene. The two steps are
detailed below.
1. Estimation of ?UF : In each trial, we present the participant with two frames:
(1) the full-resolution rendering, and (2) the foveated rendering with ?d = ?nd = x,
where x is selected from the shuffled ? parameter array with ? ranging between 1.2
and 3.0 with a step size of 0.2. The two frames are presented in a random order. We
ask the participants to score the difference between the two frames they observe with
unlimited time to make their decision. The score SUF contains five confidence levels:
5 represents perceptually identical, 4 represents minimal perceptual difference, 3 rep-
resents acceptable perceptual difference, 2 represents noticeable perceptual difference,
and 1 represents significant perceptual difference.
We use a pairwise comparison approach and the participants finish the trials
with 1.2 ? x ? 3.0 in a random order. We choose the maximum x which results in
an evaluation of perceptually identical or minimal perceptual difference with respect
to the full-resolution (non-foveated) rendering, i.e.,
?UF = argmax SUF (x) ? 4. (4.5)
x
2. Estimation of ?NF : In each trial, we present the participant with two frames:
(1) foveated rendering with ?d = ?nd = ?UF , and (2) foveated rendering with
98
?d = ?UF , ?nd = x, where x is selected from the shuffled parameter array with
parameters ranging between ?UF and 3.0 with a step size of 0.2. The two frames are
presented in random order. We ask the participants to score the difference between
the two frames with unlimited time to make their decisions. The score SNF contains
five confidence levels: 5 represents perceptually identical, 4 represents minimal
perceptual imbalance, 3 represents acceptable perceptual imbalance, 2 represents
noticeable perceptual imbalance, and 1 represents significant perceptual imbalance.
We choose the maximum x that results in perceptually identical or minimal
perceptual imbalance with respect to the uniform foveated rendering, i.e.,
?NF = argmax SNF (x) ? 4. (4.6)
x
Results and Limitations of the Pilot Study From the pilot study, we find
that: for most users, the dominant eye significantly dominates the visual perception
and therefore eye-dominance-guided foveated rendering is likely to achieve significant
speedup. From the one-way ANOVA test, we did not find a significant effect of
the choice of scenes on the feedback (with p = 0.8708 > 0.01) for the slider test.
Therefore, ?UF and ?NF do not correlate with the choice of scenes in a statistically
significant manner. However, the pilot study yields a gap between the results of the
slider test and the random test as shown in Figure 4.3.
We next present the potential reasons for this gap and our strategies for
mitigating them:
1. Performing a single trial for each test per scene is likely to induce some
inaccuracy in parameter estimation. To mitigate this for the main study, we
99
?UF and ?NF for the slider test and the random test
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
User 01 User 02 User 03 User 04 User 05 User 06 User 07 User 08 User 09 User 10 User 11 User 12 User 13 User 14 User 15 User 16
slider ?UF random  ?UF slider ?NF random ?NF
 
Figure 4.3: The average value of ?UF and ?NF in the slider and the random tests.
The pilot study yields a gap between the results of the slider and the random tests.
Comparison between ?UF and ?NF in the slider test
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
User 01 User 02 User 03 User 04 User 05 User 06 User 07 User 08 User 09 User 10 User 11 User 12 User 13 User 14 User 15 User 16
User Index
?NF ?UF
Figure 4.4: The result of the slider test of the pilot user study. We observe that ?NF
often reach our upper bound (3.0).
100
Foveation parameter ?
foveation parameter ?
Comparison between ?UF and ?NF in the random test
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
User 01 User 02 User 03 User 04 User 05 User 06 User 07 User 08 User 09 User 10 User 11 User 12 User 13 User 14 User 15 User 16
User Index
?NF ?UF
Figure 4.5: The result of the slider test of the random user study. We observe that
?NF often reach our upper bound (3.0).
carry out three trials per scene per parameter;
2. In the pilot study, we use the maximum foveation parameter in Equations 4.5
and 4.6. We did this even if lower values of the foveation parameters led to an
unacceptable score below 4. This was leading us to overestimate the foveation
thresholds. In the main study, we use the greatest foveation parameter below
which the user did not report an average score below 4. While this may reduce
the speedups due to overall foveation, it will produce a higher perceptual
quality;
3. We observe that ?NF often reach our upper bound (3.0) ? 42.5% in the slider test
(as shown in Figure 4.4) and 60% in the random test (as shown in Figure 4.5).
We have therefore increased the upper bound of ?NF from 3.0 to 4.0 in the
protocol of the main study;
101
Foveation parameter ?
4. In pilot study, we did not qualitatively evaluate the similarity in perceptual
difference between EFR with the selected parameters and conventional foveated
rendering (KFR) or regular rendering (RR). We therefore decide to add a quality
evaluation in the main study.
Taking the above limitations and their mitigation strategies into account, we redesign
the main study as described below.
4.4.4 Main Study
We conduct a slider test and a random test in the main study. There are three
steps in both tests:
1. the participant estimates the Uniform Foveation Parameter ?UF ;
2. the participant estimates the Non-dominant Eye Foveation Parameter ?NF ;
3. the participant qualitatively evaluates whether the EFR frames with ?d = ?UF ,
?nd = ?NF are perceptually the same with RR or traditional (non-dominant)
KFR.
We use Scene 3, Scene 5, and Scene 6 in Figure 4.2 for the parameter estimation in
Steps 1 and 2 above. We use all the 10 scenes in Figure 4.2 for the quality evaluation.
Participants We recruited 11 participants (4 females) at least 18 years old with
normal or corrected-to-normal vision via campus email lists and flyers. The majority
of participants had some experience with virtual reality. None of the participants
was involved with this project prior to the user study.
102
Slider Test The slider test allows the participant to navigate the foveation quality
space by themselves.
1. Estimation of ?UF : We conduct the test on three scenes with three trials per
scene. Therefore, there are 9 tests in total. For the n-th trial of scene m, we first
present the participant with the full-resolution rendering, as a reference. Next, we
present the participant with the same foveated rendering for both eyes and allow the
participant to adjust the level of foveation by themselves: starting with the highest
level of foveation, ?d = 3.0, the participants progressively decrease the foveation level
(with a step size of 0.2). The participant can switch between the foveated rendering
result and the reference image back and forth until they can identify the lowest
foveation level ?UF (m,n) that is visually equivalent to the non-foveated reference.
We calculate the mean uniform foveation parameter for scene m:
1 ?
?UF (m) = ?UF (m,n). (4.7)
3
n=1,2,3
We calculate the overall mean uniform foveation parameter:
1 ?
?UF = ?UF (m). (4.8)
3
m=1,2,3
2. Estimation of ?NF : We conduct the test on three scenes with three trials
per scene. Therefore, there are 9 tests in total. For the n-th trial of scene m, we
present the participant the foveated rendering with ?d = ?UF (m) for the dominant
eye, and allow the participant to adjust the level of foveation for the non-dominant
eye: starting with ?nd = ?UF (m), the participants can progressively increase the
foveation level (with a step size of 0.2) until they reach the highest foveation level
103
?NF (m,n) that is perceptually equivalent to the foveated rendering with uniform
foveation parameter ?UF (m).
Change of ?UF and ?NF in the Slider Test
4.5
Step 1: Estimation of ?????? Step 2: Estimation of ??4 ????
3.5
3
2.5
2
1.5
1
Time
Slider ?UF Slider ?NF
 
Figure 4.6: Change of parameter ?d and ?nd in the slider test. In Step 1, estimation
of ?UF , we present the participant with the same foveated rendering for both eyes
and the participant progressively decrease the foveation level until ?d = ?UF (m). In
Step 2, estimation of ?NF , we present the participant the foveated rendering with
?d = ?UF (m) for the dominant eye, and allow the participant to adjust the level of
foveation for the non-dominant eye. The participant can progressively increase the
foveation level until they reach the highest foveation level.
Figure 4.6 shows the change of parameters ?UF and ?NF in Step 1: Estimation
of ?UF and Step 2: Estimation of ?NF in each trial. We calculate the mean uniform
foveation parameter for scene m:
1 ?
?NF (m) = ?NF (m,n). (4.9)
3
n=1,2,3
104
Foveation Parameter ?
We calculate the overall mean uniform foveation parameter:
1 ?
?NF = ?NF (m). (4.10)
3
m=1,2,3
3. Quality evaluation: We conduct the A/B test on 10 scenes with two comparisons
(EFR vs. KFR and EFR vs. RR) per scene, and 1 trial per scene per comparison.
There are 20 trials in total. For scene m, we present the participant with two frames:
(1) EFR with ?d = ?UF , ?nd = ?NF and (2) RR or KFR with ?d = ?nd = ?UF . The
two frames are presented in random order. Then we ask the participants to score the
difference between the two frames they observed with unlimited time to make their
decisions. The score S(m) contains five confidence levels: 5 represents perceptually
identical, 4 represents minimal perceptual difference, 3 represents acceptable perceptual
difference, 2 represents noticeable perceptual difference, and 1 represents significant
perceptual difference.
Random Test The random test allows the participant to score the quality of
foveated rendering with different parameters in a random sequence.
1. Estimation of ?UF : We conduct the test on three scenes with 10 parameters
per scene, each with three trials. Therefore, there are 90 tests in total. For the n-th
trial of scene m, we present the participant with two frames: (1) the full-resolution
rendering, and (2) the foveated rendering with ?d = ?nd = x, where x is selected
from the shuffled parameter array with parameters ranging between 1.2 and 3.0 with
a step size of 0.2. The two frames are presented in a random order. Then, we ask
the participant to score the difference between the two frames they observed with
unlimited time to make their decision. The score SUF (m,n, x) contains five confidence
105
?UF and ?NF for the slider test and the random test
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
User 01 User 02 User 03 User 04 User 05 User 06 User 07 User 08 User 09 User 10 User 11
Slider ?UF Slider ?NF Random ?UF Random ?NF
 
Figure 4.7: The average value of ?UF and ?NF in the slider test and the random
test. A paired T-test reveals no significant difference (p = 0.8995 > 0.01) between
the result of the slider test and the result of the random test.
levels: 5 represents perceptually identical, 4 represents minimal perceptual difference, 3
represents acceptable perceptual difference, 2 represents noticeable perceptual difference
and 1 represents significant perceptual difference.
When the process is finished, we calculate the average score of all the trials for
scene m with foveation parameter x:
1 ?
SUF (m,x) = SUF (m,n, x). (4.11)
3
n=1,2,3
We choose the minimum x which results in an evaluation of perceptually identical
or minimal perceptual difference with respect to the full-resolution (non-foveated)
rendering as ?UF (m), i.e.,
?UF (m) = argmin SUF (m,x) ? 4. (4.12)
x
106
We calculate ?UF using Equation 4.8.
2. Estimation of ?NF : We conduct the test on three scenes with Q parameters for
scene m and three trials per scene per parameter. We compute Q using Equation 4.13
with ?max = 4.0 in the main study.
?max ? ?UF (m)
Q = + 1 (4.13)
0.2
For the n-th trial of scene m, we present the participant with two frames: (1) foveated
rendering with ?d = ?nd = ?UF (m), and (2) foveated rendering with ?d = ?UF (m),
?nd = x, where x is selected from the shuffled parameter array with Q parameters
ranging between ?UF (m) and ?max = 4.0 with a step size of 0.2. The two frames are
presented in a random order. Then we ask the participants to score the difference
between the two frames they observed with unlimited time to make their decisions.
The score SNF (m,n, x) contains five confidence levels: 5 represents perceptually
identical, 4 represents minimal perceptual imbalance, 3 represents acceptable perceptual
imbalance, 2 represents noticeable perceptual imbalance and 1 represents significant
perceptual imbalance.
When the process is finished, we calculate the average score of all the trials for
scene m with foveation parameter x:
1 ?
SNF (m,x) = SNF (m,n, x). (4.14)
3
n=1,2,3
We choose the minimum x which results in an evaluation of perceptually identical
or minimal perceptual imbalance with respect to the full-resolution (non-foveated)
107
rendering as ?NF (m), i.e.,
?NF (m) = argmin SNF (m,x) ? 4. (4.15)
x
We calculate ?NF using Equation 4.10.
3. Quality evaluation: The quality evaluation is the same as that of the slider
test.
Average score of ?UF and ?NF in the random test
5
4.5
4
3.5
3
2.5
2
1.5
1
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
Foveation Parameter ?
Random ?UF Random ?NF
 
Figure 4.8: The average score in Step 1 ( Estimation of ?UF ) and Step 2 (Estimation
of ?NF ) over different scenes and different users in the random test. To achieve
perceptually identical and minimal perceptual difference between regular rendering
and foveated rendering, we therefore choose ?UF = 2.0 and ?NF = 3.0 as our desired
parameters.
108
Score
4.4.5 Validity Test
Eye Tracking Data Analysis We collected eye-tracking data from the FOVE
HMD and would like to use it as a high-level validation to ensure that the participants
are focusing at the desired fovea location.
However, we have noticed obvious tracking errors during the process: sometimes
the eye-tracker fails to capture the movement of gaze and sometimes the tracked
gaze position changes when the user blinks while focusing at the center of the screen.
We also need to ensure that the users are paying attention to the user study instead
of randomly choosing the answers. Therefore, it may not be ideal to solely depend on
eye tracking results for judging the participants? focus. We also use the participant?s
performance with respect to the ground truth data to determine the accuracy and
participant focus. We discuss this next.
Controlling for Lack of Attention and Exhaustion We randomly inserted
30% of the trials to be validation trials in the random test to ensure the validity of
the data in the pilot study and the main study. For uniform foveation parameter
estimation, we presented the participants with identical full-resolution rendering
results for both comparison frames as validation trials; for non-dominant eye foveation
parameter estimation, we presented the participants with identical rendering results
with ?d = ?nd = ?UF for both comparison frames as validation trials. If the
participant declared these validation trials to have a low score for similarity (3 or
lower), we would ask the participant to pause and take a break for at least 30 seconds,
109
and then continue the user study. Meanwhile, we would record this choice as an error.
If error ? 5 in the random test, we would terminate the user study and discard the
data of this participant. Based on this protocol, we discard one participant from
the pilot study and mark the remaining 16 participants as valid data. All the 11
participants in the main study passed the validation trials.
Comparison Score = 1 Score = 2 Score = 3 Score = 4 Score = 5
Slider: EFR vs. RR 0.00% 2.73% 8.18% 17.27% 71.82%
Slider: EFR vs. KFR 0.00% 4.55% 10.91% 30.00% 54.55%
Random: EFR vs. RR 0.00% 0.00% 0.91% 14.55% 84.55%
Random: EFR vs. KFR 0.00% 0.91% 3.64% 25.45% 70.00%
Table 4.1: The score frequency for different comparisons in the slider test and the
random test. We notice that P (score ? 4) ? 85% for both comparisons in the slider
test and that P (score ? 4) ? 95% for both comparisons in the random test. The
result indicates the generalizability of eye-dominance-guided foveated rendering.
4.5 Results and Acceleration
In our main user study, the number of errors in the attention and exhaustion
checking is less than 5 over all the participants. We use the results from all the 11
participants for data analysis.
110
4.5.1 Parameters Estimated with Different Scenes
We conducted a one-way ANOVA test [98, 99] of the null hypothesis that the
choice of scenes has no effect on the feedback of the participants. With the slider
test, we did not find a significant effect of the choice of scenes on the feedback (with
p = 0.9782 > 0.05).
4.5.2 Results of ?UF and ?NF
For user i, we consider the averages of ?UF (Equation 4.8) and ?NF (Equa-
tion 4.10) over different scenes as the per-user foveation parameter for the dominant
eye ?UF,i and non-dominant eye ?NF,i . We present these results in Figure 4.7.
We first verified if there is a significant difference of the measured parameters
(?UF and ?NF ) between the slider test and the random test. With a paired T-test,
we did not find a significant effect between the slider test and the random test (with
p = 0.8995 > 0.05).
The paired T-test shows that the EFR parameters are stable with different
experimental setups. We therefore take the average of slider test and the random
test as the final parameters to test the rendering acceleration.
We further conducted statistical analysis on the difference between ?UF and
?NF . With a paired T-test, we found a significant effect that the foveation parameter
?UF required for the non-dominant eye is higher than the foveation parameter ?NF for
the dominant eye (with p = 7.0530?10?10 < 0.05). Hence, we reach a conclusion that
the disparity between the visual acuity in the dominant eye and the non-dominant
111
eye is significantly different for the users.
For the random test, we also present the average score in Step 1 (estimation
of ?UF ) and Step 2 (estimation of ?NF ) over different scenes and different users as
shown in Figure 4.8. We notice that both SUF and SNF decrease with the increase
of the foveation parameter. To achieve perceptually identical and minimal perceptual
difference between regular rendering and foveated rendering for most users, we may
choose ?UF = 2.0 and ?NF = 3.0 as the desired parameters.
4.5.3 Quality Evaluation
We analyzed whether there exists a significant difference of the quality evalua-
tion results between the slider test and the random test. With the paired T-test, we
did not find a significant effect between the slider test and the random test (with
p = 0.8629 > 0.05).
We further verified if there exists a significant difference of the quality evaluation
results between the experiment of EFR vs. KFR and the experiment of EFR vs RR.
With a paired T-test, we found no significant difference between the result of the
two experiments (with p = 0.9410 > 0.05).
The frequency from score = 1 to score = 5 is shown in Table 4.1. We
notice that P (score ? 4) ? 85% for both comparisons in the slider test and that
P (score ? 4) ? 95% for both comparisons in the random test. The result indicates
the generalizability of eye-dominance-guided rendering. We can get acceptable
perceptual quality that on different scenes with the measured parameters from the
112
user study.
4.5.4 Rendering Acceleration
Speedup of eye-dominance-guided foveated rendering
3
2.5
2
1.5
1
0.5
0
User 01 User 02 User 03 User 04 User 05 User 06 User 07 User 08 User 09 User 10 User 11
RR (fps) 21 21 21 21 21 21 21 21 21 21 21
KFR (fps) 37 47 47 47 51 36 36 35 37 36 36
EFR (fps) 52 53 53 50 57 53 53 47 48 46 48
Speedup (KFR vs. RR) 1.76 2.24 2.24 2.24 2.43 1.71 1.71 1.67 1.76 1.71 1.71
Speedup (EFR vs. RR) 2.48 2.52 2.52 2.38 2.71 2.52 2.52 2.24 2.29 2.19 2.29
Speedup (EFR vs. KFR) 1.41 1.13 1.13 1.06 1.12 1.47 1.47 1.34 1.30 1.28 1.33
Speedup (KFR vs. RR) Speedup (EFR vs. RR) Speedup (EFR vs. KFR)
 
Figure 4.9: The measured frame-rates (in fps) and the speedups. The speedups
of the eye-dominance-guided foveated rendering (EFR) compared with the original
kernel foveated rendering (KFR) ranges between 1.06? and 1.47? with an average
speedup of 1.35?. The speedups of EFR compared with regular rendering (RR)
ranges between 2.19? and 2.71? with an average speedup of 2.38?.
We have implemented the eye-dominance-guided foveated rendering pipeline in
C++ 11 and OpenGL 4 on NVIDIA GTX 1080 to measure the rendering acceleration.
We report our speedups based on the sophisticated Amazon Lumberyard Bistro
dataset [19] at the resolution of 1280? 1440 per eye. The frame-rates and speedups
of the original kernel foveated rendering (KFR) and the eye-dominance-guided
foveated rendering (EFR) compared with traditional regular rendering (RR) are
113
Speedup
shown in Figure 4.9. The speedup of the eye-dominance-guided foveated rendering
compared with the original kernel foveated rendering ranges between 1.06? and
1.47? with an average speedup of 1.35?.
114
Chapter 5: Hand Mesh Reconstruction from RGB Images
5.1 Overview
Hands play an important role in our daily life. An approach that could
detect the shape and gesture of the human hand from RGB images would enable
new applications in virtual and augmented reality [100,101] and human-computer
interaction [102?105]. However, the current state-of-the-art approaches do not provide
a good solution because of the depth and scale ambiguities. In recent years, deep
learning is playing an important role in visual interactions [106] by providing success
in hand pose reconstruction from a depth image [107,108], hand pose reconstruction
from a RGB image [109, 110], as well as hand mesh reconstruction from a RGB
image [20, 21,111]. Here, I focus on the hand reconstruction with only RGB images
as input. This is significantly more challenging than hand reconstruction when the
depth data is also available as an auxiliary feature.
Our contributions include:
? building a new dataset of the hand by fitting a hand model to 74, 715 3D joint
annotations in the Panoptic Studio dataset to solve the problem of sparse
training annotation;
115
? designing an end-to-end neural network, which accepts a RGB image and the
auxiliary features as inputs and predicts a 3D hand mesh of the right hand
that could be projected onto the RGB image and match the 2D hand in shape
and pose;
? evaluating our research in terms of 3D pose estimation on various public
datasets.
5.2 Related Work
5.2.1 Hand Model
Taylor et al. [112] presents a method for acquiring dense shape and deformation
from a single monocular depth sensor. Khamis et al. [113] propose the first learning-
based subject-specific hand shape variation from scans with linear blend skinning.
The MANO (hand Model with Articulated and Non-rigid defOrmations) model [114]
is a hand deformation model based on the SMPL [115] model for human bodies.
MANO models both hand shape and pose, thus generating realistic posed meshes.
5.2.2 Hand Skeleton Reconstruction from Multiview
The multi-view image processing techniques could be utilized to refine the
hand reconstruction models. Campos and Murray [116] use the relevance vector
machine-based learning for hand pose recovery. Sridhar et al. [117] propose a real-
time markerless hand tracking approach which employs an implicit hand shape
116
representation based on sum of anisotropic Gaussians and minimizes the sum of the
pose fitting energy. Simon et al. [109] propose a multi-view hand pose prediction
system which uses a pretrained weak predictor to estimate the hand pose and
retrain an improved detector by annotating the failed detection with the re-projected
successful detection.
5.2.3 Hand Mesh Reconstruction from Singleview
Boukhayma et al. [21] predict both 3D hand shape and pose from RGB
images with heatmaps in the wild. Their pipeline consists of the concatenation
of a deep convolutional encoder (ResNet-34), and a MANO decoder. Zhang et
al. [118] concatenate the autoencoder with an iterative regression block and refine the
estimated parameter iteratively. Ge et al. [111] and Baek et al. [119] use a stacked
hourglass network to predict heatmaps as the auxiliary feature and predict the hand
mesh with a convolutional neural network. Kulon et al. [20] learn the prior on 3D
hand shapes by training an autoencoder with intrinsic graph convolutions performed
in the spectral domain. Kulon et al. [120] solve the problem of sparse supervision
by gathering a large-scale dataset of hand action in YouTube videos and use it as a
source of weak supervision. They propose an autoencoder system with ResNet-50 as
the encoder and a spatial mesh convolutional decoder.
117
Eular-Rodrigues Scaling & translation Pose and shape 
rotation parameter parameter parameter 
??? ?, ? weights ?
WeightedComposition()
?$%!"# exp ? ?!
Rodrigues() ?, ? = ?$%!"# exp ?
Pose Shape 
parameter parameter
? ?
Rotation 
matrix ?
MANO 
Decoder
Predicted hand in 
the object Coord
Rotation() ScalingAndTranslation()
? = ?? ? = ?? + ?
Predicted hand in 
the world Coord
Proj Proj Proj
?
Predicted hand in Predicted hand in Predicted hand in 
the Camera-1 Coord the Camera-2 Coord the Camera-31 Coord
Fitting Loss()
Fitting Loss()
3D Joint Loss
backpropagation
Bone Length Loss
Regularization
Figure 5.1: The pipeline of the ground truth mesh generation.
118
5.3 Estimation of the Hand Mesh
5.3.1 Datasets
We use the CMU Panoptic Dataset [121] for hand mesh estimation. For each
scene, there are 31 videos captured with HD synchronized cameras. The dataset also
contains camera information, visibility of the hands from the target cameras, and
the 3D joint annotations in the world space.
5.3.2 Pipeline
The pipeline of the estimation of the hand mesh is shown in Figure 5.1. For
each hand image in the Panoptic Dataset with 3D joint annotation JI , we predict a
3D hand mesh generated with the MANO model by minimizing the error between
the 3D joint annotation JI in the image coordinates and the predicted 3D joints
in the image coordinates J?I by predicting the parameters for rotation rot, shape ?,
pose ?, scaling s, and translation t. By accepting ? and ? as inputs, the MANO
decoder M(rot, ?, ?) decodes a hand mesh H?MANO in the local object coordinates
(including the hand joint information J?MANO and the hand vertex information V?MANO
in the local object coordinates), H?MANO is projected to the world coordinates H?W
(including hand joints J?W and hand vertices V?W in the local object coordinates)
with the guidance of r, s and T . And H?W is projected to the image coordinates H?I
(including hand joints J?I and hand vertices V?I in the local object coordinates) to
match the hand in the RGB image with ground truth camera extrinsic parameters
119
R and T and camera intrinsic parameter K.
{rot?, ??, ??, s?, t?}
= argmin L(JI , J?I)
rot,?,?,s,t
(5.1)
= argmin L(JI ,P(J?MANO, s?, t?, K,R, T ))
rot,?,?,s,t
= argmin L(JI ,P(M(rot, ??, ??), s?, t?, K,R, T )),
rot,?,?,s,t
Instead of modelling the joint angles as free variables, which can lead to physi-
cally implausible hand, we constrain the pose parameters and the shape parameters
to lie in the convex hull of the pre-computed cluster centers [120]. We obtained the
cluster centers by applying k-means on the FreiHAND [122] dataset and cluster the
pose parameters and the shape parameters into 32 clusters. In the fitting process, we
predict the angle weights wClusterID with ClusterID ? [1, 32] and calculate the pose
and shape parameters as shown in Equation 5.2.
?32
Clu?sterID=1 exp(wClusterID)PClusterID[?, ?] = P (w) = 32 (5.2)
ClusterID=1 exp(wClusterID)
where F (XMANO, s, t,K,R, T ) finishes the scaling and translation of XMANO,
which could be the predicted joints J?MANO or the predicted vertices V?MANO:
P(XMANO, s, t,K,R, T ) =K[R|T ] ?XW
(5.3)
=K[R|T ] ? (sXMANO + t)
120
5.3.3 Training Objective
We combine multiple losses to predict the parameters: a 3D joint loss in the
camera coordinates Ljoint, a bone length loss Lbone, and a regularization term Lreg.
L = ?jointLjoint + ?boneLbone + ?regLreg (5.4)
3D joint loss Ljoint: We project the predicted hand joints from the world coor-
dinates J?W to the i-th camera coordinates J?C,i, calculate the joint error between
J?C,i and the ground truth hand joints in the camera coordinates JC,i by taking the
weighted sum of the joint error of the hand root (ROOT), metacarpophalangeal
(MCP), proximal interphalangeal (PIP), distal interphalangeal (DIP) and the finger
tips (TIP).
Finally we calculate the sum of the joint error of all the 31 camera coordinates
as shown in Equation 5.5. ?31
Ljoint = L(JC,k ? J?C,k)
?k=131
= (? 2MCP||JMCP ? J?MCP||
k=1
? 2PIP||JPIP ? J?PIP|| + (5.5)
? 2DIP||JDIP ? J?DIP|| +
? 2TIP||JTIP ? J?TIP|| +
?ROOT||J 2ROOT ? J?ROOT|| )
Bone length loss Lbone: The 21 hand joints could be interpreted as 20 groups
of bone edges (BE). The loss of bone length ensures that the lengths of the bones
121
coincide with the ground truth hand bone lengths.
?31 ?
Lbone = |JC,k,i ? J?C,k,j| (5.6)
k=1 (i,j)?BE
Regularization Lreg: The regularization term ensures that the predicted hand is
physically plausible.
L 2 2reg = ??||?|| + ??||?|| (5.7)
The regularization parameters ? ?1 3? = 10 and ?? = 10 are chosen experimentally.
5.3.4 Optimization
We use the Adam optimizer [123] with different learning rates for the rotation
rot, scaling s, translation t, and the MANO parameter weights W#Cluster?(|?|+|?|) with
#Cluster = 32. The optimizer is used with small learning rate decay (multiplicative
factor of 0.95) when loss ? 150. We fit 2048 groups of frames per batch on GeForce
RTX 2080 which takes on average 40 min.
5.3.5 Evaluation Metric and Result
We use F-score [124] to evaluate the quality of the generated hand mesh and
we accept F@10mm = 1 as acceptable.
We used 11 scenes from the Panoptic Dataset containing 241, 008 valid frames
with physically plausible hands and 3D annotation in 31 cameras for the fitting and
got 48, 955 mesh with F@10mm = 1 in the multiview fitting and 47, 954 mesh with
122
GT Fitting GT Fitting
Figure 5.2: The qualitative results of hand mesh estimation from joints.
123
F@10mm = 1 in the temporal-multiview fitting. The qualitative results are shown
in Figure 5.2.
5.4 Hand Reconstruction from RGB Images
5.4.1 Overview
We propose to reconstruct a 3D hand mesh from a single RGB image as
illustrated in Figure 5.3. The RGB image of the hand I is passed to a pre-trained
multi-stage convolutional neural network [125] to predict the heatmaps H for the 21
hand joints. The RGB image I and the heatmaps H are stacked and encoded to
camera embedding and mesh embedding in the Resnet-34 [126] encoder. The mesh
embedding is decoded into a hand mesh H? in the camera coordinates, and projected
to the image coordinates using a weak perspective camera created with the camera
embedding.
5.4.2 Encoder
The encoder takes an RGB image I and the heatmaps H as the input and uses
the Resnet-34 [126] network pretrained on the ImageNet [127] classification task.
The output of the encoder is a vector rot, s, T, ?, ? ? R61. The mesh embedding
containing the pose parameter ? and the shape parameter ? are passed to the decoder.
And the camera embedding including the Rodrigues rot, scaling s, and translation
T is fed into a weak perspective camera system, which projects the hand mesh from
the camera coordinates to the image coordinates.
124
Multi-Stage
Convolutional
Network
RGB
heatmaps
Encoder
Camera Embedding & Mesh Embedding
Rotation, Scaling Pose and shape 
& translation embedding
?%??, ???, ?! ?!, ?!
MANO 
Decoder
Transformer
Predicted hand in 
Weak the camera Coord
Perspective 
Camera
Loss ()
Weighted Joint Loss
Embedding Loss
Predicted hand in 
the world Coord
Ground Truth
Hand Joints
Fitted Camera Embedding & Mesh Embedding
Rotation, Scaling Pose and shape 
& translation embedding
???, ?, ? ?, ?
Figure 5.3: The pipeline of the hand Reconstruction from RGB Images.
125
5.4.3 Decoder
The differentiable MANO decoder is used to recover 3D hand meshes from
the mesh embedding. The output is a MANO hand deformation in the camera
coordinates.
5.4.4 Training Objective
We train the network with the estimated dense supervision and we combine
multiple losses:
2D joint loss Ljoint: We calculate the 2D joint loss in the image coordinates,
which is the joint error between the predicted 2D joints and the ground truth 2D
joints:
LJ = ||J ?2D ? J2D||2 (5.8)
Embedding loss Lembedding: The embedding loss Lembedding enforces the consis-
tency between the predicted embedding {r?ot, s?, T? , ??, ??} and the ground truth embed-
ding {rot, s, T, ?, ?} estimated from the fitting process as described in Chapter 5.3:
L 2 2 ? 2embedding = ?s||s? s?||2 + ?T ||T ? T? ||2 + ?rot||rot? rot||2
(5.9)
+ ??||? ? ??||22 + ??||? ? ??||22,
where ?s = 10
?2, ?T = 10
?4, ? 2rot = 10 , ?? = ?? = 1.
126
5.5 Results and Comparisons
5.5.1 Experimental Setup
Implementation: Our network is implemented in PyTorch [128]. Before training,
the weights of the encoder are initialized with the weights of an image classification
model pre-trained on the ImageNet [127] dataset. The network is trained end-to-end
using the Adam optimizer [123] with mini-batches of size 32. The learning rate is
set as 10?5 and we keep the default values for other parameters.
Training and Testing: For the Panoptic dataset with fitted ground truth, we hold
out 6284 randomly-selected samples as the test data, 6393 samples as the validation
data, and use the remaining 62038 samples as the training data. We use RGB image
crops of human hands with a resolution of 256? 256 as the input. The ground truth
hand joints are used to find the tightest bounding box B0(w0, h0), and the images are
cropped with a squared patch of size B1(2.2max(w0, h0), 2.2max(w0, h0)) centered
at the same 2D position as B0. The new patches resized to 256? 256 are used as
the input RGB images. We report the mesh estimation result using the autoencoder
trained for 450 epochs.
5.5.2 Quantitative Evaluation of 3D Hand Mesh Estimation
We report the performance of the 3D hand mesh estimation with two metrics:
? 2D/3D PCK: the percentage of correct keypoints (PCK) of which the Euclidean
127
Figure 5.4: Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Kulon et al. [20]
(the fourth column), and Boukhayma et al. [21] (the fifth column).
128
Figure 5.5: Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Kulon et al. [20]
(the fourth column), and Boukhayma et al. [21] (the fifth column).
129
Figure 5.6: Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column), Kulon et al. [20]
(the fourth column), and Boukhayma et al. [21] (the fifth column).
130
error distance is below a specific threshold;
? AUC: the area under the curve (AUC) on the 2D/3D PCK for different error
thresholds.
We compare our approach with the state-of-the-art 3D hand mesh estimation methods
on the Panoptic Dataset. Specifically, we report the results from the approaches of
Boukhayma et al. [21] and Kulon et al. [20]. We use the implementations provided by
the authors and fine-tune the model of Boukhayma et al. [21] on the Panoptic Dataset.
The approach of Kulon et al. [20] has already been trained with the Panoptic Dataset
so we use the model provided by the authors directly for evaluation.
The 2D PCK and 3D PCK curves over different error thresholds are presented
in Figure 5.7. Because we are using weak perspective camera for projection, we
evaluate the 3D PCK after depth alignment in the camera coordinates. Our method
outperforms the two state-of-the-art methods over all the evaluation metrics.
5.5.3 Qualitative Evaluation of 3D Hand Mesh Estimation
The qualitative results of 3D hand mesh reconstruction appear in Figures 5.4
to 5.6.
5.5.4 Ablation Study
Ablation Study of Loss terms: We evaluate the impact of the embedding loss
Lembedding and the joint loss Ljoint in the fully supervised training. As shown in
Figure 5.9, the model trained with only the embedding loss might make a large
131
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Ours. (AUC=0.782)
0.1 Boukhayma et al. (AUC=0.713)
Kulon et al. (AUC=0.235)
0
0 10 20 30 40 50
Error Thresholds (px)
Figure 5.7: 2D PCK our hand reconstruction approach, Kulon et al. [20], and
Boukhayma et al. [21]. Our method outperforms the other methods in AUC.
132
2D PCK
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Ours. (AUC=0.693)
0.1 Boukhayma et al. (AUC=0.566)
Kulon et al. (AUC=0.130)
0
0 10 20 30 40 50
Error Thresholds (px)
Figure 5.8: 3D PCK our hand reconstruction approach, Kulon et al. [20], and
Boukhayma et al. [21]. Our method outperforms the other methods in AUC.
133
3D PCK
prediction error by wrongly predicting a parameter such as rotation or translation.
The model trained with only the joint loss might generate a twisted (physically im-
plausible) hand to minimize the joint error. The PCK curve presented in Figure 5.10
also shows that the model trained with the sum of joint loss and parameter loss as
the objective achieves the best performance.
Figure 5.9: Qualitative evaluation results of the fitted hand mesh (the second column),
our hand reconstruction approach (the third column) with the sum of joint loss
and parameter loss as the objective, our hand reconstruction approach (the fourth
column) with the parameter loss as the objective, and our hand reconstruction
approach (the fifth column) with the joint loss as the objective.
134
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Joint Loss + Param Loss (AUC=0.680)
0.1 Param Loss (AUC=0.484)
Complex Loss (AUC=0.594)
0
0 10 20 30 40 50
Error Thresholds (px)
Figure 5.10: 3D PCK of our hand reconstruction approach with the sum of joint
loss and parameter loss as the objective, our hand reconstruction approach with the
parameter loss as the objective, and our hand reconstruction approach with the joint
loss as the objective.
Ablation Study of Input Type: We evaluate the impact of the RGB image
and the heatmaps in the fully supervised training. We train the autoencoder for
450 epochs with RGB image and heatmaps, with only RGB images, and with only
heatmaps. As shown in Figure 5.11, the model trained with only the RGB image
and the model trained with only the heatmaps predicts wrong pose or hand rotation.
The PCK curve presented in Figure 5.12 shows that the model trained with the
RGB images and the heatmaps achieves the best performance in the hand mesh
135
3D PCK
estimation, which indicates that both inputs contribute to the improvement.
136
Figure 5.11: Qualitative evaluation results of the fitted hand mesh (the second
column), our hand reconstruction approach (the third column) with the RGB image
and heatmaps as inputs, our hand reconstruction approach (the fourth column) with
the RGB image as input, and our hand reconstruction approach (the fifth column)
137
with heatmaps as input.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
RGB + Heatmaps (AUC=0.680)
0.1 RGB (AUC=0.584)
Heatmaps (AUC=0.610)
0
0 10 20 30 40 50
Error Thresholds (px)
Figure 5.12: 3D PCK of our hand reconstruction approach with the RGB image
and heatmaps as inputs, our hand reconstruction approach with the RGB image as
input, and our hand reconstruction approach with heatmaps as input.
138
3D PCK
Chapter 6: Conclusion and Future Work
In this dissertation, I have presented my research to enhance visual and and
gestural fidelity for effective visual environments.
In Kernel Foveated Rendering [30], I have presented the kernel log-polar
mapping model and conducted user studies for finding the best parameters, as well
as a GPU-based implementation and quantitative evaluation of the kernel foveated
rendering pipeline. With high frame rates, the KFR pipeline allows rendering
more complex shaders (e.g., real-time global illumination and physically-based
rendering [129]) in real time, thus bringing higher power efficiency and better user
experience for 3D games and other interactive visual computing applications.
Even though I have devised an efficient and effective foveated rendering pipeline,
my system is not without some limitations.
Foveation Parameters As discussed in Hsu et al. [130], the perceived quality of
foveated rendering systems is highly dependent on the user and the scene. As the
initial step towards kernel foveated rendering for 3D graphics, the user study in this
project is only designed for selected static scenes. The foveation parameters may
vary in dynamic scenes. Exploring the relationship between user demographic (e.g.,
pupil size, contrast sensitivity, vision condition, and diopter), perception time [26]
139
fovea fovea
(a) original scene ???? (b) foveated scene ????
fovea fovea
(c) original scene ?????? (d) foveated scene ??????
Figure 6.1: Temporal flickering issue. The original scene and the foveated scene of
two consecutive frames (FI and FII). In FI , the specular reflection in the original
scene as shown in the red and blue circles in the zoomed-in view of (a) are amplified
in the foveated scene as shown in the zoomed-in view of (b). In the next frame
FII , the specular reflection in the original scene as shown in the pink circle in the
zoomed-in view of (c) is amplified in the foveated scene as shown in the zoomed-in
view of (d).
140
and display-dependent parameters of KFR is a potential future direction.
Temporal Flickering In the post-processing stage, I have applied TAA to tackle
the temporal flickering problem. However, in fly-through of the scene with glossy
objects, I notice that view-dependent specular reflection changes before and after
applying KFR. As shown in Figure 6.1, foveation amplifies the specular reflection
regions, and makes the specular highlights flicker more.
Other Mapping Algorithms and Kernel Functions KFR makes intuitive
sense as the log-polar mapping has an initial resolution proportional to e?r, and the
kernel functions can fine tune this mapping. My choice of kernel functions is not
unique; other mapping algorithms with different kernel functions could provide a
better mapping to the human vision system. As shown in Figure 6.2, Koskela et
al. [131] implement a path traced frame in Visual-Polar space. The visual polar space
performs better than log-polar space in reducing distracting artifacts.
In 3D-kernel Foveated Rendering for Light Fields [23], I have presented a novel
approach to accelerate the interactive visualization of high-resolution light fields.
We conduct user studies to determine the optimal foveation parameters to validate
the 3D-KFR pipeline in practice. According to the quantitative experiments, our
methods accelerate the rendering process of large-scale, high-resolution light fields
by a factor of up to 7.28? at the resolution of 25? 25? 1024? 1024.
Our algorithm is effective in rendering high resolution light fields using foveation
for virtual reality HMD with low latency, low power consumption and minimal
141
Figure 6.2: Illustration of a path traced frame in Visual-Polar space, the denoised
result transformed into Cartesian screen space, and the distribution of the path
tracing samples in screen space. Path tracing and denoising in Visual-Polar space
makes both 2.5? faster.
perceptual differences. With the increase of VR headset resolution and the growth
of the VR market, we envision that 3D-KFR may inspire further research in the
foveated rendering of high-resolution light fields.
There are several possibilities to further improve our algorithm.
Foveation Parameters Our choice of the relationship between ?0, ?1, and ?2
is not unique. Other sigma arrays may provide a higher speedup. However, the
trade-off between rendering quality and foveation parameter ? always exists. It is
desirable to further explore the relationship between rendering quality and ?.
In Eye-dominance-guided Foveated Rendering [31], I have presented the EFR
pipeline, which achieves a significant speed-up by rendering the scene in the dominant
eye with a lower foveation level (higher detail) and rendering the scene in the non-
142
dominant eye with a higher foveation level (lower detail). This technique takes
advantage of the ocular dominance property of the human visual system, and
leverages the difference in acuity and sensitivity between the dominant eye and
the non-dominant eye. Our approach can be easily integrated into the current
rasterization rendering pipeline for head-mounted displays. We envision that EFR
would be also beneficial to data streaming for networked VR/AR applications such
as Montage4D [132], Geollery [133,134], AR surgery [135], and memory palaces [136]
by reducing the bandwidth requirements.
Temporal Artifacts One of the grand challenges in foveated rendering is handling
artifacts due to temporal aliasing of moving objects [137], phase-aligned aliasing [88],
and saliency-map based aliasing [138]. Since the eye-dominance-guided foveated
rendering relies on different levels of foveation for the two eyes, such challenges are
likely to be even greater. We plan to study and address these challenges in future.
Personalized VR Rendering Ocular dominance studies [139] indicate that 70%
of the population is right-eye dominant and 29% is left-eye dominant. Thus, we expect
that most users stand to benefit from eye-dominance-guided foveated rendering. In
terms of personalized VR rendering, prior art has investigated how to personalize
spatial audio for virtual environments using head-related transfer functions based on
the ears? shape [140]. Further research may investigate how to enhance the visual
experience of a user based on the eye prescription.
143
Further Leveraging Human Perception An important argument in the study
of visual direction is that there is a center or origin for judgments of visual direction
called cyclopean eye [141]. Elbaum et al. [142] demonstrates that tracking accuracy
is better with the cyclopean eye than with the dominant and non-dominant eye. Xia
and Peli [143] propose a perceptual space model for virtual reality content based on
the gaze point of the cyclopean eye. How the human visual system integrates the
input from the two eyes into a cyclopean vision and how virtual reality in general,
and foveated rendering in particular, could leverage it to improve visual quality and
efficiency is deeply intriguing. We plan to delve into exploring how the foveated
rendering system could be integrated with the cyclopean eye to further improve the
immersive viewing experience and enhance the interaction accuracy between HMD
and users.
In Hand Reconstruction from RGB Images, I have presented an end-to-end
convolutional neural network that predicts 3D hand shape and pose from a single
RGB image. The proposed research solves the problem of sparse training annotation
by fitting a 3D deformation model to 74, 715 3D possible joint conformations and
improves the quality of estimation. We envision that the proposed approach could
be used in multiple application scenarios in the field of human-computer interaction
and virtual and augmented reality.There are improvements that we can make to
enhance the user?s immersive experience.
Texture Reconstruction The first research direction is to predict the hand
texture for the estimated hand mesh. We plan to use a captured hand texture as the
144
template and use another neural network to predict the surface texture parameters
such as the hand color and roughness. We could use the texture template and
the surface texture parameters to synthesize a texture that will be attached to the
estimated hand for visualization.
Hand Interaction The second challenge is to tackle potential problems like the
interaction between the two hands. Mueller et al. [144] focuses on the hand interaction
captured from depth cameras. Because depth data is not available in most VR
headsets, it will be desirable to focus on handling inter-hand and intra-hand collisions
with only a sequence of RGB images as input.
Finally, it will be interesting to develop a real-time VR application that accepts
the video from the inside-out camera as the input, predicts the hand mesh from the
video, and visualizes the predicted hand mesh.
145
Bibliography
[1] Christine A Curcio, Kenneth R Sloan, Robert E Kalina, and Anita E Hen-
drickson. Human photoreceptor topography. Journal of comparative neurology,
292(4):497?523, 1990.
[2] Christine A Curcio and Kimberly A Allen. Topography of ganglion cells in
human retina. Journal of comparative Neurology, 300(1):5?25, 1990.
[3] Philip Kortum and Wilson S Geisler. Implementation of a foveated image
coding system for image bandwidth reduction. In Electronic Imaging: Science
& Technology, pages 350?360. International Society for Optics and Photonics,
1996.
[4] P. Lungaro, R. Sjo?berg, A. J. F. Valero, A. Mittal, and K. Tollmar. Gaze-aware
streaming solutions for the next generation of mobile vr experiences. IEEE
Transactions on Visualization and Computer Graphics, 24(4):1535?1544, April
2018.
[5] S. Firdose, P. Lunqaro, and K. Tollmar. Demonstration of Gaze-Aware Video
Streaming Solutions for Mobile VR. In 2018 IEEE Conference on Virtual
Reality and 3D User Interfaces (VR), pages 749?750, March 2018.
[6] Brian Guenter, Mark Finch, Steven Drucker, Desney Tan, and John Snyder.
Foveated 3d graphics. ACM Trans. Graph., 31(6):164:1?164:10, November
2012.
[7] K. Vaidyanathan, M. Salvi, R. Toth, T. Foley, T. Akenine-Mo?ller, J. Nils-
son, J. Munkberg, J. Hasselgren, M. Sugihara, P. Clarberg, T. Janczak, and
A. Lefohn. Coarse pixel shading. In Proceedings of High Performance Graphics,
HPG ?14, pages 9?18, Aire-la-Ville, Switzerland, Switzerland, 2014. Eurograph-
ics Association.
[8] Anjul Patney, Marco Salvi, Joohwan Kim, Anton Kaplanyan, Chris Wyman,
Nir Benty, David Luebke, and Aaron Lefohn. Towards foveated rendering for
gaze-tracked virtual reality. ACM Trans. Graph., 35(6):179:1?179:12, November
2016.
146
[9] Anjul Patney, Joohwan Kim, Marco Salvi, Anton Kaplanyan, Chris Wyman,
Nir Benty, Aaron Lefohn, and David Luebke. Perceptually-based foveated
virtual reality. In ACM SIGGRAPH 2016 Emerging Technologies, SIGGRAPH
?16, pages 17:1?17:2, New York, NY, USA, 2016. ACM.
[10] Petrik Clarberg, Robert Toth, Jon Hasselgren, Jim Nilsson, and Tomas Akenine-
Mo?ller. Amfs: adaptive multi-frequency shading for future graphics processors.
ACM Trans. Graph., 33(4):141:1?141:12, July 2014.
[11] Yong He, Yan Gu, and Kayvon Fatahalian. Extending the graphics pipeline
with adaptive, multi-rate shading. ACM Trans. Graph., 33(4):142:1?142:12,
July 2014.
[12] Nicholas T. Swafford, Jose? A. Iglesias-Guitian, Charalampos Koniaris, Bochang
Moon, Darren Cosker, and Kenny Mitchell. User, metric, and computational
evaluation of foveated rendering methods. In Proceedings of the ACM Sympo-
sium on Applied Perception, SAP ?16, pages 7?14, New York, NY, USA, 2016.
ACM.
[13] Michael Stengel, Steve Grogorick, Martin Eisemann, and Marcus Magnor.
Adaptive image-space sampling for gaze-contingent real-time rendering. Com-
puter Graphics Forum, 35(4):129?139, 2016.
[14] Okan Tarhan Tursun, Elena Arabadzhiyska-Koleva, Marek Wernikowski, Ra-
dos law Mantiuk, Hans-Peter Seidel, Karol Myszkowski, and Piotr Didyk.
Luminance-contrast-aware foveated rendering. ACM Trans. Graph., 38(4):98:1?
98:14, July 2019.
[15] Cheuk Yiu Ip, M. Adil Yalc?in, David Luebke, and Amitabh Varshney. Pixelpie:
maximal poisson-disk sampling with rasterization. In Proceedings of the 5th
High-Performance Graphics Conference, HPG ?13, pages 17?26, New York,
NY, USA, 2013. ACM.
[16] Marc Levoy and Pat Hanrahan. Light Field Rendering. In Proceedings of the
23rd Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ?96, pages 31?42, New York, NY, USA, 1996. ACM.
[17] Qi Sun, Fu-Chung Huang, Joohwan Kim, Li-Yi Wei, David Luebke, and Arie
Kaufman. Perceptually-guided foveation for light field displays. ACM Trans.
Graph., 36(6):192:1?192:13, November 2017.
[18] Morgan McGuire. Computer graphics archive, July 2017.
https://casual-effects.com/data.
[19] Amazon Lumberyard. Amazon lumberyard bistro, open research content archive
(orca), July 2017.
147
[20] Dominik Kulon, Haoyang Wang, Riza Alp Gu?ler, Michael M. Bronstein, and Stefanos
Zafeiriou. Single image 3d hand reconstruction with mesh convolutions. In Proceedings
of the British Machine Vision Conference (BMVC), 2019.
[21] Adnane Boukhayma, Rodrigo de Bem, and Philip H.S. Torr. 3d hand shape and
pose from images in the wild. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.
[22] G. Denes, K. Maruszczyk, G. Ash, and R. K. Mantiuk. Temporal Resolution
Multiplexing: Exploiting the limitations of spatio-temporal vision for more effi-
cient VR rendering. IEEE Transactions on Visualization and Computer Graphics,
25(5):2072?2082, May 2019.
[23] Xiaoxu Meng, Ruofei Du, Matthias Zwicker, and Amitabh Varshney. Kernel Foveated
Rendering. Proc. ACM Comput. Graph. Interact. Tech, 1(1):5:1?5:20, May 2018.
[24] Marc Levoy and Ross Whitaker. Gaze-directed volume rendering. In Proceedings of
the 1990 Symposium on Interactive 3D Graphics, I3D ?90, pages 217?223, New York,
NY, USA, 1990. ACM.
[25] Hans Strasburger, Ingo Rentschler, and Martin Ju?ttner. Peripheral vision and
pattern recognition: a review. Journal of vision, 11(5):13?13, 2011.
[26] Xuetong Sun and Amitabh Varshney. Investigating Perception Time in the Far
Peripheral Vision for Virtual and Augmented Reality. In ACM Symposium on
Applied Perception (SAP), Perception. ACM, Aug 2018.
[27] E. L. Schwartz. Anatomical and physiological correlates of visual computation
from striate to infero-temporal cortex. IEEE Transactions on Systems, Man, and
Cybernetics, SMC-14(2):257?271, March 1984.
[28] H. Araujo and J. M. Dias. An introduction to the log-polar mapping [image sampling].
In Proceedings II Workshop on Cybernetic Vision, pages 139?144, Dec 1996.
[29] Marco Antonelli, Francisco D. Igual, Francisco Ramos, and V. Javier Traver. Speeding
up the log-polar transform with inexpensive parallel hardware: graphics units and
multi-core architectures. J. Real-Time Image Process., 10(3):533?550, September
2015.
[30] Xiaoxu Meng, Ruofei Du, Matthias Zwicker, and Amitabh Varshney. Kernel foveated
rendering. Proc. ACM Comput. Graph. Interact. Tech., 1(1):5:1?5:20, July 2018.
[31] X. Meng, R. Du, and A. Varshney. Eye-dominance-guided foveated rendering. IEEE
Transactions on Visualization and Computer Graphics, 26(5):1972?1980, 2020.
[32] Shawn Hargreaves and Mark Harris. Deferred shading. In Game Developers Confer-
ence, volume 2, page 31, 2004.
[33] Jerome F Duluk Jr, Richard E Hessel, Vaughn T Arnold, Jack Benkual, Joseph P
Bratt, George Cuan, Stephen L Dodgen, Emerson S Fang, Zhaoyu Gong, Thomas Y
Ho, et al. Deferred shading graphics pipeline processor having advanced features,
April 6 2004. US Patent 6,717,576.
148
[34] Peter J Burt. Smart sensing within a pyramid vision machine. Proceedings of the
IEEE, 76(8):1006?1015, 1988.
[35] Jerome M Shapiro. Embedded image coding using zerotrees of wavelet coefficients.
IEEE Transactions on Signal Processing, 41(12):3445?3462, 1993.
[36] A. Said and W. A. Pearlman. A new, fast, and efficient image codec based on set
partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems for
Video Technology, 6(3):243?250, Jun 1996.
[37] Ee-Chien Chang, Ste?phane Mallat, and Chee Yap. Wavelet foveation. Applied and
Computational Harmonic Analysis, 9(3):312?335, 2000.
[38] Zhou Wang and Alan C Bovik. Embedded foveation image coding. IEEE Transactions
on Image Processing, 10(10):1397?1410, 2001.
[39] C. Papadopoulos and A. E. Kaufman. Acuity-driven gigapixel visualization. IEEE
Transactions on Visualization and Computer Graphics, 19(12):2886?2895, Dec 2013.
[40] T. H. Reeves and J. A. Robinson. Adaptive foveation of mpeg video. In Proceedings
of the Fourth ACM International Conference on Multimedia, MULTIMEDIA ?96,
pages 231?241, New York, NY, USA, 1996. ACM.
[41] Lee and Sanghoon. Foveated video compression and visual communications over
wireless and wireline networks. PhD thesis, Dept. of ECE, University of Texas at
Austin, 2000.
[42] Zhou Wang and Alan C Bovik. Foveated image and video coding. In Digitial Video
Image Quality and Perceptual Coding, pages 1?28. 2005.
[43] Sanghoon Lee, M. S. Pattichis, and A. C. Bovik. Foveated video compression with
optimal rate control. IEEE Transactions on Image Processing, 10(7):977?992, Jul
2001.
[44] Sanghoon Lee, M. S. Pattichis, and A. C. Bovik. Foveated video quality assessment.
IEEE Transactions on Multimedia, 4(1):129?132, Mar 2002.
[45] H. R. Sheikh, S. Liu, Z. Wang, and A. C. Bovik. Foveated multipoint videocon-
ferencing at low bit rates. In 2002 IEEE International Conference on Acoustics,
Speech, and Signal Processing, volume 2, pages II?2069?II?2072, May 2002.
[46] Hamid R. Sheikh, Brian L. Evans, and Alan C. Bovik. Real-time foveation techniques
for low bit rate video coding. Real-Time Imaging, 9(1):27?40, February 2003.
[47] Anton Kaplanyan, Anton Sochenov, Thomas Leimku?hler, Mikhail Okunev, Todd
Goodall, and Gizem Rufo. Deepfovea: Neural reconstruction for foveated rendering
and video compression using learned natural video statistics. In ACM SIGGRAPH
2019 Talks, SIGGRAPH ?19, pages 58:1?58:2, New York, NY, USA, 2019. ACM.
149
[48] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Proceedings of the 27th International Conference on Neural Information Processing
Systems - Volume 2, NIPS?14, pages 2672?2680, Cambridge, MA, USA, 2014. MIT
Press.
[49] M. Weier, M. Stengel, T. Roth, P. Didyk, E. Eisemann, M. Eisemann, S. Grogorick,
A. Hinkenjann, E. Kruijff, M. Magnor, K. Myszkowski, and P. Slusallek. Perception-
driven accelerated rendering. Comput. Graph. Forum, 36(2):611?643, May 2017.
[50] Toshikazu Ohshima, Hiroyuki Yamamoto, and Hideyuki Tamura. Gaze-directed
adaptive rendering for interacting with virtual space. In Proceedings of the 1996
Virtual Reality Annual International Symposium (VRAIS 96), VRAIS ?96, pages
103?110, 267, Washington, DC, USA, 1996. IEEE Computer Society.
[51] Hugues Hoppe. Smooth view-dependent level-of-detail control and its application to
terrain rendering. In Proceedings of the Conference on Visualization ?98, VIS ?98,
pages 35?42, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.
[52] L. Hu, P. V. Sander, and H. Hoppe. Parallel view-dependent level-of-detail control.
IEEE Transactions on Visualization and Computer Graphics, 16(5):718?728, Sept
2010.
[53] Jonathan Ragan-Kelley, Jaakko Lehtinen, Jiawen Chen, Michael Doggett, and
Fre?do Durand. Decoupled sampling for graphics pipelines. ACM Trans. Graph.,
30(3):17:1?17:17, May 2011.
[54] Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. The
lumigraph. In Proceedings of the 23rd Annual Conference on Computer Graphics
and Interactive Techniques, SIGGRAPH ?96, pages 43?54, New York, NY, USA,
1996. ACM.
[55] Jonghyun Kim, Youngmo Jeong, Michael Stengel, Kaan Aks?it, Rachel Albert, Ben
Boudaoud, Trey Greer, Joohwan Kim, Ward Lopes, Zander Majercik, Peter Shirley,
Josef Spjut, Morgan McGuire, and David Luebke. Foveated AR: Dynamically-
foveated Augmented Reality Display. ACM Trans. Graph., 38(4):99:1?99:15, July
2019.
[56] B Karis. High-quality temporal supersampling. Advances in Real-Time Rendering
in Games, SIGGRAPH Courses, 1:1?55, 2014.
[57] Cyril Crassin, Morgan McGuire, Kayvon Fatahalian, and Aaron Lefohn. Aggregate
g-buffer anti-aliasing. In Proceedings of the 19th Symposium on Interactive 3D
Graphics and Games, I3D ?15, pages 109?119, New York, NY, USA, 2015. ACM.
[58] Mattha?us G. Chajdas, Morgan McGuire, and David Luebke. Subpixel reconstruction
antialiasing for deferred shading. In Symposium on Interactive 3D Graphics and
Games, I3D ?11, pages 15?22 PAGE@7, New York, NY, USA, 2011. ACM.
[59] T Pengo, A Mun?oz-Barrutia, and C Ortiz-de solo?rzano. Halton sampling for autofocus.
Journal of Microscopy, 235(1):50?58, 2009.
150
[60] Kashinath D Patil. Cochran?s q test: Exact distribution. Journal of the American
Statistical Association, 70(349):186?189, 1975.
[61] Margarita Vinnikov and Robert S Allison. Gaze-contingent depth of field in realistic
scenes: The user experience. In Proceedings of the Symposium on Eye Tracking
Research and Applications, pages 119?126. ACM, 2014.
[62] Nir Benty, Kai-Hwa Yao, Tim Foley, Anton S. Kaplanyan, Conor Lavelle, Chris
Wyman, and Ashwin Vijay. The Falcor rendering framework, 07 2017.
[63] Frank W Weymouth. Visual sensory units and the minimal angle of resolution.
American Journal of Ophthalmology, 46(1):102?113, 1958.
[64] Chang Ha Lee, Amitabh Varshney, and David Jacobs. Mesh saliency. ACM
Transactions on Graphics (Proceedings of SIGGRAPH 2005), 24(3):659 ? 666, August
2005.
[65] Youngmin Kim, Amitabh Varshney, David Jacobs, and Francois Guimbretere. Mesh
saliency and human eye fixations. ACM Transactions on Applied Perception, 7(2):1 ?
13, 2010.
[66] Hsueh-Chien Cheng, Antonio Cardone, Eric Krokos, Bogdan Stoica, Alan Faden,
and Amitabh Varshney. Deep-Learning-Assisted Visualization for Live-Cell Images.
In Proceedings of 2017 IEEE International Conference on Image Processing, ICIP.
IEEE, September 2017.
[67] Y. Wan, H. Otsuna, C. Chien, and C. Hansen. An Interactive Visualization Tool
for Multi-Channel Confocal Microscopy Data in Neurobiology Research. IEEE
Transactions on Visualization and Computer Graphics, 15(6):1489?1496, Nov 2009.
[68] M. Hadwiger, J. Beyer, W. Jeong, and H. Pfister. Interactive Volume Exploration of
Petascale Microscopy Data Streams Using a Visualization-Driven Virtual Memory
Approach. IEEE Transactions on Visualization and Computer Graphics, 18(12):2285?
2294, Dec 2012.
[69] K. Mosaliganti, L. Cooper, R. Sharp, R. Machiraju, G. Leone, K. Huang, and J. Saltz.
Reconstruction of Cellular Biological Structures From Optical Microscopy Data.
IEEE Transactions on Visualization and Computer Graphics, 14(4):863?876, July
2008.
[70] Hsueh-Chien Cheng, Antonio Cardone, Somay Jain, Eric Krokos, Kedar Narayan,
Sriram Subramaniam, and Amitabh Varshney. Deep-learning-assisted Volume Visu-
alization. IEEE Transactions on Visualization and Computer Graphics, PP(99):1?14,
January 2018.
[71] Hsueh-Chien Cheng, Antonio Cardone, and Amitabh Varshney. Volume Segmentation
Using Convolutional Neural Networks With Limited Training Data. In Proceedings
of 2017 IEEE International Conference on Image Processing, ICIP. IEEE, September
2017.
151
[72] Marc Levoy, Ren Ng, Andrew Adams, Matthew Footer, and Mark Horowitz. Light
Field Microscopy. ACM Trans. Graph, 25(3):924?934, 2006.
[73] Robert Prevedel, Young-Gyu Yoon, Maximilian Hoffmann, Nikita Pak, Gordon
Wetzstein, Saul Kato, Tina Schro?del, Ramesh Raskar, Manuel Zimmer, Edward S
Boyden, et al. Simultaneous Whole-Animal 3D Imaging of Neuronal Activity Using
Light-Field Microscopy. Nature Methods, 11(7):727, 2014.
[74] Jin-Xiang Chai, Xin Tong, Shing-Chow Chan, and Heung-Yeung Shum. Plenoptic
Sampling. In Proceedings of the 27th Annual Conference on Computer Graphics and
Interactive Techniques, SIGGRAPH ?00, pages 307?318, New York, NY, USA, 2000.
ACM Press/Addison-Wesley Publishing Co.
[75] Ren Ng. Fourier Slice Photography. ACM Trans. Graph, 24(3):735?744, 2005.
[76] Douglas Lanman and David Luebke. Near-Eye Light Field Displays. In ACM
SIGGRAPH 2013 Emerging Technologies, SIGGRAPH ?13, pages 11:1?11:1, New
York, NY, USA, 2013. ACM.
[77] Fu-Chung Huang, Kevin Chen, and Gordon Wetzstein. The Light Field Stereoscope:
Immersive Computer Graphics Via Factored Near-Eye Light Field Displays With
Focus Cues. ACM Trans. Graph, 34(4):60:1?60:12, 2015.
[78] J. Zhang, Z. Fan, D. Sun, and H. Liao. Unified Mathematical Model for Multilayer-
Multiframe Compressive Light Field Displays Using LCDs. IEEE Transactions on
Visualization and Computer Graphics, pages 1?1, 2018.
[79] S. Lee, J. Cho, B. Lee, Y. Jo, C. Jang, D. Kim, and B. Lee. Foveated Retinal
Optimization for See-Through Near-Eye Multi-Layer Displays. IEEE Access, 6:2170?
2180, 2018.
[80] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari,
Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical
view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics
(TOG), 2019.
[81] Ronald S. Weinstein, Michael R. Descour, Chen Liang, Gail Barker, Katherine M.
Scott, Lynne Richter, Elizabeth A. Krupinski, Achyut K. Bhattacharyya, John R.
Davis, Anna R. Graham, Margaret Rennels, William C. Russum, James F. Goodall,
Pixuan Zhou, Artur G. Olszak, Bruce H. Williams, James C. Wyant, and Peter H.
Bartels. An Array Microscope for Ultrarapid Virtual Slide Processing and Telepathol-
ogy. Design, Fabrication, and Validation Study. Human Pathology, 35(11):1303 ?
1314, 2004.
[82] Brian Wilt, Laurie Burns, Eric Tatt Wei Ho, Kunal Ghosh, Eran Mukamel, and
Mark Schnitzer. Advances in Light Microscopy for Neuroscience. Annual Review of
Neuroscience, pages 435?506, 9.
[83] Robert Prevedel, Young-Gyu Yoon, Maximilian Hoffmann, Nikita Pak, Gordon
Wetzstein, Saul Kato, Tina Schro?del, Ramesh Raskar, Manuel Zimmer, Edward S
Boyden, and Alipasha Vaziri. Simultaneous Whole-Animal 3D Imaging of Neuronal
152
Activity Using Light-Field Microscopy. Nature Methods, 11:727 ? 730, 05/18/2014
2014.
[84] B. Sheng, P. Li, Y. Jin, P. Tan, and T. Lee. Intrinsic image decomposition with step
and drift shading separation. IEEE Transactions on Visualization and Computer
Graphics, pages 1?1, 2018.
[85] Q. Chen and V. Koltun. A simple model for intrinsic image decomposition with
depth cues. In 2013 IEEE International Conference on Computer Vision, pages
241?248, Dec 2013.
[86] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality
assessment: from error visibility to structural similarity. IEEE Transactions on
Image Processing, 13(4):600?612, April 2004.
[87] W. Feng, Y. Yang, L. Wan, and C. Yu. Tone-mapped mean-shift based environ-
ment map sampling. IEEE Transactions on Visualization and Computer Graphics,
22(9):2187?2199, Sep. 2016.
[88] E. Turner, H. Jiang, D. Saint-Macary, and B. Bastani. Phase-Aligned Foveated
Rendering for Virtual Reality Headsets. In 2018 IEEE Conference on Virtual Reality
and 3D User Interfaces (VR), pages 1?2, Los Alamitos, CA, USA, mar 2018. IEEE
Computer Society.
[89] Piotr Didyk, Elmar Eisemann, Tobias Ritschel, Karol Myszkowski, and Hans-Peter
Seidel. Perceptually-motivated Real-time Temporal Upsampling of 3D Content for
High-refresh-rate Displays. Computer Graphics Forum, 29(2):713?722, 2010.
[90] Clare Kathleen Porac and Stanley Coren. The dominant eye. Psychological Bulletin,
83(5):880?897, 9 1976.
[91] Einat Shneor and Shaul Hochstein. Eye dominance effects in feature search. Vision
Research, 46(25):4258 ? 4269, 2006.
[92] Belk?s Koc?tekin, Nimet U?nay Gu?ndog?an, Ays? Gu?l Koc?ak Alt?ntas?, and Ays?e Canan
Yaz?c?. Relation of eye dominancy with color vision discrimination performance
ability in normal subjects. International journal of ophthalmology, 6(5):733, 2013.
[93] I. C. McManus, Clare Kathleen Porac, M. P. Bryden, and R. Boucher. Eye-dominance,
Writing Hand, and Throwing Hand. Laterality: Asymmetries of Body, Brain and
Cognition, 4(2):173?192, 1 1999.
[94] Ayame Oishi, Shozo Tobimatsu, Kenji Arakawa, Takayuki Taniwaki, and Jun
ichi Kira. Ocular dominancy in conjugate eye movements at reading distance.
Neuroscience Research, 52(3):263 ? 268, 2005.
[95] Romain Chaumillon, Jean Blouin, and Alain Guillaume. Eye dominance influences
triggering action: The Poffenberger paradigm revisited. Cortex, 58:86 ? 98, 2014.
[96] Carlo A Marzi. The poffenberger paradigm: a first, simple, behavioural tool to study
interhemispheric transmission in humans. Brain Research Bulletin, 50(5):421 ? 422,
1999.
153
[97] Heidi L. Roth, Andrea N. Lora, and Kenneth M. Heilman. Effects of monocular
viewing and eye dominance on spatial attention. Brain, 125(9):2023?2035, 09 2002.
[98] Michael R. Stoline. The Status of Multiple Comparisons: Simultaneous Estimation of
all Pairwise Comparisons in One-Way ANOVA Designs. The American Statistician,
35(3):134?141, 1981.
[99] F. Buttussi and L. Chittaro. Locomotion in Place in Virtual Reality: A Comparative
Evaluation of Joystick, Teleport, and Leaning. IEEE Transactions on Visualization
and Computer Graphics, pages 1?1, 2019.
[100] Robert Y Wang and Jovan Popovic?. Real-time hand-tracking with a color glove.
ACM transactions on graphics (TOG), 28(3):1?8, 2009.
[101] Charles R Cameron, Louis W DiValentin, Rohini Manaktala, Adam C McElhaney,
Christopher H Nostrand, Owen J Quinlan, Lauren N Sharpe, Adam C Slagle,
Charles D Wood, Yang Yang Zheng, et al. Hand tracking and visualization in a
virtual reality simulation. In 2011 IEEE systems and information engineering design
symposium, pages 127?132. IEEE, 2011.
[102] Victor Adrian Prisacariu and Ian Reid. 3d hand tracking for human computer
interaction. Image and Vision Computing, 30(3):236 ? 250, 2012. Best of Automatic
Face and Gesture Recognition 2011.
[103] J. M. Rehg and T. Kanade. Digiteyes: vision-based hand tracking for human-
computer interaction. In Proceedings of 1994 IEEE Workshop on Motion of Non-rigid
and Articulated Objects, pages 16?22, 1994.
[104] V. A. Prisacariu and I. Reid. Robust 3d hand tracking for human computer
interaction. In Face and Gesture 2011, pages 368?375, 2011.
[105] Giancarlo Iannizzotto, Massimo Villari, and Lorenzo Vita. Hand tracking for human-
computer interaction with graylevel visualglove: Turning back to the simple way. In
Proceedings of the 2001 Workshop on Perceptive User Interfaces, PUI ?01, page 1?7,
New York, NY, USA, 2001. Association for Computing Machinery.
[106] Eric Krokos, Hsueh-Chien Cheng, Jessica Chang, Bohdan Nebesh, Celeste Lyn Paul,
Kirsten Whitley, and Amitabh Varshney. Enhancing Deep Learning with Visual
Interactions. ACM Transactions on Interactive Intelligent Systems (TIIS), 1(1), Jul
2018.
[107] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-Kyun Kim. Bighand2.
2m benchmark: Hand pose dataset and state of the art analysis. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 4866?4874,
2017.
[108] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. Crossing nets:
Dual generative models with a shared latent space for hand pose estimation. In
Conference on Computer Vision and Pattern Recognition, volume 7, 2017.
154
[109] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single
images using multiview bootstrapping. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 4645?4653, 2017.
[110] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from
single rgb images. In Proceedings of the IEEE international conference on computer
vision, pages 4903?4911, 2017.
[111] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and
Junsong Yuan. 3d hand shape and pose estimation from a single rgb image. In
CVPR, 2019.
[112] Jonathan Taylor, Richard Stebbing, Varun Ramakrishna, Cem Keskin, Jamie Shot-
ton, Shahram Izadi, Aaron Hertzmann, and Andrew Fitzgibbon. User-specific hand
modeling from monocular depth sequences. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 644?651, 2014.
[113] S. Khamis, J. Taylor, J. Shotton, C. Keskin, S. Izadi, and A. Fitzgibbon. Learning an
efficient model of hand shape variation from depth images. In 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 2540?2548, 2015.
[114] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling
and capturing hands and bodies together. ACM Trans. Graph., 36(6), November
2017.
[115] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and
Michael J. Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph.,
34(6), October 2015.
[116] T. E. de Campos and D. W. Murray. Regression-based hand pose estimation from
multiple cameras. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR?06), volume 1, pages 782?789, 2006.
[117] Srinath Sridhar, Helge Rhodin, Hans-Peter Seidel, Antti Oulasvirta, and Christian
Theobalt. Real-time hand tracking using a sum of anisotropic gaussians model. In
Proceedings of the International Conference on 3D Vision (3DV), December 2014.
[118] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. End-to-end hand
mesh recovery from a monocular rgb image. In The International Conference on
Computer Vision (ICCV), 2019.
[119] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Pushing the envelope for
rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
June 2019.
[120] Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos, Michael M. Bronstein, and
Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in
the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2020.
155
[121] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade,
Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system
for social motion capture. In The IEEE International Conference on Computer
Vision (ICCV), 2015.
[122] Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russel, Max Argus, and
Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape
from single rgb images. In IEEE International Conference on Computer Vision
(ICCV), 2019.
[123] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[124] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples:
Benchmarking large-scale scene reconstruction. ACM Trans. Graph., 36(4), July
2017.
[125] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d
pose estimation using part affinity fields. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 7291?7299, 2017.
[126] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2016.
[127] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248?255. Ieee, 2009.
[128] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library. In Advances in neural
information processing systems, pages 8026?8037, 2019.
[129] Matt Pharr and Greg Humphreys. Physically Based rendering, second edition: from
theory to implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2nd edition, 2010.
[130] Chih-Fan Hsu, Anthony Chen, Cheng-Hsin Hsu, Chun-Ying Huang, Chin-Laung Lei,
and Kuan-Ta Chen. Is foveated rendering perceivable in virtual reality?: exploring
the efficiency and consistency of quality assessment methods. In Proceedings of the
2017 ACM on Multimedia Conference, MM ?17, pages 55?63, New York, NY, USA,
2017. ACM.
[131] Matias Koskela, Atro Lotvonen, Markku Ma?kitalo, Petrus Kivi, Timo Viitanen, and
Pekka Ja?a?skela?inen. Foveated Real-Time Path Tracing in Visual-Polar Space. In
Tamy Boubekeur and Pradeep Sen, editors, Eurographics Symposium on Rendering -
DL-only and Industry Track. The Eurographics Association, 2019.
[132] Ruofei Du, Ming Chuang, Wayne Chang, Hugues Hoppe, and Amitabh Varshney.
Montage4D: Real-Time Seamless Fusion and Stylization of Multiview Video Textures.
Journal of Computer Graphics Techniques, 1(15):1?34, Jan. 2019.
156
[133] Ruofei Du, David Li, and Amitabh Varshney. Geollery: A Mixed Reality Social
Media Platform. In Proceedings of the 2019 CHI Conference on Human Factors in
Computing Systems, page 13. ACM, May 2019.
[134] Ruofei Du, David Li, and Amitabh Varshney. Project Geollery.com: Reconstructing
a Live Mirrored World With Geotagged Social Media. In Proceedings of the 24th
International Conference on Web3D Technology, Web3D, pages 1?9. ACM, July
2019.
[135] Xuetong Sun, Sarah B. Murthi, Gary Schwartzbauer, and Amitabh Varshney. High-
precision 5dof tracking and visualization of catheter placement in evd of the brain
using ar. ACM Trans. Comput. Healthcare, 1(2), March 2020.
[136] Eric Krokos, Catherine Plaisant, and Amitabh Varshney. Virtual Memory Palaces:
Immersion Aids Recall. Springer VR 2018, 1(15):1?20, May 2018.
[137] Jonathan Korein and Norman Badler. Temporal Anti-aliasing in Computer Generated
Animation. In Proceedings of the 10th Annual Conference on Computer Graphics
and Interactive Techniques, SIGGRAPH ?83, pages 377?388, New York, NY, USA,
1983. ACM.
[138] M. Sung and S. Choi. Selective Anti-Aliasing for Virtual Reality Based on Saliency
Map. In 2017 International Symposium on Ubiquitous Virtual Reality (ISUVR),
pages 16?19, June 2017.
[139] Chaurasia B.D. and Mathur B.B.L. Eyedness. Acta Anatomica, 96(2):301?305, 1976.
[140] D. Y. N. Zotkin, J. Hwang, R. Duraiswaini, and L. S. Davis. HRTF personalization
using anthropometric measurements. In 2003 IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684), pages 157?160,
Oct 2003.
[141] Hiroshi Ono and Raphael Barbeito. The cyclopean eye vs. the sighting-dominant eye
as the center of visual direction. Perception & Psychophysics, 32(3):201?210, 1982.
[142] Tomer Elbaum, Michael Wagner, and Assaf Botzer. Cyclopean vs. dominant eye in
gaze-interface-tracking. Journal of Eye Movement Research, 10(1), Jan. 2017.
[143] Zhenping Xia and Eli Peli. 30-1: Cyclopean eye based binocular orientation in
virtual reality. SID Symposium Digest of Technical Papers, 49(1):381?384, 2018.
[144] Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr Sotnychenko, Mickeal
Verschoor, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. Real-time Pose
and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera.
ACM Transactions on Graphics (TOG), 38(4), 2019.
157