ABSTRACT

Title of Dissertation: DENSE 3D RECONSTRUCTIONS
FROM SPARSE VISUAL DATA

Tao Hu
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Matthias Zwicker
Department of Computer Science

3D reconstruction, the problem of estimating the complete geometry or appearance

of objects from partial observations (e.g., several RGB images, partial shapes, videos),

serves as a building block in many vision, graphics, and robotics applications such as 3D

scanning, autonomous driving, 3D modeling, augmented reality (AR) and virtual reality

(VR). However, it is very challenging for machines to recover 3D geometry from such

sparse data due to occlusions, and irregularity and complexity of 3D objects. To solve

these, in this dissertation, we explore learning-based 3D reconstruction methods for dif-

ferent 3D object representations on different tasks: 3D reconstructions of static objects

and dynamic human body from limited data.

For the 3D reconstructions of static objects, we propose a multi-view representation

of 3D shapes, which utilizes a set of multi-view RGB images or depth maps to represent a

3D shape. We first explore the multi-view representation for shape completion tasks and

develop deep learning methods to generate dense and high-resolution point clouds from

partial observations. Yet one problem with the multi-view representation is the inconsis-


tency among different views. To solve this problem, we propose a multi-view consistency

optimization strategy to encourage consistency for shape completion in inference stage.

Third, the extension of multi-view representation for dense 3D geometry and texture re-

constructions from single RGB images will be presented.

Capturing and rendering realistic human appearances under varying poses and view-

points is an important goal in computer vision and graphics. In the second part, we will

introduce some techniques to create 3D virtual human avatars with limited data (e.g.,

videos). We propose implicit representations of motion, texture, and geometry for human

modeling, and utilize neural rendering techniques for free view synthesis of dynamic ar-

ticulated human body. Our learned human avatars are photorealistic and fully controllable

(pose, shape, viewpoints, etc.), which can be used in free-viewpoint video generation, an-

imation, shape editing, telepresence, and AR/VR.

Our proposed methods can learn end-to-end 3D reconstructions from 2D image

or video signals. We hope these learning-based methods will assist in perceiving and

reconstructing the 3D world for future AI systems.


DENSE 3D RECONSTRUCTIONS FROM SPARSE VISUAL DATA

by

Tao Hu

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor Matthias Zwicker, Chair/Advisor
Professor Joseph F. JaJa
Professor John Yiannis Aloimonos
Professor Marine Carpuat
Professor Abhinav Shrivastava


c© Copyright by
Tao Hu
2023


Dedication

To my family.

ii


Acknowledgments

First and foremost I would like to thank my advisor, Professor Matthias Zwicker for

his help and advice during my Ph.D. study at UMD. We have worked together on many

research projects and it will not be possible to achieve these results without your support.

Thank you!

I would also like to thank my Ph.D. thesis committee members Matthias Zwicker,

Joseph F. JaJa, John Yiannis Aloimonos, Marine Carpuat, and Abhinav Shrivastava for

the time and effort to discuss my research. I have learned so much from the valuable and

critical feedback they have given for my research.

I am very fortunate to have worked with many incredible researchers through in-

ternships. I would like to acknowledge my mentors Prof. Christian Theobalt from Max

Planck Institute for Informatics, Prof. Yebin Liu from Tsinghua University, Hongyi Xu

and Linjie Luo from ByteDance, and Kai Chen from Microsoft Research Asia. Thanks

for providing the internship opportunities, and working with you has been a wonderful

experience and a great source of inspiration. I also would like to acknowledge Weipeng

Xu from Meta Reality Labs for offering an internship opportunity, though I could not join

the team due to visa issues.

Furthermore, I would like to acknowledge my research collaborators: Gen Lin and

Zhizhong Han at UMD, Kripasindhu Sarkar and Lingjie Liu at MPII, Tao Yu, Zerong

iii


Zheng and He Zhang at Tsinghua University. I am very lucky to work with you, and

thanks for your excellent collaborative effort.

I am deeply thankful for all the friends that have largely supported me outside my

research life. I thank Guangyao Shi, Jing Xie, Bo He, Hao Chen, Hanyu Wang, Xingting

Wang, Yong Yang, Ruofei Du, Hao Zhou, Jun Wang, Zhichao Liu, Yixuan Ren, Renkun

Ni, and Yuesheng Ye in Maryland, Jian Wang and Yue Li at MPII, Caonan Ji, Zhen Fan,

Yunxi Guo, Xiaodong Yang, and Baowei Jiang at Tsinghua University, and more that I

cannot list their names here. It has been a great time with you, and I wish you all the best

for future endeavors.

Finally, none of these would be possible without the full unconditional support

from my family. I want to thank my parents and grandparents for their support and love

in whatever path I chose to pursue. I also thank my sister and brother-in-law for all the

understanding and encouragement throughout my Ph.D. life.

iv


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents v

List of Tables vii

List of Figures ix

Chapter 1: Introduction 1
1.1 Learning 3D Reconstructions for Static Objects . . . . . . . . . . . . . . 3
1.2 Learning 3D Human Avatars from Videos. . . . . . . . . . . . . . . . . . 5
1.3 Contribution & Dissertation Organization . . . . . . . . . . . . . . . . . 6

I Learning Dense 3D Reconstructions for Static Objects 10

Chapter 2: Shape Completion based on Multi-view Representation 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.A Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . 30

Chapter 3: Multi-view Consistency in Shape Completion 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Multi-view Consistent Inference . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 51

Chapter 4: Shape and Texture Reconstructions from Single RGB Images 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

v


4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.A Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . 77

II Learning 3D Human Avatars from Videos 85

Chapter 5: HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars 86
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.A Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . 109
5.B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 6: HVTR++: Pose and Image Driven Human Avatars using HVTR 118
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.A Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . 138
6.B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Chapter 7: Rendering Human Avatars from Mobile Egocentric Fisheye Cameras 141
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 164

III Conclusion and Future Work 165

Chapter 8: Conclusion & Future Work 166

Bibliography 169

vi


List of Tables

2.1 Analysis of the objective function . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Completion results for different positions of view-pooling layer . . . . . . 23
2.3 Average L1 distance for different numbers of views in view-pooling. . . . 24
2.4 Improvements while increasing training samples . . . . . . . . . . . . . . 26
2.5 Quantitative comparisons between VCN and MVCN . . . . . . . . . . . 27
2.6 Comparison with the exiting methods . . . . . . . . . . . . . . . . . . . 28

3.1 Chamfer Distance over different loss functions . . . . . . . . . . . . . . . 46
3.2 The effects of depth-buffer sizes . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Mean Chamfer Distance over multiple categories in ShapeNet . . . . . . 52

4.1 CD on single-category task. . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Mean CD of partial shape . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Mean CD of multiple-seen-category experiments on ShapeNet . . . . . . 68
4.4 Average CD of multiple-unseen-category experiments on ShapeNet. . . . 69
4.5 Average CD on both seen and unseen category on Pix3D dataset . . . . . 70
4.6 Relative CD improvements after ICP. . . . . . . . . . . . . . . . . . . . . 71
4.7 6D pose evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8 F-score on ShapeNet and Pix3D . . . . . . . . . . . . . . . . . . . . . . 75
4.9 Comparisons with GenRe . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.10 Quantitative comparisons with AtlasNet . . . . . . . . . . . . . . . . . . 79
4.11 Average Chamfer Distance (CD) on Pix3D dataset . . . . . . . . . . . . . 82

5.1 A set of human synthesis approaches classified by feature representations
and renderers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Quantitative comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Quantitative comparisons on M1 sequence. . . . . . . . . . . . . . . . . . 102
5.4 Accuracy and inference time. . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Ablation study of each component. . . . . . . . . . . . . . . . . . . . . . 105
5.6 Glossary table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Comparisons with AniNeRF. . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8 Quantitative results of each method. . . . . . . . . . . . . . . . . . . . . 114
5.9 Comparisons of fusing volumetric and textural features. . . . . . . . . . . 115
5.10 Quantitative comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.11 Performance and inference time. . . . . . . . . . . . . . . . . . . . . . . 117

6.1 Quantitative comparisons on unseen sequences from R1. . . . . . . . . . 129

vii


6.2 Quantitative comparisons on unseen sequences from R2-7. . . . . . . . . 136
6.3 Quantitative comparisons on ZJU MoCap datasets. . . . . . . . . . . . . 140

7.1 Quantitative comparisons of single-video dataset . . . . . . . . . . . . . . 156
7.2 Quantitative comparisons of multi-video training . . . . . . . . . . . . . 158

viii


List of Figures

1.1 Applications of 3D reconstruction. . . . . . . . . . . . . . . . . . . . . . 2
1.2 Partial visual data acquired in practical applications. . . . . . . . . . . . . 3
1.3 Visual overview of the dissertation. . . . . . . . . . . . . . . . . . . . . . 7

2.1 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Architecture of MVCN . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Completion results under different losses . . . . . . . . . . . . . . . . . . 22
2.4 Completion results for different numbers of views in view-pooling. . . . . 24
2.5 An example of the completion of sofa . . . . . . . . . . . . . . . . . . . 25
2.6 Visual comparison between VCN and MVCN . . . . . . . . . . . . . . . 25
2.7 Comparison between MVCN and PCN-CD. . . . . . . . . . . . . . . . . 28
2.8 Completion results on KITTI. . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9 Qualitative completion on ShapeNet . . . . . . . . . . . . . . . . . . . . 30
2.10 Completion results on noisy, sparse and occluded inputs. . . . . . . . . . 31
2.11 Improvements of 8 views over 3 and 5 views in view-pooling. . . . . . . . 32
2.12 Failed depth completions of lamps. . . . . . . . . . . . . . . . . . . . . . 33

3.1 Overview of the multi-view consistent inference pipeline . . . . . . . . . 37
3.2 Network structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Methods to calculate consistency distance. . . . . . . . . . . . . . . . . . 38
3.4 Consistency pooling with respect to V7. . . . . . . . . . . . . . . . . . . 44
3.5 All the eight loss maps of a 3D model. . . . . . . . . . . . . . . . . . . . 44
3.6 Consistency loss maps under different depth-buffer sizes . . . . . . . . . 47
3.7 Comparisons with direct optimization . . . . . . . . . . . . . . . . . . . 48
3.8 Consistent inference optimization . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Completions on noisy inputs . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 Completion results given three different inputs . . . . . . . . . . . . . . . 52
3.11 Improvements over MVCN . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Approach overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Render multiple views given partial shapes. . . . . . . . . . . . . . . . . 60
4.3 Completions of texture and depth maps. . . . . . . . . . . . . . . . . . . 64
4.4 Reconstructions on single-category task. . . . . . . . . . . . . . . . . . . 67
4.5 Reconstruction results on unseen categories from ShapeNet. . . . . . . . 69
4.6 Reconstructions of the seen categories on ShapeNet dataset. . . . . . . . . 72
4.7 Reconstructions on Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Two-stage Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

ix


4.9 Ours v.s. AtlasNet [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Qualitative comparisons with RenGe . . . . . . . . . . . . . . . . . . . . 81
4.11 Reconstructions of car objects on ShapeNet dataset . . . . . . . . . . . . 81

5.1 Differences to exiting neural rendering methods. . . . . . . . . . . . . . . 89
5.2 Pipeline overview of HVTR . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Qualitative results of our variants. . . . . . . . . . . . . . . . . . . . . . 99
5.4 Render skirts on novel poses. . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Geometry reconstructions of skirt. . . . . . . . . . . . . . . . . . . . . . 102
5.6 Shape editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7 Comparisons with other rendering methods. . . . . . . . . . . . . . . . . 106
5.8 Geometry-guided ray marching. . . . . . . . . . . . . . . . . . . . . . . 107
5.9 Construct pose-conditioned NeRF. . . . . . . . . . . . . . . . . . . . . . 107
5.10 Qualitative results on unseen sequences. . . . . . . . . . . . . . . . . . . 107
5.11 Ablation study of each component. . . . . . . . . . . . . . . . . . . . . . 108
5.12 Comparisons on shape editing. . . . . . . . . . . . . . . . . . . . . . . . 108
5.13 Novel view synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.14 Comparisons of backward skinning method. . . . . . . . . . . . . . . . . 111
5.15 Ablation study of face identity loss. . . . . . . . . . . . . . . . . . . . . . 113
5.16 Normal prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.1 Pipeline overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Ablation study of driving views. . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Ablation study of normal loss and texture loss. . . . . . . . . . . . . . . . 131
6.4 Qualitative results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Qualitative results on unseen sequences. . . . . . . . . . . . . . . . . . . 135
6.6 Ablation study of driving views. . . . . . . . . . . . . . . . . . . . . . . 137
6.7 Qualitative results on novel view synthesis. . . . . . . . . . . . . . . . . 137

7.1 Egocentric setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Examples of our rendered fisheye training dataset . . . . . . . . . . . . . 144
7.3 DensePose predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.5 Implicit texture learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.6 Renderings produced by our methods . . . . . . . . . . . . . . . . . . . . 154
7.7 Comparisons with other methods . . . . . . . . . . . . . . . . . . . . . . 154
7.8 Comparisons to Textured Neural Avatars . . . . . . . . . . . . . . . . . . 157
7.9 Renderings of single- and multi-video training. . . . . . . . . . . . . . . 158
7.10 Comparison with Only-MV . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.11 Performances in real applications . . . . . . . . . . . . . . . . . . . . . . 160
7.12 Comparisons with Only-MV on occluded parts. . . . . . . . . . . . . . . 160
7.13 Renderings of different views in applications. . . . . . . . . . . . . . . . 161
7.14 Local and global coordinate . . . . . . . . . . . . . . . . . . . . . . . . . 162

x


Chapter 1: Introduction

3D reconstruction is widely used in vision and graphics applications, including 3D

scanning, autonomous driving, 3D modeling, augmented reality (AR), and virtual real-

ity (VR). However, in many real applications (Figure 1.2), we can only acquire sparse

or partial visual data, such as partial point clouds, single RGB images, or sparse view

video sequences. For example, autonomous driving systems need to reconstruct accurate

3D maps of the surrounding environment from sparse partial point clouds scanned by

LiDAR sensors. It is also necessary to reconstruct 3D scenes from several RGB images

captured by a mobile camera for robot navigation, Simultaneous Localization and Map-

ping (SLAM), and AR. Similarly, generating 3D virtual human avatars just from several

multi-view videos also facilitates the creation of 3D content in AR and VR applications.

Human intelligence is flexible, and for partial observations like one single RGB image,

we can effortlessly recognize the 3D geometry and texture of objects within the scene.

However, it is very challenging for machines to recover 3D geometry from such sparse

data due to occlusions, and irregularity and complexity of 3D objects.

For example, given several sparse view images of an object or a scene, 3D recon-

struction becomes an ill-posed problem since there can potentially be multiple solutions

to the 3D structure occluded or invisible to the camera viewpoint. In this case, learning-

1


Figure 1.1: 3D reconstruction is widely used in different vision and graphics applications,
such as robot navigation, autonomous driving, telepresence, medical imaging, virtual and
augmented reality.

based approaches can come to the rescue, especially when utilizing the power of deep

neural networks. A straightforward solution is to employ a 3D prediction network to find

the correlations between the partial input and 3D ground truth data (e.g., crafted CAD

models), which, however, are very difficult to acquire in practical applications, especially

for outdoor scenes or dynamic articulated human bodies.

Compared to ground-truth 3D annotations, images and videos come in as more

abundant sources of data to learn from, and hence it is of major interest to learn 3D

reconstructions by utilizing the images and videos as self-supervision signals. A key step

to achieve this is to learn to establish pixel correspondences between view images or video

frames by utilizing differentiable or neural rendering techniques [2].

In this dissertation, we focus on learning 3D reconstructions without the use of ex-

plicit 3D supervision. We investigate various differentiable and neural rendering methods

2


Figure 1.2: Partial visual data acquired in practical applications: (a) sparse point clouds
scanned by LiDAR [3], (b) single RGB image [4], (c) sparse multi-view videos captured
in a studio [5]. In this dissertation, we explore digital representations of 3D shapes and
learning-based methods for 3D reconstructions from partial visual data.

for different 3D object representations in different reconstruction tasks, such as dense

point cloud completion from partial shapes, dense geometry and texture reconstruction

from single RGB images, dense (i.e. free-viewpoint) view synthesis of dynamic humans

from sparse view videos, etc. Our research goal is to develop effective 3D representations

enabling 3D reconstructions of the world from visual data to allow better efficiency and

efficacy in 3D understanding and modeling.

1.1 Learning 3D Reconstructions for Static Objects

Multi-view Representation and Applications in Shape Completion. 3D shape com-

pletion, the problem of perceiving the complete geometry of objects from partial ob-

servations, is widely used in applications such as 3D scanning in robotics, autonomous

driving, or 3D modeling and fabrication. While previous work for shape completion re-

lies on volumetric representations, meshes, or point clouds, we propose to use multi-view

depth maps from a set of fixed viewing angles as our shape representation, and perform

depth completion for each view in image space. At the heart of our approach is a deep

3


convolutional network trained to complete each depth map. We show that (1) 3D shape

completion can be formulated as multi-view depth completions in 2D image space, which

enables dense and high-solution shape generation, and (2) generalizable 3D shape com-

pletion can be trained from a dataset of multi-view depth maps without the use of explicit

3D ground truth. More details will be introduced in Chapter 2.

Multi-view Consistency in Shape Completion. However, one problem of the multi-view

representation is inconsistency among the multiple completed images. To resolve this

issue, we propose a multi-view consistent inference technique for 3D shape completion,

which we express as an energy minimization problem, and we present a consistency loss

to encourage multi-view geometric consistency. We show that with a pretrained shape

prior, the proposed learning-based approach can improve geometric consistency on novel

test data for the multi-view representation. We will introduce this technique in Chapter 3.

3D Reconstruction from Single RGB Images based on Multi-view Representation.

With recent advances in deep learning, it has become possible to recover plausible 3D

shapes even from single RGB images for the first time. The problem of single image

3D reconstruction shares some similarities with shape completion, but is arguably even

harder. In addition, obtaining detailed geometry and texture for objects with arbitrary

topology remains challenging. Based on the multi-view representation, we propose a

novel approach to reconstruct dense point clouds with textures from single RGB images.

We show that 3D geometry and texture reconstruction can also be formulated as view im-

age completions based on the multi-view representations. We will discuss this in Chapter

4.

4


1.2 Learning 3D Human Avatars from Videos.

Our world is rarely static, and it is of major interest to reconstruct and understand

specifically dynamic humans. Besides static objects, we will introduce some techniques

to reconstruct 3D dynamic articulated humans from videos, which are far more challeng-

ing than the 3D reconstructions of static objects. Capturing and rendering realistic human

appearance under varying poses and viewpoints is an important goal in computer vision

and graphics, i.e., in applications of free-viewpoint video generation, animation, telep-

resence, and content creation for AR/VR. For free-viewpoint video generation of humans

from multi-view videos, most traditional methods required animatable person-specific

3D models which were created with sophisticated reconstruction and rigging techniques,

and employed classical graphics renderers to synthesize views. These methods are time-

consuming, and often require users to have extensive modeling experience, artistic skills,

and training to create these 3D models.

In contrast, recent neural rendering methods [2, 5–9] have made great progress

in generating realistic images of humans, which are simple yet effective compared with

traditional graphics pipelines [10–12]. Neural rendering is a class of deep image and

video generation approaches that combines machine learning techniques with physical

knowledge from computer graphics to obtain controllable outputs, which enables end-to-

end 3D reconstructions with just 2D images or videos as self-supervision signals even for

dynamic articulated human bodies.

Hybrid Volumetric-Textural Rendering for Human Avatars. Different from tradi-

tional representations of texture and geometry (e.g., voxels, point clouds, meshes), we

5


explore neural textures and implicit geometries for human modeling and propose a novel

neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR), which syn-

thesizes 3D human avatars from arbitrary poses/viewpoints. At the heart of our method

lies a two-stage rendering which first constructs a pose-conditioned downsampled neu-

ral radiance field (NeRF [13]), and uses generative adversarial networks (GAN) [14] for

high-resolution image synthesis. We show that (1) HVTR is capable of learning dynamic

human reconstructions from just sparse viewpoint video sequences, and (2) the classical

volume rendering [15] and GAN-based rendering [14] can be combined, and employed

to render dynamic articulated human avatars efficiently and at high quality. More details

will be presented in Chapter 5, 6.

Reconstruct Avatars from Mobile Egocentric Fisheye Camera. Real-time free-viewpoint

rendering of self-embodied avatars is also important in VR and AR applications, notably

telepresence. Most existing telepresence systems employ multiple external outside-in

cameras to capture human motions, and thus, users are often limited to staying in con-

fined spaces visible to these external cameras. In contrast, we explore motion capture and

neural rendering techniques for mobile setup and build a system for rendering full-body

avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on

a cap or a VR headset. We will present this system in Chapter 7.

1.3 Contribution & Dissertation Organization

Figure 1.3 provides a visual overview of the dissertation and specific topics to be

discussed: learning dense 3D reconstructions for static objects (Part I), and learning 3D

6


Figure 1.3: Visual overview of the dissertation.

dynamic human avatars from videos (Part II).

Learning dense 3D reconstructions for static objects (Part I). In Chapter 2, we will

introduce the multi-view representation and its applications in 3D shape completion. Sec-

ond, the inconsistency problem of multi-view representation will be discussed in Chapter

3. In Chapter 4, we will present the extensions of multi-view representation for 3D re-

constructions from single RGB images. The following is the relevant publication list for

each chapter.

1. Chapter 2 — Hu et al., “Render4Completion: Synthesizing Multi-view Depth

Maps for 3D Shape Completion”, ICCVW 2019, Geometry Meets Deep Learning [16].

2. Chapter 3 — Hu et al., “3D Shape Completion with Multi-view Consistent

Inference”, AAAI 2020 [17].

3. Chapter 4 — Hu et al., “Learning to Generate Dense Point Clouds with Textures

on Multiple Categories”, WACV 2021 [18].

7


Learning 3D dynamic human avatars from videos (Part II). In this part, we will in-

troduce some techniques to reconstruct human avatars from videos. In Chapter 5, we will

present a novel neural rendering technique (HVTR: Hybrid Volumetric-Textural Render-

ing) to render full-body human avatars. In Chapter 6, we propose an extension of HVTR,

which takes uses of both pose and image signals for novel view synthesis. In Chapter

7, we will present how to render human avatars from mobile egocentric fisheye cameras.

The following is the relevant publication list for each chapter.

1. Chapter 5 — Hu et al., “HVTR: Hybrid Volumetric-Textural Rendering for

Human Avatars”, 3DV 2022 [19].

2. Chapter 6 — Hu et al., “HVTR++: Image and Pose Driven Human Avatars using

Hybrid Volumetric-Textural Rendering”, in submission.

3. Chapter 7 — Hu et al., “EgoRenderer: Rendering Human Avatars From Ego-

centric Camera Images”, ICCV 2021 [20].

Contribution. In summary, this thesis makes the following technical contributions:

1. A technique formulates 3D shape completion as multi-view depth map completions

in 2D image space, which enables dense and high-solution shape generations (Part

I, Chapter 2).

2. A multi-view consistent inference technique for 3D shape completion, which en-

courages multi-view geometric consistency on novel test data during the inference

stage (Part I, Chapter 3).

3. A technique to generate dense point clouds with textures from single RGB images

based on multi-view shape representations (Part I, Chapter 4).

8


4. A novel neural rendering technique, Hybrid Volumetric-Textural Rendering (HVTR),

which is trained on human video sequences to learn to synthesize virtual full-body

human avatars from novel arbitrary poses and viewpoints efficiently and at high

quality (Part II, Chapter 5).

5. A system to generate pose and image driven full-body human avatars, which is

capable of rendering faithful appearance details for digital humans by fully utilizing

driving pose and image signals (Part II, Chapter 6).

6. A system to render full-body human avatars on a mobile egocentric fisheye camera

(Part II, Chapter 7).

9


Part I

Learning Dense 3D Reconstructions for Static Objects

10


Chapter 2: Shape Completion based on Multi-view Representation

2.1 Introduction

Shape completion is an important step in 3D shape analysis, serving as a building

block in applications such as 3D scanning in robotics, autonomous driving, or 3D mod-

eling and fabrication. While learning-based methods that leverage large shape databases

have achieved significant advances recently, choosing a suitable 3D representation for

such tasks remains a difficult problem. On the one hand, volumetric approaches such

as binary voxel grids or distance functions have the advantage that convolutional neural

networks can readily be applied, but including a third dimension increases the memory re-

quirements and limits the resolutions that can be handled. On the other hand, point-based

techniques provide a more parsimonious shape representation, and recently there has been

much progress in generalizing convolutional networks to such irregularly sampled data.

However, most generative techniques for 3D point clouds involve fully connected layers

that limit the number of points and level of shape detail that can be obtained [21–23].

In this chapter, we propose to use a shape representation that is based on multi-view

depth maps for shape completion. The representation consists of a fixed number of depth

images taken from a set of pre-determined viewpoints. Each pixel is a 3D point, and

the union of points over all depth images yields the 3D point cloud of a shape. This has

11


the advantage that we can use several recent advances in neural networks that operate on

images, like U-Net [24] and 2D convolutional networks. In addition, the number of points

is not fixed and the point density can easily be increased by using higher resolution depth

images, or more viewpoints.

Here we leverage this representation for shape completion. Our key idea to perform

shape completion is to render multiple depth images of an incomplete shape from a set

of pre-defined viewpoints, and then to complete each depth map using image-to-image

translation networks. To improve the completion accuracy, we further propose a novel

multi-view completion net (MVCN) that leverages information from all depth views of

a 3D shape to achieve high accuracy for single depth view completion. In summary, our

contributions are as follows:

• We propose a strategy to address shape completion by re-rendering multi-view

depth maps to represent the incomplete shape, and performing image translation

of these rendered views.

• We introduce a multi-view completion architecture that leverages information from

all rendered views and outperforms separate depth image completion for each view.

• We demonstrate the efficacy of our method on shape completion problems, which

is able to generate denser shapes than previous point cloud based methods.

12


Figure 2.1: Overview of our approach. (a) We render 8 depth maps of an incomplete
shape (shown in red) from 8 viewpoints on the corners of a cube; (b) These rendered 8
depth maps are passed through a multi-view completion net including an adversarial loss,
which generates 8 completed depth maps; (c) We back-project the 8 depth maps into a
completed 3D model.

2.2 Related Work

Deep Learning on 3D Shapes. Pioneering work on deep learning for 3D shapes has

relied on volumetric representations [25, 26], which allow the straightforward application

of convolutional neural networks. To avoid the computation and memory costs of 3D

convolutions and 3D voxel grids, multi-view convolutional neural networks have also

been proposed early for shape analysis [27, 28] such as recognition and classification.

But these techniques cannot address shape completion. In addition to volumetric and

multi-view representations, point clouds have also been popular for deep learning on 3D

shapes. Groundbreaking work in this area includes PointNet and its extension [29, 30].

3D Shape Completion. Shape completion can be performed using volumetric grids, as

proposed by Dai et al. [31] and Han et al. [32], which are convenient for CNNs, like

3D-Encoder-Predictor CNNs for [31] and encoder-decoder CNN for patch-level geome-

try refinement in [32]. However, when represented with volumetric grids, data size grows

13


cubically as the size of the space increases, which severely limits resolution and applica-

tion. To address this problem, point based shape completion methods were presented, like

[21, 23, 33]. The point completion network (PCN) [33] is the state-of-the-art approach

that extends the PointNet architecture [29] to provide an encoder, followed by a multi-

stage decoder that uses both fully connected [21] and folding layers [23]. They show that

their decoder leads to better results than using a fully connected [21] or folding based [23]

decoder separately. However, for these voxels or points based shape completion methods,

the numbers of input and output voxels or points are still fixed. For example, the in-

put should be voxelized on a 323 grid [32] and the output point cloud size is 2048 [23],

however, which can lead to loss of detail in many scenarios.

3D Reconstruction from Images. The problem of 3D shape reconstruction from single

RGB images shares similarities with 3D shape completion, but is arguably even harder.

While a complete survey of these techniques is beyond the scope of this chapter, our

work shares some similarities with the approach by Lin et al. [34]. They use a multi-

view depth map representation for shape reconstruction from single RGB images using a

differentiable renderer. In contrast to their technique, we address shape completion, and

our approach allows us to solve the problem directly using image-to-image translation.

Soltani et al. [35] do shape synthesis and reconstruction from multi-view depth images,

which are generated by a variational autoencoder [36]. However, they do not consider the

relations between the multi-view depth images of the same model in their generative net.

Image Translation and Completion. A key advantage of our approach is that it allows us

to leverage powerful image-to-image translation architectures to address the shape com-

pletion problem, including techniques based on generative adversarial networks (GAN) [14],

14


and U-Net structures [24]. Based on conditional GANs, image-to-image translation net-

works can be applied on a variety of tasks [37]. Satoshi et al. [38] and Portenier et al. [39]

propose to use conditional GANs for image completion or editing. However, each im-

age is completed individually in their networks. We propose a network that can combine

information from other related images to help the completion of one single image.

2.3 Method

2.3.1 Multi-view Representation

As discussed above, high resolution completion is difficult to achieve by existing

methods that operate on voxels or point clouds due to memory limitations or fully con-

nected network structures. In contrast, multi-view representations of 3D shapes [27, 28]

can circumvent these obstacles to achieve high resolution and dense completion. As

shown in Figure 2.1 (a), given an incomplete point cloud, our method starts from ren-

dering 8 depth maps for this point cloud. Specifically, the renderings are generated by

placing 8 virtual cameras at the 8 vertices of a cube enclosing the shape, all pointing to-

wards the centroid of the shape. We also render 8 depth maps from the complete target

point cloud, and then we use these image pairs to train our network.

With this multi-view representation, the shape completion problem can be formu-

lated as image-to-image translation, i.e., translating an incomplete depth map to a cor-

responding complete depth map, for which we can take full advantage of several recent

advances in net structures that operate successfully on images, like U-Net architectures

and GANs. After the completion net shown in Figure 2.1(b), we get 8 completed depth

15


maps in Figure 2.1(c), which can be back-projected into a completed point cloud.

2.3.2 Multi-view Depth Maps Completion

In the completion problem, we learn a mapping from an incomplete depth map xi to

a completed depth map G(xi), where xi is rendered from a partial 3D shape S, i ∈ [1, V ].

We render V views for each shape and expect to complete each depth map xi of S as

similar as possible to the corresponding depth map yi of the ground truth 3D shape S1.

Although completing each of the V depth maps of a 3D shape separately would

be straightforward, there are two disadvantages. First, we cannot encourage consistency

among the completed depth maps from the same 3D shape, which affects the accuracy

of the resulting 3D shapes obtained by back-projecting the completed depth maps. Sec-

ond, we cannot leverage information from other depth maps of the same 3D shape while

completing one single depth map. This limits the accuracy of completing a single depth

image, since views of the same 3D model share some common information that could be

exploited, like global shape and local parts as seen from different viewpoints.

To resolve these issues, we propose a multi-view completion net (MVCN) archi-

tecture to complete one single depth image by jointly considering the global 3D shape

information. In order to complete a depth image xi as similar as possible to the ground

truth yi in terms of both low-frequency correctness and high-frequency structure, MVCN

is designed based on a conditional GAN [14], which is formed by an image-to-image net

G and a discriminatorD. In addition, we introduce a shape descriptor d for each 3D shape

S to contribute to the completion of each depth image xi from S, where d holds global

16


information of shape S. The shape descriptor d is learned along with the other parameters

in MVCN, and it is updated dynamically with the completion of all the depth images xi

of shape S.

2.3.3 MVCN Architecture

We use a U-Net based structure [24] as our image-to-image netG, which has shown

its effectiveness over encoder-decoder networds in many tasks including image-to-image

translation [37]. Including our shape descriptor, we propose an end-to-end architecture as

illustrated in Figure 2.2.

Figure 2.2: Architecture of MVCN. The shape descriptor represents the information of
a 3D shape, which contributes to the completion of each single depth map from the 3D
shape.

We adopt the generator and discriminator architecture of [37]. MCVN consists of

8 U-Net modules with an input resolution of 256× 256, and each U-Net module has two

submodules, DOWN and UP. DOWN (e.g. D3) consists of the Convolution-BatchNorm-

ReLU layers [40, 41], and UP (e.g. U3) consists of UpReLU-UpConv-UpNorm layers.

More details can be found in [37].

17


In MVCN, DOWN modules are used to extract a view feature fi of each depth

image xi. For each 3D shape S, we learn a shape descriptor d by aggregating all V

view features fi through a view-pooling layer. Since not all the features are necessary to

represent the global shape, we use max pooling to extract the maximum activation in each

dimension of all fi to form the shape descriptor, as illustrated in Figure 2.2. In addition,

the shape descriptor d is applied to contribute to the completion of each depth image xi.

Specifically, for an input xi we employ the output of DOWN moduleD3 as the view

feature fi, and insert the view-pooling layer after D3. For each shape S we use a shape

memory to store all its V view features fi as shown in Figure 2.2. When we get fi, we

first use it to update the corresponding feature map in shape memory. For example, if

i = 3, the third feature map in shape memory will be replaced with f3. Then we obtain

the shape descriptor of S in the current iteration by a view-pooling layer (max pooling all

feature maps in the shape memory of S). This strategy dynamically keeps the best view

features in all training iterations, as illustrated in Figure 2.2. Subsequently, we use shape

descriptor d to contribute to the completion of depth map xi by concatenating d with

view feature fi as the input of module D2. This concatenated feature is also forwarded to

module U3 via a skip connection.

2.3.4 Loss Function

The objective of our conditional GAN is similar to image-to-image translation [37],

LcGAN(G,D) = Ex,y[logD(x, y)] + Ex[log(1−D(x,G(x))]. (2.1)

18


In our completion problem, we expect the completion net (G) could not only deceive

the discriminator but also produce a completion result near the ground truth. Hence we

combine the GAN objective with a traditional pixel-wise loss, such as L1 or L2 distance,

which is consistent with previous approaches [37, 42]. Since L1 is less prone to blurring

than L2, and considering Eq. 2.4, there is a linear mapping from a pixel in a depth image

to a 3D point, we want to push the generated image to be near the ground truth in L1

sense rather than L2. Therefore, the loss of the completion net is

LL1(G) = Ex,y[‖y −G(x)‖1]. (2.2)

Our final object in training is then

G∗ = arg min
G

max
D
LcGAN(G,D) + λLL1(G), (2.3)

where λ is a balance parameter that controls the contributions of the two terms.

2.3.5 Optimization and Inference

Unlike some approaches that focus on image generation [43], our method does not

generate images from noise, which also makes our training stable, as mentioned in [38].

Similar to [37], we only provide noise in the form of dropout in our network.

To optimize our net, we follow the standard approach [14, 37]. The training of D

and G is alternated, one gradient descent step on D, then one step on G. Minibatch SGD

and the Adam solver [44] are applied, with a learning rate of 6e-4 for G and 6e-6 for D,

19


which slows down the rate at which D learns relative to G. Momentum parameters are

β1 = 0.5, β2 = 0.999, and the batch size is 32.

During inference, we first run MVCN with all the 8 rendered views of an incomplete

3D shape to build the shape memory and extract the shape descriptor. Then we run the net

again for the second time to complete each view leveraging the learned shape descriptor.

Our final target is 3D shape completion. Given a generated depth image G(xi), for

each pixel p at location (xp, yp) with depth value dp, we can back-project p to a 3D point

P through an inverse perspective transformation,

P = R−1(K−1[xp, yp, dp]
T − t), (2.4)

where K, R, and t are the camera intrinsic matrix, rotation matrix, and translation vector

respectively. Note that K, R, and t are always known since these are the parameters of

the 8 virtual cameras placed on the corners of a cube. The final shape is the union of the

completed, back-projected point clouds from all 8 virtual views.

2.4 Experiments

In this section, we first describe the creation of a multi-category dataset to train our

model, and then we illustrate the effectiveness of our method and the improvement of

MCVN over a single view completion net (VCN) used as a baseline, where each view is

completed individually without shape descriptor. Finally, we analyze the performance of

our method, and make comparisons with existing methods. By default, we conduct the

training of MVCN under the MVCN-Airplane600 (trained with the first 600 shapes of

20


airplane in ShapeNet [45]), and test it under the same 150 models involved in [33]).

2.4.1 Data Generation and Evaluation Metrics

We use synthetic CAD models from ShapeNet to create a dataset to train our model.

Specifically, we take models from 8 categories: airplane, cabinet, car, chair, lamp, sofa,

table, and vessel. Our inputs are partial point clouds. For each model, we extract one

partial point cloud by back-projecting a 2.5D depth map (from a random viewpoint) into

3D, and render this partial point cloud into V = 8 depth maps of resolution 256 × 256

as training samples. The reason why we use back-projected depth maps as partial point

clouds instead of subsets of the complete point cloud is that our training samples are closer

to real-world sensor data in this way. In addition, similar to other works, we choose to

use a synthetic dataset to generate training data because it contains detailed 3D shapes,

which are not available in real-world datasets. In the same way, we also render V = 8

depth maps from the ground truth point clouds as the ground truth depth maps.

Similar to [33], here we also use the symmetric version of Chamfer Distance (CD) [22]

to calculate the average closest point distance between the target shape and the generated

shape.

2.4.2 Analysis of the Objective Function

We conduct ablation studies to justify the effectiveness of our objective function for

the completion problem. Table 2.1(a) shows the quantitative effects of these variations,

and Figure 2.3 shows the qualitative effects. The cGAN alone (bottom left, setting λ = 0

21


in Eq. 2.3) gives very noisy results. L2+cGAN (bottom middle) leads to reasonable but

blurry results. L1 alone (top right) also produces reasonable results, but we can find some

visual defects, like some unfilled holes as marked, which makes the final CD distance

higher than that of L1+cGAN. These visual defects can be reduced when including both

L1 and cGAN in the loss function (bottom right). As shown by the example in Figure 2.5,

the combination of L1 and cGAN can complete the depth images with high accuracy. We

further explore the importance of the two components of the objective function for point

cloud completion by using different weights (λ in Eq. 2.3) of the L1 loss. In Table 2.1(b),

the best completion result is achieved when λ = 1. We set λ = 1 in our experiments.

Figure 2.3: Completion results under different losses.

Loss Avg CD
cGAN 10.729
L1 5.672
L2 + cGAN 6.467
L1 + cGAN 5.512

(a)

λ in Eq. 2.3 Avg CD
λ = 50 5.748
λ = 10 5.665
λ = 1 5.512
λ = 0.5 5.541

(b)

Table 2.1: Analysis of the objective function: average CD for different losses (a), and
different λ (b). Numbers are multiplied by 1000

22


2.4.3 Analysis of the View-pooling Layer

Pooling methods. We also study different view-pooling methods to construct the shape

descriptor, including element-wise max-pooling and mean-pooling. According to our ex-

periments, mean-pooling is not as effective as max-pooling to extract the shape descriptor

for image completion, which is similar to the recognition problem [28]. The average CD

is 0.005926 for mean-pooling, but that of max-pooling is 0.005512, so max-pooling is

used.

Position Avg L1 distance Avg CD
D2 3.377 5.512
D1 3.433 5.604
D0 3.501 5.919
Code 3.477 5.836

Table 2.2: Completion results for different positions of view-pooling layer. CD values are
multiplied by 1000.

Position of the view-pooling layer. Here we insert the view-pooling layer into different

positions to extract the shape descriptor and further evaluate its effectiveness, including

D2, D1, and D0, which are marked in Figure 2.2. Intuitively, the shape descriptor would

have the biggest impact on the original network if we place the view-pooling layer before

D2, and the experimental results illustrate this in Table 2.2, where both average L1 dis-

tance and CD are the lowest. We also try to do view pooling after D0 and concatenate the

shape descriptor with the latent code (marked in purple in Figure 2.2) and then pass them

through a fully connected layer, but experiments show that the shape descriptor will be

ignored since both the average L1 distance and CD do not decrease compared with single

view completion net (average L1 distance is 3.473643 and CD is 0.005839 in Table 2.5).

23


Model Name Avg L1 Distance
MVCN-V3 3.794

MVCN-V8-3 3.617
MVCN-V5 3.564

MVCN-V8-5 3.398

Table 2.3: Average L1 distance for different numbers of views in view-pooling.

Figure 2.4: Completion results for different numbers of views in view-pooling.

Number of views in view-pooling. We also analyze the effect of the number of views

used in view-pooling. In Table 2.3, MVCN-V3 was trained with 3 depth images (No.1,

3, 5) of the 8 depth images of each 3D model, and MVCN-V5 was trained with 5 depth

images (No. 1, 3, 5, 6, 8). MVCN-V8-3 and MVCN-V8-5 were trained with all the 8

depth images, but were tested with 3 views and 5 views respectively. In order to make

fair comparisons, we took the 1st, 3rd, and 5th view images to test MVCN-V8-3 and

MVCN-V3, and 1st, 3rd, 5th, 6th, 8th to test MVCN-V8-5 and MVCN-V5. The results

show that the completion of one single view will be better when we increase the number

of views, which means other views are helpful for the completion of one single view, and

the more the views, the higher the completion accuracy. Figure 2.4 shows an example

of the completion. As we increase the number of views in view-pooling, the completion

results are improved.

24


Figure 2.5: An example of the completion of sofa. The 1st row: incomplete point cloud
and 8 depth maps of it; The 2nd row: generated point cloud and related 8 depth maps;
The 3rd row: ground truth point cloud and its 8 depth maps.

Figure 2.6: Visual comparison between VCN and MVCN. Starting from the partial point
cloud in the first row, VCN and MVCN perform completions of depth maps in the second
and third row, respectively, where the completed point clouds are also shown. We use
colormaps (from blue to green to red) to highlight the pixels with bigger errors than 10
in terms of L1 distance. Ground truth data is in the last row. MVCN achieves lower L1
distance on all the 8 depth maps.

2.4.4 Improvements over Single View Completion

Pervasive improvements on L1 distance and CD. From Table 2.5, we find significant

and pervasive improvements over single view completion net (VCN) on both average L1

distance and CD on different categories. Networks in Table 2.5 were trained with 600

3D models for airplane, 1600 for lamp, and 1000 for other categories. We use 150 3D

models from each category to evaluate our network, the same test dataset in [33]. We

25


further conduct visual comparison with VCN in Figure 2.6, where we can see MVCN can

achieve higher completion accuracy with the help of the shape descriptor.

Model Avg L1 Distance Avg CD
MVCN-Airplane600 3.377 5.512
MVCN-Airplane1200 3.156 5.273
MVCN-Lamp1000 6.661 12.012
MVCN-Lamp1600 6.246 10.576
VCN-Lamp1000 6.763 12.091
VCN-Lamp1600 6.430 12.007

Table 2.4: Improvements while increasing training samples. CD values are multiplied by
1000.

Better generalization capability. Table 2.4 shows that we can improve the performance

of VCN and MVCN while increasing the number of training samples. We find that the

performance differences between MVCN-Lamp1000 and VCN-Lamp1000 are not ob-

vious. The reason is that there are relatively large individual differences among lamp

models in ShapeNet, and the completion results are bad in several unusual lamp models

in the test set. For these models, the comparisons between VCN and MVCN are less

meaningful, so the improvement is not obvious. But this can be solved when we add an-

other 600 training samples in training. MVCN-Lamp1600 has a bigger improvement than

VCN-Lamp1600 on average L1 distance and CD, which indicates a better generalization

capability of MVCN.

2.4.5 Comparisons with the State-of-the-art

Baselines. Some previous completion methods need prior knowledge of the shape [46],

or assume more complete inputs [47], so they are not directly comparable to our method.

26


Model Average L1 Distance
Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel

VCN 5.431 3.474 4.305 3.859 7.645 6.430 5.717 7.573 4.446
MVCN 5.102 3.377 3.991 3.610 7.143 6.245 5.285 7.156 4.013
Model Mean Chamfer Distance per point (multiplied by 1000 )

Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel
VCN 8.800 5.839 7.297 6.589 10.398 12.007 9.565 9.371 9.334

MVCN 8.328 5.512 7.154 6.322 10.077 10.576 9.174 9.020 8.790

Table 2.5: Quantitative comparisons between VCN and MVCN.

Here we compare MVCN with several strong baselines. PCN-CD [33] trained with point

completion net with CD as loss function, is the state of the art while this work was de-

veloped. PCN-EMD uses Earth Mover’s Distance (EMD) [22] as loss function, but it

is intractable for dense completion due to the calculation complexity of EMD. The en-

coders of FC [21], Folding [23] are the same with PCN-CD, but decoders are different,

a 3-layer fully-connected network for FC, and folding-based layer for Folding. PN2 uses

the same decoder, but the encoder is PointNet++ [30]. 3D-EPN [31] is a representative

of the class of volumetric completion methods. For fair comparison, the distance field

outputs of 3D-EPN are converted into point clouds as mentioned in [33]. TopNet [48] is

a recent point-based method, but it can only generate sparse point clouds because their

decoder mostly consists of multilayer perceptron networks, which limits the number of

points they can process.

Comparisons. As done in [33], we use the symmetric version of CD to calculate the

average closest point distance, where ground truth point clouds and generated point clouds

are not required to be the same size, which is different from EMD [22]. For point-

based methods like PCN [33], the input is sampled and the output size is fixed. Different

from these methods, the number of output points of our approach is not fixed, which

27


Model Mean Chamfer Distance per point
Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel

3D-EPN 2.014 1.316 2.180 2.031 1.881 2.575 2.109 2.172 1.854
FC 0.980 0.570 1.102 0.878 1.097 1.113 1.176 0.932 0.972

Folding 1.007 0.597 1.083 0.927 1.125 1.217 1.163 0.945 1.003
PN2 1.400 1.030 1.474 1.219 1.578 1.762 1.618 1.168 1.352

PCN-CD 0.964 0.550 1.063 0.870 1.100 1.134 1.168 0.859 0.967
PCN-EMD 1.002 0.585 1.069 0.908 1.158 1.196 1.2206 0.901 0.9789

MVCN 0.830 0.527 0.715 0.632 1.008 1.058 0.917 0.902 0.879

Table 2.6: Comparison with the exiting methods on mean CD (multiplied by 100) over
multiple categories.

Figure 2.7: Comparison between MVCN and PCN-CD.

would require resampling our output to compute the EMD. CD is more suitable for a fair

comparison among different techniques. Table 2.6 lists the quantitative results, where

the completion results of other methods are from [33]. Our method achieves the lowest

CD across almost all object categories. A more detailed comparison with PCN-CD is

in Figure 2.7, where the height of the blue bar indicates the amounts of improvement of

our method over PCN-CD on each object. Our model outperforms PCN on most objects.

Figure 2.9 shows the qualitative results. Our completions are denser, and we recover

more details in the results. Another obvious advantage is that our method can complete

shapes with complex geometry, like the second to the forth objects, but other methods

28


fail to recover these shapes. Note that our method is category-specific, which requires

on a classification step in data preprocessing before the shape completions on multiple

categories. However, in Chapter 4, we show that the multi-view representation can also

be used to train a single network for 3D reconstructions on multiple categories.

2.4.6 Completion Results on KITTI

Our goal is to obtain high quality and high resolution shape completion from data

similar to individual range scans focused on individual objects. Hence we obtain incom-

plete data using synthetic depth images, which is similar to data from RGB-D cameras.

However, for data like KITTI, which is extremely sparse and does not contain ground

truth, the usual objective is to obtain rough not high resolution completion. Our method

performs reasonably well on KITTI data, as shown in Figure 2.8.

Figure 2.8: Completion results on KITTI.

2.5 Conclusion

We have presented a method for shape completion by rendering multi-view depth

maps of incomplete shapes, and then performing image completion of these rendered

views. Experiments show that our view based representation and novel network struc-

29


ture can achieve better results with less training samples, perform better on objects with

complex geometry, and generate higher resolution results than previous methods.

Figure 2.9: Qualitative completion on ShapeNet, where MVCN can complete complex
shapes with high resolution.

2.A Additional Experimental Details

In this section, we provide additional experimental results and technical details of

the proposed method.

30


2.A.1 Completion Results on Noisy, Sparse and Occluded Point Clouds

Since there is no ground truth on KITTI, we also conduct experiments to evaluate

the performance of our method on noisy, sparse and occluded inputs in Figure 2.10. For

the ground truth point cloud (sofa), we render a depth image from a random viewpoint,

whose back-projection is labeled ‘Original Input’, and then perturb the depth map with

Gaussian noise whose standard deviation is η times the scale of the depth measurements.

We then randomly sample the input point clouds with a factor µ. Besides self-occlusion,

we also consider the target may be occluded by other objects in the wild, and in Fig-

ure 2.10, ‘Occ’ in the 2nd and the 5th column means that we further remove 10% of the

input points. Note that our model is not trained with these noisy, sparse, and occluded

examples, but it is still robust to them.

Figure 2.10: Completion results on noisy, sparse and occluded inputs.

2.A.2 Analysis of the Number of Views in View-pooling

We further show the improvements on L1 distance for all view images of test dataset

in Figure 2.11. The x-axis represents different view images. It should be mentioned that

the same x represents different view images for ‘V8 vs V3’ and ‘V8 vs V5’, considering

the test dataset has 150 3D models, so 450 view images are used to test ‘V8 vs V3’, and

31


750 view images are used to test ‘V8 vs V5’). The height of the blue bar indicates the

amounts of improvement of 8 views over 3, and the red bar indicates the improvement of

8 views over 5. Positive values mean the L1 distance is lower while using 8 views. Since

the training dataset is relatively small (600 3D models for training and 150 3D models

for testing), our net performs bad on several unusual models in testing dataset, which fall

on the boundary in Figure 2.11. Comparisons on boundary instances are not meaningful.

Apart from these, for most view images, we decrease L1 distance by increasing the num-

ber of views in view-pooling. More views mean the shape descriptors are more helpful.

Figure 2.11: Improvements of 8 views over 3 and 5 views in view-pooling.

2.A.3 Failure Cases

While in general our methods perform well, we observe our models fail to complete

several challenging input depth maps, which do not provide enough information for infer-

ence. For example, Figure 2.12 shows two failed completions of lamps, where we cannot

extract useful information from the depth inputs to infer the whole shape. These cases

mostly occur in lamp objects due to complex geometry and large individual differences

32


among lamp objects. The reconstruction of lamp is also the most challenging task, as

mentioned in [49].

Figure 2.12: Failed depth completions of lamps.

2.A.4 Rendering and Back-projecting Depth Maps

Render multi-view depth maps. First, for each 3D model, we move its center to the ori-

gin. Most models in modern online repositories, such as ShapeNet and the 3D Warehouse,

satisfy this requirement that models are upright oriented along a consistent axis, and some

previous completion or recognition methods also follow the same assumption [26, 33].

With this assumption, the center consists of the midpoints along x, y, z axis. Then, each

model is uniformly scaled to fit into a consistent sphere (radius is 0.2) and the scale factor

is the maximum length along x, y, z axis divided by radius. Finally, we render 8 depth

maps for each partial point cloud as input, as mentioned in Section 2.3.1. In this way,

all the shapes occur at the center of depth images. We also render 8 depth maps of the

complete target shape and use these image pairs to train our network.

Back-project multi-view depth maps into a point cloud. We fuse the generated depth

33


maps into a completed point cloud and apply voting algorithm to remove outliers. Specif-

ically, we reproject each point of one view into the other 7 views, and if one point falls

on the shape of other views, we add one vote for it. The initial vote number for each

point is 1, and we set a vote threshold of 7 to decide whether this point is valid or not.

Furthermore, radius outlier removal method is used to remove noisy points that have few

neighbors (less than 6) in a given sphere (radius is 0.006) around them.

34


Chapter 3: Multi-view Consistency in Shape Completion

3.1 Introduction

Convolutional neural networks have proven highly successful at analysis and syn-

thesis of visual data such as images and videos. This has spurred interest in applying

convolutional network architectures also to 3D shapes, where a key challenge is to find

suitable generalizations of discrete convolutions to the 3D domain. Popular techniques

include using discrete convolutions on 3D grids [26], graph convolutions on meshes [50],

convolution-like operators on 3D point clouds [51, 52], or 2D convolutions on 2D shape

parameterizations [53]. A simple approach in the last category is to represent shapes using

multiple 2D projections, or multiple depth images, and apply 2D convolutions on these

views. This has led to successful techniques for shape classification [28], single-view 3D

reconstruction [54], shape completion [16], and shape synthesis [55]. One issue in these

approaches, however, is to encourage consistency among the separate views and avoid

that each view represents a slightly different object. This is not an issue in supervised

training, where the loss encourages all views to match the ground truth shape. But at

inference time or in unsupervised training, ground truth is not available and a different

mechanism is required to encourage consistency.

In this chapter, we address the problem of shape completion using a multi-view

35


depth image representation, and we propose a multi-view consistency loss that is min-

imized during inference. We formulate inference as an energy minimization problem,

where the energy is the sum of a data term given by a conditional generative net, and a

regularization term given by a geometric consistency loss. Our results show the benefits

of optimizing geometric consistency in a multi-view shape representation during infer-

ence, and we demonstrate that our approach performs better on shape completion tasks.

In summary, our contributions are as follows:

i) We propose a multi-view consistency loss for 3D shape completion, which encour-

ages geometric consistency for multi-view representation on novel data in inference

stage.

ii) We formulate multi-view consistent inference as an energy minimization problem

including our consistency loss as a regularizer, and a neural network-based data term.

iii) We show the proposed multi-view consistency optimization can further refine the

shape completion results of multi-view representation introduced in Chapter 2 on

different object categories, which demonstrates the benefits of consistent inference

technique in practice.

3.2 Related Work

Multi-view Consistency. One problem of view-based representation is inconsistency

among multiple views. Some researchers presented a multi-view loss to train their net-

work to achieve consistency in multi-view representations, like discovering 3D keypoints

36


..
..

..

..
..

..

..
..

..

..
..

..

R
en

de
ri
ng

Renderin
g

T

T

1

2

Rendering
T

7

R
endering

T
8

View-reprojection

View-reprojection

View-reprojection

-

Incompleted depth images Completed depth images

Partial 

point cloud

Reprojections Consistency distances to V 

E
n

co
d

er

V7Target

D
ec

o
d

er

Consistency loss

Update

Shape

descriptor

Fixed

 parameter

Inference stage

0

1

C
o

n
sisten

cy
    p

o
o

lin
g

Loss map

7

X2

X7

X8

X1

V1

V2

V7

V8

R1

7

R7

7

R2

7

D1

7

D7

7

D8

7

T   , T1   

-1
7

T   , T2
-1

7

T   , T8

-1
7

M
7

Conditional generative net G Consistency loss calculation C

X V

z

T7Duplicate

D2

7

Closest point pooling

R8

7

Figure 3.1: Overview of the multi-view consistent inference for 3D shape completion.
Given a partial point cloud as input, we first render multiple incomplete views X , which
form our shape representation of the incomplete input. To perform inference, we apply
a conditional generative network G to generate completed depth images V based on a
shape descriptor z conditioned on X . As a key idea, we design our consistency loss C
to evaluate the geometric consistency among V . Intuitively, for all pixels in all views
Vt we find the distance to their approximate closest neighbor in the other views Vs, and
sum up these distances to form C. Specifically, for each target view (e.g. V7 in the
figure) we reproject all completed depth images Vs according to the pose of V7, which
leads to reprojection maps denoted R7

s . Then we compute consistency distances, denoted
D7
s , for each reprojection map R7

s and the target V7 via a pixel-wise closest point pooling
operation. Finally, a consistency pooling operator aggregates all consistency distancesD7

s

into a loss map M7. In inference, we minimize all loss maps as a function of the shape
descriptor z.

[56] and reconstructing 3D objects from images [34, 57–60]. With differentiable ren-

dering [34, 58], the consistency distances among different views can be leveraged as

2D supervision to learn 3D shapes in their networks. However, these methods can only

guarantee consistency for training data in training stage. Different from these methods,

with the help of our novel energy optimization and consistency loss implementation, our

proposed method can improve geometric consistency on test data directly during the in-

ference stage.

37


Figure 3.2: Network structure.

Figure 3.3: Methods to calculate consistency distance.

3.3 Multi-view Consistent Inference

Overview. The goal of our method is to guarantee multi-view consistency in inference,

as shown in the overview in Figure 3.1. Our method starts from converting partial point

clouds to multi-view depth image representations by rendering the points into a set of

incomplete depth images X = {X1, . . . , X8} from a number of fixed viewpoints. In

our current implementation, we use eight viewpoints placed on the corners of a cube.

Our approach builds on a conditional generative net G(z;X) which is trained to output

completed depth images V by estimating a shape descriptor z conditioned on a set of

38


incomplete inputs X . We obtain the conditional generative net in a separate, supervised

training stage. During inference, we keep the network weights fixed and optimize the

shape descriptor z to minimize an energy consisting of a consistency loss, which acts as a

regularizer, and a data term. On the one hand, the consistency loss C(V ) = C(G(z;X))

quantifies the geometric consistency among the completed depth images V . On the other

hand, the data term encourages the solution to stay close to an initially estimated shape

descriptor z̊. This leads to the following optimization for the desired shape descriptor z∗:

z∗ = arg min
z

C(G(z;X)) + µ||G(z;X)−G(̊z;X)|| = Lcon(z) + µLgen(z), (3.1)

where µ is a weighting factor, and we denote Y = G(̊z;X) and V = G(z;X) as

the initially estimated completed depth images and optimized completed depth images

in inference, respectively. In addition, we will formulate the regularization term and

data term as multi-view consistency loss Lcon(z) and generator loss Lgen(z) in Section

‘Consistency Loss’.

Conditional generative net. The conditional generative netG(z;X) is built on the struc-

ture of multi-view completion net [16], as shown in Figure 3.2, which is an image-to-

image translation architecture applied to perform depth image completion for multiple

views of the same shape. We train the conditional generative net following a standard

conditional GAN approach [14]. To share information between multiple depth images of

the same shape, our architecture learns a shape descriptor z for each 3D object by pool-

ing a so-called shape memory consisting of N feature maps fn, n ∈ [1, N = 8] from all

39


views of the shape. The network G consists of 8 U-Net modules, and each U-Net module

has two submodules, Down and Up, so there are 8 Down submodules (D7−0) in the en-

coder and 8 Up submodules (U0−7) in the decoder. Down submodules consist of the form

Convolution-BatchNorm-ReLU [40, 41], and Up submodules (D0−7) consist of the form

UpReLU-UpConv-UpNorm. The shape memory is the feature map after the third Down

submodule (D3) of the encoder. More details can be found in [16, 37].

In inference, we optimize the shape descriptor z of G(z;X) given test input X . We

first get an initial estimation of the shape descriptor z̊ for each test shape by running the

trained model once, and initialize z with z̊. During inference the other parameters of G

are fixed.

3.4 Consistency Loss

Our consistency loss is based on the sum of the distances between each pixel in the

multi-view depth map and its approximate nearest neighbor in any of the other views. In

this section we introduce the details of the multi-view consistency loss calculation fol-

lowing the overview in Figure 3.1. For all views Vt, we first calculate pairwise per-pixel

consistency distancesDt
s to each other view Vs, that is, per-pixel distances to approximate

nearest neighbors in view Vs. We then perform consistency pooling, which for each view

Vt provides the consistency distances over all other views (as opposed to the initial pair-

wise consistency distances between two of views). We call these the loss maps M t. The

final consistency loss is the sum over all loss maps.

40


3.4.1 Pairwise Consistency Distances

Given a source view Vs and a target view Vt, we calculate the consistency distance

Dt
s between Vs and Vt by view-reprojection and closest point pooling, where Vt, Vs ∈

RH×W and H×W is the image resolution. Specifically, view-reprojection transforms the

depth information of source Vs to a reprojection map Rt
s according to the transformation

matrix of the target Vt. Then, closest point pooling further produces the consistency

distance Dt
s between Rt

s and Vt. Figure 3.3 shows the pipeline, where the target view is

V7 and the source view is V2. In the following, we denote a pixel on source view as pi =

[ui, vi, di], where ui and vi are considered pixel coordinates, its back-projected 3D point

as Pi = [x̂i, ŷi, ẑi], and the reprojected pixel on reprojection map Rt
s as p′i = [u′i, v

′
i, d
′
i],

where di = Vs[ui, vi] and d′i = Rt
s[u
′
i, v
′
i] are the depth values at the location [ui, vi] and

[u′i, v
′
i], respectively.

View-reprojection. The view-reprojection operator back-projects each point pi = [ui, vi, di]

on Vs into the canonical 3D coordinates as Pi = [x̂i, ŷi, ẑi] via

Pi = <−1s (K−1pi − τs) ∀i, (3.2)

where K is the intrinsic camera matrix, and <s and τs are the rotation matrix and trans-

lation vector of view Vs respectively. This defines the relationship between the view

Vs = {pi} and its back-projected point cloud {Pi}. We use Ts to denote the transforma-

tion matrix of Vs, which contains the pose information, such that Ts = (<s, τs). Then,

we transform each 3D point Pi in the point cloud into a pixel p′i = [u′i, v
′
i, d
′
i] on the

41


reprojection map Rt
s as

p′i = K(<tPi + τt) ∀i. (3.3)

Eq. 3.2 and Eq. 3.3 illustrate that we can transform the depth information of source

view Vs to reprojection map Rt
s, which has the same pose with the target view Vt. How-

ever, due to the discrete grid of the depth images, different points Pi in the point cloud may

be projected to the same pixel [u′, v′] on the reprojection map Rt
s when using Eq. 3.3, like

p′1 = [u′, v′, d′1], p
′
2 = [u′, v′, d′2], p

′
3 = [u′, v′, d′3] in Figure 3.3. In fact, all the {p′1, p′2, p′3}

are projected to the same pixel p′r on Rt
s, and the corresponding point on the target view

Vt is pt. To alleviate this collision effect, we implement a pseudo-rendering technique

similar to [34]. Specifically, for each pixel on Rt
s, a sub-pixel grid with a size of (U ×U )

is presented to store multiple depth values corresponding to the same pixel, so the repro-

jection is Rt
s ∈ RH×U×W×U .

Closest point pooling. The closest point pooling operator computes the consistency

distance between reprojection Rt
s and target view Vt. First, we also upsample Vt to

RH×U×W×U by repeating each depth value into a U × U sub-pixel grid. Then, we calcu-

late the element-wise L1 distance between Rt
s and the upsampled Vt. Finally, we perform

closest point pooling to extract the minimal L1 distance in each sub-pixel grid using min-

pooling with a U ×U filter and a stride of U ×U . This provides the consistency distance

Dt
s between source view Vs and target view Vt, where Dt

s ∈ RH×W . The consistency

distance Dt
s is shown in Figure 3.3, where t = 7, s = 2. Note that we directly take the

tth input view Xt as the reprojection Rt
t when t = s, since the incomplete input Xt also

42


provides some supervision.

Note that some consistency distances in Dt
s may be large due to noisy view com-

pletion or self-occlusion between the source and target views, and these outliers interfere

with our energy minimization. Therefore, we perform outlier suppression by ignoring

consistency distances above a threshold of 2.5% of the depth range (from the minimum

to the maximum depth value of a model).

3.4.2 Consistency Distance Aggregation by Consistency Pooling

Given a target view Vt, we get all the consistency distances Ds
t between Vt and all

the other N source views Vs, as shown in Figure 3.4, where t = 7, N = 8, and we use the

same colorbar with Figure 3.1. Obviously, different source views Vs cover different parts

of the target view Vt, which leads to different consistency distances in Dt
s. For example,

the red parts on each Dt
s in Figure 3.4 indicate that they can not be well inferred from the

source view, so these parts are not helpful for the optimization of the target view.

By extracting the minimum distance between Vt and the reprojections from all other

views, we cover the whole Vt with the closest points to it and we obtain the loss mapsM t.

In our pipeline, we implement this efficiently using a consistency pooling operator defined

as,

M t(x, y) = min
j∈[1,J ]

Dt
j(x, y), (3.4)

where M t ∈ RH×W , x ∈ [1, H], y ∈ [1,W ], and J is the number of views in pooling. We

use J ≤ N to make it possible to restrict pooling to a subset of the views (see Section

‘Experiments’ for an evaluation of this parameter). This is illustrated using M7 as an

43


Figure 3.4: Consistency pooling with respect to V7.

Figure 3.5: All the eight loss maps of a 3D model.

example in Figure 3.4. Figure 3.5 shows all the consistency loss maps to each target view.

3.4.3 Loss Function

Our multi-view consistent inference aims to maximize the depth consistency across

all views by optimizing the shape descriptor z of a 3D model. Therefore, the consistency

loss Lcon(z) to the whole 3D model takes the loss maps for all target views,

Lcon(z) = C(G(z;X)) =
1

N ×H ×W

N∑
t=1

H∑
x=1

W∑
y=1

M t(x, y), (3.5)

where N is the number of views and X is the input set of incomplete depth images

of the 3D model.

In Eq. 3.1, we also have a data term to keep z close to the initial estimation z̊ during

inference. We call this the generator loss Lgen, which aims to prevent the completed depth

44


images drifting away from the prior learned from the training data:

Lgen(z) = ‖G(z;X)−G(̊z;X)‖, (3.6)

where X is the input, Y = G(̊z;X) and V = G(z;X) are the initially estimated

outputs and optimized outputs respectively, and X, Y, V ∈ RN×H×W . In summary, the

overall loss function in inference L(z) is

L(z) = Lcon(z) + µLgen(z), (3.7)

where µ is a weighting factor. We optimize the shape descriptor z for 100 gradient

descent steps, and we take z with the smallest consistency loss in the last 10 steps as z∗.

It should be mentioned that since the gradients of z are small, we use a large learning rate

of 0.2.

3.5 Experiments

Our method is built on the presented method MVCN in Chapter 2. To fairly evaluate

the improvements over MVCN directly, we use the same pipeline as MVCN to generate

training and test depth images, where each 3D object is represented byN = 8 depth maps

with a resolution of 256× 256. We take 3D models from ShapeNet [45].

Initially, we set J = N in Eq. 3.5 to conduct consistency pooling in the following

experiments. In addition, we use the same training dataset and hyperparameters with [16]

to train the network, and the same test dataset with [16, 33] to evaluate our methods with

45


Chamfer Distance (CD) [22].

3.5.1 Analysis of the Objective Function

We test different objective functions in Eq. 3.7 to justify the effectiveness of our

methods. Table 3.1 shows the quantitative effects of these variations. The experiments

are conducted on 100 3D airplane models (besides test dataset or training dataset), which

are randomly selected under the constraints that the average CD is close to that of the

test dataset in [16]. We change the weighting factor µ between Lcon(z) and Lgen(z), and

different distance functions in Lgen(z) (using L1 or L2). When µ = 0, only Lcon(z) is

used in loss function. According to the comparison, we select L2 distance to calculate

generator loss, and set µ = 1 in the following experiments.

µ µ = 0.1 µ = 1 µ = 2 µ = 5 µ = 10 µ = 0
L1 5.228 5.160 5.129 5.160 5.155

6.383
L2 5.362 5.110 5.136 5.135 5.175

Table 3.1: Chamfer distance over different loss functions in Eq. 3.7. CD is multiplied by
103.

3.5.2 The Size of Depth-buffer in Pseudo-rendering

As mentioned above, we use a depth-buffer in pseudo-rendering, and the depth-

buffer size is U×U . Obviously, a bigger buffer means less collisions in pseudo-rendering,

which further makes the reprojection more accurate. The average CD is lower when we

increase the size of the depth-buffer, as shown in Table 3.2 (a), where the experiments are

conducted on two categories of the test dataset. From the loss maps in Figure 3.6 (c) to

(e), given J = 8 in consistency pooling Eq. 3.5, the consistency loss goes smaller when

46


we increase U . This is because the closest points (reprojected from the other 8 views)

to the target view are more accurate. We also see less noisy points (brighter ones) in

Figure 3.6 (e).

U Average CD
Table Sofa

U = 1 0.888 0.844
U = 3 0.883 0.842
U = 5 0.875 0.839

(a)

J Average CD
Table Sofa

J = 3 0.881 0.848
J = 5 0.876 0.841
J = 8 0.875 0.839

(b)

Table 3.2: The effects of depth-buffer sizes U (a) and numbers of views J (in Eq. 3.4) in
consistency pooling (b). CD is multiplied by 100.

Figure 3.6: Consistency loss maps under different depth-buffer sizes (U ) and numbers of
views in consistency pooling J . (c) to (g) are the consistency loss maps, where the values
of the consistency loss (scaled with 100) are marked in red. We use the same colorbar
with Figure 3.1.

3.5.3 The Number of Views in Consistency Pooling

In this part, we analyze the effects of varying the number of views J in consistency

pooling. As shown in Figure 3.4, more views mean a bigger coverage over the target

view and a smaller consistency loss. Given a depth-buffer size of 5× 5, Figure 3.6 (e) to

(g) show that the consistency loss increases when J = 3 or J = 5, and we also find more

noisy points in these loss maps Figure 3.6 (f) and (g).

47


3.5.4 Comparison with Direct Optimization Method

Our multi-view consistent inference can also be used to optimize completed depth

maps directly without the conditional generative net G. We call this direct optimization

on depth maps, and in this part, we compare our methods with direct optimization. In fact,

direct optimization only contains the Consistency loss calculation C part in Figure 3.1.

Each depth map will be a trainable tensor. We first initialize the tensors with the com-

pleted views Vn, n ∈ [1, 8], and then update these tensors by minimizing the consistency

loss in Eq. 3.7. We use L2 distance to calculate Lgen(z), µ = 1, and the learning rate

is 0.0006, which produces the best results for direct optimization. Figure 3.7 shows the

Figure 3.7: Comparisons between direct optimization and our methods on optimizing
point clouds (left) and depth maps (right). The normals of the point clouds are shown.

comparisons. Here we color-code the normals of the completed point clouds, which are

estimated using a k-d tree search algorithm with a search radius of 0.5 and a maximum

number of neighbors of 30. Compared with direct optimization, our method performs

better. For example, in terms of optimizing point clouds, we can smooth the surface,

like the seat of the chair, and remove some outliers. As for completing depth maps, our

method can fill a hole appearing in MVCN [16] and even add the missing leg, where the

L1 distances to the ground truth are marked in red.

48


Though the direct optimization method can also refine the point clouds of MVCN,

it does not perform well in removing outliers on point clouds (left) or completing a depth

map (right) in Figure 3.7. The reason is that direct optimization does not have any knowl-

edge to distinguish shape and background from a depth map, which means that for pixels

in a hole, direct optimization does not know whether they belong to a hole of the shape

or the background. However, with the knowledge of shape completion learned in the

conditional generative net G, our method completes shapes better.

3.5.5 Intermediate Results and Convergence

Figure 3.8: Consistent inference optimization (loss vs steps).

In Figure 3.8, the image insets illustrate the intermediate completion results of the

[0, 20, 40, 60, 80, 100]th step for one example depth image from the cabinet class. In

addition, Lgt(z) = ‖GT − G(z;X)‖ is averaged over all cabinet objects, where GT

is ground truth. For clarity, the curve is offset vertically by 0.2. ∆Lgt(z) = ‖GT −

G(̊z;X)‖ − ‖GT − G(z;X)‖. We see the completed results are closer to ground truth

than MVCN, though there is no ground truth supervision in inference.

49


Figure 3.8 illustrates empirically that, under the defined loss function, our opti-

mization can find a good solution within 100 steps. The figure shows the average loss

vs gradient descent steps on all the 150 cabinet test objects. We reach the maximum of

Lgen(z) within s steps, then the distance to G(̊z;X) decreases in the following 100 − s

steps. For 98% of all the 1200 test objects, the maximum is reached within 10 steps

(s < 10), and within 20 steps for almost all. After 100 steps, the optimization has largely

converged.

3.5.6 Completion results

Improvements over Existing Works. Here we compare our method with the existing

shape completion methods, including 3D-EPN [31], FC [21], Folding [23], three variants

of PCN [33]: PN2, PCN-CD, PCN-EMD, and MVCN [16]. TopNet [48] is a recent

point-based method, but their generated point clouds are sparse.

Table 3.3 shows the quantitative results, where the completion results of the other

methods are from [16, 33] and ‘Direct-Opt’ is the direct optimization method intro-

duced above. With multi-view consistency optimization, both direct optimization and

our method can improve MVCN on most categories of the test datasets, and our method

achieves better results. The optimization methods fail on the Lamp dataset. As mentioned

in [16], the reason is that the completion of MVCN is bad on several lamp objects, which

makes the optimization less meaningful.

Figure 3.11 shows the qualitative improvements over the currently best view-based

method, MVCN, where the normals of point clouds are color-coded. With the conditional

50


generative net G and multi-view consistency loss C, our method produces completed

point clouds with smoother surfaces and fewer outliers, and can also fill holes of shapes

on multiple categories. Note that this method is category-specific, which requires on a

classification step in data preprocessing before shape completion on multiple categories.

Completions results given different inputs. Figure 3.10 (a, b, c) show completed air-

planes and cars under 3 different inputs of the same objects. Since the car input in (a)

leaves a lot of ambiguity, the completed cars vary. The airplanes results are more similar

because the inputs contain most of the structure.

Multiple views of completed shapes. Figure 3.10 (c) shows a completed airplane and

car from 3 views. We see the completed shapes are consistent among different views.

Figure 3.9: Completions on noisy inputs. GT is ground truth.

Completions on noisy inputs. In Figure 3.9, we perturb the input depth map with Gaus-

sian noise whose standard deviation is 0.01 times the scale of the depth measurements.

Our completion is robust to the noisy input.

3.6 Discussion and Conclusion

We proposed multi-view consistent inference to enforce geometric consistency in

view-based 3D shape completion. We defined a novel multi-view consistency loss suit-

able for optimization in inference, which can be achieved without the supervision of

51


Method Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel
3D-EPN 2.0147 1.3161 2.1803 2.0306 1.8813 2.5746 2.1089 2.1716 1.8543

FC 0.9799 0.5698 1.1023 0.8775 1.0969 1.1131 1.1756 0.9320 0.9720
Folding 1.0074 0.5965 1.0831 0.9272 1.1245 1.2172 1.1630 0.9453 1.0027

PN2 1.3999 1.0300 1.4735 1.2187 1.5775 1.7615 1.6183 1.1676 1.3521
PCN-CD 0.9636 0.5502 1.0625 0.8696 1.0998 1.1339 1.1676 0.8590 0.9665

PCN-EMD 1.0021 0.5849 1.0685 0.9080 1.1580 1.1961 1.2206 0.9014 0.9789
MVCN 0.8298 0.5273 0.7154 0.6322 1.0077 1.0576 0.9174 0.9020 0.8790

Direct-Opt 0.8195 0.5182 0.7001 0.6156 0.9820 1.1032 0.8885 0.8854 0.8619
Ours 0.8052 0.5175 0.6722 0.5817 0.9547 1.1334 0.8394 0.8754 0.8669

Table 3.3: Mean Chamfer Distance over multiple categories in ShapeNet. CD is scaled
by 100.

Figure 3.10: Completion results given three different inputs (a, b, c). 3 different views
(c). GT indicates ground truth (d).

ground truth. The experimental results demonstrate that the technique enables more ac-

curate shape generation for multi-view representation.

52


Figure 3.11: Improvements over MVCN on multiple categories in ShapeNet.

53


Chapter 4: Shape and Texture Reconstructions from Single RGB Images

4.1 Introduction

3D reconstruction from single RGB images has been a longstanding challenge in

computer vision. While recent progress with deep learning-based techniques and large

shape or image databases has been significant, the reconstruction of detailed geometry

and texture for a large variety of object categories with arbitrary topology remains chal-

lenging. Point clouds have emerged as one of the most popular representations to tackle

this challenge because of a number of distinct advantages: unlike meshes they can easily

represent arbitrary topology, unlike 3D voxel grids they do not suffer from cubic complex-

ity, and unlike implicit functions they can reconstruct shapes using a single evaluation of

a neural network. In addition, it is straightforward to represent surface textures with point

clouds by storing per-point RGB values.

In this chapter, we present a novel method to reconstruct 3D point clouds from sin-

gle RGB images, including the optional recovery of per-point RGB texture. In addition,

our approach can be trained on multiple categories. The key idea of our method is to

solve the problem using a two-stage approach, where both stages can be implemented

using powerful 2D image-to-image translation networks: in the first stage, we recover

an object coordinate map from the input RGB image. This is similar to a depth image,

54


but it corresponds to a point cloud in object-centric coordinates that is independent of

camera pose. In the second stage, we reproject the object space point cloud into depth

images from eight fixed viewpoints in object space, and perform depth map completion.

We can then trivially fuse all completed object space depth maps into a final 3D recon-

struction, without requiring a separate alignment stage, for example using the iterative

closest point algorithm (ICP) [61]. Since all networks are based on 2D convolutions, it

is straightforward to achieve high resolution reconstructions with this approach. Texture

reconstruction uses the same pipeline, but operating on RGB images instead of object

space depth maps.

We train our approach on a multi-category dataset and show that our object-centric,

two-stage approach leads to better generalization than competing techniques. In addition,

recovering object space point clouds allows us to avoid a separate camera pose estimation

step. In summary, our main contributions are as follows:

• A strategy to generate 3D shapes from single RGB images in a two-stage approach,

by first recovering object coordinate images as an intermediate representation, and

then performing reprojection, depth map completion, and a final trivial fusion step

in object space.

• The first work to train a single network to reconstruct point clouds with RGB tex-

tures on multiple categories.

• More accurate reconstruction results than previous methods on both seen and un-

seen categories from ShapeNet [45] or Pix3D [4] datasets.

55


Figure 4.1: Approach overview. An image X is passed through a 2D-3D network to
reconstruct the visible parts of the object, represented by an object coordinate image C.
X and C represent the texture and 3D coordinates of a shape respectively, which yield
a partial shape with texture Pdt when combined by a Joint Texture and Shape Mapping
operator. Next, by Joint Projection, Pdt is jointly projected from 8 fixed viewpoints into 8
pairs of partial depth maps and textures maps, which are translated to completed maps by
the Multi-view Texture-Depth Completion Net (MTDCN) that jointly completes texture
and depth maps. Alternatively, Multi-view Depth Completion Net (MDCN) only com-
pletes the depth maps. Finally, the Joint Fusion operator fuses the completed multiple
texture and depth maps into completed point clouds.

4.2 Related Work

Single Image 3D Reconstruction. Along with the development of deep learning tech-

niques, single image 3D reconstruction has made huge progress. Because of their regular-

ity, early works mainly learned to reconstruct voxel grids from 3D supervision [62] or 2D

supervision [63] using differentiable renderers [58, 64]. However, these methods can only

reconstruct shapes at low resolution, such as 323 or 643, due to the cubic complexity of

voxel grids. Although various strategies [65, 66] were proposed to increase the resolution,

these methods were too complex to follow. Mesh based methods [67, 68] are also alterna-

tives to increase the resolution. However meshes often have fixed topologies, which limits

the space of representable 3D shapes. Point cloud based methods [22, 30, 69, 70] provide

another direction for single image 3D reconstruction. However, some of these methods

are also limited to low resolutions, which makes it hard to reconstruct small geometric

56


details.

Besides low resolution, lack of texture is another issue that affects the realism of

generated shapes. Current methods aim to map the texture from single images to the

reconstructed shapes either represented by mesh templates [71] or point clouds using ob-

ject coordinate maps [72]. [73] proposed a differentiable point feature rendering module

named DIFFER to reconstruct 3D point clouds with colors from single images. The tex-

ture prediction pipeline of [71] samples pixels from input images directly and works on

symmetric objects with a good viewpoint. Some other methods (e.g. [63, 74]) try to pre-

dict novel RGB views by view synthesis. Although these methods have shown promising

results in some specific shape classes, their models were usually trained on one single

category for category-specific reconstruction.

Recently implicit functions have also been used to represent shapes [49, 75–80].

Fed with a latent code and a query point, a neural network is trained to predict the SDF

value [49, 79] or the binary occupancy of the point [75, 76]. Though these methods can

generate high resolution geometry by evaluating the learned implicit functions at query

3D points at arbitrary resolutions, they cannot reconstruct shapes and textures at the same

time in their pipelines.

Different from all these methods, our method can jointly learn to reconstruct very

dense point clouds with texture for multiple-category reconstruction by a two-stage recon-

struction approach, leveraging object coordinate maps (also called NOCS maps [72, 81])

as intermediate representation. Different from previous methods [69, 82] that use depth

maps as intermediate representation in a viewer-centered setting, our method works on

object-centered coordinates. Besides the capability of predicting textures, compared with

57


the implicit function-based methods, our approach generates 3D point clouds by multi-

view back-projection, while implicit function-based methods usually need the marching

cubes [83] algorithm as post-processing to extract surfaces.

4.3 Method

Most 3D point cloud reconstruction methods [62, 70, 84] solely focus on generating

3D shapes {Pi = [xi, yi, zi]} from input RGB images X ∈ RH×W×3, where H×W is the

image resolution and [xi, yi, zi] are 3D coordinates. Recovering the texture besides 3D

coordinates is a more challenging task, which requires learning a mapping from RH×W×3

to {Pi = [xi, yi, zi, ri, gi, bi]}, where [ri, gi, bi] are RGB values.

We propose a method to generate high resolution 3D predictions and recover tex-

tures from RGB images. At a high level, we decompose the reconstruction problem into

two less challenging tasks: first, transforming 2D images to 3D partial shapes that corre-

spond to the observed parts of the target object, and second, completing the unseen parts

of the 3D object. We use object coordinate images to represent partial 3D shapes, and

multiple depth and RGB views to represent completed 3D shapes.

As shown in Figure 4.1, our pipeline consists of four sub-modules: (1) 2D-3D Net,

an image translation network which translates an RGB image X to a partial shape Pd

(represented by object coordinate image C); (2) the Joint Projection module, which first

jointly maps the partial shape Pd with texture X to generate Pdt, a partial shape mapped

with texture, and then jointly projects Pdt into 8 pairs of partial depth [D1, . . . , D8] and

texture views [T1, . . . , T8] from 8 fixed viewpoints (the 8 vertices of a cube); (3) the

58


multi-view texture and depth completion module, which consists of two networks: Multi-

view Texture-Depth Completion Net (MTDCN), which generates completed texture maps

[T ′1, . . . , T
′
8] and depth maps [D′1, . . . , D

′
8] by jointly completing partial texture and depth

maps, and as an alternative, Multi-view Depth Completion Net (MDCN), which only

completes depth maps and generates more accurate results [D̂1, . . . , D̂8]; (4) the Joint

Fusion module, which jointly fuses the completed depth and texture views into completed

3D shape with textures, like Sd+t and Sdt.

4.3.1 2D RGB Image to Partial Shapes

We propose to use 3-channel object coordinate images to represent partial shapes.

Each pixel on the object coordinate image represents a 3D point, where its (r, g, b) value

corresponds to the point’s location (x, y, z). An object coordinate image is aligned with

the input image, as shown in Figure 4.1, and i