ABSTRACT Title of Dissertation: DENSE 3D RECONSTRUCTIONS FROM SPARSE VISUAL DATA Tao Hu Doctor of Philosophy, 2023 Dissertation Directed by: Professor Matthias Zwicker Department of Computer Science 3D reconstruction, the problem of estimating the complete geometry or appearance of objects from partial observations (e.g., several RGB images, partial shapes, videos), serves as a building block in many vision, graphics, and robotics applications such as 3D scanning, autonomous driving, 3D modeling, augmented reality (AR) and virtual reality (VR). However, it is very challenging for machines to recover 3D geometry from such sparse data due to occlusions, and irregularity and complexity of 3D objects. To solve these, in this dissertation, we explore learning-based 3D reconstruction methods for dif- ferent 3D object representations on different tasks: 3D reconstructions of static objects and dynamic human body from limited data. For the 3D reconstructions of static objects, we propose a multi-view representation of 3D shapes, which utilizes a set of multi-view RGB images or depth maps to represent a 3D shape. We first explore the multi-view representation for shape completion tasks and develop deep learning methods to generate dense and high-resolution point clouds from partial observations. Yet one problem with the multi-view representation is the inconsis- tency among different views. To solve this problem, we propose a multi-view consistency optimization strategy to encourage consistency for shape completion in inference stage. Third, the extension of multi-view representation for dense 3D geometry and texture re- constructions from single RGB images will be presented. Capturing and rendering realistic human appearances under varying poses and view- points is an important goal in computer vision and graphics. In the second part, we will introduce some techniques to create 3D virtual human avatars with limited data (e.g., videos). We propose implicit representations of motion, texture, and geometry for human modeling, and utilize neural rendering techniques for free view synthesis of dynamic ar- ticulated human body. Our learned human avatars are photorealistic and fully controllable (pose, shape, viewpoints, etc.), which can be used in free-viewpoint video generation, an- imation, shape editing, telepresence, and AR/VR. Our proposed methods can learn end-to-end 3D reconstructions from 2D image or video signals. We hope these learning-based methods will assist in perceiving and reconstructing the 3D world for future AI systems. DENSE 3D RECONSTRUCTIONS FROM SPARSE VISUAL DATA by Tao Hu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor Matthias Zwicker, Chair/Advisor Professor Joseph F. JaJa Professor John Yiannis Aloimonos Professor Marine Carpuat Professor Abhinav Shrivastava c© Copyright by Tao Hu 2023 Dedication To my family. ii Acknowledgments First and foremost I would like to thank my advisor, Professor Matthias Zwicker for his help and advice during my Ph.D. study at UMD. We have worked together on many research projects and it will not be possible to achieve these results without your support. Thank you! I would also like to thank my Ph.D. thesis committee members Matthias Zwicker, Joseph F. JaJa, John Yiannis Aloimonos, Marine Carpuat, and Abhinav Shrivastava for the time and effort to discuss my research. I have learned so much from the valuable and critical feedback they have given for my research. I am very fortunate to have worked with many incredible researchers through in- ternships. I would like to acknowledge my mentors Prof. Christian Theobalt from Max Planck Institute for Informatics, Prof. Yebin Liu from Tsinghua University, Hongyi Xu and Linjie Luo from ByteDance, and Kai Chen from Microsoft Research Asia. Thanks for providing the internship opportunities, and working with you has been a wonderful experience and a great source of inspiration. I also would like to acknowledge Weipeng Xu from Meta Reality Labs for offering an internship opportunity, though I could not join the team due to visa issues. Furthermore, I would like to acknowledge my research collaborators: Gen Lin and Zhizhong Han at UMD, Kripasindhu Sarkar and Lingjie Liu at MPII, Tao Yu, Zerong iii Zheng and He Zhang at Tsinghua University. I am very lucky to work with you, and thanks for your excellent collaborative effort. I am deeply thankful for all the friends that have largely supported me outside my research life. I thank Guangyao Shi, Jing Xie, Bo He, Hao Chen, Hanyu Wang, Xingting Wang, Yong Yang, Ruofei Du, Hao Zhou, Jun Wang, Zhichao Liu, Yixuan Ren, Renkun Ni, and Yuesheng Ye in Maryland, Jian Wang and Yue Li at MPII, Caonan Ji, Zhen Fan, Yunxi Guo, Xiaodong Yang, and Baowei Jiang at Tsinghua University, and more that I cannot list their names here. It has been a great time with you, and I wish you all the best for future endeavors. Finally, none of these would be possible without the full unconditional support from my family. I want to thank my parents and grandparents for their support and love in whatever path I chose to pursue. I also thank my sister and brother-in-law for all the understanding and encouragement throughout my Ph.D. life. iv Table of Contents Dedication ii Acknowledgements iii Table of Contents v List of Tables vii List of Figures ix Chapter 1: Introduction 1 1.1 Learning 3D Reconstructions for Static Objects . . . . . . . . . . . . . . 3 1.2 Learning 3D Human Avatars from Videos. . . . . . . . . . . . . . . . . . 5 1.3 Contribution & Dissertation Organization . . . . . . . . . . . . . . . . . 6 I Learning Dense 3D Reconstructions for Static Objects 10 Chapter 2: Shape Completion based on Multi-view Representation 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.A Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 3: Multi-view Consistency in Shape Completion 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Multi-view Consistent Inference . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4: Shape and Texture Reconstructions from Single RGB Images 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 v 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.A Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . 77 II Learning 3D Human Avatars from Videos 85 Chapter 5: HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars 86 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.A Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . 109 5.B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 6: HVTR++: Pose and Image Driven Human Avatars using HVTR 118 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.A Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . 138 6.B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 7: Rendering Human Avatars from Mobile Egocentric Fisheye Cameras 141 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 164 III Conclusion and Future Work 165 Chapter 8: Conclusion & Future Work 166 Bibliography 169 vi List of Tables 2.1 Analysis of the objective function . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Completion results for different positions of view-pooling layer . . . . . . 23 2.3 Average L1 distance for different numbers of views in view-pooling. . . . 24 2.4 Improvements while increasing training samples . . . . . . . . . . . . . . 26 2.5 Quantitative comparisons between VCN and MVCN . . . . . . . . . . . 27 2.6 Comparison with the exiting methods . . . . . . . . . . . . . . . . . . . 28 3.1 Chamfer Distance over different loss functions . . . . . . . . . . . . . . . 46 3.2 The effects of depth-buffer sizes . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Mean Chamfer Distance over multiple categories in ShapeNet . . . . . . 52 4.1 CD on single-category task. . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Mean CD of partial shape . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Mean CD of multiple-seen-category experiments on ShapeNet . . . . . . 68 4.4 Average CD of multiple-unseen-category experiments on ShapeNet. . . . 69 4.5 Average CD on both seen and unseen category on Pix3D dataset . . . . . 70 4.6 Relative CD improvements after ICP. . . . . . . . . . . . . . . . . . . . . 71 4.7 6D pose evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.8 F-score on ShapeNet and Pix3D . . . . . . . . . . . . . . . . . . . . . . 75 4.9 Comparisons with GenRe . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.10 Quantitative comparisons with AtlasNet . . . . . . . . . . . . . . . . . . 79 4.11 Average Chamfer Distance (CD) on Pix3D dataset . . . . . . . . . . . . . 82 5.1 A set of human synthesis approaches classified by feature representations and renderers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Quantitative comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Quantitative comparisons on M1 sequence. . . . . . . . . . . . . . . . . . 102 5.4 Accuracy and inference time. . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Ablation study of each component. . . . . . . . . . . . . . . . . . . . . . 105 5.6 Glossary table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7 Comparisons with AniNeRF. . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 Quantitative results of each method. . . . . . . . . . . . . . . . . . . . . 114 5.9 Comparisons of fusing volumetric and textural features. . . . . . . . . . . 115 5.10 Quantitative comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.11 Performance and inference time. . . . . . . . . . . . . . . . . . . . . . . 117 6.1 Quantitative comparisons on unseen sequences from R1. . . . . . . . . . 129 vii 6.2 Quantitative comparisons on unseen sequences from R2-7. . . . . . . . . 136 6.3 Quantitative comparisons on ZJU MoCap datasets. . . . . . . . . . . . . 140 7.1 Quantitative comparisons of single-video dataset . . . . . . . . . . . . . . 156 7.2 Quantitative comparisons of multi-video training . . . . . . . . . . . . . 158 viii List of Figures 1.1 Applications of 3D reconstruction. . . . . . . . . . . . . . . . . . . . . . 2 1.2 Partial visual data acquired in practical applications. . . . . . . . . . . . . 3 1.3 Visual overview of the dissertation. . . . . . . . . . . . . . . . . . . . . . 7 2.1 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Architecture of MVCN . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Completion results under different losses . . . . . . . . . . . . . . . . . . 22 2.4 Completion results for different numbers of views in view-pooling. . . . . 24 2.5 An example of the completion of sofa . . . . . . . . . . . . . . . . . . . 25 2.6 Visual comparison between VCN and MVCN . . . . . . . . . . . . . . . 25 2.7 Comparison between MVCN and PCN-CD. . . . . . . . . . . . . . . . . 28 2.8 Completion results on KITTI. . . . . . . . . . . . . . . . . . . . . . . . . 29 2.9 Qualitative completion on ShapeNet . . . . . . . . . . . . . . . . . . . . 30 2.10 Completion results on noisy, sparse and occluded inputs. . . . . . . . . . 31 2.11 Improvements of 8 views over 3 and 5 views in view-pooling. . . . . . . . 32 2.12 Failed depth completions of lamps. . . . . . . . . . . . . . . . . . . . . . 33 3.1 Overview of the multi-view consistent inference pipeline . . . . . . . . . 37 3.2 Network structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Methods to calculate consistency distance. . . . . . . . . . . . . . . . . . 38 3.4 Consistency pooling with respect to V7. . . . . . . . . . . . . . . . . . . 44 3.5 All the eight loss maps of a 3D model. . . . . . . . . . . . . . . . . . . . 44 3.6 Consistency loss maps under different depth-buffer sizes . . . . . . . . . 47 3.7 Comparisons with direct optimization . . . . . . . . . . . . . . . . . . . 48 3.8 Consistent inference optimization . . . . . . . . . . . . . . . . . . . . . . 49 3.9 Completions on noisy inputs . . . . . . . . . . . . . . . . . . . . . . . . 51 3.10 Completion results given three different inputs . . . . . . . . . . . . . . . 52 3.11 Improvements over MVCN . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Approach overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Render multiple views given partial shapes. . . . . . . . . . . . . . . . . 60 4.3 Completions of texture and depth maps. . . . . . . . . . . . . . . . . . . 64 4.4 Reconstructions on single-category task. . . . . . . . . . . . . . . . . . . 67 4.5 Reconstruction results on unseen categories from ShapeNet. . . . . . . . 69 4.6 Reconstructions of the seen categories on ShapeNet dataset. . . . . . . . . 72 4.7 Reconstructions on Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.8 Two-stage Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 ix 4.9 Ours v.s. AtlasNet [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.10 Qualitative comparisons with RenGe . . . . . . . . . . . . . . . . . . . . 81 4.11 Reconstructions of car objects on ShapeNet dataset . . . . . . . . . . . . 81 5.1 Differences to exiting neural rendering methods. . . . . . . . . . . . . . . 89 5.2 Pipeline overview of HVTR . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Qualitative results of our variants. . . . . . . . . . . . . . . . . . . . . . 99 5.4 Render skirts on novel poses. . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Geometry reconstructions of skirt. . . . . . . . . . . . . . . . . . . . . . 102 5.6 Shape editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.7 Comparisons with other rendering methods. . . . . . . . . . . . . . . . . 106 5.8 Geometry-guided ray marching. . . . . . . . . . . . . . . . . . . . . . . 107 5.9 Construct pose-conditioned NeRF. . . . . . . . . . . . . . . . . . . . . . 107 5.10 Qualitative results on unseen sequences. . . . . . . . . . . . . . . . . . . 107 5.11 Ablation study of each component. . . . . . . . . . . . . . . . . . . . . . 108 5.12 Comparisons on shape editing. . . . . . . . . . . . . . . . . . . . . . . . 108 5.13 Novel view synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.14 Comparisons of backward skinning method. . . . . . . . . . . . . . . . . 111 5.15 Ablation study of face identity loss. . . . . . . . . . . . . . . . . . . . . . 113 5.16 Normal prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.1 Pipeline overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2 Ablation study of driving views. . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Ablation study of normal loss and texture loss. . . . . . . . . . . . . . . . 131 6.4 Qualitative results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Qualitative results on unseen sequences. . . . . . . . . . . . . . . . . . . 135 6.6 Ablation study of driving views. . . . . . . . . . . . . . . . . . . . . . . 137 6.7 Qualitative results on novel view synthesis. . . . . . . . . . . . . . . . . 137 7.1 Egocentric setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Examples of our rendered fisheye training dataset . . . . . . . . . . . . . 144 7.3 DensePose predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5 Implicit texture learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.6 Renderings produced by our methods . . . . . . . . . . . . . . . . . . . . 154 7.7 Comparisons with other methods . . . . . . . . . . . . . . . . . . . . . . 154 7.8 Comparisons to Textured Neural Avatars . . . . . . . . . . . . . . . . . . 157 7.9 Renderings of single- and multi-video training. . . . . . . . . . . . . . . 158 7.10 Comparison with Only-MV . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.11 Performances in real applications . . . . . . . . . . . . . . . . . . . . . . 160 7.12 Comparisons with Only-MV on occluded parts. . . . . . . . . . . . . . . 160 7.13 Renderings of different views in applications. . . . . . . . . . . . . . . . 161 7.14 Local and global coordinate . . . . . . . . . . . . . . . . . . . . . . . . . 162 x Chapter 1: Introduction 3D reconstruction is widely used in vision and graphics applications, including 3D scanning, autonomous driving, 3D modeling, augmented reality (AR), and virtual real- ity (VR). However, in many real applications (Figure 1.2), we can only acquire sparse or partial visual data, such as partial point clouds, single RGB images, or sparse view video sequences. For example, autonomous driving systems need to reconstruct accurate 3D maps of the surrounding environment from sparse partial point clouds scanned by LiDAR sensors. It is also necessary to reconstruct 3D scenes from several RGB images captured by a mobile camera for robot navigation, Simultaneous Localization and Map- ping (SLAM), and AR. Similarly, generating 3D virtual human avatars just from several multi-view videos also facilitates the creation of 3D content in AR and VR applications. Human intelligence is flexible, and for partial observations like one single RGB image, we can effortlessly recognize the 3D geometry and texture of objects within the scene. However, it is very challenging for machines to recover 3D geometry from such sparse data due to occlusions, and irregularity and complexity of 3D objects. For example, given several sparse view images of an object or a scene, 3D recon- struction becomes an ill-posed problem since there can potentially be multiple solutions to the 3D structure occluded or invisible to the camera viewpoint. In this case, learning- 1 Figure 1.1: 3D reconstruction is widely used in different vision and graphics applications, such as robot navigation, autonomous driving, telepresence, medical imaging, virtual and augmented reality. based approaches can come to the rescue, especially when utilizing the power of deep neural networks. A straightforward solution is to employ a 3D prediction network to find the correlations between the partial input and 3D ground truth data (e.g., crafted CAD models), which, however, are very difficult to acquire in practical applications, especially for outdoor scenes or dynamic articulated human bodies. Compared to ground-truth 3D annotations, images and videos come in as more abundant sources of data to learn from, and hence it is of major interest to learn 3D reconstructions by utilizing the images and videos as self-supervision signals. A key step to achieve this is to learn to establish pixel correspondences between view images or video frames by utilizing differentiable or neural rendering techniques [2]. In this dissertation, we focus on learning 3D reconstructions without the use of ex- plicit 3D supervision. We investigate various differentiable and neural rendering methods 2 Figure 1.2: Partial visual data acquired in practical applications: (a) sparse point clouds scanned by LiDAR [3], (b) single RGB image [4], (c) sparse multi-view videos captured in a studio [5]. In this dissertation, we explore digital representations of 3D shapes and learning-based methods for 3D reconstructions from partial visual data. for different 3D object representations in different reconstruction tasks, such as dense point cloud completion from partial shapes, dense geometry and texture reconstruction from single RGB images, dense (i.e. free-viewpoint) view synthesis of dynamic humans from sparse view videos, etc. Our research goal is to develop effective 3D representations enabling 3D reconstructions of the world from visual data to allow better efficiency and efficacy in 3D understanding and modeling. 1.1 Learning 3D Reconstructions for Static Objects Multi-view Representation and Applications in Shape Completion. 3D shape com- pletion, the problem of perceiving the complete geometry of objects from partial ob- servations, is widely used in applications such as 3D scanning in robotics, autonomous driving, or 3D modeling and fabrication. While previous work for shape completion re- lies on volumetric representations, meshes, or point clouds, we propose to use multi-view depth maps from a set of fixed viewing angles as our shape representation, and perform depth completion for each view in image space. At the heart of our approach is a deep 3 convolutional network trained to complete each depth map. We show that (1) 3D shape completion can be formulated as multi-view depth completions in 2D image space, which enables dense and high-solution shape generation, and (2) generalizable 3D shape com- pletion can be trained from a dataset of multi-view depth maps without the use of explicit 3D ground truth. More details will be introduced in Chapter 2. Multi-view Consistency in Shape Completion. However, one problem of the multi-view representation is inconsistency among the multiple completed images. To resolve this issue, we propose a multi-view consistent inference technique for 3D shape completion, which we express as an energy minimization problem, and we present a consistency loss to encourage multi-view geometric consistency. We show that with a pretrained shape prior, the proposed learning-based approach can improve geometric consistency on novel test data for the multi-view representation. We will introduce this technique in Chapter 3. 3D Reconstruction from Single RGB Images based on Multi-view Representation. With recent advances in deep learning, it has become possible to recover plausible 3D shapes even from single RGB images for the first time. The problem of single image 3D reconstruction shares some similarities with shape completion, but is arguably even harder. In addition, obtaining detailed geometry and texture for objects with arbitrary topology remains challenging. Based on the multi-view representation, we propose a novel approach to reconstruct dense point clouds with textures from single RGB images. We show that 3D geometry and texture reconstruction can also be formulated as view im- age completions based on the multi-view representations. We will discuss this in Chapter 4. 4 1.2 Learning 3D Human Avatars from Videos. Our world is rarely static, and it is of major interest to reconstruct and understand specifically dynamic humans. Besides static objects, we will introduce some techniques to reconstruct 3D dynamic articulated humans from videos, which are far more challeng- ing than the 3D reconstructions of static objects. Capturing and rendering realistic human appearance under varying poses and viewpoints is an important goal in computer vision and graphics, i.e., in applications of free-viewpoint video generation, animation, telep- resence, and content creation for AR/VR. For free-viewpoint video generation of humans from multi-view videos, most traditional methods required animatable person-specific 3D models which were created with sophisticated reconstruction and rigging techniques, and employed classical graphics renderers to synthesize views. These methods are time- consuming, and often require users to have extensive modeling experience, artistic skills, and training to create these 3D models. In contrast, recent neural rendering methods [2, 5–9] have made great progress in generating realistic images of humans, which are simple yet effective compared with traditional graphics pipelines [10–12]. Neural rendering is a class of deep image and video generation approaches that combines machine learning techniques with physical knowledge from computer graphics to obtain controllable outputs, which enables end-to- end 3D reconstructions with just 2D images or videos as self-supervision signals even for dynamic articulated human bodies. Hybrid Volumetric-Textural Rendering for Human Avatars. Different from tradi- tional representations of texture and geometry (e.g., voxels, point clouds, meshes), we 5 explore neural textures and implicit geometries for human modeling and propose a novel neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR), which syn- thesizes 3D human avatars from arbitrary poses/viewpoints. At the heart of our method lies a two-stage rendering which first constructs a pose-conditioned downsampled neu- ral radiance field (NeRF [13]), and uses generative adversarial networks (GAN) [14] for high-resolution image synthesis. We show that (1) HVTR is capable of learning dynamic human reconstructions from just sparse viewpoint video sequences, and (2) the classical volume rendering [15] and GAN-based rendering [14] can be combined, and employed to render dynamic articulated human avatars efficiently and at high quality. More details will be presented in Chapter 5, 6. Reconstruct Avatars from Mobile Egocentric Fisheye Camera. Real-time free-viewpoint rendering of self-embodied avatars is also important in VR and AR applications, notably telepresence. Most existing telepresence systems employ multiple external outside-in cameras to capture human motions, and thus, users are often limited to staying in con- fined spaces visible to these external cameras. In contrast, we explore motion capture and neural rendering techniques for mobile setup and build a system for rendering full-body avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on a cap or a VR headset. We will present this system in Chapter 7. 1.3 Contribution & Dissertation Organization Figure 1.3 provides a visual overview of the dissertation and specific topics to be discussed: learning dense 3D reconstructions for static objects (Part I), and learning 3D 6 Figure 1.3: Visual overview of the dissertation. dynamic human avatars from videos (Part II). Learning dense 3D reconstructions for static objects (Part I). In Chapter 2, we will introduce the multi-view representation and its applications in 3D shape completion. Sec- ond, the inconsistency problem of multi-view representation will be discussed in Chapter 3. In Chapter 4, we will present the extensions of multi-view representation for 3D re- constructions from single RGB images. The following is the relevant publication list for each chapter. 1. Chapter 2 — Hu et al., “Render4Completion: Synthesizing Multi-view Depth Maps for 3D Shape Completion”, ICCVW 2019, Geometry Meets Deep Learning [16]. 2. Chapter 3 — Hu et al., “3D Shape Completion with Multi-view Consistent Inference”, AAAI 2020 [17]. 3. Chapter 4 — Hu et al., “Learning to Generate Dense Point Clouds with Textures on Multiple Categories”, WACV 2021 [18]. 7 Learning 3D dynamic human avatars from videos (Part II). In this part, we will in- troduce some techniques to reconstruct human avatars from videos. In Chapter 5, we will present a novel neural rendering technique (HVTR: Hybrid Volumetric-Textural Render- ing) to render full-body human avatars. In Chapter 6, we propose an extension of HVTR, which takes uses of both pose and image signals for novel view synthesis. In Chapter 7, we will present how to render human avatars from mobile egocentric fisheye cameras. The following is the relevant publication list for each chapter. 1. Chapter 5 — Hu et al., “HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars”, 3DV 2022 [19]. 2. Chapter 6 — Hu et al., “HVTR++: Image and Pose Driven Human Avatars using Hybrid Volumetric-Textural Rendering”, in submission. 3. Chapter 7 — Hu et al., “EgoRenderer: Rendering Human Avatars From Ego- centric Camera Images”, ICCV 2021 [20]. Contribution. In summary, this thesis makes the following technical contributions: 1. A technique formulates 3D shape completion as multi-view depth map completions in 2D image space, which enables dense and high-solution shape generations (Part I, Chapter 2). 2. A multi-view consistent inference technique for 3D shape completion, which en- courages multi-view geometric consistency on novel test data during the inference stage (Part I, Chapter 3). 3. A technique to generate dense point clouds with textures from single RGB images based on multi-view shape representations (Part I, Chapter 4). 8 4. A novel neural rendering technique, Hybrid Volumetric-Textural Rendering (HVTR), which is trained on human video sequences to learn to synthesize virtual full-body human avatars from novel arbitrary poses and viewpoints efficiently and at high quality (Part II, Chapter 5). 5. A system to generate pose and image driven full-body human avatars, which is capable of rendering faithful appearance details for digital humans by fully utilizing driving pose and image signals (Part II, Chapter 6). 6. A system to render full-body human avatars on a mobile egocentric fisheye camera (Part II, Chapter 7). 9 Part I Learning Dense 3D Reconstructions for Static Objects 10 Chapter 2: Shape Completion based on Multi-view Representation 2.1 Introduction Shape completion is an important step in 3D shape analysis, serving as a building block in applications such as 3D scanning in robotics, autonomous driving, or 3D mod- eling and fabrication. While learning-based methods that leverage large shape databases have achieved significant advances recently, choosing a suitable 3D representation for such tasks remains a difficult problem. On the one hand, volumetric approaches such as binary voxel grids or distance functions have the advantage that convolutional neural networks can readily be applied, but including a third dimension increases the memory re- quirements and limits the resolutions that can be handled. On the other hand, point-based techniques provide a more parsimonious shape representation, and recently there has been much progress in generalizing convolutional networks to such irregularly sampled data. However, most generative techniques for 3D point clouds involve fully connected layers that limit the number of points and level of shape detail that can be obtained [21–23]. In this chapter, we propose to use a shape representation that is based on multi-view depth maps for shape completion. The representation consists of a fixed number of depth images taken from a set of pre-determined viewpoints. Each pixel is a 3D point, and the union of points over all depth images yields the 3D point cloud of a shape. This has 11 the advantage that we can use several recent advances in neural networks that operate on images, like U-Net [24] and 2D convolutional networks. In addition, the number of points is not fixed and the point density can easily be increased by using higher resolution depth images, or more viewpoints. Here we leverage this representation for shape completion. Our key idea to perform shape completion is to render multiple depth images of an incomplete shape from a set of pre-defined viewpoints, and then to complete each depth map using image-to-image translation networks. To improve the completion accuracy, we further propose a novel multi-view completion net (MVCN) that leverages information from all depth views of a 3D shape to achieve high accuracy for single depth view completion. In summary, our contributions are as follows: • We propose a strategy to address shape completion by re-rendering multi-view depth maps to represent the incomplete shape, and performing image translation of these rendered views. • We introduce a multi-view completion architecture that leverages information from all rendered views and outperforms separate depth image completion for each view. • We demonstrate the efficacy of our method on shape completion problems, which is able to generate denser shapes than previous point cloud based methods. 12 Figure 2.1: Overview of our approach. (a) We render 8 depth maps of an incomplete shape (shown in red) from 8 viewpoints on the corners of a cube; (b) These rendered 8 depth maps are passed through a multi-view completion net including an adversarial loss, which generates 8 completed depth maps; (c) We back-project the 8 depth maps into a completed 3D model. 2.2 Related Work Deep Learning on 3D Shapes. Pioneering work on deep learning for 3D shapes has relied on volumetric representations [25, 26], which allow the straightforward application of convolutional neural networks. To avoid the computation and memory costs of 3D convolutions and 3D voxel grids, multi-view convolutional neural networks have also been proposed early for shape analysis [27, 28] such as recognition and classification. But these techniques cannot address shape completion. In addition to volumetric and multi-view representations, point clouds have also been popular for deep learning on 3D shapes. Groundbreaking work in this area includes PointNet and its extension [29, 30]. 3D Shape Completion. Shape completion can be performed using volumetric grids, as proposed by Dai et al. [31] and Han et al. [32], which are convenient for CNNs, like 3D-Encoder-Predictor CNNs for [31] and encoder-decoder CNN for patch-level geome- try refinement in [32]. However, when represented with volumetric grids, data size grows 13 cubically as the size of the space increases, which severely limits resolution and applica- tion. To address this problem, point based shape completion methods were presented, like [21, 23, 33]. The point completion network (PCN) [33] is the state-of-the-art approach that extends the PointNet architecture [29] to provide an encoder, followed by a multi- stage decoder that uses both fully connected [21] and folding layers [23]. They show that their decoder leads to better results than using a fully connected [21] or folding based [23] decoder separately. However, for these voxels or points based shape completion methods, the numbers of input and output voxels or points are still fixed. For example, the in- put should be voxelized on a 323 grid [32] and the output point cloud size is 2048 [23], however, which can lead to loss of detail in many scenarios. 3D Reconstruction from Images. The problem of 3D shape reconstruction from single RGB images shares similarities with 3D shape completion, but is arguably even harder. While a complete survey of these techniques is beyond the scope of this chapter, our work shares some similarities with the approach by Lin et al. [34]. They use a multi- view depth map representation for shape reconstruction from single RGB images using a differentiable renderer. In contrast to their technique, we address shape completion, and our approach allows us to solve the problem directly using image-to-image translation. Soltani et al. [35] do shape synthesis and reconstruction from multi-view depth images, which are generated by a variational autoencoder [36]. However, they do not consider the relations between the multi-view depth images of the same model in their generative net. Image Translation and Completion. A key advantage of our approach is that it allows us to leverage powerful image-to-image translation architectures to address the shape com- pletion problem, including techniques based on generative adversarial networks (GAN) [14], 14 and U-Net structures [24]. Based on conditional GANs, image-to-image translation net- works can be applied on a variety of tasks [37]. Satoshi et al. [38] and Portenier et al. [39] propose to use conditional GANs for image completion or editing. However, each im- age is completed individually in their networks. We propose a network that can combine information from other related images to help the completion of one single image. 2.3 Method 2.3.1 Multi-view Representation As discussed above, high resolution completion is difficult to achieve by existing methods that operate on voxels or point clouds due to memory limitations or fully con- nected network structures. In contrast, multi-view representations of 3D shapes [27, 28] can circumvent these obstacles to achieve high resolution and dense completion. As shown in Figure 2.1 (a), given an incomplete point cloud, our method starts from ren- dering 8 depth maps for this point cloud. Specifically, the renderings are generated by placing 8 virtual cameras at the 8 vertices of a cube enclosing the shape, all pointing to- wards the centroid of the shape. We also render 8 depth maps from the complete target point cloud, and then we use these image pairs to train our network. With this multi-view representation, the shape completion problem can be formu- lated as image-to-image translation, i.e., translating an incomplete depth map to a cor- responding complete depth map, for which we can take full advantage of several recent advances in net structures that operate successfully on images, like U-Net architectures and GANs. After the completion net shown in Figure 2.1(b), we get 8 completed depth 15 maps in Figure 2.1(c), which can be back-projected into a completed point cloud. 2.3.2 Multi-view Depth Maps Completion In the completion problem, we learn a mapping from an incomplete depth map xi to a completed depth map G(xi), where xi is rendered from a partial 3D shape S, i ∈ [1, V ]. We render V views for each shape and expect to complete each depth map xi of S as similar as possible to the corresponding depth map yi of the ground truth 3D shape S1. Although completing each of the V depth maps of a 3D shape separately would be straightforward, there are two disadvantages. First, we cannot encourage consistency among the completed depth maps from the same 3D shape, which affects the accuracy of the resulting 3D shapes obtained by back-projecting the completed depth maps. Sec- ond, we cannot leverage information from other depth maps of the same 3D shape while completing one single depth map. This limits the accuracy of completing a single depth image, since views of the same 3D model share some common information that could be exploited, like global shape and local parts as seen from different viewpoints. To resolve these issues, we propose a multi-view completion net (MVCN) archi- tecture to complete one single depth image by jointly considering the global 3D shape information. In order to complete a depth image xi as similar as possible to the ground truth yi in terms of both low-frequency correctness and high-frequency structure, MVCN is designed based on a conditional GAN [14], which is formed by an image-to-image net G and a discriminatorD. In addition, we introduce a shape descriptor d for each 3D shape S to contribute to the completion of each depth image xi from S, where d holds global 16 information of shape S. The shape descriptor d is learned along with the other parameters in MVCN, and it is updated dynamically with the completion of all the depth images xi of shape S. 2.3.3 MVCN Architecture We use a U-Net based structure [24] as our image-to-image netG, which has shown its effectiveness over encoder-decoder networds in many tasks including image-to-image translation [37]. Including our shape descriptor, we propose an end-to-end architecture as illustrated in Figure 2.2. Figure 2.2: Architecture of MVCN. The shape descriptor represents the information of a 3D shape, which contributes to the completion of each single depth map from the 3D shape. We adopt the generator and discriminator architecture of [37]. MCVN consists of 8 U-Net modules with an input resolution of 256× 256, and each U-Net module has two submodules, DOWN and UP. DOWN (e.g. D3) consists of the Convolution-BatchNorm- ReLU layers [40, 41], and UP (e.g. U3) consists of UpReLU-UpConv-UpNorm layers. More details can be found in [37]. 17 In MVCN, DOWN modules are used to extract a view feature fi of each depth image xi. For each 3D shape S, we learn a shape descriptor d by aggregating all V view features fi through a view-pooling layer. Since not all the features are necessary to represent the global shape, we use max pooling to extract the maximum activation in each dimension of all fi to form the shape descriptor, as illustrated in Figure 2.2. In addition, the shape descriptor d is applied to contribute to the completion of each depth image xi. Specifically, for an input xi we employ the output of DOWN moduleD3 as the view feature fi, and insert the view-pooling layer after D3. For each shape S we use a shape memory to store all its V view features fi as shown in Figure 2.2. When we get fi, we first use it to update the corresponding feature map in shape memory. For example, if i = 3, the third feature map in shape memory will be replaced with f3. Then we obtain the shape descriptor of S in the current iteration by a view-pooling layer (max pooling all feature maps in the shape memory of S). This strategy dynamically keeps the best view features in all training iterations, as illustrated in Figure 2.2. Subsequently, we use shape descriptor d to contribute to the completion of depth map xi by concatenating d with view feature fi as the input of module D2. This concatenated feature is also forwarded to module U3 via a skip connection. 2.3.4 Loss Function The objective of our conditional GAN is similar to image-to-image translation [37], LcGAN(G,D) = Ex,y[logD(x, y)] + Ex[log(1−D(x,G(x))]. (2.1) 18 In our completion problem, we expect the completion net (G) could not only deceive the discriminator but also produce a completion result near the ground truth. Hence we combine the GAN objective with a traditional pixel-wise loss, such as L1 or L2 distance, which is consistent with previous approaches [37, 42]. Since L1 is less prone to blurring than L2, and considering Eq. 2.4, there is a linear mapping from a pixel in a depth image to a 3D point, we want to push the generated image to be near the ground truth in L1 sense rather than L2. Therefore, the loss of the completion net is LL1(G) = Ex,y[‖y −G(x)‖1]. (2.2) Our final object in training is then G∗ = arg min G max D LcGAN(G,D) + λLL1(G), (2.3) where λ is a balance parameter that controls the contributions of the two terms. 2.3.5 Optimization and Inference Unlike some approaches that focus on image generation [43], our method does not generate images from noise, which also makes our training stable, as mentioned in [38]. Similar to [37], we only provide noise in the form of dropout in our network. To optimize our net, we follow the standard approach [14, 37]. The training of D and G is alternated, one gradient descent step on D, then one step on G. Minibatch SGD and the Adam solver [44] are applied, with a learning rate of 6e-4 for G and 6e-6 for D, 19 which slows down the rate at which D learns relative to G. Momentum parameters are β1 = 0.5, β2 = 0.999, and the batch size is 32. During inference, we first run MVCN with all the 8 rendered views of an incomplete 3D shape to build the shape memory and extract the shape descriptor. Then we run the net again for the second time to complete each view leveraging the learned shape descriptor. Our final target is 3D shape completion. Given a generated depth image G(xi), for each pixel p at location (xp, yp) with depth value dp, we can back-project p to a 3D point P through an inverse perspective transformation, P = R−1(K−1[xp, yp, dp] T − t), (2.4) where K, R, and t are the camera intrinsic matrix, rotation matrix, and translation vector respectively. Note that K, R, and t are always known since these are the parameters of the 8 virtual cameras placed on the corners of a cube. The final shape is the union of the completed, back-projected point clouds from all 8 virtual views. 2.4 Experiments In this section, we first describe the creation of a multi-category dataset to train our model, and then we illustrate the effectiveness of our method and the improvement of MCVN over a single view completion net (VCN) used as a baseline, where each view is completed individually without shape descriptor. Finally, we analyze the performance of our method, and make comparisons with existing methods. By default, we conduct the training of MVCN under the MVCN-Airplane600 (trained with the first 600 shapes of 20 airplane in ShapeNet [45]), and test it under the same 150 models involved in [33]). 2.4.1 Data Generation and Evaluation Metrics We use synthetic CAD models from ShapeNet to create a dataset to train our model. Specifically, we take models from 8 categories: airplane, cabinet, car, chair, lamp, sofa, table, and vessel. Our inputs are partial point clouds. For each model, we extract one partial point cloud by back-projecting a 2.5D depth map (from a random viewpoint) into 3D, and render this partial point cloud into V = 8 depth maps of resolution 256 × 256 as training samples. The reason why we use back-projected depth maps as partial point clouds instead of subsets of the complete point cloud is that our training samples are closer to real-world sensor data in this way. In addition, similar to other works, we choose to use a synthetic dataset to generate training data because it contains detailed 3D shapes, which are not available in real-world datasets. In the same way, we also render V = 8 depth maps from the ground truth point clouds as the ground truth depth maps. Similar to [33], here we also use the symmetric version of Chamfer Distance (CD) [22] to calculate the average closest point distance between the target shape and the generated shape. 2.4.2 Analysis of the Objective Function We conduct ablation studies to justify the effectiveness of our objective function for the completion problem. Table 2.1(a) shows the quantitative effects of these variations, and Figure 2.3 shows the qualitative effects. The cGAN alone (bottom left, setting λ = 0 21 in Eq. 2.3) gives very noisy results. L2+cGAN (bottom middle) leads to reasonable but blurry results. L1 alone (top right) also produces reasonable results, but we can find some visual defects, like some unfilled holes as marked, which makes the final CD distance higher than that of L1+cGAN. These visual defects can be reduced when including both L1 and cGAN in the loss function (bottom right). As shown by the example in Figure 2.5, the combination of L1 and cGAN can complete the depth images with high accuracy. We further explore the importance of the two components of the objective function for point cloud completion by using different weights (λ in Eq. 2.3) of the L1 loss. In Table 2.1(b), the best completion result is achieved when λ = 1. We set λ = 1 in our experiments. Figure 2.3: Completion results under different losses. Loss Avg CD cGAN 10.729 L1 5.672 L2 + cGAN 6.467 L1 + cGAN 5.512 (a) λ in Eq. 2.3 Avg CD λ = 50 5.748 λ = 10 5.665 λ = 1 5.512 λ = 0.5 5.541 (b) Table 2.1: Analysis of the objective function: average CD for different losses (a), and different λ (b). Numbers are multiplied by 1000 22 2.4.3 Analysis of the View-pooling Layer Pooling methods. We also study different view-pooling methods to construct the shape descriptor, including element-wise max-pooling and mean-pooling. According to our ex- periments, mean-pooling is not as effective as max-pooling to extract the shape descriptor for image completion, which is similar to the recognition problem [28]. The average CD is 0.005926 for mean-pooling, but that of max-pooling is 0.005512, so max-pooling is used. Position Avg L1 distance Avg CD D2 3.377 5.512 D1 3.433 5.604 D0 3.501 5.919 Code 3.477 5.836 Table 2.2: Completion results for different positions of view-pooling layer. CD values are multiplied by 1000. Position of the view-pooling layer. Here we insert the view-pooling layer into different positions to extract the shape descriptor and further evaluate its effectiveness, including D2, D1, and D0, which are marked in Figure 2.2. Intuitively, the shape descriptor would have the biggest impact on the original network if we place the view-pooling layer before D2, and the experimental results illustrate this in Table 2.2, where both average L1 dis- tance and CD are the lowest. We also try to do view pooling after D0 and concatenate the shape descriptor with the latent code (marked in purple in Figure 2.2) and then pass them through a fully connected layer, but experiments show that the shape descriptor will be ignored since both the average L1 distance and CD do not decrease compared with single view completion net (average L1 distance is 3.473643 and CD is 0.005839 in Table 2.5). 23 Model Name Avg L1 Distance MVCN-V3 3.794 MVCN-V8-3 3.617 MVCN-V5 3.564 MVCN-V8-5 3.398 Table 2.3: Average L1 distance for different numbers of views in view-pooling. Figure 2.4: Completion results for different numbers of views in view-pooling. Number of views in view-pooling. We also analyze the effect of the number of views used in view-pooling. In Table 2.3, MVCN-V3 was trained with 3 depth images (No.1, 3, 5) of the 8 depth images of each 3D model, and MVCN-V5 was trained with 5 depth images (No. 1, 3, 5, 6, 8). MVCN-V8-3 and MVCN-V8-5 were trained with all the 8 depth images, but were tested with 3 views and 5 views respectively. In order to make fair comparisons, we took the 1st, 3rd, and 5th view images to test MVCN-V8-3 and MVCN-V3, and 1st, 3rd, 5th, 6th, 8th to test MVCN-V8-5 and MVCN-V5. The results show that the completion of one single view will be better when we increase the number of views, which means other views are helpful for the completion of one single view, and the more the views, the higher the completion accuracy. Figure 2.4 shows an example of the completion. As we increase the number of views in view-pooling, the completion results are improved. 24 Figure 2.5: An example of the completion of sofa. The 1st row: incomplete point cloud and 8 depth maps of it; The 2nd row: generated point cloud and related 8 depth maps; The 3rd row: ground truth point cloud and its 8 depth maps. Figure 2.6: Visual comparison between VCN and MVCN. Starting from the partial point cloud in the first row, VCN and MVCN perform completions of depth maps in the second and third row, respectively, where the completed point clouds are also shown. We use colormaps (from blue to green to red) to highlight the pixels with bigger errors than 10 in terms of L1 distance. Ground truth data is in the last row. MVCN achieves lower L1 distance on all the 8 depth maps. 2.4.4 Improvements over Single View Completion Pervasive improvements on L1 distance and CD. From Table 2.5, we find significant and pervasive improvements over single view completion net (VCN) on both average L1 distance and CD on different categories. Networks in Table 2.5 were trained with 600 3D models for airplane, 1600 for lamp, and 1000 for other categories. We use 150 3D models from each category to evaluate our network, the same test dataset in [33]. We 25 further conduct visual comparison with VCN in Figure 2.6, where we can see MVCN can achieve higher completion accuracy with the help of the shape descriptor. Model Avg L1 Distance Avg CD MVCN-Airplane600 3.377 5.512 MVCN-Airplane1200 3.156 5.273 MVCN-Lamp1000 6.661 12.012 MVCN-Lamp1600 6.246 10.576 VCN-Lamp1000 6.763 12.091 VCN-Lamp1600 6.430 12.007 Table 2.4: Improvements while increasing training samples. CD values are multiplied by 1000. Better generalization capability. Table 2.4 shows that we can improve the performance of VCN and MVCN while increasing the number of training samples. We find that the performance differences between MVCN-Lamp1000 and VCN-Lamp1000 are not ob- vious. The reason is that there are relatively large individual differences among lamp models in ShapeNet, and the completion results are bad in several unusual lamp models in the test set. For these models, the comparisons between VCN and MVCN are less meaningful, so the improvement is not obvious. But this can be solved when we add an- other 600 training samples in training. MVCN-Lamp1600 has a bigger improvement than VCN-Lamp1600 on average L1 distance and CD, which indicates a better generalization capability of MVCN. 2.4.5 Comparisons with the State-of-the-art Baselines. Some previous completion methods need prior knowledge of the shape [46], or assume more complete inputs [47], so they are not directly comparable to our method. 26 Model Average L1 Distance Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel VCN 5.431 3.474 4.305 3.859 7.645 6.430 5.717 7.573 4.446 MVCN 5.102 3.377 3.991 3.610 7.143 6.245 5.285 7.156 4.013 Model Mean Chamfer Distance per point (multiplied by 1000 ) Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel VCN 8.800 5.839 7.297 6.589 10.398 12.007 9.565 9.371 9.334 MVCN 8.328 5.512 7.154 6.322 10.077 10.576 9.174 9.020 8.790 Table 2.5: Quantitative comparisons between VCN and MVCN. Here we compare MVCN with several strong baselines. PCN-CD [33] trained with point completion net with CD as loss function, is the state of the art while this work was de- veloped. PCN-EMD uses Earth Mover’s Distance (EMD) [22] as loss function, but it is intractable for dense completion due to the calculation complexity of EMD. The en- coders of FC [21], Folding [23] are the same with PCN-CD, but decoders are different, a 3-layer fully-connected network for FC, and folding-based layer for Folding. PN2 uses the same decoder, but the encoder is PointNet++ [30]. 3D-EPN [31] is a representative of the class of volumetric completion methods. For fair comparison, the distance field outputs of 3D-EPN are converted into point clouds as mentioned in [33]. TopNet [48] is a recent point-based method, but it can only generate sparse point clouds because their decoder mostly consists of multilayer perceptron networks, which limits the number of points they can process. Comparisons. As done in [33], we use the symmetric version of CD to calculate the average closest point distance, where ground truth point clouds and generated point clouds are not required to be the same size, which is different from EMD [22]. For point- based methods like PCN [33], the input is sampled and the output size is fixed. Different from these methods, the number of output points of our approach is not fixed, which 27 Model Mean Chamfer Distance per point Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel 3D-EPN 2.014 1.316 2.180 2.031 1.881 2.575 2.109 2.172 1.854 FC 0.980 0.570 1.102 0.878 1.097 1.113 1.176 0.932 0.972 Folding 1.007 0.597 1.083 0.927 1.125 1.217 1.163 0.945 1.003 PN2 1.400 1.030 1.474 1.219 1.578 1.762 1.618 1.168 1.352 PCN-CD 0.964 0.550 1.063 0.870 1.100 1.134 1.168 0.859 0.967 PCN-EMD 1.002 0.585 1.069 0.908 1.158 1.196 1.2206 0.901 0.9789 MVCN 0.830 0.527 0.715 0.632 1.008 1.058 0.917 0.902 0.879 Table 2.6: Comparison with the exiting methods on mean CD (multiplied by 100) over multiple categories. Figure 2.7: Comparison between MVCN and PCN-CD. would require resampling our output to compute the EMD. CD is more suitable for a fair comparison among different techniques. Table 2.6 lists the quantitative results, where the completion results of other methods are from [33]. Our method achieves the lowest CD across almost all object categories. A more detailed comparison with PCN-CD is in Figure 2.7, where the height of the blue bar indicates the amounts of improvement of our method over PCN-CD on each object. Our model outperforms PCN on most objects. Figure 2.9 shows the qualitative results. Our completions are denser, and we recover more details in the results. Another obvious advantage is that our method can complete shapes with complex geometry, like the second to the forth objects, but other methods 28 fail to recover these shapes. Note that our method is category-specific, which requires on a classification step in data preprocessing before the shape completions on multiple categories. However, in Chapter 4, we show that the multi-view representation can also be used to train a single network for 3D reconstructions on multiple categories. 2.4.6 Completion Results on KITTI Our goal is to obtain high quality and high resolution shape completion from data similar to individual range scans focused on individual objects. Hence we obtain incom- plete data using synthetic depth images, which is similar to data from RGB-D cameras. However, for data like KITTI, which is extremely sparse and does not contain ground truth, the usual objective is to obtain rough not high resolution completion. Our method performs reasonably well on KITTI data, as shown in Figure 2.8. Figure 2.8: Completion results on KITTI. 2.5 Conclusion We have presented a method for shape completion by rendering multi-view depth maps of incomplete shapes, and then performing image completion of these rendered views. Experiments show that our view based representation and novel network struc- 29 ture can achieve better results with less training samples, perform better on objects with complex geometry, and generate higher resolution results than previous methods. Figure 2.9: Qualitative completion on ShapeNet, where MVCN can complete complex shapes with high resolution. 2.A Additional Experimental Details In this section, we provide additional experimental results and technical details of the proposed method. 30 2.A.1 Completion Results on Noisy, Sparse and Occluded Point Clouds Since there is no ground truth on KITTI, we also conduct experiments to evaluate the performance of our method on noisy, sparse and occluded inputs in Figure 2.10. For the ground truth point cloud (sofa), we render a depth image from a random viewpoint, whose back-projection is labeled ‘Original Input’, and then perturb the depth map with Gaussian noise whose standard deviation is η times the scale of the depth measurements. We then randomly sample the input point clouds with a factor µ. Besides self-occlusion, we also consider the target may be occluded by other objects in the wild, and in Fig- ure 2.10, ‘Occ’ in the 2nd and the 5th column means that we further remove 10% of the input points. Note that our model is not trained with these noisy, sparse, and occluded examples, but it is still robust to them. Figure 2.10: Completion results on noisy, sparse and occluded inputs. 2.A.2 Analysis of the Number of Views in View-pooling We further show the improvements on L1 distance for all view images of test dataset in Figure 2.11. The x-axis represents different view images. It should be mentioned that the same x represents different view images for ‘V8 vs V3’ and ‘V8 vs V5’, considering the test dataset has 150 3D models, so 450 view images are used to test ‘V8 vs V3’, and 31 750 view images are used to test ‘V8 vs V5’). The height of the blue bar indicates the amounts of improvement of 8 views over 3, and the red bar indicates the improvement of 8 views over 5. Positive values mean the L1 distance is lower while using 8 views. Since the training dataset is relatively small (600 3D models for training and 150 3D models for testing), our net performs bad on several unusual models in testing dataset, which fall on the boundary in Figure 2.11. Comparisons on boundary instances are not meaningful. Apart from these, for most view images, we decrease L1 distance by increasing the num- ber of views in view-pooling. More views mean the shape descriptors are more helpful. Figure 2.11: Improvements of 8 views over 3 and 5 views in view-pooling. 2.A.3 Failure Cases While in general our methods perform well, we observe our models fail to complete several challenging input depth maps, which do not provide enough information for infer- ence. For example, Figure 2.12 shows two failed completions of lamps, where we cannot extract useful information from the depth inputs to infer the whole shape. These cases mostly occur in lamp objects due to complex geometry and large individual differences 32 among lamp objects. The reconstruction of lamp is also the most challenging task, as mentioned in [49]. Figure 2.12: Failed depth completions of lamps. 2.A.4 Rendering and Back-projecting Depth Maps Render multi-view depth maps. First, for each 3D model, we move its center to the ori- gin. Most models in modern online repositories, such as ShapeNet and the 3D Warehouse, satisfy this requirement that models are upright oriented along a consistent axis, and some previous completion or recognition methods also follow the same assumption [26, 33]. With this assumption, the center consists of the midpoints along x, y, z axis. Then, each model is uniformly scaled to fit into a consistent sphere (radius is 0.2) and the scale factor is the maximum length along x, y, z axis divided by radius. Finally, we render 8 depth maps for each partial point cloud as input, as mentioned in Section 2.3.1. In this way, all the shapes occur at the center of depth images. We also render 8 depth maps of the complete target shape and use these image pairs to train our network. Back-project multi-view depth maps into a point cloud. We fuse the generated depth 33 maps into a completed point cloud and apply voting algorithm to remove outliers. Specif- ically, we reproject each point of one view into the other 7 views, and if one point falls on the shape of other views, we add one vote for it. The initial vote number for each point is 1, and we set a vote threshold of 7 to decide whether this point is valid or not. Furthermore, radius outlier removal method is used to remove noisy points that have few neighbors (less than 6) in a given sphere (radius is 0.006) around them. 34 Chapter 3: Multi-view Consistency in Shape Completion 3.1 Introduction Convolutional neural networks have proven highly successful at analysis and syn- thesis of visual data such as images and videos. This has spurred interest in applying convolutional network architectures also to 3D shapes, where a key challenge is to find suitable generalizations of discrete convolutions to the 3D domain. Popular techniques include using discrete convolutions on 3D grids [26], graph convolutions on meshes [50], convolution-like operators on 3D point clouds [51, 52], or 2D convolutions on 2D shape parameterizations [53]. A simple approach in the last category is to represent shapes using multiple 2D projections, or multiple depth images, and apply 2D convolutions on these views. This has led to successful techniques for shape classification [28], single-view 3D reconstruction [54], shape completion [16], and shape synthesis [55]. One issue in these approaches, however, is to encourage consistency among the separate views and avoid that each view represents a slightly different object. This is not an issue in supervised training, where the loss encourages all views to match the ground truth shape. But at inference time or in unsupervised training, ground truth is not available and a different mechanism is required to encourage consistency. In this chapter, we address the problem of shape completion using a multi-view 35 depth image representation, and we propose a multi-view consistency loss that is min- imized during inference. We formulate inference as an energy minimization problem, where the energy is the sum of a data term given by a conditional generative net, and a regularization term given by a geometric consistency loss. Our results show the benefits of optimizing geometric consistency in a multi-view shape representation during infer- ence, and we demonstrate that our approach performs better on shape completion tasks. In summary, our contributions are as follows: i) We propose a multi-view consistency loss for 3D shape completion, which encour- ages geometric consistency for multi-view representation on novel data in inference stage. ii) We formulate multi-view consistent inference as an energy minimization problem including our consistency loss as a regularizer, and a neural network-based data term. iii) We show the proposed multi-view consistency optimization can further refine the shape completion results of multi-view representation introduced in Chapter 2 on different object categories, which demonstrates the benefits of consistent inference technique in practice. 3.2 Related Work Multi-view Consistency. One problem of view-based representation is inconsistency among multiple views. Some researchers presented a multi-view loss to train their net- work to achieve consistency in multi-view representations, like discovering 3D keypoints 36 .. .. .. .. .. .. .. .. .. .. .. .. R en de ri ng Renderin g T T 1 2 Rendering T 7 R endering T 8 View-reprojection View-reprojection View-reprojection - Incompleted depth images Completed depth images Partial point cloud Reprojections Consistency distances to V E n co d er V7Target D ec o d er Consistency loss Update Shape descriptor Fixed parameter Inference stage 0 1 C o n sisten cy p o o lin g Loss map 7 X2 X7 X8 X1 V1 V2 V7 V8 R1 7 R7 7 R2 7 D1 7 D7 7 D8 7 T , T1 -1 7 T , T2 -1 7 T , T8 -1 7 M 7 Conditional generative net G Consistency loss calculation C X V z T7Duplicate D2 7 Closest point pooling R8 7 Figure 3.1: Overview of the multi-view consistent inference for 3D shape completion. Given a partial point cloud as input, we first render multiple incomplete views X , which form our shape representation of the incomplete input. To perform inference, we apply a conditional generative network G to generate completed depth images V based on a shape descriptor z conditioned on X . As a key idea, we design our consistency loss C to evaluate the geometric consistency among V . Intuitively, for all pixels in all views Vt we find the distance to their approximate closest neighbor in the other views Vs, and sum up these distances to form C. Specifically, for each target view (e.g. V7 in the figure) we reproject all completed depth images Vs according to the pose of V7, which leads to reprojection maps denoted R7 s . Then we compute consistency distances, denoted D7 s , for each reprojection map R7 s and the target V7 via a pixel-wise closest point pooling operation. Finally, a consistency pooling operator aggregates all consistency distancesD7 s into a loss map M7. In inference, we minimize all loss maps as a function of the shape descriptor z. [56] and reconstructing 3D objects from images [34, 57–60]. With differentiable ren- dering [34, 58], the consistency distances among different views can be leveraged as 2D supervision to learn 3D shapes in their networks. However, these methods can only guarantee consistency for training data in training stage. Different from these methods, with the help of our novel energy optimization and consistency loss implementation, our proposed method can improve geometric consistency on test data directly during the in- ference stage. 37 Figure 3.2: Network structure. Figure 3.3: Methods to calculate consistency distance. 3.3 Multi-view Consistent Inference Overview. The goal of our method is to guarantee multi-view consistency in inference, as shown in the overview in Figure 3.1. Our method starts from converting partial point clouds to multi-view depth image representations by rendering the points into a set of incomplete depth images X = {X1, . . . , X8} from a number of fixed viewpoints. In our current implementation, we use eight viewpoints placed on the corners of a cube. Our approach builds on a conditional generative net G(z;X) which is trained to output completed depth images V by estimating a shape descriptor z conditioned on a set of 38 incomplete inputs X . We obtain the conditional generative net in a separate, supervised training stage. During inference, we keep the network weights fixed and optimize the shape descriptor z to minimize an energy consisting of a consistency loss, which acts as a regularizer, and a data term. On the one hand, the consistency loss C(V ) = C(G(z;X)) quantifies the geometric consistency among the completed depth images V . On the other hand, the data term encourages the solution to stay close to an initially estimated shape descriptor z̊. This leads to the following optimization for the desired shape descriptor z∗: z∗ = arg min z C(G(z;X)) + µ||G(z;X)−G(̊z;X)|| = Lcon(z) + µLgen(z), (3.1) where µ is a weighting factor, and we denote Y = G(̊z;X) and V = G(z;X) as the initially estimated completed depth images and optimized completed depth images in inference, respectively. In addition, we will formulate the regularization term and data term as multi-view consistency loss Lcon(z) and generator loss Lgen(z) in Section ‘Consistency Loss’. Conditional generative net. The conditional generative netG(z;X) is built on the struc- ture of multi-view completion net [16], as shown in Figure 3.2, which is an image-to- image translation architecture applied to perform depth image completion for multiple views of the same shape. We train the conditional generative net following a standard conditional GAN approach [14]. To share information between multiple depth images of the same shape, our architecture learns a shape descriptor z for each 3D object by pool- ing a so-called shape memory consisting of N feature maps fn, n ∈ [1, N = 8] from all 39 views of the shape. The network G consists of 8 U-Net modules, and each U-Net module has two submodules, Down and Up, so there are 8 Down submodules (D7−0) in the en- coder and 8 Up submodules (U0−7) in the decoder. Down submodules consist of the form Convolution-BatchNorm-ReLU [40, 41], and Up submodules (D0−7) consist of the form UpReLU-UpConv-UpNorm. The shape memory is the feature map after the third Down submodule (D3) of the encoder. More details can be found in [16, 37]. In inference, we optimize the shape descriptor z of G(z;X) given test input X . We first get an initial estimation of the shape descriptor z̊ for each test shape by running the trained model once, and initialize z with z̊. During inference the other parameters of G are fixed. 3.4 Consistency Loss Our consistency loss is based on the sum of the distances between each pixel in the multi-view depth map and its approximate nearest neighbor in any of the other views. In this section we introduce the details of the multi-view consistency loss calculation fol- lowing the overview in Figure 3.1. For all views Vt, we first calculate pairwise per-pixel consistency distancesDt s to each other view Vs, that is, per-pixel distances to approximate nearest neighbors in view Vs. We then perform consistency pooling, which for each view Vt provides the consistency distances over all other views (as opposed to the initial pair- wise consistency distances between two of views). We call these the loss maps M t. The final consistency loss is the sum over all loss maps. 40 3.4.1 Pairwise Consistency Distances Given a source view Vs and a target view Vt, we calculate the consistency distance Dt s between Vs and Vt by view-reprojection and closest point pooling, where Vt, Vs ∈ RH×W and H×W is the image resolution. Specifically, view-reprojection transforms the depth information of source Vs to a reprojection map Rt s according to the transformation matrix of the target Vt. Then, closest point pooling further produces the consistency distance Dt s between Rt s and Vt. Figure 3.3 shows the pipeline, where the target view is V7 and the source view is V2. In the following, we denote a pixel on source view as pi = [ui, vi, di], where ui and vi are considered pixel coordinates, its back-projected 3D point as Pi = [x̂i, ŷi, ẑi], and the reprojected pixel on reprojection map Rt s as p′i = [u′i, v ′ i, d ′ i], where di = Vs[ui, vi] and d′i = Rt s[u ′ i, v ′ i] are the depth values at the location [ui, vi] and [u′i, v ′ i], respectively. View-reprojection. The view-reprojection operator back-projects each point pi = [ui, vi, di] on Vs into the canonical 3D coordinates as Pi = [x̂i, ŷi, ẑi] via Pi = <−1s (K−1pi − τs) ∀i, (3.2) where K is the intrinsic camera matrix, and