ABSTRACT Title of dissertation: FUSING MULTIMEDIA DATA INTO DYNAMIC VIRTUAL ENVIRONMENTS Ruofei Du Doctor of Philosophy, 2018 Dissertation directed by: Professor Amitabh Varshney Department of Computer Science In spite of the dramatic growth of virtual and augmented reality (VR and AR) technology, content creation for immersive and dynamic virtual environments remains a significant challenge. In this dissertation, we present our research in fusing multimedia data, including text, photos, panoramas, and multi-view videos, to create rich and compelling virtual environments. First, we present Social Street View, which renders geo-tagged social media in its natural geo-spatial context provided by 360° panoramas. Our system takes into account visual saliency and uses maximal Poisson-disc placement with spatiotem- poral filters to render social multimedia in an immersive setting. We also present a novel GPU-driven pipeline for saliency computation in 360° panoramas using spher- ical harmonics (SH). Our spherical residual model can be applied to virtual cine- matography in 360° videos. We further present Geollery, a mixed-reality platform to render an interactive mirrored world in real time with three-dimensional (3D) buildings, user-generated content, and geo-tagged social media. Our user study has identified several use cases for these systems, including immersive social storytelling, experiencing the culture, and crowd-sourced tourism. We next present Video Fields, a web-based interactive system to create, cal- ibrate, and render dynamic videos overlaid on 3D scenes. Our system renders dynamic entities from multiple videos, using early and deferred texture sampling. Video Fields can be used for immersive surveillance in virtual environments. Fur- thermore, we present VRSurus and ARCrypt projects to explore the applications of gestures recognition, haptic feedback, and visual cryptography for virtual and augmented reality. Finally, we present our work on Montage4D, a real-time system for seamlessly fusing multi-view video textures with dynamic meshes. We use geodesics on meshes with view-dependent rendering to mitigate spatial occlusion seams while maintain- ing temporal consistency. Our experiments show significant enhancement in ren- dering quality, especially for salient regions such as faces. We believe that Social Street View, Geollery, Video Fields, and Montage4D will greatly facilitate several applications such as virtual tourism, immersive telepresence, and remote education. FUSING MULTIMEDIA DATA INTO DYNAMIC VIRTUAL ENVIRONMENTS by Ruofei Du Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Dr. Amitabh Varshney, Chair/Advisor Dr. Matthias Zwicker Dr. Furong Huang Dr. Joseph F. JaJa Dr. Ming Chuang © Copyright by Ruofei Du 2018 Acknowledgments With my deepest sincerity and gratitude, thank you everyone I met in the in- credible five years I have had at University of Maryland, College Park and Microsoft Research. First and foremost, I have to say a special thank you to my advisor, Professor Amitabh Varshney for advising me, supporting me, and having faith in me on the creative and fascinating projects over the past years. I am not the most conventional of researchers, and there is often disruption along the path to achieving successful projects, but Professor Varshney was always in the background supporting me. Thank you for helping me to grow into the researcher I am today. With your encouragement, I have learned a great deal of knowledge about interactive graphics and visualization, virtual and augmented reality, parallel computing, and, most importantly, how to conduct research and behave myself. I would like to deeply thank my committee members, Dr. Zwicker, Dr. Huang, Dr. JaJa, and Dr. Chuang, for offering me invaluable advice and direction. I am also grateful to my advisors in Human Computer Interaction, Dr. Froehlich and Dr. Findlater, for guiding me through ProjectSideWalk and ProjectHandSight, teaching me how to organize a team project, how to take human factors into account, and how to think out of the box. I owe my gratitude to all the teachers and classmates I have learned from and because of whom my graduate experience has been one that I will cherish forever. I owe my thanks to all my colleagues, collaborators, and mentors from Mi- ii crosoft Research, Redmond: Hugues Hoppe, Wayne Chang, Sameh Khamis, Shahram Izadi, Mingsong Dou, Yury Degtyarev, Philip Davidson, Sean Fanello, Adarsh Kow- dle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Push- meet Kohli, Vladimir Tankovich, Marek Kolwalski, Qiyu Chen, Spencer Fowers, Jeff Foster, Norm Whittaker, and Ben Cutler. They live and breathe the spirit of re- search that I know and love: to work hard, to embrace science and engineering, to mix theory and practice, to think and work as a team rather than individuals, and to always focus on big things. I am grateful to all my dear friends, colleagues, collaborators, and lab-mates at University of Maryland, College Park: Dr. Sujal Bista, Dr. Hsueh-Chien Cheng, Dr. Changqing Zou, Dr. Eric Krokos, Dr. Kotaro Hara, Dr. Lee Stearns, Dr. Uran Oh, Dr. Hui Miao, Dr. Hao Li, Dr. Fan Du, Dr. Awalin Sopan, Dr. Xintong Han, Dr. Zebao Gao, Dr. Jin Sun, Yu Jin, Xiaoxu Meng, Shangfu Peng, Hong Wei, Liang He, Tiffany Chao, Kent Wills, Max Potasznik, Zheng Xu, Xuetong Sun, Hao Zhou, Xiyang Dai, Weiwei Yang, Shuo Li, Sida Li, Eric Lee, Somay Jain, Mukul Agarwal, Patrick Owen, Tara Larrue, and David Li. I am lucky to collaborate with you and learn from you. Your passion and dedication will always inspire me. Finally, there are three people missing up to now who I should thank the most: my parents and my wife, Sai Yuan. They have supported my continual dedication to study and research during the late nights, the weekends, and the travel. Thank you all! iii Dedication 2D justifies existence 3D validates identification 4D convinces living To my advisors, teachers, friends, and families, who taught me theorems, algorithms, data structures, and the meaning of life, and to those who taught me to relish the moment in the true reality. iv Table of Contents Acknowledgements ii Dedication iv List of Tables x List of Figures xi List of Abbreviations xiv 1 Introduction 1 1.1 Social Street View: Blending Immersive Maps with Geotagged Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 TopicFields: Spatiotemporal Visualization of Geotagged So- cial Media With Hybrid Topic Models and Scalar Fields . . . 3 1.1.2 Geollery: Designing an Interactive Mirrored World with Geo- tagged Social Media . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Spherical Harmonics for Saliency Computation and Virtual Cine- matography in 360° Videos . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Integrating Gesture Recognition, Tactile Feedback, and Visual Cryp- tography into Virtual Environments . . . . . . . . . . . . . . . . . . . 11 1.4.1 VRSurus: Enhancing Interactivity and Tangibility of Puppets in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 ARCrypt: Visual Cryptography with Misalignment Tolerance using Augmented Reality Head-Mounted Displays . . . . . . . 13 1.5 Montage4D: Real-time Seamless Fusion and Stylization of Multiview Video Textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Social Street View: Blending Immersive Street Views with Geotagged Social Media 18 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 v 2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Immersive Maps . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Visual Management of Geotagged Information . . . . . . . . . 22 2.2.3 Analysis of Geotagged Social Media . . . . . . . . . . . . . . . 25 2.2.4 Mixed Reality in Immersive Maps . . . . . . . . . . . . . . . . 26 2.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Street View Scraper . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Mining Social Media . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Servers and Relational Databases . . . . . . . . . . . . . . . . 31 2.4 Social Street View Interface . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Social Media Layout Algorithm . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 Baseline: 2D Visualization . . . . . . . . . . . . . . . . . . . . 34 2.5.2 Uniform Random Sampling . . . . . . . . . . . . . . . . . . . 34 2.5.3 Depth and Normal-map-driven Placement of Social Media . . 36 2.5.4 Maximal Poisson-disk Sampling . . . . . . . . . . . . . . . . . 37 2.5.5 Placement of Social Media in Scenic Landscapes . . . . . . . . 38 2.5.6 Post-processing, Rendering and Interaction . . . . . . . . . . 40 2.5.7 Filtering of Social Media . . . . . . . . . . . . . . . . . . . . . 40 2.6 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.1 Dataset Acquisition and Hardware Setup . . . . . . . . . . . . 41 2.6.2 Evaluation of Initialization and Rendering Time . . . . . . . . 42 2.6.3 Evaluation of Saliency Coverage . . . . . . . . . . . . . . . . . 43 2.7 Use Cases and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.1 Storytelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.2 Business Advertising . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.3 Learning Culture and Crowd-sourced Tourism . . . . . . . . . 47 2.8 TopicFields: Spatiotemporal Visualization of Geotagged Social Me- dia with Hybrid Topic Models and Scalar Fields . . . . . . . . . . . . 49 2.8.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.8.2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . 53 2.8.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . 55 2.8.2.3 Spectral Clustering . . . . . . . . . . . . . . . . . . 58 2.8.3 Topic Fields Visualization . . . . . . . . . . . . . . . . . . . . 59 2.8.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.8.4.1 Trip Planning . . . . . . . . . . . . . . . . . . . . . 62 2.8.4.2 Searching with Temporal Filters . . . . . . . . . . . 64 2.8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.9 Geollery: Designing an Interactive Mirrored World with Geotagged Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.9.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.9.3 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.9.3.1 Meshes and Textures . . . . . . . . . . . . . . . . . 71 2.9.3.2 Interactive Capabilities . . . . . . . . . . . . . . . . 73 vi 2.9.3.3 Virtual Representations of Social Media . . . . . . . 75 2.9.3.4 Aggregation Approaches . . . . . . . . . . . . . . . . 77 2.9.3.5 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.9.3.6 Real-world Phenomena . . . . . . . . . . . . . . . . 79 2.9.3.7 Filtering of Social Media . . . . . . . . . . . . . . . 79 2.9.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.9.4.1 Background Interview . . . . . . . . . . . . . . . . . 80 2.9.4.2 Exploration of Geollery and Social Street View . . . 81 2.9.4.3 Quantitative Evaluation . . . . . . . . . . . . . . . . 83 2.9.4.4 The Future of 3D Social Media Platforms . . . . . . 85 2.9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 2.9.5.1 Insights from User Study . . . . . . . . . . . . . . . 88 2.9.5.2 Combining Geollery and Social Street View . . . . . 89 2.9.5.3 Mobile and Virtual Reality Modes . . . . . . . . . . 89 2.10 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 91 3 Spherical Harmonics for Saliency Computation and Virtual Cinematography in 360° Videos 94 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.2.1 Visual Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.2.2 Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . 100 3.3 Computing the Spherical Harmonics Coefficients . . . . . . . . . . . . 101 3.3.1 Evaluating SH Functions . . . . . . . . . . . . . . . . . . . . . 101 3.3.2 Evaluating SH Coefficients . . . . . . . . . . . . . . . . . . . . 102 3.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . 104 3.4 Spherical Spectral Residual Model . . . . . . . . . . . . . . . . . . . 106 3.4.1 Spherical Spectral Residual Approach . . . . . . . . . . . . . 106 3.4.2 Temporal Saliency . . . . . . . . . . . . . . . . . . . . . . . . 108 3.4.3 Saliency Maps with Nonlinear Normalization . . . . . . . . . 109 3.4.4 Comparison Between the Itti et al.’s and SSR Model . . . . . 110 3.5 Saliency-guided Virtual Cinematography . . . . . . . . . . . . . . . . 112 3.5.1 Optimization of the Virtual Camera’s Control Points . . . . . 113 3.5.2 Saliency Coverage Term . . . . . . . . . . . . . . . . . . . . . 113 3.5.3 Temporal Motion Term . . . . . . . . . . . . . . . . . . . . . 114 3.5.4 The Optimization Process . . . . . . . . . . . . . . . . . . . . 114 3.5.5 Interpolation of Quaternions . . . . . . . . . . . . . . . . . . . 115 3.5.6 Evaluation of the SpatioTemporal Optimization Model . . . . 116 3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 117 4 Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Environment 119 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.1 Camera-World Calibration Interface . . . . . . . . . . . . . . 122 vii 4.2.2 Background Modeling . . . . . . . . . . . . . . . . . . . . . . 123 4.2.3 Segmentation of Moving Entities . . . . . . . . . . . . . . . . 125 4.3 Video Fields Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.3.1 Early Pruning for Rendering Moving Entities . . . . . . . . . 129 4.3.2 Deferred Pruning for Rendering Moving Entities . . . . . . . . 129 4.3.3 Visibility Testing and Opacity Modulation . . . . . . . . . . . 130 4.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 132 4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 133 5 Integrating Haptics and Visual Cryptography into Virtual Environments 135 5.1 VRSurus: Enhancing Interactivity and Tangibility of Puppets in Vir- tual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.1.1 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.1.2 Gesture Recognizer . . . . . . . . . . . . . . . . . . . . . . . . 137 5.1.3 VR Educational Game . . . . . . . . . . . . . . . . . . . . . . 140 5.1.4 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.2 ARCrypt: Visual Cryptography with Misalignment Tolerance using Augmented Reality Head-Mounted Displays . . . . . . . . . . . . . . 143 5.2.1 Background and Related Work . . . . . . . . . . . . . . . . . 145 5.2.1.1 Augmented Reality Head Mounted Displays . . . . . 145 5.2.1.2 Visual Cryptography . . . . . . . . . . . . . . . . . 146 5.2.2 ARCrypt Algorithms . . . . . . . . . . . . . . . . . . . . . . . 149 5.2.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 149 5.2.2.2 Generation of Two Shares . . . . . . . . . . . . . . . 150 5.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 152 5.3 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 153 6 Montage4D: Real-time Seamless Fusion of Multiview Video Textures 155 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2.1 Image-based 3D Reconstruction . . . . . . . . . . . . . . . . . 158 6.2.2 Texture Stitching . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.2.3 Geodesic Distance Fields . . . . . . . . . . . . . . . . . . . . . 163 6.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.4.1 Formulation and Goals . . . . . . . . . . . . . . . . . . . . . . 166 6.4.2 Normal Weighted Blending with Dilated Depth Maps and Coarse-to-Fine Majority Voting Strategy . . . . . . . . . . . . 167 6.4.3 Computing Misregistration and Occlusion Seams . . . . . . . 169 6.4.4 Discrete Straightest Geodesics for Diffusing Seams . . . . . . 171 6.4.5 Temporal Texture Fields . . . . . . . . . . . . . . . . . . . . . 173 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.5.1 Comparison with the Holoportation approach . . . . . . . . . 175 6.5.2 Comparison with the Floating Textures Approach . . . . . . . 178 6.6 Real-time Stylization . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 viii 6.6.1 Sketchy Stippling . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.6.2 Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 186 7 Conclusion and Future Directions 188 List of Publications 191 References 193 ix List of Tables 1.1 Qualitative evaluation of Holoportation and Montage4D . . . . . . . . 17 2.1 Resolutions, tile counts, and file sizes of the GSV data . . . . . . . . 42 2.2 Comparison between Geollery and Social Street View with different variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.1 Binocular field of view of the consumer-level head-mounted displays . 96 3.2 Timing comparison between the Itti et al.’s model and SSR model . . 111 4.1 Evaluation of early pruning and deferred pruning in Video Fields . . 133 6.1 Statistics of the Montage4D datasets . . . . . . . . . . . . . . . . . . 180 6.2 Comparison betweenHoloportation andMontage4D in cross-validation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.3 Timing comparison between Holoportation and Montage4D . . . . . . 181 x List of Figures 1.1 Interface, deployment, and use cases of Social Street View . . . . . . . 2 1.2 Interface and the visualization results of TopicFields . . . . . . . . . 4 1.3 Interface, rendering results, and use cases of Geollery . . . . . . . . . 6 1.4 The saliency maps by our spherical spectral residual model . . . . . . 8 1.5 Results of the saliency-guided virtual cinematography . . . . . . . . . 9 1.6 Overview of the Video Fields system . . . . . . . . . . . . . . . . . . 10 1.7 Conceptual design, development, and deployment of VRSurus . . . . 12 1.8 Results and overview of our system, ARCrypt . . . . . . . . . . . . . 14 1.9 Results and work flow of the Montage4D system . . . . . . . . . . . . 15 1.10 Comparative results and real-time stylization of the Montage4D system 16 2.1 Results from our Social Street View system . . . . . . . . . . . . . . . 19 2.2 Comparison with prior art in visualizing geotagged information . . . 22 2.3 The work flow of Social Street View . . . . . . . . . . . . . . . . . . . 28 2.4 The input panoramas, depth maps, and normal maps in Social Street View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 The graphical user interface of Social Street View . . . . . . . . . . . 33 2.6 Comparison among different social media layout algorithms . . . . . . 36 2.7 Placement of Social Media in Scenic Landscapes . . . . . . . . . . . . 38 2.8 Evaluation of processing time in different resolutions . . . . . . . . . 43 2.9 Example saliency map and the average social media coverage . . . . . 44 2.10 Potential applications of Social Street View . . . . . . . . . . . . . . . 45 2.11 Results of the TopicFields system in the Manhattan district . . . . . 51 2.12 The architecture and workflow of the TopicFields system . . . . . . . 52 2.13 The top three classification results of ImageNet . . . . . . . . . . . . 54 2.14 Spectral ordering of the topic features . . . . . . . . . . . . . . . . . 55 2.15 The matrix diagram after spectral clustering . . . . . . . . . . . . . . 56 2.16 Visualization of the topic fields with different gain factors . . . . . . . 57 2.17 Visualization of the gain function . . . . . . . . . . . . . . . . . . . . 61 2.18 Use case of TopicFields: trip planning . . . . . . . . . . . . . . . . . 62 2.19 Use case of TopicFields: searching with temporal filters . . . . . . . . 64 2.20 Real-time visualization of the mirrored world by Geollery . . . . . . . 67 xi 2.21 Comparison amongst prior mixed reality systems or designs for visu- alizing geotagged social media . . . . . . . . . . . . . . . . . . . . . . 68 2.22 The workflow of Geollery . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.23 Examples for real-time communication between two users in Geollery 74 2.24 Virtual representations of geotagged social media . . . . . . . . . . . 76 2.25 Different aggregation approaches in Geollery . . . . . . . . . . . . . . 77 2.26 Quantitative evaluation of Geollery and Social Street View . . . . . . 84 2.27 Projection mapping results in Geollery . . . . . . . . . . . . . . . . . 90 3.1 Results of the saliency maps of the Itti et al.’s model and our spherical spectral residual model . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.2 Visualization of the first five bands of spherical harmonics functions . 100 3.3 The reconstructed images with the first 15 bands of SH coefficients . 105 3.4 Visualization of the spectral residual maps between different bands of spherical harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.5 The visual comparison between the Itti et al.’s model and our SSR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6 Results of the Itti et al.’s model and our SSR model with horizontal translation and spherical rotation . . . . . . . . . . . . . . . . . . . . 110 3.7 The interpolated camera trajectory with our virtual cinematography approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.8 Quantitative comparison between the MaxCoverage model and the SpatioTemporal Optimization model . . . . . . . . . . . . . . . . . . 117 4.1 Overview of the Video Fields system . . . . . . . . . . . . . . . . . . 120 4.2 Conventional surveillance interface for monitoring surveillance videos 121 4.3 The workflow of the Video Fields system . . . . . . . . . . . . . . . . 122 4.4 Segmentation results with Gaussian mixture models of the background124 4.5 The overview of the Video Fields mapping algorithm . . . . . . . . . 127 4.6 Results before and after the perspective correction . . . . . . . . . . . 128 4.7 This figure shows the segmentation of moving entities, view-dependent rendering, as well as zoom-in comparison between the early pruning algorithm and the deferred pruning algorithm. . . . . . . . . . . . . . 131 4.8 Results before and after the visibility test and opacity adjustment . . 132 5.1 Overview of VRSurus . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2 Sketches of our conception of VRSurus . . . . . . . . . . . . . . . . . 138 5.3 Mechanical design of VRSurus in details . . . . . . . . . . . . . . . . 139 5.4 Illustration of the gestures recognition in VRSurus . . . . . . . . . . 139 5.5 Virtual characters in VRSurus . . . . . . . . . . . . . . . . . . . . . . 141 5.6 During the preliminary deployment, participants from the UIST com- munity interacted with our prototype of VRSurus at the ACM UIST 2015 Student Innovation Contest . . . . . . . . . . . . . . . . . . . . 142 5.7 Results and overview of our system, ARCrypt . . . . . . . . . . . . . 143 5.8 Examples of visual cryptography . . . . . . . . . . . . . . . . . . . . 147 xii 5.9 Prior work in AR-based visual cryptography system . . . . . . . . . . 148 5.10 A real case of misalignment challenge for visual cryptography using augmented reality headsets . . . . . . . . . . . . . . . . . . . . . . . . 149 5.11 Results amongst the classical visual cryptography approach and AR- Crypt with different parameters . . . . . . . . . . . . . . . . . . . . . 152 6.1 Overview of the input and results of the Holoportation and Mon- tage4D systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.2 The workflow of the Montage4D rendering pipeline . . . . . . . . . . 164 6.3 Iterative results after the occlusion test, dilation, color voting, and the Montage4D pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.4 Examples of misregistration and occlusion seams. . . . . . . . . . . . 170 6.5 Illustration of computing the approximate geodesics . . . . . . . . . . 171 6.6 Iterative results for updating the geodesics over the mesh . . . . . . . 172 6.7 Spatiotemporal comparison between Holoportation and Montage4D . 173 6.8 Temporal transition within half a second after an abrupt change in viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.9 Comparison with the Holoportation approach . . . . . . . . . . . . . 176 6.10 Comparison of different texturing algorithms . . . . . . . . . . . . . . 179 6.11 Results before and after applying the sketchy stippling effects. . . . . 182 6.12 Results of our real-time sketchy stippling stylization . . . . . . . . . . 183 6.13 Results before and after the interactive relighting pass with different light probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.14 Limitations of our approach. Extruded triangles and highly-occluded spots may still cause artifacts. . . . . . . . . . . . . . . . . . . . . . 186 xiii List of Abbreviations 2D Two-Dimensional 3D Three-Dimensional 4D Four-Dimensional AR Augmented Reality CUDA Compute Unified Device Architecture EXIF Exchangeable Image File Format FPS Frames Per Second GMM Gaussian Mixture Models GPU Graphics Processing Unit GPGPU General-Purpose computing on Graphics Processing Units GSV Google Street View GUI Graphical User Interface HCI Human Computer Interaction HMD Head-Mounted Display IMU Inertial Measurement Unit MR Mixed Reality MRF Markov Random Fields OpenCV Open source Computer Vision library OpenGL Open Graphics Library OpenMP Open Multi-Processing RMSE Root Mean Square Error SH Spherical Harmonics SSR Spherical Spectral Residual SSV Social Street View SQL Structured Query Language VR Virtual Reality VC Visual Cryptography WebGL Web Graphics Library xiv Chapter 1: Introduction With recent advances in the commoditization of virtual and augmented reality (VR and AR) displays, there is an increasing demand for three-dimensional (3D) content. But where will the massive amount of 3D data come from? Manually craft- ing dynamic 3D scenes will not be the ultimate solution, since it usually requires tremendous efforts from professional artists and designers. In this dissertation, we present our research in automatically fusing multimedia data, including text, im- ages, 360° panoramas, and multiview videos, into dynamic virtual environments. We devise novel system architectures, visualization strategies, and rendering tech- niques to address the challenges for each type of data with real-time constraints. Our research may catalyze several VR and AR applications: mixed-reality social platforms, immersive surveillance, and real-time telepresence. 1.1 Social Street View: Blending Immersive Maps with Geotagged Social Media In Chapter 2, we present Social Street View [1], the first interactive system for rendering geotagged social media in its natural geospatial context provided by immersive maps. An overview is shown in Figure 1.1. 1 custom location query spatiotemporal filters geotagged social media immersive 360° map (a) Interface of Social Street View (b) Deployment in curved tiled displays (c) Crowdsourced Tourism (d) Immersive storytelling Figure 1.1: Interface, deployment, and use cases of Social Street View. (a) shows the graphic user interface of Social Street View. Given a custom location query with spatiotemporal filters, our system automatically fetches geotagged social media from Twitter and Instagram and immersive maps from Google Street View. The social media images are aligned with building geometry and laid out aesthetically with the immersive map. (b) shows the deployment in an immersive curved screen environment with 15 projectors. Users can interactively explore hundreds of social media messages near a New York City street with a game controller at a resolution of 6K × 3K pixels. (c) shows a use case for crowdsourced tourism, where users can look through the museum in Paris for geotagged artworks inside as well as the dishes in nearby restaurants. (d) shows an example for immersive storytelling, where geotagged photographs of a football game are embedded into the 360° stadium where the event occurred. Please refer to the supplementary materials at http: //socialstreetview.com for more examples. . 2 Social Street View consists of a street view scraper, a social media scraper, dis- tributed SQL databases, a web-server powered by Apache and PHP, and optional modules such as a temporal filter, a geolocation filter, and a machine-learning-based face filter. Given a requested location, our system first creates a 3D world using the tiles of panoramic data, depths maps, normal maps, and road orientations from Google Street View1. Taking the number and the total area of the surrounding building surfaces into account, Social Street View classifies the immersive maps into urbanscapes or landscapes. For urbanscapes, our system takes advantage of the depth and normal maps, visual saliency, and maximal Poisson-disc placement sam- pling to render social multimedia in an aesthetically pleasing manner with geospatial renderings. For landscapes, our system uses the road orientations to place social media alongside the street. Furthermore, we present a system architecture that is able to stream geotagged social media and render it across a range of display plat- forms spanning tablets, desktops, head-mounted displays, and large-area room-sized curved tiled displays. We explore several potential use cases including immersive social storytelling, learning about the culture, and crowdsourced tourism. 1.1.1 TopicFields: Spatiotemporal Visualization of Geotagged Social Media With Hybrid Topic Models and Scalar Fields While Social Street View offers an immersive street-level interface that inter- leaves visual navigation of our surroundings with social media content, how can we 1Google Street View: http://www.google.com/maps/streetview 3 acquire an overview of the geotagged social media? What are the dominant topics over time and how are they distributed on the map? To address these questions, in Section 2.8, we present TopicFields, an interactive system to explore, aggregate, and visualize geotagged social media using hybrid topic models, scalar fields, and stream graphs. An overview of TopicFields is shown in Figure 1.2. Custom topic query TopicFields visualization Spatiotemporal filters Current social media Social media details Figure 1.2: Interface and visualization of social media in Manhattan district from TopicFields The top-left map view is overlaid with an interactive scalar map, with each color indicating a different topic: fashion, art, or park. We show the ground truth labels such as Central Park and The Museum of Modern Art from Google Maps for reference. The detail view at the bottom shows the corresponding social media text, image, or video, as the user explores and clicks on the map. The control panel on the right allows the user to select, add, and modify the topics to explore. The stream graph shows the volume of social media of different topics over the queried time. . In the data processing stage, we apply two machine learning models, Word2Vec [2] and Inception-v3 [3], to the data and address the relationships among the ex- 4 tracted topics by rearranging them via spectral ordering. In the visualization stage, we allow users to interactively select the preferred topics and alter the transfer function for visualizing the social media on a map. Our system, TopicFields, can efficiently estimate the kernel density distribution and visualize the scalar fields of the requested topics on a map on the GPU. In addition, we use temporal filters and stream graphs to enhance the comprehensibility of the data over time. We further demonstrate the effectiveness of TopicFields with potential use cases such as trip planning and event discovery. 1.1.2 Geollery: Designing an Interactive Mirrored World with Geo- tagged Social Media In Social Street View, the user interaction is limited to street-level panoramas. What if we could create an interactive 3D mirrored world which has the same avail- ability of the 2D map? Motivated by this challenge, we have developed Geollery, an interactive mixed-reality platform for creating, sharing, and exploring geotagged social media (Figure 1.3). Geollery introduces a real-time pipeline to progressively render an interactive mirrored world with 3D buildings, user-generated content, and external geotagged social media. This mirrored world allows users to see, chat, and collaborate with remote participants with the same spatial context in an immersive environment. We conduct a user study with semi-structured interviews to qualitatively compare Geollery with Social Street View. From a Welch’s paired t-test, we found a signifi- 5 3d building with immersive maps geotagged social media geotagged street arts and photographs virtual avatars and chats geotagged gift boxes (a) Interface and real-time rendering results of Geollery with three participants (b) Virtual meetings (c) Trip Planning (d) Discovery of local restaurants Figure 1.3: Interface, rendering results, and use cases of Geollery (a) shows the interface and real-time rendering results of Geollery with three participants. Ge- ollery progressively fuses 3D buildings, virtual avatars, and geotagged social media into a mirrored world. (b), (c), and (d) show three potential applications: virtual meeting, trip planning, and discovery of local business and events. cant effect for interactivity (t(20) = 3.04,p < 0.01,Cohen’s d = 0.83) and creativity (t(20) = 2.10,p < 0.05,Cohen’s d = 0.66) with Geollery outperforming Social Street View. We point out that data sources, interactivity, immersive textures, and cus- tomization play significant roles in designing an interactive mirrored world with geotagged social media. User feedback from our study reveals several use cases for Geollery including travel planning, virtual meetings, and discovery of local business. Please refer to https://www.geollery.com for a live demonstration. 6 1.2 Spherical Harmonics for Saliency Computation and Virtual Cin- ematography in 360° Videos When applying the saliency metrics to the immersive maps in Social Street View, we found that classic saliency may not work directly on 360° images. The results are inconsistent with horizontal translation and spherical rotations for the same 360° image. In Chapter 3, we present a novel GPU-driven pipeline for saliency computation and virtual cinematography in 360° videos using spherical harmonics (SH). First, we present the preprocessing for computing the SH coefficients for repre- senting the 360° videos. Our pipeline pre-computes a set of the Legendre polynomials and SH functions and stores them in GPU memory. We adopt the highly-parallel prefix sum algorithm to integrate feature maps of the downsampled 360° frames as 15 bands of spherical harmonics coefficients on the GPU. Next, by analyzing the spherical harmonics spectrum of the 360° video, we extract the spectral residual by accumulating the SH coefficients between a low band and a high band. As shown in Figure 1.4, our spherical spectral residual (SSR) model reveals the multi-scale saliency maps in the spherical spectral domain and reduces the computational cost by discarding the low bands of SH coefficients. From the experimental results, our SSR model outperforms the classic Itti et al.’s model by 5× to 13× in timing and runs at over 60 FPS for 4K videos. Finally, our interactive computation of spherical saliency can be used for 7 3d building with immersive maps (a) The input 360° video frame (b) Saliency map by Itti el al.’s model (c) Saliency map by our SSR model The High Band - Q 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (d) Visualization of the normalized saliency maps between two bands of spherical harmonics Figure 1.4: (a) shows an input frame, (b) shows the saliency map by Itti et al.’s model, (c) shows the saliency map by our mode, and (d) shows the spectral residual maps between a low band and a high band of spherical harmonics. The number along the horizontal axis indicates the high band Q, while the vertical axis indicates the low band P . Note that the saliency maps within or close to the orange bounding box successfully detect the two people in the frame. saliency-guided virtual cinematography in 360° videos. We formulate a spatiotem- poral model to ensure large saliency coverage and smooth camera movement. Since the frame rate of the video may be lower than the display rate of the displays, we further employ a spline interpolation amongst the rotational quaternions of the vir- tual camera, as shown in Figure 1.5(a). Compared with the baseline model which only maximizes the saliency coverage, our model significantly reduces the camera movement jitter, as shown in as shown in Figure 1.5(b). 8 The Low Band - P Comparison of Temporal Artifacts Between MaxCoverage and SpatioTemporal Models 120 MaxCoverage SpatioTemporal 100 80 60 40 20 0 Frame Number (a) Spline interpolation of the virtual camera quaternions (b) Qualitative evaluation for virtual cinematography Figure 1.5: (a) shows the smooth spline interpolated path of the virtual camera quaternions. (b) shows the qualitative comparison between the MaxCoverage model and the SpatioTemporal model for virtual cinematography. 1.3 Video Fields: Fusing Multiple Surveillance Videos into a Dy- namic Virtual Environment In Chapter 4, we present Video Fields, a novel web-based interactive system to create, calibrate, and render dynamic video-based virtual reality scenes in head- mounted displays, as well as high-resolution wide-field-of-view tiled display walls. Video Fields system consists of a camera world calibration interface, a back- end server to process and stream the videos, and a web-based rendering system. As shown in Figure 1.6(a), Video Fields requires input of customized 3D models and multiple surveillance videos. In the calibration interface, we provide a four-stage workflow for the user: importing and synchronizing the surveillance videos, defining the ground projection plane, draw initial building geometries, and adjusting the camera positions and rotating quaternions. To estimate robust background images, we take advantage of Gaussian Mix- ture Models (GMM) for background modeling. Compared with the mean filter or the 9 Temporal Motion (diagonal degrees) 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 241 249 257 265 273 281 289 297 305 313 321 329 337 345 353 (a) Input: 3D models and multiview surveillance videos (b) Processing: Calibrate the camera matrices and segment the moving entities (c) Output: View-dependent rendering of surveillance videos in dynamic virtual environments Figure 1.6: Video Fields system fuses multiple videos, camera-world matrices from a calibration interface, static 3D models, as well as satellite imagery into a novel dynamic virtual environment. Video Fields integrates automatic segmentation of moving entities during the rendering pass and achieves view-dependent rendering in two ways: early pruning and deferred pruning. Video Fields takes advantage of the WebGL and WebVR technology to achieve cross-platform compatibility across smart phones, tablets, desktops, high-resolution tiled curved displays, as well as virtual reality head-mounted displays. See the supplementary video at http:// videofields.com Kalman filter, GMM is more adaptive with different lighting conditions, repetitive motions of scene elements, as well as moving entities. After learning the background model for each video, we achieve real-time interactive segmentation of moving enti- ties during the fragment-shader rendering pass by taking advantage of the many-core computing on the GPU. An example is shown in Figure 1.6(b). Finally, we present two methods to render video fields: early pruning and deferred pruning. In the Early Pruning approach, we discard the pixels that do not belong to the foreground as soon as the foreground is identified. In the Deferred Pruning approach, we dynamically project videos on moving billboards for better 10 anti-aliasing, bi-linear sampling, and faster visibility testing. As shown in Figure 1.6(c), we use view-dependent rendering techniques to render the moving entities in different perspectives. The user can also adjust the opacity of every object, thus allowing themselves to “see through the buildings”. We recorded three 10-minute video clips with the resolution of 1280× 720 pixels and tested both the early- and deferred-pruning algorithms in three settings. The experimental results indicate that the early pruning approach is more efficient than the deferred pruning. Nevertheless, the deferred pruning achieves better results through anti-aliasing and bi-linear interpolation. We envision the use of the system and algorithms introduced in Video Fields for immersive surveillance monitoring in virtual environments. 1.4 Integrating Gesture Recognition, Tactile Feedback, and Visual Cryptography into Virtual Environments Dynamic virtual environments can further be enhanced with gestures, haptic feedback, and visual cryptography. In Chapter 5, we present our research in gesture recognition, tactile feedback (VRSurus), and visual cryptography (ARCrypt) for VR and AR technologies. 11 1.4.1 VRSurus: Enhancing Interactivity and Tangibility of Puppets in Virtual Reality VRSurus is a smart device designed to recognize the puppeteer’s gestures and render tactile feedback to enhance the interactivity of physical puppets in virtual reality. As shown in Figure 1.7, VRSurus is wireless, self-contained, and small enough to be mounted upon any puppets. (a) Conceptual design (b) Development of VRSurus (c) Deployment at ACM UIST Figure 1.7: (a) shows the conceptual design of VRSurus, where a puppeteer could control an elephant puppet in both physical and virtual reality and receive haptic feedback. (b) shows the research prototype of VRSurus: a custom 3D-printed cap that encapsulates the Arduino microcontroller, an inertial measurement unit, and other electronic modules. (c) shows the deployment of VRSurus at ACM UIST 2015 Student Innovation Contest, where 63 participants including two K12 special- ists tried out our device. See the supplementary video, open sourced software and hardware at http://videofields.com For gesture recognition, we train the classifier using the decision tree algorithm in Weka [4]. In total, we facilitate the following sixteen features: the sum of mean values on all axes, the differences between each pair of axes, the power of each axis, the range of each axis, the cross product between every two axes and the standard 12 deviation of values along each axis. To recognize the four gestures: idle, swiping, shaking and thrusting, we collected 240 sets of raw accelerometer values for each gesture from 4 lab members (60 sets per gesture per person). For haptic feedback, we install various actuators, (e.g., solenoids, servos, and vibration motors) to provide tactile feedback on the puppeteer’s forearm and animate the physical puppetry. Finally, we deployed VRSurus at ACM UIST 2015 Student Innovation Contest for 2.5 hours. In total, 63 participants from the UIST community including two invited K12 specialists tried out our device. There were more than a hundred people who watched our device as well as the gaming procedure. Our 3D models, circuitry and the source code are publicly available at www.VRSurus.com. 1.4.2 ARCrypt: Visual Cryptography with Misalignment Tolerance using Augmented Reality Head-Mounted Displays ARCrypt uses augmented reality and visual cryptography with misalignment tolerance to safeguard the information depicted on ordinary displays. As shown in Figure 1.8, ARCrypt partitions the confidential data into two shares. It sends one to a Microsoft HoloLens and the other one to a regular display. Even if one share of the data is stolen by hackers, the original, complete, data cannot be restored from it. Only the recipient with both shares can visually decrypt the data by fusing the two shares of images through HoloLens. However, head jittering is identified as one of the greatest challenges for users to fuse the two images in head-mounted displays [5]. In ARCrypt, we model the 13 (c) ARCrypt ARCrypt Share A Ordinary Display 1 Row Misalignment Look rough Classical VC Caused by Head Jittering? AR Head-Mounted Display Share B Figure 1.8: Results and overview of our system, ARCrypt. ARCrypt is able to split a confidential message into two shares of images, which guarantees that the original information could not be revealed with either share of the image alone. The ARCrypt system then transmits the two shares to an ordinary display and an augmented reality head-mounted display, respectively. When the user looks through the two aligned images, the secret message is revealed directly to the user’s human visual system. Nevertheless, head jittering may cause the two images to misalign with each other. The proposed ARCrypt algorithm outperforms the original visual cryptography algorithm in the presence of one or two rows of misalignment. probability of misalignment caused by head jittering as a Gaussian distribution. Each foreground pixel has a probability to be misaligned with one of its surround- ing pixels; in this way, when the two shares match perfectly, the background is unchanged, but the foreground is darker. Therefore, the new algorithm enables the visual cryptography to be tolerant of misalignment when using AR head-mounted displays. 1.5 Montage4D: Real-time Seamless Fusion and Stylization of Mul- tiview Video Textures In Chapter 6, we present Montage4D as the successor of Video Fields. Mon- tage4D is a practical solution towards real-time seamless texture montage for dy- namic multiview reconstruction. We illustrate its work flow in Figure 1.9. We build on the ideas of dilated depth discontinuities and majority voting from 14 (a) Input videos and mesh (b) Texture field and results of Holoportation (c) Texture field and results of Montage4D (d) The workflow of the Montage4D rendering pipeline Figure 1.9: Results and work flow of the Montage4D system. (a) shows the input multiview videos and mesh, (b) shows the texture field and results of Holoportation, (c) shows the texture field and results ofMontage4D, where we dramatically mitigate seams and blurring issues, and (d) shows the work flow of the Montage4D rendering pipeline. Holoportation to reduce ghosting effects when blending textures. In contrast to their approach, we determine the appropriate blend of textures per vertex using view- dependent rendering techniques, so as to avert fuzziness caused by the ubiquitous normal-weighted blending. We further identify the potential causes of visible seams: self-occlusion, mis- projected colors, and field-of-view boundaries. For the datasets acquired for real- time telepresence applications, we have observed the fraction of seam triangles to be less than 1%. This observation has guided us to process the triangles adjacent to the seams, using a propagation procedure by calculating the geodesics directly on the GPU. We follow a variant of the highly efficient approximation algorithm described 15 in [6] to efficiently diffuse the texture fields using the geodesic distance fields. To prevent the texture weights from changing too fast during view transitions, we leverage temporal texture fields to mitigate spatial occlusion seams while preserving temporal consistency. (a) Pairwise comparison between Holoportation (left) and Montage4D (right) Light Probe (b) Real-time stylization and relighting modules in Montage4D Figure 1.10: (a) shows the pairwise comparative results between Holoportation (left) and Montage4D (right). (b) shows examples of the real-time stylization and relighting components in the Montage4D system. As shown in Figure 1.10(a) and Table 1.1, the experimental results demon- strate significant enhancement in rendering quality, especially in detailed regions such as faces. Furthermore, in Figure 1.10(b), we present our research towards real-time stylization and relighting to empower the Holoportation users to interac- tively stylize live 3D content. We envision that Montage4D will greatly facilitate a wide range of applications, including immersive business meetings, family gathering, and remote education. Please refer to www.montage4d.com for the supplementary 16 materials. Table 1.1: Comparison between Holoportation and Montage4D in cross-validation experiments. Note that Montage4D outperforms the Holoportation approach in visual quality while maintaining an average frame rate of over 100 Hz. Dataset Frames #tri Holoporation Montage4DSSIM PSNR FPS SSIM PSNR FPS Timo 837 251K 0.9805 38.60dB 227.2 0.9905 40.23dB 135.0 Yury 803 312K 0.9695 39.20dB 222.8 0.9826 40.52dB 130.5 Sergio 837 404K 0.9704 29.84dB 186.8 0.9813 30.09dB 114.3 Girl 1192 367K 0.9691 36.28dB 212.56 0.9864 36.73dB 119.4 Julien 526 339K 0.9511 33.94dB 215.18 0.9697 35.05dB 120.6 17 Chapter 2: Social Street View: Blending Immersive Street Views with Geotagged Social Media 2.1 Introduction Social media plays a vital role in our lives because of its interactivity, versa- tility, popularity, and social relevance. Every day, billions of users create, share, and exchange information from their lives among their social circles [7]. Social me- dia spans several modalities that include text, photos, audio, videos, and even 3D models. In addition to the content visible on the social networks, social media also consists of metadata that is useful for understanding the relationship amongst users, sentiment mining, and propagation of influence. Specifically, digital photographs of- ten embed metadata such as the time of creation, the location of creation (through GPS coordinates), and camera parameters, that are included in the EXIF data fields. In spite of the availability of such rich spatiotemporal metadata, the current- generation social media content is most often visualized as a linear narrative, rarely in a 2D layout, and almost never in a natural immersive space-time setting. For small screens of mobile devices, perhaps a linear narrative ordered by time or relevance, makes the most sense given the limitations of interaction modalities and display 18 real-estate. However, in immersive virtual environments such as those afforded by virtual and augmented reality headsets, a spatiotemporal view of the social media in a mixed reality setting may be the most natural. (a) (b) (c) Figure 2.1: Results from our system, Social Street View. (a) shows the render- ing results in a regular display. Users can look through the museum in Paris for ° artworks inside as well as the dishes in nearby restaurants. (b) shows the stereo rendering results in a VR headset. Geotagged images are automatically aligned with building geometry and laid out aesthetically. (c) shows the deployment in an im- mersive curved screen environment with 15 projectors. Users can explore hundreds of social media messages near a New York city street at a resolution of 6K × 3K pixels. Please refer to the supplementary video at http://socialstreetview.com . Immersive interfaces that interleave visual navigation of our surroundings with social media content have not yet been designed. NewsStand [8], Flickr [9], and Panoramio1 have taken important first steps towards this goal by using a user’s geolocation information on 2D maps to display content. Still, we have not come across any system that (a) enables user exploration of social media in an immersive 3D spatial context in real time, and (b) allows temporal filtering of social messages in their spatial setting. Such a system will facilitate new genres of social interactions in spatial contexts mediated through virtual and augmented reality. These could be widely adopted in immersive social storytelling, learning about the culture, and 1Panoramio: https://en.wikipedia.org/wiki/Panoramio 19 crowd-sourced tourism. As a proof-of-concept, we have developed a prototype system called the Social Street View (SSV) (Figure 3.1), the first immersive social media navigation sys- tem for exploring social media in large-scale urbanscapes and landscapes. Given a requested location, SSV builds a 3D world using tiles of panorama data from Google Street View2 and Bing Maps Streetside3, depths and normal maps, and road orientations. It then downloads geotagged data near the requested location from public-domain social media sites such as Instagram and Twitter. After building the 3D world, SSV renders the social media onto buildings or as virtual billboards along the road. The user can see photos of food uploaded by social media users next to the relevant restaurants, visual memories with friends in specific locations, identify accessibility issues on roads, as well as preview the coming attractions on scenic drives and nature hikes. The main contributions of our work are: • conception, architecting, and implementation of Social Street View, a mixed reality system that can depict geotagged social media in an immersive 3D environment, • blending multiple modalities of panoramic view metadata, including depth maps, normal maps, and road orientation, with social media metadata includ- ing GPS coordinates and the time of creation, • enhancing visual augmentation by using maximal Poisson-disk sampling and image saliency metrics, 2Google Street View: http://www.google.com/maps/streetview 3Bing Maps Streetside: http://www.bing.com/mapspreview 20 • using WebGL and WebVR to achieve cross-platform compatibility across a range of clients including smartphones, tablets, desktop, high-resolution large- area wide-field-of-view tiled display walls, as well as head-mounted displays. 2.2 Background and Related Work Our work builds upon rich literature of prior art on the creation of immer- sive maps as well as related work in visual management of geotagged information, analysis of geotagged social media and mixed reality in immersive maps. 2.2.1 Immersive Maps In this chapter, we refer the term immersive maps to refer to online services that provide panoramic 360° bubbles at multiple way-points. Since Google Street View (GSV) debuted in 2007, several immersive maps have covered over 43 countries throughout the world. Most street views are captured using a car equipped with a spherical array of cameras. For places inaccessible to ordinary cars, volunteers and trekkers on foot, tricycle, trolley, camel, snowmobile, and even underwater apparatus capture immersive panoramas [10]. Therefore, the latest immersive maps include not only outdoor urbanscapes, but also indoor areas, rural areas, forests, deserts, and even under-water seascapes. Images from a spherical array of cameras can recover depth and reconstruct 3D point clouds using structure-from-motion algorithms [11]. Recently, laser scanners are being coupled with the cameras to directly acquire depth with omnidirectional panoramas. 21 2.2.2 Visual Management of Geotagged Information As a geographic information system, Social Street View is most closely related to Panoramio, Newsstand [8], and PhotoStand [12] (Figure 2.2). These systems ac- complish the visual management of geotagged information on 2D maps. Panoramio is one of the first systems that collects user-submitted scenery photographs and overlays them on a 2D map. Newsstand [8] is a pioneering system that allows users to interactively explore photos directly from news articles depending on the query location and the zoom-level on a 2D map. Its successor, TwitterStand [13] is able to visualize tweets on a 2D map by using geotagged information as well as inferring geospatial relevance from the content of the tweets. Recently, PhotoStand [12] has shown how to visualize geotagged images from real-time scraped news on a 2D map. A primary distinction between Social Street View and the above systems is the use of immersive maps instead of 2D maps. In virtual environments, immersive maps require us to address challenges such as visual clutter, designing the content layout in 3D, and registration of pictorial information. (a) TwitterStand (b) PhotoStand (b) Panoramio (d) Social Street View Figure 2.2: Comparison with prior art in visualizing geo-tagged news, social media or photographs. (a) TwitterStand [13] (b) PhotoStand [12] (c) Panoramio (d) Social Street View. All snapshots were taken from their public websites in March 2015. In 3D environments, the well-known view-management system by Bell et al. 22 [14] registers user-annotated text and images to a particular point in 3D space. Their algorithm reduces visual discontinuities in dynamically labeled and annotated en- vironments. Our system does not involve manual steps to visualize social media in immersive maps. SSV also reduces the visual clutter by maximal Poisson-disk sampling or road orientations. Recent efforts in novel social media visualization interfaces include Social Snapshot [15], Photo Tourism [16] and 3D Wikipedia [17]. Instead of using immersive maps, Photo Tourism uses image-based modeling and rendering for navigating thousands of photographs at a single location. However, since Photo Tourism performs 3D scene reconstruction from unstructured photos, it is slow, taking a few hours to process a few hundred photos. The system also discards noisy, dark, cluttered photos due to registration failure. Similarly, 3D Wikipedia automatically transforms text and photos to an immersive 3D visualization but suf- fers from a relatively slow bundle adjustment and multi-view reconstruction. In contrast, Social Street View takes advantage of the large-scale 2.5D data and can visualize multiple geo-relevant photos in immersive environments at interactive ren- dering rates. Creating an immersive visualization of geotagged social media is a challenging task due to the lack of 3D data. For example, reconstructing a 3D mirrored world from images typically requires intensive computation for a few hours or days.Early seminal work [17, 18, 19, 20, 21] focuses on offline, image-based modeling approaches to generate virtual 3D cities. In these systems, 3D models are generated from a large collection of unstructured photos via different structure-from-motion pipelines. Since the debut of Social Street View [1], recent research offers more practical so- 23 lutions to integrate geotagged social media with pre-reconstructed cities in several minutes [22], and virtual city models [23, 24]. However, these approaches are not quite applicable to real-time applications. On the one hand, the texturing [22] of 3D buildings suffers from artifacts on complex geometries. On the other hand, the pre-crafted digital cities used in [22, 24, 25, 25, 26, 27] are usually unavailable in rural areas and require enormous amounts of collaborative work from crowdsourced workers, artists, researchers, and city planners [28, 29, 30, 31]. Moreover, without a partitioning algorithm, the digital cities (over 100 MB as mentioned in [24]) may be a bottleneck for practical online deployment. In contrast to the prior art, we further develop Geollery and circumvent the of- fline reconstruction or manufacture of a digital city by progressively streaming open 2D maps. With 2D polygons and labels, Geollery extrudes and textures geome- tries on demand in real-time using nearby street view data, enabling visualization of geo-relevant social media with their spatial context, and allowing user interaction in a mirrored world. As for human factors in 3D social media platforms, Kukka et al.[23] conduct a pioneering qualitative anticipated user experience study with 14 participants to explore the design space of geospatial visualization of social media in mirror worlds. Nevertheless, the human factors have not yet been fully discussed for experiencing a real-time mixed-reality social platform such as Geollery or Social Street View [1]. We conduct a comparative study with 20 participants and derive key insights from the semi-structured interviews. Our qualitative evaluation further reveals the strengths and weaknesses of Geollery and Social Street View. 24 2.2.3 Analysis of Geotagged Social Media An important question to consider for us is the accuracy of the geotagged me- dia corresponding to the real geographic location. Zielstra and Hochmair [32] exper- imented to investigate the positional accuracy of 211 image footprints for 6 cities in Europe by comparing the geotagged position of photos to the most likely place that they were taken at, based on image content. In this study, they found Panoramio has a median error distance of 24.5 meters, and Flickr images have a median error distance of 58.5 meters. With the growing popularity of global positioning system (GPS) and Wi-Fi positioning systems on mobile devices, the geospatial metadata information is increasingly reliable and accurate. We have observed that occasion- ally there may be some irrelevant social media based on location query, but this is not common. Nevertheless, we have implemented a mechanism for users to report the abuse of geotagged social media in our system. Previous research also explores various ways to analyze and visualize geotagged information on 2D maps. For example, MacEachren et al.[33] present a seminal system for visualizing the heat maps of health reports on a map. Their further work, SensePlace2 [34], presents a geospatial visualization of Twitter messages with user-defined queries, time filters, spatial filters, and heap maps of tweet frequencies. Chae et al.[35] present a social media analysis system with message plots on a map, topic filtering, and abnormality estimation charts. Recent research also focuses on gridded heat maps [36], multivariate kernel methods [37], movement patterns [38], Reeb graphs [37], sentiment modeling [39, 40, 41], and flow visualizations of 25 spatio-temporal patterns [42]. Using domain-specific knowledge, previous research has analyzed geotagged social media to improve emergency responses [43, 44], assist disease control [45], understand the dynamics of neighborhoods [46] and cities [47, 48], and travel route planning [49]. The key differentiator of our work is the ability to explore geotagged informa- tion in immersive 3D environments. Geollery, the successor of Social Street View, is able to create a digital city progressively in real time and fuse geotagged social media into the mirrored world. For 2D spatiotemporal analysis, we present Topic- Fields to offer social-media-topic extraction, spectral ordering of related topics, and exploratory visualization of the scalar fields of multiple topics. 2.2.4 Mixed Reality in Immersive Maps Past work on mixed reality in immersive maps generally required users to manually augment content for immersive maps. Devaux and Paparoditis [50] added laser-scanned depth data to some street views and enabled users to manually add images or videos at their desired 3D position. Their system also had additional in- teractive features including human labeling, crowdsourcing mode to blur faces, and localizing and measuring objects. In contrast, by automatically extracting proximal social media content, SSV dynamically enhances the user experience in immersive maps and allows users to focus on social media interactions. Korah and Tsai [51] convert large collections of LIDAR scans and street-view panoramas into a presenta- tion that extracts semantically meaningful components of the scene via point-cloud 26 segmentation. In addition, they propose an innovative compression algorithm and also show how to augment the scene with physics simulation, environmental lighting, and shadows. Past research on the analysis of immersive maps also addresses the important problems of segmentation, human recognition, and accessibility identifica- tion. For instance, Xiao and Quan [52] propose a multi-view semantic segmentation framework using pair-wise Markov Random Fields (MRF) to differentiate ground, buildings, and people in street views. [53] have proposed a novel algorithm to re- place any unknown pedestrian with another one which is extracted from a controlled dataset. [54] have designed a visual interface that used a machine learning algorithm to facilitate crowd-sourcing-driven identification of accessibility issues in immersive maps, such as missing curb ramps, for people in wheel chairs. 2.3 System Architecture In this section, we discuss the overview of the system architecture of Social Street View presented in Figure 2.3. Social Street View consists of a street view scraper, a social media scraper, distributed SQL databases, a web-server powered by Apache and PHP, and optional modules such as a temporal filter, a geolocation filter, and a computer-vision-based face filter. 2.3.1 Street View Scraper Our street view scraper is a custom web-scraper tool written in Javascript and PHP that downloads GIS-related panoramic images and metadata at any geoloca- 27 Figure 2.3: The work flow of Social Street View. Our system streams data from two scrapers based on users’ geolocation requests and renders social media in WebGL. Users can access the system via any WebGL-supported browser on a desktop, a tablet, a head-mounted display, or an immersive room-sized tiled display (see Figure 2.1). tion where GSV data is available. Our tool is inspired by GSVPano.js4, but we additionally scrape normal maps, infer building surfaces, and make use of the road orientations. Currently, we request all street view data from Google Maps API. Each location query is analyzed by regular expressions, and thus it can be either a mailing address (e.g., 129 St., New York) or a pair of latitude and longitude co- ordinates (e.g., 40.2384, −70.2394). For each location query, the scraper finds the closest panorama and downloads the following types of data: 1. Tiles of panoramic images with five types of resolution: from highest 13312× 6656 pixels to the lowest 832× 416 pixels. A stitched panorama is shown in Figure 2.4(a). 4GSVPano.js: https://github.com/heganoo/GSVPano 28 (a) (b) (c) Figure 2.4: The illustration of (a) stitched panoramic image tiles, (b) depth map and (c) normal map from Google Street View. The depth map is visualized in a yellow- red-black color scheme, where black indicates 255 meters or more, red indicates 128 meters and yellow indicates 0 meters. The normal map contains a 3D normal vector for each pixel. We visualize the normal data by converting the normal vector to HSV color space with blue-purple hues. 2. Depth map from GSV meta data with a coarser resolution of 512×256 pixels. We normalize and up-sample the depth and normal maps in our GLSL shaders (Figure 2.4(b)). 3. Normal map of 512×256 pixels (Figure 2.4(c)). 4. Road orientation and heading direction indicates the travel direction of the GSV car or trekker. 5. Geolocation and other information including latitude, longitude, image age, and adjacent panoramas’ hash IDs. 2.3.2 Mining Social Media Our geotagged social media scraper is a back-end program written in PHP. Tuchinda et al. [55] have proposed to model the web services as information sources 29 in a mediator-based architecture and have built an exemplary application, Mashup. Using similar architecture, Social Street View is able to integrate information from several web services. For now, we use Instagram as the major source of social media in our proof-of-concept system. Since Instagram only allows users to upload images or videos from mobile devices, it largely avoids incorrect geotagged data from desktop clients. Our social media scraper collects the following data: 1. Geospatial and textual location including latitude and longitude coordi- nates, street names and user-tagged location name. 2. Media type indicating whether it is an image or a video. 3. Caption and tags containing text information. 4. Published Time containing the exact time-stamp when published. 5. User comments and likes reflecting the popularity level. 6. URL provides a link to the images or videos on the web. Instagram API supports social-media query based on both geolocation and time. We use two distance thresholds, α = 10m and β = 5km for dense urban areas and rural areas, respectively. Given a street view panorama, the scraper first requests social media within a radius of α on Instagram. If nothing is found, the scraper increases the threshold to β and queries again. If either threshold returns social media content, we send out R requests to acquire data over the past R months (we typically use R= 12). This allows Social Street View to answer spatial queries with a temporal filter, such as “What do people wear in winter at this location?”. 30 2.3.3 Servers and Relational Databases At present, we use distributed MySQL databases to store information of visited immersive maps and social media to reduce response time for duplicated or similar queries. One of the important components of our system is to build spatial data structures to efficiently answer proximity queries relating geotagged social media with spatially-located immersive panoramas. To accomplish this effectively, we build a bipartite graph that establishes edges between social media message nodes on one side with the immersive panorama nodes on the other side. This allows us to quickly answer what social media messages are relevant to be shown in any panorama. Since this can result in an all-pairs quadratic relationship, we needed to do this in an efficient manner. Once the two scrapers complete their tasks, the back-end servers build the bipartite graph in a separate thread: G = ⟨V ,S ,E⟩, where V = {vi} are the visited street views and S = {si} are scraped social media, and E = {⟨vi, sj ,dij⟩ | vi ∈ V , sj ∈ S} are the edges between V and S. The weights of edges dij are defined by the distance between si and vi according to the Haversine formula [56]: φi−φj α 2ij = sin ( )+cos(φi) · cos(φj) · sin2( λi−λj ) (2.1) 2 2 √ √ βij = 2 ·atan2( αij , (1−αij)) (2.2) 31 dij =R ·βij (2.3) where φi,λi and φj ,λj are the latitude and longitude of si and vj respectively; R is the radius of the earth (R = 6371km); the result dij is the great-circle distance between si and vj . Since both social media and street views are indexed by a B+ tree, the insertion and query without building the graphs take O(log |V |+L) time, where L= 100 is the maximum number of queried social media. There could be an additional cost of sorting based on users’ query and filters. However, the maintenance of the bipartite graph may take O(k · |V |) for k newly scraped social media Sk. To solve this issue, given a street view vi, SSV calculates Si = {sj | ∀sj ∈ S ∧ ⟨vi, sj ,dij⟩ ∈ E} in O(log |V |+L). If Si = ∅, Social Street View returns Sk and builds the bipartite graph at the back-end for the next query; otherwise, we return Sk ∪Si for more results. Thus, for each panorama, the streaming time is O(log |V |+L log(L)). 2.4 Social Street View Interface Using WebGL and WebVR, we have designed and implemented an open-source and cross-platform system which is easy to access via most modern browsers and is shown in Figure 2.5. Users can query any location in the input field on the top panel. The left optional panel has filters to specify query words, temporal (month and hour) ranges, distance ranges, number of faces and rendering controls. The 32 bottom panel is an optional 2D visualization. We use Three.js5, a cross-browser and GPU-accelerated Javascript library. We use Bootstrap6 and jQuery7 for 2D elements of the user interface. Figure 2.5: The Social Street View interface powered by WebGL and WebVR. The top navigation bar allows the user to switch among desktop and VR mode, input addresses, and enabling different postprocessing modes. The left control panel allows the user to search for certain keywords, filter based on the month of the year, the day of the week, and the distance to the current location. The top left legend shows a mini-map indicating where the user is located. The bottom panel is a baseline 2D approach for displaying nearby social media. 2.5 Social Media Layout Algorithm In this section, we present our approach to blending the visualization of social media with street view panoramas in an immersive mixed reality setting. 5Three.js: http://www.threejs.org 6Bootstrap: http://www.getbootstrap.com 7jQuery: https://www.jquery.com 33 2.5.1 Baseline: 2D Visualization Since there is limited previous work on visualizing multiple social media images in immersive maps, we first devised a basic 2D solution for users to have a quick glimpse of social media near their location. This is shown in the bottom panel of Figure 2.5. The users can click on any image to see a higher-resolution (640×640) image or video, text caption message, geolocation, and timestamp data. In addition, the user can use the image to link to Instagram to comment, like, or forward the social media. By clicking the geolocation of the social media, the user can navigate to the closest street view panorama to the image using Social Street View. Compared with Stweet that places a single Tweet message on a top layer above Google Street View with limited interactivity, this basic 2D visualization provides multiple queries near the center of panorama with richer information. In addition, the users can filter out the social media based on a desired time range and distance to the center of the panorama. 2.5.2 Uniform Random Sampling From Social Street View’s server, the client acquires a subset of images or videos Ŝ = {si | i = 1 . . .N}, Ŝ ⊆ S. Our goal is to place them naturally in low- saliency areas in an immersive map so that the social media rendering will minimize the visual clutter in a user’s view. To do this, we first stitch the tiles of panoramic images into a rectangle T . Next, we apply a projection P : T → Ω from T to the sphere Ω, thus building an immersive panoramic map with an inside camera looking 34 outwards. For each point pi = (ui,vi) ∈ T , the corresponding point qi = (xi,yi, zi) ∈ Ω is projected uniformly on a sphere. The easiest way to place social media is to randomly sample N points P̃ ⊆ T from the panorama. For each p̃ = (ũ, ṽ) ∈ P̃ , calculate the corresponding 3D position q̃ = (x̃, ỹ, z̃) ∈ Ω as the center of the social media. As shown in Figure 2.6(a), while we find that we can blend many interesting social media in this interactive mixed-reality world such as photography, food, and people there are several drawbacks in the way they are laid out. To address this, we propose the following desiderata: 1. Ground-level context is important for way-finding for pedestrians and drivers. Therefore a system blending social media with immersive maps should mini- mize the rendering of the social media at or near the ground level. 2. Since billboards and other structures in the real world are often aligned with physical landmarks, it is desirable to align social media with the proximal geometric structures. 3. Free-floating imagery in mixed reality worlds is likely to result in cognitive dissonance and a psychophysical unease in users. 4. Social media imagery should be reasonably spaced apart to avoid visual clutter and overlaps, if at all possible. We next discuss how we accomplished the above goals in our work. 35 2.5.3 Depth and Normal-map-driven Placement of Social Media A depth map D = {di} in which each depth value di corresponds to a pixel pi in T , could be used to filter out points that are too close (e.g., the ground) or too far away (e.g., the sky). Additionally, we scale the size of images based on depth to give a perspective effect. The result is shown in Figure 2.6(b). This minimizes the images that are projected on to the ground or the sky. (a) Uniform Random Sampling on Panoramic Sphere (b) After Adding Depth Map (e) Panoramic Image (f) Depth Map (c) After Adding Normal Map (d) After Maximum Poisson-disk Sampling (g) Normal Map Figure 2.6: Results before and after applying the depth map, the normal map, and maximal Poisson-disk sampling. (a) shows the random layouts generated uniformly on the sphere, (b) shows results after using the depth map, (c) shows results after applying the normal map, (d) show results after using maximal Poisson-disk distri- bution for laying out photographs, (e) shows the original street view image, (f) and (g) visualize the depth map and the normal map respectively, for reference We use the normal map N = {ni} to project the images onto the surfaces of buildings. Denoting the normal vector of the ground as ng = (0,1,0), we define the 36 ground level Ωg as follows: Ωg = {qi | ∀qi ∈Ω∧∥ni−ng∥< δ} (2.4) where ∥ni−ng∥ is the Euclidean distance between two vectors ni and ng, δ = 0.5 is a user-defined threshold. Next, for each sampled point p̃i, we use the corresponding normal vector ni and rotate the social media to the correct orientation. The results are illustrated in Figure 2.6(c). As one can see, the images are now well-aligned with the geometry of the buildings. However, the rendering still suffers from visual clutter and overlaps. 2.5.4 Maximal Poisson-disk Sampling We use maximal Poisson-disk sampling to solve the problem of visual clutter. Poisson-disk distribution has been widely used in the field of computer graphics for global illumination [57], object placement [58] and stochastic ray tracing [59]. In our work, we follow the approach of the PixelPie algorithm devised by Ip et al. [60], which uses vertex and fragment shaders and GPU-based depth-testing features to efficiently implement the dart-throwing algorithm for maximal Poisson- disk sampling.8 After sampling points from the building surfaces, we sort them according to their depth. We preferentially place more popular social media closer, where popu- larity is defined in section 4.7. We outline this approach in Algorithm 1. 8Code of PixelPie: http://sourceforge.net/projects/pixelpie/ 37 ALGORITHM 1: Social Media Layout using Poisson-disk Samples Input: N sorted social media images Ŝ = {si | i= 1 . . .N}, acquired from SSV servers. Output: A set of image planes to display social media: I = I1 . . . IM ,M ≤N . Generate the set of candidate sample points P̃ by the PixelPie algorithm; Sort points in P̃ in descending order according to their corresponding values in the depth map D so that the closest sample point is laid out first; Set I ←∅; for i← 1 . . .min(N, |P̃ |) do Place Ii with texture from si ∈ Ŝ at the projected position q̃i ← P(p̃i); Rescale Ii according to the corresponding depth value: τi ← τ/di for perspective visual effects; Rotate Ii so that it is perpendicular to the normal vector ni ←N(ui,vi); Add Ii to the result set: I ← I ∪ Ii; end This provides us with an aesthetic layout to display social media blended in with the immersive panorama. A screenshot of the resulting placement after this algorithm appears in Figure 2.6(d). 2.5.5 Placement of Social Media in Scenic Landscapes (a) Social Street View on a high way (b) Social Street View in the university campus Figure 2.7: Results of laying out social media in scenic landscapes. (a) shows the rendering results on a highway and (b) shows the results in the university campus. Note that both locations lack the information of depth maps, but the system is able to lay out social media along the road. In open and scenic areas, there are a limited number of surfaces on which to 38 place social media. However, with the high-level knowledge of where the Google Street View camera is traveling from and traveling to, it is possible to place social media along the way without depth or normal maps. To generalize our system from urbanscapes with buildings to more general scenic landscapes, we propose Algorithm 2 and the results are shown in Figure 2.7. ALGORITHM 2: Social Media Layout using Road Orientations Input: |O| road orientations with oi ∈ [0,2π]. K social media to be placed for each orientation. Typically, |O|= 2 for a road with two orientations. Output: A set of image planes to display social media: I = I1 . . . IM ,M ≥K · |O|. Set I ←∅; for i← 1 . . . |O| do Set the position qi ← (KRcosoi, h, KR sinoi) at height h and radius R; (Optional based on user’s preference) Add a frontal image plane to I at qi; Set the translation t← (T cos(o + πi 2 ),0,T sin(o + π i 2 )) with constant T ; for k ← 1 . . .K do Set q̃← (kRcosoi, h, kR sinoi); Add a left side image plane to I at position q′ ← q̃+ t; Add a right side image plane to I at position q′ ← q̃− t; end end This algorithm is enabled whenever the depth map is missing or the number of building surfaces is smaller than 2. With the normal map, we run a flood fill algorithm to cluster the adjacent pixels whose Euclidean differences are less than 0.05 with the neighbors. The flood fill is implemented by breadth-first search and union sets. A building surface is defined as a cluster whose length is greater than 1500 pixels. 39 2.5.6 Post-processing, Rendering and Interaction To enhance the visual effects of the social media in an immersive setting, our system allows the users to add shadows, glowing shader effects, and alpha blending to the virtual billboards that depict the social media message. We can also model the difference between daytime and nights by using a blooming shader and an additive layer based on depth and normal. To experience the static street view in different seasons, we have implemented particle systems to render snow, falling leaves, or cherry petals in the scene. Additionally, we implemented a simple ray tracer that enables users to click on social media to read associated text. 2.5.7 Filtering of Social Media In crowded areas such as the New York City, it is almost impossible to visualize every message in Social Street View. One solution is to give preference to the most popular social media. However, quantifying popularity itself is subjective since popularity spans features such as comments, replies, creation time, likes, and the number of times forwarded. We have adopted the following criteria as a substitute for popularity: αCi+Li (2.5) ∆Ti where, for a given social media si, Ci is the number of comments, Li is the number of likes, and ∆Ti is the age of the social media message. Since comments generally 40 have a higher impact than likes, we scale comments by a user-defined scaling factor α. To protect potential privacy concerns or to find celebrity figures from the news, we also incorporate the face filter using Face++ API 9. Users can also filter social media based on time and distance. 2.6 Experiments and Evaluation We have carried out a number of experiments to evaluate the Social Street View system. Here we report some of our results for a variety of Google Street View resolutions and social media. 2.6.1 Dataset Acquisition and Hardware Setup We scraped 100 Google Street View panoramas from Manhattan in New York City as our main dataset. We found over 84,055 social media images on Instagram within a query distance of 20 meters in these panoramas. For each query, our system returns 100 closest social media images according to the distance to the panorama by searching in B+ trees. The experiments were conducted in Google Chrome (Version 40.0.2214.115 m) with Nvidia Quadro K6000 and Intel Xeon CPU E5-2667 2.90GHz. The rendering resolution we used is 2650× 1440 pixels. To reduce the effects of different network latency, we store all the panoramas and social media on the local disk of the workstation. We present the file sizes in Table 2.1 for a variety 9Face++: http://www.faceplusplus.com 41 of panoramas. Table 2.1: Resolutions, tile counts, and file sizes of the GSV panoramic data Pixels Resolution Number of tiles File size 88.6M 13312×6656 26×13 ∼ 5M 22.2M 6656×3328 13×7 ∼ 2M 5.5M 3328×1664 7×4 ∼ 800K 1.4M 1664×832 4×2 ∼ 300K 0.3M 832×416 2×1 ∼ 90K 2.6.2 Evaluation of Initialization and Rendering Time Interactivity and latency are of great importance for user navigation in the So- cial Street View. Figure 2.8(a) shows the initialization time based on five different resolutions in Table 2.1. The initialization mainly includes the time spent in query- ing for panorama and social media, as well as loading texture to memory and WebGL initialization. We notice that when the size of the panoramic texture is reduced to 6656×3328, the overall time cost is reduced from approximate two seconds to one. Choosing an appropriate resolution based on a user’s display, downloading speed, and GPU power can therefore make a meaningful difference. We noticed that at least 900ms was being spent on initializing panoramas and social media. To reduce the time when switching between adjacent street views, we pre-fetch the data in memory for faster initialization. Google Street View uses progressive refinement to address latency by initially loading low-resolution images that are refined to higher- resolution ones over time. We can also rely on such an approach when the local disk is unable to support pre-fetching. In Figure 2.8(b), we show how querying time is 42 reduced by pre-fetching. Further, maximal Poisson-disk samples to place 100 social media can be generated at interactive rates. However, the texture loading can still take hundreds of milliseconds and we hope that it could be improved with further advances in the WebGL technology. In Figure 2.8(c), we report the rendering time with and without social media. From the chart, it can be seen that the system runs at around 60 frames per second (fps) for all resolutions. Thus, the rendering of the social media does not affect the experience of navigating Social Street View. Initialization Time based on Resolution Initialization Time after Prefetch Rendering Time based on Resolution 2500 450 20 social media panorama social media panorama with social media panorama only 400 18 2000 350 16 14 300 1500 12 250 10 200 1000 8 150 6 500 100 4 50 2 0 0 0 88.6M 22.2M 5.5M 1.4M 0.3K 88.6M 22.2M 5.5M 1.4M 0.3K 88.6M 22.2M 5.5M 1.4M 0.3K Resolution of panorama (pixels) Resolution of panorama (pixels) Resolution of panorama (pixels) (a) (b) (c) Figure 2.8: Evaluation of processing time in different resolutions using 100 panora- mas in Manhattan, (a) shows the initialization time decreases as the resolution goes down, (b) shows that by pre-fetching, initialization time is reduced by over 3 times; (c) shows after initialization, the rendering time costs about 16ms (about 60 FPS) in WebGL while the rendering of social media does not affect the rendering performance much. 2.6.3 Evaluation of Saliency Coverage Saliency maps can represent regions where a user is likely to allocate visual attention in a fixed-time free-viewing scenario [61, 62, 63, 64]. We compute image saliency using the Matlab tool by Hou et al. [65] to evaluate the social media coverage of saliency maps. An example of such a saliency map is shown in Figure 43 Average time spent (ms) Average time spent (ms) Average time spent (ms) 2.9 (a). The average social media coverage of saliency maps over all the 100 immersive Google Street View panoramas is illustrated in Figure 2.9(b). Initially, the saliency map is covered by uniform random sampling algorithm at about 35%; after incor- porating the depth map, most sky areas are filtered out but social media are highly likely to cover the vanishing point where saliency is high; after incorporating depth and normal maps, most social media are aligned with the building structures where saliency is low in most cases; with maximal Poisson-disk sampling, the social media distributes evenly and aesthetically, thus reducing the likelihood that several images could overlap in a high-saliency area. This is the reason why the maximal Poisson- disk sampling has a significantly lower standard deviation in saliency coverage in Figure 2.9 (b). In contrast, the uniform and random sampling, as well as approaches that rely only on depth-map-based placement, result in a larger coverage of higher saliency regions. (a) Saliency Map (b) Social Media Coverage of Saliency Map 55 50 45 40 35 30 25 20 15 10 5 0 panorama only add depth add depth and add poisson- normal disk sampling Method Figure 2.9: (a) Street view with saliency map overlay. The visualization has a red-yellow-green-transparent scheme, where red indicates high saliency and trans- parent indicates low-saliency. (b) Evaluation results of saliency coverage from 100 immersive GSV panoramas. 44 Coverage of Saliency Map (%) 2.7 Use Cases and Discussion While exploring Social Street View in a variety of scenarios using publicly avail- able social media, we discovered a number of potential use cases that are promising in enhancing storytelling, business advertising, learning cultures and languages, and in visual analytics of social media in a spatiotemporal context. (a) Story Telling Example (b) Business Advertisement (c) Immersive Culture Learning (d) Crowd-sourced Tourism Figure 2.10: Potential applications of Social Street View: (a) Users can link to Social Street View to tell immersive stories (b) Business owners can use Social Street View for impressive advertising (c) Children can learn culture from local social media (d) Tourists can preview the trip from crowd-sourced photographs embedded in the immersive maps. 45 2.7.1 Storytelling Social Street View could greatly enhance the storytelling experience. For ex- ample, users could see photos from recent trips of their friends while allowing them to explore the 360° context in an immersive setting. In Figure 2.10(a), we present how social media stories can be more convincing using Social Street View. This panorama is along a road in Baja California Sur within Mexico10. Since this open road does not have vertical proximal structures, we use the scenic landscape layout mode here. Further, because it is along a long road we did not expect anyone to take photos and upload them to the social media. Nevertheless, our system found 3 images within a radius of 20 meters. In one of the images, Instagram user Daniela wrote on July 12, 2014: Stuck in traffic on our way to Cabo with this awesome view #roadtrip #cabo #view #mexico When we pan and walk around this location, we are also impressed by this awesome view. We like to think of this as our system facilitating a democratized, crowd- sourced version of the Kodak Picture Spot. 2.7.2 Business Advertising Social Street View can also be used for business advertising. For example, restaurant managers could showcase the social media photographs of their dishes shared by their customers in the context of the interior ambiance of their restaurants. 10geolocation: North 25.855319593, West 111.333931591 46 Similarly, real-estate customers could view the neighborhood street view augmented by the dynamism of the social media of that community to get a better feel for their prospective neighbors. Figure 2.10(b) uses a panorama in 6 E 24th St, New York, United States. In the rightmost image of dishes, an Instagram user frankiextah commented: ... dinner started off with amazing oysters paired with my favorite Ru- inart blanc de blancs champagne With mixed-reality rendering, Social Street View enhances future consumers’ visual memories and makes it easier for them to seek “amazing oysters” around this place. 2.7.3 Learning Culture and Crowd-sourced Tourism Immersive virtual environments have been used to protect the world’s cultural heritage and serve as a useful medium for cultural education [66]. However, it is usually challenging to generate relevant captions and up-to-date photographs for each scene of a virtual environment. By blending crowd-sourced social media content with panoramic imagery, Social Street View can (with age-appropriate filters and curation) serve as an educational tool for children and researchers to learn cultures and languages in different cities and countries. As shown in Figure 2.10(c), users can experience the holiday atmosphere of the Spring Festivals in Taierzhuang Ancient Town of China, where the oldest “living ancient canal” was built in the Ming Dynasty and the Qing Dynasty. Here again, because of a lack of sufficient vertical structures, 47 one can enable the scenic-landscape mode and visualize recent photographs of the architecture taken by tourists in the daytime and at night. Figure 2.10 (d) presents an example of crowd-sourced tourism in urban areas. Using face and popularity filters, users can get rid of most pictures with human faces and blend some high-quality photographs with a New York street. These photographs provide novel views for the user’s exploring experience. 48 2.8 TopicFields: Spatiotemporal Visualization of Geotagged Social Media with Hybrid Topic Models and Scalar Fields Social Street View presents an immersive visualization of geotagged social me- dia at the street level. How can we achieve an overview of the geotagged social media for the entire city? What topics are dominant in different times of day? With recent advances in the deep learning models applied to natural language processing and computer vision, machines are now able to extract hundreds of topics or categories out of the social media data. Nevertheless, there are several problems with directly presenting the topics to the end user: 1. Topic duplication. Some, but not all, of the extracted topics could be closely related to each other. In addition, neural networks trained for text and images usually use different classes or labels in the results. From visualization’s perspective, we would like to aggregate similar topics from hybrid models. For example, artists, painting, art all refer to art. Worse still, machine learning algorithms may fail to classify some features to a specific topic, which requires additional input from the users. 2. Information overload. Low-level features such as the unigrams or the image classification labels usually result in hundreds of labels. The variety of results are usually too overwhelming for the user. To deal with this problem, we use spectral clustering and provide the cluster with top frequencies for the user to reduce the scope of their visual search. 49 3. Diversity. Previous visual analytic approaches have investigated heat map visualization of a single topic on a map [37, 67], or heat maps of positive or negative emotions on a map [39]. The challenge of effectively visualizing a large number of topics that are implicit in a diverse spatiotemporal social media has thus far not been adequately addressed. In this section, we present Topic Fields (Figure 2.11), an interactive visualiza- tion system of geotagged social media to address these three critical problems. Our goal is to understand and correlate the social topics that occur in the real world at various geographical locations over time. The main contributions of this work are: • a novel web-based framework for analyzing, aggregating, and visualizing mul- tiple topics from large-scale geotagged social media data, • clustering hybrid machine learning classification results with spectral ordering algorithm, such as an interactive matrix diagram, • an efficient and interactive GPU-driven visualization algorithm for visualiz- ing multi-variate scalar data with kernel density estimation and non-linear normalization methods. 2.8.1 System Overview As shown in Figure 2.12, our system consists of a social-media query engine, distributed SQL databases, a machine-learning server that runs hybrid modules, and a client-side visualization system. 50 Figure 2.11: A screenshot of the TopicFields system for visualizing approximately one million geotagged social media with hybrid topic models and scalar fields (this figure is best visualized on a computer screen and does not reproduce as well on a printout). The top-left map view is overlaid with an interactive scalar map, with each color indicating a different cluster of topics: fashion, art, and park. We show the ground truth labels such as Central Park and The Museum of Modern Art from Google Maps for reference. The detail view at the bottom shows the corresponding social media text, image, or video, as the user explores and clicks on the map. The control panel on the right allows the user to select, add, and modify the topics to explore. The user could use the “Cluster” button to open the topic matrix diagram (Figure 2.14) and adjust the clustering results. The stream graph shows the volume of social media of different topics over the queried time. The corresponding topic matrix diagram is shown in Figure 2.15. First, the user is provided with an interactive map. The user can pan around, zoom in or out to select the region of interest. The boundary of the map is then sent to the social media query engine. In the current prototype, the query engine is able to scrape a few hundred geotagged social media from external sources such as Twitter and Instagram in around a second, but will mostly query from the offline databases. The social media are stored and organized in distributed SQL databases. Next, 51 TopicFields User Interactions Servers Overview first, zoom and filter, Topic Selection then details-on-demand. Temporal Filters Social Media Query Engine Distributed SQL Database Visualization Data Communication Instagram Twitter Hybrid Machine Learning Modules Topic Matrix Diagram TopicFields Overlay Temporal Stream Graph Figure 2.12: The architecture and workflow of the TopicFields system. The servers consist of social media query engine, distributed SQL databases, and hybrid machine learning modules for topic classification and clustering from text and images. After the aggregation of topics, the server transfers the data to the client visualization system to present the topic matrix diagram, topic fields overlay, and temporal stream graphs. The users can select the desired topics for visualization, have an overview of the distribution, zoom in and filter by keywords and time, and then visualize the details on demand. the machine learning module extracts topic features from the text and images, as described in Section 2.8.2.2. It uses the spectral ordering algorithm on the groups of topics and sends the matrix to the user for filtering. On the client side, the majority groups of the topics after spectral ordering are visualized as a matrix diagram. The user can select, add, and remove topic features from the groups and interactively visualize the topic fields on the map. For exploratory visualization, we have followed the Shneiderman design prin- ciple [68] overview first, zoom and filter, then details-on-demand. First, we present the map view with topic fields visualization to offer the user an overview of the social media according to the topics. The stream graph offers the user an overview of the volume of the topics over time. Finally, the user is able to zoom in and click on the map view to query proximate social media. We have linked our system with Google Street View to provide an immersive experience to explore the map. 52 2.8.2 Data Processing In this section, we present the process of data mining, feature extraction, and spectral clustering. 2.8.2.1 Data Mining Our geotagged social media scraper is a back-end program written in PHP. Tuchinda et al.[55] proposed to model web services as information sources in a mediator-based architecture and have built an exemplary application, Mashup. Us- ing a similar architecture, our system is able to integrate information from several web services. In this prototype, we use Twitter and Instagram as the major sources of social media in our proof-of-concept system. We collected the following four types of data: 1. Geo-spatial and textual location including latitude and longitude coordi- nates, street names, and user-tagged location name. 2. Caption and tags containing text information of the Tweets or Instagram messages. 3. Publication time-stamp containing the exact date and time when pub- lished. 4. User comments and likes reflecting the popularity level. We have investigated two major districts on the eastern coast of the United States: the Manhattan District of New York, and the District of Columbia (Wash- 53 ington D.C.). Over three months from December 2017 to March in 2018, we have collected 946,856 Twitter and Instagram messages with specific geographical labels and publication time-stamps from the public domain, with 589,902 in the Manhat- tan District and 356,954 in the District of Columbia. The data was scraped using a flood-fill algorithm inspired by Shen et al.[69]. 93.68% pizza, pizza pie 20.70% plate 21.73% beer glass 0.15% frying pan 18.48% meat loaf 7.59% beer bottle 0.08% trifle 12.04% chocolate sauce 6.68% pop bottle, soda 41.67% Alaskan malamute 41.12% park bench 87.10% groom, bridegroom 22.61% Siberian husky 24.78% lakeside, lakeshore 10.07% suit, suit of clothes 18.80% Eskimo dog, husky 7.99% seashore, coast 0.47% loafer Figure 2.13: The top three classification results by applying the Inception-v3 deep neural network to our dataset. Our topic feature vector of images consists of the entire hierarchical tree of the top three labels from the WordNet. For example, pizza belongs to dish, nutriment, and food. 54 (a) similarity matrix of the top 300 features (b) results after spectral ordering 1 0.75 0.5 0.25 0 0 50 100 150 200 250 300 Dimensionality (c) the characteristic feature vetor of the Laplacian matrix of (a) and (b) Figure 2.14: Spectral ordering of the topic features; (a) shows the similarity matrix of the top 300 features according to the frequencies, while (b) shows the results after spectral ordering (best viewed on a computer screen). Note that spectral ordering eliminates the randomness of the data and clustering similar groups together. (c) shows the Fiedler vector of the normalized Laplacian matrix of (a) and (b). 2.8.2.2 Feature Extraction Our data consists of two types of social media: text and images. Sometimes, we obtain videos via the social media query engine. Instead of using video data, we use the first frame or the user-defined thumbnail for feature extraction. As for text messages, we first experimented with the well-known topic model, Latent Dirichlet Allocation (LDA) [70], to extract accurate clusters of topics from our data. However, the state-of-the-art topic models cannot guarantee clean results. 55 Value of the Fiedler vector 3/31/2018 Topic Similarity   Order: by Name This matrix diagram art artists painting music artist eat eeeeeats foodstagram foodie food instafood nyceats shopping christmas holiday centralpark park fashion nyfw Figure 2.15: The resulting matrix diagram after spectral clustering with the follow- ing queries: art, food, shopping, park and fashion. For instance, the top three topics we have extracted using LDA [70] are: enhance, ishootfilm, bend, contemporary, dance, woo, retrospective, ... sadly, indore, holidayshopping, foodbaby, prk, ... minidachshund, ana, bulking, busstopdinernyc, friday’s, ... These topics can hardly be used directly for visualization. This led us to the question: can we use machine learning to cluster the most frequent keywords, 56 http://localhost:8080/AdjacentMatrices/index100.html 1/1 art artists painting music artist eat eeeeeats foodstagram foodie food instafood nyceats shopping christmas holiday centralpark park fashion nyfw provide visualization results, and allow the visual analyst to select the desired topics she wants? Towards this direction, we use the Natural Language Toolkit (NLTK) to ex- tract to top 300 words from the entire social media datasets and apply theWord2V ec neural network to the topics to compute the feature vector for each text data. Still, 300 features are too much for a visual analyst to interactively explore, so we decided to compute the similarity between each pair of features and use spectral clustering (Section 2.8.2.3) with image labels to help further aggregate the topics into topic classes. For images, we use the Inception-v3 model on the image dataset to compute the top three classification results. The results above the 80th quantile are used for extracting the feature vectors. We concatenate all labels from the hierarchical tree to find the topics for the image social media. For example, as shown in the first image of Figure 2.13, the feature vectors are “pizza, pizza pie / dish / nutriment, nourishment, nutrition, sustenance, aliment, alimentation, victuals / food, nutrient”. (a) without nonlinear normalization, k = 1 (b) nonlinear normalization with k = 2 (c) nonlinear normalization with k = 3 Figure 2.16: Visualization of the topic fields with different gain factors. (a) shows the baseline visualization using the Gaussian PDF without nonlinear normalization, (b) shows the nonlinear normalization result with a gain factor of 2, and (c) shows the result with a gain factor of 3. The nonlinear normalization significantly increases the contrast between different clusters of topics. 57 2.8.2.3 Spectral Clustering Our topic features consist of unigrams and the image classification labels, which can be 300-dimensional vectors. Some, but not all, of the extracted features could be closely related to each other. When features are in arbitrary order along the x-axis of the design widget (Figure 2.14(a)), assigning a meaningful characteristic feature vector may require numerous control points, explicitly defining the value for each dimension. Although the reordering of features does not add to the possible visualizations that can be generated using the machine-learning-assisted approach, the usability issue must be addressed to benefit from the power of high-dimensional representations. We address the relationships among features by rearranging them using spectral ordering, which sorts the features by the eigenvector of the second smallest eigenvalue of a graph Laplacian. First, the normalized Laplacian matrix is generated based on feature-to-feature similarity; then, the eigenvector associated with the second smallest non-negative eigenvalue (the Fiedler vector) is calculated, as shown in Figure 2.14(c); finally, the features are sorted based on their values in the Fiedler vector. The result is shown in Figure 2.14(b) is an ordering of features where neighboring features are similar. The Fiedler vector and other eigenvectors associated with small eigenvalues also form the basis of spectral clustering. Many pairs of the 300 features are indeed highly correlated as can be seen by several dark pixels. Nevertheless, an arbitrary order of features does not take advantage of such correlations, resulting in a disorganized similarity matrix in Figure 2.14(a). After rearranging the topic features, we cluster the adjacent features using 58 disjoint-set data structures and partition refinement algorithms [71]. The pairs of features whose similarity is greater than σ = 0.7 and a distance of the spectral ordering is smaller than δ = 0.5 are clustered into one disjoint set. We allow the users to change σ and δ in the control panel for the matrix diagram. The spectrally-ordered similarity matrix places similar features closer together, resulting in large colorful blocks of various sizes along the diagonal. Thus an ac- cessible feature order allows user-directed selection of similar topics using fewer operations in the control panel. 2.8.3 Topic Fields Visualization Our algorithm visualizes the scalar field of user-filtered topics of geotagged social media over a map. GivenN geotagged social media over the map, with locations gi, i=1,2, · · · ,N ;g ∈ G, suppose each social media is assigned to a set of M topics T : {t1, t2, · · · , tM}. Each topic consists of multiple unigrams. We classify a social media to belong to a topic if and only if the unigram appears in the caption, tag, or the hierarchical tree of the image classification results. We limit M ≤ 6 in our system, since the capacity of the short-term memory for processing information is usually seven, plus or minus two [72], so is the number of colors distinguishable in visualization schemes [73]. First, we generate a grid mesh with W ×H vertices and assign a scalar vector f to each vertex. For vertex centered at gv, we apply the kernel density estimation 59 within its circle of radius R: 1 ∑ ( )N( ) = K D(gv,gi)ft gv (2.6) NR i=1 R , where the kernel function K could be any non-negative function that inte- grates to one. However, we prefer the kernel functions that smoothly model the falloff of the spatial distribution, such as Gaussian, Quatic, Epanechnikov, or Tri- weight functions. Here we use the Gaussian Probability Density Function (PDF) with a bandwidth of R: √1 −R2K(R) = e /2 (2.7) 2π Suppose we have a transfer function to colorize each topic t with the color ct. For each vertex, we can blend the topic fields over the grids by: ∑ c= ct ·N (ft) (2.8) t∈T , where N (·) is a nonlinear normalization method to the scalar fields to em- phasize centralized topics: ( ) = ∑ g (ft)N ft (2.9) t∈T N (g (ft,k)) This nonlinear normalization operator partitions the map into different clus- ters consisting of different topics. In particular, we apply the gain function g (x,k) 60 employed in modern ray tracing frameworks such as Pixar Renderman [74]: 1 · (2x)k, x < 0.5, g (x,k) = 2 (2.10)1− 1 · (2−2x)k, x≥ 0.5 2 , where we call k as the gain factor to adjust the contrast of the scalar field. We plot the function in Figure 2.17. 1 0.75 0.5 0.25 0 0.25 0.5 0.75 1 k=1 k=2 k=3 Figure 2.17: The gain function remaps the unit interval into the unit interval. It maps 0.5 to 0.5 while expanding expanding the sides and compressing the center. By default, we take k = 2.0. However, we allow the user to change the gain factor using a control panel powered by dat.gui for the WebGL rendering. In this way, we assign a scalar vector to every vertex on the planar mesh. Typically, we use 32×32 vertices for the current boundary on the map. In the fragment shader, we interpolate the normalized scalar field using La- grange Bicubic sampling [75] and colorize the scalar field using the user-defined colors. Finally, we efficiently render the Topic Fields using WebGL in a modern 61 browser in real time. We show the visualization results with different gain factors in Figure 2.16. 2.8.4 Use Cases 3/31/2018 With TopicFTopiic Simeilaritylds system, we demonstrate two potential use cases for trip planning and searching with temporal filters. Order: by Name This matrix diagram food foodstagram eat nyceats eeeeeats instafood foodie shopping christmas holiday park centralpark (a) topic clustering for food, shopping, and park (b) seek for food near the central park with TopicFields (c) linking to the street view to identify the spot (d) seeking for social media in the central park (e) finally, looking for places for shopping nearby Figure 2.18: These figures show the procedure for planning a trip via Topic- Fields near the central park. http://localhost:8080/AdjacentMatrices/index50.html 1/1 2.8.4.1 Trip Planning First, we demonstrate how TopicFields could help a user to plan a short trip near the central park region. Suppose that the user has decided to explore the central park, but has no idea where to go for food and shopping. So the user inputs park, food, and shopping into the topic query box. With the spectral ordering 62 food foodstagram eat nyceats eeeeeats instafood foodie shopping christmas holiday park centralpark algorithm, the user quickly aggregates three clusters including 12 features into the query engine. With the TopicFields visualization, the user could quickly identify geotagged social media that relates to parks, as shown in Figure 2.18(d): #centralparksouth #centralpark #nature #flowers #flowersofinstagram #ferns #flowersofcentralpark #spring #nyc From the topic fields, food is distributed everywhere in the map. The user could simply select food, and query close to the central park. One of the results is shown in Figure 2.18(b): This is what I am talking about! #food #newyork #follow #Instagram #foodporn #sandwich #love #Angelas #centralpark #Saturday The user could further drag the street view Pacman on the right-bottom of the map to identify the environment near the spot: it seems like a real sandwich shop. However, the places for shopping seems to be a little far away from the park, given the small purple distribution on the map. Without the TopicFields visual- ization, one may click anywhere on the map to seek for social media that mentions shopping, while the TopicFields visualization quickly identifies the 5th Ave as the concentrated places for shoppers: essentials? #louisvuitton #gucci #guccigang #dolcegabbana #gucci- community #essentials #shopping #manhattan #nyc #5thavenue #street- style 63 2.8.4.2 Searching with Temporal Filters We briefly demonstrate how TopicFields could help a user to seek for geo- tagged social media with temporal filters. Using the stream graph, we can see that the amount of social media has peak values around 10 am and 5 pm. To see the differences regarding the lake in the central park, we adjust the temporal filter to before 5 pm and after 5 pm. The results are shown in Figure 2.19. Figure 2.19: The left figure shows the geospatio query with the “park” topic before 5pm, while the right figure shows the geospatio query with the “park” topic after 5pm. Before 5pm, the pond looks clear and beautiful: Cooler days as summer turns to fall (and wishing I lived in a four seasons kind of place). #summertofall #centralpark #newyork #nyc #happyfriday #wagnerscove #wcp #westcentralpark #spring #nature #ponds #nyc beautiful day! #faeries live here ? #nycmydna #timeoutnewyorkcity After 5pm, the pond has a different atmosphere: 64 #wagnerscove #wcp #westcentralpark #spring #nature #ponds #nyc beautiful day! #faeries live here ? #nycmydna #timeoutnewyorkcity 2.8.5 Discussion There are a few limitations and improvements that could be made to our algorithm. First, the variety and diversity of Twitter posts were very surprising. The Tweets were written in various manners, with many tweets containing few real words, or no real words at all. In addition, the amount and degree of sarcasm and double meaning in tweets also made calculating the actual topic of the tweets very difficult. For example, there are a few social media messages that used the word park, to refer to Park Avenue in New York City. Second, the deep-learning model on images is not always reliable. Sometimes it provides completely wrong information. For example, in a picture of a dog, the neural network recognized it as a boxer. Third, the spectral ordering algorithm heavily relies on the neural network that learns the similarity between pairs of word vectors. If the similarity score is not high enough, the algorithm may not place similar words into a cluster. 65 2.9 Geollery: Designing an Interactive Mirrored World with Geo- tagged Social Media 2.9.1 Introduction Social media plays a significant role in many people’s daily lives covering a wide range of topics such as reviews of restaurants, updates from friends, local news, and sporting events. Despite the huge innovation in virtual and augmented reality, existing social media platforms typically use a linear narrative or a grid layout. Re- cently, several technologies and designs [1, 22, 23] (Figure 2.21) have emerged for visualizing social media in mirrored worlds11[77]. Nevertheless, designing an interac- tive social platform with immersive geographical environments remains a challenge due to the real-time constraints of rendering 3D buildings. In addition, the design space of visualizing and interacting with social media in mixed reality settings is not yet fully explored. As introduced previously, Social Street View [1] contributes the initial efforts in blending immersive street views with geotagged social media. Nevertheless, user interaction is limited to street-level panoramas thereby limiting the system to areas covered by street views. Consequently, users could not virtually walk on the streets but only teleport among the panoramas. Bulbul and Dahyot [22] further reconstruct three cities with the street view data and visualize the popularity and sentiments with virtual spots lights. However, their system requires 113 - 457 minutes to 11A mirrored world is defined as “a representation of the real world in digital form [which] attempts to map real-world structures in a geographically accurate way” [76]. 66 Figure 2.20: Geollery creates an virtual mirrored world where users are immersed with three-dimensional buildings and geo-tagged social media. The mirrored world corresponds to the real world structures and content. Social media messaged from Twitter, Yelp, Flickr, and our own system are visualized as balloons, billboards, framed photos, and gift boxes in real-time. reconstruct each city and does not explore the visualization and interactivity of social media at the street level. Kukka et al.[23] have presented the conceptual design of visualizing street-level social media in a 3D virtual city, VirtualOulu [31]. However, such pre-designed 3D city models are not practical for deployment in larger areas. In this section, we present Geollery (Figure 2.20 and 2.21 D), an interactive mixed-reality social media platform in 3D which uses a mirrored world rendered in real-time. We introduce a progressive pipeline that streams and renders a mirrored world with 3D buildings and geotagged social media. We extend the design space in several dimensions: progressive meshes and view-dependent textures, virtual repre- sentations of social media, aggregation approaches, and interactive capabilities. To evaluate the system and envision the future of 3D social media platforms, we conduct a user study with semi-structured interviews with 20 people. The qual- itative evaluation and individual responses reveal the strengths and weaknesses of both systems. We further summarize the challenges and limitations of the systems, 67 as well as the types of decisions these could influence and their potential impact. Fi- nally, we improve Geollery according to the user feedback and deploy it via Amazon Web Services (AWS) at https://geollery.com. (A) Du and Varshney’s Social Street View (B) Bulbul and Dahyot’s 3D Visual Popularity (C) Kukka el al.’s Conceptual Design (D) Geollery Figure 2.21: Comparison amongst prior mixed reality systems or designs for visu- alizing geotagged social media. (A) shows Social Street View, a real-time system which depicts social media as billboards over building walls, (B) shows Bulbul and Dahyot’s offline system [22] which leverages virtual lighting to visualize popularity and sentiments of social media, (C) shows the conceptual design by Kukka et al.[23], which explores presentation manner, visibility, organization, and privacy during co- design activities, and (D) shows our results in Geollery v2, which fuses 3D textured buildings, geotagged social media, and virtual avatars in real-time. Our contribution is summarized as follows: 1. conception, design, and development of Geollery, an online system that can depict geotagged social media, 3D buildings, and panoramas in an immersive 3D environment, 2. extending the design space of 3D social media platforms to include aggregation approaches, virtual representations of social media, co-presence with virtual 68 avatars, and interactivity, 3. conducting a user study to qualitatively compare two 3D social media plat- forms (Geollery and Social Street View) by discussing their benefits, limita- tions, and potential impact on future 3D social media platforms. 2.9.2 System Overview In this section, we present an overview of Geollery’s system architecture. Ge- ollery consists of a data engine which streams 2D polygons and labels from Open- StreetMap12 and social media data from our internal database or external sources such as Twitter13, Yelp14, Instagram15, and Flickr16. Deployment of Geollery re- quires only an SQL database and a web server powered by Apache and PHP. We take advantage of the B+ tree to index the geotagged information for querying in real-time. We build the rendering system upon Three.js17, a cross-browser, GPU-accelerated JavaScript library. Geollery allows users to explore social media nearby or at a custom location. Users have a choice of either sharing their device’s current location or entering a query into a search box powered by Google’s Place Autocomplete API18. Unlike the prior art which aims at reconstructing the entire city, Geollery leverages a progressive approach to partially build the mirrored world. We present the workflow of Geollery 12OpenStreetMap: https://openstreetmap.org, an open world map. 13Twitter: https://twitter.com, social networking service. 14Yelp: https://yelp.com, local city guide. 15Instagram: https://instagram.com, a photo-sharing platform. 16Flickr: https://flickr.com, an image hosting service. 17Three.js: http://www.threejs.org. 18Place Autocomplete API: https://goo.gl/4eTK5y. 69 2D polygons and labels shaded 3D buildings with added avatars, clouds, from Open Street Map 2D ground tiles trees, and day/night effects Geollery fuses the mirrored internal or external world with geo-tagged data, geo-tagged social media street views, and avatars virtual forms of social media: balloons, billboards, and gifts Figure 2.22: The workflow of Geollery. Based on users’ geo-location requests, Ge- ollery loads the nearby 2D map tiles, extrudes 3D geometry, and renders social media in real-time. We take advantage of WebGL to enable users to access Geollery via modern browsers on a desktop, a mobile phone, or a head-mounted display. in Figure 2.22. First, given a pair of latitude and longitude coordinates, our system queries 2D map tiles and renders the ground plane within a radius of about 50 meters (depending on the user’s specification). The ground plane visualizes roads, parks, waters, and buildings with a user-selected color scheme. As users virtually walk on the street, Geollery streams additional data into the rendering system. Next, Geollery queries 2D map data from OpenStreetMap to gather information about buildings and terrains. 3D geometry is extruded from 2D polygons and then shaded with the appropriate lighting and shadows to form buildings. Trees are randomly generated in forests. In Section 2.9.4, the user study was conducted in Geollery v1 without building textures. Geollery v2 is able to project the closest panorama onto the 3D geometry, based on the availability of street view data (see Section 2.9.5 for technical details). Finally, the system renders a mirrored world within the user’s field of view in real- time, which contains 3D buildings, virtual avatars, trees, clouds, and different forms 70 of social media, such as balloons, billboards, framed photos, and virtual gifts. For registered users, Geollery connects the client with the back-end server via a web socket, which allows real-time communication and collaboration with other participants nearby. We explain our design and implementation details in the next section. 2.9.3 Design Space As listed in Table 2.2, we explore and compare several variables in the design space of 3D social media platforms between Geollery and Social Street View [1], including the choice of meshes and textures, availability, degrees of freedom (DoF) in motion, virtual avatars, and social media representations. We further discuss other dimensions of interest such as privacy concerns, real-world phenomena, and temporal filters. 2.9.3.1 Meshes and Textures During the design process of selecting meshes and textures, we consider the tradeoff between the processing speed and visual appearance. While prior art [20, 21, 22] has presented various approaches to reconstruct textured 3D buildings in minutes or hours, we prefer a progressive approach to only reconstruct the nearby building geometry. This allows us to create buildings in real-time as needed. We circumvent preconstructed models to allow Geollery to be used at any location where 2D building data are available. 71 Variable Geollery Social Street View Ground, 3D Buildings, Mesh Sphere trees, and clouds Geollery v1: No texture Textures Textured by 360° street views Geollery v2: With 360° street views Only available for the locations with Availability Almost always available 360° street view data Motion 6 DoF 3 DoF + Teleport Virtual Avatar Available Not applicable Collaboration Available Not applicable Social Media Almost the exact location in the world Estimated by distance and orientation Location Accuracy Virtual Billboards / Balloons / Billboards Representation Framed photos / Doodles / Gifts (v2: added balloons and gifts) Aggregation Based on spatial relationship Based on direction and distance Table 2.2: Comparison between Geollery and Social Street View with different vari- ables. Note that the original version of Social Street View only uses billboards as a virtual form of social media while the latest version also uses balloons and virtual gifts. Social Street View is another approach for real-time rendering of immersive street-level environments with geotagged social media. Nevertheless, it reconstructs textured spheres with depth maps and normal maps rather than 3D building blocks. Since the building geometries are not fully recovered, its degree of freedom in motion is limited to pitch, roll, and yaw. Users have to teleport to the other locations by clicking on the streets or a 2D map. To achieve six degrees of freedom movement, we progressively stream data from OpenStreetMap to build 3D meshes in real time. Geollery extrudes polygons of nearby buildings into 3D blocks according to metadata such as building heights 72 (usually available in the metropolis) and building levels. Although this approach cannot reconstruct complex geometries such as the Effiel Tower or the London Eye, it provides the spatial context necessary for augmented reality scenarios (when the user holds a mobile device). In the first version ofGeollery, we explore different color schemes for visualizing the mirrored world. Considering the users’ feedback from the user study, we add images from Google Street View to Geollery, so that the closest street views are rendered with the building geometries in real-time. We discuss the technical details in Section 2.9.5. Note that while Google Earth provides textured meshes of buildings for many cities, their data is not yet publicly available and the texturing quality is not as high as street views. We believe that in the future, streaming large-scale high-quality textured meshes is the vital key to developing an immersive social media platform in the mirrored world. 2.9.3.2 Interactive Capabilities The real-time mirrored world enables new interactive capabilities in Geollery. Here, users can see nearby friends as virtual avatars, chat with friends, and paint street art collaboratively on the virtual building walls. Avatars. First-time visitors to Geollery are asked to select a 3D avatar from a collection of 40 rigged models. These models are stored in glTF19 format for efficient transmission and loading in the WebGL context. After selecting an avatar, users 19glTF: https://github.com/KhronosGroup/glTF, GL Transmission Format. 73 Figure 2.23: Real-time communication in Geollery v1 with geotagged social media and virtual buildings may help the users for better spatial context. can use the keyboard or the panning gesture on a mobile device to virtually walk in the mirrored world. Chat. As shown in Figure 2.23, when two participants virtually meet with each other, Geollery allows them to chat with each other in the form of text bubbles. Users can click on other avatars to send private chat messages or their own avatar to send public chat messages. Collaborative Street Art. Inspired by street art, Geollery enables two or more users to share a single whiteboard, draw on it, and add pictures or text via Web- Sockets. The server updates drawings on nearby users’ canvases after every edit allowing real-time collaboration. 74 2.9.3.3 Virtual Representations of Social Media In classic 2D interfaces, social media are usually laid out linearly (Twitter, Instagram) or in a grid (Pinterest) within the screen space. Nevertheless, in a 3D space, the virtual forms of social media can have more diversity. Taking locations, real-world analog, and gamification into account, we have designed the following four virtual representations of social media: (a) Billboards. Billboards, newsstands, and posters are widely used in the phys- ical world for displaying information. As shown in Figure 2.24(a), billboards show thumbnails of geotagged images or text. We implement four levels of details for thumbnails: 642, 1282, 2562, and 5122 pixels. The resolution of the thumbnail shown on each billboard is inversely proportional to the squared dis- tance of the avatar to the billboard. Higher resolution thumbnails are loaded as the avatar approaches each billboard. When users hover over a billboard, the system reveals associated text captions (four lines at most). When users click on a billboard, it pops up a window with details including the complete text caption, the number of likes, and any user comments. (b) Balloons. To attract users’ attention and sustain their interest, we design floating balloons in Figure 2.24(b) to showcase nearby social media. The border colors of balloons categorize their social media based on the text of each social media post. (c) Framed photos or street art. These two representations are respectively 75 inspired by galleries and street art. Geollery allows the users to put on framed photos or a public whiteboard on building walls. If the creator of the white- board allows, nearby users can collaborate in drawing doodles or writing text on the whiteboard. When creating whiteboards, users have the option of selecting from multiple sizes and frame styles. (d) Virtual gifts. To encourage users to engage with their friends, we design virtual gift boxes. Users can leave a gift box at any location in the world and their friends can open it and get rewards in the form of a message or a picture. Gifts can also be secured via questions and answers. (a) billboards (b) balloons (c) framed photos (d) 3D models Figure 2.24: Four virtual representations of geotagged social media: (a) billboards, (b) balloons, (c) framed photography, and (d) 3D models such as gift boxes. Geollery allows users to create billboards, balloons, or gift boxes at their avatar’s location by uploading photos or text messages. To create a framed photo or whiteboard, users simply click on or touch an empty section of virtual wall with the drawing mode enabled. Geollery hangs the frame outside the building by aligning the normal vectors of the wall and the frame [1]. 76 2.9.3.4 Aggregation Approaches One of the challenges of visualizing a large amount of social media in 3D spaces is visual clutter. When multiple social media are placed at approximately the same location, their virtual representations may occlude with each other. We propose three modes as shown in Figure 2.25 to resolve this issue: (b) poster boards (a) stacks (c) temporal transition Figure 2.25: Geollery spatially aggregates social media into: (a) stacks, (b) poster boards, or (c) a single billboard or balloon with temporal transition. (a) Stacks. This mode stacks older billboards upon the newer ones so that all co-located social media can be viewed at the same time. (b) Poster boards. This mode is similar to the stacks, but lays out the social 77 media in a grid on a large poster board. Compared to stacks, posts are not placed as high when more than three are aggregated together. (c) Temporal transition. This mode clusters nearby social media within a radius of approximately 12 meters into a single standard size billboard or balloon. The content displayed dynamically changes between aggregated social media every 10 seconds. This method greatly reduces the visual clutter while displaying the latest information to the user. The advantage of stacks or poster boards is that multiple information is dis- played at a glance, while the advantage of temporal transition is reducing the visual clutter. We provide all options and discuss the users’ preferences in Section 2.9.4 2.9.3.5 Privacy When designing Geollery, we take privacy concerns into consideration at an initial stage since location data may reveal details of people’s lives [78]. We tackle privacy in three aspects: 1. Social Media. When creating social media, users can select among multiple privacy options including: only available to themselves, only visible to friends, and visible to the public. Although we do not support tagging on photos now, we note that future systems with the tagging feature should mitigate the multiparty privacy conflicts [79]. 2. Avatar. Users can set their avatar to be invisible to prevent exposing them- selves to the public. Users can also customize their display name to remain 78 anonymous in the mirrored world. 3. Street Art. Users can restrict rights to editing the street art they create on the building walls or allow other people to collaborate on their street art. 2.9.3.6 Real-world Phenomena As suggested in [1, 23], real-world phenomena such as day and night transitions and changing seasons make virtual worlds more alive and realistic. In Geollery, we have designed a day/night transition system which adjusts the lighting and sky based on the local time of the user. 2.9.3.7 Filtering of Social Media Applying topic models [2, 80] and temporal filters [43, 45] to social media has been researched intensively in recent years. In Geollery v2, we allow the users to filter the social media within the day, month, or year, or by keywords. 2.9.4 User Study To discover potential use cases and challenges in designing a 3D mixed-reality social media platform, we evaluated our prototype, Geollery v1, against another social media system, Social Street View, in a user study with semi-structured inter- views. The key differences between the two systems are discussed in Section 2.9.3 and Table 2.2. We recruited a total of 20 participants (ten females and ten males; average 79 age: 25.75± 3.02) via campus email lists and flyers. Each participant was paid 10 dollars as compensation for their time and effort. None of the participants had been involved with this project before. The individual semi-structured interviews took place in a quiet room using two side-by-side workstations with 27-inch displays and NVIDIA GTX 1080 graphics cards. Participants interacted with the systems using keyboards and mice alongside the interviewer. The session for each participant lasted between 45− 60 minutes and involved four stages: a background interview, an exploration of Geollery and Social Street View, a quantitative evaluation, and a discussion about the future of 3D social media platforms. 2.9.4.1 Background Interview In the first stage (5 minutes), the interviewer introduced Geollery and asked the participant about their prior experiences of social media. All of our participants reported social media usage of at least several times per week. Furthermore, 16 out of 20 responded several times per day. However, only 5 out of 20 posted social media frequently (more than several times a week): “I post news about sports and games every day.” (P7/M); “I majorly use Instagram, I post from my own portfolio.” (P17/F). The rest of our participants primarily use social media for viewing friends’ updates and photos. 80 2.9.4.2 Exploration of Geollery and Social Street View In the second stage (30-40 minutes), the interviewer instructed the partici- pant to virtually visit four places using each of the target technologies, Geollery and Social Street View. The places participants were asked to explore include the university campus where the study took place, the Manhattan District of New York, the National Gallery of Art in Washington D.C, and a custom location of the partic- ipant’s choice. We counterbalanced the order of technology conditions (Geollery or Social Street View), as well as ordered the first three places using the Latin square design [81]. For the duration of the study, the interviewer observed the participants’ behaviors and took notes about their comments and interaction. First, the participant was asked to choose a nickname and an avatar. Mean- while, the interviewer logged in to the same system on the other workstation so the participant could virtually interact with the interviewer. Next, we asked if the participant was aware of their location in each virtual setting. In Social Street View, all participants quickly figured out their virtual locations. In Geollery, participants who noticed the minimap would immediately know where they were, but four out of 20 users became confused. For example, P5/F asked: “Am I in a gallery?”, and P16/M responded: “I believe I am in a museum.” After allowing the participants to freely explore each interface for 3 minutes, we interviewed them about their first impressions. In Geollery, many participants were amazed by walking in the mirrored world and the progressive loading of the 81 geometries: “I think it’s a very good start, it’s very good experience to walk around.” (P6/F); “I like that the buildings are forming while I am walking.” (P16/M); “I really like the fact that it’s scaled, so I don’t have to walk 15 minutes from one place to the other.” (P17/F). In Social Street View, many participants appreciated the texturing of the 360 views: “I think the texturing actually helps me.” (P17, F); “It’s like you don’t have to be there.” (P11, F). However, several participants found Social Street View frustrating in that they could not freely walk around, but only teleport by clicking the mouse: “So how do I walk here?” (P5/F) [The interviewer instructed her how to teleport.] “Oh, I see, it zooms in when I scroll. It’s like Google Street View.” We further asked their preferences of different virtual representations, different aggregation methods, and whether they preferred the system to read out the social media contents on demand. In regards to billboards versus balloons, 14 out of 20 participants preferred balloons: “Balloons are informal and billboards can have notices. Balloons may be better for social media.” (P13/F). The other 6 participants preferred the billboards: “I like billboards. First thing, balloons keep moving, it’s a little distracting. Billboards look like you are announcing something. It’s neater” (P17/F). In addition, 75% of participants preferred the temporal transition approach to aggregate nearby social media into one billboard or balloons. In the end, we encouraged the participants to input any desired location and compare Geollery with Social Street View. Most participants chose their homes as a final destination while a few participants input locations where only Geollery 82 is available. For example, P12/M typed the Statue of Liberty in New York City, where only Geollery was able to present the geotagged social media with their spatial context, Liberty Island. 2.9.4.3 Quantitative Evaluation After exploring the two interfaces for 30 minutes, we asked the participant to comparatively and quantitatively rate the two systems along 9 attributes in an AttrakDiff-based 9- point (-4, -3, -2, -1, 0, 1, 2, 3, 4) antonym word pair questionnaire inspired by [23]. The average ratings are visualized as a radar chart in Figure 2.26. From a Welch’s paired t-test, we found a significant effect for interactivity (t(20) = 3.04,p < 0.01,Cohen’s d = 0.83) and creativity (t(20) = 2.10,p < 0.05,Cohen’s d = 0.66) with Geollery outperforming Social Street View. In addition, 14 out of 20 found Geollery more or equally immersive compared to Social Street View and 16 out of 20 found Geollery more or equally entertaining compared to Social Street View. We then asked the participant which system appealed more to them. Overall, more participants (13 out of 20) preferred Geollery to Social Street View due to its interactivity: “I prefer Geollery in terms of moving around, and because you have the options to draw on walls and interact with people.” (P17/F) “I like Geollery because I have free roaming there, and it’s kind of cool that I can walk over the world.” (P11/F) Several participants pointed out that Geollery is more like a massively mul- 83 Geollery Interactive* 4 Social Street View 3 Practical Creative* 2 1 0 -1 -2 Simple Immersive -3 -4 Appealing Straightforward Entertaining Pleasant Figure 2.26: The radar chart visualizes the quantitative evaluation between Geollery and Social Street View along 9 dimensions with 20 participants. With a Welch’s paired t test, there is a statistically significant effect (p < 0.05) that Geollery is rated more interactive and more creative than Social Street View. We have indicated this with a star superscript; other dimensions were not found to be different in a statistically significant sense. tiplayer online (MMO) game: “That one (Geollery) I was in a game. This one (Social Street View) feels depressing, nothing exciting.” (P18/M); “I think it’s more like a game, it’s more fun to interact with the virtual world.” (P19/M); “Having more people makes the place feel more interesting and immersive.” (P10/F). In this study, the participants did not try Geollery v2 with textured buildings. Some participants preferred Social Street View due to the immersive panoramas: “[In Geollery,] the buildings don’t like the buildings in the real world, but Social Street View allows me to explore my environment.” (P13/F); “I like Social Street 84 View better. There, I understand the environments better.” (P16/M) 2.9.4.4 The Future of 3D Social Media Platforms At the end of the user study, we asked the participant to discuss their expec- tations as users of future 3D social media platforms and the features they would add if they were designers or product managers. We interviewed the participants with the following three questions: 1. Suppose that we have a polished 3D social media platform like Geollery or Social Street View, how much time would you like to spend on it? For this question, we categorize our participants into three classes: supporters, followers, protesters. Supporters (75%, 15 out of 20) are generally more optimistic about the future of 3D social media platforms. They envision Geollery or Social Street View being used for daily exploration or trip planning. Here are some re- sponses: “I would like to use it every day when I go to work or travel during weekends. [...] I may spend about 8 hours per week using it.” (P4/M); “If it’s not distracting like Facebook and Instagram, I would use it everyday on a couple of things.” (P17/F); “I love traveling, [so] I would like to use it [Social Street View] to preview my destinations before my trips.” (P3/M). The followers (4 out of 20) typically preferred to switch to 3D social media platforms once their friends joined. For example, here are some followers’ responses: “I am a follower on most social media sites. I would only join a 3D social media 85 platform once my friends are there.” (P4/M); “If my friends are all on this, I can see myself spend a couple of hours every week. We can have a meet-up point at one place. My friends could go to my home and post social media.” (P12/M); “It depends on who is using it. If many friends of mine are using it, I would also use it.” (P20/F). As for protesters (1 out of 20), P2/F responded: “I don’t think I will use this. I prefer to use Yelp to see comments [of nearby restaurants].”. 2. Can you imagine your use cases for Geollery and Social Street View? What would you like to use 3D social media platforms for? Many participants (17 out of 20) mentioned food and travel planning as their majority use cases: “I would like to use it for the food in different restaurants. I am always hesitating of different restaurants. It will be very easy to see all restaurants with street views. In Yelp, I can only see one restaurant.” (P13/F); “[I will use it for] exploring new places. If I am going on vacation somewhere, I could immerse myself into the location. If there are avatars around that area, I could ask ques- tions.” (P17/F); “I would like to use it to learn more about the world. If there is a restaurant, I would like to click the restaurant to pull the menu. It makes it easier to communicate with the local people. ” (P20/F) Family gathering and virtual parties are also potential use cases according to the participants’ responses: “I think it (Geollery) will be useful for families. I just taught my grandpa how to use Facetime last week and it would great if I could 86 teleport to their house and meet with them, then we could chat and share photos with our avatars.” (P2/F); “...for communicating with my families, maybe, and distant friends, [so] they can see New York. And, getting to know more people, connecting with people based on similar interests.” (P19/M); “We can use it (Geollery) on parties [...] like hide some gifts around the house and ask people to find.” (P4/M). 3. If you were a designer or product manager for Geollery or Social Street View, what features would you like to add to the systems? Many participants mentioned texturing the buildings on Geollery v1: “A map- ping of the texture, high-resolution texture, will be great.” (P12/M); “if there is a way to unify the interaction between them, there will be more realistic buildings [and] you could have more roof structures. Terrains will be interesting to add on. (P18/M). Participants suggested more data be integrated into 3D social media plat- forms: “If I’m shopping around in a mall, if I could see deals and coupons, and live comments...” (P7/M); “I would like to add traffic and parking information.” (P6/F). Many participants also suggested a better avatar system, more 3D objects, and more interactive capabilities in Geollery: “[I would like] the flexibility to build your own avatar. Customizing avatar will be one useful feature.” (P18/M) “I would like to see kitties and puppies running around, and birds flying in the air.” (P13/F) “I could also add a bike, add a vehicle, a motorcycle in Geollery, this will add some 87 fun.” (P17/F). 2.9.5 Discussion In this section, we summarize the key insights we learned from the user study, as well as the further improvement we have achieved since the user study. 2.9.5.1 Insights from User Study From the user study with 20 participants, we summarize our findings and insights as follows: 1. Data sources of social media play key roles in developing a 3D social media platform. Since many users do not post to social media frequently, obtaining high-quality data from external sources or seed users to generate high-quality content is of great significance. 2. Interactivity and panoramic textures have different levels of importance for different groups of users. Users with better geospatial awareness may ap- preciate more on interactivity in Geollery while others may appreciate more on the panoramic texturing. Nevertheless, we believe that an ideal system would incorporate high-quality textures into Geollery, resulting in a faithful and interactive mirrored world in real-time. 3. Customization of avatars, diversity [82], and accessibility [83] are important for developing future 3D social media platforms. All users should be able to represent themselves and share the virtual mirrored world equally. 88 2.9.5.2 Combining Geollery and Social Street View Thanks to the participants’ feedback, we develop Geollery v2 which combines progressive geometries with street views to create textured buildings. We achieve this by projecting street views from Google onto the building geometries in Geollery. In the fragment shader, we compute the directional vector from the closest street view’s location and use the spherical coordinates of this vector to sample the street view image. As users walk around in Geollery, we continuously update the closest street view and use alpha blending to transition to textures obtained from new street views. This approach works for many urban areas in real-time. Nevertheless, as shown in Figure 2.27, this algorithm may project trees or the sky onto the geometries due to the approximation of the digital city. Accurate real-time creation of textured buildings from street view images remains an open challenge even in the state-of- the-art reconstruction systems [20, 21, 24]. Future development may take advantage of deep neural networks to semantically segment the sky [84, 85, 86] or in-paint the pedestrians [87, 88]. 2.9.5.3 Mobile and Virtual Reality Modes In mobile platforms, Geollery allows users to track their device (both the location and the orientation) and explore the nearby social media while walking in the real world. Furthermore, users can hide secret gifts in Geollery and ask people to seek for it in the real world. We envision a future mobile version of Geollery using augmented reality technology to directly overlay geotagged social media on 89 (a) mobile mode (b) WebVR mode Figure 2.27: We further combine Geollery and Social Street View by texturing the buildings in Geollery v2: (a) shows a screenshot on an Android mobile phone, where the user could track the device’s current location and orientation and explore social media in the mirrored world; (b) shows a screenshot of the WebVR mode, where we provide the user with a first-person experience to walk around via a VR controller. the ground or building walls. Additionally, via WebXR20 APIs, Geollery allows users to control their avatars with VR controllers (e.g., an Oculus Touch controller or HTC Vive controller) in head-mounted displays (HMD). However, the current generation of commercial HMDs suffers from relatively low resolutions. Consequently, text and images appear blurrier than when displayed on a conventional monitor. In the future, we plan to investigate how virtual reality HMDs will change user experiences in Geollery and whether they will be more efficient and immersive than conventional displays. 20WebXR: https://immersive-web.github.io/webxr 90 2.10 Conclusions and Future Work In this chapter, we have presented Social Street View, a system to create im- mersive social maps that blend street view panoramas with geotagged social media. Our contributions include: (a) system architecture to scrape, query, and render geo- tagged street view and social media together on clients ranging from smartphones to tiled display walls to head-mounted displays using WebGL, (b) techniques to care- fully layout and display social media on virtual billboards by a judicious combination of depth maps, normal maps, and maximal Poisson sampling, and (c) validating the efficiency of such mixed-reality visualizations for saliency coverage. We have also presented several potential use cases of exploring social media with temporal and spatial filters and storytelling with spatial context. The supplementary materials and demos are available at http://socialstreetview.com. We have also presented TopicFields, a novel system to explore, summarize, and visualize geotagged social media with hybrid topic models and scalar field. TopicFields can efficiently estimate the kernel density distribution and visualize the scalar fields of the user-selected topics on a map on the GPU. We have presented the system and its architecture that ingests geotagged Instagram and Twitter messages, extracts topics, hierarchically clusters, and facilitates their interactive visualization on a map. The advantages of using TopicFields are that it allows a large volume of spatial and temporal data to be visualized and understood, then correlated with a series of topics. Our system includes an efficient and interactive GPU-driven visualization algorithm for visualizing multi-variate scalar data with kernel density 91 estimation and non-linear normalization methods. We have further developed the successor of Social Street View, Geollery, in which we progressively blend immersive maps with 3D buildings, virtual avatars, and different virtual representation of social media. We introduce our system ar- chitecture, design choices, and implementation details. We conduct a user study with semi-structured interviews to examine the challenges and limitations of the interfaces, as well as the types of decisions these could influence and their poten- tial impact. The qualitative results indicate that Geollery is more interactive and creative than Social Street View. The user responses reveal several key use cases including searching for food, travel planning, and social gatherings. Taking the participants’ feedback into account, we further combine Geollery and Social Street View by mapping the closest street view textures onto the building geometries. There are several future directions for improving Social Street View and Ge- ollery. First, we plan to fuse multiple street views onto the building geometries in real-time to achieve better photo-realistic rendering. Second, we aim to integrate additional useful information into the 3D world such as geotagged sales, services, and job listings. Adding mental health [89] and sentiments extracted from social media as well as live surveillance videos [90] could prove useful for and law enforce- ment. Third, we intend to use techniques from previous research [91] to improve the filtering mechanism, encouraging supportive comments and reducing negative emotions in Geollery. As we obtain more users in Geollery, we envision future 3D social media plat- forms playing a significant role in the realm of mixed reality. They may eventually 92 change the way we consume and create data, as well as the way we socialize with other people. 93 Chapter 3: Spherical Harmonics for Saliency Computation and Vir- tual Cinematography in 360° Videos 3.1 Introduction With recent advances in consumer-level virtual reality (VR) head-mounted dis- plays (HMD) and panoramic cameras, omnidirectional videos are becoming ubiq- uitous. These 360° videos are becoming a crucial medium for news reports, live concerts, remote education, and social media. One of the most significant benefits of omnidirectional videos is immersion: users have a sustained illusion of presence in such scenes. Nevertheless, despite the rich omnidirectional visual information, most of the content is out of the field of view (FoV) of the head-mounted displays, as well as human eyes. The binocular vision system of human eyes can only interpret 114° FoV horizontally, and 135° FoV vertically [92]. As a result, over 75% of the 360° videos are not being perceived. Furthermore, as shown in Table 3.1, almost 90% pixels are beyond the FoV of the current generation of the consumer-level VR HMDs1. Therefore, predicting where humans will look at, i.e., saliency detection, has 1Data sources: the official websites of Oculus, HTC Vive, Samsung, blog posts https://goo. gl/eBqpvm, and https://goo.gl/n7Vji3 94 (A) The input 360° video frame (B) Saliency map by Itti et al.’s model (C) Saliency map by our SSR model Figure 3.1: This chapter presents an efficient GPU-driven pipeline of computing saliency maps of 360° videos using spherical harmonics (SH). (A) shows an input frame from a 360° video. (B) shows the saliency maps computed by the classic Itti et al.’s model in 104.46 ms on the CPU. (C) show the saliency maps computed by our spherical spectral residual (SSR) model in 21.34 ms on the CPU and 10.81 ms on the GPU. In contrast to the classic models for images in rectilinear projections, our model is formulated in the SO(2) space. Therefore, it remains consistent in challenging cases such as horizontal clipping, spherical rotations, and equator biases in 360° videos. great potential over a wide range of applications, such as: • efficiently compressing and streaming high-resolution panoramic videos under poor network conditions [93], • salient object detection in panoramic images and videos [94], • information overlay in panoramic images [1], videos [95], and for augmented reality displays, • directing the user’s viewpoint to salient objects which are out of the user’s current field of view, or automatic navigation and synopsis of the 360° videos [96, 97, 98, 99]. Saliency of regular images and videos has been well studied thoroughly since Itti et al.’s work [61]. Previous research has also investigated mesh saliency [100], volume saliency [101], and light-fields saliency [102]. However, unlike classic images which are stored in rectilinear or gnomonic projections, most of the panoramic videos 95 are stored in equirectangular projections. Consequently, classic saliency may not work for 360° videos due to the following challenges, as further shown in Figure 3.6: • Horizontal clipping may slice a salient object into two parts on the left and right edges, which may cause a false negative result. • Spherical rotation may distort the non-salient objects near the north and south poles, which may cause a false positive result. • Equatorial bias is not formulated in the classic saliency detectors. Visual Approximate Field of View (FoV) Ratio Beyond FoV Medium Horizontal Vertical Human Eyes 114° 135° 76.25% HTC Vive, 85° 95° 87.53% Oculus Rift Samsung 75° 85° 90.16% Gear VR Google 65° 75° 92.48% Cardboard Table 3.1: Comparison of the approximate binocular field of view of human eyes, as well as the current generation of the consumer-level head-mounted displays In this chapter, we address three interrelated research questions: (a) how should we formulate the saliency in the SO(2) space with spherical harmonics, (b) how should we speed up the computation by discarding the low-frequency informa- tion, and (c) how should we automatically and smoothly navigate the 360° videos with saliency maps? To investigate these questions, we present a novel GPU-driven 96 pipeline for saliency computation and navigation based on spherical harmonics (SH), as shown in Figure 3.1. In Section 3.2.2, we present the preprocessing for computing the SH coefficients for representing the 360° videos. Our pipeline pre-computes a set of the Legendre polynomials and SH functions and stores them in GPU memory. We adopt the highly-parallel prefix sum algorithm to integrate feature maps of the downsampled 360°frames as 15 bands of spherical harmonics coefficients on the GPU. In Section 3.4, we introduce the Spherical Spectral Residual (SSR) model. Inspired by the spectral residual approach, we define SSR as the accumulation of the SH coefficients between a low band and a high band. This model reveals the multi- scale saliency maps in the spherical spectral domain and reduces the computational cost by discarding the low bands of SH coefficients. From the experimental results, it outperforms the Itti et al.’s model by over 5× to 13× in timing, and runs in real time at over 60 frames per second for 4K videos. In Section 3.5, as a proof-of-concept, we propose and implement a saliency- guided virtual cinematography system for navigating 360° videos. We formulate a spatiotemporal model to ensure large saliency coverage while reducing the camera movement jitter. The main contributions of our work are: • formulating saliency natively and directly in the special orthogonal group SO(2) space using the spherical harmonics coefficients, without converting the image to R2, 97 • reducing the computational cost and formulating the spherical saliency using the spectral residual model with spherical harmonics, • devising a saliency-guided virtual cinematography system for automatic nav- igation in 360° videos, • implementing the GPU-driven real-time pipeline of computing saliency maps in 360° videos. 3.2 Related Work Our work builds upon a rich literature of prior art on saliency detection, as well as spherical harmonics. 3.2.1 Visual Saliency Visual saliency has been investigated in ordinary images [61, 103], videos [104], giga-pixel images [64], 3D meshes [100], volumes [63], and light fields [102]. Here, we mainly focus on image and video saliency. A region is considered salient if it has perceptual differences from the sur- rounding areas that are likely to draw visual attention. Prior research has designed bottom-up [61, 105, 106, 107], top-down [108, 109, 110], and hybrid models for constructing a saliency map of images (see the review by Zhao et al.[111]). The bottom-up models combine low-level image features from multi-scale Gaussian pyra- mids or Fourier spectrum. Top-down models usually use machine learning strategies and take advantage of higher-level knowledge such as context or specific tasks for 98 saliency detection. Recently, hybrid models using convolutional neural networks [112, 113, 114, 115, 116, 117] have emerged to improve the accuracy of saliency prediction. One of the most pivotal algorithms for saliency detection is Itti et al.’s model [61]. This model computes the center-surround differences of multi-level Gaussian pyramids of the feature maps, which include intensity, color contrast, and orienta- tions, as conspicuity maps. It further combines the conspicuity maps with non-linear combination methods and a winner-take-all network. Another influential algorithm is the spectral residual approach devised by Hou and Zhang [103]. This model com- putes the visual saliency by the difference of the original and smoothed log-Fourier spectrum of the image. However, both approaches assume the input data as rectilinear images, which would not output consistent results for spherical images with horizontal clipping or spherical rotation. Inspired by these two approaches, we formulate the spherical spectral residual model in the SO(2) space. By efficiently evaluating the SH coef- ficients between two bands, our model can be easily implemented on the GPU and achieves spherical consistency. In addition to Itti et al.’s model and the spectral residual model, Bruce et al.[118] learn a set of sparse codes from example images to evaluate the saliency of new inputs. Wang et al.[119] use random graph walks on image pixels to compute image saliency. Goferman et al.[108] consider visual organization and high-level features such as human faces in saliency computation. Nonetheless, all of these prior approaches only work for rectilinear images. Our 99 work, as far as we are aware, is the first to apply spherical harmonics for saliency analysis of 360° videos. The work presented in this chapter is inspired by Itti et al.’s model [61] and the spectral residual approach [103] used for image and video saliency. 3.2.2 Spherical Harmonics Figure 3.2: This figure shows the first five bands of spherical harmonics functions. Blue indicates positive real values, and red indicates negative real values. Our code and visualization can be viewed online interactively at https://shadertoy.com/ view/4dsyW8. This demo is built on Íñigo Quílez’s prior work. Spherical harmonics are a complete set of orthogonal functions on the sphere (as visualized in Figure 3.2), and thus may be used to represent functions defined on the surface of a sphere. In visual computing, spherical harmonics have been widely applied to various domains and applications: indirect lighting [120, 121], volume rendering [122], 3D sound [123, 124], and 3D object retrieval [125, 126]. As for lighting, previous work in computer graphics has applied spherical harmonics to calculate global illumination and ambient occlusion [127], refraction [128], scattering [129], as well as precomputed radiance transfer [130]. 100 To the best of our knowledge, we are the first to apply spherical harmonics for saliency detection in panoramic images and videos. 3.3 Computing the Spherical Harmonics Coefficients Spherical harmonics coefficients are usually computed using Monte Carlo in- tegration over the sphere [121]. 360° videos are mostly stored in equirectangular projections, where each pair of the texture coordinate (u,v) corresponds to a pair of spherical coordinate (θ,ϕ) ←→ (2πu,πv). Therefore, we can directly integrate over the scalar fields of the feature maps by using precomputed spherical harmonics at each texture coordinate. Hence, computation of the spherical harmonics coeffi- cients is reduced to a prefix sum problem on the GPU, which is efficiently solved by the Blelloch Scan algorithm. Finally, we also show that we could downsample the panoramic image to N ×M pixels while maintaining a small error in the re- sulting spherical harmonics coefficients. We further show that, for L bands of SH coefficients, the computational complexity is O(L2 logMN) on the GPU. 3.3.1 Evaluating SH Functions To efficiently extract the spherical harmonics coefficients from the 360° videos, we precompute the SH functions at each spherical coordinate (θ,ϕ) of the input panorama of N ×M pixels. Since the values in the feature maps, which are used to define the intensity and color contrast are positive and real, we compute only the real-valued SH functions, also known as the tesseral spherical harmonics, as shown 101 in Figure 3.2. The SH functions, Y (θ,ϕ), are orthonormal to each other, and defined in terms of the Legendre polynomials Pml as follows: √ m m 2Kl cos(mϕ)Pl cos(θ) , m > 0 Y ml (θ,ϕ) = K 0 l P 0 l cos(θ) , m= 0 (3.1) √2Kml sin(mϕ)P−ml cos(θ) , m < 0 where 0≤ l≤L is the band index, m is the order of the band, and −l≤m≤ l. Pml are the associated Legendre polynomials: P ll = (−1)l(2l−1)!!(1−x2)l/2 P l−1l = [[x(2l−1)]P l ] l ( ) (3.2) Pml = x(2l−1) m − l+m−1P Pm l−m l−1 l−m l−2 Kml is a scaling factor to normalize the functions: √√√ Kml =√(2l+1) (l−|m|)! (3.3)4π (l+ |m|)! 3.3.2 Evaluating SH Coefficients To compute the SH coefficients of the 360° videos, we first extract the feature maps such as the intensity and color contrast, inspired by Itti et al.’s model [61] and SaliencyToolbox [107]. The intensity is calculated from the red, green, blue channels 102 of each frame (r,g,b) according to [107]: I = (r,g,b)T · (0.2126,0.7152,0.0722) (3.4) We also define the red-green (RG) and blue-yellow (BY) contrast for each pixel as follows: = r−gRG (3.5) max(r,g,b) = − min(r,g)BY b (3.6) max(r,g,b) For each feature map, the SH coefficients consist of L2 values for L bands. In the equirectangular representation of the 360°videos, we assume that each fea- ture fi,j at the coordinate (i, j),0 ≤ i < N,0 ≤ j < M represents the mean value f(θi+0.5,ϕj+0.5) at the solid angle (θi+0.5,ϕj+0.5), where θi and ϕj are defined as: = πi = 2πjθi ,ϕj , (3.7) N M Therefore, for the mth element of a specific band l, we evaluate the SH coeffi- cients of the feature map f as: ∫ cml (θ,ϕ) = f(θ,ϕ) ·Y ml (θ,ϕ)sinθ dθ dϕ(θ,ϕ∑)∈SN ∑M (3.8) = 2π f mi,j ·Yl (θi+0.5,ϕj+0.5) |cosθi+1− cosθi|M i=1 j=1 103 Let = 2πH Y mi,j l (θi+0.5,ϕj+0.5) |cosθi+1− cosθ | (3.9)iM we have ∑N ∑M cml (θ,ϕ) = fi,j ·Hi,j (3.10) i=1 j=1 Hence, for a given dimension of the input frames, we can precompute the terms H(i, j) and store them in a lookup table. The integration of the SH coefficients is then reduced to a conventional prefix sum problem. 3.3.3 Implementation Details On the CPU-driven pipeline, we use OpenMP to accelerate the evaluation of SH coefficients with 12 threads. On the GPU-driven pipeline, we take advantage of the Blelloch Scan algorithm [131] with CUDA 9 to efficiently aggregate the SH co- efficients with 2048 kernels on an NVIDIA GTX 1080. The Blelloch Scan algorithm computes the cumulative sum in O(logN) for N numbers. Therefore, our algorithm runs at O(L2 logMN) for L2 coefficients. Finally, we show the reconstructed image f ′ with the 1− 15 bands of SH 104 Figure 3.3: The reconstructed images with the first 15 bands of spherical harmonics coefficients extracted from the video frame. coefficients with regular RGB color maps in Figure 3.3 with the following equation: ∑L ∑l f ′(θ,ϕ) = cm ml ·Yl (θ,ϕ) (3.11) l=0m=−l Note that the low-band SH coefficients capture the background information, such as sky and mountains, while the high-band SH coefficients capture the details, such as parachuters. 105 3.4 Spherical Spectral Residual Model With the spherical harmonics coefficients, we present a novel approach to com- pute saliency for spherical 360° videos using the idea of spherical spectral residuals (SSR). 3.4.1 Spherical Spectral Residual Approach As shown in Figure 3.3, spherical harmonics bands can be used to compute the contrast directly across multiple scales in the frequency space. In the space of SO(2), we define the spherical spectral residual (SSR) as the difference between the higher bands (up to Q) of SH coefficients and the lower bands (up to P ) of SH coefficients: ∑Q ∑l ∑P ∑l R(θ,ϕ) = cml ·Y ml (θ,ϕ)− cm ·Y ml l (θ,ϕ) l=∑0m=−l l=0m=−l (3.12)Q ∑l = cm ·Y ml l (θ,ϕ) l=P+1m=−l in which Y ml (ϕ,θ) are pre-computed associated Legendre polynomials in the pre- processing stage. The SSR represents the salient part of the scene in the spectral domain and serves as a compressed representation using spherical harmonics. For better visual effects, we square the spectral residual to reduce the estima- tion errors. For better visual effects, we smooth the spherical saliency maps using 106 a Gaussian: S (θ,ϕ) =G(σ)∗ [R(θ,ϕ)]2 (3.13) where G(σ) is a Gaussian filter with standard deviation σ (σ = 5 for the results presented in this chapter). The High Band - Q 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 3.4: The spectral residual maps between different bands of spherical har- monics. The number along the horizontal axis indicates the high band Q, while the vertical axis indicates the low band P . Note that the saliency maps within or close to the orange bounding box successfully detect the two people in the frame. We show the SSR results of the intensity channel with all different pairs of the lower band P and the higher bandQ in Figure 3.4. As P increases, the low-frequency information such as the sky and mountains are filtered out. The spectral residual results within and close to the orange bounding box reveal the salient objects, such as the two people. 107 The Low Band - P Input Itti et al’s Model Our SSR Model Parachute (1920 × 1080) Grassland (1920 × 1080) Office (4096 × 2048) Winter Outdoor (4096 × 2048) Night (7680 × 3840) Spring Outdoor (7680 × 3840) Figure 3.5: The visual comparison between the Itti et al.’s model and our SSR model. Note that while the results are visually similar, our SSR model are 5× to 13× times faster than the Itti et al.’s model. 3.4.2 Temporal Saliency In addition to intensity and color features, we further extract temporal saliency in the spherical harmonics domain. For the SH coefficients extracted from the three feature maps, we maintain two sliding temporal windows to estimate the temporal contrast. The smaller window w0 stores the more recent SH coefficients from the feature maps, and the larger win- 108 dow w1 stores the SH coefficients over a longer term. For each frame, we calculate the estimated SH coefficients c̄m ¯̄ml , cl from both windows, using two probability den- sity functions from the Gaussian distribution (|w0| = 5, |w1| = 25,σ = 7.0). We use formulation similar to the spatial saliency to measure the spherical spectral residual between two temporal windows: ∣∣∣∣∣ ∑Q ∑l ( )∣ ¯̄m m ∣ ∣∣ R(Ftemporal, θ,ϕ) = cl (θ,ϕ)− c̄l (θ,ϕ) ·Y ml (θ,ϕ)∣∣∣ (3.14)l=P+1m=−l We further apply Equation 3.13 to compute the smoothed temporal saliency maps. 3.4.3 Saliency Maps with Nonlinear Normalization Following Itti et al.[61], we apply the non-linear normalization operator N (·) to all the six saliency maps: intensity, red-green, and blue-yellow contrasts, both statically and temporally. This operator globally promotes maps which contain a few peak responses and suppresses maps with a large number of peaks. S = 1 ⊕N N (S(Fi)) (3.15) N i=1 After the non-linear normalization, we linearly combine all saliency maps into the final saliency map. Empirically, we choose Q = 15,P = 7. The final composed result is shown at the bottom left corner in Figure 3.4, as well as in the accompanying 109 video. 3.4.4 Comparison Between the Itti et al.’s and SSR Model The Input Frame Itti et al’s Model Our SSR Model Source Horizontal Clipping Spherical Rotation Figure 3.6: This figure shows the comparison between Itti et al.’s model and our SSR model with horizontal translation and spherical rotation in the 360° video frame. White circles indicate the false negative result from Saliency Toolbox and orange ones indicate false positive result from Saliency Toolbox. Meanwhile, the results from our SSR model remain consistent, regardless of horizontal clipping and spherical rotation. Note that we use custom shaders to transform the spherical images, compute the saliency, and apply the inverse transformation for intuitive visualization. As shown in Figure 3.6, our SSR model is visually better than Itti et al.’s model. In addition, our experimental results below compare the classic Itti et al.’s model and our model. We use six videos from the Insta3602 and the 360Rize3. The video resolutions vary from 1920×1080 to 7680×3840 pixels. The experiments are conducted on a workstation with an NVIDIA GTX 1080 and an Intel Xeon E5-2667 2.90GHz CPU with 32 GB RAM. Both Itti et al.’s model 2Insta360: https://www.insta360.com 3360 Rize: http://www.360rize.com 110 The Average Timing Per Frame Resolution Itti et al. (CPU) SSR (CPU) SSR (GPU) 1920x1080 104.46 ms 21.34 ms 10.81 ms 4096x2048 314.94 ms 48.18 ms 13.20 ms 7680x3840 934.26 ms 69.53 ms 26.58 ms Table 3.2: Timing comparison between the Itti et al.’s model and our SSR model and the SSR model are implemented in C++ and OpenCV. The GPU versions of the SSR model is developed using CUDA 8.0. We measure the average timing of saliency computation, as well as the visual results between the Itti et al.’s model and our SSR model. Note that the timings do not include the uploading time for each frame from system memory to GPU memory. We believe that our algorithms would map well to products such as NVIDIA DrivePX4 in which videos are directly loaded onto the GPU memory. We measure the average computational cost of the initial 600 frames across three resolutions: 1920× 1080, 4096× 2048, and 7680× 3840, as shown in Table 3.2. All frames are preloaded into the CPU memory to eliminate the I/O overhead. Both the CPU and GPU versions of our SSR model outperform the classic Itti et al.’s model, with the speedups ranging from 4.8× to 13.4×, depending on various resolutions. We show the example input, and the output from both models in Figure 3.5. 4https://NVIDIA.com/en-us/self-driving-cars/drive-px 111 3.5 Saliency-guided Virtual Cinematography With advances in the 360° video cameras and network bandwidth, more events are being live-streamed as high-resolution 360° videos. Nevertheless, while the user is watching the 360° video in a typical commodity HMD, almost 90 percent of the video is beyond the user’s field of view, as shown in Table 3.1. Therefore, methods to automatically control the path of the virtual camera (virtual cinematography) becomes a vital challenge for streaming and navigating 360° videos in real time. Inspired by the prior work on camera path selection and interpolation [96, 99, 132, 133, 134, 135, 136, 137, 138], we investigate how saliency maps could guide automatic camera control for 360° videos. First, we compute the saliency maps by linearly combining the saliency maps based on intensity, color, and motion, and then perform a non-linear normalization, as introduced in the previous section. However, for 360° videos, the most salient objects may vary from frame to frame, due to the varying occlusions, colors, and self-movement. As a result, an approach that relies on just tracking the most salient objects may incur rapid motion of the camera, and worse still, may induce motion sickness in virtual reality. In this section, we propose a spatiotemporal optimization model of the virtual camera’s discrete control points and further employ a spline interpolation amongst the control points to achieve smooth camera navigation. 112 3.5.1 Optimization of the Virtual Camera’s Control Points To estimate the virtual camera’s control points, we formulate an energy func- tion E(C ) in terms of camera location C = (θ,ϕ). The energy function E(C ) = λsaliency ·Esaliency(C )+λtemporal ·Etemporal(C ) (3.16) consists of a saliency coverage term Esaliency and a temporal motion term Etemporal, thus taking both saliency coverage and temporal smoothness into consideration. 3.5.2 Saliency Coverage Term This spatial term Esaliency penalizes the coverage of the saliency values beyond the field of view. As for a specific virtual camera location C , this term would be written as: ∑ S(θ,ϕ) ·O(C , θ,ϕ) Esaliency(C ) = θ,ϕ ∑ (3.17) θ,ϕS(θ,ϕ) where O(C ,ϕ,θ) indicates whether an arbitrary spherical point (ϕ,θ) is ob- served by the camera centered at the location Ci:  1 , (θ,ϕ) is observed by virtual camera at C O(C , θ,ϕ) =  (3.18)0 , otherwise Thus, Esaliency (C ) measures the coverage of the saliency values beyond the 113 field of view of the virtual camera centered at C . To reduce the computation, we compute the saliency coverage terms over 2048 points (θ,ϕ), that are uniformly distributed over the sphere. 3.5.3 Temporal Motion Term For the ith frame in the sequence of the discrete control points, Etemporal (C ) measures the temporal motion of the virtual camera as follows: ∥C i−1 ,Ci∥2 , i≥ 1 Etemporal (C ) =  (3.19)0 , i= 0 3.5.4 The Optimization Process Based on this spatiotemporal model, we evaluate the energy functions over 32×64 pairs of discrete (θ,ϕ). This process is highly parallel and can be efficiently implemented on the GPU. For each frame, we compute the optimal camera point as follows: C̊ = argminE(C ) (3.20) C In this way, we extract a sub-sequence of discrete spherical coordinates Seq = {C̊i|C̊i = (ϕi, θi)} of the optimal camera location in the saliency maps every K frames, K = 5 in our examples. Since these locations are discrete and sampled at a lower frame rate, we further perform spline interpolation with C2 continuity. 114 3.5.5 Interpolation of Quaternions To achieve superior interpolation over a sphere, we convert the spherical co- ordinates to quaternions: Q(θ,ϕ) = (0,sin(θ)cos(ϕ) ,sin(θ)sin(ϕ) ,cos(θ)) (3.21) We use the spherical spline curves with C2 continuity to compute the smooth trajectory of the camera cruise path over the quaternions. For an arbitrary times- tamp x, we need to compute the interpolated spherical coordinates Qi(x). We denote ti as the most recent timestamp to x, which corresponds to the ith video frame, and Qi as the corresponding quaternion. Hence, we compute the interpo- lated quaternion Qi(x) as follows: 2 3 2 3 ( ) = ∇ Qi(x− ti−1) +∇ Qi−1(ti−x)Qi x [ 6hi ] 6h2 [i ] (3.22) + Qi −∇ Qihi ( − )+ Qi− 2 1 ∇ Qi−1hi x ti−1 − (ti−x) hi 6 hi 6 where ∇2Qi is the second derivative of the quaternion at the ith frame, and hi = ti− ti−1. Figure 3.7 shows the locations of the global maximas, as well as the interpolated spline path over the sphere. 115 Figure 3.7: This figure shows the interpolation amongst the global maximums of the saliency maps in the spherical space. The yellow dots show the discrete optimal location using the energy function, and the blue dots show the interpolation using the spherical spline curve with C2 continuity. 3.5.6 Evaluation of the SpatioTemporal Optimization Model We compare our method with the MaxCoverage model which determines the camera position for the maximal coverage of the saliency map. We evaluate the temporal motion terms for the same video sequence and plot the data in Figure 3.8. From the quantitative evaluation, as well as the complementary video, we have validated that the SpatioTemporal Optimization model reduces the temporal jittering of the camera motion compared to MaxCoverage model for virtual cine- matography in 360°videos. 116 Comparison of Temporal Artifacts Between MaxCoverage and SpatioTemporal Models 120 MaxCoverage SpatioTemporal 100 80 60 40 20 0 Frame Number Figure 3.8: Quantitative comparison between the MaxCoverage model and the Spa- tioTemporal Optimization model. We visualize the temporal motion of the virtual camera across 360 frames. Compared with the MaxCoverage model, the SpatioTem- poral Optimization model significantly reduces the temporal jitters. 3.6 Conclusions and Future Work In this chapter, we have presented a novel GPU-driven pipeline which employs spherical harmonics to directly compute the saliency maps for 360° videos in the SO(2) space. In contrast to the traditional method, our method remains consistent for challenging cases like horizontal clipping, spherical rotations, and equatorial bias, and is 5× to 13× faster than the classic Itti et al.’s model. We demonstrate the application of using spherical harmonics saliency to auto- matically control the path of the virtual camera. We present a novel spatiotemporal optimization model to maximize the spatial saliency coverage and minimize the temporal jitters of the camera motion. In future, we plan to further develop our SSR model for stereoscopic saliency 117 Temporal Motion (diagonal degrees) 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 330 337 344 351 358 detection [139] in 360° videos. We aim to collect a large scale dataset with stereo 360° videos and human eye tracking data. Another future direction is to generate hyper-lapse rectilinear videos [140] from 360° videos, using a variant of our virtual cinematography model. We would like to open source our toolbox for computing spherical harmonics from 360° videos, saliency maps from SSR models, and virtual cinematography. We believe the spherical representation of saliency maps will inspire more research to think out of the rectilinear space. We envision our techniques will be widely be used for live streaming of events, video surveillance of public areas, as well as templates for directing the camera path for immersive storytelling. Future research may explore how to naturally place 3D objects with spherical harmonics irradiance in 360° videos, how to employ spherical harmonics for foveated rendering in 360° videos, and the potential of compressing and streaming 360° videos with spherical harmonics. 118 Chapter 4: Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Environment 4.1 Introduction Surveillance videos play a crucial role in monitoring a variety of activities in shopping centers, airports, train stations, and university campuses. In conventional surveillance interfaces, where multiple cameras are depicted on a display grid, human operators endure a high cognitive burden in fusing and interpreting multiple cam- era views (Figure 4.2). In the field of computer vision, researchers have made great strides in processing surveillance videos for segmenting people, tracking moving en- tities, as well as classifying human activities. Nevertheless, it remains a challenging task to fuse multiple videos from RGB cameras, without prior depth information, into a dynamic 3D virtual environment. Microsoft Kinect, for example, cannot cap- ture the infrared structured lighting patterns in the sunlight. In this project, we present our research on the fusion of multiple video streams taking advantage of the latest WebGL and WebVR technology. Nowadays, there is an increasing demand for real-time photo-realistic dynamic scene generation. As an exploratory work, we investigate the following research 119 Figure 4.1: Video Fields system fuses multiple videos, camera-world matrices from a calibration interface, static 3D models, as well as satellite imagery into a novel dynamic virtual environment. Video Fields integrates automatic segmentation of moving entities during the rendering pass and achieves view-dependent rendering in two ways: early pruning and deferred pruning. Video Fields takes advantage of the WebGL and WebVR technology to achieve cross-platform compatibility across smart phones, tablets, desktops, high-resolution tiled curved displays, as well as virtual reality head-mounted displays. See the supplementary video at http:// videofields.com questions: 1. Can we efficiently generate dynamic scenes from surveillance videos for VR applications? 2. Can we use a web-based interface to allow human operators to calibrate the cameras intuitively? 3. Can we use the-state-of-the-art web technologies to achieve interactive video- based rendering? 4. Can we render moving entities as 3D objects in virtual environments? In the Video Fields project, we present our early solutions to create dynamic 120 Figure 4.2: This photograph shows conventional surveillance interface where mul- tiple monitors are placed in front of the operators. One of the greatest challenges for the users is to mentally fuse and interpret moving entities from multiple camera views. virtual reality from surveillance videos using web technologies. The contributions are summarized as follows: 1. conception, architecture, and implementation of Video Fields, a mixed-reality system that fuses multiple surveillance videos into an immersive virtual envi- ronment, 2. integrating automatic segmentation of moving entities in the Video Fields rendering system, 3. presenting two novel methods to fuse multiple videos into a dynamic virtual environment: early pruning and deferred pruning of geometries, 4. achieving cross-platform compatibility across a range of clients including smart- phones, tablets, desktops, high-resolution large-area wide field-of-view tiled display walls, as well as head-mounted displays. 121 4.2 System Design Video Fields system consists of a camera world calibration interface, a back- end server to process and stream the videos, and a web-based rendering system. The flow chart of the system is shown in Figure. 4.3. Figure 4.3: This figure shows the workflow of Video Fields. Our system imports video streams as video textures in WebGL. In the WebGL camera world calibration interface, the user may create the ground, calibrate the cameras, and add initial geometries. The videos are then sent to the Video Fields server for generating a background model. 4.2.1 Camera-World Calibration Interface The camera-world calibration interface for Video Fields has been built based on WebGL and WebVR. We use the open-source library Three.js1 to create the interface and render the mixed-reality scene. We have designed a four-stage workflow for Video Fields. First, users import their videos into the system. During this stage, the users can also alter the video timeline to crop and synchronize the videos 1Three.js: http://threejs.org/ 122 manually, if needed. Then users can define the ground projection plane for the projection. Second, users adjust the position and the rotating quaternion for each camera in the videos as if they were orienting flashlights onto the ground projection plane. We have used the transform controller in the Three.js library to provide such an interface. Third, users can drag geometric primitives for constructing buildings for the 3D world. The user can watch real-time rendering results interactively as they are dragging the geometric primitives. Finally, users can use any number of display devices, including the VR headsets, to experience the world that they have just created. Using the estimated position and size of each geometry, we can attach textures to the user-defined building geometry. We can also use a sky sphere to enhance the visual immersion. The position, scaling, and rotation of the cameras (as camera world matrices C) and the 3D models are exported and saved in the JSON format. 4.2.2 Background Modeling The motivation of background modeling is to provide a background texture for each camera to identify moving entities in the rendering stage and to reduce the network bandwidth requirements when streaming videos from the web-server. With a robust estimation of the background, only the pixels corresponding to the moving entities will need to be streamed for rendering and not the rest of the video for every frame. To estimate robust background images, we take advantage of Gaussian Mixture 123 (a) Source video texture (b) Background model by GMM (c) Segmentation without (d) Segmentation with Gaussian convolution Gaussian convolution Figure 4.4: Segmentation results with Gaussian mixture models of the background. (a) shows a reference frame of the video texture, (b) shows the background model learnt by the Gaussian Mixture Models approach, (c) shows segmentation without Gaussian convolution, and (d) shows the segmentation with Gaussian convolution Models (GMM) for background modeling. Compared with the mean filter or the Kalman filter, GMM is more adaptive with different lighting conditions, repetitive motions of scene elements, as well as moving entities in slow motion [141]. Given the video texture T , for each pixel tuv at the texture coordinates (u,v), we model the background value of the pixels as a mixture ofN Gaussian distributions Puv. For each frame at time i ∈ {1, . . . ,M}, we denote T (u,v)i as the set of all previous pixels at tuv, P(T (u,v)i) as the probability of the background color of the pixel at (u,v): 124 T (u,v)i = {T (u,v,j),1≤ j ≤ i} ∑N (4.1) P(T (u,v)i) = N (T (u,v)i|µij ,Σij) ·ωij j=1 where N is the total number of Gaussian distributions. We set N ← 3 in our implementation and N is the Gaussian probability density function: 1 T −1 N (T (u,v)i|µ,Σ) = 1 1 n · 1 · e − 2 (T (u,v)i−µij) σij (T (u,v)i−µij) (4.2) (2π) 2 Σ 2 where Σ is the covariance matrix. We assume that the red, green, and blue channels are independent and have the same variances. Thus, the GMM can be trained at a lower cost of computing the covariance matrix. To update the Gaussian distribu- tions, we used an online K-means algorithm to learn the weights ωij : ωij ← (1−α)ωi(j−1)+αMij (4.3) where α is the learning rate and Mij indicates whether the model is matched to the current pixel. An example of the computed background model is shown in Figure 4.4(b). 4.2.3 Segmentation of Moving Entities After learning the background model for each video, we achieve real-time in- teractive segmentation of moving entities during the fragment-shader rendering pass 125 by taking advantage of the many-core computing on the GPU. To alleviate noise and smooth the boundaries, given input video texture T and its corresponding background model B, we convolve T and B with a Gaussian kernel G at scale σ: T ′ ← G (σ)⊗T ,B′ ← G (σ)⊗B (4.4) We segment the moving entities by the thresholding function δ: F ← δ(|I ′−B′|) (4.5) A comparative example showing the advantage of using the Gaussian convo- lution is shown in Figure 4.4 (c) and (d) using the same threshold of 0.08. This parameter is passed to the fragment shader as a GLSL floating-point number, thus enabling the user to interactively alter the thresholding function in the WebGL browser. After obtaining the foreground F , we also calculate the set of bounding rectangles R of moving entities in F . 4.3 Video Fields Mapping To map video fields onto the geometries in the 3D virtual environment, we need to establish a bidirectional mapping between the texture space and the 3D world space. Here, we use the ground model as an example of an arbitrary geometry in the 3D scene. The challenges we address here are as follows: • Given a vertex on the ground model, we need to calculate the corresponding 126 Figure 4.5: This figure shows the the overview of the Video Fields mapping by projecting 2D imagery from the streaming videos onto a 3D scene; t1, . . . , t4 indicate the key points of the foreground bounding boxes in the input 2D frames and p1, . . . ,p4 indicate the corresponding projected points. pixel in the texture space. This is important for visualizing the color of the ground model. • Given a pixel in the texture space, we need to calculate the corresponding vertex on the ground to project that pixel. This is important for projecting a 2D segmentation of a moving entity to the 3D world. The first challenge could be solved by projection mapping, but we need to correct for the inherent perspective in the acquired video. Given the camera-world matrix C, which is obtained from the WebGL-based camera-world calibration in- terface, and the model matrix of the ground G. For each vertex pxyz on the ground, we first use the camera-world matrix and the ground-model matrix to convert its coordinates to the homogeneous coordinates in the camera space: p̂xyzw ←C ·G · (pxyz,1.0) (4.6) 127 To obtain the texture coordinates tuv, we need to carry out a perspective correction: ( ← p̂x+ ) p̂w p̂y+ p̂w tuv , (4.7)2p̂w 2p̂w Figure 4.6 shows the results before and after the perspective correction. (a) Video Fields mapping before perspective correction (b) Video Fields mapping after perspective correction Figure 4.6: This figure shows the results before and after the perspective correction. The texture in (a) has the same perspective as the original video, but is not projected correctly in the 3D scene. The second challenge, to convert a 2D point to 3D, is non-trivial because projection from 3D to 2D is irreversible. To do this, we first compute a dense grid contained within the ground projection of the video. For each 3D vertex of this grid, we compute the corresponding 2D coordinates in the video texture using equations 4.4 and 4.5. We store the results in a hash function H . Since we are using a dense grid, all points in the video texture can be mapped to a 3D vertex on the ground. Finally, we store the results in a hash function: H : tuv −7 → pxyz. (4.8) 128 Once calculated, this hash map can be stored on the server side and be used for mapping 2D texture points back to the 3D world. For each vertex p on the ground, we also calculate the angle between the camera ray and the ground surface θp. 4.3.1 Early Pruning for Rendering Moving Entities To render moving entities in the 3D world, we remove the background pixels from the video texture and correct the projection so that the moving entities are vertical on the ground model. In the Early Pruning approach, we discard the pixels that do not belong to the foreground as soon as the foreground is identified after the Gaussian convolution and thresholding by equations 4.4 and 4.5. Then the foreground pixels are transformed into a 3D point cloud and projected to the 3D world space. We use point clouds to implement the early pruning technique and optimize the rendering performance of Video Fields since they are extremely efficient to render. The detailed algorithm is described in Algorithm 3. View-dependent results of our visualization technique are shown in Figure 4.4. 4.3.2 Deferred Pruning for Rendering Moving Entities Though pruning the background pixels at an early stage is useful for our videos where most pixels belong to the background, we have also developed the deferred pruning approach for better anti-aliasing, bi-linear sampling, and faster visibility testing. In this approach, we dynamically project videos on moving billboards. The background subtraction is completed in the fragment shader of each billboard. After 129 ALGORITHM 3: Early Pruning for Rendering Moving Entities Input: foreground F and the set of bounding rectangles R of moving entities Output: a 3D point cloud P visualizing the moving entities Initialize a set of points for the video visualization. (Run once); For each pixel t inside the bounding box, calculate the intersection point t⊥ between its perpendicular line and t1t3; for each pixel t from the video do if t ∈/ F then discard t and continue; end set the color of the pixel: c← texture2D(F ,t); look up the corresponding projected points in the 3D scene: p← H (t),p⊥ = H (t⊥); update the z coordinate of the 3D point: pz ← |p−p⊥| · tan(θp) ; use the x,y coordinates of t⊥ to place the point vertically: pxy ← tuv ; render the point p; end the world matrix of a billboard is determined, we test the visibility of the billboard. We then render foreground pixels onto the billboard and discard the background pixels. We describe the details in Algorithm 4. 4.3.3 Visibility Testing and Opacity Modulation One of the biggest advantages for visualizing videos in an immersive 3D virtual environment is that the system can adjust the opacity of every object, thus allowing the users to “see-through buildings”. The users can therefore observe the video- recorded activities from cameras that would otherwise not be viewable from a user’s given vantage point. We achieve this by doing a visibility test using ray-casting from the current camera to every moving entity in the scene. If a moving entity is found to be occluded, we can modulate the opacity of that occluding object and render the moving entities. Figure 4.8 shows an example before and after the visibility test 130 (a) early pruning (b) deferred pruning for rendering for rendering moving entities moving entities Figure 4.7: This figure shows the segmentation of moving entities, view-dependent rendering, as well as zoom-in comparison between the early pruning algorithm and the deferred pruning algorithm. ALGORITHM 4: Deferred Pruning for Rendering Moving Entities Input: foreground F and the set of bounding rectangles R of moving entities Output: a set of billboards rendering the moving entities Initialize a set of billboards to display moving objects. (Run once); for each detected bounding box r in R do calculate the bottom-left, bottom-middle, bottom-right and top-middle points t1,t2,t3,t4 in r, as illustrated in Figure 4.5; look up the corresponding projected points in the 3D scene: pi ← H (ti), i ∈ {1,2,3,4}.; calculate the width of the billboard in the 3D space: w← |p3−p1|,h← |p4−p2| · tan(θp4).; Reposition a billboard to the position p1+p32 with width and height w and h; In the fragment shader of the billboard, sample the color from I as described in Equation. 4.6 and 4.7, but replace G with the current billboard’s model matrix; discard pixels which does not belong to the foreground F ; end 131 (a) rendering without visibility (b) rendering with visibility test and transparency control test and transparency control Figure 4.8: This figure shows the rendering results before and after the visibility test and opacity adjustment. Note that the two people behind the building are correctly rendered through the semi-transparent meshes. and opacity modulation. 4.4 Experiments and Evaluation In our experiment, we recorded three 10-minute video clips with a resolution of 1280× 720 pixels. We tested both the early- and deferred-pruning algorithms in the following three settings. The first two tests were conducted on a desktop workstation with an NVIDIA Quadro K6000 graphics card running Windows 8.1 on Google Chromium 48.0.2544.0 with WebVR enabled. We tested on a regular desktop display with a resolution of 2560×1440 as well as Oculus Rift DK2 head- mounted display with a resolution of 950× 1080 for each eye. The last test was conducted in an immersive curved screen environment with 15 projectors driven by 4 NVIDIA Quadro K6000 graphics cards. The rendering resolution was 6000×3000 pixels. The results were rendered with the same software setup. The experimental results are shown in Table 4.1: 132 Table 4.1: The experimental results of early pruning and deferred pruning in Video Fields for different display devices and resolutions Render Algorithm Resolution WebVR Framerate (FPS) 2560×1440 No 60.0 Early Pruning 2×960×1080 Yes 55.2 6000×3000 No 48.6 2560×1440 No 60.0 Deferred Pruning 2×960×1080 Yes 41.5 6000×3000 No 32.4 From the table above, both deferred pruning and early pruning achieve in- teractive rates in desktop settings. However, for stereo rendering using a WebVR- enabled browser and high-resolution rendering, deferred pruning suffers from lower frame rate. This is because deferred pruning carries out the texture sampling after the vertex transformation for each billboard. However, the advantage of the deferred pruning approach is that it achieves better anti-aliasing results than early pruning as shown in Figure. 4.7. On the other hand, the early pruning approach samples the colors at an early stage and discards unnecessary background pixels in the fragment shader, making it a faster approach than the deferred pruning. Please visit the websites http://video-fields.com for supplementary ma- terials related to this paper. 4.5 Conclusions and Future Work In this chapter, we have described a web-based rendering system that fuses multiple surveillance videos to create a dynamic virtual environment. 133 Our approach leverages the recent advances in web-based rendering to design and implement the Video Fields system to provide a more immersive and easier-to- use dynamic virtual environment for visualizing multiple videos in their appropriate spatial context. We have compared two new ways of fusing moving entities into the 3D world: early pruning and deferred pruning. We found that each has its relative advantages. The choice of which technique to use will depend on the characteristics of the environment being recorded as well as the preferred display device. Future systems may scale up our approach to handle hundreds of surveillance videos spread over a wider area. To do so effectively, one may use techniques from image and mesh saliency [63, 64, 100, 142] to extract saliency moving entities. As we broaden the scale and scope of our work, more efficient distributed systems and parallel computing algorithms will be necessary to achieve interactive rendering rates. We also plan to explore the integration of our efforts with scalable, distributed, and parallel web-services platforms such as Amazon’s S3. Video Fields system renders a dynamic scene yet with simple geometries. In Chapter 6, we introduce a real-time solution for fusing multiview video textures with complex dynamic meshes. 134 Chapter 5: Integrating Haptics and Visual Cryptography into Vir- tual Environments In addition to panoramas, images, text, multiview videos, and meshes, VR and AR may also be empowered by gesture recognition, haptic feedback, and visual cryptography. In this chapter, we introduce two projects, VRSurus and ARCrypt, to explore these applications. 5.1 VRSurus: Enhancing Interactivity and Tangibility of Puppets in Virtual Reality Puppets are widely used in storytelling for puppetry on the stage as well as playing games among children. However, as physical inanimate objects, traditional puppets can hardly provide immersive auditory and haptic feedback. Previous re- search such as 3D Puppetry [143] and Video Puppetry [144] used a Microsoft Kinect or an overhead camera to create digital animation with rigid puppets. Nevertheless, they did not allow users to wear a flexible puppet in an immersive virtual reality environment. As exploratory work, we started with the following questions: “Could we endow puppets with more interactivity in both real and virtual world? Could 135 we make puppetry more immersive with virtual reality? Could we allow puppeteers to feel the life of puppets via haptic feedback?” (a) (b) (c) Figure 5.1: Overview of VRSurus. (a) a puppeteer wearing an elephant puppet with VRSurus playing a VR game, (b) shows a closer view of the elephant puppet wearing a custom 3D-printed cap that encapsulates the Arduino microcontroller, an inertial measurement unit and other electronic modules, and (c) shows the educational VR game empowered with gesture recognition and tactile feedback. Inspired by our previous project HandSight [145, 146, 147], which supports the activities of daily living (ADLs) by sensing and feeding back non-tactile information about the physical world as it is touched, we have designed and developed VRSurus (Figure 5.1), a smart device that enhances interactivity and tangibility of puppets using gesture recognition and tactile feedback in a virtual reality environment. The conceptual design of VRSurus is illustrated in Figure 5.2. VRSurus uses simple machine-learning algorithms and an accelerometer to recognize three gestures. It is also built with servo motors, solenoids and vibration motors to render tactile sensations. VRSurus is designed in the “hat” form so that it could be mounted upon any puppet. We have also implemented a serious VR game where the puppeteer manipulates a virtual elephant puppet (corresponding to the physical puppet) to clear the litter in the forests, splash water onto the lumbermen and destroy factories that pollute the air. The game is designed to educate children on environmental 136 protection. Our main contribution is the concept and development of an interactive puppet for virtual reality powered by tactile feedback and gesture recognition, as well as our specific mechanical and software design. The main benefit of our approach is its applicability to VR games and puppetry performance. We have presented this game at the ACM UIST 2015 Student Innovation Contest in Charlotte, North Carolina. Please see our supplementary video for a demonstration: www.vrsurus.com 5.1.1 Hardware Design To empower the puppet’s I/O abilities, we have leveraged Pololu MinIMU-91 (only accelerometer is used in the current prototype; gyroscope and compass are reserved for future use) to capture the puppeteer’s gestures (input) and small push- pull solenoids, servo motors, and vibration motors to simulate tactile feedback and tangible animations on the arm (output). As shown in Figure 5.3, two solenoids, one on either side of the forearm, indicate the directions of the targets. Two servo motors may pull parts of the puppet using a string to simulate physical animations. Additionally, vibration motors can generate vibration feedback on the entire puppet. 5.1.2 Gesture Recognizer We have trained our classifier using the decision tree algorithm in Weka [4]. In total, we used the following sixteen features: the sum of mean values on all axes, the differences between each pair of axes, the power of each axis, the range 1Pololu MinIMU-9: https://www.pololu.com/product/1265 137 (a) (b) Figure 5.2: Sketches of our conception of VRSurus. (a) shows a puppeteer control- ling a giraffe puppet to play a platform VR game; (b) shows a puppeteer controlling an elephant puppet to play a first-person VR game. Initially, we planned to sew sensors directly onto the puppet. In subsequent iterations, we created a hat-like attachment which allows VRSurus to be mounted onto any puppet. 138 (a) (b) (c) Servo motors 3D-printed exible cable chain Threads Servo motors Elastic bands Solenoids inside Figure 5.3: Mechanical design of VRSurus in details: (a) shows how two servo motors are seated inside the 3D-printed flexible cable chain, (b) shows how servo motors pull the forelegs of the puppet via threads and elastic bands, and (c) shows how solenoids were embedded inside 3D-printed modules around the puppeteer’s arm. of each axis, the cross product between every two axes and the standard deviation of values along each axis. To recognize the four gestures: idle, swiping, shaking and thrusting (Figure 5.4), we collected 240 sets of raw accelerometer values for each gesture from 4 lab members (60 sets per gesture per person). We achieved an average of 97% accuracy using 5-fold cross-validation. The classifier and all signal processing program is written in Java. (a) Swiping (b) Shaking (c) Thrusting Figure 5.4: This figure shows three gestures we designed to control the virtual character: (a) swiping from left to right triggers the character to swing the nose and blow wind, (b) shaking up and down triggers the character to splash water, and (c) thrusting from inward to outward triggers the character to howl, trample on the ground and throw fireballs. A video demonstration is included at www.vrsurus.com. 139 5.1.3 VR Educational Game We designed and implemented a proof-of-concept serious VR game based on WebGL and WebVR written in Javascript and PHP (Figure 5.5). We use Three.js2 for cross-platform rendering and Sea3D Studio3 for rendering the animation in WebGL. In the VR game, the player acts as a little elephant called Surus to prevent evil human beings from invading the forest. The player is able to use defined gestures to commit attacks as stated in the Ideation section. The target can be litter, a lumberman and a polluting factory. The goal of the game is to defeat as many enemies as possible in a limited time. We designed a tutorial session before the game, encouraging players to familiarize themselves with each gesture and all the enemies. When a lumberman appears, the game will play the 3D audio of axe-chopping according to the orientation between the player and the lumberman. Meanwhile, the game sends signals to VRSurus to enable the left or right solenoid to tap on the user’s arm. When the factory is destroyed, the game will output vibration feedback along with the 3D audio of an explosion. After each gesture is successfully performed, the servo motors are signaled to string the elephants’ ears and forelegs backward and forward, simulating physical movements for the puppet. We have also written a web-server in PHP that exchanges signals between the gesture recognizer and the VR game every 10ms. 2Three.js: http://www.threejs.org 3Sea3D: http://sea3d.poonya.com 140 (a) (b) (c) Figure 5.5: Virtual characters in the VR game. (a) shows the hero Surus controlled by the puppeteer, (b) shows the the enemies, (c) shows the Oculus Rift HMD ren- dering results of the lumberman being defeated by Surus. 5.1.4 Deployment VRSurus was initially deployed for the ACM UIST 2015 Student Innovation Contest in Charlotte, NC. It was demonstrated live throughout the entire 2.5-hour contest (Figure 5.6). Each session lasted about two minutes (one exact minute for the VR game and about one minute for the introduction and hardware prepara- tion). In total, there were 63 participants from the UIST community including two invited K12 specialists, who tried out our device. There were more than a hun- dred people who watched our device as well as the gaming procedure. Overall, the audience’s reactions were positive. After successfully destroying an enemy in the VR environment, many participants cheered and some even laughed loudly. Some people mentioned that the tutorial greatly helped them learn the gestures. People were also greatly encouraged when the game told them that how many trees they “saved” in the end. We also received some criticisms and suggestions for future work. One major concern was that the current prototype is a little fragile and heavy to wear over an extended period, although the device is small enough to be put upon the pup- pet. Some people also mentioned that the gesture recognizer failed to recognize 141 Figure 5.6: During the preliminary deployment, participants from the UIST com- munity interacted with our prototype of VRSurus at the ACM UIST 2015 Student Innovation Contest their gesture when moving slowly and the tutorial did not give them feedback. So during the deployment, we sometimes instructed people to move faster to enable the accelerometer to generate better data. In our next iteration, we plan to train the recognizer with more gestures from more participants and at varying rates of movements. 142 5.2 ARCrypt: Visual Cryptography with Misalignment Tolerance using Augmented Reality Head-Mounted Displays (c) ARCrypt ARCrypt Share A Ordinary Display 1 Row Misalignment Look rough Classical VC Caused by Head Jittering? AR Head-Mounted Display Share B Figure 5.7: Results and overview of our system, ARCrypt. ARCrypt is able to split a confidential message into two shares of images, which guarantees that the original information could not be revealed with either share of the image alone. The ARCrypt system then transmits the two shares to an ordinary display and an augmented reality head-mounted display, respectively. When the user looks through the two aligned images, the secret message is revealed directly to the user’s human visual system. Nevertheless, head jittering may cause the two images to misalign with each other. The proposed ARCrypt algorithm outperforms the original visual cryptography algorithm in the presence of one or two rows of misalignment. A vast amount of private data is displayed on our monitors every day. For ex- ample, the social security numbers from a company’s documents or paychecks, credit card numbers shown in the bill of payment, the plain text of passwords appearing in the registration emails, as well as keycodes from the two-factor authentication messages. Nevertheless, all of these confidential data can easily be eavesdropped by skilled and persistent hackers via cyber-attacks, captured by secret video cameras, or even spied upon by random passersby. Consequently, there is a great demand to have a secure mechanism for protecting information displayed on the screen. In the past decades, researchers and scientists have designed sophisticated cryptographic algorithms to encrypt and decrypt messages. However, the device 143 that executes the decryption algorithm may be under threats from attackers. Some other solutions incur new trusted computing base (TCB) such as smart-phones [148] and two-factor authentication [149] to address the problem. Nevertheless, the new TCB can also be compromised as a result of an implementation flaw in the secure protocol during the communication. For example, the heart-bleed bug, found in OpenSSL, a popular secure protocol, was taken advantage of to steal information over the SSL/TLS encryption [150]. In this section, we introduce ARCrypt, a practical and robust cryptographic system that eliminates every device from the TCBs and assumes no connection between the TCBs. First, the encryption algorithm in the ARCrypt system splits the confidential information (as an image) into two shares. One share of data is displayed on the ordinary screen while the other share of data is displayed on the untethered augmented reality headset, such as Microsoft HoloLens. Eventually, the human operator has to conduct the decryption operation by themselves by manually aligning the two shares of information together. Our work was built upon the pioneering research by Andrabi et al.[5], which first demonstrates the usability of using tethered augmented reality headsets like Google Glass to reveal secret messages using the visual cryptography system induced by Naor and Shamir [151]. However, our work differs from their system by taking advantage of a novel visual cryptography algorithm with misalignment tolerance, which makes the system more practical to use. We have modeled the misalignment of head jittering using a 2D Gaussian distribution. We have developed a novel algorithm to enhance the visibility of the 144 classical visual cryptography via diffusion with Gaussian kernels, thus enabling the algorithm to be tolerant with one or two rows of misalignment. 5.2.1 Background and Related Work Our work builds upon the recent advances in augmented reality (AR) headsets, as well as the theory of visual cryptography (VC) and its recent implementations on AR headsets. 5.2.1.1 Augmented Reality Head Mounted Displays In recent years, there has been an increasing number of commercial AR head- mounted displays, such as Google Glass, ODG Smart Glasses, Meta Headsets, Epson Moverio, and Microsoft HoloLens. These displays blend the virtual information rendered by the computer with the real scene observed by the user. There are several factors to be considered in selecting AR displays and de- signing rendering algorithms and content for these displays, in which field of view (FoV) and resolution are the dominant ones. In this project, we chose Microsoft HoloLens because it has a wider FoV and a better resolution (30°FoV, 1268× 720 pixels), compared with Google Glass (14°, 640×360 pixels) and Epson Moverio (23°, 960× 540 pixels). Moreover, Microsoft HoloLens itself is a standalone and unteth- ered computer, which is more likely to be considered as a trusted computing base when the Internet connection is switched off. As most of the AR headsets, Microsoft HoloLens uses additive color mixing 145 strategy to project visual information onto the display [152]. Consequently, for each pixel, the lower RGB values it has, the less visible it is through the display. Besides, the black color indicates total transparency in HoloLens; thus rendering a black quad through HoloLens to observe a white quad yields a white quad in the user’s perception; a semi-transparent red quad through the HoloLens overlaid on a green quad yields a yellow quad. 5.2.1.2 Visual Cryptography Visual cryptography is a cryptographic scheme in which decodes a secret im- age without any computational cryptographic operations. The well-known visual cryptography fundamental theory was firstly developed by Naor and Shamir [151]. An example of visual cryptography with two shares is shown in Figure 5.8. First, the algorithm splits one message into N shares with different transparency. Suppose the original image has w×w pixels, each share will have w × w2 2 blocks of 2×2 pixels. Each block will be one of the six patterns as shown in Figure 5.8(a). Meanwhile, it ensures that a person with any K shares of the data can visually restore the image by stacking their transparencies (as in Figure 5.8(b) and (c)), but any K−1 shares of the data cannot restore the image. Carlo et al.[153] advanced the theoretical foundations of visual cryptography for use with grey-scale images. Zhou et al.[154] have combined dithering techniques with visual cryptography to encrypt gray-scale images. Hou et al.[155] have pre- sented novel algorithms of visual cryptography for color images. Bin et al.[156] have 146 (a) Valid horizontal, vertial, and diagnoal blocks for an image share + = + = (b) Revealing a black or (noisy) white block using two shares of images + = (c) A concrete example of visual cryptography Figure 5.8: Examples of visual cryptography proposed by Naor and Shamir [151]. (a) shows six valid patterns for one 2×2 block of pixels to be selected for an image share, (b) shows the addition operator when fusing one share with the other share, and (c) shows our example of revealing the characters “AR CRYPT” from two image shares. 147 proposed an edge-preserving technique for dithering to improve visual quality for visual cryptography. Recently, Andrabi et al.[5] conducted the first formal user study to investigate the feasibility and usability of using Google Glass and Epson Moverio for reading visual cryptography (Figure 5.9). Their system requires users to use a chin rest to minimize head jittering effects. Even with the chin rest, users spent a consid- erable amount of time initially aligning the image shares: ranging from 18.49 to 313.32 seconds. Apart from the effort in initial alignment, the participants spent approximately 8 seconds to decode and recognize a single plaintext character. + = Share 1 Share 2 Observed from Eyes Hardware Setup Figure 5.9: The pioneering AR-based visual cryptography system, developed by Andrabi et al.[5], is able to encrypt a single character with the help of a chin rest. Our work is motivated by the challenge of misalignment, inspired by the study conducted by Andrabi et al.[5]. Specifically, we desire to enable the user to see the fused image even though the two images are not perfectly aligned. Although the latest Microsoft HoloLens has the capability of detecting the depth map and stabilizing the image at a 3D position of the reconstructed world, the depth mapping, and tracking is still far from perfect. It is likely that the image could move a couple of pixels with a little bit of the user’s head jitter. An example taken from HoloLens’s mixed-reality capture is shown in Figure 5.10. 148 (a) (b) Figure 5.10: A real case of misalignment challenge for visual cryptography using augmented reality headsets. The red line shows the border of the overlaid image. Compared with (a), (b) has better alignment than (a), so the left character “V” appears from the HoloLens. “R” cannot be observed clearly from (b), because the HoloLens’s debugging camera has a little translation to the actual human eyes. In reality, both “V” and “R” are shown in (b)’s condition but not in (a). 5.2.2 ARCrypt Algorithms In this section, we describe the core algorithm of ARCrypt, which has tolerance for limited misalignment. The main idea behind ARCrypt is: for each pixel p in one share, model the probability of misalignment on another pixel q as a 2D Gaussian distribution centered at the pixel p. In this way, we sacrifice a little contrast in the fused result for better clarity when one or two rows of misalignment occurs. 5.2.2.1 Preprocessing Following [5] and [151], given a confidential visual image I, we first generate a binary image Î by thesholding every 2× 2 block of pixels in I. Here, we denote F (Î) and B(Î) as the set of foreground (white) and background (black) pixels of Î, respectively. 149 Next, we model the range of misalignment as a s× s square. We generate a s× s 2D Gaussian kernel G (x,y,σ) at scale σ: 1 x2+y2 G ( −x,y,σ) = 2 e 2σ2 , (5.1)2πσ where σ indicates the standard deviation of the misalignment. In our experi- ments, we choose s= 3,σ = 1.0 and s= 5,σ = 2.0. 5.2.2.2 Generation of Two Shares ARCrypt algorithm generates the first share as the classical VC approach, as shown in Figure 5.8. For each 2× 2 block of pixels in the first share, we randomly choose one of the six VC patterns. Next, we carry out two solutions to deal with the possible misalignment: 1. ARCrypt∗: for the second share, we only diffuse the foreground pixels: each foreground pixel has a probability to be misaligned with one of its surrounding pixels; in this way, when the two shares match perfectly, the background is unchanged, but the foreground is darker. 2. ARCrypt: for the second share, we diffuse both the background and fore- ground pixels to enhance the contrast: every pixel has a probability to be misaligned with one of its surrounding pixels. The pseudo code of the core algorithm is shown in Algorithm 5: 150 ALGORITHM 5: ARCrypt: Visual Cryptography With Misalignment Tol- erance Input: binary secret image Î Output: two shares of information Iα and Iβ Generate a random share of Iα using the six random patterns; for each 2×2 block brc of Î do for each 2×2 block bij of Î, where |r− i| ≤ s2 , |c− j| ≤ s 2 do if bij ∈ F (IB) or AllowBackgroundDiffusion then Look up the probability of misaligning brc with bij from the from the Gaussian kernel G (x,y,σ): P(brc, bij)← G (r− i, c− j,σ) ; Increase the normalization factor of brc: Nrc ← Nrc+P(brc, bij); end end Normalize the probabilities: P(brc, bij)← P(brc, bij)/Nrc; Generate a random number from a uniform distribution: r ∈ [0,1]; Set the accumulated probabilities: Arc ← 0 for each 2×2 block bij of Î, where |r− i| ≤ s2 , |c− j| ≤ s 2 do if bij ∈ F (IB) or AllowBackgroundDiffusion then Arc ← Arc+P(brc, bij); if r ≤ Arc then (p,q)← (i, j); break; end end end if bpq ∈ B(Î) then Iβ ← Iαrc pq ; end else Iβrc ←WHITE−Iαpq ; end end 151 Exact Match 1RM 1CM 1R1CM 2RM 2R2CM VC ARCrypt* ( =3, =1) ARCrypt ( =3, =1) ARCrypt ( =5, =2) FiguArReC5r.y1p1t: Results amongst the classical visual cryptography approach and AR- Cr(yp=t7, wi=t2h) different parameters. (s indicates the size of the Gaussian kernel, σ indicates the standard deviation of the Gaussian kernel). 1RM indicates one row of misalignment, 1CM indicates one column of misalignment, and so forth. From the results, we observe that both ARCrypt and ARCrypt outperform the visual cryptog- raphy algorithm for 1RM or 1CM misalignment. ARCrypt provides better contrast than ARCrypt when 1RM or 1CM misalignment occurs. 5.2.3 Experimental Results Considering that the resolution of HoloLens is 1268×720 pixels, we generate visual cryptography images at the resolution of 1024× 1024 pixels using a custom C++ program using the proposed ARCrypt algorithm, as well as the classical visual cryptography algorithm. We render the misalignment under five conditions: one row (2 pixels) misalignment (1RM), one column misalignment (1CM), one row and one column misalignment (1R1CM), two rows misalignment (2RM), and two rows and two columns misalignment (2R2CM). The images are rendered in a high PPI (pixels per inch) monitor (PPI = 264). Lower contrast may be expected in a lower PPI 152 monitor. As shown in Figure 5.11, we arrive at the following insights: 1. The classical visual cryptography algorithm does not work with even a single row or column of misalignment, making it hard to interpret the image using augmented reality headsets. 2. ARCrypt∗ can deal with one row or one column misalignment (2 pixels) while preserving as good a contrast as the original visual cryptography algorithm. However, the contrast drops with misalignment. 3. ARCrypt provides better contrast than ARCrypt∗ when misalignment occurs and even works for the 1R1CM case (2 pixels misaligned both horizontally and vertically). After increasing the size and scale of the Gaussian kernel, we can still see the secret message even with 2 rows (4 pixels) of misalignment. 4. ARCrypt cannot deal with the 2R2CM case (4 pixels misaligned both hori- zontally and vertically). 5.3 Conclusions and Future Work In this section, we have adapted gesture recognition, tactile feedback, and visual cryptography for current-generation virtual and augmented reality headsets. First, we have described VRSurus, a prototype device that enhances the inter- activity of puppets with gesture recognition and haptic feedback for virtual reality gaming. We received initial feedback from participants at the ACM UIST 2015 Stu- 153 dent Innovation Contest. We have identified several directions for future work. One is to extend more input modules for better animations. For example, by embedding flex sensors in different parts of the puppet (e.g., ears and nose), the virtual charac- ter can animate corresponding parts when the puppeteer manipulates the physical one in the real world. By using a microphone, we can facilitate performance which kinetic input cannot support, such as creating a conversation with other virtual roles. Additionally, it could be a promising storytelling tool to express children’s creativity. For instance, a child can experience others’ stories with recorded sound, tactile feedback and physical movements all at the same time. Next, our system ARCrypt uses a novel visual cryptography algorithm which is tolerant to users’ head jitter and misalignment of the two shares of encrypted visual information. We achieve this by modeling the misalignment through a 2D Gaussian distribution of the visual cryptography’s random patterns. This allows us to trade off precise alignment with perceived contrast. We believe ARCrypt provides a ver- satile, commodity, off-the-shelf solution for embedding encrypted augmented reality information in the real-world displays, thereby protecting confidential data while facilitating an easy-to-use visual decryption. Future system may take advantage of robust tracking algorithms to automatically align the two shares of encrypted visual information. 154 Chapter 6: Montage4D: Real-time Seamless Fusion of Multiview Video Textures 6.1 Introduction With recent advances in consumer-level virtual and augmented reality, several dynamic scene reconstruction systems have emerged, including KinectFusion [157], DynamicFusion [158], Free-Viewpoint Video [159], and Holoportation [160]. Such 4D reconstruction technology is becoming a vital foundation for a diverse set of applications such as 3D telepresence for business, live concert broadcasting, family gatherings, and remote education. Among these systems, Holoportation is the first to achieve real-time, high- fidelity 4D reconstruction without any prior knowledge of the imaged subjects. The success of this system builds upon the breakthrough of fast non-rigid alignment algorithms in fusing multiview depth streams into a volumetric representation by the Fusion4D system [161]. Although Holoportation is able to mitigate a variety of artifacts using techniques such as normal-weighted blending and multilevel major- ity voting, some artifacts persist. In a previous user study on Holoportation [160], around 30% of the participants did not find that the reconstructed model real com- 155 A B C D E Inputs Texture field and results of Holoportation Texture field and results of Montage4D Figure 6.1: The Montage4D algorithm stitches multiview video textures onto dy- namic meshes seamlessly and at interactive rates. (A) inputs: dynamic trian- gle meshes reconstructed by the Fusion4D algorithm, multiview video textures, and camera poses; (B) merged texture result using Holoportation, which employs normal-weighted blending with dilated depth discontinuities and a majority voting algorithm; (C) the corresponding color-coded field of scalar texture weights of (B), which we call a texture field; (D) result using Montage4D, which favors the dom- inant view, ensures temporal consistency, and also reduces seams between camera views; (E) the corresponding texture field of (D). pared with a real person. We believe that this is a significant challenge that must be addressed before telepresence can be embraced by the masses. We also note that the user feedback about visual quality was much less positive than other aspects (speed and usability). This is caused by the blurring and visible seams in the rendering results, especially on human faces, as shown in Figure 6.1. Blurring arises because of two reasons. First, texture projection from the camera to the geometry leads to registration errors around visible seams. Second, normal-weighted blending of the different views with different appearance attributes (specular highlights and inconsistent color calibration), leads to an inappropriate mixing of colors and therefore blurring. We further characterize visible seams into: Misregistration seams are caused by imprecisely reconstructed geometry with missing or extruded triangles. Occlu- 156 sion seams arise out of discontinuous texture transitions across the field of view of multiple cameras and self-occlusions. In this chapter, we address both blurring and visible seams and achieve seam- less fusion of video textures at interactive rates. Our algorithm estimates the misreg- istration and occlusion seams based on the self-occlusion from dilated depth discon- tinuities, multi-level majority voting, foreground segmentation, and the field-of-view of the texture maps. To achieve a smooth transition from one view to another, we compute geodesic distance fields [162] from the seams, to spatially diffuse the texture fields to the visible seams. In order to prevent view-dependent texture weights from rapidly changing with the viewpoints, we extend the scalar texture field as shown in Figure 6.1(C) to a temporally changing field to smoothly update the texture weights. As shown in Figure 6.1(D) and 6.9, our system achieves significantly higher visual quality at interactive rates compared to the state-of-the-art Holoportation system. Please refer to www.montage4d.com for the supplementary video, slides, and future code and datasets release. The main contributions of the Montage4D work are: • formulation and quantification of the misregistration and occlusion seams for fusing multiview video textures, • use of equidistance geodesics from the seams based on discrete differential geometry concepts to diffuse texture fields, • temporal texture fields to achieve temporal consistency of the rendered im- agery, and 157 • a fast computational pipeline for high-fidelity, seamless video-based rendering, enabling effective telepresence and customized real-time stylization. 6.2 Related Work We build upon a rich literature of prior art on image-based 3D reconstruction, texture stitching, and discrete geodesics. 6.2.1 Image-based 3D Reconstruction Image-based 3D reconstruction has been researched extensively in the past decades. The pioneering work of Fuchs et al.[163, 164] envisioned that a patient on the operating table could be acquired by a sea of structured-light cameras, and a remote doctor could conduct medical teleconsultation with a head-mounted display. Kanade et al.[165] invented one of the earliest systems that use a dome of cameras to generate novel views via triangulated depth maps. Its successor, 3D Dome [166], reconstructs explicit surfaces with projected texture. Towles et al.[167] achieve real- time 3D telepresence over networks using 3D point clouds. Goldluecke et al.[168] adopt spatiotemporal level sets for volumetric reconstruction. Furukawa et al.[169] reconstruct deformable meshes by optimizing traces of vertices over time. While compelling, it takes two minutes on a dual Xeon 3.2 GHz workstation to process a single frame. De et al.[170] present a system that reconstructs space-time coherent geometry with motion and textural surface appearance of actors performing complex and rapid moves. However, this also suffers from slow processing speed (approxi- 158 mately 10 minutes per frame), largely due to challenges in stereo matching and op- timization. Since then, a number of advances have been made in dealing with video constraints and rendering quality [90, 159, 171, 172, 173, 174, 175, 176, 177, 178, 179], but rendering dynamic scenes in real time from video streams has remained a chal- lenge. Zitnick et al.[180] present an efficient rendering system which interpolates the adjacent two views with a boundary layer and video matting. However, they consider a 2.5D layered representation for the scene geometry rather than a general mesh model that can be viewed from all directions. Their work inspires us with the computation of depth discontinuity and seam diffusion. With recent advances in consumer-level depth sensors, several reconstruction systems can now generate dynamic point-cloud geometries. KinectFusion [157, 181] is the first system that tracks and fuses point clouds into dense meshes using a single depth sensor. However, the initial version of KinectFusion cannot handle dynamic scenes. The systems developed by Ye et al.[182] and Zhang et al.[183] are able to reconstruct non-rigid motion for articulated objects, such as human bodies and animals. Further advances by Newcombe et al.[158] and Xu et al.[184] have achieved more robust dynamic 3D reconstruction from a single Kinect sensor by using warp-fields or subspaces for the surface deformation. Both techniques warp a reference volume non-rigidly to each new input frame. Guo et al.[185, 186] and Yu et al.[187] have realized real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. However, the reconstructed scenes still suffer from the occlusion issues since the data comes from a single depth sensor. In addition, many 3D reconstruction systems rely on a volumetric model that is used for model 159 fitting, which is limited in accommodating fast movement and major shape changes. Collet et al.[159] have demonstrated the Free-Viewpoint Video, an offline pipeline to reconstruct dynamic textured models in a studio setup with 106 cameras. How- ever, it requires controlled lighting, calibration, and approximately 28 minutes per frame for reconstruction, texturing, and compression. Furthermore, Prada et al.[177, 178] present a unified framework for evolving the mesh triangles and the spatiotemporal parametric texture atlas. Nonetheless, the average processing time for a single frame is around 80 seconds, which is not yet applicable for real-time applications. Orts et al.[160] present Holoportation, a real-time pipeline to capture dynamic 3D scenes by using multiple RGBD cameras. This system is highly robust to sudden motion and large changes in meshes. To achieve real-time performance, their system blends multi-view textures according to the dot product between surface normals and the camera viewpoint directions. Our system extends the Holoportation system and solves the problems of fuzzi- ness caused by normal-weighted blending, visible seams caused by misregistration and occlusion, while ensuring temporal consistency of the rendered images. In the state-of-the-art work by Dou et al.[188] with depth maps generated up to 500Hz [189, 190], a detail layer is computed to capture the high-frequency details and atlas mapping is applied to improve the color fidelity. Our rendering system is compatible with the new fusion pipeline, by integrating the computation of seams, geodesic fields, and view-dependent rendering modules. 160 6.2.2 Texture Stitching View-dependent texture-mapping on the GPU has been widely applied for reconstructed 3D models since the seminal work by Debevec et al.[191, 192]. How- ever, seamlessly texturing an object by stitching RGB images remains a challenging problem due to inexact geometry, varying lighting conditions, as well as imprecise calibration matrices. Previous work has considered using global optimization algorithms to improve color-mapping fidelity in static models. For example, Gal et al.[193] present a multi- label graph-cut optimization approach that assigns compatible textures to adjacent triangles to minimize the seams on the surface. In addition to the source images, their algorithm also searches over a set of local image transformations that compen- sate for geometric misalignment using a discrete labeling algorithm. While highly creative and elegant, their approach takes 7 to 30 minutes to process one frame on a mesh with 10,000 to 18,000 triangles. Markov Random Field (MRF) optimization- based approaches [194, 195, 196] are also similarly time intensive. To reduce the seams caused by different lighting conditions, Zhou et al.[197] introduce Texture- Montage, which automatically partitions the mesh and the images, driven solely by feature correspondences. TextureMontage integrates a surface texture in-painting technique to fill in the remaining charts of the surface with no corresponding tex- ture patches. However, their approach takes over 30 minutes per frame to process. Zhou et al.[198] optimize camera poses in tandem with non-rigid correction func- tions for all images at the cost of over 30 minutes per frame. Narayan et al.[199] 161 jointly optimize a non-linear least squares objective function over camera poses and a mesh color model at the cost of one to five minutes per frame. They incorporate 2D texture cues, vertex color smoothing, and texture-adaptive camera viewpoint selection into the objective function. A variety of optical-flow-based approaches have been used to eliminate blur- ring and ghosting artifacts. For example, Eisemann et al.[200] introduce Floating Texture, a view-dependent rendering technique with screen-based optical-flow run- ning at 7-22 frames per second.1 Casas et al.[201] extend their online alignment with spatiotemporal coherence running at 18 frames per second. Volino et al.[202] employs a surface-based optical flow alignment between views to eliminate blurring and ghosting artifacts. However, the major limitations of optical-flow-based ap- proaches are twofold. First, surface specularity [200], complex deformations, poor color calibration and low-resolution of the textures [201] present challenges in the optical flow estimation. Second, even with GPU computation, the computational overhead of optical flow is still a limitation for real-time rendering. This overhead increases even further with more cameras. In studio settings, Collet et al.[159] have found that with diffused lighting condition and precisely reconstructed surface geometry, direct image projection fol- lowed by normal-weighted blending of non-occluded images yields sufficiently accu- rate results. However, for real-time reconstruction systems with a limited number of cameras, the reconstructed geometries are often imperfect. 1We tested Floating Texture on a GTX 1080 under target resolution of 1024×1024 and 2048× 2048. 162 Our work focuses on improving the texture fusion for such real-time applica- tions. Building upon the pioneering research above as well as the work of several others, our approach is able to process over 130,000 triangles at over 100 frames per second. 6.2.3 Geodesic Distance Fields The field of discrete geodesics has witnessed impressive advances over the last decade [203, 204, 205]. Geodesics on smooth surfaces are the straightest and locally shortest curves and have been widely used in a variety of graphics applications such as optimal movement of an animated subject. Mitchell et al.[206] devise an exact algorithm for computing the “single source, all destinations” geodesic paths. For each edge, their algorithm maintains a set of tuples (windows) for the exact distance fields and directions, and updates the windows with a prio(rity queu)e like the Dijkstra algorithm. Ho(weve)r, the worst running time could be O n 2 logn , and the average is close to O n1.5 [6, 162]. Recently, Qin et al.[207] proposes a 4-15 times faster algorithm using window pruning st(rate)gies. However, their algorithm aims for the exact geodesic paths and requiresO n2 space like the previous approaches. Kapoor [208] prop(oses a s)ophisticated approach for the “single source, single destination” case in O n log2n time. As for approximate geodesics, Lanthier [209] describes an algorithm that adds many extra edges into the mesh. Kanai and Suzuki [210] and Martinez et al.[211] use iterative optimization to converge the geodesic path locally. However, their methods require a large number of iterations. 163 In this work, we compute geodesics distance fields for weighting the texture fields, so as to assign low weight near the seams and progressively larger weight up to some maximum distance away from the seams. Our goal is to solve the geodesics problem for the “multiple sources, all destinations”. Bommes et al.[162] have introduced an accurate algorithm for computation of geodesic distance fields. In this paper, we follow a variant of the efficient algorithm developed by Surazhsky et al.[6] to measure the approximation of the geodesics fields in O(n logn) time for a small number of vertices (seam vertices are approximately 1% of the total vertices) in a few iterations (typically 15−20). 6.3 System Overview In this section we present the workflow of the Montage4D system as shown in Figure 6.2: Rasterized depth maps Discrete geodesic distance fields to diffuse texture fields from the seams Input triangle meshes Seams caused by Montage4D Results from Fusion4D misregistration and occlusion Texture maps with foreground segmentation Update temporal texture fields Figure 6.2: The workflow of the Montage4D rendering pipeline. 1. Streaming of Meshes and Videos: Our system streams polygonal meshes and video textures from a reconstruction server that runs the Fusion4D al- gorithm. The calibration parameters for projective mapping from camera to 164 model space are only transferred for the initial frame. 2. Rasterized depth maps and segmented texture maps: For each frame, Montage4D estimates rasterized depth maps from each camera’s viewpoint and perspective in parallel on the GPU. The video textures are processed with a background subtraction module, using the efficient real-time algorithm for performing mean field inference [212]. 3. Seam identification with dilated depth discontinuities: The renderer estimates the dilated depth discontinuities from the rasterized depth maps, which are bounded by an estimated reconstruction error e. This is crucial for reducing ghosting artifacts, that arise when missing geometry and self- occlusion cause incorrect color projection on to surfaces. The renderer uses the texture maps to calculate the seams due to each camera’s limited field of view. 4. Geodesic fields: After the identification stage of all the seams, the renderer calculates each vertex’s distance field to the seams based on the straightest discrete geodesic. The distance fields non-linearly filter the texture fields, while ensuring spatial smoothness of the resulting texture fields. 5. Temporal texture fields: Using the parameters of the current rendering camera, the renderer also determines the view-dependent weights of each tex- ture. However, if an abrupt jump in viewpoint should occur, the texture weights field may alter rapidly. To overcome this challenge, Montage4D em- 165 ploys the concept of temporal texture weights so that each texture weight alters smoothly towards the target texture weight over time. 6. Color synthesis and post-processing: We fuse the sampled color using the temporal texture fields for each pixel in the screen space. Our system also provides an optional post-processing module for screen-space ambient occlusion. 6.4 Algorithms In this section, we describe the Montage4D algorithms. 6.4.1 Formulation and Goals For each frame, given a triangle mesh andN video texture mapsM1,M2, · · · ,MN streamed from the dedicated Fusion4D servers, our goal is to assign for each mesh vertex v a vector (T 1v , . . . ,T Nv ) of scalar texture weights. Let the texture field T denote the piecewise linear interpolation of these vectors over the triangle mesh. For each non-occluded vertex v ∈ R3, we calculate a pair of corresponding (u,v) coordinates for each texture map using back-projection. Finally, the resulting color cv is fused using the normalized texture field Tv at vertex v: ∑N ∑N cv = civ ·T iv = texture2D(Mi,u,v) ·T iv (6.1) i=1 i=1 In order to achieve high-quality rendering, we need to take the following factors 166 into consideration: 1. Smoothness: The transition between the texture fields of adjacent vertices should be smooth, because human perception is especially sensitive to texture discontinuities. 2. Sharpness: The rendered image should preserve the fine-scale detail of the input textures. However, due to imprecisely reconstructed geometry, fusing all the textures onto the mesh usually results in blurring or ghosting artifacts. 3. Temporal Consistency: The texture fields should vary smoothly over time as the mesh changes and as a user’s viewpoint changes. 6.4.2 Normal Weighted Blending with Dilated Depth Maps and Coarse- to-Fine Majority Voting Strategy Our baseline approach is derived from the real-time implementation in the Holoportation project. This approach uses normal-weighted blending of non-occluded textures, together with a coarse-to-fine majority voting strategy. For each vertex v, the texture field T iv for the ith view is defined as T iv = Vv ·max(0, n̂v · v̂ αi) , (6.2) where Vv is a visibility test using dilated depth maps and multi-level majority vot- ing algorithm introduced later, n̂v is the smoothed normal vector at vertex v, v̂i is the view direction of the ith camera, and α determines the smoothness of the 167 transition, and favors the frontal views. This approach determines the texture fields purely based on the geometry, which may have missing or extruded triangles. The resulting texture fields of adjacent vertices may favor completely different views, thus introducing visible seams. (A) Raw Projection Mapping (B) After Occlusion Test (C) Dilation and Color Voting (D) Reference Mesh (E) Results of Holoportation (F) Results of Montage4D Figure 6.3: This figure shows how texture weight fields improve the rendering qual- ity compared to previous work (the baseline approach). The Holoportation approach removes several ghosting artifacts by taking advantage of dilated depth maps and majority voting algorithm (top row), however, the rendering still suffers from fuzzi- ness and visible seams (bottom row). (A) shows the raw projection mapping result from an input video texture, (B) shows the culling result after the occlusion test, (C) shows the culling result after using dilated depth maps and majority voting algorithm, (D) shows the input mesh, (E) and (F) respectively shows the rendering results from the baseline approach and our algorithm, together with the correspond- ing texture weight fields for comparison. In order to remove the ghosting effect, we adopt the method from the Holo- portation project, which uses a dilated depth map to detect the occluded regions as 168 shown in Figure 6.3(C), thus removing many artifacts caused by inexact geometries: At each textured point on a surface, and for each input view, we search for any depth discontinuities in its projected 2D neighborhood within a rasterized depth map with a radius determined by ϵ= 4 pixels. If such a discontinuity is found, we set T iv = 0. In addition, we also adopt the same multi-level majority voting strategy. For a given vertex v and texture mapMi, we search from coarse to fine levels, the sampled color civ is trusted if at least half of the visible views (we denote the number of visible views as X) agree with it in the Lab color space ( δ = 0.15): ∑N ⌊ ⌋ (|civ− cjv| X < δ)≥ (6.3) j=1 2,j≠ i Although the dilated depth maps and multilevel majority voting strategy can mitigate most of the ghosting effects in real time (Figure 6.3(C)) the rendering results still suffer from blurring and visible seams, as shown in Figure 6.3(E). 6.4.3 Computing Misregistration and Occlusion Seams Our algorithm identifies each triangle as a misregistration or occlusion seam when any of the following three cases occur: 1. Self-occlusion: One or two vertices of the triangle are occluded in the dilated depth map while the others are not. 2. Majority voting: The triangle vertices have different results in the majority voting process, which may be caused by either misregistration or self-occlusion. 169 3. Field of View: One or two triangle vertices lie outside the camera’s field of view or in the subtracted background region while the rest are not. Some of these examples are shown in Figure 6.4. (A) Raw projection mapping (B) Seams after occlusion test (C) Seams after majority voting (D) Raw projection mapping (E) Seams caused by field-of-view Figure 6.4: Examples of misregistration and occlusion seams. (A) shows the raw projection mapping result of a monkey toy in front of a plaid shirt, (B) shows the seams after the occlusion test with dilated depth maps, and (C) shows the seams after the majority voting test. Note that while (B) fails to remove some ghosting artifacts from the monkey toy, (C) removes most of them. (D) shows another projection onto a crane toy, (E) shows the seams identified by the field-of- view test. For the datasets acquired for real-time telepresence applications, we have ob- served the fraction of seam triangles to be less than 1%. This observation has guided us to process the triangles adjacent to the seams, using a propagation procedure by calculating the geodesics directly on the GPU. 170 6.4.4 Discrete Straightest Geodesics for Diffusing Seams We efficiently diffuse the texture fields using the geodesic distance fields, by making a tradeoff between accuracy and efficiency. We follow a variant of the highly efficient approximation algorithm described in [6]. (A) V (B) Vs s (C)σ s0 s s1s ’ dl dr w w w’=w∪wcl τ c c’l 0 1 c’r r Figure 6.5: Illustration of computing the approximate geodesics. (A) shows the con- cept of the geodesic window from a single source vertex. (B) shows the components within a window. (C) shows the merging process of two overlapping windows for approximiation. Let S be a piecewise planar surface defined by the triangle mesh. We define the geodesic distance function as D : S 7→ R. In an earlier stage, we extracted the vertices from the seam triangles Vs ∈ S as the source vertices. For any point p ∈ S, the algorithm returns the length of the geodesic path D(p) from p back to the closest seam vertex v ∈ Vs. We iteratively diffuse across the triangles from the seams towards the non-occluded triangles. As illustrated in Figure 6.5, for each edge e, we maintain a small number of windows w(e) consisting of a pair of coordinates (cl, cr) (counterclockwise), the corresponding geodesic distance (dl,dr) to the closest pseudosource source s, the direction of the geodesic path τ , and the geodesic length σ=D(s). The position of s 171 can be calculated by intersecting two circles. As suggested by [6], when propagating a window w1(e) with an existing window w0(e) on the same edge, we try to merge the two windows w′ ← w0(e)∪w1(e), if the directions τ0, τ2 agree with each other, and the estimated geodesic lengths are within a bounded error: |D(w0)−D(w1)|<ε. In order to achieve interactive rates for rendering, we march at most k = 15 triangles from the seams in K = 20 iterations. In this propagation process, we maintain two windows per edge and discard the rest. We chose the parameter k