ABSTRACT Title of dissertation: NEW ALGORITHMIC TECHNIQUES FOR LARGE SCALE VOLUMETRIC DATA VISUALIZATION ON PARALLEL ARCHITECTURES QIN WANG, Doctor of Philosophy, 2008 Dissertation directed by: Professor Joseph JaJa Department of Electrical and Computer Engineering Volume visualization is widely used as an effective approach for the visual exploration, computational analysis, and manipulation of volumetric datasets. Due to the dramatic advances in imaging instruments and computing technologies, such datasets are now appearing at a very fast rate with increasingly larger sizes in many engineering, science and medical applications. Isosurface and direct volume ren- dering(DVR) are two of the most widely used techniques to render such datasets. This dissertation introduces novel techniques for rendering isosurfaces and volumes, and extends these techniques to multiprocessor architectures. We first focus on cluster-based techniques for isosurface extraction and rendering using polygonal ap- proximation. We present a new simple indexing scheme and data layout approach, which enable scalable and efficient isosurface generation. This algorithm is the first known parallel algorithm to achieve provable load balancing on multiprocessor sys- tems. We also develop an algorithm to generate isosurfaces using ray-casting on multi-core processors. Our method is based on a hybrid strategy that begins with an object order traversal of the data followed by ray-casting on ordered sets of an adaptive number of subcubes, one set for each small group of pixels on the im- age. We develop a multithreaded implementation, which uses new dynamic load balancing techniques that start with an image partitioning for the initial stage and then perform dynamic allocation of groups of ray-casting tasks among the different threads. The strategy ensures almost equal loads among the cores while maintaining spatial data locality. This scheme is extended to perform direct volume rendering and is shown to achieve similar improvements in terms of overall performance, load balancing, and scalability. We conduct a large number of tests for all our algorithms on the University of Maryland Visualization Cluster and on the 8-core Clovertown platform using a wide variety of datasets such as Richtmyer-Meshkov Instability dataset (7.5GB for each time step) and Visible Human dataset (?1GB). We obtain results that consistently validate the efficiency and the scalability of our algorithms. In particular, the overall performance of our hybrid ray-casting scheme achieves an interactive rendering rate on high resolution (10242) screens for all the datasets tested. NEW ALGORITHMIC TECHNIQUES FOR LARGE SCALE VOLUMETRIC DATA VISUALIZATION ON PARALLEL ARCHITECTURES by Qin Wang Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2008 Advisory Committee: Dr. Joseph JaJa, Chair/Advisor Dr. Rama Chellappa Dr. Manoj Franklin Dr. David Mount Dr. Amitabh Varshney Dr. Donald Yeung c? Copyright by Qin Wang 2008 Acknowledgments I would like to thank my advisor, Professor Joseph JaJa for bringing me an invaluable opportunity to work on challenging and interesting projects to fulfill my Ph.D. study. His expertise and research attitude inspire me throughout the course. Without his support and advising, I could not complete this program and learn so much. I would also like to thank Professor Amitabh Varshney and his group in Com- pute Science department for the valuable research collaboration. Professor Varshney always made himself available for the discussion and his thoughtful opinion has been very helpful on the projects. My gratitude goes to Fritz McCall and UMIACS help desk staff for their prompt technical assistance in the system setup and configuration. I am also grateful to my committee members for their serving on the board and their expertise to examine my research work. I am indebted to my family for their endless supports and patience of many years. ii Table of Contents List of Tables vi List of Figures viii List of Abbreviations ix 1 Introduction 1 1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Volume Visualization 8 2.1 Volumetric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Basic Visualization Techniques . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Isosurface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Polygonal Approximation . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Pixel-based Ray-casting . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Direct Volume Rendering . . . . . . . . . . . . . . . . . . . . . 14 2.2.5 Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Metacell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Empty Space Leaping . . . . . . . . . . . . . . . . . . . . . . 19 2.3.4 Early Ray Termination . . . . . . . . . . . . . . . . . . . . . . 20 2.3.5 Packet-of-Rays . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Previous Work 22 3.1 Isosurface Extraction and Rendering . . . . . . . . . . . . . . . . . . 22 3.2 Isosurface Ray-Casting . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Direct Volume Rendering . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Compact Interval Tree for Isosurface Extraction and Rendering 32 4.1 Isosurface Extraction Computational Model . . . . . . . . . . . . . . 33 4.2 Compact Interval Tree Indexing Scheme . . . . . . . . . . . . . . . . 34 4.2.1 Standard Interval Tree . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Our Indexing Scheme . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.3 Space Consumption Analysis . . . . . . . . . . . . . . . . . . . 38 4.3 Efficient Isosurface Extraction Method . . . . . . . . . . . . . . . . . 39 4.4 Parallel Processing Scheme . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1 Span Space Partitioning . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 Provable Load Balancing . . . . . . . . . . . . . . . . . . . . . 41 4.4.3 Sort-last Composition . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Extension to Time Varying Data . . . . . . . . . . . . . . . . . . . . 44 iii 4.6 UM Visualization Cluster . . . . . . . . . . . . . . . . . . . . . . . . 44 4.7 Experiments and Performance Analysis . . . . . . . . . . . . . . . . . 46 4.7.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.7.2 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.7.3 Overall Performance and Scalability . . . . . . . . . . . . . . . 48 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Hybrid Ray-casting for Isosurface Rendering 55 5.1 Features of Existing Ray-casting Algorithms . . . . . . . . . . . . . . 56 5.2 Basic Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.1 Compact BONO-Tree Construction . . . . . . . . . . . . . . . 58 5.2.2 Phase I: Bounded Block List Generation . . . . . . . . . . . . 59 5.2.3 Phase II: Spatial-Aware Ray-Casting . . . . . . . . . . . . . . 60 5.3 Improvements on Basic Strategy . . . . . . . . . . . . . . . . . . . . . 61 5.3.1 Extension to Packets of Rays . . . . . . . . . . . . . . . . . . 61 5.3.2 Adaptive Block Size . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Parallel Processing Scheme . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.1 Block List Generation . . . . . . . . . . . . . . . . . . . . . . 65 5.4.2 Z-Order Screen Partitioning . . . . . . . . . . . . . . . . . . . 65 5.4.3 Dynamic Task Scheduling of Ray-Casting . . . . . . . . . . . . 66 5.5 The Clovertown Multi-Core Processors . . . . . . . . . . . . . . . . . 67 5.6 Experiments and Performance Analysis . . . . . . . . . . . . . . . . . 68 5.6.1 Datasets Used and Space Consumption . . . . . . . . . . . . . 70 5.6.2 Performance Implication on List Upper Bound . . . . . . . . . 72 5.6.3 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 79 5.6.5 Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Extension of Hybrid Scheme to Direct Volume Rendering(DVR) 85 6.1 Extended Hybrid Ray-casting Scheme . . . . . . . . . . . . . . . . . . 86 6.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.2 Block Discrimination Method . . . . . . . . . . . . . . . . . . 88 6.1.3 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . 90 6.1.4 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Experiments and Performance Analysis . . . . . . . . . . . . . . . . . 92 6.2.1 Datasets Used and Space Consumption . . . . . . . . . . . . . 93 6.2.2 Performance Implication on List Upper Bound . . . . . . . . . 94 6.2.3 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 iv 7 Conclusion 103 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2 Further Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A Performance of Our Isosurface Extraction Algorithm on the Cluster 109 B Rendered Isosurface and DVR Images by Ray-casting for RMI Datasets 112 Bibliography 116 v List of Tables 4.1 Size comparison between standard and compact interval trees . . . . 39 4.2 Measured load imbalance of one and two-byte scalar field datasets . . 48 4.3 Measured load imbalance of four-byte scalar field datasets . . . . . . 49 4.4 Performance summary of our algorithm on a single node . . . . . . . 50 4.5 Overall performance of four processors for time-varying datasets . . . 53 5.1 List generation time on single-core over upper bound k . . . . . . . . 62 5.2 Parameters of various datasets used . . . . . . . . . . . . . . . . . . . 72 5.3 Size of our indexing structure and preprocessing time . . . . . . . . . 72 5.4 Performance of our algorithm on the Clovertown . . . . . . . . . . . . 77 5.5 Comparison over ray traversal steps . . . . . . . . . . . . . . . . . . . 78 5.6 Performance comparison on the Clovertown . . . . . . . . . . . . . . 78 5.7 Performance benchmark comparison between Opteron and Xeon . . . 79 5.8 Performance comparison between our algorithm and others . . . . . . 80 5.9 Comparison of cache misses between our algorithm and others . . . . 81 5.10 Average frame rate of our algorithm on the Clovertown . . . . . . . . 82 5.11 Measured work load balance on the two Phases . . . . . . . . . . . . 83 6.1 Size of our indexing structure and preprocessing time . . . . . . . . . 95 6.2 Execution time and composition steps of our DVR algorithm . . . . . 98 6.3 Comparison over composition and ray traversal steps . . . . . . . . . 99 6.4 DVR performance comparison between our algorithm and others . . . 100 6.5 Average frame rate of our DVR algorithm on the Clovertown . . . . . 100 6.6 Measured DVR work load balance on the two Phases . . . . . . . . . 102 vi A.1 Performance of our isosurface algorithm with two processors . . . . . 109 A.2 Performance of our isosurface algorithm with four processors . . . . . 110 A.3 Performance of our isosurface algorithm with eight processors . . . . 110 A.4 Performance of our isosurface algorithm with sixteen processors . . . 111 vii List of Figures 2.1 Illustrative 3D grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Span space for active cells . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Different triangulation cases among cubic cell . . . . . . . . . . . . . 12 2.4 Illustrative recursive ray traversal in 2D data space . . . . . . . . . . 13 2.5 DVR pipeline processing . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Illustrative recursive ray traversal . . . . . . . . . . . . . . . . . . . . 19 4.1 Our span space partitioning scheme . . . . . . . . . . . . . . . . . . . 36 4.2 Compact interval tree structure . . . . . . . . . . . . . . . . . . . . . 37 4.3 Span space partitioning for parallelism . . . . . . . . . . . . . . . . . 41 4.4 The visualization cluster architectural diagram . . . . . . . . . . . . . 45 4.5 Overall execution time on the cluster . . . . . . . . . . . . . . . . . . 51 4.6 Speedups of our algorithm . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Execution time of PhaseI and PhaseII on a single-core . . . . . . . . . 63 5.2 Intel Quad-Core Processor Architectural Diagram . . . . . . . . . . . 68 5.3 The six datasets used in our experiments . . . . . . . . . . . . . . . . 71 5.4 Execution times of our algorithm over upper bound k . . . . . . . . . 74 5.5 Percentage of pixels left to shade after the block lists . . . . . . . . . 75 5.6 Analysis of the percentage of rays having no intersection . . . . . . . 76 6.1 Illustrative diagram for 2D-dimensional discriminate function . . . . . 89 6.2 The six datasets used in our DVR tests . . . . . . . . . . . . . . . . . 94 6.3 The 2D histograms of the six datasets used in our DVR tests . . . . . 95 6.4 Execution times of our DVR algorithm over upper bound k . . . . . . 96 viii 6.5 Percentage of composition steps after the list and ESL efficiency . . . 97 B.1 Isosurface image I of RMI dataset . . . . . . . . . . . . . . . . . . . . 112 B.2 Isosurface image II of RMI dataset . . . . . . . . . . . . . . . . . . . 113 B.3 DVR image I of RMI dataset . . . . . . . . . . . . . . . . . . . . . . 114 B.4 DVR image II of RMI dataset . . . . . . . . . . . . . . . . . . . . . . 115 ix List of Abbreviations AMC Active MetaCell BBIO Binary-Blocked I/O BONO Branch on Need Octree CPU Central Processing Unit CT Computer Tomography DVR Direct Volume Rendering ESL Empty Space Leap GPU Graphics Processing Unit LLNL Lawrence Livermore National Laboratory LOD Level of Detail MIMD Multiple Instruction Multiple Data MRI Magnetic Resonance Imaging NIH National Institutes of Health NUMA Non-Uniform Memory Access RMI Richtmyer-Meshkov Instability SIMD Single Instruction Multiple Data TB Tera-Byte x Chapter 1 Introduction Visualization is a process that creates images to convey salient information about the underlying data. Because of the extraordinary capability of the human visual system to interpret images quickly and effectively, visualization is a very important method for exploring and analyzing the features, patterns and relations embedded in the data. Nowadays, with the fast evolving computing technology, visualization is playing an ever-expanding role in applications in science, engineering, medicine, multimedia, finance, etc. In particular, scientific visualization is now an integral component of scientific computation and simulation, which aims to create a visual representation of complex scientific data to provide insights underlying the physical phenomena. Such visual abstraction and reduction of underlying data into images is in general a very computationally demanding process. 1.1 Challenges During the past three decades, there has been unprecedented growth in com- putational and acquisition technologies, which resulted in the increasing capability of investigating the physical world in detail with high precision as well as modeling and simulating very complex phenomena. It is not uncommon that hours, or even days of simulation running on high performance computers produce vast amount 1 of data ranging from hundreds of giga-bytes to tens of tera-bytes. As an exam- ple of such a dataset, consider the process of the Richtmyer-Meshkov instability in inertial confinement fusion and supernovae from the ASCI team at the Lawrence Livermore National Labs(LLNL). This dataset represents a simulation in which two gases, initially separated by a membrane, are pushed against a wire mesh. These are then perturbed with a superposition of long wavelength and short wavelength disturbances and a strong shock wave. The simulation took 9 days on 960 CPUs and produced about 2.1 TB of simulation data. The data shows the characteris- tic development of bubbles and spikes and their subsequent merger and break-up over 272 time steps, each consisting of 20482 ?1920 volume of one-byte scalar field. Another example is the popular medical dataset coming from the Visible Human Project established by the National Institute of Health(NIH). The generated dataset contains 1,871 cross-section CT and anatomical images in the size of 15 GB for the male subject while 5,189 images of 39 GB for the female. In the past decade, a great number of efforts were made to visually explore and analyze such large scale datasets interactively with mixed results. Since the visualization of large scale data demands a great deal of computa- tion and irregular communication, previously only supercomputers equipped with proprietary graphic system, such as SGI Reality Monster with 128 processors, were able to support the visual exploration on large datasets. However, such super com- puters can only be afforded by a limited number of researchers. Since early 1990?s, driven by the availability of high speed networks and cheap prices of PCs, cluster- ing has become a cost-effective platform to provide massive computing power. In a 2 cost-effective way, visualization systems built upon a cluster consisting of off-shelf computers with GPU cards have been since then exploited to visualize 3D datasets. More recently, the newly emerging multi-core processor technology which includes multiple CPUs on a single chip is attracting considerable attention from researchers to conduct visualization on this new platform. Although there are a number of visu- alization systems and techniques developed using these parallel platforms, they are usually targeted for several orders of magnitude smaller datasets, and do not scale to gigabyte or terabyte-sized datasets. Hence, visual exploration and interaction with large scale datasets requires the development of a high-performance visualization software infrastructure running on affordable parallel platforms that can essentially handle the growing size of datasets, extract the features of interest efficiently and render them interactively on high resolution screen. 1.2 Contributions This dissertation considers the problem of visualizing large scale datasets us- ing isosurfaces and Direct Volume Rendering (DVR) technologies on two different parallel architectures, namely multiprocessor clusters and multi-core processors. We have developed three new algorithms that offer significant improvements over prior schemes. Part of the work in this dissertation appears in [24] and [69]. Our first major contribution concerns the extraction of a polygonal approxima- tion of isosurfaces. A new compact indexing structure and a new data partitioning scheme are developed for out-of-core isosurface extraction and rendering of large 3 scale data on a multiprocessor environment. This scheme results in the following contributions: ? The algorithm uses a smaller indexing structure and a more effective bulk data movement than the best known previous algorithms while achieving similar asymptotic bounds. In particular, the size of our indexing structure is shown to be orders of magnitude smaller than that of the interval tree for a number of well-known datasets. ? Our scheme can be implemented on a distributed storage multiprocessor envi- ronment such that the data distribution across the local disks of the different processors results in a provably balanced workload irrespective of the isovalue. Moreover, the total amount of work across the different processors is about the same as that required by the serial version of the algorithm. ? The experimental results show that isosurfaces can be generated and rendered at the rate of 3.5 ? 4.0 million triangles per second on the Richtmyer-Meshkov dataset using our algorithm on a single processor. On a 16-node cluster, we achieve scalable performance across widely different isovalues with a perfor- mance of up to 60 million triangles per second. The experimental results also show that our algorithm achieves excellent load balancing for a variety of datasets over a wide range of isovalues. Our second major contribution is the development of a novel hybrid strategy for rendering isosurfaces using ray-casting on a multi-core processor. The proposed 4 method is based on a hybrid strategy that begins with an object order traversal of the data followed by ray-casting on ordered sets of an adaptive number of subcubes. Compared with existing ray-casting approaches, our scheme achieves the following: ? A short list of possible candidate data blocks is generated for each small set of contiguous pixels through the traversal of a BONO tree (Branch On Need Octree), built upon a coarse version of the volumetric data. Such a process can be performed extremely fast because the BONO tree is very compact and the imposed upper bound of the data block list restricts the traversal significantly. This enables us to identify almost all the pixels with rays not intersecting the isosurfaces through the BONO tree travel, and the percentage of the non-intersecting rays cast is extremely small. ? We cast rays that traverse through a limited number of blocks in a front to back order, and skip a substantial fraction of irrelevant portions of the volumetric data up front. ? Nearby pixels will likely have a number of common blocks on their lists and hence spatial locality of pixels can be exploited to achieve high performance caching. That is, processing nearby pixels can make effective use of caching since their corresponding lists are short and are likely to share blocks. ? The multi-threaded implementation on multi-core processors, using new dy- namic load balancing techniques, ensures almost equal loads among the cores while maintaining spatial data locality. 5 ? The extensive testing of our algorithm on a variety of datasets, of widely different complexities, indicates that the scheme can easily achieve interactive high resolution rendering of isosurfaces of large scale volumetric scalar data on emerging multi-core processors. Our third major contribution is to extend the hybrid ray-casting scheme to accelerate direct volume rendering on large scale datasets as well. Our extended scheme achieves: ? The BONO tree is augmented with extra dimensional information to accom- modate the specification and application of interactive 2D transfer functions at runtime. ? A novel block discrimination method is proposed to classify the blocks into various types and directly support empty space leaping in an effective way in the case of direct volume rendering. ? The multi-threaded implementation demonstrates a high degree of scalabil- ity, excellent load balancing, and effective memory management on multi-core processors. ? Extensive experimental tests show significantly superior performance over pre- viously published algorithms for direct volume rendering by ray-casting, and resultininteractiveratesforverylargedatasetssuchastheLawrence-Livermore instability dataset on high-resolution screens. 6 1.3 Outline The rest of this dissertation is organized as follows. Chapter 2 introduces existing visualization technologies as well as some related acceleration techniques. Chapter 3 reports on the previous algorithms and methods used to render isosur- faces and DVR, but mainly focusing on multiprocessor environment and multi-core processors. Chapter 4 presents a new simple indexing scheme and data layout ap- proach, which enables scalable and efficient isosurface generation resulting in the first known parallel algorithm to achieve provable load balancing on multiprocessor cluster systems. The novel hybrid scheme that improves isosurface ray-casting is presented in Chapter 5 and its extension to direct volume rendering is described in Chapter 6. We conclude in Chapter 7. 7 Chapter 2 Volume Visualization Volume visualization is used to create images from scalar and vector datasets defined typically on 3D or 4D space. It involves a process of transforming a multidi- mensional dataset onto a 2D image plane to gain insights of the salient features and the structure contained within the data. The transformation takes place by sam- pling the 3D data, interpolating the data value, classifying the optical properties of the material, and compositing the visual effects to come up with 2D images. There are two widely used methods for volume visualization, isosurfaces or Direct Volume Rendering(DVR). The isosurface method computes and renders surfaces with con- stant density value while the DVR method can more generally reveal the underlying translucent structure and the features of the data. Before discussing various tech- niques regarding these two methods, we first introduce the data representation of the underlying datasets. 2.1 Volumetric Data A volumetric dataset consists of a collection of values obtained by sampling a function at various locations in a 3D space, which can be in structured or unstruc- tured format. A structured dataset is associated with certain form of regularity in sampling positions, while positions can be arbitrary in an unstructured dataset. 8 This thesis primarily studies methods that are applicable to structured datasets dominated in many scientific domains. Typically, instrument measurements and computer simulation are the two main sources for such datasets. For example, in the Visible Human Project, Computer Tomography(CT), Magnetic Resonance Imaging(MRI), and Anatomical images are obtained via the experimental measure- ment of male and female cadavers. These 2D images are stacked together to form a 3D data grid with each pixel representing a grid point. On the other hand, the RMI dataset from LLNL is generated directly by computer simulation, with one scalar density value at each grid point. An illustrative 3D grid is shown in Figure 2.1, where each point on the grid is called a voxel (as analogous to pixel in 2D image), and is associated with a scalar value. Alternatively, this grid can be viewed as a collection of cubic cells, each of which contains eight neighboring grid points. The Vmin and Vmax of a cell are defined respectively as the minimum and maximum of eight scalar values belonging to the cell. Based upon such volumetric data rep- resentation, various techniques have been developed to visualize the dataset using isosurfaces and direct volume rendering. 2.2 Basic Visualization Techniques 2.2.1 Isosurface Given a scalar density function F(x) over a 3D geometric space G, x ? G,F ? V (V specifies a scalar value space), an isosurface S consists of the 3D point x such that F(x) = ?, for some constant isovalue ?. 9 Figure 2.1: Illustrative 3D grid, each grid point contains a scalar value measured at a discrete sampling location. As the density function F(x) is only known through a set of sample values fi at discrete sampling locations over the 3D space, i.e., F = {fi},fi ? V, in practice an isosurfaces can be approximated through interpolated values over the set {fi}. For the volumetric data in the 3D grid representation, eight scalar values of a cell are referenced to interpolate points on the isosurface located inside each cell. Two different approaches are adopted to carry out the interpolation process, one by polygonal approximation, the other by ray-casting. 2.2.2 Polygonal Approximation A popular method for computing an isosurface is by approximating the sur- face with a polygonal mesh followed by rendering the mesh using the usual graphics pipeline. The isosurface inside each cubic cell of a volumetric dataset can be approx- 10 Figure 2.2: Span space for active cells. imated by a set of polygonal geometries such as triangles as follows. The process begins by examining the isovalue ? with Vmin and Vmax of the cell. If the condi- tion Vmin ? ? < Vmax is satisfied (such as those cells located in the rectangular shaded region in span space of Figure 2.2), the cell intersects isosurface and will be called active. An intersecting position is determined along each edge by linear in- terpolation based upon scalar values of two end voxeles and isovalue ?. After these intersecting positions are obtained from edges, a triangle mesh can be composed to approximate the isosurface inside of the cell. There are in total of at most 256 intersecting scenarios. In reality, 15 cases are sufficient since the other cases can be generated by symmetry. These are represented in Figure 2.3. This type of geometric approximation is also called triangulation, and was first proposed by Lorensen in [14]. In this scheme, the isosurface is approximated from 11 Figure 2.3: 15 unique triangulation cases among cubic cell. each active cell one after another in a brute force manner. As a result, a great number of triangles are generated for large scale dataset, which could amount to massive computations. There have been many acceleration techniques proposed to efficiently locate the active cells, some of which will be introduced in Section 2.3. 2.2.3 Pixel-based Ray-casting The triangulation process generates a geometric approximation of isosurfaces, whose rendering may produce visual artifacts. Moreover, this process doesn?t take into consideration the position of the viewpoint and hence may waste a great deal of computation on triangles that may not be visible from the viewpoints of interest. Another technique is to use ray-casting. For each pixel on the screen, a ray is cast along the viewing direction across the data grid. When the ray encounters an active cell, a trilinear interpolation process takes place within the cell to locate the 12 Figure 2.4: Illustrative recursive ray traversal in 2D data space. intersecting position where its density value is equal to ? [36]. The ray goes through the cells following the forward viewing direction until such intersecting position is found. In such case, a shading is performed for the corresponding pixel on the screen relative to the lighting and selected shade model. The ray-casting process is inherently a view-dependent approach for isosurface rendering since it only locates and shades the first intersection voxel and skips all the data behind. Also, each in- tersecting position along the ray is accurately computed such that the visual quality of ray-casting isosurface can be very high. Unlike the polygonal approximation in which the performance is directly related to the number of extracted triangles, the pixel-by-pixel approach of ray-casting makes the rendering speed more sensitive to the screen size other than the size of the dataset and the complexity of the isosur- face. An illustrative ray-casting diagram in 2D space is shown in Figure 2.4 with shaded squares specifying the active cells and dashed lines indicating the part of 13 isosurfaces invisible from the viewpoint. 2.2.4 Direct Volume Rendering Although isosurfaces play a very important role in volume visualization, they have several drawbacks. First, triangle meshes are only an approximated represen- tation of a surface; second, much of the information gets lost by extracting only surfaces from the original 3D dataset since the surface is just a 2D geometric repre- sentation; third, amorphous phenomena, such as clouds, gas, soft tissue, etc., cannot be adequately represented by means of surfaces. Therefore, it is desirable to gener- ate a volumetric representation that extracts and renders the information directly from the dataset without going through an intermediate geometric approximation. Direct volume rendering by ray-casting involves shooting a ray through each pixel on the screen along the viewing direction and directly computing the opacity and color of the pixel using sampled data at equal distance intervals along the ray. Given the low-albedo optical model described in [60], the color on the final image can be computed by the following integration: C = integraldisplay L 0 Cs?se? integraltexts 0 ?tdtds (2.1) Cs and ?s are the material color and opacity per unit length respectively at the location s. In most cases, this integration has no analytic solution but can be composited approximately in front-to-back order using a recursive Riemann sum [49] as in the formula below: ?i+1 = ?i +(1??i)?s (2.2) 14 Figure 2.5: DVR pipeline processing. Ci+1 = Ci +(1??i)?sCs (2.3) where ?s = integraltextsi+1si ?tdt, ?s and Cs are usually mapped from the properties of the voxels (such as scalar field value and/or gradient value) by a user-specified transfer function. The stages in the pipeline process of direct volume rendering are illus- trated in Figure 2.5. First, a series of discrete locations are sampled uniformly along each ray one at a time; next, the scalar value at each location is obtained by applying certain filters such as trilinear interpolator; third, the optical properties of the material, captured by opacity and color, are obtained by classification through a transfer function; There can be an optional shading process to factor in the contribu- tion from the lighting effect; At last, the composition step accumulates the opacity and colors over the ray in a front-to-back order following the recursive composition equations 2.2 and 2.3. In general, the DVR process is much more computationally intensive than isosurface computation in that the final pixel color is determined by the composition of visual contributions from the data throughout the entire space. 15 2.2.5 Transfer Function The transfer function, which maps the material properties of a voxel onto opacity and color, is critical in capturing and rendering the features and patterns of interest from the underlying dataset. Transfer functions can be one-dimensional, i.e. mapping from scalar density value to optical properties, or multi-dimensional using the additional dimensions for gradient and/or curvature, which can improve the capability of distinguishing the features based upon their higher order derivatives. The classification via transfer function is achieved through a table mapping, which is straightforward and efficient for one or two dimensions but much more complex to manipulate for higher dimensions. A transfer function could introduce extra artifacts when the mapping from density value to opacity/color raises the Nyquist frequency of the underlying data. However, increasing the sampling rate would produce significantly more computational work load and compromise the rendering speed. One way to suppress such artifacts without increasing the sampling rate is to adopt pre-integral transfer functions. The method approximates ?s through the integration over the density field instead of the spatial space as follows: ?s = integraldisplay si+1 si ?(t)dt = integraldisplay fi+1 fi ?(f)dtdfdf ? ?si?f i (T(fi+1)?T(fi)) (2.4) Where T(f) = integraltextf0 ?(f)df, called pre-integral transfer function, which prevents the piecewise mapping from the field value to optical properties. 16 2.3 Acceleration Techniques Volume visualization is a computationally demanding task especially for large scale datasets and high resolution screens. There are various techniques that have been developed over the past 15 years or so to accelerate the process of isosurface generation and direct volume rendering. The most widely adopted techniques are described next. 2.3.1 Metacell Metacell is defined as a group of neighboring cells within a spatial bounding box. In particular, the metacell is trivially a subcube in the case of structured grids. The initial concept of metacell is driven by clustering the cells into the size of one or several contiguous disk blocks to facilitate I/O access. Using metacells can also reduce the data duplication on the boundary voxels, otherwise almost eight times size of data would be expected in the representation of a collection of individual cells. Obviously, data locality is improved by utilizing metacell as well. In addition, another advantage of metacell is that a hierarchical spatial indexing structure can be constructed upon metacells, the size of this hierarchical structure will then be compact, and hence will improve the memory access performance for large datasets. 2.3.2 Hierarchical Structures Hierarchical structures have been widely used to avoid processing the parts of the data which do not contribute visually to the final image. Two widely used 17 structures are the octree [23] and the kd-tree [13]. A node of such hierarchical structures represents a spatially bounded region. An octree recursively divides the spatial space into eight sub-regions while a kd-tree splits the space recursively into two halves alternatively along each spatial dimension. In extracting isosurfaces by polygonal mesh approximation, each node of the octree or kd-tree is augmented with the minimum and the maximum scalar values (Vmin and Vmax) of the subtree rooted at this node. If the stabbing condition is not satisfied at the node (i.e. the isovalue ? is not between Vmin and Vmax), the whole subtree is removed from further processing. Another important hierarchical structure is the interval tree [2, 28] which enables the extraction of the active cells in an optimal fashion. Basically, each node of an interval tree contains a set of intervals denoted by the values of Vmin and Vmax which sandwitch a median value associated with the node. The intervals with both Vmin and Vmax smaller (or larger) than the median value belong to the left (or right) subtree respectively. Based upon their individual median values, the nodes constitute a binary tree which allow us to determine the intervals containing a given isovalue through a binary search process. More discussions will be given on this structure along with the description of our new algorithm in Section 4.2. Regarding the ray-casting technology, each ray advances forward along the viewing direction in a front-to-back order using the hierarchical structures. An illustrative 2D case of recursive ray traversal is shown in Figure 2.6. Starting from the root node, each ray intersecting the bounding box of the current node is tested for the intersection with the middle planes perpendicular to the three axes. These intersecting positions located within the bounding box are then sorted according 18 Figure 2.6: Illustrative recursive ray traversal. to their distance to the ray source position. Based upon their ordering, the front- to-back visiting order to the current node can be easily established as well as the recursive traversal of the ray. A hierarchical indexing structure provides a critical support for other acceleration techniques introduced below. 2.3.3 Empty Space Leaping Important features in volumetric datasets are partially captured by isovalues or transfer functions. Parts of a dataset may not produce any visual effect on the final image. In the method of ray-casting, the ray can skip the empty (or transparent) region when traversing across the dataset to save computation and memory access. Using the octree or kd-tree, the visibility of each block corresponding to a node can 19 be quickly determined by examining the isosurface stabbing condition or testing the difference of the pre-integral transfer function value between Vmin and Vmax of the node. If there is no visual contribution from the region represented by this node, the ray is advanced forward to skip the traversal within the subtree under this node. The empty space leaping is particularly effective when the features of interest come from a limited part of the whole dataset. 2.3.4 Early Ray Termination Another important acceleration technology for ray-casting is early ray termi- nation. As to the isosurface rendering, the early ray termination is an inherent part of the technique as the ray stops after intersecting the isosurface. When rays tra- verse across a volume grid in front-to-back order for direct volume rendering, many of the rays may quickly accumulate an opacity close to 1 by Equation 2.2. This means that features of the dataset located in the back are occluded and need not to be processed to produce the final image. Those rays can be terminated before they pass through the entire grid. 2.3.5 Packet-of-Rays The packet-of-rays exploits the CPU SIMD extensions that exist in many current architectures to apply a single operation on four operands at once. In such way a packet of four rays can be traversed as one single unit throughout the volumetric dataset. An extended version of this technology can traverse a beam of 20 rays (not limited to four rays) across the data grid in the hope of achieving significant speedups. But, both techniques suffer from the overhead caused by merging and splitting of rays or the frustum that encloses the ray beam during the traversal process. The relatively high speedups reported are only observed in coherent scenes. 21 Chapter 3 Previous Work This chapter will present an overview of a number of algorithms and tech- niques directly related to this dissertation. These methods are divided into three categories: isosurface extraction and rendering, isosurface ray-casting, and direct volume rendering. The review begins by describing the sequential algorithms but the focus will be more on parallel schemes developed for multiprocessor systems. The contributions made by these schemes are summarized along with their limita- tions especially when dealing with the challenges imposed by large scale datasets. The discussion aims to introduce the great efforts many researchers have devoted to this field and at the same time to provide sufficient context to understand the contributions of our work. 3.1 Isosurface Extraction and Rendering The Marching Cubes(MC) algorithm proposed in [14] scans the entire cell set, one cell at a time, and determines whether a cell is cut by the isosurface, and in the affirmative generates a local triangulation of the isosurface intersection with the cell. The overall triangular mesh is then rendered. Many improvements to this initial algorithm have been reported in the literature. Some algorithms attempt to reduce the number of examined cells by using a spatial data structure such as 22 the octree [22, 23], while others partition the range of the scalar field values and construct an index on the span space [20]. A theoretically optimal algorithm was described in [8], and involves the construction of an interval tree on the scalar field intervals defined by the cells. Such a data structure enables the exploration of only the active cells (cells that intersect the isosurface) and hence it is output sensitive. This algorithm was later generalized into a theoretically optimal out- of-core isosurface extraction strategies in [4, 5]. For multiprocessor environments, several parallel algorithms have been reported for the case when the data can fit in the main memory [9, 10, 16, 18, 19, 20]. As to larger scale datasets, parallel out-of-core algorithms such as those in [3, 7, 21, 25, 26, 27] have been reported. For a multiprocessor cluster environment in which each processor has its own main memory and local disk, the processors communicate and exchange data through an interconnection network using message passing. Critical factors that in- fluence the performance include the amount of work required to generate the index and organize the data on the different disks (preprocessing step); the relative compu- tational loads of the different processors corresponding to an arbitrary isovalue; the performance of the rendering and rasterization into a single display; and the total amount of work relative to the (best) sequential algorithm. The preprocessing step described in [5, 6, 7] involves partitioning the dataset into metacells and building a B-tree like interval tree, called Binary-Blocked I/O interval tree (BBIO tree). The computational cost of this step is asymptotically similar to an external sort, which is likely to be too expensive in practice. The isosurface generation requires that a host traverses the BBIO tree to determine the indices of the active metacells, after 23 which jobs are dispatched on demand to the available processors. In addition to the substantial preprocessing overhead, a significant bottleneck with this scheme is the host overhead in coordinating and dispatching jobs, and the unpredictability of the access pattern to the available disks. The algorithm described in [25] attempts to solve the load balancing problem by distributing the data based on a range space partition. The range of possible field values is partitioned into a number of intervals. Blocks are then assigned to triangular matrix entries depending on which intervals a block spans. An external interval tree (BBIO tree) is then built separately for the data on each processor. Again this strategy involves a very expensive preprocessing step but in addition there is no guarantee of load balancing among the processors. In fact, it is easy to construct a dataset that will result in extremely unbalanced loads among the processors. Moreover, while the BBIO is ?asymptotically optimal? for I/O performance (in terms of the number of active metacells brought to memory and not necessarily the active cells), it does incur a significant overhead in terms of size and performance. The extracted local surface is then streamed to parallel rendering servers, followed by compositing the outputs of the different frame buffers to a tiled-display. The preprocessing algorithm described in [27] is based on parti- tioning the range of scalar values into equal-sized subranges, creating afterwards a file of subcubes for each subrange. The blocks in each range file are then distributed across the different processors, based on a work estimate of each block. As in [25], the preprocessing is computationally expensive and there is no guarantee that the loads of the different processors will be about the same in general. Insummary, existingparallelout-of-corealgorithmseitherrequireunpredictable 24 data accesses to the disks, which can cause a considerable overhead, or use complex indexing schemes with no guaranteed load balancing. The parallel out-of-core al- gorithm proposed in this thesis work uses a simple indexing scheme with provable load balancing among the processors (regardless of the datasets or the isovalue) and incurs almost no overhead relative to the efficient sequential version of the algorithm. 3.2 Isosurface Ray-Casting Isosurface ray-casting directly generates the isosurface by shooting rays from the viewpoint through the pixels and computing the intersections of these rays with the isosurface. As a special case of ray tracing technique, ray-casting only considers and traces the primary ray. Nevertheless, ray-casting delivers most effectively the visual information of isosurfaces and is the most time-consuming component of ray tracing. This method, coupled with various techniques for improving visual effects such as shading, reflection, and global illumination, can generate extremely high quality visualization of isosurfaces. However, the method is computationally demanding, especially for high resolution screens, since it is pixel-by-pixel approach and hence its complexity depends on the number of rays and the size of the dataset as well as on the scheme used for ray traversal and for computing and shading the intersection voxels of the rays with the isosurface. Note that in general the memory access is relatively expensive as the voxels are not processed in the same order as the data layout, and cache performance can be poor since the cells intersected with cast rays are not easily predictable. 25 Given the high computational requirements of isosurface ray tracing, Parker et. al. [19] describe an implementation on the SGI Reality Monster, which is a shared- memory multiprocessor with up to 128 processors, and show interactive isosurface rendering of the 1GB Visible Woman dataset. The screen size used is 5122. Their algorithm uses a simple multi-level spatial hierarchy with a 3D tiling of the input data to improve cache performance. More recently, Wald et. al. [42] describe an implementation using a combination of a kd-tree and coherent ray tracing that exploits the SIMD extensions that are available on many of the current multi-core processors. A kd-tree is a binary tree that represents a spatial partitioning of the volume data. Each node, except for the leaves, represents a splitting plane that is closest to the center of the largest dimension. Each node contains the minimum and the maximum of the densities contained within the subtree. This structure is very similar to the octree as described in [23, 33], except that the authors claim that the kd-tree more easily enables a simple handling of packets of rays used in coherent ray tracing. Coherent ray tracing traverses packet of rays through the kd-tree in order to make effective use of the SIMD extensions. However this comes at the cost of creating an indexing structure that is at least twice as large as the input data size. For example, their most compact kd-tree representation of the 8GB LLNL dataset is of size 18GB. The authors illustrate the performance of their scheme on a single and on a 5-node cluster of dual-1.8 GHz AMD Opteron, with a default screen size of 5122. A more compact kd-tree is introduced in [30] and used for isosurface ray tracing on two-processor platform, each is a dual-core 2.6 GHz AMD Opteron with 16GB RAM. They show interactive rate isosurface rendering for a variety of 26 datasets of sizes up to 8GB but using half the memory used in [42]. The multi-layer ray tracing algorithm (MLRTA) reported in [39] exploits spatial coherence in image space to speedup the rendering of geometric objects of a few million triangles. The speedups achieved by the MLRTA method basically rely on coherent scenes containing geometric objects of large 3D extension to ensure that a wide range of ray beams treated as a single traversal entity can quickly detect deep entry points. However, in order to handle small objects, a repeated tile splitting process is required to subdivide the ray beams gradually, thereby creating a significant overhead and resulting in a traversal that is similar to the octree traversal. Hence, it is questionable whether the method will perform well on complex (and possibly incoherent) isosurfaces in large scale datasets such as LLNL where the concept of large object extension is not applicable. In fact, it has been noticed that the packet ray traversal technique (including the exploitation of SIMD instructions) may perform poorly on incoherent complex scenarios where frequent ray splitting and merging could lead to a worse performance than just using a single-ray [33, 34, 30]. As the most recent implementation of the single-ray scheme, an octree representation that contains the scalar data as well as the range information is used in [33] to generate competitive performance on multi-core processors. Its multi- resolution level of detail (LOD) version that incorporates coherent ray tracing to work around the problem of incoherent scenes appears in [34], resulting in faster rendering of LOD data in some cases. The ray-casting implemented on GPUs also appeared in the past few years with the advent of improved GPU programmability. The GPU-based ray-casting 27 implementation is reported in [43, 32]. However, the GPU-based approach is con- strained by its on-board memory size (usually 512MB) and its fairly strict SIMD programming model. 3.3 Direct Volume Rendering Software techniques developed for direct volume rendering fall primarily into three main categories: Ray-casting [55, 56], Splatting [72] and hybrid strategies such as the Shear-Warp algorithm [53]. Although the GPU-based technology currently delivers the fastest rendering of volumetric data due to its massive pipeline stream processing, much improved GPU programmability, and 3-D texture mapping [47, 52, 65, 48, 11], the high rendering frame rate is observed when the processed data is able to fit in the GPU on-board memory. Currently, the limited size (i.e. 512MB ? 1024MB) of on-board memory prevents GPU from achieving interactive volume rendering on large size datasets. Researchers have considered a number of parallel volume rendering schemes for ray-casting [63], splatting [62] and shear-warp [54, 66] using either PC clusters or tightly coupled MIMD supercomputer systems. Although the splatting method avoids sampling over empty regions by projection of kernel functions, its parallel version seems to be inefficient and complicated in the way it utilizes the early termination technique [62]. Shear-warp generally produces low quality images due to its bilinear interpolation with severe artifacts showing up when changing the viewing direction around 45o [53, 54]. Consequently, ray-casting seems to be the preferred technique in terms of its technical simplicity and ability 28 of producing higher quality images. Nevertheless, the ray-casting approach usually involves significantly more computation due to its pixel-by-pixel processing (and hence heavily depends on the screen size), which makes it hard to achieve interactive rendering for moderate size datasets. Refer to [61, 49] for a good overview and a comparison of the various volume rendering techniques. Since the direct volume rendering method by ray-casting was first published by Levoy in [55], many alternate versions, such as those reported in [56, 68, 64, 70, 71], have been introduced. However, empty space leaping requires either a hierarchical structure (i.e. multilevel-grid [64], octree [56], kd-tree [68], etc) which incurs a significant overhead, or a space-consuming per-vertex register structure such as DFB-jumping described in [70] which however can not accommodate interactively changing the transfer function. To improve the efficiency of ray-casting for direct volume rendering, a two-step method is proposed in [41]. The first step consists of projecting the boundary cells onto the image plane using graphics hardware, and the second step applies the octree ray tracing but now slightly more constrained. Some other schemes use quantized voxels to accelerate volume rendering [37], or cache-efficient layouts of bounding volume hierarchies [44] to improve kd-tree access during ray casting for some medium size of datasets. On the other hand, the idea of projecting non-empty cells onto screen was introduced to determine the first and last cells hit by each ray using either GPU hardware [67] or cell-template [71]. But the schemes are inefficient in skipping empty regions between the bounded range. To skip empty space more effectively and reduce the overhead of repeated hierarchical traversal, an object-oriented ray-casting method is proposed in [38], which performs 29 ray-casting right after each cell projection and processes one cell after another in front-to-back octree traversal order. The limitation of this approach is that there is no straightforward way to parallelize this algorithm which involves in-order traversal of octree. The parallelization of the general ray-casting method for direct volume render- ing has been described by many researchers [63, 46, 57, 58, 64, 45]. The approaches used can be classified according to three categories: image-space, object-space and hybrid. The image-space parallel scheme is easier to implement due to the indepen- dent pixel-by-pixel approach of ray-casting but makes it quite hard to achieve good load balancing and cache coherence due to the irregular data access pattern and the unpredictable image complexity on the screen with varying transfer function and viewing direction. Nieh presented a parallel scheme for the serial ray-casting algorithm [56] on shared-memory MIMD system in [63] and achieved the scalabil- ity around 70 ? 80% by utilizing dynamic task stealing from other processors at runtime. Other similar implementations [46, 64] yield relatively the same scalability performance. On the other hand, the object-space partition scheme is relatively pop- ular on distributed systems. The basic idea relies on dividing the dataset evenly in a certain way or randomly among the processors in the hope of achieving good load balancing and regular memory access pattern [57, 58, 59]. But, these algorithms require additional merging and re-sorting of the ray segments to produce correct rendering of images with significant synchronization time for the image composition stage. Moreover, the early ray termination condition can not be easily applied un- der object-space partitioning schemes. Using a completely different strategy, hybrid 30 parallel shear-warp algorithms [54, 66] require additional 2D image warping process and the image quality does not in general match those produced by the other al- gorithms. Another hybrid parallel scheme [45] divides full screen into small image tiles and the dataset hierarchically into bricks. The algorithm obtains data bricks that intersect with viewing frustum by repeated octree traversal, which incurs a significant overhead, in order to achieve load balancing with dynamically assigned both image tiles and data bricks among processors. The transfer function, which maps the material properties of the data into opacity and colors, is critical in capturing and rendering the features and patterns of interest. Initially, the transfer function was applied at the preprocessing step as in [56], which does not allow interactive modification and rendering. More recently, researchers have allowed users to specify and modify the transfer function at run time using a friendly interface, and introduced multi-dimensional transfer functions that can be much more effective in capturing complex patterns and features [50, 51]. Furthermore, pre-integral transfer function was introduced in [47] to suppress the artifacts stemming from discrete sampling along the ray. Our scheme supports the interactive construction of multi-dimensional transfer functions while using the pre- integral technique. 31 Chapter 4 Compact Interval Tree for Isosurface Extraction and Rendering In this chapter, we introduce novel techniques to enable efficient isosurfaces extraction for large scale datasets that do not fit in main memory. Specifically, we develop a simple indexing scheme, called compact interval tree, which is orders of magnitude smaller than existing indexing structures, and meanwhile provides more efficient bulk data movement. The parallel implementation of this algorithm on a distributed multiprocessor environment introduces a new scheme to distribute the data, which results in a provable load balanced workload among the processors regardless of the datasets or the isovalue, and incurs almost no overhead relative to the efficient sequential version of the algorithm. We make use of the University of Maryland visualization cluster in which each node consists of a 2-way symmetric multiprocessor with 8GB of main memory, a 60GB local disk, and an NVIDIA GPU (Graphics Processing Unit). The nodes are interconnected via a 10 Gbps InfiniBand, with four nodes reserved for compositing the image frame buffer outputs from the processors and displaying the results on a wall-sized screen (multi-projector display). We start by defining the out-of-core computational model used to evaluate our methods for isosurface extraction. 32 4.1 Isosurface Extraction Computational Model Due to the electromechanical components, disks have two to three orders of magnitude longer access time than random-access main memory. In order to amor- tize the access time over a large amount of data, a single disk access reads or writes a block of contiguous data at once, typically of size 4KB or 8KB. We will use the standard model [1] to measure the I/O performance of our algorithms. We denote the input size by N, the disk block size by B, and the size of the main memory by M. In this work, we are assuming that N is much larger than M, which in turn is much larger than B. The I/O performance of an external memory algorithm is measured by the number of I/O operations, each such operation involving the read- ing or writing of a single disk block. As a result, scanning contiguously the input data requires O(N/B) I/O operations. Our parallel computation model consists of a number of processors, each with its own local main memory and disk, interconnected through an interconnection network. The processors communicate and exchange data through message passing using the interconnection network. Since we are interested in large scale data, we assume that the input data resides on the disks of the available processors. A preprocessing step involves rearranging the data among the disks with the goal to optimizing the access patterns to the data, and to distributing the computational load equally among the processors. 33 4.2 Compact Interval Tree Indexing Scheme Our algorithm can handle both structured and unstructured grids and makes use of the metacell notion introduced in Section 2.3.1. In general, a metacell consists of a cluster of neighboring cells. All the metacells are about the same size, which is a small multiple of the disk block size. In particular, for the structured grid of the Richtmyer-Meshkov dataset, our metacell consists of a subcube of 8 ? 8 ? 8 neighboring cells, represented by a list of the scalar values appearing in a predefined order. Our indexing structure and isosurface query algorithm are designed upon the concept of metacell. With each metacell, we associate an interval (vmin,vmax) corresponding respectively to the minimum and maximum values of the scalar field over the metacell. Our compact interval tree structure makes use of the span space concept to organize the data layout. Before introducing this structure, we begin with a brief review of the standard binary interval tree. 4.2.1 Standard Interval Tree Given a set of intervals, to build the binary interval tree, we store the median of the endpoints of the intervals at the root and assign all the intervals containing that value to the root. We then recursively build the left and right subtrees corresponding respectively to the intervals completely to the left and the right of the value stored at the root. More specifically, each node of the tree holds a splitting value vm and two secondary lists of the intervals (vmin,vmax) satisfying the condition vmin ? vm ? vmax, one list in increasing order of vmin values and the second in decreasing vmax values. 34 The remaining intervals with vmax < vm are assigned to the left subtree while the intervals with vm < vmin are assigned to the right subtree. The tree is completed when all the intervals are captured and assigned to a node. Because two copies of a list of intervals are stored at each node, the size of the interval tree is at least twice as large as the size of original set of intervals. 4.2.2 Our Indexing Scheme Our compact interval tree is similar to the interval tree except that we don?t store the two sorted lists of intervals at each node. Instead, we store the distinct values of the vmax endpoints of these intervals, sorted in decreasing order, and asso- ciate with each such value a pointer to a list of intervals sorted in increasing order of left endpoint value vmin. We now explain the compact interval tree in the context of the isosurface problem and its relationship to the metacells generated from the input data. Consider the span space consisting of all possible combinations of the (vmin,vmax) values of the scalar field. With each such pair we associate a list con- taining the metacells whose minimum scalar field value is vmin and whose maximum scalar field value is vmax. The essence of the scheme for our compact interval tree is illustrated through Figure 4.1 representing the span space, and Figure 4.2 repre- senting the compact interval tree built upon the n distinct values of the endpoints of the intervals corresponding to the metacells. Let vm0 be the median of all the endpoints ranging from v0 to vn. The root of the interval tree corresponds to all the intervals whose vmin values fall in the range 35 Figure 4.1: Span space partitioning scheme for our indexing structure. [v0,...,vm0], and whose vmax values fall in the range [vm0,...,vn]. Such intervals are represented as points in the square of Figure 4.1 whose bottom right corner is located at (vm0,vm0). We group together all the metacells having the same vmax value in this square, and store them consecutively on disk from left to right in increasing order of their vmin values. We refer to this contiguous arrangement of all the metacells having the same vmax value within a square as a brick. The bricks within the square are in turn stored consecutively on disk in decreasing order of the vmax values. The root will contain the value vm0, the number of non-empty bricks in the corresponding square (of the span space), and an index list of the corresponding bricks. This index list consists of at most n/2 entries corresponding to the non-empty bricks, each entry containing three fields: the vmax value of the brick, the smallest vmin value of the metacells in the brick, and a pointer that indicates the start position of the brick 36 Figure 4.2: Compact interval tree structure and the associated metacell lists. on the disk. Each brick contains contiguous metacells in increasing order of vmin values, and each metacell consists of its vmin value, its location information such as metacell ID, and a list of the scalar field values of the vertices (in a predefined order) within the metacell. We recursively repeat the process for the left and right children of the root. We will then obtain two smaller squares whose bottom right corners are located respectively at (vm10,vm10) and (vm11,vm11) in the span space, where vm10 and vm11 are the median values of the endpoints of the intervals associated respectively with the left and right subtrees of the root. In this case, each child will have at most n/4 non-empty index entries associated with its corresponding bricks on the disk. This recursive process is continued until all the intervals are exhausted. At this point we have captured all possible (vmin,vmax) pairs and their associated metacell lists. 37 4.2.3 Space Consumption Analysis Note that the size of the standard interval tree is typically much larger than the size of our indexing structure. We can upper bound the size of our compact interval tree as follows. There are at most n/2 index entries at each level of the compact interval tree and the height of the tree is no more than O(logn). Hence our compact interval tree consists of O(nlogn) index entries, each entry having three fields. Therefore the total size of our compact interval tree is O(nlogn), while the size of the standard interval tree is ?(N), where N is the total number of intervals and hence can be as large as ?(n2). In fact, the standard interval tree is always at least twice as large as our indexing structure regardless of the relative values of N and n since the interval tree stores each interval (vmin,vmax) twice while our indexing structure stores each such interval at most once (the extra space required for pointers to the corresponding data on disk is the same in both cases). In Table 4.1 below we compare the sizes of the two indexing structures for some well-known data sets from LLNL, the Stanford Volume Data Archive and IEEE visualization. While the Richtmyer-Meshkov data uses one-byte scalar fields, the MRBrain and CTHead datasets consist of two-byte scalar fields, and the remaining four data sets (generated from a hurricane simulation) use floating point scalar fields. As can be seen from the table, our indexing structure is substantially smaller than the standard interval tree, even in the case when N ? n. 38 Table 4.1: Size comparison between standard and compact interval trees. N denotesthenumberofdistinctintervalsandnrepresentsthenumberofdistinct vmax values. DataSet Scalar Field Size of Interval Tree Name Size N n Standard Compact RMI a one bytes 18,970 227 222.3 KB 6.0 KB MRBrain b two bytes 756,982 2,894 11.6 MB 22.6 KB CTHead b two bytes 817,642 3,238 12.5 MB 25.3 KB Pressure c four bytes 24,507,104 20,748,433 560.9 MB 237.4 MB Velocity c four bytes 24,444,597 17,548,131 559.5 MB 200.8 MB aRichtmyer-Meshkov Instability dataset from http://www.llnl.gov/CASC/asciturb/ b3D CT images from http://graphics.stanford.edu/data/voldata/ cSimulation data of a Hurricane from http://vis.computer.org/vis2004contest/data.html 4.3 Efficient Isosurface Extraction Method Given a query isovalue ?, consider the unique path from the leaf node labeled with the largest value ? ? to the root. Each internal node on this path contains an index list with pointers to some bricks. For each such node, two cases can happen depending on whether ? belongs to the right or left subtree of the node. Case 1: ? falls within the range covered by the node?s right subtree. In this case, the active metacells associated with this node can be retrieved from the disk sequentially starting from the first brick until we reach the brick with the smallest value vmax larger than ?. Case 2: ? falls within the range covered by the node?s left subtree. The active metacells are those whose vmin values satisfy vmin ? ?, from each of the bricks on the index list of the node. These metacells can be retrieved from the disk starting from the first metacell on each brick until a metacell is encountered with a vmin > ?. Note 39 that since each entry of the index list contains the smallest vmin of the corresponding brick, no I/O access will be performed if the brick contains no active metacells. Once an active metacell is in memory, any of the several variations of the Marching Cubes algorithm can be used to precisely determine the active cells within the metacell and generate the appropriate triangles defining the isosurface mesh. 4.4 Parallel Processing Scheme Assume that we have p processors, each with its own local disk, and the pro- cessors are interconnected with some type of a high-speed interconnection network. We will show how to partition the input data among the p local disks and apply our compact tree indexing structure to extract and render isosurfaces in a scalable and efficient way. The main challenge is to ensure load balancing among the processors for any isovalue while maintaining the same total amount of work as that of our sequential algorithm. 4.4.1 Span Space Partitioning We start by showing how to distribute the N input metacells among the local disks in such a way that the active metacells corresponding to any isovalue are spread almost evenly among the processors. We first sort the N metacells in decreasing order of their vmax values. The N sorted metacells are then divided evenly into radicalBig N/p sets, each set containing ?pN consecutive metacells in the sorted order. Then, we re-sort the metacells within each set by increasing order of vmin values. 40 We now stripe all the N metacells as they appear after the two sorting steps across the p disks. That is, the first metacell is stored on the disk of the first processor, the second on the disk of the second processor, and so on wrapping around as necessary. We next prove the following property of our partitioning scheme. Figure 4.3: Each pair (vmin,vmax) is illustrated as a point and assumed in this case to correspond to a single metacell. For N = 32 and p = 3, the metacells are partitioned into 3 sets separated by dashed lines. The colors black, white, and grey are used to denote the metacells belonging to the first, second, and third processor respectively. 4.4.2 Provable Load Balancing Lemma: Our metacell partitioning algorithm distributes the N metacells of the initial dataset onto p processors such that each processor receives N/p + O(1) metacells. For any isovalue ?, each processor will hold at most N?/p+2 radicalBig N/p active metacells, where N? is the total number of active metacells corresponding to the isovalue ?. 41 Proof: The first part of the lemma is obvious since the N metacells, after getting rearranged in the two sorting steps, are striped across the p disks. Given an isovalue ?, the corresponding active metacells are those represented in the shaded region shown in Figure 4.3 of the span space. We now examine how these metacells are distributed among the p processors. After the first sorting step, our partition- ing algorithm distributes the N metacells evenly into radicalBig N/p sets, each of which contributes ?pN metacells. The second sorting step is performed within each set separately, after which all the metacells are striped across the p disks. Each set whose vmax values are larger than or equal to ? (the first two sets in Figure 4.3) contributes an equal number of active metacells with a difference of at most one for each set since the metacells within each set were sorted in increasing order of vmin. Since there are at most radicalBig N/p such sets, the total difference of active metacells contributed by these sets is at most radicalBig N/p. We may have an additional set some of whose vmax values are larger than or equal to ?, and the rest are strictly smaller (the last set in Figure 4.3). In this case, the maximum difference in the number of active metacells among all the p processors can not possibly exceed radicalBig N/p, which is the number of metacells placed at each processor from this set. Therefore the max- imum difference in the number of active metacells among all processors is bounded by 2 radicalBig N/p, and the proof of the lemma is complete. A few observations about our partitioning algorithm are in order. The first is that the above upper bound on the number of active metacells per processor is extremely conservative. As we will illustrate later, the bounds achieved in practice across widely different data sets are substantially better than this upper bound. The 42 second observation is that in general we expect N? to be of order N2/3 for interesting isovalues in which case 2 radicalBig N/p is asymptotically much smaller than N?/p. Third, our scheme is the only scheme, as far as the authors know, for which asymptotic load balancing bounds can actually be established. 4.4.3 Sort-last Composition Once the metacells are distributed among the p processors, we create a com- pact interval tree that corresponds to the local data set. The isosurface query can now be carried out simultaneously by all the p processors using their own local in- dexing lists. As a result, roughly the same number of triangles are generated by each processor, which are then rendered locally. The p frame buffers will then be composited in a binary-swap way [15] using their depth information to create the final output. Except for the very last step, we have provably split the work asymp- totically equally among the processors, without increasing the total work relative to the sequential algorithm and without requiring communication between the proces- sors (such a strategy is referred to as sort-last strategy in the literature). For large scale datasets such as the Richtmyer-Meshkov dataset whose isosurfaces consist of hundreds of millions of triangles, the compositing step involves the movement of data that is orders of magnitude smaller than the total size of the triangles, and hence can be done extremely quickly given a high-speed interconnection network as will be illustrated later. 43 4.5 Extension to Time Varying Data Given the extremely compact size of our indexing structure, this scheme can be easily extended to deal with large scale time-varying data as follows. We have shown that the size of our indexing structure is O(nlogn) for a single time step during which there are n distinct values of the endpoints of the intervals corresponding to the metacells. To index time-varying data of m time steps, we can use the same indexing scheme for each time step separately resulting in an indexing structure of size O(mnlogn). Note that the size of the indexing structure depends only on the number of time steps, which is typically small, say in the order of hundreds and rarely in the thousands, but independent of the total number of cells of the given dataset. For example, one-byte scalar data with hundreds of time steps will require an indexing structure of size at most a few megabytes, regardless of the size of each grid since the number n of possible distinct values is 28 = 256. Similarly for two-byte scalar data, the size of the indexing structure increases to hundreds of Megabytes, which is still reasonable and can easily fit in today?s processors? main memory. In the case of Richtmyer-Meshkov data set, we have 270 time steps with 7.5GB per time step, which amounts to a total of about 2.1TB. However the size of our indexing structure for the whole data set is only 3.2MB. 4.6 UM Visualization Cluster The testing platform consists of a 16-node visualization cluster, each node consists of a 2-way SMP Dual-CPU running at 3.0 GHz, an 8GB main memory, a 44 60GB local disk that can achieve 70 MB/sec I/O transfer rate, and one NVIDIA6800 GPU card with bi-directional 4GBps data transfer rate to memory via PCI-Express (?16) Bus. The GPU communicates with CPU and RAM via the MCH(Memory Controller Hub). These 16 Nodes are inter-connected through 10Gbps Topspin InfiniBand network. In addition, four nodes are connected to four projectors for a four-way tiled wall-sized display via their GPU card?s DVI port. The architecture of a single node is illustrated in Figure 4.4. Figure 4.4: The visualization cluster architectural diagram. As a visualization cluster, each node of the system can run graphics programs and dispatch OpenGL commands to its GPU for rendering. The system software configuration includes Redhat Linux Enterprise 3.0, MPI, and the Chromium pack- age to enable the parallel rendering among multiple rendering nodes. Basically, the Chromium is able to intercept OpenGL command calls from graphics appli- 45 cations and sends them to proper rendering servers according to the tiled-display layout3 [12]. For the parallel rendering scheme, we use the sort-last method [17]. The essence of this method is to have each node render its triangles locally using the on-board GPU, after which the output is read back from the GPU?s frame buffer and sorted according to the display server?s tile layout. Different regions of the frame buffer including the z-buffer content are forwarded to the appropriate render- ing servers, each of which will be responsible for displaying a specific region on the wall-sized display. At each rendering server, the components of the frame buffers from various processors are composited using their z-buffer contents and rendered to the display device connected to server?s GPU. In our experiments, the time of sorting and shu?ing the frame buffers among various nodes via 10Gbps InfiniBand doesn?t cause a noticeable overhead compared to time it takes to extract and render the triangles at each node as we will see later. 4.7 Experiments and Performance Analysis We tested the performance of our parallel scheme intensively on the Richtmyer- Meshkov dataset because this is one of the most challenging datasets in terms of both size scale and data complexity. This dataset consists of 2048 ? 2048 ? 1920 one-byte scalar values with very high complexity for each time step and spans 270 time steps. The data amounts to 7.5GB for each time step for a total of 2.1TB. 3http://chromium.sourceforge.net/ 46 4.7.1 Preprocessing During the data preprocessing stage, we scan the data once and create the metacells, where each metacell consists of a four-byte ID indicating the location of the metacell, 9?9?9 one-byte scalar values of the vertices, for a subcube of 8?8?8 cells and the minimum values of the metacell vertices. At this point, the original data has been converted to 256 ? 256 ? 240 metacells, each of length 734 bytes. Using our scheme, we stripe the metacells among the available disks and build each local indexing structure separately on each node. In particular, each node of the visualization cluster will hold a compact interval tree with pointers to the bricks stored on its local disk. For a single time step, this preprocessing takes about 30 minutes to complete on a single node of our cluster. 4.7.2 Load Balancing Since the performance of our algorithm depends critically on how the overall computational load is allocated to the different processors, we start by illustrating the load balancing achieved on a 16-processor cluster using seven different datasets, including the five datasets listed in Table 4.1. The average number of active meta- cells among all the processors as well as the maximum difference between the loads of any two processors are shown in Tables 4.2 and 4.3, over a wide range of isoval- ues. As the results show, the maximum difference of the number of active metacells among all the p processors is substantially smaller than the established upper bound of 2 radicalBig N/p, for all cases, and clearly illustrate the excellent load balancing achieved 47 Table 4.2: Measured load imbalance in number of active metacells among sixteen nodes with varying isovalues for one-byte and two-byte datasets. N denotes the number of intervals and p is the number of processors, which is equal to 16. Ave. Nstar/p represents average number of active metacells per processor. LLNL Dataset MRBrain Dataset CTHead Dataset One-byte scalar Two-byte scalar Two-byte scalar N = 5,629,653 N = 6,134,838 N = 5,799,589 2(N/p)1/2 = 1,186 2(N/p)1/2 = 1,238 2(N/p)1/2 = 1,204 Iso- Ave. Max Iso- Ave. Max Iso- Ave. Max Value Nstar/p Diff Value Nstar/p Diff Value Nstar/p Diff 10 90,986 33 1,253 24,657 108 0 12,196 36 30 129,505 28 1,480 28,797 72 272 30,149 24 50 165,931 26 1,707 36,185 61 544 18,255 42 70 177,806 18 1,934 41,931 52 816 14,696 27 90 150,728 18 2,161 29,164 38 1,088 45,802 9 110 110,231 14 2,388 17,132 44 1,360 17,683 34 130 81,769 11 2,615 7,284 36 1,632 15,536 27 150 64,839 5 2,842 5,425 42 1,904 14,161 61 170 54,411 6 3,069 3,903 38 2,176 12,258 22 190 47,232 7 3,296 2,052 18 2,448 6,055 44 210 41,658 12 3,523 522 32 2,720 1,215 36 in practice by our scheme. Since the total work of our parallel scheme is about the same as the sequential algorithm, we expect the performance of our parallel algorithm to be linearly scalable as we show next in some detail for the Richtmyer- Meshkov dataset. 4.7.3 Overall Performance and Scalability We consider time step 250 of the Richtmyer-Meshkov dataset. After pre- processing the dataset, we obtain 5,592,802 metacells that occupy a space of size 48 Table 4.3: Measured load imbalance in number of active metacells among sixteen nodes with varying isovalues for four-byte floating-point datasets corresponding to a hurricane simulation. N denotes the number of intervals and p is the number of processors, which is equal to 16. Ave. Nstar/p represents average number of active metacells per processor. Four-byte floating-point datasets: N = 24,651,099; 2(N/p)1/2 = 2,482 Pressure Dataset Velocity Dataset IsoValue Ave. Nstar/p Max Diff IsoValue Ave. Nstar/p Max Diff -3272.59 83 9 -60.44 65 9 -2763.30 214 16 -49.66 410 20 -2254.02 410 30 -38.88 1,315 38 -1744.74 633 32 -28.10 2,602 53 -1235.46 1,020 12 -17.32 6,135 75 -726.18 1,678 34 -6.55 46,495 55 -216.90 3,095 39 4.23 49,948 70 292.38 22,879 32 15.01 15,791 41 801.66 19,730 65 25.79 5,308 27 1310.94 15,003 49 36.57 1,087 39 1820.22 9,334 45 47.35 198 35 3.828GB, which is nearly 50% smaller than the original 7.5GB size since we elim- inate the metacells for which all the vertices have the same scalar field value. We vary the isovalues from 10 up to 210, in steps of 20. For each of these isovalues, we ran the algorithm on one, two, four, eight, and sixteen nodes. We evaluate the performance of our isosurface extraction algorithm according to the following three metrics: (i) the I/O time it takes to retrieve the active metacells from the disk, referred to as Active MetaCell(AMC) Retrieval time; 49 (ii) the amount of CPU time required to go through the active metacells and generate the appropriate triangles, referred to as Triangulation time; (iii) finally the rendering time, which reflects the time it takes to render the triangles on the local GPU, after which the different frame buffers are composited to generate the final display. The time it takes to do the compositing of the very last step is included in the total time. Table 4.4: Performance summary of our algorithm on a single node. AMC refers to the active metacells. The last column shows that the performance is linear in the number of triangles and hence the algorithm is output sensitive. Iso- Number Number AMC Triang- Rend- Total Overall value of of Retrieval ulation ering Time Rate Triangles AMC (sec) (sec) (sec) (sec) (sec) 10 228,770,844 1,455,782 15.49 36.2 19.00 70.74 3.23 30 376,578,332 2,072,085 21.52 57.9 21.50 100.98 3.73 50 569,387,336 2,654,902 25.74 85.41 32.65 143.85 3.96 70 651,834,482 2,844,889 25.89 96.31 37.39 159.65 4.08 90 511,136,810 2,411,647 21.96 76.09 29.26 127.35 4.01 110 329,408,766 1,763,701 17.13 50.45 19.09 86.72 3.8 130 229,201,420 1,308,299 13.30 35.25 13.28 61.89 3.7 150 177,035,314 1,037,426 10.71 27.18 10.29 48.23 3.67 170 146,369,438 870,573 9.12 22.38 8.51 40.05 3.66 190 125,365,482 755,710 7.99 19.09 7.27 34.41 3.64 210 108,977,638 666,527 7.07 16.5 6.32 29.94 3.64 We first consider the performance of our algorithm on a single processor. From Table 4.4, we can see that the number of generated triangles varies from 100 million to 650 million over the range of isovalues from 10 to 210. Our indexing structure is of size 6KB, which is quite small compared to the size of the data. As shown 50 in Table 4, we are able to achieve the I/O rate of about 70MB/s in retrieving the active metacells, with a linear relationship between the total time and the number of triangles generated. It is also shown that the triangle generation stage is the bottleneck for the whole isosurface extraction as we need to go through each of the active unit cells within an active metacell to generate the triangles as necessary. Once the triangles are generated, they are rendered on the GPU very quickly. As a result we were able to extract and render isosurfaces at the rate of almost 4 million triangles per second. Figure 4.5: Overall execution time of up to sixteen processors over a range of iso- values. Table A.1 through Table A.4 (in the Appendix A) show the execution times of the major steps of our algorithm on 2, 4, 8, and 16 processors over a wide range of isovalues. A careful examination of the experimental results in these tables clearly illustrate the scalability and the efficiency of our scheme. Note also that the compositing step (whose time is equal to the total time minus the sum of the times 51 of the other steps) is extremely fast and takes at most a few hundreds of milliseconds even on 16 processors. Figure 4.6: Corresponding speedups of up to sixteen processors over a range of isovalues. Theoveralltimespentontheextractionandrenderingofisosurfacesforvarious isovalues is illustrated in Figure 4.5. The corresponding speedups are highlighted in Figure 4.6. As expected, our scheme achieves very good scalability relative to our extremely efficient serial algorithm, independent of the particular isovalue. We now consider the more general case of time-varying datasets that are to be explored by extracting and rendering isosurfaces corresponding to a time step and an isovalue. We can index the 270 time steps of the Richtmyer-Meshkov dataset using our indexing scheme. The size of the resulting indexing structure is 3.2MB, which easily fits into the main memory of a node. The layout of the data of each time step will be distributed across the processors as before. Extracting an isosurface of a time step amounts to determining the appropriate indexing structure for that time 52 step, which can easily be performed since the whole indexing structure is in main memory. Table 4.5 shows the results for time steps 180 through 195 for the isovalue of 70 on four processors. Each row of the table lists the number of active metacells, the number of triangles generated, the execution time on a four-node configuration, and the overall rate of triangles rendered (millions per second). Table 4.5: Overall performance of four processors on Isovalue 70 over 16 time steps Time-Step Num of Active Number of Overall Overall Rate Meta Cells Triangles Time (Sec) (?106/s) 180 2,249,247 499,936,480 35.971 13.898 181 2,259,741 502,356,768 32.934 15.253 182 2,269,996 504,717,568 32.717 15.427 183 2,280,249 507,140,192 33.349 15.207 184 2,290,438 509,504,96 33.145 15.372 185 2,300,808 511,877,56 33.628 15.222 186 2,310,642 514,222,240 34.121 15.071 187 2,320,869 516,675,168 34.524 14.966 188 2,330,322 519,26,80 33.732 15.387 189 2,340,363 521,496,928 34.3 15.204 190 2,295,699 516,444,992 31.405 16.445 191 2,360,385 526,339,520 35.163 14.969 192 2,370,458 528,728,448 34.523 15.315 193 2,380,148 531,149,920 35.256 15.066 194 2,389,433 533,600,192 35.172 15.171 195 2,399,116 536,50,304 35.891 14.936 53 4.8 Summary In this chapter, a new indexing structure and a new partitioning algorithm are presented for out-of-core isosurface extraction and rendering of large scale data. The indexing scheme is based on a compact version of the interval tree that makes use of the span space concept whose size is O(nlogn) compared to ?(N) for the standard interval tree, where N is the number of all possible pairs of scalar field values appearing in metacells and n is the number of their distinct endpoints. The data is arranged in a compact layout on the disk, which enables very efficient I/O performance. We have shown that the new algorithm can easily be adapted to a multiprocessor environment, provably delivering efficient and scalable performance. The algorithm was tested extensively on a wide variety of datasets, and was shown in detail to achieve scalable performance on the Richtmyer-Meshkov dataset over different processor configurations and different isovalues. 54 Chapter 5 Hybrid Ray-casting for Isosurface Rendering A major drawback of polygonal-based isosurface generation is the need to extract a potentially very large triangular mesh for each specific isovalue, a large part of which, however, may not be visible from any specific viewpoint. The size of this view-independent triangular mesh tends to increase significantly as the size of the dataset grows or as the structure of the isosurface becomes more complex. In fact, as shown in the previous chapter, it consists of hundreds of millions of triangles in the Richtmyer-Meshkov instability dataset from LLNL. One approach to address this drawback is to only extract and render the triangles that cover the portions of the isosurface, which are visible from the viewpoints of interest. A scheme for view-dependent rendering was introduced in [35], in which they showed that the view dependent visualization can significantly reduce the complexity of the rendered surfaces. Unfortunately, view dependent isosurface generation algorithms tend to be relatively slow in extracting triangles as they have to deal with two different types of constraints. The first involves a range search relative to the given isovalue, while the second constraint amounts to a spatial filtering to identify the visible portion of the isosurface. In order to achieve interactive rendering, researchers have resorted to parallel algorithms such as those that appeared in [31, 26]. A recent method based on an elaborate data structure for a persistent octree was described in [40]. On 55 the other hand, we should note that the quality of the generated isosurfaces using triangular meshes may not be of high quality especially for complex regions as the triangular mesh is just a polygonal approximation of the surface. 5.1 Features of Existing Ray-casting Algorithms An alternative approach for generating isosurfaces is using ray-casting method- ology. Such an approach was first proposed in [19], using a brute-force ray-casting on the SGI Reality Monster based upon a shared memory multiprocessor. A dis- tributed memory version was described in [29], and schemes based on kd-trees and octree are described in [30, 42, 33]. In general, there are several main features that have been exploited in the literature to speed up the computation of isosurfaces through ray-casting listed as follow: ? The use of spatial decomposition indexing structures augmented by the range of the densities at each node. Such a structure enables the culling away of large parts of the data, which are not part of the visible portion of the isosurface. Also, the input data can be incorporated into the structure to generate a multiresolution representation of the volumetric data. ? The mapping of the inherent parallelism of ray-casting into pipelined and parallel architectures since the ray traversal corresponding to any pixel can be performed independently of any other ray traversal. ? An attempt to optimize cache performance by processing chunks of the input data of suitable sizes. 56 ? The exploitation of the SIMD extensions on some of the newer multi-core processors, which led to the idea of shooting a packet of rays (typically, 4 rays corresponding to 2?2 adjacent pixels) as the unit traversal through the volumetric dataset. Except for applying the above techniques in different ways, all the known isosurface ray-casting algorithms follow more or less the same basic strategy. In this part of work, we introduce a new strategy to greatly improve ray-casting for isosurface rendering. This method is a combination of object-order projection of a coarse version of the data and a very efficient ray-casting restricted to a few data blocks for each packet of rays (corresponding to adjacent pixels). For clarity, we start by presenting the basic version for single ray, which will be extended to exploit the idea of packets of rays afterwards. 5.2 Basic Strategy At a high level, our scheme consists of the following two phases. Phase I: We perform a traversal of a 3D-tiled version of our volumetric dataset, using a very compact data structure, to identify, through projection from object space onto the image space, the visible and isosurface-intersecting 3D-tiles corresponding to each pixel. From now on, we refer to a 3D-tile of the input data as a data block or simply a block. At the end of this phase, we will have, for each pixel, a list of data blocks that are visible from the ray through this pixel and that intersect the isosurface, organized in a front-to-back order relative to the viewpoint. 57 Phase II: We now shoot a ray from each pixel through its ordered list of data blocks constructed during Phase I (assuming the list is non-empty; otherwise there is no work to be done), checking whether an intersection voxel with the isosurface lies within a block from the list. There is no need to proceed further once an intersection is found. This is done through a trilinear interpolation process and followed by the shading of the corresponding pixel under the diffuse shading model (which is widely used in scientific visualization). 5.2.1 Compact BONO-Tree Construction We organize our volumetric data into a coarse grid of equal-sized blocks, where the scalar field values within each block are stored contiguously in a pre-defined order and the block is identified by the coordinates of a pre-specified corner. We use an octree to index the data within a block such that the leaves correspond to 2?2?2 cells. That is, each leaf will contain a pointer to such a cell. As usual, each node of the octree will contain the minimum and maximum of the values of the voxels lying within the region represented by the node. In addition, we build a BONO (Branch-On-Need Octree) [23] tree for the coarse grid, augmented as usual by the appropriate value ranges. The BONO structure is very similar to the octree except that, for data resolutions other than powers of two, BONO avoids allocating nodes of empty subtrees, and hence it is more space-efficient than the original octree. Note that the blocks are always chosen so that each dimension is a power of two, and 58 hence the use of octrees to index their scalar data. 5.2.2 Phase I: Bounded Block List Generation Phase I is implemented as follows. For efficiency reasons, we limit the size of the list of blocks associated with each pixel to a fixed constant k. We later show that the value k is rather small for the most efficient implementation. We note that we will always obtain the correct visible isosurface regardless of the value of k. The BONO tree representing the coarse grid is traversed starting from the root. Assume we reach a node v of the tree. If the range stored in v contains the isovalue, we project the minimum axis-aligned bounding box (AABB) of v onto the screen. Such a 3D AABB is computed by using the coordinates of a pre-specified corner, whose x, y, and z extensions can be deduced from the level of v. We consider all the pixels falling within the projected area. If the size of the list of any such pixel is less than k, we traverse the children of v in a front to back order relative to the view point. Otherwise, we skip the subtree rooted at v. Once a leaf is reached, the list of each pixel falling within the projection of the minimum bounding box of the corresponding block is augmented with a pointer to this block unless the list already has k blocks. Notice that at the end of this phase, we have a list of size at most k blocks associated with each pixel, and organized in a front to back order since this is how the BONO tree was traversed. The limit imposed by the value k makes this phase quite efficient. 59 5.2.3 Phase II: Spatial-Aware Ray-Casting Phase II is implemented as follows. For each pixel with non-empty list, we shoot a ray from this pixel through the list of its blocks, one block at a time in the order they appear on the list. If the list is empty, the ray does not intersect the isosurface. Otherwise, the ray goes through active cells one by one within a block via the front-to-back octree traversal. When the ray encounters an active cell, we perform a trilinear interpolation upon the active cell for the isosurface intersection, which is followed by (diffuse) shading if such intersection exists. The intersection interpolation is computed accurately using the method described in [36] while the normals are computed using forward difference. If the ray reaches the end of the list without finding such intersection (and hence the list is of size k), we revert to the traditional approach by resuming the traversal of the BONO tree from where we stopped during the first phase, which is indicated at the end of each k-sized list. Lastly, if the list is of size less than k with no intersection found at this stage, it implies that this ray does not intersect the isosurface. Clearly we will always end up with the correct intersection points of all the rays with the isosurface regardless of the value of k. However, we will later show that the case when we have to resume the traversal of the BONO tree (as in the traditional approach) occurs rarely if k is chosen appropriately. 60 5.3 Improvements on Basic Strategy We now consider some acceleration techniques for the basic strategy, which exploit the coherency between nearby pixels and their associated block lists. 5.3.1 Extension to Packets of Rays Our scheme builds for each pixel a small size list of data blocks that are visible from the ray through this pixel and that intersect the isosurface. In general, we expect the lists of adjacent pixels to significantly overlap, especially for close views. We exploit this feature by combining the lists of each group of adjacent pixels (say 2?2 as used in our experimental results) into a single list. This is somewhat similar in spirit to the use of packet of rays in [42, 34]. However in this work we do not make use of SIMD instructions to process the packet of rays for ray-casting since its success depends upon scene coherency and the use of such instructions may lead to poorer rendering performance on complex incoherent scenarios as pointed out in [33, 34]. Instead, our emphasis here is on high level algorithmic techniques that are applicable to all scenes. However, we intend to explore in the future the additional benefits of our algorithm when SIMD instructions are exploited to process each grouped list of data blocks for coherent scenes. During the Phase I creation of the lists, we traverse the BONO tree as before. However we create lists for each packet of rays (corresponding to an adjacent group of pixels typically 2?2) rather than a separate list for each pixel. Whenever such a group of pixels overlaps with the projection of the current node being traversed, the 61 Table 5.1: List generation time on single-core for Far and Close views with various upper bound k using single pixels (resulting in 1024 ? 1024 lists) and groups of 2 ? 2 adjacent pixels (resulting in 512 ? 512 lists) on a 10242 screen. To make the comparison fair, the list upper bound k is adjusted so that two cases generate relatively the same number of shaded pixels after rays are cast through the data blocks on the lists. List upper bound k Time (msec) Ratio # of lists 10242 5122 10242 5122 3 6 273 110 2.48 Far 7 14 362 142 2.55 11 22 434 169 2.57 3 6 228 68 3.35 Close 7 14 334 92 3.63 11 22 449 119 3.77 group?s list is processed as before. Since we are now creating fewer lists, the per- formance of Phase I improves substantially as illustrated in Table 5.1, which shows the execution times corresponding to different values of k to generate respectively 10242 lists (one list per pixel) and 5122 lists (one list for each 2?2 adjacent pixels). During Phase II, a slight overhead will be incurred as the upper bound k on the size of the list needs to be increased for the grouped list. However we will show later that we achieve the best performance when k is around 22, compared to 12 in the single pixel case, and therefore the overhead will be minimal. 5.3.2 Adaptive Block Size Another improvement to our basic scheme is to make the size of the data block adaptive. For far views, we can use relatively large size blocks especially when processing large volumetric data. For example, we use 8?8?8 blocks for the RMI dataset to handle the rendering of far views, which results in 6?6 pixels on average 62 being covered by the projection of a data block, and this seems to achieve the best performance when generating the lists for groups of 2?2 adjacent pixels. However when we zoom in for close views, the smaller size blocks are more effective especially that the number of BONO nodes visited and the number of projected blocks are much smaller but the projection of a block covers more pixels (e.g. 12 ? 12 on average when we use 4 ? 4 ? 4 blocks for a 16 : 1 zoom-in close view). Figure 5.1 illustrates the performance of each of Phase I and Phase II on the RMI dataset as a function of the block size and the viewpoint. Figure 5.1: Execution time of list generation(PhaseI) and ray-casting(PhaseII) on a single-core as we vary the zoom-in from far-view to close-view for block sizes 83 and 43. We make our scheme adaptive as follows. We visit the BONO nodes as before, except that, for close views, at the end of the traversal of the BONO tree, we proceed with the octree traversal of the BONO tree leaves until we reach the desired block size. After reaching the desired size, we proceed using the current blocks to compute the minimum bounding box and construct the lists for the various packets of rays. 63 With the adaptation of the block size from 8?8?8 to 4?4?4, our tests show an overall speedup of around 10 ? 20%. 5.4 Parallel Processing Scheme As multi-core processors begin to dominate the computing market, new pro- gramming paradigms are needed to fully exploit the performance opportunities of- fered by these processors. In general, parallel programming remains a difficult task in spite of the considerable related research efforts undertaken during the past sev- eral decades. Unfortunately, this task becomes even more difficult for multi-core processors given the limited on-chip memory, and the typical complex memory hi- erarchies present in such architectures. On the other hand, multi-core processors present an opportunity for speeding up the computation by partitioning the load among the cores, but a careful management of the memory hierarchy (including whatever caches are available) is critical to the overall performance, in addition to the usual problem of trying to ensure balanced loads among the cores with as little communication as possible. Under this perspective, we will focus on programming a single multi-core processor rather than a cluster of these processors since we believe this is where the main challenge is, and moreover a multi-core processor will soon be the common platform for most people. The multi-core architecture presents opportunities for speeding up demand- ing computations if the available resources can be effectively used. Typically a multi-threaded program is used to specify the tasks executed on the cores and the 64 coordination involved among these tasks. In general, assume there are p cores on a multi-core processor, with some local (possibly shared) cache or memory available for each core. Using p threads, each phase in the serial version of our algorithm is parallelized as follows. 5.4.1 Block List Generation To handle Phase I, the screen is divided into almost equal contiguous regions. Each thread traverses the BONO tree and creates the lists of blocks corresponding to its groups of pixels within the region. Hence a traversal of a node is followed by traversing the children nodes in a front to back order only if the projection of the minimum bounding box of the node intersects with the thread?s screen region and there is at least one list associated with the region which is not full (i.e., its size is less than k). Note that our BONO tree is small and only a fraction (no more than 10%) of the total BONO tree nodes are actually accessed during Phase I due to the imposed list upper bound k. 5.4.2 Z-Order Screen Partitioning We then partition the screen and group the ray-casting tasks through all the lists as follows. i. Partition the screen into small image-size tiles (for example 8?8 or 16?16) and order these tiles using a Z-order (or a space-filling curve such as Hilbert space filling curve). Such an ordering will ensure a high degree of spatial 65 locality of nearby tiles and will result in high cache performance as we will show later. This step is performed during the preprocessing stage and takes a few milliseconds. ii. After the lists are generated, assign a weight to each small-size image tile, which is equal to the number of non-empty lists within the tile, and compute the total weight W of all the tiles. iii. Following the Z-order of the image tiles, group the tiles as follows. The first set of tiles whose total weight is W2 are divided into p equal groups, each group consisting of a contiguous set of tiles following the Z-order. A group is identified by a pair of indices indicating the first and the last image tile in the group. The second set of remaining tiles whose total weight is W4 is divided equally as before into p groups. This process is repeated until each image tile is associated with a group, and hence we need at most a logarithmic number of iterations in screen size, each iteration creating p groups. The result is a list LI of pairs of indices, each pair delineating a group of image tiles. 5.4.3 Dynamic Task Scheduling of Ray-Casting We perform ray-casting in Phase II dynamically as follows. Initially, each thread will grab a group of image tiles from the ordered list LI created through screen partitioning. A thread will then process its group by shooting rays through the pixels in the group using the generated data block lists. Once a thread completes the processing of its group, it grabs the first available group of tiles from the list LI, 66 and start processing the corresponding group. The process continues until all the image tiles are processed. Our dynamic allocation of the ray-casting tasks attempts to achieve an opti- mal trade-off between two conflicting requirements. The first is the desire to have fine-grain tasks to be assigned dynamically with the goal of achieving tight load balancing. The second requirement is to make the number of jobs as small as pos- sible with the goal of minimizing the amount of coordination and synchronization among the threads. In our list LI, we start with jobs (corresponding to groups of image tiles) that are relatively large, and decrease the sizes until we reach fine-grain jobs at the last p positions of LI. Therefore our strategy seems to strike an opti- mal balance between the two requirements. In the next section, we will illustrate the performance of each step, and show in particular that we are able to achieve a very tight load balancing among the different threads as well as very high cache performance. 5.5 The Clovertown Multi-Core Processors We use the Clovertown platform, consisting of two Quad-Core Intel 1.86 GHz Xeon Processors 5320. Each dual-core on a Quad-core shares an L2 cache of size 4 MB, and hence the total L2 cache available is 8 MB. Its architectural diagram is shown in Figure 5.2. Our Clovertown platform has 8 GB of main memory, which constitutes an upper bound on the size of the datasets used in our experiments. 67 Figure 5.2: Intel Quad-Core Processor Architectural Diagram. 5.6 Experiments and Performance Analysis We have conducted extensive testing of our algorithm on six datasets whose sizes range from about 100MB to 8GB, which is the largest dataset that can fit into the main memory of our Clovertown platform. Although the isosurface can be generated from an arbitrary viewing point, we report our test results for two typical views: Far-view that enables the viewing of the complete isosurface on the screen; and Close-view that consists of a zooming by a ratio of 16 : 1 to view details of regions of interest. These two view settings will typically involve significantly different numbers of voxels intersecting the isosurface, which directly influence the performance of any isosurface rendering algorithm. Hence, we measure the cor- responding performance separately to shed more light into the robustness of our scheme. In addition, we take six different viewing angles for both Far-view and Close-view, specified by zenith angle ? = {15o,45o,75o} and azimuth angle ? = {22.5o,45o} in spherical coordinates. Due to the high topological complexity of 68 most generated isosurfaces, the screen resolution for our testing is typically set at 10242, which for example enables the highlighting of the fine details of the complex RMI dataset from LLNL. As described before, our scheme consists of an initial phase that generates a list of data blocks for each packet of rays, followed by a dynamic allocation of groups of Z-ordered image tiles among the processor cores, and ending with ray-casting through the lists associated with groups of adjacent pixels. If the ray intersects the isosurface, the intersection position is calculated by solving a tri- linear interpolation equation as in [36], then the pixel is shaded by computing the forward difference gradient as the normal at the intersection position and applying the diffuse shading model. We measure the execution time of each phase as well as the overall rendering frame rate of the corresponding isosurfaces. We will show scalability in the number of cores used. In particular, we run our tests on 1,2,4,8 CPU cores of our Clovertown platform and measure the performance for each case separately. At this point, we note that comparing our experimental results with those of previous algorithms is not straightforward (except when comparing the sizes of the indexing structures) since prior work did not provide sufficient details about their testing scenarios and they used different processors (which sometimes were faster in CPU clock speed than our 1.86 GHz Quad-core processors and had more main memory). However we will see later that our performance numbers suggest significantly better performance than any of the published algorithms. To illustrate the relative increased performance achieved by our techniques in a concrete way, we implemented a ray-casting algorithm using the octree indexing structure, while 69 trying to make as effective use of the memory hierarchy and multithreading as much as possible. All the detailed steps for ray traversal, computing the intersection points, and shading are the same as in our algorithm. In particular, our multi- threaded implementation of the octree traversal algorithm is based on a dynamic allocation of static small screen tiles (8?8 pixels) to the different processor cores. Therefore the comparison between the two algorithms running on the same machine with identical datasets, viewpoints, and screen sizes will highlight the differences in the strategies used by both algorithms rather than the small implementation details. Moreover, it appears that the performance of this octree traversal algorithm is rather similar to that achieved by the octree algorithm reported in [33]. 5.6.1 Datasets Used and Space Consumption We selected six datasets for our tests, which can generate spatially sparse or dense, topologically smooth or complex isosurfaces, and which represent most types of isosurfaces encountered in various applications (Figure 6.2). The sizes of these datasets vary from 87MB to 8GB, which is the largest that can fit into our main memory. These datasets illustrate our scheme?s adaptivity to various types of isosurfaces and data sizes. Table 6.1 illustrates the block size used for each dataset and the corresponding number of blocks for each case. In all cases, the number of blocks is relatively small and does not exceed a few millions, and hence the corresponding BONO tree is very compact and can be constructed extremely quickly. In fact, our largest BONO tree 70 Figure 5.3: The six datasets used in our experiments. From left to right, top to down, the datasets are: Aneurism, Bunny, Skull, Abdominal, VisMale and RMI respectively. is around 46MB for the RMI 8GB dataset. On the other hand, the accumulated size of the finer indexing structures (that is, octrees) for all the data blocks is just a fraction of original dataset, no more than 14 as shown in Table 5.3. Note that the total sizes of our indexing structures are substantially smaller than those used by the kd-tree algorithms (such as [42, 30]). Due to the simplicity of our indexing structure, the preprocessing times are also much better than any of the published preprocessing times even when the previous algorithms are run on faster processors. We note that the compressed octree reported in [33] consolidates regions with uniform values into a single octree node, and hence can result in significant space savings especially for the case when 71 Table 5.2: Parameters of various datasets used Data Field Grid Data Block # of Sets Size Size Size Size Blocks Abdominal 2 bytes 5122 ?174 87 MB 43 302 K Bunny 2 bytes 5122 ?360 180 MB 43 1,181 K Aneurism 2 bytes 5122 ?512 256 MB 43 1,620 K Skull 2 bytes 5122 ?512 256 MB 43 1,680 K VisMale 2 bytes 5122 ?1882 941 MB 83 663 K RMI 1 bytes 20482 ?1920 7.5 GB 83 5,655 K Table 5.3: Size of our indexing structure for Blocks and BONO tree along with their preprocessing time Data Indexing Size (MB) Preprocess Time (Sec) Space Blocks BONO Blocks BONO Overhead Abdominal 11.52 2.97 7.1 0.172 13.24 % Bunny 45.07 11.43 15.2 0.332 25.04 % Aneurism 61.80 15.68 23.4 0.407 24.14 % Skull 64.12 16.26 25.6 0.427 25.05 % VisMale 187.37 6.43 88.3 0.221 19.91 % RMI 809.08 46.79 520.0 1.950 10.53 % the spatial variability is relatively small. In fact a compression of up to 25% is reported for the RMI dataset. Since in our case the indexing structure is separate than the data itself, we can make use of this technique to compress the volumetric data as well. 5.6.2 Performance Implication on List Upper Bound In addition to our new strategy that combines object order traversal followed by ray-casting, we make use of a novel trick by putting a limit k on the number of blocks computed for each group of adjacent pixels (corresponding to a packet of rays). We examine here the critical importance of such an upper bound. The total execution time of our algorithm consists of four main components: 72 (i) the time it takes to traverse the BONO tree and to generate the lists of blocks; (ii) the time it takes to group the small-sized image tiles into groups for dy- namic allocation among the processor cores; (iii) the time to perform ray-casting through the data block lists; (iv) the time needed for ray-casting of the unfinished pixels (that is, those pixels whose lists were of size k with no intersection found in step (iii)). The amount of work involved in grouping the image tiles is small (in the order of 2 ? 3 milliseconds). The bulk of the time is spent on steps (i), (iii), and (iv). In order to illustrate the trade-off involved relative to the upper bound k and the various stages of the algorithm, we ran a number of experiments on the RMI dataset of time step 250 using the isovalue of 70 and screen resolution 10242 for Far-view settings on our Clovertown platform. We measured the execution time on a single core for different values of k, ranging from 0 to 42 (note that octree ray-casting is the same as the case when k = 0). The corresponding results are illustrated in Figures 5.4. From these results, we can make the following observations. First, the octree ray-casting corresponding to the case when k = 0 has the longest execution time by a factor of approximately 40% relative to our algorithm for the best value of k. Second, the time it takes to generate the block lists (indicated in blue) increases with the value k almost linearly because the depth complexity of RMI data is high (? 50) but its contribution to the total time is less than 10% for k ? 30. Third, the ray-casting on the block lists (indicated in red) takes an increasingly larger 73 Figure 5.4: Execution times of the different stages of our algorithm on a single core vs. the value of upper bound k. The results are for the Far-view of the RMI dataset of time step 250 using 10242 screen resolution. fraction of the total execution time as k increases, and is significantly larger than the time it takes to generate the lists. Fourth, and perhaps most importantly, the number of rays that have no intersection with the isosurface after going through exactly k blocks (indicated in yellow) drops very quickly initially as k increases and then somewhat levels off, which can be verified more clearly by examining the curve shown in Figure 5.5. The combined effect of these properties lead to an optimal value for k that in our experiments has been in the range 16 ? 22. For example, the optimal value of k is around 20 ? 22 for the RMI dataset, while for the VisMale dataset, the optimal value of k is around 16 ? 18 because of a smaller complexity depth. Another important benefit of our scheme is the significant decrease in the number of rays cast which do not intersect the isosurface relative to the existing strategy. The traversal of the BONO tree effectively identifies the area on the screen 74 Figure 5.5: Percentage of pixels left to shade after going through k blocks from the lists for the 20482 ?1920 RMI dataset at time step 250. Screen size is 10242. where the isosurface is mapped, passing this information for ray-casting through the block lists. Figure 5.6 illustrates the dependence of the non-intersecting rays cast upon the value of k. The blue curve represents the number of non-intersecting rays determined when going through the lists containing less than k blocks, while the red curve represents the number of non-intersecting rays determined at the very last step of the algorithm after their k-size lists were completed. Obviously, the total number of non-intersecting rays cast is the sum of these two numbers, and does not depend upon the value k > 0. The percentage is out of the total number of rays cast by our algorithm. As shown in Figure 5.6, the value of k directly impacts these two numbers, while the total number of non-intersecting rays cast by our algorithm (for k > 0) is about 5.2% of the total number of rays cast. On the other hand, the octree ray-casting doesn?t filter out any ray initially and simply shoots a ray through each pixel. When the isosurface doesn?t occupy most of the screen, which is not 75 Figure 5.6: Analysis of the percentage of rays having no intersection with the iso- surface in our scheme using the 20482 ?1920 RMI dataset at time step 250 under Far-view. Screen size is 10242. uncommon in Far-view, the percentage of non-intersecting rays over total number of rays cast could be large (such as in Aneurism and Abdominal datasets). For the same RMI dataset and the same screen resolution, octree ray-casting ends up with around 45% non-intersecting rays on average over the six tested viewpoints under Far-view. This clearly illustrates the power of our hybrid strategy that manages to almost eliminate the casting of non-intersecting rays. We will assume for the rest of the testing that an optimal value of k has been selected and report the performance corresponding to this value. 5.6.3 Overall Performance In this section, we give an overview on the overall performance of our algorithm on different datasets using a range of viewpoints. The tests conducted are for both the Far-view and the Close-view, each from six viewing angles specified by (? , 76 ?), using a 10242 screen resolution. A variable number of cores, up to 8, are used by running the multi-threaded version of our algorithm. While the scalability of our algorithm and a detailed analysis of the load balance are described later in Section 5.6.6, we report here on the overall performance and compare it with the best published results. The performance, expressed in terms of fps to render the RMI dataset (time step 250 and the isovalue is equal to 70), is listed in Table 5.4 for the Far-view and the Close-view at the six different viewing angles. As can be seen, we achieve interactive rates regardless of the viewpoint or the viewing angle for a very complex isosurface on a high resolution screen. These results illustrate the robustness of our scheme regardless of the complexity of the scene. Note that the number of cores is supposed to steadily increase in the future (perhaps doubling every 18 ? 24 months), and hence our scheme will easily achieve interactive rates on future desktop or laptop processors. Table 5.4: Performance of our algorithm on the Clovertown in fps for the RMI dataset with screen resolution 10242 and isovalue 70 under Far and Close views. 10242 Screen ??? View Core 15-22 15-45 45-22 45-45 75-22 75-45 Far 2-core 0.87 0.87 0.84 0.81 1.36 1.35 8-core 3.41 3.44 3.28 3.20 5.29 5.24 Close 2-core 1.32 1.27 1.08 1.00 1.04 0.98 8-core 5.08 4.85 4.15 3.85 4.12 3.83 As already noted in previous research [19, 42], the ray traversal across the spatial acceleration structure, such as kd-trees or octrees, constitutes the major portion of the total execution time (usually around 65% ? 70%) in octree ray- casting. Yet, Phase I of our algorithm uses efficient object-order projection of blocks 77 to considerably reduce the number of ray traversal steps in Phase II, which in large part leads to our superior performance. In Table 5.5 the comparison of number of ray traversal steps in our algorithm and octree ray-casting for various datasets clearly elucidates this aspect. Table 5.5: Number of ray traversal steps undertaken during ray-casting in octree ray-casting and our algorithm for a screen size of 10242 screen using all the datasets considered in this paper. 10242 Screen octree ray ours Dataset casting (?103) (?103) ratio Abdominal 39,992 5,865 6.82 Bunny 22,547 2,585 8.72 Aneurism 45,237 3,291 13.8 Skull 37,648 5,463 6.89 VisMale 21,421 3,174 6.75 RMI (far) 42,228 12,944 3.26 RMI (close) 42,868 6,129 6.99 Table 5.6: Measured performance on 8-core Clovertown in fps for our scheme and the octree ray-casting algorithm under Far-view setting for various datasets Screen size 5122 10242 Dataset octree ours ratio octree ours ratio Abdominal 12.99 24.65 1.90 3.80 7.87 2.07 Bunny 22.22 38.56 1.74 6.49 12.66 1.95 Aneurism 13.89 39.33 2.83 3.77 12.35 3.27 Skull 13.70 25.02 1.83 3.76 7.19 1.91 visMale 18.52 29.68 1.60 5.52 9.26 1.68 We now report a summary of our performance results on the other datasets illustrated in Fig. B.4. These results, expressed in terms of fps under the Far-view setting and taking the average over the different viewing angles, are shown in Ta- ble 5.6. Since these datasets have lower depth complexity than the RMI dataset, combined with the fact that their isosurfaces cover the screen unevenly, our algo- 78 rithm delivers a faster interactive rendering rate and achieves further performance improvements over the octree ray-casting algorithm. Note also the significant per- formance achieved for the lower resolution screen of size 5122. 5.6.4 Performance Comparison We compare our algorithm to the algorithm reported in [33], which uses the 16-core NUMA 2.4 GHz Opteron workstation. As far as we know, the performance numbers published in [33] are the best known for the general isosurface ray-casting problem. Since our platform is different than theirs, we need to calibrate the two processors. Comparing the SPEC benchmark1 performance on the AMD Opteron 2.6 GHz and the Intel Xeon 1.86 GHz with the same number of cores (8 in each case) as shown in Table 5.7, we note that the Opteron runs slightly faster and has significantly better throughput than the Intel Xeon. Table 5.7: Performance comparison between an 8-core Opteron 8218 and an 8-core Xeon E5320 using the SPEC benchmark. CPU Model AMD Opteron 8218 Intel Xeon E5320 CPU Clock 2.6 GHz 4 processors 1.86 GHz 2 processors Multi-Core 2-core per die 4-core per die L1 Cache per core 64 KB I + 64 KB D 32 KB I + 32 KB D L2 Cache per die 2 MB I+D 8 MB I+D Main memory 32 GB DDR2-5300 16 GB DDR2-5300 Cint 11.3 11.1 Speed Cfp 11.9 9.57 Through- Cint rate 85.3 58.5 put Cfp rate 83.2 41.3 1http://www.spec.org/benchmarks.html 79 Table 5.8: Performance comparison in fps for RMI datasets on 10242 screen resolu- tion. NUMA is Knoll?s platform consisting of AMD 2.4GHz 16-core Opteron with 64GB memory, Clovertown is our platform consisting of Intel 1.86GHz 8-Core with 8GB memory. The RMI datasets and testing views correspond to their settings with isovalue = 20. Screen 10242 NUMA 16-Core Clovertown 8-Core View Time step Knoll et. al. ours 50 7.4 5.88 Far 150 5.7 4.90 270 4.7 4.15 50 4.3 7.04 Close 150 3.6 5.81 270 3.5 5.58 Listed in Table 5.8 are the performance numbers reported in [33] on their 16-core NUMA and the performance numbers of our algorithm on the Clovertown 8-core using the same dataset, the same viewpoint, and the same screen size. While the number of cores on their platform is twice the number of cores on our platform and they have access to 64GB of memory compared to 8GB on our platform, our performance is much better for close views and only worse by no more than 25% for far views. As we show later, our algorithm is highly scalable and hence we expect our performance to almost double on a 16-core Clovertown, and hence the resulting performance will be significantly better than that of the algorithm in [33]. 5.6.5 Cache Performance A critical factor affecting the performance of any ray-casting algorithm is the irregular data access, which makes it difficult to exploit caches. This issue is even more critical on multi-core processors as the overhead of memory accesses becomes 80 Table 5.9: Comparison of cache misses of our algorithm and the octree ray-casting algorithm on the RMI dataset with isovalue 70 and screen resolution 10242. Data load request refers to the number of requests issued for min/max and voxel values during the ray-casting; Cache miss is the number of requests that fail to find the requested data inside the cache after the initial load; Miss rate is calculated as cache miss divided by data load request. Number of Cores Single-core 8-core Data load Cache Miss Cache Miss View Method request miss rate miss rate Octree 122,341K 2,153K 1.76% 3,340K 2.73% Far Ours 66,336K 813K 1.25% 836K 1.26% Ratio 1.84 2.65 ? 4.00 ? Octree 123,881K 2,384K 1.92% 5,305K 4.28% Close Ours 52,660K 723K 1.37% 764K 1.45% Ratio 2.35 3.25 ? 6.94 ? relativelymoresignificant. DuringPhaseIofourscheme, thedataaccessisrelatively regular as we process the data in object order and generate block lists. During Phase II, our scheme sorts the small-size image tiles (typically, 8?8) into a Z-order, and group the tiles into decreasing size groups that depend on the weight of each tile, followed by dynamic allocation of these groups to the different threads. We now illustrate the resulting cache performance. Table 5.9 shows the cache miss rates achieved by our scheme and the octree ray-casting, on both a single core and an 8-core Clovertown for far and close views of the RMI dataset respectively. Here we have excluded the initial misses caused by the first time access to the data. As can be seen from Table 5.9, the number of data load requests and cache misses in our method are significantly lower than those for the octree method, and a thread in our method will rarely need to access the main memory after the first time the data was loaded. 81 5.6.6 Scalability Our scheme achieves a very good scalability in terms of the number of cores used. The first phase divides the image equally among the core processors, and hence the work load is distributed almost equally among them. Before performing the ray-casting phase, we create an ordered list of groups of small-size image tiles, which are then dynamically allocated to the threads as they become available. While the lists associated with each group of pixels are of different sizes, they are upper bounded by the value of k, which is typically less than or equal to 22. Given the dynamic allocation, we expect the loads on the different threads to be almost equally distributed, resulting in scalable performance. This is indeed the case as illustrated in Table 5.10, which shows the average frame rate over six views for the two different settings of the viewpoint on the RMI dataset using a varying number of cores. The results are for 5122 and 10242 screen resolution respectively. Table 5.10: Average frame rate of our algorithm on Clovertown for the RMI dataset at time step 250 under a varying number of CPU cores using 5122 and 10242 screen resolution. Screen Size 5122 10242 Cores Far-view Close-view Far-view Close-view 1 1.77 1.97 0.51 0.56 2 3.53 3.82 1.02 1.12 4 7.03 7.64 2.04 2.23 8 13.08 14.53 3.98 4.31 Scalability 92.4% 92.2% 97.5% 96.2% In fact, an examination of Table 5.10 reveals that the scalability of our algo- rithm is above 90% for both views for up to the maximum number of cores available on our Clovertown platform, where the scalability at 8-core is computed as the per- 82 centage of the performance in fps achieved on 8-core relative to the performance on a single core multiplied by 8 (best possible speedup). Clearly, the advantage of ray-redistribution in our scheme is more useful for the sparse isosurfaces such as those generated by the Abdominal, Aneurism, and Skull datasets since many of the block lists will be empty. Table 5.11: The work from two Phases distributed among eight threads running among 8-core for 10242 screen and isovalue 70 along with their corresponding indi- vidual execution time. The tests are done on RMI dataset for both far and close views. The work load of Phase II is measured by the number of rays cast, the num- ber of ray traversal steps and intersections. Total includes the synchronization time and writing time of the frame buffer. 8-core / 10242 Number of (?103) Time (msec) View Proc No. Rays Traversal Intersect I II Total 0 98 1,380 277 19 223 250 1 102 1,353 274 22 223 250 F 2 99 1,365 276 22 223 250 a 3 97 1,393 274 18 224 251 r 4 96 1,355 278 16 223 250 5 101 1,344 275 22 223 250 6 100 1,365 273 21 223 250 7 94 1,361 275 18 223 250 ? Avg ?100% 2.14 0.81 0.48 10.8 0.21 0.19 0 147 850 382 21 202 231 1 155 844 378 21 203 232 C 2 140 897 360 21 202 231 l 3 127 915 381 20 202 231 o 4 149 857 379 21 203 232 s 5 157 880 371 22 202 231 e 6 149 852 377 20 200 230 7 151 844 382 21 203 232 ? Avg ?100% 4.39 2.61 1.39 2.32 0.31 0.27 Another way to illustrate the scalability of our scheme is through Table 6.6 that shows the loads on the different threads for the RMI dataset for both the far and 83 close views. We provide more details for Phase II since it constitutes approximately 90% of the total computational load. Note that the numbers of ray cast, octree traversal steps, and intersection are almost evenly distributed among the threads regardless of the viewpoint. Therefore the loads are extremely well-balanced among the different threads. 5.7 Summary In this chapter we presented a novel hybrid strategy for rendering isosurfaces by ray-casting. The resulting algorithm starts with an object order traversal that eliminates almost all the pixels with non-intersecting rays and creates short lists of ordered small data blocks for the remaining pixels, then apply ray-casting for relevant pixels on these lists. We have shown that the total size of our indexing structure is very compact and that our performance is significantly superior rela- tive to the published isosurface ray-casting algorithms. We have also shown that our algorithm can effectively exploit the memory hierarchies and its multithreaded implementation can efficiently utilize the multicore platform, which is available on almost all new processors. We presented the results of some of our extensive tests, showing interactive rendering rates for a variety of datasets, of widely different com- plexities, of size up to that of our main memory on a high resolution 10242 screen. All these results indicate that our scheme can easily achieve interactive rendering of isosurfaces of large scale volumetric scalar data on emerging multi-core processors. 84 Chapter 6 Extension of Hybrid Scheme to Direct Volume Rendering(DVR) As mentioned before, the surface representation is not able to account for the internal translucent structures of the datasets which appear in most applications. Direct volume rendering creates images from the 3D datasets by sampling the optical properties and compositing their visual effects throughout the data volume. While there are several different approaches to perform direct volume rendering, we focus here on the ray-casting method since this seems to produce the highest visualization quality for volumetric datasets among software-based technology (as outlined in Section 3.3). In general, the work involved in direct volume rendering using ray- casting can be divided into two parts: the first part comes from the traversal of the rays across the data volume, and the second is the opacity/color composition along the ray, which makes use of Riemann sums and a transfer function. Hence, this process demands much more computation than in the isosurface case. In this chapter, we extend the method presented in the previous chapter to develop an extremely effective algorithm for DVR. Before presenting our method, we introduce some terminology for describing our algorithm. We denote a set of spatially adjacent cells (e.g., 4?4?4 or 8?8?8 cells) as a data block or simply block. Given a transfer function and a preset small 85 positive value epsilon1, we can characterize data blocks as follows: transparent block, each vertex of which has opacity < epsilon1; semi-transparent block, some vertices of which have opacity < epsilon1 while other vertices have opacity ? epsilon1; and diffuse block, all the vertices of which have opacity ? epsilon1. 6.1 Extended Hybrid Ray-casting Scheme Similar to the hybrid ray-casting scheme for isosurfaces, our strategy for DVR consists of two main phases. Phase I involves an object order traversal of the data using a spatial hierarchi- cal BONO tree representation whose leaves represent data blocks. The main purpose of this phase is to determine for each group of pixels (containing 2?2 pixels in our implementation) a list of non-transparent data blocks (i.e. semi-transparent and dif- fuse blocks) encountered by the packet of rays in a front-to-back order. As before, we place an upper bound k on the number of such data blocks for all the lists. This will substantially reduce the traversal time without affecting the rendering quality. Phase II consists of shooting rays through the pixels using the lists of data blocks generated during Phase I. The blocks are processed in a front-to-back order and the early termination technique is applied on each ray when possible. In the case when k blocks are processed with a resulting opacity short of the threshold, we extend the ray traversal further across the dataset to continue compositing along the view direction. We next provide more details about our scheme. 86 6.1.1 Pre-processing The preprocessing step starts by constructing the BONO-tree as described in Section 5.2.1 except that a region with constant voxel value can not now be ignored. We organize our volumetric data into a coarse grid of data blocks with fixed size (each block contains 8 ? 8 ? 8 or 4 ? 4 ? 4 cells), where the scalar field values within each block are stored contiguously in a pre-defined order and the block is identified by the coordinates of a pre-specified corner. To accommodate 2D transfer functions, we compute the normal magnitude at each vertex using the method of center difference. We use an octree to index the data within a data block and store the minimum and maximum density values and normal magnitude pairs, (dmin,hmin) and (dmax,hmax), at each node with the leaves corresponding to 2?2?2 cells, where dmin and hmin are respectively the minimum density value and the minimum normal magnitude of all the vertices contained within the subtree. The parameters dmax and hmax correspond to the maximum values. In addition, we build a BONO (Branch-On-Need Octree) [23] tree on the coarse grid consisting of the data blocks, augmented as usual by the appropriate min/max pairs of density values and magnitude of normals. Note that the blocks are always chosen so that each dimension is a power of two, and hence the use of octrees to index their data. We will later show that our indexing structure is compact for all our datasets and the preprocessing step can be performed very quickly even for very large datasets. 87 6.1.2 Block Discrimination Method We devise a new method that is able to quickly classify data blocks as transpar- ent, semi-transparent, or diffuse to allow for efficient empty space leaping based on any 2D transfer function selected by the user. The method creates a discrimination function implemented as a table generated quickly once the 2D transfer function is specified. The construction procedure of such a table D(vi,gj) with discrete values vi for the scalar field and gj for the gradient magnitude amounts to the follow- ing computation. Given a small preset value epsilon1 and user-specified transfer function opacity(v,g), the value of D(vi,gj) is set as the number of pairs (vk,gl) such that vk ? vi, gl ? gj, and opacitiy(vk,gl) > epsilon1. During the BONO tree traversal step in Phase I, the minimum and maximum density and gradient pairs, (dmin, hmin) and (dmax, hmax) are retrieved from a BONO tree node which corresponds to a subvolume (See Figure 6.1). The discrimination method finds the largest discrete value vmin smaller than dmin, and the smallest discrete value vmax which is larger than or equal to dmax, and similarly for gmin and gmax with respect to the gradient magnitudes hmin and hmax. Using the table D(vi,gj) we can compute the value ?, the number of pairs (vk,gl) such that vmin ? vk ? vmax, gmin ? gl ? gmax, and opacitiy(vk,gl) > epsilon1 by Equation 6.1 ? = D(vmax,gmax)?D(vmin?1,gmax)?D(vmax,gmin?1)+D(vmin?1,gmin?1) (6.1) Then, the ? value can be used to classify the blocks as follows: ? If ? = 0, the block is transparent; 88 ? If ? is equal to the number of pairs in the box bounded by (vmin,gmin) and (vmax,gmax), the block is classified as diffuse; ? Otherwise, the block is classified as semi-transparent. Note that in this case the block could be transparent in which case we will process it unnecessarily but this will not affect in any way the correctness of the rendering process. Since ? represents the number of discrete values of the pairs (vi,gj) which are mapped to an opacity larger than epsilon1, it is easy to observe that our designation of transparent and diffuse blocks is always accurate. Note that our block discrimination method involves only the transfer function and the min/max values of the scalar field and gradient magnitude of the data blocks, and hence can be processed very quickly via table look-up and incorporated into Phase I of our extended scheme. Figure 6.1: An illustrative diagram for employing 2D-dimensional discriminate func- tion D(vi,gj), given (dmin, hmin) and (dmax, hmax). The shaded area is bounded by discrete pair values (vmin, gmin) and (vmax, gmax). vmin = floorleftdminfloorright, gmin = floorlefthminfloorright, vmax = ceilingleftdmaxceilingright, gmax = ceilinglefthmaxceilingright when vi and gj takes consecutive integer values. 89 6.1.3 Sequential Algorithm After the preprocessing step, Phase I of our strategy is implemented as follows. The BONO tree is traversed starting from the root. Assume we reach a node v. The block discrimination process is applied to v with min/max range values retrieved from the node. We skip the node if it is classified as transparent; otherwise, we process the node as follows. We project the minimum bounding box of v onto the screen, and consider all the sizes of the lists corresponding to all the groups of pixels intersecting the projection. Should any of the lists be of size smaller than the upper bound k, we proceed and traverse the children of node v if v is semi-transparent in a front-to-back order relative to the viewpoint. Once a block of certain size (such as 8?8?8 for far view or 4?4?4 for close view) is reached, we augment the lists corresponding to the groups of pixels, which overlap the projection, with a pointer to the block. If v is classified as diffuse, we project the minimum bounding box of v onto the screen and augment the corresponding block lists by pointing to node v whenever the size of the list is less than k. Since the nodes are traversed and projected along the viewing order, a falsely classified semi-transparent node will not produce an error and is likely to get cor- rected once we traverse down its subtree. Therefore we end up with lists, each with no more than k blocks for each group of pixels, such that the blocks are arranged in front-to-back order relative to the viewpoint. During Phase II, we shoot rays through the pixels of each group using the cor- responding block lists. Each ray proceeds using grid traversal for diffuse blocks or 90 octree traversal for semi-transparent blocks along the viewing direction while com- positing the colors and updating the opacity values using the Riemann sum (Equa- tions 2.2 and 2.3) with the specified transfer function, while applying early ray termination whenever possible. If the accumulated opacity is lower than the thresh- old (e.g., 0.99 in our setting) after the ray goes through up to k blocks on the list, the ray continues traversal further across the dataset via the BONO tree structure until either reaching the threshold or going beyond the date volume boundary. We will also show that the percentage of such continued ray traversals is minimal given any sufficiently large value of k. 6.1.4 Parallel Processing The objective of a multi-threaded implementation of our algorithm is to al- locate the computation among the different cores in such a way as to (i) achieve an almost equal distribution of the work; (ii) ensure as little communication and coordination among the threads as possible; and (iii) make effective use of the mem- ory hierarchy present in multi-core architectures. This last requirement is especially important when dealing with irregular data movement for large scale datasets as is the case for our volume rendering problem. The parallel scheme of extended hybrid ray-casting is very similar to the one for isosurface ray-casting presented before, and involves a progressively finer partition of the tasks, each of which corresponding to shooting rays through pixels, coupled with dynamic allocation of these tasks to the cores as they become available. In fact, 91 the multithreaded scheme for hybrid ray-casting described in the previous chapter can be easily adapt to direct volume rendering and achieve the three goals above by modifying the computation of weight W in the step of Z-Order Screen Partitioning (in Section 5.4.2). That is, in our parallel direct volume rendering, the weight W is computed as the summation of the sizes of all the blocks on the lists rather than the number of non-empty lists to more accurately estimate the workload during the following ray-casting task. In the next section, we present experimental results over a wide variety of datasets, which illustrate the scalability of our algorithm and the excellent load balance achieved among the different cores. 6.2 Experiments and Performance Analysis We have conducted extensive tests of our algorithm on six volumetric datasets with sizes varying from 150MB to 7.5GB. We select six viewpoints with zenith angles ? = {15o,45o, 75o} and azimuth angles ? = {12o,102o} in spherical coordinates and report their average frame rates. The tests were conducted on the Clovertown platform, consisting of two Quad-Core Intel 1.80 GHz Xeon Processors 5320. Each dual-core shares an L2 cache of size 4MB, and hence the total L2 cache available is 8MB. Our Clovertown platform has 8GB of main memory, which constitutes an upper bound on the size of the datasets used in our tests. We generate visualizations with screen resolutions 5122 and 10242, and report corresponding performance for Far-view and Close-view settings. 92 Since the work involved in direct volume rendering using ray-casting can be divided into the ray traversal across space, and the color composition along the rays through non-transparent region, we report on the number of ray traversal steps (excluding those steps that contribute to color composition) and the number of color composition steps separately. Also measured are the execution times during Phase I and Phase II of our multithreaded scheme using a varying number of cores on Clovertown platform. Note that excluding Phase I of our scheme (i.e., block-list length k = 0) converts our algorithm into a variant of the well-known image-based DVR approaches, such as those proposed by [56] and [64]. We observe considerable performance improvement when Phase I of our scheme is applied to accelerate the ray-casting process using suitable values of k. The detailed performance numbers will be presented after introducing the datasets used. 6.2.1 Datasets Used and Space Consumption The six datasets used in our tests (shown in Figure 6.2 as rendered by our scheme) belong to the following three major domains: geometric objects (Bonsai, XTree), medical CT scan (Cadaver, Prone, VF), and scientific simulation (RMI). In our scheme, data blocks consisting of background voxels do not need to be indexed, resulting in a very compact representation in general. As shown in Table 6.1 for the six datasets, the corresponding BONO trees are quite compact and the space required by all the octrees representing the data blocks is in general below 40% of original data size despite the fact that we augment the data with the minimum and 93 maximum gradient values within each octree node to make use of the 2D transfer function. During preprocessing, we also generate the 2D histograms in terms of density and gradient values as shown in Figure 6.3 to enable us to construct the 2D transfer function for volume rendering. Figure 6.2: The six datasets used in our tests. From left to right, top to down, the datasets are: Bonsai, XmasTree(XTree), Cadaver, VisFemale(VF), Prone and Richtmyer-Meshkov Instability (RMI) respectively. 6.2.2 Performance Implication on List Upper Bound As in the case of isosurface ray-casting, the upper bound k imposed on the block-list plays a critical role in accelerating the ray-casting with a very small over- head incurred during Phase I of our algorithm. To quantify the performance gain due to the value of k, we analyze our performance as follows. 94 Figure 6.3: The 2D histograms (i.e. density value and magnitude of the gradient) of the six datasets used in our tests. From left to right, top to down, the datasets are: Bonsai, XmasTree, Cadaver, VisFemale, Prone and RMI respectively. The total execution time of our algorithm consists of four main components: (i) the time of Phase I for generating the lists of blocks using the BONO tree traversal; (ii) the time of grouping the small-sized image tiles into larger tasks for dy- namic allocation among the processor cores, which is typically about 2 ? 3 msec; (iii) the time of ray-casting through the data blocks on the lists to composite Table 6.1: Sizes of grid and indexing structure for various datasets used along with their preprocessing time. The dimension of data block in pre-processing is of 4?4?4 cells for all the tested datasets except that RMI dataset has 8?8?8 cells for each block. Dataset Dimension Grid BONO Block Time(sec) Bonsai 5122 ?308?2B 154MB 8.5MB 44.2 (28.7 %) 29.3 Cavader 5122 ?424?2B 212MB 8.2MB 45.1 (21.3 %) 31.2 Prone 5122 ?462?2B 231MB 17.1MB 94.3 (36.8 %) 33.9 XTree 5122 ?512?2B 256MB 17.4MB 94.0 (36.7 %) 35.5 VF 5122 ?1734?2B 867MB 45.0MB 159.5 (18.4 %) 125.1 RMI 20482 ?1920?1B 7,680MB 51.27MB 1,596 (20.8 %) 736.1 95 the colors and opacities; (iv) the time needed for ray-casting of the unfinished pixels (i.e., those pixels whose lists were of size k but with accumulated opacity short of threshold). WedividetheworkinPhaseIIintocolorcompositionstepsovernon-transparent regions and ray traversal steps for empty space leaping via the hierarchical struc- ture. We conduct a series of tests on the LLNL RMI dataset at the time step 250 to capture the trade-off among the execution times of the various steps with varying k values, using a single core on our Clovertown platform with a 10242 screen. Figure 6.4: Execution times of the different stages of our algorithm on a single core vs. the value of the upper bound k. The results are for the Far-view of the LLNL RMI dataset of time step 250 using 10242 screen resolution. From the results in Figure 6.4, we can make the following observations. First, the pure ray-casting without the generation of block-lists (i.e., k = 0) has the longest execution time by a factor of approximately 40% relative to the best value of k. Second, the time it takes to generate the block lists in Phase I (indicated in 96 Figure 6.5: Percentage of composition steps left after going through k blocks from the lists and Percentage of composition steps over total steps (including composition and traversal steps), i.e. Empty Space Leap Efficiency, for the Far-view of RMI dataset at time step 250. Screen size is 10242. blue) increases with the value k almost linearly but its contribution to the total time is less than 10% for all the tested values of k. Third, the time that Phase II takes for compositing beyond the k blocks (indicated in yellow) goes down very quickly initially as k increases and then somewhat levels off, which makes an optimal value k at around 40. During the ray traversal in Phase II, traversal steps lead the rays to skip the empty space while compositing steps compute the optical interaction with the material and composite the visual effects along the ray transport. We measure the percentage of the number of compositing steps over the total number of steps (which include compositing and traversal steps), called Empty Space Leap(ESL) Efficiency, as well as the percentage of the number of compositing steps beyond the k blocks over the total number of the steps. As illustrated in Figure 6.5, the increasing ESL 97 Efficiency with the list upper bound k (red curve) indicates that utilizing block lists in Phase II significantly reduces the number of ray traversal steps because the number of compositing steps stays the same over the value k. However, the increment becomes somewhat flat when k reaches a certain value because much fewer composition steps are left after an increasing number of k blocks are processed by ray-casting (indicated by the blue curve in Figure 6.5). Table 6.2: Measured execution time of our DVR algorithm for Phase I and II as well as the number of composition(CP) steps within optimal k blocks on the lists and beyond these blocks on single-core of Clovertown for a screen of resolution 10242 and different view type settings Time (sec) # of CP steps (?103) Datasets k I II k blocks beyond (%) Bonsai 70 0.32 3.87 10,751 164 1.50 Cadaver 75 0.24 4.52 11,926 350 2.85 Prone 70 0.67 8.21 19,966 387 1.90 XTree 70 0.38 5.24 16,243 284 1.72 VF(far) 70 0.34 5.26 13,536 186 1.36 VF(close) 70 0.56 8.20 29,158 334 1.13 RMI(far) 40 0.34 5.98 15,901 289 1.79 RMI(close) 60 0.67 7.44 22,862 346 1.50 The tests conducted on the other datasets yield similar results under their individual optimal values of k to those obtained for the LLNL RMI dataset. As can be seen from these results listed in Table 6.2, the time of Phase I is well below 10% of the Phase II time while the number of compositing steps beyond k blocks are a small percentage of the total number of steps. The numbers of compositing and ray traversal steps undertaken in Phase II for various datasets and view types under corresponding optimal k values in Table 6.3 confirm that the number of traversal steps are dramatically reduced by using Phase I, which in turn results in an increase 98 Table 6.3: Number of composition(CP) steps and number of ray traversal steps undertaken to skip empty space during DVR ray-casting with k = 0 and k = optimal value k? in our algorithm under screen size of 10242 for various datasets and view types. ESL Efficiency(ESL Eff.) is defined as the percentage of number of compositions over number of total steps including composition and traversal steps. # of steps CP Traversal ESL Eff. (%) (?106) both k = 0 k? k = 0 k? ratio Bonsai 10.9 50.1 4.5 17.9 70.9 4.0 Cadaver 12.3 74.7 8.3 14.1 59.7 4.2 Prone 20.4 122.1 14.9 14.3 57.8 4.0 XTree 16.5 83.8 7.5 16.5 68.8 4.2 VF(far) 13.7 72.1 12.6 16.0 52.1 3.3 VF(close) 29.5 146.7 12.9 16.7 69.7 4.2 RMI(far) 16.2 51.0 3.4 24.1 82.5 3.4 RMI(close) 23.2 69.7 4.6 25.0 83.4 3.3 on ESL efficiency by a factor of up to 4. 6.2.3 Overall Performance We report on the overall performance of our algorithm in terms of fps on visualizing the six datasets using up to 8 cores on our Clovertown system under 5122 and 10242 screen resolutions. The comparison is made with the well-known DVR ray-casting approach based on the use of the octree, which is same as our implementation when the block list length k is set to zero. However we use our multi-threaded scheme in all cases, which gives a substantial boost to the previous DVR ray-casting approach in terms of scalability on our multi-core processor. In fact, the performance numbers reported for this scheme on the 8-core platform are significantly better than the published ray-casting results for similar size datasets and screens with software-based acceleration techniques. Table 6.4 summarizes the results, and shows that the frame rate of our algorithm can reach 2 ? 6fps for 99 5122 screen and 1 ? 2fps for a screen of resolution 10242. For the 10242 screen, the performance improvement over the well-known octree algorithm (with our new multi-threaded implementation) can range from 40% to over 100%, while it becomes a bit lower on a 5122 screen because Phase I is less effective when the projection of BONO tree node covers fewer pixels under lower resolution. Table 6.4: Measured DVR performance on 8-core Clovertown in fps for our scheme and the octree ray-casting algorithm on 5122 and 10242 screen for various datasets Screen size 5122 10242 Dataset octree ours ratio octree ours ratio Bonsai 4.67 6.58 1.41 1.26 1.84 1.46 Cadaver 3.50 5.46 1.56 0.93 1.60 1.72 Prone 2.00 2.99 1.49 0.53 0.89 1.68 XTree 2.92 4.42 1.52 0.78 1.36 1.74 VF(far) 2.89 3.92 1.36 0.81 1.31 1.61 VF(close) 1.57 2.54 1.62 0.40 0.89 2.19 RMI(far) 3.01 3.76 1.25 0.87 1.26 1.45 RMI(close) 2.62 3.85 1.47 0.69 0.98 1.42 Table 6.5: Average frame rate of our DVR algorithm on Clovertown for the RMI dataset at time step 250 under a varying number of CPU cores using 5122 and 10242 screen resolution. Screen Size 5122 10242 # of Cores Far Close Far Close 1 0.48 0.51 0.16 0.13 2 0.96 1.02 0.31 0.25 4 1.93 2.03 0.64 0.50 8 3.76 3.85 1.26 0.98 Scalability 97.0% 94.2% 98.7% 95.1% 100 6.2.4 Scalability We turn now to examining the scalability of our algorithm in terms of the number of cores and the corresponding loads on the cores induced by our multi- threaded dynamic allocation scheme. Table 6.5 shows the average frame rate over six views for the two different settings of the viewpoint on the RMI dataset using a varying number of cores. The results are for 5122 and 10242 screen resolution respectively, both showing a scalability well above 90% as we increase the number of cores to the maximum of eight available on our platform. Table 6.6 illustrates the loads on the different threads for the RMI dataset for both the far and close views. Note that the numbers of compositing and ray traversal steps are almost evenly distributed among the threads regardless of the viewpoint for Phase II which takes up more than 90% of the total computational cost. Therefore the loads are extremely well-balanced among the different threads. 6.3 Summary In this chapter, we extended the hybrid ray-casting strategy for direct volume rendering. The resulting algorithm uses an object order traversal coupled with a novel block discrimination method to generate a list of blocks in viewing order for each group of pixels and achieve efficient empty space skipping in the following ray-casting stage. We have shown that the total size of our indexing structure is very compact and that our performance is significantly superior along with high scalability and excellent load balancing relative to the published software-based 101 Table 6.6: The work from the two phases distributed among eight threads running on 8-core for 10242 screen along with their corresponding individual execution time. The tests are done on RMI dataset for both far and close views. The work load of Phase II is measured by the number of composition steps and the number of ray traversal steps. Total time includes the synchronization time and writing time of the frame buffer. Number of steps(?103) Time (msec) View Proc No. Composition Traversal I II total 0 421 2,036 40 721 781 1 387 2,033 50 721 781 F 2 392 2,035 58 720 780 a 3 447 1,994 45 720 780 r 4 392 2,022 46 720 780 5 470 2,028 57 721 781 6 391 2,033 47 721 781 7 427 2,020 40 722 782 ? Avg ?100% 6.93 0.64 13.35 0.09 0.08 0 670 2,917 72 932 1,007 C 1 758 2,897 70 932 1,007 l 2 689 2,937 73 933 1,008 o 3 767 2,915 71 932 1,007 s 4 771 2,900 71 933 1,008 e 5 657 2,969 72 932 1,007 6 793 2,921 71 933 1,008 7 739 2,920 70 935 1,010 ? Avg ?100% 6.56 0.73 1.36 0.10 0.10 schemes. We presented the results of some of our extensive tests, showing interactive rendering rates for a variety of datasets of widely different sizes. All these results indicate that our scheme can deliver interactive rendering of large scale volumetric data on emerging multi-core processors. 102 Chapter 7 Conclusion Volume Rendering is widely used as an effective approach for the visual ex- ploration, computational analysis, and manipulation of volumetric datasets. Due to the dramatic advances in imaging instruments and computing technologies, such datasets are now appearing in many engineering, science and medical applications at a very fast rate with increasingly larger sizes from giga-bytes to tera-bytes. The visualization of such datasets is critical to enable the discovery of features and pat- terns of interest. However, visualizing such large scale of datasets interactively is a computationally demanding task and poses significant challenges to carry it out on current computing platforms. Currently clusters of processors and multi-core pro- cessors are the two most important computing platforms. A significant amount of research has concentrated on exploiting their computational capability and potential use for volume visualization of large scale datasets. 7.1 Summary In this dissertation, we have developed parallel schemes for visualizing volu- metricdatasetsbyrenderingisosurfacesorbydirectvolumerenderingontwoparallel architectures, multiprocessor clusters and multi-core processors. Our first algorithm for isosurface computation contains a new simple indexing structure called compact 103 interval tree for out-of-core isosurface extraction and rendering of large scale data sets and an efficient and scalable implementation on multiprocessor environments in which each processor has access to its own local disk. The second scheme proposes a new hybrid method for the interactive rendering of isosurfaces using ray-casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray-casting strategy tailored for the re- sulting limited-size lists of candidate 3D data blocks. We have extended the hybrid ray-casting scheme to parallel direct volume rendering with a new block discrimina- tion method in support of interactive 2D transfer function and efficient empty space leaping. The key components in our new schemes include compact interval tree indexing structure, span space partitioning scheme, upper bounded block list generation for groups of pixels, Z-order screen partitioning and dynamic task dispatching, as well as block discrimination method under 2D transfer function. The major achievements are summarized below: ? Part I: The compact interval tree enables the identification of the active cells extremely quickly, using more compact indexing structure and more effec- tive bulk data movement than previous schemes. Moreover, the span space partitioning leads to a parallel implementation of the algorithm that provably achieves load balancing across the processors independent of the isovalue, with almost no overhead in the total amount of work relative to the sequential al- 104 gorithm. We have conducted a large number of experimental tests on the University of Maryland Visualization Cluster using the RichtmyerMeshkov in- stability data set, and have obtained results that consistently validate the efficiency and the scalability of our algorithm. ? Part II: The generation of the lists of data blocks enables the identification of the possible regions which could produce the isosurfaces for each pixel on the screen and significantly reduce the computation by efficiently skipping the empty space and irrelevant pixels. The grouping of nearby pixels exploits the data locality among the neighboring block lists and helps achieve higher cache performance. The upper bound imposed on the block list makes the list generation process computationally and algorithmically efficient. Based upon the block lists, the Z-order screen partitioning scheme allows dynamic allocation of groups of ray-casting tasks among the different threads to ensure almost equal loads among the different cores while maintaining spatial locality. We have tested our algorithm on a dual-processor Clovertown platform, each consisting of a Quad-Core 1.86 GHz Intel Xeon Processor, showing that our method is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an interactive isosurface rendering on a screen with 10242 resolution for all the datasets tested up to the maximum size that can fit in the main memory of our platform. ? Part III: The hybrid ray-casting scheme was extended to direct volume ren- dering with the same high degree of scalability, excellent load balancing, and 105 efficient memory management as in the case of isosurface. The block dis- crimination method allows interactively specifying a 2D transfer function and produce the block lists to achieve efficient empty space leaping in direct volume rendering. Our extensive experimental tests on the same Clovertown platform over a wide variety of well-known volumetric datasets show significantly su- perior performance for direct volume rendering, and result in interactive rates for very large datasets such as the RichtmyerMeshkov instability dataset on high-resolution screens. 7.2 Further Discussion Our hybrid ray-casting scheme can achieve interactive isosurface rendering and direct volume rendering on large scale volumetric datasets upon high resolution screen. This technology can also be applied to other areas as well. Some of the possible extensions include: (i) Geometric object visualization ? physical objects are usually represented by bounding polygonal meshes. The BONO tree built upon the bounding box of nearby polygons can produce the list of 3D blocks which is used to identify potential intersection with rays and skip the empty space up front. Similar to the case of isosurface, the intersecting location between the viewing ray and the object can be computed and the shading can be performed for the corresponding pixel. However, since the size of polygons may differ dramatically from each other, the size of 3D blocks can be adaptive as well to accommodate the difference and obtain the higher 106 acceleration based upon our hybrid scheme. (ii) Multivariate data visualization ? visualizing multivariate datasets is an important and effective method to discover and reveal the relations among the un- derlying data components and has received more research attention in the literature. For the single field datasets, one dimensional transfer function can be the choice. While, 2D or higher dimensional transfer functions are used to work on multivariate datasets with each dimension corresponding to one individual data component. Our extended hybrid ray-casting scheme for DVR can directly adapt to the two-scalar field datasets by associating the secondary dimension with the values on the other scalar field. For the multivariate datasets with more scalar fields, we can build the BONO tree with more augmented min/max ranges and examine the relation be- tween any two scalar fields swiftly by the existing handling method for 2D transfer function. However, to quickly explore the relation among three or more scalar fields demands a new way of handling transfer function, which is one of current research topics in the field of volume visualization and beyond the scope of this study. Our ray-casting techniques focus on speeding up primary ray traversal but does not seem to be applicable to secondary ray traversals for ray-tracing in a straight- forward way. Nevertheless, the primary ray traversal is the most time consuming part in ray-tracing and using ray-casting can sufficiently produce high quality iso- surfaces and direct volume rendering for most applications in scientific visualization. Given the ever increasing sizes of the volumetric data, an interesting issue is whether our ray-casting techniques can be extended to handle out-of-core data. We believe that this is the case and the coherency among the neighboring blocks and between 107 two consecutive views can be exploited to accelerate ray-casting upon out-of-core datasets. It is expected that the details of such extensions will come up in future research. 108 Appendix A Performance of Our Isosurface Extraction Algorithm on the Cluster Table A.1: Execution time of our isosurface extraction algorithm with two processors over varying isovalues. Iso- Retrieval Triangula- Render- Total Rate value (sec) tion(sec) ing(sec) Time(sec) (106/s) Speedup 10 7.774 18.391 9.625 36.005 6.32 1.96 30 11.015 29.338 10.948 53.31 7.046 1.89 50 12.871 43.044 16.549 74.827 7.589 1.92 70 12.971 48.655 17.532 80.537 8.064 1.98 90 11.03 38.386 14.842 64.42 7.891 1.98 110 8.615 25.165 9.596 43.568 7.521 1.99 130 6.732 17.672 6.711 31.412 7.297 1.97 150 5.497 13.607 5.181 24.639 7.185 1.96 170 4.631 11.185 4.28 20.33 7.2 1.97 190 4.012 9.548 3.674 17.479 7.172 1.97 210 3.585 8.267 3.195 15.162 7.188 1.97 109 Table A.2: Execution time of our isosurface extraction algorithm with four proces- sors over varying isovalues. Iso- Retrieval Triangula- Render- Total Rate value (sec) tion(sec) ing(sec) Time(sec) (106/s) Speedup 10 4.583 9.299 4.867 19.979 11.45 3.54 30 6.009 14.705 5.455 27.604 13.642 3.66 50 6.601 21.58 8.251 37.81 15.059 3.8 70 6.243 24.186 9.475 40.172 16.226 3.97 90 5.228 19.109 7.436 32.331 15.81 3.94 110 4.38 12.645 4.783 22.23 14.818 3.9 130 3.606 8.827 3.323 16.213 14.137 3.82 150 2.956 6.8 2.571 12.73 13.907 3.79 170 2.482 5.596 2.135 10.549 13.875 3.8 190 2.136 4.779 1.825 9.11 13.761 3.78 210 1.799 4.119 1.587 7.806 13.961 3.84 Table A.3: Execution time of our isosurface extraction algorithm with eight proces- sors over varying isovalues. Iso- Retrieval Triangula- Render- Total Rate value (sec) tion(sec) ing(sec) Time(sec) (106/s) Speedup 10 2.332 4.72 2.511 9.972 23.055 7.09 30 3.004 7.458 2.778 13.989 27.012 7.22 50 2.984 10.875 4.192 18.558 30.761 7.75 70 3.005 12.263 4.792 20.387 32.049 7.83 90 2.733 9.724 3.761 16.648 30.784 7.65 110 2.269 6.382 2.422 11.867 27.86 7.31 130 1.903 4.456 1.691 8.771 26.256 7.06 150 1.561 3.433 1.315 6.947 25.629 6.94 170 1.293 2.827 1.09 5.797 25.416 6.91 190 1.08 2.408 0.934 4.928 25.63 6.98 210 0.923 2.088 0.81 4.22 26.041 7.09 110 Table A.4: Execution time of our isosurface extraction algorithm with sixteen pro- cessors over varying isovalues. Iso- Retrieval Triangula- Render- Total Rate value (sec) tion(sec) ing(sec) Time(sec) (106/s) Speedup 10 1.396 2.372 1.251 5.994 38.167 11.8 30 1.461 3.759 1.372 7.584 49.654 13.31 50 1.776 5.493 2.044 9.911 57.45 14.51 70 1.803 6.16 2.32 10.936 59.604 14.6 90 1.538 4.85 1.813 8.769 58.289 14.52 110 1.222 3.186 1.182 6.019 54.728 14.41 130 0.934 2.229 0.828 4.379 52.341 14.13 150 0.749 1.719 0.642 3.472 50.989 13.89 170 0.645 1.418 0.528 2.868 51.035 13.96 190 0.572 1.206 0.449 2.463 50.9 13.97 210 0.515 1.046 0.388 2.09 52.142 14.33 111 Appendix B Rendered Isosurface and DVR Images by Ray-casting for RMI Datasets Figure B.1: Isosurface image rendered by ray-casting under close-view for RMI dataset at time step 50 with isovalue 70. 112 Figure B.2: Isosurface image rendered by ray-casting under close-view for RMI dataset at time step 250 with isovalue 70. 113 Figure B.3: DVR image rendered by ray-casting under close-view for RMI dataset at time step 50. 114 Figure B.4: DVR image rendered by ray-casting under close-view for RMI dataset at time step 250. 115 Bibliography [1] A.AggarwalandJ.S.Vitter, The input/output complexity of sorting and related problems, Communications of the ACM, 31(9), pp.1116?1127, 1988. [2] L. Arge and J. S. Vitter, Optimal dynamic interval management in external memory (extended abstract), IEEE Symposium on Foundations of Computer Science, pp. 560?569, 1996. [3] C. L. Bajaj, V. Pascucci, D. Thompson and X. Zhang, Parallel accelerated isocontouring for out-of-core visualization, Proceedings of 1999 IEEE Parallel Vis. and Graphics Symp., pp. 97?104, 1999 [4] Y?J. Chiang and C. T. Silva, I/O optimal isosurface extraction, Proceedings IEEE Visualization, pp. 293?300, 1997. [5] Y?J. Chiang, C. T. Silva and W. J. Schroeder, Interactive out-of-core isosurface extraction, Proceedings IEEE Visualization, pp. 167?174, 1998. [6] Y. Chiang and C. Silva, External memory techniques for isosurface extraction in scientific visualization, ExternalMemoryAlgorithmsandVisualization, Vol.50, pp. 247?277, DIMACS Book Series, American Mathematical Society, 1999 [7] Y?J. Chiang, R. Farias, C. Silva and B. Wei, A unified infrastructure for parallel out-of-core isosurface and volume rendering of unstructured grids, Proc. IEEE Symp. on parallel and large-data visualization and graphics, pp. 59?66, 2001 [8] P. Cignoni, C. Montani, D. Darti and R. Scopigno, Optimal isosurface extrac- tion from irregular volume data, Proceedings of the 1996 symposium on Volume visualization, pp. 31?38 San Francisco, USA 1996. [9] P. Ellsiepen, Parallel isosurfacing in large unstructured datasets, Visualization in scientific computing ? 95, pp. 9?23, Springer Verlag, 1995. [10] C. Hansen and P. Hinker, Massively parallel isosurface extraction, Proc. IEEE Visualization, pp. 77?83, 1992. [11] W. Hong, F. Qiu and A. Kaufman, GPU-based object-order raycasting for large datasets, EurographicsIEEE VGTC Workshop on Volume Graphics, pp. 177- 186, 2005. 116 [12] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. Kirchner and J. Klosowski, Chromium: A stream processing framework for interactive rendering on clusters, Proceedings of SIGGRAPH 2002, pp. 693?702, 2002. [13] Y. Livnat, H. Shen, and C.R. Johnson, A Near Optimal Isosurface Extraction Algorithm Using the Span Space, IEEE Trans. Visualization and Computer Graphics, vol. 2, no. 1, pp. 73?84, March 1996. [14] W. E. Lorensen and H. E. Cline, Marching Cubes: A high resolution 3D surface construction algorithm, Maureen C. Stone, editor. Computer Graphics (SIG- GRAPH ?87 Proceedings), vol. 21, pp. 161?169, July 1987. [15] K. Ma, J. S. Painter, C. D. Hansen, and M. F. Krogh, Parallel volume rendering using binary-swap image composition, Computer Graphics and Application, vol. 14(4), pp. 59-68, 1994. [16] S. Miguet and J. M. Nico, A load-balanced parallel implementation of marching- cubes algorithm, Proceedings of High performance computing Symp. ?95, pp. 229?239, 1995. [17] S. Molnar, M. Cox, D. Ellsworth and H. Fuchs, A sorting classification of parallel rendering, IEEE Computer Graphics and Applications, Vol. 14, No. 4, pp. 23?32, July 1994. [18] T. S. Newman and N. Tang, Approaches that exploit vector-parallelism for three rendering and volume visualization techniques, ComputerandGraphics, Vol.24, no. 5 pp. 755?774, 2000. [19] S. Parker, P. Shirley, Y. Livnat, C. Hansen and P.-P. Sloan, Interactive ray tracing for isosurface rendering, IEEE Visualization ?98, pp. 233?238, Oct 1998. [20] H. W. Shen, C. D. Hansen, Y. Livnat and C. R. Johnson, Isosurfacing in span space with utmost efficiency (ISSUE), IEEE Visualization?96, pp. 281?294, Oct 1996. [21] C. Silva, Y. Chiang, J. El-Sana and P. Lindstrom, Out-of-core algorithms for scientific visualization and computer graphics, Visualization?02, Course Notes for Tutorial #4, 2002. [22] P. M. Sutton and C. D. Hansen, Isosurface extraction in time-varying fields using a temporal branch-on-need tree (T-BON), IEEE Visualization ?99, IEEE Computer Society Press, pp. 147?154, Oct. 25?29 1999. 117 [23] J. Wilhelms and A. Van Gelder, Octrees for faster isosurface generation, Com- puter Graphics (San Diego Workshop on Volume Visualization), vol. 24, pp. 57?62, 1990. [24] Q. Wang, J. JaJa and A. Varshney, An Efficient and Scalable Parallel Algorithm for Out-of-core Isosurface Extraction and Rendering, Journal of Parallel and Distributed Computing (JPDC), vol.6, no.5, pp. 592?603, May 2007. [25] X. Zhang, C. L. Bajaj and W. Blanke, Scalable isosurface visualization of mas- sive datasets on cots clusters, Proc. IEEE Symposuim on parallel and large-data visualization and graphics, pp. 51?58, 2001. [26] X. Zhang, C. L. Bajaj and V. Ramachandran, Parallel and out-of-core view- dependent isocontour visualization using random data distribution, Proc. Joint Eurographics-IEEETCVGSymp.onvisualizationandgraphics, pp.9?18, 2002. [27] H. Zhang and T. S. Newman, Efficient parallel out-of-core isosurface extraction, Proc. IEEE Symposium on parallel and large-data visualization and graphics (PVG) ?03, pp. 9?16, Oct. 2003. [28] P. Cignoni, P. Marino, C. Montani, E. Puppo and R. Scopigno, Speeding up isosurface extraction using interval trees, IEEE Transactions on Visualization and Computer Graphics vol. 3, no. 2, pp. 158?170, April?June, 1997. [29] D. E. DeMarle, S. Parker, M. Hartner, C. Gribble, and C. Hansen, Distributed Interactive Ray Tracing for Large Volume Visualization, Proceedings of the 2003 IEEE Symposium on Parallel and Large-Data Visualization and Graphics (PVG), pp. 87?94, 2003. [30] M. Gross, C. Lojewski, M. Bertram and H. Hagen, Fast implicit kd-trees: accel- erated isosurface ray tracing and maximum intensity projection for large scalar fields, Proceedings of Computer Graphics and Imaging (CGIM), pp. 67?74, 2007. [31] J. Gao and H.-W. Shen, Parallel View-Dependent Isosurface Extraction Using Multi-Pass Occlusion Culling, Proceedings of the 2001 IEEE Symposium on Parallel and Large-Data Visualization and Graphics, pp. 67?74, 2001. [32] M, Hadwiger, C. Sigg, H. Scharsach, K. Bhler and M. Gross, Real-Time Ray- Casting and Advanced Shading of Discrete Isosurfaces, Proceedings of Euro- graphics 2005, pp. 303?312, 2005. 118 [33] A. Knoll, S. G. Parker and C. D. Hansen, Interactive Isosurface Ray Tracing of Large Octree Volumes, Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, pp. 115?124, 2006. [34] A. Knoll, C. D. Hansen and I. Wald, Coherent Multiresolution Isosurface Ray Tracing, Scientific Computing and Imaging Institute, University of Utah, Techi- nal Report No. UUSCI-2007-001, 2007. [35] Y. Livnat and C. Hansen, View Dependent Isosurface Extraction, Proceedings of the conference on IEEE Visualization 1998, pp. 175?180, 1998. [36] G. Marmitt, H. Friedrich, A. Kleer and S. Parker, Fast and accurate ray-voxel intersection techniques for iso-surface ray tracing, Proceedings of Vision, Mod- eling and Visualization(VMV), pp. 429?435, 2004. [37] B. Mora, J. P. Jessel and R. Caubet, Accelerating volume rendering with quan- tized voxels, Proceedings of the 2000 IEEE symposium on Volume visualization, pp. 63?70, October, 2000. [38] B. Mora, J. P. Jessel and R. Caubet, A new object-order ray-casting algorithm, Proceedings of the conference on IEEE Visualization 2002, pp. 203?210, Octo- ber, 2002. [39] A. Reshetov, A. Soupikov and J. Hurley, Multi-level ray tracing algorithm, ACM Transactions on Graphics, Proceedings of ACM SIGGRAPH 2005, pp. 1176?1185, 2005. [40] Q. M. Shi, J. JaJa, Isosurface extraction and spatial filtering using persistent octree (POT), IEEE Transactions on Visualization and Computer Graphics vol. 12, no. 5, pp. 1283?1290, September, 2006. [41] L. Sobierarjski and R. Avila, A Hardware Acceleration Method for Volume Ray Tracing, Proceedings of the 6th conference on IEEE Visualization 1995, pp. 27?34, 1995. [42] I. Wald, H. Friedrich, G. Marmitt, P. Slusallek, and H. P. Seidel, Faster Iso- surface Ray Tracing using Implicit KD-Trees, IEEE Transactions on Computer Graphics and Visualization, vol. 11, no. 5, pp. 562?572, 2005. [43] R. Westermann and B. Sevenich, Accelerated volume ray-casting using texture mapping, Proceedings of the conference on IEEE Visualization 2001, pp. 271? 278, 2001. 119 [44] S. E. Yoon and D. Manocha, Cache-efficient layouts of bounding volume hier- archies, EUROGRAPHICS 2006, vol. 25, no. 3, pp. 507?516, 2006. [45] C. Bajaj, I. Ihm, G. Koo, and S. Park, Parallel Ray Casting of Visible Human on Distributed Memory Architectures, In Data Visualization, Eurographics, pp. 269?276, May 1999. [46] J. Challinger, Scalable Parallel Volume Raycasting for Nonrectilinear Compu- tational Grids, Proc. of Parallel Rendering Symposium93, pp. 81?88, 1993. [47] K. Engel, M. Kraus, and T. Ertl, High-Quality Pre-Integrated Volume Render- ing Using Hardware-Accelerated Pixel Shading, Proc. of SIGGRAPH Graphics Hardware Workshop01, pp. 9?16, 2001. [48] T. Foley and J. Sugerman, KD-Tree Acceleration Structures for a GPU Ray- tracer, In Proceedings of Graphics Hardware 2005 [49] A. Kaufman and K. Mueller, Overview of Volume Rendering, Chapter of Visu- alization Handbook, 2005 [50] G. Kindlmann, and J. Durkin, Semi-automatic generation of transfer functions for direct volume rendering, Symp. Volume Visualization98, pp. 79?86, 1998. [51] J. Kniss, G. Kindlmann, and C. Hansen, Multidimensional Transfer Func- tions for Interactive Volume Rendering, IEEE Transactions on Visualization and Computer Graphics, vol. 8, no. 3, pp. 270?285, 2002. [52] J. Kruger and R. Westermann, Acceleration Techniques for GPU-based Volume Rendering, IEEE Visualization 2003. [53] P. Lacroute and M. Levoy, Fast volume rendering using a shear-warp factoriza- tion of the viewing transformation, Proc. SIGGRAPH 94, pp. 451?458, 1994. [54] P. Lacroute, Real-Time Volume Rendering on Shared Memory Multiprocessors Using the Shear-Warp Factorization, IEEE Parallel Rendering Symposium 95 Proceedings, pp. 15-22, 1995. [55] M.Levoy, Display of surfaces from volume data, IEEEComp.Graph.andAppl., vol. 8, no. 5, pp. 29?37, 1988. [56] Marc Levoy, Efficient ray tracing of volume data, ACM Transactions on Graph- ics, 9(3): pp. 245?261, July 1990. 120 [57] K. Ma, Parallel Volume Ray-Casting for Unstructured-Grid Data on Distributed-Memory Architectures, Proc. of Parallel Rendering Symposium95, pp. 23?30, 1995. [58] K. Ma, and T. Crockett, A Scalable Parallel Cell-Projection Volume Rendering Algorithm for Three-Dimensional Unstructured Data, Proc. of Parallel Render- ing Symposium 97, 1997. [59] M. Matsui, F. Ino and K. Hagihara, Parallel Volume Rendering with Early Ray Termination for Visualizing Large-Scale Datasets, ISPA 2004, pp. 245? 256, 2004 [60] N. Max, Optical models for direct volume rendering, IEEE Trans. Vis. and Comp. Graph., vol. 1, no. 2, pp. 99?108, 1995. [61] M. Meiner, J. Huang, D. Bartz, K. Mueller, and R. Crawfis, A practical com- parison of popular volume rendering algorithms, Symposium on Volume Visu- alization and Graphics 2000, pp. 81?90, 2000. [62] K. Mueller, N. Shareef, J. Huang, and R. Crawfis, High quality splatting on rectilinear grids with efficient culling of occluded voxels, IEEE Transactions on Visualization and Computer Graphics, vol. 5, no. 2, pp. 116?134, 1999. [63] J. Nieh, and M. Levoy, Volume Rendering on Scalable Shared- Memory MIMD Architectures, Proc. of Volume Visualization Symposium, pp. 17?24, 1992. [64] S. Parker, M. Parker. Y. Livnat, P. Sloan, C. Hansen, and P. Shirley, Interactive Ray Tracing for Volume Visualization, IEEE Transactions on Visualization and Computer Graphics, vol. 5, no. 3, pp. 238?250, 1999. [65] S. Roettger, S. Guthe, D. Weiskopf, T. Ertl and W. Strasser, Smart Hardware- Accelerated Volume Rendering, Proceedings of the symposium on Data visual- isation 2003, pp. 231?238, 2003 [66] J. P. Schulze, U. Lang, The Parallelization of the Perspective Shear-Warp Vol- ume Rendering Algorithm, Proceedings of the 4th Eurographics Workshop on Parallel Graphics and Visualization, pp. 61?69, 2002 [67] L. Sobierarjski and R. Avila, A Hardware Acceleration Method for Volume Ray Tracing, Proceedings of the 6th conference on IEEE Visualization 1995, pp. 27-34, 1995. [68] K. Subramaniam and D. Fussel Applying Space Subdivision Techniques to Vol- ume Rendering, In Proc. IEEE Visualization, pp. 150?158, Oct. 1990. 121 [69] Q. Wang and J. JaJa, Intreactive High Resolution Isosurface Ray Casting on Multi-core Processors, IEEE Visualization and Computer Graphics, Preprint, Dec 2007. [70] M. Wan, Q. Tang, A. Kaufman, Z. Liang, and M. Wax, Volume Rendering Based Interactive Navigation within the Human Colon, In Proc. IEEE Visual- ization, pp. 397-400, 1999. [71] M. Wan, A. Sadiq, and A. Kaufman, Fast and Reliable Space Leaping for In- teractive Volume Rendering, In Proc. IEEE Visualization, pp. 195-202, Oct. 2002. [72] L. Westover, Footprint evaluation for volume rendering, SIGGRAPH 90, pp. 367?376, 1990. 122