ABSTRACT Title of dissertation: COMPUTER VISION IN THE SPACE OF LIGHT RAYS: PLENOPTIC VIDEOGEOMETRY AND POLYDIOPTRIC CAMERA DESIGN Jan Neumann, Doctor of Philosophy, 2004 Dissertation directed by: Professor Yiannis Aloimonos Department of Computer Science Mostofthecamerasusedincomputervision,computergraphics,andimageprocess- ingapplicationsaredesignedtocaptureimagesthataresimilartotheimagesweseewith our eyes. This enables an easy interpretation of the visual information by a human ob- server. Nowadays though, more and more processing of visual information is done by computers. Thus, it is worth questioning if these human inspired ?eyes? are the optimal choice for processing visual information using a machine. In this thesis I will describe how one can study problems in computer vision with- out reference to a specific camera model by studying the geometry and statistics of the space of light rays that surrounds us. The study of the geometry will allow us to deter- mine all the possible constraints that exist in the visual input and could be utilized if we had a perfect sensor. Since no perfect sensor exists we use signal processing techniques to examine how well the constraints between different sets of light rays can be exploited given a specific camera model. A camera is modeled as a spatio-temporal filter in the space of light rays which lets us express the image formation process in a function ap- proximation framework. This framework then allows us to relate the geometry of the imaging camera to the performance of the vision system with regard to the given task. In this thesis I apply this framework to problem of camera motion estimation. I show how by choosing the right camera design we can solve for the camera motion using lin- ear, scene-independent constraints that allow for robust solutions. This is compared to motion estimation using conventional cameras. In addition we show how we can extract spatio-temporal models from multiple video sequences using multi-resolution subdivi- son surfaces. COMPUTER VISION IN THE SPACE OF LIGHT RAYS : PLENOPTIC VIDEO GEOMETRY AND POLYDIOPTRIC CAMERA DESIGN by Jan Neumann Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2004 Advisory Commmittee: Professor Yiannis Aloimonos, Chair and Advisor Professor Rama Chellappa, Dean?s Representative Professor Larry Davis Professor Hanan Samet Professor Amitabh Varshney c? Copyright by Jan Neumann 2004 ACKNOWLEDGMENTS I owe my gratitude to all the people who have made this thesis possible and because of whom my graduate experience has been one that I will cherish forever. First and foremost I?d like to thank my advisor, Professor Yiannis Aloimonos for giving me an invaluable opportunity to work on challenging and extremely interesting projects over the past seven years, while still being free to follow my own paths. He has always given me help and advice and there has never been an occasion when he hasn?t given me time to discuss personal and research questions with him. I also want to thank Professor Rama Chellappa, Professor Larry Davis, Professor Amitabh Varshney, and Professor Hanan Samet for agreeing to serve on my thesis com- mittee and for sparing their invaluable time reviewing the manuscript. I also want to thank the late Professor Azriel Rosenfeld whose guidance and enthusiasm for the field of computer vision made the computer vision lab at the University of Maryland the in- spiring place that it is. My colleagues and friends at the computer vision lab have enriched my graduate life in many ways and deserve a special mention, Patrick Baker, Tomas Brodsky, Filip Defoort, Cornelia Ferm?uller, Gutemberg Guerra, Abhijit Ogale, and Robert Pless. I also want to thank my friends Yoni Wexler, Darko Sedej, Gene Chipmann, Nizar Habash, Adam Lopez, Supathorn Phongikaroon and Theodoros Salonidis for sharing the gradu- ate student experience with all its highs and lows with me, and understanding if I could not always spend as much time with them as they deserve. I also want to thank Kostas ii Daniilidis for the many illuminating discussions about the structure of the space of light rays. I dedicate this thesis to my family - my mother Insa, father Joachim and brother Lars,who have always stood by me and guided me through my career, and especially to my wife Danitza, who has supported me during the most stressful times of my PhD and was willing to accept the negative consequences of being with a PhD student, thank you. In the end, I would also like to acknowledge the financial support from the De- partment of Computer Science and the Graduate School for their two fellowships and the Deutsche Studienstiftung for their summer schools that truly enriched my studies during the early years of my graduate school career. iii TABLE OF CONTENTS List of Tables ix List of Figures x 1 Introduction 1 1.1 Why study cameras in the space of light rays? . . . . . . . . . . . . . . . . 7 1.2 Example Application: Dynamic 3D photography . . . . . . . . . . . . . . . 9 1.3 Plenoptic video geometry: How is information encoded in the space of light rays? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Polydioptric Camera Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Plenoptic Video Geometry: The Structure of the Space of Light Rays 18 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Representations for the space of light rays . . . . . . . . . . . . . . . . . . . 24 2.2.1 Plenoptic Parametrization . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Pl?ucker coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3 Light field parameterization . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Information content of plenoptic subspaces . . . . . . . . . . . . . . . . . . 30 iv 2.3.1 One-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Two-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 32 2.3.3 Three-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 35 2.3.4 Four-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 37 2.3.5 Five-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 38 2.4 The space of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 The Geometry of Epipolar Images . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.1 Fourier analysis of the epipolar images . . . . . . . . . . . . . . . . 44 2.5.2 Scenes of constant depth . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.3 A slanted surface in space . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.4 Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Extension of natural image statistics to light field statistics . . . . . . . . . 52 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3 Polydioptric Image Formation: From Light Rays to Images 57 3.1 The Pixel Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Optics of the lens system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Radiometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Noise characteristics of a CCD sensor . . . . . . . . . . . . . . . . . . . . . 65 3.5 Image formation in shift-invariant spaces . . . . . . . . . . . . . . . . . . . 70 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4 Polydioptric Sampling Theory 74 4.1 Plenoptic Sampling in Computer Graphics . . . . . . . . . . . . . . . . . . 75 4.2 Quantitative plenoptic approximation error . . . . . . . . . . . . . . . . . . 79 v 4.3 Evaluation of Approximation Error based Natural Image Statistics . . . . 85 4.4 Non-uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5 Application: Structure from Motion 87 5.1 Plenoptic Video Geometry: How is 3D motion information encoded in the space of light rays? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Ray incidence constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Ray Identity Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 Discrete plenoptic motion equations . . . . . . . . . . . . . . . . . . 94 5.3.2 Differential plenoptic motion equations . . . . . . . . . . . . . . . . 95 5.3.3 Differential Light Field Motion Equations . . . . . . . . . . . . . . . 97 5.3.4 Derivation of light field motion equations . . . . . . . . . . . . . . . 98 5.4 Variational Formulation of the Plenoptic Structure from Motion . . . . . . 101 5.5 Feature computation in the space of light rays . . . . . . . . . . . . . . . . 104 5.6 How much do we need to see of the world? . . . . . . . . . . . . . . . . . . 105 5.7 Influence of the field of view on the motion estimation . . . . . . . . . . . 107 5.8 Sensitivity of motion and depth estimation using perturbation analysis . . 109 5.9 Stability Analysis using the Cramer-Rao lower bound . . . . . . . . . . . . 114 5.10 Experimental validation of plenoptic motion estimation . . . . . . . . . . . 117 5.10.1 Polydioptric sequences generated using computer graphics . . . . 118 5.10.2 Polydioptric sequences generated by resampling epipolar volumes 119 5.10.3 Polydioptric sequences captured by a multi-camera rig . . . . . . . 121 5.10.4 Comparison of single view and polydioptric cameras . . . . . . . . 123 5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vi 6 Application: Spatio-temporal Reconstruction from Mulitple Video Sequences 130 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Multi-Resolution Subdivision Surfaces . . . . . . . . . . . . . . . . . . . . . 136 6.3 Subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4 Smoothing and Detail Computation . . . . . . . . . . . . . . . . . . . . . . 140 6.5 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.6 Encoding of Detail in Local Frames . . . . . . . . . . . . . . . . . . . . . . . 146 6.7 Multi-Camera Shape and Motion Estimation . . . . . . . . . . . . . . . . . 148 6.8 Shape Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.9 Motion Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.10 Shape and Motion Refinement through Spatio-Temporal Stereo . . . . . . 152 6.11 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.12 Hierarchies of cameras for 3D photography . . . . . . . . . . . . . . . . . . 159 6.13 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7 Physical Implementation of a Polydioptric Camera 162 7.1 The Physical Implementation of Argus and Polydioptric Eyes . . . . . . . 163 7.2 The plenoptic camera by Adelson and Wang . . . . . . . . . . . . . . . . . 167 7.3 The optically differentiating sensor by Farid and Simoncelli . . . . . . . . 168 7.4 Compound Eyes of Insects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8 Conclusion 176 A Mathematical Tools 180 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 A.1.1 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 vii A.1.2 Poisson Summation Formula . . . . . . . . . . . . . . . . . . . . . . 183 A.1.3 Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . 184 A.1.4 Minkowski Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A.1.5 Sobolev Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 A.2 Derivation of Quantitative Error Bound . . . . . . . . . . . . . . . . . . . . 186 A.2.1 Definitions for Sampling and Synthesis Functions . . . . . . . . . . 186 A.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 A.2.3 l2 Convergence of the Samples . . . . . . . . . . . . . . . . . . . . . 189 A.2.4 Expression ofepsilon1f in Fourier Variables . . . . . . . . . . . . . . . . . . 191 A.2.5 Average Approximation Error . . . . . . . . . . . . . . . . . . . . . 195 Bibliography 197 viii LIST OF TABLES 2.1 Information content the axis-aligned plenoptic subspaces, Part I . . . . . . 55 2.2 Information content of axis-aligned plenoptic subspaces Part II . . . . . . 56 5.1 Quantities that can be computed using the basic light ray constraints for different camera types. ? denotes that two quantities cannot be indepen- dently estimated without assumptions about scene or motion. . . . . . . . 91 5.2 Rigid motion constraint equations for plenoptic subspaces of different di- mensions. In each row the corresponding subspace for each motion con- straint equation is in bold letters. . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Motion Flow Constraint Equations for Rigid Motion Estimation . . . . . . 127 5.4 Brightnessconstancyconstraintequationsforrigidmotionestimation. Since we have the brightness invariance along the ray that can be expressed as ?rLTr = ?xLTr = 0 and ?rLTr = ?mLTr = 0, we can omit the projec- tion operators in the constraint equations. . . . . . . . . . . . . . . . . . . . 128 5.5 Single termMTci,piMci,pi of the Fisher Information Matrix for Rigid Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ix LIST OF FIGURES 1.1 Michael Land?s landscape of eye evolution. . . . . . . . . . . . . . . . . . . 2 1.2 The information pipeline in computer vision. Usually computer vision starts from images, but this already assumes a choice of camera. To ab- stract the notion of a specific sensing device, we need to analyze vision problems in the space of light rays . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 (a) Sequence of images captured by a horizontally translating camera. (b) Epipolar image volume formed by the image sequence where each voxel corresponds to a unique light ray. The top half of the volume has been cut awaytoshowhowarowoftheimagechangeswhenthecameratranslates. (c) A row of an image taken by a pinhole camera at two time instants (red and green) corresponds to two non-overlapping horizontal line segments in the epipolar plane image, while in (d) the collection of corresponding ?rows? of a polydioptric camera at two time instants corresponds to two rectangular regions of the epipolar image that do overlap (yellow region). This overlap enables us to estimate the rigid motion of the camera purely based on the visual information recorded. . . . . . . . . . . . . . . . . . . . 11 x 1.4 a) Hierarchy of Cameras for 3D Motion Estimation. The different camera models are classified according to the field of view (FOV) and the num- ber and proximity of the different viewpoints that are captured (Dioptric Axis). The camera models are clockwise from the lower left: small FOV pinhole camera, spherical pinhole camera, spherical polydioptric camera , and small FOV polydioptric camera. . . . . . . . . . . . . . . . . . . . . . . 14 1.5 (a) Design of a Polydioptric Camera (b) capturing Parallel Rays and (c) simultaneously capturing Pencil of Rays. . . . . . . . . . . . . . . . . . . . 16 2.1 (a) Parameterization of light rays passing through a volume in space by recording the intersection of the light rays with the faces of a surrounding cube. (b) Lightfield Parameterization. . . . . . . . . . . . . . . . . . . . . . 29 2.2 (a) Sequence of images captured by a horizontally translating camera. (b) Epipolar plane image volume formed by the image sequence where the top half of the volume has been cut away to show how a row of the image changes when the camera translates. . . . . . . . . . . . . . . . . . . . . . . 40 2.3 (a) Light Ray Correspondence (here shown only for the light field slice spanned by axesxandu). (b) Fourier spectrum of the (x,u) light field slice with choice of ?optimal? depthzopt for the reconstruction filter. . . . . . . 41 2.4 (a)Theratiobetweenthemagnitudeofperspectiveandorthographicderiv- atives is linear in the depth of scene. (b) The scale at which the perspective andorthographicderivativesmatchdependsonthedepth(matchingscale in red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xi 2.5 (a) Epipolar image of a scene consisting of two translucent fronto-parallel planes. (b) The Fourier transform of (a). Notice how the energy of the signal is concentrated along two lines. . . . . . . . . . . . . . . . . . . . . . 43 2.6 (a) Epipolar image of a scene where one signal occludes the other. (b) The Fouriertransformof(a). Noticetheringingeffectsparalleltotheoccluding signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.7 (a) Epipolar image of a complex scene with many occlusions.. (b) The Fourier transform of (a). There are multiple ringing effects parallel to all the occluding signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Imaging pipeline expressed in a function approximation framework . . . 57 3.2 (a) Pinhole and (b) Thin-Lens Camera Geometry . . . . . . . . . . . . . . . 60 3.3 CCD Camera Imaging Pipeline (redrawn from [149]) . . . . . . . . . . . . 65 3.4 Image formation diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1 Everyimagingdevicecanonlycapturetheplenopticfunctionwithlimited precision. By extending signal processing techniques to the space of light rays we can determine how accurately we can measure the radiance of light rays at non-sample locations. . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 (a) Light Ray Correspondence (here shown only for the light field slice spanned by axes x and u). (b) Fourier spectrum of the (x,u) light field slice with choice of ?optimal? depth zopt for the reconstruction filter. (c) Optimal reconstruction filter in Fourier space. . . . . . . . . . . . . . . . . 77 xii 5.1 Illustration of ray incidence and ray identity: (a-b) A multi-perspective system of cameras observes an object in space while undergoing a rigid motion. Each individual camera sees a scene point on the object from a different view point which makes the correspondence of the rays depend on the scene geometry (ray incidence). (c) By computing the inter- and intra-camera correspondences of the set of light rays between the two time instants, we can recover the motion of the camera without having to esti- mate the scene structure since we correspond light rays and not views of scene points (ray identity). . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Examplesofthethreedifferentcameratypes: (a)Pinholecamera(b)Spher- ical Argus eye and (c) Polydioptric lenslet camera. . . . . . . . . . . . . . . 93 5.3 Deviation from the epipolar constraints for motion estimation from indi- vidual cameras (from [5]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Combination of error residuals for a six camera Argus eye (from [5]) . . . 108 5.5 Singular values of matrixAfor different camera setups in dependence on the field of view of the individual cameras: (a) single camera, (b) two cameras facing forward and backward, (c) two cameras facing forward and sideways, and (d) three cameras facing for- ward, sideways, and upward . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xiii 5.6 (a) Hierarchy of Cameras for 3D Motion Estimation and (b) 3D Shape esti- mation. Thedifferentcameramodelsareclassifiedaccordingtothefieldof view(FOV)andthenumberandproximityofthedifferentviewpointsthat are captured (Dioptric Axis). The camera models are clockwise from the lower left: small FOV pinhole camera, spherical pinhole camera, spherical polydioptric camera, and small FOV polydioptric camera. . . . . . . . . . 117 5.7 (a) Subset of an Example Scene, (b) the corresponding light field (c) Accu- racy of Plenoptic Motion Estimation. The plot shows the ratio of the true and estimated motion parameters (vertical axis) in dependence of the dis- tance between the sensor surface and the scene (horizontal axis) forf = 60 and spheres of unit radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.8 Evaluationofcameradesignsbyresamplingepipolarvolumes. (a)Bysam- pling brightness values in the epipolar volume at different spacings along the view dimensions we can simulate a variety of camera arrangements. (b) Example: 29 camera system with noticeable aliasing between the views. 120 5.9 Relationship between camera spacing, image smoothing and accuracy of motion estimation based on integrating over many image sequences gen- erated from an epipolar volume. The standard deviation of the Gaussian smoothing filter increases from left to right from 1 over 5 to 11 pixels. . . 121 5.10 (a) Example indoor scene for motion estimation. The camera can be seen onthetable. Itwasmovedinaplanarmotiononthetable. (b)Theepipolar image that was recovered after compensating for varying translation and rotation of the camera. The straight lines indicate that the motion has been accurately recovered. (c) Recovered depth model. . . . . . . . . . . . . . . 122 xiv 5.11 Example outdoor scene. The camera was moved in a straight line while being turned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.12 Motion estimation results for example outdoor scene. The comparison of theparkinglotgeometrywiththerecoveredpathindicatesthatthemotion was accurately recovered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.13 Comparisonofmotionestimationusingsingleandmulti-perspectivecam- eras. The errors in the correspondences varied between 0 and 0.04 radians (0-2 degrees), and we computed the mean squared error in the translation and rotation direction for 3 different camera field of views (30,60, and 90 degrees). The blue line (+) and green line (squares) are Jepson-Heeger?s subspace algorithm and Kanatani?s normalized minimization of the in- stantaneous differential constraint as implemented by [145]. The red lines denotetheperformanceofmotionestimationusingEq.(5.19)wheretheer- rors in the disparities are normally distributed with a standard deviation that varies between 0 and 5 degrees. . . . . . . . . . . . . . . . . . . . . . . 125 6.1 (a) Sketch of multi camera dome in the Keck Lab to capture a large en- vironment and (b) calibrated camera setup for detailed capture of small objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Stencils for Loop Refinement and 1-4 Triangle Split . . . . . . . . . . . . . 138 6.3 Neighbouringpatchesthatinfluencethe(a)3Dlimitshapeand(b)3Dlimit motion field of the spatio-temporal surface . . . . . . . . . . . . . . . . . . 140 6.4 Control Meshes at different Resolutions and their detail differences . . . . 142 6.5 Detail Encoding in Global and Local Coordinate Systems . . . . . . . . . . 143 xv 6.6 Histogram encoding the magnitude of the normal (black) and tangential (gray,white) components of the detail vectors in the multi-resolution en- coding of the mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.7 Flow chart describing the multi-resolution analysis and synthesis . . . . . 146 6.8 (a) Multi Camera Silhouette Intersection from (b) Image Silhouettes . . . 149 6.9 (a) 3D Normal Flow Constraint. The planes formed by corresponding im- age edges and the optical centers intersect in a line in space. (b) By inte- grating over a surface patch we can estimate the full 3D Flow . . . . . . . 151 6.10 Four Example Input Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.11 Results of 3D Structure and Motion Flow Estimation: Structure.(a-c) Three Novel Views from the Spatio-Temporal Model (d) Left View of Control Mesh (e) Right Side of Final Control Mesh (f) Close Up of Face . . . . . . 156 6.12 Results: Motion Flow. (a-c) Magnitude of Motion Vectors at Different Lev- els of Resolution (d) Magnitude of Non-Rigid 3D Flow Summed over the Sequence (e) Non-Rigid 3D Motion Flow (f) Non-Rigid Flow Close Up of Mouth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.13 (a) Triangulation-Correspondence Tradeoff in dependence on baseline be- tween cameras. For a small baseline system, we can solve the correspon- dence problem easily, but have a high uncertainty in the depth structure. Foralargebaselinesystemthisuncertaintyismuchreduced,butthecorre- spondence problem is harder. (b) Motion and Shape can be reconstructed directly from feature traces in spatio-temporal and epipolar image volumes.160 7.1 Design of a Polydioptric Camera (a) capturing Parallel Rays (b) and simul- taneously capturing a Pencil of Rays (c). . . . . . . . . . . . . . . . . . . . . 162 xvi 7.2 A highly integrated tennis-ball-sized Argus eye. . . . . . . . . . . . . . . . 164 7.3 Block diagram of an Argus eye. . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.4 Forming a kind of polydioptric eye. . . . . . . . . . . . . . . . . . . . . . . 166 7.5 Plenoptic projection geometry for micro-lenses. . . . . . . . . . . . . . . . . 166 7.6 Plenoptic image processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.7 (a) The inter-ommatidial derivativesLu andLv are computed within each ommatidial image. Intra-ommatidial derivativesLx andLy are computed across ommatidial images. The constant matrix Mi, which in general will be different for each pixel, encapsulates for all geometric relationships of how individual ommatidial images are formed. (b) A curved com- pound eye design for ego motion sensor. Gradient Index (Grin) lenses are arranged along a curved surface ?f forming optical images on the curved surface ?i. The two planes are used to parameterize the Light Field. The elemental images from each Grin lens are brought to a planar sensor chip using coherent fiber optic bundles. (c) A two dimensional array of Grin lenses for the compound eye 3D ego motion sensor. . . . . . . . . . . . . . 171 xvii 7.8 The gradient update is computed at each pixel and proportional bits of current are summed into a global wire. One wire is used for each of the six motion parameters (?q,?). Once the solution for the motion parameters are reached, based on minimizing Eq. (7.11), the gradients (I.e., the total current into each of six wires) are zero thus the steady state voltage solu- tion on the capacitors is reached. The voltage from the six wires is con- tinuously readout from the sensor representing the instantaneous 6DOF motion measurements. (The capacitors are explicitly drawn in this figure, although in actual implementation the parasitic capacitance of the wires is sufficient.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 xviii Chapter 1 Introduction For most of us vision is a very natural process that happens unconsciously and allows us to navigate accurately through the world, detect moving objects and judge their speeds, recognize objects and actions, and interact with the world using visual feedback. Unfor- tunately, up to now nobody has been able to decipher how humans are able to perform their advanced vision tasks. Therefore, when we think about vision algorithms, we look beyond the human example and utilize mathematical disciplines such as geometry, cal- culus, and statistics to find solutions to the tasks mentioned, although many of these methods do not directly map into any existing biological circuits. Despite the advances in algorithm development, when we think about vision hardware, we still usually think ofcamerasthatcaptureimagessimilartotheimagestakenby(two)eyessuchasourown - that is, images acquired by camera-type eyes based on the pinhole principle. Basically all commercial photo and video cameras used in computer vision re- search, with the exception of a few examples we will study later, are primarily designed to capture images that are as similar as possible to the images that a human eye would captureifplacedatthecamera?sposition. Thisisadvantageousbecauseitenablesaneasy interpretation of the visual information by a human observer. This advantage though 1 seems to be of less and less importance nowadays, because more and more image inter- pretation tasks become automated and thus humans will interpret less and less of the raw data that is coming from the cameras, but rather the processed output. Therefore, I believe that we should look beyond the pinhole camera principle, and design a camera in dependence on the task we want to perform. Human eyes are not the only types of eyes that exist, the biological world reveals a large variety of eye designs many of which are not based on the pinhole principle. COMPOUND EYES CAMERA-TYPE EYES Corneal eyes of land vertebrates Superposition eyes Neural superposition Apposition Spiders Fish eyes Tapetum ridge Limulus Cephalopodlens eyes Intermediates Mirror eyes Debris-copepods Vitreous mass eyes Proto-compound eyes Nautilus Reflecting pigment cupsNear pinholes Pigment cup eyes Mere Photoreceptors Figure 1.1: Michael Land?s landscape of eye evolution. The biological world gives a good example of task specific eye design. It has been estimated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom [37], and these eye designs, and therefore the images they capture, are highly adapted to the tasks the animal has to perform. This evolution of eyes is nicely illustrated in Michael Land?s landscape of eye evolution (Fig. 1.1) where every 2 hill and mountain denotes a different independent eye design. This suggests that we should not just focus our efforts on designing algorithms that optimally process a given visual input, but also optimize the design of the imaging sensor with regard to the task at hand, so that the subsequent processing of the visual information is facilitated. The notion that we need to build accurate models of the information processing pipeline to optimize our choice of algorithm in dependence on the statistics of the environment has become more prominent lately ( [32], [149]). Figure 1.2: The information pipeline in computer vision. Usually computer vision starts from images, but this already assumes a choice of camera. To abstract the notion of a specific sensing device, we need to analyze vision problems in the space of light rays The exploration of new sensor designs has already begun. Technological advances make it possible to construct integrated imaging devices using electronic and mechan- ical micro-assembly, micro-optics, and advanced on-chip processing. These devices are not only of the kind that exists in nature such as log-polar retinas [24], but also of many other kinds such as catadioptric cameras [100, 50]. Some initial work has begun on cus- tom mirror-design where a mirror is machined such that the imaging geometry of the mirror-camera combination is optimized to approximate a predefined scene to image mapping [59, 141]. Nevertheless, a general framework to relate the design of an imaging sensortoitsusefulnessforagiventaskisstillmissing. AsseeninFig.1.2ifwephraseour 3 analysis of vision algorithms in terms of images, then we already have chosen a specific camera model to form an image. To abstract from a specific camera geometry, we need to base our analysis on the most general representation for visual information, that is the space of light rays and its functional representation the plenoptic function. By analyz- ing vision problems in the space of light rays, we can optimize over both the sensor and the algorithm to find an optimal solution. In this thesis I develop such a framework by studying the relationship between the subset of light rays captured by a camera and its performance with regard to a task. InthisthesisIwillusethetermpolydioptriccamerathatwasintroducedin[104]to denote a generalized camera that captures a multi-view point subset of the space of light rays. (dioptric: assisting vision by refracting and focusing light). The essential question of polydioptric camera design is how to find the camera design that allows us to perform the tasks of interest as well as possible. For this we need to assess the relative importance of spatial resolution and depth resolution with regard to the problem at hand, and how the best compromise between the two can be found. This framework is able to simultaneously address the three problems crucial to the design of next generation imaging systems: ? The sampling problem. Which rays of light should be sampled to yield optimal information for a particular function whether the task at hand is high resolution image reconstruction, imag- ing from different views, recovery of a scene?s 3D structure, or detection of various salientfeaturesandtargets? Howcantheappropriatesamplingstrategiesbeimple- mented in the prescribed form and fit? Generally, we are searching for a transform 4 ? that captures a set of light rays L and generates signal measurements I. ?(L) = I ? could be a compound transform to account for many optical centers and many different (potentially overlapping) fields of view. In general the transform ? has a null space; therefore, direct inversion based on the signal measurements will not be possible. ? The sensing problem. How should the optical signals be sensed as to yield an optimal SNR, dynamic range, possibly salience-ordered pixel/region readout, pro- grammable spatial resolution, and possible detector properties adaptation? How do the selected sensor functions contribute to the prescribed system function? ? The processing problem. What kind of scene features can different camera de- signs extract? Some can only extract features based on texture cues (2D), others in addition also features based on shape cues (3D) as we discussed above. What are the mathematical tools that need to be developed? Or conversely, what sampling and sensing strategy is necessary to achieve efficient processing for a prescribed function? What processing hardware will be necessary to achieve these operations within a prescribed time? How can we analyze how much these features will help us? These three problems are not orthogonal and must be addressed simultaneously. For example, efficient mathematics will demand a particular sampling of the light rays. Sensors may not be able to deliver sufficient SNR for a given sampling strategy. There- fore, the optimal solution will lie somewhere in the ?space of compromises? across the three areas. In this thesis I will focus on the processing and sampling questions because 5 they are the two properties that are much easier to modify for the camera designer then the underlying electronics on the sensor chip itself. Thus, we assume that the properties of the sensor element are fixed and can be described by a simple model. To find the optimal solution and design a task specific camera, we need to answer the following two questions: 1. How is the relevant visual information that we need to extract to solve our task encoded in the visual data that a camera can capture? 2. What is the camera design and image representation that optimally facilitates the extraction of the relevant information? To answer the first question, we first have to think about what we mean by visual information. When we think about vision, we usually think of interpreting the images taken by (two) eyes such as our own - that is, perspective images acquired by camera- type eyes based on the pinhole principle. These images enable an easy interpretation of the visual information by a human observer. Therefore, most work on sensor design has focused on designing cameras that would result in pictures with higher fidelity (e.g.[66]). Image fidelity has a strong impact on the accuracy with which we can make quantita- tive measurements of the world, but the qualitative nature of the image we capture (e.g., single versus multiple view point images) also has a major impact on the accuracy of measurements which cannot be measured by a display-based fidelity measure. Since nowadays most processing of visual information is done by machines, there is no need to confine oneself to the usual perspective images. Instead, as motivated at the start of the introduction, we propose to study how the relevant information is encoded in the geometry of the time-varying space of light rays which allows us to determine how well 6 we can perform a task given any set of light ray measurements. In Chapter 2 we will the examine the structure of the space of light rays and analyze what kind of information canbeextractedaboutascene. Toanswerthesecondquestionwehavetodeterminehow well a given eye can capture the necessary information. We can interpret this as an ap- proximation problem where we need to assess how well the relevant subset of the space of light rays can be reconstructed based on the samples captured by the eye, our knowl- edge of the transfer function of the optical apparatus, and our choice of function space to represent the image. In Chapter 4 we will model eyes as spatio-temporal sampling patterns in the space of light rays which allows to use well developed tools from signal processing and approximation theory to evaluate the suitability of a given eye design for the proposedtask and determinethe optimal design. The answersto these twoquestions then allow us to define a fitness function for different camera designs with regard to a given task. 1.1 Why study cameras in the space of light rays? A conventional camera is observing the world only from a single effective view point. It is well known that due to the camera projection the range information about the world gets lost. That means one can only capture an image of the two-dimensional properties of the object surfaces in the scene. Therefore, any information that is not uniquely de- termined by the scene texture itself cannot be accurately recovered without assumptions about the world. As an example, if we want to segment the captured image into image regions corresponding to the different objects in the scene, we have to make assumptions beforehand to which texture belongs to which object. In contrast, if a camera captures light from many view points the correlations be- 7 tween the different images can be used to infer information about the three-dimensional structure of the world, that is for example shape estimation and the segmentation of the scene into distinct objects using occlusion events. In this case if we want to segment the multi-perspective image according to the views of the objects we observed, we can uti- lize the intrinsic structure of the captured multi-perspective image to detect occlusion events and segment the image according to a combination of two-dimensional texture and three-dimensional depth cues. If the camera is moving, thus instead of images we capture image sequences, we can use the correlations between different images that observe the same scene to infer more information about the world even with conventional cameras. The problem of structure from motion has been studied in depth [61, 55, 88], but it is still not sufficiently solved to allow for fool-proof algorithms. Thus, if we only rely on texture cues to de- tect independently moving objects or track objects, it is difficult to distinguish between changes in images due to moving objects or parallax-inducing 3D structures. It is essen- tial that we estimate the motion of the camera which is in general a highly non-linear problem since it involves the estimation of the 3D structure of the scene as well. The non- linearnatureoftheproblemcausesambiguitiesinthesolutionandduetothecomplexity of the estimation it has not been possible up to now to process the motion estimation di- rectly on a camera chip. In contrast, a multi-perspective camera allows us to compute the rigid motion of a camera solely based on spatio-temporal image derivative measurements by solving a system of linear equations with few (six) unknowns as we will show in Chapter 5. Ef- ficient solvers for these problems have been recently implemented on cheap graphics hardware chips, thus allowing us to solve for the 3D motion of the camera directly inside 8 the box using the camera hardware. In addition, the low-dimensionality and scene in- dependent formulation of the changes in the images due to the camera motion allow us to apply simple motion segmentation algorithms to the captured polydioptric images to detect and track moving objects. These superior abilities of polydioptric imaging with regard to 3D scene structure estimation and segmentation, can also be utilized to improve the resolution of the sen- sor through super-resolution techniques. In nearly all super-resolution techniques it is assumed that the scene consists of a single fronto-parallel plane and that there are no occlusions between different views. In reality this is not necessarily satisfied. A poly- dioptric camera is able to first compute an estimate of the depth structure of the scene and detect occlusion events, before the super-resolution system is solved with appropri- ate masking of the input and adaption of the parameters modeling the imaging process. 1.2 Example Application: Dynamic 3D photography Toevaluateandcomparedifferenteyedesignsinamathematicalframework,Ichoosethe recovery of spatio-temporal scene descriptions from image sequences, that is structure from motion, as the problem of interest. There are many approaches to structure from motion (e.g. see [54, 88] for an overview), but essentially all these approaches disregard the fact that the way images are acquired, already determines to a large degree how difficult it is to solve for the structure and motion of the scene. Since systems have to cope with limited resources, their cameras should be designed to optimize subsequent image processing. Of most practical importance are the following two specific subcases of the struc- ture from motion problem: 9 1. Static Structure from Dynamic Cameras: We are given a single (or a set of) cam- era(s) that move rigidly in space. Based on the recovered image sequences we would like to estimate the camera motion, the properties of the static objects in scene (such as shape and textures). If the scene properties are not invariant over time, then we need to use robust statistics to differentiate between the static and non-static parts of the scene. 2. Dynamic Structure from Static Cameras: Given a number of calibrated cameras in a known configuration that capture a sequence of images, find the shape of the objects, their surface properties, and the motion field on the object surface. For some example results see Chapter 6. 1.3 Plenoptic video geometry: How is information encoded in the space of light rays? Inthisoutlinewewillillustratetheconceptofthepolydioptriccameradesignbyexamin- ing the plenoptic video geometry for the case of 3D ego-motion estimation. This analysis will be further refined in Chapter 5. At each location x in free space the plenoptic func- tion L(x;r;t); L : R3 ?S2 ?R+ ? ? measures the radiance, that is the light intensity or color from a given direction r at timet. ? denotes here the spectral energy, and equals R for monochromatic light, Rn for arbitrary discrete spectra, or could be a function space for a continuous spectrum. S2 is the unit sphere of directions in R3. Let us assume that the albedo of every scene point is invariant over time and that we observe a static world under constant illumination. In this case, the radiance of a light ray does not change over time which implies that the total time derivative of the 10 (a) (b) (c) (d) Figure 1.3: (a) Sequence of images captured by a horizontally translating camera. (b) Epipolar image volume formed by the image sequence where each voxel corresponds to auniquelightray. Thetophalfofthevolumehasbeencutawaytoshowhowarowofthe imagechangeswhenthecameratranslates. (c)Arowofanimagetakenbyapinholecam- era at two time instants (red and green) corresponds to two non-overlapping horizontal line segments in the epipolar plane image, while in (d) the collection of corresponding ?rows? of a polydioptric camera at two time instants corresponds to two rectangular re- gions of the epipolar image that do overlap (yellow region). This overlap enables us to estimate the rigid motion of the camera purely based on the visual information recorded. plenoptic function vanishes: d dtL(x;r;t) = 0. The set of imaging elements that make up a camera each capture the radiance at a 11 given position coming from a given direction. If the camera undergoes a rigid motion, then we can describe this motion by an opposite rigid coordinate transformation of the ambient space of light rays in the camera coordinate system. This rigid transformation, parameterized by the rotation matrix R(t) and a translation vector q(t) which maps the time-invariant space of light rays upon itself. Thus we have the following exact equality which we call the discrete plenoptic motion constraint L(R(t)x+q(t);R(t)r;t) = L(x;r;0) (1.1) We see that if a sensor is able to capture a continuous non-degenerate subset of the plenoptic function, then the problem of estimating the rigid motion of this sensor has be- comeanimageregistrationproblemthatisindependentofthescene. Thereforetheonlyfree parameters are the six degrees of freedom of the rigid motion. This global parametriza- tion of the plenoptic motion field by only six parameters leads to a highly constrained estimation problem that can be solved with any multi-dimensional image registration criterion. To illustrate this idea with an example, a camera is translated along the horizon- tal image axis and the images of the sequence are stacked to form an image volume (Figs. 1.3a-1.3b). Due to the horizontal translation scene points always project into the same row in each of the images. Such an image volume is known as an epipolar image volume [14] since corresponding rows lie all in the same epipolar plane. Each pixel in this volume corresponds to a unique light ray. A horizontal slice through the image volume (Figs. 1.3c-1.3d) is called an epipo- lar plane image and contains the light rays lying in this epipolar plane parameterized by view position and direction. A row of an image frame taken by a pinhole camera corresponds to a horizontal line segment in the epipolar plane image (Fig. 1.3c) because 12 we observe the light rays passing through a single point in space (single view point). In contrast, a polydioptric camera captures a rectangular area (multiple view points) of the epipolar image (Fig. 1.3d) because it captures light rays passing through a range of view points. Here we assumed that the viewpoint axis of the polydioptric camera is aligned with the direction of translation used to define the epipolar image volume. If this is not the case, the images can be warped as necessary. We see that a camera rotation around an axis perpendicular to the epipolar image plane corresponds to a horizontal shift of the camera image (change of view direction), while a translation of the camera parallel to an image row, causes a vertical shift (change of view point). These shifts can be different for each pixel depending on the rigid motion of the camera. If we want to recover this rigid transformation based on the images captured using a pinhole camera, we see in (Fig. 1.3c) that we have to match two non-overlapping sets of light rays (shown as a bright green and a dark red line) since each time a pinhole camera captures by definition only the view from a single viewpoint. Therefore, it is necessary for an accurate recovery of the rigid motion that we have a depth estimate of the scene, since the correspondence between pixels in image rows taken from different view points depends on the local depth of the scene. In contrast, we see in (Fig. 1.3d) that for a polydioptric camera the matching can be based purely on the captured image information, since the sets of light rays captured at consecutive times (bright yellow region) overlap. Disregarding sampling issues at the moment, we have a ?true? brightness constancy in the region of overlap, because we match a light ray with itself (Eq. (1.1)). This also implies that polydioptric matching is invariant to occlusions and view-dependent visual events such as specularities. We conclude thatthe correspondence oflight rays using apolydioptric camera dependsonly 13 on the motion of the camera, not on any properties of the scene, thus enabling us to estimate the rigid motion of the camera in a completely scene-independent manner. A moredetailedanalysisofthestructureofthespaceoflightrayscanbefoundinChapter2. 1.4 Polydioptric Camera Design Dioptric Axis Field of V iew Axis ill-conditioned stable single viewpointcontinuous viewpoints small FOV 360 deg FOV scene dependent(nonlinear)scene independent(linear) Dioptric Axis Field of V iew Axis scene dependent(nonlinear)scene independent(linear) ill-conditioned stable discrete viewpointscontinuous viewpoints small FOV 360 deg FOV (a) Static World - Moving Cameras (b) Moving Object - Static Cameras Figure 1.4: a) Hierarchy of Cameras for 3D Motion Estimation. The different camera models are classified according to the field of view (FOV) and the number and prox- imity of the different viewpoints that are captured (Dioptric Axis). The camera models are clockwise from the lower left: small FOV pinhole camera, spherical pinhole camera, spherical polydioptric camera , and small FOV polydioptric camera. It is known that the stability of the structure from motion estimation depends on the collective field of view of all the ?sub?-sensors making up the polydioptric camera. This relationship has been studied for example in [3, 33, 35, 70, 96, 47]. Some of these results can be found in Chapter 5. Combining this result with the plenoptic motion equations (1.1) in [104] we can 14 define a coordinate system on the space of camera designs (as shown in Fig. 1.4). The dif- ferent camera models are classified according to the field of view (FOV) and the number and proximity of the different viewpoints that are captured (Dioptric Axis). This in turn determines if structure from motion estimation is a well-posed or an ill-posed problem, and if the estimation is scene dependent or independent (thus implying that the motion parameters are related linearly or non-linearly to the image measurements if we have differential motion as shown in Section 1.3). One can see in the figure that the conventional pinhole camera is at the bottom of the hierarchy because the small field of view makes the motion estimation ill-posed and it is necessary to estimate depth and motion simultaneously. Although the estima- tion of structure and motion for a single-viewpoint spherical camera is stable and robust, it is still scene-dependent, and the algorithms which give the most accurate results are search techniques, and thus rather elaborate. One can conclude that a spherical poly- dioptric camera is the camera of choice to solve the structure from motion problem since it combines the stability of full field of view motion estimation with the linearity and scene independence of the polydioptric motion estimation. Such a camera would enable us to utilize new scene-independent constraints between the structure of the plenoptic function and the parameters describing the rigid motion of a polydioptric imaging sen- sor. Using tools of polydioptric sampling theory as described in Chapter 4 enables us to extend this qualitative analysis to a quantitative analysis. A polydioptric camera can be implemented in many ways. The simplest design is an array of ordinary cameras very close to each other (see Fig. 1.5 or [158]) or one could use specialized optics or lens systems such as described in [2, 42, 99]. In Section 7.1, some examples of these designs are presented. 15 (a) (b) (c) Figure 1.5: (a) Design of a Polydioptric Camera (b) capturing Parallel Rays and (c) simul- taneously capturing Pencil of Rays. Whatever design one uses, it is not possible to capture the plenoptic function with arbitrary precision. If we want to use the plenoptic motion constraints, we need to recon- struct a continuous light field from discrete samples. In this thesis I study the implemen- tation of a polydioptric camera using a regular array of densely spaced pinhole cameras. This problem has been studied in the context of light field rendering [27, 162] where the authors examined which rays of a densely captured light field need to be to retained to reconstruct the continuous light field. In chapter 4 we will examine how we can estimate the approximation error between the time-varying light field and a reconstruction based on the images captured. Specifically, we determine how we can compute an estimate of the approximation error given a description of the camera and the statistics of the scene. 1.5 Overview of the thesis Inspired by how Argus, the hundred-eyed guardian of Hera, the goddess of Olympus, alone defeated a whole army of Cyclops, the mythical monocular giants, and the impres- sive navigational feats of insects with compound eyes, I will show in this thesis how one can analyze vision problems in the space of light rays, and thereby find solutions that 16 are optimal with a respect to the joint design of vision sensors and algorithms. Based on the geometrical and statistical properties of these spaces, I introduce a framework to systematically study the relationship between the shape of an imaging sensor and the task performance of the entity using this sensor. I illustrate this concept by analyzing the structure of the time-varying plenoptic function and identifying the important camera design parameters for the case of 3D motion estimation. This analysis is then used to form a hierarchy of camera designs with regard to the task under study. The hierarchy implies that large field of view polydioptric cameras are the optimal cameras for 3D ego motion estimation because they allow for stable and scene-independent motion estima- tion algorithms. Polydioptric cameras are generalized cameras which capture a multi- perspective subset of the plenoptic function. Using a combination of multi-dimensional sampling analysis, statistical image modeling I then improve upon the qualitative hi- erarchy of camera design by defining a metric on the space of cameras. This metric is based on how well a given polydioptric camera is able to capture the plenoptic function, and how the approximation errors propagate through the motion estimation to the final parameter estimate. This allows us to determine the best camera arrangement for a task. Although camera motion estimation is important for tasks in navigation, it is often only part of the information that we want to extract about the world. Thus, at the end of this thesis I also examine the problem of estimating the shape and 3D motion of non- rigidly moving objects in a scene using a distributed camera network. 17 Chapter 2 Plenoptic Video Geometry: The Structure of the Space of Light Rays 2.1 Preliminaries Allthevisualinformationthatcanbecapturedinavolumeofspacebyanimagingsensor is described by the intensity function defined on the space of light rays surrounding us. Thisfunctionisknownastheplenopticfunction1 [1]. Foreachpositioninspaceitrecords the intensity of a light ray for every direction, time, wave length, and polarization, thus providing a complete description of all uninterpreted visual information. The idea of a function that contains all images of an object was already described by Leonardo Da Vinci in his notebooks. Similar functions were used later by Mehmke in 1898 and by Gershun in 1936 to describe the reflection of light on objects. Gershun was the first one to use the term light field for the vector irradiance field [49]. One can find more about the historical study of the space of light rays in the book by Moon and Spencer about the scalar irradiance field [98]. The time-varying shapes of the objects in a scene, their surface reflectance proper- ties, the illumination of the scene, and the transmittance properties of the ambient space 1From the Greek words plenus full and optic view. 18 determine the structure and properties of the visual space. The objects cannot transmit their shape and surface properties directly to an observer. Without actively interacting withtheobject,theobservercanonlyrecordtheintensitiesofasetoflightraysthatreflect or emit from the object surface and infer the properties of the objects in the scene based on these ?images?. This is possible for example by utilizing the geometric ray model of light transport which relates the geometric and reflection properties of the object surface and the illumination to the observed image intensity values. An image can be defined as a collection of rays with a certain intensity where each ray captured has a position, orientation, time, wavelength and domain of integration (scale). Such a ray element was called a raxel by Grossberg and Nayar [52]. In computer graphics the scene parameters are known and the goal is to generate the view of a scene from a given view point by de- termining the ray properties for all the rays that make up an image (e.g. using standard computer graphics techniques such as ray tracing and global illumination). The quality that can be achieved is often nearly indistinguishable from actual views as evidenced by the ubiquitous use of computer generated imagery in today?s cinema. In comparison, computer vision attempts to extract a description of the world based on images, thus one could say that computer graphics and computer vision are ?inverse problems? of each other. We are given the images of a scene (in general a scene in the real world) and based on the information in the images, we would like to find the set of variables that describe the scene. Since it is impossible to find an infinite number of variables, we will use models to approximate the true scene parameters and to recre- ate the scene. The model parameters can be found based on the images captured given some a priori assumptions about the scene. In 3D photography we use the samples of the visual space captured by the cameras to determine the most likely set of variables that 19 give rise to the observed structure of the visual space. For segmentation we assign labels corresponding to different objects in the scene or for recognition we try to find the best label for an image region given a set of classes. To solve these tasks we have to analyze how the scene and object properties manifest themselves in the global and local structure of the plenoptic function. The structure of the visual space is generated by the objects and their proper- ties in the scene. We have a distance function D(x;t) : R3 ? R+ ? R defined on the four-dimensional spatio-temporal volume (three space dimensions and one time di- mension) that describes the space that is occupied by the objects in the scene. The surface of the objects is defined by the zero level set of the distance function, that is S := {(x;t)|D(x;t) = 0}. The geometric properties of this function describe the shape of the objects and their temporal evolution. We will use the term object surface to denote this zero level set S. The change of the distance function over timedD/dtcaptures only deformations of the shape along its normal which is given by ?xD(x;t) = ? ?? ?? ?? ? ?D(x;t)/?x1 ?D(x;t)/?x2 ?D(x;t)/?x3 ? ?? ?? ?? ? where x = ? ?? ?? ?? ? x1 x2 x3 ? ?? ?? ?? ? . (2.1) We can also define a vector field M(x;t) : S ? R4 on the object surface that de- scribesthetrajectoriesofuniquepointsontheobjectsurface, the3Dmotionfloworscene flow[152]. Usuallywerestrictthetimecomponentofthisvectorfieldtobeofunitmagni- tude (or whichever time step is small enough to allow an accurate first-order description of the deformations of shape and surface properties). This vector field is restricted to transport points only on the surface S, therefore we have the constraint [MT?D] = 0 20 where ?D = [?xD;D/?t]. On the surface S we have the reflection properties of the objects given by the sur- face light field LS : S ?S2 ?R+ ? ?. ? denotes here the spectral energy, and equals R for monochromatic light, Rn for arbitrary discrete spectra, or could be a function space for a continuous spectrum. S2 is the unit sphere of directions in R3. Iftheobjectisnotemittinganylight,thenasurfacelightfieldscanoftenbefactored into a number of physically motivated components such as an illumination component and a component describing the surface reflection. A popular representation in graphics expresses the radiance LS(x;r;t) leaving the surface in direction ?r in terms of the sur- face irradiance IS(x;m;t) = L(x;m;t)(nTm) measured from direction m through the rendering equation [72] LS(x;r;t) = integraldisplay H(n) B(x;r;m;t) IS(x;m;t)dm (2.2) where B(x;r;m;t) is the bidirectional reflection distribution function(BRDF) [108] defined onthesurfaceS. H(n)isthehemisphereofdirectionsH(n) = {m : bardblmbardbl = 1 and mTn> 0}. AnothermorecomplexexampleistheBidirectionalSurfaceScatteringReflectanceDis- tribution Function [108], inshortBSSRDF.TheBRDFonlydescribestheinteractionoflight that enters and exits the surface at the same point, therefore it does not describe the scat- tering of light inside the surface as seen in marble, milky fluids or human skin. The BSSRDF is thus parameterized by two location vectors, one for the entrance and one for the exit location of the light rays. For some nice examples that were rendered using this model see [69]. Even such a general function is not able to adequately describe all the surface reflection phenomena, since we left out the effect of polarization for example. In computer vision we usually only use simple reflection models, because the noise in the 21 image acquisition process makes it difficult to find the parameters of the more complex models under non-laboratory conditions. In computer vision, we are more interested in a representation that can describe the intrinsic complexity of a reflection model. A nice measuretoassessthecomplexityofasurfacelightfieldisdescribedinJinetal.[71]where theydescribetheconceptoftheradiancetensor,thatisthematrixthatisconstructedfrom the observed intensities of a number of surface points in a local neighborhood on the ob- ject surface that are observed from a number of viewing positions. Each column contains the different intensity values of the scene points as seen from a single view point, while the rows of the matrix contain the different views of a single scene point. If the local surface patch is small compared to the distance to the observing cameras and to the light sources in the scene, then this radiance tensor will have a rank of two or less. This is easy to show. Due to the small distance between the points in comparison to the distance to the light sources and cameras, we can assume that the surface irradiance is the same for each point in the patch, that the vector from each scene point to a single camera center is also nearly constant, and that the object surface is locally planar so that the normals and tangent directories are also constant across the points. Then we can factor the surface light field observed at each point as LS(x,r) = Nsummationdisplay i=1 ?i(x,r0)Bi(x0,r) (2.3) whereN is usually smaller then 4. This decomposition was also used before by Chen et al. in the context of surface light field compression [28]. The simplest example of this factorization is the case of Lambertian reflectance of the surface with albedo ?(x;t) which describes a perfectly diffuse reflector. In this case, 22 we can conveniently express LS(x;r;t) for all directions r ?H(?n) by L(x;r;t) = ?(x;t)[n(x;t)Ts(x;t)] (2.4) wheres(x;t) =integraltextH(n)L(m;x;t)(nTm)dmisthenetdirectionalirradianceonthesurface point, given by the surface integral overH(n), (for more details see [62]). If we factor the popular diffuse plus specular model we will get: LS(x;r) = integraldisplay H(n) B(x;r;m) IS(x;m)dm = integraldisplay H(n) (?d(x) +?s(x)B(x0;r;m)) IS(x0;m)dm = ?d(x) integraldisplay H(n) IS(x0;m)dm +?s(x) integraldisplay H(n) B(x0;r;m) IS(x0;m)dm = ?d(x)Bd(x0) +?s(x)Bs(r;x0) (2.5) This factorization indicates that under the correct assumptions the radiance tensor has only rank two. This rank constraint can then be used as a plausibility criterion to decide if the scene parameters are estimated correctly. A realistic model for the reflection properties that can be split in a diffuse and a specular component is given by Ward?s anisotropic (elliptical) Gaussian model [156]. At every point on the surface we can locally define at each location x an orthonormal co- ordinate system t1,t2,n and define a half-way vector h between the incoming m and outgoing rays r: h = (m+r)/bardblm+rbardbl. Then we can write the BRDF for this model as: B(x;r;m) = ?d(x) +?s(x)Bs(x;r;m) (2.6) = ?d(x)pi +?s exp bracketleftBiggparenleftbigg 1? 1(hTn)2 parenrightbiggparenleftBiggparenleftbigg hTt1 ?1 parenrightbigg2 + parenleftbigg hTt2 ?2 parenrightbigg2parenrightBiggbracketrightBigg 4pi?1?2radicalbig(mTn)(rTn) 23 where?d is the diffuse reflection coefficient, ?s is the specular reflectance coefficient and ?1 and ?2 are the standard deviations of the microscopic surface slope (as a measure of surfaceroughness)inthetangentdirectionst1 andt2. These4parameterstogetherdefine a physically valid reflection model. 2.2 Representations for the space of light rays 2.2.1 Plenoptic Parametrization We define the light field as the extension of the surface light field, which is only defined on the object surfaces S, to the free space of R3, that is {x|D(x,t) > 0}. We will assume that the ambient space is a transparent medium, such as air for example, which does not change the color or intensity of light along its path. Thus, the light field along the view direction r is constant and the following equations hold: L(x;r;t) = L(x+?r;r;t) ??s.t. D(x+?r) > 0 ??? [0,?] (2.7) which implies that ?xLTr = ?rLTr = 0 ?x ? R3,f(x;t) > 0 (2.8) Here ?xL and ?rL are the partial derivatives of L with respect to x and r: ?x = (?/?x1,?/?x2,?/?x3)T , and ?r = (?/?r1,?/?r2,?/?r3)T. To initialize this ?differential equation? we define the following equality on the object surface L(x;r;t) = LS(x;r;t) where (x;t) ? S. (2.9) 24 SinceLis constant along r, the set of functionsLS defined onS generate the struc- ture of the visual space, that is the geometry of the plenoptic function, completely. Due to the brightness invariance along the ray direction the plenoptic function in free space reduces locally to five dimensions ? the time-varying space of directed lines for which many representations have been presented (for a comparison of different rep- resentations see Camahort and Fussel [22] or the book by Pottmann and Wallner [118] about line geometry). If we deal with smog, fog, transmitting or partially transparent objects, then this invariance will not hold, but we will not consider these cases in this work. The dependence of the structure of the plenoptic space on the shape of the scene objects can be analyzed using a directional distance function that describes the shortest distance between a point location and a scene surface in a given direction: Z(x;r;t) = min ? D(x+?r;t) = 0 (2.10) Using this function, we can rewrite the equality Eq. (2.9) as L(x,r,t) = LS(x+Z(x,r,t)r,r,t) (2.11) The space of light rays can now be parameterized in a number of different ways which offer different advantages with regard to completeness, redundancy and ease to describe geometric transformations. The question of representation is very important because the space of light rays has a specific structure since it is parameterized by the geometric and signal properties of the objects in the scene which introduce redundancy in the representation. In addition since all physical measurements of light are done with a finite aperture, we always need to account for the specific scale of the measurement in our later computations. This introduces a continuity on the space of measurements 25 which also needs to be reflected in the space and metric that we will use for computa- tions in the space of light rays. We call the parametrization that we used in this section the plenoptic parametrization because it is possible to assign a brightness value for each location, direction and time, thus capturing the full, global structure of the light rays. 2.2.2 Pl?ucker coordinates Sometimes we are interested in a representation for the space of light rays that does not exhibit the redundancy intrinsic to the plenoptic parametrization. As said in the last section, since light rays do not change their color in a transparent medium like air, we can represent light rays in the space of directed lines in three-dimensional space without losing any information. Linear subspaces are best treated in the context of Grassmann coordinates [118]. The Grassmann coordinates for lines in three-dimensional space are known as Pl?ucker coordinates and are a very useful to describe lines and their motions in space. The Pl?ucker coordinates of the linel := {x +?r,? ? R} are given by the tuple (r,m) ? R6, containing the direction vector r and the moment vector m defined by m = x?r. We see that the Pl?ucker moment is the normal to the plane that contains the line and the origin, and its magnitude is equal to the distance of the line to the origin. If wenormalizertounitlength,bardblrbardbl = 1,thenwehavearepresentationforanorientedline. This representation was described bye Study [17] to describe the space of oriented lines. Since the norm of all the line elements in the Study representation is of unit length, one also says the that all the coordinates representing oriented lines lie on the Study sphere. This parametrization is equivalent in terms of its sampling properties in line space tothesphere-planerepresentationdescribedin[23]. Herethelinelisdefinedbyitsdirec- tionr andtheintersectionoflwiththeplaneperpendiculartor throughtheorigin. Since 26 r ?m = 0 by definition, m lies in this plane, although not on the line. The intersection point x of the linelwith the plane x?r = 0 is given by x = m?r. Comparing the plenoptic and the Pl?ucker parameterizations, we see that, locally, for each plane through the origin of the coordinate system perpendicular to a fixed di- rection r, the plane xTr = 0, they differ only by a rotation of 90 degrees around r since m = r?x. Thus we can define an intensity function on the space of Pl?ucker lines using the identity: L(m,r,t) = L((m?r,r,t) L can be used to describe the plenoptic function globally whileLcan be used to describe the plenoptic function locally, since it chooses a single plane to describe rays and thus cannot account for occlusions or the changes in the differential structure of the plenoptic functionduetoamotionoftheobserver. Forexamplewhenanobservermovesalongthe main view direction then the image expands/contracts radially. Such a forward motion does not change the brightness of the light ray along the axis of forward motion since we have the brightness invariance along the ray, but it changes the magnitude of the derivatives ?rL and ?rLsince they depend on the distance to the scene. In this case we have ?rL(x,r,t) = Z(x,r,t)?xL(x,r,t) and ?rL(m,r,t) = Z(m,r,t)?mL(m,r,t). The main advantage of the Pl?ucker parameterization is that we can express the motion of lines in space very elegantly in terms of a matrix multiplication. If a line is un- dergoing a rigid motion described by the rotation matrixR and the translation t around the origin of the coordinate system, then this can be expressed by multiplying the line coordinate vector l = (r,m) with a 6?6 motion matrixQ: ? ?? ? rprime mprime ? ?? ?= ? ?? ? R 0 ?[t]xR R ? ?? ? ? ?? ? r m ? ?? ?= Q ? ?? ? r m ? ?? ?. (2.12) 27 This motion equation can be derived easily by moving two points on the line rigidly in spaceandthenrecomputingthePl?uckerlinecoordinateswithrespecttothenewposition of the two points. 2.2.3 Light field parameterization One difficulty with the previous two parameterization is that although we can define a metric between lines, it is difficult to express this distance directly in terms of distances between the line coordinates since they do not lie in a Euclidean space, but on a non- linear manifold. The light field parameterization is a convenient representation for the space of light rays, because it is closed under affine combinations of coordinate vectors and we can approximate the geodesic distances between points on the non-linear mani- fold of lines by the Euclidean distance between the coordinate vectors that represent the lines in the light field parameterization. As described by Peternell and Pottman [114], we can divide the space of lines into caps of directions, that are sets of directions close to a central axis ci that defines each cap i. We can then choose two parallel planes Z+i and Z?i perpendicular to the central axisci of each cap and represent all lines not perpendic- ular ci by their intersection with the two planes Z+i and Z?i . This defines now an affine space where the distance between two lines in 3D can be approximated by the Euclidean pointdistancebetweentheaffinecoordinatesoftheintersectionsofthelineswiththetwo planes. These new coordinates are equivalent to the coordinates of the line directions af- ter being stereographically projected onto a plane. The different caps are glued together using interpolation weights so that we end up with a continuous representations. The extent of the caps should be chosen such that we do minimize the error in the directional distances between the rays. This two-plane parameterization was used by [51, 80] to rep- 28 resent the space of light rays. All the lines passing through some space of interest can be parameterizedbysurroundingthisspace(thatcouldcontaineitheracameraoranobject) with two nested cubes and then recording the intersection of the light rays entering the camera or leaving the object with the planar faces of the two cubes. We only describe the parameterization of the rays passing through one pair of faces, the extension to the other pairs of the cube is straight forward. Without loss of generality we choose both planes to be perpendicular to the z-axis and separated by a distance off. We denote one plane as focal plane ?f indexed by coordinates (x,y) and the other plane as image plane ?i indexed by (u,v), where (u,v) is defined in a local coordinate system with respect to (x,y) (see Fig. 2.1a). Both (x,y) and (u,v) are aligned with the (X,Y)-axes of the world coordinates and ?f is at a distance ofZ? from the origin of the world coordinate system. (0,0) ?f: Focal PlaneCamera view points (x,y) x y(u0,v0,t) ?i: Image PlaneImage Sequences (u,v,t) L0(u,v,t) (u,v,t) L(x,yu,v,t) ?y ?x (a) (b) Figure 2.1: (a) Parameterization of light rays passing through a volume in space by recording the intersection of the light rays with the faces of a surrounding cube. (b) Lightfield Parameterization. This enables us to parameterize the light rays that pass through both planes at any 29 timetusing the tuples (x,y,u,v,t) and we can record their intensity in the time-varying light fieldL(x,y,u,v,t). Since the properties of the generating functions determine the intensity (or color) distribution in the space of light rays as described by the plenoptic function, we can construct a catalog of the relations between these properties and the local differential structure of the plenoptic function. This catalog then relates all the observable visual events such as colors, textures, occlusions, and glares to the shape, surface reflectance andilluminationofthescene. Inthissectionwewillstudytheplenopticfunctiondirectly without any reference to an imaging system. The properties of imaging systems will be studied later in chapter 3. 2.3 Information content of plenoptic subspaces Recently, computer graphics and computer vision took interest in non-perspective sub- sets of the plenoptic function to represent visual information to be used for image-based rendering. Some examples are light fields [80] and lumigraphs [51], multiple centers of projection images [120] which have been used in cell animation already for quite some time [159], or multi-perspective panoramas [113]. For an overview over the use of multi- perspective images for image-based rendering see Zhang and Chen [163]. Non-perspective images have also been used by several researchers to reconstruct the observed scene from video sequences (for some examples see [14, 132, 26]). The de- scriptions of non-perspective images have been formalized lately by the work of Swami- nathan et al [141] and Hicks [59, 58]. For more information you can also see the page of catadioptricsensordesignhttp://www.mcs.drexel.edu/?ahicks/design/design. html. 30 In their seminal paper [1] Adelson and Bergen demonstrate with examples from early vision, how the importance of different subspaces of the space of light rays is re- flected in the resolution by which a subspace is sampled and the amount of processing that is applied to them. A similar criterion should be applied to design of artificial sen- sors. Adelson and Bergen defined a set of ?periodic tables for early vision? where they relate first and second order derivative measurements in different two-dimensional sub- spaces to visual events in the world. In this section we will extend their results by exam- ining how the correlation between different dimensions can be utilized to improve the imaging process. Ifwecomputethegradientofthespatio-temporallightfield?L,wecananalyzethe eigenvalue structure of the tensor that is formed by the outer product of the gradients to extract information about the scene (see [56] for the analysis for perspective images). 2.3.1 One-dimensional Subspaces If we study one-dimensional subspaces, we can measure the change in static radiance for linearly varying orientations and constant view points, or linearly varying view points and a fixed view direction. If we fix both view direction and view orientation and only the temporal dimension is varying, then we can measure the change in irradiance of a light ray over time. It is to note that the time domain is different from the spatial domain, because it only allows for causal filtering, since we cannot make measurements in the future. Linear subspaces where two or more dimensions are simultaneously varying allow us to fixate on a specific point in the world and measure a slice of the surface light field at that point. 31 2.3.2 Two-dimensional Subspaces In two dimension things become interesting because we can observe correlated behavior between different domains which allows us to extract more complex information about the world. Texture Information The subsets L(x,y) = L(x,y,?,?,?) or L(u,v) = L(?,?,u,v,?) correspond to the standard perspective or orthographic images which enable us to measure the colors and texture of all the visible surfaces of the objects in the scene. The complexity of the textures can be classified by looking at the eigenvalues?1 and?2 of ?L?LT. If both the eigenvalues are zero then the scene consists of a homogeneous region. If only one eigenvalue is non- zero, then we have a linear structure such as a brightness edge in the texture, and if both eigenvaluesarelarge,thenthesurfacetextureistoocomplextobedescribedbyfirstorder derivative operators. An example would be a texture that is isotropically changing, or onethatconsistsofacornerorapointfeature. Eachpointintheworldisonlyimagedbya singlelightray,thusforagivennumberofimagingelementswegetthemostinformation about the surface textures in the world, but a single static image does not give us any information about the shape of the objects in the world. Shape and Segmentation Information Any set of light rays that lies on a doubly ruled surface allows us to infer spatial informa- tion about the scene since the light rays intersect in space [112, 129]. Especially the case wheretheorientationvectorandthepositionvectorsvaryinthesameplaneiseasytoan- alyze. These subsets of light rays are known as epi-polar plane images (EPIs) [15]. Exam- 32 ples areL(x,u) = L(x,?,u,?,?) orL(y,v) = L(?,y,?,v,?) and will be analyzed in more de- tail in Section 2.5. Since the points in the world are imaged by multiple imaging elements we can relate the spatial arrangement of the imaging elements to the correlation between visual measurements at different elements to extract spatial information about the scene inview. Sinceweusethecorrelationoftexturalelementstoinferspatialstructurewecan- not always extract spatial information, for example if we compute the structure tensor in an EPI and we have only vanishing eigenvalues, then this corresponds to a texture-less region of the scene where we cannot extract any shape information. Each line in an EPI corresponds to the light passing through a single point in space. If this point lies on the scene surface, then the changes in color and intensity are due to reflection properties at this scene point and should vary slowly (unless it is specular). If the line corresponds to a scene point that does not lie on the surface of an object then it corresponds to a virtual pinholecameraimageofthesceneandtheintensitiescanvaryarbitrarily. Ifthereflection properties of the scene texture are dominated by the Lambertian component, that means that the light intensity reflected is independent of the angle at which we look at the sur- face, then if we fixate on a scene point by varying the rate of change of view point and view direction appropriately, the observed radiance of the scene point will not change and thus form a line in the EPI of uniform color. Thus for most scenes with non-specular reflection properties, we can identify the lines corresponding to scene points by finding the lines where the intensity varies only very little. The slope of this line structure encodes the depth of the scene because the depth is proportional to the ratio of the necessary changes in view direction and view position to fixate on a scene point. This equivalent to triangulating the scene points from a con- tinuum of view positions where correspondence is established by following the tracks of 33 uniform color formed by the projection of a scene point using different centers of pro- jection. We can use the structure tensor to estimate the depth accurately. If we have one vanishing eigenvalue indicating an edge in the EPI, then the depth of the scene can be computed from the direction of the eigenvector corresponding to the non-vanishing eigenvalue. The changes in slope across the EPI can be used to infer properties of the surface shape such as surface orientation and surface curvature. In case that both eigenvalues are non-zero, then we have either a specularity or an occlusion in the scene. The difference in dimensionality of the image structure between texture and occlusion discontinuities can be used to easily segment the image captured with the sensor into regions belonging to different objects in the scene. Depth occlusions form y-junctions in the EPI and the angle of the junction can be used to differentiate between self-occlusions due to surface curvature where the intersection angle between the tracks of different features is very small and object occlusions where the angle is larger (corresponding to the magnitude of difference in depth between the objects). Linear Motion Any two-dimensional subset that involves simultaneously time and one of the spatial dimensions allows us to detect and measure the planar motion of the camera or an object in the scene (see Table 2.1). For perspective views (u,t) or (v,t), we can measure the two components of the motion along the optical axis and the coordinate axis, but have to estimate the depth of the scene at the same time, while for orthographic views we can onlymeasurethedisplacementalongthespatialdimension,butwecandothisaccurately without any additional depth estimation. 34 2.3.3 Three-dimensional Subspaces Structure from Motion: Simultaneous 3D Motion and Depth Estimation If we the take conventional perspective image sequences we can estimate the 3D mo- tion of the camera and the depth of scene, a problem known as structure from motion. This problem has been studied in depth and many solutions have been proposed. Unless we have pure rotation or are imaging a plane scene, this estimation is often very diffi- cult since the temporal changes in the image depend not just on the camera and object motions, but also on the depth structure of the scene(Tables 2.1 and 2.2). This forces us to solve the motion and depth segmentation simultaneously with the motion estimation leading to a high dimensional problem which is very sensitive to noise. In this case the dynamic images L(x,y,t) = L(x,y,?,?,t) or L(u,v) = L(?,?,u,v,t) are the subsets of interest. We can have the following cases for the eigenvalues?1,...,?3 of the structure tensor ?L?LT. As before, if all eigenvalues are of equal value (either non-zero or zero) we cannot extract any information, since the scene is either too homo- geneous or too random. If we have 1 non-zero eigenvalue, we have the motion of a line in the image, resulting in a plane in spatio-temporal space. Here we can only determine the component of the motion parallel to the edge in the image, the normal flow, due to the aperture problem. If we have two non-vanishing eigenvalues, then this corresponds to the motion of a point feature along a line in spatio-temporal space and the eigenvector corresponding to the vanishing eigenvector is parallel to the direction of the line. For the orthographic images, this flow is corresponding to the actual motion of the scene point, while in the perspective images, this motion also depends on the distance to the scene and the motion along the optical axes. 35 Scene-independent Planar Motion Estimation and Motion Segmentation If we look at the three-dimensional subspaces that are formed byL(x,u,t) andL(y,v,t), then we can extract the motion of a camera in the epipolar plane in a scene independent manner (see Table 2.2). Since changes in the epipolar image over time due to the camera motion form only a low three-dimensional subspace parameterized by the motion para- meters it is easy to detect independently moving objects since their trajectory will give risetogradientsintheEPIthatarenotcompatiblewiththeestimatedrigidmotion. Inthe future I plan to study if subspace factorizations in the spirit of Tomasi and Kanade [146], Irani [67], or Vidal et al. [155] can be utilized to make this factorization. Depth Estimation from Parallax If we capture a plenoptic subset formed by three of the four spatial dimensions (known as epipolar volumes), then we can determine the depth of the scene in all the epipolar planes contained in the plenoptic subset as long the scene texture offers enough informa- tion. In regions of homogeneous color, it is of course impossible to recover information about the scene. Besides depth we can again recover all the information as described in the Section 2.3.2. As before if the local orientation of the scene texture is parallel to the epipolar plane then we cannot estimate the depth because in this case the epipolar lines and the texture line coincide. This fact which is a well-known problem in the stereo literature. 36 2.3.4 Four-dimensional Subspaces Shape Estimation and Recovery of Surface Properties In 4D, we can take a look at the 4D light field which contains information about the re- flectance of a scene point as well as the depth and occlusion properties, thus allowing for computations that utilize the fact that all the dimensions constrain each other. We know for example, that all the gradients in the images should be parallel, but intensity derivatives in view direction space are scaled by the depth with respect to the intensity derivatives in view point space as described before. If we look at the different eigen- value distributions, we see that again we have the non-informative cases of completely vanishing or non-vanishing eigenvalues corresponding to the absence of textures or the presence of too much noise or strong specularity. Motion Stereo Estimation Anythreespatialdimensionsandatemporaldimensionallowsustocomputethemotion of the camera and the objects in the scene. Since we can compute the depth of the scene from the epipolar plane images, motion estimation and depth estimation are decoupled leading to a simpler problem. We can also solve for some motion parameters linearly in terms of plenoptic derivatives and then plug the solution into the non-linear scene- dependent motion equations. Having access to epipolar images and thus depth also simplifies the segmentation of the scene into independently moving objects since their depth is constrained by the plenoptic derivatives. 37 2.3.5 Five-dimensional Subspaces Scene-independent Motion Estimation If we have access to the complete plenoptic function, then we can compute the motion of the camera in a scene-independent manner by solving a linear system of equations in terms of the view point and view direction derivatives. Since these gradient fields only depend on the six motion parameters, it will be easy to detect independently moving objects by a simple clustering procedure using an affine motion model to describe the displacement of a scene region in the plenoptic space. 2.4 The space of images Of special educational interest are the two-dimensional subspaces of the plenoptic func- tion, because most continuous imaging surfaces can only capture two-dimensional im- agessincethereceptorsneedtolieonatwo-dimensionalsurfaceinspace. Two-dimensional imagesalsoenableustoutilizethecorrelationsbetweendifferentcoordinateaxestoinfer information about the scene. In general any 2D-subset of this lightfield function consti- tutes an image. Recently, Yu and McMillan [161] described how all linear cameras can be described by the affine combination of three points in the light field parameterization. Some of the cameras they describe are: ? Ifalocation(x,y)inthefocalplaneisfixedsothattheimageisoftheformL(x,y,?,?,t), then it corresponds to the image captured by a perspective camera. This image is formed by the pencil of light rays that all pass through the point (x,y) in the focal plane. ? Ifinsteadwefixtheviewdirection(u,v),wecaptureanorthographicimageL(?,?,u,v,t) 38 of the scene. In this case, all the light rays that form the image are parallel where the direction is given by the vector (u,v,f)/bardbl(u,v,f)bardbl. ? If we choose to fix the values on two orthogonal spatial axes, for example (x,v) or (y,u), then the subsets L(x,?,?,y,t) and L(?,y,u,?,t) describe linear push-broom- video sequences, where a one-dimensional perspective slit camera is moved per- pendicular to the slit direction as for example in recordings made by surveillance planes. This camera is an example of the so-called crossed-slit projection camera as described by Zomet et al. [168] These images are formed by the set of rays passing through two slits lying in two planes parallel to the imaging surface. The different kinds of images above are subsets of the plenoptic function where each light ray corresponding to an image pixel intersects the scene at a unique location. In other words this means that none of the light rays intersects another light ray that is captured in the image on the surface of an object in the scene. Given an image of the type described above, it is therefore impossible to infer something about the scene structure without prior knowledge. In the following we will take a closer look at the subsets that correspond to light rays that intersect in space and therefore could potentially encode information about the structure. From [112, 129] we know that to form a stereo geometry, that means rays lie on an epipolar surface and intersect, the rays making up an image need to lie on a doubly-ruled quadric. The only such surfaces are planes, hyperboloids, and parabolic hyperboloids. An example for such a planar configuration are epipolar plane images. These images are formed by fixing two of the parallel axes in the two-plane parameteri- zation, that is (x,u) or (y,v). 39 (a) (b) Figure 2.2: (a) Sequence of images captured by a horizontally translating camera. (b) Epipolar plane image volume formed by the image sequence where the top half of the volume has been cut away to show how a row of the image changes when the camera translates. 2.5 The Geometry of Epipolar Images We are interested in characterizing the frequency structure of the time-varying plenop- tic function captured by a moving polydioptric camera. Since visual information is in general local information, we will first study local neighborhoods of the static plenoptic function. Most of the local structure of the space of light rays can already be described by ex- amining two-dimensional slices, either perspective, orthographic or epipolar images. In this section we will focus on the epipolar images because in them the geometric structure and surface textures interact the strongest. The rays in the epipolar image intersect either in free space, inside the object or on the object surface. These images exhibit a regular structure, as seen in the example in 40 f Z? x Scene u Origin u0 u 0 x f Z u=u0 - x f/z(u0) z(u0) ?f ?i ?u ?x Bu -Bu (f/zmax) ?u ? ?x = 0 (f/zmin) ?u ? ?x = 0 (f/zopt) ?u ? ?x = 0 Bx = f(1/zmin-1/zmax) Bu (a) (b) Figure 2.3: (a) Light Ray Correspondence (here shown only for the light field slice spanned by axesxandu). (b) Fourier spectrum of the (x,u) light field slice with choice of ?optimal? depthzopt for the reconstruction filter. Fig. 2.2b. The projection of a scene point traces out a linear line in the EPI with a slope is proportionaltothedepthofthescenepoint. Thusthedifferentialstructureofanepipolar plane image encodes the depth, as well as occlusions and the reflectance function of the surface in view. We follow in our description [15] and analyze the motion of the image of a world point x = (x,z) in a moving line camera. If camera and world coordinate system coin- cide, we can apply the usual perspective projection, so that the image of the point in the image linez = f is given byu = f(x/z). To describe the motion of the projected point in dependence on the camera motion for a general camera position and orientation, let c = (cx,cz) be the center of the camera and let the optical axis make an angle of? with thex-axis of the coordinate system. The world coordinates of the point x are x0 = (x0,z0) = Rx + c where the rotation matrix 41 du z Z? x Scene Origin uf Z dx = (z/f) dudx?f ?i ?_u z Z? x Scene Origin uf Z ?_x = (z/f) ?_u?_x?f ?i (a) (b) Figure 2.4: (a) The ratio between the magnitude of perspective and orthographic deriv- atives is linear in the depth of scene. (b) The scale at which the perspective and ortho- graphic derivatives match depends on the depth (matching scale in red) R = [cos(?),?sin(?);sin(?),cos(?)] alignsthecameraandworldcoordinatesystems. This implies that x = RT(x0 ?c), therefore the projection onto the image line in the general case is given by: u = f(x0 ?cx)cos(?) + (z0 ?cz)sin(?)(c x ?x0)sin(?) + (z0 ?cz)cos(?) (2.13) First, we will study the effect of a simple linear motion. Without loss of generality, we can assume that the camera moves with constant speed along the positive x-axis (c = (cx,cz) = (at,0)). This leads to u = f(x0 ?at)cos(?) +z0 sin(?)(at?x 0)sin(?) +z0 cos(?) (2.14) 42 (a) (b) Figure 2.5: (a) Epipolar image of a scene consisting of two translucent fronto-parallel planes. (b) The Fourier transform of (a). Notice how the energy of the signal is concen- trated along two lines. which we can expand to 0 = (asin(?))ut+ (z0 cos(?)?x0 sin(?))u (2.15) + (af cos(?))t?f(x0 cos(?) +z0 sin(?)) which shows that the feature paths are hyperbolic curves. The asymptotes of the hyper- bola are parallel to the coordinate axes as can be seen if we rewrite Eq.2.15 as: (u+f cot(?))(t+ (cot(?)?x0)/a) = fz0asin(?) (2.16) It is to note, that it is always possible to linearize the feature paths by derotating the coor- dinate system, so that the viewing direction is perpendicular to the direction of motion. The transformation is given by: uprime = fucos(?)?f sin(?)f cos(?)?usin(?) (2.17) If the viewing direction is perpendicular to the direction of motion (? = 0), then the 43 feature paths degenerate to straight lines z0u+aft?x0f = 0. (2.18) This is the canonical way to parameterize epipolar images. Each feature in the world traces out a line in the epipolar image with a slope that is proportional to the relative depth of the feature with respect to the camera. This is also illustrated in Fig. 2.4as where we see that dx = (z/f)du by the law of similar triangles. Thus, depth can be extracted from the feature paths simply by extract- ing the slope of a featurez = f(dx/du). 2.5.1 Fourier analysis of the epipolar images In contrast to Chai et al. [27] which parameterize the world with respect to a perspective reference image of the scene, we will follow Zhang and Chen [162] and will parameterize the 2D light field slice in terms of the planar surface light field LS(x,?), x = [x,z]T ? R2,? ? [?pi,pi] where ? = 0 corresponds to the direction opposite of the depth axis (??z = [0,?1]). ThesurfacelightfieldisonlydefinedonthesurfaceS oftheobjectswhich in this flat-land case is defined by the curve x(s) which we will parameterize using the arc-length parameters. The epipolar plane image is parameterized by its view point xc = [xc,0] (we as- sume w.l.o.g. that the camera is at depth 0) and view direction, parameterized by the intersection u of the view ray with the image plane at distance f to focal plane. We also assume that the field of the camera is restricted tou? [?u0,u0]. In the following we will examine the effect of depth, slope, reflection properties and occlusions on the frequency structure of the epipolar image. Specifically, we will look at the following scenarios: 44 1. a single fronto-parallel plane with Lambertian surface reflection, 2. a single fronto-parallel plane with band-limited non-Lambertian surface reflection, 3. a single tilted plane, 4. and one fronto-parallel plane occluding a second one. From our previous definition, the view ray has the implicit equation [f,?u]?(x? xc) = 0 or if multiply the equation out, we get xf ?xcf ?uz = 0. (2.19) If the normal to the surface and the viewing ray do not point along the same di- rection in the field of view of the camera, that is N = [u,f] ? ?D(x(u))T < 0 ?u ? [?u0,u0], then we do not have to worry about self-occlusions on the surfaces. Here x(u) = xc +z(xc,u)?u/f is the intercept of the view ray at pixeluas seen from camera position xc. 2.5.2 Scenes of constant depth In this case the equation of the object curve is given by xs = [s,z0]T and we can easily solve for the intersection of the viewing ray with the surface in terms ofswhich leads to the relations s = uz0f +xc and? = ?tan?1(u/f) (2.20) Thus the local light field structure of an EPI is given by the following equation, L(xc,u) = LS(xc + z0f u,?tan?1(u/f)) (2.21) In this case the occlusion condition is satisfied for all non-horizontal viewing rays, and we do not have to worry about self-occlusions. 45 We can compute the depth of the scene from the image derivatives as follows, to simplify the computation, we will use the approximation tan?1(u/f) ?u/f. L(xx,u) = LS(xc + z0f u,?tan?1(u/f)) = LS(xc + z0f u,?u/f) ? ?xcL(xc,u) = ? ?sLS(xc + z0 f u,?u/f) ? ?uL(xc,u) = ? ?sLS(xc + z0 f u,?u/f) z0 f ? ? ??LS(xc + z0 f u,?u/f) 1 f If we examine the ratio between the EPI derivatives ? ?uL(xc,u)/ ? ?xcL(xc,u) = ? ?sLS(...,...) z0 f + ? ??LS(...,...) 1 f ? ?sLS(...,...) = z0f + ? ??LS(...,...) 1 f ? ?sLS(...,...) (2.22) we see that the ratio between the derivatives is proportional to the depth of the scene if the scene consists of a fronto-parallel plane and we can neglect the influence of the non-Lambertian effects (see Figure 2.4a). If the derivatives of the reflectance function are involved,thenwehavetoseparatetheinfluenceofreflectionanddepthonthebrightness derivatives. In the Fourier domain, we can analyze the frequency spectrum of a scene at con- stant depth by expressing the light field spectrum in terms of the surface light field spec- trum. ?L(xc,u) = integraldisplayintegraldisplay L(xc,u)exp(?2pii(?xxc + ?uu))dxcdu = integraldisplayintegraldisplay LS(xc + z0f u,?tan?1(u/f))exp(?2pii(?xxc + ?uu))dxcdu We have the identities u = ?f tan(?) and xc = s + z0 tan(?). Here we will make the approximation tan(?) ? ? for small ? (corresponding to a small field of view camera), 46 thus we get ?L(xc,u) = integraldisplayintegraldisplay LS(s,?)exp(?2pii(?x(s+z0?)??uf?)fdsd? = f?LS(?x,?xz0 ??uf) (2.23) As described in [162], if we assume that the surface light field is bandlimited by B?, then we have the equality ?LS(?s,??) = ?LS(?s,??)?1B?(??) where 1B?(??) is an in- dicator function over the interval [?B?,B?]. When we have a Lambertian surface, where B? = 0, ?1B?(??) equals the Dirac functional ?(??) and we get the familiar fact that the frequency spectrum of an EPI observing a Lambertian surface at constant depth is a line in frequency space with a slope proportional to the depth of the surface in the spatial domain. 2.5.3 A slanted surface in space The computations for a slanted surface are more complicated because the depth of the scene is changing when we vary either the view direction or the view position causing a fore-shortening effect of scene texture. This causes the frequency content of the projected texture to change with position, a phenomenon known as chirping [93]. A tilted line in space can be parameterized as (x,z) = (xs,zs) + s(cos(?),sin(?)), thus using Eq. (2.19) we can solve fors: s = (xs ?xc)f ?zsucos(?)f ?sin(?)u. (2.24) We have to make sure that the occlusion property is satisfied for the field of view of the camera [?u0,u0]. Thus we have the constraint on?such that cos(?)f ?sin(?)u< 0 ?u? [?u0,u0]. 47 Computing the Fourier transform of the light field we get ?L(xc,u) = integraldisplayintegraldisplay L(xc,u)exp(?2pii(?xxc + ?uu))dxcdu = integraldisplayintegraldisplay LS( (xs ?xc)f ?zsucos(?)f ?sin(?)u,?tan?1(u/f))exp(?2pii(?xxc + ?uu))dxcdu Again we solve these two equations foruandxc (while again making the approxi- mation tan(?) ??): ? ?? ? u xc ? ?? ?= ? ?? ? ?f? ?s?[sin(?)]?scos(?) +?zs +xs ? ?? ?= ?(s,?) (2.25) Computing the determinant of the Jacobian of the mapping ?(s,?) = (u,xc), we get |det(D?(s,?))| = cos(?) + sin(?)? which leads to ?L(xc,u) = integraldisplayintegraldisplay L(s,?)exp(?2pii([?u,?x]?(s,?)f(cos(?) + sin(?)?)dsd?. (2.26) Sincesand? appear as a product in ?(s,?), in general we cannot easily additively factor ? and express the spectrum of the epipolar image in terms of the spectrum of the surface light field. Nevertheless, if we can factor the surface light field LS(s,?) = ?(s)B(?) where ?(s) describes the texture on the plane, while B(?) captures the view dependent effects, then we can simplify the equations as follows: ?L(xc,u) (2.27) =f integraldisplay ??(?x(cos(?) + sin(?)?))B(?)(cos(?) + sin(?)?)exp(?2pii(?x(?zs +xs) + ?uf?)d? where ?? is the Fourier transform of the surface texture ?(s). By choosing a specific tex- ture, we can then evaluate the integral and compute the spectrum of the epipolar image. We see how by varying?, the frequency spectrum is getting spread out between the min- imum and maximum depth. 48 (a) (b) Figure 2.6: (a) Epipolar image of a scene where one signal occludes the other. (b) The Fourier transform of (a). Notice the ringing effects parallel to the occluding signal. 2.5.4 Occlusion To estimate how occlusions influence the frequency structure of L(x,u) we can use the approach described in [8]. They model an occlusion in a local neighborhood by split- ting the light field L(x,u) into two parts L+(x,u) and L?(x,u) using a binary indicator function ?(x,u). L(x,u) = ?(x,u)L+(x,u) + [1??(x,u)]L?(x,u) Applying the Fourier transform toL(x,u) results in ?L(?x,?u) = ??(?x,?u)? ?L+(?x,?u)? ??(?x,?u)]? ?L?(?x,?u) + ?L?(?x,?u) We see that due to the convolution the frequency spectra of ?L+ and ?L? are spread by out the frequency spectrum ??. Since most of the geometrical information of a scene manifests itself only on the local scale, we can assume that locally every occlusion edge is a linear step edge. This casewasstudiedby[8]wheretheylookedattheFourierspacedescriptionoftwomoving 49 textured planes where one plane is occluding the other. They studied the case of constant and linear motion, which basically corresponds to changing the viewpoint in the light field parameterization (constant motion case), or also changing the viewpoint along the optical axis (linear motion case). We follow their convention and model the texture of the scene as a signal that satisfies the Dirichlet conditions. This allows us to express the signal as an infinite exponential series that converges uniformly to the signal. We define a step function ? for one side of a fronto-parallel plane that begins at positions+ at depthz+ using the previous reparametrization as: ?(x,u) = ?s(x+ z+f u) = ?? ?? ??? 1 ifx+ z+f u?s+ 0 otherwise The Fourier transform of a step function is then given by: ??(?x,?u) = ?integraldisplay u=?? ?integraldisplay x=?? ?(x,u)e?j(?xx+?uu)dxdu = parenleftbigg pi?(?x)? i? x parenrightbigg e?j?xxi?(?u ??xz+f ) We see that the occlusion will cause ringing in the frequency domain that is parallel to the line ?u??xzif = 0. The orientation of this line depends on the depth of the occlusion edge (Fig. 2.6). Following [162] we can model more complicated scenes by partitioning the scene intondepth levels where each depth is assumed to have constant depth (fronto-parallel plane). The order of the depth levels be z1 < ... < zn. Each level extends until infinity and is only occluded by the layers in front of it closer to the camera. To model a finite extent for each depth layer, we define the maskMi (i = 1,...,n): Mi(x,u) = ?? ?? ??? 1 if layeriis blocking the path of light ray x 0 otherwise 50 (a) (b) Figure 2.7: (a) Epipolar image of a complex scene with many occlusions.. (b) The Fourier transform of (a). There are multiple ringing effects parallel to all the occluding signals. We will examine the epipolar images Li(x,u) of the layers separately one by one. The Fourier transform of each layer is denoted by ?Li(?x,?u) and the Fourier transform of the occluded signal by ?Loi(?x,?u). The first layer is not occluded, thus its Fourier transform is identical to the unmasked Fourier transform, that is ?Lo1 = ?L1. The second layer is occludedbythefirst,thuswehaveLo2 = L2?M1 anditsFouriertransformis ?Lo2 = ?L2? ?M1. WecandothisnowforeverylayerandwegetLoi = Li?producttexti?1j=1Mi anditsFouriertransform is ?Loi = ?Li? ?M1?...? ?Mi?1. Usingtheresultsfrombefore,weseethattheFouriertransform oftheoccludedsignalisconcentratedalongthelinewithslopezi/f withringingartifacts parallel to z1/f,...,zi?1/f (Fig. 2.7). We can also include the windowing effect due to the finite image size by choosing the first mask to be the size of the camera window which will lead to some ringing artifacts that are parallel to the coordinate axes. This also suggests that to reduce the ringing effects, we should window the local plenoptic neighborhood of interest. 51 If we now compute the convolution, between ??i and ?LSi(?x)?(?u ??xzif ), we get ??i(?x)?(?LSi(?x)?(?u ??xzif )) (2.28) = integraldisplay ej(?x??y)x ui +xli 2 sin(xui ?xli2 (?x ??y)) ?x ??y ?LS i(?y)?(?u ??y zi f )d?y (2.29) =ej(?x??u f zi) xui +xli 2 sin(xui ?xli2 (?x ??u fzi)) ?x ??u fzi ?LS i(?u f zi) (2.30) and see how the occlusion again spreads out the Fourier spectrum of the texture compo- nent of the surface light. 2.6 Extension of natural image statistics to light field statistics There exists a large amount of literature about the statistics of natural images which we can use to simulate an ?average scene?. The most noticeable result is that the power spectrum of a natural image falls approximately inversely proportional to the square of the spatial frequency. Dong and Atick [39] demonstrate how a similar scaling law can be derived from first principles for spatio-temporal sequences. We can use their formalism to find an expression for the power spectrum of a light field |?L|2. As we have seen in Chapter 2, the power spectrum |?L|2 depends on the spatial frequencies of the textures in the scene, the orientations of the scene surfaces, as well as the depth and velocity distribution in the scene. For now we will disregard the effect of occlusions and assume that the power spectrum of a perspective image of the scene is rotationally symmetric, that means there is no special direction. For natural scenes this should be roughly satisfied, although it has been observed that horizontal and vertical orientations are often more predominant especially in man-made environments. Thus it is sufficient to find the power spectrum of a light field subspace formed by the axesu-x-t (time-varyingepi-polarplane). Theextensiontothefull5Dlightfieldisstraightforward. 52 Forthepowerspectrumofthestaticperspectiveimagerowwecanassumethatitfollows a power law |?L0(?u)|2 = Kbardbl?ubardblm wherem? 2.3 andK is a normalization constant [39]. As shown in Section 2.5, if we disregard occlusions the energy of the Fourier trans- form of an epipolar plane image (x-u-plane) observing an object at a constant depth z is concentrated along the line ?x = f?u/z. Thus the power spectrum of the epipolar plane image of this scene is given by |?L(?u)|2?(?x ?f?u/z). If the depth is varying, then the power spectrum will spread out to a wedge-shaped region bounded by the minimal and maximal depth (see Fig 2.3c). For a given region in the world at depth z that moves relative to the camera with velocity ?x, we have the brightness invariance of the formI(u,t) = Is(u?f ?xt/z), thus the power spectrum of a perspective spatio-temporal image plane is concentrated along the line ?t = f?u?x/z. For the characterization of the spectrum of an spatio-temporal image observing points following more complicated trajectories see [30]. Now we can write the power spectrum of the time-varying epipolar plane as follows: |L(?x,?u,?t,z, ?x)|2 (2.31) = |Lu(?u)|2?(?x ?f?u/z)?(?t ?f?u?x/z) (2.32) GivenaprobabilitydistributionforthevelocitiesD?x(?x)anddepthsDz(z),wecanexpress an ?average? light field power spectrum by integrating over these distributions: |?L(?x,?u,?t)|2 = ?integraldisplay ?? ?integraldisplay 0 |L(?x,?u,?t,z, ?x)|2D?x(?x)Dz(z)d?xdz (2.33) = |Lu(?u)|2 ? ?integraldisplay ?? ?integraldisplay 0 ?(?x ?f?uz )?(?t ?f?u?xz )Dz(z)D?x(?x)dzd?x = Kbardbl? ubardblm D?x(?t? x )Dz(f?u? x ) (2.34) 53 since the integral is only non-zero when z = f?u/?x (2.35) ?x = (?tz)/(f?u) = ?t/?x. (2.36) 2.7 Summary In this chapter we described the information content of the different subspaces of the plenoptic function. We saw that most of the information that relates the image infor- mation to the geometry of the scene is encoded in two-dimensional subspaces known as epipolar plane images. We then examined the frequency spectrum of these two- dimensional subspaces and saw that the energy of these images in the frequency domain is concentrated along regions in space that are determined by the depth of the scene. We willlaterutilizetheseresultswhenwestudythesamplingproblemforpolydioptriccam- eras. In this thesis we will restrict our study to the dynamic plenoptic function observed by a rigidly moving observer. Since in chapter 5 we will focus specifically on this case, we will postpone the description of the details to that chapter. 54 Dim Subspace Object Information Content 0D: ray -color 1D: x,y line of view points ? intensity variation u,v circle of view directions ? intensity variation t temporal changes ? change detection 2D: xy orthographic ? metric measurements image parallel to image plane ? depth-independent fore- shortened 2D-texture uv perspective ? angle measurements image around optical center ? depth-dependent fore- shortened texture xv,yu push-broom image ? cylindrical panorama xt,yt video of ? motion parallel to sensor parallel lines ? temporal occlusion ut,vt video of ? radial motion converging lines ? temporal occlusions xu,yv epi-polar image ? depth from gradients ? occlusions from junctions Table 2.1: Information content the axis-aligned plenoptic subspaces, Part I 55 Dim Subspace Object Information Content 3D: xyu,xyv orthographic ? surface geometry epi-polar volume xyu,xyv perspective ? surface geometry epi-polar volume xyt orthographic video ? motion parallel to parallel lines image plane uvt pinhole camera ? motion around sphere video of directions xvt,yut push-broom video ? motion parallel to line and around circle xut,yvt epi-polar video ? linear planar rigid motion estimation 4D: xyuv light field ? reflection properties xyut,xyvt orthographic epi- ? 3D motion recovery polar volume video and segmentation xuvt,yuvt perspective epi- ? 3D motion recovery polar volume video and segmentation 5D: xyuvt light field video ? complete scene information ? linear metric motion estimation Table 2.2: Information content of axis-aligned plenoptic subspaces Part II 56 Chapter 3 Polydioptric Image Formation: From Light Rays to Images L(x) Camera Design Analysis(Optics)Sampling(Geometry) Prefilter(Image Filter)Synthesis(Representation) I(x) ? ?(x-k) Image Processing ?(?x) ?(x)?(k) ?(k) AquisitionNoise Figure 3.1: Imaging pipeline expressed in a function approximation framework In this Chapter we will describe how the properties of the optics and the sensors affect the imaging process. Our model of the image formation pipeline for a polydioptric camera is based on the mathematical framework described in [150] and is summarized in the diagram in Fig. 3.1. The pipeline consists of the following five components: 1. Optics of the lens system: The radiometric and geometrical transformation from light rays in world coordinates to light rays that impinge on the sensor surface. 2. Geometric distribution of the sensor elements: This includes the summation of light over the sensor area and the geometric sampling pattern of the sensor element 57 distribution. 3. Noise characteristics of the acquisition process: The conversion from irradiance into an electrical signal, including shot noise due to the quantum nature of light, and noise due to the readout electronics and quantization of the measurements. 4. Elementary processing on the image: Filtering in the image domain to reduce the noise and prefilter the image for the reconstruction. 5. Continuous representation of discrete measurements: Interpolating the discrete signal to generate a continuous representation. In the next sections we will formalize the five individual steps of the imaging process. 3.1 The Pixel Equation The conversion of irradiance into a digital intensity value at a pixel location can be de- scribed by the measurement equation (also known as pixel equation). For a single pixelithe measured intensityIi at timetj is given by the following equation: Ii(tj) = integraldisplay t?Tj integraldisplay x?Xi integraldisplay r?Ai(x) L(Gx(x;r,t),Gr(x;r,t),t)Pi(x;t))drdxdt. (3.1) In this equation x denotes the position vector on the imaging surface, r is the direction vectortowardsthelenssystem,andtisthetime. Thisequationcanbefurthergeneralized by including wavelength effects such as chromatic aberrations, but in this work we will not consider them further. L is the scene spectral radiance defined in object space, that is the usual plenoptic function. The geometric transformation of a light ray on its path through the camera system, that is the geometry of image formation, is described by the 58 functions Gx and Gr. An example transformations would be the focusing effect of the lens. P models the behavior of the shutter and the sensor response and is dependent on position and time. The integration is done over the positions in pixel coordinates that make up the pixel area Xi, the directions that point towards the lens aperture as a function of the light ray originAi(x), and the shutter intervalTj. Apolydioptriccameraconsistsofamultitudeofsuchpixels. Thepolydioptricsam- pling problem can be easier understood if we model the image formation as a sequence of convolutions on the space of light rays. We can rewrite the pixel equation by rewriting the rendering equation as a non-stationary filter operation in light ray space followed by a discrete sampling operation. The filter models the imaging acquisition process and the samplingpatternmodelsthesensorgeometryanddistribution. ThenewformofEq.(3.1) is then: Ii(tj) = integraldisplay Tj integraldisplay Xi integraldisplay Ai(x) L(Gx(x;r,t),Gr(x;r,t),t)Pi(x;r;t))drdxdt (3.2) = integraldisplay ?Tj integraldisplay ?Xi integraldisplay ?Ai(x) L(x;r;t)Pi(G?1x (x,r,t),G?1r (x,r,t),t))drdxdt (3.3) = integraldisplay R3 integraldisplay S2 integraldisplay R+ L(x;r,t)??(x?xi,r?ri,t?tj)drdxdt (3.4) where ? is a function that integrates the effect of the integration limits, the geometric transformationsG?1x ,G?1r , as well as the optical effects captured byP. 3.2 Optics of the lens system This section follows the exposition about how to model a correct and realistic camera model as presented in [76]. We will now describe in detail the different components of this model. We will assume that the coordinate system is located at the center of the 59 image surface and that the axes of the coordinate system are aligned with the axes of the image. In this section we will describe two different idealizations of the lens system of a camera, the pinhole and the thin-lens model. Pinhole Camera di P dido f2f1 F2F1 (a) (b) Figure 3.2: (a) Pinhole and (b) Thin-Lens Camera Geometry The simplest camera geometry is the pinhole camera model, which uses no lens at all. A point aperture is placed in front of the imaging surface, so only the rays through a single point in space pass through the aperture and fall on the imaging surface. Every point in space can be imaged in perfect focus (up to the diffraction limit) and we need only one parameter to describe this lens, that is the distance between the imaging surface and the lens center di, often also called the focal length f. If we choose the focal point as the origin of our fiducial coordinate system, then the projection of a point P ? R3 onto the image plane results in the image point x which coordinates are given by the well-known equation x = ? diP?z?P = ? fP?z?P (3.5) where ?z is the unit vector along the optical axis. 60 In this case the ray pixel equation becomes Ii = integraldisplay (x,y)?Xi integraldisplay t?Tj L ? ?? ?? ?? ? ? ?? ?? ?? ? x y ?di ? ?? ?? ?? ? ; ? ?? ?? ?? ? ?x ?y di ? ?? ?? ?? ? N ;t ? ?? ?? ?? ? P ? ?? ?? ?? ? ? ?? ?? ?? ? x y ?di ? ?? ?? ?? ? ,t ? ?? ?? ?? ? ? ?? ?? ?? ? dx dy dt ? ?? ?? ?? ? . (3.6) To simplify the notation we used the shorthand [r]N = r/bardblrbardbl to describe vectors of unit length. We see that the geometric transformations of the light rays are very simple, that isGx(x;r,t) = x andGr(x;r,t) = ?[x]N, and the inverse transformations are given by G?1x (x;r,t) = 0 andGr(x;r,t) = r. Thin-lens Camera In most cases the light gathering power of a simple pinhole camera is not sufficient (or it would necessitate an unreasonably long exposure time), therefore a lens is used to focus the light passing through a larger, finite aperture. Let?s assume this aperture is circular and has a radiusRl. For the thin-lens approximation, we assume that we have a lens of infinitesimally small extent along the optical axis, that means the two sides of the lens coincide in this idealization. The only parameters we need to know to describe this kind of camera are the focal length f of the lens and the distance di between the imaging surface and the lens center . Due to the focusing of the lens only one plane in object space at depth d0 will be imaged in focus on the imaging surface. The relationship betweend0,di andf is given by the lens equation: 1 d0 + 1 di = 1 f (3.7) We see that the focal length of the camera is the distance from the distance at which an object at infinite depth will be imaged in focus. If the depth d of point P is at a depth 61 different from d0, then the point will be imaged on the imaging surface as a blurred circular patch, where the blur radius is given by the formula: rb = Rldi parenleftbigg1 f ? 1 di ? 1 d0 parenrightbigg (3.8) The blur effect due to defocus has been a popular method to recover depth in the litera- ture(e.g., [127]). The fact that the distance between the intersection of the rays that pass through the periphery of the lens with the imaging surface and the intersection of the principal rays with the imaging surface depend on the depth of the point that is imaged is the basic idea behind the ?plenoptic cameras? of Adelson and Wang [2] and ?depth by optical differentiation? method of Farid and Simoncelli [42]. The projection of a point in the world onto the imaging plane of a thin-lens camera is still described by Eq. (3.5), since the geometric aspects of the projection are completely defined by the principal ray. If this point is not lying on the plane of focus, then radiance leaving this point will be measured not just at a single location on the imaging surface, but over an extended area of the sensor. SincetheEq.3.8issymmetricwithrespecttothedistancesofthelenstotheimaging surface and the lens to the object, we can also go the other direction and use it to define what rays of the world are integrated to form the response at a single pixel. If we define a coordinate system on the lens using the coordinatesu,v, and denote the set of locations on the lens by the setAL := {(u,v) ? {bardbl(u,v)bardbl ?Rb}then inside the camera we have the following expression for the pixel equation: Ii = integraldisplay (x,y)?Xi integraldisplay (u,v)?AL integraldisplay t?Tj L ? ?? ?? ?? ? ? ?? ?? ?? ? x y ?di ? ?? ?? ?? ? ; ? ?? ?? ?? ? u?x v?y di ? ?? ?? ?? ? N ;t ? ?? ?? ?? ? P ? ?? ?? ?? ? ? ?? ?? ?? ? x y ?di ? ?? ?? ?? ? ,t ? ?? ?? ?? ? dx (3.9) 62 Next, we have to determine what image is projected onto the lens. We know that the points at depth d0 are imaged in perfect focus, thus the principal ray originating at x = [x,y,?di]T intersects the plane of focus at xf = [?xd0/di,?yd0/di,d0]T = [? xfd i ?f ,? yfd i ?f , fd i ?f ]T. (3.10) Thereforetherefracteddirectionoftheraythatintersectsthelensatpositionxl = [u,v,0]T is given by rf = [xf ?xl]N. This leads to the following equation for the rays that are ob- served by a pixel: Ii = integraldisplay (x,y)?Xi integraldisplay (u,v)?AL integraldisplay t?Tj L ? ?? ?? ?? ? ? ?? ?? ?? ? u v 0 ? ?? ?? ?? ? ; ? ?? ?? ?? ? ?xf/(di ?f)?u ?yf/(di ?f)?v dif/(di ?f) ? ?? ?? ?? ? N ;t ? ?? ?? ?? ? P ? ?? ?? ?? ? ? ?? ?? ?? ? x y ?di ? ?? ?? ?? ? ,t ? ?? ?? ?? ? dx (3.11) If the origin of the coordinate system is at the center of the lens, then for an imag- ing element at distance di from the lens, the plane at depth do = fdi/(f ?di) will be in focus. We use this fact to rewrite the pixel equation for the thin-lens camera in the filter/sampling framework. Given the ray (x;r) with ?z ? x = ?di leaving the imaging element, then it will intersect the lens at location xl = x + di?z?rr and the plane of focus at xf = ?dodix. Thus by setting Gx(x;r,t) = xl and Gr(x;r,t) = [xf ? xl]N we can describe the geometric mapping between rays in object space and the rays leaving the sensor surface. Similar we can define the inverse functions. 3.3 Radiometry Since manyof the sourcesof noise aredependent on theintensity of thelight falling onto the sensor, we need to analyze how the amount of light depends on the size of the sensor surface as well as the aperture of the lens. Sensor response is a function of exposure, the 63 integral of irradiance at point x on the film plane over the time the shutter is open. If we assume constant irradiance over the shutter period (an assumption we have to relax later when using a moving sensor), then we have Is(x;t0) = integraldisplay t0+T t0 E(x;t)dt = E(x;t0)T (3.12) where E(x;t) is the irradiance at location x and time t, T is the exposure duration, and H(x) is the exposure at x and time t0. E(x;t) is the result of integrating the radiance at x over the solid angle subtended by the exit pupil, which is modeled as a disk. Denoting the set of locations in this disk byD we get E(x) = integraldisplay xprime?D L parenleftbigg x; x prime ?x bardblxprime ?xbardbl parenrightbiggcos(?)cos(?prime) bardblxprime ?xbardbl2 dA (3.13) whereListheplenopticfunction,?istheanglebetweenthenormalofthesensorsurface and xprime ?x, ?prime is the angle between the normal of the aperture stop and x?xprime, and the differential areadA. If the sensor is parallel to the disk withdi the axial distance between lens and sensor, then Eq. 3.13 can be rewritten as E(x) = 1f2 integraldisplay xprime?D L parenleftbigg x; x prime ?x bardblxprime ?xbardbl parenrightbigg cos4(?)dA. (3.14) The weighting in the irradiance integral leads to a varying irradiance across the imaging surface. If the exit pupil subtends a small angle from x, we can assume that? is constant and equal to the angle between x and the center of the disk. This leads to E(x) = LAf2 cos4(?). (3.15) This constant multiplicative factor can easily be accounted for in the imaging process by rescaling the pixel intensity values accordingly. 64 3.4 Noise characteristics of a CCD sensor Inthedescriptionofthedifferentnoiseprocesses,wefollowHealeyandKondepudy[57]. Another good descriptions of all the elements involved in the imaging process can be found in [74] and [31]. CCDs convert light intensity linearly into a sensor response, but we must account for the black level, intensity scaling, vignetting, etc. in the camera system. An overview over the process can be seen in Fig. 3.3 (redrawn from [149]) that shows the image formation process. Scene Radiance AtmosphericAttenuationCamera Irradiance Lens/Geometric Distortion D/ATransmission A/D Digitized Image Noise GammaCorrection WhiteBalancing CCDImaging Bayer PatternFPN DarkCurrent SN Thermal Noise E t ++ Interpolation FPN : FIxed Pattern NoiseSN : Shot NoiseD/A, A/D : Digital to Analog and Analog to Digital Transform Figure 3.3: CCD Camera Imaging Pipeline (redrawn from [149]) The capture of irradiant light by a CCD sensor is analogous to measuring the amount of rainfall on a field by setting up a regular array of buckets to capture the rain per unit area during the storm and then measure it by carrying it one by one to a measur- ing unit. A CCD measures the amount of light falling on a thin wafer of silicon. The im- pact of a photon on the silicon creates an electron-hole pair (photo-electric effect). These free electrons are then collected at discretely spaced collection sites. Each collection site is formed by growing a thin layer of silicon dioxide on the silicon and depositing a con- ductive gate structure over the silicon oxide. If a positive electrical potential is applied to the gate, a depletion region is created that can hold the free electrons. By integrating the 65 number of electrons stored at a collection site over a fixed time interval, the light energy is finally converted into an electronic representation. A process called charge-coupling is used to transfer the stored charge from col- lection site to an adjacent collection site. The fraction of charge that can be effectively transported is known as the charge transfer efficiency of a device. An image is read out by transferring the charge packets integrated at a collection site in parallel along electron conducting channels that connect columns of collection sites. A serial output register with one element for each column receives a new row transfer after each paral- lel transport. In between parallel transfers, all the charge in the serial output register is transferred in sequence to an output amplifier that generates a signal proportional to the amount of charge. Once all the serial registers have been read out, the next parallel trans- fer happens. In analog video cameras the signal by the CCD is then transformed into an analog signal and converted into the digital domain by a frame grabber. In modern dig- ital cameras, the signal from the CCD is directly digitized by the camera electronics and theframegrabberisnotnecessaryanymore. Inthenextsection, wewillnowexaminethe different sources of error during this process that one would like to calibrate for during the radiometric calibration step. Let?s start by examining the signal at a single collection site. The signal at this site is proportional to the number of electrons that arrive at the site and can be described as: I = T integraldisplay ? integraldisplay x?Xi E(x,y,?)P(x,y)q(?)dxdyd? (3.16) Here I is again the intensity of the resulting signal, T is the integration time (assuming constant irradiance over time), E(x,y,?) denotes the spectral irradiance at given loca- tion of the collection site, that is the integral over all the rays that pass through the lens 66 aperture as described in Section 3.2. E(x,y,?) = integraldisplay (u,v)?Al(x,y) L(x,y,u,v,t)dudv. (3.17) P(x,y) is the spatial response of the collection site, and q(?) describes the quantum effi- ciencyofthedevice,thatistheratioofelectronfluxtoincidentphotonfluxindependence on wavelength - which we had omitted in Eq. (3.2). We will now examine the different sources of noise in detail. Fixed pattern noise (FPN) Due to processing errors during CCD fabrication there exist small variations in quantum efficiency between different collection sites. Thus even if we have a completely uniform irradiance on the CCD array, we will get a non-uniform response across the array which is known as fixed pattern noise. Although this fluctuation in quantum efficiency often depends on wavelength, we will disregard this wavelength dependence for now. We model the number of electrons collected at site byKI, whereI is defined as in Eq. (3.16) andK accounts for the difference betweenq(?) andS(x,y) at the collection sites by scal- ing them appropriately. We assume that the mean ofK is 1 and that the spatial variance is small and given by?2K. Blooming Usually one assumes that the number of electrons collected at each site is independent of the number of electrons collected at neighboring sites. This can be violated if a site illu- minated with enough energy to cause the potential well to overflow with charges which then contaminate the wells at other collection sites. This process is called blooming and in bad cases can affect many surrounding sites in the neighborhood of an overexposed 67 site. Modern cameras build special walls between the sites to reduce these effects. For now we will assume appropriate illumination and thus disregard this effect. Thermal energy Thermal energy in silicon generates free electrons. These free electrons are known as dark current. They can also be stored at collection sites and therefore become indistin- guishable from electrons due to light incidence. The expected number of dark electrons is proportional to the integration time T and is highly temperature dependent. Cooling a sensor can reduce the dark current substantially. The dark current also varies from collection site to collection site. We will denote the noise from the dark current asNDC. Photon shot noise Shot noise is a result of the quantum nature of light and expresses the uncertainty about the actual number of electrons stored at a collection site. The number of electrons that interact with the collection site follows a Poisson distribution, so that its variance equals its mean. The probability to measure a given intensity I for the incoming radiance L is thus given by: Ppoisson(I = s) = L s s! e L (3.18) with a mean E[Ppoisson(I)] = Land a variance Var[Ppoisson(I)] = L. We therefore see that the signal-to-noise-ratio is illumination and sensor size dependent and increases with the square-root of the intensity. Shot noise is a fundamental limitation and cannot be eliminated. Since the dark current increases the number of electrons stored at a site, it will increase the mean and thus the variance of the number of electrons stored. The 68 number of electrons integrated at a collection site is given by KI +NDC +NS (3.19) whereNS is the zero mean Poisson shot noise with a variance that depends on the num- ber of collected photo-electronsKI and the number of dark current electronsNDC. Read-out noise The charge transfer efficiency of a real CCD is not perfect and the noise due to charge transfer can be quantified. Modern CCDs though achieve transfer efficiencies on the order of 99.999% so that we can safely disregard this effect. During the next step, the on- chipoutputamplifierconvertsthechargecollectedateachsiteintoameasurablevoltage. This generates zero mean read noise NR that is independent of the number of collected electrons. Amplifier noise will dominate shot noise at low light levels and determines the read noise floor of the CCD. The voltage signal can then be transformed into a video signal which is electronically low-pass filtered to remove high-frequency noise. The gain from both output amplifier and camera circuitry is denoted by A. For a digital camera, the conversion to video is not necessary, so A will only depend on the output amplifier. The video signal leaving the camera can then be described by V = (KI +NDC +NS +NR)A (3.20) Quantization error To generate a digital image, the analog signal V needs to be converted to a digital sig- nal. This can either be done using a frame grabber for analog video cameras or is done directly in the camera for digital cameras using an analog-to-digital (A/D) converter. 69 The A/D converter approximates the analog voltage V using an integer multiple of the quantization step q, so that each value of V that satisfies (n? 0.5)q < V < (n+ 0.5)q is rounded to the digital valueD = nqwherenmust satisfy 0 ?n? 2b, wherebis the num- ber of bits used to representD. To prevent clipping, q andbneed to be chosen such that max(V) ? (2b?0.5)q. This quantization step can be modeled by adding a noise termNQ that is a zero mean random variable independent of V and uniformly distributed over the range [?0.5q,0.5q] which results in a variance ofq2/12. Anotherimportantinfluenceonthesensoraccuracyisthelimiteddynamicrangeof today?ssensorswhichisamajorsourceoferrorinmanyapplications. Sinceonlyalimited range of signal magnitudes can be encoded without local processing on the chip we have to take into account the tradeoff between a high saturation threshold and a sufficient resolution at lower illumination levels. A detailed study of the benefits of high-dynamic- range imaging is beyond this thesis, and the reader is referred to the literature [92, 101]. Thus in conclusion, a comprehensive model for the image formation at a single CCD location is given by D = V +NQ = (KI +NDC +NS +NR)A+NQ (3.21) 3.5 Image formation in shift-invariant spaces Since all natural signals have finite energy, we can represent a given instance of the space of light rays using the light field parameterization as an element of L2(R5), the space of measurable, square integrable functions defined on R5, and phrase the light field recon- struction problem as a function approximation problem inL2(R5) using recent results in approximation theory [144, 13]. 70 do dili D ObjectLens Image PlaneDl Dy Dxf g do/di Figure 3.4: Image formation diagram. We will restrict our analysis to regular arrays of densely spaced pinhole cameras and analyze how accurately a specified camera arrangement can reconstruct the contin- uous plenoptic function under a given model for the environment. We will use the two-plane parameterization and assume that the imaging elements of the camera sample the light field on a regular lattice in the 5-D space of light rays which corresponds to a choice of camera spacing, image resolution, and frame rate. An example setup would be a set of cameras with their optical axes perpendicular to a plane containing the focal points (see Fig. 2.1). Then we can describe the 5D periodic lattice A using 5 vectors [a1,a2,...,a5] which form a lattice matrixAsuch that A = {Ak|k ? Z5}. A unique tiling of the space of light rays can be achieved by associating with each lattice site a Voronoi cell, which contains all points that are closer to the given lattice site then to any other. The camera output is modeled as the inner product of the light field with different translates of an analysis function? which was described in the previous sections: ?c?(k) = integraldisplay L(x)?(A?1x?k)dx;c?(k) ?l2(Z5). (3.22) 71 The function? : R5 ? R is dilated by the dilation matrixA?1 and sampled according to the lattice patternAwhich results in Eq. (3.22). It models the effects of the Pixel Response Function (PRF) such as scattering, blurring, diffraction, flux integration across the pixel?s receptive field, shutter time, and other signal degradations as explained in Section 3.1. As described in Section 3.4 the conversion of the irradiance on the sensor surface to an electrical energy introduces noise into the measurement process which we denote byN(?c?,k). The value measured by the sensor is then given by: c?(k) = ?c?(k) +N(?c?,k) (3.23) Since we are interested in a continuous reconstructed light field I(x) we will ex- press it as a linear combination of synthesis functions? : R5 ? R centered on the lattice points. ThusI(x) is represented as I(x) = summationdisplay k?Z5 c?(k)?(A?1x?k); c?(k) ?l2(Z5) (3.24) wherel2 is the space of square-summable sequences (summationtextk?Z5 |c(k)|2 1, then the Minkowski inequality states that ? ? bintegraldisplay a |f(x) +g(x)|pdx ? ? 1 p ? ? ? bintegraldisplay a |f(x)|pdx ? ? 1 p + ? ? bintegraldisplay a |g(x)|pdx ? ? 1 p 184 Similarly, if p > 1, and ak,bk > 0, then the Minkowski inequality is also valid for sums of sequences and we have parenleftBigg nsummationdisplay k=1 (ak +bk)p parenrightBigg1 p ? parenleftBigg nsummationdisplay k=1 (ak)p parenrightBigg1 p parenleftBigg nsummationdisplay k=1 (bk)p parenrightBigg1 p A.1.5 Sobolev Space Letr be a positive real number and ? ? Rn . The Sobolev spaceWrp is defined as Wrp(?) := {f ?Lp(?)|??f ?Lp(?)? multi-index?,|?| 12. This implies the requirement of H?older continuity with exponentr? 12. It is of interest how well a given synthesis function is able to approximate a given function. We will measure this using the L2-norm of the difference between original function and approximation given by epsilon1f = bardblf ?D(A,B)fbardblL2 (A.23) This error term epsilon1f can be split into two terms, one dominating main term that ex- presses the idea of an average error over all possible phase positions of a signal and a perturbation. The main term is computed by integrating |?f(v)|2, where ?f is the Fourier transform off, against an error kernel of the form E(v) = |1????(v)??(v)|2 + summationdisplay k?Zn{0} |???(v)??(v +ATB?Tk)|2 (A.24) where ? = |det(A)|/|det(B)|. The additional correction term e(f,A,B) depends on the Sobolev regularity exponent of the function to be approximated. In this chapter we will outline the steps to derive the quantization of the error with respect to camera spacing and frequency of the signal. This is based on the descriptions in the papers [13, 12]. They describe the 1D case and we will extend it to the n-D case where the sampling pattern is given by a general scaling matrix A and sampling matrix B. Theorem A.4. (Adapted from [12]) For all f ? Wr2 with r > 1/2 the approximation error averaged over all possible phase shifts of the function f with regard to the sampling operator 188 D(A,B) is given by epsilon1f =angbracketleftbigbardblf ?D(A,B)fbardblL2angbracketrightbig= bracketleftbiggintegraldisplay |?f(v)|2E(ATv)dv bracketrightbigg1/2 (A.25) The proof consists of three parts, which we will detail in the next subsections. A.2.3 l2 Convergence of the Samples First we will define the followingB?T-periodic functions U(v) = 1det(B) summationdisplay m ?f(v +B?Tm)??(ATv +ATB?Tm) (A.26) V(v) = 1det(B) summationdisplay m ?f(v +B?Tm)??(ATv +ATB?Tm) (A.27) andprovethatthesefunctionsarewelldefinedandbelongtoL2(VB?T)whereVB?T isthe Voronoi cell defined by the lattice matrixB?T, that isVB?T := {B?Tx|x ? [?1/2,1/2]n}. If this is the case, then we can develop theseB?T-periodic functions into a Fourier series U(v) = summationdisplay k?Zn ak exp(2piivTBk) (A.28) ak = integraldisplay VB?T |det(B)| |det(B)| summationdisplay k ?f(v +B?Tk)??(ATv +ATB?Tk)exp(2piivTBk)dv (A.29) = integraldisplay Rn ?f(v)??(ATv)exp(2piivTBm)dv (A.30) UsingthefactthattheinverseFouriertransformoftheproductofonefunctionand the complex conjugate of another in the Fourier domain equals the correlation of the two functions in the signal domain, that is integraldisplay Rn ?f(v)??(ATv)exp(?2piivT(?Bk))dv = F?1braceleftBig?f(v)??(ATv)bracerightBig(?Bk) = 1|det(A)|(f ??A?1)(?Bk) = 1|det(A)| integraldisplay Rn f(?)?(A?1(? ?Bk))d? 189 we can write down an alternative definition forU andV given by U(v) = summationdisplay k bracketleftbigg 1 |det(A)| integraldisplay f(?)?(A?1(? ?Bk))d? bracketrightbigg exp(2piivTBk) V(v) = summationdisplay k bracketleftbigg 1 |det(A)| integraldisplay f(?)?(A?1(? ?Bk))d? bracketrightbigg exp(2piivTBk) (A.31) Remark A.5. The following section has not been checked yet! We will prove the well-posedness of the functionU as follows (the procedure forV is exactly the same). First we will define the functional sequence UK(v) = 1|det(B)| summationdisplay |k|?K ?f(v +B?Tk)??(ATv +ATB?Tk) forK ? Nn+. We would like to prove thatUK is a Cauchy sequence, that means we need to show that lim K?? sup Kprime>K bardblUKprime ?UKbardblL2(I) = 0 By the Fisher-Riesz theorem this will automatically prove the convergence of UK towards anL2(Rn) functionU. We chooseKprime >K, and assume thatf ?Wr2(Rn) withr> 12 and thatbardbl??bardbl? ?C < ?. Then we define the set of bandpass functionsfm which are defined as ?fm = ?? ???? ???? ? ?f(v) if m2|det(B)| ? bardblvbardbl< m+12|det(B)| 0 elsewhere (A.32) which allows us to write UKprime(v)?UK(v) = 1|det(B)| summationdisplay m?0 summationdisplay K<|k|?Kprime ?fm(v +B?Tk)??(ATv +ATB?Tk) Notallfm contributetothesumontheright-handbecauseoftheirlimitedsupport. We choose only them> 2K, then by applying the Minkowski Inequality and the bound 190 on ??assumed before we find that bardblUKprime(v)?UK(v)bardblL2(I) ? C|det(A)| summationdisplay m>2K bardblfmbardblL2 Looking at the definition offm in (A.32), we see that on the support offm we have bardbl2pivbardbl?r ? (|det(B)|/(mpi))r, so that we can bound the norm offm by the norm of itsrth derivative bardblfmbardblL2 ? parenleftBig|det(B)| mpi parenrightBigr bardblf(r)m bardblL2 Using the Cauchy-Schwartz Inequality for discrete sequences, we find that summationdisplay m>2K bardblfmbardblL2 ? parenleftbiggdet(B) pi parenrightbiggrradicalBigg summationdisplay m>2K m?2rbardblf(r)bardblL2 This expression tends to zero as K goes to infinity, therefore UK is a functional Cauchy sequence. This makes sure that the sum of the squared samples converges and proves thatU ?L2(I). A.2.4 Expression ofepsilon1f in Fourier Variables In this section, we will expand theL2-errorepsilon1f into three terms epsilon12f = bardblfbardbl2L2 ?2?f,D(A,B)f?+bardblD(A,B)fbardbl2L2 and examine each term. We will start with Eq.(A.20) D(A,B)f(x) = 1|det(A)| summationdisplay k integraldisplay f(?)?(A?1(? ?Bk))?(A?1(x?Bk))d? WeseethatthepartoftheintegrallooksalotliketheexpressionforU inEq.(A.31). We now express?(A?1(x?Bk)) in terms of its Fourier transform ?(A?1(x?Bk)) = |det(A)| integraldisplay ??(ATv)exp(2piivTBk)exp(?2piixTv)dv 191 then we can manipulate Eq. (A.2.4) D(A,B)f(x) = integraldisplay bracketleftBiggsummationdisplay k integraldisplay f(?)?(A?1(? ?Bk))d? exp(2piivTBk) bracketrightBigg ? ??(ATv)exp(?2piixTv)dv = |det(A)| integraldisplay V(v)??(ATv)exp(?2piivTv)dv We can see that D(A,B)f(x) and |det(A)|integraltext V(v)??(ATx) are Fourier transforms of each other and thus have the same L2-norm. Remembering that V(v) is B?T-periodic, we can write bardblD(A,B)f(x)bardblL2 (A.33) =|det(A)|2 integraldisplay V(v)??(ATv)??(ATv)V(v)dv =|det(A)|2 summationdisplay k integraldisplay VB?T V(v)??(ATv +ATB?Tk)??(ATv +ATB?Tk)V(v)dv =|det(A)|2 integraldisplay VB?T V(v)A(ATv)V(v)dv (A.34) where A(ATv) = bracketleftBiggsummationdisplay k ??(ATv +ATB?Tk)??(ATv +ATB?Tk) bracketrightBigg (A.35) The same trick (splitting or reassembling the infinite integral by using the peri- odicity) can now be applied to write the product of the two infinite sums in V(v) as a 192 combination of a sum and an infinite integral. Starting from bardblD(A,B)f(t)bardbl2L2 =|det(A)|2 integraldisplay VB?T V(v)A(ATv)V(v)dv =|det(A)| 2 |det(B)|2 integraldisplay VB?T bracketleftBiggsummationdisplay k ?f(v +B?Tk)??(ATv +ATB?Tk) bracketrightBigg A(ATv) bracketleftBiggsummationdisplay k ?f(v +B?Tk)??(ATv +ATB?Tk) bracketrightBigg dv = |det(A)| 2 |det(B)|2 summationdisplay k integraldisplay Rn ?f(v) ?f(v +B?Tk)??(ATv)A(ATv)??(ATv +ATB?Tk)dv (A.36) A similar approach can now be done for ?f,D(A,B)f?. We have ?f,D(A,B)f? = integraldisplay Rn f(x) ? ?summationdisplay k?Zn integraldisplay Rn f(?)??(A?1(? ?Bk))??(A?1(x?Bk))d ?|det(A)| ? ?dx = ? ? integraldisplay Rn f(x)??(A?1(x?Bk))dx ? ? ? ?summationdisplay k integraldisplay Rn f(?)??(A?1(? ?Bk)d ?|det(A)| ? ? We now use the previous derivation to write ?f,D(A,B)f? = integraldisplay Rn f(x) ? ?|det(A)| integraldisplay Rn V(v)??(ATv)exp(2piixTv)dv ? ?dx = |det(A)| integraldisplay Rn ? ? integraldisplay Rn f(x)exp(2piixTv)dx ? ?V(v)??(ATv)dv = |det(A)| integraldisplay Rn ?f(v)V(v)??(ATv)dv 193 This can now further simplified by ?f,D(A,B)f? = |det(A)| integraldisplay VB?T summationdisplay k f(v +B?Tk)??(ATv +ATB?Tk)V(v)dv = |det(A)||det(B)| integraldisplay VB?T U(v)V(v)dv Again, we writeU andV as sums = |det(A)||det(B)| integraldisplay V?TB bracketleftBiggsummationdisplay k ?f(v +B?Tk)??(ATv +ATB?Tk) bracketrightBigg ? bracketleftBiggsummationdisplay k ?f(v +B?Tk)??(ATv +ATB?Tk) bracketrightBigg dv = |det(A)||det(B)| summationdisplay n integraldisplay Rn ?f(v)??(ATv) ?f(v +B?Tk)??(ATv +ATB?Tk)dv Thus we can conclude that the final error measure consists of the following terms (we write ? = |det(A)|/|det(B)|): epsilon1f = bardblfbardbl2L2 ?2?f,DTf?+bardblDTfbardbl2L2 = integraldisplay Rn ?f(v) ?f(v)dv ?2? summationdisplay k integraldisplay Rn ?f(v) ?f(v +B?Tn)??(ATv +ATB?Tk)??(ATv)dv + ?2 summationdisplay k integraldisplay Rn ?f(v) ?f(v +B?Tn)??(ATv +ATB?Tk)A(ATv)??(ATv)dv We will now collect the terms for k = 0 and k negationslash= 0: epsilon1f = integraldisplay Rn ?f(v) ?f(v)bracketleftBig1?2????(ATv)??(ATv) + ?2 ??(ATv)A(ATv)??(ATv)bracketrightBigdv (A.37) + summationdisplay knegationslash=0 integraldisplay Rn ?f(v) ?f(v +B?Tk)??(ATv +ATB?Tk)bracketleftBig?2A(ATv)??(ATv)?2???(ATv)bracketrightBigdv = epsilon121 +epsilon122 (A.38) 194 Then we can rewrite epsilon11 as the integration of |?f(v)bardbl2 with the error Kernel E(ATv) where the error kernel becomes: E(v) = bracketleftBig 1?2?? ??(v)??(v) + ?2 ??(v)A(v)??(v) bracketrightBig = 1?2?R{??(v)??(v)}+ ?2|??(v)??(v)|2 + ?2 summationdisplay knegationslash=0 vextendsinglevextendsingle vextendsingle??(v)??(v +ATB?Tk) vextendsinglevextendsingle vextendsingle 2 = vextendsinglevextendsingle vextendsingle1????(v)??(v) vextendsinglevextendsingle vextendsingle 2 +summationdisplay knegationslash=0 vextendsinglevextendsingle vextendsingle???(v)??(v +ATB?Tk) vextendsinglevextendsingle vextendsingle 2 Ifweuseorthonormalizedbasisfunction,wegetasimplerexpressionfortheerrorkernel. That is we normalize?and? by premultiplying with 1/radicalbigA(v) where A(v) = summationdisplay k ??(v +ATB?Tk)??(v +ATB?Tk) =summationdisplay n |??(v +ATB?Tk)|2 (A.39) is the Fourier transform of the sampled auto-correlation sequence ?A(?) of?, ?A(?) =summationdisplay k ?(x)?(x?A?1Bk) to compute the ortho-normal basis functions ??and ??. E(v) = 1?2?R{??(v)??(v)}+|??(v)|2 summationdisplay k |???(v +ATB?Tk)|2 = 1?2?R{??(v)??(v)}+|???(v)|2A(v) = 1? | ??(v)|2 A(v) +A(v) parenleftBigg |??(v)|2 A(v)2 ?2? R{??(v)??(v)} A(v) +|? ??(v)|2 parenrightBigg = 1? | ??(v)|2 A(v) +A(v) vextendsinglevextendsingle vextendsinglevextendsingle vextendsingle? ??(v)? |??(v)| A(v) vextendsinglevextendsingle vextendsinglevextendsingle vextendsingle 2 A.2.5 Average Approximation Error The approximation is very powerful, because it does not just express an upper bound on the error, but it is actually the exact expression for the average error over all the possible phase shifts of the functionf [12], that means we have the identity integraldisplay VB bardblfu ?DTfbardblL2du = bracketleftbiggintegraldisplay |?f(v)|2E(ATv)dv bracketrightbigg1/2 (A.40) 195 wherefu := f(x?u). Proof. The shift by u corresponds to a multiplication of ?f(v) by exp(2piiuTv). The first term of the error kernel epsilon11 is unchanged since |?f(v)|2 is not affected by this modulation. The effect on the second term of the error kernel epsilon12 is a premultiplied with a term of the form exp(2piiuTBTk) which is independent of v and thus we can move it in front of the integral. The integral integraltextVB exp(2piiuTBTk)du equals 0 for all k negationslash= 0 and thus epsilon12 vanishes. 196 BIBLIOGRAPHY [1] E. H. Adelson and J. R. Bergen. The plenoptic function and the elements of early vision. In M. Landy and J. A. Movshon, editors, Computational Models of Visual Processing, pages 3?20. MIT Press, Cambridge, MA, 1991. [2] E. H. Adelson and J. Y. A. Wang. Single lens stereo with a plenoptic camera. IEEE Trans. PAMI, 14:99?106, 1992. [3] G. Adiv. Inherent ambiguities in recovering 3D motion and structure from a noisy flowfield. InProc. IEEE Conference on Computer Vision and Pattern Recognition,pages 70?77, 1985. [4] P. Baker, R. Pless, C. Fermuller, and Y. Aloimonos. Camera networks for building shapemodelsfromvideo. In Workshop on 3D Structure from Multiple Images of Large- scale Environments (SMILE 2000), 2000. [5] P. Baker, R. Pless, C. Fermuller, and Y. Aloimonos. A spherical eye from multiple cameras (makes better models of the world). In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2001. [6] Patrick Baker and Yiannis Aloimonos. Translational camera constraints on parallel lines. In Proc. Europ. Conf. Computer Vision, page in press, 2004. 197 [7] G. Baratoff and Y. Aloimonos. Changes in surface convexity and topology caused bydistortionsofstereoscopicvisualspace. InProc. European Conference on Computer Vision, volume 2, pages 226?240, 1998. [8] S.S. Beauchemin and J.L. Barron. On the fourier properties of discontinuous mo- tion. Journal of Mathematical Imaging and Vision (JMIV), 13:155?172, 2000. [9] P.J. Besl. Active optical range imaging sensors. Machine Vision Appl., 1(2):127?152, 1988. [10] F. Blais. A review of 20 years of ranges sensor development. In Videometrics VII Proceedings of SPIE-IST Electronic Imaging, volume 5013, pages 62?76, 2003. [11] V. Blanz and T Vetter. Face recognition based on fitting a 3d morphable model. IEEE Trans. PAMI, 25(9), 2003. [12] T. Blu and M. Unser. Approximation error for quasi-interpolators and (multi- ) wavelet expansions. Applied and Computational Harmonic Analysis, 6(2):219?251, March 1999. [13] T. Blu and M. Unser. Quantitative Fourier analysis of approximation tech- niques: Part I?Interpolators and projectors. IEEE Transactions on Signal Processing, 47(10):2783?2795, October 1999. [14] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structure from motion. International Journal of Computer Vision, 1:7?55, 1987. 198 [15] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structure from motion. International Journal of Computer Vision, 1:7?55, 1987. [16] J. S. De Bonet and P. Viola. Roxels: Responsibility weighted 3d volume reconstruc- tion. In Proceedings of ICCV, September 1999. [17] O. Bottema and B. Roth. Theoretical Kinematics. North-Holland, 1979. [18] T.J.Broida,S.Chandrashekhar, andR.Chellappa. Recursive3-Dmotionestimation from a monocular image sequence. IEEE Transactions on Aerospace and Electronic Systems, 26(4):639?656, 1990. [19] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estia- tion based on a theory for warping. In Proc. European Conference on Computer Vision, volume 4, pages 25?36, 2004. [20] A. Bruss and B. K. P. Horn. Passive navigation. CVGIP: Image Understanding, 21:3? 20, 1983. [21] P. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communication, 31, 1983. [22] E. Camahort and D. Fussell. A geometric study of light field representations. Technical Report TR99-35, Dept. of Computer Sciences, The University of Texas at Austin, 1999. [23] E. Camahort, A. Lerios, and D. Fussell. Uniformly sampled light fields. In Euro- graphics Workshop on Rendering, pages 117?130, 1997. 199 [24] C. Capurro, F. Panerai, and G. Sandini. Vergence and tracking fusing log-polar images. In Proc. International Conference on Pattern Recognition, 1996. [25] R. L. Carceroni and K. Kutulakos. Multi-view scene capture by surfel sampling: From video streams to non-rigid 3d motion, shape, and reflectance. In Proc. Inter- national Conference on Computer Vision, June 2001. [26] J. Chai and H. Shum. Parallel projections for stereo reconstruction. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 493 ?500, 2000. [27] J. Chai, X. Tong, and H. Shum. Plenoptic sampling. In Proc. of ACM SIGGRAPH, pages 307?318, 2000. [28] W. Chen, J. Bouguet, M. Chu, and R. Grzeszczuk. Light field mapping: efficient representation and hardware rendering of surface light fields. ACM Transactions on Graphics (TOG), 21(3):447?456, 2002. [29] Amit Roy Chowdhury and Rama Chellappa. Stochastic approximation and rate- distortion analysis for robust structure and motion estimation. International Journal of Computer Vision, 55(1):27?53, October 2003. [30] G.M. Cortelazzo and M. Balanza. Frequency domain analysis of translations with piecewise cubic trajectories. PAMI, 15(4):411?416, April 1993. [31] R. Costantini and S. S?usstrunk. Virtual sensor design. In Proc. IS&T/SPIE Elec- tronic Imaging 2004: Sensors and Camera Systems for Scientific, Industrial, and Digital Photography Applications V, volume 5301, pages 408?419, 2004. 200 [32] S. Crossley, N.A. Thacker, and N.L. Seed. Benchmarking of bootstrap temporal stereo using statistical and physical scene modelling. In British Machine Vision Con- ference, pages 346?355, 1998. [33] K. Daniilidis. Zur Fehlerempfindlichkeit in der Ermittlung von Objektbeschreibungen und relativen Bewegungen aus monokularen Bildfolgen. PhD thesis, Fakult?at f?ur Infor- matik, Universit?at Karlsruhe (TH), 1992. [34] K. Daniilidis and M. Spetsakis. Understanding noise sensitivity in structure from motion. In Visual Navigation: From Biological Systems to Unmanned Ground Vehicles, chapter 4, pages 61?88. Lawrence Erlbaum Associates, Hillsdale, NJ, 1997. [35] K. Daniilidis and M. E. Spetsakis. Understanding noise sensitivity in structure from motion. In Y. Aloimonos, editor, Visual Navigation: From Biological Systems to Unmanned Ground Vehicles, Advances in Computer Vision, chapter 4. Lawrence Erlbaum Associates, Mahwah, NJ, 1997. [36] J. Davis, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo: A unifying framework for depth from triangulation. In 2003 Conference on Computer Vision and Pattern Recognition (CVPR 2003), pages 359?366, June 2003. [37] R. Dawkins. Climbing Mount Improbable. Norton, New York, 1996. [38] D. Demirdjian and T. Darrell. Motion estimation from disparity images. In Proc. International Conference on Computer Vision, volume 1, pages 213?218, July 2001. [39] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Com- putation in Neural Systems, 6(3):345?358, 1995. 201 [40] A.Edelman,T.Arian,andS.Smith. Thegeometryofalgorithmswithorthogonality constraints. SIAM J. Matrix Anal. Appl., 1998. [41] I.A. Essa and A.P. Pentland. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Trans. PAMI, 19:757?763, 1997. [42] H. Farid and E. Simoncelli. Range estimation by optical differentiation. Journal of the Optical Society of America, 15(7):1777?1786, 1998. [43] O. Faugeras and R. Keriven. Complete dense stereovision using level set methods. In Proc. European Conference on Computer Vision, pages 379?393, Freiburg, Germany, 1998. [44] O. D. Faugeras. Three-Dimensional Computer Vision. MIT Press, Cambridge, MA, 1992. [45] O.D. Faugeras, F. Lustman, and G. Toscani. Motion and structure from motion from point and line matches. In Proc. Int. Conf. Computer Vision, pages 25?34, 1987. [46] C. Ferm?uller and Y. Aloimonos. Ambiguity in structure from motion: Sphere ver- sus plane. International Journal of Computer Vision, 28(2):137?154, 1998. [47] C. Ferm?uller and Y. Aloimonos. Observability of 3D motion. International Journal of Computer Vision, 37:43?63, 2000. [48] P. Fua. Regularized bundle-adjustment to model heads from image sequences without calibration data. International Journal of Computer Vision, 38:153?171, 2000. [49] A. Gershun. The light field. Journal of Mathematics and Physics, 12:51?151, 1939. 202 [50] C. Geyer and K. Daniilidis. Catadioptric projective geometry. International Journal of Computer Vision, 43:223?243, 2001. [51] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen. The lumigraph. In Proceedings of ACM SIGGRAPH 96, Computer Graphics (Annual Conference Series), pages 43? 54, New York, 1996. ACM, ACM Press. [52] M. D. Grossberg and S. K. Nayar. A general imaging model and a method for finding its parameters. In Proc. International Conference on Computer Vision, pages 108?115, 2001. [53] K.J. Hanna and N.E. Okamoto. Combining stereo and motion analysis for direct estimation of scenestructure. In Proc. International Conference on Computer Vision, pages 357?365, 1993. [54] R. Hartley and A. Zisserman. Multiple View Geometry. Cambridge University Press, 2000. [55] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cam- bridge University Press, Cambridge, UK, 2000. [56] H. Haussecker and B. J?ahne. A tensor approach for precise computation of dense displacement vector fields. In DAGM Symposium, pages 199?208, September 1997. [57] G. Healey and R. Kondepudy. Radiometric ccd camera calibration and noise esti- mation. IEEE Trans. PAMI, 16(3):267?276, 1994. [58] Andrew Hicks. arXiv preprint cs.CV/0303024. 203 [59] R. Andrew Hicks and Ronald K. Perline. Geometric distributions for catadioptric sensor design. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 584?589, 2001. [60] H. Hoppe, T. DeRose, T. Duchamp, M. Halstead, H. Jin, J. McDonald, J. Schweitzer, and W. Stuetzle. Piecewise smooth surface reconstruction. In Proc. of ACM SIG- GRAPH, pages 295?302, 1994. [61] B. K. P. Horn. Robot Vision. McGraw Hill, New York, 1986. [62] B. K. P. Horn. Robot Vision. McGraw Hill, New York, 1986. [63] Berthold K.P. Horn. Parallel networks for machine vision. Technical report, MIT A.I. Lab, 1988. [64] J. Huang, A. Lee, and D. Mumford. Statistics of range images. In Proc. IEEE Con- ference on Computer Vision and Pattern Recognition, 2000. [65] A. Hubeli and M. Gross. A survey of surface representations for geometric mod- eling. Technical Report 335, ETH Z?urich, Institute of Scientific Computing, March 2000. [66] F. Huck, C. Fales, and Z. Rahman. Visual Communication. Kluwer, Boston, 1997. [67] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Proc. International Conference on Computer Vision, Corfu, Greece, 1999. [68] B. J?ahne, J. Haussecker, and P. Geissler, editors. Handbook on Computer Vision and Applications. Academic Press, Boston, 1999. 204 [69] H. W. Jensen, S. R. Marschner, M. Levoy, and P. Hanrahan. A practical model for subsurface light transport. In Proc. of ACM SIGGRAPH, 2001. [70] A. D. Jepson and D. J. Heeger. Subspace methods for recovering rigid motion II: Theory. Technical Report RBCV-TR-90-36, University of Toronto, 1990. [71] H. Jin, S. Soatto, and A. J. Yezzi. Multi-view stereo beyond lambert. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 171?178, June 2003. [72] J. T. Kajiya. The rendering equation. In Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pages 143?150. ACM Press, 1986. [73] I.A.KakadiarisandD.Metaxas. Three-dimensionalhumanbodymodelacquisition from multiple views. International Journal of Computer Vision, 30:191?218, 1998. [74] G. Kamberova. Understanding the systematic and random errors in video sensor data. [75] L. Kobbelt, T. Bareuther, and H.-P. Seidel. Multiresolution shape deformations for meshes with dynamic vertex connectivity. In Computer Graphics Forum 19 (2000), Eurographics ?00 issue, pages 249?260, 2000. [76] C. Kolb, D. Mitchell, and P. Hanrahan. A realistic camera model for computer graphics. In Proc. of ACM SIGGRAPH, pages 317?324, 1995. [77] J. Kosecka, Y. Ma, and S. Sastry. Optimization criteria, sensitivity and robustness of motion and structure estimation. In Vision Algorithms Workshop, ICCV, 1999. [78] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38:199?218, 2000. 205 [79] A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE Trans. PAMI, 16:150?162, 1994. [80] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of ACM SIG- GRAPH 96, Computer Graphics (Annual Conference Series), pages 161?170, New York, 1996. ACM, ACM Press. [81] Y. Liu and T. S. Huang. Estimation of rigid body motion using straight line corre- spondences. CVGIP: Image Understanding, 43(1):37?52, 1988. [82] B. London, J. Upton, K. Kobre, and B. Brill. Photography. Prentice Hall Inc., Upper Saddle River, NJ, 7 edition, 2002. [83] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133?135, 1981. [84] H.C. Longuet-Higgins and K. Prazdny. The interpretation of a moving retinal im- age. Proc. R. Soc. London B, 208:385?397, 1980. [85] C. Loop. Smooth subdivision surfaces based on triangles. Master?s thesis, Univer- sity of Utah, 1987. [86] Y.Ma, K.Huang, R.Vidal, J.Kosecka, andS.Sastry. Rankconditiononthemultiple view matrix. International Journal of Computer Vision, 59(2), 2004. [87] Y. Ma, J. Kosecka, and S. Sastry. Optimization criteria and geometric algorithms for motion and structure estimation. International Journal of Computer Vision, 44(3):219? 249, 2001. [88] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry. An Invitation to 3-D Vision: From Images to Geometric Models. Springer Verlag, 2003. 206 [89] Y. Ma, R. Vidal, S. Hsu, and S. Sastry. Optimal motion from multiple views by normalized epipolar constraints. Communications in Information and Systems, 1(1), 2001. [90] S. Malassiotis and M.G. Strintzis. Model-based joint motion and structure estima- tion from stereo images. Computer Vision and Image Understanding, 65:79?94, 1997. [91] C. Mandal, H. Qin, and B.C. Vemuri. Physics-based shape modeling and shape recovery using multiresolution subdivision surfaces. In Proc. of ACM SIGGRAPH, 1999. [92] S. Mann. Pencigraphy with acg: joint parameter estimation in both domain and range of functions in same orbit of projective-wyckoff group. Technical Report 384, MIT Media Lab, December 1994. also appears in: Proceedings of the IEEE International Conference on Image Processing (ICIP?96), Lausanne, Switzerland, September 16?19, 1996, pages 193?196. [93] S. Mann and S. Haykin. The chirplet transform: Physical considerations. IEEE Trans. Signal Processing, 43:2745?2761, November 1995. [94] W. Matusik, C. Buehler, S. J. Gortler, R. Raskar, and L. McMillan. Image based visual hulls. In Proc. of ACM SIGGRAPH, 2000. [95] S. J. Maybank. The angular velocity associated with the optical flowfield arising from motion through a rigid environment. Proc. R. Soc. London A, 401:317?326, 1985. [96] S. J. Maybank. Algorithm for analysing optical flow based on the least-squares method. Image and Vision Computing, 4:38?42, 1986. 207 [97] P. Meer. Robust techniques for computer vision. In G. Medioni and S. B. Kang, editors, Emerging Topics in Computer Vision. Prentice Hall, 2004. [98] P. Moon and D.E. Spencer. The Photic Field. MIT Press, Cambridge, 1981. [99] T. Naemura, T. Yoshida, and H. Harashima. 3-d computer graphics based on inte- gral photography. Optics Express, 10:255?262, 2001. [100] S. Nayar. Catadioptric omnidirectional camera. In Proc. IEEE Conference on Com- puter Vision and Pattern Recognition, pages 482?488, 1997. [101] S. K. Nayar and V. Branzoi. Adaptive dynamic range imaging: Optical control of pixel exposures over space and time. In Proc. International Conference on Computer Vision, Nice, France, 2003. [102] S. K. Nayar, V. Branzoi, and T. Boult. Programmable imaging using a digital mi- cromirror array. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2004. [103] J. Neumann and C. Ferm?uller. Plenoptic video geometry. Visual Computer, 19(6):395?404, 2003. [104] J. Neumann, C. Ferm?uller, and Y. Aloimonos. Eyes from eyes: New cameras for structure from motion. In IEEE Workshop on Omnidirectional Vision 2002, pages 19? 26, 2002. [105] J. Neumann, C. Ferm?uller, and Y. Aloimonos. Polydioptric camera design and 3d motion estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recogni- tion, volume II, pages 294?301, 2003. 208 [106] J. Neumann, C. Ferm?uller, Y. Aloimonos, and V. Brajovic. Compound eye sensor for 3d ego motion estimation. In accepted for presentation at the IEEE International Conference on Robotics and Automation, 2004. [107] B. Newhall. Photosculpture. Image, 7(5):100?105, 1958. [108] F.E.Nicodemus,J.C.Richmond,J.J.Hsia,L.W.Ginsberg,andT.Limperis. Geometric Considerations and Nomenclature for Reflectance. National Bureau of Standards (US), 1977. [109] Editors of Time Life, editor. Light and Film. Time Inc., 1970. [110] T. Okoshi. Three-dimensional Imaging Techniques. Academic Press, 1976. [111] J. Oliensis. The error surface for structure from motion. Neci tr, NEC, 2001. [112] T. Pajdla. Stereo with oblique cameras. International Journal of Computer Vision, 47(1/2/3):161?170, 2002. [113] S. Peleg, B. Rousso, A. Rav-Acha, andA. Zomet. Mosaicing onadaptive manifolds. IEEE Trans. on PAMI, pages 1144?1154, October 2000. [114] M. Peternell and H. Pottmann. Interpolating functions on lines in 3-space. In Ch. Rabut A. Cohen and L.L. Schumaker, editors, Curve and Surface Fitting: Saint Malo 1999, pages 351?358. Vanderbilt Univ. Press, Nashville, TN, 2000. [115] R. Pl?ankers and P. Fua. Tracking and modeling people in video sequences. Interna- tional Journal of Computer Vision, 81:285?302, 2001. [116] R. Pless. Using many cameras as one. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 587?593, 2003. 209 [117] M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. In ICCV, pages 90? 95, 1998. [118] H. Pottmann and J. Wallner. Computational Line Geometry. Springer Verlag, Berlin, 2001. [119] K. Prazdny. Egomotion and relative depth map from optical flow. Biological Cyber- netics, 36:87?102, 1980. [120] P. Rademacher and G. Bishop. Multiple-center-of-projection images. In Proceed- ings of ACM SIGGRAPH 98, ComputerGraphics(AnnualConferenceSeries), pages 199?206, New York, NY, 1998. ACM, ACM Press. [121] W.Reichardt. Movementperceptionininsects. InWernerReichardt,editor,Process- ing of Optical Information by Organisms and Machines, pages 465?493. Academic Press, 1969. [122] W. Reichardt. Evaluation of optical motion information by movement detectors. Journal of Comparative Physiology A, 161:533?547, 1987. [123] W. Reichardt and R. W. Schl?ogl. A two dimensional field theory for motion com- putation. Biological Cybernetics, 60:23?35, 1988. [124] J. P. Richter, editor. The Notebooks of Leonardo da Vinci, volume 1, p.39. Dover, New York, 1970. [125] M Rioux. Laser range finder based on synchronized scanners. Applied Optics, 23(21):3837?3844, 1984. 210 [126] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two- frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1/2/3):17?42, 2002. [127] Yoav Y. Schechner and Nahum Kiryati. Depth from defocus vs. stereo: How differ- ent really are they? International Journal of Computer Vision, 89:141?162, 2000. [128] P. Schr?oder and D. Zorin. Subdivision for modeling and animation. Siggraph 2000 Course Notes, 2000. [129] S. Seitz. The space of all stereo images. In Proc. International Conference on Computer Vision, pages 307?314, 2001. [130] S. Seitz and C. Dyer. Photorealistic scene reconstruction by voxel coloring. Interna- tional Journal of Computer Vision, 25, November 1999. [131] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 593 ? 600, 1994. [132] H.Y. Shum, A. Kalai, and S. M. Seitz. Omnivergent stereo. In Proc. International Conference on Computer Vision, pages 22?29, 1999. [133] D. Snow, P. Viola, and R. Zabih. Exact voxel occupancy with graph cuts. In Proceed- ings IEEE Conf. on Computer Vision and Pattern Recognition, 2000. [134] M. E. Spetsakis and Y. Aloimonos. Structure from motion using line correspon- dences. International Journal of Computer Vision, 4:171?183, 1990. [135] M. E. Spetsakis and Y. Aloimonos. A multi-frame approach to visual motion per- ception. International Journal of Computer Vision, 6(3):245?255, 1991. 211 [136] H. Spies, B. J?ahne, and J.L. Barron. Regularised range flow. In Proc. European Con- ference on Computer Vision, June 2000. [137] J. Stam. Evaluation of loop subdivision surfaces. SIGGRAPH?99 Course Notes, 1999. [138] G.P. Stein and A. Shashua. Model-based brightness constraints: on direct estima- tion of structure and motion. IEEE Trans. PAMI, 22(9):992?1015, 2000. [139] G.W. Stewart. Stochastic perturbation theory. SIAM Review, 32:576?610, 1990. [140] C. Strecha and L. Van Gool. Motion-stereo integration for depth estimation. In ECCV, volume 2, pages 170?185, 2002. [141] R. Swaminathan, M. D. Grossberg, and S. K. Nayar. Framework for designing catadioptric imaging and projection systems. In International Workshop on Projector- Camera Systems, ICCV 2003, 2003. [142] J.TannerandC.Mead. Anintegratedanalogopticalmotionsensor. InR.W.Broder- sen and H.S. Moscovitz, editors, VLSI Signal Processing, volume 2, pages 59?87. IEEE, New York, 1988. [143] G. Taubin. A signal processing approach to fair surface design. In Proc. of ACM SIGGRAPH, 1995. [144] P. Th?evenaz, T. Blu, and M. Unser. Interpolation revisited. IEEE Transactions on Medical Imaging, 19(7):739?758, July 2000. [145] T. Tian, C. Tomasi, and D. Heeger. Comparison of approaches to egomotion com- putation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 315?320, June 1996. 212 [146] C. Tomasi and T. Kanade. Shape and motion from image streams under orthogra- phy: A factorization method. International Journal of Computer Vision, 9(2):137?154, 1992. [147] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjust- ment ? a modern synthesis. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, number 1883 in LNCS, pages 298?373, Corfu, Greece, September 1999. Springer-Verlag. [148] R.Y.TsaiandT.S.Huang. Uniquenessandestimationofthree-dimensionalmotion parameters of rigid objects with curved surfaces. IEEE Trans. PAMI, 6(1):13?27, 1984. [149] Y. Tsin, V. Ramesh, and T. Kanade. Statistical calibration of ccd imaging process. In Proc. International Conference on Computer Vision, volume 1, pages 480?487, 2001. [150] M. Unser and A. Aldroubi. A general sampling theory for nonideal acquisition devices. IEEE Transactions on Signal Processing, 42(11):2915?2925, November 1994. [151] D. Van De Ville, T. Blu, and M. Unser. Recursive filtering for splines on hexagonal lattices. InProceedingsoftheTwenty-EighthIEEEInternationalConferenceonAcoustics, Speech, and Signal Processing (ICASSP?03), volume III, pages 301?304, 2003. [152] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Three-dimensional scene flow. In Proc. International Conference on Computer Vision, Corfu,Greece, September 1999. 213 [153] S. Vedula, S. Baker, S. Seitz, and T. Kanade. Shape and motion carving in 6d. In Proc.IEEEConferenceonComputerVisionandPatternRecognition,HeadIsland,South Carolina, USA, June 2000. [154] S. Vedula, P. Rander, H. Saito, and T. Kanade. Modeling, combining, and rendering dynamic real-world events from image sequences. In Proc. of Int. Conf. on Virtual Systems and Multimedia, November 1998. [155] R. Vidal and S. Sastry. Segmentation of dynamic scenes from image intensities. In IEEE Workshop on Vision and Motion Computing, pages 44?49, 2002. [156] G. Ward. Measuring and modeling anisotropic reflection. In Proc. of ACM SIG- GRAPH, volume 26, pages 265?272, 1992. [157] J. Weng, T.S. Huang, and N. Ahuja. Motion and Structure from Image Sequences. Springer-Verlag, 1992. [158] B. Wilburn, M. Smulski, H.-K. Lee, and M. Horowitz. The light field video camera. In Proceedings of Media Processors. SPIE Electronic Imaging, 2002. [159] D. N. Wood, A. Finkelstein, J. F. Hughes, C. E. Thayer, and D. H. Salesin. Multiper- spective panoramas for cel animation. Proc. of ACM SIGGRAPH, pages 243?250, 1997. [160] G.YoungandR.Chellappa. 3-dmotionestimationusingasequenceofnoisystereo images: Models, estimation, and uniqueness results. IEEE Trans. PAMI, 12(8):735? 759, 1990. [161] Jingyi Yu and Leonard McMillan. General linear cameras. In 8th European Confer- ence on Computer Vision, ECCV 2004, Prague, Czech Republic, 2004. 214 [162] C. Zhang and T. Chen. Spectral analysis for sampling image-based rendering data. IEEE Trans. on Circuits and Systems for Video Technology: Special Issue on Image-based Modeling, Rendering and Animation, 13:1038?1050, Nov 2003. [163] C. Zhang and T. Chen. A survey on image-based rendering - representation, sam- pling and compression. EURASIP Signal Processing: Image Communication, 19(1):1? 28, 2004. [164] L.Zhang,B.Curless,andS.M.Seitz. Spacetimestereo: Shaperecoveryfordynamic scenes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 367?374, 2003. [165] Y. Zhang and C. Kambhamettu. Integrated 3d scene flow and structure recovery from multiview image sequences. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages II:674?681, Hilton Head, 2000. [166] Z. Zhang. Parameter estimation techniques: A tutorial with application to conic fitting. International Journal of Image and Vision Computing, 15(1):59?76, 1997. [167] W. Zhao, R. Chellappa, J. Phillips, and A. Rosenfeld. Face recognition in still and video images: A literature surrey. ACM Computing Surveys, pages 399?458, 2003. [168] A. Zomet, D. Feldman, S. Peleg, and D. Weinshall. Mosaicing new views: The crossed-slits projection. IEEE Trans. PAMI, pages 741?754, 2003. [169] D. Zorin, P. Schr?oder, and W. Sweldens. Interactive multiresolution mesh editing. In Proc. of ACM SIGGRAPH, 1997. 215