ABSTRACT
Title of dissertation: COMPUTER VISION IN THE SPACE OF LIGHT RAYS:
PLENOPTIC VIDEOGEOMETRY AND
POLYDIOPTRIC CAMERA DESIGN
Jan Neumann, Doctor of Philosophy, 2004
Dissertation directed by: Professor Yiannis Aloimonos
Department of Computer Science
Mostofthecamerasusedincomputervision,computergraphics,andimageprocess-
ingapplicationsaredesignedtocaptureimagesthataresimilartotheimagesweseewith
our eyes. This enables an easy interpretation of the visual information by a human ob-
server. Nowadays though, more and more processing of visual information is done by
computers. Thus, it is worth questioning if these human inspired ?eyes? are the optimal
choice for processing visual information using a machine.
In this thesis I will describe how one can study problems in computer vision with-
out reference to a specific camera model by studying the geometry and statistics of the
space of light rays that surrounds us. The study of the geometry will allow us to deter-
mine all the possible constraints that exist in the visual input and could be utilized if we
had a perfect sensor. Since no perfect sensor exists we use signal processing techniques
to examine how well the constraints between different sets of light rays can be exploited
given a specific camera model. A camera is modeled as a spatio-temporal filter in the
space of light rays which lets us express the image formation process in a function ap-
proximation framework. This framework then allows us to relate the geometry of the
imaging camera to the performance of the vision system with regard to the given task.
In this thesis I apply this framework to problem of camera motion estimation. I show
how by choosing the right camera design we can solve for the camera motion using lin-
ear, scene-independent constraints that allow for robust solutions. This is compared to
motion estimation using conventional cameras. In addition we show how we can extract
spatio-temporal models from multiple video sequences using multi-resolution subdivi-
son surfaces.
COMPUTER VISION IN THE SPACE OF LIGHT RAYS : PLENOPTIC VIDEO
GEOMETRY AND POLYDIOPTRIC CAMERA DESIGN
by
Jan Neumann
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2004
Advisory Commmittee:
Professor Yiannis Aloimonos, Chair and Advisor
Professor Rama Chellappa, Dean?s Representative
Professor Larry Davis
Professor Hanan Samet
Professor Amitabh Varshney
c? Copyright by
Jan Neumann
2004
ACKNOWLEDGMENTS
I owe my gratitude to all the people who have made this thesis possible and because
of whom my graduate experience has been one that I will cherish forever. First and
foremost I?d like to thank my advisor, Professor Yiannis Aloimonos for giving me an
invaluable opportunity to work on challenging and extremely interesting projects over
the past seven years, while still being free to follow my own paths. He has always given
me help and advice and there has never been an occasion when he hasn?t given me time
to discuss personal and research questions with him.
I also want to thank Professor Rama Chellappa, Professor Larry Davis, Professor
Amitabh Varshney, and Professor Hanan Samet for agreeing to serve on my thesis com-
mittee and for sparing their invaluable time reviewing the manuscript. I also want to
thank the late Professor Azriel Rosenfeld whose guidance and enthusiasm for the field
of computer vision made the computer vision lab at the University of Maryland the in-
spiring place that it is.
My colleagues and friends at the computer vision lab have enriched my graduate
life in many ways and deserve a special mention, Patrick Baker, Tomas Brodsky, Filip
Defoort, Cornelia Ferm?uller, Gutemberg Guerra, Abhijit Ogale, and Robert Pless. I also
want to thank my friends Yoni Wexler, Darko Sedej, Gene Chipmann, Nizar Habash,
Adam Lopez, Supathorn Phongikaroon and Theodoros Salonidis for sharing the gradu-
ate student experience with all its highs and lows with me, and understanding if I could
not always spend as much time with them as they deserve. I also want to thank Kostas
ii
Daniilidis for the many illuminating discussions about the structure of the space of light
rays.
I dedicate this thesis to my family - my mother Insa, father Joachim and brother
Lars,who have always stood by me and guided me through my career, and especially to
my wife Danitza, who has supported me during the most stressful times of my PhD and
was willing to accept the negative consequences of being with a PhD student, thank you.
In the end, I would also like to acknowledge the financial support from the De-
partment of Computer Science and the Graduate School for their two fellowships and
the Deutsche Studienstiftung for their summer schools that truly enriched my studies
during the early years of my graduate school career.
iii
TABLE OF CONTENTS
List of Tables ix
List of Figures x
1 Introduction 1
1.1 Why study cameras in the space of light rays? . . . . . . . . . . . . . . . . 7
1.2 Example Application: Dynamic 3D photography . . . . . . . . . . . . . . . 9
1.3 Plenoptic video geometry: How is information encoded in the space of
light rays? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Polydioptric Camera Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Plenoptic Video Geometry: The Structure of the Space of Light Rays 18
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Representations for the space of light rays . . . . . . . . . . . . . . . . . . . 24
2.2.1 Plenoptic Parametrization . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Pl?ucker coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Light field parameterization . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Information content of plenoptic subspaces . . . . . . . . . . . . . . . . . . 30
iv
2.3.1 One-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 Two-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Three-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Four-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 Five-dimensional Subspaces . . . . . . . . . . . . . . . . . . . . . . 38
2.4 The space of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 The Geometry of Epipolar Images . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 Fourier analysis of the epipolar images . . . . . . . . . . . . . . . . 44
2.5.2 Scenes of constant depth . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.3 A slanted surface in space . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.4 Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 Extension of natural image statistics to light field statistics . . . . . . . . . 52
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Polydioptric Image Formation: From Light Rays to Images 57
3.1 The Pixel Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Optics of the lens system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Radiometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Noise characteristics of a CCD sensor . . . . . . . . . . . . . . . . . . . . . 65
3.5 Image formation in shift-invariant spaces . . . . . . . . . . . . . . . . . . . 70
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Polydioptric Sampling Theory 74
4.1 Plenoptic Sampling in Computer Graphics . . . . . . . . . . . . . . . . . . 75
4.2 Quantitative plenoptic approximation error . . . . . . . . . . . . . . . . . . 79
v
4.3 Evaluation of Approximation Error based Natural Image Statistics . . . . 85
4.4 Non-uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Application: Structure from Motion 87
5.1 Plenoptic Video Geometry: How is 3D motion information encoded in the
space of light rays? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Ray incidence constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Ray Identity Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.1 Discrete plenoptic motion equations . . . . . . . . . . . . . . . . . . 94
5.3.2 Differential plenoptic motion equations . . . . . . . . . . . . . . . . 95
5.3.3 Differential Light Field Motion Equations . . . . . . . . . . . . . . . 97
5.3.4 Derivation of light field motion equations . . . . . . . . . . . . . . . 98
5.4 Variational Formulation of the Plenoptic Structure from Motion . . . . . . 101
5.5 Feature computation in the space of light rays . . . . . . . . . . . . . . . . 104
5.6 How much do we need to see of the world? . . . . . . . . . . . . . . . . . . 105
5.7 Influence of the field of view on the motion estimation . . . . . . . . . . . 107
5.8 Sensitivity of motion and depth estimation using perturbation analysis . . 109
5.9 Stability Analysis using the Cramer-Rao lower bound . . . . . . . . . . . . 114
5.10 Experimental validation of plenoptic motion estimation . . . . . . . . . . . 117
5.10.1 Polydioptric sequences generated using computer graphics . . . . 118
5.10.2 Polydioptric sequences generated by resampling epipolar volumes 119
5.10.3 Polydioptric sequences captured by a multi-camera rig . . . . . . . 121
5.10.4 Comparison of single view and polydioptric cameras . . . . . . . . 123
5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
vi
6 Application: Spatio-temporal Reconstruction from Mulitple Video Sequences 130
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2 Multi-Resolution Subdivision Surfaces . . . . . . . . . . . . . . . . . . . . . 136
6.3 Subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Smoothing and Detail Computation . . . . . . . . . . . . . . . . . . . . . . 140
6.5 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.6 Encoding of Detail in Local Frames . . . . . . . . . . . . . . . . . . . . . . . 146
6.7 Multi-Camera Shape and Motion Estimation . . . . . . . . . . . . . . . . . 148
6.8 Shape Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.9 Motion Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.10 Shape and Motion Refinement through Spatio-Temporal Stereo . . . . . . 152
6.11 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.12 Hierarchies of cameras for 3D photography . . . . . . . . . . . . . . . . . . 159
6.13 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Physical Implementation of a Polydioptric Camera 162
7.1 The Physical Implementation of Argus and Polydioptric Eyes . . . . . . . 163
7.2 The plenoptic camera by Adelson and Wang . . . . . . . . . . . . . . . . . 167
7.3 The optically differentiating sensor by Farid and Simoncelli . . . . . . . . 168
7.4 Compound Eyes of Insects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Conclusion 176
A Mathematical Tools 180
A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.1.1 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
vii
A.1.2 Poisson Summation Formula . . . . . . . . . . . . . . . . . . . . . . 183
A.1.3 Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . 184
A.1.4 Minkowski Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 184
A.1.5 Sobolev Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
A.2 Derivation of Quantitative Error Bound . . . . . . . . . . . . . . . . . . . . 186
A.2.1 Definitions for Sampling and Synthesis Functions . . . . . . . . . . 186
A.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.2.3 l2 Convergence of the Samples . . . . . . . . . . . . . . . . . . . . . 189
A.2.4 Expression ofepsilon1f in Fourier Variables . . . . . . . . . . . . . . . . . . 191
A.2.5 Average Approximation Error . . . . . . . . . . . . . . . . . . . . . 195
Bibliography 197
viii
LIST OF TABLES
2.1 Information content the axis-aligned plenoptic subspaces, Part I . . . . . . 55
2.2 Information content of axis-aligned plenoptic subspaces Part II . . . . . . 56
5.1 Quantities that can be computed using the basic light ray constraints for
different camera types. ? denotes that two quantities cannot be indepen-
dently estimated without assumptions about scene or motion. . . . . . . . 91
5.2 Rigid motion constraint equations for plenoptic subspaces of different di-
mensions. In each row the corresponding subspace for each motion con-
straint equation is in bold letters. . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Motion Flow Constraint Equations for Rigid Motion Estimation . . . . . . 127
5.4 Brightnessconstancyconstraintequationsforrigidmotionestimation. Since
we have the brightness invariance along the ray that can be expressed as
?rLTr = ?xLTr = 0 and ?rLTr = ?mLTr = 0, we can omit the projec-
tion operators in the constraint equations. . . . . . . . . . . . . . . . . . . . 128
5.5 Single termMTci,piMci,pi of the Fisher Information Matrix for Rigid Motion
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
ix
LIST OF FIGURES
1.1 Michael Land?s landscape of eye evolution. . . . . . . . . . . . . . . . . . . 2
1.2 The information pipeline in computer vision. Usually computer vision
starts from images, but this already assumes a choice of camera. To ab-
stract the notion of a specific sensing device, we need to analyze vision
problems in the space of light rays . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 (a) Sequence of images captured by a horizontally translating camera. (b)
Epipolar image volume formed by the image sequence where each voxel
corresponds to a unique light ray. The top half of the volume has been cut
awaytoshowhowarowoftheimagechangeswhenthecameratranslates.
(c) A row of an image taken by a pinhole camera at two time instants (red
and green) corresponds to two non-overlapping horizontal line segments
in the epipolar plane image, while in (d) the collection of corresponding
?rows? of a polydioptric camera at two time instants corresponds to two
rectangular regions of the epipolar image that do overlap (yellow region).
This overlap enables us to estimate the rigid motion of the camera purely
based on the visual information recorded. . . . . . . . . . . . . . . . . . . . 11
x
1.4 a) Hierarchy of Cameras for 3D Motion Estimation. The different camera
models are classified according to the field of view (FOV) and the num-
ber and proximity of the different viewpoints that are captured (Dioptric
Axis). The camera models are clockwise from the lower left: small FOV
pinhole camera, spherical pinhole camera, spherical polydioptric camera ,
and small FOV polydioptric camera. . . . . . . . . . . . . . . . . . . . . . . 14
1.5 (a) Design of a Polydioptric Camera (b) capturing Parallel Rays and (c)
simultaneously capturing Pencil of Rays. . . . . . . . . . . . . . . . . . . . 16
2.1 (a) Parameterization of light rays passing through a volume in space by
recording the intersection of the light rays with the faces of a surrounding
cube. (b) Lightfield Parameterization. . . . . . . . . . . . . . . . . . . . . . 29
2.2 (a) Sequence of images captured by a horizontally translating camera. (b)
Epipolar plane image volume formed by the image sequence where the
top half of the volume has been cut away to show how a row of the image
changes when the camera translates. . . . . . . . . . . . . . . . . . . . . . . 40
2.3 (a) Light Ray Correspondence (here shown only for the light field slice
spanned by axesxandu). (b) Fourier spectrum of the (x,u) light field slice
with choice of ?optimal? depthzopt for the reconstruction filter. . . . . . . 41
2.4 (a)Theratiobetweenthemagnitudeofperspectiveandorthographicderiv-
atives is linear in the depth of scene. (b) The scale at which the perspective
andorthographicderivativesmatchdependsonthedepth(matchingscale
in red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xi
2.5 (a) Epipolar image of a scene consisting of two translucent fronto-parallel
planes. (b) The Fourier transform of (a). Notice how the energy of the
signal is concentrated along two lines. . . . . . . . . . . . . . . . . . . . . . 43
2.6 (a) Epipolar image of a scene where one signal occludes the other. (b) The
Fouriertransformof(a). Noticetheringingeffectsparalleltotheoccluding
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7 (a) Epipolar image of a complex scene with many occlusions.. (b) The
Fourier transform of (a). There are multiple ringing effects parallel to all
the occluding signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 Imaging pipeline expressed in a function approximation framework . . . 57
3.2 (a) Pinhole and (b) Thin-Lens Camera Geometry . . . . . . . . . . . . . . . 60
3.3 CCD Camera Imaging Pipeline (redrawn from [149]) . . . . . . . . . . . . 65
3.4 Image formation diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Everyimagingdevicecanonlycapturetheplenopticfunctionwithlimited
precision. By extending signal processing techniques to the space of light
rays we can determine how accurately we can measure the radiance of
light rays at non-sample locations. . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 (a) Light Ray Correspondence (here shown only for the light field slice
spanned by axes x and u). (b) Fourier spectrum of the (x,u) light field
slice with choice of ?optimal? depth zopt for the reconstruction filter. (c)
Optimal reconstruction filter in Fourier space. . . . . . . . . . . . . . . . . 77
xii
5.1 Illustration of ray incidence and ray identity: (a-b) A multi-perspective
system of cameras observes an object in space while undergoing a rigid
motion. Each individual camera sees a scene point on the object from a
different view point which makes the correspondence of the rays depend
on the scene geometry (ray incidence). (c) By computing the inter- and
intra-camera correspondences of the set of light rays between the two time
instants, we can recover the motion of the camera without having to esti-
mate the scene structure since we correspond light rays and not views of
scene points (ray identity). . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Examplesofthethreedifferentcameratypes: (a)Pinholecamera(b)Spher-
ical Argus eye and (c) Polydioptric lenslet camera. . . . . . . . . . . . . . . 93
5.3 Deviation from the epipolar constraints for motion estimation from indi-
vidual cameras (from [5]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Combination of error residuals for a six camera Argus eye (from [5]) . . . 108
5.5 Singular values of matrixAfor different camera setups in dependence on
the field of view of the individual cameras:
(a) single camera, (b) two cameras facing forward and backward, (c) two
cameras facing forward and sideways, and (d) three cameras facing for-
ward, sideways, and upward . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xiii
5.6 (a) Hierarchy of Cameras for 3D Motion Estimation and (b) 3D Shape esti-
mation. Thedifferentcameramodelsareclassifiedaccordingtothefieldof
view(FOV)andthenumberandproximityofthedifferentviewpointsthat
are captured (Dioptric Axis). The camera models are clockwise from the
lower left: small FOV pinhole camera, spherical pinhole camera, spherical
polydioptric camera, and small FOV polydioptric camera. . . . . . . . . . 117
5.7 (a) Subset of an Example Scene, (b) the corresponding light field (c) Accu-
racy of Plenoptic Motion Estimation. The plot shows the ratio of the true
and estimated motion parameters (vertical axis) in dependence of the dis-
tance between the sensor surface and the scene (horizontal axis) forf = 60
and spheres of unit radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.8 Evaluationofcameradesignsbyresamplingepipolarvolumes. (a)Bysam-
pling brightness values in the epipolar volume at different spacings along
the view dimensions we can simulate a variety of camera arrangements.
(b) Example: 29 camera system with noticeable aliasing between the views. 120
5.9 Relationship between camera spacing, image smoothing and accuracy of
motion estimation based on integrating over many image sequences gen-
erated from an epipolar volume. The standard deviation of the Gaussian
smoothing filter increases from left to right from 1 over 5 to 11 pixels. . . 121
5.10 (a) Example indoor scene for motion estimation. The camera can be seen
onthetable. Itwasmovedinaplanarmotiononthetable. (b)Theepipolar
image that was recovered after compensating for varying translation and
rotation of the camera. The straight lines indicate that the motion has been
accurately recovered. (c) Recovered depth model. . . . . . . . . . . . . . . 122
xiv
5.11 Example outdoor scene. The camera was moved in a straight line while
being turned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.12 Motion estimation results for example outdoor scene. The comparison of
theparkinglotgeometrywiththerecoveredpathindicatesthatthemotion
was accurately recovered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.13 Comparisonofmotionestimationusingsingleandmulti-perspectivecam-
eras. The errors in the correspondences varied between 0 and 0.04 radians
(0-2 degrees), and we computed the mean squared error in the translation
and rotation direction for 3 different camera field of views (30,60, and 90
degrees). The blue line (+) and green line (squares) are Jepson-Heeger?s
subspace algorithm and Kanatani?s normalized minimization of the in-
stantaneous differential constraint as implemented by [145]. The red lines
denotetheperformanceofmotionestimationusingEq.(5.19)wheretheer-
rors in the disparities are normally distributed with a standard deviation
that varies between 0 and 5 degrees. . . . . . . . . . . . . . . . . . . . . . . 125
6.1 (a) Sketch of multi camera dome in the Keck Lab to capture a large en-
vironment and (b) calibrated camera setup for detailed capture of small
objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2 Stencils for Loop Refinement and 1-4 Triangle Split . . . . . . . . . . . . . 138
6.3 Neighbouringpatchesthatinfluencethe(a)3Dlimitshapeand(b)3Dlimit
motion field of the spatio-temporal surface . . . . . . . . . . . . . . . . . . 140
6.4 Control Meshes at different Resolutions and their detail differences . . . . 142
6.5 Detail Encoding in Global and Local Coordinate Systems . . . . . . . . . . 143
xv
6.6 Histogram encoding the magnitude of the normal (black) and tangential
(gray,white) components of the detail vectors in the multi-resolution en-
coding of the mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.7 Flow chart describing the multi-resolution analysis and synthesis . . . . . 146
6.8 (a) Multi Camera Silhouette Intersection from (b) Image Silhouettes . . . 149
6.9 (a) 3D Normal Flow Constraint. The planes formed by corresponding im-
age edges and the optical centers intersect in a line in space. (b) By inte-
grating over a surface patch we can estimate the full 3D Flow . . . . . . . 151
6.10 Four Example Input Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.11 Results of 3D Structure and Motion Flow Estimation: Structure.(a-c) Three
Novel Views from the Spatio-Temporal Model (d) Left View of Control
Mesh (e) Right Side of Final Control Mesh (f) Close Up of Face . . . . . . 156
6.12 Results: Motion Flow. (a-c) Magnitude of Motion Vectors at Different Lev-
els of Resolution (d) Magnitude of Non-Rigid 3D Flow Summed over the
Sequence (e) Non-Rigid 3D Motion Flow (f) Non-Rigid Flow Close Up of
Mouth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.13 (a) Triangulation-Correspondence Tradeoff in dependence on baseline be-
tween cameras. For a small baseline system, we can solve the correspon-
dence problem easily, but have a high uncertainty in the depth structure.
Foralargebaselinesystemthisuncertaintyismuchreduced,butthecorre-
spondence problem is harder. (b) Motion and Shape can be reconstructed
directly from feature traces in spatio-temporal and epipolar image volumes.160
7.1 Design of a Polydioptric Camera (a) capturing Parallel Rays (b) and simul-
taneously capturing a Pencil of Rays (c). . . . . . . . . . . . . . . . . . . . . 162
xvi
7.2 A highly integrated tennis-ball-sized Argus eye. . . . . . . . . . . . . . . . 164
7.3 Block diagram of an Argus eye. . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.4 Forming a kind of polydioptric eye. . . . . . . . . . . . . . . . . . . . . . . 166
7.5 Plenoptic projection geometry for micro-lenses. . . . . . . . . . . . . . . . . 166
7.6 Plenoptic image processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.7 (a) The inter-ommatidial derivativesLu andLv are computed within each
ommatidial image. Intra-ommatidial derivativesLx andLy are computed
across ommatidial images. The constant matrix Mi, which in general will
be different for each pixel, encapsulates for all geometric relationships
of how individual ommatidial images are formed. (b) A curved com-
pound eye design for ego motion sensor. Gradient Index (Grin) lenses are
arranged along a curved surface ?f forming optical images on the curved
surface ?i. The two planes are used to parameterize the Light Field. The
elemental images from each Grin lens are brought to a planar sensor chip
using coherent fiber optic bundles. (c) A two dimensional array of Grin
lenses for the compound eye 3D ego motion sensor. . . . . . . . . . . . . . 171
xvii
7.8 The gradient update is computed at each pixel and proportional bits of
current are summed into a global wire. One wire is used for each of the
six motion parameters (?q,?). Once the solution for the motion parameters
are reached, based on minimizing Eq. (7.11), the gradients (I.e., the total
current into each of six wires) are zero thus the steady state voltage solu-
tion on the capacitors is reached. The voltage from the six wires is con-
tinuously readout from the sensor representing the instantaneous 6DOF
motion measurements. (The capacitors are explicitly drawn in this figure,
although in actual implementation the parasitic capacitance of the wires is
sufficient.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
xviii
Chapter 1
Introduction
For most of us vision is a very natural process that happens unconsciously and allows us
to navigate accurately through the world, detect moving objects and judge their speeds,
recognize objects and actions, and interact with the world using visual feedback. Unfor-
tunately, up to now nobody has been able to decipher how humans are able to perform
their advanced vision tasks. Therefore, when we think about vision algorithms, we look
beyond the human example and utilize mathematical disciplines such as geometry, cal-
culus, and statistics to find solutions to the tasks mentioned, although many of these
methods do not directly map into any existing biological circuits. Despite the advances
in algorithm development, when we think about vision hardware, we still usually think
ofcamerasthatcaptureimagessimilartotheimagestakenby(two)eyessuchasourown
- that is, images acquired by camera-type eyes based on the pinhole principle.
Basically all commercial photo and video cameras used in computer vision re-
search, with the exception of a few examples we will study later, are primarily designed
to capture images that are as similar as possible to the images that a human eye would
captureifplacedatthecamera?sposition. Thisisadvantageousbecauseitenablesaneasy
interpretation of the visual information by a human observer. This advantage though
1
seems to be of less and less importance nowadays, because more and more image inter-
pretation tasks become automated and thus humans will interpret less and less of the
raw data that is coming from the cameras, but rather the processed output. Therefore, I
believe that we should look beyond the pinhole camera principle, and design a camera
in dependence on the task we want to perform. Human eyes are not the only types of
eyes that exist, the biological world reveals a large variety of eye designs many of which
are not based on the pinhole principle.
COMPOUND EYES CAMERA-TYPE EYES
Corneal eyes of land vertebrates
Superposition eyes
Neural
superposition Apposition Spiders Fish eyes Tapetum ridge
Limulus Cephalopodlens eyes
Intermediates Mirror
eyes
Debris-copepods
Vitreous
mass eyes
Proto-compound eyes Nautilus
Reflecting
pigment cupsNear pinholes
Pigment cup eyes
Mere Photoreceptors
Figure 1.1: Michael Land?s landscape of eye evolution.
The biological world gives a good example of task specific eye design. It has been
estimated that eyes have evolved no fewer than forty times, independently, in diverse
parts of the animal kingdom [37], and these eye designs, and therefore the images they
capture, are highly adapted to the tasks the animal has to perform. This evolution of eyes
is nicely illustrated in Michael Land?s landscape of eye evolution (Fig. 1.1) where every
2
hill and mountain denotes a different independent eye design. This suggests that we
should not just focus our efforts on designing algorithms that optimally process a given
visual input, but also optimize the design of the imaging sensor with regard to the task
at hand, so that the subsequent processing of the visual information is facilitated. The
notion that we need to build accurate models of the information processing pipeline to
optimize our choice of algorithm in dependence on the statistics of the environment has
become more prominent lately ( [32], [149]).
Figure 1.2: The information pipeline in computer vision. Usually computer vision starts
from images, but this already assumes a choice of camera. To abstract the notion of a
specific sensing device, we need to analyze vision problems in the space of light rays
The exploration of new sensor designs has already begun. Technological advances
make it possible to construct integrated imaging devices using electronic and mechan-
ical micro-assembly, micro-optics, and advanced on-chip processing. These devices are
not only of the kind that exists in nature such as log-polar retinas [24], but also of many
other kinds such as catadioptric cameras [100, 50]. Some initial work has begun on cus-
tom mirror-design where a mirror is machined such that the imaging geometry of the
mirror-camera combination is optimized to approximate a predefined scene to image
mapping [59, 141]. Nevertheless, a general framework to relate the design of an imaging
sensortoitsusefulnessforagiventaskisstillmissing. AsseeninFig.1.2ifwephraseour
3
analysis of vision algorithms in terms of images, then we already have chosen a specific
camera model to form an image. To abstract from a specific camera geometry, we need
to base our analysis on the most general representation for visual information, that is the
space of light rays and its functional representation the plenoptic function. By analyz-
ing vision problems in the space of light rays, we can optimize over both the sensor and
the algorithm to find an optimal solution. In this thesis I develop such a framework by
studying the relationship between the subset of light rays captured by a camera and its
performance with regard to a task.
InthisthesisIwillusethetermpolydioptriccamerathatwasintroducedin[104]to
denote a generalized camera that captures a multi-view point subset of the space of light
rays. (dioptric: assisting vision by refracting and focusing light). The essential question
of polydioptric camera design is how to find the camera design that allows us to perform
the tasks of interest as well as possible. For this we need to assess the relative importance
of spatial resolution and depth resolution with regard to the problem at hand, and how
the best compromise between the two can be found.
This framework is able to simultaneously address the three problems crucial to the
design of next generation imaging systems:
? The sampling problem.
Which rays of light should be sampled to yield optimal information for a particular
function whether the task at hand is high resolution image reconstruction, imag-
ing from different views, recovery of a scene?s 3D structure, or detection of various
salientfeaturesandtargets? Howcantheappropriatesamplingstrategiesbeimple-
mented in the prescribed form and fit? Generally, we are searching for a transform
4
? that captures a set of light rays L and generates signal measurements I.
?(L) = I
? could be a compound transform to account for many optical centers and many
different (potentially overlapping) fields of view. In general the transform ? has a
null space; therefore, direct inversion based on the signal measurements will not be
possible.
? The sensing problem. How should the optical signals be sensed as to yield an
optimal SNR, dynamic range, possibly salience-ordered pixel/region readout, pro-
grammable spatial resolution, and possible detector properties adaptation? How
do the selected sensor functions contribute to the prescribed system function?
? The processing problem. What kind of scene features can different camera de-
signs extract? Some can only extract features based on texture cues (2D), others in
addition also features based on shape cues (3D) as we discussed above. What are
the mathematical tools that need to be developed? Or conversely, what sampling
and sensing strategy is necessary to achieve efficient processing for a prescribed
function? What processing hardware will be necessary to achieve these operations
within a prescribed time? How can we analyze how much these features will help
us?
These three problems are not orthogonal and must be addressed simultaneously.
For example, efficient mathematics will demand a particular sampling of the light rays.
Sensors may not be able to deliver sufficient SNR for a given sampling strategy. There-
fore, the optimal solution will lie somewhere in the ?space of compromises? across the
three areas. In this thesis I will focus on the processing and sampling questions because
5
they are the two properties that are much easier to modify for the camera designer then
the underlying electronics on the sensor chip itself. Thus, we assume that the properties
of the sensor element are fixed and can be described by a simple model.
To find the optimal solution and design a task specific camera, we need to answer
the following two questions:
1. How is the relevant visual information that we need to extract to solve our task
encoded in the visual data that a camera can capture?
2. What is the camera design and image representation that optimally facilitates the
extraction of the relevant information?
To answer the first question, we first have to think about what we mean by visual
information. When we think about vision, we usually think of interpreting the images
taken by (two) eyes such as our own - that is, perspective images acquired by camera-
type eyes based on the pinhole principle. These images enable an easy interpretation of
the visual information by a human observer. Therefore, most work on sensor design has
focused on designing cameras that would result in pictures with higher fidelity (e.g.[66]).
Image fidelity has a strong impact on the accuracy with which we can make quantita-
tive measurements of the world, but the qualitative nature of the image we capture (e.g.,
single versus multiple view point images) also has a major impact on the accuracy of
measurements which cannot be measured by a display-based fidelity measure. Since
nowadays most processing of visual information is done by machines, there is no need
to confine oneself to the usual perspective images. Instead, as motivated at the start of
the introduction, we propose to study how the relevant information is encoded in the
geometry of the time-varying space of light rays which allows us to determine how well
6
we can perform a task given any set of light ray measurements. In Chapter 2 we will
the examine the structure of the space of light rays and analyze what kind of information
canbeextractedaboutascene. Toanswerthesecondquestionwehavetodeterminehow
well a given eye can capture the necessary information. We can interpret this as an ap-
proximation problem where we need to assess how well the relevant subset of the space
of light rays can be reconstructed based on the samples captured by the eye, our knowl-
edge of the transfer function of the optical apparatus, and our choice of function space
to represent the image. In Chapter 4 we will model eyes as spatio-temporal sampling
patterns in the space of light rays which allows to use well developed tools from signal
processing and approximation theory to evaluate the suitability of a given eye design for
the proposedtask and determinethe optimal design. The answersto these twoquestions
then allow us to define a fitness function for different camera designs with regard to a
given task.
1.1 Why study cameras in the space of light rays?
A conventional camera is observing the world only from a single effective view point. It
is well known that due to the camera projection the range information about the world
gets lost. That means one can only capture an image of the two-dimensional properties
of the object surfaces in the scene. Therefore, any information that is not uniquely de-
termined by the scene texture itself cannot be accurately recovered without assumptions
about the world. As an example, if we want to segment the captured image into image
regions corresponding to the different objects in the scene, we have to make assumptions
beforehand to which texture belongs to which object.
In contrast, if a camera captures light from many view points the correlations be-
7
tween the different images can be used to infer information about the three-dimensional
structure of the world, that is for example shape estimation and the segmentation of the
scene into distinct objects using occlusion events. In this case if we want to segment the
multi-perspective image according to the views of the objects we observed, we can uti-
lize the intrinsic structure of the captured multi-perspective image to detect occlusion
events and segment the image according to a combination of two-dimensional texture
and three-dimensional depth cues.
If the camera is moving, thus instead of images we capture image sequences, we
can use the correlations between different images that observe the same scene to infer
more information about the world even with conventional cameras. The problem of
structure from motion has been studied in depth [61, 55, 88], but it is still not sufficiently
solved to allow for fool-proof algorithms. Thus, if we only rely on texture cues to de-
tect independently moving objects or track objects, it is difficult to distinguish between
changes in images due to moving objects or parallax-inducing 3D structures. It is essen-
tial that we estimate the motion of the camera which is in general a highly non-linear
problem since it involves the estimation of the 3D structure of the scene as well. The non-
linearnatureoftheproblemcausesambiguitiesinthesolutionandduetothecomplexity
of the estimation it has not been possible up to now to process the motion estimation di-
rectly on a camera chip.
In contrast, a multi-perspective camera allows us to compute the rigid motion of
a camera solely based on spatio-temporal image derivative measurements by solving a
system of linear equations with few (six) unknowns as we will show in Chapter 5. Ef-
ficient solvers for these problems have been recently implemented on cheap graphics
hardware chips, thus allowing us to solve for the 3D motion of the camera directly inside
8
the box using the camera hardware. In addition, the low-dimensionality and scene in-
dependent formulation of the changes in the images due to the camera motion allow us
to apply simple motion segmentation algorithms to the captured polydioptric images to
detect and track moving objects.
These superior abilities of polydioptric imaging with regard to 3D scene structure
estimation and segmentation, can also be utilized to improve the resolution of the sen-
sor through super-resolution techniques. In nearly all super-resolution techniques it is
assumed that the scene consists of a single fronto-parallel plane and that there are no
occlusions between different views. In reality this is not necessarily satisfied. A poly-
dioptric camera is able to first compute an estimate of the depth structure of the scene
and detect occlusion events, before the super-resolution system is solved with appropri-
ate masking of the input and adaption of the parameters modeling the imaging process.
1.2 Example Application: Dynamic 3D photography
Toevaluateandcomparedifferenteyedesignsinamathematicalframework,Ichoosethe
recovery of spatio-temporal scene descriptions from image sequences, that is structure
from motion, as the problem of interest. There are many approaches to structure from
motion (e.g. see [54, 88] for an overview), but essentially all these approaches disregard
the fact that the way images are acquired, already determines to a large degree how
difficult it is to solve for the structure and motion of the scene. Since systems have to
cope with limited resources, their cameras should be designed to optimize subsequent
image processing.
Of most practical importance are the following two specific subcases of the struc-
ture from motion problem:
9
1. Static Structure from Dynamic Cameras: We are given a single (or a set of) cam-
era(s) that move rigidly in space. Based on the recovered image sequences we
would like to estimate the camera motion, the properties of the static objects in
scene (such as shape and textures). If the scene properties are not invariant over
time, then we need to use robust statistics to differentiate between the static and
non-static parts of the scene.
2. Dynamic Structure from Static Cameras: Given a number of calibrated cameras
in a known configuration that capture a sequence of images, find the shape of the
objects, their surface properties, and the motion field on the object surface. For
some example results see Chapter 6.
1.3 Plenoptic video geometry: How is information encoded in the space of
light rays?
Inthisoutlinewewillillustratetheconceptofthepolydioptriccameradesignbyexamin-
ing the plenoptic video geometry for the case of 3D ego-motion estimation. This analysis
will be further refined in Chapter 5. At each location x in free space the plenoptic func-
tion L(x;r;t); L : R3 ?S2 ?R+ ? ? measures the radiance, that is the light intensity or
color from a given direction r at timet. ? denotes here the spectral energy, and equals R
for monochromatic light, Rn for arbitrary discrete spectra, or could be a function space
for a continuous spectrum. S2 is the unit sphere of directions in R3.
Let us assume that the albedo of every scene point is invariant over time and that
we observe a static world under constant illumination. In this case, the radiance of a
light ray does not change over time which implies that the total time derivative of the
10
(a) (b)
(c) (d)
Figure 1.3: (a) Sequence of images captured by a horizontally translating camera. (b)
Epipolar image volume formed by the image sequence where each voxel corresponds to
auniquelightray. Thetophalfofthevolumehasbeencutawaytoshowhowarowofthe
imagechangeswhenthecameratranslates. (c)Arowofanimagetakenbyapinholecam-
era at two time instants (red and green) corresponds to two non-overlapping horizontal
line segments in the epipolar plane image, while in (d) the collection of corresponding
?rows? of a polydioptric camera at two time instants corresponds to two rectangular re-
gions of the epipolar image that do overlap (yellow region). This overlap enables us to
estimate the rigid motion of the camera purely based on the visual information recorded.
plenoptic function vanishes:
d
dtL(x;r;t) = 0.
The set of imaging elements that make up a camera each capture the radiance at a
11
given position coming from a given direction. If the camera undergoes a rigid motion,
then we can describe this motion by an opposite rigid coordinate transformation of the
ambient space of light rays in the camera coordinate system. This rigid transformation,
parameterized by the rotation matrix R(t) and a translation vector q(t) which maps the
time-invariant space of light rays upon itself. Thus we have the following exact equality
which we call the discrete plenoptic motion constraint
L(R(t)x+q(t);R(t)r;t) = L(x;r;0) (1.1)
We see that if a sensor is able to capture a continuous non-degenerate subset of the
plenoptic function, then the problem of estimating the rigid motion of this sensor has be-
comeanimageregistrationproblemthatisindependentofthescene. Thereforetheonlyfree
parameters are the six degrees of freedom of the rigid motion. This global parametriza-
tion of the plenoptic motion field by only six parameters leads to a highly constrained
estimation problem that can be solved with any multi-dimensional image registration
criterion.
To illustrate this idea with an example, a camera is translated along the horizon-
tal image axis and the images of the sequence are stacked to form an image volume
(Figs. 1.3a-1.3b). Due to the horizontal translation scene points always project into the
same row in each of the images. Such an image volume is known as an epipolar image
volume [14] since corresponding rows lie all in the same epipolar plane. Each pixel in
this volume corresponds to a unique light ray.
A horizontal slice through the image volume (Figs. 1.3c-1.3d) is called an epipo-
lar plane image and contains the light rays lying in this epipolar plane parameterized
by view position and direction. A row of an image frame taken by a pinhole camera
corresponds to a horizontal line segment in the epipolar plane image (Fig. 1.3c) because
12
we observe the light rays passing through a single point in space (single view point). In
contrast, a polydioptric camera captures a rectangular area (multiple view points) of the
epipolar image (Fig. 1.3d) because it captures light rays passing through a range of view
points. Here we assumed that the viewpoint axis of the polydioptric camera is aligned
with the direction of translation used to define the epipolar image volume. If this is not
the case, the images can be warped as necessary. We see that a camera rotation around
an axis perpendicular to the epipolar image plane corresponds to a horizontal shift of the
camera image (change of view direction), while a translation of the camera parallel to an
image row, causes a vertical shift (change of view point). These shifts can be different for
each pixel depending on the rigid motion of the camera.
If we want to recover this rigid transformation based on the images captured using
a pinhole camera, we see in (Fig. 1.3c) that we have to match two non-overlapping sets of
light rays (shown as a bright green and a dark red line) since each time a pinhole camera
captures by definition only the view from a single viewpoint. Therefore, it is necessary
for an accurate recovery of the rigid motion that we have a depth estimate of the scene,
since the correspondence between pixels in image rows taken from different view points
depends on the local depth of the scene.
In contrast, we see in (Fig. 1.3d) that for a polydioptric camera the matching can
be based purely on the captured image information, since the sets of light rays captured
at consecutive times (bright yellow region) overlap. Disregarding sampling issues at
the moment, we have a ?true? brightness constancy in the region of overlap, because
we match a light ray with itself (Eq. (1.1)). This also implies that polydioptric matching
is invariant to occlusions and view-dependent visual events such as specularities. We
conclude thatthe correspondence oflight rays using apolydioptric camera dependsonly
13
on the motion of the camera, not on any properties of the scene, thus enabling us to
estimate the rigid motion of the camera in a completely scene-independent manner. A
moredetailedanalysisofthestructureofthespaceoflightrayscanbefoundinChapter2.
1.4 Polydioptric Camera Design
Dioptric Axis
Field of V
iew Axis
ill-conditioned
stable
single viewpointcontinuous viewpoints
small FOV
360 deg FOV
scene dependent(nonlinear)scene independent(linear)
Dioptric Axis
Field of V
iew Axis
scene dependent(nonlinear)scene independent(linear)
ill-conditioned
stable
discrete viewpointscontinuous viewpoints
small FOV
360 deg FOV
(a) Static World - Moving Cameras (b) Moving Object - Static Cameras
Figure 1.4: a) Hierarchy of Cameras for 3D Motion Estimation. The different camera
models are classified according to the field of view (FOV) and the number and prox-
imity of the different viewpoints that are captured (Dioptric Axis). The camera models
are clockwise from the lower left: small FOV pinhole camera, spherical pinhole camera,
spherical polydioptric camera , and small FOV polydioptric camera.
It is known that the stability of the structure from motion estimation depends on
the collective field of view of all the ?sub?-sensors making up the polydioptric camera.
This relationship has been studied for example in [3, 33, 35, 70, 96, 47]. Some of these
results can be found in Chapter 5.
Combining this result with the plenoptic motion equations (1.1) in [104] we can
14
define a coordinate system on the space of camera designs (as shown in Fig. 1.4). The dif-
ferent camera models are classified according to the field of view (FOV) and the number
and proximity of the different viewpoints that are captured (Dioptric Axis). This in turn
determines if structure from motion estimation is a well-posed or an ill-posed problem,
and if the estimation is scene dependent or independent (thus implying that the motion
parameters are related linearly or non-linearly to the image measurements if we have
differential motion as shown in Section 1.3).
One can see in the figure that the conventional pinhole camera is at the bottom
of the hierarchy because the small field of view makes the motion estimation ill-posed
and it is necessary to estimate depth and motion simultaneously. Although the estima-
tion of structure and motion for a single-viewpoint spherical camera is stable and robust,
it is still scene-dependent, and the algorithms which give the most accurate results are
search techniques, and thus rather elaborate. One can conclude that a spherical poly-
dioptric camera is the camera of choice to solve the structure from motion problem since
it combines the stability of full field of view motion estimation with the linearity and
scene independence of the polydioptric motion estimation. Such a camera would enable
us to utilize new scene-independent constraints between the structure of the plenoptic
function and the parameters describing the rigid motion of a polydioptric imaging sen-
sor. Using tools of polydioptric sampling theory as described in Chapter 4 enables us to
extend this qualitative analysis to a quantitative analysis.
A polydioptric camera can be implemented in many ways. The simplest design is
an array of ordinary cameras very close to each other (see Fig. 1.5 or [158]) or one could
use specialized optics or lens systems such as described in [2, 42, 99]. In Section 7.1, some
examples of these designs are presented.
15
(a) (b) (c)
Figure 1.5: (a) Design of a Polydioptric Camera (b) capturing Parallel Rays and (c) simul-
taneously capturing Pencil of Rays.
Whatever design one uses, it is not possible to capture the plenoptic function with
arbitrary precision. If we want to use the plenoptic motion constraints, we need to recon-
struct a continuous light field from discrete samples. In this thesis I study the implemen-
tation of a polydioptric camera using a regular array of densely spaced pinhole cameras.
This problem has been studied in the context of light field rendering [27, 162] where the
authors examined which rays of a densely captured light field need to be to retained to
reconstruct the continuous light field. In chapter 4 we will examine how we can estimate
the approximation error between the time-varying light field and a reconstruction based
on the images captured. Specifically, we determine how we can compute an estimate of
the approximation error given a description of the camera and the statistics of the scene.
1.5 Overview of the thesis
Inspired by how Argus, the hundred-eyed guardian of Hera, the goddess of Olympus,
alone defeated a whole army of Cyclops, the mythical monocular giants, and the impres-
sive navigational feats of insects with compound eyes, I will show in this thesis how one
can analyze vision problems in the space of light rays, and thereby find solutions that
16
are optimal with a respect to the joint design of vision sensors and algorithms. Based
on the geometrical and statistical properties of these spaces, I introduce a framework to
systematically study the relationship between the shape of an imaging sensor and the
task performance of the entity using this sensor. I illustrate this concept by analyzing the
structure of the time-varying plenoptic function and identifying the important camera
design parameters for the case of 3D motion estimation. This analysis is then used to
form a hierarchy of camera designs with regard to the task under study. The hierarchy
implies that large field of view polydioptric cameras are the optimal cameras for 3D ego
motion estimation because they allow for stable and scene-independent motion estima-
tion algorithms. Polydioptric cameras are generalized cameras which capture a multi-
perspective subset of the plenoptic function. Using a combination of multi-dimensional
sampling analysis, statistical image modeling I then improve upon the qualitative hi-
erarchy of camera design by defining a metric on the space of cameras. This metric is
based on how well a given polydioptric camera is able to capture the plenoptic function,
and how the approximation errors propagate through the motion estimation to the final
parameter estimate. This allows us to determine the best camera arrangement for a task.
Although camera motion estimation is important for tasks in navigation, it is often
only part of the information that we want to extract about the world. Thus, at the end
of this thesis I also examine the problem of estimating the shape and 3D motion of non-
rigidly moving objects in a scene using a distributed camera network.
17
Chapter 2
Plenoptic Video Geometry: The Structure of the Space of Light
Rays
2.1 Preliminaries
Allthevisualinformationthatcanbecapturedinavolumeofspacebyanimagingsensor
is described by the intensity function defined on the space of light rays surrounding us.
Thisfunctionisknownastheplenopticfunction1 [1]. Foreachpositioninspaceitrecords
the intensity of a light ray for every direction, time, wave length, and polarization, thus
providing a complete description of all uninterpreted visual information. The idea of
a function that contains all images of an object was already described by Leonardo Da
Vinci in his notebooks. Similar functions were used later by Mehmke in 1898 and by
Gershun in 1936 to describe the reflection of light on objects. Gershun was the first one
to use the term light field for the vector irradiance field [49]. One can find more about
the historical study of the space of light rays in the book by Moon and Spencer about the
scalar irradiance field [98].
The time-varying shapes of the objects in a scene, their surface reflectance proper-
ties, the illumination of the scene, and the transmittance properties of the ambient space
1From the Greek words plenus full and optic view.
18
determine the structure and properties of the visual space. The objects cannot transmit
their shape and surface properties directly to an observer. Without actively interacting
withtheobject,theobservercanonlyrecordtheintensitiesofasetoflightraysthatreflect
or emit from the object surface and infer the properties of the objects in the scene based
on these ?images?. This is possible for example by utilizing the geometric ray model of
light transport which relates the geometric and reflection properties of the object surface
and the illumination to the observed image intensity values. An image can be defined
as a collection of rays with a certain intensity where each ray captured has a position,
orientation, time, wavelength and domain of integration (scale). Such a ray element was
called a raxel by Grossberg and Nayar [52]. In computer graphics the scene parameters
are known and the goal is to generate the view of a scene from a given view point by de-
termining the ray properties for all the rays that make up an image (e.g. using standard
computer graphics techniques such as ray tracing and global illumination). The quality
that can be achieved is often nearly indistinguishable from actual views as evidenced by
the ubiquitous use of computer generated imagery in today?s cinema.
In comparison, computer vision attempts to extract a description of the world
based on images, thus one could say that computer graphics and computer vision are
?inverse problems? of each other. We are given the images of a scene (in general a scene
in the real world) and based on the information in the images, we would like to find the
set of variables that describe the scene. Since it is impossible to find an infinite number
of variables, we will use models to approximate the true scene parameters and to recre-
ate the scene. The model parameters can be found based on the images captured given
some a priori assumptions about the scene. In 3D photography we use the samples of the
visual space captured by the cameras to determine the most likely set of variables that
19
give rise to the observed structure of the visual space. For segmentation we assign labels
corresponding to different objects in the scene or for recognition we try to find the best
label for an image region given a set of classes. To solve these tasks we have to analyze
how the scene and object properties manifest themselves in the global and local structure
of the plenoptic function.
The structure of the visual space is generated by the objects and their proper-
ties in the scene. We have a distance function D(x;t) : R3 ? R+ ? R defined on
the four-dimensional spatio-temporal volume (three space dimensions and one time di-
mension) that describes the space that is occupied by the objects in the scene. The
surface of the objects is defined by the zero level set of the distance function, that is
S := {(x;t)|D(x;t) = 0}. The geometric properties of this function describe the shape of
the objects and their temporal evolution. We will use the term object surface to denote
this zero level set S.
The change of the distance function over timedD/dtcaptures only deformations of
the shape along its normal which is given by
?xD(x;t) =
?
??
??
??
?
?D(x;t)/?x1
?D(x;t)/?x2
?D(x;t)/?x3
?
??
??
??
?
where x =
?
??
??
??
?
x1
x2
x3
?
??
??
??
?
. (2.1)
We can also define a vector field M(x;t) : S ? R4 on the object surface that de-
scribesthetrajectoriesofuniquepointsontheobjectsurface, the3Dmotionfloworscene
flow[152]. Usuallywerestrictthetimecomponentofthisvectorfieldtobeofunitmagni-
tude (or whichever time step is small enough to allow an accurate first-order description
of the deformations of shape and surface properties). This vector field is restricted to
transport points only on the surface S, therefore we have the constraint [MT?D] = 0
20
where ?D = [?xD;D/?t].
On the surface S we have the reflection properties of the objects given by the sur-
face light field LS : S ?S2 ?R+ ? ?. ? denotes here the spectral energy, and equals R
for monochromatic light, Rn for arbitrary discrete spectra, or could be a function space
for a continuous spectrum. S2 is the unit sphere of directions in R3.
Iftheobjectisnotemittinganylight,thenasurfacelightfieldscanoftenbefactored
into a number of physically motivated components such as an illumination component
and a component describing the surface reflection. A popular representation in graphics
expresses the radiance LS(x;r;t) leaving the surface in direction ?r in terms of the sur-
face irradiance IS(x;m;t) = L(x;m;t)(nTm) measured from direction m through the
rendering equation [72]
LS(x;r;t) =
integraldisplay
H(n)
B(x;r;m;t) IS(x;m;t)dm (2.2)
where B(x;r;m;t) is the bidirectional reflection distribution function(BRDF) [108] defined
onthesurfaceS. H(n)isthehemisphereofdirectionsH(n) = {m : bardblmbardbl = 1 and mTn>
0}.
AnothermorecomplexexampleistheBidirectionalSurfaceScatteringReflectanceDis-
tribution Function [108], inshortBSSRDF.TheBRDFonlydescribestheinteractionoflight
that enters and exits the surface at the same point, therefore it does not describe the scat-
tering of light inside the surface as seen in marble, milky fluids or human skin. The
BSSRDF is thus parameterized by two location vectors, one for the entrance and one for
the exit location of the light rays. For some nice examples that were rendered using this
model see [69]. Even such a general function is not able to adequately describe all the
surface reflection phenomena, since we left out the effect of polarization for example. In
computer vision we usually only use simple reflection models, because the noise in the
21
image acquisition process makes it difficult to find the parameters of the more complex
models under non-laboratory conditions. In computer vision, we are more interested in
a representation that can describe the intrinsic complexity of a reflection model. A nice
measuretoassessthecomplexityofasurfacelightfieldisdescribedinJinetal.[71]where
theydescribetheconceptoftheradiancetensor,thatisthematrixthatisconstructedfrom
the observed intensities of a number of surface points in a local neighborhood on the ob-
ject surface that are observed from a number of viewing positions. Each column contains
the different intensity values of the scene points as seen from a single view point, while
the rows of the matrix contain the different views of a single scene point. If the local
surface patch is small compared to the distance to the observing cameras and to the light
sources in the scene, then this radiance tensor will have a rank of two or less. This is easy
to show. Due to the small distance between the points in comparison to the distance to
the light sources and cameras, we can assume that the surface irradiance is the same for
each point in the patch, that the vector from each scene point to a single camera center is
also nearly constant, and that the object surface is locally planar so that the normals and
tangent directories are also constant across the points. Then we can factor the surface
light field observed at each point as
LS(x,r) =
Nsummationdisplay
i=1
?i(x,r0)Bi(x0,r) (2.3)
whereN is usually smaller then 4. This decomposition was also used before by Chen et
al. in the context of surface light field compression [28].
The simplest example of this factorization is the case of Lambertian reflectance of
the surface with albedo ?(x;t) which describes a perfectly diffuse reflector. In this case,
22
we can conveniently express LS(x;r;t) for all directions r ?H(?n) by
L(x;r;t) = ?(x;t)[n(x;t)Ts(x;t)] (2.4)
wheres(x;t) =integraltextH(n)L(m;x;t)(nTm)dmisthenetdirectionalirradianceonthesurface
point, given by the surface integral overH(n), (for more details see [62]).
If we factor the popular diffuse plus specular model we will get:
LS(x;r) =
integraldisplay
H(n)
B(x;r;m) IS(x;m)dm
=
integraldisplay
H(n)
(?d(x) +?s(x)B(x0;r;m)) IS(x0;m)dm
= ?d(x)
integraldisplay
H(n)
IS(x0;m)dm
+?s(x)
integraldisplay
H(n)
B(x0;r;m) IS(x0;m)dm
= ?d(x)Bd(x0) +?s(x)Bs(r;x0) (2.5)
This factorization indicates that under the correct assumptions the radiance tensor has
only rank two. This rank constraint can then be used as a plausibility criterion to decide
if the scene parameters are estimated correctly.
A realistic model for the reflection properties that can be split in a diffuse and a
specular component is given by Ward?s anisotropic (elliptical) Gaussian model [156]. At
every point on the surface we can locally define at each location x an orthonormal co-
ordinate system t1,t2,n and define a half-way vector h between the incoming m and
outgoing rays r: h = (m+r)/bardblm+rbardbl. Then we can write the BRDF for this model as:
B(x;r;m) = ?d(x) +?s(x)Bs(x;r;m) (2.6)
= ?d(x)pi +?s
exp
bracketleftBiggparenleftbigg
1? 1(hTn)2
parenrightbiggparenleftBiggparenleftbigg
hTt1
?1
parenrightbigg2
+
parenleftbigg
hTt2
?2
parenrightbigg2parenrightBiggbracketrightBigg
4pi?1?2radicalbig(mTn)(rTn)
23
where?d is the diffuse reflection coefficient, ?s is the specular reflectance coefficient and
?1 and ?2 are the standard deviations of the microscopic surface slope (as a measure of
surfaceroughness)inthetangentdirectionst1 andt2. These4parameterstogetherdefine
a physically valid reflection model.
2.2 Representations for the space of light rays
2.2.1 Plenoptic Parametrization
We define the light field as the extension of the surface light field, which is only defined
on the object surfaces S, to the free space of R3, that is {x|D(x,t) > 0}.
We will assume that the ambient space is a transparent medium, such as air for
example, which does not change the color or intensity of light along its path. Thus, the
light field along the view direction r is constant and the following equations hold:
L(x;r;t) = L(x+?r;r;t) ??s.t. D(x+?r) > 0 ??? [0,?] (2.7)
which implies that
?xLTr = ?rLTr = 0 ?x ? R3,f(x;t) > 0 (2.8)
Here ?xL and ?rL are the partial derivatives of L with respect to x and r:
?x = (?/?x1,?/?x2,?/?x3)T , and
?r = (?/?r1,?/?r2,?/?r3)T.
To initialize this ?differential equation? we define the following equality on the
object surface
L(x;r;t) = LS(x;r;t) where (x;t) ? S. (2.9)
24
SinceLis constant along r, the set of functionsLS defined onS generate the struc-
ture of the visual space, that is the geometry of the plenoptic function, completely.
Due to the brightness invariance along the ray direction the plenoptic function in
free space reduces locally to five dimensions ? the time-varying space of directed lines
for which many representations have been presented (for a comparison of different rep-
resentations see Camahort and Fussel [22] or the book by Pottmann and Wallner [118]
about line geometry). If we deal with smog, fog, transmitting or partially transparent
objects, then this invariance will not hold, but we will not consider these cases in this
work.
The dependence of the structure of the plenoptic space on the shape of the scene
objects can be analyzed using a directional distance function that describes the shortest
distance between a point location and a scene surface in a given direction:
Z(x;r;t) = min
?
D(x+?r;t) = 0 (2.10)
Using this function, we can rewrite the equality Eq. (2.9) as
L(x,r,t) = LS(x+Z(x,r,t)r,r,t) (2.11)
The space of light rays can now be parameterized in a number of different ways
which offer different advantages with regard to completeness, redundancy and ease to
describe geometric transformations. The question of representation is very important
because the space of light rays has a specific structure since it is parameterized by the
geometric and signal properties of the objects in the scene which introduce redundancy
in the representation. In addition since all physical measurements of light are done with
a finite aperture, we always need to account for the specific scale of the measurement
in our later computations. This introduces a continuity on the space of measurements
25
which also needs to be reflected in the space and metric that we will use for computa-
tions in the space of light rays. We call the parametrization that we used in this section
the plenoptic parametrization because it is possible to assign a brightness value for each
location, direction and time, thus capturing the full, global structure of the light rays.
2.2.2 Pl?ucker coordinates
Sometimes we are interested in a representation for the space of light rays that does not
exhibit the redundancy intrinsic to the plenoptic parametrization. As said in the last
section, since light rays do not change their color in a transparent medium like air, we
can represent light rays in the space of directed lines in three-dimensional space without
losing any information. Linear subspaces are best treated in the context of Grassmann
coordinates [118]. The Grassmann coordinates for lines in three-dimensional space are
known as Pl?ucker coordinates and are a very useful to describe lines and their motions
in space. The Pl?ucker coordinates of the linel := {x +?r,? ? R} are given by the tuple
(r,m) ? R6, containing the direction vector r and the moment vector m defined by
m = x?r. We see that the Pl?ucker moment is the normal to the plane that contains the
line and the origin, and its magnitude is equal to the distance of the line to the origin. If
wenormalizertounitlength,bardblrbardbl = 1,thenwehavearepresentationforanorientedline.
This representation was described bye Study [17] to describe the space of oriented lines.
Since the norm of all the line elements in the Study representation is of unit length, one
also says the that all the coordinates representing oriented lines lie on the Study sphere.
This parametrization is equivalent in terms of its sampling properties in line space
tothesphere-planerepresentationdescribedin[23]. Herethelinelisdefinedbyitsdirec-
tionr andtheintersectionoflwiththeplaneperpendiculartor throughtheorigin. Since
26
r ?m = 0 by definition, m lies in this plane, although not on the line. The intersection
point x of the linelwith the plane x?r = 0 is given by x = m?r.
Comparing the plenoptic and the Pl?ucker parameterizations, we see that, locally,
for each plane through the origin of the coordinate system perpendicular to a fixed di-
rection r, the plane xTr = 0, they differ only by a rotation of 90 degrees around r since
m = r?x. Thus we can define an intensity function on the space of Pl?ucker lines using
the identity:
L(m,r,t) = L((m?r,r,t)
L can be used to describe the plenoptic function globally whileLcan be used to describe
the plenoptic function locally, since it chooses a single plane to describe rays and thus
cannot account for occlusions or the changes in the differential structure of the plenoptic
functionduetoamotionoftheobserver. Forexamplewhenanobservermovesalongthe
main view direction then the image expands/contracts radially. Such a forward motion
does not change the brightness of the light ray along the axis of forward motion since
we have the brightness invariance along the ray, but it changes the magnitude of the
derivatives ?rL and ?rLsince they depend on the distance to the scene. In this case we
have ?rL(x,r,t) = Z(x,r,t)?xL(x,r,t) and ?rL(m,r,t) = Z(m,r,t)?mL(m,r,t).
The main advantage of the Pl?ucker parameterization is that we can express the
motion of lines in space very elegantly in terms of a matrix multiplication. If a line is un-
dergoing a rigid motion described by the rotation matrixR and the translation t around
the origin of the coordinate system, then this can be expressed by multiplying the line
coordinate vector l = (r,m) with a 6?6 motion matrixQ:
?
??
?
rprime
mprime
?
??
?=
?
??
?
R 0
?[t]xR R
?
??
?
?
??
?
r
m
?
??
?= Q
?
??
?
r
m
?
??
?. (2.12)
27
This motion equation can be derived easily by moving two points on the line rigidly in
spaceandthenrecomputingthePl?uckerlinecoordinateswithrespecttothenewposition
of the two points.
2.2.3 Light field parameterization
One difficulty with the previous two parameterization is that although we can define a
metric between lines, it is difficult to express this distance directly in terms of distances
between the line coordinates since they do not lie in a Euclidean space, but on a non-
linear manifold. The light field parameterization is a convenient representation for the
space of light rays, because it is closed under affine combinations of coordinate vectors
and we can approximate the geodesic distances between points on the non-linear mani-
fold of lines by the Euclidean distance between the coordinate vectors that represent the
lines in the light field parameterization. As described by Peternell and Pottman [114], we
can divide the space of lines into caps of directions, that are sets of directions close to a
central axis ci that defines each cap i. We can then choose two parallel planes Z+i and
Z?i perpendicular to the central axisci of each cap and represent all lines not perpendic-
ular ci by their intersection with the two planes Z+i and Z?i . This defines now an affine
space where the distance between two lines in 3D can be approximated by the Euclidean
pointdistancebetweentheaffinecoordinatesoftheintersectionsofthelineswiththetwo
planes. These new coordinates are equivalent to the coordinates of the line directions af-
ter being stereographically projected onto a plane. The different caps are glued together
using interpolation weights so that we end up with a continuous representations. The
extent of the caps should be chosen such that we do minimize the error in the directional
distances between the rays. This two-plane parameterization was used by [51, 80] to rep-
28
resent the space of light rays. All the lines passing through some space of interest can be
parameterizedbysurroundingthisspace(thatcouldcontaineitheracameraoranobject)
with two nested cubes and then recording the intersection of the light rays entering the
camera or leaving the object with the planar faces of the two cubes. We only describe the
parameterization of the rays passing through one pair of faces, the extension to the other
pairs of the cube is straight forward. Without loss of generality we choose both planes to
be perpendicular to the z-axis and separated by a distance off. We denote one plane as
focal plane ?f indexed by coordinates (x,y) and the other plane as image plane ?i indexed
by (u,v), where (u,v) is defined in a local coordinate system with respect to (x,y) (see
Fig. 2.1a). Both (x,y) and (u,v) are aligned with the (X,Y)-axes of the world coordinates
and ?f is at a distance ofZ? from the origin of the world coordinate system.
(0,0)
?f: Focal PlaneCamera view points (x,y)
x
y(u0,v0,t)
?i: Image PlaneImage Sequences (u,v,t)
L0(u,v,t)
(u,v,t)
L(x,yu,v,t)
?y
?x
(a) (b)
Figure 2.1: (a) Parameterization of light rays passing through a volume in space by
recording the intersection of the light rays with the faces of a surrounding cube. (b)
Lightfield Parameterization.
This enables us to parameterize the light rays that pass through both planes at any
29
timetusing the tuples (x,y,u,v,t) and we can record their intensity in the time-varying
light fieldL(x,y,u,v,t).
Since the properties of the generating functions determine the intensity (or color)
distribution in the space of light rays as described by the plenoptic function, we can
construct a catalog of the relations between these properties and the local differential
structure of the plenoptic function. This catalog then relates all the observable visual
events such as colors, textures, occlusions, and glares to the shape, surface reflectance
andilluminationofthescene. Inthissectionwewillstudytheplenopticfunctiondirectly
without any reference to an imaging system. The properties of imaging systems will be
studied later in chapter 3.
2.3 Information content of plenoptic subspaces
Recently, computer graphics and computer vision took interest in non-perspective sub-
sets of the plenoptic function to represent visual information to be used for image-based
rendering. Some examples are light fields [80] and lumigraphs [51], multiple centers of
projection images [120] which have been used in cell animation already for quite some
time [159], or multi-perspective panoramas [113]. For an overview over the use of multi-
perspective images for image-based rendering see Zhang and Chen [163].
Non-perspective images have also been used by several researchers to reconstruct
the observed scene from video sequences (for some examples see [14, 132, 26]). The de-
scriptions of non-perspective images have been formalized lately by the work of Swami-
nathan et al [141] and Hicks [59, 58]. For more information you can also see the page of
catadioptricsensordesignhttp://www.mcs.drexel.edu/?ahicks/design/design.
html.
30
In their seminal paper [1] Adelson and Bergen demonstrate with examples from
early vision, how the importance of different subspaces of the space of light rays is re-
flected in the resolution by which a subspace is sampled and the amount of processing
that is applied to them. A similar criterion should be applied to design of artificial sen-
sors. Adelson and Bergen defined a set of ?periodic tables for early vision? where they
relate first and second order derivative measurements in different two-dimensional sub-
spaces to visual events in the world. In this section we will extend their results by exam-
ining how the correlation between different dimensions can be utilized to improve the
imaging process.
Ifwecomputethegradientofthespatio-temporallightfield?L,wecananalyzethe
eigenvalue structure of the tensor that is formed by the outer product of the gradients to
extract information about the scene (see [56] for the analysis for perspective images).
2.3.1 One-dimensional Subspaces
If we study one-dimensional subspaces, we can measure the change in static radiance for
linearly varying orientations and constant view points, or linearly varying view points
and a fixed view direction. If we fix both view direction and view orientation and only
the temporal dimension is varying, then we can measure the change in irradiance of a
light ray over time. It is to note that the time domain is different from the spatial domain,
because it only allows for causal filtering, since we cannot make measurements in the
future. Linear subspaces where two or more dimensions are simultaneously varying
allow us to fixate on a specific point in the world and measure a slice of the surface light
field at that point.
31
2.3.2 Two-dimensional Subspaces
In two dimension things become interesting because we can observe correlated behavior
between different domains which allows us to extract more complex information about
the world.
Texture Information
The subsets L(x,y) = L(x,y,?,?,?) or L(u,v) = L(?,?,u,v,?) correspond to the standard
perspective or orthographic images which enable us to measure the colors and texture
of all the visible surfaces of the objects in the scene. The complexity of the textures can
be classified by looking at the eigenvalues?1 and?2 of ?L?LT. If both the eigenvalues
are zero then the scene consists of a homogeneous region. If only one eigenvalue is non-
zero, then we have a linear structure such as a brightness edge in the texture, and if both
eigenvaluesarelarge,thenthesurfacetextureistoocomplextobedescribedbyfirstorder
derivative operators. An example would be a texture that is isotropically changing, or
onethatconsistsofacornerorapointfeature. Eachpointintheworldisonlyimagedbya
singlelightray,thusforagivennumberofimagingelementswegetthemostinformation
about the surface textures in the world, but a single static image does not give us any
information about the shape of the objects in the world.
Shape and Segmentation Information
Any set of light rays that lies on a doubly ruled surface allows us to infer spatial informa-
tion about the scene since the light rays intersect in space [112, 129]. Especially the case
wheretheorientationvectorandthepositionvectorsvaryinthesameplaneiseasytoan-
alyze. These subsets of light rays are known as epi-polar plane images (EPIs) [15]. Exam-
32
ples areL(x,u) = L(x,?,u,?,?) orL(y,v) = L(?,y,?,v,?) and will be analyzed in more de-
tail in Section 2.5. Since the points in the world are imaged by multiple imaging elements
we can relate the spatial arrangement of the imaging elements to the correlation between
visual measurements at different elements to extract spatial information about the scene
inview. Sinceweusethecorrelationoftexturalelementstoinferspatialstructurewecan-
not always extract spatial information, for example if we compute the structure tensor in
an EPI and we have only vanishing eigenvalues, then this corresponds to a texture-less
region of the scene where we cannot extract any shape information. Each line in an EPI
corresponds to the light passing through a single point in space. If this point lies on the
scene surface, then the changes in color and intensity are due to reflection properties at
this scene point and should vary slowly (unless it is specular). If the line corresponds to
a scene point that does not lie on the surface of an object then it corresponds to a virtual
pinholecameraimageofthesceneandtheintensitiescanvaryarbitrarily. Ifthereflection
properties of the scene texture are dominated by the Lambertian component, that means
that the light intensity reflected is independent of the angle at which we look at the sur-
face, then if we fixate on a scene point by varying the rate of change of view point and
view direction appropriately, the observed radiance of the scene point will not change
and thus form a line in the EPI of uniform color. Thus for most scenes with non-specular
reflection properties, we can identify the lines corresponding to scene points by finding
the lines where the intensity varies only very little.
The slope of this line structure encodes the depth of the scene because the depth
is proportional to the ratio of the necessary changes in view direction and view position
to fixate on a scene point. This equivalent to triangulating the scene points from a con-
tinuum of view positions where correspondence is established by following the tracks of
33
uniform color formed by the projection of a scene point using different centers of pro-
jection. We can use the structure tensor to estimate the depth accurately. If we have
one vanishing eigenvalue indicating an edge in the EPI, then the depth of the scene can
be computed from the direction of the eigenvector corresponding to the non-vanishing
eigenvalue. The changes in slope across the EPI can be used to infer properties of the
surface shape such as surface orientation and surface curvature.
In case that both eigenvalues are non-zero, then we have either a specularity or an
occlusion in the scene. The difference in dimensionality of the image structure between
texture and occlusion discontinuities can be used to easily segment the image captured
with the sensor into regions belonging to different objects in the scene. Depth occlusions
form y-junctions in the EPI and the angle of the junction can be used to differentiate
between self-occlusions due to surface curvature where the intersection angle between
the tracks of different features is very small and object occlusions where the angle is
larger (corresponding to the magnitude of difference in depth between the objects).
Linear Motion
Any two-dimensional subset that involves simultaneously time and one of the spatial
dimensions allows us to detect and measure the planar motion of the camera or an object
in the scene (see Table 2.1). For perspective views (u,t) or (v,t), we can measure the
two components of the motion along the optical axis and the coordinate axis, but have
to estimate the depth of the scene at the same time, while for orthographic views we can
onlymeasurethedisplacementalongthespatialdimension,butwecandothisaccurately
without any additional depth estimation.
34
2.3.3 Three-dimensional Subspaces
Structure from Motion: Simultaneous 3D Motion and Depth Estimation
If we the take conventional perspective image sequences we can estimate the 3D mo-
tion of the camera and the depth of scene, a problem known as structure from motion.
This problem has been studied in depth and many solutions have been proposed. Unless
we have pure rotation or are imaging a plane scene, this estimation is often very diffi-
cult since the temporal changes in the image depend not just on the camera and object
motions, but also on the depth structure of the scene(Tables 2.1 and 2.2). This forces us
to solve the motion and depth segmentation simultaneously with the motion estimation
leading to a high dimensional problem which is very sensitive to noise.
In this case the dynamic images L(x,y,t) = L(x,y,?,?,t) or L(u,v) = L(?,?,u,v,t)
are the subsets of interest. We can have the following cases for the eigenvalues?1,...,?3
of the structure tensor ?L?LT. As before, if all eigenvalues are of equal value (either
non-zero or zero) we cannot extract any information, since the scene is either too homo-
geneous or too random. If we have 1 non-zero eigenvalue, we have the motion of a line
in the image, resulting in a plane in spatio-temporal space. Here we can only determine
the component of the motion parallel to the edge in the image, the normal flow, due to
the aperture problem. If we have two non-vanishing eigenvalues, then this corresponds
to the motion of a point feature along a line in spatio-temporal space and the eigenvector
corresponding to the vanishing eigenvector is parallel to the direction of the line. For the
orthographic images, this flow is corresponding to the actual motion of the scene point,
while in the perspective images, this motion also depends on the distance to the scene
and the motion along the optical axes.
35
Scene-independent Planar Motion Estimation and Motion Segmentation
If we look at the three-dimensional subspaces that are formed byL(x,u,t) andL(y,v,t),
then we can extract the motion of a camera in the epipolar plane in a scene independent
manner (see Table 2.2). Since changes in the epipolar image over time due to the camera
motion form only a low three-dimensional subspace parameterized by the motion para-
meters it is easy to detect independently moving objects since their trajectory will give
risetogradientsintheEPIthatarenotcompatiblewiththeestimatedrigidmotion. Inthe
future I plan to study if subspace factorizations in the spirit of Tomasi and Kanade [146],
Irani [67], or Vidal et al. [155] can be utilized to make this factorization.
Depth Estimation from Parallax
If we capture a plenoptic subset formed by three of the four spatial dimensions (known
as epipolar volumes), then we can determine the depth of the scene in all the epipolar
planes contained in the plenoptic subset as long the scene texture offers enough informa-
tion. In regions of homogeneous color, it is of course impossible to recover information
about the scene. Besides depth we can again recover all the information as described
in the Section 2.3.2. As before if the local orientation of the scene texture is parallel to
the epipolar plane then we cannot estimate the depth because in this case the epipolar
lines and the texture line coincide. This fact which is a well-known problem in the stereo
literature.
36
2.3.4 Four-dimensional Subspaces
Shape Estimation and Recovery of Surface Properties
In 4D, we can take a look at the 4D light field which contains information about the re-
flectance of a scene point as well as the depth and occlusion properties, thus allowing
for computations that utilize the fact that all the dimensions constrain each other. We
know for example, that all the gradients in the images should be parallel, but intensity
derivatives in view direction space are scaled by the depth with respect to the intensity
derivatives in view point space as described before. If we look at the different eigen-
value distributions, we see that again we have the non-informative cases of completely
vanishing or non-vanishing eigenvalues corresponding to the absence of textures or the
presence of too much noise or strong specularity.
Motion Stereo Estimation
Anythreespatialdimensionsandatemporaldimensionallowsustocomputethemotion
of the camera and the objects in the scene. Since we can compute the depth of the scene
from the epipolar plane images, motion estimation and depth estimation are decoupled
leading to a simpler problem. We can also solve for some motion parameters linearly
in terms of plenoptic derivatives and then plug the solution into the non-linear scene-
dependent motion equations. Having access to epipolar images and thus depth also
simplifies the segmentation of the scene into independently moving objects since their
depth is constrained by the plenoptic derivatives.
37
2.3.5 Five-dimensional Subspaces
Scene-independent Motion Estimation
If we have access to the complete plenoptic function, then we can compute the motion
of the camera in a scene-independent manner by solving a linear system of equations in
terms of the view point and view direction derivatives. Since these gradient fields only
depend on the six motion parameters, it will be easy to detect independently moving
objects by a simple clustering procedure using an affine motion model to describe the
displacement of a scene region in the plenoptic space.
2.4 The space of images
Of special educational interest are the two-dimensional subspaces of the plenoptic func-
tion, because most continuous imaging surfaces can only capture two-dimensional im-
agessincethereceptorsneedtolieonatwo-dimensionalsurfaceinspace. Two-dimensional
imagesalsoenableustoutilizethecorrelationsbetweendifferentcoordinateaxestoinfer
information about the scene. In general any 2D-subset of this lightfield function consti-
tutes an image. Recently, Yu and McMillan [161] described how all linear cameras can
be described by the affine combination of three points in the light field parameterization.
Some of the cameras they describe are:
? Ifalocation(x,y)inthefocalplaneisfixedsothattheimageisoftheformL(x,y,?,?,t),
then it corresponds to the image captured by a perspective camera. This image is
formed by the pencil of light rays that all pass through the point (x,y) in the focal
plane.
? Ifinsteadwefixtheviewdirection(u,v),wecaptureanorthographicimageL(?,?,u,v,t)
38
of the scene. In this case, all the light rays that form the image are parallel where
the direction is given by the vector (u,v,f)/bardbl(u,v,f)bardbl.
? If we choose to fix the values on two orthogonal spatial axes, for example (x,v) or
(y,u), then the subsets L(x,?,?,y,t) and L(?,y,u,?,t) describe linear push-broom-
video sequences, where a one-dimensional perspective slit camera is moved per-
pendicular to the slit direction as for example in recordings made by surveillance
planes. This camera is an example of the so-called crossed-slit projection camera as
described by Zomet et al. [168] These images are formed by the set of rays passing
through two slits lying in two planes parallel to the imaging surface.
The different kinds of images above are subsets of the plenoptic function where
each light ray corresponding to an image pixel intersects the scene at a unique location.
In other words this means that none of the light rays intersects another light ray that is
captured in the image on the surface of an object in the scene. Given an image of the type
described above, it is therefore impossible to infer something about the scene structure
without prior knowledge.
In the following we will take a closer look at the subsets that correspond to light
rays that intersect in space and therefore could potentially encode information about the
structure. From [112, 129] we know that to form a stereo geometry, that means rays
lie on an epipolar surface and intersect, the rays making up an image need to lie on a
doubly-ruled quadric. The only such surfaces are planes, hyperboloids, and parabolic
hyperboloids. An example for such a planar configuration are epipolar plane images.
These images are formed by fixing two of the parallel axes in the two-plane parameteri-
zation, that is (x,u) or (y,v).
39
(a) (b)
Figure 2.2: (a) Sequence of images captured by a horizontally translating camera. (b)
Epipolar plane image volume formed by the image sequence where the top half of the
volume has been cut away to show how a row of the image changes when the camera
translates.
2.5 The Geometry of Epipolar Images
We are interested in characterizing the frequency structure of the time-varying plenop-
tic function captured by a moving polydioptric camera. Since visual information is in
general local information, we will first study local neighborhoods of the static plenoptic
function.
Most of the local structure of the space of light rays can already be described by ex-
amining two-dimensional slices, either perspective, orthographic or epipolar images. In
this section we will focus on the epipolar images because in them the geometric structure
and surface textures interact the strongest.
The rays in the epipolar image intersect either in free space, inside the object or on
the object surface. These images exhibit a regular structure, as seen in the example in
40
f
Z?
x
Scene
u
Origin
u0 u
0 x
f
Z
u=u0 - x f/z(u0)
z(u0)
?f
?i
?u
?x
Bu
-Bu
(f/zmax) ?u ? ?x = 0
(f/zmin) ?u ? ?x = 0
(f/zopt) ?u ? ?x = 0
Bx = f(1/zmin-1/zmax) Bu
(a) (b)
Figure 2.3: (a) Light Ray Correspondence (here shown only for the light field slice
spanned by axesxandu). (b) Fourier spectrum of the (x,u) light field slice with choice of
?optimal? depthzopt for the reconstruction filter.
Fig. 2.2b. The projection of a scene point traces out a linear line in the EPI with a slope is
proportionaltothedepthofthescenepoint. Thusthedifferentialstructureofanepipolar
plane image encodes the depth, as well as occlusions and the reflectance function of the
surface in view.
We follow in our description [15] and analyze the motion of the image of a world
point x = (x,z) in a moving line camera. If camera and world coordinate system coin-
cide, we can apply the usual perspective projection, so that the image of the point in the
image linez = f is given byu = f(x/z).
To describe the motion of the projected point in dependence on the camera motion
for a general camera position and orientation, let c = (cx,cz) be the center of the camera
and let the optical axis make an angle of? with thex-axis of the coordinate system. The
world coordinates of the point x are x0 = (x0,z0) = Rx + c where the rotation matrix
41
du z
Z?
x
Scene
Origin
uf
Z
dx = (z/f) dudx?f
?i ?_u z
Z?
x
Scene
Origin
uf
Z
?_x = (z/f) ?_u?_x?f
?i
(a) (b)
Figure 2.4: (a) The ratio between the magnitude of perspective and orthographic deriv-
atives is linear in the depth of scene. (b) The scale at which the perspective and ortho-
graphic derivatives match depends on the depth (matching scale in red)
R = [cos(?),?sin(?);sin(?),cos(?)] alignsthecameraandworldcoordinatesystems. This
implies that x = RT(x0 ?c), therefore the projection onto the image line in the general
case is given by:
u = f(x0 ?cx)cos(?) + (z0 ?cz)sin(?)(c
x ?x0)sin(?) + (z0 ?cz)cos(?)
(2.13)
First, we will study the effect of a simple linear motion. Without loss of generality,
we can assume that the camera moves with constant speed along the positive x-axis
(c = (cx,cz) = (at,0)). This leads to
u = f(x0 ?at)cos(?) +z0 sin(?)(at?x
0)sin(?) +z0 cos(?)
(2.14)
42
(a) (b)
Figure 2.5: (a) Epipolar image of a scene consisting of two translucent fronto-parallel
planes. (b) The Fourier transform of (a). Notice how the energy of the signal is concen-
trated along two lines.
which we can expand to
0 = (asin(?))ut+ (z0 cos(?)?x0 sin(?))u (2.15)
+ (af cos(?))t?f(x0 cos(?) +z0 sin(?))
which shows that the feature paths are hyperbolic curves. The asymptotes of the hyper-
bola are parallel to the coordinate axes as can be seen if we rewrite Eq.2.15 as:
(u+f cot(?))(t+ (cot(?)?x0)/a) = fz0asin(?) (2.16)
It is to note, that it is always possible to linearize the feature paths by derotating the coor-
dinate system, so that the viewing direction is perpendicular to the direction of motion.
The transformation is given by:
uprime = fucos(?)?f sin(?)f cos(?)?usin(?) (2.17)
If the viewing direction is perpendicular to the direction of motion (? = 0), then the
43
feature paths degenerate to straight lines
z0u+aft?x0f = 0. (2.18)
This is the canonical way to parameterize epipolar images. Each feature in the world
traces out a line in the epipolar image with a slope that is proportional to the relative
depth of the feature with respect to the camera.
This is also illustrated in Fig. 2.4as where we see that dx = (z/f)du by the law of
similar triangles. Thus, depth can be extracted from the feature paths simply by extract-
ing the slope of a featurez = f(dx/du).
2.5.1 Fourier analysis of the epipolar images
In contrast to Chai et al. [27] which parameterize the world with respect to a perspective
reference image of the scene, we will follow Zhang and Chen [162] and will parameterize
the 2D light field slice in terms of the planar surface light field LS(x,?), x = [x,z]T ?
R2,? ? [?pi,pi] where ? = 0 corresponds to the direction opposite of the depth axis
(??z = [0,?1]). ThesurfacelightfieldisonlydefinedonthesurfaceS oftheobjectswhich
in this flat-land case is defined by the curve x(s) which we will parameterize using the
arc-length parameters.
The epipolar plane image is parameterized by its view point xc = [xc,0] (we as-
sume w.l.o.g. that the camera is at depth 0) and view direction, parameterized by the
intersection u of the view ray with the image plane at distance f to focal plane. We also
assume that the field of the camera is restricted tou? [?u0,u0].
In the following we will examine the effect of depth, slope, reflection properties
and occlusions on the frequency structure of the epipolar image. Specifically, we will
look at the following scenarios:
44
1. a single fronto-parallel plane with Lambertian surface reflection,
2. a single fronto-parallel plane with band-limited non-Lambertian surface reflection,
3. a single tilted plane,
4. and one fronto-parallel plane occluding a second one.
From our previous definition, the view ray has the implicit equation [f,?u]?(x?
xc) = 0 or if multiply the equation out, we get
xf ?xcf ?uz = 0. (2.19)
If the normal to the surface and the viewing ray do not point along the same di-
rection in the field of view of the camera, that is N = [u,f] ? ?D(x(u))T < 0 ?u ?
[?u0,u0], then we do not have to worry about self-occlusions on the surfaces. Here
x(u) = xc +z(xc,u)?u/f is the intercept of the view ray at pixeluas seen from camera
position xc.
2.5.2 Scenes of constant depth
In this case the equation of the object curve is given by xs = [s,z0]T and we can easily
solve for the intersection of the viewing ray with the surface in terms ofswhich leads to
the relations
s = uz0f +xc and? = ?tan?1(u/f) (2.20)
Thus the local light field structure of an EPI is given by the following equation,
L(xc,u) = LS(xc + z0f u,?tan?1(u/f)) (2.21)
In this case the occlusion condition is satisfied for all non-horizontal viewing rays,
and we do not have to worry about self-occlusions.
45
We can compute the depth of the scene from the image derivatives as follows, to
simplify the computation, we will use the approximation tan?1(u/f) ?u/f.
L(xx,u) = LS(xc + z0f u,?tan?1(u/f)) = LS(xc + z0f u,?u/f)
?
?xcL(xc,u) =
?
?sLS(xc +
z0
f u,?u/f)
?
?uL(xc,u) =
?
?sLS(xc +
z0
f u,?u/f)
z0
f ?
?
??LS(xc +
z0
f u,?u/f)
1
f
If we examine the ratio between the EPI derivatives
?
?uL(xc,u)/
?
?xcL(xc,u) =
?
?sLS(...,...)
z0
f +
?
??LS(...,...)
1
f
?
?sLS(...,...)
= z0f +
?
??LS(...,...)
1
f
?
?sLS(...,...)
(2.22)
we see that the ratio between the derivatives is proportional to the depth of the scene
if the scene consists of a fronto-parallel plane and we can neglect the influence of the
non-Lambertian effects (see Figure 2.4a). If the derivatives of the reflectance function are
involved,thenwehavetoseparatetheinfluenceofreflectionanddepthonthebrightness
derivatives.
In the Fourier domain, we can analyze the frequency spectrum of a scene at con-
stant depth by expressing the light field spectrum in terms of the surface light field spec-
trum.
?L(xc,u) =
integraldisplayintegraldisplay
L(xc,u)exp(?2pii(?xxc + ?uu))dxcdu
=
integraldisplayintegraldisplay
LS(xc + z0f u,?tan?1(u/f))exp(?2pii(?xxc + ?uu))dxcdu
We have the identities u = ?f tan(?) and xc = s + z0 tan(?). Here we will make the
approximation tan(?) ? ? for small ? (corresponding to a small field of view camera),
46
thus we get
?L(xc,u) =
integraldisplayintegraldisplay
LS(s,?)exp(?2pii(?x(s+z0?)??uf?)fdsd?
= f?LS(?x,?xz0 ??uf) (2.23)
As described in [162], if we assume that the surface light field is bandlimited by
B?, then we have the equality ?LS(?s,??) = ?LS(?s,??)?1B?(??) where 1B?(??) is an in-
dicator function over the interval [?B?,B?]. When we have a Lambertian surface, where
B? = 0, ?1B?(??) equals the Dirac functional ?(??) and we get the familiar fact that the
frequency spectrum of an EPI observing a Lambertian surface at constant depth is a line
in frequency space with a slope proportional to the depth of the surface in the spatial
domain.
2.5.3 A slanted surface in space
The computations for a slanted surface are more complicated because the depth of the
scene is changing when we vary either the view direction or the view position causing a
fore-shortening effect of scene texture. This causes the frequency content of the projected
texture to change with position, a phenomenon known as chirping [93]. A tilted line in
space can be parameterized as (x,z) = (xs,zs) + s(cos(?),sin(?)), thus using Eq. (2.19)
we can solve fors:
s = (xs ?xc)f ?zsucos(?)f ?sin(?)u. (2.24)
We have to make sure that the occlusion property is satisfied for the field of view of the
camera [?u0,u0]. Thus we have the constraint on?such that cos(?)f ?sin(?)u< 0 ?u?
[?u0,u0].
47
Computing the Fourier transform of the light field we get
?L(xc,u) =
integraldisplayintegraldisplay
L(xc,u)exp(?2pii(?xxc + ?uu))dxcdu
=
integraldisplayintegraldisplay
LS( (xs ?xc)f ?zsucos(?)f ?sin(?)u,?tan?1(u/f))exp(?2pii(?xxc + ?uu))dxcdu
Again we solve these two equations foruandxc (while again making the approxi-
mation tan(?) ??):
?
??
?
u
xc
?
??
?=
?
??
?
?f?
?s?[sin(?)]?scos(?) +?zs +xs
?
??
?= ?(s,?) (2.25)
Computing the determinant of the Jacobian of the mapping ?(s,?) = (u,xc), we get
|det(D?(s,?))| = cos(?) + sin(?)? which leads to
?L(xc,u) =
integraldisplayintegraldisplay
L(s,?)exp(?2pii([?u,?x]?(s,?)f(cos(?) + sin(?)?)dsd?. (2.26)
Sincesand? appear as a product in ?(s,?), in general we cannot easily additively
factor ? and express the spectrum of the epipolar image in terms of the spectrum of
the surface light field. Nevertheless, if we can factor the surface light field LS(s,?) =
?(s)B(?) where ?(s) describes the texture on the plane, while B(?) captures the view
dependent effects, then we can simplify the equations as follows:
?L(xc,u) (2.27)
=f
integraldisplay
??(?x(cos(?) + sin(?)?))B(?)(cos(?) + sin(?)?)exp(?2pii(?x(?zs +xs) + ?uf?)d?
where ?? is the Fourier transform of the surface texture ?(s). By choosing a specific tex-
ture, we can then evaluate the integral and compute the spectrum of the epipolar image.
We see how by varying?, the frequency spectrum is getting spread out between the min-
imum and maximum depth.
48
(a) (b)
Figure 2.6: (a) Epipolar image of a scene where one signal occludes the other. (b) The
Fourier transform of (a). Notice the ringing effects parallel to the occluding signal.
2.5.4 Occlusion
To estimate how occlusions influence the frequency structure of L(x,u) we can use the
approach described in [8]. They model an occlusion in a local neighborhood by split-
ting the light field L(x,u) into two parts L+(x,u) and L?(x,u) using a binary indicator
function ?(x,u).
L(x,u) = ?(x,u)L+(x,u) + [1??(x,u)]L?(x,u)
Applying the Fourier transform toL(x,u) results in
?L(?x,?u) = ??(?x,?u)? ?L+(?x,?u)? ??(?x,?u)]? ?L?(?x,?u) + ?L?(?x,?u)
We see that due to the convolution the frequency spectra of ?L+ and ?L? are spread by out
the frequency spectrum ??.
Since most of the geometrical information of a scene manifests itself only on the
local scale, we can assume that locally every occlusion edge is a linear step edge. This
casewasstudiedby[8]wheretheylookedattheFourierspacedescriptionoftwomoving
49
textured planes where one plane is occluding the other. They studied the case of constant
and linear motion, which basically corresponds to changing the viewpoint in the light
field parameterization (constant motion case), or also changing the viewpoint along the
optical axis (linear motion case). We follow their convention and model the texture of
the scene as a signal that satisfies the Dirichlet conditions. This allows us to express the
signal as an infinite exponential series that converges uniformly to the signal.
We define a step function ? for one side of a fronto-parallel plane that begins at
positions+ at depthz+ using the previous reparametrization as:
?(x,u) = ?s(x+ z+f u) =
??
??
???
1 ifx+ z+f u?s+
0 otherwise
The Fourier transform of a step function is then given by:
??(?x,?u) =
?integraldisplay
u=??
?integraldisplay
x=??
?(x,u)e?j(?xx+?uu)dxdu
=
parenleftbigg
pi?(?x)? i?
x
parenrightbigg
e?j?xxi?(?u ??xz+f )
We see that the occlusion will cause ringing in the frequency domain that is parallel to
the line ?u??xzif = 0. The orientation of this line depends on the depth of the occlusion
edge (Fig. 2.6).
Following [162] we can model more complicated scenes by partitioning the scene
intondepth levels where each depth is assumed to have constant depth (fronto-parallel
plane). The order of the depth levels be z1 < ... < zn. Each level extends until infinity
and is only occluded by the layers in front of it closer to the camera. To model a finite
extent for each depth layer, we define the maskMi (i = 1,...,n):
Mi(x,u) =
??
??
???
1 if layeriis blocking the path of light ray x
0 otherwise
50
(a) (b)
Figure 2.7: (a) Epipolar image of a complex scene with many occlusions.. (b) The Fourier
transform of (a). There are multiple ringing effects parallel to all the occluding signals.
We will examine the epipolar images Li(x,u) of the layers separately one by one. The
Fourier transform of each layer is denoted by ?Li(?x,?u) and the Fourier transform of the
occluded signal by ?Loi(?x,?u). The first layer is not occluded, thus its Fourier transform
is identical to the unmasked Fourier transform, that is ?Lo1 = ?L1. The second layer is
occludedbythefirst,thuswehaveLo2 = L2?M1 anditsFouriertransformis ?Lo2 = ?L2? ?M1.
WecandothisnowforeverylayerandwegetLoi = Li?producttexti?1j=1Mi anditsFouriertransform
is ?Loi = ?Li? ?M1?...? ?Mi?1. Usingtheresultsfrombefore,weseethattheFouriertransform
oftheoccludedsignalisconcentratedalongthelinewithslopezi/f withringingartifacts
parallel to z1/f,...,zi?1/f (Fig. 2.7). We can also include the windowing effect due to
the finite image size by choosing the first mask to be the size of the camera window
which will lead to some ringing artifacts that are parallel to the coordinate axes. This
also suggests that to reduce the ringing effects, we should window the local plenoptic
neighborhood of interest.
51
If we now compute the convolution, between ??i and ?LSi(?x)?(?u ??xzif ), we get
??i(?x)?(?LSi(?x)?(?u ??xzif )) (2.28)
=
integraldisplay
ej(?x??y)x
ui +xli
2
sin(xui ?xli2 (?x ??y))
?x ??y
?LS
i(?y)?(?u ??y
zi
f )d?y (2.29)
=ej(?x??u
f
zi)
xui +xli
2
sin(xui ?xli2 (?x ??u fzi))
?x ??u fzi
?LS
i(?u
f
zi) (2.30)
and see how the occlusion again spreads out the Fourier spectrum of the texture compo-
nent of the surface light.
2.6 Extension of natural image statistics to light field statistics
There exists a large amount of literature about the statistics of natural images which we
can use to simulate an ?average scene?. The most noticeable result is that the power
spectrum of a natural image falls approximately inversely proportional to the square of
the spatial frequency. Dong and Atick [39] demonstrate how a similar scaling law can be
derived from first principles for spatio-temporal sequences. We can use their formalism
to find an expression for the power spectrum of a light field |?L|2.
As we have seen in Chapter 2, the power spectrum |?L|2 depends on the spatial
frequencies of the textures in the scene, the orientations of the scene surfaces, as well as
the depth and velocity distribution in the scene. For now we will disregard the effect of
occlusions and assume that the power spectrum of a perspective image of the scene is
rotationally symmetric, that means there is no special direction. For natural scenes this
should be roughly satisfied, although it has been observed that horizontal and vertical
orientations are often more predominant especially in man-made environments. Thus it
is sufficient to find the power spectrum of a light field subspace formed by the axesu-x-t
(time-varyingepi-polarplane). Theextensiontothefull5Dlightfieldisstraightforward.
52
Forthepowerspectrumofthestaticperspectiveimagerowwecanassumethatitfollows
a power law |?L0(?u)|2 = Kbardbl?ubardblm wherem? 2.3 andK is a normalization constant [39].
As shown in Section 2.5, if we disregard occlusions the energy of the Fourier trans-
form of an epipolar plane image (x-u-plane) observing an object at a constant depth z is
concentrated along the line ?x = f?u/z. Thus the power spectrum of the epipolar plane
image of this scene is given by |?L(?u)|2?(?x ?f?u/z). If the depth is varying, then the
power spectrum will spread out to a wedge-shaped region bounded by the minimal and
maximal depth (see Fig 2.3c).
For a given region in the world at depth z that moves relative to the camera with
velocity ?x, we have the brightness invariance of the formI(u,t) = Is(u?f ?xt/z), thus the
power spectrum of a perspective spatio-temporal image plane is concentrated along the
line ?t = f?u?x/z. For the characterization of the spectrum of an spatio-temporal image
observing points following more complicated trajectories see [30]. Now we can write the
power spectrum of the time-varying epipolar plane as follows:
|L(?x,?u,?t,z, ?x)|2 (2.31)
= |Lu(?u)|2?(?x ?f?u/z)?(?t ?f?u?x/z) (2.32)
GivenaprobabilitydistributionforthevelocitiesD?x(?x)anddepthsDz(z),wecanexpress
an ?average? light field power spectrum by integrating over these distributions:
|?L(?x,?u,?t)|2 =
?integraldisplay
??
?integraldisplay
0
|L(?x,?u,?t,z, ?x)|2D?x(?x)Dz(z)d?xdz (2.33)
= |Lu(?u)|2 ?
?integraldisplay
??
?integraldisplay
0
?(?x ?f?uz )?(?t ?f?u?xz )Dz(z)D?x(?x)dzd?x
= Kbardbl?
ubardblm
D?x(?t?
x
)Dz(f?u?
x
) (2.34)
53
since the integral is only non-zero when
z = f?u/?x (2.35)
?x = (?tz)/(f?u) = ?t/?x. (2.36)
2.7 Summary
In this chapter we described the information content of the different subspaces of the
plenoptic function. We saw that most of the information that relates the image infor-
mation to the geometry of the scene is encoded in two-dimensional subspaces known
as epipolar plane images. We then examined the frequency spectrum of these two-
dimensional subspaces and saw that the energy of these images in the frequency domain
is concentrated along regions in space that are determined by the depth of the scene. We
willlaterutilizetheseresultswhenwestudythesamplingproblemforpolydioptriccam-
eras. In this thesis we will restrict our study to the dynamic plenoptic function observed
by a rigidly moving observer. Since in chapter 5 we will focus specifically on this case,
we will postpone the description of the details to that chapter.
54
Dim Subspace Object Information Content
0D: ray -color
1D: x,y line of view points ? intensity variation
u,v circle of view directions ? intensity variation
t temporal changes ? change detection
2D: xy orthographic ? metric measurements
image parallel to image plane
? depth-independent fore-
shortened 2D-texture
uv perspective ? angle measurements
image around optical center
? depth-dependent fore-
shortened texture
xv,yu push-broom image ? cylindrical panorama
xt,yt video of ? motion parallel to sensor
parallel lines ? temporal occlusion
ut,vt video of ? radial motion
converging lines ? temporal occlusions
xu,yv epi-polar image ? depth from gradients
? occlusions from junctions
Table 2.1: Information content the axis-aligned plenoptic subspaces, Part I
55
Dim Subspace Object Information Content
3D: xyu,xyv orthographic ? surface geometry
epi-polar volume
xyu,xyv perspective ? surface geometry
epi-polar volume
xyt orthographic video ? motion parallel to
parallel lines image plane
uvt pinhole camera ? motion around sphere
video of directions
xvt,yut push-broom video ? motion parallel to line
and around circle
xut,yvt epi-polar video ? linear planar rigid
motion estimation
4D: xyuv light field ? reflection properties
xyut,xyvt orthographic epi- ? 3D motion recovery
polar volume video and segmentation
xuvt,yuvt perspective epi- ? 3D motion recovery
polar volume video and segmentation
5D: xyuvt light field video ? complete scene
information
? linear metric
motion estimation
Table 2.2: Information content of axis-aligned plenoptic subspaces Part II
56
Chapter 3
Polydioptric Image Formation: From Light Rays to Images
L(x)
Camera Design
Analysis(Optics)Sampling(Geometry) Prefilter(Image Filter)Synthesis(Representation)
I(x)
? ?(x-k)
Image Processing
?(?x) ?(x)?(k)
?(k)
AquisitionNoise
Figure 3.1: Imaging pipeline expressed in a function approximation framework
In this Chapter we will describe how the properties of the optics and the sensors
affect the imaging process. Our model of the image formation pipeline for a polydioptric
camera is based on the mathematical framework described in [150] and is summarized
in the diagram in Fig. 3.1. The pipeline consists of the following five components:
1. Optics of the lens system: The radiometric and geometrical transformation from
light rays in world coordinates to light rays that impinge on the sensor surface.
2. Geometric distribution of the sensor elements: This includes the summation of
light over the sensor area and the geometric sampling pattern of the sensor element
57
distribution.
3. Noise characteristics of the acquisition process: The conversion from irradiance
into an electrical signal, including shot noise due to the quantum nature of light,
and noise due to the readout electronics and quantization of the measurements.
4. Elementary processing on the image: Filtering in the image domain to reduce the
noise and prefilter the image for the reconstruction.
5. Continuous representation of discrete measurements: Interpolating the discrete
signal to generate a continuous representation.
In the next sections we will formalize the five individual steps of the imaging
process.
3.1 The Pixel Equation
The conversion of irradiance into a digital intensity value at a pixel location can be de-
scribed by the measurement equation (also known as pixel equation). For a single pixelithe
measured intensityIi at timetj is given by the following equation:
Ii(tj) =
integraldisplay
t?Tj
integraldisplay
x?Xi
integraldisplay
r?Ai(x)
L(Gx(x;r,t),Gr(x;r,t),t)Pi(x;t))drdxdt. (3.1)
In this equation x denotes the position vector on the imaging surface, r is the direction
vectortowardsthelenssystem,andtisthetime. Thisequationcanbefurthergeneralized
by including wavelength effects such as chromatic aberrations, but in this work we will
not consider them further. L is the scene spectral radiance defined in object space, that
is the usual plenoptic function. The geometric transformation of a light ray on its path
through the camera system, that is the geometry of image formation, is described by the
58
functions Gx and Gr. An example transformations would be the focusing effect of the
lens. P models the behavior of the shutter and the sensor response and is dependent
on position and time. The integration is done over the positions in pixel coordinates
that make up the pixel area Xi, the directions that point towards the lens aperture as a
function of the light ray originAi(x), and the shutter intervalTj.
Apolydioptriccameraconsistsofamultitudeofsuchpixels. Thepolydioptricsam-
pling problem can be easier understood if we model the image formation as a sequence
of convolutions on the space of light rays. We can rewrite the pixel equation by rewriting
the rendering equation as a non-stationary filter operation in light ray space followed by
a discrete sampling operation. The filter models the imaging acquisition process and the
samplingpatternmodelsthesensorgeometryanddistribution. ThenewformofEq.(3.1)
is then:
Ii(tj) =
integraldisplay
Tj
integraldisplay
Xi
integraldisplay
Ai(x)
L(Gx(x;r,t),Gr(x;r,t),t)Pi(x;r;t))drdxdt (3.2)
=
integraldisplay
?Tj
integraldisplay
?Xi
integraldisplay
?Ai(x)
L(x;r;t)Pi(G?1x (x,r,t),G?1r (x,r,t),t))drdxdt (3.3)
=
integraldisplay
R3
integraldisplay
S2
integraldisplay
R+
L(x;r,t)??(x?xi,r?ri,t?tj)drdxdt (3.4)
where ? is a function that integrates the effect of the integration limits, the geometric
transformationsG?1x ,G?1r , as well as the optical effects captured byP.
3.2 Optics of the lens system
This section follows the exposition about how to model a correct and realistic camera
model as presented in [76]. We will now describe in detail the different components of
this model. We will assume that the coordinate system is located at the center of the
59
image surface and that the axes of the coordinate system are aligned with the axes of the
image. In this section we will describe two different idealizations of the lens system of a
camera, the pinhole and the thin-lens model.
Pinhole Camera
di
P
dido f2f1
F2F1
(a) (b)
Figure 3.2: (a) Pinhole and (b) Thin-Lens Camera Geometry
The simplest camera geometry is the pinhole camera model, which uses no lens at
all. A point aperture is placed in front of the imaging surface, so only the rays through
a single point in space pass through the aperture and fall on the imaging surface. Every
point in space can be imaged in perfect focus (up to the diffraction limit) and we need
only one parameter to describe this lens, that is the distance between the imaging surface
and the lens center di, often also called the focal length f. If we choose the focal point
as the origin of our fiducial coordinate system, then the projection of a point P ? R3
onto the image plane results in the image point x which coordinates are given by the
well-known equation
x = ? diP?z?P = ? fP?z?P (3.5)
where ?z is the unit vector along the optical axis.
60
In this case the ray pixel equation becomes
Ii =
integraldisplay
(x,y)?Xi
integraldisplay
t?Tj
L
?
??
??
??
?
?
??
??
??
?
x
y
?di
?
??
??
??
?
;
?
??
??
??
?
?x
?y
di
?
??
??
??
?
N
;t
?
??
??
??
?
P
?
??
??
??
?
?
??
??
??
?
x
y
?di
?
??
??
??
?
,t
?
??
??
??
?
?
??
??
??
?
dx
dy
dt
?
??
??
??
?
. (3.6)
To simplify the notation we used the shorthand [r]N = r/bardblrbardbl to describe vectors of unit
length. We see that the geometric transformations of the light rays are very simple, that
isGx(x;r,t) = x andGr(x;r,t) = ?[x]N, and the inverse transformations are given by
G?1x (x;r,t) = 0 andGr(x;r,t) = r.
Thin-lens Camera
In most cases the light gathering power of a simple pinhole camera is not sufficient (or it
would necessitate an unreasonably long exposure time), therefore a lens is used to focus
the light passing through a larger, finite aperture. Let?s assume this aperture is circular
and has a radiusRl.
For the thin-lens approximation, we assume that we have a lens of infinitesimally
small extent along the optical axis, that means the two sides of the lens coincide in this
idealization. The only parameters we need to know to describe this kind of camera are
the focal length f of the lens and the distance di between the imaging surface and the
lens center . Due to the focusing of the lens only one plane in object space at depth d0
will be imaged in focus on the imaging surface. The relationship betweend0,di andf is
given by the lens equation:
1
d0 +
1
di =
1
f (3.7)
We see that the focal length of the camera is the distance from the distance at which an
object at infinite depth will be imaged in focus. If the depth d of point P is at a depth
61
different from d0, then the point will be imaged on the imaging surface as a blurred
circular patch, where the blur radius is given by the formula:
rb = Rldi
parenleftbigg1
f ?
1
di ?
1
d0
parenrightbigg
(3.8)
The blur effect due to defocus has been a popular method to recover depth in the litera-
ture(e.g., [127]). The fact that the distance between the intersection of the rays that pass
through the periphery of the lens with the imaging surface and the intersection of the
principal rays with the imaging surface depend on the depth of the point that is imaged
is the basic idea behind the ?plenoptic cameras? of Adelson and Wang [2] and ?depth by
optical differentiation? method of Farid and Simoncelli [42].
The projection of a point in the world onto the imaging plane of a thin-lens camera
is still described by Eq. (3.5), since the geometric aspects of the projection are completely
defined by the principal ray. If this point is not lying on the plane of focus, then radiance
leaving this point will be measured not just at a single location on the imaging surface,
but over an extended area of the sensor.
SincetheEq.3.8issymmetricwithrespecttothedistancesofthelenstotheimaging
surface and the lens to the object, we can also go the other direction and use it to define
what rays of the world are integrated to form the response at a single pixel. If we define
a coordinate system on the lens using the coordinatesu,v, and denote the set of locations
on the lens by the setAL := {(u,v) ? {bardbl(u,v)bardbl ?Rb}then inside the camera we have the
following expression for the pixel equation:
Ii =
integraldisplay
(x,y)?Xi
integraldisplay
(u,v)?AL
integraldisplay
t?Tj
L
?
??
??
??
?
?
??
??
??
?
x
y
?di
?
??
??
??
?
;
?
??
??
??
?
u?x
v?y
di
?
??
??
??
?
N
;t
?
??
??
??
?
P
?
??
??
??
?
?
??
??
??
?
x
y
?di
?
??
??
??
?
,t
?
??
??
??
?
dx (3.9)
62
Next, we have to determine what image is projected onto the lens. We know that
the points at depth d0 are imaged in perfect focus, thus the principal ray originating at
x = [x,y,?di]T intersects the plane of focus at
xf = [?xd0/di,?yd0/di,d0]T = [? xfd
i ?f
,? yfd
i ?f
, fd
i ?f
]T. (3.10)
Thereforetherefracteddirectionoftheraythatintersectsthelensatpositionxl = [u,v,0]T
is given by rf = [xf ?xl]N. This leads to the following equation for the rays that are ob-
served by a pixel:
Ii =
integraldisplay
(x,y)?Xi
integraldisplay
(u,v)?AL
integraldisplay
t?Tj
L
?
??
??
??
?
?
??
??
??
?
u
v
0
?
??
??
??
?
;
?
??
??
??
?
?xf/(di ?f)?u
?yf/(di ?f)?v
dif/(di ?f)
?
??
??
??
?
N
;t
?
??
??
??
?
P
?
??
??
??
?
?
??
??
??
?
x
y
?di
?
??
??
??
?
,t
?
??
??
??
?
dx
(3.11)
If the origin of the coordinate system is at the center of the lens, then for an imag-
ing element at distance di from the lens, the plane at depth do = fdi/(f ?di) will be
in focus. We use this fact to rewrite the pixel equation for the thin-lens camera in the
filter/sampling framework. Given the ray (x;r) with ?z ? x = ?di leaving the imaging
element, then it will intersect the lens at location xl = x + di?z?rr and the plane of focus
at xf = ?dodix. Thus by setting Gx(x;r,t) = xl and Gr(x;r,t) = [xf ? xl]N we can
describe the geometric mapping between rays in object space and the rays leaving the
sensor surface. Similar we can define the inverse functions.
3.3 Radiometry
Since manyof the sourcesof noise aredependent on theintensity of thelight falling onto
the sensor, we need to analyze how the amount of light depends on the size of the sensor
surface as well as the aperture of the lens. Sensor response is a function of exposure, the
63
integral of irradiance at point x on the film plane over the time the shutter is open. If we
assume constant irradiance over the shutter period (an assumption we have to relax later
when using a moving sensor), then we have
Is(x;t0) =
integraldisplay t0+T
t0
E(x;t)dt = E(x;t0)T (3.12)
where E(x;t) is the irradiance at location x and time t, T is the exposure duration, and
H(x) is the exposure at x and time t0. E(x;t) is the result of integrating the radiance at
x over the solid angle subtended by the exit pupil, which is modeled as a disk. Denoting
the set of locations in this disk byD we get
E(x) =
integraldisplay
xprime?D
L
parenleftbigg
x; x
prime ?x
bardblxprime ?xbardbl
parenrightbiggcos(?)cos(?prime)
bardblxprime ?xbardbl2 dA (3.13)
whereListheplenopticfunction,?istheanglebetweenthenormalofthesensorsurface
and xprime ?x, ?prime is the angle between the normal of the aperture stop and x?xprime, and the
differential areadA. If the sensor is parallel to the disk withdi the axial distance between
lens and sensor, then Eq. 3.13 can be rewritten as
E(x) = 1f2
integraldisplay
xprime?D
L
parenleftbigg
x; x
prime ?x
bardblxprime ?xbardbl
parenrightbigg
cos4(?)dA. (3.14)
The weighting in the irradiance integral leads to a varying irradiance across the imaging
surface. If the exit pupil subtends a small angle from x, we can assume that? is constant
and equal to the angle between x and the center of the disk. This leads to
E(x) = LAf2 cos4(?). (3.15)
This constant multiplicative factor can easily be accounted for in the imaging process by
rescaling the pixel intensity values accordingly.
64
3.4 Noise characteristics of a CCD sensor
Inthedescriptionofthedifferentnoiseprocesses,wefollowHealeyandKondepudy[57].
Another good descriptions of all the elements involved in the imaging process can be
found in [74] and [31]. CCDs convert light intensity linearly into a sensor response, but
we must account for the black level, intensity scaling, vignetting, etc. in the camera
system. An overview over the process can be seen in Fig. 3.3 (redrawn from [149]) that
shows the image formation process.
Scene
Radiance
AtmosphericAttenuationCamera
Irradiance
Lens/Geometric
Distortion
D/ATransmission
A/D
Digitized
Image
Noise
GammaCorrection WhiteBalancing
CCDImaging
Bayer PatternFPN
DarkCurrent SN
Thermal
Noise
E
t
++
Interpolation
FPN : FIxed Pattern NoiseSN : Shot NoiseD/A, A/D : Digital to Analog and Analog to Digital Transform
Figure 3.3: CCD Camera Imaging Pipeline (redrawn from [149])
The capture of irradiant light by a CCD sensor is analogous to measuring the
amount of rainfall on a field by setting up a regular array of buckets to capture the rain
per unit area during the storm and then measure it by carrying it one by one to a measur-
ing unit. A CCD measures the amount of light falling on a thin wafer of silicon. The im-
pact of a photon on the silicon creates an electron-hole pair (photo-electric effect). These
free electrons are then collected at discretely spaced collection sites. Each collection site
is formed by growing a thin layer of silicon dioxide on the silicon and depositing a con-
ductive gate structure over the silicon oxide. If a positive electrical potential is applied to
the gate, a depletion region is created that can hold the free electrons. By integrating the
65
number of electrons stored at a collection site over a fixed time interval, the light energy
is finally converted into an electronic representation.
A process called charge-coupling is used to transfer the stored charge from col-
lection site to an adjacent collection site. The fraction of charge that can be effectively
transported is known as the charge transfer efficiency of a device. An image is read
out by transferring the charge packets integrated at a collection site in parallel along
electron conducting channels that connect columns of collection sites. A serial output
register with one element for each column receives a new row transfer after each paral-
lel transport. In between parallel transfers, all the charge in the serial output register is
transferred in sequence to an output amplifier that generates a signal proportional to the
amount of charge. Once all the serial registers have been read out, the next parallel trans-
fer happens. In analog video cameras the signal by the CCD is then transformed into an
analog signal and converted into the digital domain by a frame grabber. In modern dig-
ital cameras, the signal from the CCD is directly digitized by the camera electronics and
theframegrabberisnotnecessaryanymore. Inthenextsection, wewillnowexaminethe
different sources of error during this process that one would like to calibrate for during
the radiometric calibration step.
Let?s start by examining the signal at a single collection site. The signal at this site
is proportional to the number of electrons that arrive at the site and can be described as:
I = T
integraldisplay
?
integraldisplay
x?Xi
E(x,y,?)P(x,y)q(?)dxdyd? (3.16)
Here I is again the intensity of the resulting signal, T is the integration time (assuming
constant irradiance over time), E(x,y,?) denotes the spectral irradiance at given loca-
tion of the collection site, that is the integral over all the rays that pass through the lens
66
aperture as described in Section 3.2.
E(x,y,?) =
integraldisplay
(u,v)?Al(x,y)
L(x,y,u,v,t)dudv. (3.17)
P(x,y) is the spatial response of the collection site, and q(?) describes the quantum effi-
ciencyofthedevice,thatistheratioofelectronfluxtoincidentphotonfluxindependence
on wavelength - which we had omitted in Eq. (3.2). We will now examine the different
sources of noise in detail.
Fixed pattern noise (FPN)
Due to processing errors during CCD fabrication there exist small variations in quantum
efficiency between different collection sites. Thus even if we have a completely uniform
irradiance on the CCD array, we will get a non-uniform response across the array which
is known as fixed pattern noise. Although this fluctuation in quantum efficiency often
depends on wavelength, we will disregard this wavelength dependence for now. We
model the number of electrons collected at site byKI, whereI is defined as in Eq. (3.16)
andK accounts for the difference betweenq(?) andS(x,y) at the collection sites by scal-
ing them appropriately. We assume that the mean ofK is 1 and that the spatial variance
is small and given by?2K.
Blooming
Usually one assumes that the number of electrons collected at each site is independent of
the number of electrons collected at neighboring sites. This can be violated if a site illu-
minated with enough energy to cause the potential well to overflow with charges which
then contaminate the wells at other collection sites. This process is called blooming and
in bad cases can affect many surrounding sites in the neighborhood of an overexposed
67
site. Modern cameras build special walls between the sites to reduce these effects. For
now we will assume appropriate illumination and thus disregard this effect.
Thermal energy
Thermal energy in silicon generates free electrons. These free electrons are known as
dark current. They can also be stored at collection sites and therefore become indistin-
guishable from electrons due to light incidence. The expected number of dark electrons
is proportional to the integration time T and is highly temperature dependent. Cooling
a sensor can reduce the dark current substantially. The dark current also varies from
collection site to collection site. We will denote the noise from the dark current asNDC.
Photon shot noise
Shot noise is a result of the quantum nature of light and expresses the uncertainty about
the actual number of electrons stored at a collection site. The number of electrons that
interact with the collection site follows a Poisson distribution, so that its variance equals
its mean. The probability to measure a given intensity I for the incoming radiance L is
thus given by:
Ppoisson(I = s) = L
s
s! e
L (3.18)
with a mean E[Ppoisson(I)] = Land a variance Var[Ppoisson(I)] = L. We therefore see that
the signal-to-noise-ratio is illumination and sensor size dependent and increases with
the square-root of the intensity. Shot noise is a fundamental limitation and cannot be
eliminated. Since the dark current increases the number of electrons stored at a site, it
will increase the mean and thus the variance of the number of electrons stored. The
68
number of electrons integrated at a collection site is given by
KI +NDC +NS (3.19)
whereNS is the zero mean Poisson shot noise with a variance that depends on the num-
ber of collected photo-electronsKI and the number of dark current electronsNDC.
Read-out noise
The charge transfer efficiency of a real CCD is not perfect and the noise due to charge
transfer can be quantified. Modern CCDs though achieve transfer efficiencies on the
order of 99.999% so that we can safely disregard this effect. During the next step, the on-
chipoutputamplifierconvertsthechargecollectedateachsiteintoameasurablevoltage.
This generates zero mean read noise NR that is independent of the number of collected
electrons. Amplifier noise will dominate shot noise at low light levels and determines
the read noise floor of the CCD. The voltage signal can then be transformed into a video
signal which is electronically low-pass filtered to remove high-frequency noise. The gain
from both output amplifier and camera circuitry is denoted by A. For a digital camera,
the conversion to video is not necessary, so A will only depend on the output amplifier.
The video signal leaving the camera can then be described by
V = (KI +NDC +NS +NR)A (3.20)
Quantization error
To generate a digital image, the analog signal V needs to be converted to a digital sig-
nal. This can either be done using a frame grabber for analog video cameras or is done
directly in the camera for digital cameras using an analog-to-digital (A/D) converter.
69
The A/D converter approximates the analog voltage V using an integer multiple of the
quantization step q, so that each value of V that satisfies (n? 0.5)q < V < (n+ 0.5)q is
rounded to the digital valueD = nqwherenmust satisfy 0 ?n? 2b, wherebis the num-
ber of bits used to representD. To prevent clipping, q andbneed to be chosen such that
max(V) ? (2b?0.5)q. This quantization step can be modeled by adding a noise termNQ
that is a zero mean random variable independent of V and uniformly distributed over
the range [?0.5q,0.5q] which results in a variance ofq2/12.
Anotherimportantinfluenceonthesensoraccuracyisthelimiteddynamicrangeof
today?ssensorswhichisamajorsourceoferrorinmanyapplications. Sinceonlyalimited
range of signal magnitudes can be encoded without local processing on the chip we have
to take into account the tradeoff between a high saturation threshold and a sufficient
resolution at lower illumination levels. A detailed study of the benefits of high-dynamic-
range imaging is beyond this thesis, and the reader is referred to the literature [92, 101].
Thus in conclusion, a comprehensive model for the image formation at a single
CCD location is given by
D = V +NQ = (KI +NDC +NS +NR)A+NQ (3.21)
3.5 Image formation in shift-invariant spaces
Since all natural signals have finite energy, we can represent a given instance of the space
of light rays using the light field parameterization as an element of L2(R5), the space of
measurable, square integrable functions defined on R5, and phrase the light field recon-
struction problem as a function approximation problem inL2(R5) using recent results in
approximation theory [144, 13].
70
do dili
D
ObjectLens Image PlaneDl
Dy
Dxf
g do/di
Figure 3.4: Image formation diagram.
We will restrict our analysis to regular arrays of densely spaced pinhole cameras
and analyze how accurately a specified camera arrangement can reconstruct the contin-
uous plenoptic function under a given model for the environment.
We will use the two-plane parameterization and assume that the imaging elements
of the camera sample the light field on a regular lattice in the 5-D space of light rays
which corresponds to a choice of camera spacing, image resolution, and frame rate. An
example setup would be a set of cameras with their optical axes perpendicular to a plane
containing the focal points (see Fig. 2.1). Then we can describe the 5D periodic lattice A
using 5 vectors [a1,a2,...,a5] which form a lattice matrixAsuch that A = {Ak|k ? Z5}.
A unique tiling of the space of light rays can be achieved by associating with each lattice
site a Voronoi cell, which contains all points that are closer to the given lattice site then to
any other.
The camera output is modeled as the inner product of the light field with different
translates of an analysis function? which was described in the previous sections:
?c?(k) =
integraldisplay
L(x)?(A?1x?k)dx;c?(k) ?l2(Z5). (3.22)
71
The function? : R5 ? R is dilated by the dilation matrixA?1 and sampled according to
the lattice patternAwhich results in Eq. (3.22). It models the effects of the Pixel Response
Function (PRF) such as scattering, blurring, diffraction, flux integration across the pixel?s
receptive field, shutter time, and other signal degradations as explained in Section 3.1.
As described in Section 3.4 the conversion of the irradiance on the sensor surface
to an electrical energy introduces noise into the measurement process which we denote
byN(?c?,k). The value measured by the sensor is then given by:
c?(k) = ?c?(k) +N(?c?,k) (3.23)
Since we are interested in a continuous reconstructed light field I(x) we will ex-
press it as a linear combination of synthesis functions? : R5 ? R centered on the lattice
points. ThusI(x) is represented as
I(x) =
summationdisplay
k?Z5
c?(k)?(A?1x?k); c?(k) ?l2(Z5) (3.24)
wherel2 is the space of square-summable sequences (summationtextk?Z5 |c(k)|2 <?).
The coefficientsc? are determined from the camera outputc? using a linear convo-
lution filter?:
c?(k) =
summationdisplay
i
?(i)c?(k?i). (3.25)
Theadvantageofusingafunctionapproximationframeworktolocallydescribethe
image formation process is that it allows us to utilize a rich assembly of tools to analyze
the accuracy of the transfer of light radiance into pixel values.
3.6 Summary
In this chapter we showed how we can approximate the image formation process of
a rectangular array of cameras in terms of a projection on a space formed by a shift-
72
invariant filter. This filter accounts for the size of the individual pixels as well as the
depth of field of the camera as necessary. We also described different sources of noise
that are present when converting light into an electrical signal.
73
Chapter 4
Polydioptric Sampling Theory
Figure 4.1: Every imaging device can only capture the plenoptic function with limited
precision. By extending signal processing techniques to the space of light rays we can
determine how accurately we can measure the radiance of light rays at non-sample loca-
tions.
Often we are given a camera design and we would like to know how well it will
perform under a number of different conditions, or we need to design a camera that it
is asked to perform well in a specific environment. In both cases we need to relate the
74
design parameters to the error in the recovery of the necessary visual information.
Each neighborhood of imaging elements computes an approximation to the bright-
ness structure of a localized bundle of light rays. The resolution of any camera is limited
by the finite size of the imaging elements and lens aperture. This causes an uncertainty
in our ability to measure the spatial position of objects, the shutter frequency and shutter
time cause temporal uncertainty, and in addition the quantum nature of light measure-
ment leads to unavoidable noise in the form of shot noise as was described in the pre-
vious chapter. An important criteria on which we will focus here is the spacing of the
camera view points since this parameter is easiest to adjust by the user of multi-camera
arrangements. The view point spacing can be discrete as in a light field camera or Argus
eye, or it can vary more or less continuously as when a camera observes a non-focusing
mirror. In this section we will show how the image formation process will be modeled in
a function approximation framework which allows us to compute an accurate character-
ization of the measurement errors.
4.1 Plenoptic Sampling in Computer Graphics
Asdescribedintheintroduction,image-basedrepresentationshavebecomeverypopular
in computer graphics, because they enable one to render a scene from novel view-points
with minimal knowledge about the geometry and textures in the scene (for an overview
see Zhang and Chen [163]) This representation makes it possible to develop display al-
gorithms that are independent of the complexity of the scene description. The drawback
is that many images need to be stored to be able to render new photo-realistic images.
Due to the high data volume, it is of interest to find the minimum number of images that
need to be stored to enable a perceptually valid reconstruction of all possible images in
75
a given viewing area. The first paper that analyzed this problem was by Chai et al. [27],
who studied the problem in the Fourier domain and derived a curve that related the
geometrical complexity of the scene and the spectral bandwidth of the captured images
to the lower bounds on the view point sampling density for non-aliased reconstruction.
They modeled the image formation for an array of cameras by the following variant of
the pixel equation
I(x) = r(x)?[(L(x)?p(x))?s(x)] (4.1)
= r(x)?[Lp(x)?s(x)] (4.2)
Lp(x) = [L(x)?p(x)] (4.3)
which relates the reconstructed light field I(x) to the ideal light field L(x) existing in
physical space. r is the interpolation/reconstruction filter, p is the Pixel Response Func-
tion (PRF) that combines the effects of scattering, blurring, diffraction, flux integration
across the pixel?s receptive field, shutter time,and other signal degradations,and s is the
sampling pattern defined by the camera spacing, image resolution, and frame rate, mod-
eled as an impulse train s(x) = summationtext
k?Z5
?(x? ?xk). We disregard the effects of the optical
elements on the view points during the image formation and model p as a combination
of a low-pass filter in xu andt, and a Dirac impulse in xx.
Applying the Fourier transform to Eq. (4.1), we get
?I(?) = ?r(?)?[?Lp(?)? ?s(?)] (4.4)
= ?r(?)?
summationdisplay
k?Z5
?Lp(?x ? kx
?xx,?u ?
ku
?xu,?t ?
kt
?t) = ?r(?)?
?Ls
whereweusetheabbreviations?x := [?x,?y]T,?u := [?u,?v]T,and? :=[?x,?y,?u,?v,?t]T),
as well as, ?xx := [?x,?y]T, and ?xu := [?u,?v]T.
76
Weseethat ?Ls isthesumofthecopiesof ?Lp shiftedtothelatticepoints[ kx?xx, ku?xu, kt?t].
If these copies overlap we will have aliasing and will not be able to reconstruct Lp from
the samplesLs exactly. We assume in the following thatppre-filters the light field suffi-
ciently along thexu andtdimensions, so that we can choose ?u = ?v = ?t = 1 without
having any aliasing.
To understand the conditions on the sampling pattern to enable a continuous re-
constructionoftheplenopticfunctionfromtheimagesamples,wecanutilizeouranalysis
of the frequency structure of the time-varying light field that we did in Chapter 2.
f
Z?
x
Scene
u
Origin
u0 u
0 xf
Z
u=u0 - x f/z(u0)
z(u0)
?f
?i
?u
?x
Bu
-Bu
(f/zmax) ?u ? ?x = 0
(f/zmin) ?u ? ?x = 0
(f/zopt) ?u ? ?x = 0
Bx = f(1/zmin-1/zmax) Bu
Wu
Wx
(a) (b) (c)
Figure 4.2: (a) Light Ray Correspondence (here shown only for the light field slice
spanned by axes x and u). (b) Fourier spectrum of the (x,u) light field slice with choice
of ?optimal? depth zopt for the reconstruction filter. (c) Optimal reconstruction filter in
Fourier space.
There we showed that if the world consists of a number of fronto-parallel planes
and we can disregard occlusion effects, then the frequency spectrum of the light field is
bounded by two lines whose slopes were given by the minimal and maximal depth of
the scene as shown in Fig. 4.2.
To compute a non-aliased orthographic image of the scene, one might think that it
77
is necessary to place the cameras as closely as the size of the smallest feature in the world
we would like to resolve. Fortunately, thanks to the redundancy in the light field repre-
sentation, we have the following constraint on the spacing of the cameras [27]. Given the
bounds on the depthzmin ?z(xu0,t) ?zmax and the band limit B0 of ?L(?), we see from
Fig. 4.2 that the maximal view separation for a non-aliased image reconstruction is
?x ? 1B
x
= 1f(1/z
min ?1/zmax)Bu
(4.5)
If we have accurate information about the depth bounds of the scene, then we
can design an interpolation filter that minimizes the error in the reconstruction of the
continuous light field [27]. As can be seen in Fig. 4.2b, the copies of ?Lp(?) are opti-
mally compacted if we choose the principal directions of the interpolation filter to be
(f/zopt)?u ??x = 0 and ?u = 0 where zopt is chosen as 2/zopt = 1/zmin + 1/zmax. The
ideal interpolation filter is given by the sinc interpolation filter whose pass-band region
corresponds to the rectangle denoted by the dotted lines in Fig. 4.2b. As was pointed out
in [27], choosing such a filter is similar to the depth corrected interpolation proposed in
the original lumigraph paper [51] where we choose a local depth to improve the quadri-
linear interpolation.
Assume we have cameras where the image pixels (view directions) are spaced ?u
apart in the image plane. It follows that the signal we can reconstruct is band-limited by
?u ? Bu = 1/2(?u) due to the Nyquist theorem. To be able to compute intermediate
images without aliasing artifacts, we need to apply a low-pass filter to the images with a
cutoff-frequency that corresponds to the maximum signal bandwidth we can reconstruct
using the view point spacing. Applying the Nyquist-theorem we know that the highest
frequency we can reconstruct is Bx = 1/(2?x) where we denote the spacing of the dif-
ferent camera optical centers by ?x. We can assume using today?s camera technology
78
that ?u lessmuch ?x, therefore we can apply a low-pass filter to the images taken by the indi-
vidual cameras, to reduce the aliasing at the cost of decreased resolution. We know that
Bx ? [Bu/Zmax,Bu/Zmin], therefore we choose the cutoff-frequency as
Bu ? min(Zmin/(2?x),1/(2?u) (4.6)
to avoid effects from aliasing.
4.2 Quantitative plenoptic approximation error
Intheprevioussectionwewereabletoestablishminimalboundsonthespacingbetween
viewpoints to achieve view interpolation free of aliasing. Unfortunately, for computer
vision applications we often do not know what kind of properties a scene will have and
itisveryimportantthatwedonotjustdetectifwewillhaveerrors,butinsteadwewould
like to have an estimate about the magnitude and distribution of the errors. Since scenes
and depths might vary noticeably, there does not exist a single optimal interpolation
filter. Instead we need to know the relationship between the approximation error and the
camera design parameters, so that this information can be used to optimize the camera
design and can be included in the estimation process at the later stages of the processing
pipeline. Therefore, in this section, we will derive quantitative expressions for the multi-
dimensional approximation error between the original light field and the reconstructed
light field which we will use to characterize the distribution of errors. Finally we will
evaluate the exact average approximation error using statistics about the texture and
depth distributions of the environment.
To evaluate the error of the light field reconstruction from the camera samples, we
want to approximate the mean-square error epsilon12 = bardblL?IbardblL2 as presented in [144] using
79
an integral of the Fourier spectrum of ?L(?) and an error kernelE(?)
epsilon12 ??2(A) =
integraldisplay ?
??
|?L(?)|2E(AT?)d? (4.7)
whereAis the sampling lattice matrix, and the error kernelE(?) is defined as
E(?) = |1? ??(?)??(?)|2 +
summationdisplay
k?Zn\{0}
|??(?)??(?+k)|2
= 1? |
???(?)|2
?a?(?)bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
Emin
+ ?a?(?)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
???(?)? ??(?)
a?(?)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
2
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
Eres
. (4.8)
where ?a?(?) = summationtext
k?Zn
|???(?)??(?+k)|2 isthesampledauto-correlationfunctionof?. Here
as before?is the synthesis function, and ?? is the combination of image transfer function
? and prefilter?.
This error kernel consists of two components. The first term Emin measures the
least-squares error between the signal L and the perpendicular projection into the syn-
thesis space spanned by the functions ?. The second term Eres then measures how or-
thogonal the analysis and synthesis function spaces are. We see that if? is the dual basis
function to the synthesis function?, that is ?? = ??/?a?, then the light field is orthogonally
projected onto the space of the synthesis functions ?? = ?d andEres vanishes. In that case
Eq. (4.7) reduces to the least-squares error as expected.
If L is a band-limited then we have the equality epsilon1 = ? for all phase shifts of the
signal with respect to the camera, otherwise ?(V) is equal to the average error over all
possible phase shifts L(Ak + ?) where ? ? {Ax|x ? [0,1)5}. This is a very useful
property because in general we are not interested in a specific position of the camera
relative to the scene, but want to have an error characterization that is independent of
the camera position.
Eq. (4.7) gives us a means to assess how accurately a given camera design is able
80
to reconstruct the space of light rays in an environment, and thus how accurately we are
able to estimate the quantities of interest that we need to compute to solve our task.
An important part of the design process of a computer vision system is also the
image processing as a preprocessing step. Thus, the question becomes how we should
choose the filter ? to minimize the approximation error. In general, since we do not
have access to the original signal L, the best we can do is to perpendicularly project the
sampling space spanned by the set of analysis functions{?(x??)}on to the image space
spanned by the set of synthesis functions {?(x ? ?)}. This ensures that the image I(x)
passes through the imaging pipeline unchanged, thus to the camera it ?looks? the same
as the true light field.
If the image transfer function is known (for example from measuring the point
spread function of the camera optics), then we can define the cross-correlation sequence
a??(k) =integraltext ?(x)?(x?k)dx. Ifa??(k) is an invertible convolution operator inl2, then the
perpendicular projection results in the synthesis coefficientsc?(k) = (a?1???c?)(k) where
the filter? equalsa?1?? (for more details see [150]). In the Fourier domain this filter has the
expression ??(?) = 1/(summationtextk ??(??k)??(??k)).
We can also choose the prefilter ? such that the reconstructed signal interpolates
the input signal at the lattice sites, that isI(Ak) = L(Ak). The coefficients of the prefilter
can then be determined from the interpolation condition using filter design techniques
(for an example on a hexagonal lattice see [151]).
We are interested in evaluating the shape of the error kernel for the case of a regu-
lar array of cameras. The combination of the point spread function of the camera and the
integration over the pixel area determines the analysis function ?, and can be approxi-
mated by a tensor product of B-spline of orderpdenoted by?p(x).
81
B-splines of order p can be defined as the (p + 1)-fold convolution of the box-
function
?0(x) =
??
??
???
1 if |x| ? 0.5
0 otherwise
(4.9)
with itself, that is?p = ?p?1 ??0 = ?0 ??0 ?...??0bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
(p+1)?fold
.
Gaussians are often used to approximate the point spread function (PSF) of the
cameralens. TherelationshipbetweenaGaussianandB-splinesoforderpwasdescribed
in [150]) leading to the formula
?p(x) ? 1radicalbig2pi?
p
exp
parenleftbigg
? x
2
2?2p
parenrightbigg
, ?p =
radicalbiggp+ 1
12 . (4.10)
This approximation is already very accurate for low values ofp(1% error forp = 3), thus
we can approximate the PSF of the lens by a B-spline with little loss in accuracy.
The analysis function for each pixel in the perspective image is approximated by
a B-spline that is dilated according to the size of the optical and depth of field blur (we
assume that the analysis is local enough, so that we can approximate the depth of field
by a shift-invariant filter). The sampling due to view points that are spaced apart can be
modeled by a train of delta functions ?(x)?(y). For the time domain we choose a simple
box filter that integrates over the integration time. We will assume that the sampling grid
issquareresultinginAbeingadiagonalmatrix. Thisallowsustowritetheanalysisfunc-
tion?as the tensor product of B-splines and delta functions along orthogonal coordinate
axes:
?(A?1x?k) = ?(x/ax?kx)?(y/ay?ky)?p(u/au?ku)?p(v/av?kv)?0(t/at?kt) (4.11)
It was shown in [13] and [144] that cubic B-splines offer the best compromise be-
tween order of approximation and minimal support, thus we will choose them as our
82
synthesis basis function ?. They are centered on the reconstruction grid and dilated by
the same amount as the analysis function, so we have?n(A?1x?k).
Since the lattice matrix for the synthesis function is the same as the one for the
analysis function, we can write the synthesis functions also as a tensor product of the
B-spline functions along the individual coordinate axes:
?(A?1x?k) = ?m(x/ax?kx)?m(y/ay?ky)?n(u/au?ku)?n(v/av?kv)?0(t/at?kt) (4.12)
B-splines have the following Fourier transform
??n(?) =
bracketleftbiggsin(pi?)
pi?
bracketrightbiggn
(4.13)
Sincetheanalysisandsynthesisfunctionsareseparable,theerrorkerneldefinedinEq.(4.7)
can be evaluated for each coordinate axis independently. If we define
?ap =
summationdisplay
k
productdisplay
i
|??p(?i +k)|2 and (4.14)
?bp =summationdisplay
k
productdisplay
i
??p(?i +k) (4.15)
where i varies over the different dimensions x,y,u,v,t then we can write the two com-
ponents of the error kernel as:
Emin(?) = 1? |
??(?)|2
?a?(?) (4.16)
Eres(?) =
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
radicalBig
?a?(?)??(?)??(?)?
??(?)
radicalbig?a
?(?)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
2
(4.17)
In our case we have for the least-squares approximation error
Emin(?) = 1? sin(pi?)
2n
(pi?)2n?an(?) (4.18)
and for the residual errors we have for the view point interpolation
Eres(?x) =
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
radicalbig
?an(?x)??x(?)? sin(pi?x)
2n
(pi?x)2nradicalbig?an(?x)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
2
83
and similarly for the view direction interpolation (n is the degree of the synthesis B-
spline, whilepis the degree of the analysis B-spline).
Eres(?u) =
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
radicalbig?a
n(?u)sin(pi?u)p
(pi?u)p
??u(?u)? sin(pi?u)2n
(pi?u)2nradicalbig?an(?u)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
2
We can evaluate the components of the error kernel by utilizing the following two
identities (as described in [13]):
cot(?) =
summationdisplay
k
(?+pik)?1 forpeven
sin(?)?1 =
summationdisplay
k
(?1)k(?+pin)?1 forpodd
This leads to the definition of
?bp(?) = ?sin(pi?)p
(p?1)!
dp?1
d?p?1(cot(pi?)) forpeven
?bp(?) = ?sin(pi?)p
(p?1)!
dp?1
d?p?1(sin(pi?)
?1) forpodd
and ?ap(?) = ?b2p(?).
As explained before the optimal choice for the correction filter ? depends on the
application. The best approximation in the least-squares sense is achieved if we set
??(?) = ?a??(?)?1, that is the inverse of the sampled correlation function between ? and
?. The cross-correlation between each component of? and?is given as
a??(k) =
integraldisplay
?(x)?(x?k)dx =
integraldisplay
?p(x)?n(x?k)dx = ?p+n+1(k) = bm(k) (4.19)
The optimal correction filter (bm)?1 in this example is the so-called B-spline filter of order
m = p+n+ 1 which is stable for all m. The shape of these filters needs to be adapted
to the specific properties of the light field Fourier spectrum that we determined in the
previous chapters.
84
4.3 Evaluation of Approximation Error based Natural Image Statistics
ToevaluateEq.(4.7)wehavetochooseasynthesisfunction?,determinetheimagetrans-
fer function ? and compute the appropriate prefilter ?. In addition, we need to have an
idea about the power spectrum of the light field |?L|2. The power spectrum depends on
the scene in which the sensor operates. Depending on the task we try to solve we might
have access to very specific knowledge about the scene the sensor will operate in which
case we can evaluate the error for a number of specific scenes. This becomes quickly
infeasible, if we want to design a visual sensor that performs well in many different en-
vironments.
Using Eq. (2.33)
|?L(?x,?u,?t)|2 = Kbardbl?
ubardblm
D?x(?t?
x
)Dz(?u?
x
)
derived in Section 2.6 we can approximate the power spectrum of an average scene.
This expression allows us now evaluate Eq. (4.7) given a depth and velocity distri-
bution. If we have a more complicated scene then we would have to integrate over the
distributionofthefrequencyspectrumusingMonteCarlointegrationsinceingeneralthe
power spectrum of such a scene cannot be expected to admit a nice analytical expression.
A detailed study is subject of future work.
To apply the framework described in Section 4.3, we need to determine the dis-
tribution of depths in the scene, and the distribution of velocities which depend on the
locomotionofthesensor, aswellasthespacingofthepixelsontheimageplane, thespac-
ing of the view points and the frame rate of the camera (which determine the sampling
lattice).
85
4.4 Non-uniform Sampling
In future work I would like to extend the analysis presented in this chapter to the case
of non-uniformly distributed cameras. The interpolation problem for non-uniformly dis-
tributed data samples is a very active area research and can be attacked using the frame-
work of frames. Camera systems can be interpreted as defining a frame operator on the
space of light rays. By analyzing the properties of this operator such as its frame bounds
we can determine how well we can reconstruct the original plenoptic function based on
the light ray samples that we captured.
4.5 Summary
In this chapter we analyzed how accurately the space of light rays can be reconstructed
from the image information captured by a camera. We described previous approaches in
computergraphicsandexplainedtheirshortcomings. Wethendescribedhowwecande-
rive quantitative expressions for the difference between the true and reconstructed light
fields by expressing the image acquisition process in a function approximation frame-
work. Since this expression is defined in the Fourier domain, we showed how the statis-
tics of natural images can be used to evaluate this error expression for an average envi-
ronment. This quantitative expression will be used in the following chapter to analyze
the accuracy of the plenoptic motion estimation algorithm.
86
Chapter 5
Application: Structure from Motion
AsexplainedinChapter2wecannotinferinformationaboutthethree-dimensionalstruc-
ture of the world without prior knowledge about the scene. If the camera is moving
though, the changes in the images captured over time are related to both the motion of
the camera and the structure of the world by the geometry of image formation. Thus,
if we can recover the motion of the camera, we can factor its effects out and utilize the
residualchangestocomputethescenepropertiessuchasdepthandthemotionofobjects
in the scene.
Structure from motion is one of the fundamental and thus also most studied prob-
lems in computer vision. The study of this problem has come a long way from its early
breakthroughsbyLonguet-HigginsandPrazdny[84,119,83],Horn[20],Huang[148,81],
Maybank [95, 96], Faugeras [45], Aloimonos [134, 135] and Chellappa [18] as demon-
strated by the recent arrival of comprehensive text books on the subject [55, 88]. Despite
its maturity, foolproof solution still do not exist because the non-linear nature of the un-
derlying constraints makes the error functions often non-convex. This causes algorithms
toendupinlocalminimaoftheobjectivefunctionandthesolutioniseasilyperturbedby
noise in the input measurements which can lead to erroneous and ambiguous solutions.
87
For accurate results we need to employ computationally expensive global optimization
techniques such as bundle adjustment [147] to compute accurate solutions.
It was quickly noticed that motion estimation benefits from the inclusion of stereo
information. Usually this is done by first computing correspondence between points in
different cameras and then triangulating the 3D coordinates of the points in space. In
a second step then the 3D-information was used to compute the motion estimate. The
advantage of this approach is that the dimensionality of the motion estimation problem
isdramaticallyreducedsinceonlythemotionparametersareunknown. Thiscanbedone
for discrete features [160, 157, 38], as well as for optical flow and dense depth maps [53,
138, 140]. With all these approaches we still have the problem, that we specifically need
to determine a depth map or equivalently correspondences between the multiple views
to be able to compute the motion of the camera which in itself is a difficult and often ill-
posed and error-prone problem. Since finding corresponding points in different images
is a notoriously hard problem, it would be of great advantage if one could estimate the
motion of the camera without explicitly having to find correspondences first.
Motivatedbytheimprovementofthecameramotionestimationbytheinclusionof
stereo information or large field-of-view (i.e. omni-directional) cameras, it seems worth-
while to include the design of the camera when studying the motion estimation problem.
Therefore, as motivated in the introduction, we will examine the problem in the space
of light rays. As outlined before in the introduction to do so we need to answer the two
fundamental questions of camera design:
1. How is the relevant visual information that we need to extract to solve our task
encoded in the visual data that a camera can capture?
2. What is the camera design and image representation that optimally facilitates the
88
extraction of the relevant information?
First, we will establish the constraints on the problem in terms of the light rays
captured, which leads to the description of a novel constraint, the ray identity constraint
between two polydioptric images. This constraint allows a multi-viewpoint camera to
compute accurate motion parameters without determining explicit correspondences or
scene models by matching sets of light rays.
Then, turning to the second question, we will describe how the design of the cam-
era used for image acquisition determines which of these constraints can be used and
how accurately we can compute the motion parameters. Finally this will lead to the de-
finition of a fitness function over the space of cameras for the problem of ego-motion
estimation of a moving image sensor.
5.1 Plenoptic Video Geometry: How is 3D motion information encoded in
the space of light rays?
We will interpret the images captured by a generalized camera in terms of samples of
the plenoptic function. Based on these samples, we can estimate the local geometry of
the plenoptic function and then utilize these features in the space of light rays to recover
spatio-temporal information about the world.
We define a generalized camera in terms of a set of two-dimensional imaging sur-
facesCi(u,v) which are indexed by pixel coordinates (u,v). Associated with each camera
surface is a pair of functions that map a pixel to a ray in space. Each ray is defined by a
position (xi : (u,v) ? R2 ? R3) and a direction in space (ri : (u,v) ? R2 ? S2). These
functions do not need to be continuous, because adjacency in (u,v) does not necessarily
imply adjacency in the space of light rays. One can for example think of a camera that
89
(a) (b) (c)
Figure5.1: Illustrationofrayincidenceandrayidentity: (a-b)Amulti-perspectivesystem
of cameras observes an object in space while undergoing a rigid motion. Each individual
camera sees a scene point on the object from a different view point which makes the cor-
respondence of the rays depend on the scene geometry (ray incidence). (c) By computing
the inter- and intra-camera correspondences of the set of light rays between the two time
instants, we can recover the motion of the camera without having to estimate the scene
structure since we correspond light rays and not views of scene points (ray identity).
consists of a set of mini-lenses on a CCD chip where at the boundaries between adjacent
mini-lenses we will have a discontinuity in the observed ray directions and positions.
Each camera surface collects an imaging sequence Ii(u,v,t). We assume that the
cameraisundergoingarigidmotioninspacewhichisparameterizedbyarotationmatrix
R and a translation vector q thus the world coordinates of a ray entering pixel (u,v) in
camera Ci are given by xi(u,v,t) = R(t)xi(u,v) +q(t) and ri(u,v,t) = R(t)ri(u,v).
Depending on the geometric properties of the camera, there are different types
of features that can be computed, and we have two fundamentally different types of
constraints as illustrated in Table 5.1.
If a scene, made up of diffuse reflective surfaces, is observed from multiple view
points, then the views of the same surface region will look very similar. The intensity
90
Camera Type Ray Incidence Ray Identity
Single view point Camera motion ? depth Camera rotation (3 d.o.f.)
(Structure from motion)
Multiple view points Stereo motion Rigid motion (6 d.o.f.)
Small and large-baseline stereo Differential Stereo
Table 5.1: Quantities that can be computed using the basic light ray constraints for dif-
ferent camera types. ? denotes that two quantities cannot be independently estimated
without assumptions about scene or motion.
function defined on a line bundle through a scene point is expected to have a much
lower variance then a line pencil through a point in free air. This photo-consistency
constraint [78] allows us to identify corresponding projections of the same scene point
in the different views. This can also be extended to other features such as sets of rays
through lines and planes utilizing line and texture cues [6, 86]. Given corresponding
features we can then infer geometrical information such as shape and occlusions about
the scene, as well as the position of the cameras. We call these constraints ray incidence
constraints. These family of constraints have been used in stereo algorithms to find depth
from correspondences, and in motion estimation algorithms to find the camera motion
from correspondences. Examples are the well-known epipolar and trifocal constraints
between multiple images.
If we move a polydioptric camera in space, there is a second constraint. If the
camera captures overlapping sets of light rays at two time instants, then we can register
the two sets of light rays and recover the motion of the camera. We term this constraint
the ray identity constraint. The general principle is illustrated in Figure5.1.
91
In the following section we will show how both of these different constraints can
be used to compute the motion of a camera and the shape of the scene based on measure-
ments in the space of light rays.
5.2 Ray incidence constraint
The ray incidence constraint is defined in terms of a scene point P and set of rays li :=
(xi,ri). The incidence relation between the scene point P ? R3 and rays li, defined by
their origins xi ? R3 and directions ri ? S2, can be written as
[ri]?xi = [ri]?P ?i (5.1)
were [r]? denotes the skew-symmetric matrix so that [r]?x = r?x. The rays that satisfy
this relation form a 3D-line pencil in the space of light rays.
The geometric incidence relations for light rays lead to extensions of the familiar
multi-view constraints for light rays. Depending on the design of the imaging sensor
there are three types of epipolar constraints that constrain the structure and motion esti-
mation (qj,i,Rj,i are the translation and rotation that movelj into the camera coordinate
system ofli in case they were measured at different times).
1. Pinhole camera: If we have a conventional single-view point camera, then each
scene point is only observed once for each frame and we have the usual single
view point epipolar constraint that constrains the camera motion up to scale and
has the form
rTi [qi,j]?Rj,irj = 0.
2. Non-central ?Argus Eye? camera: If we observe the world using multiple sin-
gle viewpoint cameras with non-overlapping field of view, then we can utilize an
92
epipolar constraint on the motion between frames including the scale of the recon-
struction which has the form
rTi [qi,j + (xi ?Rj,ixj)]?Rj,irj = 0.
3. Polydioptric or stereo camera: If we have a multi-perspective camera and the scene
point Pk projects into multiple locations at the same time, then in addition to the
previous two types of epipolar constraints, we can also utilize a stereo constraint
between two feature locations li and lj that is independent of the motion between
frames and is of the form
rTi [(xi ?xj)]?rj = 0.
t=(U,V,W)
R r
X
Y
Z
x
y f
Oz
0
w=a,b,g( )
?r
Figure 5.2: Examples of the three different camera types: (a) Pinhole camera (b) Spherical
Argus eye and (c) Polydioptric lenslet camera.
Possible implementations of these three camera types can be seen in Fig. 5.2
To solve for the motions and 3D positions of points, we need to optimize over the
manifold of rotations and translations. The description of this optimization is beyond
the scope of this thesis, but its experimental validation in this context is planned for the
future. The interested reader is referred to the papers [87] for the two view case and [89]
forthemultipleviewcase,aswellasthegeneraldescriptionofhowtooptimizefunctions
93
on spaces with orthogonality constraints [40]. The books [44, 55, 88] also include a large
number of algorithms to find depth and camera motion from ray incidence constraints.
5.3 Ray Identity Constraint
In a static world where the albedo of the scene points is not changing over time, the
brightness structure of the space of light rays is time-invariant. Therefore, as was shown
in Chapter 2, if a camera moves rigidly and captures two overlapping sets of light rays at
two different time instants, then a subset of these rays should match exactly and would
allow us to recover the rigid motion from the light ray correspondences. Note that this is
a true brightness constancy constraint because we compare each light ray to itself. This
is in contrast to the usual assumption of brightness constancy, where we have to assume
a notion of view invariance since we compare two views of the same scene point. This is
illustrated in Fig. 5.1.
5.3.1 Discrete plenoptic motion equations
As we had shown in previous work [104, 103], this brightness constancy constraint can
be utilized to estimate the motion of the camera in a scene-independent way. The set of
imaging elements that make up a camera each capture the radiance at a given position
x ? R3 comingfromagivendirectionr ? S2. Ifthecameraundergoesarigidmotionand
we choose the camera coordinate system as our fiducial coordinate system, then we can
describe this motion by an opposite rigid coordinate transformation of the ambient space
of light rays in the camera coordinate system. This rigid transformation, parameterized
by the rotation matrix R(t) and a translation vector q(t), results in the following exact
94
equality which is called the discrete plenoptic motion constraint
L(R(t)x+q(t);R(t)r;t) = L(x;r;0) (5.2)
since the rigid motion maps the time-invariant space of light rays upon itself. Thus, if a
sensor is able to capture a continuous, non-degenerate subset of the plenoptic function,
then the problem of estimating the rigid motion of this sensor has become an image
registration problem that is independent of the scene. Therefore the only free parameters
are the six degrees of freedom of the rigid motion. This global parameterization leads to
a highly constrained estimation problem that can be solved with any multi-dimensional
image registration criterion.
5.3.2 Differential plenoptic motion equations
If in the neighborhood of the intersection point y ? R3 of the ray ? (?(?) = x + ?r)
with the scene surface the albedo is continuously varying and no occlusion boundaries
are present, then the plenoptic function L changes smoothly and we can develop the
plenoptic function L in the neighborhood of (x;r;t) into a Taylor series (we use Lt as an
abbreviation for?L/?t):
L(x+dx;r +dr;t+dt) = L(x;r;t) (5.3)
+Ltdt+?xLTdx+?rLTdr+O(bardbldr,dx,dtbardbl2).
where ?xL and ?rL are the partial derivatives of L with respect to x and r. This ex-
pression now relates a local change in view ray position and direction to the first-order
differential brightness structure of the plenoptic function.
We define the plenoptic ray flow (dx/dt,dr/dt) as the difference in position and ori-
entation between the two rays that are captured by the same imaging element at two
95
consecutive time instants. This allows us to use the spatio-temporal brightness deriva-
tives of the light rays captured by an imaging device to constrain the plenoptic ray flow.
This generalizes the well-known Image Brightness Constancy Constraint to the Plenoptic
Brightness Constancy Constraint:
d
dtL(r;x;t) = Lt +?rL
Tdr
dt +?xL
Tdx
dt = 0. (5.4)
We assume that the imaging sensor can capture images at a rate that allows us to
use the instantaneous approximation of the rotation matrixR?I + [?]? where [?]? is a
skew-symmetric matrix parameterized by the axis of the instantaneous rotation ?. Now
we can define the plenoptic ray flow for the ray captured by the imaging element located
at location x and looking in direction r as
dr
dt = ??r and
dx
dt = ??x+ ?q (5.5)
where ?q = dq/dtis the instantaneous translation. As before in the discrete case (Eq.(5.2)),
the plenoptic ray flow is completely specified by the six rigid motion parameters. This
regular global structure of the rigid plenoptic ray flow makes the estimation of the dif-
ferential rigid motion parameters very well-posed.
Combining Eqs. 5.4 and 5.5 leads to the differential plenoptic motion constraint
?Lt = ?xL?(??x+ ?q) +?rL?(??r) (5.6)
= ?xL? ?q + (x??xL+r??rL)??
whichisalinear, scene-independentconstraintinthemotionparametersandtheplenop-
tic partial derivatives.
96
5.3.3 Differential Light Field Motion Equations
Using the light field parameterization we can rewrite the plenoptic motion equation
(Eq. 5.6) by setting x = [x,y,Z?]T and r = [u,v,f]Tbardbl[u,v,f]Tbardbl. We plug these expressions into
Eq.5.6, and convert the spatial partial derivatives of the light fieldLx = ?L/?x,...,Lv =
?L/?v into the three-dimensional plenoptic derivatives ?xL and ?rL. To do so, we
consider the projections of ?xL and ?rL on three directions cx, cr, and r to obtain the
following linear system which we solve for ?xL and ?rL (cx = [1,0,0]T, cy = [0,1,0]T,
andnr = bardbl[u,v,f]Tbardbl):
?
??
??
??
?
cTx
cTy
rT
?
??
??
??
?
[?xL,?rL] =
?
??
??
??
?
Lx nrLu
Ly nrLv
0 0
?
??
??
??
?
.
This results in the following expressions for the plenoptic derivatives
?xL = [Lx,Ly,?ufLx ? vfLy]T
?rL = nr[Lu,Lv,?ufLu ? vfLv]T.
Using these expressions, we define the plenoptic motion constraint for the ray indexed
by (x,y,u,v,t) as ([?;?] denotes the vertical stacking of vectors):
?Lt = ?Lt = ?xL? ?q + (x??xL+r??rL)??
= [Lx,Ly,?ufLx ? vfLy]?q
?[Lx,Ly,?ufLx ? vfLy]([x,y,Z?]T ??)
?[Lu,Lv,?ufLu ? vfLv]([u,v,f]T ??)
= [Lx,Ly,Lu,Lv][Mt,M?][?q;?] (5.7)
97
where
Mt =
?
??
??
??
??
??
?
1 0 ?uf
0 1 ?vf
0 0 0
0 0 0
?
??
??
??
??
??
?
and M? =
?
??
??
??
??
??
?
?uyf uxf +Z? ?y
?(vyf +Z?) vxf x
?uvf u2f +f ?v
?(v2f +f) vuf u
?
??
??
??
??
??
?
(5.8)
By combining the constraints across the light field, we can form a highly over-
determined linear system and solve for the rigid motion parameters.
The light field derivatives Lx,...,Lt can be obtained directly from the image in-
formation captured by a polydioptric camera. For example, to convert the image infor-
mation captured by a collection of pinhole cameras into a light field, for each camera we
simply have to intersect the rays from its optical center through each pixel with the two
planes ?f and ?i and set the corresponding light field value to the pixel intensity. Since
our measurements are only at discrete locations, we have to use appropriate interpola-
tion schemes as described in more detail in Chapter 4. The light field derivatives can
then easily be computed by applying standard image derivative operators to the contin-
uous light field. The plenoptic motion constraint is extended to the other faces of the
nested cube by pre-multiplying ?q and ? with the appropriate rotation matrices to rotate
the motion vectors into local light field coordinates.
5.3.4 Derivation of light field motion equations
Inthissectionwewillderivethediscreteanddifferentialmotionequationforapolydiop-
tric camera in the light field parameterization. The plenoptic parameterization is defined
in world space. Each ray is parameterized by a point on the ray x ? R3 and a direction
vector r ? S2. As described in Section 2.2.3 we can also parameterize a bundle of rays
98
within a cone of directions using the intersection of the rays with two planes ?f and ?i.
We will assume as before that these planes are perpendicular to the z-axis of the local
camera coordinate and are at depths Z? and Z? +f. The camera coordinate system of
camera Ci is related to the world coordinate system by a rotation matrix Ri and a trans-
lation vector Ti. Thus, the ray defined by the coordinate vector (x,y,u,v) corresponding
to the ray (we use the abbreviations xlf = [x,y,Z?]T and rlf = [u,v,f]T):
l =braceleftbigp ? R3|p = xlf +?rlf,?? Rbracerightbig
in camera coordinates corresponds to the following ray coordinates in the plenoptic pa-
rameterization:
l = (x,r) with x = Rixlf +Ti and r = Rirlf/bardblrlfbardbl (5.9)
If the camera is now displaced by a rotation matrixR(t) and a translationq(t), then
the ray l has the following equation in camera coordinates (q(t) = [qx,qy,qz]T):
l =braceleftbigp ? R3|p = R(t)T (xlf ?q) +?R(t)Trlf,?? Rbracerightbig (5.10)
To find the coordinates of this ray in the light field parameterization, we have
to intersect this ray with the plane Z? to find the (x,y) coordinates and rescale the
direction vector so that its z-coordinate equals f. Denoting the columns of R(t) by
R1(t),R2(t),R3(t), we can easily find the value of ? where l intersects the plane Z? re-
sulting in:
? =bracketleftbigZpi ?R3(t)T (xlf ?q(t))bracketrightbig/R3(t)Trlf (5.11)
99
Thus, the new equations for the light field coordinates of the ray l are:
xprime = R1(t)T (xlf ?q(t)) +bracketleftbigZpi ?R3(t)T (xlf ?q(t))bracketrightbigR1(t)
Trlf
R3(t)Trlf, (5.12)
yprime = R2(t)T (xlf ?q(t)) +bracketleftbigZpi ?R3(t)T (xlf ?q(t))bracketrightbigR2(t)
Trlf
R3(t)Trlf,
uprime = fR1(t)
Trlf
R3(t)Trlf, andv
prime = fR2(t)Trlf
R3(t)Trlf
which we can summarize as lprime = T (l,q(t),R(t)). Using the Rodrigues formula we can
parameterize the rotation matrix R(t) by the axis of rotation ?(t) = [?1(t),?2(t),?3(t)].
Thenbardbl?(t)bardblis the angle of rotation, and ??(t) = ?(t)/bardbl?(t)bardblis the direction of the axis of
rotation.
R(t) = I3 + sin(bardbl?(t)bardbl)[?(t)/bardbl?(t)bardbl]x + (1?cos(bardbl?(t)bardbl))[?(t)/bardbl?(t)bardbl]2x (5.13)
If the angle of rotation is small we can make the approximation R(t) ? I3 + [?(t)]x
which we will use to derive the instantaneous ray motion equations in the light field
parameterization. We have then (dropping the time index for now)
uprime = fR
T1 rlf
RT3 rlf ?f
u??3v+?2f
f ??2u+?1v =
u??3v+?2f
1?(?2uf ??1vf)
= (u??3v+?2f)(1 + (?2uf ??1vf) + (?2uf ??1vf)2 + (...)3 +...)
= u? uvf ?1 + (f + u
2
f )?2 ?v?3 +O(bardbl?bardbl
2)
where we used the Neumann series expansion 1/(1?x) =summationtext?i=0xi and omit all second-
order terms in the rotation parameters.
A similar derivation can be done for the change of view point. Using the instanta-
100
neous approximation for the rotation we can write:
xprime = RT1 (xlf ?q) +bracketleftbigZpi ?RT3 (xlf ?q)bracketrightbigR
T1 rlf
RT3 rlf
? (x?qx) + (y?qy)?3 ?(Z? ?qz)?2
+ Z?f (u? uvf ?1 + (f + u
2
f )?2 ?v?3)
+ 1f((y?qy)?1 ?(x?qx)?2 ?Z? +qz)(u? uvf ?1 + (f + u
2
f )?2 ?v?3)
= x?qx + ufqz + yuf ?1 ?
parenleftbigg
Z? + xuf
parenrightbigg
?2 +y?3 +O(bardbl?,qbardbl2)
The derivations for vprime and yprime are similar. This results in the differential motion
equations in the light field parameterization given by Eq. (5.7).
5.4 Variational Formulation of the Plenoptic Structure from Motion
We would like to describe the structure from motion estimation using the plenoptic func-
tion in a variational framework. To find the rigid motion parameters, we need to mini-
mize the discrete plenoptic motion constraint as defined in Eq. (5.2). ? will denote the
error function which could be theL2-Norm or some robust error measure that compares
two sets of captured light rays. Ddescribes the sensor surface, andT is a transformation
of the ray l in the space of light rays parameterized by set of parameters ? (e.g., rotation
Rand translation q of the camera) that we want to recover . It is now our goal to find the
set of parameters ?? so that L(T (l,?),t+ ?t) = L(l,t) and thus minimizes the matching
functional
?? = argmin
?
integraldisplay
l?D
?(L(T (l,?),t+ ?t)?L(l,t))dl (5.14)
To solve for ??, we form the Lagrange equations of Eq. (5.14) and set them to zero.
101
This leads to a system of non-linear equations for the optimal value ??:
integraldisplay
l?D
?prime(L(T (l, ??),t+ ?t),L(l,t))?L(T (l, ??),t+ ?t)JT|?(l, ??)dl = 0 (5.15)
whereJT|? is the Jacobian of the transformationT with respect to the parameters ?. For
the case of rigid motion estimation using the light field parameterization it is given by
the motion matrices in Eq. (5.7).
We can minimize this functional using a gradient descent minimization. Since the
error function is in general not convex due to the non-convexity of the image function
and thus the minimization can get stuck in a local minima. Therefore, use a scale-space
focusing strategy by embedding the estimation in a coarse-to-fine framework. The esti-
mation consists of two nested fix-point iterations. The outer loop consists of solving the
motionestimationequationstartingusingasmoothedimageLk = (G?k?L)whereG?k is
a smoothing filter in the space of light rays, and then using the solution ?k at the coarse
scale as an initialization for the solution at the finer scale. We will choose the coarse scale
such that there will be only subpixel differences between the view point images. This can
be achieved by utilizing our sampling analysis from Chapter 4.
Since the resulting equations given in the form of Eq. (5.15) are still non-linear we
will follow a numerical optimization scheme similar to the one described in [19]. We
will update the motion parameters using small update steps along the tangent plane of
SE(3) given by the instantaneous motion vectors ? and ?q. Using the linearization of
the plenoptic function around our current motion estimate, we can then incrementally
update the camera motion with the solution of the linear system defined by the local
constraint equations in the form of the plenoptic motion constraint in Eq. (5.6) evaluated
at the current value of the rigid motion parameters ?Rand ?q.
Thus, we linearize the difference between the warped and the original light field as
102
follows:
L(T (l,?i+1),t+ ?t)?L(l,t)
=L(T (l,?i),t+ ?t)?L(l,t) +?L(T (l,?i),t+ ?t)JT|?d?i
This will now lead to a new set of equations in the update parametersd?i. SinceJT|? is
independent of d?i, the only remaining non-linearity is possibly due to the function ?,
unlessthefunction?isasimplefunctionsuchasthesquared-error?(x) = x2. Therefore,
we will use another inner fix-point iteration where we will use the previous estimated?i
tocomputethevaluefor ?(l,?i) andkeepitfixedwhilesolvingthelinearouterequation
ford?i+1. Thus,weuseaformofiteratedweightedleast-squaresminimizationtoimple-
ment the robust error function ?. Example functions would be non-linear M-estimators
such as the Huber function or modified L1 minimization (e.g. ?(x) = ?x2 +epsilon12 ?epsilon1) as
described by Zhang [166] or Meer [97].
Here we are using a similarity measure that depends on a dense sampling of the
plenoptic function because the evaluation points for the moving image are not necessar-
ily the same as the original sampling points. As explained in Chapter 2, the plenoptic
parameterization is not optimal for signal processing. The problem is that we cannot ex-
pressthedistancebetweentwolinesintermsofthedistancebetweenthelinecoordinates
interpreted as points in a Euclidean space. Since this is possible for the light field para-
meterization and many camera configurations consist of clusters of axis-aligned cameras
the light field parameterization is our representation of choice. In this case, each cluster
can be parameterized using the light field parameterization by rectifying the images to
the plane that is the best linear fit through the optical centers of the camera focal points
without introducing any major distortions. To evaluate the matching function, we need
to solve an interpolation problem in the light field parameterization.
103
5.5 Feature computation in the space of light rays
Toutilizetheconstraintsdescribedaboveweneedtodefinethenotionofcorrespondence
in mathematical terms. In the case of the ray identity constraint we have to evaluate if
two sets of light rays are identical. If the scene is static we can use the difference be-
tween the sets of light rays that are aligned according to the current motion estimate as
our matching criterion. This criterion is integrated over all rays and we expect that at
the correct solution, we will have a minimum. As with all registration algorithms, we
need signals that contain enough information for the matching. As we analyzed in Sec-
tion 2.3, if the surface texture is only a homogeneous color or consists only of gradients
in a limited number of directions, then we might not be able to compute the motion of
scene points uniquely. This is known as the aperture problem, which expresses the fact
that we cannot detect any information about changes of position along iso-brightness
contours. The amount of information which corresponds to the amount of texture in the
perspective images is often measured in terms of the eigenvalues of the structure tensor,
that is the outer product of the local intensity gradients integrated over a local neighbor-
hood [131, 68]. This criterion can be extended to the space of light rays by examining the
structure tensor of the plenoptic function. The plenoptic structure tensor ?L?LT can
be computed from the intensity gradients of the plenoptic function with respect to view
point ?xL and view direction ?rL. By examining subspaces of the plenoptic structure
tensor, we can determine what kind of features can be reliably computed. The structure
tensor of the plenoptic function has a simple structure because we have the relationship
Z(x,r,t)?xL(x,r,t) = ?rL(x,r,t) whereZ(x,r,t) is again the depth to the scene from
location x in direction r at timet.
The amount of texture of the scene can thus be measured by examining the sub-
104
structure tensors ?xL?xLT and ?rL?rLT which correspond to the structure tensors
of the perspective and orthographic images of the scene. As suggested in the literature
before, we can use the inverse of the intensity Hessian as a measure for the variance of
the estimated feature positions [131].
If we analyze the portion of the tensor formed by the image derivatives ?EPIL in
the epipolar plane images (EPI) [14], we can analyze how much information is avail-
able to compute the depth of the scene. EPI images are two-dimensional subspaces of
the plenoptic function where both view point and view direction are varying. We have
?EPIL = [mT?xL;mT?rL] where bardblmbardbl = 1 and mTr = 0. The structure tensor will
have a single non-zero eigenvalue for fronto-parallel planes (lines in the EPI), and two
non-zeroeigenvaluesfordepthdiscontinuities. Toestimatethedepthandthelocalshape
we need to make differential measurements of the plenoptic function. The accuracy of
the depth estimates depend on how accurately we can reconstruct the light field based
on the samples captured by a multi-perspective camera. The main advantage of such
a polydioptric camera is that geometric properties of the scene such as depth, surface
slope, and occlusions can be easily computed from the intensity structure of the poly-
dioptric images.
5.6 How much do we need to see of the world?
In the previous section we saw how we can compute the motion parameters by utilizing
the polydioptric image derivatives. If we choose a set of the spatial dimensionsx,y,u,v
together with the time dimension t, then we get spatio-temporal images where the gra-
dients in the image describe the relative motion between the imaging sensor and points
in the scene. The gradients due to a rigid motion of a camera are given by the equation
105
for the plenoptic motion constraint. Depending on the type of sensor we are restricted in
our choice of constraints on the motion parameters as collected in Table 5.2.
Subspace Example Motion Constraint Equation (zero columns Parameters
Axes imply that no constraint exists for that parameter) to estimate
2D:
xt,yt
ut,vt
?Lt = Luparenleftbig fz 0 ?uz 0 u2f +f 0parenrightbig
parenleftBig ?q
?
parenrightBig
2+N
3D:
xyt,xvt
yut,uvt
?Lt =parenleftbigLuLvparenrightbigT
parenleftBigg
f
z 0 ?
u
z ?
uv
f
u2
f +f ?v
0 fz ?vz ?(v2f +f) uvf u
parenrightBiggparenleftBig
?q
?
parenrightBig
5+N
3D:
xut,yvt
?Lt =parenleftbigLxLuparenrightbigT
parenleftbigg1 0 ?u
f 0
ux
f +Z? 0
0 0 0 0 u2f +f 0
parenrightbiggparenleftBig
?q
?
parenrightBig
3
4D:
xyut,xyvt
xuvt,yuvt
?Lt =
parenleftbiggL
xL
uL
v
parenrightbiggT?
?
1 0 ?uf 0 uxf +Z? 0
0 0 0 ?uvf u2f +f ?v
0 fz ?vz ?(v2f +f+fzZ?) vuf +vxz u+fzx
?
?
parenleftBig ?q
?
parenrightBig
6 + N
5D:
xyuvt
?Lt =
parenleftBiggL
xL
y
Lu
Lv
parenrightBiggT??
??
1 0 ?uf ?uyf uxf +Z? ?y
0 1 ?vf ?(vyf +Z?) vxf x
0 0 0 ?uvf u2f +f ?v
0 0 0 ?(v2f +f) vuf u
?
??
?
parenleftBig ?q
?
parenrightBig
6
Table 5.2: Rigid motion constraint equations for plenoptic subspaces of different dimen-
sions. In each row the corresponding subspace for each motion constraint equation is in
bold letters.
We see that depending on the subspace that the camera captures an algorithm can
compute the ego-motion estimate with or without having to estimate the scene structure
at the same time. For a single viewpoint camera, we can either make use of an image
106
registration algorithms that finds a parametric mapping between two images (in case of
pure rotation or planar scene) or we can use correspondences to solve a global bundle
adjustment problem. In general for a single view camera the estimation of motion and
structure are coupled and both have to be estimated simultaneously. In the case of mul-
tiple views we have two options. Each measurement that a camera captures corresponds
to a bundle of light rays in space, for any scene and motion we can find a rigid motion
that maps one set of light rays to another. This is equivalent to the case of estimating
the rotation of a pinhole camera which can be done independently of the scene that is
observed. Given the multi-view image information it is of course also possible to com-
pute approximate 3D information to improve the search for corresponding points and
the chance to presegment the scene into different depth layers.
5.7 Influence of the field of view on the motion estimation
Another important criteria for the sensitivity of the motion estimation problem is the
size of the field of view (FOV) of the camera system. The basic understanding of these
difficulties has attracted a number of investigators over the years [33, 34, 70, 77, 96, 111].
These difficulties are based on the geometry of the problem and they exist in the cases of
small and large baselines between the views, that is for the case of continuous motion as
well as for the case of discrete displacements of the cameras.
If we increase the field of view of a sensor to 360? proofs in the literature show that
we should be able to accurately recover 3D motion and subsequently shape [47]. Cata-
dioptric sensors could provide the field of view but they have poor resolution, making
it difficult to recover shape models. Instead assemblies of cameras, such as the Argus
eye [4], a construction consisting of six cameras pointing outwards, offer an alternative
107
tocatadioptricsystems. Whenthisstructureismovedarbitrarilyinspace, thendatafrom
all six cameras can be used to very accurately recover 3D motion, which can then be used
in conjunction with the individual videos to recover shape.
Since the six cameras do not have the same center of projection, the motion estima-
tion for this camera is more elaborate than for a spherical one, but because the geometric
configuration between the cameras is known from the camera calibration, one can obtain
all three translational parameters.
For every direction of translation one finds the corresponding best rotation which
minimizes deviation from a brightness-based constraint. Fig. 5.3 shows (on the sphere of
possible translations) the residuals of the epipolar error color coded for each individual
camera computed for a real image sequence that we captured with a multi-perspective
camera setup, the ?Argus Eye?. Noting that the red areas are all the points within a
small percentage of the minimum, we can see the valleys which clearly demonstrates the
ambiguitytheoreticallyshownintheproofsintheliterature. Incontrast,weseeinFig.5.4
Figure 5.3: Deviation from the epipolar constraints for
motion estimation from individual cameras (from [5])
Figure 5.4: Combination of
error residuals for a six cam-
era Argus eye (from [5])
a well-defined minimum (in red) when estimating the motion globally over all the Argus
108
cameras, indicating that the direction of the translation obtained is not ambiguous when
using information from a full field of view.
5.8 Sensitivity of motion and depth estimation using perturbation analysis
To assess the performance of different camera designs we have to make sure that the
algorithms we use to estimate the motion and shape are comparable.
If the camera moves differentially between frames, then the ray intersection con-
straint leads to the optical flow equations for a rigidly moving camera. If we observe
the same scene point P from two different locations x and x + ?x, then we have for a
Lambertian surface
L(x,r,t) = L(x,(p?x)/bardblp?xbardbl,t) = L(x+ ?x, p?x??xbardblp?x??xbardbl,t) (5.16)
We can write (where? = D(x,r,t) is the distance between x and P).
p?x??x
bardblp?x??xbardbl =
?r??x
bardbl?r??xbardbl = r?
(I3 ?rrT)?x
? +O(
vextenddoublevextenddouble
vextenddoublevextenddouble?x
?
vextenddoublevextenddouble
vextenddoublevextenddouble
2
)
which is the well-known expression for translational motion flow on a spherical imaging
surface. Assuming brightness constancy, we have
L(x,r,t) = L(x+ ?x,r?P(r)?x/?,t) = L(x+ ?x,r + ?r,t). (5.17)
Thus, we can express a displacement in view point space by an equivalent parallel dis-
placementinviewdirectionspacewhichonlydiffersbyascalefactorthatisproportional
tothedepth. Thisrelationshipallowsuscomparealgorithmsbasedonrayincidencecon-
straints (scene point correspondences) with algorithms based on ray identity constraints
(light ray correspondences).
109
We choose to compare a number of standard algorithms for ego-motion estimation
as described in [145] to solving a linear system based on the plenoptic motion flow equa-
tion Eq. (5.5). This linear system relates the plenoptic motion flow to the rigid motion
parameters, that is the translation q and axis of rotation ?. Since multi-view information
can be used to infer depth information, we will project the motion flow onto a spherical
retina and account for the multi-view information by scaling the flow by an approximate
inverse depth value. Since we compute the depth using the polydioptric derivatives, the
distribution of the depth errors can be computed using the error expression presented in
Chapter 4. Then the flow for the ray li = (xi,ri,t) is given by (we drop the argument
and writeZi = Z(xi,ri,t)):
? 1Z
i
[ri]2xq?
parenleftbigg 1
Zi[ri]
2
x[xi]x + [ri]x
parenrightbigg
? = ?ri (5.18)
Given a multi-perspective camera system we can utilize the multi-view informa-
tion to generate an approximate local depth map. The accuracy of these 3D measure-
ments depends on the noise in our image measurements, the accuracy of the calibration
of the camera, and the baseline between the views. In this section we will apply the
ideas of stochastic perturbation theory [139] to analyze the influence of depth errors on
the accuracy of instantaneous motion estimation. We will denote the inverse depth at a
given point Pi by Di = 1/Zi. Since each measurement is scaled by the depth individu-
ally, we can combine the individual depth measurements to form the diagonal matrices
D = diag(D1,...,Dn) and Z = diag(Z1,...,Zn). If we know the inverse depths in D
and the spherical flow ?r in the images, then we can form a linear system of the form
(m = [q;?]):
Am = b that isDAz[q;?] +A?? = [?r1; ?r2; ?r3;...] (5.19)
where b contains the flow between frames. Az = [Aq,Ac] is formed by stacking the 3?3
110
matrices Aqi := [ri]2x and Aci := [ri]2x[xi]x that contain the terms in Eq. (5.18) that are
scaled by the depth, and (A?) is constructed by stacking the terms [ri]x that do not get
scaled by the depth.
There are two sources for error in this system: the error in the inverse depthDepsilon1 and
the error in the computed optical flows ?ri which we stack to form the vector b. We write
the estimated inverse depth as ?D = D +Depsilon1 and estimated spherical flow as ?b = b + bepsilon1.
Then we can write the linear system including the error terms as:
(A+E)m = (D+Depsilon1)Azm+A?? = b+bepsilon1 = ?b (5.20)
We can characterize the error matrix in a probabilistic sense by writing it as a sto-
chastic matrix [139] of the form
E := ScHSr = ?Depsilon1HAz (5.21)
whereH is a diagonal stochastic matrix where the entries have zero mean and unit vari-
ance. The entries in H are multiplied from the left by the inverse depth errors in the
diagonal matrix ?Depsilon1 to scale the errors to the correct size and multiplied from the right
byAz, the entries of the system matrixAthat get scaled by the depth, to account for the
correlations between the depth values and the motion parameters.
If we writeC = ATA, then the first order change in the solution is given by [139]:
?m = m?A+Em+C?1ETbepsilon1 (5.22)
The stochastic norm bardblAbardbls of a matrix A is defined as the expected Frobenius norm of
the matrix that is bardblAbardbls = E[bardblAbardblF] = E[radicalbigtrace(ATA)]. We can express the difference
between the true and estimated motion parameters in terms of stochastic norms as:
bardbl ?m?mbardblS =
radicalBig
bardblA+Scbardbl2FbardblSrmbardbl2 +bardblScbepsilon1bardbl2bardblSrC?1bardbl2F (5.23)
111
or for each component separately
bardbl ?mi ?mibardblS =
radicalBig
bardblScA+i bardbl2bardblSrmbardbl2 +bardblScbepsilon1bardbl2bardblSrC?1i bardbl2 (5.24)
where A+i and C?1i are the i-th rows of the matrices A+ and C?1. Assuming that the
errors in the stereo correspondences and the errors in the optical flow are identically in-
dependently distributed with variances?D and?b, then we can simplify the expressions
for the error in the motion parameters. In this case we have that Sc = ?Depsilon1 = ?DI and
Srm = Azmis the magnitude of the depth dependent flow for a scene of unit depth. The
termbardblScbepsilon1bardbl2 = bardbl?DIbepsilon1bardbl2 reduces to?2D?2b. Then we can rewrite the expected error in the
motion estimate as
bardbl ?m?mbardblS =
radicalBig
?2DbardblC?1(DAz +A?)Tbardbl2FbardblAzmbardbl2 +?2D?2bbardblAzC?1bardbl2F (5.25)
Wecanseethattheamplificationofthenoise, andthusthesensitivityofthemotion
estimation to errors in the depth and in the flow, depends on the matrix C = (ATA).
How much C?1 inflates the errors depends on the eigenvalue distribution of the matrix
C, which in turn is determined by the field of view of the camera. The larger the field
of view, the smaller will be the condition number of C. If A is well-conditioned, i.e. the
ratio between its largest and smallest singular values is close to 1, then the solution m
will not be sensitive to small perturbations Depsilon1 and bepsilon1. If instead A is badly conditioned,
i.e. AisclosetosingularandthereforesomeeigenvaluesofC?1 areverylarge,thensmall
errors in the measurements can cause large errors in the motion estimate. Therefore, the
effect of the field of view on the sensitivity of the motion estimation can be analyzed by
examining how the singular values of the matrixAdepend on the field of view.
Weperformedsomesyntheticexperiments, wherewesimulatedfourdifferentspa-
tial arrangements of polydioptric cameras (see Fig. 5.5) similar to possible polydioptric
112
camera configurations. For each individual camera we defined the imaging surface to
be the set of rays that made an angle of less than ? degrees with the optical axis, where
? was varied to simulate different fields of view. The sum of the measurements for the
whole system of cameras was kept constant to make the results for the different setups
comparable. The same experiments also tell us about the influence of the field of view in
the case of single-viewpoint cameras with known depth. For a forward-looking camera
(Fig. 5.5 a) and forward and backward looking cameras (Fig. 5.5 b), two of the singular
values vanish for a small field of view, implying that the estimation of the motion pa-
rameters is ill-posed. If we increase the field of view the linear system becomes better
and better conditioned. When we arrange the cameras so that they face in perpendicu-
lar directions (Figs. 5.5 c, d), we see that the conditioning of the linear system is nearly
independent of the fields of view of the individual cameras, suggesting that motion es-
timation using a configuration of conventional planar cameras pointing in orthogonal
direction is as robust as when using an ideal spherical eye.
To give some geometric intuition, we write the motion constraint on the sphere as
?Lt = ?rL? t|R| + (r??rL)?? (5.26)
Since?rLis perpendicular to (r??rL), for a small field of view (r varies very little) and
little variation in depth, a translational error tepsilon1 can be compensated by a rotational error
?epsilon1 without violating the constraint in Eq. 5.26 as long as the errors have the following
relationship:
1
|R|r?tepsilon1 = ?r?(r??epsilon1). (5.27)
That is, the projections of the translational and rotational errors on the tangent plane to
the sphere at r need to be perpendicular. This is known as the orthogonality constraint
113
on the plane [46]. If we now increase the field of view, the constraint on the errors in
Eq. 5.27 cannot be satisfied for all r, thus the confusion disappears.
There is another ambiguity. Looking at the first term in Eq.5.26, that is ?rL?t/|R|,
we see that the component of t parallel to r does not factor into the equation (since?rL?
r = 0) and therefore cannot be recovered from the projection onto the gradients for a
small field of view. We call this the line constraint on the plane, because the projections
of the actual t (FOE) and the estimated ?t = t +?r,? ? R onto the image plane lie on a
line through the image center. Again an increase in the field of view will eliminate this
ambiguity, since then measurements at other image locations enable us to estimate the
component of t parallel to r.
5.9 Stability Analysis using the Cramer-Rao lower bound
The sensitivity of an estimation problem that was geometrically motivated in the previ-
ous section can be more accurately characterized by the Fisher-Information matrix of the
system which is defined as
F = E
bracketleftBigg
?ln(h|p)
?p
T?ln(h|p)
?p
bracketrightBigg
(5.28)
wherehare the measurements of the system (the optic flow ?r or the plenoptic flow [?r, ?x],
and p = [?q;?] are the parameters. Thus it measures how much the probability of a given
measurement changes for a change in the model parameters. The Fisher-information
matrix is used in Cramer-Rao inequality that states that for an unbiased estimator the
covariancematrixisboundedfrombelowbytheinverseoftheFisherinformationmatrix:
Ebracketleftbig(p? ?p)T(p? ?p)bracketrightbig?F?1 (5.29)
If the error in the measurements follows a normal distribution with zero mean and
114
(a) (b)
(c) (d)
Figure 5.5: Singular values of matrixAfor different camera setups in dependence on the
field of view of the individual cameras:
(a) single camera, (b) two cameras facing forward and backward, (c) two cameras facing
forward and sideways, and (d) three cameras facing forward, sideways, and upward
.
all the measurements are independent, then the Fisher information matrix has the simple
expression
F = 1?2
summationdisplay
i
?hTi
?p
?hi
?p (5.30)
Thus,ifthereisaneasyrelationshipbetweentheopticalflowintheimageandthemotion
parameters, we can numerically integrate over the distribution of scene parameters and
camera parameters to estimate the covariance of the estimation. A similar analysis has
been done for the case of planar scene structure in [34] and for the case of generalized
cameras using Pl?ucker coordinates in [116]. We will use it to characterize the uncertainty
115
in terms of the field of view and the camera spacing.
Since our parameters ? and t are linearly related to the spherical motion flow, the
Jacobian?hi/?p in the Fisher information matrix is simply given by the motion matrices
Mi in Table 5.3. Thus the Fisher information matrix can be written as the sum over all the
cameras and measurements
Ft,? =
summationdisplay
ci
summationdisplay
pi
MT(ci,pi)M(ci,pi) (5.31)
whereci is the camera index andpi are the point indices.
It is still very difficult to evaluate this expression analytically, since it involves the
scene structure, but if we assume a certain distribution of values, similar to [116] we can
numerically estimate the covariance matrix over the samples.
We collected all the expressions for the Fisher information matrix using different
camera parameterizations in table 5.5.
This framework is based on a systematic study of the relationship between the
shape of an imaging sensor and the task performance of the entity using this sensor.
Combining the two criteria, the field of view and the subset of the space of light
rays that a sensor captures, we can rank different eye design in a hierarchy as shown in
Fig. 5.6 which expresses a qualitative measure of how hard the task of motion estimation
is to solve for a given sensor design [105].
One can see in the figure that the conventional pinhole camera is at the bottom
of the hierarchy because the small field of view makes the motion estimation ill-posed
and it is necessary to estimate depth and motion simultaneously. Although the estima-
tion of the 3D motion for a single-viewpoint spherical camera is stable and robust, it is
still scene-dependent, and the algorithms which give the most accurate results are search
techniques, and thus rather elaborate. One can conclude that a spherical polydioptric
116
Dioptric Axis
Field of V
iew Axis
ill-conditioned
stable
single viewpointcontinuous viewpoints
small FOV
360 deg FOV
scene dependent(nonlinear)scene independent(linear)
Dioptric Axis
Base Line 
Axis
correspondenceproblem (nonlinear)no correspondenceproblem (linear)
ill-conditioned
stable
discrete viewpointscontinuous viewpoints
small base line
large base line
(a) (b)
Figure 5.6: (a) Hierarchy of Cameras for 3D Motion Estimation and (b) 3D Shape estima-
tion. The different camera models are classified according to the field of view (FOV) and
the number and proximity of the different viewpoints that are captured (Dioptric Axis).
The camera models are clockwise from the lower left: small FOV pinhole camera, spher-
ical pinhole camera, spherical polydioptric camera, and small FOV polydioptric camera.
camera is the camera of choice to solve the 3D motion estimation problem since it com-
bines the stability of full field of view motion estimation with the linearity and scene
independence of the polydioptric motion estimation.
5.10 Experimental validation of plenoptic motion estimation
It is difficult to experimentally validate the power of the plenoptic motion constraint un-
der real world conditions because camera technology is not yet available at a scale that
allows for a real-world implementation of insect compound eye-inspired camera sys-
tems. Tp experimentally validate different camera designs we need to build and modify
117
different hardware camera configurations. Since this was infeasible in the course of this
thesis, we will utilize the following three types of experiments to assess the performance
of the plenoptic motion estimation framework.
1. Usecomputergraphicstogeneratepolydioptricimagesequencesfordifferentcam-
era configurations and motions and evaluate against ground truth,
2. Capture an epipolar image volume of a scene and generate new polydioptric im-
age sequences by resampling this volume for different camera configurations and
motions and evaluate against ground truth,
3. buildapolydioptriccameraandcaptureafewpolydioptricsequencesandevaluate
motion accuracy based on reprojection error of features.
5.10.1 Polydioptric sequences generated using computer graphics
To examine the performance of an algorithm using the plenoptic motion constraint, we
did experiments with synthetic data. We distributed spheres, textured with a smoothly
varying pattern, randomly in the scene according to natural scene statistics [64] so that
they filled the horizon of the camera (see Fig. 5.7a). We then computed the light fields for
all the faces of the nested cube surrounding the camera through ray tracing, computed
the derivatives, stacked the linear equations (Eq. 5.6) to form a linear system, and solved
for the motion parameters. Even using derivatives only at one scale, we find that the
motion is recovered very accurately as seen in Fig. 5.7c. As long as the relative scales
of the derivatives are similar enough (scene not too far away) the error in the motion
parameters varies between 1% and 3%.
We used the scene description language Renderman to describe artificial scenes
118
consisting of planes at different distances surrounding the camera from all sides. The
camerawasmovedwithaknownmotiontogenerateshortpolydioptricimagesequences.
The resolutions in the perspective images ranged from 16 by 16 pixels to 256 by 256 pix-
els, and the number of images ranged from 5 by 5 arrays to 64 by 64. Different fields
of view were generated by having identical cameras face in different directions. All the
images were generated in floating point format to avoid problems with dynamic range.
An example scene is shown in Figure 5.7a.
(a) (b) (c)
Figure 5.7: (a) Subset of an Example Scene, (b) the corresponding light field (c) Accuracy
of Plenoptic Motion Estimation. The plot shows the ratio of the true and estimated mo-
tion parameters (vertical axis) in dependence of the distance between the sensor surface
and the scene (horizontal axis) forf = 60 and spheres of unit radius.
5.10.2 Polydioptric sequences generated by resampling epipolar volumes
Another approach to simulating and evaluating camera designs without actually build-
ing them is based on resampling previously recorded subsets of the space of light rays
(see Fig. 5.8).
We captured a number of epipolar volumes [14] of different scenes which varied in
depth and texture complexity. Given such a continuous subset of the plenoptic function
119
x?Axis: Camera Position
u?Axis: View Direction
60120180240300360420
100
200
300
400
500
600
700
800
(a) (b)
Figure 5.8: Evaluation of camera designs by resampling epipolar volumes. (a) By sam-
pling brightness values in the epipolar volume at different spacings along the view di-
mensions we can simulate a variety of camera arrangements. (b) Example: 29 camera
system with noticeable aliasing between the views.
we are able to generate new line-camera image sequences by resampling this set of light
rays. These generated image sequences are essentially identical to the image sequences
that a true physical line camera would capture as long as the camera motion is chosen
suchthatallpixelsofthecameracanbeinterpolatedfromthevoxelsoftheepipolarplane
volume. By varying the spatio-temporal sampling pattern we can simulate a wide range
of camera motions as well as camera designs. We generated a large number of image
sequences for various camera motions and distances between the camera centers. For
each frame of a sequence we formed the plenoptic motion constraint equations (consist-
ing in this case of the rows of Eq. (5.7) corresponding to a planar motion) and solved for
the planar motion parameters using the plenoptic derivatives. As an example result, we
showinFig.5.9howtheaccuracyoftherotationandtranslationestimatesimprovewhen
the field of view increases, and how the accuracy decreases when the spacing between
120
Figure 5.9: Relationship between camera spacing, image smoothing and accuracy of
motion estimation based on integrating over many image sequences generated from an
epipolar volume. The standard deviation of the Gaussian smoothing filter increases from
left to right from 1 over 5 to 11 pixels.
the cameras increases. We also see how thanks to the inherent redundancy of the data,
the accuracy and robustness of the estimation increases noticeably when we increase the
amount of smoothing in the perspective images. We can use the error analysis to derive
the best interpolation and smoothing filters based on the depth that we recovered.
5.10.3 Polydioptric sequences captured by a multi-camera rig
We used the multi-perspective camera concepts described in this paper to recover shape
fromrealworldsequences. Webuiltapolydioptriccameraconsistingoftwolineararrays
of cameras looking at perpendicular directions. This camera configuration was moved in
a planar motion in front of an office scene (Fig. 5.10a). Using the variational motion esti-
mation described in Section 5.4, we estimated the motion based on the plenoptic deriva-
121
tives and were able to compute accurately rectified epipolar image volumes (Fig. 5.10b).
Finally, we used the recovered motion to segment the scene according to depth and re-
covered the position of the main objects in the scene (Fig. 5.10c).
(a) (b) (c)
Figure 5.10: (a) Example indoor scene for motion estimation. The camera can be seen
on the table. It was moved in a planar motion on the table. (b) The epipolar image that
was recovered after compensating for varying translation and rotation of the camera.
The straight lines indicate that the motion has been accurately recovered. (c) Recovered
depth model.
We also build a camera rack and captured polydioptric image sequences outdoors
(Fig. 5.11) where the camera rack consisted of two camera clusters of five and four cam-
eraseachwhichwerefacinginorthogonaldirections. Sincetheindividualclustersdonot
shareacommonviewweneededtocalibratethispolydioptriccamerausingtheKecklab-
oratory. The image sequences were radiometrically calibrated to enable the computation
of accurate derivatives. We then computed the plenoptic derivatives. The view direction
derivativeswerecomputedforeachimageindependentlyandtheviewpointderivatives
were computed by fitting the best linear function to the brightness values of the pixels
in the individual cameras facing in the same direction. Unfortunately, we were not able
122
to record ground truth information about the camera motion, therefore the evaluation
of the accuracy of the recovered motion was done by comparing the path of the camera
with the path of features in the images such as the light pole to right.
Figure 5.11: Example outdoor scene. The camera was moved in a straight line while
being turned.
5.10.4 Comparison of single view and polydioptric cameras
To assess the performance of different camera models with regard to motion estimation,
we compare a number of standard algorithms for ego-motion estimation as described
in [145] against a multi-camera stereo system. We used Jepson and Heeger?s linear sub-
123
Figure 5.12: Motion estimation results for example outdoor scene. The comparison of the
parking lot geometry with the recovered path indicates that the motion was accurately
recovered.
space algorithm and Kanatani?s normalized minimization of the epipolar constraint. We
assume similar error distributions for the optical flow and disparity distributions, both
varying from 0 to 0.04 radians (0-2 degrees) in angular error. This corresponds to about
a pixel error of 0 to 4.5 pixels in a 256x256 image. We ran each algorithm 100 times on
different randomly generated flow vectors and point clouds, and measured the angular
deviation from the true translation and rotation. The results in Figure 5.13 demonstrate
that for a similar distribution of errors in the disparities and the optical flow, solving a
linear system with approximate depth knowledge outperforms the algorithms that alge-
braically eliminate the depth from the equation noticeably.
124
0 1 20
5
10
15
20 fov 30
Flow error in degrees
Translational error in deg
0 1 20
1
2
3
4 fov 30
Flow error in degrees
Rotation error in deg
0 1 20
5
10
15
20 fov 60
Flow error in degrees
Translational error in deg
0 1 20
1
2
3
4 fov 60
Flow error in degrees
Rotation error in deg
0 1 20
5
10
15
20 fov 90
Flow error in degrees
Translational error in deg
0 1 20
1
2
3
4 fov 90
Flow error in degrees
Rotation error in deg
Figure 5.13: Comparison of motion estimation using single and multi-perspective cam-
eras. The errors in the correspondences varied between 0 and 0.04 radians (0-2 degrees),
and we computed the mean squared error in the translation and rotation direction for 3
different camera field of views (30,60, and 90 degrees). The blue line (+) and green line
(squares) are Jepson-Heeger?s subspace algorithm and Kanatani?s normalized minimiza-
tion of the instantaneous differential constraint as implemented by [145]. The red lines
denote the performance of motion estimation using Eq. (5.19) where the errors in the dis-
parities are normally distributed with a standard deviation that varies between 0 and 5
degrees.
The effect of the camera on depth estimation, can similarly analyzed. To estimate
the depth from a motion estimate, we can invert Eq. (5.19) pointwise to get:
Z = Aqq +Ac?b?A
??
(5.32)
For the case of stereo, we have ? = 0, and the equation reduces to Z = rT?[r]2?qrT
?b
. It
125
has been observed before that if we have errors in the motion estimation, then the re-
constructed shape will be distorted [7]. This necessitates the use of Kalman filters or
sophisticated fusion algorithms. By using polydioptric cameras, we can solve the corre-
spondenceproblemeasilyduetothesmallbaseline, andatthesametime, sinceweknow
the calibrated imaging geometry, we can estimate the local depth models with greater ac-
curacy. This improves the motion estimation and allows for easier correspondence over
larger baselines. Finally, stochastic fusion algorithms such as described in [29] can be
used to integrate the local depth estimates.
5.11 Summary
In this chapter we applied the framework of polydioptric motion estimation to the prob-
lem of 3D camera motion estimation. We used our previous analysis of the geometry of
the plenoptic function to derive the constraints that relate the motion of a camera to the
measurementsintheimages. Besidesthewell-knownconstraintsrelatingdifferentviews
of the same scene point, we also described the novel constraint relating different views
of the same light ray. This lead to discrete and differential plenoptic motion constraints.
The power of these constraints was experimentally demonstrated using synthetic and
real images sequences. Finally, we showed how we can use the sampling analysis from
Chapter 4 to devise smoothing filters for the perspective images that increase the accu-
racy of the motion estimation noticeably.
126
Camera Model Motion Flow Equation
Spherical Pinhole
r: ray direction
x: ray origin
Z: depth along ray
Discrete:
r(t) = q(t)+R(t)x0+ZR(t)r0?zT(q(t)+R(t)x
0+ZR(t)r0)
Differential:
?r =
bracketleftbigg
?[r]2?Z [r]? + [r]2?Z [x]?
bracketrightbigg
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
Mpinhole
?
??
?
?q
?
?
??
?
Pl?ucker Lines
r: ray direction
m: moment vector
Discrete:?
??
?
r(t)
m(t)
?
??
? =
?
??
?
R(t) 03
?[q(t)]?R(t) R(t)
?
??
?
?
??
?
r0
m0
?
??
?
Differential:?
??
?
?r
?m
?
??
? =
?
??
?
0 [r]?
[r]? [m]?
?
??
?
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
MPl?ucker
?
??
?
?q
?
?
??
?
Polydioptric Parameterization
r: ray direction
x: ray origin
n: normal to
projection plane
Discrete:?
??
?
r(t)
x(t)
?
??
? =
?
??
?
R(t)r0
?[n]2?R(t)x0
?
??
?+
?
??
?
0
?[n]2?q(t)
?
??
?
Differential:?
??
?
?r
?x
?
??
? =
?
??
?
0 [r]?
?[n]2? [n]2?[x]?
?
??
?
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
Mpolydioptric
?
??
?
?q
?
?
??
?
Table 5.3: Motion Flow Constraint Equations for Rigid Motion Estimation
127
Camera Model Motion Constraint Equation
Spherical Pinhole
r: ray direction
x: ray origin
?Lt = ?rL
bracketleftbigg
? I3bardblRbardbl [r]? + [x]?bardblRbardbl
bracketrightbigg
?
??
?
?q
?
?
??
?
Pl?ucker Lines
r: ray direction
m: moment vector
?Lt =
?
??
?
?rL
?mL
?
??
?
?
??
?
0 [r]?
[r]? [m]?
?
??
?
?
??
?
?q
?
?
??
?
Polydioptric Camera
r: ray direction
x: ray origin
?Lt =
?
??
?
?rL
?xL
?
??
?
?
??
?
0 [r]?
?[n]2? [n]2?[x]?
?
??
?
?
??
?
?q
?
?
??
?
Table 5.4: Brightness constancy constraint equations for rigid motion estimation. Since
we have the brightness invariance along the ray that can be expressed as ?rLTr =
?xLTr = 0 and ?rLTr = ?mLTr = 0, we can omit the projection operators in the
constraint equations.
128
Camera Model Fisher Information Matrixbracketleftbig A BBT Cbracketrightbig
Spherical Pinhole
r: ray direction
X: ray origin
A= [r]2?/bardblRbardbl2
B = [r]2?[X]x/bardblRbardbl2 + [r]?/bardblRbardbl
C = [X]?[r]2?[X]?/bardblRbardbl2
+([X]?[r]? + [r]?[X]?)/bardblRbardbl+ [r]2?
Pl?ucker Lines
r: ray direction
m: moment vector
A= [r]2?
B = ?[r]?[m]x
C = [r]2? + [m]2?
Polydioptric Camera
r: ray direction
X: ray origin
X?: normal to
camera surface
A= [r]2?
B = ?[X?]2?[X]?
C = [r]2? ?[X]?[X?]2?[X]?
Table 5.5: Single term MTci,piMci,pi of the Fisher Information Matrix for Rigid Motion
Estimation
129
Chapter 6
Application: Spatio-temporal Reconstruction from Mulitple
Video Sequences
The concept of three-dimensional(3D) photography and imaging has always been of
great interest to humans. Early attempts to record and recreate images with depth were
the stereoscopic drawings of Giovanni Battista della Porta around 1600, and the stereo-
scopic viewers devised by Wheatstone and Brewster in the 19th century. As described
in [107], in the 1860?s Francois Vill`eme invented a process known as photo sculpture,
which used 24 cameras, to capture the notion of a three-dimensional scene. Later a three-
dimensional photography and imaging technique was invented by G. Lippmann in 1908
under the name of integral photography where the object was observed by a large num-
ber of small lenses arranged on a photographic sheet resulting in many views of the
object from different directions [110]. Today modern electronic display techniques en-
able the observer to view objects from arbitrary view points and explore virtual worlds
freely. These worlds need to be populated with realistic renderings of real life objects
to give the observer the feel of truly spatial immersion. This need fuels the demand for
accurate ways to recover the 3D shape and motion of real world objects. In general, the
approaches to recover the structure of an object are either based on active or passive vi-
130
sion sensors, i.e. sensors that interact with their environment or sensors that just observe
without interference. The main examples for the former are laser range scanners [125]
and structured light based stereo configurations where a pattern is projected onto the
scene and the sensor uses the image of the projection on the structure to recover depth
from triangulation [9, 36, 164]. For a recent overview over different approaches to ac-
tive range sensing and available commercial systems see [10]. The category of passive
approaches consists of stereo algorithms based on visual correspondence where the cam-
eras are separated by a large baseline [126] and structure from motion algorithms on
which we will concentrate [54, 88]. Since correspondence is a hard problem for widely
separated views, we believe that the structure from motion paradigm offers the best ap-
proach to 3D photography [117, 48], because it interferes the least with the scene being
imaged and the recovery of the sensor motion enables us to integrate depth information
from far apart views for greater accuracy while taking advantage of the easier correspon-
dence due to dense video. In addition, the estimation of the motion of the camera is a
fundamental component of most image-based rendering algorithms.
Having recovered the camera positions using the polydioptric motion estimation
algorithm, we can use this calibration information to extract spatio-temporal represen-
tations of the scene observed. A spatio-temporal representation that captures all the es-
sential shape and motion information is necessary to understand, simulate and copy an
activity of a person or an object in the scene. This view-independent representation is
made up by the 3D spatial structure as well as the 3D motion field on the object surface
(also called range flow [136] or scene flow [152]) describing the velocity distribution on
the object surface as described in Chapter 2. Such a representation of the shape and mo-
tion of an object is very useful in a large number of applications. Even for skilled artists it
131
is very hard to animate computer graphics models of living non-rigid objects in a natural
way, thus a technique that enables one to recover the accurate 3D displacements with-
out the need for manual input, would reduce costs in animation production dramatically
andaddanotherdimensiontotherealismoftheanimation. Insteadofhavingtosimulate
the natural motion dynamics by extensive parameter tweaking of complicated bone and
muscle models, one will be able to extract the necessary information directly from video
sequences.
Accurate 3D motion estimation is also of great interest to the fields of medicine and
kinesiologywhereitcangiverisetonewdiagnosticmethodswhenthemotionofathletes
or patients is analyzed. Given access to a complete and accurate 3-dimensional model of
a moving human, the performance of athletes can be examined with unparalleled ease
and accuracy. In medicine the recovery of 3D motion fields could be used to diagnose
strokes, by detecting differences in the motion patterns before and after the incident. The
recovery of dense 3D motion fields will also help us to understand the human represen-
tation of action by comparing the information content of 3D and 2D motion fields for
object recognition and task understanding.
We present a new algorithm that given calibrated image sequences computes ac-
curately the 3D structure and motion fields of an object solely from image data utiliz-
ing silhouette and spatio-temporal image information without any manual intervention.
There are many methods in the literature that recover three-dimensional structure infor-
mation from image sequences captured by multiple cameras. Some example techniques
are based on voxel carving (Kutulakos and Seitz [78], Seitz and Dyer [130], DeBonet and
Viola [16], etc.), silhouette intersection (Matusik et al. [94, 133]) or multi-base line stereo
( Vedula et al. [154], Fua [48], Faugeras and Keriven, etc. [43]), sometimes followed by a
132
second stage to compute the 3D flow (Vedula et al. [152]) from optical flow information.
Unfortunately, these approaches usually do not try to compute the motion and structure
of the object simultaneously, thereby not exploiting the synergistic relationship between
structureandmotionconstraints. Approachesthatrecoverthemotionandstructureofan
object simultaneously, most often depend on the availability of prior information such as
an articulated human model (Kakadiaris and Metaxas [73], Pl?ankers and Fua [115]) or an
animation mask for the face (Essa and Pentland [41]). Some of the few examples of com-
binedstructureandmotionestimationthatisnotusingadomainspecificpriormodelare
Zhang and Khambhamettu [165] as well as Malassiotis and Strintzis [90]. But in contrast
to our approach the scene is still parameterized with respect to a base view resulting in
a moving 2.5D surface, whereas we use an object space parameterization that represents
true view-independent 3D information. In other related work, Vedula et al. [153] used
a motion and voxel carving approach to recover both motion and structure, but due to
the computational and memory demands of the 6D-voxel grid, their models and motion
were of only low resolution. Our subdivision surface representation enables us to adapt
therepresentationlocallytothecomplexityofshapeandmotion. Borrowingterminology
from signal processing, we can say that we optimize the different ?frequency bands? of
themeshseparatelysimilartomulti-gridmethodsinnumericalanalysis. Wenotethough
that the term ?frequency? is not well defined for meshes and should only be seen as an
analogy. Recently, Carceroni and Kutulakos [25] presented an algorithm that computes
motionandshapeofdynamicsurfaceelementstorepresentthespatio-temporalstructure
of a scene observed by multiple cameras under known lighting conditions. The adap-
tive subdivision hierarchy of our algorithm enables us to avoid the preset subdivision of
the space into surface elements and to integrate measurements over neighborhoods that
133
(a) (b)
Figure 6.1: (a) Sketch of multi camera dome in the Keck Lab to capture a large environ-
ment and (b) calibrated camera setup for detailed capture of small objects
adapt to the complexity of the shape and motion.
The rest of this chapter is organized in 4 different sections. First, in section 6.1
we present a general description of the camera setup and the scene, justifying our as-
sumptions that lead to the error criterion used later. Then, in the following section we
explain the details of the multi-resolution subdivision framework consisting of the repre-
sentation for the moving shape as well as the operators that transform a spatio-temporal
structure into a multi-resolution representation and vice versa. Section 6.7 contains the
initialization and refinement steps of the shape and motion estimation algorithm, before
we conclude with the results.
6.1 Preliminaries
We will use the formalism described in section 2.1 to describe the shape and motion of
the object in the scene. Our goal it is now to estimate the spatio-temporal surface D(x;t)
134
as well as the motion vector field F(x;t) defined on its surface from image data alone
without having to make any prior assumptions about shape and motion.
The algorithms presented here can be applied to large environments where the
cameras are widely spread (see Figure 6.1a that displays a schematic description of our
large baseline lab) and the aim is to reconstruct the interaction of people and objects,
as well as to small environments where it is more important to capture subtle details of
shape and motion as shown in Figure 6.1b. The experiments presented here were done
in the small environment.
The camera configuration is parameterized by the camera projection matrices Mk
that relate the world coordinate system to the image coordinate system of camerak. The
calibration is done using a calibration frame. In the following we assume that the images
havealreadybeencorrectedforradialandtangentialdistortion. Therefore, thegeometric
mapping between points in space and the image is given by the conventional pinhole
camera model. The surface point P = [x,y,z]T,P ? S in world coordinates projects to
image point pk in camerak which image coordinates are given by
pk = fMk[P;1]M3
k[P;1]
(6.1)
where f is the focal length, M3k is the third row of Mk, and [P;1] is the homogeneous
representation of P.
The object surface is assumed to have Lambertian reflectance properties, thus fol-
lowing equation 2.4, the brightness intensity of a pixel pk in camera k is given by (?k is
the constant that describes the brightness gain for each camera)
I(pk;t) = ??k ??(P)?[n(P;t)?s(P;t)]. (6.2)
Tobeabletoidentifyuniquepointsontheobjectsurface,weassumethatthealbedo
135
of a surface point ?(P) is constant over time (d?/dt = 0). In addition, since we record
the video sequences with 60 frames per second, the illumination and orientation of the
surface will change very little from frame to frame, thus we can assume total derivative
of the irradiance on a pixel vanishes (d/dt[n(P;t)?s(P;t)] = 0) which leads to the image
brightness constancy constraint
??I(pk)?t = ?I(pk)latticetop ? dpkdt . (6.3)
6.2 Multi-Resolution Subdivision Surfaces
There are many different ways to represent surfaces in space. Some examples are B-
spline surfaces, deformable models, iso-surfaces or level-sets, and polygonal meshes (for
an overview see [65]). Subdivision surfaces combine the ease of manipulating polygonal
meshes with the implicit smoothness properties of B-spline surfaces. They are defined
by a control mesh that determines the topology and shape of the object and a subdivision
operator that determines how the mesh is refined. Repeated refinement will ultimately
lead to a smooth limit surface. If we replace each static control point with a point tra-
jectory that for example could be described by a subdivision curve we end up with a
hierarchical multi-resolution representation for the spatio-temporal surface.
6.3 Subdivision
In our notation we follow Zorin et al. [169], who describe a system that enables the user
to edit complex polygon meshes by manipulating different levels of resolution indepen-
dently. Our goal is similar except that the user is not editing the shape, but the shape and
motion are changed to optimize an error criterion. At each time instant the object surface
S will be represented by a subdivision hierarchy of triangle meshes. On each level i of
136
the hierarchy, there is a triangle mesh Ti that consists of a set of vertices Vi. Starting
from an initial mesh T0 with vertices V0 that determines the topology of the shape at
each time instant, we build a hierarchy of meshesT0,T1,...,Ti by successively refining
each triangle into 4 sub-triangles. The vertex sets are nested, that isVj ?Vi ifj <i. We
define odd vertices on leveliasOi = Vi+1\Vi, thusVi+1 consists of two disjoint sets, the
even vertices Vi and the odd vertices Oi. With each set of vertices Vi we can associate a
map that relates verticesv ?Vi to control point trajectories ci(?,t) : Vi?R+ ? R3 in the
world. Thus for each vertex v, level i and time instant t, we get a 3D point ci(v,t) ? R3.
The set ci(t) contains all points at leveliand timetand describes the shape of the mesh.
The changes of ci(t) over time determine the motion of the surface. We will drop the
time dependence ofci for the rest of this section, implicitly knowing thatci is different at
different times.
A subdivision scheme defines now a linear operator Ci that takes the points sets
from a level i to a finer level i + 1 : ci+1 = Cici. If the subdivision scheme converges,
we can define a limit surface s = c? =
?producttext
k=0
Ckc0 where s(v,t) is the trajectory of vertexv
on the surface of the shape. Examples for different subdivision schemes can be found in
[128].
We choose the Loop-subdivision scheme (see [85]) for our purposes, because as an
approximating subdivision scheme it smoothes the final surface and thereby regularizes
the estimation process. In addition, the scheme is easy to implement because the refined
position of a vertex depends only on the positions of its immediate neighbors and its
limit surface can be analytically evaluated for arbitrary points on the surface.
We denote the K immediate neighbors (the 1-ring) of a vertex v ? Vi by vk ?
137
Figure 6.2: Stencils for Loop Refinement and 1-4 Triangle Split
Vi,1 ?k ?K. We now define the new point ci+1(v) as (see Figure 6.2):
ci+1(v) =
a(K)ci(v) +
Ksummationtext
k=1
ci(vk)
a(K) +K (6.4)
a(K) = K(1??(K)?(k) (6.5)
?(K) = 58 ? (3 + 2cos(
2pi
K ))
2
64 (6.6)
For the odd vertices that are introduced at the midpoints, we use the stencil as described
in Figure 6.2.
Since the refinement operator Ci is a linear operator, we can write this refinement
process as matrix-vector multiplication (where the matrix is very sparse). Thus given the
triangulation on level i, we split the control points ci into the set of even cie and odd cio
vertices and write the refinement equation as
138
ci+1 =
?
??
?
cie
cio
?
??
?=
?
??
?
Cie
Cio
?
??
?
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
Ci
ci (6.7)
The Loop scheme is a generalization of quartic box spline patches. This enables us
to evaluate analytically the 3D position of each surface point in dependence of its control
points similar to a parametric surface as was shown by Stam [137]. Given a triangler =
(va,vb,vc) ?Ti corresponding to a patch s(r) on the object surface, and the surrounding
control pointsci(r) defining the shape of the patchs(r) (the union of the 1-rings ofva,vb,
and vc) , then we can express the position of any mesh surface point s(u1,u2,u3) that
is defined by a set of barycentric coordinates u1,u2,u3 ? [0,1],u1 + u2 + u3 = 1 with
respect to the triangle patch as a linear combination of the subdivision control points
surrounding the patch
s(u1,u2,u3) = b(u1,u2,u3)Tci(r). (6.8)
See Figure 6.3 for an illustration. The positions of points on the surface patch (indicated
by the vertices of the tessellation in the center triangle) depend only on the positions
of the vertices of the surrounding large triangles. Concatenating all the linear equations
definingallthepointsoverallthepatches,wecanformthelimitmatrixLi,whichenables
us to write
s = Lici (6.9)
for an arbitrary tessellation of the smooth limit surface. Since the motion of the mesh is
described by the trajectoriesci(t), the motion vector field on the mesh surface is given by
?s(t)
?t = L
i?ci(t)
?t , (6.10)
which is smoothly varying across surface.
139
(a) (b)
Figure 6.3: Neighbouring patches that influence the (a) 3D limit shape and (b)
3D limit motion field of the spatio-temporal surface
6.4 Smoothing and Detail Computation
After having defined the refinement operations that transform coarse meshes into finer,
smoother versions of themselves, we now describe how we can construct the other di-
rection of the multi-resolution transformation. To be able to analyze a mesh on different
levels of resolution, we first need to have a linear operator that takes a fine mesh at level
iand builds a smoothed, coarse version of it at leveli?1
ci?1 = Hci. (6.11)
Many operators are possible, for example we could solve a linear system to find
the best coarse mesh in the least-squares sense
minc
i?1
bardblci ?Ci?1ci?1bardbl. (6.12)
Unfortunately, this leads to a smoothing operator that is not sparse and local anymore
and, therefore, is expensive to compute (inversion of a large non-sparse matrix). Similar
problems arise if we solve for the coarse surface in the framework of a global variational
problem(springs, thin-plateminimization). Inthiscasewedecidedonthenon-shrinking
140
smoothing filter described by Taubin in [143] because of its computational simplicity.
Since we are not dependent on an orthogonal decomposition of the surface we do not
need to choose the dual to the Loop subdivision operator as our smoothing operator.
Given a vertex v ? Vi and its K neighbors vk ? Vi we can define the discrete Laplacian
as
L = s?
Ksummationtext
k=1
ci(vk)
K . (6.13)
FollowingTaubin,wecannowdefineasmoothingoperatorH thatactssimilartoGaussian
smoother but does not exhibit shrinkage of the mesh by setting
H := (I ??L)(I ??L) (6.14)
where?and?are constants that determine the properties of the filter. We chose the stan-
dard values of? = ?0.6364 and? = 0.6324 as suggested by Taubin in [143]. Combining
the refinement and the smoothing operator, we can now define the analysis component
of the multi-resolution transform as follows:
ci?1 = Hici (6.15)
di = (ci ?Ci?1ci?1) = (I ?Ci?1Hi)ci (6.16)
We sample the detail vectorsdi on the finer levelito avoid aliasing and thus end up with
an over-complete representation of the mesh similar to the Laplacian pyramid described
by Burt and Adelson [21].
6.5 Synthesis
This decomposition now leads to the following linear synthesis algorithm. As described
in the section 6.3, on each leveli, we can express the position of the control points ci as a
141
Figure 6.4: Control Meshes at different Resolutions and their detail differences
linear combination of the control points on a coarser level ci?1 and the detail coefficients
di?1 that express the additional degrees of freedom of the vertices in the refined level i.
Formally, we can write
ci = [Ci?1Di]
?
??
?
ci?1
di
?
??
?. (6.17)
Di can be defined in numerous ways, e.g. as the identity matrix as in Equation 6.16 or
defining a local coordinate system as in section 6.6.
142
Figure 6.5: Detail Encoding in Global and Local Coordinate Systems
The decomposition can now be iterated (similar to [91])
ci = [Ci?1Di]
?
??
?
ci?1
di
?
??
? (6.18)
= [Ci?1Di]
?
??
?
Ci?2 Di?1 0
0 0 Ii
?
??
?
?
??
??
??
?
ci?2
di?1
di
?
??
??
??
?
.
and we end up with the following expression that relates the limit surface linearly to
the control points c0 of the coarse initial mesh and the detail coefficients dj on all levels
143
1 ?j ?i:
s = LihatwiderCici? (6.19)
hatwiderCi =
?
?
i?1productdisplay
j=0
Cj,
i?1productdisplay
j=1
CjD1,
i?1productdisplay
j=2
CjD2,...,Ci?1Di?1,Di
?
?
ci? =bracketleftbigc0,d1,d2,...dibracketrightbig
Although we have a nice and simple decomposition that relates the motion and
shape of the object linearly to the values of the control and detail coefficients, there is
a problem with the approach so far. To achieve a linear representation of the surface
in terms of its control vectors, we need to encode the detail level in a global coordinate
system. Unfortunately, this prevents us from optimizing the different levels of resolution
independently, because changes on a coarse level will cause unintended changes of the
global object shape as can be seen in Figure 6.5. Most of the difference in shape between
successive levels can be represented by a displacement along the normal to the surface
as indicated by the histogram of the magnitudes of the detail coefficients in Figure 6.6.
This is quite obvious since the discrete Laplacian is an approximation of the local normal
direction, thus most of the smoothing in the analysis step occurs parallel to the surface
normal. In summary, most of the characteristic shape information is encoded along the
local normal directions and during the synthesis step the local detail should be added
back relative to the local coordinate frame to take advantage of this and to decorrelate
the different levels of resolution. If we were just interested in shape estimation, this
would also suggest to take advantage of local encoding to reduce the dimensionality of
our estimation problem by restricting the detail vectors to vary only along the normal
directions. Unfortunately, since we are also interested in motion estimation we need the
tangential component of the local encoding to be able to represent the part of the motion
144
Figure 6.6: Histogram encoding the magnitude of the normal (black) and tangential
(gray,white) components of the detail vectors in the multi-resolution encoding of the
mesh
field parallel to the tangent plane of the surface.
A local encoding of the detail vectors is especially important in the case of motion
estimation, because if we have a motion that is well described by the trajectories of the
control vectors on a coarser level, then the trajectories of the locally encoded detail coeffi-
cients will not need to be adjusted. An example would be the encoding of a moving arm
including a detailed hand that it is not moving relative to the arm. If we encode the fin-
gers in a global coordinate system, we would get large changes in the detail coefficients
to be able describe the motion of the fingers. If we have a local encoding, we just need
to add the local shape details to the refined control mesh on the coarser level that was
moved by the global motion of the arm and we the new shape is accurately synthesized
145
Figure6.7: Flowchartdescribingthemulti-resolutionanalysisandsyn-
thesis
since the local frames moved with the global motion.
6.6 Encoding of Detail in Local Frames
To define the detail coefficients with respect to a local frame, we apply two linear opera-
torsRandQtothecontrolmeshci thatresultintwolinearlyindependentvectorsri(v) =
(Rci)(v) and qi(v) = (Qci)(v) in the tangent plane to s(v). We can use them to define
a local orthonormal frame at vFi(v) = (ni(v),ri(v),qi(r)) where ni(v) = ri(v) ?qi(r).
Details about how to define these operators in the case of the Loop subdivision scheme
can be found in Hoppe [60].
Includingthelocalencodinginourmulti-resolutionframework,wesummarizethe
steps of the transform (for a graphical flow chart see Figure 6.7)
146
ANALYSIS:
for i = n:-1:1
di = (Fi)T(I ?SH)ci
ci?1 = Hci
end
SYNTHESIS:
for i = 1:n
ci = Sci?1 + (Fi)di
end
s = Lncn
This forms the basis of our multi-resolution algorithm. Given a decomposition
of our shape estimate into the different levels of resolution, the detail coefficients are
modified one level at a time. If we do not have a shape estimate yet, we start from a
coarse meshc0 with few vertices and set all the detail coefficients to zero. The estimation
of the spatio-temporal structure is always done top-down, from coarser decomposition
levels to finder decomposition levels. When optimizing detail coefficients on level i, we
compute the value of ci?1 using the coarse base mesh c0 and the detail levels dj 1 ?j ?
i? 1. This also determines the local frame Fi?1. The changes in the detail coefficients
are then propagated to the mesh surface by continuing the synthesis procedure, and then
we can evaluate the error measure. After the optimization converges or a fixed number
147
of iterations has been applied, we precompute ci using our estimate of di and continue
to optimize over di+1, thereby increasing the degrees of freedom of the mesh. This is
continued until we reach the maximum refinement depth. The refinement can be locally
controlled by the magnitude of the detail coefficients. If they are small in a region of the
mesh, there is no need for further subdivision. This enables the algorithm to adapt the
mesh resolution according to the complexity of the surface geometry.
6.7 Multi-Camera Shape and Motion Estimation
The shape and motion estimation can be subdivided into two different parts. Initially,
we do not know anything about our object except that it is located somewhere inside the
volume observed by the cameras (which may see only parts of the object at any given
time) and that the object is moving (if it is not moving then we need background images
to distinguish the object from the background). Thus, the first step of the algorithm tries
to locate the object of interest in the working volume, and estimate an approximation to
its shape as well as motion. Having an initial estimate of the spatio-temporal structure of
the object, we then apply our multi-resolution optimization framework to the structure
using stereo constraints.
6.8 Shape Initialization
The shape initialization is based up on a subdivision of the working volume into voxels,
which are then projected into all the images. We then accumulate the evidence for the
voxel being inside or outside the object. The evidence is defined in Eq. 6.20 as the ratio
between the temporal image gradient and the local spatial gradient (?is a small positive
148
(a) (b)
Figure 6.8: (a) Multi Camera Silhouette Intersection from (b) Image Silhouettes
constant to ensure a well-defined measure everywhere):
?(P) =
summationdisplay
k?Cameras
?k(P) =
?I(pk)
?t
?+bardbl?I(pk)bardbl2 (6.20)
For each image the ratios?i(P) are formed by integrating over the footprint of the
voxel corresponding to the 3D location P (see Fig. 6.8b for two examples of the thresh-
olded images of ratios). If the assumption of a moving object is violated, we can also
incorporate the difference between the image sequence and previously recorded back-
ground images into the algorithm. The voxel volume is smoothed and thresholded using
3D morphological filters, before an iso-surface extraction algorithm determines the ini-
tial shape and topology of the mesh. Given an iso-surface, we then convert it into a base
mesh with subdivision connectivity using the following algorithm:
1. Simplify the triangle mesh representation down to a coarse base resolution
2. Displace the vertices of the subdivision mesh along their surface normal direction
149
until they lie on the iso-surface
3. Refine the mesh as described in Section 6.3
4. Repeat this process until the subdivision mesh approximates the iso-surface well.
Using a voxel algorithm based on the intersection of silhouettes in the images only
allows us to find the visual hull [79] of the object in view (Fig. 6.8). For some applications
this might be good enough [94], but we would like to capture all the shape and motion
detail of the object.
We apply this algorithm to all the frames of the sequence which results in an ap-
proximation to the shape of the object at each time instance. Unfortunately, from these
shapes we can only extract the component of the 3D motion that is normal to the object
surface, but not the tangential component. Thus to compute the trajectories of the mesh
vertices over time, we need to include image derivative information.
6.9 Motion Initialization
To find the correspondence between vertices over time and to compute an initial esti-
mate for the 3D motion field, we use the spatio-temporal gradients in the images and
relate them to the motion vectors on the object surface. From equation 6.3 it follows that
each normal flow measurement, that is the component of the image flow that is perpen-
dicular to the local brightness gradient in an image, constrains the projection of the 3D
motion flow to lie along a line parallel to the iso-brightness contour in the image, the nor-
mal flow constraint line. Thus the 3D motion flow vector has to lie on the plane defined
by the normal flow constraint line and the optical center of the camera as shown in Fig-
ure 6.9a. Using the approximation to the shape computed before, we intersect the planes
150
(a) (b)
Figure 6.9: (a) 3D Normal Flow Constraint. The planes formed by corresponding image
edges and the optical centers intersect in a line in space. (b) By integrating over a surface
patch we can estimate the full 3D Flow
of corresponding measurements in space. The intersection should ideally be a single line
in space, the 3D normal flow constraint line that is parallel to the iso-brightness contour
on the object. The component of the 3D motion along the iso-brightness contour is not
recoverable. This is the aperture problem revisited in 3D. Since each control motion vec-
tor is constrained by many samples (see the parameterization of each sample in a patch
in Figure 6.3), and we expect the gradient directions to vary on the object surface, we can
expect the estimation of these motion vectors to be nevertheless well-defined (see 6.9b
for an illustration).
Since the shape we use to correspond the measurements is only an approximation,
we can expect to have errors when corresponding measurements across cameras. To de-
tect bad correspondences, we compute a measure for the collinearity of the intersections
151
between the normal flow constraint planes and use it to prune bad correspondences.
Taking the derivative of Equation 6.1 with respect to time, and substituting it into
equation 6.3, we can define the following linear constraint at each sample point P =
s(u1,u2,u3) on the object surface patchr:
??I(pk)?t = ?I(pk)T ? dpkdt (6.21)
= ?I(pk)latticetop ? ?pk?p ?p?t
= ?I(pk)latticetop ? ?pk?P ?s(u1,u2,u3)?t
= ?I(pk)latticetop ? ?pk?P Ln(P)hatwiderCn(P)?c
n?(P)
?t
whereLn(P)hatwiderCn(P)cn?(P) are the components of equation 6.19 corresponding to surface
point P. Choosing the detail refinement matrices Di to be the local frame matrices Fi
corresponding to the current control vector values ci 1 ? i ? n, we can define a large
linear system by stacking all the equations 6.21. This linear system can now be solved for
the derivative of the base control and detail vectors with respect to time. These estimated
temporalderivativestogetherwiththeourmeshapproximationsformanestimateforthe
trajectories of the control and detail vectors. We use a preconditioned conjugate gradient
algorithm to solve this overdetermined linear system which converges always in a few
iterations.
6.10 Shape and Motion Refinement through Spatio-Temporal Stereo
To refine our estimate of the spatio-temporal surface that describes the object, we adapt
the vertex trajectories of the mesh, such that a multi-view stereo criterion is optimized.
We assume as expressed in equation 6.2 that the brightness of the projection of P,
that is the pixel value, is similar across cameras up to a linear transformation and that is
152
onlychangingslowlyovertime. Combiningbothconstancyconstraints, wehavethatthe
similarity between the spatio-temporal volumes that each patch r on the object surface
traces out in the spatio-temporal image space of each camera can be used as a measure
for the correctness of our shape and motion estimation.
To evaluate the error measure, we choose a regular sampling pattern inside each
surface patch, where the sampling density is adjusted per patch in such a way that we
have approximately one sampling point per pixel in the highest resolution image the
patch is visible in. The visibility of each patch is determined by a z-buffer algorithm and
we denote the set of camera pairs that mutually see a patch on the object surface byV(r).
Using the subdivision framework presented in section 6.2, we synthesize the shape at
a number of consecutive time instances based on the current values of the base control
and detail vectors and evaluate the following matching functional in space-time based
on normalized correlation
E(r,t) =
summationdisplay
(i,j)?V(r)
W(i,j)
t+?tintegraldisplay
t??t
?Ii(s),Ij(s)?
|Ii(s)|?|Ij(s)|ds (6.22)
+
summationdisplay
i
t+?tintegraldisplay
t??t
?Ii(t),Ii(s)?
|Ii(t)|?|Ii(s)|ds
?Ii(t),Ij(t)? =
integraldisplay
P?r
?Ii(pi(P,t))? ?Ij(pj(P,t))dP
?Ii(pi(P,t)) = Ii(pi(P,t))?Ii(t)
Ii(t) =
integraldisplay
P?r
Ii(pi(P,t))dP
|Ii(t)|2 = ?Ii(t),Ii(t)?
tocomparecorrespondingspatio-temporalimagevolumesbetweenpairsofcameras. For
each sampling point on the mesh surface, we determine the intensity values by bilinear
153
interpolation from the images.
We combine the correlation scores from all the camera pairs by taking a weighted
average with the weights W(i,j) depending on the angles between optical axes of cam-
erasiandj and the surface normal at point P. Notice that each vertexv ?Vi influences
the shape only in those patches that involve vertices that are part of v?s 2-ring (vertices
thatareatmost2edgesaway). WewilldenotethissetofpatchesbyT2(v). Thus,whenwe
want to compute the derivatives of the error function with respect to the control points,
we only need to evaluate the changes of the error measure on T2(v). Due to the non-
linearity that was introduced by the local encoding of the detail, we cannot the express
the change of the surface directly as a linear function of the change in the detail coeffi-
cients, but have to synthesize the change on the surface as described in Figure 6.7. Since
we only have to evaluate the changes on small patches of the object surface at a time, this
can be done efficiently though. The final derivative can then be computed easily using
the chain rule from the derivative of the projection equation (Eq. 6.1) and the spatio-
temporal image derivatives.
We use the BFGS-quasi newton method in MATLABTM Optimization Toolbox to
optimize over the control point positions. The upper bounds for their displacement is
given by the boundaries of the voxel volume, which we include as inequality constraints
in the optimization.
So far we have only applied our algorithm on objects consisting only of one com-
ponent, but if the initial shape approximation consists of several disconnected parts, we
could use the distance between the disconnected parts to merge or separate the recov-
ered surfaces. Then we can apply the refinement algorithm to every object in turn, while
updating the visibility globally.
154
Figure 6.10: Four Example Input Views
6.11 Results
We have established in our laboratory a multi-camera network consisting of sixty-four
cameras, Kodak ES-310, providing images at a rate of up to eighty-five frames per sec-
ond; the video is collected directly on disk. The cameras are connected by a high-speed
networkconsistingofsixteendualprocessorPentium450swith1GBofRAMeachwhich
process the data.
For our experiments we used eleven cameras, 9 gray scale and 2 color, placed in
a dome-like arrangement around the head of a person, who was opening his mouth to
express surprise (example images in Figure 6.10) and blinking his eyes while turning his
head and moving it forward.
155
(a) (b)
(c) (d)
(e) (f)
Figure 6.11: Results of 3D Structure and Motion Flow Estimation: Structure.(a-c) Three
Novel Views from the Spatio-Temporal Model (d) Left View of Control Mesh (e) Right
Side of Final Control Mesh (f) Close Up of Face
156
(a) (b)
(c) (d)
(e) (f)
Figure 6.12: Results: Motion Flow. (a-c) Magnitude of Motion Vectors at Different Levels
of Resolution (d) Magnitude of Non-Rigid 3D Flow Summed over the Sequence (e) Non-
Rigid 3D Motion Flow (f) Non-Rigid Flow Close Up of Mouth
157
The recovered spatio-temporal structure enables us to synthesize texture-mapped
views of the head from arbitrary viewing directions (Figures 6.11a-6.11c). The textures,
coming always from the least oblique camera with respect to a given mesh triangle, were
not blended together to illustrate the good agreement between adjacent texture region
boundaries(notetheagreementbetweenthegrey-valuestructuresinFigure6.11cdespite
absolute grey-value differences). Unfortunately, we did not have access to a laser range
scan to generate ground-truth for the shape, but we believe the recovered control meshes
in Figures 6.11d-6.11f show that despite some artifacts near the eyes the spatial structure
of the head was recovered well. For a full view of the reconstruction, please see the
accompanying videos at the web site www.videogeometry.com. Since only two cameras
were color, we were only able to texture map parts of the head in color.
Examining the 3D motion fields at different resolutions we see that the multi-
resolution representation is seperating the motion field into different components. In
Figures 6.12a-6.12d we encoded the magnitude of the motion vectors on the object sur-
face as brighntess values that vary from bright for large displacements to dark for small
displacements. The brightness values are increasing proportionally with the magnitude
of the motion energy. Figure 6.12a shows that at the coarsest level the magnitudes of the
3flowvectorsvarylittleacrosstheobject, whichistobeexpectedsincethetrajectorieson
the coarsest level should encode the rigid motion of the object. At next finer level (Figure
6.12b)weseethatmostofthemotionenergyisconcentratedintheeyeandmotionregion
which corresponds well to the activity in the scene. Further increasing the scale, we no-
tice that the magnitude of the motion vectors concentrates more and more at only a few
places (Figure6.12c), in this example the fast blinking motion of the eye is the prominent
motion.
158
To separate the non-rigid 3D motion flow of the mouth gesture from the motion
field caused by the turning of the head, we fit a rigid flow field to the trajectories of
the coarsest mesh level c0 by parameterizing the 3D motion flow vectors by the instanta-
neousrigid motion?c0/?t = v+??c0, wherevand? arethe instantaneous translation,
and rotation ([61]). By subtracting the rigid motion flow from the full flow, we extract the
non-rigid flow. It can be seen that the motion energy (integrated magnitude of the flow
over the whole sequence) is concentrated in the non-rigid regions of the face such as the
mouth and the jaw as indicated by the higher brightness values in Figure 6.12d. In con-
trast, the motion energy on the rigid part of the head (e.g., forehead, nose and ears) is
significantly smaller. In the close up view of the mouth region (Figure 6.12f) we can eas-
ilysee, howthemouthopens, andthejawmovesdown. Although, itisobviouslyhardto
visualize dynamic movement by static imagery, the vector field and motion energy plots
(Figure 6.12) illustrate that the dominant motion ? the opening of the mouth ? has been
correctly estimated.
6.12 Hierarchies of cameras for 3D photography
For 3D photography we need to reconstruct the scene structure from multiple views.
How well this can be done depends mainly on our abilities to compute correspondences
between the views and then how accurately we can triangulate the correct position of
the scene points. If we have a polydioptric camera we can compute local shape estimates
fromthemultiplesmall-baselinestereosystemswhichallowsonetouseshapeinvariants
inadditiontointensityinvariantstofindcorrespondencesbetweendifferentviews,while
a single view point camera has to rely completely on intensity information. The accuracy
ofthetriangulationdependsonthebaselinebetweenthecameras, thelargerthebaseline
159
the more robust the estimation as can be seen in Fig. 6.13. Based on these two criteria, we
can also define a hierarchy of cameras for the 3D shape estimation problem.
(a) (b)
Figure 6.13: (a) Triangulation-Correspondence Tradeoff in dependence on baseline be-
tween cameras. For a small baseline system, we can solve the correspondence problem
easily, but have a high uncertainty in the depth structure. For a large baseline system this
uncertainty is much reduced, but the correspondence problem is harder. (b) Motion and
Shape can be reconstructed directly from feature traces in spatio-temporal and epipolar
image volumes.
6.13 Conclusion and Future Work
Toconclude,wepresentedamethodthatisabletorecoveranaccurate3Dspatio-temporal
description of an object by combining the structure and motion estimation in a unified
framework. The technique can incorporate any number of cameras and the achievable
depth and motion resolution depends only on the available imaging hardware.
In the future, we plan to explore other surface representations where we are able
to adapt not just the geometry, but also the connectivity (see [75]) according the some
160
optimization criterion. It is also interesting to study the connection between multi-scale
mesh representation and multi-scale structure of image sequences that observe them to
increase the robustness of the algorithm even further by improving the stopping criteria
for the mesh refinement and the optimization. I would also like to explore the use of im-
plicit representations for the spatio-temporal surface because if we want to reconstruct
more complex scenes consisting of many objects in the scene, we need to have a repre-
sentation that can change its topology during its evolution. For parameterized surfaces
as presented in this work, this is a non-trivial problem, while implicit representations do
not have this problem at all.
Another important issue for the 3D structure and motion estimation problem is the
absence of good benchmark sequences including ground truth data. Due to the technical
difficulties involved in the capture and calibration of these sequences, there are only very
few sequences available for processing right now. We hope that in the future a bench-
mark data collection will evolve, so that we will be able to compare our algorithms on a
standard data set against other researcher?s algorithms as it is customary for the stereo
or optical flow problem.
161
Chapter 7
Physical Implementation of a Polydioptric Camera
(a) (b) (c)
Figure 7.1: Design of a Polydioptric Camera (a) capturing Parallel Rays (b) and simulta-
neously capturing a Pencil of Rays (c).
A ?plenoptic camera? has been proposed in [2], but since no physical device can
capture the true time-varying plenoptic function, we prefer the term polydioptric to em-
phasize the difference between the theoretical concept and the implementation. With a
polydioptric camera we observe points in the scene in view from many different view-
points (theoretically, from every point on S) and thus we capture many rays emanating
from that point. The components of the word polydioptric are based on the Greek words
polys which means many and dioptric (from Greek dioptrikos)that according to Webster?s
Dictionary means ?assisting vision by refracting and focalizing light?. Thus a polydiop-
tric camera is an imaging sensor that captures light that has been refracted and focussed
162
in a multitude of ways.
A polydioptric camera can be implemented by arranging ordinary cameras very
close to each other (Figs. 7.1b and 7.1c). This camera has an additional property aris-
ing from the proximity of the individual cameras: it can form a very large number of
orthographic images, in addition to the perspective ones.
Indeed,consideradirectionr inspaceandthenconsiderineachindividualcamera
the captured ray parallel to r. All these rays together, one from each camera, form an
image with rays that are parallel. For different directions r different orthographic image
are formed. For example, Fig. 7.1b shows that we can select one appropriate pixel in each
camera to form an orthographic image that looks to one side (blue rays pointing to the
left) or another (red rays pointing to the right). Fig. 7.1c shows all the captured rays, thus
illustrating that each individual camera collects conventional pinhole images.
7.1 The Physical Implementation of Argus and Polydioptric Eyes
We are currently working on developing a highly integrated tennis-ball-sized Argus eye
with embedded DSP power, integrated spherical image frame memory and high-speed
interface to a PC. A possible appearance of such an Argus eye is shown in Fig. 7.2. A
simplified block diagram is shown in Fig. 7.3. A number of CCD or CMOS image sensor
chip sets are interfaced to their own DSP chips. Normally, DSP chips have fast ports
for communicating with other DSP chips, forming a parallel processor. Also, modern
DSPs provide a host port through which a host computer can address and control the
DSP as well as access its memory space. We envision using a P1394 serial bus (Firewire).
Through six wires this serial bus can recreate a complete parallel bus (e.g., PCI bus) at
a remote location away from the host PC. In our case this remote location is inside the
163
Argus eye. In effect, Firewire brings the PCI bus inside the Argus eye, allowing complete
addressability, programmability and control from a single PC. Ultimately, the Argus eye
will integrate our motion estimation algorithms within on-board DSP chips.
Figure 7.2: A highly integrated tennis-
ball-sized Argus eye.
Figure 7.3: Block diagram of an Argus
eye.
Of course one could think of numerous alternative implementations of Argus and
polydioptriceyes. Ideasinvolvingfish-eyelensesandcatadioptricmirrors[100]toproject
wide-angle panoramas onto a single sensor are first to come to mind. In principle, such
cameras can be used for experimentation. Nonetheless, any optical approach that warps
the panoramic optical field to project it onto one or two planar image sensors will suffer
reduced spatial resolution, because too wide an angle is squeezed onto a limited number
of pixels. Given the low price of common resolution image sensors1 as well as inex-
pensive plastic optics, there is no reason not to build Argus eye as suggested above by
pointing many cameras to look all around. In fact, the Argus eye is not only a panoramic
spherical camera; it is a compound eye with many overlapping fields of view. A simple
1A 1/4? quality color CCD chipset with about 640 ? 480 pixels can be purchased for about $100. It is
estimated that in 10 years such cameras will cost only a few dollars.
164
Argus eye could also be built by combining mirrors with conventional cameras, using
the mirrors to split the field of view in such a way that we capture many directions in
one image (e.g., one forward and one sideways).
By using special pixel readout from an array of tightly spaced cameras we can
obtain a polydioptric camera. Perhaps the biggest challenge in making a polydioptric
camera is to make sure that neighboring cameras are at a distance that allows estimation
of the ?orthographic? derivatives of rays, i.e., the change in ray intensity when the ray is
moved parallel to itself. For scenes that are not too close to the cameras it is not necessary
to have the individual cameras very tightly packed; therefore, miniature cameras may be
sufficient. The idea of lenticular sheets has appeared in the literature [2, 110] for 3D
imaging. These sheets employ cylindrical lenses and are not very appealing because of
the blurring they create. There are, however, similar micro-image formation ideas that
would fully support the mathematical advantages of polydioptric eyes suggested in the
previous section. One such idea is shown in Fig. 7.4. A micro-lens array is mounted on
thesurfaceofanimagesensordirectly,emulatinganinsectcompoundeye. Fig.7.5shows
the imaging geometry. As an alternative to micro-lens arrays one could also consider
coherent optical fiber bundles as image guides.
In this example, micro-lenses are focused at infinity. Advances in MEMS and
micro-machining has resulted in the wide availability of micro-optics and lens arrays.
They can be as small as tens of microns in diameter and can be arranged in a rectangular
or hexagonal grid. The image sensor detects a plurality of optical images. These images
are very small, perhaps 16?32 pixels on a side. They may not be useful for extensive
imaging; however, they directly support computation that unleashes the power of poly-
dioptric eyes. Fig. 7.6 depicts how these small images might look; they appear more like
165
textures than like detailed images of the scene. However, these textures are sufficient for
obtaining the rotational and translational derivatives of the light rays as 3D motion of
the camera (or scene) occurs (i.e., ?rL and ?xL). Fig. 7.6 shows that the derivatives of
ray rotation (blue to red ray change in Fig. 7.5) would be computed by estimating the
texture motion in individual sub-retinas over time. On the other hand, the derivatives
of ray translation would be computed by estimating the time-evolving disparity of tex-
ture motions across pairs of sub-retinas. This is only one idea for building a miniature
polydioptric eye. With a conventional image sensor we could have as many as 60 ? 60
micro-lenses over a single image sensor. A plurality of such polydioptric eyes could be
arranged to yield new camera topologies for vision computation.
Some form of acceptance optics would be needed to resize the field of view to
meaningful sizes. These are commonly used as image tapers to resize the field of view
Figure 7.4: Forming a kind
of polydioptric eye.
Figure 7.5: Plenoptic
projection geometry for
micro-lenses.
Figure 7.6: Plenoptic im-
age processing.
in scientific cameras such as digital X-ray cameras. Also, image guides are manufactured
for use in medical instruments for laparoscopy. A plurality of image guides would be
used. The acceptance faces of the image guides would be tightly arranged in an imaging
surface; theexitfacescouldthenbespacedapartandcoupledtoimagesensors. Abundle
166
of thousands of such fibers, appropriately calibrated, may constitute the ultimate design.
We are currently experimenting with small versions and investigating the possibility of
different optical materials. Using image guides is a more expensive proposition, but
it is an attractive alternative, as this approach builds a superb compound eye with each
?ommatidium?havingacompleteperspectiveimageofthescene. Similarmethodscould
be used to build small polydioptric domes: the acceptance faces would surround the
dome volume, and at some distance behind them, the image sensors can be placed.
Finally, we should emphasize that not all images collected by an Argus eye need to
be kept in memory. For example, a polydioptric eye with a few thousand optical fibers
can create the same number of perspective images plus several thousand more affine
images for different directions. As we only need some calculations of raw derivatives
in order to derive a scene model, we can be selective regarding the images we wish to
keep. Infact, ourdedicatedDSPenginescansiftthroughtheirindividualimagedataand
preserve only higher-level entities, such as derivatives, needed for later computation.
Perhaps one spherical image would also be computed and stored locally, but not all the
raw images from the eye.
7.2 The plenoptic camera by Adelson and Wang
This is the ?original? plenoptic camera as described in [2]. The term plenoptic is derived
from plenus, full, complete, and optic, view. They place a lenticular array over the CCD
sensor to simulate a collection of small pinhole cameras which allows them to capture
light from many viewpoints where each lens formed a so-called 5x5 macro-pixel. The
problemstheauthorsnotewiththisapproacharealiasingandthealignmentofthelentic-
ular sheet with the CCD-array. To reduce aliasing they also place a diffuser in front of
167
the lenticular array. To compute the depth from the plenoptic images they choose a least-
squares estimator that uses an directional derivative weighted ratio between the direc-
tional and positional derivatives:
d(x) =
summationtext
P Ix(x)Iv(x)summationtext
P[Ix(x)]2
(7.1)
7.3 The optically differentiating sensor by Farid and Simoncelli
In[42]FaridandSimoncellidescribeaninterestingideatocomputetheview-pointderiv-
atives. The main idea is that a lens in comparison to a pin-hole camera captures a contin-
uum of rays and focuses them on one sensor element (if the rays originate from a point
in focus). Thus if one is able to compute the viewpoint derivative in front of the lens, one
could use the whole continuum of rays to compute the derivative, not just the summed
projectionatthesensorsurface. Ifapointmovesoutoffocus(eithercloserorfartherthan
the plane of focus), then the rays will not converge on a single point on the sensor plane,
but they will intersect the sensor plane in an area that is dependent on the depth of the
ray origin. The blurring that results is again dependent on the depth and thus can be
used to infer depth. If now a mask is placed in front of the lens and a point light source
is imaged, then the image of the mask and the mask derivative will be
I(x) = 1?M
parenleftBigx
?
parenrightBig
and Iv(x) = 1?Mprime
parenleftBigx
?
parenrightBig
(7.2)
where?is related to the depth as follows :
? = 1? dsf + dsZ (7.3)
Thus by taking the ratio between
Ix(x) = ??x
bracketleftbigg1
?M
parenleftBigx
?
parenrightBigbracketrightbigg
= 1?2Mprime
parenleftBigx
?
parenrightBig
= 1?Iv(x) (7.4)
168
andIv we can compute the depth: ? = Iv(x)Ix(x).
If a sensor captures light from several points then the formulation above can be
extended by integrating over all the visible points, that is :
I(x) =
integraldisplay 1
?pM
parenleftbiggx?x
p
?p
parenrightbigg
L
parenleftbiggx
p
?p
parenrightbigg
dxp (7.5)
Iv(x) =
integraldisplay 1
?pM
prime
parenleftbiggx?x
p
?
parenrightbigg
L
parenleftbiggx
p
?p
parenrightbigg
dxp (7.6)
where xp is the image of the point p on the sensor surface through the center of the lens
andLis the light intensity coming frompthat is assumed to be constant across the lens.
As before if we differentiateI(x) we get
Ix(x) =
integraldisplay 1
?2pM
prime
parenleftbiggx?x
p
?
parenrightbigg
L
parenleftbiggx
p
?p
parenrightbigg
dxp (7.7)
and the two derivatives only differ by a scale factor. Unfortunately, ?p is changing with
xp, thus unless we have a fronto-parallel plane it is difficult to solve for the parameter
directly.
7.4 Compound Eyes of Insects
A subject of future work is to determine if the compound eyes of insects can act as
polydioptric cameras. Given the anatomical data is possible to compute the view di-
rection and view position derivatives by comparing brightness values inside and be-
tween neighboring ommatidia. A neuronal architecture similar to correlators described
by Richardt [121, 122, 123]. Here we include a conceptual description of an analog VLSI
implementation based on the plenoptic motion estimation framework is first described
in Neumann et al. [106].
Startingfromtheplenopticmotionconstraint(Eq.5.6)fortherayindexedby(x,y,u,v,t)
169
as ([?;?] denotes the vertical stacking of vectors):
?Lt = ?xL? ?q + (x??xL+r??rL)??
= [Lx,Ly,?ufLx ? vfLy]?q
?[Lx,Ly,?ufLx ? vfLy]([x,y,Z?]T ??)
?[Lu,Lv,?ufLu ? vfLv]([u,v,f]T ??)
?Lt = [Lx,Ly,Lu,Lv][Mq,M?][?q;?] (7.8)
Mq =
parenleftBigg1 0 ?u
f
0 1 ?vf
0 0 00 0 0
parenrightBigg
, M? =
?
??
?
?uyf uxf +Z? ?y
?(vyf +Z?) vxf x
?uvf u2f +f ?v
?(v2f +f) vuf u
?
??
? (7.9)
As described in Chapter 5 by combining the constraints across the sensor surface,
we can form a highly over-determined linear system and solve for the rigid motion pa-
rameters robustly and sufficiently.
The plenoptic motion constraint is extended to the other elemental images of the
sensor surface by pre-multiplying ?q and ? with the appropriate rotation matrices to ro-
tate the motion vectors into the local coordinates. For example, if the local coordinate
system of an elemental image is related to the global sensor coordinate system by the
rotation matrix Ri, then the motion matrices in Eq. (7.9) are post-multiplied with Ri to
yield the local motion matricesMqi = MqRi andM?i = M?Ri.
To actually compute the plenoptic derivatives ?xL and ?xL we have to express
them in terms of the derivatives between the imaging elements on the surface of the ac-
tualsensor. ThesederivativescanbecomputedasshowninFig.7.7a. Thesphericalimag-
ing surface will be tessellated into elemental images (Fig. 7.7c). In each elemental image
theindividualsensorimagingelementswillbearrangednexttoeachother(Fig.7.7b). As
described by Horn [63], the gradient descent evolution that solves the linear system as
170
(a) (b) (c)
Figure7.7: (a)Theinter-ommatidialderivativesLu andLv arecomputedwithineachom-
matidialimage. Intra-ommatidialderivativesLx andLy arecomputedacrossommatidial
images. The constant matrixMi, which in general will be different for each pixel, encap-
sulates for all geometric relationships of how individual ommatidial images are formed.
(b) A curved compound eye design for ego motion sensor. Gradient Index (Grin) lenses
are arranged along a curved surface ?f forming optical images on the curved surface
?i. The two planes are used to parameterize the Light Field. The elemental images from
each Grin lens are brought to a planar sensor chip using coherent fiber optic bundles. (c)
A two dimensional array of Grin lenses for the compound eye 3D ego motion sensor.
given in Eq. (7.8) can be efficiently computed using an analog VLSI hardware implemen-
tation. Let Si denote the set of imaging elements s = [x,y,u,v] of the elemental image i
for which we can form the constraint equation Eq. (7.8). In the extreme case where each
elemental image only consists of a 2x2 array of pixels, Si would only contain a single
imaging element since we can only compute a single set of spatial derivative measure-
ments Lu and Lv per elemental image (Fig. 7.7a). Based on Eq. (7.8), for each imaging
element in Si we want to minimize the error
Ei(s; ?q,?) = Lt(s) +?L(s)TMi(s)[?q;?] (7.10)
where ?L(s) = [Lx(s),Ly(s),Lu(s),Lv(s)]T andMi(s) = [Mqi(s),M?i(s)].
171
Summing over all imaging elements and elemental images we define the global
error functionJ(?q,?) as
J(?q,?) =
summationdisplay
i
summationdisplay
s?Si
|Ei(s; ?q,?)|2 (7.11)
The gradient of this error function with respect to the motion parameters is then given
by
?J(?q,?) =
summationdisplay
i
summationdisplay
s?Si
2Ei(s; ?q,?)?Ei(s; ?q,?) (7.12)
If we split Eq. (7.12) into its components, we end up with the following six contin-
uous update equations for the motion parameters
(we denote the columns of the local motion matricesMi by [Mi1,...,Mi6] and scale
the update by a factor?):
??q1
?t = ??
?J
??q1 = ??
summationdisplay
i
summationdisplay
s?Si
MTi1?L(s)E(s; ?q,?)
??q2
?t = ??
?J
??q2 = ??
summationdisplay
i
summationdisplay
s?Si
MTi2?L(s)E(s; ?q,?)
??q3
?t = ??
?J
??q3 = ??
summationdisplay
i
summationdisplay
s?Si
MTi3?L(s)E(s; ?q,?)
??1
?t = ??
?J
??1 = ??
summationdisplay
i
summationdisplay
s?Si
MTi4?L(s)E(s; ?q,?)
??2
?t = ??
?J
??2 = ??
summationdisplay
i
summationdisplay
s?Si
MTi5?L(s)E(s; ?q,?)
??3
?t = ??
?J
??3 = ??
summationdisplay
i
summationdisplay
s?Si
MTi6?L(s)E(s; ?q,?)
We see that each update equation consists of the sum over all imaging elements and el-
emental images of the derivatives multiplied with a fixed set of weights that depends
on the sensor geometry and is precomputed during the design phase of the sensor. Due
to the implementation of the algorithm in analog electronic circuitry the motion para-
172
meters are updated continuously thus allowing us to update the motion parameters at a
temporal refresh rate that is much higher then for conventional motion sensors.
Referring to Fig. 7.8, for each pixel spatial derivatives are computed using differen-
tial amplifiers, while the temporal derivatives are computed using a differentiator. These
derivative measurements are then multiplied with the pixel positional weights defined
by the motion matrices Mi. These positional weights can be supplied to each cell using
resistive chains connected to fixed potentials. Based on these computations at each local
pixel, six currents are generated that are summed up in their corresponding wire on the
6-wire analog bus. These 6 wires transverse all pixel locations. The summation of the
currents emulates the summations in Eq. (7.12).
The total injected current in each wire updates the voltage on the wire. Once the
minimum of Eq. (7.11) is reached, the gradients in Eq. (7.12) becomes zero and no further
change on the voltage on the six wires is observed. Effectively, the steady state voltages
on the six wires represent the sensor?s 6DOF motion measurement. It is interesting to
note that these instantaneous voltages are also good initial ?guesses? for future gradient
descent updates as the motion of the sensor changes over time. In essence, the dynamics
of minimizing Eq. (7.12) can be made much faster than the changes of motion expected
in majority of robotics applications.
When computing spatial derivatives it is easy to calculateLu andLv, because these
arethefirstneighbordifferencesineachelementalimage(eachommatidium)ofthecom-
pound eye. CalculatingLx andLy requires wiring between pixels with the first neighbor
ommatidium. This is not major problem, since we envision that each ommatidium has
very small pixel arrays, probably only 2 by 2 pixels or so.
In all, our proposed 6DOF sensor is both a mathematical generalization of the clas-
173
sical constant optical flow equation [61] and a consequent hardware generalization of
Tanner?s constant optical flow sensor [142]. It is important to note, however, that due to
the compound eye arrangement and more general view of the environment, our sensor
is expected to provide stable motion measurements independent of the structure of the
scene.
174
Figure7.8: Thegradientupdateiscomputedateachpixelandproportionalbitsofcurrent
are summed into a global wire. One wire is used for each of the six motion parameters
(?q,?). Once the solution for the motion parameters are reached, based on minimizing
Eq. (7.11), the gradients (I.e., the total current into each of six wires) are zero thus the
steady state voltage solution on the capacitors is reached. The voltage from the six wires
is continuously readout from the sensor representing the instantaneous 6DOF motion
measurements. (The capacitors are explicitly drawn in this figure, although in actual
implementation the parasitic capacitance of the wires is sufficient.)
175
Chapter 8
Conclusion
According to ancient Greek mythology Argus, the hundred-eyed guardian of Hera, the
goddess of Olympus, alone defeated a whole army of Cyclopes, one-eyed giants. In-
spired by the mythological power of many eyes I proposed in this thesis that computer
vision researchers should study the design of sensors and algorithm in the space of light
rays. This allows for unified treatment of both sensors and algorithms and allows us to
optimize both components simultaneously. Based on a mathematical analysis of the dif-
ferential structure of the space of light rays, I presented a framework to systematically
study the relationship between the properties of an imaging sensor and the task perfor-
mance of the entity using this sensor.
In this thesis I focused on the relation between the local differential structure of
the time-varying plenoptic function and the ego-motion estimation of an imaging sensor.
This led to a new motion estimation algorithm based on matching subsets of light rays in
a scene independent manner. Since cameras cannot sample the space of light rays with
arbitrary precision, we analyze the plenoptic sampling problem in a function approxi-
mation framework leading to a quantitative expression for the difference between the
true and reconstructed light rays. This allows us to determine the best tradeoff between
176
camera spacing and image resolution for polydioptric cameras.
Although we only focused on the study of camera motion estimation, I believe this
framework can be extended to many more tasks. For example, as described in Chapter 2,
we can utilize the differential structure of the plenoptic function to find the depth of the
scene and detect occlusions. This would be very useful to improve the shape estimation
for distributed camera systems, because the local depth and shape information of each
polydioptriccameracanbeusedtoimprovethematchingprocessbetweenviewsthatare
further apart. The widely separated camera clusters in turn then can estimate the depth
of the scene with higher accuracy due to the larger baseline.
Differential plenoptic constraints are also very useful for segmenting objects out of
ascene. Separatingdifferentobjectsinanimagebasedontexturecuesisdifficult,because
discontinuities in texture space can caused by transitions between regions on the same
surfaceorbyocclusions. Incontrastifweutilizedepthinformationthesegmentationtask
becomes much easier, because it is very likely that image regions that lie at very differ-
ent depths belong to different objects. Similarly moving objects lie on low dimensional
image manifolds since the multi-view information constrain the possible depth values
of the object observed. This reduces the ambiguities and allows us to use depth-based
regularization schemes for independent motion estimation and detection.
An interesting object for further study is the task of recognition. In recognition we
use a discriminant function to assign different classes to regions in the image. Although
textural information is a very strong cue it was shown that the inclusion of depth knowl-
edge increases the accuracy of the recognition task [167, 11]. In this case one should
analyze how the image formation affects the shape of the discriminant function.
An important aspect of the imaging process which I did not address is the limited
177
dynamic range of image sensors. It is often the largest source error in algorithms, and an
inclusion in this camera design framework would be of great utility. Novel techniques
for automatic gain control that actively change the sensing process by using polarized
masks [101] or micro-mirror arrays [102] before or after the lens have been recently de-
scribed in the literature.
Cameras nowadays become smaller and more affordable by the day, thus soon it
will be in anyone?s reach to use polydioptric assemblies of cameras for the tasks they
try to solve. Since these assemblies can be reconfigured easily, the design of novel poly-
dioptric eyes will be possible with little effort. The main challenges that remain for the
implementation of polydioptric cameras are the current size of the individual imaging
elements that make up the polydioptric sensor. Since the surface of the camera is re-
stricted to be two-dimensional and we are sampling a four-dimensional function, the
spacing between the imaging elements will always be discrete along some dimensions.
Thus, it is paramount to develop small lens-sensor systems (MEMS, nano-technology)
that address the optical imaging as well as the sensing problem. The smaller the imag-
ing elements become, the more noise our measurements will contain due to the quantum
nature of light. This necessitates intelligent sensor that adaptively combine information
from neighboring measurements. We believe that the analysis of camera designs based
on the structure of the space of light rays has great potential especially with advent of
optical nano-technology around the corner which will offer new opportunities to design
cameras that sample the space of light rays in ways unimaginable to us today.
178
Acknowledgements
The quote by Leonardo Da Vinci is taken from [2] and the help of Vladimir Brajovic who
supplied Figures 7.2-7.8 is gratefully acknowledged.
179
Appendix A
Mathematical Tools
A.1 Preliminaries
A.1.1 Fourier Transform
The Fourier transform of a functionf can be defined in 1D as
F[f](?) = ?f(?) =
?integraldisplay
??
f(x)exp(?2pij?x)dx (A.1)
and its inverse Fourier transform as
F?1[ ?f](x) = f(x) =
?integraldisplay
??
?f(?)exp(2pij?x)d?. (A.2)
Ifwearegivenamulti-dimensionalsignal,wecanapplytheFouriertransformtoeachdi-
mension separately, which leads to the definition of the multi-dimensional Fourier trans-
form (here forddimensions):
F[f](?) = ?f(?) =
?integraldisplay
??
...
?integraldisplay
??
f(x)exp(?2pij?Tx)dx (A.3)
and
F?1[ ?f](x) = f(x) =
?integraldisplay
??
...
?integraldisplay
??
?f(?)exp(2pij?Tx)dx (A.4)
ParsevalRelations: Ifafunctionf isabsolutelyandsquareintegrable(f ?L1?L2),
then we have the following equations between the energy in the signalf and the energy
180
of its Fourier Coefficients:
1
T
Tintegraldisplay
0
|f(x)|2dx =
?summationdisplay
k=??
|?f(k)|2 (A.5)
For the Fourier transform pairsf and ?f, we also have the relationship
integraldisplay
R
|f(x)|2dx =
integraldisplay
R
|?f(?)|2d? (A.6)
Similarly, there exists a theorem that relates the magnitude of the product of functions to
the product of their Fourier series coefficients:
1
T
Tintegraldisplay
0
f(x)g(x)dx =
?summationdisplay
k=??
?f(k)?g(k) (A.7)
Some Fourier transforms for elementary functions, that we use in this work are for
one dimension:
F[1](?) = ?(?)
F[exp(2pi?0x](?) = ?(???0)
F[?(x?x0)](?) = exp(2pix0?)
F[?(x)](?) =
?integraldisplay
0
e?2pij?xdx = 12
bracketleftbigg
?(?)? ipi?
bracketrightbigg
F[?(x)](?) =
1/2integraldisplay
?1/2
e?2pij?xdx = sin(pi?)pi?
LemmaA.1. If a signal varies only along one dimension, that means the intrinsic dimensionality
of a signal is only one dimensionalf : Rd ? Rwithf(x) = fn(nTx), then we get the following
expression for its Fourier transform:
?f(?) = ?fn(?Tn)?(?Tn?2 )?...??(?Tn?d ) (A.8)
181
Proof.
?f(?) =
integraldisplay
f(x)e?2pii?Txdx
=
integraldisplay
fn(nTx)e?2pii?Txdx
change coordinates x = Nxn = [n,n?2 ,...,n?d ][xn,xn?
2
,...,xn?
d
]T
where nTn?i = 0?i, det(N) = 1
=
integraldisplay
fn(nTNxn)e?2pii?TNxndxn
=
integraldisplay parenleftBig
fn(xn)e?2pii?Tnxn
parenrightBig
e?2pii?
T[n?2 ,...,n?
d ][xn?2 ,...,xn?d ]
T
dxn
= ?fn(?Tn)?(?Tn?2 )?...??(?Tn?d ) (A.9)
This result can now be applied to the case of a linear step edge function of arbitrary
orientation in the 2D plane which is defined as:
?(x) =
?
???
???
1 if nTx ? 0
0 otherwise
(A.10)
SincetheFouriertransformofa1-dimensionalstepfunctionisgivenby ??(?) = 12(?(?)?
i
pi?), we get using the result from Eq.A.9:
??(?) = 1
2(?(?
Tn)? i
pi?Tn)?(?
Tn?) = 1
2(?(?)?
i?(?Tn?)
pi?Tn ) (A.11)
We can generalize this to describe the Fourier transform of any degenerate signal.
Given the intensity function f : Rm ? R and the matrix J ? Rm?n, then the Fourier
transform of the functionf(Jx), x ? Rn, is given by
?f(?) =
integraldisplay
f(Jx)e??Txdx (A.12)
182
Let?s assume that the singular value decomposition ofJ is given by
J = VSUT = V[SbardblS?]
?
??
?
UTbardbl
UT?
?
??
? (A.13)
where V ? Rm?m and U ? Rn?n are unitary matrices, S is split in a diagonal matrix Sbardbl
that contains the nonzero andS? that contains the zero singular values ofJ. Ubardbl andU?
contain the corresponding singular vectors. Thus when we do the variable substitution
?x = Jx, we can express x as
x = UbardblS?1bardbl VT?x+U?x? (A.14)
wherex? areparametersthatcanbechosenarbitrarybecausethevectorsU?x? arelying
in the null space ofJ. Thus, if we plug these values into the Eq. (A.12), we get
?f(?) = 1
det(S)
integraldisplay
f(?x)e??
TU
bardblS?1bardbl VT ?xe??
TU
?x?d?xdx? (A.15)
= 1det(S) ?f(VS?1UTbardbl ?)?(UT??) (A.16)
We see that Eq. (A.11) is a special case of the above, where we set Sbardbl = 1, Ubardbl = n,
andU? = [xn?
2
,...,xn?
d
].
A.1.2 Poisson Summation Formula
The following relation is known as the Poisson Summation Formula. Here we will give
then-dimensional version for the lattice A defined by the lattice matrixA.
summationdisplay
k?Zn
f(x+Ak) = 1|det(A)|
summationdisplay
m?Zn
?f(A?Tm)exp(?2pii(A?Tm)Tx) (A.17)
which can be proven quite easily for nice functions, for example if f ? L1(Rn) and
smooth so that bardbl(1 +x2)Nf(x)bardbl<? ?N.
183
We start by defining F(x) = summationtextk?Zn f(x + Ak). Since f ? L1(Rn), this sum con-
vergesabsolutelyanduniformlyandtheresultingfunctionissmoothandisperiodicover
thelatticeA. ItfollowsthatF hasaFourierexpansionF(x) =summationtextm?Zn amexp(?2piimTA?1x)
wherethecoefficientsaregivenby(VA istheVoronoicellcorrespondingtothelatticema-
trixA)
am = 1|det(A)|
integraldisplay
VA
F(x)exp(2piimTA?1x)dx
= 1|det(A)|
integraldisplay
VA
summationdisplay
k?Zn
f(x+Ak)exp(2piimTA?1x)dx.
since exp(2piimTA?1x) = exp(2piimTA?1(x+Ak)) we have
am = 1|det(A)|
integraldisplay
Rn
f(x)exp(2piimTA?1x) = 1|det(A)| ?f(A?Tm)
and we see that
F(x) =
summationdisplay
k?Zn
f(x+Ak) = 1|det(A)|
summationdisplay
m?Zn
?f(A?Tm)exp(?2pii(A?Tm)Tx)
Ifwesetn = A = 1,thenwehavethewell-knownone-dimensionalcase,summationtextn ?f(n)e2ipinx =
summationtext
nf(x+n).
A.1.3 Cauchy-Schwartz Inequality
Leta,b? Rn, then the Cauchy-Schwartz Inequality states that
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
nsummationdisplay
k=1
akbk
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle?
radicaltpradicalvertex
radicalvertexradicalbt nsummationdisplay
k=1
a2k
radicaltpradicalvertex
radicalvertexradicalbt nsummationdisplay
k=1
b2k
A.1.4 Minkowski Inequality
Ifp> 1, then the Minkowski inequality states that
?
?
bintegraldisplay
a
|f(x) +g(x)|pdx
?
?
1
p
?
?
?
bintegraldisplay
a
|f(x)|pdx
?
?
1
p
+
?
?
bintegraldisplay
a
|g(x)|pdx
?
?
1
p
184
Similarly, if p > 1, and ak,bk > 0, then the Minkowski inequality is also valid for
sums of sequences and we have
parenleftBigg nsummationdisplay
k=1
(ak +bk)p
parenrightBigg1
p
?
parenleftBigg nsummationdisplay
k=1
(ak)p
parenrightBigg1
p
parenleftBigg nsummationdisplay
k=1
(bk)p
parenrightBigg1
p
A.1.5 Sobolev Space
Letr be a positive real number and ? ? Rn . The Sobolev spaceWrp is defined as
Wrp(?) := {f ?Lp(?)|??f ?Lp(?)? multi-index?,|?|<r} (A.18)
That is the collection of functions satisfyingintegraltext(1 +v2)r|?f(v)|2dv < ?. This is equivalent
to saying that f and its first r derivatives are square-integrable. This definition extends
tondimensions. Letf be a real-valued function on Rn, then we can define
bardblfbardblr =
parenleftBigg rsummationdisplay
i=0
integraldisplay
|Dif(x)|2dx)
parenrightBigg1
2
=
parenleftBig
bardbl2pivbardbl2r|?f(v)|rdv
parenrightBig1
2 .
The space of functions for which this quantity is finite is the Sobolev space Wr2.
Here,|Dif|2 denotesthesumofthesquaresofallpartialderivativesoff oforderi. Thus,
theSobolevspaceWr2 isthespaceoffunctionswhosepartialderivativesuptoorderrare
square integrable. Similar spaces can be defined for vector valued functions by taking a
sum of contributions from the separate components in the integral. It is also possible to
define Sobolev spaces on any Riemannian manifold, using covariant derivatives. In the
case of natural signals, we can assume that all natural signals after they passed through
an optical system have a sufficiently high degree of smoothness.
185
A.2 Derivation of Quantitative Error Bound
A.2.1 Definitions for Sampling and Synthesis Functions
In contrast to the work by Blu et al. [12], whose argument we will follow closely, we
are interested in a function space where the spacing of the samples and the scale of the
synthesis function are not necessarily coupled. This means, instead of a function space
of the form {?(A?1 ? ?k)}, we will consider a function space of the form {?(A?1(? ?
Bk))}, which will lead to synthesis functions whose sampling lattice is defined by a
lattice matrix B and a scaling matrix A. We do this, because we are interacting with a
physicalspace,theenvironment,ofafixedphysicaldimension. Thus,thereisadifference
between the scaling of the sampling function, integration domain of a pixel, and the
distancebetweensamplingpointswhichisgivenbythegeometryoftheimagingsurface.
Thus, we define the sampling operator S(A,B) as
S(A,B) :?
braceleftbigg 1
|det(A)|
integraldisplay
f(?)?k(A?1(? ?Bk))d?
bracerightbigg
k?Zn
(A.19)
that defines the coefficients of the linear decomposition of the approximation functional
D(A,B)f(x) = 1|det(A)|
summationdisplay
k
integraldisplay
f(?)?(A?1(? ?Bk))?(A?1(x?Bk))d? (A.20)
Definition A.2. Definition of Order The set of synthesis functions?is of orderLiff there
existsLreal sequences {?(s)n }n?Z such that, in the sense of distributions,
xs =
summationdisplay
n
?(s)n ?n(x) fors = 0,...,L?1 (A.21)
This implies that the synthesis functions can reproduce polynomials up to orderL.
DefinitionA.3. Quasi-orthonormalityofOrderLAsetofsamplingdistributions{?n}n?Z
186
and synthesis functions {?n}nZ constitute a quasi-orthonormal set of order L iff the fol-
lowing two conditions are met:
? the functions?n are of order L
? the distributions?n satisfy the moment conditions
integraldisplay
xs?n(x)dx = ?(s)n forn? Z ands = 0,...,L?1 (A.22)
where the sequence?(s)n satisfies (A.21).
This condition ensures that polynomial functions up to orderLcan be recovered exactly
by this pair of sampling and synthesis spaces.
A.2.2 Hypotheses
For the following theorems to hold, we have to make the following hypotheses
? on the synthesis functions?k:
1. ?k ?L2(Rn) and it satisfies the Riesz condition
C1
summationdisplay
k
|ck|2 ? bardbl
summationdisplay
k
ck?kbardbl2L2 ?C2
summationdisplay
k
|ck|2 ?{ck} ?l2(Zn)
where 0 <C1 ?C2 <?. This implies that the synthesis functions are linearly
independent wheneverl2-sequences are considered.
2. The vector spaceM of all possible coefficients?n such thatsummationtext?n?n = 0 in the
sense of distributions has a finite dimension.
3. If ?n are of order L, then integraltext |x|r|?k(x)|dx is finite for r = 0,...,L. This im-
plies that the derivatives of the Fourier transform are bounded and L-times
continuously differentiable.
187
? on the sampling functions: we only assume that ? has a bounded Fourier trans-
form,
? on the functions to approximate: f needs to be chosen such that {S(A,B)(f)k} ?
l2(Zn). The convergence of the coefficients is ensured if f is in Wr2 for r > 12. This
implies the requirement of H?older continuity with exponentr? 12.
It is of interest how well a given synthesis function is able to approximate a given
function. We will measure this using the L2-norm of the difference between original
function and approximation given by
epsilon1f = bardblf ?D(A,B)fbardblL2 (A.23)
This error term epsilon1f can be split into two terms, one dominating main term that ex-
presses the idea of an average error over all possible phase positions of a signal and a
perturbation. The main term is computed by integrating |?f(v)|2, where ?f is the Fourier
transform off, against an error kernel of the form
E(v) = |1????(v)??(v)|2 +
summationdisplay
k?Zn{0}
|???(v)??(v +ATB?Tk)|2 (A.24)
where ? = |det(A)|/|det(B)|. The additional correction term e(f,A,B) depends on the
Sobolev regularity exponent of the function to be approximated.
In this chapter we will outline the steps to derive the quantization of the error with
respect to camera spacing and frequency of the signal. This is based on the descriptions
in the papers [13, 12]. They describe the 1D case and we will extend it to the n-D case
where the sampling pattern is given by a general scaling matrix A and sampling matrix
B.
Theorem A.4. (Adapted from [12]) For all f ? Wr2 with r > 1/2 the approximation error
averaged over all possible phase shifts of the function f with regard to the sampling operator
188
D(A,B) is given by
epsilon1f =angbracketleftbigbardblf ?D(A,B)fbardblL2angbracketrightbig=
bracketleftbiggintegraldisplay
|?f(v)|2E(ATv)dv
bracketrightbigg1/2
(A.25)
The proof consists of three parts, which we will detail in the next subsections.
A.2.3 l2 Convergence of the Samples
First we will define the followingB?T-periodic functions
U(v) = 1det(B)
summationdisplay
m
?f(v +B?Tm)??(ATv +ATB?Tm) (A.26)
V(v) = 1det(B)
summationdisplay
m
?f(v +B?Tm)??(ATv +ATB?Tm) (A.27)
andprovethatthesefunctionsarewelldefinedandbelongtoL2(VB?T)whereVB?T isthe
Voronoi cell defined by the lattice matrixB?T, that isVB?T := {B?Tx|x ? [?1/2,1/2]n}.
If this is the case, then we can develop theseB?T-periodic functions into a Fourier
series
U(v) =
summationdisplay
k?Zn
ak exp(2piivTBk) (A.28)
ak =
integraldisplay
VB?T
|det(B)|
|det(B)|
summationdisplay
k
?f(v +B?Tk)??(ATv +ATB?Tk)exp(2piivTBk)dv (A.29)
=
integraldisplay
Rn
?f(v)??(ATv)exp(2piivTBm)dv (A.30)
UsingthefactthattheinverseFouriertransformoftheproductofonefunctionand
the complex conjugate of another in the Fourier domain equals the correlation of the two
functions in the signal domain, that is
integraldisplay
Rn
?f(v)??(ATv)exp(?2piivT(?Bk))dv = F?1braceleftBig?f(v)??(ATv)bracerightBig(?Bk)
= 1|det(A)|(f ??A?1)(?Bk) = 1|det(A)|
integraldisplay
Rn
f(?)?(A?1(? ?Bk))d?
189
we can write down an alternative definition forU andV given by
U(v) =
summationdisplay
k
bracketleftbigg 1
|det(A)|
integraldisplay
f(?)?(A?1(? ?Bk))d?
bracketrightbigg
exp(2piivTBk)
V(v) =
summationdisplay
k
bracketleftbigg 1
|det(A)|
integraldisplay
f(?)?(A?1(? ?Bk))d?
bracketrightbigg
exp(2piivTBk) (A.31)
Remark A.5. The following section has not been checked yet!
We will prove the well-posedness of the functionU as follows (the procedure forV
is exactly the same). First we will define the functional sequence
UK(v) = 1|det(B)|
summationdisplay
|k|?K
?f(v +B?Tk)??(ATv +ATB?Tk)
forK ? Nn+. We would like to prove thatUK is a Cauchy sequence, that means we need
to show that
lim
K??
sup
Kprime>K
bardblUKprime ?UKbardblL2(I) = 0
By the Fisher-Riesz theorem this will automatically prove the convergence of UK
towards anL2(Rn) functionU.
We chooseKprime >K, and assume thatf ?Wr2(Rn) withr> 12 and thatbardbl??bardbl? ?C <
?. Then we define the set of bandpass functionsfm which are defined as
?fm =
??
????
????
?
?f(v) if m2|det(B)| ? bardblvbardbl< m+12|det(B)|
0 elsewhere
(A.32)
which allows us to write
UKprime(v)?UK(v) = 1|det(B)|
summationdisplay
m?0
summationdisplay
K<|k|?Kprime
?fm(v +B?Tk)??(ATv +ATB?Tk)
Notallfm contributetothesumontheright-handbecauseoftheirlimitedsupport.
We choose only them> 2K, then by applying the Minkowski Inequality and the bound
190
on ??assumed before we find that
bardblUKprime(v)?UK(v)bardblL2(I) ? C|det(A)|
summationdisplay
m>2K
bardblfmbardblL2
Looking at the definition offm in (A.32), we see that on the support offm we have
bardbl2pivbardbl?r ? (|det(B)|/(mpi))r, so that we can bound the norm offm by the norm of itsrth
derivative bardblfmbardblL2 ?
parenleftBig|det(B)|
mpi
parenrightBigr
bardblf(r)m bardblL2
Using the Cauchy-Schwartz Inequality for discrete sequences, we find that
summationdisplay
m>2K
bardblfmbardblL2 ?
parenleftbiggdet(B)
pi
parenrightbiggrradicalBigg summationdisplay
m>2K
m?2rbardblf(r)bardblL2
This expression tends to zero as K goes to infinity, therefore UK is a functional Cauchy
sequence. This makes sure that the sum of the squared samples converges and proves
thatU ?L2(I).
A.2.4 Expression ofepsilon1f in Fourier Variables
In this section, we will expand theL2-errorepsilon1f into three terms
epsilon12f = bardblfbardbl2L2 ?2?f,D(A,B)f?+bardblD(A,B)fbardbl2L2
and examine each term. We will start with Eq.(A.20)
D(A,B)f(x) = 1|det(A)|
summationdisplay
k
integraldisplay
f(?)?(A?1(? ?Bk))?(A?1(x?Bk))d?
WeseethatthepartoftheintegrallooksalotliketheexpressionforU inEq.(A.31).
We now express?(A?1(x?Bk)) in terms of its Fourier transform
?(A?1(x?Bk)) = |det(A)|
integraldisplay
??(ATv)exp(2piivTBk)exp(?2piixTv)dv
191
then we can manipulate Eq. (A.2.4)
D(A,B)f(x) =
integraldisplay bracketleftBiggsummationdisplay
k
integraldisplay
f(?)?(A?1(? ?Bk))d? exp(2piivTBk)
bracketrightBigg
?
??(ATv)exp(?2piixTv)dv
= |det(A)|
integraldisplay
V(v)??(ATv)exp(?2piivTv)dv
We can see that D(A,B)f(x) and |det(A)|integraltext V(v)??(ATx) are Fourier transforms of
each other and thus have the same L2-norm. Remembering that V(v) is B?T-periodic,
we can write
bardblD(A,B)f(x)bardblL2 (A.33)
=|det(A)|2
integraldisplay
V(v)??(ATv)??(ATv)V(v)dv
=|det(A)|2
summationdisplay
k
integraldisplay
VB?T
V(v)??(ATv +ATB?Tk)??(ATv +ATB?Tk)V(v)dv
=|det(A)|2
integraldisplay
VB?T
V(v)A(ATv)V(v)dv (A.34)
where
A(ATv) =
bracketleftBiggsummationdisplay
k
??(ATv +ATB?Tk)??(ATv +ATB?Tk)
bracketrightBigg
(A.35)
The same trick (splitting or reassembling the infinite integral by using the peri-
odicity) can now be applied to write the product of the two infinite sums in V(v) as a
192
combination of a sum and an infinite integral. Starting from
bardblD(A,B)f(t)bardbl2L2
=|det(A)|2
integraldisplay
VB?T
V(v)A(ATv)V(v)dv
=|det(A)|
2
|det(B)|2
integraldisplay
VB?T
bracketleftBiggsummationdisplay
k
?f(v +B?Tk)??(ATv +ATB?Tk)
bracketrightBigg
A(ATv)
bracketleftBiggsummationdisplay
k
?f(v +B?Tk)??(ATv +ATB?Tk)
bracketrightBigg
dv
= |det(A)|
2
|det(B)|2
summationdisplay
k
integraldisplay
Rn
?f(v) ?f(v +B?Tk)??(ATv)A(ATv)??(ATv +ATB?Tk)dv (A.36)
A similar approach can now be done for ?f,D(A,B)f?. We have
?f,D(A,B)f?
=
integraldisplay
Rn
f(x)
?
?summationdisplay
k?Zn
integraldisplay
Rn
f(?)??(A?1(? ?Bk))??(A?1(x?Bk))d ?|det(A)|
?
?dx
=
?
?
integraldisplay
Rn
f(x)??(A?1(x?Bk))dx
?
?
?
?summationdisplay
k
integraldisplay
Rn
f(?)??(A?1(? ?Bk)d ?|det(A)|
?
?
We now use the previous derivation to write
?f,D(A,B)f? =
integraldisplay
Rn
f(x)
?
?|det(A)|
integraldisplay
Rn
V(v)??(ATv)exp(2piixTv)dv
?
?dx
= |det(A)|
integraldisplay
Rn
?
?
integraldisplay
Rn
f(x)exp(2piixTv)dx
?
?V(v)??(ATv)dv
= |det(A)|
integraldisplay
Rn
?f(v)V(v)??(ATv)dv
193
This can now further simplified by
?f,D(A,B)f? = |det(A)|
integraldisplay
VB?T
summationdisplay
k
f(v +B?Tk)??(ATv +ATB?Tk)V(v)dv
= |det(A)||det(B)|
integraldisplay
VB?T
U(v)V(v)dv
Again, we writeU andV as sums
= |det(A)||det(B)|
integraldisplay
V?TB
bracketleftBiggsummationdisplay
k
?f(v +B?Tk)??(ATv +ATB?Tk)
bracketrightBigg
?
bracketleftBiggsummationdisplay
k
?f(v +B?Tk)??(ATv +ATB?Tk)
bracketrightBigg
dv
= |det(A)||det(B)|
summationdisplay
n
integraldisplay
Rn
?f(v)??(ATv) ?f(v +B?Tk)??(ATv +ATB?Tk)dv
Thus we can conclude that the final error measure consists of the following terms (we
write ? = |det(A)|/|det(B)|):
epsilon1f = bardblfbardbl2L2 ?2?f,DTf?+bardblDTfbardbl2L2
=
integraldisplay
Rn
?f(v) ?f(v)dv
?2?
summationdisplay
k
integraldisplay
Rn
?f(v) ?f(v +B?Tn)??(ATv +ATB?Tk)??(ATv)dv
+ ?2
summationdisplay
k
integraldisplay
Rn
?f(v) ?f(v +B?Tn)??(ATv +ATB?Tk)A(ATv)??(ATv)dv
We will now collect the terms for k = 0 and k negationslash= 0:
epsilon1f =
integraldisplay
Rn
?f(v) ?f(v)bracketleftBig1?2????(ATv)??(ATv) + ?2 ??(ATv)A(ATv)??(ATv)bracketrightBigdv (A.37)
+
summationdisplay
knegationslash=0
integraldisplay
Rn
?f(v) ?f(v +B?Tk)??(ATv +ATB?Tk)bracketleftBig?2A(ATv)??(ATv)?2???(ATv)bracketrightBigdv
= epsilon121 +epsilon122 (A.38)
194
Then we can rewrite epsilon11 as the integration of |?f(v)bardbl2 with the error Kernel E(ATv)
where the error kernel becomes:
E(v) =
bracketleftBig
1?2?? ??(v)??(v) + ?2 ??(v)A(v)??(v)
bracketrightBig
= 1?2?R{??(v)??(v)}+ ?2|??(v)??(v)|2 + ?2
summationdisplay
knegationslash=0
vextendsinglevextendsingle
vextendsingle??(v)??(v +ATB?Tk)
vextendsinglevextendsingle
vextendsingle
2
=
vextendsinglevextendsingle
vextendsingle1????(v)??(v)
vextendsinglevextendsingle
vextendsingle
2 +summationdisplay
knegationslash=0
vextendsinglevextendsingle
vextendsingle???(v)??(v +ATB?Tk)
vextendsinglevextendsingle
vextendsingle
2
Ifweuseorthonormalizedbasisfunction,wegetasimplerexpressionfortheerrorkernel.
That is we normalize?and? by premultiplying with 1/radicalbigA(v) where
A(v) =
summationdisplay
k
??(v +ATB?Tk)??(v +ATB?Tk) =summationdisplay
n
|??(v +ATB?Tk)|2 (A.39)
is the Fourier transform of the sampled auto-correlation sequence ?A(?) of?,
?A(?) =summationdisplay
k
?(x)?(x?A?1Bk)
to compute the ortho-normal basis functions ??and ??.
E(v) = 1?2?R{??(v)??(v)}+|??(v)|2
summationdisplay
k
|???(v +ATB?Tk)|2
= 1?2?R{??(v)??(v)}+|???(v)|2A(v)
= 1? |
??(v)|2
A(v) +A(v)
parenleftBigg
|??(v)|2
A(v)2 ?2?
R{??(v)??(v)}
A(v) +|?
??(v)|2
parenrightBigg
= 1? |
??(v)|2
A(v) +A(v)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle?
??(v)? |??(v)|
A(v)
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
2
A.2.5 Average Approximation Error
The approximation is very powerful, because it does not just express an upper bound on
the error, but it is actually the exact expression for the average error over all the possible
phase shifts of the functionf [12], that means we have the identity
integraldisplay
VB
bardblfu ?DTfbardblL2du =
bracketleftbiggintegraldisplay
|?f(v)|2E(ATv)dv
bracketrightbigg1/2
(A.40)
195
wherefu := f(x?u).
Proof. The shift by u corresponds to a multiplication of ?f(v) by exp(2piiuTv). The first
term of the error kernel epsilon11 is unchanged since |?f(v)|2 is not affected by this modulation.
The effect on the second term of the error kernel epsilon12 is a premultiplied with a term of
the form exp(2piiuTBTk) which is independent of v and thus we can move it in front
of the integral. The integral integraltextVB exp(2piiuTBTk)du equals 0 for all k negationslash= 0 and thus epsilon12
vanishes.
196
BIBLIOGRAPHY
[1] E. H. Adelson and J. R. Bergen. The plenoptic function and the elements of early
vision. In M. Landy and J. A. Movshon, editors, Computational Models of Visual
Processing, pages 3?20. MIT Press, Cambridge, MA, 1991.
[2] E. H. Adelson and J. Y. A. Wang. Single lens stereo with a plenoptic camera. IEEE
Trans. PAMI, 14:99?106, 1992.
[3] G. Adiv. Inherent ambiguities in recovering 3D motion and structure from a noisy
flowfield. InProc. IEEE Conference on Computer Vision and Pattern Recognition,pages
70?77, 1985.
[4] P. Baker, R. Pless, C. Fermuller, and Y. Aloimonos. Camera networks for building
shapemodelsfromvideo. In Workshop on 3D Structure from Multiple Images of Large-
scale Environments (SMILE 2000), 2000.
[5] P. Baker, R. Pless, C. Fermuller, and Y. Aloimonos. A spherical eye from multiple
cameras (makes better models of the world). In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, 2001.
[6] Patrick Baker and Yiannis Aloimonos. Translational camera constraints on parallel
lines. In Proc. Europ. Conf. Computer Vision, page in press, 2004.
197
[7] G. Baratoff and Y. Aloimonos. Changes in surface convexity and topology caused
bydistortionsofstereoscopicvisualspace. InProc. European Conference on Computer
Vision, volume 2, pages 226?240, 1998.
[8] S.S. Beauchemin and J.L. Barron. On the fourier properties of discontinuous mo-
tion. Journal of Mathematical Imaging and Vision (JMIV), 13:155?172, 2000.
[9] P.J. Besl. Active optical range imaging sensors. Machine Vision Appl., 1(2):127?152,
1988.
[10] F. Blais. A review of 20 years of ranges sensor development. In Videometrics VII
Proceedings of SPIE-IST Electronic Imaging, volume 5013, pages 62?76, 2003.
[11] V. Blanz and T Vetter. Face recognition based on fitting a 3d morphable model.
IEEE Trans. PAMI, 25(9), 2003.
[12] T. Blu and M. Unser. Approximation error for quasi-interpolators and (multi-
) wavelet expansions. Applied and Computational Harmonic Analysis, 6(2):219?251,
March 1999.
[13] T. Blu and M. Unser. Quantitative Fourier analysis of approximation tech-
niques: Part I?Interpolators and projectors. IEEE Transactions on Signal Processing,
47(10):2783?2795, October 1999.
[14] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An
approach to determining structure from motion. International Journal of Computer
Vision, 1:7?55, 1987.
198
[15] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An
approach to determining structure from motion. International Journal of Computer
Vision, 1:7?55, 1987.
[16] J. S. De Bonet and P. Viola. Roxels: Responsibility weighted 3d volume reconstruc-
tion. In Proceedings of ICCV, September 1999.
[17] O. Bottema and B. Roth. Theoretical Kinematics. North-Holland, 1979.
[18] T.J.Broida,S.Chandrashekhar, andR.Chellappa. Recursive3-Dmotionestimation
from a monocular image sequence. IEEE Transactions on Aerospace and Electronic
Systems, 26(4):639?656, 1990.
[19] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estia-
tion based on a theory for warping. In Proc. European Conference on Computer Vision,
volume 4, pages 25?36, 2004.
[20] A. Bruss and B. K. P. Horn. Passive navigation. CVGIP: Image Understanding, 21:3?
20, 1983.
[21] P. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE
Transactions on Communication, 31, 1983.
[22] E. Camahort and D. Fussell. A geometric study of light field representations.
Technical Report TR99-35, Dept. of Computer Sciences, The University of Texas
at Austin, 1999.
[23] E. Camahort, A. Lerios, and D. Fussell. Uniformly sampled light fields. In Euro-
graphics Workshop on Rendering, pages 117?130, 1997.
199
[24] C. Capurro, F. Panerai, and G. Sandini. Vergence and tracking fusing log-polar
images. In Proc. International Conference on Pattern Recognition, 1996.
[25] R. L. Carceroni and K. Kutulakos. Multi-view scene capture by surfel sampling:
From video streams to non-rigid 3d motion, shape, and reflectance. In Proc. Inter-
national Conference on Computer Vision, June 2001.
[26] J. Chai and H. Shum. Parallel projections for stereo reconstruction. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, volume 2, pages 493 ?500,
2000.
[27] J. Chai, X. Tong, and H. Shum. Plenoptic sampling. In Proc. of ACM SIGGRAPH,
pages 307?318, 2000.
[28] W. Chen, J. Bouguet, M. Chu, and R. Grzeszczuk. Light field mapping: efficient
representation and hardware rendering of surface light fields. ACM Transactions on
Graphics (TOG), 21(3):447?456, 2002.
[29] Amit Roy Chowdhury and Rama Chellappa. Stochastic approximation and rate-
distortion analysis for robust structure and motion estimation. International Journal
of Computer Vision, 55(1):27?53, October 2003.
[30] G.M. Cortelazzo and M. Balanza. Frequency domain analysis of translations with
piecewise cubic trajectories. PAMI, 15(4):411?416, April 1993.
[31] R. Costantini and S. S?usstrunk. Virtual sensor design. In Proc. IS&T/SPIE Elec-
tronic Imaging 2004: Sensors and Camera Systems for Scientific, Industrial, and Digital
Photography Applications V, volume 5301, pages 408?419, 2004.
200
[32] S. Crossley, N.A. Thacker, and N.L. Seed. Benchmarking of bootstrap temporal
stereo using statistical and physical scene modelling. In British Machine Vision Con-
ference, pages 346?355, 1998.
[33] K. Daniilidis. Zur Fehlerempfindlichkeit in der Ermittlung von Objektbeschreibungen
und relativen Bewegungen aus monokularen Bildfolgen. PhD thesis, Fakult?at f?ur Infor-
matik, Universit?at Karlsruhe (TH), 1992.
[34] K. Daniilidis and M. Spetsakis. Understanding noise sensitivity in structure from
motion. In Visual Navigation: From Biological Systems to Unmanned Ground Vehicles,
chapter 4, pages 61?88. Lawrence Erlbaum Associates, Hillsdale, NJ, 1997.
[35] K. Daniilidis and M. E. Spetsakis. Understanding noise sensitivity in structure
from motion. In Y. Aloimonos, editor, Visual Navigation: From Biological Systems
to Unmanned Ground Vehicles, Advances in Computer Vision, chapter 4. Lawrence
Erlbaum Associates, Mahwah, NJ, 1997.
[36] J. Davis, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo: A unifying
framework for depth from triangulation. In 2003 Conference on Computer Vision
and Pattern Recognition (CVPR 2003), pages 359?366, June 2003.
[37] R. Dawkins. Climbing Mount Improbable. Norton, New York, 1996.
[38] D. Demirdjian and T. Darrell. Motion estimation from disparity images. In Proc.
International Conference on Computer Vision, volume 1, pages 213?218, July 2001.
[39] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Com-
putation in Neural Systems, 6(3):345?358, 1995.
201
[40] A.Edelman,T.Arian,andS.Smith. Thegeometryofalgorithmswithorthogonality
constraints. SIAM J. Matrix Anal. Appl., 1998.
[41] I.A. Essa and A.P. Pentland. Coding, analysis, interpretation, and recognition of
facial expressions. IEEE Trans. PAMI, 19:757?763, 1997.
[42] H. Farid and E. Simoncelli. Range estimation by optical differentiation. Journal of
the Optical Society of America, 15(7):1777?1786, 1998.
[43] O. Faugeras and R. Keriven. Complete dense stereovision using level set methods.
In Proc. European Conference on Computer Vision, pages 379?393, Freiburg, Germany,
1998.
[44] O. D. Faugeras. Three-Dimensional Computer Vision. MIT Press, Cambridge, MA,
1992.
[45] O.D. Faugeras, F. Lustman, and G. Toscani. Motion and structure from motion
from point and line matches. In Proc. Int. Conf. Computer Vision, pages 25?34, 1987.
[46] C. Ferm?uller and Y. Aloimonos. Ambiguity in structure from motion: Sphere ver-
sus plane. International Journal of Computer Vision, 28(2):137?154, 1998.
[47] C. Ferm?uller and Y. Aloimonos. Observability of 3D motion. International Journal
of Computer Vision, 37:43?63, 2000.
[48] P. Fua. Regularized bundle-adjustment to model heads from image sequences
without calibration data. International Journal of Computer Vision, 38:153?171, 2000.
[49] A. Gershun. The light field. Journal of Mathematics and Physics, 12:51?151, 1939.
202
[50] C. Geyer and K. Daniilidis. Catadioptric projective geometry. International Journal
of Computer Vision, 43:223?243, 2001.
[51] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen. The lumigraph. In Proceedings
of ACM SIGGRAPH 96, Computer Graphics (Annual Conference Series), pages 43?
54, New York, 1996. ACM, ACM Press.
[52] M. D. Grossberg and S. K. Nayar. A general imaging model and a method for
finding its parameters. In Proc. International Conference on Computer Vision, pages
108?115, 2001.
[53] K.J. Hanna and N.E. Okamoto. Combining stereo and motion analysis for direct
estimation of scenestructure. In Proc. International Conference on Computer Vision,
pages 357?365, 1993.
[54] R. Hartley and A. Zisserman. Multiple View Geometry. Cambridge University Press,
2000.
[55] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cam-
bridge University Press, Cambridge, UK, 2000.
[56] H. Haussecker and B. J?ahne. A tensor approach for precise computation of dense
displacement vector fields. In DAGM Symposium, pages 199?208, September 1997.
[57] G. Healey and R. Kondepudy. Radiometric ccd camera calibration and noise esti-
mation. IEEE Trans. PAMI, 16(3):267?276, 1994.
[58] Andrew Hicks. arXiv preprint cs.CV/0303024.
203
[59] R. Andrew Hicks and Ronald K. Perline. Geometric distributions for catadioptric
sensor design. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pages 584?589, 2001.
[60] H. Hoppe, T. DeRose, T. Duchamp, M. Halstead, H. Jin, J. McDonald, J. Schweitzer,
and W. Stuetzle. Piecewise smooth surface reconstruction. In Proc. of ACM SIG-
GRAPH, pages 295?302, 1994.
[61] B. K. P. Horn. Robot Vision. McGraw Hill, New York, 1986.
[62] B. K. P. Horn. Robot Vision. McGraw Hill, New York, 1986.
[63] Berthold K.P. Horn. Parallel networks for machine vision. Technical report, MIT
A.I. Lab, 1988.
[64] J. Huang, A. Lee, and D. Mumford. Statistics of range images. In Proc. IEEE Con-
ference on Computer Vision and Pattern Recognition, 2000.
[65] A. Hubeli and M. Gross. A survey of surface representations for geometric mod-
eling. Technical Report 335, ETH Z?urich, Institute of Scientific Computing, March
2000.
[66] F. Huck, C. Fales, and Z. Rahman. Visual Communication. Kluwer, Boston, 1997.
[67] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Proc.
International Conference on Computer Vision, Corfu, Greece, 1999.
[68] B. J?ahne, J. Haussecker, and P. Geissler, editors. Handbook on Computer Vision and
Applications. Academic Press, Boston, 1999.
204
[69] H. W. Jensen, S. R. Marschner, M. Levoy, and P. Hanrahan. A practical model for
subsurface light transport. In Proc. of ACM SIGGRAPH, 2001.
[70] A. D. Jepson and D. J. Heeger. Subspace methods for recovering rigid motion II:
Theory. Technical Report RBCV-TR-90-36, University of Toronto, 1990.
[71] H. Jin, S. Soatto, and A. J. Yezzi. Multi-view stereo beyond lambert. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, volume 1, pages 171?178,
June 2003.
[72] J. T. Kajiya. The rendering equation. In Proceedings of the 13th annual conference on
Computer graphics and interactive techniques, pages 143?150. ACM Press, 1986.
[73] I.A.KakadiarisandD.Metaxas. Three-dimensionalhumanbodymodelacquisition
from multiple views. International Journal of Computer Vision, 30:191?218, 1998.
[74] G. Kamberova. Understanding the systematic and random errors in video sensor
data.
[75] L. Kobbelt, T. Bareuther, and H.-P. Seidel. Multiresolution shape deformations for
meshes with dynamic vertex connectivity. In Computer Graphics Forum 19 (2000),
Eurographics ?00 issue, pages 249?260, 2000.
[76] C. Kolb, D. Mitchell, and P. Hanrahan. A realistic camera model for computer
graphics. In Proc. of ACM SIGGRAPH, pages 317?324, 1995.
[77] J. Kosecka, Y. Ma, and S. Sastry. Optimization criteria, sensitivity and robustness
of motion and structure estimation. In Vision Algorithms Workshop, ICCV, 1999.
[78] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International
Journal of Computer Vision, 38:199?218, 2000.
205
[79] A. Laurentini. The visual hull concept for silhouette-based image understanding.
IEEE Trans. PAMI, 16:150?162, 1994.
[80] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of ACM SIG-
GRAPH 96, Computer Graphics (Annual Conference Series), pages 161?170, New
York, 1996. ACM, ACM Press.
[81] Y. Liu and T. S. Huang. Estimation of rigid body motion using straight line corre-
spondences. CVGIP: Image Understanding, 43(1):37?52, 1988.
[82] B. London, J. Upton, K. Kobre, and B. Brill. Photography. Prentice Hall Inc., Upper
Saddle River, NJ, 7 edition, 2002.
[83] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two
projections. Nature, 293:133?135, 1981.
[84] H.C. Longuet-Higgins and K. Prazdny. The interpretation of a moving retinal im-
age. Proc. R. Soc. London B, 208:385?397, 1980.
[85] C. Loop. Smooth subdivision surfaces based on triangles. Master?s thesis, Univer-
sity of Utah, 1987.
[86] Y.Ma, K.Huang, R.Vidal, J.Kosecka, andS.Sastry. Rankconditiononthemultiple
view matrix. International Journal of Computer Vision, 59(2), 2004.
[87] Y. Ma, J. Kosecka, and S. Sastry. Optimization criteria and geometric algorithms for
motion and structure estimation. International Journal of Computer Vision, 44(3):219?
249, 2001.
[88] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry. An Invitation to 3-D Vision: From Images
to Geometric Models. Springer Verlag, 2003.
206
[89] Y. Ma, R. Vidal, S. Hsu, and S. Sastry. Optimal motion from multiple views by
normalized epipolar constraints. Communications in Information and Systems, 1(1),
2001.
[90] S. Malassiotis and M.G. Strintzis. Model-based joint motion and structure estima-
tion from stereo images. Computer Vision and Image Understanding, 65:79?94, 1997.
[91] C. Mandal, H. Qin, and B.C. Vemuri. Physics-based shape modeling and shape
recovery using multiresolution subdivision surfaces. In Proc. of ACM SIGGRAPH,
1999.
[92] S. Mann. Pencigraphy with acg: joint parameter estimation in both domain and
range of functions in same orbit of projective-wyckoff group. Technical Report
384, MIT Media Lab, December 1994. also appears in: Proceedings of the IEEE
International Conference on Image Processing (ICIP?96), Lausanne, Switzerland,
September 16?19, 1996, pages 193?196.
[93] S. Mann and S. Haykin. The chirplet transform: Physical considerations. IEEE
Trans. Signal Processing, 43:2745?2761, November 1995.
[94] W. Matusik, C. Buehler, S. J. Gortler, R. Raskar, and L. McMillan. Image based
visual hulls. In Proc. of ACM SIGGRAPH, 2000.
[95] S. J. Maybank. The angular velocity associated with the optical flowfield arising
from motion through a rigid environment. Proc. R. Soc. London A, 401:317?326,
1985.
[96] S. J. Maybank. Algorithm for analysing optical flow based on the least-squares
method. Image and Vision Computing, 4:38?42, 1986.
207
[97] P. Meer. Robust techniques for computer vision. In G. Medioni and S. B. Kang,
editors, Emerging Topics in Computer Vision. Prentice Hall, 2004.
[98] P. Moon and D.E. Spencer. The Photic Field. MIT Press, Cambridge, 1981.
[99] T. Naemura, T. Yoshida, and H. Harashima. 3-d computer graphics based on inte-
gral photography. Optics Express, 10:255?262, 2001.
[100] S. Nayar. Catadioptric omnidirectional camera. In Proc. IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 482?488, 1997.
[101] S. K. Nayar and V. Branzoi. Adaptive dynamic range imaging: Optical control of
pixel exposures over space and time. In Proc. International Conference on Computer
Vision, Nice, France, 2003.
[102] S. K. Nayar, V. Branzoi, and T. Boult. Programmable imaging using a digital mi-
cromirror array. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
2004.
[103] J. Neumann and C. Ferm?uller. Plenoptic video geometry. Visual Computer,
19(6):395?404, 2003.
[104] J. Neumann, C. Ferm?uller, and Y. Aloimonos. Eyes from eyes: New cameras for
structure from motion. In IEEE Workshop on Omnidirectional Vision 2002, pages 19?
26, 2002.
[105] J. Neumann, C. Ferm?uller, and Y. Aloimonos. Polydioptric camera design and 3d
motion estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recogni-
tion, volume II, pages 294?301, 2003.
208
[106] J. Neumann, C. Ferm?uller, Y. Aloimonos, and V. Brajovic. Compound eye sensor
for 3d ego motion estimation. In accepted for presentation at the IEEE International
Conference on Robotics and Automation, 2004.
[107] B. Newhall. Photosculpture. Image, 7(5):100?105, 1958.
[108] F.E.Nicodemus,J.C.Richmond,J.J.Hsia,L.W.Ginsberg,andT.Limperis. Geometric
Considerations and Nomenclature for Reflectance. National Bureau of Standards (US),
1977.
[109] Editors of Time Life, editor. Light and Film. Time Inc., 1970.
[110] T. Okoshi. Three-dimensional Imaging Techniques. Academic Press, 1976.
[111] J. Oliensis. The error surface for structure from motion. Neci tr, NEC, 2001.
[112] T. Pajdla. Stereo with oblique cameras. International Journal of Computer Vision,
47(1/2/3):161?170, 2002.
[113] S. Peleg, B. Rousso, A. Rav-Acha, andA. Zomet. Mosaicing onadaptive manifolds.
IEEE Trans. on PAMI, pages 1144?1154, October 2000.
[114] M. Peternell and H. Pottmann. Interpolating functions on lines in 3-space. In
Ch. Rabut A. Cohen and L.L. Schumaker, editors, Curve and Surface Fitting: Saint
Malo 1999, pages 351?358. Vanderbilt Univ. Press, Nashville, TN, 2000.
[115] R. Pl?ankers and P. Fua. Tracking and modeling people in video sequences. Interna-
tional Journal of Computer Vision, 81:285?302, 2001.
[116] R. Pless. Using many cameras as one. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition, pages 587?593, 2003.
209
[117] M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric reconstruction
in spite of varying and unknown internal camera parameters. In ICCV, pages 90?
95, 1998.
[118] H. Pottmann and J. Wallner. Computational Line Geometry. Springer Verlag, Berlin,
2001.
[119] K. Prazdny. Egomotion and relative depth map from optical flow. Biological Cyber-
netics, 36:87?102, 1980.
[120] P. Rademacher and G. Bishop. Multiple-center-of-projection images. In Proceed-
ings of ACM SIGGRAPH 98, ComputerGraphics(AnnualConferenceSeries), pages
199?206, New York, NY, 1998. ACM, ACM Press.
[121] W.Reichardt. Movementperceptionininsects. InWernerReichardt,editor,Process-
ing of Optical Information by Organisms and Machines, pages 465?493. Academic
Press, 1969.
[122] W. Reichardt. Evaluation of optical motion information by movement detectors.
Journal of Comparative Physiology A, 161:533?547, 1987.
[123] W. Reichardt and R. W. Schl?ogl. A two dimensional field theory for motion com-
putation. Biological Cybernetics, 60:23?35, 1988.
[124] J. P. Richter, editor. The Notebooks of Leonardo da Vinci, volume 1, p.39. Dover, New
York, 1970.
[125] M Rioux. Laser range finder based on synchronized scanners. Applied Optics,
23(21):3837?3844, 1984.
210
[126] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-
frame stereo correspondence algorithms. International Journal of Computer Vision,
47(1/2/3):17?42, 2002.
[127] Yoav Y. Schechner and Nahum Kiryati. Depth from defocus vs. stereo: How differ-
ent really are they? International Journal of Computer Vision, 89:141?162, 2000.
[128] P. Schr?oder and D. Zorin. Subdivision for modeling and animation. Siggraph 2000
Course Notes, 2000.
[129] S. Seitz. The space of all stereo images. In Proc. International Conference on Computer
Vision, pages 307?314, 2001.
[130] S. Seitz and C. Dyer. Photorealistic scene reconstruction by voxel coloring. Interna-
tional Journal of Computer Vision, 25, November 1999.
[131] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 593 ? 600, 1994.
[132] H.Y. Shum, A. Kalai, and S. M. Seitz. Omnivergent stereo. In Proc. International
Conference on Computer Vision, pages 22?29, 1999.
[133] D. Snow, P. Viola, and R. Zabih. Exact voxel occupancy with graph cuts. In Proceed-
ings IEEE Conf. on Computer Vision and Pattern Recognition, 2000.
[134] M. E. Spetsakis and Y. Aloimonos. Structure from motion using line correspon-
dences. International Journal of Computer Vision, 4:171?183, 1990.
[135] M. E. Spetsakis and Y. Aloimonos. A multi-frame approach to visual motion per-
ception. International Journal of Computer Vision, 6(3):245?255, 1991.
211
[136] H. Spies, B. J?ahne, and J.L. Barron. Regularised range flow. In Proc. European Con-
ference on Computer Vision, June 2000.
[137] J. Stam. Evaluation of loop subdivision surfaces. SIGGRAPH?99 Course Notes,
1999.
[138] G.P. Stein and A. Shashua. Model-based brightness constraints: on direct estima-
tion of structure and motion. IEEE Trans. PAMI, 22(9):992?1015, 2000.
[139] G.W. Stewart. Stochastic perturbation theory. SIAM Review, 32:576?610, 1990.
[140] C. Strecha and L. Van Gool. Motion-stereo integration for depth estimation. In
ECCV, volume 2, pages 170?185, 2002.
[141] R. Swaminathan, M. D. Grossberg, and S. K. Nayar. Framework for designing
catadioptric imaging and projection systems. In International Workshop on Projector-
Camera Systems, ICCV 2003, 2003.
[142] J.TannerandC.Mead. Anintegratedanalogopticalmotionsensor. InR.W.Broder-
sen and H.S. Moscovitz, editors, VLSI Signal Processing, volume 2, pages 59?87.
IEEE, New York, 1988.
[143] G. Taubin. A signal processing approach to fair surface design. In Proc. of ACM
SIGGRAPH, 1995.
[144] P. Th?evenaz, T. Blu, and M. Unser. Interpolation revisited. IEEE Transactions on
Medical Imaging, 19(7):739?758, July 2000.
[145] T. Tian, C. Tomasi, and D. Heeger. Comparison of approaches to egomotion com-
putation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
315?320, June 1996.
212
[146] C. Tomasi and T. Kanade. Shape and motion from image streams under orthogra-
phy: A factorization method. International Journal of Computer Vision, 9(2):137?154,
1992.
[147] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjust-
ment ? a modern synthesis. In B. Triggs, A. Zisserman, and R. Szeliski, editors,
Vision Algorithms: Theory and Practice, number 1883 in LNCS, pages 298?373, Corfu,
Greece, September 1999. Springer-Verlag.
[148] R.Y.TsaiandT.S.Huang. Uniquenessandestimationofthree-dimensionalmotion
parameters of rigid objects with curved surfaces. IEEE Trans. PAMI, 6(1):13?27,
1984.
[149] Y. Tsin, V. Ramesh, and T. Kanade. Statistical calibration of ccd imaging process. In
Proc. International Conference on Computer Vision, volume 1, pages 480?487, 2001.
[150] M. Unser and A. Aldroubi. A general sampling theory for nonideal acquisition
devices. IEEE Transactions on Signal Processing, 42(11):2915?2925, November 1994.
[151] D. Van De Ville, T. Blu, and M. Unser. Recursive filtering for splines on hexagonal
lattices. InProceedingsoftheTwenty-EighthIEEEInternationalConferenceonAcoustics,
Speech, and Signal Processing (ICASSP?03), volume III, pages 301?304, 2003.
[152] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Three-dimensional scene
flow. In Proc. International Conference on Computer Vision, Corfu,Greece, September
1999.
213
[153] S. Vedula, S. Baker, S. Seitz, and T. Kanade. Shape and motion carving in 6d. In
Proc.IEEEConferenceonComputerVisionandPatternRecognition,HeadIsland,South
Carolina, USA, June 2000.
[154] S. Vedula, P. Rander, H. Saito, and T. Kanade. Modeling, combining, and rendering
dynamic real-world events from image sequences. In Proc. of Int. Conf. on Virtual
Systems and Multimedia, November 1998.
[155] R. Vidal and S. Sastry. Segmentation of dynamic scenes from image intensities. In
IEEE Workshop on Vision and Motion Computing, pages 44?49, 2002.
[156] G. Ward. Measuring and modeling anisotropic reflection. In Proc. of ACM SIG-
GRAPH, volume 26, pages 265?272, 1992.
[157] J. Weng, T.S. Huang, and N. Ahuja. Motion and Structure from Image Sequences.
Springer-Verlag, 1992.
[158] B. Wilburn, M. Smulski, H.-K. Lee, and M. Horowitz. The light field video camera.
In Proceedings of Media Processors. SPIE Electronic Imaging, 2002.
[159] D. N. Wood, A. Finkelstein, J. F. Hughes, C. E. Thayer, and D. H. Salesin. Multiper-
spective panoramas for cel animation. Proc. of ACM SIGGRAPH, pages 243?250,
1997.
[160] G.YoungandR.Chellappa. 3-dmotionestimationusingasequenceofnoisystereo
images: Models, estimation, and uniqueness results. IEEE Trans. PAMI, 12(8):735?
759, 1990.
[161] Jingyi Yu and Leonard McMillan. General linear cameras. In 8th European Confer-
ence on Computer Vision, ECCV 2004, Prague, Czech Republic, 2004.
214
[162] C. Zhang and T. Chen. Spectral analysis for sampling image-based rendering data.
IEEE Trans. on Circuits and Systems for Video Technology: Special Issue on Image-based
Modeling, Rendering and Animation, 13:1038?1050, Nov 2003.
[163] C. Zhang and T. Chen. A survey on image-based rendering - representation, sam-
pling and compression. EURASIP Signal Processing: Image Communication, 19(1):1?
28, 2004.
[164] L.Zhang,B.Curless,andS.M.Seitz. Spacetimestereo: Shaperecoveryfordynamic
scenes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
367?374, 2003.
[165] Y. Zhang and C. Kambhamettu. Integrated 3d scene flow and structure recovery
from multiview image sequences. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition, pages II:674?681, Hilton Head, 2000.
[166] Z. Zhang. Parameter estimation techniques: A tutorial with application to conic
fitting. International Journal of Image and Vision Computing, 15(1):59?76, 1997.
[167] W. Zhao, R. Chellappa, J. Phillips, and A. Rosenfeld. Face recognition in still and
video images: A literature surrey. ACM Computing Surveys, pages 399?458, 2003.
[168] A. Zomet, D. Feldman, S. Peleg, and D. Weinshall. Mosaicing new views: The
crossed-slits projection. IEEE Trans. PAMI, pages 741?754, 2003.
[169] D. Zorin, P. Schr?oder, and W. Sweldens. Interactive multiresolution mesh editing.
In Proc. of ACM SIGGRAPH, 1997.
215