ABSTRACT Title of dissertation: THE COMPOSITIONAL CHARACTER OF VISUAL CORRESPONDENCE Abhijit S. Ogale, Doctor of Philosophy, 2004 Dissertation directed by: Professor Yiannis Aloimonos, Department of Computer Science Given two images of a scene, the problem of finding a map relating the points in the two images is known as the correspondence problem. Stereo correspondence is a special case in which corresponding points lie on the same row in the two images; optical flow is the general case. In this thesis, we argue that correspondence is inextricably linked to other problems such as depth segmentation, occlusion detection and shape estimation, and can- not be solved in isolation without solving each of these problems concurrently within a compositional framework. We first demonstrate the relationship between correspondence and segmentation in a world devoid of shape, and propose an algorithm based on con- nected components which solves these two problems simultaneously by matching image pixels. Occlusions are found by using the uniqueness constraint, which forces one pixel in the first image to match exactly one pixel in the second image. Shape is then introduced into the picture, and it is revealed that a horizontally slanted surface is sampled differently by the two cameras of a stereo pair, creating images of different width. In this scenario, we show that pixel matching must be replaced by interval matching, to allow intervals of different width in the two images to correspond. A new interval uniqueness constraint is proposed to detect occlusions. Vertical slant is shown to have a qualitatively different char- acter than horizontal slant, requiring the role of vertical consistency constraints based on non-horizontal edges. Complexities which arise in optical flow estimation in the presence of slant are also examined. For greater robustness and flexibility, the algorithm based on connected components is generalized into a diffusion-like process, which allows the use of new local matching metrics which we have developed in order to create contrast invari- ant and noise resistant correspondence algorithms. Ultimately, it is shown that temporal information can be used to assign correspondences to occluded areas, which also yields ordinal depth information about the scene, even in the presence of independently mov- ing objects. This information can be used for motion segmentation to detect new types of independently moving objects, which are missed by state-of-the-art methods. THE COMPOSITIONAL CHARACTER OF VISUAL CORRESPONDENCE by Abhijit Satishchandra Ogale Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2004 Advisory Committee Professor Yiannis Aloimonos, Chair Professor Larry Davis Professor Steven Zucker (Yale University) Professor David Jacobs Professor Cynthia Moss, Dean?s representative c? Copyright by Abhijit Satishchandra Ogale 2004 Acknowledgements I would like to thank my advisor, Professor Yiannis Aloimonos, for giving me an oppor- tunity to explore the exciting world of vision. His deep understanding of the issues in vision, genuine scientific curiosity, and open-mindedness towards new ideas has gone a long way in making my research experience a rewarding one. I thank him for encouraging me and giving me the freedom to explore even the wildest of ideas. His warm personality has made working with him a very pleasant experience. I would also like to thank Dr. Cornelia Ferm?ller for discussing various problems from time to time and sharing her valuable insights. My fellow graduate students and friends Jan Neumann, Patrick Baker, Hui Ji and Gutemberg Bezerra have all contributed towards creating an interactive and pleasant atmosphere in the lab. I would like to thank the members of my examining committee, Professor Larry Davis, Professor David Jacobs, Professor Steven Zucker and Professor Cynthia Moss for their comments and suggestions for improving this thesis. I would also like to thank the late Professor Azriel Rosenfeld, who was a member of my proposal committee, for his help and advice in finding valuable references. I thank the members of the staff at the Center for Automation Research, UMIACS and the Department of Computer Science for their help with many administrative matters. This thesis is dedicated to my parents for their unending love and support, without which none of this would be possible. My father?s academic achievements have been a great source of inspiration over the years, and I am grateful to him for introducing me to the pleasures of scientific curiosity and understanding. My mother has played a pivotal role in shaping my career and shaping me as a person, and I owe her a great debt of grat- itude. I would also like to thank my sister Deepti and my grandparents for their support and encouragement. Last but not the least, I thank my wife Neeti. ii Contents List of Figures vi List of Tables xii 1 Introduction 1 1.1 Modularity vs. Compositionality ......................... 3 1.2 Organization: a road map of the thesis ...................... 6 2 Correspondence and Segmentation in Flatland 8 2.1 Previous work .................................... 8 2.2 Flatland ........................................ 9 2.3 Chicken-and egg problems ............................. 10 2.4 Connected matching regions ............................ 11 2.5 Boundaries and connectivity maximization ................... 12 2.6 Vertical connectivity ................................ 13 2.7 Occlusions and uniqueness ............................ 14 2.8 An algorithm for Flatland ............................. 16 2.9 Experimental Results ................................ 16 3 Shape and Correspondence 20 3.1 Horizontal slant ................................... 21 3.1.1 Unequal projection lengths and interval matching ........... 21 3.1.2 Slant affects Sampling ........................... 22 3.1.3 Occlusions and the new interval uniqueness constraint ........ 24 3.1.4 An algorithm to deal with horizontal slant ............... 27 3.2 Vertical slant ..................................... 29 3.2.1 Fundamental differences between vertical and horizontal slant . . . 29 3.2.2 Vertical connectivity and non-horizontal edges ............. 31 3.2.3 Cue integration along the vertical direction ............... 31 3.3 Revisiting the principle of minimum segmentation ............... 34 3.4 Shape and motion: problems with optical flow ................. 35 3.5 Experimental results ................................ 35 iii 4 Connectivity and Diffusion 41 4.1 Connectivity becomes diffusion .......................... 41 4.2 Fast diffusion along scanlines ........................... 42 4.3 An algorithm for matching by using diffuse connectivity ........... 45 4.4 Experimental results ................................ 46 5 Contrast Invariance 48 5.1 Local linear fitting - a first step towards contrast invariance .......... 48 5.2 Contrast invariance in biological vision ..................... 51 5.3 Multiple spatial frequency channels ....................... 55 6 Occlusions and ordinal depth 61 6.1 Why occlusions must be filled?........................... 61 6.2 Occlusion filling (rigid scene, no independently moving objects) ....... 62 6.3 Generalized occlusion filling (in the presence of moving objects) ....... 63 6.4 Experiments ..................................... 64 7 Motion segmentation 66 7.1 Previous work .................................... 66 7.2 Ambiguities in 3D motion estimation ....................... 68 7.3 Types of independently moving objects ..................... 70 7.3.1 Class 1: 3D motion based clustering ................... 70 7.3.2 Class 2: Ordinal depth conflict between occlusions and structure from motion ................................. 70 7.3.3 Class 3: Cardinal depth conflict ..................... 72 7.4 Motion based clustering .............................. 73 7.5 Algorithm summary ................................ 74 7.6 Experimental results ................................ 74 7.7 Summary ....................................... 78 8 Conclusion and Future Work 79 A Phase correlation 84 A.1 Basic process ..................................... 84 iv A.2 Logpolar coordinates ................................ 84 A.3 Four parameter estimation ............................. 85 A.4 Results ........................................ 86 B Design of a mobile Argus Eye with nine synchronized cameras 88 B.1 Introduction ..................................... 88 B.2 Project Requirements ................................ 88 B.3 Hardware Configuration .............................. 89 B.3.1 Cameras ................................... 89 B.3.2 Computer configuration .......................... 90 B.3.3 Sync Network ................................ 91 B.4 Why Linux? ..................................... 91 B.5 Capture Software design .............................. 92 B.5.1 Hardware detection and initialisation .................. 92 B.5.2 Trigger .................................... 93 B.5.3 Grabber ................................... 94 B.5.4 Writer .................................... 94 B.5.5 User input .................................. 95 B.5.6 Utilities for focusing, calibration, viewing and frame extraction . . . 95 B.6 Applications and extensions ............................ 97 Bibliography 99 v List of Figures 1.1 Top row: stereo images of a scene with slanted untextured surfaces. Bottom row: stereo pair with mismatched contrast ................... 2 1.2 Two images at successive time instants of a car emerging from behind a truck. All the objects including the background move to the left. ....... 3 2.1 Top row: Left image, Right image, True disparity (black denotes 0, gray de- notes 2). Bottom row: Absolute intensity difference of left and right images for horizontal shift ? x =0,1,2 ........................... 11 2.2 Vertical connectivity must not be established across horizontal edges .... 14 2.3 An Algorithm for Flatland. Besides these steps, the uniqueness constraint is enforced to find occlusions. ............................ 15 2.4 Top row: left images from four stereo pairs tsukuba, sawtooth, venus and map. Second row: true disparity maps. Third row: our results with occlusions filled in as required by the Scharstein and Szeliski evaluation. Bottom row: the detected occlusions are shown separately. .................. 17 2.5 Top row: two frames of the table-vase sequence. Second row: computed optical flow field. .................................. 18 2.6 First and second rows: Two frames of the yosemite sequence and the com- puted optical flow. Third and fourth rows: Two frames of the sofa sequence and the computed optical flow. .......................... 19 3.1 (a) unequal projection lengths of a horizontally slanted line (b) equal projec- tion lengths of a fronto-parallel line ........................ 21 3.2 Sampling problem for a horizontally slanted line ................ 23 3.3 The modified uniqueness constraint operates by preserving a one-to-one correspondence between intervals on the left and right scanlines, instead of pixels. ....................................... 24 3.4 Stretch and match .................................. 25 3.5 Top: stereo pair of images. Bottom: Corresponding intervals on the left and right scanlines can have different length. The order (left-right) of matching intervals can also change (see the blue and gray intervals). .......... 26 vi 3.6 Top: Horizontal changes in horizontal disparity due to a discontinuity cre- ate an occlusion. Middle: Horizontal changes in horizontal disparity due to horizontal slant lead to stretching/shrinking but no occlusions. Bottom: Vertical changes in horizontal disparity due to discontinuity and vertical slant cannot be distinguished (since no occlusions occur in either case). . . 30 3.7 Top: images of a vertically slanted plane. Middle: images overlaid to maxi- mize overlap. Bottom: area of largest overlap .................. 32 3.8 Vertical connections between pixels are established only along non-horizontal edges ......................................... 33 3.9 Stereo: disparity difference on two sides of (a) horizontal edge, and (b) ver- tical edge. Optical flow: (c) components of optical flow parallel and perpen- dicular to an edge. (d) change in parallel flow component across the edge. (e) change in perpendicular flow component across the edge. ........ 36 3.10 Columns 1 to 4: Left image, right image, graph cuts result for the left dis- parity map, our result for left disparity map. Row 1: horizontally slanted object, Row 2: vertically slanted object. Occlusions are shown in red. .... 37 3.11 Visual problems such as correspondence, depth segmentation, and shape estimation must be solved simultaneously .................... 38 3.12 Top row (Left frames), Middle row (ground truth), Bottom row (our results). Occlusions were filled in as required by the evaluation procedure. ...... 39 3.13 Top row: tree stereo pair and disparity map. Bottom row: Corridor stereo pair with the disparity map and occlusions (blue regions). Note the dispar- ity variation for the left and right walls of the corridor. ............ 39 4.1 Connected components is a special case of diffusion. On the left (A to E), we show how a measure of the influence of matching pixels on each other is computed using diffusion. On the right, we see how the same measure is obtained using connected component sizes. Note that M(x) is used to denote M(x,?), C(x) for C(x, ?), and so on, for the sake of brevity. ...... 44 vii 4.2 Top row: left images from four stereo pairs tsukuba, sawtooth, venus and map. Second row: true disparity maps. Third row: results of scanline diffusion with occlusions filled in as required by the Scharstein and Szeliski evalua- tion. Bottom row: the detected occlusions are shown separately. ....... 47 5.1 Top row: the local intensity around a pixel in the left image and the right image is shown. The plot on the extreme right is an intensity-intensity plot, which is a straight line with positive slope, indicating a match. Middle row: another pair of left and right intensity profiles with similar variations also yields an intensity-intensity plot with positive slope. Bottom row: a mis- matched pair of intensity profiles is shown. ................... 50 5.2 (a) and (b) denote a left and right image from the map sequence with differ- ent contrasts. (c) shows the result of the linear fit algorithm. (d) and (e) are a stereo pair of images from the tree sequence, and (f) shows the resulting disparity map. Note that occlusions are colored white in the disparity maps. 52 5.3 (a) and (b) denote a left and right image from the pentagon sequence, with only a part of the right image having different contrast. (c) shows the result of the linear fit algorithm. (d) and (e) are a random dot stereo pair of images, with a quadratic variation in contrast across the left image, (f) shows the resulting disparity map. Occlusions are colored white in the disparity maps. 53 5.4 Row 1: Tsukuba stereo pair with a quadratic contrast variation across the left image. The disparity map is shown on the right. Row 2: Sawtooth stereo pair with different image contrasts. Row 3: Random dot pair with a Gaussian contrast variation across the left image. Row 4: Leopard stereo pair with different contrast in a square patch in the right image. ............. 59 5.5 (a) Left image from the map sequence. (b) Right image with lower contrast and the addition of noise in the high frequency channel. The noise causes upto 25% variation in the original intensity. (c) shows a portion of a scanline in the right image, where the solid line shows the intensity values before addition of the noise, and the dotted line shows the values after addition of the noise. (d) shows the results of the linear fit algorithm. (e) shows the results of the multiple frequency channel algorithm. ............... 60 viii 6.1 If the occluded region belongs to R 1 , then R 1 is behind R 2 and vice-versa. . 62 6.2 Occlusion Filling: from left (a) to (c). Gray regions indicate occlusions (por- tions which disappear in the next frame) ..................... 63 6.3 Generalized occlusion filling and ordinal depth estimation. (a) Three frames of a video sequence. The yellow region which is visible in F 1 and F 2 dis- appears behind the tree in F 3 . (b) Forward and reverse flow (only the x- components are shown). Occlusions are colored white. (c) Occlusions in vectoru 23 are filled using the segmentation of vectoru 21 . Note that the white areas have dis- appeared. (d) Deduced ordinal depth relation. In a similar manner, we can also fill occlusions in vectoru 21 using the segmentation of vectoru 23 to deduce ordinal depth relations for the right side of the tree. ................... 65 7.1 Motion valley (red) visualized as an error surface in the 2D space of direc- tions of translation. The error is found after finding the optimal rotation and structure for each translation direction. ..................... 69 7.2 Toy examples of three classes of moving objects. In each case, the indepen- dently moving object is red colored. Portions of objects which disappear in the next frame (i.e. occlusions) are shown in a dashed texture. ........ 71 7.3 Class 1: (a,b,c) shows three frames of the teabox sequence. (d,e) show X and Y components of the optical flow using frames (b) and (c). Occlusions are colored white. (f) shows the computed motion valley for the background. (g) shows the cosine of the angular error between the reprojected flow (us- ing the background motion) and the true flow. (h) shows detected Class 1 moving object after thresholding angular error (greater than 45 degrees). . . 75 ix 7.4 Class 2: (a,b,c) show three frames F 1 ,F 2 ,F 3 of the leopardA sequence. (d) shows X-component of the optical flow vectoru 21 from frame F 2 to F 1 . Y-component (not shown) is zero. White regions denote occlusions (e) shows X-component of the optical flow vectoru 23 from frame F 2 to F 3 . Y-component (not shown) is zero. (f) shows an example of ordinal depth (green in front, red behind) obtained by filling vectoru 21 using the segmentation of vectoru 23 . (g) shows another ex- ample of ordinal depth (green in front, red behind) obtained by filling vectoru 23 using the segmentation of vectoru 21 . (h) shows the Class 2 moving object detected using the ordinal depth conflict. (i) shows the computed motion valley. (j) shows the structure from motion which puts the leopard in front of the red box, whereas occlusions (see (g)) show that the leopard is behind the box. . 76 7.5 Class 3: (a,b,c) show three frames F 1 ,F 2 ,F 3 of the leopardB sequence. (d) shows computed motion valley. (e,f) show X and Y components of the flow vectoru 23 between F 2 and F 3 . White regions denote occlusions (g) shows inverse depth from motion. (h) shows 3D structure from motion. (p,q) show rectified stereo pair of images. (q) is the same as (b). (r) shows inverse depth from stereo. (s) shows 3D structure from stereo. Compare (s) with (h) to see how the background objects appear closer to the leopard in (s) than in (h). (x) shows the histogram of depth ratios and clusters detected by k-means (k=3). (y) shows cluster labels: cluster 2 (yellow) is the background, cluster 3 (red) is the leopard, cluster 1 (light blue) is mostly due to errors in the disparity. (z) shows the moving objects of Class 3 (clusters other than 2). .. 77 A.1 An example of a peak generated by the phase correlation method. ...... 85 A.2 An image in cartesian coordinates (left) and its logpolar representation (right). 86 A.3 Top row shows two input images I 1 and I 2 . Image I 2 was created from I 1 by rotating by 5 degrees, scaling by a factor of 1.2, and translating by (-10,20) pixels. Bottom row: The left image shows image I prime 2 obtained by unwarping I 2 using the results of the phase correlation. The right hand side shows the absolute intensity difference between I 1 and the unwarped image I prime 2 to reveal the accuracy of the registration. ...................... 87 x B.1 Sample INI file for 2 cameras ........................... 96 B.2 Argus eye system being used outdoors to collect data. The cameras are mounted on the octahedral frame, and carried around by a person who stands in the middle holding the frame. ..................... 98 xi List of Tables 3.1 Performance comparison from the Middlebury Stereo Vision Page (overall rank is 6?th among 28 algorithms). The table shows only the top ten algo- rithms. Error percentages and rank (in brackets) in each column is shown. . 40 4.1 Performance of scanline diffusion. Numbers indicate percentage of bad pix- els overall, in untextured regions and at discontinuities. Numbers in brack- ets indicate rank in each column (overall rank is 14). ............. 46 xii Chapter 1 Introduction The field of Computational Vision has experienced tremendous growth in the past thirty years. This was due in part to the wide range of applications that may be realized if one understands to some degree how vision works, and partly due to the desire to understand how the brains of biological systems are designed. The field has seen several novel theo- ries regarding the solution to core problems such as stereo matching [SS02], computation of image motion (optical flow) [BFB94, BB95, MB96, LHH + 98, GMN + 98], shape from tex- ture [BL76, Wit81, Alo88, BA89, SB95], structure from motion [Fau93, HN94, MSKS04], and so on (the references given here are mostly surveys and books; see thesis chapters for de- tailed references). There has also been significant progress in understanding the geometric constraints underlying multiview vision [HZ00]. Despite the tremendous progress, we are still unable to effectively deal with a variety of inputs which are encountered in the real world. For example, although the field has advanced a great deal in the estimation of image motion, state of the art algorithms have difficulty finding occlusions, i.e. places in the scene that were visible at some instant of time and became invisible at the next time instant, or vice versa. Similarly, many stereo algorithms fail when presented with an input where there are few features (i.e. large untextured areas), strongly slanted surfaces (e.g. the corridor walls in the top row of Figure 1.1), when the contrast of one image differs significantly from another (e.g. bottom row of Figure 1.1), or when noise is present in one or more frequency channels. Motion segmentation algorithms, i.e. techniques for finding independently moving objects in a video taken by a camera moving in an unrestricted manner, fail miserably when the background motion is similar to the independent motion. One may consider these cases as exceptional and hardly an obstacle to the realization of practical robust systems. This is, however, not true. Consider the following application of extreme relevance to the automotive industry. Imagine that a pair of cameras (a stereo system) is installed on a car with the goal of finding a spatial layout of the environment in front of the car, as well as the location of independently moving objects, i.e. other vehicles or humans and animals. Imagine that we are driving towards the North in the morning 1 Figure 1.1: Top row: stereo images of a scene with slanted untextured surfaces. Bottom row: stereo pair with mismatched contrast hours, with the sun to our left. The stereo pair of images that the system will acquire will probably be such that the left image will be much brighter than the right one. Unequal contrast may also result due to differences in aperture and exposure for the two cameras. Matters are complicated further by the presence of large untextured slanted regions such as roads, walls of buildings or surfaces of other vehicles. Most stereo algorithms will simply fail in this case. Imagine further (see Figure 1.2), that as we drive along, on the periphery of the left camera which is not visible in the right camera, a car appears from behind a truck, both moving in the opposite direction as our own. Existing motion segmentation algorithms will miss such a moving object, classifying it as part of the background, which also moves in the same direction. Thus, it becomes essential to reexamine basic visual processes that lie at the heart of low or intermediate level vision. These are the well known processes of image correspon- dence including stereo matching and optical flow, shape from X, structure from motion, motion segmentation, and the like. They are the processes responsible for creating de- 2 Figure 1.2: Two images at successive time instants of a car emerging from behind a truck. All the objects including the background move to the left. scriptions of the surfaces surrounding us, their boundaries and discontinuities, as well as descriptions of the movements of the observer or of other objects. These are descriptions of a non-cognitive nature and their recovery depends on the appropriate utilization of the underlying geometry and physics. The central question is: what are the major issues which are preventing us from making these processes more robust and accurate? 1.1 Modularity vs. Compositionality Given a complex multifaceted problem, we often use a modular approach which breaks up the larger problem into multiple modules or black boxes which are then connected together to solve the original problem. Like much of engineering, contemporary computer vision often works in such a modular manner. Consider the example of structure from motion. We have as an input at least two images of a scene taken from two different viewpoints (e.g. a video taken by a moving camera), and would like to develop a three dimensional model of the scene in view. First, the images are matched, by solving the correspondence problem. After this, the 3D camera motion (or the rigid transformation between the views) is estimated using the correspondence and geometric constraints. After this step, one can place the cameras in the world, i.e. we know the viewpoints from which the images were taken. The last step amounts to a triangulation: knowing the correspondence and the rigid transformation between the views, we can compute the location of each point of the scene in view and build a 3D model of the scene structure. One notices immediately that the 3 processing has a modular character, and information is passed from one module to the next in a feedforward manner. Consider now the same problem in the presence of independently moving objects, the so-called motion segmentation problem. Finding the scene structure requires that we know the camera motion (the egomotion). Computing the camera motion requires us to first find the background, i.e. parts of the scene which are not independently moving. Finding the background, however, is equivalent to locating the independently moving ob- jects, which cannot be done without first knowing the camera motion and the structure. The three problems of egomotion estimation, structure estimation and moving object de- tection are completely entangled in a chicken-and-egg loop. Now let us focus our attention on the correspondence problem, which is required to solve the motion problem mentioned above. Local properties such as color or intensity alone are not sufficient to solve the point correspondence problem between two images, since matching based on these properties alone yields a large set of possibilities. Ad- ditional assumptions about scene smoothness are necessary to obtain unique solutions. However, in order to obtain the best results, we need to use smoothness constraints but respect depth discontinuities at the same time. Hence, depth segmentation (in the image space, not 3D) is needed in order to solve the correspondence problem accurately. On the other hand, we do not know the segmentation beforehand, but we could compute it if we knew the correspondence map by searching for discontinuities. Thus, we have another set of chicken-and-egg problems whose solutions depend on each other. Now assume that we have a stereo system viewing a planar patch in the scene which is horizontally slanted. The slant causes the patch to be sampled differently by the two cameras, creating images of different width (notice the width of the slanted left wall of the corridor shown in Figure 1.1). If we knew the slant of the patch, we could stretch (or unwarp) one of the images to equalize the sampling, so that we could solve for the correspondence accurately, perhaps by applying signal processing operations for accurate matching. But we do not know the slant beforehand, so how can we find it? We can find it if we knew the correspondence. Thus, estimation of shape is also a part of the chicken-and-egg loop containing correspondence and segmentation. Another example of a chicken-and-egg pair are the processes of texture segmentation and shape from texture. The list goes on and on. If you consider a process from the low and intermediate level 4 vision repertoire, the chances are that you will find that it depends on another process, which in turn depends on the original process and perhaps other related processes. It appears then that vision, at least at the low and intermediate level, is a compositional problem, consisting of several subprocesses which are parts of an elaborate set of feedback loops. Our role as computational vision theorists is then to discover and implement such loops. The word discover is used since vision is a physical process which exists in the biological world. One of the remarkable discoveries in neuroscience is the presence of extensive feedback between different parts of the visual system. Many details are known about this architecture, but we are far from a complete understanding of how the system functions as a whole. Thus, a theory of vision considered as a compositional process will not only contribute to robust artificial vision systems, but can also serve as a source of hypotheses that can be addressed with specific experiments in the neurosciences. Although there exist many interdependent processes in single image analysis as well, in this dissertation, we shall concentrate on processes that involve more than one image. This is because the positions of corresponding image points in different images are gov- erned by well known geometric principles, which makes problem formulation easier, and better reveals some of the compositional relations involved as compared to the harder problem of analyzing a single image. As far as low and intermediate level vision is con- cerned, the problem of correspondence, i.e. matching images is considered one of the most basic problems, as it represents the foundation of a large number of higher level processes. In this dissertation, we shall concentrate on the relationships between the correspondence problem and well known problems such as segmentation, shape estimation and occlusion detection. (By segmentation, we mean the process which detects discontinuities in well- defined properties such as disparity, flow or 3D motion.) We shall develop solutions which extend the state of the art to deal with inputs which currently pose serious problems. This dissertation shows how the new framework allows us to build a number of retinotopic maps for a binocular observer in motion, including stereo disparity, image motion (opti- cal flow), occlusions, shape, ordinal depth, and motion segmentation. In the next section, we shall describe how the compositionality framework is progressively developed in the dissertation, and provide a road map of the ideas which will be introduced. 5 1.2 Organization: a road map of the thesis Given two images of the same scene from different viewpoints, the dense point correspon- dence problem deals with finding a piecewise continuous map which relates points in one image to points in the other image. We begin in Chapter 2 by examining the problem in a world which contains no slanted surfaces. In such a fronto-parallel Flatland, two stereo images of the same surface have the same width, which allows us to solve the pixel corre- spondence problem instead of the point correspondence problem. In this shapeless world, we show how correspondence (stereo and optical flow) and segmentation can be solved together in a compositional fashion by using connected components. Occlusions are also found using the pixel uniqueness constraint. In Chapter 3, we introduce shape into the picture, to show how a horizontally slanted object projects onto stereo images of different width, causing problems due to uneven sampling and lack of a pixel uniqueness constraint for finding occlusions. These problems are addressed by presenting a compositional algo- rithm which solves for correspondence, shape and slant simultaneously. We also address problems caused by the presence of untextured regions and vertically slanted surfaces, and discuss the difficulties involved in formulating smoothness constraints for optical flow in the presence of slant. In Chapter 4, we generalize our algorithm based on connected com- ponents into a diffusion formulation, which increases robustness, and allows for a broader selection of local matching metrics. In Chapter 5, we develop local metrics which per- form matching in a manner which is invariant to the contrast of the two images. One of these local metrics is biologically motivated and uses multiple spatial frequency chan- nels to improve robustness with respect to noise in one of the channels. In Chapter 6, we demonstrate how occluded regions can be filled by optical flow values from neighboring regions even in the presence of independently moving objects. The underlying method uses segmentation obtained from previous frames, thereby presenting another case for the presence of feedback over time. We also show that if the occlusions can be filled, then ordinal depth relations can easily be derived between different regions of the scene, even if the scene contains independent motion. In Chapter 7, occlusion and ordinal depth in- formation obtained in previous chapters is used to find a novel solution to the motion segmentation problem. In fact, if motion information is available, then the problems of formulating smoothness constraints for optical flow are alleviated. Hence, knowledge of 6 the egomotion can actually help us find the correct correspondence, which testifies to the feedback of information all the way back to the roots. In this thesis, we shall present ro- bust correspondence algorithms which find discontinuities, deal with untextured regions, handle slanted surfaces, find occlusions and ordinal depth, and are able to perform in the presence of large non-uniform changes in contrast between two images. Finally Chapter 8 explores open problems and possible avenues of further research. 7 Chapter 2 Correspondence and Segmentation in Flatland The dense correspondence problem (also referred to as the matching problem) consists of finding a unique mapping between the points belonging to two images of the same scene. If the camera geometry is known, the images can be rectified ([Fau93, HZ00]), and the problem reduces to the stereo correspondence problem, where points in one image can correspond only to points along the same horizontal line in the other image. If the geome- try is unknown, then we have the optical flow estimation problem. In both cases, regions in one image which have no counterparts in the other image are referred to as occlusions (or more correctly as half-occlusions). 2.1 Previous work There exists a considerable body of work on the dense stereo correspondence problem. Scharstein and Szeliski [SS02] have provided an exhaustive review and comparison of dense stereo correspondence algorithms. Dense matching algorithms generally utilize lo- cal measurements such as image intensity (or color) and phase, and aggregate information from multiple pixels using smoothness constraints. The simplest method of aggregation is to minimize the matching error within rectangular windows of fixed size [OK93]. Better approaches utilize multiple windows [GLY92, FRT97], adaptive windows [KO94] which change their size in order to minimize the error, shiftable windows [BI99, TSK01], or pre- dicted windows [MD00], all of which give performance improvements at discontinuities. Global approaches to solving the stereo correspondence problem rely on the extrem- ization of a global cost function or energy. The energy functions which are used include terms for local property matching (?data term?), additional smoothness terms, and in some cases, penalties for occlusions. Depending on the form of the energy function, the most efficient energy extremization scheme can be chosen. These include dynamic program- ming [OK85], simulated annealing [GG84, Bar89], relaxation labeling [Sze90], non-linear diffusion [SS98], maximum flow [RC98] and graph cuts [BVZ01, KZ01]. Maximum flow 8 and graph cut methods provide better computational efficiency than simulated anneal- ing for energy functions which possess a certain set of properties. Some of these algo- rithms treat the images symmetrically and explicitly deal with occlusions (eg. [KZ01]). The uniqueness constraint [MP79] is often used to find regions of occlusion. Egnal and Wildes [EW02] provide comparisons of various approaches for finding occlusions. Re- cently, some algorithms [BT99] have explicitly incorporated the estimation of slant while performing the estimation of horizontal disparity. Lin and Tomasi [LT03] explicitly model the scene using smooth surface patches and also find occlusions; they initialize their dis- parity map with integer disparities obtained using graph cuts, after which surface fitting and segmentation are performed repeatedly. As is the case with stereo correspondence, there exists a large body of literature de- voted to the understanding of the optical flow problem, including the dense flow estima- tion problem. Beauchemin et al [BB95] and Mitiche et al [MB96] provide surveys of the various techniques for optical flow estimation, while Barron et al [BFBB94], and more re- cently, Galvin et al [GMN + 98] and Liu et al [LHH + 98], provide performance comparisons between various optical flow algorithms. Mitiche et al [MB96] also discuss the problems of finding motion based segmentation and occlusions, and survey related approaches. There also exists some very interesting recent work on explicitly finding occlusions and motion discontinuities [BF00, ADPS02], and also regarding the spectral properties of oc- clusions [BB00] in the context of optical flow. 2.2 Flatland When we compute the disparity map or the optical flow from a given pair of images, the desirable property of such a map is that it should explain the observed images while mini- mizing the number of discontinuities. In other words, we would like to model the disparity (or the optical flow) as a piecewise continuous function which is consistent with the observed images and has the minimum possible number of pieces. To simplify the problem compu- tationally, we often choose more restrictive versions of the general model of a piecewise continuous function. The simplest but most restrictive version models the disparity map as a piecewise constant function. An obvious improvement is to model the depth map as a piecewise linear (ie. planar) function. We can proceed in this manner towards progressively 9 complex models in an attempt to get closer to the true property of piecewise continuity. One aspect to keep in mind is that when we speak of segmentation, we mean depth segmentation on the image. However, surfaces which are connected in the 3D scene may project to multiple disconnected regions due to self overlap. We are not attempting to find the 3D continuity of such surfaces, which may involve other principles. In this chapter, we shall begin with the simplest but most restrictive case of trying to model the disparity as a piecewise constant function. This assumption models the scene as a shapeless world consisting of a collection of flat fronto-parallel surfaces, hence the name Flatland. We assign unique correspondences to each pixel and do not deal with transparent surfaces. In this world, we show that correspondence and segmentation are chicken-and- egg problems which can only be solved simultaneously. 2.3 Chicken-and egg problems Establishing correspondence between two images of a scene involves first selecting a local metric, such as the intensity (gray level) or color of a pixel which forms the basis for lo- cal comparisons. However, matching on the basis of such local information alone is almost impossible since many pixels have similar intensity or color. To reduce the correspondence possibilities for a pixel to a single possibility, regions around that pixel must be used along with additional continuity or smoothness assumptions about the scene. Thus, informa- tion around a pixel must be aggregated to obtain a unique match. Enforcing smoothness without a prior knowledge of depth discontinuities (segmentation) will inevitably lead to errors, especially near the discontinuities. Hence, prior knowledge of the segmentation is essential in order to correctly define regions around a pixel for information aggregation. Conversely, if exact correspondence is known, the segmentation may be easily deduced. Thus, if we knew the segmentation, then we could better estimate the correspondence. But we need correspondence in order to achieve segmentation. Correspondence and seg- mentation are chicken-and-egg problems: we need one in order to solve the other. Any recipe for solving such cyclic problems must involve feedback, either implicitly or explic- itly. In the following section, we present a method which does not separate the problem of finding correspondence from the problem of finding the segmentation at any stage, and ex- plicitly demonstrates the interdependence of correspondence and segmentation. Recently, 10 Figure 2.1: Top row: Left image, Right image, True disparity (black denotes 0, gray denotes 2). Bottom row: Absolute intensity difference of left and right images for horizontal shift ? x =0,1,2 we have found that a method similar to our own but using different vertical consistency constraints also exists in the stereo literature [BVZ98]. 2.4 Connected matching regions Let I 1 (x,y) and I 2 (x,y) be a given pair of rectified stereo images. The absolute intensity difference between the two images is found using equation 2.1, where ? x denotes the rela- tive horizontal shift between the two input images. For the case of optical flow, the shifts will be two dimensional. ?I(x,y, ? x )=|I 1 (x,y) ? I 2 (x + ? x ,y)| (2.1) The first row of Figure 2.1 shows a random dot pair of stereo images and the true disparity map. The second row shows the absolute intensity difference images for three horizontal shifts ? x =0,1,2. If we observe the intensity difference images (second row), we notice that large connected regions of matching pixels (shown in black) appear for certain values of the shift. By the word ?match?, we mean that the absolute intensity difference is below a certain threshold t. 11 ?I(x,y, ? x )