ABSTRACT Title of dissertation: SEEING BEHIND THE SCENE: USING SYMMETRY TO REASON ABOUT OBJECTS IN CLUTTERED ENVIRONMENTS Aleksandrs Ecins, Doctor of Philosophy, 2017 Dissertation directed by: Professor Yiannis Aloimonos Department of Computer Science Rapid advances in robotic technology are bringing robots out of the controlled environments of assembly lines and factories into the unstructured and unpredictable real-life workspaces of human beings. One of the prerequisites for operating in such environments is the ability to grasp previously unobserved physical objects. To achieve this individual objects have to be delineated from the rest of the environment and their shape properties estimated from incomplete observations of the scene. This remains a challenging task due to the lack of prior information about the shape and pose of the object as well as occlusions in cluttered scenes. We attempt to solve this problem by utilizing the powerful concept of symmetry. Symmetry is ubiquitous in both natural and man-made environments. It reveals redundancies in the structure of the world around us and thus can be used in a variety of visual processing tasks. In this thesis we propose a complete pipeline for detecting symmetric objects and recovering their rotational and reflectional symmetries from 3D reconstructions of natural scenes. We begin by obtaining a multiple-view 3D pointcloud of the scene using the Kinect Fusion algorithm. Additionally a voxelized occupancy map of the scene is extracted in order to reason about occlusions. We propose two classes of algorithms for symmetry detection: curve based and surface based. Curve based algorithm relies on extracting and matching surface normal edge curves in the pointcloud. A more efficient surface based algorithm works by fitting symmetry axes/planes to the geometry of the smooth surfaces of the scene. In order to segment the objects we introduce a segmentation approach that uses symmetry as a global grouping principle. It extracts points of the scene that are consistent with a given symmetry candidate. To evaluate the performance of our symmetry detection and segmentation algorithms we construct a dataset of cluttered tabletop scenes with ground truth object masks and corresponding symmetries. Finally we demonstrate how our pipeline can be used by a mobile robot to detect and grasp objects in a house scenario. SEEING BEHIND THE SCENE: USING SYMMETRY TO REASON ABOUT OBJECTS IN CLUTTERED ENVIRONMENTS by Aleksandrs Ecins Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Advisory Committee: Professor Yiannis Aloimonos, Chair/Advisor Dr. Cornelia Fermu¨ller, Co-Advisor Professor Nikhil Chopra, Dean’s Representative Professor Ramani Duraiswami Professor Matthias Zwicker c© Copyright by Aleksandrs Ecins 2017 Acknowledgments I am grateful to my advisors Prof. Yiannis Aloimonos and Dr. Cornelia Fermu¨ller for introducing me to the fascinating world of machine perception. You have taught me to think about computer vision problems in terms of their utility for active agents, that move and act in the world. I clearly remember the day when Yiannis came to the lab one Friday afternoon, picked a mug from my table, and pointed out that he had no trouble placing his fingers on the ’back’ of the mug, even though he didn’t see it. ”You can use symmetry to do it. Look around you, everything is symmetric!”. This was the defining moment of my PhD, and from that day onward I knew exactly what was going to be the topic of my thesis. While Yiannis inspired, Cornelia was there to help nurture these ideas and translate them into working code and papers, to be shared with the rest of the world. I am glad that I got a chance to work with both of you. I would like to thank my colleagues that shared the lab with me over the 7 years of my PhD: Austin Myers, Yezhou Yang, Ching Lik Teo, Francisco Barranco, Kostas Zampogiannis, Kanishka Ganguly, Greogory Kramida, Chetan Mysore, Nitin Sanket and Snehesh Shreshtra. I will remember the time we spent in and outside of the lab, talking science and having fun. I am happy that two of my fellow students, whom I met right after arriving to Maryland, remained my closest friends throughout the rest of my PhD. We shared a similar kind of craziness with Kotaro Hara, who is officially the best roommate ever. And I enjoyed chatting about our graduate life during long walks with Aishwarya ii Thiruvengadam. My parents, Elena and Sergejs, have always believed in me and supported me in my endeavors. I am thankful to you for giving me the encouragement and freedom to pursue my interests. Finally, I want to thank my dear Sachoques for keeping me going throughout all these years. You were there for me, during the moments of doubt, during the moments of joy, and I can’t imagine completing this journey without you. iii Table of Contents Acknowledgements ii List of Figures vi 1 Introduction 1 1.1 Robots of The Future . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Current Approaches to Perception For Manipulation . . . . . . . . . . 2 1.3 Symmetry is Key! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . 7 2 Data Collection 10 2.1 Reconstruction pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Surface reconstruction . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Occupancy mapping . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Cluttered Tabletop Dataset . . . . . . . . . . . . . . . . . . . . . . . 16 3 Symmetry Detection in 3D Pointclouds 18 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Curve-based symmetry detection . . . . . . . . . . . . . . . . . . . . 22 3.4.1 Pointcloud edge detection . . . . . . . . . . . . . . . . . . . . 23 3.4.2 Reflectional symmetry detection . . . . . . . . . . . . . . . . . 23 3.5 Surface-based symmetry detection . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Smooth surface segmentation . . . . . . . . . . . . . . . . . . 27 3.5.2 Rotational symmetry detection . . . . . . . . . . . . . . . . . 28 3.5.3 Reflectional symmetry detection . . . . . . . . . . . . . . . . . 32 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.1 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 39 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 iv 4 Symmetry Segmentation 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Unary term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1.1 Rotational symmetry . . . . . . . . . . . . . . . . . . 47 4.3.1.2 Reflectional symmetry . . . . . . . . . . . . . . . . . 49 4.3.2 Binary term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 Hypothesis rejection . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5.1 Segmentation evaluation . . . . . . . . . . . . . . . . . . . . . 54 4.5.2 Symmetry evaluation . . . . . . . . . . . . . . . . . . . . . . . 59 4.5.3 Runtime analysis . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6 Object grasping by a mobile robot . . . . . . . . . . . . . . . . . . . . 63 4.6.1 Robot platform . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6.2 Perception pipeline . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6.3 Grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.7 Final Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . 67 Bibliography 69 v List of Figures 1.1 Robot operating environments. (a) Today’s robot operate in environ- ments where shapes and positions of all of the manipulable objects and obstacles are known in advance. (b) In the future robots will work in messy and unpredictable human environments. . . . . . . . . 2 1.2 Examples of symmetry in nature and in man-made objects. . . . . . . 5 1.3 Human perception of symmetry (adapted from Treder [1]). (a) A random blob pattern is perceived as disorganized. (b) The symmetry of a pattern is immediately apparent to a human observer. (c) The sense of structure is increased if pattern has multiple symmetries. . . 6 2.1 Surface reconstruction of a simple tabletop scene. (a) Using default calibration parameters provided by the camera driver. (b) Using cal- ibrated depth camera intrinsics as well as depth distortion calibration. 12 2.2 Pointcloud reconstruction of a tabletop scene. . . . . . . . . . . . . . 13 2.3 Visualization of the occupancy map. Green shows the interface (bound- ary) between the occupied and free space, while red shows the inter- face between occupied and free space. In the right image, notice the trailing ”tail” behind the objects, that corresponds to the occlusions behind the objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Sample scenes from the Cluttered Tabletop Dataset. . . . . . . . . . . 16 2.5 Labeling for one of the scenes in the Cluttered Tabletop Dataset. Colored points denote individual segments and green planes show corresponding symmetries. . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Symmetry detection from a pair of oriented points. . . . . . . . . . . 24 3.2 Curve-based symmetry detection. (a) Input pointcoloud (d) Segmen- tation (b) Segment boundary points (e) Linked edge curves (c) and (f) Symmetry correspondences and corresponding symmetry hypotheses. 25 3.3 Scene pointcloud pre-segmentation. (a) Scene pointcloud. Pre-segmentation with (b) θsmooth = 10 ◦, and (c) θsmooth = 15 ◦. . . . . . . . . . . . . . . 27 vi 3.4 Constraint imposed on the points of a rotationally symmetric object. The surface normal n of any point of a rotationally symmetric object must lie in the plane formed by the symmetry axis S and the point p itself. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Rotational symmetry detection stages shown for the blue bowl seg- ment in Figure 3.6. (a) Input pointcloud. (b) Initial symmetries. (c) Refined symmetry. (d) Cloud reconstructed by rotating the segment around the symmetry axis. . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Visualization of the reflectional symmetry fitting approach (a) Input object pointcloud and candidate symmetry. Black curve denotes the observed points of the object. (b) Grey curve denotes the reflected pointcloud and blue lines show the estimated symmetric correspon- dences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Reflectional symmetry detection stages shown for a teddy bear seg- ment in Figure 3.6. Top and bottom rows show different sides of the segment. (a) Input pointcloud. (b) One of the initial symmetries. (c) Refined initial symmetry. . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 PR curves for symmetry detection evaluation. Stars mark the point of maximum F-measure. . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Symmetry consistency analysis. A square and a triangle are observed by the camera. pr denotes the reflection of point p and p′ denotes the symmetric neighbor of p. (a) p reflects to its symmetric neighbor. (b) p reflects to occluded space. (c) p reflects to unoccluded space. Note that all of the observed points belonging to the square and none of the points of the triangle are consistent with symmetry S. . . . . . . 46 4.2 System flowchart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 PR curves for segmentation evaluation. Stars mark the point of max- imum F-measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Comparison of segmentation results on individual scenes. For each method best segments were automatically selected by choosing the predicted segments that have the highest overlap with ground truth masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 PR curves for symmetry detection results for the complete pipeline. Stars mark the point of maximum F-measure. . . . . . . . . . . . . . 60 4.6 Output of our complete pipeline. First image in a row shows the input pointcloud. Following images show the detected objects and their symmetries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Robot grasping application. (a) Robot reconstructs the scene by mov- ing the depth sensor mounted on the left arm. (b) The reconstructed view of the table. (c) Robot grasping the bowl (d) Cylinders fitted to the reconstruction shown in green. . . . . . . . . . . . . . . . . . 66 vii Chapter 1: Introduction 1.1 Robots of The Future Since ancient times humanity has dreamt of artificial contraptions, mechanical or otherwise, that would serve and help us with our daily lives. We live in the age where technology might be finally catching up to our dreams. The 20th century has seen great progress in automation due to the introduction of small and efficient electric motors and digital electronics. Industrial robots can be programmed to execute predefined motions with extreme precision and repeatability. However, due to their limited perceptual capabilities, they can not adapt to dynamic changes in the environment. Specialist programming is required to ”reprogram” them for different narrowly defined tasks. As a result industrial robots are only used in large-scale manufacturing settings where operating space can be tailored to the specific needs of the robot. The robots of the future will instead be encountered in the “real-world” workspaces of our houses, shops, schools and hospitals, where they will interact and work together with humans. This capability requires three major components: 1) Navigation - robots need to move around in space without running into static or dynamic obstacles such as sofas and humans; 2) Human robot interaction - robots will need to understand human actions and intent, and at the same time be 1 (a) (b) Figure 1.1: Robot operating environments. (a) Today’s robot operate in environ- ments where shapes and positions of all of the manipulable objects and obstacles are known in advance. (b) In the future robots will work in messy and unpredictable human environments. able to communicate its own state and intent back to the humans in a natural way; 3) Manipulation - to do useful work, robots need the ability to alter the world through physical contact. While progress in all three of these areas is required to build a robotic butler, this thesis focuses on the latter one. Specifically we are addressing the problem of developing perceptual capabilities required for robotic manipulation of everyday objects. 1.2 Current Approaches to Perception For Manipulation Like any perceptual process our pipeline begins with sensing the environment. Usual sensor choices are 2D cameras capturing RGB information or 3D range sensors that capture depth images or pointclouds. Once the data is captured, individual objects must be delineated from the rest of the environment. Next object shape properties are estimated in order to choose appropriate contact points between the 2 gripper and the object and to plan a suitable grasp strategy. This has to be done in typical messy real life scenarios where the objects are not placed in an orderly manner but instead are piled on top of each other. This remains an extremely challenging problem. There are countless variations in possible object shapes and appearances. This problem is compounded by the fact that cluttered scenes contain heavy occlusions, so only incomplete object observations are available in the data captured by the sensors. As a result it is difficult to segment the scene (i.e. to figure out which parts of the scene belong to different objects) and reconstruct the missing shape information for the objects in the scene. Current approaches to this problem can be largely divided into two classes: model-based and shape primitive based. Model-based methods use a priori knowl- edge about object appearance in order to recognize them in the scene. During the training stage a full 3D CAD object model is rendered from multiple views and visual/shape descriptors are computed and associated to each of the views. Object 6DOF pose in the scene is then recovered by extracting the descriptors from the input data and finding the best matching set of the training descriptors. This is often followed by a fine-tuning step that refines object pose by fitting the 3D model to the observed data using the Iterative Closest Points algorithm [2]. Numerous descriptor representations were proposed including contour models [3], feature con- stellations [4] [5] and appearance templates [6]. More advanced methods are capable of tracking the pose of multiple objects from multiple cameras by ensuring global consistency between all object hypotheses using a simulation framework [7]. These methods are capable of accurately estimating object pose in highly cluttered scenes. 3 Moreover, the object shape can be fully reconstructed by aligning 3D models of the object to the scene using the estimated pose. The downside of model-based methods is that they are not capable of detecting objects that were not available during the training stage. Manipulation of previously unseen objects is a key feature of truly autonomous robots. A more general family of methods finds objects by fitting parametric solid shape primitives to the scene. Object shape and pose are obtained by simultaneously estimating primitive pose and parameters. The choice of the primitives used defines the trade-off between the accuracy of shape approximation and the robustness of the fitting process. For cluttered scenes, boxes, cylinders and spheres can be fitted to the input pointclouds using robust RANSAC method [8] [9]. These methods work well for objects that are well approximated by the primitives used (i.e. boxes and mugs) but are unsuitable for more complex objects (teddy bear). Several methods use a more general family of primitives - superquadrics [10] [11]. These methods are limited to single isolated objects and can not tolerate clutter since superquadric fitting is very sensitive to noise and outliers [12]. 1.3 Symmetry is Key! Wherever there is structure - there is symmetry. In its most general sense sym- metry is repetition of some element or pattern. Lack of repetition implies lack of structure. Hence symmetry and structure are inseparable concepts. While symme- try is not limited to geometry, from the point of view of perception we are interested 4 (a) Butterfly (b) Flower (c) Human face (d) Taj Mahal (e) Car (f) Teapot Figure 1.2: Examples of symmetry in nature and in man-made objects. in symmetries in geometrical shapes. Symmetrical structures have two appealing properties. Firstly, symmetrical structures are inherently stable. Whether it’s an animal standing on its feet, or the the arrangement of atoms in a crystal lattice, or the aerodynamics of a car - symmetrical structures achieve a certain balance of forces that results in an overall stability of the system. Secondly, symmetric shapes can be represented very efficiently, which allows for certain efficiencies in replication. A fern expands as it grows by repeating the same pattern over an over again. An artist making a pot rotates a piece of clay around an axis and shapes only one side of the vessel. Due to these properties symmetry is pervasive both in natural and man-made artifacts. In nature symmetrical structures can be encountered at all scales, from the double helix of the DNA, to plant and animal bodies, all the way to the galaxies. Man-made made objects are almost universally symmetric: chairs, 5 (a) (b) (c) Figure 1.3: Human perception of symmetry (adapted from Treder [1]). (a) A random blob pattern is perceived as disorganized. (b) The symmetry of a pattern is immediately apparent to a human observer. (c) The sense of structure is increased if pattern has multiple symmetries. cups, screens, buildings, cars are all symmetric. In fact, if you take a look around you right now, you will be hard-pressed to find asymmetrical objects. Symmetry is so pervasive in natural environments that both human and animal visual systems have evolved to actively exploit it for perception [13] [14]. Symme- try plays a crucial role in the process of figure-ground organization i.e. the process of figuring which parts of the environment form a whole [15]. This is illustrated intuitively in Figure 1.3. A previously unseen pattern is perceived as a whole if it exhibits symmetry. Multiple symmetries strengthen this effect, while a pattern with no symmetry is perceived as lacking structure. Moreover, it was shown that humans identify symmetries more quickly when it belongs to a single object as opposed to being shared between two objects [16]. This finding suggests that object segmen- tation and symmetry detection are tightly interlocked processes. Additionally, it was found that local symmetry can be used to predict human eye movements when looking at images of novel scenes [17]. This implies that symmetry is used as a cue to guide our visual attention. 6 Symmetry information is particularly important for robotic grasping applica- tions. Given a partial 3D scan of an object, symmetry can be used to reconstruct the occluded parts of the object. Furthermore, symmetry is directly linked to the object’s center of mass. Assuming uniform density, center of mass of an object al- ways lies on the symmetry plane/axis of an object, or at their intersection in case of multiple symmetries. Both of these properties can be used to predict a stable grasp for previously unseen objects [11] [18]. 1.4 Contributions of This Thesis Symmetry encodes structural redundancies of our world. This thesis explores approaches to revealing these redundancies and using them to enable general robotic manipulation in complex and unpredictable environments. Specifically, we present a complete system for segmenting objects and extracting their symmetries in densely cluttered scenes. Unlike CAD model-based methods, our system doesn’t require any object specific prior knowledge and can deal with previously unseen objects. Our core assumption is that the three dimensional shape of objects is either reflectional or rotational symmetric. Although this assumption might seem restrictive at first, as discussed earlier, it holds true for a vast majority of rigid man-made objects. Our approach can be seen as a generalization of shape primitive fitting approaches, where instead of fitting shapes to the 3D data, symmetries are fitted instead. This allows our approach to avoid the problem of approximating complex object shapes with simple geometric primitives. Our pipeline consists of three major steps: 7 • Obtaining a multiple view 3D surface reconstruction of a scene • Detecting candidate object symmetries • Segmenting individual objects based on the candidate symmetries The rest of this thesis is organized as follows. Chapter 2 describes the data collection procedure used to acquire 3D reconstructions. In addition to the tradi- tional pointcloud-based reconstruction of the scene surfaces we collect a voxel-based occupancy map of the scene that is required for the later processing stages. We dis- cuss the sensor calibration procedure for a commodity structured light depth sensor that is necessary to achieve high-quality reconstructions. We also present a dataset of densely cluttered tabletop scenes used for testing and evaluating the efficacy of our perception pipeline. Chapter 3 addresses the problem of symmetry detection in 3D pointclouds. We propose two approaches. A curve-based approach detects reflectional symmetries by extracting and matching surface normal edge curves in the pointcloud. A surface- based approach can be used to detect both rotational and reflectional symmetries by fitting symmetry axes/planes to the geometry of the smooth surfaces extracted from the scene. This step acts as an attention operator that finds candidate symmetries in the scene. In chapter 4 we discuss the algorithms for obtaining high quality object seg- mentation masks given a set of candidate symmetries in the scene. The goal of the segmentation stage is to find points in the scene that belong together according to the local grouping principles of convexity, and at the same time are geometrically 8 consistent with a given symmetry candidate. Our segmentation algorithm can be used with slight modifications to segment both reflectional and rotational symmetric objects. We demonstrate how a simplified version of our pipeline can be used by a collaborative robot to pick up objects in a kitchen scenario. We conclude this thesis by discussing the avenues for future research. 9 Chapter 2: Data Collection In this chapter we discus the data acquisition process as well as introduce the dataset used to test and evaluate our perception pipeline. 2.1 Reconstruction pipeline Since our goal is to use symmetry as a constraint on an object’s three di- mensional shape, we are interested in collecting 3D geometric data. Unlike a static camera, a robot equipped with an imaging sensor fits the paradigm of an active observer [19]: it can actively change the position of its sensors in the world in order to collect additional information about the environment. Hence, our data collection process should not be limited to single views. Instead we want to collect data from multiple views and combine it into a single 3D reconstruction of the scene. In addi- tion to surface reconstruction, we also want to get an occupancy map of the scene that identifies the free and occupied space. This information is used by symmetry detection and segmentation stages to evaluate the compatibility of scene points with candidate symmetries. 10 2.1.1 Surface reconstruction The problem of 3D reconstruction from multiple images, known as Structure from Motion (SFM), is one of the oldest problems in computer vision [20]. Although the field has seen steady progress over the last several decades [21] [22], image based methods still struggle to produce highly accurate, dense and noise-free models. The main limitation of these methods is that both scene geometry and camera frame- to-frame poses (ego-motion) need to be estimated simultaneously. This problem can be alleviated by the use of 3D range sensors. 3D sensors directly measure the surface geometry of the scene, thus greately simplifying geometry and ego-motion estimation problems. For our pipeline we use the KinectFusion algorithm [23], that uses Kinect sensor to deliver extremely accurate dense 3D reconstructions while running in real time. The primary idea behind Kinect Fusion is to fuse measurements from a se- quence of depth images into a single model. The fused model is maintained using a volumetric Truncated Signed Distance Function (TSDF) [24]. Given a voxelized volume TSDF represents a surface within the volume by storing for each voxel the signed distance d to the nearest observed surface. Given a new frame, Kinect Fu- sion algorithm starts by back-projecting the depth image to a pointcloud. Camera pose in is then estimated by aligning the pointcloud to the TSDF using the ICP algorithm. The depth map is fused into the TSDF volume by updating voxels that project into the current depth map. The final TSDF is the weighted average of the TSDF’s from individual depth maps. This achieves the effect of smoothing noisy 11 (a) (b) Figure 2.1: Surface reconstruction of a simple tabletop scene. (a) Using default calibration parameters provided by the camera driver. (b) Using calibrated depth camera intrinsics as well as depth distortion calibration. measurements from multiple depth maps which allows recovering dense noise-free scene models. Once the data is captured, the final mesh model can be extracted by running the Marching Cubes algorithm [25] on the TSDF. For our experiments we used an Asus Xtion Pro sensor (similar to Kinect) for data capture and an open-source implementation of the KinectFusion algorithm [26] with a voxel size of 3.1 millimeters. KinectFusion has no bundle adjustement or loop closure mechanism, so errors in frame-to-frame pose estimates accumulate over multiple frames leading to inconsistencies and tearing in the reconstructed model. Hence, a high quality sensor calibration is required to achieve precise reconstruc- tions. This can be done by calibrating the intrinsics and the radial distortion of Kinect’s infrared camera [27]. Additionally, it was found that Kinect sensors ex- hibit complex systematic residual errors in their depth measurement values. In a robotic manipulation scenario, these can result in the robot not being able to pick up the object, since it is perceived as either being too far or too close to it’s true 12 Figure 2.2: Pointcloud reconstruction of a tabletop scene. position. To correct these errors we constructed a series of pixel-wise lookup tables that store the multiplicative values for the depth measurements at different depth ranges [28]. Figure 2.1 shows the output of our surface reconstruction process on a sample tabletop scene. After obtaining the scene mesh we convert it to a pointcloud and remove all points that do not belong to the tabletop objects. This is done by detecting the table plane and extracting points lying directly above it. Finally we donwsample the pointcloud using a voxelgrid filter with voxel size of 5 millimeters. Our experiments show that this resolution is sufficient for our needs and at the same time significantly reduces the processing time for the further processing stages of our pipeline. An example of the final scene pointcloud is shown in Figure 2.2. 2.1.2 Occupancy mapping To construct an occupancy map of the scene we use the OctoMap algorithm [29]. Given a sequence of depth maps and corresponding camera poses, it generates a 3D 13 volumetric map, where each voxel is labeled as either occupied, occluded or free space. At runtime each voxel stores the probability of the corresponding space be- ing occupied. Voxel occupancy probabilities are updated by casting rays defined by the depth maps into the volume. If a voxel is traversed by a ray, it’s occu- pancy probability in decreased. Conversely if a ray endpoint falls within a voxel, it’s occupancy probability is increased. After integrating all of the depth maps, the occupancy probabilities are thresholded to classify voxels as either free or occupied. Voxels that were never traversed by a ray are considered occluded. The resulting map is expected to have very few tightly localized occupied voxels and large chunks of free and occluded voxels. OctoMap takes advantage of this data sparsity by storing the map in an octree datastructure. This significantly reduces the memory footprint of the algorithm and allows storing of large maps. For our experiments we used the open-source implementation of the OctoMap algorithm [30]. Since OctoMap has no mechanism for frame-to-frame camera pose estimation, we used poses estimated by KinectFusion instead. The occupancy map was constructed with voxel size of 5 millimeters. In our experiments, we have found that lower resolutions negatively affected the performance of the symmetry detection and segmentation stages of our pipeline. One of the downsides of the OctoMap algorithm is that rays are integrated into the volume sequentially on a CPU, which results in slow performance. To alleviate this bottleneck we only processed every 5th frame of our sequences and for each depth map use every 10th point. Finally, to reason about scene occlusions we compute the three dimensional Euclidean Distance Transform (EDT) of the occupancy map. In the context of 14 Figure 2.3: Visualization of the occupancy map. Green shows the interface (bound- ary) between the occupied and free space, while red shows the interface between occupied and free space. In the right image, notice the trailing ”tail” behind the objects, that corresponds to the occlusions behind the objects. robotics EDT is usually used for path planning and collision avoidance. Given a binary spatial map where obstacles are labeled with ones, an EDT computes the Euclidean distance from each voxel in the map to the nearest obstacle voxel. EDT allows quickly evaluating the distance from an arbitrary point in space to the nearest obstacle. Given an occupancy map, we construct the EDT by treating both occluded and occupied voxels as obstacles. As will be discussed in the following chapters, we will use the EDT to judge which points of the scene are compatible with a given symmetry by reflecting/rotating the point and checking it’s distance to the occluded/occupied space. We will use EDT (Pi) to denote the distance from point Pi to the nearest occluded/occupied voxel in the scene. The distance is clamped to a maximum of docclmax in order to reduce the influence of outliers. We used the value of docclmax = 3cm throughout our experiments. 15 Figure 2.4: Sample scenes from the Cluttered Tabletop Dataset. 2.2 Cluttered Tabletop Dataset To evaluate our pipeline we propose a new Cluttered Tabletop Dataset, which contains 3D reconstructions of 89 cluttered tabletop scenes. The set of objects used in the dataset ranges from simple objects like boxes to highly non-convex objects such as bowls and stuffed toys. Scenes have varying complexity, from single objects to up to 15 objects put side by side and stacked on top of each other, as shown in Figure 2.4. Scenes were captured by manually moving a depth sensor in approximately 120◦ horizontal arc around the center of the scene. This strategy was chosen to simulate the environment sensing procedure that could be executed by a robot with a camera mounted on its manipulator. All of the scenes were labeled with ground-truth pointwise segmentation masks for the individual objects, 16 Figure 2.5: Labeling for one of the scenes in the Cluttered Tabletop Dataset. Colored points denote individual segments and green planes show corresponding symmetries. as well as corresponding object rotational and reflectional symmetries, as illustrated in Figure 2.5. 17 Chapter 3: Symmetry Detection in 3D Pointclouds 3.1 Introduction Symmetry is ubiquitous in natural and man-made objects. Gestalt psychol- ogysts have recognized the role of symmetry in the human visual system as an important cue that serves as input to the higher level visual processes such as seg- mentation [15]. This observation has been mirrored in Computer Vision applica- tions, where symmetry is used as a preprocessing step for a variety of object-centric perception tasks such as object recognition [31], retrieval [32], reconstruction [33] as well as robotic manipulation [11]. Unlike most of the previously proposed approaches that operate on high qual- ity 3D models of single objects, our aim is to detect symmetries of multiple objects in incomplete pointclouds. Our approach is to first extract stable geometric features and then find symmetries within single or multiple feature instances. In order to handle cluttered scenes, the features used have to satisfy two properties: a single feature must belong to a single object and at the same time it must capture the geometry of the underlying object. Although orientable 3D point features may seem like an obvious choice, they are not suitable for our task. Due to the limitations in the resolution of the pointcloud and the nature of the shape of the object, ex- 18 tracting a sufficient number of features may be problematic. For example, a box only has 6 corners that can serve as discriminative feature points and only few of those may be visible due to occlusions. A bowl has no unique feature points at all. Alternatively we propose two families of approaches that use curve and surface features instead. Our curve-based approach detects reflectional symmetries by ex- tracting surface normal edge curves and then searching for pairs of curves that are ”aligned” by a reflectional symmetry. In our surface-based approach we first extract smooth surfaces from the pointcloud and then fit reflectional/rotational symmetry planes/axes to the segments. For rotational symmetry, we define a novel measure of fitness between an oriented point and a symmetry axis that allows us to fit a symme- try axis directly to the segment points. Reflectional symmetries are detected using an ICP-like technique that alternates between finding correspondences between sym- metric points and refining the candidate symmetry plane given the correspondences. For each of the three approaches we define a set of confidence measures that can be used to control the precision-recall tradeoff of the method. 3.2 Related Work A large amount of research in Computer Vision is devoted to symmetry detec- tion in 2D and 3D data. Most of the 2D approaches rely on extracting and matching intensity discontinuities such as edges and point features. One of the first success- ful methods for detecting symmetry in 2D images is the ”Generalized Symmetry Transform” of Reisfeld et al. [34]. The main idea of the method is that two image 19 points provide evidence for the existence of symmetry at a midpoint between them if they ”reflect” onto each other. This principle was used to generate a continuous ”symmetry map” that measures how symmetrical the neighborhood around an im- age point is. This idea was extended by Loy and Eklundh [35] who employed robust oriented SIFT features to improve the quality of symmetrical matching. Pairs of symmetrically matching feature points were used to generate reflectional and rota- tional symmetry hypotheses which were then aggregated in a Hough voting scheme to find the dominant symmetries in the image. Some approaches proposed symme- try for attention [36], and symmetry to aid segmentation of symmetric objects [37]. More recently Teo et al. [38] proposed to detect curved reflectional symmetries and used them to drive the segmentation process by embedding a symmetry term into the cost function of the Markov Random Field (MRF) used for segmentation. For three dimensional data most of the existing methods focus on recovering symmetries from complete 3D models. Sun et al. proposed to detect global sym- metries in a 3D model by analyzing its three dimensional surface normal histogram called the Extended Gaussian image (EGI) [39]. The key assumption is that an EGI of an object has the same global symmetries as the object itself. Thus they can be discovered by finding the transformation that maximizes the correlation be- tween the original and transformed EGIs. This assumption however does not hold for noisy and incomplete object models. Podolak et al. [32] defined the ”Planar Reflective Symmetry Transform” that captured the degree of symmetry of an object with respect to all possible planes passing through it. Exact object symmetries were then detected by extracting the transform maxima. In [40], Mitra et al. . relied on 20 orientable feature matching to detect partial symmetries in complete 3D meshes of single objects. Symmetry hypotheses were found by matching uniquely orientable keypoints on the surface of the mesh and filtering out the dominant ones using mean shift clustering. This approach was capable of detecting reflectional, rota- tional, translational and scaling transformations present in the mesh. Thrun and Wegbeit reasoned about occlusions of the scene to detect symmetries in partial 3D pointclouds [41]. Symmetries were found by searching through the space of possible transformations and finding the one that minimizes the number of points that reflect into the unoccluded space and maximizes the match between the original and the reflected clouds. Detected symmetries were then used to reconstruct occluded parts of the object. 3.3 Notation We use the following notation to denote oriented points, symmetries and re- flections/rotations of points throughout this and the following chapters: • P = {p, n} - an oriented point with coordinate p and a normal vector n. • Srefl = {nrefl, drefl} - a reflectional symmetry plane defined by a normal nrefl and distance to origin drefl. • Srefl = {nrefl, prefl} - a reflectional symmetry plane defined by a normal nrefl and a point prefl lying in the symmetry plane. • Srefl(·) - a reflection of either a point or a normal by the reflectional symmetry 21 plane Srefl. • Srot = {drot, prot} - a rotational symmetry axis defined by a point prot of the axis and the direction vector drot. • Srot(·, θ) - a rotation of a point or normal by angle θ around the rotational symmetry axis Srot 3.4 Curve-based symmetry detection Consider a simple toy example of detecting reflectional symmetries of a box. The edges of the box are the straight lines connecting it’s vertices. It is immediately obvious that edges provide sufficient information for discovering the symmetries of the box: we can find pairs of edges for which there exists a symmetry plane that reflects them onto each other. Following this process we will find some of the symmetries that are not correct (between edges that don’t lie on the same planar surface of the box), but we are guaranteed not to miss any of the true symmetries. This observation motivates our curve-based symmetry detection approach. Given a pointcloud with surface normal information we define surface normal edges as an ordered set of points where surface normal direction changes abruptly. After extracting the edges we find symmetric matches between pairs of such edges. Using a robust clustering method allows us to recover symmetries even from noisy edges that contain a high number of outliers. 22 3.4.1 Pointcloud edge detection Our strategy for extracting normal edges is similar to that presented in [42] which utilizes the duality between segmentation and edge detection. We oversegment the cloud into patches whose borders are likely to lie along normal edges and then find the boundaries between them. We use the algorithm of Papon et al. [43] that segments pointclouds into supervoxels with high boundary adherence. To avoid extracting boundaries between supervoxels that belong to the same flat surface we merge all adjacent supervoxels that have collinear surface normals (Figure 3.2d). Superovxel boundary points are found by extracting points which have at least one neighbor that belongs to a different supervoxel (Figure 3.2b). Boundary points are then linked together using a Minimum Spanning Tree algorithm to form a set of disjoint edge curves C = {C1, ..., Cn}. For each of the curves, its pointwise tangent direction vectors are estimated by fitting lines to each of its points and their neighbors. Each curve Ci = (p 1 Ci , ..., pmCi) is represented by an ordered set of oriented points and describes a surface normal edge in the cloud (Figure 3.2e). 3.4.2 Reflectional symmetry detection Once the edge curves are extracted, pointcloud symmetries can be recovered by finding symmetrical pairs of edge curves. The basis of this process is the idea of symmetrical correspondence between two oriented points. Consider two points in space Pi and Pj. The coordinates of the points uniquely define a reflectional symmetry plane that is perpendicular to the line connecting the points and contains 23 p 1 p 2 n 2n1 n1 r ∠(n 2, n 1 r ) S Figure 3.1: Symmetry detection from a pair of oriented points. the midpoint between them. Representing the symmetry plane Srefl = {nrefl, drefl} by its normal and distance to the origin, its parameters can be expressed as: nrefl = pi − pj ‖pi − pj‖ (3.1) drefl = (pj − pi) · n refl 2 (3.2) We define the matching score of the symmetric correspondence between Pi and Pj as the angular difference between the normal of one of the points ni and the reflected normal of the other point Srefl(nj): MS(Pi, Pj) = ∠(ni, S refl(nj)) (3.3) By putting a threshold on the maximum allowed matching score, we can reject noisy symmetric correspondences. This mechanism allows us to decide if a valid symmetry plane exists for a pair of points and to recover that symmetry. Given a pair of edge curves Ci, Cj we exhaustively search for symmetrical correspondences between their individual points {PCi , PCj} using the symmetrical 24 (a) (b) (c) (d) (e) (f) Figure 3.2: Curve-based symmetry detection. (a) Input pointcoloud (d) Segmen- tation (b) Segment boundary points (e) Linked edge curves (c) and (f) Symmetry correspondences and corresponding symmetry hypotheses. correspondence matching approach described above. Curve tangent direction vec- tors are used as orientation vectors instead of surface normals, since surface nor- mal estimates are inherently noisy at the points of high curvature. Matches are filtered by removing correspondences with a match score higher than a threshold MS(PCi , PCj) > αmax and enforcing a one-to-one correspondence relationship. This results in a set of noisy candidates for the global symmetry plane aligning the two curves. To recover the global symmetry plane we take an approach similar to the one proposed in [40]. Symmetry candidates are treated as samples of the prob- ability density function of the global symmetry plane. The problem of recovering the global symmetry plane becomes equivalent to clustering these candidates. To do the clustering we represent the candidates with points in three dimensional space by 25 weighting the symmetry plane normal by the distance to the origin i.e. nrefl/drefl. We then use mean shift clustering [44] to find the dominant mode of the distribu- tion and use its maximum as the symmetry plane aligning the two curves. This procedure is run on every pair of edge curves resulting in a set of 3D reflectional symmetry hypotheses for the input pointcloud S = {Srefl 1 , ..., Sreflk } (Figure 3.2c). Finally, we filter out symmetries for which the ratio of the number of valid sym- metric correspondences to the total number of points in the corresponding curves is lower than a threshold ninlier. This allows us to remove some of the false symmetry hypotheses that will inevitably be detected by our approach. 3.5 Surface-based symmetry detection While the curve-based approach focuses on the high curvature features of the pointcloud, our surface-based approach makes use of the dense geometric informa- tion available from the smooth surfaces. It capitalizes on the fact that many objects contain large continuous surfaces which share the same symmetries as the complete object. First, we oversegment the pointcloud using a region growing algorithm that uses local surface normals and point connectivity to enforce a smoothness constraint. To detect scene symmetries, we employ a geometric fitting approach. Two separate algorithms are used to fit rotational symmetry axes and reflectional symmetry planes respectively to the scene segments. Since the segments are noisy, we ensure that both algorithms are robust to outliers and at the same time are sensitive enough to detect symmetries from limited support. 26 (a) (b) (c) Figure 3.3: Scene pointcloud pre-segmentation. (a) Scene pointcloud. Pre- segmentation with (b) θsmooth = 10 ◦, and (c) θsmooth = 15 ◦. 3.5.1 Smooth surface segmentation We employ a variant of the region growing algorithm with a smoothness con- straint [45]. Pointcloud points are sorted in the order of increasing curvature and the point with the minimum curvature is selected as the seed point. At each step six of the nearest neighbors of the current seed point are examined. Segment smooth- ness is enforced by requiring that the angle between the normals of neighboring points is smaller than a threshold θsmooth. Unlike the standard approach, in which all of the neighboring points satisfying the smoothness constraint are added as new 27 segment points, we only grow the segment if at least half of the neighbors satisfy the constraint. This prevents the segments from ”leaking” over the surface normal boundaries due to noisy surface normal estimates. The newly added points are used as new seeds. The process continues until no more valid neighbors can be found or all of the pointcloud points belong to a segment. If there are any points remaining they are used to initialize a new segment. To increase the likelihood of recovering the necessary segments, we run the above segmentation algorithm at several smoothness thresholds. Specifically we use θ1 = 10 ◦ and θ2 = 15 ◦. Figure 3.6 shows an example of a scene pre-segmentation. Note how the two segmentations complement each other. With θ1 = 10 ◦ the front faces of the red and violet boxes are segmented correctly, while the teddy bear is over-segmented. On the other hand, with θ1 = 15 ◦ teddy bear is segmented correctly while the box faces are merged. 3.5.2 Rotational symmetry detection Rotational symmetry imposes a strong constraint on the shape of an object. We observe that for any point on the surface of a rotational symmetric object, the corresponding surface normal lies in the plane formed by the symmetry axis and the point itself, as shown on Figure 3.4. This allows us to formulate an optimization scheme that fits a rotational symmetry axis directly to the segment pointcloud. Consider a pointcloud consisting of oriented points Pi = {pi, ni} and a can- didate symmetry axis Srot = {prot, drot} described by a point and a unit direction 28 pn S Figure 3.4: Constraint imposed on the points of a rotationally symmetric object. The surface normal n of any point of a rotationally symmetric object must lie in the plane formed by the symmetry axis S and the point p itself. vector. Let ∠(Srot, Pi) denote the angle between the point normal ni and the plane formed by the symmetry axis Srot and point pi. The candidate symmetry axis can be fitted to the pointcloud by minimizing the following function over the symmetry axis parameters: ∑ i min(∠(Srot, Pi), αmax) (3.4) where αmax is the maximum angle used to limit the influence of outliers. In our experiments we use αmax = 45 ◦. This function can be minimized using a non-linear solver such as Levenberg-Marquardt. To initialize the minimization we use PCA to construct three candidate sym- metry axes that go through the center of mass of the pointcloud and are aligned to the segment principal axes. We have found this initialization procedure ensures that at least one of the three axes will converge to the true symmetry axis. Once all 29 (a) (b) (c) (d) Figure 3.5: Rotational symmetry detection stages shown for the blue bowl segment in Figure 3.6. (a) Input pointcloud. (b) Initial symmetries. (c) Refined symmetry. (d) Cloud reconstructed by rotating the segment around the symmetry axis. three axes are refined, the one attaining the minimum error is selected as the single final hypothesis. This is motivated by the fact that, except for a degenerate case of a sphere, any shape can have at most one rotational symmetry axis. Figure 3.5 shows the symmetry detected for a bowl. Note how our approach manages to accurately detect the symmetry from a partial pointcloud, which allows us to reconstruct the complete shape of the object. In order to find symmetries for all of the rotational symmetric objects in a scene, we apply our detection approach to all of the segments returned by the smooth segmentation stage. While the resulting set oh hypotheses is likely to contain all of the true rotational symmetries, it will also contain many false positives, since the 30 scenes are expected to contain objects that have no rotational symmetries (e.g. a box). To filter out these cases we define a set of metrics that measure how well a rotational symmetry axis fits a given segment. Symmetry score. Symmetry score measures how well a candidate symmetry axis fits the points of a segment. It is calculated as the total fitness error between the final symmetry axis and the segment, normalized to the [0, 1] range: Symrot = 1 N ∑ i min(∠(Srot, Pi), αmax) αmax (3.5) where θmax = 60 ◦ and N is the number of points in the segment. Lower symmetry score implies a better fit between a symmetry and a segment. Perpendicularity score. Intuitively, there is no unique way of fitting a rotational symmetry to flat surface, such as a face of a box. In fact, given our definition of fitness between a symmetry axis and a point, all of the points of a flat segment achieve a fitness score close to zero for any symmetry axis that is perpendicular to the segment. This means that the symmetry score can not be used to filter out such segments. To handle this corner case we measure how perpendicular a segment is to a symmetry candidate. The perpendicularity score between a point and a symmetry axis is calculated as the angle between the point normal and the symmetry axis: Perprot(Pi) = ∠(drot, ni) 90◦ (3.6) 31 Perpendicularity between a segment and a axis is defined as the average of point perpendicularity scores. Symmetries that satisfy all of the above filtering checks are accepted as final symmetry candidates. Specifically, a symmetry candidate Srotis accepted as valid if satisfies the following conditions: Symrot < Symrotmax and Perp rot < Perprotmax. Symmetries that don’t satisfy at least one of the measures are discarded. 3.5.3 Reflectional symmetry detection Unlike the case of rotational symmetry, reflectional symmetry does not im- pose any constraints on the position or surface normal orientation of single points. Instead, relationships are established between pairs of symmetric points. We say that point P ′i is a symmetric correspondence of Pi under the reflectional symmetry Srefl if the two points ”reflect” onto each other i.e. P ′i is similar to S refl(Pi) - the reflection of Pi by S refl. This motivates our fitting approach that alternates between symmetric correspondence estimation and symmetry plane refinement. We represent a symmetry plane Srefl = {prefl, nrefl}by a point lying in the plane prefl and the plane’s normal nrefl (note that this is a different representation, from the one used in our curve-based approach). For each point Pi we compute its reflection Srefl(Pi) and search for its nearest neighbor P ′ i in the pointcloud. Since we are dealing with incomplete object scans, we do not expect every point to have a valid symmetric match. Thus we only establish a correspondence if the distance between the reflected point and its nearest neighbor is less than 1cm. Addition- 32 S(a) S (b) Figure 3.6: Visualization of the reflectional symmetry fitting approach (a) Input object pointcloud and candidate symmetry. Black curve denotes the observed points of the object. (b) Grey curve denotes the reflected pointcloud and blue lines show the estimated symmetric correspondences. ally, we reject all correspondences for which the angle between the reflected normal Srefl(ni) and the corresponding normal n ′ i is greater than 45 ◦. After establishing the correspondences, we minimize the following function for the parameters of the symmetry plane: ∑ {i} dp,pl(S refl(Pi), P ′ i ) (3.7) where dp,pl stands for point to plane distance and {i} is the set of points that have a correspondence. This process is repeated until convergence or until a maximum number of 20 iterations is reached. Since correspondences are only established be- tween points that are already approximately symmetric, this method results in very accurate alignment between a symmetry and a pointcloud. However, similar to ICP, it requires a good initial symmetry plane estimate [46]. One of the ways to generate initial hypotheses is by jointly sampling plane 33 positions from a grid defined around the pointcloud and plane orientations from a unit sphere. This is computationally prohibitive, since sampling from an n × n × n grid is O(n3) and densely sampling from a sphere incurs a large constant factor [32]. We avoid the costly position sampling by only sampling orientations and then refining plane positions based on the pointcloud structure. We start by sampling points from the surface of a unit sphere aligned to the principal axes of the segment pointcloud. Specifically, we rotate the unit vector corresponding to the largest principal axis by multiples of degrees 36◦ around the two remaining principal axes. In spherical coordinates, this corresponds to uniform angular sampling in polar and azimuth angles. If any two normals are opposites of each other, we discard one of them since they define the same symmetry plane. This results in 25 candidate orientations. For each orientation, we construct a symmetry plane Srefl that goes through the segment’s center of mass. To refine the position of a plane we first find all points that could potentially form symmetric correspondences if the symmetry plane was shifted appropriately. Given a point Pi, its potential symmetric neighbors Pj are enclosed in a cylinder centered at Pi with an axis parallel to the plane normal nrefl. These “cylindrical” neighbors can be found efficiently by projecting the pointcloud onto the symmetry plane and performing a radius search. We filter out correspondences for which the angle between the normal ni and the reflected neighbor normal nreflj is greater than 10 ◦. For each of the remaining correspondences, we calculate the shift distance di,j as the distance between the symmetry plane and the midpoint between Pi and Pj. We compute the median shift dmedian from all correspondences and use it to shift the symmetry candidate’s 34 (a) (b) (c) Figure 3.7: Reflectional symmetry detection stages shown for a teddy bear segment in Figure 3.6. Top and bottom rows show different sides of the segment. (a) Input pointcloud. (b) One of the initial symmetries. (c) Refined initial symmetry. point along its normal: Srefl = {prefl + dmedian ∗ n refl, nrefl} (3.8) After generating the initial symmetries, we apply the global iterative refinement described above to obtain the final symmetry detections. Figure 3.7 shows the symmetry detected for a teddy bear. Similarly to the case of rotational symmetry we implement several filtering measures. 35 Symmetry score. Symmetry fitness score is only defined for the points that have symmetric correspondences. For a point P with a symmetric correspondence P ′ the fitness score is calculated as the normalized angle between the reflected point normal and the corresponding point normal: Symrefl(Pi) = min(∠(Srefl(ni), n ′ i), αmax) αmax (3.9) with αmax = 20 ◦. The fitness between the symmetry plane and a segment is defined as the average symmetry score for all of the correspondences. Inlier score. We want to reject symmetries which don’t have sufficient sup- port, i.e. that have too few symmetric correspondences. Inlier score is calculated as: Inlierrefl = Ncorr N (3.10) where Ncorr is the number of correspondences under a given symmetry and N is the number of points in the segment. 3.6 Experiments To compare the proposed approaches we tested their performance on the Clut- tered Tabletop Dataset. As discussed in Chapter 2 the scenes in the dataset contain a high amount of clutter which poses a significant challenge for precise symmetry detection. We evaluate our approaches in terms of their ability to predict individual 36 symmetries present in the scenes. In Section 3.6.1 we discuss the details of the exper- imental setup and the evaluation procedure used. We then presents the quantitative results and discuss the strengths and shortcomings of the proposed approaches in Section 3.6.2. 3.6.1 Evaluation procedure Our evaluation procedure is inspired by the evaluation method used for 2D symmetry detection [47]. Similarity between a predicted symmetry S and the ground truth symmetry SGT is measured based on the angular and positional differences between the two symmetries. The geometric definitions of the angle between two axes and the angle between two planes provide good measures of angular similarity between rotational and reflectional symmetries respectively: ∆angle(S, SGT ) = ∠(S, SGT ) (3.11) Finding a measure of positional similarity between symmetries is more chal- lenging. The distance between the lines/planes does not provide a meaningful mea- sure of similarity between symmetries. For example, two rotational symmetry axes belonging to different objects in the scene may intersect outside the bounds of the scene. The situation is even more extreme in case of reflectional symmetries, since any two planes will always intersect, unless they are parallel. It is more revealing to measure the distance between the two symmetries in the spatial area defined by the object that exhibits the ground truth symmetry. Since our dataset contains object 37 segmentation masks associated with every ground truth symmetry, we use the cen- troids of the segments as the estimates of the locations of the object. Specifically, given a ground truth symmetry SGT associated with a ground truth segment SegGT with centroid CtrGT , we define the distance to a predicted symmetry S as the Eu- clidean distance between the projections of the segment centroid onto the predicted and ground truth symmetries: ∆pos(S, SGT ) = ‖projSGT(CtrGT )− projS(CtrGT )‖ (3.12) Given the above definitions we say that a ground truth symmetry SGT in the scene is predicted correctly if there exists a predicted symmetry such that the angular difference ∆angle(S, SGT ) is less than t1 = 10 ◦ and positional difference ∆pos(S, SGT ) is less than t2 = 10 cm. If several of the predicted symmetries satisfy these criteria for the same ground truth symmetry, the predicted symmetry achieving the smallest distance difference is counted as a true positive detection, while the rest are considered false positive. Predicted symmetries that do not match to any of the ground truth symmetries are considered false positive, while ground truth symmetries that do not get matched to any predicted symmetries are considered false negative. We use the evaluation procedure described above to perform the precision- recall analysis of our symmetry detection approaches. For each of the three meth- ods we varied the parameters of the confidence measures used to filter the detected symmetries. The values for parameters used are summarized in Table 3.1. Each 38 approach was evaluated on the corresponding symmetry type i.e. curve-based and surface-based reflectional symmetry detection approaches were evaluated on reflec- tional ground truth symmetries, while surface-based rotational symmetry approach was evaluated on ground truth rotational symmetries. Method Parameter min max step Curve-based reflectional MS 1 ◦ 15◦ 1◦ ninlier 0.1 0.5 0.1 Surface-based rotational Symrot 0.004 0.03 0.002 Perrot 0.6 0.7 0.1 Curve-based reflectional Symrefl 0.7 0.95 0.05 Inlierrefl 0.2 0.8 0.2 Table 3.1: Parameters used for evaluating symmetry detection. Parameters were varied from min value to max value in steps of size step. 3.6.2 Results and discussion Figure 4.3 shows the PR curves and the maximum f-scores for the three al- gorithms. Curve-based reflectional symmetry detection achieves an f-score of 10%. While the algorithm achieves a high level of recall, it comes at a price of very low precision. This can be attributed to the fact that the output of the normal edge detection is very noisy. This results in many false positive symmetric matches be- tween curves that belong to different objects or non-symmetric curves of the same object. On the other hand, curve-based detection approaches achieve a comparable high recall, but have much higher precision. The dense geometric information avail- 39 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P re c is io n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Surface-based rotational F = 0.80 Surface-based reflectional F = 0.40 Curve-based reflectional F = 0.10 Figure 3.8: PR curves for symmetry detection evaluation. Stars mark the point of maximum F-measure. able from the smooth segments allows these methods to detect the symmetries much more reliably, while ignoring many of the false positives. The reflectional symme- try detection approach achieves an f-score of 40%. Rotational symmetry detection has the highest f-score of 80%. This is due to the fact that rotational symmetry places a much stricter constraint on the shape of an object, compared to reflectional symmetry. In general, the main source of errors of the surface-based methods stem from the imperfections of the presegmentation process. Two kinds of errors can be distinguished. Firstly some of the segments can span multiple objects, which results in false negatives, since it becomes harder to extract true symmetries for both of the objects. Secondly, a segment that only spans a part of the object may possess 40 additional symmetries that are not present in the complete object, which results in false positive detections. Additionally, a large number of false positive reflec- tional symmetries come from the rotationally symmetric objects. This is due to the fact that for a rotational symmetric object, a plane passing through the symmetry axis defines a valid reflectional symmetry. As a result the filtering process can not distinguish these symmetries as invalid. 3.7 Conclusions Symmetry detection is an important perceptual capability that serves as a preprocessing step for many visual tasks. In this chapter we presented two families of approaches for reflectional and rotational symmetry detection in pointclouds. An evaluation on a challenging dataset of cluttered scenes showed that the surface-based approaches deliver the highest quality results, especially for rotational symmetries. All of the methods achieve a very high recall (> 90%) under relaxed filtering con- ditions. As expected, this comes at a cost of reduced precision. This observation suggests that a more sophisticated filtering approach may retain the same recall per- formance while increasing the precision by intelligently discarding false positives. In the next chapter we discuss how symmetry detection results can be used to aid object segmentation, and, how the segmentation step can be used to filter our false positive symmetry detections. 41 Chapter 4: Symmetry Segmentation 4.1 Introduction Figure-ground organization, the process of grouping smaller percepts into a whole, is one of the core abilities required for visual perception. Symmetry is one of the main grouping laws proposed by Gestalt psychologists. In this chapter we propose a method for segmenting reflectional and rotational symmetric objects in cluttered scenes. Given a set of candidate symmetries we initialize multiple figure- ground segmentations. The goal of each segmentation is to find points in the scene that belong together according to the grouping principles of proximity and convexity, and at the same time are geometrically consistent with a given symmetry candidate. The key insight to our segmentation approach is that symmetry hypotheses that correspond to one of the objects in the scene are consistent with all of the observed points of that object, while incorrect symmetries are compatible with an unorganized set of points that can be easily discarded. Using the results from Chapter 3 we present a complete pipeline for detecting symmetries and using them to segment objects in a pointcloud. We evaluate our approach on a challenging dataset and demonstrate how a simplified version of our system can be used by a mobile robot to manipulate objects in a kitchen scenario. 42 4.2 Related Work Image segmentation is one of the fundamental problems in Computer Vision and has been studied extensively. This problem is inspired by Gestalt theory of per- ceptual organization which states that there exist a number of bottom-up ”grouping laws” which allow humans to perceive the visual stimuli in terms of coherent groups corresponding to meaningful objects and patterns [48]. The laws of proximity, sim- ilarity, continuity and closure are implicitly used in all modern segmentation algo- rithms. Most of the classical segmentation algorithms were developed for analyzing RGB or greyscale images and use texture, color and image contour features for grouping [49] [50] [51] [52]. Although suitable for certain tasks, they are not capable of reliably segmenting objects in cluttered environments. This is due to the fact that image features are too weak to be used in a purely bottom-up fashion. Discon- tinuities in intensity images are caused both by object boundaries as well as surface texture making the two extremely difficult to distinguish. Nevertheless, these algo- rithms serve as a foundation for more recent segmentation approaches which rely on 3D information to overcome many of the limitations of its predecessors. One such approach is the active segmentation approach of Mishra et al. [53]. Depth and RGB cues are used to find an edge image of the scene, and objects are segmented by finding a closed contour in the edge map around a given fix- ation point. This method is able to segment compact objects in simple scenes. Additional segmentation cues can be obtained by projecting the depth image into three-dimensional space and analyzing the geometry of the resulting 3D pointcloud. 43 In [54] Richtsfeld et al. presegment the scene by fitting planes and NURBS to the pointcloud data and then merge the resulting segments into object hypotheses. The likelihood of two segments belonging to the same object is computed by training an SVM classifier on a number of appearance and geometrical features. The use of robust shape primitives in the presegmentation step allows this method to deal with compact objects in stacked configurations. Several methods were proposed that explicitly model the assumption that ob- jects tend to be convex in their shape. Stein et al. [55] use surface normals to oversegment the pointcloud into supervoxels and estimate their adjacency using the distance between their 3D centroids. Pairs of adjacent supervoxels are classified as either convex or concave based on the relative position and surface normal orienta- tion difference of their centroids. Supervoxels are then merged to find components that are enclosed by convex edges. A favorable property of the algorithm is that non-convex objects tend to get divided into their convex parts. In another work Karpathy et al. find objects in 3D meshes of indoor scenes [56]. The input mesh is oversegmented using an adaptation of Felzenswalb’s algorithm [49] producing a set of overlapping object hypotheses. Individual hypotheses are then classified as ob- ject or non-object based on a number of features including compactness, convexity, smoothness and symmetry. Zheng et al. proposed to reason about physical stability of objects in the scene [57]. Their method estimates the volumetric extent of seg- ments extracted from pointcloud data and then combines them into configurations that are stable under the gravity constraint. This method is capable of correctly seg- menting non-convex objects, but fails in situations when volumetric reconstruction 44 is unreliable. Symmetry has been utilized as an attention mechanism to detect potential object locations in the scene. In [58] Koostra et al. generate 2D saliency maps by finding symmetric configurations of image gradients and show that these maps can be used to guide the segmentation process. Potapova et al. [59] showed that higher quality saliency maps can be obtained by extracting locally symmetric depth image patches. More recently Teo et al. [38] proposed to use symmetry both as an object detector and as a constraint in the segmentation process. Curved reflectional sym- metries are detected and are used to setup a foreground segmentation that forces the segments to be symmetric with respect to the detected symmetry curves. The idea of using symmetry both as an attention and grouping mechanism is similar to our approach. However, unlike [38] which operates on single 2D images, our approach works on multiple view 3D pointclouds, which enables it to produce accurate results even in extremely cluttered scenes. 4.3 Approach The goal of the segmentation step is to find object hypotheses i.e. subsets of points in the pointcloud that are consistent with the detected symmetries and at the same time respect the basic grouping law of convexity. To check if a point Pi of the pointcloud is consistent with a symmetry hypothesis S we analyze its reflection/rotation by the symmetry [41]. Consider the case of reflectional symmetry. We distinguish between three different options for the reflection of a point. If P is 45 pp' S (a) p pr S (b) ppr S (c) Unoccluded space Occluded space Symmetry hypothesis Observed points True object surface Figure 4.1: Symmetry consistency analysis. A square and a triangle are observed by the camera. pr denotes the reflection of point p and p′ denotes the symmetric neighbor of p. (a) p reflects to its symmetric neighbor. (b) p reflects to occluded space. (c) p reflects to unoccluded space. Note that all of the observed points belonging to the square and none of the points of the triangle are consistent with symmetry S. reflected to another point of the pointcloud P ′, we call them symmetric neighbors and they are both considered to be supporting the symmetry S (Figure 4.1a). If on the other hand it reflects into occluded space it does not support the symmetry but is still compatible with it (Figure 4.1b). Finally if a point reflects to unoccluded space it violates the symmetry S (Figure 4.1c). Similar principles can be used for evaluating consistency with a rotational symmetry, with the difference that the point needs to be rotated around the symmetry axis instead of being reflected. To capture these properties, we define two pointwise measures: symmetry score Sym(P ) and occlusion score Ocl(P ). These scores are calculated differently for rotational and reflectional symmetry, but in both cases, the scores are normalized to lie in the [0, 1] range and a lower score indicates better compatibility with a symmetry hypothesis. To segment all of the objects in the scene we setup multiple graph based fore- ground segmentation problems. A graph is constructed for each of the symmetry 46 hypotheses where nodes correspond to the points of the pointcloud and edges are established between adjacent points of the pointcloud. Let L = {fg, bg} be la- bels corresponding to object and background respectively. Object hypotheses are segmented by finding a labeling f that assigns a label fp ∈ L to all points in the pointcloud, such that the following energy functional is minimized: E(f) = ∑ p∈P Dp(fp) + λ ∑ {p1,p2}∈N Vp1,p2 · δ(fp1 6= fp2) (4.1) where δ(·) denotes an indicator function. The first term is the unary term Dp(fp) that ensures that the points consistent with the current symmetry hypothesis are labeled as foreground. Vp1,p2 is the binary term defined between neighbouring points of the pointcloud. It forces segment boundaries to lie along the surface normal edges and not across flat surfaces. λ determines the influence of the binary term. In our experiments we set both this weight to 2. Once the graphs are constructed they are optimized using the graph-cuts algorithm [60]. We will now discuss how the different terms are computed for rotational and reflectional symmetries. 4.3.1 Unary term 4.3.1.1 Rotational symmetry As discussed in section 3.5.2, the fitness score between an oriented point and a rotational symmetry S is calculated as the normalized angle between the point 47 normal and the plane containing the symmetry axis and the point itself: Symrot(P ) = ∑ i min(∠(Srot, Pi), αmax) αmax (4.2) For the occlusion score, we repeatedly rotate P around the symmetry axis by θ degrees and check the distance from the rotated point to the nearest occluded voxel. The distance lookup is done efficiently using the EDT computed during the reconstruction process (see section 2.1.2). Occlusion score for a single point is calculated as the maximum of these distances: Occlrot(P ) = maxi(EDT (S rot(P, θ ∗ i))) dmax , (4.3) where θ = 30◦ and n ∈ {1, ..., 12} and dmax = 3cm is the clamping distance used for the EDT. The unary weights are set as: Dp(fg) = (1− Sym rot(P )) ∗ (1−Ocl(P rot)) (4.4) Dp(bg) = Sym(P rot) +Ocl(P rot) (4.5) These terms encode the intuition that a point belonging to the surface of a rotation- ally symmetric object has to satisfy both the symmetry fitness and the occlusion constraints. 48 4.3.1.2 Reflectional symmetry In the case of reflectional symmetry, the symmetry fitness score is only defined for the points that have symmetric correspondences. For a point P with a symmetric correspondence P ′ the fitness score is calculated as the normalized angle between the reflected point normal and the corresponding point normal: Symrefl(P ) = min(∠(Srefl(n), n′), αmax) αmax (4.6) The occlusion score, which is defined for all of the points in the scene, is computed by reflecting the points by the symmetry plane and checking the distance to the nearest occluded voxel: Oclrefl(P ) = EDT (P ′)− dmax dmax (4.7) For a point that has a symmetric neighbor, the unary weights are computed similarly to the rotational symmetry case (equations 4.4 and 4.5). If a point has no symmetric correspondence, its foreground weight is set to 0, while the occlusion score is used for the background weight: Dp(fg) = 0 (4.8) Dp(bg) = Ocl(P, Srefl) dmax (4.9) 49 4.3.2 Binary term Unlike the unary term, binary term does not depend on the symmetry type, and is computed in the same way for rotational and reflectional symmetries. Point adjacency in the pointcloud N is estimated by connecting every point to 9 of its nearest neighbors. Binary weights are set between every pair of neighboring points. One viable option is to set the edge weights according to the angle between the surface normals of neighboring points. A smaller angle indicates that points belong to the same surface while a large angle suggests the existence of a surface normal edge between the points. This approach can be improved by modulating the weights based on the convexity criterion. This modification captures the intuition that objects tend to be convex while boundaries between two objects on top of each other tend to be concave. Given two points P1 and P2 we define them to be in a convex arrangement if n1 · (p1− p2) > 0. The binary weight between them is set as: Vp1,p2 =   exp ( − n1 · n2 σconvex ) if n1 · (p1 − p2) > 0 exp ( − n1 · n2 σconcave ) otherwise (4.10) where σconvex = 2 and σconcave = 0.15. 4.3.3 Hypothesis rejection For a given symmetry, the output of our segmentation approach is a segmen- tation mask, each associated with the symmetry hypothesis used to generate it. 50 In order to reject false positive segments we employ a filtering process based on a number of geometrical ”objectness” measures. These measures are used to rate each segment hypothesis on how likely is it to correspond to an object in the scene. Symmetry score. A valid segment predicted by our segmentation approach must be symmetric. Symmetry score of a segment is calculated as the average symmetry score for all of the points of the segment: Sym(H) = 1 |H| ∑ Pi∈H Sym(Pi) (4.11) where H is the predicted segment. Occlusion score. As discussed previously, all of the points of the segment must lie in the occluded/occupied space of the scene when rotated/reflected by the symmetry. Segment occlusion score is the average of its point-wise occlusion scores: Occl(H) = 1 |H| ∑ Pi∈H Ocl(Pi) (4.12) Smoothness score. Object hypotheses should be smooth i.e. their bound- aries should not cross flat surfaces. We measure the smoothness of a hypothesis H by counting the average number of smooth links between adjacent points of the pointcloud that are broken by the segmentation. To make sure that surface normal edges do not contribute to the score we only count links for which the angle be- tween point normals is smaller than a threshold. Let BH = {Pi, Pj | Pi ∈ H,Pj ∈ P/H, {Pi, Pj} ∈ N} be the set of links between neighboring points in the pointcloud 51 that were broken by the object hypothesis H. The smoothness score is calculated as: Smooth(H) = 1 |H| ∑ {Pi,Pj}∈BH δ(∠(ni, nj) < αsmooth) (4.13) Object hypotheses that satisfy Sym(H) < ssym, Occl(H) < soccl and Smooth(H) < ssmooth are considered valid. The rest of the hypotheses are rejected. By changing these thresholds, our algorithm can be tuned to achieve a necessary precision-recall trade-off. 4.4 Pipeline In this section we present an approach to combining the symmetry detection methods described in Chapter 3 with the segmentation method described above, into a complete pipeline for segmenting rotationally and reflectional symmetrical objects in 3D reconstructions of cluttered scenes. Figure 4.2 shows the flowchart of the system. It consists of two sequential streams: one for rotational and the other for reflectional symmetries. Given an input pointcloud and an EDT of the scene, rotational objects are detected first. Next the points belonging to the detected rotational object segments are removed from the input pointcloud, before pass- ing it to the reflectional symmetry stage. By doing this, we avoid the previously discussed problem of detecting reflectional symmetries on rotationally symmetric surfaces (Section 3.6.2). It has an added benefit of speeding up the reflectional sym- metry stage of the pipeline, since it is operating on the smaller pointcloud. After 52 Input Rotational Symmetry Detection Reectional Symmetry Detection Rotational Symmetry Segmentation Reectional Symmetry Segmentation Output Figure 4.2: System flowchart. segmenting both rotational and reflectonal objects we combine the outputs from both stages. Since some of the objects in the scene may have multiple symmetries (e.g. a box), the segments generated for any of these symmetries should be identical. To handle these cases, we merge the segments for which the intersection over union is greater than 95% and concatenate the symmetries used to generate them. This allows our pipeline to return segments associated with multiple symmetries. 4.5 Experiments We evaluate our approach on the scenes from the challenging Cluttered Table- top Dataset. We investigate its ability to predict accurate segmentation masks as well as to detect object symmetries. We evaluate three versions of our approach. The first version uses the complete pipeline with surface-based approaches used for rotational and reflectional symmetry detection (Rot + Refl (surface)). To investi- gate the importance of modeling rotational symmetries, we test two versions of our pipeline that only use reflectional symmetry to segment the scenes. The two ver- sions use surface-based (Refl (surface)) and curve-based (Refl (curve)) approaches for the symmetry detection stage. 53 The performance of our pipeline directly depends on the quality of the detected symmetries. If none of the symmetries of an object are detected, the segmentation stage will not be able to segment it. On the other hand, if an incorrect symmetry is supplied to the segmentation stage, there is a good chance it will be rejected by the segment hypothesis filtering process. Thus we use a set relaxed filtering parameters for the symmetry detection stages, that maximize recall of the detected symmetries at the cost of precision. The values of the parameters used are shown in table 4.1. Method Parameter Value Curve-based reflectional MS 7 ◦ ninlier 0.3 Surface-based rotational Symrot 0.02 Perrot 0.6 Curve-based reflectional Symrefl 0.8 Inlierrefl 0.3 Table 4.1: Filtering parameters used for the symmetry detection stages in the evaluation of our pipeline. 4.5.1 Segmentation evaluation Unlike the majority of modern segmentation algorithms which assign a single object label to every point in the pointcloud, our method returns a set of object segments that can be spatially overlapping. Moreover, some points might not get assigned to any segment at all. We believe that this property does not reflect negatively on our approach. We argue that the utility of a segmentation process for a robotic system should be measured not by the number of points in the scene 54 that were assigned a correct object label, but by the number of objects in the scene for which sufficiently accurate segmentation masks were returned. This principle motivates our choice of the evaluation metric. An object in the scene is considered to be segmented correctly if there exists a predicted segment that overlaps sufficiently with its ground truth segmentation mask (intersection over union metric is used to compute the overlap). Recall is calculated as the fraction of the objects in the dataset that were segmented correctly and precision as the number of correctly segmented objects normalized by the total number of returned segments. Denoting by HT the set of ground truth masks for all objects in all of the scenes of a dataset and by HP the set of predicted object hypotheses: P = 1 |HP | ∑ Ht∈HT δ( max Hp∈HP O(Hp, Ht) > σo) (4.14) R = 1 |HT | ∑ Ht∈HT δ( max Hp∈HP O(Hp, Ht) > σo) (4.15) where O(H1, H2) = H1 ∩H2 H1 ∪H2 (4.16) and σo is the desired overlap score. In our evaluation we use the value of σo = 0.9. We choose such a high overlap requirement since small errors in segmentation can lead to significant errors in later stages of the processing pipeline of a robot (i.e. grasp selection and path planning). In addition to the three versions of our pipeline we evaluate two bottom-up 55 segmentation approaches that rely on convexity as the main grouping principle. The first one is the popular Felzenswalb graph segmentation algorithm adapted to point- cloud data [56]. The second is the LCCP [55] that segments the scene by merging supervoxels found in the scene. The meta-parameters used for these algorithms as well as the segmentation stages in out pipeline are described in Table 4.2. Method Parameter min max step Reflectional segmentation Symrefl 0.1 0.6 0.1 Oclrefl 0.005 0.045 0.01 Rotational segmentation Symrot 0.01 0.03 0.01 Oclrot 0.01 0.015 0.005 Felzenswalb k 0.5 7.0 0.5 LCCP βthresh 10 ◦ 70◦ 5◦ Table 4.2: Parameters used for evaluating symmetry detection. Parameters were varied from min value to max value in steps of size step. The PR curves for the evaluated algorithms are shown on figure Figure 4.3. LCCP and Felzenswalb achieve a similar performance with maximum F-measures of 31% and 33% respectively. Both methods are significantly outperformed by the symmetry-based methods. This shows that local convexity alone is not sufficient for accurate object segmentation in highly cluttered environments. Looking at results in Figure 4.4, we can identify two distinct types of object configurations where Felzenswalb and LCCP methods fail: 1. touching objects where the transition between objects is not concave are merged together. In row 1 transition between the speaker and the coffee box is non-concave. 56 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P re c is io n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Felzenswalb F = 0.33 LCCP F = 0.31 Refl (curve) F = 0.60 Refl + Rot (surface) F = 0.73 Refl (surface) F = 0.65 Figure 4.3: PR curves for segmentation evaluation. Stars mark the point of max- imum F-measure. 2. objects separated by an occlusion boundary are oversegmented. In row 2 the box is occluded by the milk carton. In both of these cases convexity criterion is not sufficient to correctly group the points into objects. On the other hand symmetry provides a global object-level grouping principle, which allows our method to correctly segment these scenes. Among the symmetry-based methods Refl (surface), our complete pipeline with rotational and reflectional symmetry stages, achieves the highest f-score of 73%. Approaches relying solely on the reflectional symmetry achieve similar recall, but have a lower precision. This can be attributed to two reasons. Approaches that do 57 Input pointcloud Felzenswalb LCCP Rot + Refl (curve) Figure 4.4: Comparison of segmentation results on individual scenes. For each method best segments were automatically selected by choosing the predicted seg- ments that have the highest overlap with ground truth masks. 58 not explicitly model rotational symmetry can only segment rotational objects if they happen to detect a reflectional symmetry that goes through the rotational symmetry axis. Even if such symmetry is detected, reflectional segmentation imposes fewer constraints on the segment, compared to a rotational method. As a result it is less likely to segment it correctly. Additional segmentation errors arise if the axis of a rotationally symmetric objects lies in the symmetry plane of a different objects. The two will be merged together by a reflectional segmentation approach. Finally, Refl (surface) achieves an f-score of 65% outperforming Refl (curve) with 60%. The lower performance can be explained by the fact that Refl (curve) uses an inferior symmetry detection stage. 4.5.2 Symmetry evaluation We investigate the symmetry detection performance of our pipeline Refl + Rot (surface). For each of the scenes, symmetries associated with the returned segments are concatenated together and are treated as outputs of a symmetry detection ap- proach. The procedure described in section 3.6.1 is used for evaluation. Figure 4.5 shows the PR curves for the reflectional and rotational symmetry detection results after the segmentation process. The results for surface-based symmetry detection approaches (without the segmentation step) are shown shown for comparison. Rotational symmetries are detected very reliably with an f-score of 89%, while reflectional symmetries get a more modest score of 63%. For both symmetry types, our pipeline significantly outperforms the raw surface-based approaches. This con- 59 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P re c is io n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotational (after seg) F = 0.89 Rotational (surface) F = 0.80 Reflectional (after seg) F = 0.63 Reflectional (surface) F = 0.40 Figure 4.5: PR curves for symmetry detection results for the complete pipeline. Stars mark the point of maximum F-measure. firms our hypothesis, that segmentation can be viewed as a reliable filtering stage for the detected symmetries. The performance gap between pre and post segmenta- tion results is particularly notable for reflectional symmetries - an increase of 23%. This is due to the fact that our pipeline removes rotationally symmetric parts of the scene before detecting reflectional symmetries. This has an effect of ”decluttering” the scene, making further processing easier. 60 Figure 4.6: Output of our complete pipeline. First image in a row shows the input pointcloud. Following images show the detected objects and their symmetries. 61 4.5.3 Runtime analysis The runtime of our complete pipeline (Section 4.4) depends on the number and of the objects present in the scene, the complexity of their individual shapes as well as on their arrangement. The computational complexity of the symmetry detection (both rotational and reflectional) for a single segment is bounded by the performance of the Levenberg-Marquardt optimization, which has the worst-case computational complexity of O(n3) where n is the number of points in a pointcloud. Symmetry detection algorithms have to be run on every segment returned by the smooth surface segmentation process. The symmetry segmentation steps have the same complexity of O(n3) due to the use graph-cuts algorithm. A graph-cuts algorithm has to be run for every symmetry detected during the symmetry detection stage. Table 4.3 shows the average computational times for the different stages of our pipeline. Performance were measured on a modern Intel i7 CPU. Scenes consisting of 4 or less objects required 1.5 seconds, while more complex scenes took 8.4 seconds to process. Pipeline stage 4 or less objects more than 10 objects Rotational stage 0.3 sec 0.9 sec Reflectional stage 1.2 sec 7.5 sec Total 1.5 sec 8.4 sec Table 4.3: Average runtimes for different stages of our pipeline for the scenes from the Cluttered Tabletop Dataset. 62 4.6 Object grasping by a mobile robot In this section we describe an application of a simplified version of our percep- tion pipeline in the context of robot manipulation. A mobile robot was tasked with picking dishes such as bowl mugs and plates, from a table in a kitchen scenario. The shape of the graspable objects was modeled as cylinders. This model is generalizes well to a large amount of dishes and can be used by the robot to pick up previously unseen objects. 4.6.1 Robot platform Our robot consisted of a dual arm Baxter collaborative robot mounted on an omnidirectional mobile base. The right arm was equipped with a ReFlex 3-fingered gripper and was used for object grasping. An Asus Xtio Pro RGB-D sensor was used as the main vision sensor. It was mounted on the left arm, such that it’s position could be controlled to get a better view of the scene. The pose of the camera relative to the arm was estimated by running stereo calibration on the RGB images from the depth sensor and the RGB camera installed in the wrist of robot’s arm. The Robot Operating System (ROS) [61] was used to program and control the robot. 4.6.2 Perception pipeline The perception pipeline consisted of two main stages: reconstruction and de- tection. To reconstruct the scene multiple RGB-D frames were captured by moving the depth sensor in a sequence of predefined arm poses. The poses were chosen such 63 that the table and its objects would be clearly visible to the camera. Due to the low precision of the motor encoders of the Baxter robot as well errors in camera to robot calibration, the camera pose available through the ROS transform tree was too noisy to be used for reconstruction purposes. Instead, a purely computer vision approach was used. SIFT feature correspondences were extracted from the RGB-D frames an SolvePNP was used to find camera poses. The pointclouds from each of the frames were then stitched together using ICP. Next, object pointcloud was acquired, by detecting the table plane and ex- tracting points lying above it. The resulting pointcloud was then segmented using Euclidean Clustering. A symmetry axis was then fitted to every segment using the approach described in Section 3.5.2. Finally, the parameters of the bounding cylin- ders were estimated for every segment for which a valid rotational symmetry was detected. Given a segment and a symmetry axis, the radius of the cylinder was found as the maximum distance between the detected symmetry axis and any point of the segment. Cylinder bounding planes were found by projecting segment points onto the axis and finding the bounding points. 4.6.3 Grasping Given the parameters of a bounding cylinder a set of candidate grasps were proposed. The grasps were generated such that the end effector would envelop the object around the symmetry axis while respecting the size of the cylinder. The final grasp was chosen based on the reachability analysis. Once the desired pose of the 64 end effector was selected, arm motion was planned using the RRT planner. 4.6.4 Results Figure 4.7 shows the multiple stages of the perception and grasping processes. In our experiments the robot was able to reliably pick objects placed in different poses on the table. The perception pipeline was found to be very robust and general enough to handle multiple scenarios that it was not specifically programmed for. 65 (a) (b) (c) (d) Figure 4.7: Robot grasping application. (a) Robot reconstructs the scene by moving the depth sensor mounted on the left arm. (b) The reconstructed view of the table. (c) Robot grasping the bowl (d) Cylinders fitted to the reconstruction shown in green. 66 4.7 Final Conclusions and Outlook In this thesis we have investigated the use of rotational and reflectional sym- metries for delivering perceptual constructs required for unconstrained robot object manipulation. We have presented a complete pipeline for segmenting symmetric objects in 3D reconstructions. Our pipeline uses symmetry first as an attention operator, that provides seeds for the segmentation process and then as a group- ing principle to segment individual objects. This tandem use of symmetry borrows inspiration from the studies of human visual system. In our experiments we have demonstrated the ability of our approach to deliver high quality segmentations and symmetry detections of objects in extremely cluttered scenes. This confirms our hy- pothesis that symmetry imposes a strong constraint on the three dimensional shape of the objects, which can and should be exploited by perceptual systems. Finally we have demonstrated how a system similar to ours can be deployed on a mobile robot. There are several avenues for improvement of our work. We have observed that rotational symmetry, while more specialized, imposes a stronger constraint on object shape. Modeling additional ”combinatorial” symmetry classes that correspond to a more limited, yet common set of object, classes could improve the performance of our system even further. An example is a set of 3 perpendicular planes which correspond to box-like objects. It would be interesting to see if the sound geometrical principles used in our pipeline can be encoded in a more advanced inference framework such as a Deep Convolutional Neural Network. This has the potential of automating some 67 of the parameter learning steps required to fine-tune out system. Additionally it can significantly reduce the computational load of the system making it applicable to real-time robotics. Finally, to make our system truly autonomous we would need to automate the data collection process. Symmetry could be used to guide this process by providing cues for selecting the next viewpoint used to observe the scene. While fully unconstrained manipulation remains an elusive task, we believe that the approaches demonstrated in this thesis bring us closer to that goal. 68 Bibliography [1] Matthias Sebastian Treder. Behind the looking-glass: A review on human symmetry perception. Symmetry, 2(3):1510–1543, 2010. [2] Yang Chen and Ge´rard Medioni. Object modelling by registration of multiple range images. Image and vision computing, 10(3):145–155, 1992. [3] Menglong Zhu, Konstantinos G Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips, Matthieu Lecce, and Kostas Daniilidis. Single im- age 3d object detection and pose estimation for grasping. In Robotics and Au- tomation (ICRA), 2014 IEEE International Conference on, pages 3936–3943. IEEE, 2014. [4] Aitor Aldoma, Federico Tombari, Johann Prankl, Andreas Richtsfeld, Luigi Di Stefano, and Markus Vincze. Multimodal cue integration through hypotheses verification for rgb-d object recognition and 6dof pose estimation. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 2104– 2111. IEEE, 2013. [5] Federico Tombari and Luigi Di Stefano. Object recognition in 3d scenes with oc- clusions and clutter by hough voting. In Image and Video Technology (PSIVT), 2010 Fourth Pacific-Rim Symposium on, pages 349–355. IEEE, 2010. [6] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Brad- ski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision–ACCV 2012, pages 548–562. Springer, 2013. [7] Karl Pauwels and Danica Kragic. Simtrack: A simulation-based framework for scalable real-time object pose detection and tracking. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 1300– 1307. IEEE, 2015. 69 [8] Radu Bogdan Rusu, Nico Blodow, Zoltan Csaba Marton, and Michael Beetz. Close-range scene segmentation and reconstruction of 3d point cloud maps for mobile manipulation in domestic environments. In Intelligent Robots and Sys- tems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 1–6. IEEE, 2009. [9] Lucian Cosmin Goron, Zoltan-Csaba Marton, Gheorghe Lazea, and Michael Beetz. Robustly segmenting cylindrical and box-like objects in cluttered scenes using depth cameras. In Robotics; Proceedings of ROBOTIK 2012; 7th German Conference on, pages 1–6. VDE, 2012. [10] Kester Duncan, Sudeep Sarkar, Redwan Alqasemi, and Rajiv Dubey. Multi- scale superquadric fitting for efficient shape and pose recovery of unknown objects. In Robotics and Automation (ICRA), 2013 IEEE International Con- ference on, pages 4238–4243. IEEE, 2013. [11] Ana Huama´n Quispe, Benoˆıt Milville, Marco A Gutie´rrez, Can Erdogan, Mike Stilman, Henrik Christensen, and Heni Ben Amor. Exploiting symmetries and extrusions for grasping household objects. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 3702–3708. IEEE, 2015. [12] Franc Solina and Ruzena Bajcsy. Recovery of parametric models from range im- ages: The case for superquadrics with global deformations. IEEE transactions on pattern analysis and machine intelligence, 12(2):131–147, 1990. [13] Juan D Delius and Brigitte Nowak. Visual symmetry recognition by pigeons. Psychological research, 44(3):199–212, 1982. [14] Martin Giurfa, Birgit Eichmann, and Randolf Menzel. Symmetry perception in an insect. Nature, 382(6590):458, 1996. [15] Jon Driver, Gordon C Baylis, and Robert D Rafal. Preserved figure-ground segregation and symmetry perception in visual neglect. Nature, 360(6399):73– 75, 1992. [16] Marco Bertamini, Jay D Friedenberg, and Michael Kubovy. Detection of sym- metry and perceptual organization: The way a lock-and-key process works. Acta psychologica, 95(2):119–140, 1997. [17] Gert Kootstra and Lambert RB Schomaker. Prediction of human eye fixa- tions using symmetry. In The 31st Annual Conference of the Cognitive Science Society (CogSci09), pages 56–61. Cognitive Science Society, 2009. [18] Jeannette Bohg, Matthew Johnson-Roberson, Beatriz Leo´n, Javier Felip, Xavi Gratal, N Bergstrom, Danica Kragic, and Antonio Morales. Mind the gap- robotic grasping under incomplete observation. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 686–693. IEEE, 2011. 70 [19] Yiannis Aloimonos. Active perception. Psychology Press, 2013. [20] Carlo Tomasi and Takeo Kanade. Shape and motion from image streams un- der orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992. [21] Montiel J. M. M. Mur-Artal, Rau´l and Juan D. Tardo´s. ORB-SLAM: a ver- satile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. [22] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011. [23] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011. [24] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303–312. ACM, 1996. [25] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, volume 21, pages 163–169. ACM, 1987. [26] Cuda helper library for vision. https://github.com/arpg/Kangaroo, (ac- cessed July 5, 2017). [27] Jan Smisek, Michal Jancosek, and Tomas Pajdla. 3d with kinect. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 1154–1160. IEEE, 2011. [28] Alex Teichman, Stephen Miller, and Sebastian Thrun. Unsupervised intrinsic calibration of depth sensors via SLAM. In Robotics: Science and Systems, 2013. [29] Armin Hornung, Kai M. Wurm, Maren Bennewitz, Cyrill Stachniss, and Wol- fram Burgard. OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013. [30] An efficient probabilistic 3d mapping framework based on octrees. https: //github.com/OctoMap/octomap, (accessed July 5, 2017). [31] Ehud Barnea and Ohad Ben-Shahar. Depth based object detection from partial pose estimation of symmetric objects. In European Conference on Computer Vision, pages 377–390. Springer, 2014. 71 [32] Joshua Podolak, Philip Shilane, Aleksey Golovinskiy, Szymon Rusinkiewicz, and Thomas Funkhouser. A planar-reflective symmetry transform for 3d shapes. ACM Transactions on Graphics (TOG), 25(3):549–559, 2006. [33] Pablo Speciale, Martin R Oswald, Andrea Cohen, and Marc Pollefeys. A sym- metry prior for convex variational 3d reconstruction. In European Conference on Computer Vision, pages 313–328. Springer, 2016. [34] Daniel Reisfeld, Haim Wolfson, and Yehezkel Yeshurun. Context-free atten- tional operators: the generalized symmetry transform. International Journal of Computer Vision, 14(2):119–130, 1995. [35] Gareth Loy and Jan-Olof Eklundh. Detecting symmetry and symmetric constel- lations of features. In Computer Vision–ECCV 2006, pages 508–521. Springer, 2006. [36] H. Fu, X. Cao, Z. Tu, and D. Lin. Symmetry constraint for foreground extrac- tion. IEEE Trans. on Systems, Man and Cybernetics, 44(5), 2014. [37] Tammy Riklin-Raviv, Nahum Kiryati, and Nir Sochen. Segmentation by level sets and symmetry. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 1015–1022. IEEE, 2006. [38] Ching L Teo, Cornelia Fermuller, and Yiannis Aloimonos. Detection and seg- mentation of 2d curved reflection symmetric structures. In Proceedings of the IEEE International Conference on Computer Vision, pages 1644–1652, 2015. [39] Changming Sun and Jamie Sherrah. 3d symmetry detection using the extended gaussian image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(2):164–168, 1997. [40] Niloy J Mitra, Leonidas J Guibas, and Mark Pauly. Partial and approximate symmetry detection for 3d geometry. ACM Transactions on Graphics (TOG), 25(3):560–568, 2006. [41] Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1824–1831. IEEE, 2005. [42] Kris Demarsin, Denis Vanderstraeten, Tim Volodine, and Dirk Roose. Detec- tion of closed sharp edges in point clouds using normal estimation and graph theory. Computer-Aided Design, 39(4):276–283, 2007. [43] Jeremie Papon, Alexey Abramov, Markus Schoeler, and Florentin Worgot- ter. Voxel cloud connectivity segmentation-supervoxels for point clouds. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2027–2034. IEEE, 2013. 72 [44] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward fea- ture space analysis. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 24(5):603–619, 2002. [45] Frank van den Heuvel Rabbani, Tahir and G. Vosselmann. Segmentation of point clouds using smoothness constraint. pages 248–253, 2006. [46] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp algorithm. In 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152. IEEE, 2001. [47] Jingchen Liu, George Slota, Gang Zheng, Zhaohui Wu, Minwoo Park, Seungkyu Lee, Ingmar Rauschert, and Yanxi Liu. Symmetry detection from realworld images competition 2013: Summary and results. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2013. [48] Max Wertheimer. Laws of organization in perceptual forms. 1938. [49] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. [50] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pat- tern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000. [51] Yuri Y Boykov and M-P Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 1, pages 105–112. IEEE, 2001. [52] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Inter- active foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004. [53] Ajay Mishra, Yiannis Aloimonos, and Cornelia Fermuller. Active segmentation for robotics. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 3133–3139. IEEE, 2009. [54] Andreas Richtsfeld, Thomas Morwald, Johann Prankl, Michael Zillich, and Markus Vincze. Segmentation of unknown objects in indoor environments. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Confer- ence on, pages 4791–4796. IEEE, 2012. [55] Simon Christoph Stein, Markus Schoeler, Jeremie Papon, and Florentin Wor- gotter. Object partitioning using local convexity. In Computer Vision and Pat- tern Recognition (CVPR), 2014 IEEE Conference on, pages 304–311. IEEE, 2014. 73 [56] Andrej Karpathy, Stephen Miller, and Li Fei-Fei. Object discovery in 3d scenes via shape analysis. In Robotics and Automation (ICRA), 2013 IEEE Interna- tional Conference on, pages 2088–2095. IEEE, 2013. [57] Bo Zheng, Yibiao Zhao, Joey C Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Beyond point clouds: Scene understanding by reasoning geometry and physics. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3127–3134. IEEE, 2013. [58] Gert Kootstra, N Bergstrom, and Danica Kragic. Fast and automatic detection and segmentation of unknown objects. In Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference on, pages 442–447. IEEE, 2010. [59] Ekaterina Potapova, Karthik Mahesh Varadarajan, Andreas Richtsfeld, Michael Zillich, and Markus Vincze. Attention-driven object detection and seg- mentation of cluttered table scenes using 2.5 d symmetry. In Robotics and Au- tomation (ICRA), 2014 IEEE International Conference on, pages 4946–4952. IEEE, 2014. [60] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min- cut/max-flow algorithms for energy minimization in vision. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(9):1124–1137, 2004. [61] Morgan Quigley, Josh Faust, Tully Foote, and Jeremy Leibs. Ros: an open- source robot operating system. 74