ABSTRACT
Title of dissertation: COMPUTATIONAL MID-LEVEL VISION:
FROM BORDER OWNERSHIP TO
CATEGORICAL OBJECT RECOGNITION
Ching Lik Teo, Doctor of Philosophy, 2015
Dissertation directed by: Professor Yiannis Aloimonos
Department of Computer Science
Since it was proposed in 1890 by Christian von Ehrenfels, Gestalt psychology
has remained a key school of thought that explains how one perceives the world
(“the whole”) from the sum of its individual components (“the parts”) or processes.
These processes are aptly summarized in the well known “Rules of Gestalt”. In spite
of its influence in other fields, the empirical nature of Gestalt rules impedes their
widespread adoption in Computer Science. This thesis serves to bridge this apparent
divide by making Mid-level Vision, or Computer Vision based on Gestalt rules, not
only computationally feasible but also practical for real applications. We address
the general problem of figure-ground organization, where the goal is to separate the
foreground (or object) from the background. To do this, we first formulate a fast
approach that pairs Structured Random Forests (SRFs) with Gestalt-like features,
for both boundary detection and border ownership assignment. We then show how
border ownership information is useful for shape-based recognition of object cate-
gories. This is done by embedding ownership information into the image torque,
a grouping operator that detects closure patterns in the image edge, so that we
modulate the operator in an efficient manner for detecting class-specific contours
in clutter and occlusion. Next, we show how symmetry, an important shape-based
regularity in Gestalt psychology, can be detected in clutter and be used for guid-
ing segmentation of symmetric foreground regions. Besides shape and symmetry,
functionality is another important mid-level cue that supports categorical object
recognition. Based on Gibson’s principle of affordance, we introduce a fast tech-
nique based on a SRF trained with geometric features that provides pixel-accurate
affordances of tool parts. Finally, we describe as future work how language can be
exploited to “activate” such mid-level processes so that a joint semantic space can
be obtained for linking visual concepts to language to solve even more challeng-
ing problems in Computer Vision, effectively reducing the so-called “semantic gap”
between these two related domains.
COMPUTATIONAL MID-LEVEL VISION:
FROM BORDER OWNERSHIP TO
CATEGORICAL OBJECT RECOGNITION
by
Ching Lik Teo
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2015
Advisory Committee:
Professor Yiannis Aloimonos, Chair/Advisor
Dr. Cornelia Fermüller, Co-Advisor
Professor David Jacobs
Professor Donald Perlis
Professor Timothy Horiuchi, Dean’s representative
c© Copyright by
Ching Lik Teo
2015
Acknowledgments
This thesis would not have been possible without the generous guidance and
support of my advisors: Prof. Yiannis Aloimonos and Dr. Cornelia Ferm¨uller,
during my 5 years with the Computer Vision Lab. I am grateful for the opportunity
and time spent discussing these works (and more), many of which came to form the
basis of this dissertation.
I would like to thank my colleagues as well: F. Barranco, A. Ecins, A. Myers
and Y. Yang for many thoughtful discussions, encouragements and fun times that
make life in the lab more interesting and memorable.
Finally, I am grateful for the sacrifices and support from my wife, my daughter
and family for all the weekends and nights spent working for another paper deadline
or demo. This thesis is dedicated to you all.
There are many other people whom I cannot fully list who have influenced me
during my time here in Maryland. To them all, thank you.
ii
Table of Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Figure-ground Organization and Bridging the Semantic Gap . . . . . 1
1.2 Mid-Level Vision for Figure-ground Organization . . . . . . . . . . . 3
1.2.1 Biological Motivations: Psychological and Neurological Evi-
dences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Computational Motivations: Gestalt in Computer Vision . . . 6
1.3 Other Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Vision and Language . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 13
2 Assigning Border Ownership in 2D images 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Border ownership cues . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1.1 HoG-like descriptors . . . . . . . . . . . . . . . . . . 22
2.3.1.2 Extremal edges from PCA of contour tokens . . . . . 24
2.3.1.3 Gestalt-like grouping features . . . . . . . . . . . . . 25
2.3.2 Border ownership assignment via SRF . . . . . . . . . . . . . 28
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Datasets, baselines and evaluation procedure . . . . . . . . . . 32
2.4.2 Comparing spectral components . . . . . . . . . . . . . . . . . 34
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Applications of Border Ownership . . . . . . . . . . . . . . . . . . . . 39
2.5.1 Guiding image torque using ownership information . . . . . . 39
2.5.2 Predicting boundaries and ownership from DVS . . . . . . . . 44
2.5.2.1 Event-based features from DVS . . . . . . . . . . . . 45
iii
2.5.2.2 A sequential SRF for continuous DVS data . . . . . 48
2.5.2.3 Boundary and ownership results . . . . . . . . . . . . 49
2.5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Contour-Based Categorical Object Recognition 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Contour completion using image torque . . . . . . . . . . . . . 70
3.3.2 Torque shape context descriptor . . . . . . . . . . . . . . . . . 74
3.3.2.1 Robust contour fragment matching from border own-
ership information . . . . . . . . . . . . . . . . . . . 75
3.3.2.2 Rotational invariance via the Fast Fourier Transform 79
3.3.2.3 Matching of torque shape context descriptors . . . . 83
3.3.3 Object sensitive torque via multi-scale matching of supporting
contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.1 Evaluation over UMD Hand-Manipulation dataset . . . . . . . 90
3.4.2 Evaluation over CMU Kitchen Occlusion dataset . . . . . . . . 95
3.4.3 Evaluation over ETHZ-Shapes dataset . . . . . . . . . . . . . 99
3.4.4 Object recognition in clutter by a mobile robot . . . . . . . . 103
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4 Detecting and Segmenting Symmetrical Regions 111
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.1 Symmetry detection . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.2 Segmenting symmetrical regions . . . . . . . . . . . . . . . . . 116
4.2.3 Contributions of this work . . . . . . . . . . . . . . . . . . . . 117
4.3 Robust bilateral symmetry detection . . . . . . . . . . . . . . . . . . 119
4.3.1 Symmetry Attention . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.1.1 The symmetry attention map . . . . . . . . . . . . . 120
4.3.1.2 Fixation-based segmentation . . . . . . . . . . . . . . 124
4.3.2 Symmetry Refinement . . . . . . . . . . . . . . . . . . . . . . 126
4.3.2.1 1D search over orientations . . . . . . . . . . . . . . 128
4.3.2.2 1D search over centroid locations . . . . . . . . . . . 129
4.3.2.3 Scoring the symmetry axes . . . . . . . . . . . . . . 131
4.4 Fast curved symmetry detection via SRF . . . . . . . . . . . . . . . . 132
4.4.1 Patch-based symmetry features . . . . . . . . . . . . . . . . . 134
4.4.2 Symmetry detection via SRF . . . . . . . . . . . . . . . . . . 135
4.5 Symmetry-constrained segmentation using graph-cuts . . . . . . . . . 137
4.6 Experiments: Bilateral Symmetry Detection . . . . . . . . . . . . . . 143
4.6.1 Datasets, baseline and evaluation procedure . . . . . . . . . . 143
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
iv
4.6.2.1 Performance of individual stages . . . . . . . . . . . 146
4.6.2.2 Performance comparison with baseline . . . . . . . . 147
4.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.6.3.1 Advantages of a two stage approach . . . . . . . . . 151
4.6.3.2 Local features versus statistics-based detection of sym-
metry . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.7 Experiments: Curved Symmetry Detection . . . . . . . . . . . . . . . 154
4.7.1 Datasets, baselines and evaluation procedures . . . . . . . . . 154
4.7.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 155
4.7.2.1 Curved symmetry accuracy over SYMMAX-300 . . . 156
4.7.2.2 Curved symmetry accuracy over NY-roads . . . . . . 156
4.8 Experiments: Bilateral Symmetry-Constrained Segmentation . . . . . 157
4.8.1 Datasets, baselines and evaluation procedure . . . . . . . . . . 157
4.8.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 159
4.9 Experiments: Curved Symmetry-Constrained Segmentation . . . . . . 164
4.9.1 Datasets, baselines and evaluation procedure . . . . . . . . . . 164
4.9.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 165
4.9.2.1 Symmetric segmentation accuracy over SYMSEG-
300, BSD-Parts and WHD . . . . . . . . . . . . . . . 167
4.9.2.2 Symmetric segmentation accuracy over NY-roads . . 167
4.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5 Object-Level Functional Category Detection 170
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.3.1 Robust geometric and shape features . . . . . . . . . . . . . . 175
5.3.1.1 Depth features . . . . . . . . . . . . . . . . . . . . . 176
5.3.1.2 Surface normals (SNorm) . . . . . . . . . . . . . . . 176
5.3.1.3 Principle curvatures (PCurv) . . . . . . . . . . . . . 176
5.3.1.4 Shape-index and curvedness (SI+CV) . . . . . . . . 177
5.3.2 SRF for affordance prediction . . . . . . . . . . . . . . . . . . 177
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.4.3 Evaluation procedures . . . . . . . . . . . . . . . . . . . . . . 182
5.4.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 183
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6 Closing the Semantic Gap using Language 189
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
6.2.1 Vision and language from the NLP and MM communities . . . 191
6.2.2 Visual attributes . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 197
v
6.3.1 Language grounding of affordance-based attributes . . . . . . 197
6.3.2 Learning a canonical multimodal space . . . . . . . . . . . . . 203
6.3.3 Multimodal features from deep networks . . . . . . . . . . . . 209
6.4 Final Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . 211
A Generalizing the image torque to other patterns 212
B Summary of Contour-Based Categorical Object Recognition Algorithm 213
C Simulating Log-polar Coordinates in Cartesian Coordinates 215
D Separability of orientation from translation components 217
E Bilateral symmetry detector: supplementary information 220
E.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 220
E.2 Description of parameters and their values . . . . . . . . . . . . . . . 222
E.2.1 Full approach [AttentionSymSegBB] . . . . . . . . . . . . . . 222
E.2.2 Baseline [Loy-Eklundh] . . . . . . . . . . . . . . . . . . . . . 223
E.3 Symmetry complexity coding in the UMD Symmetry dataset . . . . . 224
E.4 Average Precision (AP) scores . . . . . . . . . . . . . . . . . . . . . . 224
E.5 Running times per dataset . . . . . . . . . . . . . . . . . . . . . . . . 225
Bibliography 228
vi
List of Tables
2.1 Border ownership prediction accuracy . . . . . . . . . . . . . . . . . . 36
2.2 Boundary prediction accuracy . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Descriptions of DVS sequences used . . . . . . . . . . . . . . . . . . . 50
2.4 Performance evaluation of DVS feature ablations . . . . . . . . . . . . 51
3.1 CMU Kitchen Occlusion dataset: detection rates . . . . . . . . . . . . 96
3.2 ETHZ-Shapes dataset: AP scores . . . . . . . . . . . . . . . . . . . . 99
3.3 ETHZ-Shapes dataset: detection rates . . . . . . . . . . . . . . . . . 102
3.4 UMD-clutter dataset: detection rates . . . . . . . . . . . . . . . . . . 105
4.1 Parameters used in the SRF-based curved symmetry detector . . . . . 136
4.2 Performance comparison of mean segmentation accuracy: human an-
notated axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.3 Performance comparison of mean segmentation accuracy: detected
symmetry axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.1 Performance over the RGB-D Affordance Dataset . . . . . . . . . . . 185
5.2 Ablation experiments over RGB-D affordance dataset . . . . . . . . . 185
5.3 Results on the Cornell Grasping Dataset . . . . . . . . . . . . . . . . 185
E.1 Parameters for the bilateral symmetry detector . . . . . . . . . . . . 222
E.2 Parameters for Loy-Eklundh symmetry detector . . . . . . . . . . . . 223
E.3 Symmetry coding nomenclature. . . . . . . . . . . . . . . . . . . . . . 224
E.4 AP scores from variants of bilateral symmetry detector and Loy-
Eklundh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
E.5 AP scores of bilateral symmetry and Loy-Eklundh over UMD Sym-
metry dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
E.6 Running times of bilateral symmetry detector . . . . . . . . . . . . . 226
E.7 Running times of Loy-Eklundh . . . . . . . . . . . . . . . . . . . . . . 227
vii
List of Figures
1.1 A typical scene understanding task . . . . . . . . . . . . . . . . . . . 2
1.2 Visual illusions demonstrating Gestalt principles . . . . . . . . . . . . 5
1.3 Bregman’s illusion (1981) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Illustrating the border ownership assignment problem . . . . . . . . . 16
2.2 Boundary prediction and border ownership assignment using our ap-
proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Border ownership cues used . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Generalizing the image torque for different Gestalt groupings . . . . . 26
2.5 Training a SRF for border ownership assignment . . . . . . . . . . . . 29
2.6 Computing the Gini impurity measure using cluster labels . . . . . . 31
2.7 Principal components from aligned grayscale patches along object
boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Example results from both BSDS and NYU-Depth datasets . . . . . . 36
2.9 Ownership-guided torque for object proposals . . . . . . . . . . . . . 40
2.10 Comparing ownership-guided torque vs. standard torque . . . . . . . 43
2.11 Event-based visual features . . . . . . . . . . . . . . . . . . . . . . . . 46
2.12 Extending a non-sequential SRF (Rns) to a sequential SRF (Rsq) . . . 48
2.13 Precision-Recall of boundary prediction accuracy . . . . . . . . . . . 52
2.14 Boundary and ownership predictions using Rns . . . . . . . . . . . . 53
2.15 Effect of different wf on Rsq’s predictions . . . . . . . . . . . . . . . . 53
3.1 From mid-level contour grouping to object recognition . . . . . . . . . 59
3.2 Challenges of contour-based categorical object recognition . . . . . . 62
3.3 Overview of contour-based categorical recognition approach . . . . . . 69
3.4 Why shape context is insufficient for matching contour fragments in
clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Constructing the torque shape context . . . . . . . . . . . . . . . . . 77
3.6 Using border ownership information for robust matching in clutter . . 78
3.7 Torque shape context: Robustness against deformations . . . . . . . . 80
3.8 Torque shape context: partial matching in clutter and occlusions . . . 80
3.9 Estimating the phase lag Og . . . . . . . . . . . . . . . . . . . . . . . 81
3.10 Effects of using FFT to estimate Og on matching accuracy . . . . . . 82
viii
3.11 Multi-scale edge matching . . . . . . . . . . . . . . . . . . . . . . . . 84
3.12 UMD Hand-Manipulation dataset: detection results . . . . . . . . . . 91
3.13 UMD Hand-Manipulation dataset: evaluation results . . . . . . . . . 92
3.14 UMD Hand-Manipulation dataset (no rotational invariance): evalu-
ation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.15 Limitations of using only contour information . . . . . . . . . . . . . 94
3.16 CMU Kitchen Occlusion dataset: evaluation results . . . . . . . . . . 97
3.17 CMU Kitchen Occlusion dataset: detection results . . . . . . . . . . . 98
3.18 ETHZ-Shapes dataset: P-R curves . . . . . . . . . . . . . . . . . . . 100
3.19 ETHZ-Shapes dataset: DR/FPPI curves . . . . . . . . . . . . . . . . 101
3.20 ETHZ-Shapes dataset: detection results . . . . . . . . . . . . . . . . 103
3.21 UMD-clutter dataset: robotic platform and object categories . . . . . 104
3.22 UMD-clutter dataset: DR/FPPI curves . . . . . . . . . . . . . . . . . 106
3.23 How depth information helps in improving the image torque . . . . . 108
3.24 UMD-clutter dataset: detection results . . . . . . . . . . . . . . . . . 109
4.1 The symmetry attention map . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Segments from symmetry attention points . . . . . . . . . . . . . . . 125
4.3 Overview of the symmetry refinement step . . . . . . . . . . . . . . . 127
4.4 Scoring a symmetry axis via a robust Hough-voting technique . . . . 131
4.5 Training a SRF for curved symmetry detection . . . . . . . . . . . . . 133
4.6 5-way MRF for symmetry-constrained segmentation . . . . . . . . . . 138
4.7 Example symmetry-constrained segmentations . . . . . . . . . . . . . 140
4.8 Effect of the ballooning term, Bpq . . . . . . . . . . . . . . . . . . . . 142
4.9 PR curves of bilateral symmetry evaluations . . . . . . . . . . . . . . 145
4.10 Example bilateral symmetry detections . . . . . . . . . . . . . . . . . 148
4.11 Symmetry axes detected from fixation-based segments and bounding
boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.12 Failure cases of bilateral symmetry detection . . . . . . . . . . . . . . 153
4.13 Curved symmetry prediction accuracy . . . . . . . . . . . . . . . . . . 155
4.14 Bilateral symmetry-constrained segmentation results: human anno-
tated axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.15 Bilateral symmetry-constrained segmentation results: from detected
symmetry axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.16 Curved symmetry-constrained segmentation accuracy . . . . . . . . . 165
4.17 Curved symmetry detection and symmetrical segmentation results . . 166
5.1 Affordance prediction of tool parts . . . . . . . . . . . . . . . . . . . 172
5.2 Affordance detection using SRF . . . . . . . . . . . . . . . . . . . . . 178
5.3 Estimating pixel accurate annotations from the Cornell Grasping
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.4 Example results of affordance detections . . . . . . . . . . . . . . . . 184
5.5 Grasping locations predicted by SRF . . . . . . . . . . . . . . . . . . 187
6.1 Using both linguistic and visual representations for object recognition 190
ix
6.2 Why grounding attributes using affordances makes sense . . . . . . . 198
6.3 Verbal descriptions for a typical object (spoon) . . . . . . . . . . . . 199
6.4 Soliciting verbal responses from AMT turkers . . . . . . . . . . . . . 202
6.5 Using CCA to associate verbal and linguistic features . . . . . . . . . 204
D.1 Separability of orientation from translation components . . . . . . . . 219
x
Chapter 1: Introduction
This thesis proposes several techniques that invokes Mid-Level Vision as a
central paradigm to solve related problems in Computer Vision. These problems
can be broadly grouped into the area of figure-ground organization, where the goal
is to determine, given a 2D image, which parts of the image belong to the fore-
ground (or object) and which parts belong to the background. Numerous tasks in
Computer Vision directly or indirectly use figure-ground organization: e.g. scene
understanding [294], object proposals [3] and occlusion boundaries detection [109].
In this chapter, we will introduce the problem and explain its importance in linking
high-level (semantic) information with low-level visual signals. Next, we argue for
the use of mid-level vision from both biological and computational perspectives to
efficiently solve this problem. Finally, we survey previous works in related areas and
show how this thesis contributes in advancing the state-of-the-art.
1.1 Figure-ground Organization and Bridging the Semantic Gap
Look at the images in Fig. 1.1. Input scenes are shown on the left with
predictions of parts of the scene on the right showing the names (labels) assigned
to individual segments. These images illustrate the process of scene understanding,
1
Figure 1.1: A typical scene understanding task. (Left) Input images and (right)
predicted results from [93] (above) and [301] (below).
that is, the assignment of semantic labels to scene parts. Although easy for humans,
it remains a difficult problem in Computer Vision as many sub-processes remain un-
resolved. These include (ordered in terms of increasing complexity): 1) boundary
detection, 2) foreground-background segmentation, 3) object detection and recog-
nition , 4) semantic segmentation of objects and scene entities and 5) perceptual
organization. These problems are often addressed individually in Computer Vision
with different approaches (and names) e.g. [8,62,113,193]. Their main goal, however,
is the same: to mimic the capabilities of humans in understanding and perceiving
the visual world by assigning semantic labels (names) to their corresponding parts.
The exact processes involved in converting visual signals to a visual percept of the
world is still an open (and complex) research problem. Behavioral and neurological
studies from primates, however, have indicated that the process of figure-ground
organization (FGO) (processes 1-3) provides important input for high-level visual
2
processes [43,194,306] (processes 4 and 5). Recent neurological findings (fMRI and
EEG) have also revealed feedback connections between the visual cortex and higher
brain areas involved in memory and words [57, 58, 90]. These connections are most
active when participants are tasked to describe an image or a moving video, and are
co-located with regions of high saliency.
These findings and observations show that the FGO problem is central for
linking high-level knowledge (e.g. words, semantic labels, relationships etc.) of ob-
jects, scenes and other entities in the world with low-level visual signals (e.g. edges,
pixels etc.) for achieving a visual percept (understanding) of the world. In other
words, solving the FGO problem bridges the so-called “semantic gap” or “pixels-
to-predicates” [96, 186, 289] problem, which is well known in the field of Artificial
Intelligence. This is also the main underlying goal that drives and links up the
different works in this thesis.
Central to the FGO problem is how one begins the actual process of selecting
the “pixels” so that meaning or predicates can be appropriately assigned to them.
In this thesis, we draw inspiration from the field of Gestalt psychology, and propose
to use “Mid-Level Vision” in our approaches. We introduce these ideas in the next
section.
1.2 Mid-Level Vision for Figure-ground Organization
“Mid-level Vision” is a computational paradigm that exploits the use of so-
called “mid-level” visual cues to guide certain visual processes. Such cues are derived
3
or inspired from Gestalt psychology, proposed in 1890 by von Ehrenfels which lead
to several seminal works and branches of thought notably by Wertheimer (1912)
and Köhler (1920). The basis idea is intuitive: from a set of Gestalt principles that
combines information from low-level visual signals, we are able to perceive the world
in all its complexity and nuances [135]. These principles or “rules of Gestalt” cover
a wide area: 1) shape, 2) proximity, 3) motion, 4) symmetry and 5) common-fate to
name a few. Since it deals with how such cues can be derived from low-level signals to
guide higher-level visual processes such as recognition or semantic understanding,
the term “mid-level” vision was used by several works [25, 55, 66, 138, 209]. The
FGO problem, as noted in a recent survey [285], is a central problem in Gestalt
psychology that uses mid-level cues: convexity, symmetry, lower-region, extremal
edges, motion synchrony etc. In this thesis, we show further that it is possible to
compute efficiently a subset of such cues via mid-level operators and to use these
cues for distinguishing the ownership of an occlusion boundary. In the next two
sections, we motivate the use of mid-level vision as a paradigm for solving the FGO
problem from both biological and computational perspectives.
1.2.1 Biological Motivations: Psychological and Neurological Evidences
The earliest evidence for Gestalt in FGO comes from the famous “vase-face”
illusion of Rubin (1915) (Fig. 1.2 (left)), where depending on the viewer’s inter-
pretation, the percept shifts between two faces and a central vase in the image.
This switch in interpretation depends on which side the boundary is perceived to be
4
Figure 1.2: Visual illusions demonstrating Gestalt principles. (Left) Switching of
the “vase-face” percept of Rubin (1915). (Right) Illusionary contours of Kanizsa
(1976) [120].
“owned” by the figure (foreground) and points to the existence of a neural circuitry
that encodes this ownership information. This process is known in the literature
as border ownership assignment (BOWN). Related to BOWN is the process known
as contour continuation (CCONT), which is made famous by Kanizsa’s illusionary
contours [120] that complete an otherwise incomplete shape (Fig. 1.2 (right)). The
importance of BOWN and CCONT for FGO is further demonstrated by Bregman
(1981) [30] where a figure, initially showing only some unrelated fragments changes
instantly to recognizable letters with the simple addition of a dark occluding fore-
ground (Fig. 1.3).
Although psychological evidences supporting BOWN and CCONT processes
are plentiful [99, 130, 207, 215, 225], it was from recordings of the primate visual
cortex that demonstrate the existence of specific neurons that encode BOWN and
CCONT. von der Heydt et al. [284] showed that the V2 primate visual cortex re-
sponds to specific illusionary contours stimuli. This was followed by several models
of neural circuitry [103,223] based on the suggestion that cells in V2 are responsible
5
Figure 1.3: Bregman’s illusion (1981) [30]. Indiscernible fragments in the back-
ground instantly appear as the letters ‘B’ with the addition of a dark foreground.
for encoding occlusion boundaries, that is, regions where objects meet with one an-
other or when objects meet with the background. The neural mechanisms of BOWN
were discovered by Zhou et al. [308] that showed that V2 and V4 (and to a lesser
extend V1) cells encode border ownership selectively. Very recent works showed
that such cells encode not only ownership but also depth ordering [226] within a
very short amount of time [262] (75ms of the onset of the stimulus). There are also
studies that provide neurological evidence for the interplay of BOWN and CCONT
for FGO, where feedback from higher visual areas guides the process of context inte-
gration [43,117]: the combination of cues indicative of ownership (usually far [306])
and continuity (which are more local [102]).
1.2.2 Computational Motivations: Gestalt in Computer Vision
Several computational models for FGO, based on the biological and neurolog-
ical evidences presented above, have been proposed by several authors [37, 75, 147,
200, 219, 257, 291]. One of the most recent (and complete) models is suggested by
Kogo et al. [136], that uses local cues suggestive of ordinal depth to determine own-
6
ership along occlusion boundaries and allow localized spreading to simulate illusion-
ary contours. Besides these biologically motivated models which seek to mimic the
neural circuitry, we describe here other computational models that embed Gestalt
principles in Computer Vision related tasks.
Most of the works in Computer Vision dealt with embedding the Gestalt
principle of proximity, good continuation and similarity by using an explicit or
implicit Markovian assumption, modeling the problem as a Markov Random Field
(MRF) [79] or (more recently) as a Conditional Random Field (CRF) [151]. Williams
and Jacobs [291] used convexity to implement a stochastic field for contour com-
pletion, and this was recently extended by [11] using Tangent Bundle Theory. The
principles of closure and proximity were explored in a probabilistic contour frame-
work by Elder et al. [59, 60]. The same principles were applied to extracting closed
regions by Stahl and Wang [256] using a grouping cost that captured these two prin-
ciples within a graph optimization framework, which was extended to include the
extraction of symmetrical regions [257]. More recently, multiple cues have been em-
ployed to extract closed contours from images [202] and extended to extract closed
regions from videos [167].
Variational methods such as snakes [124] and level-sets [213] have also been
employed with success in capturing salient regions of images and videos displaying
different forms of prior information. Cootes et al. [40] introduced active shape mod-
els for representing the shape prior in terms of an average shape and basis vectors
that account for shape variability. This lead to several other works [31,277] that em-
bed this model within a variational energy functional. Instead of an average shape,
7
Rousson and Paragios [239] proposed a novel cost function that constrains an im-
plicit surface to evolve towards a predefined shape prior in a variational framework.
Extending this, Cremers et al. [44] further showed that by defining a binary shape
functional, embedding the shape prior becomes a convex problem which results in
globally optimal solutions.
Scale-space theory [176], image pyramids [32] and CRFs have also been em-
ployed to capture longer range contextual relationships. Henkel [104] proposed to
group edges in scale-space for the purpose of segmenting coherent regions. Late-
cki and Lakamper [153] showed improved grouping results by matching multiscaled
shape fragments that corresponds to foreground object parts. He et al. [100] intro-
duced multiscale random fields to enforce segmentation labeling consistency. Follow-
ing this, Cour et al. [42] introduced a novel multiscale spectral clustering technique
via graph decomposition so that scale and grouping constraints are enforced to gen-
erate coherent segments. Along similar lines, Latecki et al. [154] proposed multiscale
random fields with a novel combinatory append operator that enables efficient op-
timization for the task of detecting class-specific contours in clutter. More recently,
Shotton et al. [246] showed improved recognition of object classes by training class-
specific local classifiers with multiscale contour fragments via a modified chamfer
matching formulation. Kohli et al. [137] proposed a novel set of potential functions
that capture even more long range relationships in CRFs and showed improved scene
segmentation results in occlusion and clutter. Finally, the recent work of Arbelaez
et al. [8] combines both multiscale segmentation and object recognition within a
single framework. The approach uses a multiscale spectral clustering technique to
8
produce segmentation candidates that are then combined in an efficient manner
to form reasonable object proposals. Object proposals is currently a very popular
research area and we defer the discussion of such works to Chapter 2.
1.3 Other Related Works
As related works for specific tasks are discussed in their individual chapters,
we focus here on works from two areas in Computer Vision: 1) Scene Understanding
and 2) Vision and Language, which are not directly addressed by this thesis but are
nonetheless important for solving the problem fully.
1.3.1 Scene Understanding
Scene understanding has remained one of the key open research areas in
Computer Vision. Early works define “understanding” as a vision system that
achieves two key tasks: 1) segmentation into coherent parts and 2) entity (ob-
ject or foreground) recognition of each part and the entire scene. Numerous works
have been proposed that addressed each task either in combination or together
[23, 26, 68, 93, 203, 247, 292, 297] and we review briefly here the most notable works.
Olivia and Torralba [210] defined a scene “gist” measure to broadly classify an entire
image into several scene types, based on the notion of a “spatial envelope” or an
“attribute” of the scene. Along similar lines, using the recently introduced “SUN
attributes” dataset, Patterson et al. [222] analyzed and introduced attributes de-
scriptors for scene-based classification. More recently, Zhou et al. [307] used state-
9
of-the-art deep Convolutional Neural Networks (CNN) to learn attributes-based
features for scenes.
Most other works focused on accurate segmentation and labeling of scene en-
tities. Sali and Ullman [242] combined class specific contour fragments using a
model-based approach to detect and segment target objects. Similarly, Kumar et
al. [149] combined state-of-the-art pictorial structures with MRFs so that object
specific segments are extracted from the image. Levin and Weiss [165] extended
this work by combining top-down object specific contours to produce object specific
bottom-up segmentations by training a CRF that takes into account both contour
(high-level) and edges (low-level) information. Cao and Li [33] proposed a gener-
ative model by learning latent topic models of multiscale object patches, enabling
simultaneous detection and segmentation during inference. More recently, by us-
ing a textual model, Li et al. [168] demonstrated a novel generative model that is
capable of performing simultaneous segmentation, classification and captioning of
unseen images with textual tags (see §1.3.2 for other works that combines language
and vision). By embedding responses of pre-trained object detectors as additional
potentials into a CRF, Ladicky et al. [150] improved the final scene segmentation
compared to using low-level information alone. Yao et al. [301] extended this work
by embedding a shape prior potential over superpixel segments to further improve
the segmentation results.
Recognizing the scene only by labeling its segmented entities, however, is only
a small part of scene understanding [295]. Understanding how these entities relate
with each other so that it can be useful, for e.g. an active agent to navigate in the
10
scene, is an important next step. Several recent works in scene understanding have
pursued this research direction that attempts to predict a 3D room layout from 2D
images [92,101,303]. Other works view the problem as generating an image “parse”,
similar to approaches in natural language processing (NLP), which we discuss next.
1.3.2 Vision and Language
Although language and vision are completely different modalities, they encode
often complementary information that differ only in terms of their semantic content.
Language, or text, are often used to describe non-visual and to a lesser extent visual
entities in an image or video. A key research problem is how one can leverage on lan-
guage to improve visual processing (and in NLP, to use vision to improve linguistic
processing). Works in the Computer Vision community are mainly focused on us-
ing language as a form of contextual information that reduces ambiguous/uncertain
information from visual inputs. [84] use high-level contextual knowledge of objects
in scenes (position, color, etc.) to induce a visual saliency map that represents the
most likely locations of objects in the scene. This work uses high-level information of
the target to influence directly (via learned parameters) the weights of the saliency
algorithm so that the target is more accurately detected than bottom-up methods.
In one of the earliest works, [56] showed how nouns can provide constraints that
improve image segmentation. This is done by imposing constraints on the nouns
(objects) that are likely to co-exist: e.g. sky and plane are more likely to exist to-
gether than water and bus. [91] extended this work with the addition of prepositions
11
to enforce spatial constraints in recognizing objects from segmented images. The
work of [4] addresses the object search problem in clutter by encoding predefined
relationships on likely occurring locations, object co-occurrences, visual and shape
cues into a graphical model that guides a mobile robot in selecting the next place
to search. The input is a 2.5D Kinect point cloud and they used a max-margin
learning approach over several object classes in two different kinds of environments:
office and home. Another interesting work that uses contextual information comes
from [160] where they create a “object-graph” that encodes relationships of known
object categories together with unfamiliar/unknown categories. The goal is to rec-
ognize, in a weak sense, via the similarity of the graphs, the common categories of
such unknown objects and perhaps assign a more descriptive label to them.
There has also been several works that integrate linguistic information for the
purpose of describing visual scenes/images using natural text, which can also be
viewed as a manifestation of the scene understanding process. [13] processed news
captions to discover names associated with faces in the images, and [119] extended
this work to associate poses detected from images with the verbs in the captions.
Both approaches use annotated examples from a limited news caption corpus to learn
a joint image-text model so that one can annotate new unknown images with textual
information easily. Tu et al. [279] view the of scene understanding as analogous to
parsing a sentence in NLP, except that the grammar and entities are visual. More
recently, [300] proposed a “image to text” parser that combines noisy detections from
visual detectors using learned rules to generate a reasonable textual description
of the scene. Along the same lines, [148] constructs a model of a image parse
12
consisting of objects, their attributes (e.g. color, texture), spatial relationships into
a CRF. Inference over the CRF results in the most likely combinations of these
components so that a reasonable descriptive paragraph of the image is generated.
The work of [65] attempts to “generate” sentences by first learning from a set of
human annotated examples, and producing the same sentence if both images and
sentence share common properties in terms of their triplets: (Nouns-Verbs-Scenes).
In another work, [253] views the problem of parsing an image containing super-pixel
segmentations within the same framework of parsing a parallel textual description
of the image and proposes a recursive neural network (RNN) to model the key
components that make up the image and text. By training the network using a
structured max-margin learning approach, the model is able to optimally parse
both images and text for segmentation in both domains. Very recently, Karpathy
and Li [122] introduced a multimodal Recurrent Neural Network that generates a
sentence by conditioning the network on objects detected via a separately trained
CNN over the input image.
1.4 Contributions of this Thesis
Given the breadth, scope and complexity of the FGO problem, this thesis
focuses on a smaller subset of related problems that we believe provide important
contributions to a final complete solution. Unlike previous works (§1.2.2) that use
Gestalt as a high-level prior containing contextual information, we are motivated
from a more biological perspective (§1.2.1) that uses mid-level vision to extract
13
and organize low-level visual signals before feeding them to higher-level visual areas
(e.g. TE, TEO) [146]. Specifically, we propose efficient computational methods
that detect Gestalt (cues or representations) from real images and demonstrate
their usefulness in higher-level visual tasks: e.g. recognition and segmentation.
Chapter 2 introduces a computationally efficient method for border ownership
assignment, which is a central subproblem in FGO. Our approach leverages on a
state-of-the-art classifier termed the Structured Random Forest (SRF) [140] trained
over local and global ownership cues to predict both boundaries and ownership in
real-time. We demonstrate the usefulness of detecting border ownership in two ar-
eas: 1) Extracting foreground object locations using a mid-level grouping operator,
termed the image torque [209], that is biased towards closure and 2) layered seg-
mentation of a scene from an event-based camera that mimics the human retina,
known as the Dynamic Vision Sensor (DVS) [173].
Next, in Chapter 3, we use border ownership information for enhancing cat-
egorical recognition of objects/parts that share common contour fragments. This
is achieved by embedding a novel shape-based descriptor with ownership informa-
tion followed by modulating the image torque so that it becomes sensitive to the
target contours. Compared to other approaches, we show the advantage of using a
mid-level approach for this task in terms of handling clutter, occlusion and noisy
contours.
In Chapter 4, we detect symmetry, specifically reflection and curved reflection
symmetries and use it to extract symmetrical regions in real images. For reflection
(bilateral) symmetry detection, we propose to detect, using a fast histogram com-
14
parison of local edges, potential symmetry attention points from which we extract
object-centric segments for a more detailed localization of the symmetry axis. For
curved reflection symmetry detection, we propose a fast approach by training a SRF
classifier sensitive to local symmetry cues. Using the detected symmetries, we em-
bed a symmetry prior into a Markov Random Field (MRF) representation of the
image edge so that symmetrical regions can be extracted via graph-cuts [27].
We describe in Chapter 5, a fast approach for detecting part-based function-
ality, or affordances [82] of tools from local geometric features. Affordances can be
seen as an innate object-level attribute that generalizes recognition to larger classes
of objects (even unseen ones). For this work, we train a SRF paired with such
features that provides pixel-accurate predictions of the target affordance.
As was noted earlier in §1.1, the purpose of this thesis is to provide solutions
that ultimately bridge the semantic gap. In Chapter 6, we conclude this thesis
by suggesting potential research directions that exploit language with approaches
presented in the precedent chapters. Specifically, we present ideas that link lan-
guage and mid-level visual representations in a common canonical space so that
the appropriate mid-level concepts: e.g. ownership, symmetry, affordances, can be
appropriately activated via linguistic cues.
15
Chapter 2: Assigning Border Ownership in 2D images
Figure 2.1: Illustrating the border ownership assignment problem. (Left) Input im-
age and (right) boundaries and border ownership of foreground (red) and background
(blue) regions.
In this chapter, we propose a fast solution for border ownership assignment
(BOWN)1, that is, we determine given the input boundaries: places where objects
meet with each other or with the background, which “side” of the boundary belongs
to the foreground (object) and which side belongs to the background (Fig. 2.1 (left)).
As was noted in Chapter 1, cells sensitive to BOWN have been discovered in areas V2
and V4 [308] and they fire within a very short interval [262]. From a computational
perspective, BOWN is one of the key mid-level processes for solving the figure-
ground organization (FGO) problem in that it provides important ordinal depth
1This work was published in [268]. Full results, code and videos are available online http:
//www.umiacs.umd.edu/~cteo/BOWN_SRF/
16
information and can be regarded as a preprocessing step for higher level tasks such
as foreground-background segmentation [238, 283], semantic segmentation [93] and
object proposals [34], and is also closely related to selective attention [43]. In spite of
this crucial role, BOWN has remained largely ignored by the computational vision
community [146], with only two recent works: Ren et al. [231] and Leichter and
Lindenbaum [162] proposing computational approaches that address this problem.
Unlike these two works that first detect boundaries followed by a separate ownership
assignment step, our approach predicts both boundaries and ownership directly
from the input RGB image. In addition to state-of-the-art ownership predictions
compared to [162,231] over two datasets: BSDS [193] and the NYU Depth V2 [206],
our method runs in real-time: ≈0.1s for a 320×240 image compared to 15s in [162].
2.1 Introduction
Look at the two images in Fig. 2.2 with highlighted boundaries on the right.
These are regions in the image where objects meet with one another or with the
background. Humans are able to interpret complex scenes such as these and predict
their approximate depth orderings with relative ease by integrating both bottom-up
and top-down cues. In recent years, so-called boundary detectors have become very
popular tools. These detectors use local cues, such as brightness, color, texture, gra-
dients and simple features [193] in image patches to distinguish edge points likely
at boundaries of surfaces from others. More recent approaches also include global-
ization processes using long-range relations of image points [6]. However, the image
17
Figure 2.2: Example results of predicted boundaries (blue) and their ownership (red:
foreground, yellow: background) from real-world images: BSDS (above) and NYU
Depth V2 (below).
structure in the regions next to an occlusion edge can be used for more than bound-
ary indication; it also encodes information about the relative depth about the edge’s
two adjacent regions, and to which of the regions the edge belongs to. It has been
shown that image cues, such as the convexity of the edge [121], the edge junctions,
contrast, or the gradient in the intensity and the texture carry this information [218].
In this work, we focus on detecting classes of bottom-up cues that indicate border
ownership from 2D image, an important mid-level process for solving the FGO
problem that was discussed in Chapter 1. Fig. 2.2 shows example predictions using
our proposed approach with their accuracy scores over two popular datasets: the
Berkeley Segmentation (BSDS) and the NYU Depth V2 (NYU-Depth) [193, 206].
The prediction accuracy not only is state-of-the-art, but outperforms previous ap-
proaches [162, 231]. Our method exploits two novel features derived from findings
18
in human psychophysics to determine the ownership of a boundary. The first one,
known as extremal edges or image folds [133], captures how changes in the shading
of pixels near real boundaries differ between foreground and background. It was
shown in [228] that such folds exist in a variety of environments.
The second feature detects Gestalt-like groupings of mid-level cues. Specif-
ically, we introduce a new multi-scale grouping mechanism that implements the
concept of contour closure, and common patterns such as radial and spiral tex-
tures. Since such patterns occur naturally in images, we expect the differences in
the distribution of these patterns to be indicative of border ownership. Finally, by
embedding these features within a Structured Random Forest (SRF), we are able
to predict border ownership in real-time, ≈ 0.1s for a 320×240 image. Notably, our
method predicts both boundary and ownership together in a single step. Compared
to previous works that considered border ownership determination as a separate
step independent of boundary detection, our single-step approach is not only faster
but also more accurate.
2.2 Related Works
Determining border ownership accurately in images involves several related
works in computer vision which can be classified into two different areas: 1) depth
ordering prediction and 2) object proposals. We briefly review each area in relation
to the current work.
Depth ordering prediction. Perceiving ordinal depth from 2D images has been
19
tackled as early as the classical “Blocks World” of Roberts [237]. Hoeim et al. [109]
revisited the problem by combining numerous local and global cues: color, gradients,
junctions, textures, sky above ground etc. into a large conditional random field
(CRF) for recovering occlusion boundaries and depth ordering in a 2D image. The
CRF weights were obtained from training data to ensure consistency of depth across
different segments, which were merged in an iterative process from an initial over-
segmentation. Along similar lines, Saxena et al. [244] imposed simple geometric
constraints to estimate plane parameters related to the 3D location and orientation
of each image patch to create a 3D pop-out of the image. Ren et al. [231] considered
local convexity and junction cues and integrated them into a CRF to predict border
ownership on Pb boundaries [193]. Leichter and Lindenbaum [162] followed up by
computing distributions of ownership cues in ordinal depth: parallelity, image folds,
lower-region etc. over curves, T-junctions and image segments. Stein and Hebert
[261] further imposed motion constraints to detect occlusion boundaries consistently
across video frames.
Object proposals. A recent trend in computer vision is to detect from an
image, object-like regions in the foreground. Early works [3, 61] combined several
“objectness” cues to train detectors. However, the applicability of such methods
are limited as cue detection and integration is computationally expensive. Recently,
Cheng et al. [34] introduced a surprisingly simple technique using binarized gradient
norms of images that is able to produce high quality proposals at a fraction of the
time of previous methods. The Gestalt concept of closure has been exploited by
Nishigaki et al. [209, 296] in detecting object like regions via a mid-level grouping
20
operator termed “image torque”. Similarly, using a SRF based structured edge (SE)
detector [51], Zitnick et al. [310] counts the number of contours that enter and exit
a bounding box region to determine if there is enough closure within the proposed
region.
Although many of these works have considered the border ownership problem
implicitly in their problem formulation, it is often considered as an independent
pixel-wise classification step over predicted input boundaries [162,231] or segmenta-
tions [109,261]. In order to ensure prediction consistency over larger scales, CRFs are
often used at the expense of computation time. Our approach, by contrast, consid-
ers border ownership and boundary detection within a single SRF where consistency
over multiple scales are enforced using structured output labels. Our approach is
therefore self-contained: we predict both boundaries and border ownership in one
single step unlike previous approaches that require further optimizations using a
CRF. Consequently, our approach affords us to predict border ownership in real-
time.
2.3 Approach
Our approach of determining border ownership via SRFs consists of two key
components: 1) Features derived from ownership cues and 2) Imposing border own-
ership structure in the SRF. We describe these two components in the sections that
follow.
21
2.3.1 Border ownership cues
We use some local cues reported in prior works [52, 81, 162, 216, 218, 231] that
were shown to be important in determining border ownership and some new cues.
Specifically, we use: 1) shape (convexity/concavity), 2) image folds or extremal edges
derived from spectral properties of boundary patches and 3) Gestalt-like grouping
features. In addition, our choice of features was influenced by how efficient we can
extract them from local patches.
2.3.1.1 HoG-like descriptors
As reported in several previous works, shape cues such as local convexity and
concavity of contours are important features that are indicative of foreground ob-
jects: the foreground ownership of a boundary tends to be on the concave side [216].
To capture this cue within a local patch, we construct a HoG-like descriptor [47] of
image gradients where we quantize the gradient directions into 4 orientation bins.
In addition, we use the gradient magnitude as an indicator for good boundary local-
ization. The HoG-like descriptor of gradient orientations captures roughly the local
shape of the patch, while its magnitude tells us how likely this patch should contain
a real boundary. Notably, as shown in Fig. 2.3 (A), we see that the histograms for
typical convex and concave patches are different. For efficiency, we compute these
features in terms of “channels” [50] per image patch. Given a patch P of size N×N ,
this results in a N2 × 5 feature vector per patch.
22
Figure 2.3: Border ownership cues used. (Top) Input image and annotations (red:
foreground, blue: background) with example patches boxed. (Below) (A) Local
shape (HoG + gradient magnitude) showing four discrete orientations, (B) Spectral
features derived via PCA from 20 oriented token clusters (foreground at lower half)
and their principal components with extremal edge cues in PC2 (boxed) and (C)
Gestalt-like grouping target patterns: closure, radial, spiral and hyperbolic. (D)
Corresponding responses at one scale for each of the features. See text for details.
23
2.3.1.2 Extremal edges from PCA of contour tokens
Extermal edges, or image folds have been known for some time as one of
the strongest border ownership cues [52, 81]. Huggins et al. [115] have shown that
extremal edges can be reliably detected by computing the so-called shadow flow field
in controlled environments. Recently, [228] have shown that extremal edges exists
in natural images by performing a principal component analysis (PCA) of aligned
oriented boundary image patches. Their key insight is that extremal edges account,
after step edges, for most of the gray-level illumination variance at such regions.
Motivated by this insight, we derived the basis functions using PCA oriented along so
called contour fragments or Sketch Tokens [174] which are similar to shapemes [231]
as shown in Fig. 2.3 (B). Since each contour token has an orientation determined
by its foreground and background labels, we first orientate all patches so that the
background and foreground occupy the top and bottom halves of the patch (using
the center pixel as a reference) respectively. Clustering these orientated tokens
produces a set of C token centers to which we then apply PCA over the S supporting
patches, Pc = {P1, . . . ,PS}, c ∈ {1 . . . C}. By applying PCA over each Pc, we learn
a separate orthonormal basis corresponding to each token center. Specifically, given
the N2×S data matrixX that contains at each column a vectorized (and demeaned)
Pc, we apply Singular Value Decomposition on its covariance matrix ΣX to obtain
a set of orthonormal basis spanned by the eigenvectors (columns) of U:
ΣX = USU−1 (2.1)
24
where we keep the top K eigenvectors, uk ∈ U, corresponding to the top K eigen-
values in S to obtain the projection matrix Wc = [u1, . . . , uK ]. Wc represents a new
basis that accounts for most of the variance per contour token center. As features,
we reproject X to obtain YK×S = WTc X, the coordinates of each patch Pc in the
new basis. This yields a feature vector of dimensions N2 ×K. We show in Fig. 2.3
(D-middle) the spectral features derived from the first four principal components
(PC). Of note are the responses for PC2-PC4 which exhibit a large response only
along real boundaries with positive values encoding foreground ownership and neg-
ative values encoding background ownership. In §2.4.2, we show further that PC2
exhibits the characteristics of extremal edges.
2.3.1.3 Gestalt-like grouping features
Gestalt psychologists have developed a set of well-known rules of “Gestalt”
that suggests how humans perceive the world from 2D images. Gestalt rules deal
with groupings of low-level features (e.g. edges), and can be regarded as a form of
mid-level cue that captures the holistic properties of individual visual parts. These
properties can then be used to organize these visual parts into more meaningful
entities that serve as input to higher level processes: e.g. segmentation, recognition
etc. In this work, we leverage on specific grouping patterns: 1) closure, 2) radial, 3)
spiral and 4) hyperbolic (Fig. 2.3 (C)). Such patterns are useful for border ownership
determination because foreground objects tend to exhibit different grouping patterns
compared to the background [217], and such patterns have been observed in area
25
Figure 2.4: Generalizing the image torque for different Gestalt groupings. (Top) By
rewriting τ(x,y) in terms of a scalar product, we are able to generalize the image torque
so that it becomes sensitive to: A) radial, B) spiral and C) hyperbolic patterns.
(Bottom) Test toy image with different target patterns and their maximum responses
over different scales. Notice the selective nature for each target pattern.
V4 of macaques [78]. Closure, one of the strongest cues used in foreground object
proposals tasks, is detected in this work by computing the image “torque” [209], τP,
associated at each patch (Fig. 2.4 (Top-left)). The image torque is so-termed because
it is analogous to the torque formulation known in physics, which is the cross-product
between a tangential “force” vector ~Fq and its corresponding displacement vector ~dpq
where p denotes the center pixel in P and q an edge pixel in P. The image torque
for each edge point q is thus defined as τpq = ~Fq × ~dpq. Summing up all q ∈ P and
normalizing with the patch size yields τP:
τP =
1
2|N |
∑
q∈P
τpq =
1
2|N |
∑
q∈P
(
~Fq × ~dpq
)
(2.2)
26
In practice, we search over several scales s ∈ {5, 6, · · · , N} within P and retain
the maximum torque response over all scales. An alternative derivation for τP is
to view the detection of closure patterns as detecting iso-contours corresponding
to circles in the image. In general, we consider the patterns we want to detect as
the iso-contours of some function f . For example circles are the iso-contours of
the function f(x, y) = x2 + y2. We are interested in the tangent lines of these iso-
contours, g(x, y). Given the 2D gradient field,∇f(x, y) = (fx, fy), the corresponding
tangent vectors perpendicular to the gradient field are thus g(x, y) = (−fy, fx).
From the iso-contour equation of circles, it follows that the closure tangent vectors
are g(x, y) = (−y, x). Given an input test patch P, we first determine its gradient
field, denoted as ∇P (x, y) = (Px, Py), (x, y) ∈ P, and their edges (tangent vectors)
as E(x, y) = (−Py, Px). If a closure pattern exists in E(x, y), then the edges must
align well with tangent vectors g(x, y). A simple measure of alignment for a point
(x, y) ∈ P is thus the scalar product between E(x, y) and g(x, y):
τ(x,y) = E(x, y) · g(x, y) = (−Py, Px) · (−y, x) (2.3)
which is equivalent to the definition of τpq for point q. Replacing τpq in eq. (2.2) with
eq. (2.3) yields exactly the same results. The key insight from eq. (2.3) is that we are
now able to modify g(x, y) so that eq. (2.3) is sensitive to different patterns in the
image. As we show in Appendix A, by writing different target iso-contour equations,
we are able to detect different Gestalt patterns using the same formulation. We show
some sample responses using different g(x, y) in Fig. 2.4 (Bottom) for four patterns:
closure, radial, spiral and hyperbolic. For efficiency, we have implemented eq. (2.2)
27
as a convolution operation so that their responses can be used directly as features of
size N2× 4 for training the SRF. Additionally, the responses of the Gestalt features
for an example input image are shown in Fig. 2.3 (D-below). We note that because
the background (e.g. sky) tends to be textureless, all the features have a small
response. Notably, we observe that the strongest response occurs for the spiral
pattern, which is localized in the forested foreground region.
2.3.2 Border ownership assignment via SRF
We use an extension of the Random Forest (RF) classifier [107], termed the
Structured Random Forest (SRF). Similar to the RF, a SRF is an ensemble learning
technique that combines t decision trees, (T1, · · · , Tt), trained over random permu-
tations of the data to prevent overfitting. The key difference is that in general, SRFs
are able to learn a mapping between inputs of arbitrary complexity (e.g. strings,
segmentations, relationships etc.) and similarly complex outputs. Due to their flex-
ibility in representation, SRFs have been used successfully in a variety of computer
vision tasks such as boundary detection [51] and semantic scene segmentation [140].
See [45] for a comprehensive review of RFs and their applications. In this work,
we show that a SRF can be used as a border ownership classifier by imposing a
spatial border ownership structure in the output labels (Fig. 2.5). Similar to [51],
we assume that only the target output labels are structured (borders with owner-
ship labels) while the inputs are non-structured (feature vectors derived from image
patches).
28
Figure 2.5: Training a SRF for border ownership assignment. (A) Example image
with extracted features xf ∈ Xf and ground truth annotations from the highlighted
patch. We derive an orientation coding, Y , from the annotations. (B) By mapping
Y to discrete labels, we determine the optimal split parameters θ associated with
each split function h(xf , θ) that send features xf either to the left or right child.
The leaf nodes store a distribution of border ownership structured labels. (C) Dur-
ing inference, a test patch is assigned to a leaf node within a tree that contains a
prediction of the border ownership. Averaging the prediction over all t trees yields
the final ownership prediction. We then convert the orientation code into an ori-
ented boundary (blue) that encodes the foreground (red) and background (yellow)
predictions.
29
Let us denote the input as Xf composed of features xf ∈ Xf derived from a
training patch P. The target output is a structured label Y = ZN×N that contains
the orientation coded annotation of the border ownership. Using a 8 way local
neighborhood system, this amounts to 8 different possible orientations of border
ownership (Fig. 2.5 (A-bottom)) that each decision tree will predict. The goal of
training a SRF (or a RF in general) is to determine, for the ith split (internal) node,
the parameters θi associated with a binary split function h(xf , θi) ∈ {0, 1} so that
if h(·) = 1 we send xf to the left child or to the right child otherwise. We define
h(xf , θi) to be an indicator function with θi = (k, ρ) and h(xf , θi) = 1[xf (k) < ρ],
where k is the feature dimension corresponding to one of the features described in
§2.3.1. Following [80], we select at most
√
k feature elements for evaluation. ρ is
the learned decision threshold that splits the data Di ⊂ Xf × Y at node i into DLi
and DRi for the left and right child nodes respectively. ρ is based on maximizing a
standard information gain criterion Mi:
Mi = H(Di)−
∑
o∈{L,R}
|Doi |
|Di|
H(Doi ) (2.4)
We use the Gini impurity measure: H(Di) =
∑
y cy(1 − cy) with cy denoting the
proportion of features in Di with ownership label y ∈ Y . For non-structured Y ,
computing eq. (2.4) is straightforward. In the case of structured labels, we first
compute an intermediate mapping Π : Y 7→ L of structured labels into discrete
labels l ∈ L following [51] that allows us to compute eq. (2.4) directly. L is a set
of labels that corresponds to different types of possible contour token centers (see
§2.3.1.2), and this means that we can reuse the results from the feature extraction
30
Figure 2.6: Computing the Gini impurity measure, H(Di), using cluster labels.
(Left) Using cluster centers derived from the 8 discrete orientation coding, we first
apply nearest neighboring clustering (NN) over the input structured annotations
y ∈ Y . The labels assigned to each y are then used to compute H(Di) so that we
improve the purity of Di at each split node i.
step during training for added efficiency. Specifically, we apply a nearest neighbor
(NN) clustering for each data point y so that it is assigned to one of the cluster
labels that each contour token center is associated with. We can then apply eq. 2.4
over these cluster labels l so that we split Di appropriately (Fig. 2.6).
The process is repeated with the remaining data Do, o ∈ {L,R} at both child
nodes until a terminating criterion is satisfied. Common terminating criteria are:
1) maximum depth of tree dt is reached, 2) a minimum input |D| is achieved or
3) the gain in Mi is too small. The leaf nodes of each tree after training thus
contain the predicted local ownership orientation decision y (Fig. 2.5 (B)). Note
that unlike the RF, where a prediction is performed independently per pixel, the
SRF enforces spatial consistency in the structured labels at the leaf nodes so that
31
the final predictions do not change too much along boundaries. In order to account
for scale variations, we further sample patches from three (original, half and double)
different resolutions of the input image. During inference, we sample test patches
densely (at the original resolution) over the entire image and classify them using all
t decision trees in the SRF. The final ownership label at each pixel is determined by
averaging the predicted orientation labels across all t trees, producing an orientation
code that we convert directly into an oriented boundary representation (Fig. 2.5
(C)).
2.4 Experiments
2.4.1 Datasets, baselines and evaluation procedure
We evaluate the performance of border ownership assignment over two pub-
licly available datasets containing real world images: 1) The Berkeley Segmentation
Dataset (BSDS) [193] and 2) The NYU Depth V2 (NYU-Depth) dataset [206]. For
BSDS, we use a separate subset of 200 labeled images (obtained from the training
subset of BSDS-300) that contains ownership annotations. As this dataset was used
by the two baseline approaches: 1) Global-CRF of Ren et al. [231] and 2) 2.1D-CRF
of Leichter and Lindenbaum [162], the results we report in §2.4.3 are directly com-
parable. We use the same test/train split as both baselines, with 100 images for
training and 100 images for testing. The NYU-depth dataset consists of 1449 RGB-
Depth images taken from a variety of indoor environments. The training set consists
of 795 images while the remaining 654 images are used for testing. All images in the
32
dataset are hand annotated with 1000+ object class labels. Following [93], we select
the top 35 most frequent object labels (excluding flat surfaces such as walls, floors
and ceilings) in order to automatically generate a large number of ownership labels
along the boundaries of these objects, using the depth information to produce the
ground truth labels for the entire dataset. Compared to BSDS, where only 36.1%
of boundary pixels have ownership annotations, we increase the annotation density
to nearly 50% in NYU-Depth. Several examples of the input data, ground truths
and results are shown in Fig. 2.8.
We report the same accuracy evaluation metric used in [231] and [162], where
we count the number of correctly classified border ownership pixels against the
ground truth. This is computed via a bipartite graph matching to determine the clos-
est correspondences between the predicted border ownership pixels and the ground
truth. Predictions that were not matched are not considered. Following [162], we
set this threshold to 0.75% of the image diagonal. The parameters used for training
the SRF are the same for both datasets. We use patch sizes of N = 16 with C = 20
token cluster centers (per direction). 200,000 patches are randomly sampled from
the training images. We retain the top K = 5 principal components for generating
the spectral features. We train a SRF with t = 16 decision trees and we limit all
trees to have a maximum tree depth of dt = 64 levels.
33
Figure 2.7: Top 20 principal components for BSDS (left) and NYU-Depth (right) for
a particular token cluster center. (Bottom row) Components derived from random
patches in each dataset.
2.4.2 Comparing spectral components
Before we present evaluation results of the approach, we first perform an anal-
ysis of the spectral components produced by applying PCA over clustered token
patches in both the indoor (NYU-Depth) and outdoor (BSDS) datasets. We show
in Fig. 2.7 a visual comparison of the top 20 principal components (PC) obtained
from one token cluster center: horizontal with the background at the top half and
the foreground at the lower half of each patch, baselined against components derived
from random patches (bottom row). In both datasets, we sampled 500,000 patches.
We make four observations. First, the top component (PC1) is the same for both
BSDS and NYU-Depth, which is a step edge. The second component (PC2, boxed
in Fig. 2.7) exhibits the distinctive signature of extremal edges: with a shading
34
on the lower-half (foreground) and no shading in the top-half (background). This
confirms the observations made by Ramenahalli et al. [228] on the basis of a much
smaller number of images (585), and confirm that extremal edges are present across
different scenes and environments. Second, we note that the intensity variation in
PC2 from NYU-Depth appears “smoother” across the foreground region compared
to BSDS. This seems to indicate that extremal edges are more stable in the indoor
NYU-Depth dataset. One possible explanation would be that the structured light-
ing in indoor environments supports the existence of extremal edges better than the
diffused lighting common in outdoor situations. Third, we note that other owner-
ship cues such as T-junctions and parallel structures are also captured within the
top PCs of both datasets (e.g. PC6 and PC9). Finally, as none of the PCs from
random patches exhibit the signature of extremal edges (or other ownership cues),
this further confirms that the spectral features we use are unique along true object
boundaries.
2.4.3 Results
We perform a series of quantitative ablation studies over different features
sets in both datasets and compared their performance with the baselines Global-
CRF and 2.1D-CRF in the BSDS dataset. In a second experiment, we also applied
the basis functions learned from NYU-Depth (indoor) over the BSDS dataset in
order to validate our observations in §2.4.2 that the spectral components from the
indoor NYU-Depth scenes are more informative than those obtained from BSDS
35
Figure 2.8: Example results from both BSDS (left panel) and NYU-Depth (right
panel) datasets. Eight results per dataset: (Top-left counterclockwise): images,
ground truth labels (red: foreground, blue: background) and ownership prediction
(red: foreground, yellow: background, blue: boundaries).
Feature set BSDS NYU-Depth
HoG 72.0% 66.0%
+ Spectral (no contour tokens) 73.1% (72.0%) 67.0% (65.6%)
+ Spectral (contour tokens) 74.0% (72.3%) 68.1% (66.7%)
+ Gestalt patterns 74.4% (72.7%) 68.4% (66.7%)
All features + Spectral (NYU) 74.7% (72.8%) -
Global-CRF [231] 69.1% -
2.1D-CRF [162] 68.9% -
Table 2.1: Border ownership prediction accuracy for various ablations compared
with the baselines (last two rows). ‘+’ denotes the addition of new features to those
above the current row. Numbers in parenthesis denote the use of the single feature
for prediction.
36
Method BSDS-500 NYU-Depth
Our approach 0.73,0.74,0.76 0.63,0.64,0.60
gPb-owt-ucm [6] 0.73,0.76,0.73 0.63,0.66,0.56
SE [51] 0.73,0.75,0.77 (SE-SS) 0.65,0.67,0.65 (SE-RGB)
Table 2.2: Boundary prediction accuracy. The numbers reported in each cell are
[ODS, OIS, AP] following [6]. Results for gPb-owt-ucm and SE are reproduced
from [51].
(outdoor). The full results are summarized in Table 2.1. We show the contribution
for individual features, as well as the improvements when the feature is used with
other cues. As a point of reference, we note that for BSDS, we are classifying over
18,000 pixels, while we are approaching 2,500,000 pixels for NYU-Depth. Finally,
since our approach predicts boundaries in addition to ownership, we evaluate its
boundary prediction accuracy in a third experiment (Table 2.2).
Ablation studies of different features. The first four rows in Table 2.1 summa-
rize the mean accuracy of border ownership assignment when different combinations
of feature sets are used. The general trend is that with more cues used, the own-
ership prediction improves for both datasets. We note that the results confirm the
usefulness of learning separate basis functions corresponding to different contour
token centers (third row), where there is around 1% improvement in accuracy over
the case where no contour tokens are used (second row). For the latter, we simply
learned a basis over 8 ownership orientations. We also show the contribution of
individual features in parenthesis. Of interest is that Gestalt-like features perform
on par with spectral features in the NYU-Depth dataset while they have a larger
37
individual influence in BSDS. A likely explanation is that most indoor man-made
objects are textureless compared to outdoor environments. Additional experiments
with more controlled environments have to be done to confirm this hypothesis.
Applying NYU-depth (indoor) spectral features to BSDS dataset. In the sec-
ond experiment, we applied the basis functions obtained from NYU-Depth to the
BSDS dataset. This results in a slight improvement to 72.8% of its individual con-
tribution. Due to this small degree of improvement, more experiments with a more
careful selection of indoor patches should be performed to confirm our hypothesis
in §2.4.2. Nonetheless, we note that combining NYU-Depth spectral features with
other features yield the best overall prediction accuracy for BSDS (74.7%) in all
experiments.
Comparison with state-of-the-art. The prediction accuracy of the proposed
SRF border ownership assignment outperforms previous state-of-the-art results: 1)
Global-CRF and 2) 2.1D-CRF by at least 2% even using simple HoG-like (shape)
features in the BSDS dataset. The performance when all features are combined is
even more significant: > 5% or around 900 pixels that were reclassified correctly.
Compared to 2.1D-CRF with a reported mean run-time of 15s, inference using the
SRF is ≈100 times faster (0.1s).
Boundary prediction accuracy. Our approach (using all features) produces
reasonable boundary (not ownership) predictions that are comparable with state-of-
the-art boundary detectors: gPb-owt-ucm [6] and structured edges (SE) [51] when
evaluated over the larger BSDS-500 [6] and NYU-Depth datasets (Table 2.2). Since
our approach evaluates test patches at the original resolution without any depth
38
information, we compared the closest variants of SE: SE-SS (single scale) and SE-
RGB (no depth) in BSDS-500 and NYU-Depth respectively. Ablations of features
produce insignificant deviations from these results, which shows that the proposed
features are more suitable for ownership than boundary prediction. Furthermore,
these results are even more significant since our approach is trained on a smaller
subset of ownership labels in both datasets.
2.5 Applications of Border Ownership
We demonstrate in this section, two extensions of our proposed border owner-
ship assignment approach. First, we show how ownership information can be used
to guide the image torque (closure) operator [209] (§ 2.3.1.3) for the object proposal
task, that is, for detecting potential foreground objects. Second, we demonstrate
the generalizability of our approach to a completely different sensor, a neuromorphic
event-based camera known as the Dynamic Vision Sensor (DVS) [173].
2.5.1 Guiding image torque using ownership information
The image torque closure operator, introduced by Nishigaki et al. [209], is
defined as the cross product between a tangential “force” vector ~Fq along an edge
point q and its corresponding displacement vector ~dpq towards the center of the patch
p. [209] used this operator to detect closed regions, under the reasonable assumption
that a region that is closed tends to correspond to an object, and this corresponds
to the key Gestalt principle of closure. However, as we will show in Chapter 3,
39
Figure 2.9: Using the torque operator with ownership information for object pro-
posals. (Top row) Given the input image and extracted border ownership (BOWN)
information, objects at different depth layers will experience an inversion at shared
boundaries (left-boxed). By enforcing an ownership orientation (clockwise), we are
able to invert ownerships that do not correspond to the torque center, denoted as
a blue ‘+’ (right-boxed). (Bottom row) We then use this ownership-guided torque
to select only the negative torque points (black crosses) to extract layered segments
using the fixation segmentation approach of [199].
40
this assumption is often violated in real images, where clutter and occlusions result
in many wrong closure groupings from background boundaries. We present here a
simple extension to further improve the results of [209] by imposing an additional
ownership constraint so that torque groups only the foreground side of boundary.
The key insight is that ownership information encodes a directed edge, ~Oq at each
boundary pixel q. Replacing ~Fq with ~Oq in the torque definition therefore encourages
(by selecting the correct direction or polarity) foreground boundaries to be grouped:
τOpq = ~Oq × ~dpq (Fig. 2.9).
An important point to note, and is often overlooked, is why would one need
a grouping mechanism such as torque when we already have ownership informa-
tion to begin with? The reason is although ownership indicates the foreground and
background regions along the boundaries, it provides only relative and local ordinal
depth information. Unlike segmenting an object which requires larger and more
global cues, we note that in many cases, especially in complex scenes with multiple
scene depths, inversion of the ownership occurs along shared boundaries between
such objects (Fig. 2.9 (top-left)). Such a situation makes it impossible to apply a
straightforward approach to segment the object directly from ownership informa-
tion. The ownership orientation, however, encoded by ~Oq (clockwise or counter-
clockwise) should remain consistent throughout. To do this, we first extract the
ownership orientation codings surrounding a particular torque center. We consider
only ownership furthest from the center while removing repeats. Next, we select as
a start/end point which has orientation ‘1’ (FG below/BG above). Since the fore-
ground object must be closed, a perfect closure would have the following (target)
41
sequence [1 → 2 → 3 → · · · → 8 → 1]. Such a situation, however, only occurs
in uncluttered, unoccluded foreground. In most cases, inversions will occur within
the sequence and we mark them so that when we compute torque, the orientations
along such locations are flipped intentionally (Fig. 2.9 (top-right)).
Using the ownership-guided torque in this way presents a robust and elegant
mechanism for locating objects at different depth layers. Furthermore, we retain all
the advantages of torque: 1) speed, 2) robustness to noisy edges (wrong ownership
predictions) and 3) the maximum response provides both the scale and centroid
(fixation points) of the object. It also limits the selection of correct torque points:
we only need to search for a single torque polarity (which in this case, is negative for
the clockwise direction). Finally, this approach models closely neurological studies
that show that border ownership tends to have longer contextual range from higher-
level regions ( [43,306]), and the ownership-guided torque is a computationally viable
method to realize this.
Since torque produces fixation points at the maxima over multiple scales, we
use them in a fixation-based segmentation [199] to extract the final object seg-
ments. We show and compare in Fig. 2.10 segmentation results of ownership-guided
torque τOpq with standard torque τpq. The key observation is that segmentations
extracted from ownership-guided torque are usually the foreground objects com-
pared to the segments from standard torque that segments both foreground and
background regions. Using ownership-guided torque also produces more consistent
foreground segments with little leakage to the background unlike results from stan-
dard torque. Finally, we note that these quantitative results are extremely promising
42
Figure 2.10: Some results of applying ownership-guided torque vs. standard torque.
4 results per column: (L-R) (T):Input RGB image, standard torque output and seg-
ments; (B): Border ownership predictions, ownership-guided torque and segments.
We show only the segments produced by the top 2 torque points. For standard
torque, we show segments from both torque maxima (green) and minima (red) since
we do not have ownership constraints unlike ownership-guided torque where we show
only the segments corresponding to torque minima.
43
and indicates further research directions into using ownership information better for
foreground-background segmentation.
2.5.2 Predicting boundaries and ownership from DVS
The Dynamic Vision Sensor (DVS), is a neuromorphic camera that mimics
the human retina [173] in terms of its design and output. It belongs to a group
of novel sensors known as event-based cameras, so named because the camera does
not create a full image frame using a global clock used in conventional CCD cam-
eras. In conventional frame-based cameras, the CCD sensor captures images at a
known frame rate (related to the ISO film speed), and reconstructs the entire image
after a preset time. This is not what the retina does. Instead, the DVS outputs
asynchronous events, ev(x, y, t, p) parameterized by its (x, y) spatial location within
the 128×128 sensor; the timestamp, t of when the event occurred and the polarity,
p ∈ [+1,−1] of the events. The sign of p is based on whether the log of the intensity
at the same pixel location increases or decreases beyond a fixed global threshold
compared to the previous event, ev(x, y, t− 1, p).
Due to the event-based nature of such cameras, most Computer Vision tech-
niques in existence are not suitable to handle such input. This is because the key
assumption, that we have the entire 2D image or 3D frames (for videos), is simply not
available for such cameras. This is the key computational motivation for this work2.
From a biological perspective, the formation of edges to form object-centric contours
2Joint work with Francisco Barranco and was published in [9]. Code, data and more results are
available at http://www.umiacs.umd.edu/research/POETICON/DVSContours/
44
with ownership information models processes in the visual cortex [102,146,284] that
are responsible for illusionary contours and ownership assignment (see §1.2.1).
Our approach is similar to what has been described above and in [268], with
two important novelties, we: 1) use DVS event-based features and 2) present a se-
quential SRF that improves the prediction as more events are observed. We describe
these two innovations next.
2.5.2.1 Event-based features from DVS
Several different event-based features derived from the DVS data are used
here. They were selected based on: 1) the ownership and boundary information
they capture and 2) their ease of computation. These features can be broadly
grouped into four categories and are illustrated in Fig. 2.11, which we describe in
the next few paragraphs.
Event temporal information. We show in Fig. 2.11-A the timestamp of the last
event triggered for every pixel, measured in terms of relative time (ms) with respect
to the onset of the event.
Event-based orientation. The events are grouped into eight discrete spatial
orientations (from 0 to pi). Fig. 2.11-B shows the map of orientations for different
spatial locations. For every new event, its timestamp is first compared to the average
timestamp of the events in the neighborhood. If the difference exceeds 10 ms, the
event is considered an outlier and is discarded. A winner-takes-all strategy is then
used to obtain the most likely orientation for the new event which we admit if the
45
Figure 2.11: Event-based visual features. (Left panel) A 3D spatial representation
that encodes the timestamp of the last event in the z axis (after 20 ms) for every
pixel in the DVS sensor. The image on the top-left shows the original configuration
of the scene (captured with a conventional camera). We show more features derived
from the highlighted patch (boxed) in the right panel: A) the last timestamp (time);
B) event-based orientation (orientation, time); C) event-based motion estimation,
∇Te, computed by fitting local 5 × 5 planes to the surface Te (horizontal component
of the motion vx, vertical component of the motion vy, time); D) event-based time-
texture, obtained from the maximum responses per scale of a bank of Gabor filters
with 6 orientations and 3 scales (max response at 1st scale Gs1, max response at 2nd
scale Gs2, max response at 3rd scale Gs3, time). Figure adapted from [9].
46
difference between the new orientation and previous orientation exceeds 2 orientation
bins.
Event-based motion estimation. Following [12], we used a function Te that
assigns to every position the timestamp of its last event. This function locally
defines a surface of size 5× 5 pixels. The spatial derivatives of this surface provide
the speed and direction of the local motion. Specifically, the gradient vector ∇Te =
(
v−1x , v
−1
y
)T
gives the inverse of the image velocity. In practice, the function Te is
approximated by fitting a local plane P (with normal vector ~n) to the last timestamp
for every location, as illustrated in Fig. 2.11-C. Additionally, a regularization of the
data is performed simultaneously along with the plane fitting process. For each new
event that is reasonably close (< 0.2 pixels), P is updated within a time interval of
7.5 ms.
Event-based time-texture. Instead of intensity texture gradients as used on
images, we use a map of the timestamps of the last event triggered at every pixel.
This map defines a time-texture surface and we apply a bank of Gabor filters, using 6
orientations and 3 scales. The three feature maps depicted in Fig. 2.11-D correspond
to the maximum response over all orientations at every location, for each of the three
spatial scales considered.
All these feature maps are estimated using short time intervals of 20 ms. It is
important to note that all feature processing is event-driven: with every new event,
all the feature maps and their timestamps are updated with respect to the new
event.
47
2.5.2.2 A sequential SRF for continuous DVS data
Figure 2.12: Extending a non-sequential SRF (Rns) to a sequential SRF (Rsq) given
a sequence of DVS input features. (A) Training Rsq. We train Rsq in exactly the
same way as Rns with one key difference, we first run Rns over the training data
to provide initial predictions Eo (left panel) which is then used as an augmented
training feature set Uf = Eo × Xf for learning weights in Rsq (right panel). (B)
Inference from sequential data. (Top panel) For the first DVS data at 20ms , we use
Rns to predict E20o : the boundaries and their ownership labels. Using the augmented
input feature patch, the sequential Rsq is then used to produce E40o for the second
DVS data at 40ms. (Bottom panel) The process is repeated for all subsequent DVS
data using Rsq.
In practice, as events from DVS arrive in a continuous fashion, our (non-
sequential) SRF-based approach for boundary and ownership prediction, Rns, should
benefit from more observations of the scene. We describe here an extension that
takes advantage of new DVS data to refine the results further using a sequential SRF,
48
Rsq. We do this by augmenting the existing DVS features used with the output
predictions of the previous time’s DVS data as shown in Fig. 2.12. Specifically,
denoting n = 20 as the first DVS data at 20ms, we use the existing non-sequential
SRF (Rns in Fig. 2.12 (B)) to produce a prediction of the data, E20o . For subsequent
DVS times, n + 20, we augment the input DVS features X n+20f with the previous
data’s prediction Eno to obtain a larger feature set U
n+20
f = E
n
o ×X
n+20
f which we then
pass into a sequential SRF Rsq. Rsq is a SRF that contain K/2 trees trained using
DVS features Xf and K/2 (K = 8) trees trained using Uf (Fig. 2.12 (A)) which
during inference produces two predictions: Eo from the DVS features and Eosq from
the augmented features. Eo is exactly what Rns predicts for the current DVS data
(with half the number of decision trees) while Eosq is a prediction that takes into
account the results from the previous time’s DVS data. By choosing wf ∈ [0, 1],
a weight factor that combines these two predictions, the final prediction is thus
defined as En+20o = wfEosq + (1− wf )Eo.
2.5.2.3 Boundary and ownership results
In order to validate our results, we ran a series of feature ablations studies
over different DVS sequences that varies in terms of the number of objects (depth-
layers), background and motion of the camera. The datasets used are summarized in
Table 2.3. Note that two sequences: “NewObj-NewBG” and “Complex-C” sequences
are used for testing only. All sequences are used to evaluate Rns except “Complex-C”
which is used solely to evaluate Rsq. The various feature ablations are selected by
49
Sequence Description # Objects/ # Layers/ #
Motion/{# Train | # Test}
Rotation Mainly rotational motion 1/1/1/{20 | 20}
Translation Mainly translational motion 1/1/1/{18 | 18}
Zoom Mainly zoom motion 1/1/1/{18 | 18}
Complex Up to 3 objects and clutter, different backgrounds 3/3/3/{74 | 53}
NewObj-NewBG Only for testing: new objects and backgrounds 3/3/3/{- | 47}
Complex-C Mainly translation + rotation 3/3/2/{-|1}
Table 2.3: Descriptions of DVS sequences used. Note that “NewObj-NewBG” is a
held out testing sequence and “Complex-C” is used only for testing the sequential
SRF
training the SRF with a subset of the DVS features that capture certain properties
that are sensitive to boundary prediction and ownership assignment. We first train
separate SRFs that used each feature subset (§2.5.2.1) separately: [Timestamp (TS)
only], [Motion Only], [Orientation (Orient) Only] and [Time-Texture Only]. Next,
we train a SRF that uses all features together [All features].
We evaluate the performance of our approach by reporting the F-measure
over Precision and Recall (P-R) for assessing ownership and boundary accuracy.
For boundaries, we use the standard evaluation procedure from the Berkeley Seg-
mentation Dataset [193] to generate P-R curves and report the maximal F-score
(ODS) per DVS sequence. For ownership, we compute its F-score, Fown, by first
matching ownership predictions that are no further than 0.4% of the image diagonal
to the groundtruth (same as [162]), and we consider a pixel to have the correct own-
ership when it’s orientation code is less than 90 degrees from the groundtruth. As a
50
Feature ablations Rotation Translation Zoom Complex NewObj-NewBG
Timestamp Only 0.394, 0.641, 0.517 0.308, 0.591, 0.449 0.239, 0.498, 0.368 0.289, 0.494, 0.391 0.185, 0.366, 0.276
Motion Only 0.307, 0.558, 0.433 0.271, 0.492, 0.381 0.251, 0.475, 0.363 0.267, 0.478, 0.373 0.207, 0.392, 0.300
Orientation Only 0.321, 0.570, 0.445 0.323, 0.536, 0.429 0.243, 0.494, 0.368 0.279, 0.471, 0.375 0.200, 0.363, 0.282
Time-Texture Only 0.268, 0.552, 0.410 0.197, 0.512, 0.354 0.223, 0.492, 0.358 0.258, 0.460, 0.359 0.206, 0.395, 0.300
All features 0.373, 0.661, 0.517 0.313, 0.578, 0.445 0.268, 0.523, 0.395 0.287, 0.502, 0.394 0.204, 0.406, 0.305
Baseline –, 0.218, – –, 0.237, – –, 0.344, – –, 0.273, – –, 0.302, –
Table 2.4: Performance evaluation of feature ablations over different DVS sequences.
For every dataset and ablation, each cell reports the {Fown, ODS, Fc} scores.
final measure of the combined performance of ownership assignment and boundary
accuracy, we report the average of these two scores, denoted as Fc. Since this is the
first approach that detects boundaries from DVS data, there is no other methods
to compare with. However, we have created a baseline that groups events using
their timestamps. This simple method connects edges to create long contours, if
they appear in spatial proximity within a small time interval, and if their orienta-
tions match. Moreover, it applies non-maximum suppression as in the Canny edge
operator, to make boundaries cleaner.
The evaluation results are summarized in Fig. 2.13 (boundary P-R curves) and
in Table 2.4. We show quantitative results using Rns and Rsq in Figs. 2.14 and 2.15
as well.
2.5.2.4 Discussion
Boundary prediction accuracy, ODS. From the results, it is clear that our
approach significantly outperforms the baseline predictions, producing much better
51
Figure 2.13: Precision-Recall of boundary prediction accuracy for all DVS sequences.
Top row (L-R): “Rotation”, “Translation” and “Zoom”. Bottom row (L-R): “Com-
plex”, “NewObj-NewBG” and “Complex-C”. See text for details.
52
Figure 2.14: Example results using Rns(Top to bottom): Original scene configura-
tion; Hand annotated segmentation and border ownership groundtruths; Predicted
boundaries (blue) and ownership (red: foreground, yellow: background) from DVS
data; Baseline contours.
SRF Complex-Continuous
Rns [wf = 0.0] 0.181, 0.324, 0.252
Rsq [wf = 0.4] 0.212, 0.310, 0.261
Figure 2.15: (Left panel) How three different values of wf affect the final predictions
using the sequential SRF, Rsq. (Left to right) DVS data from the first 120ms of the
“Complex-Continuous” sequence. (Top to Bottom) wf = {0.0, 0.4, 0.7}. Notice
that the predictions retain more history with increasing wf , while a small value
of wf (top) is comparatively more noisy. The image on the right shows the final
configuration at the end of the sequence. (Right panel) Evaluation results comparing
Rns (non-sequential SRF) with Rsq: each cell encodes {Fown, ODS, Fc} scores.
53
boundaries that are closer to the groundtruth. Moving on to the individual features,
we first note that Timestamp (TS) is an extremely strong feature that predicts
the spatial location of the object (motion) boundary, yielding the highest ODS
scores in all sequences (with the exception of “NewObj-NewBG”). This highlights
the importance of further studies into the use of the event timestamps, which is a
unique feature of the DVS camera, not present in conventional sensors. Next, we
note that in “NewObj-NewBG”, time textures yield the most accurate results which
may indicate some form of invariance under challenging scenarios not captured by
other features. Further experiments with more precise motions, however, are needed
to confirm this. Finally, we note that using all features together improves boundary
ownership in all sequences except “Translation" (where TS remains the best).
Ownership assignment accuracy, Fown. We first note that the best results
are obtained by different features for different sequences (motions). This shows
that ownership assignment compared to boundary prediction is more complicated
to capture from the features we investigated and no single feature accurately pre-
dicts ownership reliably across different motions (sequences). Interestingly, we note
that even though the combination of all features do not yield the best accuracy,
it consistently produces one of the top results which shows the advantage of using
the SRF to determine the best feature combination. This also highlights another
issue: the dependency of the motion pattern on the 3D motion. We believe that a
possible approach in a practical application would be to selectively use features for
border ownership characterization, according to the general predominant 3D mo-
tion, i.e. depending on the kind of motion (predominant parallel translation, zoom,
54
or rotation) we can use specific SRF classifiers tuned for the predicted motion.
Overall performance, Fc. We note that in spite of the selectivity of features
for boundary prediction and/or ownership assignment, the best results (with the
exception of “Translation”) are obtained when all features are used. This confirms
that our choice of features is balanced in terms of these two performance criteria
and the SRF is trained to make the optimal selection to this end.
Qualitative comparisons. From Fig. 2.14, we note that qualitatively, not only
are our predictions much cleaner and smoother than the baseline, we are able to
generate these predictions in real-time which is a key requirement for event-based
approaches.
Results using Rsq. We illustrate the effects of using three different values of
wf in Fig. 2.15 (left) over the “Complex-C” sequence: a small value of wf results
in more noisy predictions while a large one retains more temporal history, some of
which are propagated to the subsequent DVS times. We have determined that a
value of wf = 0.3 to 0.4 provides reasonable predictions that removes temporally
inconsistent predictions while reinforcing the strongest predictions over time. This is
confirmed experimentally as shown in Fig. 2.15 (right) where the sequential variant
of the SRF, Rsq, outperforms the non-sequential variant Rns (by setting wf = 0.0)
in the combined F-score, Fc. Most of the improvement is derived from improving
ownership accuracy, at the slight expanse of boundary accuracy (due to the blurring
of edges across time), which is also observed in their corresponding P-R curves
(Fig. 2.13 (bottom-right)).
55
2.6 Conclusions
In this chapter, we have described a real-time approach for simultaneous
boundary and border ownership prediction using a Structured Random Forest (SRF)
classifier. Our results are state-of-the-art for two modalities: RGB images from con-
ventional CCD cameras and event-based features from DVS. We also described a
simple and elegant approach for recovering foreground object segments by reformu-
lating the image torque grouping operator using ownership information. Key to the
success of our approach are local and global ownership cues that were efficiently
extracted from the input RGB/DVS data. For RGB images, we used well-known
cues such as convexity/concavity, extremal edges and Gestalt-like patterns while for
DVS data we exploit a variety of time-based features that are indicative of boundary
regions.
In the next chapter, we move beyond boundaries and border ownership to a
higher-level visual task: shape-based recognition of objects, where we use ownership
information to improve recognition and matching of contour fragments in clutter
before we modulate the torque grouping operator towards the target.
56
Chapter 3: Contour-Based Categorical Object Recognition
In this chapter, we propose a method for detecting generic classes of objects
from their representative contours in cluttered environments1. The approach uses
the image torque closure operator [209] to group edges into contours which likely
correspond to object boundaries. This operator is used in two ways, bottom-up
on simple edges and top-down incorporating object shape information, thus acting
as the intermediary between low-level and high-level information. First, we apply
the torque to simple edges to extract likely fixation locations of objects. Using the
torque’s output, a novel contour-based descriptor is created that extends the shape
context descriptor [10] to include border ownership information and accounts for ro-
tation. This descriptor is then used in a multi-scale matching approach to modulate
the torque operator towards the target, so it indicates its location and size. Unlike
other approaches that use edges directly to guide the independent edge grouping
and matching processes for recognition, both of these steps are effectively combined
using the proposed method. We evaluate the performance of our approach using
four diverse datasets containing a variety of object categories in clutter, occlusion
1This work was published in [271] and further extended to include contour fragments and
rotational invariance in [269]. Full results, code and datasets are available at http://www.umiacs.
umd.edu/research/POETICON/contour_based_recognition/
57
and viewpoint changes. Compared with current state of the art approaches, our ap-
proach is able to detect the target with less false alarms in most object categories.
The performance is further improved when we exploit depth information available
from the Kinect RGB-Depth sensor by imposing depth consistency when applying
the image torque.
3.1 Introduction
Humans have an uncanny ability to recognize objects of various shapes and
sizes with relative speed and ease even in highly cluttered environments by exploiting
a wide variety of visual cues. In this work we seek to use contours as the main
cue for recognition. The problem of object recognition in general, and recognition
from contours specifically, is still considered a challenging problem. The problem
is particularly difficult in clutter, when objects occlude each other, and only parts
of an object’s boundary are visible. How do we get from the simple edge responses
detected by filters to characteristic contours at the boundaries of objects? What is
the approach we should take in our computations? Is there inspiration we can get
from human perception?
As we have noted earlier in Chapter 1, the Gestalt theorists proposed a very
influential theory on how this can be resolved. They suggested that certain principles
guide the processing in the vision system with the goal to extract foreground regions
from background (§1.1). Here we focus on two of these principles: the principle of
closure, which states that simple feature elements tend to be grouped together if they
58
Figure 3.1: From mid-level contour grouping to object recognition. (a) Attention
based contour grouping: by grouping contours that support the presence of an
object, a set of initial fixation points are used for the recognition step. (b) Contour
based recognition at fixation points: using the supporting contours at each fixation
point, we score the contour similarity in a hierarchical manner (increasing lengths)
with a target contour model. (c) Target object detection: regrouping scored contours
using the same mid-level grouping strategy reveals locations, scales and supporting
contours of the target object.
59
are parts of a closed figure, and the principle of past experience, implying that visual
stimuli are categorized according to past experience. We propose a mid-level vision
operator to implement these principles. This operator groups edges within regions
of different size to locate boundaries of objects, and it interacts with low-level and
high-level processes. By using it first in a bottom-up fashion to group simple edge
responses (Fig. 3.1(a)), it can be used to find in parallel potential object locations.
Then by tuning it to object characteristic edges (Fig. 3.1(b)) to group boundary
edges of objects, even when only parts of the object are visible, it can be used to
locate and identify specific objects (Fig. 3.1(c)).
The main advantages of using contour information for recognition are that
they are: 1) extremely easy to obtain and process using recent state-of-the-art edge
detectors [51,174] and 2) robust against changes in lighting in comparison with other
appearance or pixel-based cues (e.g. color and texture) since one considers at least
the first order differences between low-level pixel signals in localizing the edge [193].
In addition, since we are interested in recognizing categories of similarly shaped
objects, by using contours we generalize better across object categories which share
certain common shapes and functionality in different domains. This has important
implications when searching for objects based on descriptions of shape (this work)
or functionality, or when it is asked to suggest plausible alternatives when the actual
target is not present. The main drawback of only using 2D contour based informa-
tion is that it is affected by changes in viewpoints – which we address through our
choice of a robust shape based descriptor. The result is a simple and straightforward
approach that quickly recognizes objects that share common 2D shape properties
60
in cluttered environments.
The input is a 2D RGB image or a 2.5D image (RGB with depth information),
and we are interested in the detection of the contours that correspond to the target
object class – e.g. a Hammer class in the UMD Hand-Manipulation dataset or a Bottle
class in the ETHZ-Shapes dataset (§3.4), which is defined by a specific outline (or
shape) of the most representative contours of the object. The key challenge is to
determine from the edges derived from the input image, the set of contours that
supports the presence of the target object. Although this task seems simple and
straightforward, it poses several crucial challenges (Fig. 3.2):
1) Inaccurate and noisy (broken) edges. Since edge detection in 2D or 2.5D
images depends inherently on the local intensity gradients or surface normals, noise
during the image formation process would inadvertently result in edges that are
either inaccurate, incomplete or missing (Fig. 3.2(a),(d)). Additionally, for 2.5D
images, at the junctions of smoothly varying depth, boundaries cannot be accurately
localized since their surface normals are ambiguous (Fig. 3.2(e)). One common way
of resolving this issue is to first attempt to group pieces of contours, using saliency
measures and Gestalt principles of edge continuation, for example in [127,198]. Edge
grouping techniques, however, will still fail when considerable clutter (see issue 3
below) occurs and when broken edges predominate.
2) Boundary detection and border ownership. In order to distinguish between
contours belonging to one object, a key challenge addressed in Chapter 2 of the
thesis, is to determine who “owns” the edge. Once the ownership is determined,
we can assign an orientation to the contour (Fig. 3.2(b)), which makes it more
61
Figure 3.2: Challenges of contour-based categorical object recognition. (Top Panel)
2D images. (a) Noisy edges: some edges on the head are missing. (b) Border owner-
ship between two targets, with support marked as ‘+’. (c) Detecting partial contours
in clutter. (Bottom Panel) 2.5D images. (d) Noise and errors in depth/surface es-
timates, shown as dark blue gaps, make grouping edges at such regions difficult.
(e) Edges at smoothly varying depth boundaries (dotted green lines) are hard to
localize.
62
discriminative.
3) Partial matching in clutter. Related to issue 1 above, occlusions from clut-
ter and self-occlusions from the object’s internal contours both produce contours
that are broken and fragmented in the image (Fig. 3.2(c)). Unfortunately, since
in such situations we detect nearby contours that do not originate from the same
physical entity, bottom-up edge grouping techniques will still fail. To overcome this,
approaches such as [187, 234] perform partial edge matching. The main limitation
of such approaches is that even with good partial matches, a separate edge grouping
and scoring step is still needed to determine the location of the object.
Indeed, the main reason for these challenges is that detecting and recognizing
objects from edges alone is a very difficult task. This is because an edge, when it is
used in isolation, does not convey a lot of discriminatory information. Compounded
with the issues raised above, contour-based recognition of objects is therefore ex-
tremely challenging. In this work, we argue that by exploiting mid-level contour
grouping mechanisms, we are able to effectively address all the above issues in a
simple, holistic object detection framework.
3.2 Related Work
The problem of contour-based object recognition has been studied extensively
within the computer vision community. Existing approaches can be classified based
on how the edges/contours are obtained, represented and scored, and on the basis
of the algorithms used for classification.
63
Some approaches [212,246] learn a codebook of shape fragments. The learned
class specific shape fragments are then matched using oriented chamfer matching
and voted via a star-shape model to detect objects in the image. More recently,
[188] proposed a discriminative sparse coding that learns a class specific dictionary
that detects object specific contours within clutter. [161] introduced the notion of
an “implicit shape model” where patches relative to an object center are used to
create a codebook that encodes both spatial and appearance based information for
a particular class of objects.
Other methods transform the contour representations so that it becomes more
amenable for classification. [72] approximates them with straight adjacent fragments
for part-based matching. Similarly, [229] uses curves instead of straight lines, which
are more discriminative, together with a novel scoring function. More recently,
[288] proposed a deformable “fan-shape” object model that encodes statistically the
expected deformation (scale and angle) of matched contour fragments with respect
to an assigned center. A score for the object’s location is determined via a hough
distance voting metric over several scales.
Many approaches have used local feature descriptors from interest points to
match contours with the target. [164] uses simple features based on orientations
and pairwise interactions to create a local descriptor for matching. [254] views the
problem as a many-to-one matching problem and used shape context to match
long salient contours. Descriptors tuned for matching partial shape fragments were
introduced in [234] and used in a discriminative framework in [141]. [276] proposed
using a novel descriptor known as the “chordiogram” to encode relative angles of
64
boundaries obtained from an initial super-pixel segmentation step. In [185], the
authors used a triplet of edge points to create a histogram of angles over all triplets
for representing and matching similar contours. Machine learning methods have
also been employed to improve the matching function. [190] viewed the problem
as a deformable shape matching problem where a max-margin learning approach
was used to assign discriminative weights to potential contours while [211] used a
kernel based Support Vector Machine (SVM) [41] with a hough voting approach to
detect object specific contours. The recent work of Hariharan et al. [97] combines
the outputs trained “poselet” detectors [24] with gPb edges [6] to detect so-called
“semantic contours” in images. Instead of hand designed features, the recent work
of Bertasius et al. [15] introduced a novel multi-scale bifurcated deep network that
detects object-level boundaries. A recent extension [16] extends this network to aid
in the detection of semantic contours, with results far surpassing that of [97].
The work of [106] introduced a very fast 2D line matching technique by pre-
computing binarized gradient orientations of the model template known as LINE2D.
By spreading the orientations of the model orientations over a small region, the
approach is shown to be robust against changes in orientations within clutter. How-
ever, the method requires a large number of templates for precomputing the response
and is memory intensive. As LINE2D does not explicitly handle occlusions, [111]
extended LINE2D’s performance with the addition of occlusion priors learned from
training data. The occlusion prior is obtained from the statistics of which object
parts are likely to be occluded. The prior is estimated from the geometry of the
object, occluder and camera. This yields a probabilistic occlusion prior that in-
65
dicates which image points in LINE2D are consistent with the unoccluded object
for matching with the model. Although the approach showed improved detection
results under severe clutter, it requires significant amounts of annotated data to
learn the occlusion prior, and it is unclear how the approach performs over different
clutter interactions without retraining.
There are some works that focused on improving the robustness of shape-based
descriptors against a variety of deformations. Since we are interested in detecting
manipulated objects (for the UMD Hand-Manipulation dataset), rotational invari-
ance is crucial. [118] proposed searching over all possible rotations and selecting
the one that yields the smallest matching score with the shape context descriptor.
Extending the idea of searching over pose space, [171] proposed using a fan-shaped
triangulation technique with a novel optimization scheme to improve the rotational
invariance of shape context. Instead of searching over rotations, [298] applied a
2D Fourier Transform to contour points represented as Euclidean distances with
respect to a manually selected center point, to create a descriptor that is invariant
to translations, scaling and rotations.
The approach presented here extends [271] where we combined the torque
mid-level operator with high-level information of a specific object (of known size,
and shape) represented by silhouettes obtained from 2.5D Kinect data from various
poses. There are also many other works that used the full 2.5D information for ob-
ject recognition. See [94] for an extensive review. Approaches that used local 2.5D
descriptors, such HONV or FPFH [241, 266] exploits local geometry and surface
normals as their main features. To improve their discriminatory power, [20] used
66
hierarchical kernel descriptors to produce larger patch based features and trained
a linear SVM for 2.5D object recognition. A more recent extension [21] proposed
a discriminatory dictionary learning method termed Hierarchical Matching Pursuit
(HMP), to learn, in an unsupervised manner, hierarchical feature representations
of image patches containing RGB-Depth data. Using a trained SVM, the approach
achieves state of the art recognition results over the RGB-D Object Dataset intro-
duced in [20].
These different approaches share several common characteristics. First, in or-
der to overcome the noise and clutter that exist in real edge maps, some form of edge
grouping is applied. Next, using specific local descriptors, edges that are grouped
together are matched to see if they are similar enough to the target object model.
However, in the approaches surveyed above, the two steps of grouping and matching
are performed independently of each other, and their performance can depend on
the effectiveness of either step. In addition, many of them do not address the issue
of border ownership at all, which is a powerful cue for contour discrimination (see
§2.1). Even among approaches that use object centers, either explicitly [161] or
implicitly [276] to determine ownership, a key drawback is that the object centers
are determined either by hand or from imprecise over-segmentation using superpix-
els. Our proposed approach, by way of contrast, uses the image torque operator
in a holistic manner such that grouped edges are intrinsically endowed with border
ownership information via their torque centers, to create a more robust descriptor
for matching contour fragments with the target object model. As we will detail in
the next section, because objects are represented in terms of partial contours with
67
descriptors that enable us to estimate the amount of rotation with respect to the
model, we are able to circumvent the need for an extensive pose search and allow
for more complex shape representations compared to our prior work. Our proposed
mid-level object recognition approach therefore provides robotic applications with a
method that: 1) is effective under a wide variety of imaging conditions, 2) requires
minimal training since only sample contours of the target shape are required and 3)
generalizes well to similar shaped objects (no retraining needed).
3.3 Approach
The proposed approach consists of several steps and is summarized in Fig. 3.3.
Prior to detection, contours of the target model are obtained from annotated ground
truth of the training set (Fig. 3.3(a)). As the ground truth consists of contours
of varying sizes and scales, we first apply Generalized Procrustes Analysis [87] to
align the contours of the objects of the same category. Next, motivated by the
classical work on representing contours compactly using codons [233], we take a
similar approach of breaking up the contours at locations of minimum and max-
imum curvature, where a codon is a set of ordered contiguous edge pixels in the
image. Each codon from the training set is then represented as a set of B-splines
and we apply EM clustering over the spline coefficients to recover the set of u
model codons, {b1, . . . , bu}, which are arranged clockwise in the order they ap-
pear on the contour. For matching codons over multiple scales, we group these
model codons, creating longer codons by combining neighboring codons in a cu-
68
Figure 3.3: Overview of proposed approach. (a) Example model contours fragments
(codons) obtained via EM clustering of annotated training data: Bottles(top) and
Swans(below). (b-1) Input image + edge map. (b-2) Original torque value map
with detected proto-objects centers Pc: sorted by their torque values, black crosses
(negative torque), white crosses (positive torque). (c) Multi-scale edge matching at
two selected centers (I) and (II) compared to target Bottles. The codons selected
have the strongest torque contribution τpcqi . Matches to the model at one scale
are shown in the same color, gray indicates no matches. (d-1) Weighted edge map
(red means higher weights), (d-2) modulated torque value map and (d-3) predicted
object location and scale at maximum torque. See text for details.
69
mulative way such that we create a set of l model codons of increasing length,
Cmo = {M1, . . . ,Ml},with |Mt| < |Mt+1| until the entire contour is accounted for,
per target object class.
At the detection step, we first obtain from the input image an edge map
Ie of size H × W (height × width) using any standard edge detection technique
(Fig. 3.3(b-1)). We detail the remaining steps in the sections that follow (Fig. 3.3(c)-
(d)). First, we review the image torque and how it functions as an edge grouping
mechanism to locate the contours of proto-objects : regions likely to contain objects
in the image. Next, we show how information from the computed torque can be
used to enhance the shape context descriptor with border ownership information and
rotational invariance for robust matching. Finally, we describe how these matched
contours are used in modulating the image torque operator in a multi-scale manner
so that class specific object contours can be extracted for recognition.
3.3.1 Contour completion using image torque
The image torque [209] is a mid-level operator that is tuned to find closed
boundaries, which are indicative of the presence of possible objects (proto-objects).
Given an image edge map Ie, consider an image patch P ∈ Ie with center point p.
We denote the set of edge pixels (pixels corresponding to edges in P ) as E(P ). This
measure of edge completion is computed by summing the cross product between
the tangent vectors at an edge pixel q ∈ E(P ) and the corresponding displacement
70
vectors between p and q. Formally, the value of image torque2 , τpq of an edge pixel
q within a discrete image patch with center p is defined as:
τpq = ~rpq × ~Fq (3.1)
where ~rpq is the displacement vector from p to q and ~Fq is the tangent vector3 at
q. In the original torque implementation ~Fq is a unit vector. ~Fq can be viewed as
a “force” unit vector in the image space that can be associated with the relative
importance of a particular edge pixel (see eq. (3.4)). The torque of an image patch,
P , is defined as the sum of the torque values of all edge pixels, E(P ), within the
patch as follows:
τP =
1
2|P |
∑
q∈E(P )
τpq (3.2)
We compute eq. (3.2) over multiple scales, s ∈ S for every image point, and we
extract the largest τP over all scales to create a two-dimensional torque value map,
TI (Fig. 3.3(b-2)), with the same dimensions as Ie. The extrema in the value map
indicate locations in the image that are likely centers of closed contours (crosses
2Here with a slight abuse of notation, we describe ~rpq and ~Fq as two-dimensional vectors, and
denote the cross product of these two dimensional vectors as the signed scalar magnitude of the
resulting vector obtained by cross-multiplying these vectors. Writing ~rpq and ~Fq as 3D dimensional
vectors (with 0 in the third component), their cross-product, ~τpq, either points “out" (upwards) in
which case τpq is positive, or downwards in which case τpq is negative.
3The sign of τpq depends on the direction of the tangent vector. In this work, we define the
direction based on the image contrast and we compute it based on the sign of the image gradient.
As we have shown in §2.5.1, this direction is fixed given border ownership, which is not considered
here.
71
in Fig. 3.3(b-2)), denoted as Pc, and we consider the largest τP , which we evaluate
as possible proto-object centers. We use the top 20 largest τP in our current im-
plementation. For each extrema center, pc ∈ Pc, we can also compute the torque
contribution per edge pixel, τpcqi , via eq. (3.1). Setting a threshold tc on the torque
contribution, we obtain a set of n edge pixels (with τpcqi > tc) which we denote as
Qpc = {qi}, i ∈ {1, . . . , n} (shown as selected contours in Fig. 3.3(c)).
We highlight two important properties of the operator that make it ideal for
grouping edges that support the presence of proto-objects. Firstly, the summation
operation in eq. (3.2) strongly biases the operator against edge pixels that have
different orientations within the image patch P . This means that randomly oriented
edges from noise or textures have a smaller torque contribution to τP compared
to edges that have orientations that are more coherent towards forming a closed
contour. Secondly, the cross product between ~r and ~F will be large if an edge
pixel is far away from the center p, implying that the patch size associated with an
extrema point is a good estimate of the object’s scale.
For 2.5D images, we modify the definition of the image torque above so that
depth information is incorporated. The key idea is to add in an additional depth
constraint so that contours with the same depth values as the torque centers are
preferred over contours with different depth values. This way, we enforce some form
of depth consistency within the torque contour grouping framework when depth
information is available. Formally, from eq. (3.1), we apply additional weights wpq
that measure the absolute difference in depth values between an edge point q and
72
the center p:
τdpq = ~rpq × (wpq ~Fq) (3.3)
with wpq = abs(Id(p) − Id(q)) where Id is a W × H depth image that records the
depth values per image pixel and abs(·) denotes the absolute value. The torque of
an image patch with depth information is similarly derived via τdpq using eq. (3.2).
In practice, we use an efficient implementation4 via the method of Summed
Area Tables [46] (integral images) to compute the image torque per patch in constant
time. To achieve further efficiency, we use a discrete set of angles to represent the
edge vectors (we used 8 in our current implementation). We precompute and sum
up the image torque per edge pixel into a summed table per angle. Summing up the
responses over all discrete angles enables us to compute τpq efficiently. For τdpq that
includes depth information, we create at the same time a set of summed tables for
wpq per displacement angle which affords us the same constant time computation
complexity as the non-depth torque τpq.
The original image torque [209], however, is a purely bottom-up procedure: it
detects potential proto-object locations, pc and supporting contours, Qpc , with no
preference towards any particular object class. In the next two sections we show
that by integrating this bottom-up information from torque with the shape context
local descriptor, we extend the operator so that it becomes sensitive to a target
object class.
4Code available online at http://www.umiacs.umd.edu/research/SRVC/NSF-project/
73
3.3.2 Torque shape context descriptor
Let us return to eq. (3.1), which defines the image torque, τpq, between an
edge pixel q and the associated center pixel p. Since ~rpq is fixed (edges are fixed in
a 2D image), one way to modify τpq is to change the weight on ~Fq as follows:
τωpq = ~rpq × ~f(~Fq) (3.4)
where ~f(·) can be any vector-valued function that modifies the tangent unit vector
~Fq appropriately. In this work, we define ~f(·) to be a normalized contour matching
score function that is larger if edge pixel q is similar to the target object’s contours
and smaller otherwise. We detail in the sections that follow how the final form of τωpq
in eq. (3.12) is derived that tunes the torque mid-level operator towards the target
object class for detection and recognition.
There are numerous methods for matching local edge pixels, among which
the most popular is the shape context descriptor [10]. Given a set of edge pixels,
Qpc = {q1, . . . , qn}, for each point qi the shape context descriptor, h
sc
i , is defined as
a coarse histogram of the relative coordinates of the remaining n− 1 points:
hsci (k) = # [qj 6= qi : (qj − qi) ∈ bin(k)] , j 6= i (3.5)
In the above equation, (qj − qi) denotes the coordinate difference between qj and
qi in log-polar space and bin(k) denotes the kth bin in the histogram in log-polar
space centered over the ith edge point, qi. This descriptor is tolerant to small lo-
calized deformations (due to the histogramming of the distances), and is scale and
translation invariant.
74
However, when the descriptor is applied on contour fragments, Q′pc ⊆ Qpc by
breaking them up into codons there will be some ambiguous edge fragments that
can be matched to object contour fragments of different target object classes. The
reason is that the shape context in its original form does not encode any mid-level
information on how the fragments are related to the object that it is supposed to
support (Fig. 3.4 (Middle-r1)). In addition, shape context by construction is not
rotationally invariant as the log-polar histograms are defined over a fixed coordinate
system. Thus we need to account for target objects that present themselves in a
variety of poses (Fig. 3.4(Middle-r2)).
To overcome these two shortcomings, we introduce two enhancements to the
shape context descriptor by: 1) Embedding border ownership information through
image torque to create a more robust descriptor, termed the torque shape context
(Fig. 3.4(right)), that can better match contour fragments (§3.3.2.1) and 2) As a
pre-processing step, we estimate the amount of rotation between the test and model
by computing the cross-correlation of the descriptor’s angular bins via the Fast
Fourier Transform (FFT) (§3.3.2.2). Finally, we show how the torque shape context
descriptor is matched efficiently via dynamic programming in §3.3.2.3.
3.3.2.1 Robust contour fragment matching from border ownership in-
formation
To improve the matching of contour fragments, we introduce in this work a
new descriptor that extends shape context by embedding within the angular bins
75
Figure 3.4: Why shape context is insufficient for matching contour fragments in
clutter. (Left panel) Input image with torque value map. The two torque fixations
considered are boxed as r1 and r2 and the model is Saw. Model points are red and
test points are blue. Black lines indicate correspondences. (Middle panel) Original
shape context matchings. r1: Wrong matches due to similar histograms: notice that
the Borer object is matched to the handle of the Saw model, r2: Wrong matches of
test Saw points as shape context is not rotationally invariant. (Right panel) Robust
matching using torque shape context. r1: Less points from Borer object are matched
due to border ownership embedding. r2: Rotational invariance enables matching of
rotated Saw to the model.
76
Figure 3.5: Constructing the torque shape context: Selected codon highlighted with
respective pc, ~rpcqi and θpcqi from torque. h
τ
i is constructed by adding soft-weighted
counts from angular bins in h∠i that are intersected with ~rpcqi (oriented along Og, see
Fig. 3.9). Red means more counts, gray means no counts. The sum of the original
shape context hsci bin counts with h
∠
i produces h
τ
i .
of the shape context histogram additional information that indicates the location
of pc, i.e. the torque center that this fragment is supporting. Formally, consider a
shape context histogram hsci (k) for edge point qi ∈ Qpc with corresponding torque
center pc, we define the torque shape context histogram, hτi (k), as the sum of the
original shape context hsci (k) bins and “soft weighted” angular bins, h
∠
i (k), that are
aligned towards pc (Fig. 3.5):
hτi (k) = h
sc
i (k) + h
∠
i (k)
= hsci (k) +K(∠bin(k) ≡ θpcqi) (3.6)
where ∠bin(k) denotes the angular bins in the shape context histogram in hsci (k).
θpcqi is the angle that vector ~rpcqi makes with respect to Og within the coordinate
system of the shape context, as shown in Fig. 3.5. K(·) is a normalized “truncated”
77
Figure 3.6: Using border ownership information for robust matching in clutter.
(Left) Codons in clutter to be matched with the model Bottle. The two codons
marked with ∗ are correct. (a) Using only shape context, many mismatches oc-
cur because of similar histograms in clutter, e.g. hsci and h
sc
j . Codon colors code
for corresponding matches with the model codons. (b) Using the border owner-
ship information embedded in torque shape context, many mismatches are avoided
since their histograms hτi , h
τ
j and corresponding torque centers, pc and p
′
c, are more
discriminative.
Gaussian N(θpcqi , σ
2
K)
∣
∣θpcqi+pi/2
θpcqi−pi/2
that reweighs bin counts only on the side of the shape
context histogram pointing towards pc and has zero influence on the other side 5.
What K(·) does is to weigh angular bins in hi(k) nearest to θpcqi more than those
angular bins that are not aligned towards θpcqi . This truncated “soft weighting” of
angular bins in hτi (k) entails two important properties that are key for matching
contour fragments in clutter:
5We define bins that are orientated towards the torque center as those bins that are captured
within the half-circle centered along, ~rpcqi , the vector with direction θpcqi . Since the distribution
N(θpcqi , σ
2
K) is positive everywhere, truncating the distribution so that it is active only between
θpcqi + pi/2 and θpcqi − pi/2 achieves the desired effect.
78
1) As K(·) is active only on the side facing pc, the set of torque shape contexts
{hτi |qi ∈ Qpc} encodes effectively the “ownership” side of the set of edges in Qpc with
respect to the torque center pc. This makes matching contour fragments Q′pc with a
target model much more discriminative in clutter since similarly-shaped fragments
(with similar hsci (k)) must have the same pc as support with the model for a strong
match to occur. For example, Fig. 3.6 illustrates the case, where random fragments
that have the same hsci (k) (due to noise in the histogram counts or nearby edges)
can be differentiated using additional information from pc.
2) By weighing the bin counts softly via K(·), the matching of contour frag-
ments is also robust against a certain amount of perturbation and deformation of
the overall shape that the fragment belongs to. This is important, since the target
model must match fragments in a variety of camera viewpoints. This also motivates
why only angular bins are used, since (relative) angles are less likely to change under
various image deformations (up to an affine transformation) (Fig. 3.7).
We illustrate the advantages of using hτi (k) using real cluttered data in Fig. 3.8
where it enables us to 1) to distinguish between ambiguous contour fragments with
similar shape contexts but different pc and 2) perform partial contour matching
under occlusion.
3.3.2.2 Rotational invariance via the Fast Fourier Transform
For rotational invariance, the most straightforward approach is to simply define
the reference frame to be the tangent vector ~Fq at each edge point q. However,
79
Figure 3.7: Robustness against deformations. The selected torque shape context
remains stable for deformations induced by (a) shearing, (b) scale changes and (c)
shifts in the torque center.
Figure 3.8: (Left) Matching in clutter using torque shape context. (a-1) Input image
+ edge map. (a-2) Selected proto-object center boxed. (b) Comparing the matches
to the model codons with (b-1) shape context and (b-2) torque shape context. Notice
that fingers and noisy edges do not have the correct support, and are not matched in
(b-2). (c) Final modulated edge weights. (c-2) with torque shape context identifies
more of the correct edges than (c-1). (Right) Partial contour matching. The saw’s
handle (boxed) is occluded by the hand (top), but the blade is detected correctly
(below).
80
Figure 3.9: Estimating the phase lag Og from the angular bins of the torque shape
context at the torque center. (Left + Middle) Using the angular bins vectors (num-
bers indicate the bin ID) from the model and the test edges, we estimate the phase
lag Og from the FFT of the two signals. (Right) Using Og, we “unrotate” the test
edges before matching the descriptors (black lines indicate correspondences).
this approach in practice tends to reduce significantly the discriminatory power
of the descriptor due to the fact that tangents are easily corrupted by noise and
discretization effects. Instead, we propose to compute an additional torque shape
context descriptor centered at the torque center of the test and model, pgc and p
m
c ,
that estimates the amount of rotation between them so that we can “unrotate” the
test contours before matching them with the model contours (Fig. 3.9).
The key idea is to apply a 1D Fast Fourier Transform (FFT) over a 1D vector,
~ag = 〈h∠pgc (1), · · · , h
∠
pgc
(κ)〉 derived from the angular bin counts of a torque shape
context located at the torque center, h∠pgc (k), with bin 0 equivalent to the first com-
ponent of the vector used and so on until all bins are accounted for (we used κ = 60
bins for this part to get more resolution). This vector captures succinctly the struc-
81
Figure 3.10: Effects of using FFT to estimate Og on matching accuracy: (a-1)
With pose estimation. (a-2) Without pose estimation. The hammer is much better
localized and scored using the torque shape context when Og is applied.
ture of the edge points while it is robust against changes in scale and translation.
We obtain in a similar fashion ~am, from h∠pmc (k) of the model. The cross-correlation
between the two discrete signals, (~ag ? ~am)[ν] is then obtained via FFT to deter-
mine the most significant phase lag Og = argmaxν(~ag ? ~am)[ν] which is an estimate
of the rotation that exists between the test and model edges. We then use Og to
“unrotate” the test contours before matching them with the model. Since the two
signals ~ag,~am are (potentially) circularly shifted versions of each other, there are
four possible orientations (at each quadrant) that relate the test contours to the
model, and we consider all four orientations when we perform multiscale contour
matching (§3.3.3). Compared to matching over a large number of orientations, this
approach drastically reduces the number of orientation poses to search to just four.
We demonstrate the effects of imposing rotational invariance in Fig. 3.10. One can
see that without imposing Og, the object Hammer is not as well detected compared to
the case where Og is used to define the reference frame. We demonstrate in §3.4.1
quantitative results that highlight the importance of this procedure over a challeng-
82
ing hand manipulation dataset in improving the recognition of tools that are often
occluded and placed at random orientations.
3.3.2.3 Matching of torque shape context descriptors
Following [10], we compare the torque shape context defined in eq. (3.6) using
the χ2 statistic. We use the dynamic programming method of [273] to compute
correspondences φ by minimizing the overall cost of matching, Cφ between two edge
fragments G ′pgc (test) andMpmc (model):
Cφ(G
′,M) = γscCsc(G
′,M) + γ∠C∠(G
′,M) (3.7)
where we drop the subscripts pgc and p
m
c for simplicity in notation. Csc(·) and C∠(·)
are the shape context matching costs for the original shape context (first term in
eq. (3.6)) and the angular bin histograms (second term in eq. (3.6)) respectively.
We impose γsc + γ∠ = 1 so that we control the relative importance of these two
histograms in influencing the local matching score within the torque shape context.
We denote for simplicity the SC and angular components of the torque shape context
histogram for the ith test point and corresponding φ(i) model point as gti and m
t
φ(i),
with t ∈ {sc,∠} respectively. The matching costs Ct(·), t ∈ {sc,∠} for these two
components are similarly defined as:
Ct(G
′,M) =
n′∑
i=1
χ2(gti ,m
t
φ(i)) (3.8)
where we sum up the χ2 distances computed between the t components of the test
points’ torque shape contexts gti and their n
′ corresponding shape contexts mtφ(i)
83
Figure 3.11: Multi-scale edge matching: (a) Detail of p8 from Fig. 3.3(b-2). Neigh-
boring torques pc with their supporting edges (in similar colors) are combined. (b-1)
to (b-3) Increasing scales of combining neighboring codons together for matching.
in the model. χ2 is defined for two sets of shape context histograms centered at
(gti ,m
t
φ(i)) as:
χ2(gti ,m
t
φ(i)) =
1
2
K∑
k=1
[hti(k)− h
t
φ(i)(k)]
2
hti(k) + h
t
φ(i)(k)
(3.9)
Using the correspondences, we define the torque shape context matching dis-
tance, Dτsc, as the weighted mean of shape context matching costs over the n
′
matched points in G ′:
Dτsc(G
′,M) =
1
n′
n′∑
i=1
Cφ(i)(G
′,M) (3.10)
Since Dτsc is a local measure of similarity of partial edge fragments, we show
in §3.3.3 how we use it in a multi-scale approach to develop a mid-level contour
matching score function ~f(·) that is sensitive to the target object class.
84
3.3.3 Object sensitive torque via multi-scale matching of supporting
contours
Although the matching of edge fragments enables us to detect possible partial
contours that indicate the presence of the target object, it is only a weak indicator,
and one needs to check if there also is sufficient support from neighboring fragments
to strengthen the hypothesis. Motivated by this observation, we pursue the follow-
ing multi-scale approach of progressively combining and matching neighboring edge
fragments aided by torque as shown in Fig. 3.11. From the torque grouped edges
Qpc , we first combine neighboring Qrc belonging to nearby centers that fall within
the detected bounding box of pc to form a larger set of grouped edges RNc , where
Nc is a new object center estimated from the center of gravity of all the contributing
neighbors’ proto-object centers. This combination of neighboring torques is crucial
for target object classes (e.g. Giraffes) that have long and thin structures, and can
only be represented via multiple torque centers.
Next, we group edge pixels ri in RNc = {r1, . . . , rf} into codon fragments so
as to obtain a more compact representation of a set of d codons, Cg = {R′1, . . . ,R
′
d}.
Starting at codon R′1, we progressively select and combine the next J neighboring
codons: {R′{1}, . . . ,R
′
{1+J}} for comparison
6 with each of the l codons from the
model contours: Cmo = {M1, . . . ,Ml} by computing Dτsc(R
′
{1},...,{1+J},M{1},...,{l})
from eq. (3.10) with a slight abuse of notation. This results in aW×H×J ×l matrix
of distance scores corresponding to each combination. This process is repeated for
6the codons are indexed in a clockwise direction.
85
each of the d codons which gives us a final W ×H× (d×J × l) matrix that records
the value of Dτsc at every edge pixel location in RNc . We then select the smallest D
τ
sc
across all d × J × l levels to yield the final distance score for each ri denoted as a
2D torque shape context distance map, EDτsc . This is repeated over all four possible
global orientations Og described in §3.3.2.2 and we select the orientation that yields
the smallest EDτsc . A note on the computational complexity of this step. Since d
and l are small (typically, 15 and 6) and we set J to a small number as well (3 to 5
depending on the object class), we are able to reasonably compare all combinations
of codons over several scales by a direct brute-force approach. This is an important
advantage of using the compact codon representation (a mid-level representation by
itself). In comparison, other methods performing partial edge matching [187, 234]
use all edge pixels at once.
In order to convert the distance scores for each ri in EDτsc to a normalized
weight we use an exponential function:
WDτsc(ri) = βc + βf (exp(−EDτsc(ri)/(2σ))), (3.11)
where βc, βf , σ are parameters that determine how much we penalize for distances
that are large versus distances that are smaller. For any edge point q, by applying
eq. (3.11) to the scale at which ~Fq was detected, we obtain the modulated image
torque that is sensitive to the target object class:
τωpq = ~rpq × (WDτsc(q)
~Fq) (3.12)
where ~f(~Fq) = WDτsc(q)
~Fq as in eq. (3.4). Finally we can compute the modulated
86
torque per patch P by replacing τpq in eq. (3.2) with τωpq:
τωP =
1
2|P |
∑
q∈E(P )
τωpq (3.13)
For 2.5D images, we add in the depth constraint wpq = abs(Id(p)− Id(q)) similarly
as described in §3.3.1 to redefine the modulated torque with depth information:
τωdpq = ~rpq × (wpqWDτsc(q)
~Fq) (3.14)
and we define the modulated torque per patch in the same way as in eq. (3.13) by
replacing τωpq with τ
ω
dpq .
A crucial point to note is that even though our approach does not consider all
possible lengths and combinations of the test edges with the model, by embedding
WDτsc with the mid-level torque operation, we retain all the advantages of the image
torque. As long as we have sufficiently strong support for an edge to belong to
the target arranged in a coherent manner with other edges of similar weights, it
is a strong indication of the presence of the target object. τωP thus transforms the
original image torque so that it is now tuned towards the object model: WDτsc(q)
is large for edge points q from test codons Cg that are similar to the model codons
in Cmo while it is small for codons that are dissimilar to the model. In addition,
because WDτsc(q) is derived from codon comparisons with the model, codons from
other (unknown) object categories that are similar with the model will be detected as
well: e.g apples and oranges (round), sticks and rods (elongated) etc. Viewed in this
way, our approach can generalize to be sensitive to common object parts if we use a
generic model of such parts. On the other hand, if the model is extremely specific
and it has codons that are unique to the particular object class, then our approach
87
becomes very selective. The choice between how selective or general we want the
model to be is task-dependent and selecting the appropriate model automatically is
part of our future work. We provide a summary of the algorithm in Appendix B.
When the depth image Id is available, the algorithm remains the same except we
replace eqns. (3.2) and (3.13) with eqns. (3.3) and (3.14) respectively. The run-time
complexity of the complete approach is O(|Pc| × J × d × l) as it is dominated by
contour matching step.
3.4 Experiments
We perform experiments over four datasets. The first one, termed the UMD
Hand-Manipulation dataset, is collected by a mobile robot observing humans per-
forming manipulation activities using various tools and objects. This dataset is
challenging because the hands, tools and objects induce occlusions, clutter and
deformations (translation, scale and rotation), which are typical of manipulation
activities. The goal is to show that our approach can handle such situations reli-
ably. The second dataset is the CMU Kitchen Occlusion Dataset [111] that consists
of eight common kitchen objects collected under severe occlusions and clutter. We
demonstrate our approach’s ability to detect the presence of the target from a sin-
gle viewpoint and compare its performance with state of the art template based
object detectors embedded with a learned occlusion model. To show that our ap-
proach compares well with other state of the art contour-based object recognition
approaches, we use the ETHZ-Shapes dataset for evaluating object detection and
88
localization performance when there are significant variations in environmental con-
ditions: background, lighting and camera viewpoints. Finally, we demonstrate the
feasibility our approach on a mobile robot platform where the task is to search for a
specific object in clutter as the robot moves around the table – inducing occlusions
and viewpoint changes.
For all four experiments, we use the following meta-parameters: γsc = γτ =
0.5, βc = 0.05, βf = 0.95, σ = 0.05, σK = 0.5. These parameters were determined
by optimizing the mean precision rate with groundtruth from a separate subset of
100 training images derived from the four datasets used in the experiments. tc, the
threshold to select the strongest edges, is set to the 50th percentile of the ranked
torque contribution scores from the grouped edges. The number of codon neighbors
to combine, J , is set to 3 for all object categories except for Giraffes and Swans
(from the ETHZ-Shapes dataset), which have J = 5 so as to fully account for
long thin structures (neck and legs) that are common in these two categories. It
is possible to set J = 5 for all categories, but at the cost of longer processing
time. The recognition accuracy would not be affected since we are simply doing a
more extensive search over larger scales of combined codons. We use the Pb edge
detector of [193] to derive Ie. For computing torque, we search over image patches
with sizes ranging from 3 pixels to a quarter of the input image height and width. In
practice, we found that the recognition accuracy (mean precision) of the approach
is not very sensitive to the parameters used, but setting J and |Pc| to large values
will slow down the recognition times significantly. Using the current parameters,
typical running times for a 320×240 image are around ∼15 seconds using a Matlab
89
implementation running on a Core i7 2.4GHz machine.
We predict the target’s location and scale from the modulated torque map,
TmI , by selecting the largest modulated torque response over the same image patch
scales as noted above. For evaluating object detection performance, we admit a
true positive using the PASCAL criterion: when the overlap between the predicted
object’s bounding box and the ground truth bounding box exceeds 50% of the union
of the two boxes. For multiple detections near the ground truth, we select the one
with the largest absolute torque value. For scoring the detections, we normalize the
modulated torque at the predicted object center, τmP (replacing τpq in eq. (3.2) with
τmpq from eq. (3.4)) with τP .
3.4.1 Evaluation over UMD Hand-Manipulation dataset
We demonstrate our approach on a dataset collected by a mobile robot that
is actively observing a table full of tools/objects in clutter manipulated by humans.
This dataset, termed the UMD Hand-Manipulation dataset, consists of 6 video se-
quences (around 1500 frames each) of 3 different human subjects constructing a par-
tial wooden frame using 5 tool classes: {Borer, Hammer, Ruler, Saw, Screwdriver}.
This dataset is challenging because it has significant occlusions and orientation
changes due to the hands and active nature of the frame making process. The
goal is to show that our approach is able to handle partial occlusions under various
viewpoints/orientations. In addition, we demonstrate the contribution of estimating
Og using FFT (§3.3.2.2) in improving the recognition accuracy.
90
Figure 3.12: Detection results from five sample frames of the UMD Hand-
Manipulation dataset. (Rows) Target object class: (Top to Bottom) Borer, Hammer,
Ruler, Saw, Screwdriver. (Columns) Left: WDτsc where red means higher values and
target model contours at top-right, Middle: Modulated torque showing the top 2
object detections (red and green crosses), Right: RGB frames overlaid with detec-
tion results. Note that for Hammer and Saw, the objects are partially occluded by the
hands.
91
We used the meta-parameters and evaluation procedure as indicated above.
For obtaining the target model codons, we used the initial first ten frames and hand
annotated the target tool’s contours to obtain the model codons. We then evaluated
the rest of the sequence at sample intervals of 10 frames each, which yielded a total
of around 800 evaluated frames in the entire dataset. We show some results from
sample frames of the dataset in Fig. 3.12: final edge weights WDτsc and the predicted
target objects with centers marked as crosses.
Borer Hammer Ruler Saw Screwdriver
AP 0.77 0.82 0.80 0.89 0.79
0.3/0.4 FPPI 0.74/0.90 0.94/1.00 0.88/0.88 0.95/1.00 0.78/0.83
Figure 3.13: (Top-Left) Precision/Recall curves over the 6 videos in the UMD Hand-
Manipulation dataset. (Top-Right) Corresponding DR/FPPI curves. (Below) In-
terpolated average precision (AP) and detection rates at 0.3/0.4 FPPI over the 5
tool categories.
For evaluation, we report the Precision/Recall (PR) rates and corresponding
Detection Rate/False Positives Per Image (DR/FPPI) curves. The results are sum-
marized in Fig. 3.13 for all 5 tools considered and compared to Fig. 3.14 where
no rotational invariance is applied to the procedure. From the results of the full
92
Borer Hammer Ruler Saw Screwdriver
AP 0.61 0.10 0.34 0.85 0.41
0.3/0.4 FPPI 0.85/0.85 0.10/0.10 0.35/0.60 0.95/0.95 0.40/0.40
Figure 3.14: Performance without incorporating rotational invariance via FFT:
(Top-Left) Precision/Recall curves over the 6 videos in the UMDHand-Manipulation
dataset. (Top-Right) Corresponding DR/FPPI curves. (Below) Interpolated aver-
age precision (AP) and detection rates at 0.3/0.4 FPPI over the 5 tool categories.
approach, we are able to localize the target objects in clutter with Average Preci-
sion (AP) ranging from 0.77 to 0.89, with detection rates at the standard 0.3/0.4
FPPI that range from 0.74 to 1.00. These results are on par with current object
recognition approaches. The best detection using the full approach comes from Saw
and Hammer, and it is probably due to the fact that the contours belonging to these
two classes are very distinctive (and hence easy for discrimination) compared with
other tools. The worst results (in terms of AP) are from Borer, which is most con-
fused with Screwdriver. This is not surprising since both of these tools share many
common parts (with similar functions).
The contribution of estimating Og via FFT is also clearly shown in Fig. 3.14
when we note the improvements in AP that range from 0.04 (Saw) to 0.72 (Hammer).
93
The improvement is modest for Saw as for most of the frames, the target tool was
well aligned with the model’s original orientation. This makes the estimation of
Og unnecessary for most of the frames considered. However, for other tools, the
improvements are much more significant since they were placed and manipulated in
very different orientations (such as Hammer) compared to the model (see Fig. 3.12
first column where the model codons are shown on the top right).
Figure 3.15: When contour information alone is not enough. (Left) Model contours
of the Marker class. (Right) Contours that have long parallel lines are highlighted
(in green boxes): e.g. the handle of the hammer or the sides of a tape.
The decrease in performance of Borer (and to a large extent Screwdriver as
well) compared with other tools as noted in Fig. 3.13 highlights one of the key
shortcomings of the approach: the mid-level groupings over multiple scales do not
capture enough global information about parts and their relationships to accurately
separate out objects that consist of a subset of contours from other targets. An
extreme example is that of the Marker class, which we have not considered here,
but consists only of two parallel contours as shown in Fig. 3.15(left). Due to the
small number of contours in the model, such a configuration is highly ambiguous
(Fig. 3.15(right)). This result points to future work that should incorporate addi-
tional global mid-level information on the spatial configuration of object parts. For
94
example, the Hammer class consists of two distinctive (functional) parts: 1) the han-
dle and 2) the hammer head. Modifying the torque operator to enforce the grouping
at the level of these subparts would enable us to distinguish hammer handles from
markers since a marker consists solely of a single part.
3.4.2 Evaluation over CMU Kitchen Occlusion dataset
We investigate the performance of our approach in severe clutter and occlusion
using the single viewpoint subset of the CMU Kitchen Occlusion dataset introduced
by [111] and compared it with the state of the art LINE2D algorithm of [106] as
baseline and the robust version, rLINE2D, which compares edge points with the
model’s gradient orientation to decide if an edge point is consistent with a learned
occlusion model. We did not compare the full approach of [111] that includes the
probabilistic occlusion prior, since our approach does not explicitly model it. The
dataset consists of eight textureless objects: {bakingpan, colander, cup, pitcher,
saucepan, scissors, shaker, thermos} placed among other common kitchen ob-
jects with severe amount of occlusion. There are 100 testing frames per object class,
with a single positive target per test image. For training, we are provided with
a single viewpoint of the model as a mask and an image. We used the training
image mask to extract the model codons and used the same meta-parameters and
evaluation procedure as described above over all eight object categories. Since [111]
used the same PASCAL criterion to generate DR/FPPI curves, we are able to di-
rectly compare our results with LINE2D and rLINE2D as shown in Fig. 3.16. The
95
bakingpan colander cup pitcher
Our Method 0.35/0.44/0.55 0.47/0.54/0.62 0.59/0.62/0.65 0.58/0.62/0.65
LINE2D 0.26/0.29/0.44 0.28/0.31/0.43 0.28/0.29/0.40 0.05/0.07/0.21
rLINE2D 0.27/0.32/0.51 0.48/0.51/0.65 0.47/0.49/0.60 0.45/0.48/0.62
(cont.) saucepan scissors shaker thermos
Our Method 0.43/0.49/0.66 0.36/0.42/0.48 0.36/0.39/0.44 0.58/0.63/0.84
LINE2D 0.27/0.31/0.48 0.15/0.18/0.32 0.10/0.11/0.18 0.29/0.32/0.43
rLINE2D 0.50/0.54/0.67 0.27/0.31/0.46 0.20/0.23/0.35 0.55/0.60/0.73
Table 3.1: Comparing detection rates with our method, LINE2D [106] and rLINE2D
[111] at 0.3/0.4/1.0 FPPI over the CMU Kitchen Occlusion dataset.
detection rates at 0.3/0.4/1.0 FPPI are summarized in Table 3.1.
From the DR/FPPI curves, we first note that for all the objects, our method
significantly outperforms LINE2D and has a performance that is at least on-par or
better than rLINE2D which includes an occlusion model of the target (which our
method does not have). Second, from Table 3.1, our approach is able to obtain
much better detection rates with a lower number of false positives (lower FPPI)
compared to both methods. This shows that our approach is discriminative even
when severe clutter is present. This improvement is due to the partial hierarchical
matching via codons and the torque shape context that match edge points with
better accuracy while rejecting false positives with different torque centers more
effectively compared to LINE2D or rLINE2D that use gradient orientations only.
We show some example detection results with the modulated torque in Fig. 3.17
that illustrate how the approach performs over this dataset.
96
Figure 3.16: DR/FPPI curves comparing our approach with LINE2D [106] and
rLINE2D [111] over the eight object categories in the CMU Kitchen Occlusion
dataset.
97
Figure 3.17: Detection results for the eight objects in the CMU Kitchen Occlusion
dataset. (Rows) Target object class: (Top to Bottom) bakingpan, colander, cup,
pitcher, saucepan, scissors, shaker, thermos. (Columns) Left: WDτsc where red
means higher values and target model contours at top-right, Middle: Modulated
torque showing the top 2 object detections (red and green crosses), Right: RGB
frames overlaid with detection results.
98
3.4.3 Evaluation over ETHZ-Shapes dataset
We further evaluate our approach using the ETHZ-Shapes dataset which is
often used in the computer vision community as a standard baseline for evaluating
2D contour-based object recognition approaches. This dataset is divided into five
object categories: {Applelogos,Bottles,Giraffes,Mugs,Swans}, and consists of 255
images containing instances of the objects with varying background, clutter, scale
and viewpoint. We follow the same test/train split procedure as suggested by [254]
for evaluation: the first half of each category is used to obtain the model codons
from the ground truth contours and the remaining half, together with the rest of the
images, are used for testing. Because this dataset is widely used, it enables us to
compare the performance of our approach with other state of the art contour based
object recognition approaches.
Applelogos Bottles Giraffes Mugs Swans Mean
Our Method 0.917 0.931 0.796 0.888 0.891 0.885
[190] 0.869 0.724 0.742 0.806 0.716 0.771
[254] 0.845 0.916 0.787 0.888 0.922 0.872
[187] 0.881 0.920 0.756 0.868 0.959 0.877
[288] 0.866 0.975 0.832 0.843 0.828 0.869
Table 3.2: Comparing interpolated average precision (AP) with the proposed
method over the ETHZ-Shapes dataset.
We focus our comparisons with recent state of the art contour-based object
detection methods [187, 190, 254, 288]. The Precision/Recall (PR) curves of these
methods and their interpolated average precision (AP) are compared with the pro-
99
Figure 3.18: Precision/Recall curves comparing [187, 190, 254, 288] to the proposed
method over the ETHZ-Shapes dataset.
100
Figure 3.19: Comparison of DR/FPPI curves over the ETHZ-Shapes dataset.
101
Applelogos Bottles Giraffes Mugs Swans Mean
Our Method 1/1 1/1 0.930/0.930 0.958/0.958 0.938/0.938 0.965/0.965
[190] 0.95/0.95 0.929/0.964 0.896/0.896 0.936/0.967 0.882/0.882 0.919/0.932
[254] 0.95/0.95 1/1 0.872/0.896 0.936/0.936 1/1 0.952/0.956
[187] 0.92/0.92 0.979/0.979 0.854/0.854 0.875/0.875 1/1 0.926/0.926
[288] 0.90/0.90 1/1 0.92/0.92 0.94/0.94 0.94/0.94 0.940/0.940
[234] 0.933/0.933 0.970/0.970 0.792/0.819 0.846/0.863 0.926/0.926 0.893/0.905
[71] 0.777/0.832 0.798/0.816 0.399/0.445 0.751/0.8 0.632/0.705 0.671/0.72
Table 3.3: Comparing detection rates at 0.3/0.4 FPPI over the ETHZ-Shapes
dataset.
posed method in Fig. 3.18 and Table 3.2 respectively. Across all five categories, the
proposed approach is comparable with state-of-the-art procedures – its most dom-
inant performance is for Applelogos. Averaged over all 5 categories, our approach
is able to achieve the overall best mean AP among the compared methods, with a
small improvement over [187].
In addition, we plot the Detection Rate/False Positives per Image (DR/FPPI)
curves in Fig. 3.19. The detection rates at 0.3 and 0.4 FPPI are compared with
several reported results in the literature in Table 3.3. The detection performance
at these two levels is consistently on par with the state of the art, with the largest
improvements in Applelogos and Giraffes. We show in Fig. 3.20 some example
results: the modulated torque with the final detections, and some failure cases.
Similar to the discussion in the preceding sections, these cases occur due to the
fact that some model codons between classes may be very similar (such as between
Swans and Giraffes). A more discriminative learning approach that incorporates
more global level part-based information should yield even better results.
102
Figure 3.20: Some example detection results with their modulated torque. Edges
show values of WDτsc , green boxes are ground truth, red and blue boxes are the top
min/max modulated torque values: Top row (left to right): Applelogos, Giraffes,
Swans. Bottom row (left to right): Bottles, Mugs. False detections of Applelogos
and Giraffes. Best viewed in color.
3.4.4 Object recognition in clutter by a mobile robot
We demonstrate the feasibility of our approach for practical robotic applica-
tions on our mobile robot platform (Fig. 3.21(left)). The robot consists of the Adept
Pioneer P3-DX base together with a custom made frame on which a Kinect RGB-
Depth sensor is attached via a Directed Perception PTU-D46 pan-tilt unit (PTU).
The robot’s software runs over the Robot Operating System (ROS) [227] with ap-
propriate interfaces implemented to send the Kinect RGB-Depth data to Matlab
for processing by the proposed method. The robot is tasked to perform random
movements using either the base or the PTU while observing a cluttered scene of
objects on a table. The goal is to detect objects in clutter while inducing changes
in viewpoint and occlusion from the movements. We used the same “UMD-clutter”
dataset reported in our previous work [271]: we performed three different collec-
103
Figure 3.21: (Left) The mobile robot with relevant hardware components highlighted
used in the experiment. (Right) The four object categories used. For Mug and Spoon,
two different instances exist in the dataset.
tions of the Kinect RGB-Depth data with differing amount of clutter per dataset,
with around 1000 frames per sequence. We focus on detecting four object categories
(Fig. 3.21(right)): {Book, Bowl, Mug, Spoon} which are located at random positions
on a table, under various degrees of occlusion. We used the same meta-parameters
and evaluation described above. For this dataset, we evaluated frames at intervals
of 10 frames, yielding around 300 frames that were considered for evaluation. As a
baseline, we compared it with our previous work [271] termed “Shape-Torque” that
uses multiple shape templates to define a multi-view model to modulate the torque
response towards the desired target object. As a comparison, we used the recent
Hierarchical Matching Pursuit (HMP) method of [21] that learns a dictionary of
RGB-Depth features for object recognition. A linear SVM classifier is then trained
over the features for the four target object categories. For evaluation, we select
the top 20 initial torque fixations per test frame which are processed by the SVM
104
classifier.
In order to evaluate the contribution of the depth information in influencing
the detection rates of the approach, we compared the standard (no depth) approach
for computing modulated image torque (eq. (3.12)) with the approach that uses
depth information eq. (3.14). For HMP, we trained two SVM classifiers: one using
RGB features only (HMP-RGB) and another using RGB-Depth features (HMP-
RGBDepth). Since HMP does not provide bounding boxes, we cannot use the
PASCAL criterion for evaluation. Instead, we admit all positive predictions, which
results in a much higher detection rate at high recalls compared to the other ap-
proaches that localize the prediction with a bounding box. Fig. 3.22 shows the
DR/FPPI curves for the entire dataset over the four object categories considered.
The detection rates at 0.3/0.4 FPPI are summarized in Table 3.4.
Book Bowl Mug Spoon Wood Spoon
Our Method + Depth 0.27/0.37 0.61/0.68 0.57/0.60 0.32/0.37 0.42/0.42
Our Method (no Depth) 0.29/0.39 0.56/0.62 0.54/0.59 0.43/0.46 0.36/0.36
Shape-Torque [271] 0.33/0.33 0.14/0.14 0.43/0.43 0.17/0.17 0.09/0.09
HMP-RGBDepth [21] 0.00/0.01 0.13/0.38 0.00/0.00 0.06/0.06 0.25/0.33
HMP-RGB 0.00/0.00 0.00/0.01 0.00/0.00 0.00/0.00 0.02/0.05
Table 3.4: Comparing detection rates of our approach that uses depth information
and one that does not use depth information with the baseline Shape-Torque [271]
and HMP [21] at 0.3/0.4 FPPI over the UMD-clutter dataset.
From the results, we see that with the exception of Spoon, the proposed method
with depth information is on par or better than the baseline and standard non-depth
approach at both FPPI levels. This shows that given a cluttered environment,
105
Figure 3.22: DR/FPPI curves of the UMD-clutter data evaluated over four object
classes. The DR/FPPI curves for the Wood Spoon (boxed) instance is presented to
contrast with the results shown for the Spoon category, which is affected due to bad
depth estimates, see text for details.
106
using the depth information enables us to reduce the influence of false contour
groupings that have very different depth values and hence unlikely to come from the
same object. Fig. 3.23 shows an example of how using depth information reduces
false groupings to improve the detection of the target. Obviously this assumption
has its limitations, especially for objects with large depth disparities, e.g. Book,
and this is shown by a slightly worse performance compared with not using depth
information at 0.4 FPPI. The method also significantly outperforms both variants
of HMP, which use RGB information in addition to depth. This is indicative of
the robustness of using contour information for recognition under such challenging
scenarios. Finally, it is interesting to note that the improvements of both variants of
the approach against the baseline Shape-Torque approach occur when we use only
a single viewpoint in the model, versus the multiple (6 to 10) viewpoints used in
Shape-Torque. This highlights the contribution of using our codon based torque
shape context for robust matching under occlusion and clutter.
For Spoon, the reason for the consistent poorer performance using depth is
because the depth estimates from the Black Spoon instances are usually wrong. This
is due to the dark surface coloration that tends to absorb the Kinect’s IR radiation.
The Wood Spoon instance, however, does not suffer from this issue. This is shown
in the DR/FPPI curves of the Wood Spoon instances only (Fig. 3.22, boxed), where
depth information improves the result.
We show some results from sample frames of the dataset in Fig. 3.24. Specifi-
cally the figure shows the final edge weights WDτsc , the modulated torque value map
with depth constraint and the predicted objects with centers marked as crosses.
107
Figure 3.23: How depth information helps in improving the image torque. (Left)
Input Kinect RGB-D image with target Bowl in green box. (Right) Comparing effects
of (a) not using depth information and (b) using depth information. Region t1:
Using depth information produces more depth consistent grouping in (b) compared
to (a): Notice there are three fixations corresponding to three objects on the table
in (b) compared to four in (a). Region t2: As a result of this grouping, we are able
to combine and compare groups of codons more accurately with the model. In (a),
codon groupings near Book are erroneously weighted more due to wrong groupings
which are weighed down in (b) as their depth values are inconsistent. This enables
the target Bowl to be correctly detected in (b).
108
Figure 3.24: Detection results using depth information for the four objects in the
UMD-clutter dataset. (Rows) Target object class: (Top to Bottom) Book, Bowl,
Mug, Spoon. (Columns) Left: WDτsc where red means higher values and target model
contours at top-right, Middle: Modulated torque with depth constraint showing the
top 2 object detections (red and green crosses), Right: RGB frames overlaid with
detection results.
109
3.5 Conclusions
In this chapter, we have presented a Gestalt-based approach to contour-based
categorical object recognition that uses the image torque for the selection and group-
ing of specific target object contours in clutter, occlusions and viewpoint changes.
Our approach proceeds in two stages. In a first stage, we use the torque as an at-
tention mechanism to find initial proto-object locations by applying the torque on
simple edge responses possibly augmented with depth information. With the help of
these proto-object locations, we then match in a multi-scale approach edges using a
new shape context descriptor that takes into account border ownership information
and object rotation. In a second stage, we then use the torque to group the matched
edge responses by modulating their weights within the operator. We evaluated the
approach over four datasets: 1) the UMD Hand-Manipulation dataset, 2) the CMU
Kitchen Occlusion dataset, 3) the ETHZ-Shapes dataset and 4) the UMD-clutter
dataset collected by a moving robot observing a table with clutter. The results high-
light the ability of the approach to handle occlusions, partial matches and orienta-
tion changes over large variations in environmental conditions, with state-of-the-art
performance compared to other contour-based approaches.
The ability to recognize categories of objects using their shape information
is just one way, however, of using Gestalt for higher-level visual tasks. Besides
shape, symmetry and functionality are two innate attributes that invoke Gestalt-
based recognition of objects. We discuss how these cues can be exploited to solve
the FGO problem in the next two chapters.
110
Chapter 4: Detecting and Segmenting Symmetrical Regions
Symmetry, as one of the key components of Gestalt theory, provides an impor-
tant mid-level cue that serves as input to higher visual processes such as segmenta-
tion. In this chapter, we propose a complete approach that links the detection of:
1) reflection (bilateral) and 2) curved reflection symmetries to produce symmetry-
constrained segments of structures/regions in real images with clutter.
For detecting bilateral symmetry, a two stage approach that first detects pu-
tative symmetrical locations followed by a more expensive localization step is pro-
posed1. We evaluate and compare our bilateral symmetry detector with the state-
of-the-art feature based detector of Loy and Eklundh (Loy-Eklundh) [184] over two
datasets: 1) Penn State University 2011/2013 symmetry competition datasets (PSU
2011/2013) and 2) the UMD Symmetry dataset. Extensive experiments with vari-
ous ablations of the approach show that it is able to retrieve more precise bilateral
symmetries compared to Loy-Eklundh over most recall values.
For curved symmetry detection, we leverage on patch-based symmetry fea-
tures to train a Structured Random Forest (SRF) [140] classifier that detects multi-
1The detection of bilateral symmetry is currently under revision as [272]. Joint work with
Hyoungjune Yi. Full results, datasets and code are available at http://www.umiacs.umd.edu/
~cteo/object_symmetry/
111
scaled curved symmetries in 2D images. Experimental evaluations over two datasets:
1) SYMMAX-300 [278] and 2) NY-Roads [249] show that our SRF-based curved
symmetry detector outperforms two state-of-the-art curved symmetry detectors
[159,278] with comparable performance to that of [249].
Next, using these symmetries, we modulate a novel symmetry-constrained
foreground-background segmentation by their symmetry scores so that we enforce
global symmetrical consistency in the final segmentation. This is achieved by impos-
ing a pairwise symmetry prior that encourages symmetric pixels to have the same
labels over a MRF-based representation of the input image edges, and the final
segmentation is obtained via graph-cuts. Experimental results over four publicly
available datasets containing annotated symmetric structures: 1) SYMSEG-300, 2)
BSD-Parts, 3) Weizmann Horse (both from [159]) and 4) NY-roads [249] demon-
strate the approach’s applicability to different environments with state-of-the-art
performance2.
4.1 Introduction
Symmetry is a universal invariant, an innate attribute, that is ubiquitous and
very common in nature. In Gestalt psychology, symmetry is considered as one of
the key grouping or mid-level cues for explaining human visual perception and its
detection facilitates early visual processes such as figure-ground segmentation [53].
2The detection of curved symmetry and symmetry-constrained segmentation was published
as [270]. Code, data and more results are available at http://www.umiacs.umd.edu/~cteo/
SymmetrySegmentation/
112
From a biological viewpoint, it is well-known that humans are extremely sensitive to
reflection symmetry [39, 280]. There is also evidence that humans know about the
symmetry of a figure, prior to the onset of recognition. Analysis of eye movement
patterns by [182] found that symmetrical patterns are scanned with fixations that
mostly fall on only one side of the symmetry axes.
In this chapter, we present approaches for detecting: 1) reflection (bilateral)
symmetry and 2) curved reflection symmetry. For bilateral symmetry, we are con-
cerned with the detection of larger, global symmetry patterns, often associated with
foreground objects. On the other hand, curved reflection symmetries consider reflec-
tion symmetries that are much more local in terms of spatial scale. Such a concept
is known in the literature as ridges, ribbons or centerlines [177,249], and is related to
the classical Medial Axis Transform of Blum [19]. Equally important is the extrac-
tion of symmetrical regions that support these symmetries, and a key requirement is
that the final extracted regions (segments) has to be symmetric as well. We consider
both of these issues together in this chapter and present a complete approach for
detecting symmetries and segmenting such symmetrical regions.
4.2 Related Works
4.2.1 Symmetry detection
The detection of various types of symmetry (bilateral/reflection, rotational
and translation) from 2D images has a long history in computer vision. See [181] for
an up-to-date survey of past and current techniques. The classic voting approach of
113
“Generalized Symmetry Transform” (GST) of Reisfeld et al. [230] is now largely sur-
passed by the feature-based method of Loy and Eklundh [184] that used symmetrical
SIFT keypoint descriptors [183] for more robust detection of bilateral and rotational
symmetries. GST, however, is still widely used, e.g [142, 169]. The main drawback
of GST is that because it compares pointwise orientations, it is computationally
expensive and fragile, and it works well only in relatively clean images with little
noise and textures. By contrast, Loy-Eklundh detects standard SIFT keypoints and
derives “mirrored” SIFT counterparts for each key-point. Matches based on mir-
rored counterparts result in local symmetry axes “particles” that are then used to
vote (via their location and orientation) in Hough space. The local maxima within
the space, selected via an adaptive threshold method, are therefore indicative of the
most consistent symmetry axes associated with the image. An important limitation
of [184] is that its performance is heavily dependent on the detectability of SIFT
features in the image. Since SIFT keypoints are essentially histograms of local edge
orientations, textureless or shadowed regions will not generate a sufficient amount of
keypoints. A recent extension by Lee and Liu [158] matched these descriptors within
a 3D axis parameter space to detect curved reflection symmetries from keypoints.
Instead of using keypoints, Tsogkas and Kokkinos [278] used Multiple Instance
Learning to train a curved symmetry detector, that combines multiscale patch-based
feature histograms of intensity, color, texture and spectral cues to obtain state-of-
the-art detection performance on a large dataset of real images with clutter. Along
similar lines, [35] creates reflected copies of local features for training a symmetry
detector based on spectral features.
114
In the domain of biomedical imaging, most works had focused on detecting
centerlines of 3D tubular/cylindrical structures: blood vessels, axons, dendrites and
spinal columns [86, 155, 286], and in cartography symmetry was used for detecting
road networks [114]. Although these works produce very good centerline predictions,
their applicability is often limited to the specific imaging modality (e.g. CT, MRI or
brightfield) and the expected size (scale) of the target tubular structures. Recently,
Sironi et al. [249] proposed a novel regression-based technique using Regression-Trees
(as opposed to classification) that showed state-of-the-art centerline detections in
different applications (medical and roads). Their method, however, requires training
a large number of regressors to predict the expected scale and location of the tubular
structures from the input.
Other methods, [214] and [144] proposed to detect the symmetry of a region
based on the phase relationship in spatial harmonics. Gabor and log-gabor filter
kernels are applied over the image; regions where the different symmetric frequency
components are in phase indicate a possible symmetry axis. This method, however,
is extremely sensitive to noise, and the choice of the kernel filter size affects both
the scale and quality of the symmetry axes returned. [305] proposed a “Symmetry
Distance” measure of shapes, defined by the mean square distance that transforms
each point in the original shape to the new (symmetrical) shape. A smaller distance
means the shape is more symmetric. The method, however, is dependent on a good
selection of initial seed points, and an exhaustive search for various symmetrical
shapes limits its applicability to simple scenarios.
115
4.2.2 Segmenting symmetrical regions
Most previous approaches [142, 170] considered segmentation as a separate,
independent step from symmetry detection. [142] used local features to approximate
a symmetry axis followed by the fixation-based segmentation method of Mishra et
al. [199] to extract regions with no symmetry constraints.
Riklin-Raviv et al. [236] embed symmetry cues dynamically into a level-set
functional so that each evolution of the functional improves the symmetric properties
of the current segmentation. Like most variational methods, however, the approach
requires several iterations and could be stuck at a local minima.
Sun and Bhanu [264] use a region merging approach where homogeneous re-
gions, measured in terms of color and texture, are merged while preserving reflection
symmetry. The merging process, however, is sensitive to large variations of color
and textures, producing oversegmented (small) regions in these areas. Along similar
lines, Levinshtein et al. [166] build an adjacency graph that encodes how superpix-
els are grouped into symmetrical parts. Lee et al. [159] extend this approach by
imposing a more general deformable disc (ellipse) model that encodes affinity of su-
perpixels in curvilinear structures better. Affinity is computed from shape similarity
(parameters of the deformable ellipse) and differences in local color and intensity.
Since superpixels are grouped in a pairwise manner via dynamic programming, this
approach is limited to grouping homogeneous regions that contain a single curved
symmetry (no branches).
Fu et al. [77] focused on extracting foreground salient objects exhibiting re-
116
flective symmetry by first computing a symmetry foreground segmentation map
from color-contrast cues and feature-based symmetry-induced homography, which
are then set into unary and pairwise terms in a MRF-based segmentation. The accu-
racy of the final segmentation, however, depends primarily on the initial foreground
map, which is formed by combining an feature-based estimate of a global reflection
homography with a saliency foreground map based on color-contrast.
In the medical imaging literature, most works consider segmentation as an
integral part of centerline detection, where the centerlines and their corresponding
radii are solved by the same detector [86, 155, 249]. The final segmentations are
therefore a combination of circles or balls located along the centerlines, with no
enforcement of global curved symmetry consistency.
4.2.3 Contributions of this work
Our approach, described next, considers the issues of symmetry detection and
segmentation of symmetrical regions together so that we produce accurate segmen-
tations of symmetrical structures from robust symmetry detections in clutter.
Compared to Loy-Eklundh that requires textured regions for computing SIFT
keypoints, our proposed two-step bilateral symmetry detector uses Gestalt princi-
ples to first detect putative symmetrical locations, a symmetry “attention” map, by
comparing histograms of Gabor edge responses. Although histograms have better
tolerance to noise, since they only encode the number of edges within a discrete set of
orientations, they are by no means sufficient to determine the precise location, scale
117
and orientation of the symmetry, because spatial information is lost. In the second
step, termed the symmetry “refinement” step, we estimate the location (centroid)
and orientation of the supporting symmetry axis within the potential symmetry
patches. As a direct 2D search over the image patch over all locations and ori-
entations is computationally expensive, we show that it is possible to decompose
the problem into two separate 1D searches, the first in edge orientation space and
the second, in the (rectified) location space. To achieve robustness, we use kernel
density estimates of edge orientations and edge counts in the first and second 1D
search, respectively; and compare their probability distributions using the efficient
L1 Earth Movers distance (EMD-L1) measure of [178]. Combining both searches
admits a small set of local minima in each image patch that suggests likely loca-
tions of possible symmetry axes. We then score each axis via a method similar to
Loy and Eklundh [184] by matching local symmetric features using a Hough voting
technique. The symmetry axis with the best (highest) score is selected as the final
symmetry axis of the image patch.
Compared to other curved symmetry detectors [159,249,278], our SRF-based
approach learns to associate multiscale symmetrical features with a novel symmetry
output annotation structure. The key advantage is that we let the SRF deter-
mine from training exemplars the optimum feature combination that predicts the
best symmetry axes (location, orientation and scale) without a need to predefine
a symmetry or noise model, enabling our approach to work in a large variety of
environments and conditions. Additionally, as inference using SRF is extremely ef-
ficient, our curved symmetry detector runs in 0.1s (after feature extraction which
118
takes ≈1min) per 320× 240 image.
Compared to works that apply grouping and merging of superpixels or regions
[159, 166] or those that iteratively improve the final symmetry segmentation [236,
264], our segmentation approach is not only faster but in addition is able to handle
symmetry axes with multiple branches. This is achieved through a foreground-
background segmentation of structures/regions supporting these predicted curved
symmetries via the addition of a novel pairwise symmetry prior in a Markov Random
Field (MRF) representation of the input image edges. Since the symmetry prior is
defined locally in the MRF clique, the optimal segmentation can be solved using
graph-cuts [27], while handling even convoluted curved symmetry axes with multiple
branches. As the predicted symmetries provide an initial measure of how symmetric
the region should be, we modulate this prior so that the appropriate amount of
symmetry is enforced in the final segmentation, a crucial requirement for natural
images that can exhibit approximate symmetries at different scales.
4.3 Robust bilateral symmetry detection
As noted earlier, our bilateral symmetry detector consists of two related steps:
1) generating a “symmetry attention” map to obtain putative symmetry fixation
points from which we extract initial object-like segments and 2) a symmetry “re-
finement” step that localizes the symmetry axes per segment. From an image I,
we extract an edge map, Ie, which is used as input to the symmetry computation.
In our implementation, we use Gabor filters, and the gPb edge detector [6], and
119
we retain edges larger than a threshold of 0.07. The output is a set of T symmetry
axes A = {A1, . . . , AT} with Ai = {(xi, yi), θi, vi} (centroid location, orientation and
symmetry axis score) for the ith symmetry axis. We describe these two steps next.
4.3.1 Symmetry Attention
The goal of the symmetry attention stage is to detect efficiently regions in the
image that are likely to support a reflection symmetry axis. From the input edge
map, Ie, we first determine a symmetry “attention map” (§4.3.1.1), Dsym, with the
same dimensions as Ie. From this map, we then extract a set of G fixation points
(local maxima) which we denote as Fsym = {f1, . . . , fG}. We assume that each fm
is supported by a unique symmetric region, rm, which we obtain via a (fast) variant
of the fixation-based segmentation (§4.3.1.2) of Mishra et al. [199]. The method
computes a closed boundary surrounding fm, and this boundary contour likely is
associated with an object in the image. These “object-centric” segments will then be
used in the refinement stage (§4.3.2) to localize the precise locations of the symmetry
axes.
4.3.1.1 The symmetry attention map
We introduce a fast and robust technique for detecting approximate sym-
metries in the image via the formation of an attention map. This map localizes
potential (noisy) symmetry locations. This is similar in spirit to computing a sym-
metry “saliency” map of [77, 142], but our approach differs in the representation
120
Figure 4.1: (Top) Generating the symmetry attention map: (A) Oriented Gabor
edges are detected over several scales, (B) At an image patch c (boxed), we compare
histograms hc,s(k) via dχ2 to measure local symmetry, and (C) Normalized detections
over scales, Ds, are combined to form a final attention map Dsym: larger values
indicate stronger localized symmetry. (Bottom) Example top three maxima showing
scales (boxes) and orientations.
121
used: oriented gabor histograms, and speed via integral images [46]. While most
symmetry detection methods based on edges consider both the edge locations and
their orientations and evaluate perfect symmetry, our method only considers the
orientation of edges. Given a possible symmetry axis, the method compares the
probability distributions of edges on the two sides of the axis and thus is robust to
image deformations and errors.
The steps for generating the attention map are illustrated in Fig. 4.1. Given
an input image, I, oriented Gabor edges Ie,s are detected over s ∈ S = 4 scales
and O = 16 orientations. At each scale, s, we compute a histogram hc,s(k) of the
orientations of Gabor edges Qp = {q1, . . . , qn} within a patch c centered at pixel
(xc, yc) ∈ Ie,s (Fig. 4.1 (A)):
hc,s(k) = #{∠{q1, . . . , qn} ∈ bin(k)} (4.1)
where bin(k) denotes an orientation bin in the histogram centered at patch c, with
k ∈ [0, pi] radians. To check for local symmetry at a patch, we check for every
possible orientation of the symmetry axis (for O discrete orientations) whether their
appropriately adjusted histograms match. Specifically, for the o ∈ O orientation,
we select bins from opposing angles bo = hc,s({k1, . . . , ko−1}) and its symmetric
counterpart b′o = hc,s({kO, . . . , ko+1}) to compute their χ
2 distance:
dχ2(o) = χ
2 (bo, b
′
o) (4.2)
For example, to detect horizontal symmetries (parallel to the x axis), we would
compare bins ranging from (0, pi/2] and (pi, pi/2] (Fig. 4.1 (B)). When the bins on
each side are similar, the χ2 distance would be small, indicating the presence of a
122
strong symmetry. The final symmetry distance measure, dc,s, of a patch is defined
as:
dc,s = 1− argmin
o∈O
(dχ2(o))/nc, (4.3)
where we select the minimum dχ2(o) (and corresponding orientation) normalized
by nc, the number of edges in the current patch. This process is repeated over all
pixels in Ie,s, generating a symmetry attention map Ds that codes the symmetry
measure at scale s. The final symmetry attention map, Dsym, combines all the
different scales into one map. At each point we extract the maximum response over
all scales: Dsym = argmaxs∈S Ds, which yields a measure of symmetry confidence
at the particular pixel. The set of “fixation points”, Fsym = {f1, . . . , fG}, shown as
crosses in Fig. 4.1 (C) and (Bottom) are obtained as the maxima (using non-maxima
suppression) applied over Dsym. We show the top two orientations per point (usually
separated by pi/2 radians) with the bounding box representing the best scale that
supports this symmetry.
For an efficient implementation, we use, similar as in [209, 269], the method
of summed area tables (integral images). Specifically, since each patch c consists of
sums of Gabor edges, we precompute the sum of edges for each possible orientation
o ∈ O at each location in the image in summed area tables. With these summed
tables of Gabor responses per orientation, Ro, we then can compute the total Gabor
response as an orientation histogram, hc,s(o), for one orientation within a rectan-
gular patch c of size Wc × Hc (width, height) and centered at pixel (xc, yc) as the
123
sum/difference of four tables as:
hc,s(o) = Ro(xc −
Wc
2
, yc −
Hc
2
) +Ro(xc +
Wc
2
, yc +
Hc
2
)
−Ro(xc +
Wc
2
, yc −
Hc
2
)−Ro(xc −
Wc
2
, yc +
Hc
2
) (4.4)
Repeating eq. (4.4) O times for each Ro allows us to compute the orientation his-
togram of a patch per scale, hc,s, in constant time.
4.3.1.2 Fixation-based segmentation
Given the set of G symmetry fixation points Fsym = {f1, . . . , fG}, we seek to
determine at this step the set ofG supporting regionsRsym = {r1, . . . , rG} associated
with every fixation point. Specifically, for themth fixation point fm ∈ Fsym, we want
to determine the region, rm ∈ Rsym, with the appropriate size that would explain
the putative symmetry centered at fm. Extracting rm has three key purposes. First,
it suggests an appropriate scale that the potential symmetry axis should be detected
at the refinement stage. As we have argued, this is a critical aspect that has been
largely ignored by the majority of past works but is key to the proper definition
of symmetry. Secondly, since we are interested in object-centric symmetries in the
image, we require a procedure that is able to extract, in addition to the right scale,
also a region that is suggestive of a potential object. In this work, we assume that
an object-like region must satisfy the simple criterion that most of the edges must
be closed via the Gestalt principle of closure so that an appropriate neighborhood
that surrounds the fixation point fm is found. Thirdly, by limiting the refinement
step to within the set of regions in Rsym, we significantly reduce the search space
124
Figure 4.2: Segmenting a region given the symmetry fixation point fm (blue cross).
(a) Weighing edges by κ = 1dq with respect to fm (red means larger κ) simulates a
log-polar transformation of the image. (b) The final segmentation.
for detecting the symmetry axis since each rm ⊆ I. This speeds up the approach
compared to an exhaustive search over the entire image and reduces the number of
false positives as well.
We use a fixation-based segmentation procedure that is similar to the method
proposed by Mishra et al. [199] as it seeks to segment a closed region that surrounds
the input fixation point. We use a log-polar transformation, instead of a polar
transform. In Appendix C we show that we can simulate the transformation into
the log-polar coordinates by weighing the pairwise terms in the standard graph-
cuts energy function [28] by a factor κ = 1dq where dq is the Euclidean distance (in
pixels) from the fixation center fm to an edge point q (Fig. 4.2 (a)). Additionally,
this saves computation time needed for the coordinate transformation, leading to a
faster segmentation procedure. Finally, unlike [199] which uses color information as
further pairwise constraints for the segmentation, we use only edge pixels from Ie.
This provides a better segmentation for cases where there is large variation in color
125
information within the object itself.
Note that a straightforward alternative that uses the bounding box associated
with each fm to determine a region would be a rough approximation of the desired
object segment. In §4.6.2.2 we show experimentally that using the segmentation im-
proved the overall performance for images with single symmetric regions, but using
the bounding box regions produced the best performance for multiple symmetric
regions.
4.3.2 Symmetry Refinement
Having obtained the set of regions Rsym = {r1, . . . , rG} from the symmetry
attention map, the goal of the symmetry refinement step is to detect, for each rm ∈
Rsym with dimensions Xrm × Yrm , the final symmetry axis Am = {(xm, ym), θm, vm}
parameterized by its centroid, orientation, and symmetry score, respectively. Un-
like approaches that use local feature matches (e.g. [184, 230]) to determine these
parameters, we use a robust approach based on comparing statistics of the edges
present in the image. A similar idea was developed in [263], where gradient orienta-
tion histograms were compared using a FFT based technique to find the direction of
the orientation axis. In contrast to [263], our approach uses two steps of comparing
edge statistics, to find both the orientation and the position of the symmetry axis,
and it is not applied to the whole image, but searches over the regions provided in
the attention step (§4.3.1). We first compare probability density functions (pdf)s
derived using kernel density estimates of edge orientations, pd(θ), then probability
126
Figure 4.3: Overview of the refinement step. (a) Comparing pdfs of edge orientations
pd(θ) (middle) and their local minima marked by crosses (right). (b) Given a selected
orientation, we compare pdfs of edge density pd,θe(x) (middle) and the final set of
possible symmetry axes as local minima as crosses (right).
distributions of edge counts, pd,θ(x), over the rectified image at angle θ, as shown in
Fig. 4.3. This effectively reduces the original (expensive) 2D search for the symmetry
solution in each rm into two separate 1D searches. A justification for the separation
of orientation from translation (centroid location) in computing the reflection sym-
metry is given in Appendix D. We describe next these two steps in detail (§4.3.2.1
and 4.3.2.2) and how the final symmetry axis per segment is obtained (§4.3.2.3).
127
4.3.2.1 1D search over orientations
We proceed first by taking each rm and dilating it by a small factor δe = 10
pixels so that the expanded segment captures the necessary edges in Ie that are
likely to support the symmetry. Let us denote the set of M edge points within the
expanded segment as Wrm = {w1, . . . wM} and their corresponding edge orienta-
tions as Θrm = {θ1, . . . , θM}, with θj ∈ [0, pi] radians. We then estimate the edge
orientation pdf pd(θ) using kernel density estimates as:
pd(θ) =
1
Mβ
M∑
j=1
KN
(
θ − θj
β
)
(4.5)
where KN is the standard normal kernel. β is the bandwidth parameter that we
derived from the number of modes of the data. We determine from Θrm which ori-
entations have significant (e.g. greater than tθ = 50% of the largest edge orientation
bin in Θrm) counts in the data to set a reasonable value for β. Next, by sweeping
through the orientation space, we check at each test orientation, θt ∈ [0, pi], whether
the potential symmetry axis separates Θrm into two distributions pd(θt) and p
′
d(θt)
via an efficient implementation of the EMD-L1 distance measure [178]3:
dEMD(θt) = EMD(pd(θt), p
′
d(θt)) (4.6)
3Other statistical measures based on entropy, such as the Jensen-Shannon divergence [175], or
the Bhattacharyya distance were tried but the EMD-L1 distance was found to give the highest
accuracy.
128
where {pd(θt), p′d(θt)} are derived from pd(θ) as:



pd(θt) = pd(i)|
i=θt
i=θt−pi/2
p′d(θt) = pd(i)|
i=θt+θ+pi/2
i=θt+θ
(4.7)
and pd(θ) = pd(θ)± pi due to the periodicity of the orientation angles. We set θ =
pi/180 to obtain a search resolution of 1 degree. Sweeping through the orientation
space gives us a 1D score function defined over θ, Ysym(θ), that codes the symmetrical
distance dEMD at each evaluated θ (Fig. 4.3 (a)). The set of J local minima obtained
from Ysym(θ), Yθ = {θ1, . . . , θJ} represents the top J orientations for the potential
symmetry axis within rm. We therefore focus our search for the centroid of the
symmetry axis only along these J orientations.
4.3.2.2 1D search over centroid locations
For each θe ∈ Yθ, we search for the centroid location of the best symmetry
axis in the following manner. First, we rectify the expanded segment rm along θe
so that the axis becomes parallel to the image y axis, and we only need to search
along the x axis to determine the centroid location. Next, we estimate via kernel
density estimates, similar to eq. (4.5), a pdf pd,θe(x), that captures the edge counts
along the rectified x axis as
pd,θe(x) =
1
Mβx
M∑
j=1
KN
(
x− ξj
βx
)
(4.8)
where each ξj ∈ Ξx = {ξ1, ξ2, . . . , ξM} are the rectified x coordinates of the M
edges in rm, and βx is the bandwidth of the normal kernel KN which we set as
129
βx = Xrm/100. Given a test location, xt, the symmetry distance score between the
distributions at xt: {pd,θe(xt), p
′
d,θe(xt)}, is again obtained via the EMD-L1 distance:
dEMD(xt) = EMD(pd,θe(xt), p
′
d,θe(xt)), (4.9)
and their distributions are derived from pd,θe(x) as:



pd,θe(xt) = pd,θe(i)|
i=xt
i=xt−R(xt)
p′d,θe(xt) = pd,θe(i)|
i=xt+1+R(xt)
i=xt+1
(4.10)
R(xt) defines the size of the area from which we consider edge points. The same
parameter can be viewed as the optimal support for the symmetry axis expected
within rm. This means that we search xt within the range of [R(xt), Xrm − R(xt)].
In practice, R(xt) is a fixed value that is learned from training data, or if this is not
available, it can be a function that varies the support adaptively over different xt: for
example one can set R(xt) as a normal distribution N(µΞ, σ2Ξ) over Ξx so that we bias
a larger support at locations with the densest edge counts. By repeating eq. (4.9)
over all test locations, we obtain a 1D score function defined over x, Ysym,θe(x), that
codes the symmetrical distance dEMD at each evaluated x (Fig. 4.3 (b)). The set of
L local minima obtained from Ysym,θe(x): Yx = {x1, . . . , xL} are therefore the top L
centroid locations (after de-rotating the image) per θe: Y(x,y) = {(x, y)1, . . . , (x, y)L}.
To summarize, at the end of the two 1D search procedures, we obtain for each
of the J orientation minima, a set of top L centroid locations. This results in a
combined set of J × L potential axes per rm: YA = {A1, . . . AJ×L}.
130
Figure 4.4: Using a robust Hough-voting technique to score an axis. (Left) Dilated
and rectified LHS and RHS segments. (Right) Hough space with the scoring region
boxed.
4.3.2.3 Scoring the symmetry axes
Given the set of top J×L potential symmetry axes for each rm, in the final step
we associate an appropriate symmetry score per Al ∈ YA. We use a robust Hough-
based voting method derived from matching localized symmetric SIFT features of
Loy and Eklundh [184] (Fig. 4.4). First, we rectify the edge image Ie by al so that
al is now parallel to the y axis and is in the center of Ie. Next, we extract from the
rectified edge image the relevant edges captured by the rm (rectified according to al
as well). This creates an edge map Ie(rm) that is a subset of all edges in Ie. We
then dilate Ie(rm) by a factor δs = 20 pixels so that a larger region surrounding the
edges are used in detecting and matching symmetrical SIFT features. Each match
between symmetrical features is a vote in the linear Hough space H(θ, xc) (angle,
x position of axis with respect to the image center) for a potential orientation and
location of a symmetry axis. Unlike [184] that uses H(θ, xc) directly to determine
the best symmetry axes with the largest votes, our goal here is to obtain the same
(normalized) voting score for the selected axis al. Doing this is extremely easy,
131
since Ie(rm) is rectified and centered with respect to al. The solution is the point
located at H(0, 0) in the Hough space. In order to account for discretization effects
in the Hough space, we take a small region of size (δH × δH) with δH = 5 pixels
surrounding H(0, 0) to obtain a mean normalized score per al: vl. This procedure
of scoring therefore affords us with a robust and symmetry sensitive score that is
directly comparable with the scores used in the baseline [184] (§4.6.2.2).
Finally, for each rm, we select the symmetry axis Am = {(xm, ym), θm, vm}, that
yields the largest symmetry score vm = argmaxv{v1, . . . , vJ×L} from the J ×L axes
that were compared. Repeating the above procedure for each of the G segments in
Rsym yields the final set of T ≤ G4 axes: A = {A1, . . . , AT}. The symmetry score,
vm will also be used in §4.5 to modulate symmetry prior for extracting the final
symmetrical segments.
Further implementation details for the bilateral symmetry detector can be
found in Appendix E, including its training and run-time evaluations. In the next
section, we turn our attention to the detection of curved reflection symmetry using
SRF.
4.4 Fast curved symmetry detection via SRF
Similar to the detection of bilateral symmetry, the input to our SRF-based
curved symmetry detector is a RGB image I and the output is a set of n curved
symmetry axes, Ac = {Ac1, Ac2, · · · , Acn}. We describe the features used and the
4T ≤ G is due to an extra segment combination step that combines similar segments together.
132
Figure 4.5: Training a SRF for curved symmetry detection. (A) Multiscale in-
tensity, color, texture, spectral and oriented Gabors are used to compute a set of
local symmetry responses, Xf , by comparing histograms of patches. (B) By pairing
patch-based features xf ∈ Xf with their symmetry groundtruth annotations Y , we
determine the optimal split parameters θ associated with the split functions h(xf , θ)
that send features xf either to the left or right child. The leaf nodes store a distri-
bution of structured labels of symmetry axes. (C) During inference, a test patch is
assigned to a leaf node within a tree that contains a prediction of the location and
scale of the symmetry axis. Averaging the prediction over all K trees yields the final
symmetry axes and their corresponding strengths (degree of symmetry).
133
training procedure next.
4.4.1 Patch-based symmetry features
In order to detect curved symmetries/centerlines in real images with clutter,
a key requirement is the ability to efficiently extract robust features from the input
image that are suggestive of symmetry. Our feature selection approach is motivated
by two issues well-known in visual symmetry. First, similar to textures (which is
a kind of translation symmetry), curved reflection symmetry is a function of im-
age scale. Second, and related to the first is what features can one use to define
symmetry in the image? In this work, we extract multiscale features based on in-
tensity, color (from L∗, a∗ and b∗ channels), orientated Gabor edges and texture
features [193] by comparing patches with different orientations (we use 8 discrete
orientations) densely in the image (Fig. 4.5 (A)). The reason is that such features
capture different forms of symmetry information that are complementary, e.g. edge
based features can suggest symmetry at textureless regions. For efficiency, we adopt
the integral image implementation of [278]. For each patch, we compare the empiri-
cal distribution of feature histograms using the robust EMD-L1 distance [178] where
a small value suggests a region with strong symmetry. In addition to these local
features, we compute symmetric spectral features proposed by [278]. These are sim-
ilar to the intervening contour cue of [193] except that curved symmetry responses
from histogram comparisons above are used to construct the affinity matrix prior
to extracting the eigenvectors using normalized-cuts [245]. The output Xf is a set
134
local symmetry responses over multiple scales (we use 4 scales here) for each of the
6 feature channels considered.
4.4.2 Symmetry detection via SRF
In this work, we train a SRF in a similar fashion as border ownership (§2.3.2).
We use patch-based symmetry responses of size N × N , xf ∈ Xf as features, and
binary structured labels of groundtruth curved symmetries Y = 1N×N The goal of
training the SRF is to learn, for the ith internal (split) node, the optimal splitting
parameters θi for each binary split function h(xf , θi) ∈ {0, 1}. If h(·) = 1 we
send xf to the left child and to the right child otherwise (Fig. 4.5 (B)). h(xf , θi)
is an indicator function with θi = (d, ρ) and h(xf , θi) = 1[xf (d) < ρ], where d is
the feature dimension of one of the input features described above. ρ is based on
maximizing a standard information gain criterionMi (eq. (2.4)) that splits the input
data Di ⊂ Xf × Y at node i into DLi (left child) and D
R
i (right child) respectively.
As was noted in [51], computing eq. (2.4) using structured Y is more feasible if
one imposes an intermediate mapping function Π : Y 7→ B of structured labels
onto the discrete labels b ∈ B. The number and type of discrete labels, |B|, is an
empirical measure of the diversity of structured curved symmetries that we expect
to encounter. To determine Π, we first apply an Expectation-Maximization (EM)-
based clustering over DAISY [274] descriptors from randomly sampled symmetry
patches from Y . The final clusters obtained are then used to define B. The process
is repeated with the remaining data Do, o ∈ {L,R} at both child nodes until Mi
135
Notation Description Value
- Number of feature orientations 8 [0 to pi]
- Number of feature scales 4 ([0.1, 0.3, 0.5, 0.75] of image diagonal)
- Number of feature channels 6 [L∗, a∗, b∗, textons, spectral, Gabor]
N Patch size 16
- Number of [positive/negative] training samples per dataset [105/105]
|B| Size of structured labels 150
K Number of trees 16
hd Maximum tree depth 64
- Minimum value of Mi 10−10
- Minimum length of Ac (pixels) 5
- Minimum symmetry response, Ac(r) per pixel r ∈ Ac 0.01
Table 4.1: Parameters used in SRFSym.
is below a fixed threshold or a desired tree depth hd is reached. The leaf nodes at
each Tk store a distribution of curved symmetry labels encountered during training.
Inference using SRF is straightforward (Fig. 4.5 (C)). We sample test patches densely
over the image to obtain test features, Xtest, and pass them into the SRF to produce
a structured prediction of the symmetry axes per decision tree Tk. Averaging these
responses over all K trees produces the final curved symmetry predictions. We then
convert these predictions (a continuous value) into a set of curved symmetry axes:
Ac = {Ac1, Ac2, · · · , Acn}, where each Ac ∈ Ac is defined as a contiguous single-pixel
wide segment. Notably, these responses can be seen as an estimate on the symmetry
strength of the test patch, denoted as Ac(r) for a pixel r ∈ Ac, and we use it to
modulate the amount of symmetry to enforce in the segmentation step, described
in §4.5. The parameters used for training the SRF are summarized in Table 4.1.
136
4.5 Symmetry-constrained segmentation using graph-cuts
We embed symmetry constraints via a modified Markov Random Field (MRF)
representation over the binary image edge map (Fig. 4.6 (top)), Ie, derived from
gPb [6] (§4.3) or SE [51] where we retain responses >0.07 or >0.03 respectively.
Using Ie is important here as it ensures that the segmentation results obtained are
not influenced by color or intensity similarity but only by the detected symmetry
axes, which is our goal. Since this approach works for both bilateral symmetries A
(§4.3) and curved symmetries Ac (§4.4), for simplicity and clarity, we will denote
both kinds of symmetry axes as A in our descriptions.
Each node in the MRF is a pixel in Ie, with links (graph-edges) between nodes
denoting the local relationship between connected pixels. Pixels that are directly
connected with one another form a local neighborhood or clique. In addition to
the unary and pairwise terms over links in a standard 4-way neighborhood clique
system, we add in a link at each node that connects, based on the detected set of
curved symmetry axes, A, their closest symmetrical neighbor. To do this, we first
compute its distance transform, DA. Next, pixels that lie on the same iso-contours
on opposite sides of each A ∈ A are linked (Fig. 4.6 (below)). This additional link,
called the cross-symmetry term, creates a new 5-way neighborhood clique system
that enforces both local (4-way) and global curved symmetry constraints within a
single MRF model. This ensures both global symmetrical consistency while allowing
for small local deformations in the final segmentation. In addition, since this term is
computed locally with respect to A, our model handles multiple axes and branching
137
Figure 4.6: Symmetry-constrained segmentation. (Top) Constructing the 5-way
MRF over Ie with cross-symmetry terms, Spp′ given a (curved) symmetry axis,
A. Every node in the MRF consists of N (4-way, green box) and Nsym (cross-
symmetry, blue box) neighbors. We detail two symmetric neighbors {p1, p′1} and
{p2, p′2} with their corresponding cross-symmetry terms as red links. Note that not
all Spp′ are shown for clarity. (Below) Computing the symmetry prior: (a) Given
A, we compute its distance transform. (b) We then link the closest pixels along the
same iso-contours on opposite sides of A to form symmetric pairs. (c) Visualization
of the symmetry strength used in espp′ , with red denoting stronger symmetries.
138
symmetries with no additional modifications. Finally, as the 5-way MRF model
retains a local clique neighborhood system, the optimal labeling can be efficiently
obtained using standard graph-cuts. We use the popular max-flow/min-cut toolbox
of Kolmogorov and Zabih [139] in our implementation.
We detail the binary energy function E used here. Let L = {0, 1} be the labels
of the background and the symmetrical region respectively. P is the set of all pixels
in Ie, with {N ,Nsym} denoting the 5-way neighborhood clique system consisting of
the 4-way pairwise neighbors (p, q) and cross-symmetry neighbors (p, p′) respectively.
The energy function is thus defined as:
E(f) =
∑
p∈P
Up(fp) +
∑
(p,q)∈N
Vpq(fp, fq) +
∑
(p,p′)∈Nsym
Spp′(fp, fp′) +
∑
(p,q)∈N
Bpq(fp, fq) (4.11)
where fp ∈ L is the label assigned to pixel p ∈ P and f = {fp|p ∈ P} is the labeling
of all the pixels in the image. The first two terms, {Up, Vpq} of eq. (4.11) are the
standard unary and pairwise terms that encode the foreground prior and boundary
information used in the majority of MRF-based segmentation approaches [28, 238].
For the unary term, instead of a foreground model derived from color or intensity
information (which we do not have), we set pixels that overlap with any axis A,
pA, as foreground (UpA(0) = ∞) and pixels along the image boundary, pB, are set
as background (UpB(1) = ∞). Similarly for Vpq, we replace image intensities used
in [28] with their edge labels:
Vpq(fp, fq) =



exp(− (Ie(p)−Ie(q))
2
2σ2 ), if fp 6= fq
0, otherwise
(4.12)
so that the final segmentation aligns with Ie.
139
Figure 4.7: Example symmetry-constrained segmentations. Notice that we are able
handle symmetry axes with multiple branches and produce more accurate segments
with the symmetry prior term.
The third term is the symmetry prior term. This term sums up the cost for
assigning different labels to symmetric neighbors (p, p′):
Spp′(fp, fp′) =



espp′ , if fp 6= fp′
β, otherwise
(4.13)
where β < 1 is a small positive value that provides a penalty when symmetrical
neighbors are assigned similar labels. We set β = 0.006 for all results reported.
espp′ is a measure of symmetry strength defined as:
espp′ = 1 + β −
1
Z
log(1 + ‖DA(p)−DA(p
′)‖+ νpp′) (4.14)
where DA(p) , min(pa∈A) ‖p − pa‖ is the distance between pixel p and pa ∈ A,
its closest pixel along the symmetry axis obtained from the distance transform.
νpp′ is a symmetry score that depends on the type of symmetry predicted. For
bilateral symmetry, A = {(x, y), θ, v} and we define νpp′ = 1 − v based on the
symmetry score v derived from §4.3.2.3. For curved symmetries, we define νpp′ =
140
1 − (A(pA) + A(p′A))/2 as the symmetry score predicted by the SRF, where we
take the mean value of the two corresponding symmetry scores along A. Since
espp′ is obtained by combining two estimates of symmetry, we tend to ameliorate
the inherent noisy symmetric pixel correspondences caused by internal edges or
textures in Ie. Z = max(p,p′)∈P (log(1 + ‖DA(p) − DA(p′)‖ + νpp′)) normalizes the
second term in eq. (4.14) to [0, 1] and as a result espp′ is in the range [β, 1 + β].
Since eq. (4.14) sets a large espp′ for pixels with different labels exhibiting strong
symmetries, this encourages symmetrical pixels to have the same labels, and as
a consequence, enforces symmetry in the final segmentation. Notably, as espp′ is
derived from DA and the SRF predicted symmetry scores, we are able to modulate
the effect of this term, allowing for symmetrical and asymmetrical configurations
to occur at appropriate locations. We show some example segmentation results in
Fig. 4.7 comparing it to the case when no symmetry prior is used (a standard 4-way
MRF).
The final term Bpq is a “ballooning” term, introduced by Veksler [282] that
encourages the final segmentation to expand, in opposite directions along both sides
of A so that a reasonably-sized symmetric segment is obtained. Without this term,
the final segmentation tends to be small, as symmetry strengths are usually the
largest between the closest (p, p′). Assuming that pixel p is further away from pixel
141
Figure 4.8: Effect of the ballooning term, Bpq. (L-R) Without Bpq, the final seg-
mentation tends to be small. When ρb  0, over-expansion occurs, resulting in a
degenerate segmentation. Using an appropriate value for ρb produces an optimal
segmentation.
q with respect to A, we have:
Bpq(fp, fq) =



0, if fp = fq
∞, if fp = 1 and fq = 0
ρb, if fp = 0 and fq = 1
(4.15)
where ρb is a ballooning cost that we set to control the expansion of the final seg-
mentation. Following [282], the value of ρb is usually set to a small negative value.
However, when ρb  0, over-expansion occurs resulting in a degenerate (and unde-
sirable) symmetrical segmentation (Fig. 4.8). We use ρb = −0.03 in all experiments.
A note on the submodularity of all the pairwise terms in eq. (4.11). Clearly,
Vpq and Bpq are submodular by construction. The symmetry prior term, Spp′ , is also
submodular since by eq. (4.14), espp′ ≥ β for all values of β. From [139], E can be
minimized exactly via graph-cuts.
142
4.6 Experiments: Bilateral Symmetry Detection
4.6.1 Datasets, baseline and evaluation procedure
We use three datasets for the experiments. The first two, PSU 2011 and PSU
2013, are publicly available5 while the UMD Symmetry dataset is new. Each dataset
is separated into two different categories: ‘singles’ containing only one dominant
symmetry and ‘multiples’ for images that contain multiple symmetric objects.
The PSU 2011 and PSU 2013 datasets consist of images taken under natural
conditions and they contain reflection symmetries from a variety of different natural
objects. For PSU 2011, we chose the ‘real’ subset (for real images) while ignoring
synthetic images. For training, we used only the training set (35 for ‘singles’, 17
for ‘multiples’) from PSU 2013 to tune the parameters for the evaluation of both
PSU 2011 and PSU 2013. For PSU 2011, we evaluated our results over the training
subset (because it has more images): 79 (singles), 85 (multiples) while for PSU
2013, we used the testing subset: 40 (singles), 30 (multiples). Human annotated
symmetry axes groundtruth are provided in both the testing and training subsets.
The UMD dataset, which is the largest of its kind so far, consists of 107 (singles)
and 123 (multiples) test images that are classified via several paid experts into four
empirical categories of increasing symmetry complexity : (P) perfect, (Q) quasi (or
approximate) symmetric, (C) corrupted with clutter and (N) not globally symmetric;
5PSU 2011: http://vision.cse.psu.edu/research/symmComp/index.shtml, PSU 2013:
http://vision.cse.psu.edu/research/symComp13/index.shtml
143
but locally symmetric. See Appendix E.3 for details. An additional 70 images are
used as a separate training subset. In addition to the hand annotated groundtruth
symmetry axes, every axis is associated with an elliptical region that specifies the
extent of the symmetry region that supports the symmetry axis. This allows more
precise future evaluations that takes into account the estimated symmetry region as
well.
As baseline, we use the state-of-the-art method of Loy-Eklundh [184]. We
use the Matlab implementation available from the authors’ website6 with optimally
tuned parameters per dataset to detect reflection (bilateral) symmetries. We tune
four parameters: {ts, ta, tr, tm}, that control the scale, angular and radial matching
distances and the number of matches admitted per symmetric SIFT feature. An
offline search procedure is used to obtain the best results per dataset. The code,
which has been optimized via C++ calls, runs very fast: 1.12±0.70s for a 320×240
input image. We detail the parameter search procedure and optimal parameters,
together with detailed runtimes for each dataset in Appendix E.
We adopt the same evaluation procedure used in the 2013 Symmetry Compe-
tition [180], where we compare the accuracy of detections via standard Precision-
Recall (PR) curves. In order to determine if a detected axis ai = {(xi, yi), θi} is
correct with respect to a given groundtruth GT = {(xgt, ygt), θgt}, we use the fol-
lowing three criteria: 1) the minimum angular difference between θi and θgt is less
than t1 = 10 degrees, 2) the shortest Euclidean distance from (xi, yi) to the GT
axis is less than t2 = 0.2 × min{lai , lGT}, where l(·) is the length of the axis and
6http://www.nada.kth.se/~gareth/homepage/local_site/code.htm
144
Figure 4.9: PR curves over the three datasets. (Left Panel) Columns (L-R): PSU
2011, PSU 2013, UMD Symmetry datasets. Rows: ‘singles’ (Top), ‘multiples’ (Bot-
tom). (Right Panel) Results for different symmetry categories in the UMD Symme-
try dataset. Average Precision (AP) scores are in Appendix E. See text for details.
3) the Euclidean distance between the centroids, (xi, yi) and (xgt, ygt) is less than
t3 = 0.5×min{lai , lGT}. The first two criteria are from [180] while the third criterion
was added to reject detections that are either too small (not the same scale) or not
centered near the desired foreground/object7.
4.6.2 Results
In this section, we report a series of detailed experimental evaluations of the
proposed approach and compare its performance with the state of the art base-
line Loy-Eklundh detector. In §4.6.2.1, we first evaluate the contribution of each
7Due to this additional criterion and the use of optimized baseline parameters, the PR curves
reported here are not comparable to [180].
145
component of the approach and then compare variants of the approach with the
baseline in §4.6.2.2. All the parameters (optimized with respect to the approach
and dataset) are kept the same throughout all experiments and evaluated using the
same evaluation procedure to produce comparable results.
4.6.2.1 Performance of individual stages
Before comparing the performance of the final approach with the baseline, an
important question that needs to be answered is the contribution of each of the
two stages of the approach: 1) the Symmetry Attention stage [SymAttention] and
2) the Symmetry Refinement stage [RefinementOnly]. We ran each of these stages
individually8 and compared their results when both stages are combined via fixation-
based segments [AttentionSymSeg]. In addition, we investigated the effects of using
only bounding boxes as simple segments within Rsym in the combined approach by
not running the fixation-based segments at all in [AttentionSymBB]. The relevant
PR curves evaluated over the three datasets, separated into ‘singles’ and ‘multiples’
categories are shown in Fig. 4.9 (left).
We highlight two important observations. First, we note that in all the
datasets, the individual stages: [SymAttention] and [RefinementOnly] have con-
sistently lower average precision (AP) compared to the case when both of them
are combined in [AttentionSymSeg]. This is especially true for the symmetry at-
8As the symmetry refinement stage only returns the best symmetry axis, [RefinementOnly]
was evaluated only on the ‘singles’ subset of each dataset. For other variants and ‘multiples’, the
refinement stage is applied over segments obtained from the [SymAttention] stage.
146
tention stage which has the lowest AP over all datasets considered. This is not
surprising since the attention map is meant to produce noisy putative symmetry
axes. What the improvement in [AttentionSymSeg] shows is that our two-stage
approach makes sense, where the refinement step helps remove a large number of
false positives from the symmetry attention stage. Second, the contribution of us-
ing fixation-based segments in [AttentionSymSeg] compared to using only simple
bounding boxes in [AttentionSymBB] is more pronounced in the ‘singles’ category.
For ‘multiples’, we notice a consistently better performance for [AttentionSymBB]
at larger recalls compared to [AttentionSymSeg]. A possible explanation for this
behavior could be that images with multiple symmetries often occur in more com-
plex/cluttered environments, causing erroneous regions to be segmented which re-
duces the accuracy of the returned final symmetry axes (Fig. 4.11 (a-f)). However,
when one uses bounding boxes alone on ‘singles’ images, not enough symmetrical in-
formation is captured, due to the fact that most of the symmetries occur on objects
that occupy a large part of the image. As both bounding boxes and fixation-based
segments tend to provide complementary information, we therefore investigate an
approach [AttentionSymSegBB] that combines a subset of the bounding boxes with
the fixation-based segments in the next section.
4.6.2.2 Performance comparison with baseline
We compare quantitatively the performance of the full approach [AttentionSymSegBB]
against the baseline detector of Loy and Eklundh [Loy-Eklundh] via PR curves eval-
147
Figure 4.10: Example results (from fixation-based segments) and Loy-Eklundh [184]
detections (last row). Ten images per dataset (rows). (From top row): PSU 2011
singles, PSU 2011 multiples, PSU 2013 singles, PSU 2013 multiples, UMD singles
and UMD multiples. For ‘singles’, we show the top 2 detections while for ‘multiples’,
the top 5 detections are shown: Symmetry axis (lines) and their support segments
(dashes). Color encodes the relative ranking of the detections: blue, green, red,
cyan and magenta (best – last).
148
Figure 4.11: Complementary symmetry axes detected from fixation-based segments
(top) and bounding boxes (below). In cases where segmentation is wrong or ambigu-
ous (e.g. between objects, clutter), the predicted symmetries from bounding boxes
are more accurate (a-f). In other cases, segmentation provides a more accurate
region for prediction (g-l).
149
uated over the three datasets as shown in Fig. 4.9 (left). As a check on the contri-
bution of the symmetry refinement step, we replaced the refinement step of the full
approach with the method of Loy and Eklundh by running the detector only over
the segments and bounding boxes in Rsym, [AttentionSymLoyBB].
From the results, we first note that the [Loy-Eklundh] baseline performs much
better in the ‘singles’ categories compared to the ‘multiples’ categories, with the best
AP for UMD singles (0.865) and the worst AP for UMD multiples (0.321). This
indicates that matching keypoints in cluttered scenes typical of ‘multiples’ is more
challenging compared to simpler scenes in ‘singles’. Moving on to the comparisons,
the full approach: [AttentionSymSegBB] has similar performance (precision) as the
baseline at low recalls but quickly outperforms [Loy-Eklundh] at higher recalls: >0.5
(‘singles’), >0.1 (‘multiples’); and is among the top performing algorithm in terms
of AP: ≥0.89 (‘singles’), ≥0.66 (‘multiples’). This consistent performance across
both ‘singles’ and ‘multiples’ shows that by augmenting Rsym with bounding box
segments, we capture enough complementary information for the refinement step to
accurately localize the best symmetry axis (Fig. 4.11). Finally, as the PR curves
of [AttentionSymLoyBB] are consistently lower in performance compared to the full
approach, we can conclude that the symmetry refinement approach is able to localize
the correct symmetry axis with much better precision within the segment, compared
to using the local based symmetric features of Loy and Eklundh. The implications
of these results are discussed in §4.6.3.
In addition, we compared the full approach against the baseline over different
symmetry categories in the UMD Symmetry dataset (Fig. 4.9 (right)). We note
150
that [Loy-Eklundh] has the best AP in the ‘P’ (perfect symmetry) category for
‘singles’ (0.862) and in the ‘Q’ (quasi-symmetric) category for ‘multiples’ (0.439),
while it has the worst APs in the ‘C’ (corrupted with clutter) category. By con-
trast, our approach achieves the best performance in the more challenging ‘N’ (not
globally symmetric) and ‘C’ categories for ‘singles’ (0.967) and ‘multiples’ (0.820)
respectively. This shows that our approach is able to better handle more complex
cluttered situations compared to [Loy-Eklundh].
4.6.3 Discussion
We discuss two key insights provided by the experimental results presented in
the preceding section and illustrate them with example outputs in Fig. 4.10.
4.6.3.1 Advantages of a two stage approach
An important hypothesis of this approach is the proposition that symmetry
detection, should be performed in a two-stage manner that starts off with 1) an
attentional-based mechanism to quickly determine potential symmetrical regions
and 2) followed by a more expensive symmetry detection step applied at each region.
The experimental results clearly demonstrate the advantage of this strategy, with
the full approach [AttentionSymSegBB] significantly outperforming [Loy-Eklundh]
in all the datasets at medium and high recalls. There are three main reasons.
First, we note that many of detections from [Loy-Eklundh] are relatively small
and insignificant compared to the expected groundtruth (4.10 (last row)). This is
151
because the approach does not estimate the correct symmetry scale. Our approach,
on the other hand, defines scale in terms of the segments or objects, which reduces
the chance that insignificant symmetries are detected, thereby improving perfor-
mance. Of course, as we have noted earlier, errors in segmentation will reduce the
performance of the approach when relying on fixation-based segments alone (espe-
cially for ‘multiples’). Integrating the segments with simple bounding box regions
ameliorates the problem but using a more sophisticated segmentation mechanism
that takes advantage of high-level (e.g. symmetry, object-hood) information may
provide a better solution.
Second, combining symmetry attention with segmentation also improves the
precision of the symmetry axes detected in the refinement stage. This is because
by limiting the search of the symmetry axis to the approximate regions, irrelevant
information (texture or edges) are removed, reducing false positives. This is shown
clearly in the improved results of both the full approach and when Loy-Eklundh is
used in the refinement stage [AttentionSymLoyBB] (except UMD singles).
Finally, using a two stage approach is more natural for detection of multi-
ple symmetries. This is because at the first (attentional) stage, we have already
detected putative symmetry locations in the image which we then independently
verify in the second (refinement) stage. A single stage approach, however, often
needs to apply a mechanism for separating the input data into various clusters or
potential symmetries via ad-hoc similarity measures. In the Loy-Eklundh detector,
the approach first checks for matching strengths and for scale consistency to detect
different symmetry clusters, by using a pre-determined threshold. This approach
152
Figure 4.12: Example failure cases of the proposed approach (top row) compared to
Loy-Eklundh (bottom row).
often removes many detections, especially for the ‘multiples’ dataset, resulting in
reduced recall rates.
4.6.3.2 Local features versus statistics-based detection of symmetry
The experimental results also support the use of robust statistics to detect
symmetry compared to using local image based features as in the baseline. This is
demonstrated by comparing the PR curves of [AttentionSymSegBB] (our approach)
and [AttentionSymLoyBB]. The latter is the same as the full approach except that the
Loy-Eklundh detector was used instead of our statistical approach in the symmetry
refinement step. Beyond a recall rate of 0.3 (‘singles’) and 0.05 (‘multiples’), our
proposed approach consistently outperforms Loy-Eklundh’s detections even though
the same segments from Rsym were used. The main reason for this difference in per-
formance is that often many of the segments are devoid of texture that is required
for good feature matching. Furthermore, although local SIFT based features are
robust to occlusions, matching them across clutter in real situations often produces
numerous mismatches and (as a consequence) incorrect detections. This also ex-
plains why [Loy-Eklundh] has the worst performance in the ‘C’ subset of the UMD
153
Symmetry dataset.
Although using edge-based statistics for detecting symmetries is clearly advan-
tageous in such situations, there are certain limitations as well (Fig. 4.12). Since we
only used edge-based features, errors in the edge detection/segmentation, or noise
from background may result in errors in localizing the symmetry axis. Also, and
more importantly, edge statistics alone may not provide sufficient discriminative in-
formation to support two competing symmetries, especially when the scale (sample
size) is small. One possible solution that we will explore in future work is to in-
corporate more discriminative features, such as the symmetric SIFT features of the
baseline, into the statistical comparison framework.
4.7 Experiments: Curved Symmetry Detection
4.7.1 Datasets, baselines and evaluation procedures
We use the SYMMAX-300 dataset (200 train/ 87 test) introduced by [278],
that contains curved symmetry annotations from the BSDS-300 dataset [193]. Specif-
ically, automatically generated medial axes are presented to human annotators who
select which axes best supports the groundtruth segments. We follow the same eval-
uation procedure as [193], where instead of boundaries, the groundtruths are human
annotated curved symmetries and we report the Precision-Recall (P-R) curves and
the ODS, OIS and AP metrics of [7] using the same evaluation parameters sug-
gested by [278] where symmetry pixels close enough to the groundtruth (<0.01%
of the image diagonal) are considered correctly matched. As baselines, we compare
154
Method SYMMAX-300 NY-roads
Our approach, SRFSym 0.38,0.42,0.27 0.70,0.70,0.63
gSym [278] 0.36,0.40,0.22 -
DefDiscs [159] 0.37,0.41,0.23 -
RTree [249] - 0.85,0.85,0.83
Figure 4.13: Curved symmetry prediction accuracy. (Top) Precision-recall curves:
SYMMAX-300 (left) and NY-roads (right). (Below) [ODS, OIS, AP] scores [7] in
each cell. Best viewed in color.
our SRF-based approach (SRFSym) with two state-of-the-art curved symmetry detec-
tors: 1) global-symmetry (gSym) of Tsogkas and Kokkinos [278] and 2) symmetric
deformable-discs (DefDiscs) of Lee et al. [159]. Finally, the same evaluation param-
eters suggested by [278] are used to compare the symmetry prediction accuracies
of all three approaches. We also compared our approach to the recent regression-
tree based (RTree) centerline prediction method of Sironi et al. [249] over published
results in their “Aerial” dataset of 14 satellite images (7 train/ 7 test) of road net-
works from New York state (NY-roads). Following [249], we consider symmetry
pixels within 2 pixels of the groundtruth axes as correct to obtain the P-R, ODS,
OIS and AP metrics.
4.7.2 Results and discussion
Fig. 4.13 summarizes the the evaluations performed as described in the pre-
vious section. We briefly discuss these results and their implications. We show
155
example results of curved symmetry detections with their corresponding segmenta-
tions in Fig. 4.17 in §4.9.
4.7.2.1 Curved symmetry accuracy over SYMMAX-300
Our method, SRFSym, returns the most accurate curved symmetry predictions
compared to gSym and DefDiscs in all accuracy metrics [ODS,OIS,AP] (Fig. 4.13
(top-left)). Notably, we see that beyond a recall value of 0.2, SRFSym outperforms
gSym consistently at higher recalls. This shows that SRFSym’s curved symmetry pre-
dictions are more accurate across a larger range of symmetry scores. This is likely
due to: 1) our complementary set of features (gSym does not use edges) which works
better at non-textured regions and 2) the structured predictions over multiple scales
smooth out wrong predictions across multiple decision trees.
4.7.2.2 Curved symmetry accuracy over NY-roads
In this dataset, SRFSym is unable to match the (almost) perfect curved sym-
metry predictions of RTree (Fig. 4.13 (top-right)), even with reasonably high pre-
dictions (>0.8) for most recalls. The reason for the drop in precision at high recalls
is that SRFSym responds to other symmetric regions (besides roads) that are not in
the groundtruth. This shows that for this particular task and modality, the regres-
sion formulation proposed in [249] makes sense compared to our approach which
uses more general features for detecting symmetry. Modifying our approach to take
advantage of features derived from the sparse convolutional filters of RTree may
156
also improve our performance further. Finally, it is also important to note that al-
though SRFSym is comparatively less precise, its inference is extremely fast compared
to RTree: seconds compared to the minutes/hours reported in [249].
4.8 Experiments: Bilateral Symmetry-Constrained Segmentation
4.8.1 Datasets, baselines and evaluation procedure
We use three datasets for experimental evaluation of extracting symmetric re-
gions exhibiting bilateral symmetry. The first dataset is the PSU 2013 dataset9,
introduced in §4.6.1, consisting of 20 training and 40 testing images augmented it
with human labeled segmentation ground-truth of the symmetrical regions. The
second and third datasets come from the work of Sun and Bhanu [264] UCSD sym-
metry segmentation dataset. These datasets consist of selected images from two
publicly available datasets: a) Berkeley Segmentation Dataset (BSDS) [6] (15 im-
ages) and b) 93 images from Caltech-101 object categories [67]. Since both datasets
only have segmentation ground-truths, ground-truth symmetry axes associated with
the symmetrical regions were added in manually. In addition, we selected 10 images
from BSDS and 50 images from Caltech-101 (from the same categories), which we
used as a separate training set.
Note that we only evaluate with bilateral symmetries predicted from the initial
symmetry attention step, [SymAttention]. There are two reasons for this. First, as
the full approach [AttentionSymSegBB] consists of a the refinement step applied over
9Available online at http://vision.cse.psu.edu/research/symComp13/index.shtml
157
segmented regions or bounding boxes of initial putative symmetry fixation points,
including a symmetric segmentation after the refinement step does not make sense.
Second, as other approaches do not apply a “segmentation-in-the-middle” phase as
well, a proper and fair evaluation of the contribution of the symmetry-constrained
segmentation can only be derived meaningfully from [SymAttention].
We compare with two baselines. The first baseline is the standard 4-way
MRF with no symmetry constraints, GC. The second baseline is the region-merging
approach of Sun and Bhanu [264] that was discussed in §4.2.2. As the segmentation
results using Sun and Bhanu are not available, we re-use the results reported in
Table 10 of their paper.
A note on the performance metrics used. We capture the accuracy of the
segmentation via the segmentation covering score Cseg used in the BSDS dataset [6]
that measures the overlap of the regions R′ in the final segmentation S ′ with the
groundtruth S containing regions R by:
Cseg(S
′ → S) =
1
|P|
∑
R∈S
|R| · max
R′∈S′
O(R,R′) (4.16)
where O(R,R′) = |R∩R
′|
|R∪R′| measures the overlap of the two regions R,R
′. Addi-
tionally, we used the unsupervised and supervised segmentation performance met-
rics EV A_SEGunsup, EV A_SEGsup used in in the UCSD dataset for compari-
son. EV A_SEGunsup is a simple measure of region contrast between segments
computed using image color and texture features. The larger the contrast, the
better “separated” are the regions, which leads to a larger EV A_SEGunsup score.
EV A_SEGsup measures the overlap between the test and groundtruth regions sim-
158
Dataset Method Cseg EV A_SEGunsup EV A_SEGsup
PSU 2013
Our Method, SymSegGC 0.72 (+0.08) 0.88 (+0.02) 0.69 (+0.06)
GC 0.64 0.86 0.63
Caltech-101
Our Method 0.69 (+0.07) 0.85 (+0.02) 0.67 (+0.04)
GC 0.62 0.82 0.63
Region-Merging [264] – 0.83 –
BSDS
SymSegGC 0.78 (+0.09) 0.88 (+0.02) 0.79 (+0.03)
GC 0.69 0.86 0.66
Region-Merging – – 0.76
Table 4.2: Performance comparison of mean segmentation accuracy between differ-
ent approaches over the three datasets. Dashes (–) indicate missing results which
were not reported in [264] or for Cseg not performed since final segmentation results
were not made available. Improvements (+x) are with respect to the next closest
result.
ilar to O(R,R′) with additional penalties for over and under-segmentation.
4.8.2 Results and discussion
The first set of experiments evaluates the contribution of our symmetry-constrained
segmentation approach, SymSegGC, in enforcing symmetry within the final segmenta-
tion. For this purpose, we used the human annotated symmetry axes as input and
compare with GC (no symmetry) and the state-of-the-art symmetry embedded region
growing approach of Sun and Bhanu [264]. Table 4.2 summarizes and compares the
performance of various approaches over the three datasets.
From Table. 4.2, several key results are worth highlighting. Firstly, our ap-
159
Dataset SymSegGC + Cseg EV A_SEGunsup EV A_SEGsup
PSU 2013
[SymAttention] 0.67 (+0.03) 0.87 (+0.01) 0.65 (+0.02)
Loy-Eklundh [184] 0.63 0.86 0.62
Caltech-101
[SymAttention] 0.68 (+0.02) 0.84 0.62
Loy-Eklundh 0.63 0.84 0.62
BSDS
[SymAttention] 0.68 (+0.03) 0.87 (+0.01) 0.55 (+0.01)
Loy-Eklundh 0.64 0.87 0.51
Table 4.3: Performance comparison of mean segmentation accuracy with two differ-
ent methods of automatic symmetry axis detection. Improvements (+x) are with
respect to the next closest result.
proach achieves the overall best performance in terms of segmentation accuracy
compared to the other two approaches over all of the performance metrics used.
The most significant improvement occurs over the fixation-based baseline, high-
lighting the contribution of the symmetry prior in improving the final segmentation
accuracy. Secondly, compared with the symmetry integrated region-merging ap-
proach, our approach performs significantly better using the Cseg metric and less
significantly so using the EV A_SEGsup and has almost the same performance for
EV A_SEGunsup. This is not surprising, since both Cseg and EV A_SEGsup mea-
sure the overall segmentation accuracy with respect to the groundtruth symmetry
target(s), while EV A_SEGunsup evaluates segmentations based on simple color and
texture contrast. Fig. 4.14 highlights the improvements shown via example final seg-
mentations of the fixation-based approach compared with the proposed approach.
160
Figure 4.14: Example final segmentation results, two results per row: (L-R) Input
image + Symmetry axis, GC, Our method SymSegGC. The last row with red boxes
indicates typical failure cases: (L) Weak symmetry causes segmentation to leak into
background, (R) Noisy background confuses the symmetry strength measure, with
only partial improvements.
161
In the second set of experiments, we evaluate the complete approach by inte-
grating the results of putative symmetry axes detections derived from the symmetry
attention map Dsym as described in §4.3.1.1. We compared the final segmentation
results using detections of Loy-Eklundh [184]. In both cases, we used the symmetry
axis with the highest response (strongest bilateral symmetry). Table. 4.3 compares
the performance evaluation of the two approaches over the three datasets used.
As expected, the performance of the complete approach takes a hit when noisy
putative symmetry axes are used. However, we see that in most cases, the perfor-
mance of the complete approach does not differ too much from GC of Table 4.2. This
is due to the modulation of the symmetry prior term (eq. (4.14)) for situations where
the bilateral symmetry is weak or completely wrong. Another interesting observa-
tion is that integrating the symmetry attention axes tends to give slightly better
final segmentation accuracies compared to using the detections of Loy-Eklundh. A
possible explanation could be that the images used in the three datasets contain
symmetrical regions without texture (especially PSU 2013 and Caltech-101) which
may result in numerous wrong (but strong) symmetry axes to be detected using Loy-
Eklundh. Fig. 4.15 shows some qualitative results when we integrate the detected
symmetry axes using the two detection methods.
162
Figure 4.15: (Top) Example bilateral symmetry-constrained segmentations with
automatically detected symmetry axes, two results per row showing [SymAttention]
(left column) and Loy-Eklundh detections (right column). Last row with red boxes
indicates typical failure cases where segmentation leakage still occurs.
163
4.9 Experiments: Curved Symmetry-Constrained Segmentation
4.9.1 Datasets, baselines and evaluation procedure
We use three datasets (train/test splits) for our main evaluation: 1) SYMSEG-
300 (200/87), 2) BSD-Parts (0/36) and 3) Weizmann Horses, WHD (20/61) (both
from [159]). SYMSEG-300 is an extension of SYMMAX-300 where we extract sym-
metric segments based on the original BSDS-300 groundtruth segments. BSD-Parts
and WHD were introduced by [159] as one dataset for evaluating the DefDiscs su-
perpixel grouping approach. For comparisons, we applied our symmetry-constrained
segmentation approach (SymSegGC) using symmetry axes predictions from: 1) SRFSym
(our approach), 2) gSym and 3) DefDiscs. As an additional demonstration of the con-
tribution of the symmetry priors, {Spp′ , Bpq}, we evaluated SRFSym and gSym without
these two priors, effectively reducing the segmentation to the standard MRF-based
approach (GC) that was used in §4.8. We also compared the grouped superpixels
segments obtained from DefDiscs (DefDiscs-SP) as an additional baseline. Follow-
ing [159], we consider a segment as correct when its standard Intersect-over-Union
(IoU) score with respect to the groundtruth exceeds 0.4 over all three datasets and
report the resulting P-R curves and Average Precision (AP) metrics for each method.
We also compared SymSegGC with the estimated centerline scales predicted by RTree
(RTree-ES) over the NY-roads dataset where we used symmetry axes predictions
from SRFSym with/without symmetry priors and similarly for RTree. The same eval-
uation procedure of [249] that applies an exclusion zone of 0.4% of the groundtruth
164
  
(a) (b)
(d)(c)
Methods Datasets
Sym Detector Segmentation SYMSEG-300 BSD-Parts WHD NY-roads
SRFSym
SymSegGC 0.13 0.27 0.15 0.86
GC 0.04 0.05 0.04 0.28
gSym [278]
SymSegGC 0.11 0.18 0.06 -
GC 0.03 0.05 0.01 -
DefDiscs [159]
-SP [159] 0.09 0.17 0.14 -
SymSegGC 0.10 0.16 0.13 -
RTree [249]
-ES [249] - - - 0.90
SymSegGC - - - 0.97
GC - - - 0.19
Figure 4.16: Curved symmetry-constrained segmentation accuracy. (Left) Precision-
recall curves: (a) SYMSEG-300, (b) BSD-Parts, (c) WHD and (d) NY-roads.
(Right) Corresponding Average precision (AP) scores per cell.
radius was used to generate comparable results.
4.9.2 Results and discussion
Fig. 4.16 summarizes the the performance evaluation of curved symmetry-
constrained segmentation as described above. Some example results of curved sym-
metry detection and segmentation of corresponding symmetric regions using our
approach compared to DefDiscs-SP [159] are shown in Fig. 2.8 and we briefly dis-
cuss these results and their implications.
165
Figure 4.17: Example curved symmetry detection and symmetrical segmentation
results, 10 results per panel: (L-R): Groundtruth of symmetrical axes (white) and
regions, curved symmetry detections using SRFSym, extracted symmetry axes, curved
symmetry-constrained segments using SymSegGC, segmentation using DefDiscs-SP
[159].
166
4.9.2.1 Symmetric segmentation accuracy over SYMSEG-300, BSD-
Parts and WHD
The general observation is that our full approach (SRFSym+SymSegGC) reports
the best overall AP compared to other approaches as shown in Fig. 4.16 (a-c).
Removing the symmetry prior in all approaches decreases accuracy by a significant
amount, highlighting its importance. The dataset that challenges SRFSym+SymSegGC
the most is WHD, where the textureless and small regions of horses (e.g. legs, tails)
are better captured by the superpixels computed in DefDiscs-SP. Nonetheless, our
full approach is able to extract symmetric parts with better accuracy from WHD
until 0.46 recall. An interesting observation is when we pair up symmetry axes
predicted by DefDiscs with SymSegGC, the performance is at least on-par (or slightly
better in SYMSEG-300) with DefDiscs-SP. This shows that our proposed symmetry-
constrained segmentation approach not only makes sense but is flexible enough to
work with other approaches. It also highlight the complementary nature of both
approaches: while [159] is a local (and slower) approach that groups superpixels, our
proposed approach presents a faster alternative that captures longer range branched
symmetries.
4.9.2.2 Symmetric segmentation accuracy over NY-roads
Although our full approach (SRFSym+SymSegGC) does not outperform RTree-ES
in terms of overall AP, our precision is still higher than RTree-ES up to a reasonably
167
high recall of 0.7 (Fig. 4.16 (d)). The rapid drop in precision after this recall is once
again due to SRFSym responding to other symmetric regions in the image. Another
interesting observation is the improved performance over RTree when we pair the
centerlines of RTree with SymSegGC. This highlights the key advantage of enforcing
a global symmetrical consistency which greatly improves the accuracy of the final
segmentation.
4.10 Conclusions
We have presented a complete approach for detecting and segmenting sym-
metric structures from real images. Robust approaches for detecting bilateral and
curved reflection symmetries were proposed and evaluated over different datasets,
with state-of-the-art results compared to other approaches. A novel two-stage ap-
proach that uses putative symmetry attention points followed by a symmetry re-
finement step was proposed to accurately detect and localize bilateral symmetries.
For curved reflection symmetries, we developed a fast SRF-based symmetry detector
trained on multiscale patch-based symmetry features sensitive to local symmetry.
For segmenting symmetric regions, symmetry constraints are embedded within a
novel 5-way MRF via an additional pairwise cross-symmetry term that is appro-
priately modulated by the predicted symmetry scores. This allows the approach
to produce accurate segmentations of approximate symmetrical structures that are
common in real images. Results of experimental evaluations confirm that our seg-
mentation approach is not only more accurate than existing state-of-the-art, but is
168
also flexible enough to improve existing segmentation approaches when paired with
their symmetry detections.
We have shown here a practical implementation of detecting a specific Gestalt
principle (symmetry) and applied it for a specific higher-level visual task (segmenta-
tion). This is an important step for solving the FGO problem (§1.1) as we are now
able to extract symmetric segments, which are typically salient foreground objects,
for further processing. In the next chapter, we demonstrate the detection of func-
tionalities or affordances, a universal concept similar to symmetry via geometrical
features, and apply it for the task of detecting different parts of common household
tools.
169
Chapter 5: Object-Level Functional Category Detection
An important aspect of Gestalt or Mid-Level Vision is the generalizability of
the proposed approaches. Unlike approaches that require large amounts of train-
ing data or separate training regimes for different situations (e.g. [15, 122, 307]),
mid-level vision advocates the use of well-known invariants captured by Gestalt
(e.g. symmetry, closure, proximity) that generalize well to numerous conditions and
environments. Developing invariant representations is also a key aspect of vision
according to Marr (1976) [192] and Gibson (1979) [83]. As noted by Richards [232],
although Marr, Gibson and Gestalt offer different and sometimes conflicting views of
visual perception, the study of the object’s “value to the observer” [135] or its affor-
dance [83] remains an open research problem that ties these three differing schools
of thought together. Motivated from these views, we tackle the preeminent issue
of object recognition in Computer Vision from a functional or affordance perspec-
tive in this chapter. Specifically, we detect affordances of tool parts by training a
SRF-based classifier to associate geometric features with seven affordance categories:
{grasp,cut,scoop,contain,pound,support,wrap-grasp} so as to produce pixel
accurate predictions of the affordances given a RGB-Depth image in real-time. Ex-
tensive comparisons with other (slower) approaches using more complex features
170
show that our SRF-based method is able to provide highly accurate predictions
within a fraction of the time needed1.
5.1 Introduction
The ability to understand and perceive objects and tools beyond their simple
labels (e.g. names) is a vital requirement for Computer Vision to function “in-the-
wild” [112, 179, 287, 293, 309]. This capability enables generalization so that novel
objects with similar shared attributes can be recognized and used, and is key to
scaling up Computer Vision approaches [48]. The goal of this work is to establish a
technique of generalizing object recognition based on their intended functionalities or
affordances. Such a capability will enable Computer Vision approaches and mobile
agents to: 1) recognize a larger variety of objects based on their functions, 2) suggest
meaningful alternatives and 3) know the correct (and safe) method of manipulating
such tools while working with humans.
The input is a RGB-Depth (RGB-D) image, and the output is a probabilistic
“functional” map that shows pixel accuracy localizations of potential target func-
tional regions (Fig. 5.1). Key to our approach is the use of view-invariant local
geometric features: 1) Depth gradients, 2) Surface normals, 3) Principal curva-
tures [49], 4) Shape-index and Curvedness measures [134]. Unlike other approaches
(§5.2) that use full 3D (metric measures) or require detailed mesh reconstruction,
we show that it is possible to relate local geometry and shape primitives (§5.3.1)
1Joint work with Austin Myers and was published as [205]. Full results, code and videos are
available online http://www.umiacs.umd.edu/~amyers/part_affordance/
171
Figure 5.1: Predicting novel affordances in clutter (left) and in single objects (right).
(Left) Detections of grasp, scoop and support in a cluttered scene. (Right) Novel
affordances predicted for turner (spatula): support,cut,grasp (top) and mug:
wrap-grasp,contain,pound (bottom). Notice that we are able to predict and lo-
calize reasonable locations for novel affordances, even in clutter and not just on
well-defined object parts, but on the relevant regions of the object (e.g. the bottom
of the mug affords pounding and the edge of the turner affords cutting). Brighter
regions indicate higher probability.
172
to functionality as long as we impose that the features are view-invariant to some
degree. This is achieved via a Structured Random Forests (SRF) [140] classifier
trained for affordance prediction which we detail in §5.3.2. Experiments conducted
over a new large RGB-D dataset, containing precise functionality annotations de-
rived from multiple human annotators, showed that our SRF approach is able to
achieve reasonable functionality detection in challenging test sequences containing
novel objects with clutter and occlusions from different viewpoints (§5.4). In addi-
tion, we compare our approach with: 1) Superpixel Hierarchical Matching Pursuit
(S-HMP) introduced by [204] and 2) a recent deep learning method termed the
Sparse Autoencoder (SAE) that learns graspable features in common objects [163]
and show in §5.4.4 that our approach is able to achieve comparable performance
using simpler geometric view-invariant features.
5.2 Related Works
The study of affordance has a rich history in the computer vision and robotics
communities. Early work sought a function-based approach to object recognition
for 3D CAD models of objects like chairs [258]. More recently, many papers have
focused on predicting grasping points for objects from 2D images [22,243,259]. [163]
exploits a deep learning framework to learn graspable features from RGB-D images
of complex objects and [126] detects tips of tools being held by a robot. From the
computer vision community, [132] classify human hand actions in context of the
objects being used, Grabner et al. [88] detect surfaces for sitting from 3D data.
173
Affordances might be considered a subset of object attributes, which have been
shown to be powerful for object recognition tasks as well as transferring knowledge
to new categories. Ferrari and Zisserman [73] learn color and 2D shape patterns
to recognize the attributes in novel images. Parikh and Grauman [221] show that
relative attributes can be used to rank images relative to one another, and Lampert
et al. [152] and Yu et al. [304] show that attributes can be used to transfer knowledge
to novel object categories. Using RGB-D data, [265] identify color, shape, material,
and name attributes of objects selected via bounding boxes. [105] explored, using
active manipulation of different objects, the influence of the shape, material and
weight in predicting good pushable locations. [2] used a full 3D mesh model to learn
so-called 0-ordered affordances that depend on object poses and relative geometry.
Koppula et al. [143] view affordance of objects as a function of interactions, and
jointly model both object interactions and activities via a Markov Random Field
using 3D geometric relations (‘on top’, ‘below’ etc.) between the tracked human
and object as features.
Recently, unsupervised feature learning approaches have been applied to prob-
lems with 3D information. [21] propose using hierarchical matching pursuit (HMP),
and [251] propose using a convolutional recursive neural network to recognize ob-
jects from RGB-D images. For supervised methods, state-of-the-art performance
using structured random forests [140] applied over RGB-D data for simultaneous
object segmentation and recognition has been reported in [98].
174
5.3 Approach
In this section, we detail how we use train a SRF for affordance detection.
Similar to border ownership prediction (§2.3), our SRF-based affordance detector is
trained over small local feature patches that capture some form of affordance cues.
In this work, we leverage on local measures of geometry and shape derived from
RGB-D data and associate them with different affordance categories. In contrast
to previous works that require accurate metric models [2] or predict attributes for
segmented objects [265], we show that such local geometric and shape primitives are
sufficient for pixel accurate functionality detection compared to those discovered via
deep learning (which returns only a bounding box) [163], resulting in a more efficient
and simple implementation that runs in real-time due to the fast inference inherent
in SRF. We first introduce these geometric and shape primitives that we use as
patch-based features for training the SRF.
5.3.1 Robust geometric and shape features
The key hypothesis of this work is that shape and geometry are physically
grounded qualities which are deeply tied to the affordances of a tool part. When
characterizing geometric qualities of a part, it is important that the features we
compute are robust to variations, such as changes in viewpoint. At the same time, we
would like to gain insight into the influence of basic geometric measures. Therefore,
we leverage simple geometric features, such as surface normals and curvature, to
learn the relationship between geometry and part affordance. In order to detect
175
affordances for a variety of tools in cluttered scenes with occlusions, we derive the
following local geometric features from small N×N (N = 16) RGB-D input patches:
5.3.1.1 Depth features
We first apply smoothing and interpolation operators to reduce noise and
missing depth values. Then, we remove the mean from the patch to gain robust-
ness to absolute changes in depth. As features, we compute histograms over depth
gradients (HoG-Depth). Similar to the 2D Histogram of Gradients (HoG) image
descriptor [47], we compute gradients on the depth image and quantize them into
four orientations to create a compact histogram feature.
5.3.1.2 Surface normals (SNorm)
We use the depth camera’s intrinsic parameters to recover the 3D point cloud,
from which we can estimate 3D surface normals. As with the depth, we remove
the patch mean during feature learning, to make the representation more robust to
changes in viewpoint.
5.3.1.3 Principle curvatures (PCurv)
The principle curvatures [49] are an extrinsic invariant of the local patch ge-
ometry, and are independent of viewpoint. The principal curvatures (κ1, κ2), κ1>κ2
characterize how the surface bends in different directions.
176
5.3.1.4 Shape-index and curvedness (SI+CV)
The shape index (SI) and curvedness (CV) measures were introduced by Koen-
derink et al. [134] to characterize human perception of shape. These measures, which
are derived from (κ1, κ2), are also viewpoint invariant and are defined as
SI = −
2
pi
arctan
(
κ1 + κ2
κ1 − κ2
)
, CV =
√
κ21 + κ
2
2
2
(5.1)
SI and CV are continuous in the range [−1,+1], where the shape index captures
the type of local shape (elliptic, parabolic, etc.) and the curvedness its perceived
strength.
5.3.2 SRF for affordance prediction
Different from previous approaches using SRFs for border ownership (§2.3)
and curved symmetry detection (§4.4) introduced in the thesis, we impose here a
novel structure that relates affordances to the local patch geometry and shape. This
contrasts with the previous SRFs that predicts pixel-wide ownership or symmetries,
resulting in a pixel-accurate prediction over regions given the test RGB-D image.
To this end, we train a SRF that takes as input X , features from local N × N
patches described in §5.3.1 with pixel accurate annotations of the target affordance,
Y (Fig. 5.2 (B)). The annotations impose the expected spatial structure of how
the affordance should appear in the final prediction, which in this case are binary
segments (c.f. annotations used for ownership in Fig. 2.5 and curved symmetry in
Fig. 4.5 which are just a binary contour representation). For the jth split (internal)
177
Figure 5.2: Affordance detection using SRF. (A) Input image with example patch
highlighted. (B) Features extracted from each patch (top) and sampled annotation
patches from data (below). (C) Training different patches, X with corresponding
binary affordance annotations, Y , learns the optimal θj at each split node. The
leaf nodes store per pixel confidence scores for each Y encountered. (D) During
inference, a test patch is assigned to a leaf node that contains affordance prediction.
Averaging the predictions over the K trees produces an affordance confidence score
per pixel.
178
node, we train a binary decision function h(x, θj) ∈ {0, 1} over random subsets,
x ∈ X , of the input features so that the parameters θj = (f, ρ) send x(f) (where
f is the feature dimension for each feature described in §5.3.1) to the left child
when h(·) = 1 if [x(f) < ρ] and to the right child otherwise. Similar to other SRF-
based approaches, the decision threshold, ρ, is obtained by maximizing a standard
information gain criterion Mj over Dj ⊂ X ×Y , the features and annotations using
eq. (2.4) computed via an intermediate mapping Π : Y 7→ L of structured affordance
labels into discrete labels l ∈ L following [51]. To determine Π, we first cluster via
k-means random annotation patches that have the same affordance labels and select
the largest |L| cluster centers. We repeat the training procedure until a maximum
tree depth, dt, is reached and we store at the leaf nodes per pixel confidence scores
for each affordance annotation patch encountered during training. (Fig. 5.2 (C)).
Each tree in the SRF therefore learns jointly, the 2D spatial structure together with
the 2.5D features that describe the affordance within a patch. Inference using the
trained SRF is extremely simple and fast. Given a forest of K trees and a testing
patch with extracted features, the learned decision thresholds in each split node
will send the patch to a leaf node that contains the predicted affordance labeling
and confidence scores. We then average all K predictions for the final prediction
(Fig. 5.2 (D)).
In our implementation, we train a SRF with K = 8 trees with a maximum
training depth of dt = 64. We use patches of size N = 16 and we set |L| = 10 cluster
centers for Π. Training over the entire affordance RGB-D dataset (§5.4.1) in parallel
with an average of 5000 RGB-D images per split takes around 20 minutes on a 16
179
core Xeon 2.9GHz machine with 128GB of ram. Inference for a single RGB-D image
of size (640× 480) (height, width), takes an average of 0.1s which includes the time
for feature extraction.
5.4 Experiments
5.4.1 Datasets
We use the RGB-D Part Affordance Dataset, introduced in [204], which fo-
cuses on everyday tools and the affordances of their parts. Each part or sur-
face of the tool are hand annotated with multiply ranked affordances, ordered
by their most likely (primary) functionality to their least likely functionality, for
e.g. the inner surface of a bowl is labeled with contain as its primary affor-
dance, followed by scoop and so on. Seven affordance categories are considered:
{grasp,cut,scoop,contain,pound,support,wrap-grasp}. The dataset contains
107 diverse kitchen objects and tools with different appearances were captured over
a turnstile (to get multiple poses) using a Kinect RGB-D camera. This results in
30,000 RGB-D images. Of these, more than 10,000 images have pixel-level ground
truth affordance labels. In addition, we supplement the dataset with three sequences
of around 1000 RGB-D frames, each collected by a mobile robot observing novel tools
in clutter under changing viewpoints.
We also evaluate our approach to a more common, but related robotic task
of determining where to grasp (a specific affordance). For this purpose, we used
the recently introduced Cornell Grasping Dataset of Lenz et al. [163] to compare
180
against their deep-learning method and validate the effectiveness of our approach.
The dataset contains 1035 RGB-D images of 280 graspable objects, where objects
are captured from a small discrete number of viewpoints. Each image contains
a single object, and is annotated with a set of rectangles indicating good or bad
graspable locations.
5.4.2 Baselines
We compare our SRF-based affordance prediction method (SRF) with two
state-of-the-art baselines: 1) Superpixels HMP (S-HMP) [204] and 2) a deep learning
technique for detecting graspable regions from [163] termed the Sparse Autoencoder
(SAE). S-HMP combines superpixels derived from the SLIC algorithm [1] with hi-
erarchical sparse codes generated using Hierarchical Matching Pursuit (HMP) [21].
The sparse features generated by S-HMP are then passed into a SVM classifier that
predicts the affordance category per pixel. SAE combines a cascade of two deep
networks for detecting suitable grasping rectangles (location and orientations) given
the input RGB-D image of an object. The first network is small runs fast to pro-
vide initial rectangles which are then passed for further evaluation using a deeper
network with more complex features. To combine the features together, the authors
explored several strategies and found that a two-stage structured regularization over
features learned between the two networks yield the best performance, which is the
variant that we compare with here.
181
5.4.3 Evaluation procedures
We use two evaluation metrics to provide different perspectives on the per-
formance of our approach over the RGB-D Part Affordance dataset. The proposed
approach output a probability map over the image for each affordance, which can
be evaluated against ground truth labels to fairly compare their performance. First,
we use the Weighted F-Measure, Fwβ , introduced recently by Margolin et al. [191]
to evaluate saliency maps with continuous valued responses against binary valued
ground-truths. Fwβ is an extension of the well-known F-measure Fβ
2:
Fwβ = (1 + β
2)
Prw.Rcw
β2.P rw +Rcw
,with β = 1 (5.2)
where Prw and Rcw are weighted versions of the standard precision Pr = TPTP+FP and
recall Rc = TPTP+FN measures. Here, TP, TN, FP, FN refer to true positives, true
negatives, false positives and false negatives respectively. The key insight from [191]
is to extend the standard precision and recall measures with weights derived by
comparing the binary ground-truth and the continuous valued responses in order to
reduce biases inherent in the standard measures. To do this, the authors proposed
weights that measure the dependency of foreground pixels (pixels clustered together
near the ground-truth are weighted higher), and assign lower weights to pixels far
from the ground-truth.
Since the ground-truth in the RGB-D Affordance dataset provides rankings
2The F-measure with β = 1 is defined by the harmonic mean of the precision and recall values:
Fβ = (1 + β2). Pr.Rcβ2.Pr+Rc and is used as a measure of the accuracy of the Pr and Rc scores. β is a
positive weight that gives preferences to either Rc (β > 1) or Pr (β < 1).
182
across multiple affordances, for a second measure we define a rank weighted Fwβ ,
Rwβ =
∑
r
wr.F
w
β (r),with
∑
r
wr = 1 (5.3)
that sums weighted Fwβ (r) over their corresponding r ranked affordances. The ranked
weights wr are chosen so that the top ranked affordance is given the most weight,
followed by the secondary affordance and so on. This allows us to capture if the
detector is generalizing across multiple affordances appropriately. Note that when
we impose w1 = 1, (5.3) reduces to (5.2), where we consider only the top ranked
affordance.
For the Cornell dataset, we follow the same evaluation procedure described
in [163], where we averaged results from 5 random splits, and report both recognition
accuracy, ra and detection accuracy, da. For detection, we report the point-wise
metric following [163] and [243], which considers the detection a success if it is
within some distance from at least one ground-truth rectangle center. To obtain
structured labels from this dataset for training the SRF, we estimated the ground-
truth annotations of graspable regions by first applying a mask obtained over all
graspable rectangles followed by a edge detection and hole filling operation (Fig. 5.3).
For fairness in comparison, we use the same training and evaluation parameters
with the same input features in all experiments.
5.4.4 Results and discussion
We report results that demonstrate the performance of our approach using the
proposed metrics described above: (Fwβ , R
w
β ) for affordance detectors trained using
183
Figure 5.3: Estimating pixel accurate annotations from the Cornell Grasping
Dataset. (Left) Input RGB image. (Middle) Overlay of several graspable rectangles.
(Right) Edge detection and hole filling produces a pixel accurate segment.
Figure 5.4: Results of affordance detection across three different input RGB-D
frames (left) using S-HMP (middle) and SRF (right) over the cluttered sequence:
two target affordances per method – contain (l) and wrap-grasp (r). Brighter
means higher probability of the target affordance.
184
Affordance
Non-cluttered subset (single objects) Cluttered subset (multiple objects)
S-HMP (Fwβ , Rwβ ) SRF (Fwβ , Rwβ ) S-HMP (Fwβ , Rwβ ) SRF (Fwβ , Rwβ )
grasp 0.367, 0.149 0.423, 0.173 0.398, 0.268 0.414, 0.286
cut 0.373, 0.043 0.438, 0.051 0.350, 0.143 0.490, 0.200
scoop 0.415, 0.046 0.612, 0.120 0.348, 0.195 0.634, 0.377
contain 0.810, 0.168 0.780, 0.170 0.588, 0.334 0.588, 0.382
pound 0.643, 0.035 0.606, 0.042 0.474, 0.147 0.553, 0.172
support 0.524, 0.030 0.561, 0.047 0.451, 0.116 0.485, 0.171
wrap-grasp 0.767, 0.102 0.800, 0.107 0.394, 0.269 0.504, 0.342
Mean 0.557, 0.082 0.603, 0.101 0.429, 0.210 0.521, 0.275
Table 5.1: Performance over the RGB-D Affordance Dataset. (Left) Non-cluttered
subset and (Right) Cluttered subset.
Feature Sets S-HMP Fwβ SRF Fwβ
Depth+SNorm+PCurv+[SI/CV]† 0.557 (+0.018) 0.603 (+0.072)
Depth+SNorm+PCurv 0.562 (+0.023) 0.601 (+0.070)
Depth+SNorm 0.547 (+0.008) 0.599 (+0.068)
Depth 0.539 0.531
Table 5.2: Ablation experiments. +x indicates the amount of change over Depth.
†Since SI and CV are related measures, the best results using either one of them are
reported. SI and CV yields the best results for HMP and SRF respectively.
Method ra % da %
RF 85.3 62.5
SRF 93.5 87.0
SAE [163] 93.7 88.4
S-HMP 95.2 92.0
Table 5.3: Results on the Cornell Grasping Dataset.
185
S-HMP and SRF. We used the same train/test splits for both methods,and report
averaged results over random splits of the RGB-D Affordance Dataset from [204].
Table. 5.1 summarizes the two detectors’ performance over the seven affordance
labels considered.
From the results, we can see that SRF consistently outperforms S-HMP in both
evaluation metrics over both subsets of the RGB-D Affordance Dataset (Table 5.1).
The difference is most significant in the cluttered subset where SRF outperforms
S-HMP by 0.092 for Fwβ and 0.065 for R
w
β . This shows that the predictions from
SRF are not only more accurate for the top ranked (primary) affordance, it also
predicts more reasonable secondary affordances compared to S-HMP. This is further
confirmed when we compare example outputs in Fig. 5.4. Not only are the SRF
predictions better aligned to actual objects (unlike the superpixels of S-HMP), they
also generalize to novel categories: e.g. the blue scoop is detected as contain by
the SRF which is a reasonable secondary affordance while S-HMP has no detectable
responses. Another outstanding aspect of SRF compared to S-HMP is that the
predictions are performed per-pixel in real-time unlike S-HMP which takes minutes
per image since the SVM classifier has to be run over all superpixels. In terms of
feature ablations (Table 5.2), we see that the SRF improves most with the addition of
SNorm while S-HMP benefits most with the addition of PCurv and both approaches
saturate with SI/CV. A possible explanation for these differences is that the sparse-
codes of S-HMP are already learning normal based representations from depth alone
while principal curvatures are not as easily derived.
We compared all approaches with SAE over the Cornell grasping dataset using
186
the ra and da metrics summarized in Table 5.3. In order to highlight the contribution
of the structured constraints in the SRF, we trained a standard random forest (RF)
with 20 trees over the annotated grasping rectangles in the dataset, using the same
feature set of the SAE: RGB + Depth + SNorms. We note first that using a standard
RF results only in mediocre performance. By adding the structured constraints and
the proposed robust features, the SRF is able to achieve recognition and detection
performances comparable to the deep learning based SAE. S-HMP outperforms the
other approaches by a large margin, achieving state-of-the-art performance for this
dataset. It is important to note, however, that the SRF provides very reasonable
predictions of graspable locations with pixel-wise accuracy (Fig. 5.5), within a frac-
tion of the time needed for inference using SAE (30s) or S-HMP (90s) vs. 0.1s in
SRF.
Figure 5.5: Grasping locations predicted by SRF. (Top) Input RGB-D images for
four example objects. (Bottom) Predicted graspable locations. Notice the large
difference in shape of the graspable regions. Brighter means higher probability.
187
5.5 Conclusions
We have described a fast, pixel-accurate affordance detector based on SRF
in this chapter. Similar to other detectors based on SRFs explored in this thesis,
our approach draws its success from the robust geometric features extracted from
the RGB-D image. Compared to S-HMP and SAE, its strength lies in its gener-
alizability to novel objects and affordance categories. As it predicts pixel-accurate
affordances, the approach is robust to clutter, occlusions and viewpoint changes. Fi-
nally, as inference using SRF is extremely efficient, our approach is able to provide
reasonable predictions within 0.1s which makes it more suitable for practical robotic
applications compared to other (slower) but more accurate approaches (S-HMP and
SAE).
The ability to detect affordances or functionalities, as was noted in the opening
remarks of this chapter, is key to generalizing object recognition beyond its outward
appearance. In terms of solving the FGO problem (§1.1), we move one step closer
towards linking high-level knowledge as affordances themselves carry important se-
mantic information of object attributes which can be found in natural language.
We discuss the role of language with final remarks and insights in the concluding
chapter of this thesis next.
188
Chapter 6: Closing the Semantic Gap using Language
In previous chapters, we have shown a gradual progression of using Gestalt to
higher-level visual tasks. Starting from local cues and Gestalt operators, we have
proposed an approach for border ownership; which aids contour-based recognition
of object categories that share similar shapes (Chapters 2 and 3). From local cues,
we detect bilateral and curved symmetries in images with clutter. By embedding a
symmetry prior into a MRF-based representation of the image edges, we are able to
segment symmetrical regions, linking symmetry (a key Gestalt) with the higher-level
task of visual segmentation (Chapter 4). Finally, in Chapter 5, we demonstrate a
fast and effective approach for detecting affordances of tool parts in real images with
clutter and occlusions, which models Gibson’s notion of “direct perception” [83] and
agrees with Marr’s viewpoint of invariant representations [192].
These approaches are important steps towards solving the figure-ground orga-
nization (FGO) problem, which we have argued in §1.1 to be important in bridging
the so-called “semantic gap” between high-level (semantic) representations of the
world with visual representations. The remaining question is how one can leverage
from the approaches developed in the precedent chapters to actually link up with
semantic or linguistic representations? In this concluding chapter, we present fu-
189
ture research directions that exploit structure from language in similar ways as we
have done before to produce simpler linguistic representations that are closer to the
mid-level visual features developed so far. In some sense, we are proposing a analog
of Gestalt for language (though in the opposite direction), so that it is easier to find
a common canonical feature space for linking these two modalities together.
6.1 Introduction
Figure 6.1: Illustrating how language and vision can be used together for e.g. de-
tecting a bottle in the image. We produce mid-level representations on from both
language (left) and vision (right) so that one can use text for describing the bottle’s
attributes which would then activate the correct visual operators for detection.
Throughout this thesis, we have advocated for solving the FGO problem as a
key driver for various visual approaches presented in previous chapters. However,
the FGO problem or the semantic gap in general, has two facets: Vision and Lan-
guage. In this thesis, we have focused on the former and have largely ignored the
latter, which we will address here as future work. The main idea is summarized in
Fig. 6.1 where the goal is determine a common representation between these two
190
modalities so that a common task (e.g. object detection or scene understanding)
can be enhanced. We have showed in this thesis that it is possible to use Gestalt in a
principled manner for obtaining mid-level visual features, but an analog in language
remains elusive. The key challenge, therefore, is finding such a representation from
language and combining it with vision in a principled manner for solving multimodal
tasks. We present some works from other communities in §6.2 and propose possible
research directions in §6.3.
6.2 Related Works
In this section, we expand our review of related works from §1.3 to include
works from the Natural Language Processing (NLP) and multimedia (MM) commu-
nities that solve similar problems but with a different (linguistic and multimodal)
perspectives. We also present related works on visual attributes, which we show in
§6.3.1 to be a good common representation between language and vision.
6.2.1 Vision and language from the NLP and MM communities
We have discussed the use of language by the Computer Vision community as
a contextual tool for object recognition or scene understanding tasks in §1.3.2. Here
we present relevant works from the NLP and MM communities that address similar
issues.
In the NLP community, the most relevant works that explored processing raw
linguistic data to obtain semantically meaningful information for integration with
191
vision is known as “semantic parsing”. Among the most well-known is the Combi-
natory Categorical Grammar (CCG) of Steedman [260] which parses text into an
a logical lambda expression that captures semantic meaning. A CCG is defined by
a set of lexicon and a small set of combinatory rules. Each entry in the lexicon
is a word-category pair that encodes semantic relationships. For e.g. the entry
Boston ` N : λx.place(x) pairs the word “Boston” with a category that has syn-
tactic type N (noun) and meaning λx.place(x). The output of the CCG parse of
a sentence is a logical expression consisting of lambda expressions that encode the
meaning and syntax of the sentence at the same time. Using CCG or other related
semantic parsers [38, 125, 172], research in NLP have tried to integrate high-level
knowledge to improve visual tasks and vice versa. This is known as the “ground-
ing” problem in linguistics, where the goal is to embed real-world meaning so that
ambiguity in both the linguistic and vision domains can be reduced. The key to
do this is to train the semantic parser so that (in the case of CCG), the appropri-
ate lexicon and mappings relevant to the task can be used. There are numerous
works that vary in their learning approaches, representations and tasks. [29] uses
Reinforcement Learning (RL) to map instructions to meaning for the task of map-
ping high-level instructions to automated system commands to be executed in a
operating system’s GUI. There have also been research into grounding instructions
with the environment that the agent (robot) is in, with most work focused on robot
navigation. [290] used a specialized MRF to encode transition probabilities on how
the environment and its objects occur with observational probabilities derived from
natural language directions. Inference on the MRF produced the most likely path
192
given noisy directional instructions from humans. [196] uses supervised learning on a
statistical machine translation framework to parse natural language instructions to
formal expressions, upon which a path-finding algorithm is applied to determine the
optimal path to choose. Similarly, the work of [128] trains in a supervised manner a
semantic parser for mapping instructions to navigation using automatically induced
labels. Specifically, the authors learned a mapping between natural language in-
structions into a “meaning space” by determining the appropriate parameters for a
Probabilistic Context Free Grammar (PCFG) when sentences are paired with some
form of ambiguous description. [267] uses a graphical model representation to en-
code semantics (known as G3) to learn a mapping between instructions and actions
via demonstrations that are to be performed by a robot (e.g. move, load pellets,
reverse etc.). The training data are natural language descriptions of the environ-
ments and instructions obtained from Amazon Mechanical Turks. [195] combines
both visual information of objects on a table with linguistic descriptions to learn a
mapping between attributes (color only) and objects. Finally, [145] combines lan-
guage and visual perception together in a framework known as Logical Semantics
with Perception so that scenes with objects and a linguistic query are parsed into
logical expressions, and using a pre-defined environment (essentially a knowledge-
base of how objects come together and interact), outputs the appropriate locations
of objects that corresponds to the query.
There has also been significant development of techniques for fusing language
and visual information together in the MM communities. By viewing this as a multi-
modal fusion problem, several approaches have been proposed along the realms of
193
context-based image retrieval, tag-based retrievals, image to image suggestions and
image-to-text production etc. Many of the works are based on a technique known as
Canonical Correlation Analysis (CCA) proposed in the 30s by Hotelling [110]. The
idea is simple: CCA takes two features vectors from two different modalities (text
and vision), and attempts to determine a set of basis vectors (for each modality)
such that the correlations of the projections of the features in both modalities are
mutually maximized. The basis vectors therefore define a new latent semantic space
that is shared by both language and vision, simplifying the cross-modal retrieval
tasks mentioned earlier. Recent works such as those of [95] have extended CCA to
a kernelized version (Kernel CCA) to handle non-linear transformations and more
complex cross-model relationship by first embedding a kernel to map the original
features to higher dimensions. Works such as [116,129,197] have shown good results
for action recognition, audio-video syncing/detection and object recognition. [85]
recently introduced a 3-view approximate KCCA approach such that the kernels
can be computed over extremely large datasets with very good cross-model retrieval
results of tags and images. Finally, [5] proposed an extension to KCCA known as
Deep CCA (DCCA) where the kernels for KCCA are first learned using deep neural
networks before KCCA is applied.
6.2.2 Visual attributes
Attributes are intrinsic properties of entities that are immutable – that is,
they do not change with time, space or location. Systems with such capabilities are
194
able to recognize new, unseen entities based simply from these shared properties,
enabling them to scale up to a wide variety of categories. Because of its attractive
properties, recognition based on shared attributes had been an area of active research
in Computer Vision as it provides a strong invariance against changes in viewpoints
and environmental conditions.
Among the earlier works in this direction is the work of [17] which introduced
the idea of “geons”: a set of 32 primitive 3D geometric shapes that were proposed
to be crucial for human object recognition. Following similar lines, but from a more
neurologically plausible viewpoint, is the work of [235] that proposed “HMAX”, a
hierarchical model of the human visual cortex for recognizing objects. The model
begins with simple inputs from low-level cells (directed edge responses) that are
then fed to more complex composite cells that pool responses over a larger visual
field (longer edges). The work of [74] uses a similar hierarchical approach consisting
of contour fragments but introduced a generative model that learns the appropriate
combinations at different layers of the model from training data. Part-based models,
introduced by [70] and extended further using HOG (Histogram of Gradient) features
trained in a discriminative framework [69] has become a standard baseline for object
recognition can also be viewed as an effort on this direction since the approach
essentially recognizes shared parts (e.g. arms, torso, eyes, etc.).
Attributes in Computer Vision has also been widely used in several learning
approaches, notably in the work of [152] where the authors proposed the notion of
“zero-shot” and “one-shot” transfer learning where attributes learned from a dataset
are used to classify a new object that has none or a single training exemplar. More re-
195
cently, attributes have been used in a discriminative approach for object recognition
by training specific classifiers for particular hand-selected attributes of objects [64]
and extended further for recognizing object categories [63]. Silberer et al. [248] di-
rectly used such attributes to learn a joint model of textual attributes and visual
features for cross-modality tasks. A recent development was the introduction of
“relative attributes” [221] where attributes are used in a ranked manner to learn a
relative scale for comparison, which yields better results than simply using manu-
ally specified attributes for learning a binary decision boundary. Other authors have
recently addressed the issue of discovering attributes from data instead of manually
specified terms. [14] uses image captions and annotated labels to learn discrimina-
tive visual attributes from web based images. To ensure that meaningful attributes
are obtained, [220] proposed an active learning approach with humans in the loop.
Along similar lines but over a much larger dataset, [222] used the MIT-SUN dataset
of scene categories with the help of Amazon Mechanical Turks to re-categorize and
discover novel scene-level attributes that are shown to be more precise than the
original categories found the dataset. There has also been significant effort in using
linguistic resources to discover attributes and to induce semantically meaningful on-
tologies for comparison and classification. [240] used the ImageNet dataset to learn
attributes that parallels the WordNet ontologies of object categories. [275] uses vi-
sual concepts obtained from the Internet to design a novel “classeme” descriptor for
efficient object recognition. There has also been a recent parallel trend where instead
of learning/discovering so-called “nameable” attributes – those with precise semantic
meaning understandable by humans, to discovering “un-nameable” attributes – at-
196
tributes derived directly from low-level visual features. [302] used such un-nameable
attributes for object categorization where the authors designed a category-attribute
matrix that allows for object category separation and is at the same time suitable for
learning. Along similar lines, the authors in [54] used a latent CRF model to learn
an attribute classifier that discovers attributes that are both semantically meaning-
ful and have visually detectable features. The initial model is further refined in a
second training step with humans in the loop to determine meaningful attributes so
that a name can be assigned.
6.3 Future Research Directions
We present here three future research directions that we believe will lead to
a complete solution for the FGO problem. First, we present ideas for developing
an appropriate mid-level representation of language via affordances, which we call
grounded affordances. Second, we propose to use CCA, popular in the MM commu-
nities to associate mid-level linguistic features with mid-level visual representations
developed in earlier chapters. Finally, we discuss the role of deep learning as a
modern and viable alternative for multimodal association.
6.3.1 Language grounding of affordance-based attributes
The first task is to determine the appropriate level of representation in lan-
guage so that it is even possible to learn a reliable mapping. This has remained an
open research question (§6.2.1 and §1.3.2). The key challenge is finding a representa-
197
tion that is able to capture the complexity of language while associating them with
their appropriate visual counterparts at the same time. As was noted in Chapter
1, this is a very difficult problem as it deals with the fundamental semantic gap
between language and vision (Fig. 6.2 (top)).
Figure 6.2: Why grounding attributes using affordances makes sense. (Top) Se-
mantic gap between language (text) and low-level visual representation is large –
one does not naturally describe a complex object in terms of their low-level visual
representations or appearances. (Bottom) Grounding attributes using affordances
provides a mid-level representation in language to bridge the gap with mid-level
visual representations (here we show grouped contours or codons from Chapter 3).
Faced with this formidable challenge, it is clear that using language at the
semantic level is not appropriate. We therefore draw our motivation on recent re-
198
search into attributes (§6.2.2) that is claimed to be some form of “mid-level” visual
concept that has properties that is shared among objects within the same category.
Additionally, we assume the objects of interest have some (a small number) associ-
ated functions that we want to categorize1 such as tools (see Chapter 5). Our key
hypothesis is that attributes grounded via affordances, or grounded attributes, can
be easily associated in both the language and vision space (Fig. 6.2 (bottom)). This
hypothesis is supported by the observation that when asked to describe objects in a
scene as reported in [222], most subjects expressed the highest degree of confidence
when they described the object via its functionalities or affordances. This indicates
that when we “reduce” the complexity of textual description down to the level of
affordances, the object or scene is still understandable by others.
Figure 6.3: A typical tool (spoon), with its associated verbs and verbal descriptions.
Note that verbs contain high-level semantic meanings while the verbal description
is closer to the visual space.
To do this, we propose that the functionality/affordance of a particular tar-
get m be expressed as a set of verbal descriptions, Vm = {v1, v2, v3, · · · , vn}, in the
1Objects that do not serve a particular function are not well defined (e.g. a “pear”), and complex
objects that have numerous uses (e.g. a “car”) are not considered since it is their parts that affords
the functions – “wheels”, “doors” etc.
199
language space. Verbal descriptions are not the same as verbs but are related to
verbs in a consequential relationship. To be more concrete, let us give an exam-
ple of a typical tool, e.g. “spoon”, with its associated verbs (affordances) and its
verbal descriptions (Fig. 6.3). A typical list of associated verbs2 would be {drink,
sip, stir}, but such verbs themselves contain a significant amount of semantic
content so that it is hard to find any visual association without further processing.
However, looking at the verbal descriptions, we have {convex_part_container,
small_convex_part_container, flat_edge_with_handle} which corresponds to
a simplified description of the part of the spoon that affords a particular func-
tion/verb. Such descriptions can be seen as a form of consequence due to the associ-
ated verbs: e.g. Because a spoon has as a verbal description convex_part_container,
it induces as a consequence the verb drink. Since the descriptions describe certain
visual aspects of the target that contribute to the verb, it is therefore more suit-
able for learning a visual association with the language space than the actual verbs
themselves. Another reason why we choose verbal descriptions is because they relate
fundamentally to the innate qualities of the target, essentially defining its purpose
and hence identifying it. This is in contrast with other kinds of attributes (whether
visual or textual) such as for e.g. color, texture, which are shared across categories
but do not fully define the target. For example, a “red” or “rough” attribute can
correspond to many different objects, and is nearly not as useful as {sharp_edges,
has_cutting_edge} verbal descriptions. Since such descriptions are grounded to
the physical properties of the target, we call them grounded attributes in order to
2Verbs are the most direct representation of affordances in language
200
emphasize this fact.
An important challenge that one must address is determining Vm. There are
several ways to do this. Firstly, one can learn a (limited) set of possible grounded
attributes from structured ontologies such as WordNet or from online dictionary
definitions (e.g. Wiktionary3). This is done by extracting definitions of the noun
or from parsing of example verses that demonstrate how the noun is used (and
extracting the correct verbs). Such an approach, however, results in a very limited
and noisy set of associated verbs. Take for example, the definition of “spoon” (from
Wiktionary) is:
spoon (plural spoons):
1. An implement for eating or serving; a scooped utensil whose long
handle is straight, in contrast to a ladle.
2. An implement for stirring food while being prepared; a wooden spoon.
3. A measure that will fit into a spoon; a spoonful.
One can see that the definition is written in a high level because it is meant
to be understood by humans, so only a few key verbs can be found using this ap-
proach: {eat, serve, stir, measure}. Additional functional description of the
object such as “straight handle”, “scooped utensil” which are visually important
could be used but requires additional processing since such definitions often im-
ply another word (e.g “handle”, “scooped”, “wooden spoon”) that may result in a
circuitous definition or use inconsistent word descriptions such that it is hard to
3http://en.wiktionary.org/
201
associate them with say another spoon-like object. Another possible way to expand
this verb association would be to mine for object-verb relationships from large data
such as in [299]. However, this approach is problematic because unstructured text
from copora like Gigawords [89] often do not contain a clear definition or description
of the object used under a variety of conditions. The reason is because when we
write about objects in text, they are almost always understood at the visual level
rather than at the textual level. This means that it would be extremely hard to
extract the verbs or verbal descriptions from large unstructured corpora.
Figure 6.4: Interface used for soliciting verbal description responses from AMT turk-
ers. (A) Ranking the proposed verb in terms of relevance. (B) Selecting locations
via bounding boxes which corresponds to the verb. (C) Brief description of the
location in terms of its appearance.
Clearly, since the data that we need to construct Vm is not readily available or
easily processed in the wild, we propose to extract such information from humans di-
rectly via crowd-sourcing methods such as Amazon Mechanical Turk (AMT). Start-
ing from a list of target objects/categories, we will use standard syntactic parsing
methods (e.g. [224]) together with a dependency parser (e.g. [36]) to extract the
202
initial list of verbs associated with the object’s affordance. We then use this initial
list of verbs to design an interface that will enable us to solicit responses using AMT
turkers as shown in Fig. 6.4. A typical image of the target is shown together with
one of the selected verbs from the list. The task is simple and consists of two steps:
1) draw a bounding box indicating the part (if possible) of the object that affords
the particular verb and 2) write down a brief (< 10 words) description of the part
based on its appearance. Different images will be shown yielding a large number (at
least 20) of the parts that are associated with the verb together with potential re-
sponses for their verbal descriptions. With these verbal descriptions, we will apply a
clustering step to determine the most consistent and highest ranked descriptions for
Vm, together with a set of description image parts Im that will be used for learning
a mapping between language and vision space. How this can be done is described
next.
6.3.2 Learning a canonical multimodal space
Given the set of visual descriptions Vm and image parts Im, we would like
to formulate a learning mechanism that learns a mapping between Vm and Im –
essentially, a mapping between the components in the language space of Vm and the
visual space of Im. Since the two spaces are very different, we propose using Canon-
ical Correlation Analysis (CCA) or the kernelized version (KCCA) [95] to determine
this mapping into a common latent space as shown in Fig. 6.5. Using CCA/KCCA
makes sense because: 1) the features extracted from Vm and Im are describing the
203
Figure 6.5: Illustrating CCA applied between verbal descriptions associated with
image features to form a common latent space that generates the information in
the two modalities. The verbal descriptions, within boxes of the same color, are
paired with their visual representations (here we use codons described in Chapter
3) that are highlighted with the same color. The space here is represented by the
first canonical variate: w1 =< ω1, ω2, ω3, · · · > which maximizes the correlation of
the two features.
204
same object parts, and therefore must be generated by a common (latent) source and
2) the two spaces are coming from two different views (modalities), and the problem
of fusing language and vision can be seen a cross-modal problem that CCA/KCCA
is designed to handle. We first describe CCA followed by KCCA. Without loss of
generality, we will denote (X1,X2) the data from vision and language respectively.
CCA [110] attempts to find linear projections w1,w2 of two random vectors
X1 ∈ Rn1 and X2 ∈ Rn2 such that their correlation is mutually maximized:
(w∗1,w
∗
2) = argmax
w1,w2
corr(w>1 X1,w
>
2 X2)
= argmax
w1,w2
w>1 Σ12w2√
w>1 Σ11w1w
>
2 Σ22w2
(6.1)
where Σ11 ∈ Rn1×n1 ,Σ22 ∈ Rn2×n2 are the covariances of X1,X2 and Σ12 ∈ Rn1×n2
their cross-covariance. These covariances are derived from the total covariance ma-
trix Σˆ defined as:
Σˆ ,




Σ11 Σ12
Σ21 Σ22



 = Eˆ








X1
X2








X1
X2




>


 (6.2)
where Eˆ is the empirical expectation defined for a function f(a,b):
Eˆ [f(a,b)] =
1
m
m∑
i=1
f(ai,bi) (6.3)
An important observation is that the objective in (6.1) is invariant to changes
in scales forw1 andw2, which means that the same results can be achieved by simply
maximizing the numerator while constraining the projections in the denominator to
be of unit variance:
(w∗1,w
∗
2) = argmax
w>1 Σ11w1=1,w
>
2 Σ22w2=1
w>1 Σ12w2 (6.4)
205
Eq. (6.4) has a closed form solution using Lagrange multipliers [95], which can be
reduced into a pair of standard eigenproblem:



Σ−111 Σ12Σ
−1
22 Σ21w1 = λ
2w1
Σ−122 Σ21Σ
−1
11 Σ12w2 = λ
2w2
(6.5)
Solving eq. (6.5) yields the eigenvectors (w∗1,w
∗
2) that defines the canonical bases
for (X1,X2) respectively. Choosing the eigenvectors (w11, w
1
2) corresponding to the
largest eigenvalue λ2 yields the largest correlation between the first canonical vari-
ates: (x′1, x
′
2) = (w
1>
1 X1, w
1>
2 X2) (and so on) in the new space.
KCCA extends CCA in a similar way. Instead on finding linear projections
(w1,w2) in CCA, KCCA finds pairs of non-linear projections by first projecting the
data into a higher dimensional space using the so-called “kernel-trick”. Denoting
a sequence of data of length m from the two views as X1 ∈ Rm×n1 ,X2 ∈ Rm×n2
(each sequence is a row in the data matrices), we can rewrite the definition of their
covariances in eq. (6.2) as Σ11 = X>1 X1,Σ12 = X
>
1 X2 (and similarly for Σ21,Σ22).
The projections, w1,w2 can also be rewritten as the projection of the data matrices
with respect to α1 ∈ Rm, α2 ∈ Rm such that we have: w1 = X>1 α1 and w2 = X
>
2 α2.
Substituting these into eq. (6.1) yields:
(α∗1, α
∗
2) = argmax
α1,α2
α>1 X1X
>
1 X2X
>
2 α2√
α>1 X1X
>
1 X1X
>
1 α1α
>
2 X2X
>
2 X2X
>
2 α2
(6.6)
Denoting the kernel matrices for each view as K1 = X1X>1 ∈ R
m×m and K2 =
X2X>2 ∈ R
m×m, we derive from eq. (6.6) the dual form of eq. (6.1):
(α∗1, α
∗
2) = argmax
α1,α2
α>1 K1K2α2√
α>1 K
2
1α1α
>
2 K
2
2α2
(6.7)
206
However, eq. (6.7) has problems in practice because without constraining the di-
rections of the projections, one can easily find trivial projections in the higher di-
mensions. The solution is to regularize the projections using Partial Least Squares
(PLS) so as to penalize the norms of the associated weights in the denominator:
(α∗1, α
∗
2) = argmax
α1,α2
α>1 K1K2α2√
(α>1 K
2
1α1 + κ1‖w1‖2).(α
>
2 K
2
2α2 + κ2‖w2‖2)
(6.8)
with κ1, κ2 as the regularization coefficients.
Similar to the primal form of the objective in eq. (6.6), the objective in dual
form is not affected by changes in scales for α1 and α2, which leads us to the
simplified objective where we constrain the terms in the denominator to be unity:
(α∗1, α
∗
2) = argmax
α>1 K
2
1α1+κ1‖w1‖
2=1,α>2 K
2
2α2+κ2‖w2‖
2=1
α>1 K1K2α2 (6.9)
The solution to eq. (6.9) can be found using Lagrange multipliers which is again
reduced into a pair of standard eigenproblem:



(K1 + κ1I)−1K2(K2 + κ2I)−1K1α1 = λ2α1
(K2 + κ2I)−1K1(K1 + κ1I)−1K2α2 = λ2α2
(6.10)
Solving eq. (6.10) yields as the top eigenvectors (α∗1, α
∗
2) which maximizes the cor-
relation of the projections in the new space.
Depending on the complexity of representations used in the text and vision
domains, one can choose to use the linear CCA projections or non-linear KCCA
projections. CCA is fast and easy to optimize but in practice does not handle data
with large non-linearities between the two modalities, while KCCA depends on the
assumption that the kernels K1, K2 can be inverted, which limits its applicability
207
when m gets large, although several fast approximations had been proposed [85,189,
281] to overcome this limitation.
Once the common latent space has been learned, it is straightforward to gener-
ate novel views in vision given input text descriptions (and vice-versa) of the target.
The most naive way to do this (but the fastest) would be to use a nearest neighbor
(NN) approach where transformed visual centers closest to the transformed language
centers are returned directly. One can also apply Support Vector Machines (SVMs)
to learn classifiers within this space as well.
There are several important issues that need to be addressed for this approach
to work. Firstly is the representation of the language and vision data that will
be used in CCA/KCCA. For visual data, we propose to directly use the mid-level
representations: ownership, grouped contours (codons), symmetry and affordances
developed in the thesis. For language, since we are working directly with verbal
descriptions of grounded attributes Vm, one simple way would be to first cluster all
the attributes in Vm to remove redundant attributes (and reducing the dimensions)
using techniques such as, k-means or probabilistic latent semantic analysis (pLSA)
[108] or latent Dirichlet allocation (LDA) [18]. Using the reduced set of vocabulary of
attributes, we can then represent each vk ∈ Vm as a binary vector where 1 indicates
the presence of the attributes and 0 otherwise. A linear kernel, Kv(vi, vj) = v>i vj
can then be use as a representation of the number of shared attributes for learning
the mapping in CCA/KCCA. Other forms of representations are possible such as
using a sub-sequence kernel [201] that counts the number of word sequence that
are shared from a sentence that is generated (from Turker’s response) to describe
208
the functionality of the object. Secondly, and related to the first is the issue of an
appropriate distance measurement in the shared latent space. Most works surveyed
in §6.2.1 had traditionally used standard Euclidean distance, but as was noted in [85],
designing a new similarity measure that takes into account the magnitude of the
eigenvalues in CCA produced better results than using Euclidean. We intend to
experiment with several similarity measures in the latent space in the proposed
work as well. Finally, a key issue if we are to use KCCA is the design of the
vision and text kernels. There are numerous choices, such as a standard Radial
Basis Functions (RBF) kernels or a d-degree polynomial kernels that can be used
in both modalities. Motivated by the results reported in [5], we plan to experiment
with kernels that tend to capture some form high-level semantics within the data,
for example, by grouping longer contour fragments together (vision) and/or longer
sequences of word/verbal descriptions that would capture long-range semantics.
6.3.3 Multimodal features from deep networks
A recent trend in Computer Vision and NLP is the learning of appropriate
feature representations using so-called deep convolutional networks (CNN) [156].
Such deep networks have also been used successfully by a number of authors for
learning cross-modality representations between language and vision e.g. [76, 123,
131, 208, 250, 252, 255]. The basic idea for earlier works [76, 208, 250, 255] is simple:
train two different networks – one for vision and the other for language – and combine
together their derived representations for solving joint language and visual tasks (e.g.
209
image-based retrieval using text and vice versa). Recently, [131] replaced the deep
language models with a log-bilinear model for generating sentences describing the
image (unlike simple tags used in prior works). Along similar lines, Karpathy et
al. [123] demonstrated a deep bilinear model that links sentences to images that
embeds fine-grained image patches corresponding to scene/object parts and their
spatial relations, resulting in a more meaningful generated sentence. Socher et
al. [252] combined language features derived from dependency parse trees with image
features from a deep network to learn a joint multimodal space capable of retrieving
images and generating sentences based on the dependency parse.
These works show that learning appropriate multimodal representations re-
mains challenging due to the large semantic gap noted in §6.3.1 and in Chapter 1.
A key issue with these approaches is that joint learning between vision and language
occurs after features are derived from separate networks. This makes learning more
difficult since the two modalities may contain biases that would not be recoverable in
the final stage. To overcome this, we propose to learn features using deep networks
in a more careful and supervised manner. Similar to the Deeply-supervised Nets of
Lee et al. [157], where additional supervision of layers in the CNNs are shown to
be useful for improving certain visual tasks, we believe a similar supervision, where
mid-level linguistic attributes [248] can be paired our mid-level visual representa-
tions to guide the learning of a joint linguistic-visual feature in a principled manner.
Along similar lines as the DCCA of Andrew et al. [5], we can then inject such jointly
learned deep features into CCA to perform text or sentence based retrieval of images
or to generate image descriptions.
210
6.4 Final Conclusions and Outlook
From the FGO problem, we have motivated the approaches presented in this
thesis as a means to bring low-level visual signals, guided by Gestalt principles,
to a richer and more meaningful mid-level representation. However, this is only
one (the visual) side of the FGO problem which is the focus of this thesis. In this
chapter, we discussed the linguistic side of the same problem, and proposed to use
grounded attributes as a means to simplify linguistic representations for learning a
joint canonical space via CCA that links vision and language together. We have also
discussed how multimodal features could be derived from deep learning methods,
where we propose to inject mid-level linguistic features within the network to guide
the learning of visual features (and vice versa).
We believe the results and approaches presented in this thesis have important
implications in related fields beyond Computer Vision. In terms of visual psychol-
ogy, we have confirmed that their models of visual Gestalt make sense for tasks such
as border ownership, contour-based recognition, detection of symmetries and seg-
mentation of symmetrical parts; and extends to functionality detection as well. For
robotics, our computational algorithms using SRFs and dynamic programming are
fast and approach real-time performance, which should be useful for mobile intelli-
gent agents working in real environments containing clutter, occlusions and lighting
changes. Finally, for NLP, we have proposed research into mid-level representations
of language which would be useful for learning joint visual-language models.
211
Appendix A: Generalizing the image torque to other patterns
Following the notations used in §2.3.1.3, we write down the following iso-
contour functions for: 1) radial fr(x, y), 2) spiral fs(x, y) and 3) hyperbolic fh(x, y):
fr(x, y) = atan
(y
x
)
⇒ ∇fr(x, y) =




y√
x2+y2
−x√
x2+y2




fs(x, y) = x
2 − ay2 ⇒ ∇fs(x, y) =




x
−ay




fh(x, y) = ln(
√
x2 + y2)− a atan
(y
x
)
⇒
∇fh(x, y) =
1
x2 + y2




ax− y
x+ ay




(A.1)
which leads to the following expressions for the tangent vectors g(x, y):
gr(x, y) = (x, y)
gs(x, y) = (ax− y, x+ ay)
gh(x, y) = (ay, x)
(A.2)
for some values of a = {13 , 1, 3}. Substituting the corresponding g(x, y) from
eq. (A.2) in eq. (2.3) enables us to compute the alignment of the target pattern
in the image.
212
Appendix B: Summary of Contour-Based Categorical Object Recog-
nition Algorithm
Algorithm 1 summarizes the contour-based object recognition approach pre-
sented in Chapter 3. The inputs are an image edge map, Ie of size W ×H and Cmo,
the set of model codons for the target model. The output is the final modulated
torque value map TmI that contains the modulated torque per patch, τ
ω
P , P ∈ Ie at
every image point.
213
Input : Image Edge Map Ie, model codon set Cmo = {M1, . . . ,Ml}
and torque shape contexts per model codon: (m∠i ,m
sc
i ), i ∈ Ml for the i
thedge point inMl
Output: Modulated torque map, TmI
Step 1: Compute original torque map TI
for (r, c)← (1, 1) to (H,W ) do
Compute eq. (3.2) for every patch Ps(r, c) centered at (r, c) over Ie over all scales s ∈ S;
TI (r, c)← maxs∈S τPs(r,c);
end
Step 2: Extract top |Pc| torque centers, pc ∈ Pc from TI ;
Step 3: Compute modulated torque map TmI
for pc ∈ Pc do
Select contour fragments with torque contribution > tc, Qpc ;
Group neighboring torque centers, Qrc to form a larger set of d test codons Cg = {R
′
1, . . . ,R
′
d};
Compute torque shape context per test codon: (g∠i , g
sc
i ), i ∈ R
′
d for the i
th edge point in R′d;
Get Og by computing the cross-correlation between angular bins of torque shape contexts at the torque center
(sec. 3.3.2.2);
Og ← {Og, Og + 90, Og + 180, Og + 270};
for og ∈ Og do
for R′e ∈ Cg do
Group neighboring J test codons: R′{e},...,{e+J};
Unrotate all test codons using og ;
for (a, b)← (1, 1) to (J , l) do
Compute V (a, b) = Dτsc(R
′
{e},...,{e+a},Mb) via eq. (3.10);
end
EDτsc
(e, og)← minJ ,l V ;
end
EDτsc
(og)← mine EDτsc
(e, og);
end
EDτsc
← minog EDτsc
(og);
for e← 1 to d do
for ri ∈ R
′
e do
Convert EDτsc
to weights WDτsc
(ri) via eq. (3.11);
Compute τωpcri
via eq. (3.12);
end
end
for (r, c)← (1, 1) to (H,W ) do
Compute τωPs(r,c)
via eq. (3.13) for every patch Ps(r, c) centered at (r, c) over Ie over all scales s ∈ S;
TmI (r, c)← maxs∈S τ
ω
Ps(r,c)
;
end
end
Algorithm 1: Pseudocode for the proposed approach
214
Appendix C: Simulating Log-polar Coordinates in Cartesian Coordi-
nates
We demonstrate here that the distance between two neighboring points (p, q)
in Log-polar coordinates is equivalent to 1/r times their distance in Cartesian coor-
dinates.
Proof. We consider a Log-polar system, where points have coordinates (ρ, θ) =
(ln r, θ), with r the radius from the fixation center (pole) to the point. Let p and q
be two neighboring boundary points. Their coordinates are (xp, yp) and (xq, yq) in
a Cartesian system, and (ρp, θp) and (ρq, θq) in log-polar coordinate system, respec-
tively. Let us denote,
dx = xp − xq, dy = yp − yq
dρ = ρp − ρq, dθ = θp − θq.
The distance between the points p and q in the log-polar coordinate system amounts
to
Dp,q =
√
dρ2 + dθ2 (C.1)
215
The relationship between log polar and Cartesian coordinates is:
ρ = ln
√
x2 + y2, θ = arctan(
y
x
)
Thus, using the transformation:




dρ
dθ



 =
∣
∣
∣
∣
∣
∣
∣
∣
∂ρ
∂x
∂ρ
∂y
∂θ
∂x
∂θ
∂y
∣
∣
∣
∣
∣
∣
∣
∣




dx
dy



 (C.2)
we obtain that
Dp,q =
√
(dρ2 + dθ2) =
1
r
√
(dx2 + dy2) (C.3)
where
√
x2 + y2 is the distance of the two points. Since in our implementation we
use a 4-way MRF neighborhood, it is always 1.
Invoking this result to weigh the binary pairwise term, Vpq, in the energy
minimization, we obtain that we have to weigh Vpq by 1r .
Finally, let us note, if instead we used a simple polar transform with coordi-
nates (r, θ), the transformation from (dx, dy) to (dr, dθ) would be: dr = xrdx+
y
rdy
and dθ = −yr2 dx+
x
r2dy following a similar derivation.
216
Appendix D: Separability of orientation from translation components
We provide a geometric argument to demonstrate the separability of the ori-
entation θlc from its centroid location (xlc , ylc) in the computation of the symmetry
axis lc.
Proof. Without loss of generality, we assume that lc separates two lines l1, l2 as
illustrated in Fig. D.1. We parameterize lines in the Hough space by the coordinates
r and θ, which gives us: rl{1,2,c} (centroids) and θ{1,2,c} (orientations). Since lc bisects
∠APC between l1 and l2, forming a pair of congruent triangles (4DPE ≡ 4EPF ).
From this observation, we can write down the first relationship between θlc and
θl1 , θl2 :
∠BOE = ∠BPC as (4BOE ∼ 4BPC)
= ∠AOD as (4AOD ∼ 4APC)
= θl2 − θlc
Similarly, we have ∠DOG = θlc − θl1 . Now since ∠DOG ≡ ∠BOE, because lc
bisects ∠AOG, we have
θlc − θl1 = θl2 − θlc
θlc =
θl1 + θl2
2
(D.1)
217
Eq. (D.1) shows that θlc , the orientation of the symmetry axis lc, is fully defined by
the orientation of l1, l2 and is independent of its centroid rlc . For this reason the
orientation of the symmetry axis can be recovered independently of its position. On
the other hand, we show next that rlc is a function of θlc .
As rl1 = |OG|, rl2 = |OC| and rlc = |OE|, this gives us
rl2 = |OF | cos(∠AOD)
rl1 = |OD| cos(∠DOG) = |OD| cos(∠AOD)
Since (4DPE ≡ 4EPF )⇒ |DE| = |EF |, we have
|OF | = |OD|+ 2|DE|
|OF | = |OD|+ 2(|OE| − |OD|) = 2|OE| − |OD|
|OE| =
|OF |+ |OD|
2
Replacing in the above the distance with the centroid of the lines and substituting
for the angle from eq. (D.1) yields
rlc =
rl2 + rl1
2 cos(∠AOD)
=
rl2 + rl1
2 cos(θl2 − θlc)
=
rl2 + rl1
2 cos
(
θl2−θl1
2
) (D.2)
which completes the proof.
218
Figure D.1: Illustrating the separability conditions for the symmetry axis lc of two
lines l1, l2. See text for details.
219
Appendix E: Bilateral symmetry detector: supplementary information
E.1 Implementation Details
Our approach has a worst case run time given by the number of segments,
G, and the number of final potential axes, J × L, considered per segment. This
results in a run time complexity of O(G× J × L). J and L are usually small, with
typical values ranging from 5 to 8, while G can range from 5 to 50 depending on
the complexity/size of the image. Reducing/limiting G will therefore decrease the
computational time per test image. There are two possible ways. The most direct
method is to simply select the top G fixation points in Fsym. However, this approach
may miss weak symmetries depending on G, and it is hard to determine a reasonable
value of G beforehand. Instead, we choose a slightly different approach based on the
observation that many of the segments rm ∈ Rsym are actually very similar, with
large amounts of overlap between them. The implication is that searching over all G
segments is likely to return similar results. To avoid this, we apply a simple filtering
step that checks the region overlap between adjacent segments by computing the
standard intersect over union scores: SOrm =
rm∩rb
rm∪rb
, ∀rb ∈ Rsym. We then retain
the T remaining segments with SOrm > to, where to is an overlap threshold that can
be set to a reasonably (large) value (e.g. 0.75 to 0.9) or learned from training data.
220
This results in a reduced set of filtered segments |Rsym| = T ≤ G and decreases
the running time of the approach in practice. Typical values of T range from 5 to
10. As we have shown in the experimental results (sec. 4.6.2), the best performance
is achieved when we add simple bounding box regions to Rsym. For all the results
reported here, we limit the addition to the top 5 fixation points, increasing the size
of Rsym slightly: between 10 to 15. The current implementation runs over Matlab.
The mean running times for a 320×240 image using a dual Intel Xeon 2.9GHz CPU
with 128GB of ram is 8.79 ± 1.43s, where 2.52 ± 0.07s are used for generating the
putative fixation points and 6.27± 1.36s for the refinement step.
Two key parameters are learned, for each dataset, via a separate offline training
procedure: 1) R(xt), the optimal support for the symmetry axis within each segment
rm (sec. 4.3.2.2) and 2) to, the overlap threshold between segments inRsym described
earlier in this section. For each parameter, we search over a predefined range while
holding the other parameter fixed at their default value. For R(xt), we search over 10
equally spaced discrete factors of the segment’s width, Xrm , ranging from [0.1, 1.0]
with default at 0.5. For to, we search between [0.5, 0.95], with a default overlapping
factor of 0.75. The optimal parameter values that yield the best overall performance
in the training set of images are then used during evaluation over the test set. All
parameter values used and their running times are listed in the sections that follow.
221
Method subset Notation Description
Edge Detection, gPb [6] tpb = 0.07 Edge detection threshold
Symmetry attention
S = 4 Number of Gabor scales searched
O = 16 Number of orientations searched
Fixation-based segmentation
σ = 3 Edge contrast pairwise penalty stan-
dard deviation
to Segment overlap clustering threshold
Symmetry refinement
δe = 10 Dilation factor per segment
tθ = 50% Modal orientation selection threshold
θ = pi/180 Orientation search resolution
R(xt) Optimal centroid search width (factor
of the segment width Xrm)
Symmetry axis scoring
δs = 20 Dilation factor over edges of a segment,
Ie(rm)
δh = 5 Search size of scoring region in Hough
space
Dataset Type Parameters
to, R(xt)
PSU 2011
singles 0.75, 0.6
multiples 0.75, 0.2
PSU 2013
singles 0.75, 0.6
multiples 0.75, 0.8
UMD Symmetry
singles 0.9, 1.0
multiples 0.7, 0.8
Table E.1: (Left) Description of parameters. Those with values are the same over
all datasets. (Right) Optimal parameters used per dataset.
E.2 Description of parameters and their values
E.2.1 Full approach [AttentionSymSegBB]
A brief description of the parameters used and their default values are sum-
marized in Table E.1 (left). The values of the two learned optimal parameters:
to, R(xt), used by the full approach [AttentionSymSegBB] are listed in Table E.1
(right) for each of the datasets considered. For R(xt), the value provided is a factor
applied to Xrm , the width of the segment. Note that other variants: [SymAttention,
RefinementOnly, AttentionSymBB] and [AttentionSymSeg] are ablations of the full
approach using the same parameters listed here as well.
222
Notation (default
value, [search range])
Description
ts = 0.2, [0.1, 1.0]
Scale ratio threshold for matching features of dif-
ferent scales
ta = 3, [1, 10]
Angular threshold (degrees) for matching features
with different orientations
tr = 3, [1, 10]
Radial distance threshold (pixels) between match-
ing features
tm = 1, [1, 10] Number of matches admitted per feature
Dataset Type Parameters
ts, ta, tr, tm
PSU 2011
singles 0.2, 3, 10, 1
multiples 0.6, 8, 10, 3
PSU 2013
singles 0.2, 5, 3, 1
multiples 0.4, 5, 10, 10
UMD Symmetry
singles 0.2, 3, 3, 9
multiples 0.7, 8, 8, 2
Table E.2: (Left) Description of parameters with their default values and parameter
search ranges. (Right) Optimal parameters used per dataset.
E.2.2 Baseline [Loy-Eklundh]
We modify the original code1 so that four key parameters: {ts, ta, tr, tm} de-
scribed in Table E.2 (left) are tuned from their default values via an offline parameter
search procedure: 1) We search one parameter at a time, while holding the remain-
ing three parameters fixed to their default values. 2) For each parameter, we search
over ten discrete values listed in Table E.2 (left). The best parameter is selected
that yields the highest Average Precision (AP) score over the training subset per
dataset. 3) We then compute the AP scores when the best parameters obtained in
step 2) are combined together and select the final parameter combination with the
top AP scores. The final optimal parameters used per dataset are summarized in
Table E.2 (right).
1http://www.nada.kth.se/~gareth/homepage/local_site/code.htm
223
E.3 Symmetry complexity coding in the UMD Symmetry dataset
We denote images via a systematic file naming procedure summarized in Ta-
ble E.3 that encodes via 10 characters the attributes of the symmetry, including
the complexity of the symmetry obtained from a set of 3 paid experts (taking the
majority of their votes, with the authors as the tie-breaker).
POSITION MNEMONIC DESCRIPTION
1 (symmetry type) 2 or 3 2: planar 2D symmetry
3: non-planar 3D symmetry
2 (symmetry number) S or M S: Single symmetry
M: Multiple symmetries
3 (image type) N or S N: Natural image
S: Synthetic image
4 (symmetry complexity) P, Q, C or N P: Perfect
Q: Quasi (or approximate) symmetric
C: Corrupted with clutter
N: Not globally symmetric; but locally
symmetric
5-10 Unique file number
Table E.3: Symmetry coding nomenclature.
E.4 Average Precision (AP) scores
Table E.4 lists the AP scores for the baseline [Loy-Eklundh] approach and
all variants of the proposed approach explored in the experiments over the three
datasets. For the symmetry complexity categories in the UMD symmetry dataset,
224
we report the AP scores of the baseline and the full approach [AttentionSymSegBB]
in Table E.5.
Dataset Type [Loy-E] [SymAtt] [R-Only] [AttSymLoyBB] [AttSymBB] [AttSymSeg] [AttSymSegBB]
PSU 2011
singles 0.736 0.233 0.659 0.774 0.837 0.847 0.889
multiples 0.525 0.183 - 0.576 0.673 0.493 0.727
PSU 2013
singles 0.741 0.209 0.631 0.832 0.738 0.899 0.890
multiples 0.458 0.157 - 0.611 0.761 0.559 0.795
UMD Symmetry
singles 0.865 0.239 0.743 0.765 0.792 0.898 0.891
multiples 0.321 0.219 - 0.569 0.471 0.432 0.667
Table E.4: Summary of AP scores comparing the baseline and all variants of the
approach. Abbreviations used: [Loy-E]=[Loy-Eklundh], [R-Only=RefinementOnly,
[Att*]=[Attention*]. ‘-’ indicates no experiments for this variant was performed.
Best performance per dataset (row) is highlighted in bold.
E.5 Running times per dataset
We summarize the run time performance of the full approach [AttentionSymSegBB]
and the baseline [Loy-Eklundh] by the mean, max, min and standard deviation of
running times per test image over all three datasets in Table E.6 and Table E.7,
respectively. For [AttentionSymSegBB], we provide separate timings for the sym-
metry attention and refinement stages separately in addition to the total processing
time. For the refinement stage, we used a parallelized implementation where the
symmetry axes for different segments in Rsym are computed in parallel. We used the
optimal parameters described in sec. E.2 of each approach. For fairness, we timed
the results using the same hardware setup: Dual Intel Xeon 2.9GHz CPU + 128GB
ram and repeated the timings three times per approach and took the fastest results.
225
Dataset Type Complexity [Loy-Eklundh] [AttentionSymSegBB]
UMD Symmetry – singles
P 0.862 0.893
Q 0.802 0.890
C 0.718 0.905
N 0.860 0.967
UMD Symmetry – multiples
Q 0.439 0.711
C 0.268 0.820
N 0.287 0.540
Table E.5: Summary of AP scores comparing the baseline and the full approach
over different symmetry complexity categories in the UMD Symmetry dataset. Best
performance for each approach (column) per dataset type is highlighted in bold.
Finally, all input images are resized to 320× 240 and we exclude all I/O processing
(reading in data, resizing and displaying) in the timings.
Dataset Type
Runtimes (seconds): mean, max, min, std-dev
Sym Attention Sym Refinement Total
PSU 2011
singles 2.52, 2.79, 2.37, 0.06 5.84, 10.34, 3.18, 1.40 8.36, 13.13, 5.55, 1.47
multiples 2.51, 3.15, 2.38, 0.09 6.53, 10.00, 3.34, 1.36 9.04, 13.15, 5.73, 1.45
PSU 2013
singles 2.53, 2.78, 2.43, 0.06 6.21, 9.43, 3.46, 1.31 8.75, 12.21, 5.89, 1.37
multiples 2.53, 2.84, 2.39, 0.08 6.76, 10.43, 4.33, 1.40 9.29, 13.27, 6.72, 1.47
UMD Symmetry
singles 2.50, 2.85, 2.34, 0.06 6.13, 11.12, 3.34, 1.38 8.63, 13.98, 5.68, 1.44
multiples 2.53, 2.86, 2.37, 0.07 6.15, 10.43, 3.50, 1.33 8.68, 13.29, 5.88, 1.40
Overall running times 2.52, 2.88, 2.38, 0.07 6.27, 10.29, 3.53, 1.36 8.79, 13.17, 5.91, 1.43
Table E.6: Running times for [AttentionSymSegBB] (full approach).
226
Dataset Type Runtimes (seconds): mean, max, min, std-dev
PSU 2011
singles 1.13, 3.94, 0.42, 0.69
multiples 1.25, 7.07, 0.46, 0.91
PSU 2013
singles 1.19, 3.64, 0.45, 0.80
multiples 1.27, 4.02, 0.52, 0.81
UMD Symmetry
singles 0.83, 2.47, 0.41, 0.45
multiples 1.04, 3.55, 0.44, 0.57
Overall running times 1.12, 4.11, 0.45, 0.70
Table E.7: Running times for [Loy-Eklundh] (baseline).
227
Bibliography
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC
superpixels compared to state-of-the-art superpixel methods. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012. 181
[2] A. Aldoma, F. Tombari, and M. Vincze. Supervised learning of hidden and
non-hidden 0-order affordances and detection in real scenes. Proc. IEEE Int’l
Conf. on Robotics and Automation, pages 1732–1739, 2012. 174, 175
[3] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of im-
age windows. IEEE Trans. on Pattern Analysis and Machine Intelligence,
34(11):2189–2202, 2012. 1, 20
[4] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena. Contextually guided
semantic labeling and search for three-dimensional point clouds. The Inter-
national Journal of Robotics Research, 32(1):19–34, 2013. 12
[5] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation
analysis. In Proceedings of the 30th International Conference on Machine
Learning, 2013. 194, 209, 210
[6] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hier-
archical image segmentation. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 33(5):898–916, 2011. 17, 37, 38, 65, 119, 137, 157, 158, 222
[7] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik. From contours to re-
gions: An empirical evaluation. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition, pages 2294–2301, 2009. 154, 155
[8] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale
combinatorial grouping. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 328–335. IEEE, 2014. 2, 8
[9] F. Barranco, C. L. Teo, C. Fermüller, and Y. Aloimonos. Contour detection
and characterization for asynchronous event sensors. In Proc. Int’l Conf. on
Computer Vision, Accepted, 2015. 44, 46
228
[10] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition
using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell., 24(4):509–522,
Apr. 2002. 57, 74, 83
[11] G. Ben-Yosef and O. Ben-Shahar. A tangent bundle theory for visual curve
completion. IEEE Trans. on Pattern Analysis and Machine Intelligence,
34(7):1263–1280, 2012. 7
[12] R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi. Event-
based visual flow. IEEE Trans. on Neural Networks and Learning Systems,
25(2):407–417, 2014. 47
[13] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s in the picture?
In Advances in Neural Information Processing Systems, 2004. 12
[14] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and char-
acterization from noisy web data. In Proc. European Conference on Computer
Vision, pages 663–676. Springer, 2010. 196
[15] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi-scale bifurcated
deep network for top-down contour detection. In Proc. IEEE Conf. on Com-
puter Vision and Pattern Recognition, pages 4380–4389, June 2015. 65, 170
[16] G. Bertasius, J. Shi, and L. Torresani. High-for-low and low-for-high: Efficient
boundary detection from deep object features and its applications to high-level
vision. arXiv preprint arXiv:1504.06201, 2015. 65
[17] I. Biederman. Recognition-by-components: a theory of human image under-
standing. Psychological review, 94(2):115, 1987. 195
[18] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The
Journal of Machine Learning Research, 3:993–1022, 2003. 208
[19] H. Blum. A transformation for extracting new descriptors of shape. In Mod-
els for the Perception of Speech and Visual Form, pages 362–380. MIT press
Cambridge, 1967. 113
[20] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical
kernel descriptors. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 1729–1736. IEEE, 2011. 66, 67
[21] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for rgb-d based
object recognition. Int’l Symp. on Experimental Robotics, 2012. 67, 104, 105,
174, 181
[22] J. Bohg and D. Kragic. Grasping familiar objects using shape context. Int.
Conf. on Advanced Robotics, pages 1–6, 2009. 173
229
[23] A. Bosch, A. Zisserman, and X. Muoz. Scene classification using a hybrid
generative/discriminative approach. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 30(4):712–727, 2008. 9
[24] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually
consistent poselet activations. In Proc. European Conf. on Computer Vision,
pages 168–181. Springer, 2010. 65
[25] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level fea-
tures for recognition. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 2559–2566. IEEE, 2010. 4
[26] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene
classification. Pattern recognition, 37(9):1757–1771, 2004. 9
[27] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization
via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence,
23(11):1222–1239, 2001. 15, 119
[28] Y. Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary &
region segmentation of objects in ND images. Proc. Int’l Conf. on Computer
Vision, 1:105–112, 2001. 125, 139
[29] S. Branavan, L. S. Zettlemoyer, and R. Barzilay. Reading between the lines:
Learning to map high-level instructions to commands. In Proceedings of the
48th Annual Meeting of the Association for Computational Linguistics, pages
1268–1277. Association for Computational Linguistics, 2010. 192
[30] A. S. Bregman. Asking the ‘what for’ question in auditory perception. Per-
ceptual organization, pages 99–118, 1981. 5, 6
[31] X. Bresson, P. Vandergheynst, and J. Thiran. A priori information in image
segmentation: energy functional based on shape statistical model and image
information. In Proc. Int’l Conf. on Image Processing, volume 3, pages III–
425. IEEE, 2003. 7
[32] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image
code. IEEE Trans. on Communications, 31(4):532–540, 1983. 8
[33] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent
segmentation and classification of objects and scenes. In Proc. Int’l Conf. on
Computer Vision, pages 1–8. IEEE, 2007. 10
[34] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed gra-
dients for objectness estimation at 300fps. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, pages 3286–3293, 2014. 17, 20
[35] M. Chertok and Y. Keller. Spectral symmetry analysis. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 32(7):1227–1238, 2010. 114
230
[36] J. D. Choi and M. Palmer. Robust constituent-to-dependency conversion for
english. In Proceedings of the 9th International Workshop on Treebanks and
Linguistic Theories, pages 55–66, Tartu, Estonia, 2010. 202
[37] P. M. Claessens and J. Wagemans. A bayesian framework for cue integration in
multistable grouping: Proximity, collinearity, and orientation priors in zigzag
lattices. Journal of Vision, 8(7):33, 2008. 6
[38] J. Clarke, D. Goldwasser, M.-W. Chang, and D. Roth. Driving semantic
parsing from the world’s response. In Proceedings of the Fourteenth Conference
on Computational Natural Language Learning, pages 18–27. Association for
Computational Linguistics, 2010. 192
[39] R. W. Conners and C. T. Ng. Developing a quantitative model of human preat-
tentive vision. IEEE Trans. on Systems, Man and Cybernetics, 19(6):1384–
1407, 1989. 113
[40] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models-
their training and application. Computer Vision and Image Understanding,
61(1):38–59, 1995. 7
[41] C. Cortes and V. Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995. 65
[42] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graph
decomposition. In Proc. IEEE Conf. on Computer Vision and Pattern Recog-
nition, volume 2, pages 1124–1131 vol. 2, June 2005. 8
[43] E. Craft, H. Schütze, E. Niebur, and R. von der Heydt. A neural model of
figure-ground organization. J. Neurophysiology, 97(6):4310–4326, 2007. 3, 6,
17, 42
[44] D. Cremers, F. R. Schmidt, and F. Barthel. Shape priors in variational image
segmentation: Convexity, lipschitz continuity and globally optimal solutions.
In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 1–6.
IEEE, 2008. 8
[45] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Med-
ical Image Analysis. Springer, February 2013. 28
[46] F. C. Crow. Summed-area tables for texture mapping. SIGGRAPH Computer
Graphics, 18(3):207–212, 1984. 73, 122
[47] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 886–
893, 2005. 22, 176
231
[48] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classifying more than
10,000 image categories tell us? In Proc. European Conf. on Computer Vision,
pages 71–84. Springer, 2010. 171
[49] M. do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Hall,
1976. 171, 176
[50] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for
object detection. IEEE Trans. on Pattern Analysis and Machine Intelligence,
36(8):1532–1545, 2014. 22
[51] P. Dollár and C. L. Zitnick. Fast edge detection using structured forests. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 2015. 21, 28, 30, 37, 38,
60, 135, 137, 179
[52] N. Dorfman, D. Harari, and S. Ullman. Learning to perceive coherent objects.
In CogSci, pages 394–399, 2013. 22, 24
[53] J. Driver and G. C. Baylis. Preserved figure-ground segregation and symmetry
perception in visual neglect. Nature, 360(6399):73–75, 1992. 112
[54] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized
attributes for fine-grained recognition. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 3474–3481. IEEE, 2012. 197
[55] L.-Y. Duan, M. Xu, T.-S. Chua, Q. Tian, and C.-S. Xu. A mid-level repre-
sentation framework for semantic sports video analysis. In Proceedings of the
eleventh ACM international conference on Multimedia, pages 33–44. ACM,
2003. 4
[56] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth. Object
recognition as machine translation: Learning a lexicon for a fixed image vo-
cabulary. In Proc. European Conference on Computer Vision, volume 2353,
pages 97–112. Springer, 2002. 11
[57] T. Egner, J. M. Monti, E. H. Trittschuh, C. A. Wieneke, J. Hirsch, and M.-M.
Mesulam. Neural integration of top-down spatial and feature-based informa-
tion in visual search. The Journal of neuroscience, 28(24):6141–6151, 2008.
3
[58] M. Eimer, M. Kiss, and S. Nicholas. What top-down task sets do for us: An
erp study on the benefits of advance preparation in visual search. Journal of
Experimental Psychology: Human Perception and Performance, 37(6):1758,
2011. 3
[59] J. H. Elder, A. Krupnik, and L. A. Johnston. Contour grouping with
prior models. IEEE Trans. on Pattern Analysis and Machine Intelligence,
25(6):661–674, 2003. 7
232
[60] J. H. Elder and S. W. Zucker. Computing contour closure. In Proc. European
Conf. on Computer Vision, pages 399–412. Springer, 1996. 7
[61] I. Endres and D. Hoiem. Category-independent object proposals with di-
verse ranking. IEEE Trans. on Pattern Analysis and Machine Intelligence,
36(2):222–234, 2014. 20
[62] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman.
The pascal visual object classes (voc) challenge. Int’l J. of Computer Vision,
88(2):303–338, 2010. 2
[63] A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-
category generalization. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 2352–2359. IEEE, 2010. 196
[64] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by
their attributes. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, pages 1778–1785. IEEE, 2009. 196
[65] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hock-
enmaier, and D. A. Forsyth. In Proc. European Conference on Computer
Vision, pages 15–29. Springer, 2010. 13
[66] A. Fathi and G. Mori. Action recognition by learning mid-level motion fea-
tures. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
pages 1–8. IEEE, 2008. 4
[67] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(4):594–611,
2006. 157
[68] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natu-
ral scene categories. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, volume 2, pages 524–531. IEEE, 2005. 9
[69] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part-based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 195
[70] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object
recognition. International Journal of Computer Vision, 61(1):55–79, 2005.
195
[71] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object
detection. International Journal of Computer Vision, 87(3):284–303, 2010.
102
233
[72] V. Ferrari, T. Tuytelaars, and L. Van Gool. Object detection by contour
segment networks. In Proc. European Conf. on Computer Vision, pages 14–
28. Springer, 2006. 64
[73] V. Ferrari and A. Zisserman. Learning visual attributes. In Advances in Neural
Information Processing Systems, 2007. 174
[74] S. Fidler and A. Leonardis. Towards scalable representations of object cate-
gories: Learning a hierarchy of parts. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–8. IEEE, 2007. 195
[75] D. J. Field, A. Hayes, and R. F. Hess. Contour integration by the human visual
system: Evidence for a local “association field”. Vision research, 33(2):173–193,
1993. 6
[76] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al.
Devise: A deep visual-semantic embedding model. In Advances in Neural
Information Processing Systems, pages 2121–2129, 2013. 209
[77] H. Fu, X. Cao, Z. Tu, and D. Lin. Symmetry constraint for foreground extrac-
tion. IEEE Trans. on Systems, Man and Cybernetics, 44(5):644–654, 2014.
116, 120
[78] J. L. Gallant, C. E. Connor, S. Rakshit, J. W. Lewis, and D. C. Van Essen.
Neural responses to polar, hyperbolic, and cartesian gratings in area v4 of the
macaque monkey. J. Neurophysiology, 76(4):2718–2739, 1996. 26
[79] S. Geman and C. Graffigne. Markov random field image models and their
applications to computer vision. In Proceedings of the International Congress
of Mathematicians, volume 1, page 2, 1986. 7
[80] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine
learning, 63(1):3–42, 2006. 30
[81] T. Ghose and S. E. Palmer. Extremal edges versus other principles of figure-
ground organization. Journal of Vision, 10(8):3, 2010. 22, 24
[82] J. J. Gibson. The theory of affordances. Hilldale, USA, 1977. 15
[83] J. J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin,
1979. 170, 189
[84] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency de-
tection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
34(10):1915–1926, 2012. 11
[85] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space
for modeling internet images, tags, and their semantics. IJCV, 2013. 194, 208,
209
234
[86] G. González, F. Aguet, F. Fleuret, M. Unser, and P. Fua. Steerable features
for statistical 3d dendrite detection. In Proc. Int’l Conf. on Medical Image
Computing and Computer Assisted Intervention, pages 625–632. 2009. 115,
117
[87] J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33–51,
1975. 68
[88] H. Grabner, J. Gall, and L. Van Gool. What makes a chair a chair? Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, pages 1529–1536,
2011. 173
[89] D. Graff. English gigaword. In Linguistic Data Consortium, Philadelphia, PA,
2003. 202
[90] A. S. Greenberg, M. Esterman, D. Wilson, J. T. Serences, and S. Yantis.
Control of spatial and feature-based attention in frontoparietal cortex. The
Journal of Neuroscience, 30(43):14330–14339, 2010. 3
[91] A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and com-
parative adjectives for learning visual classifiers. In Proc. European Conference
on Computer Vision, pages 16–29. Springer, 2008. 11
[92] A. Gupta, S. Satkin, A. Efros, and M. Hebert. From 3d scene geometry to
human workspace. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 1961–1968, June 2011. 11
[93] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition
of indoor scenes from rgb-d images. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition, pages 564–571, 2013. 2, 9, 17, 33
[94] J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced computer vision with
microsoft kinect sensor: A review. 2013. 66
[95] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation anal-
ysis: An overview with application to learning methods. Neural Computation,
16(12):2639–2664, 2004. 194, 203, 206
[96] J. S. Hare, P. H. Lewis, P. G. Enser, and C. J. Sandom. Mind the gap:
another look at the problem of the semantic gap in image retrieval. In Elec-
tronic Imaging 2006, pages 607309–607309. International Society for Optics
and Photonics, 2006. 3
[97] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic
contours from inverse detectors. In Proc. Int’l Conf. on Computer Vision,
pages 991–998, Nov 2011. 65
235
[98] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection
and segmentation. In Proc. European Conf. on Computer Vision, pages 297–
312, 2014. 174
[99] G. Hatfield and W. Epstein. The status of the minimum principle in the
theoretical analysis of visual perception. Psychological Bulletin, 97(2):155,
1985. 5
[100] X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random
fields for image labeling. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, volume 2, pages II–695. IEEE. 8
[101] V. Hedau, D. Hoiem, and D. Forsyth. Recovering free space of indoor scenes
from a single image. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 2807–2814. IEEE, 2012. 11
[102] F. Heitger, L. Rosenthaler, R. Von Der Heydt, E. Peterhans, and O. Kübler.
Simulation of neural contour mechanisms: from simple to end-stopped cells.
Vision research, 32(5):963–981, 1992. 6, 45
[103] F. Heitger, R. von der Heydt, E. Peterhans, L. Rosenthaler, and O. Kübler.
Simulation of neural contour mechanisms: representing anomalous contours.
Image and Vision Computing, 16(6):407–421, 1998. 5
[104] R. D. Henkel. Segmentation in scale space. In Computer Analysis of Images
and Patterns, pages 41–48. Springer, 1995. 8
[105] T. Hermans, F. Li, J. M. Rehg, and A. F. Bobick. Learning contact locations
for pushing and orienting unknown objects. In Proc. IEEE Int’l Conf. on
Humanoid Robots, 2013. 174
[106] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lep-
etit. Gradient response maps for real-time detection of textureless objects.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):876–
888, 2012. 65, 95, 96, 97
[107] T. K. Ho. Random decision forests. In Proc. IEEE Int’l Conf. on Document
Analysis and Recognition, volume 1, pages 278–282, 1995. 28
[108] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development in
information retrieval, pages 50–57. ACM, 1999. 208
[109] D. Hoiem, A. A. Efros, and M. Hebert. Recovering occlusion boundaries from
an image. Int’l J. of Computer Vision, 91(3):328–346, 2011. 1, 20, 21
[110] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–
377, 1936. 194, 205
236
[111] E. Hsiao and M. Hebert. Occlusion reasoning for object detection under arbi-
trary viewpoint. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition. IEEE, 2012. 65, 88, 95, 96, 97
[112] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments.
Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
171
[113] Q. Huang, B. Dam, D. Steele, J. Ashley, and W. Niblack. Fore-
ground/background segmentation of color images by integration of multiple
cues. In Proc. Int’l Conf. on Image Processing, volume 1, pages 246–249 vol.1,
Oct 1995. 2
[114] X. Huang and L. Zhang. Road centreline extraction from high-resolution
imagery based on multiscale structural features and support vector machines.
Int’l J. Remote Sensing, 30(8):1977–1987, 2009. 115
[115] P. S. Huggins, H. F. Chen, P. N. Belhumeur, and S. W. Zucker. Finding folds:
On the appearance and identification of occlusion. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, pages 718–725, 2001. 24
[116] H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identification
and segmentation of moving-sounding objects. IEEE transactions on multi-
media, 15(2):378–390, 2013. 194
[117] J. F. Jehee, V. A. Lamme, and P. R. Roelfsema. Boundary assignment in a
recurrent network architecture. Vision research, 47(9):1153–1165, 2007. 6
[118] H. Jiang and S. Yu. Linear solution to scale and rotation invariant object
matching. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
pages 2474–2481. IEEE, 2009. 66
[119] L. Jie, B. Caputo, and V. Ferrari. Who’s doing what: Joint modeling of names
and verbs for simultaneous face and pose annotation. In Advances in Neural
Information Processing Systems. NIPS, December 2009. 12
[120] G. Kanizsa. Subjective contours. Scientific American, 234(4):48–52, 1976. 5
[121] G. Kanizsa and W. Gerbino. Convexity and symmetry in figure-ground orga-
nization. Vision and artifact, pages 25–32, 1976. 18
[122] A. Karpathy and L. Fei-Fei. Real time detection and segmentation of re-
flectionally symmetric objects in digital images. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition. IEEE, 2015. 13, 170
[123] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirec-
tional image sentence mapping. In Advances in Neural Information Processing
Systems, pages 1889–1897. 2014. 209, 210
237
[124] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. Int’l
J. of Computer Vision, 1(4):321–331, 1988. 7
[125] R. J. Kate and R. J. Mooney. Learning language semantics from ambiguous
supervision. In Proc. National Conference on Artificial Intelligence, volume 7,
pages 895–900, 2007. 192
[126] C. C. Kemp and A. Edsinger. Robot manipulation of human tools: Au-
tonomous detection and control of task relevant features. In Proc. Intl. Conf.
on Development and Learning, 2006. 173
[127] R. Kennedy, J. Gallier, and J. Shi. Contour cut: identifying salient contours
in images by solving a hermitian eigenvalue problem. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, pages 2065–2072. IEEE, 2011. 61
[128] J. Kim and R. J. Mooney. Unsupervised pcfg induction for grounded lan-
guage learning with highly ambiguous supervision. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, pages 433–444. Association for
Computational Linguistics, 2012. 193
[129] T.-K. Kim, S.-F. Wong, and R. Cipolla. Tensor canonical correlation analysis
for action classification. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8. IEEE, 2007. 194
[130] R. Kimchi, M. Behrmann, and C. R. Olson. Perceptual organization in vision:
Behavioral and neural perspectives. Psychology Press, 2003. 5
[131] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language mod-
els. In Proceedings of the International Conference on Machine Learning, pages
595–603. JMLR Workshop and Conference Proceedings, 2014. 209, 210
[132] H. Kjellström, J. Romero, and D. Kragić. Visual object-action recognition:
Inferring object affordances from human demonstration. Computer Vision and
Image Understanding, 115(1):81–90, 2011. 173
[133] J. J. Koenderink and A. J. Van Doorn. The singularities of the visual mapping.
Biological cybernetics, 24(1):51–59, 1976. 19
[134] J. J. Koenderink and A. J. van Doorn. Surface shape and curvature scales.
Image and Vision Computing, 10(8):557–564, 1992. 171, 177
[135] K. Koffka. Principles of Gestalt Psychology. Harcourt, Brace, 1935. 4, 170
[136] N. Kogo, C. Strecha, L. Van Gool, and J. Wagemans. Surface construction
by a 2-d differentiation–integration process: A neurocomputational model for
perceived border ownership, depth, and lightness in kanizsa figures. Psycho-
logical review, 117(2):406, 2010. 6
238
[137] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label
consistency. International Journal of Computer Vision, 82(3):302–324, 2009.
8
[138] I. Kokkinos, R. Deriche, P. Maragos, and O. Faugeras. A biologically moti-
vated and computationally tractable model of low and mid-level vision tasks.
In Proc. European Conf. on Computer Vision, pages 506–517. Springer, 2004.
4
[139] V. Kolmogorov and R. Zabih. What energy functions can be minimized via
graph cuts? IEEE Trans. on Pattern Analysis and Machine Intelligence,
26(2):147–159, 2004. 139, 142
[140] P. Kontschieder, S. R. Bulo, H. Bischof, and M. Pelillo. Structured class-
labels in random forests for semantic image labelling. In Proc. Int’l Conf. on
Computer Vision, pages 2190–2197, 2011. 14, 28, 111, 173, 174
[141] P. Kontschieder, H. Riemenschneider, M. Donoser, and H. Bischof. Discrim-
inative learning of contour fragments for object detection. In Proceedings of
the British Machine Vision Conference, pages 4.1–4.12. BMVA Press, 2011.
64
[142] G. Kootstra, N. Bergstrom, and D. Kragic. Using symmetry to select fixation
points for segmentation. Proc. Int’l Conf. on Pattern Recognition, pages 3894–
3897, 2010. 114, 116, 120
[143] H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and
object affordances from rgb-d videos. Int’l J. of Robotics Research, 32(8):951–
970, 2013. 174
[144] P. Kovesi. Symmetry and asymmetry from local phase. In Tenth Australian
Joint Conf. on Artificial Intelligence, volume 190, 1997. 115
[145] J. Krishnamurthy and T. Kollar. Jointly learning to parse and perceive: Con-
necting natural language to the physical world. In Transactions of the Asso-
ciation for Computational Linguistics, 2013. 193
[146] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J.
Rodriguez-Sanchez, and L. Wiskott. Deep hierarchies in the primate visual
cortex: What can we learn for computer vision? IEEE Trans. on Pattern
Analysis and Machine Intelligence, 35(8):1847–1871, 2013. 14, 17, 45
[147] M. Kubovy and J. Wagemans. Grouping by proximity and multistability in dot
lattices: A quantitative gestalt theory. Psychological Science, 6(4):225–234,
1995. 6
[148] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L.
Berg. Baby talk: Understanding and generating simple image descriptions. In
239
Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
1601–1608. IEEE, 2011. 12
[149] M. Kumar, P. Ton, and A. Zisserman. Obj cut. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, volume 1, pages 18–25 vol. 1, June
2005. 10
[150] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Graph cut based inference
with co-occurrence statistics. In Proc. European Conf. on Computer Vision,
pages 239–253. Springer, 2010. 10
[151] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In Proceedings
of the Eighteenth International Conference on Machine Learning, ICML ’01,
pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers
Inc. 7
[152] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen
object classes by between-class attribute transfer. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, pages 951–958, 2009. 174, 195
[153] L. J. Latecki and R. Lakamper. Shape similarity measure based on corre-
spondence of visual parts. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 22(10):1185–1190, 2000. 8
[154] L. J. Latecki, C. Lu, M. Sobel, and X. Bai. Multiscale random fields with
application to contour grouping. In Advances in Neural Information Processing
Systems, pages 913–920. 2009. 8
[155] M. W. Law and A. C. Chung. Three dimensional curvilinear structure de-
tection using optimally oriented flux. In Proc. European Conf. on Computer
Vision, pages 368–382. 2008. 115, 117
[156] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,
1998. 209
[157] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets.
In International Conference on Artificial Intelligence and Statistics,AISTATS,
2015. 210
[158] S. Lee and Y. Liu. Curved glide-reflection symmetry detection. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 34(2):266–278, Feb 2012. 114
[159] T. S. H. Lee, S. Fidler, and S. Dickinson. Detecting curved symmetric parts
using a deformable disc model. In Proc. Int’l Conf. on Computer Vision, pages
1753–1760, 2013. 112, 116, 118, 119, 155, 164, 165, 166, 167
240
[160] Y. J. Lee and K. Grauman. Object-graphs for context-aware visual category
discovery. IEEE Transactions on Pattern Analysis and Machine Intelligence.
12
[161] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and
segmentation with an implicit shape model. In Workshop on Statistical Learn-
ing in Computer Vision, ECCV, volume 2, page 7, 2004. 64, 67
[162] I. Leichter and M. Lindenbaum. Boundary ownership by lifting to 2.1 d. In
Proc. Int’l Conf. on Computer Vision, pages 9–16, 2009. 17, 18, 20, 21, 22,
32, 33, 36, 50
[163] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps.
Int’l J. of Robotics Research, 2014. 173, 175, 180, 181, 183, 185
[164] M. Leordeanu, M. Hebert, and R. Sukthankar. Beyond local appearance:
Category recognition from pairwise interactions of simple features. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8.
IEEE, 2007. 64
[165] A. Levin and Y. Weiss. Learning to combine bottom-up and top-down seg-
mentation. In Proc. European Conf. on Computer Vision, pages 581–594.
Springer, 2006. 10
[166] A. Levinshtein, S. Dickinson, and C. Sminchisescu. Multiscale symmetric part
detection and grouping. Proc. Int’l Conf. on Computer Vision, pages 2162–
2169, 2009. 116, 119
[167] A. Levinshtein, C. Sminchisescu, and S. Dickinson. Optimal Contour Closure.
International Journal of Computer Vision, 2012. 7
[168] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Clas-
sification, annotation and segmentation in an automatic framework. In Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, pages 2036–2043.
IEEE, 2009. 10
[169] W. H. Li and L. Kleeman. Real time object tracking using reflectional symme-
try and motion. Proc. IEEE Int’l Conf. on Robotics and Automation, pages
2798–2803, 2006. 114
[170] W. H. Li, A. M. Zhang, and L. Kleeman. Real time detection and segmentation
of reflectionally symmetric objects in digital images. Proc. IEEE Int’l Conf.
on Robotics and Automation, pages 4867–4873, 2006. 116
[171] W. Lian and L. Zhang. Rotation invariant non-rigid shape matching in clut-
tered scenes. In Proc. European Conference on Computer Vision, pages 506–
518. Springer, 2010. 66
241
[172] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based composi-
tional semantics. Computational Linguistics, 39(2):389–446, 2013. 192
[173] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 times; 128 120 db 15
Îĳs latency asynchronous temporal contrast vision sensor. IEEE Journal of
Solid-State Circuits, 43(2):566–576, Feb 2008. 14, 39, 44
[174] J. J. Lim, C. L. Zitnick, and P. Dollár. Sketch tokens: A learned mid-level
representation for contour and object detection. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, pages 3158–3165, 2013. 24, 60
[175] J. Lin. Divergence measures based on the shannon entropy. IEEE Trans. on
Information Theory, 37(1):145–151, 1991. 128
[176] T. Lindeberg. Scale-space theory in computer vision. Springer Science &
Business Media, 1993. 8
[177] T. Lindeberg. Edge detection and ridge detection with automatic scale selec-
tion. Int’l J. of Computer Vision, 30(2):117–156, 1998. 113
[178] H. Ling and K. Okada. An efficient earth mover’s distance algorithm for
robust histogram comparison. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 29(5):840–853, 2007. 118, 128, 134
[179] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in
the wild”. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
pages 1996–2003, June 2009. 171
[180] J. Liu, G. Slota, G. Zheng, Z. Wu, M. Park, S. Lee, I. Rauschert, and Y. Liu.
Symmetry detection from real world images competition 2013: Summary and
results. Proc. IEEE Conf. on Computer Vision and Pattern Recognition Work-
shops, pages 200–205, June 2013. 144, 145
[181] Y. Liu, H. Hel-Or, C. S. Kaplan, and L. Van Gool. Computational symmetry
in computer vision and computer graphics, volume 5. 2009. 113
[182] P. Locher and C. Nodine. The perceptual value of symmetry. Computers &
mathematics with applications, 17(4):475–484, 1989. 113
[183] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int’l
J. of Computer Vision, 60(2):91–110, 2004. 114
[184] G. Loy and J.-O. Eklundh. Detecting symmetry and symmetric constellations
of features. Proc. European Conf. on Computer Vision, pages 508–521, 2006.
111, 114, 118, 126, 131, 132, 144, 148, 160, 162
[185] C. Lu, L. J. Latecki, N. Adluru, X. Yang, and H. Ling. Shape guided contour
grouping with particle filters. In Proc. International Conference on Computer
Vision, pages 2288–2295. IEEE, 2009. 65
242
[186] Y. Lu, L. Zhang, Q. Tian, and W.-Y. Ma. What are the high-level concepts
with small semantic gaps? In Proc. IEEE Conf. on Computer Vision and
Pattern Recognition, pages 1–8. IEEE, 2008. 3
[187] T. Ma and L. J. Latecki. From partial shape matching through local defor-
mation to robust global shape similarity for object detection. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, pages 1441–1448.
IEEE, 2011. 63, 86, 99, 100, 102
[188] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative
learned dictionaries for local image analysis. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008. 64
[189] S. Maji and A. C. Berg. Max-margin additive classifiers for detection. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, pages 40–47.
IEEE, 2009. 208
[190] S. Maji and J. Malik. Object detection using a max-margin hough transform.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
1038–1045. IEEE, 2009. 65, 99, 100, 102
[191] R. Margolin, L. Zelnik-Manor, and A. Tal. How to evaluate foreground maps?
In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages
248–255, 2014. 182
[192] D. Marr. Early processing of visual information. Philosophical Transactions of
the Royal Society of London. B, Biological Sciences, 275(942):483–519, 1976.
170, 189
[193] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image
boundaries using local brightness, color, and texture cues. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 26(5):530–549, 2004. 2, 17, 18, 20,
32, 50, 60, 89, 134, 154
[194] A. Martinez, L. Anllo-Vento, M. I. Sereno, L. R. Frank, R. B. Buxton,
D. Dubowitz, E. C. Wong, H. Hinrichs, H. J. Heinze, and S. A. Hillyard.
Involvement of striate and extrastriate visual cortical areas in spatial atten-
tion. Nature neuroscience, 2(4):364–369, 1999. 3
[195] C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox. A joint model
of language and perception for grounded attribute learning. In Proceedings
of the International Conference on Machine Learning, pages 1671–1678, 2012.
193
[196] C. Matuszek, D. Fox, and K. Koscher. Following directions using statisti-
cal machine translation. In Proceedings of the 5th ACM/IEEE international
conference on Human-robot interaction, pages 251–258. IEEE Press, 2010. 193
243
[197] H. Meng, D. R. Hardoon, J. Shawe-Taylor, and S. Szedmak. Generic object
recognition by combining distinct features in machine learning. In Electronic
Imaging 2005, pages 90–98. International Society for Optics and Photonics,
2005. 194
[198] Y. Ming, H. Li, and X. He. Connected contours: A new contour completion
model that respects the closure effect. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 829–836. IEEE, 2012. 61
[199] A. Mishra, Y. Aloimonos, and C. Fermüller. Active segmentation for robotics.
Proc. IEEE Int’l Conf. on Robotics and Automation, pages 3133–3139, 2009.
40, 42, 116, 120, 125
[200] R. Mohan and R. Nevatia. Perceptual organization for scene segmentation
and description. IEEE Trans. on Pattern Analysis and Machine Intelligence,
14(6):616–635, 1992. 6
[201] R. J. Mooney and R. C. Bunescu. Subsequence kernels for relation extraction.
In Advances in neural information processing systems, pages 171–178, 2005.
208
[202] V. Movahedi and J. Elder. Combining local and global cues for closed contour
extraction. In Proceedings of the British Machine Vision Conference. BMVA
Press, 2013. 7
[203] K. Murphy, A. Torralba, W. Freeman, et al. Using the forest to see the trees:
a graphical model relating features, objects and scenes. Advances in neural
information processing systems, 16:1499–1506, 2003. 9
[204] A. Myers, A. Kanazawa, C. Fermüller, and Y. Aloimonos. Affordance of object
parts from geometric features. In Proc. of Robotics: Science and Systems
RGB-D Workshop, 2014. 173, 180, 181, 186
[205] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos. Affordance detection
of tool parts from geometric features. In Proc. IEEE Int’l Conf. on Robotics
and Automation, 2015. 171
[206] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation
and support inference from rgbd images. In Proc. European Conf. on Computer
Vision, pages 746–760, 2012. 17, 18, 32
[207] D. Navon. Forest before trees: The precedence of global features in visual
perception. Cognitive psychology, 9(3):353–383, 1977. 5
[208] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal
deep learning. In Proceedings of the International Conference on Machine
Learning, pages 689–696, 2011. 209
244
[209] M. Nishigaki, C. Fermüller, and D. DeMenthon. The image torque operator:
A new tool for mid-level vision. Proc. IEEE Conf. on Computer Vision and
Pattern Recognition, pages 502–509, 2012. 4, 14, 20, 26, 39, 41, 57, 70, 73, 123
[210] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic repre-
sentation of the spatial envelope. Int’l J. of Computer Vision, 42(3):145–175,
2001. 9
[211] B. Ommer and J. Malik. Multi-scale object detection by clustering lines. In
Proc. International Conference on Computer Vision, pages 484–491. IEEE,
2009. 65
[212] A. Opelt, A. Pinz, and A. Zisserman. A boundary-fragment-model for object
detection. In Proc. European Conference on Computer Vision, pages 575–588.
Springer, 2006. 64
[213] S. Osher and J. A. Sethian. Fronts propagating with curvature-dependent
speed: algorithms based on hamilton-jacobi formulations. Journal of compu-
tational physics, 79(1):12–49, 1988. 7
[214] D. Osorio. Symmetry detection by categorization of spatial phase, a
model. Proc. of the Royal Society of London. Series B: Biological Sciences,
263(1366):105–110, 1996. 115
[215] S. E. Palmer. Hierarchical structure in perceptual representation. Cognitive
psychology, 9(4):441–474, 1977. 5
[216] S. E. Palmer. Vision Science: Photons to phenomenology, volume 1. MIT
press Cambridge, MA, 1999. 22
[217] S. E. Palmer and J. L. Brooks. Edge-region grouping in figure-ground or-
ganization and depth perception. J. Exp Psychol: Hum Percep Perform,
34(6):1353–1371, 2008. 25
[218] S. E. Palmer and T. Ghose. Extremal edge: A powerful cue to depth perception
and figure-ground organization. Psychological Science, 19(1):77–83, 2008. 18,
22
[219] P. Parent and S. W. Zucker. Trace inference, curvature consistency, and
curve detection. IEEE Trans. on Pattern Analysis and Machine Intelligence,
11(8):823–839, 1989. 6
[220] D. Parikh and K. Grauman. Interactively building a discriminative vocabulary
of nameable attributes. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 1681–1688. IEEE, 2011. 196
[221] D. Parikh and K. Grauman. Relative attributes. Proc. Int’l Conf. on Computer
Vision, pages 503–510, 2011. 174, 196
245
[222] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating,
and recognizing scene attributes. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 2751–2758, 2012. 9, 196, 199
[223] E. Peterhans and R. von der Heydt. Mechanisms of contour perception in
monkey visual cortex. ii. contours bridging gaps. The Journal of neuroscience,
9(5):1749–1763, 1989. 5
[224] S. Petrov, L. Barrett, R. Thibaux, and D. Klein. Learning accurate, com-
pact, and interpretable tree annotation. In Proceedings of the 21st Interna-
tional Conference on Computational Linguistics and 44th Annual Meeting of
the Association for Computational Linguistics (ACL/COLING), pages 433–
440, 2006. 202
[225] J. R. Pomerantz and M. Kubovy. Theoretical approaches to perceptual orga-
nization: Simplicity and likelihood principles. Organization, 36:3, 1986. 5
[226] F. T. Qiu, T. Sugihara, and R. von der Heydt. Figure-ground mechanisms
provide structure for selective attention. Nature neuroscience, 10(11):1492–
1499, 2007. 6
[227] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler,
and A. Y. Ng. Ros: an open-source robot operating system. In ICRA workshop
on open source software, volume 3, page 5, 2009. 103
[228] S. Ramenahalli, S. Mihalas, and E. Niebur. Extremal edges: Evidence in
natural images. In Conf. on Information Sciences and Systems, pages 1–5,
2011. 19, 24, 35
[229] S. Ravishankar, A. Jain, and A. Mittal. Multi-stage contour based detection
of deformable objects. In Proc. European Conference on Computer Vision,
pages 483–496. Springer, 2008. 64
[230] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context-free attentional operators:
the generalized symmetry transform. Int’l J. of Computer Vision, 14(2):119–
130, 1995. 114, 126
[231] X. Ren, C. C. Fowlkes, and J. Malik. Figure/ground assignment in natural
images. In Proc. European Conf. on Computer Vision, pages 614–627. 2006.
17, 18, 20, 21, 22, 24, 32, 33, 36
[232] W. Richards. Marr, gibson, and gestalt: a challenge. Perception-London,
41(9):1024, 2012. 170
[233] W. Richards and D. D. Hoffman. Codon constraints on closed 2d shapes.
Computer Vision, Graphics, and Image Processing, 31(3):265–281, 1985. 68
246
[234] H. Riemenschneider, M. Donoser, and H. Bischof. Using partial edge contour
matches for efficient object category localization. In Proc. European Confer-
ence on Computer Vision, pages 29–42. Springer, 2010. 63, 64, 86, 102
[235] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in
cortex. Nature neuroscience, 2(11):1019–1025, 1999. 195
[236] T. Riklin-Raviv, N. Kiryati, and N. Sochen. Segmentation by level sets and
symmetry. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
pages 1015–1022, 2006. 116, 119
[237] L. Roberts. Machine perception of 3-d solids. PhD thesis, MIT, 1965. 20
[238] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground
extraction using iterated graph cuts. ACM Trans. on Graphics, 23(3):309–
314, 2004. 17, 139
[239] M. Rousson and N. Paragios. Shape priors for level set representations. In
Proc. European Conf. on Computer Vision, pages 78–92. Springer, 2002. 8
[240] O. Russakovsky and L. Fei-Fei. Attribute learning in large-scale datasets. In
Trends and Topics in Computer Vision, pages 1–14. Springer, 2012. 196
[241] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh)
for 3d registration. In Proc. IEEE Int’l Conf. on Robotics and Automation,
pages 1848–1853, 2009. 66
[242] E. Sali and S. Ullman. Combining class-specific fragments for object classifi-
cation. In Proceedings of the British Machine Conference, pages, pages 21–1.
10
[243] A. Saxena, J. Driemeyer, and A. Y. Ng. Robotic grasping of novel objects
using vision. Int’l J. of Robotics Research, 27(2):157–173, 2008. 173, 183
[244] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a
single still image. IEEE Trans. on Pattern Analysis and Machine Intelligence,
31(5):824–840, 2009. 20
[245] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. 134
[246] J. Shotton, A. Blake, and R. Cipolla. Multiscale categorical object recogni-
tion using contour fragments. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 30(7):1270–1281, July 2008. 8, 64
[247] C. Siagian and L. Itti. Rapid biologically-inspired scene classification using
features shared with visual attention. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 29(2):300–312, 2007. 9
247
[248] C. Silberer, V. Ferrari, and M. Lapata. Models of Semantic Representation
with Visual Attributes, pages 572–582. Association for Computational Lin-
guistics, 2013. 196, 210
[249] A. Sironi, V. Lepetit, and P. Fua. Multiscale centerline detection by learning
a scale-space distance transform. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition, pages 2697–2704, 2014. 112, 113, 115, 117, 118, 155,
156, 157, 164, 165
[250] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through
cross-modal transfer. In Advances in Neural Information Processing Systems,
pages 935–943. 2013. 209
[251] R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. Convolutional-
recursive deep learning for 3d object classification. In Advances in Neural
Information Processing Systems, 2012. 174
[252] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded
compositional semantics for finding and describing images with sentences.
Transactions of the Association for Computational Linguistics, 2:207–218,
2014. 209, 210
[253] R. Socher, C. C. Lin, A. Ng, and C. Manning. Parsing natural scenes and
natural language with recursive neural networks. In Proceedings of the Inter-
national Conference on Machine Learning, pages 129–136, 2011. 13
[254] P. Srinivasan, Q. Zhu, and J. Shi. Many-to-one contour matching for describing
and discriminating object shape. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 1673–1680. IEEE, 2010. 64, 99, 100,
102
[255] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltz-
mann machines. In Advances in neural information processing systems, pages
2222–2230, 2012. 209
[256] J. S. Stahl and S. Wang. Edge grouping combining boundary and region
information. IEEE Trans. on Image Processing, 16(10):2590–2606, 2007. 7
[257] J. S. Stahl and S. Wang. Globally optimal grouping for symmetric closed
boundaries by combining boundary and region information. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 30(3):395–411, 2008. 6, 7
[258] L. Stark and K. Bowyer. Function-based generic recognition for multiple object
categories. CVGIP: Image Understanding, 59(1):1–21, 1994. 173
[259] M. Stark, P. Lies, M. Zillich, J. Wyatt, and B. Schiele. Functional object class
detection based on learned affordance cues. In Computer Vision Systems,
pages 435–444. Springer, 2008. 173
248
[260] M. Steedman. The syntactic process. The MIT press, 2001. 192
[261] A. N. Stein and M. Hebert. Occlusion boundaries from motion: Low-level
detection and mid-level reasoning. Int’l J. of Computer Vision, 82(3):325–
357, 2009. 20, 21
[262] T. Sugihara, F. T. Qiu, and R. von der Heydt. The speed of context integration
in the visual cortex. Journal of neurophysiology, 106(1):374–385, 2011. 6, 16
[263] C. Sun and D. Si. Fast reflectional symmetry detection using orientation
histograms. Real-Time Imaging, 5(1):63–74, 1999. 126
[264] Y. Sun and B. Bhanu. Reflection symmetry-integrated image segmentation.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 34(9):1827–1841,
2012. 116, 119, 157, 158, 159
[265] Y. Sun, L. Bo, and D. Fox. Attribute based object identification. In Proc.
IEEE Int’l Conf. on Robotics and Automation, pages 2096–2103, 2013. 174,
175
[266] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao.
Histogram of oriented normal vectors for object recognition with a depth sen-
sor. In Proc. Asian Conf. on Computer Vision, pages 525–538. Springer, 2013.
66
[267] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller,
and N. Roy. Understanding natural language commands for robotic navi-
gation and mobile manipulation. In Proc. National Conference on Artificial
Intelligence, 2011. 193
[268] C. L. Teo, C. Fermüller, and Y. Aloimonos. Fast 2D border ownership as-
signment. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
pages 5117–5125, June 2015. 16, 45
[269] C. L. Teo, C. Fermüller, and Y. Aloimonos. A gestaltist approach to contour-
based object recognition: Combining bottom-up and top-down cues. The
International Journal of Robotics Research, 34(4-5):627–652, 2015. 57, 123
[270] C. L. Teo, C. Fermüller, and Y. Aloimonos. Detection and segmentation of
2D curved reflection symmetric structures. In Proc. Int’l Conf. on Computer
Vision, Accepted, 2015. 112
[271] C. L. Teo, A. Myers, C. Fermüller, and Y. Aloimonos. Embedding high-level
information into low level vision: Efficient object search in clutter. In Proc.
IEEE Int’l Conf. on Robotics and Automation, pages 126–132. IEEE, 2013.
57, 66, 103, 104, 105
249
[272] C. L. Teo, H. Yi, and C. Fermüller. Object-centric bilateral symmetry detec-
tion. IEEE Trans. on Pattern Analysis and Machine Intelligence, submitted.
111
[273] A. Thayananthan, B. Stenger, P. H. Torr, and R. Cipolla. Shape context and
chamfer matching in cluttered scenes. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, volume 1, pages I–127. IEEE, 2003. 83
[274] E. Tola, V. Lepetit, and P. Fua. DAISY: An Efficient Dense Descriptor Applied
to Wide Baseline Stereo. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 32(5):815–830, 2010. 135
[275] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recog-
nition using classemes. In Proc. European Conference on Computer Vision,
pages 776–789. Springer, 2010. 196
[276] A. Toshev, B. Taskar, and K. Daniilidis. Object detection via boundary struc-
ture segmentation. In Proc. IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 950–957. IEEE, 2010. 64, 67
[277] A. Tsai, A. Yezzi Jr, W. Wells III, C. Tempany, D. Tucker, A. Fan, W. E.
Grimson, and A. Willsky. Model-based curve evolution technique for image
segmentation. In Proc. IEEE Conf. on Computer Vision and Pattern Recog-
nition, volume 1, pages I–463. IEEE, 2001. 7
[278] S. Tsogkas and I. Kokkinos. Learning-based symmetry detection in natural
images. Proc. European Conf. on Computer Vision, pages 41–54, 2012. 112,
114, 118, 134, 154, 155, 165
[279] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unifying segmen-
tation, detection, and recognition. Int’l J. of Computer Vision, 63(2):113–140,
2005. 12
[280] C. W. Tyler. Human symmetry perception and its computational analysis.
Lawrence Erlbaum Associates Publishers, 2002. 113
[281] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature
maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
34(3):480–492, 2012. 208
[282] O. Veksler. Star shape prior for graph-cut image segmentation. In Proc.
European Conf. on Computer Vision, pages 454–467. Springer, 2008. 141, 142
[283] L. A. Vese and T. F. Chan. A multiphase level set framework for image
segmentation using the Mumford and Shah model. Int’l J. of Computer Vision,
50(3):271–293, 2002. 17
[284] R. Von der Heydt, E. Peterhans, and G. Baumgartner. Illusory contours and
cortical neuron responses. Science, 224(4654):1260–1262, 1984. 5, 45
250
[285] J. Wagemans, J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh,
and R. von der Heydt. A century of gestalt psychology in visual perception:
I. perceptual grouping and figure–ground organization. Psychological bulletin,
138(6):1172, 2012. 4
[286] C. Wang, Y. Li, W. Ito, K. Shimura, and K. Abe. A machine learning approach
to extract spinal column centerline from three-dimensional ct data. In SPIE
Medical Imaging, pages 72594T–72594T, 2009. 115
[287] K. Wang and S. Belongie. Word spotting in the wild. In Proc. European Conf.
on Computer Vision, pages 591–604. Springer-Verlag, 2010. 171
[288] X. Wang, X. Bai, T. Ma, W. Liu, and L. J. Latecki. Fan shape model for
object detection. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, pages 151–158. IEEE, 2012. 64, 99, 100, 102
[289] X.-J. Wang, L. Zhang, M. Liu, Y. Li, and W.-Y. Ma. Arista-image search
to annotation on billions of web photos. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, pages 2987–2994. IEEE, 2010. 3
[290] Y. Wei, E. Brunskill, T. Kollar, and N. Roy. Where to go: Interpreting natural
directions using global inference. In Proc. IEEE Int’l Conf. on Robotics and
Automation, pages 3761–3767. IEEE, 2009. 192
[291] L. R. Williams and D. W. Jacobs. Stochastic completion fields: A neural
model of illusory contour shape and salience. Neural Computation, 9(4):837–
858, 1997. 6, 7
[292] J. Wu and J. M. Rehg. Centrist: A visual descriptor for scene categorization.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(8):1489–1501,
2011. 9
[293] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d
object detection in the wild. In IEEE Winter Conference on Applications of
Computer Vision, pages 75–82, 2014. 171
[294] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-
scale scene recognition from abbey to zoo. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, pages 3485–3492, June 2010. 1
[295] J. Xiao, J. Hays, B. C. Russell, G. Patterson, K. A. Ehinger, A. Torralba,
and A. Oliva. Basic level scene understanding: categories, attributes and
structures. Frontiers in psychology, 4, 2013. 10
[296] Y. Xu, Y. Quan, Z. Zhang, H. Ji, C. Fermüller, M. Nishigaki, and D. Demen-
thon. Contour-based recognition. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition, pages 3402–3409, 2012. 20
251
[297] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching
using sparse coding for image classification. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, pages 1794–1801. IEEE, 2009. 9
[298] S. Yang and Y. Wang. Rotation invariant shape contexts based on feature-
space fourier transformation. In International Conference on Image and
Graphics, pages 575–579. IEEE, 2007. 66
[299] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence
generation of natural images. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 444–454. Association for Com-
putational Linguistics, 2011. 202
[300] B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to
text description. Proceedings of the IEEE, 98(8):1485 –1508, aug. 2010. 12
[301] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint
object detection, scene classification and semantic segmentation. In Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, pages 702–709.
IEEE, 2012. 2, 10
[302] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang. Designing
category-level attributes for discriminative visual recognition. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, pages 771–778.
IEEE, 2013. 197
[303] S. Yu, H. Zhang, and J. Malik. Inferring spatial layout from a single image via
depth-ordered grouping. In IEEE Computer Vision and Pattern Recognition
Workshops, pages 1–7. IEEE, 2008. 11
[304] X. Yu and Y. Aloimonos. Attribute-based transfer learning for object catego-
rization with zero/one training example. In Proc. European Conf. on Com-
puter Vision, pages 127–140, 2010. 174
[305] H. Zabrodsky, S. Peleg, and D. Avnir. Symmetry as a continuous feature. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 17(12):1154–1166, 1995.
115
[306] N. R. Zhang and R. von der Heydt. Analysis of the context integration mecha-
nisms underlying figure–ground organization in the visual cortex. The Journal
of Neuroscience, 30(19):6482–6496, 2010. 3, 6, 42
[307] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep
features for scene recognition using places database. In Advances in Neural
Information Processing Systems, pages 487–495, 2014. 9, 170
[308] H. Zhou, H. S. Friedman, and R. Von Der Heydt. Coding of border ownership
in monkey visual cortex. The Journal of Neuroscience, 20(17):6594–6611,
2000. 6, 16
252
[309] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark
localization in the wild. In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pages 2879–2886, June 2012. 171
[310] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges.
In Proc. European Conf. on Computer Vision, pages 391–405. 2014. 21
253