ABSTRACT Title of dissertation: Semi-supervised and Active Image Clustering with Pairwise Constraints from Humans Arijit Biswas, Doctor of Philosophy, 2014 Dissertation directed by: Prof. David W. Jacobs Department of Computer Science University of Maryland, College Park Clustering images has been an interesting problem for computer vision and machine learning researchers for many years. However as the number of categories increases, image clustering becomes extremely hard and is not possible to use for many practi- cal applications. Researchers have proposed several methods that use semi-supervision from humans to improve clustering. Constrained clustering, where users indicate whether an image pair belong to the same category or not, is a well-known paradigm for semi- supervision. Past research has shown that pairwise constraints have the potential to sig- nificantly improve clustering performance. There are two major components to constrained clustering research: how pairwise constraints can be used to improve clustering (e.g: constrained clustering algorithms, dis- tance or metric learning methods) and determining which constraints are most useful for improving clustering (e.g.: active or interactive clustering methods). In this thesis we propose three different approaches to improve pairwise constrained clustering spanning both of these components. First, we propose a distance learning method in non-vector spaces, where the triangle inequality is used to propagate the pairwise constraints to the unsupervised image pairs. This approach can work with any pairwise distance and does not require any vector representation of images. Second, we propose an algorithm for active image pair selection. A novel method is developed to choose the most useful pairs to show a person, obtaining constraints that improve clustering. Third, we study how pairwise constraints can effectively be used to cluster large image datasets. Complete clustering of large datasets requires an extremely large number of pairwise constraints and may not be feasible in practice. We propose a new algorithm to cluster a subset of the images only (we call this subclustering), which will produce a few examples from each class. Subclustering will produce smaller but purer clusters and can be used for sum- marization, category discovery, browsing, image search, etc.... Finally, we make use of human input in an active subclustering algorithm to further improve results. We perform experiments on several real world datasets such as faces, leaves, videos and scenes and empirically show that our approaches can advance the state-of-the-art in clustering. Semi-supervised and Active Image Clustering with Pairwise Constraints from Humans by Arijit Biswas Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2014 Advisory Committee: Prof. David W. Jacobs, Chair/Advisor Prof. Larry S. Davis Prof. Hal Daume? III Prof. Ramani Duraiswami Prof. Rama Chellappa, Dean?s Representative c? Copyright by Arijit Biswas 2014 Dedication I dedicate this thesis to my parents. ii Acknowledgments First, I would like to thank my advisor Dr. David Jacobs for the strong support and freedom he has provided me throughout my PhD life. The lively and involved meetings with David are still the most cherished hours of the last five years. Whenever I lost di- rection in my research, he would guide me until I had new ideas to pursue. I strongly appreciate his precious advice in the initial years of my PhD life while I was still an am- ateur in computer vision and machine learning. He has taught me how to find interesting problems, study them and then figure out the best possible way to solve them. His unique style of research has always encouraged me and will continue to do so until the end of my research career. I would also like to thank my committee members Larry Davis, Rama Chellappa, Hal Daume and Ramani Duraiswami. They have provided many useful comments during my proposal and defense which helped me to significantly improve my PhD thesis. I am also very much grateful to the professors at the Computer Science Department who have taught excellent courses and helped me to develop my background in Computer Science. I would like to thank my internship supervisors Dr. Devi Parikh at Virginia Tech (previously at Toyota Technological Institute at Chicago (TTIC)) and Dr. Eugene Bart at Xerox Palo Alto Research Center (PARC) who guided me to understand and work on several practical aspects of computer vision research. I am thankful to many CS and UMIACS personnel without whom my research and official work could have stopped any time. I would like to make special mention of Jennifer, Fatima, Janice, Adelaide, Arlene and Jonathan. iii My PhD studies would not have been so much fun without the friends I had during my stay at UMD. First and foremost, I would like to thank my academic siblings Car- los, Anne, Joao, Daozheng, Abhishek, Angjoo and Jin for the interesting discussions and feedback they have provided about my research during the last five years. I should also thank other members of the Computer Vision lab who have enriched my research expe- rience while I was at UMD: Sumit, Ashish, Raviteja, Austin, Chinglik, Ajay, Kaushik, Nazre, Sima, Jayshankar, Ming-Yu, Mohammed, Sameh, Jonghyan, Vishal and everyone else. I am also grateful to my past and present roommates Ranchu, Biswadip, Siddharth, Anup, Ranjith who have always agreed to cook good food when I was just beginning to attend Cooking 101. I must mention many other friends in UMD without whom my journey would have been incomplete: Ashwin, Vaibhav, Krishna, Soumyadeep, Soham, Shawon, Sofia, Titir, Arti, Anand, Jonathan, Pritam and Wrick. I am also grateful to my friends from high school and Jadavpur University who have always been supportive throughout my life. This dissertation would not have been possible without constant support of my par- ents. Words would not be enough to express my gratitude to them. This dissertation is a testament to their commitment towards my education and sacrifices they have made in shaping my life. I dedicate this dissertation to them. iv Table of Contents List of Figures vii 1 Introduction 1 1.1 Distance Learning in Non-vector Spaces . . . . . . . . . . . . . . . . . . 7 1.2 Active Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Active Subclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Why Subclustering? . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Distance Learning in Non-Vector Spaces 20 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 High Level Intuition . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.4 Graph Based QP (DistLQP) . . . . . . . . . . . . . . . . . . . . 26 2.2.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Active Image Clustering 42 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1 High Level Intuition . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.2 Components of Our Algorithm . . . . . . . . . . . . . . . . . . . 50 3.2.2.1 Consensus Clustering based Pairwise Probability Dis- tribution . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.2.2 Minimum Spanning Forests (MSF) . . . . . . . . . . . 52 3.2.3 Our Algorithm (Active-HACC) . . . . . . . . . . . . . . . . . . 54 3.2.3.1 Complexity of the algorithm . . . . . . . . . . . . . . 57 3.2.4 FAST-Active-HACC . . . . . . . . . . . . . . . . . . . . . . . . 58 v 3.2.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.4.2 Data Structures . . . . . . . . . . . . . . . . . . . . . 61 3.2.4.3 Primitive Operations . . . . . . . . . . . . . . . . . . . 67 3.2.4.4 Same Cluster Cannot Links (SCC) . . . . . . . . . . . 68 3.2.4.5 Different Cluster Must Links (DCM) . . . . . . . . . . 69 3.2.4.6 QUICK-SIMULATE . . . . . . . . . . . . . . . . . . 70 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.1 Empirical Observations . . . . . . . . . . . . . . . . . . . . . . . 74 3.3.2 Experiments with Real Humans . . . . . . . . . . . . . . . . . . 77 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Active Subclustering 81 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Passive Subclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.2 Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Active Subclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4 Subclustering Jaccard?s Coeff. (SJC) . . . . . . . . . . . . . . . . . . . . 92 4.5 Fast-PAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.1 Subclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6.2 Restricted Set Size and Subcluster Size Variation . . . . . . . . . 98 4.6.3 Active Subclustering . . . . . . . . . . . . . . . . . . . . . . . . 99 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5 Conclusion and Future Works 103 6 Appendix 106 6.1 Detailed Proofs for Distance Learning . . . . . . . . . . . . . . . . . . . 106 6.2 Computational Complexity of Passive Subclustering . . . . . . . . . . . . 112 6.3 Details of the Active Subclustering Method . . . . . . . . . . . . . . . . 113 6.3.1 Thresholding (qth) in Active Learning . . . . . . . . . . . . . . . 114 6.4 Fast Partition Around Medoids (Fast-PAM) . . . . . . . . . . . . . . . . 115 6.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 118 vi List of Figures 1 A diagram that motivates how the problems that we solve are related to pairwise constrained clustering. . . . . . . . . . . . . . . . . . . . . . . . 6 2 We show some example face images from the dataset Pubfig[49]. Each row shows some example face images of a single person. Variation in lighting, expression and pose is so high that automatic clustering algo- rithms fail badly, especially with a large number of clusters. . . . . . . . . 16 3 We show some example leaf images from the dataset used in Leafsnap[47]. Each row shows some example leaf images of a particular plant species. Due to the presence of specularity and shadow automatic clustering algo- rithms often fail badly, especially with a large number of clusters. . . . . 17 4 Pipeline of our system: Active-HACC is our proposed active algorithm for selecting data-pairs to get constraints and HACC is a constrained clus- tering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 The pipeline of our proposed subclustering algorithms. First we start with an unlabeled set of images. Next, we apply our passive subclustering al- gorithm. Green, red and blue boxes are used to denote three different subclusters of size three. Then our active algorithm chooses pairs of im- ages to show to a user to obtain constraints; we update the subclustering with those constraints. At any point, we can produce a subclustering of the image data that users can browse easily. . . . . . . . . . . . . . . . . 18 6 10 example subclusters produced after we run our proposed algorithms. Each row contains one person?s faces. . . . . . . . . . . . . . . . . . . . 19 1 Toy examples to motivate our distance learning approach. We have three two-dimensional points A, B and C with a must-link/can?t-link between A and B. Initial pairwise distances are written in black. (a) Must-link constraint (shown in green). (b) Distances learned using our approach (in blue). (c) Can?t-link constraint (shown in red). (d) Distances learned using our approach (written in blue). . . . . . . . . . . . . . . . . . . . . 24 2 Simple examples in 2-d Euclidean space (green denotes must-link edge and red denotes can?t-link edge) to explain our proposed approach. . . . . 30 3 Comparison of our proposed distance learning approach with other con- strained clustering methods. . . . . . . . . . . . . . . . . . . . . . . . . 38 vii 1 Toy example to motivate our approach. . . . . . . . . . . . . . . . . . . . 48 2 Pairwise probability distribution of real data. . . . . . . . . . . . . . . . . 52 3 Complete graph G=(V,E) with edge weights on the edges and a sorted edge list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Simple examples for explaining the faster version of our algorithm. . . . . 62 5 Performance plot showing how the Jaccard?s Coefficient increase with the number of questions. Our proposed algorithm Active-HACC significantly outperforms all other algorithms we compare. . . . . . . . . . . . . . . . 78 6 Scene-300 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7 Performance plot showing how the Jaccard?s Coefficient increases with the number of questions when we get pairwise constraints from real hu- mans in Mechanical Turk. . . . . . . . . . . . . . . . . . . . . . . . . . 80 1 Comparison of our passive subclustering algorithm with K-medoids (medoid and closest 2 points from the medoid are displayed in each cluster) in a small scale 2-d clustering problem. We see that subclusters produced by our proposed algorithm are more meaningful (30 points from 3 clusters with subcluster size of 3). . . . . . . . . . . . . . . . . . . . . . . . . . . 88 2 This is a subclustering example with 3 subclusters and with 4 images in each subcluster. We ask humans about within-class pairs like (cj, I1j ) before (ci, I1i ). (see text) . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3 Some subclusters (n = 10) and the fraction they contribute to a Subclus- tering Jaccard?s Coefficient calculation (K clusters). Red boxes denote errors in a subcluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4 Comparison of our proposed passive subclustering algorithm with other baselines for Face-40, 80, 120, 160, 200 (|R| = 20K). . . . . . . . . . . 97 5 Varying restricted set size and subcluster size. (a) The restricted set size |R| is varied from 5K to 40K. (b)-(f) The subcluster size is varied from 5 to 25 and performance for all algorithms were reported. Restricted set size was set to a fixed value of 20K. Since Fast-PAM is slow, we do not report that algorithm for this part of the experimental evaluation. . . . . . 100 6 Proposed active algorithm?s performance for several datasets. . . . . . . . 101 1 Simple examples in 2-d euclidean space (green denotes must-link edge and red denotes can?t-link edge) to explain our proposed approach. . . . . 107 2 Performance and time comparison of Fast-PAM with other algorithms. . . 120 viii Chapter 1: Introduction Clustering, or unsupervised learning, is a critical part of the analysis of data. There has been a huge volume of past work on clustering [40], producing many interesting and effective algorithms. Clustering images has also been a problem of interest, where a set of unlabeled images are automatically grouped based on identity or a common characteristic. We discuss a few possible applications of image clustering below: ? Labeling: Clustering images can reduce the labeling cost significantly. If we have a set of unlabeled images that are required to be labeled, we can cluster them and then someone can label each cluster just by looking at a few images from that cluster. This could significantly reduce the labeling cost compared to an exhaustive method where a person will have to go through each image individually for labeling. ? Video Surveillance: Nowadays many places such as airports, parking lots, shop- ping centers etc... are kept under constant surveillance. Cameras usually store very large amounts of video data every day. It is often important to locate a person or an action of interest by looking at those videos. It is not possible for someone to go through all of the video frames to find these actions or persons of interest. How- ever we can cluster frames based on the actions or the persons present in them and 1 then one can look at a few of these frames in each cluster to detect anomalies in those video streams. This can reduce the time required for humans by a significant amount. ? Category Discovery: Computer vision algorithms have recently been applied to many species identification tasks, such as identifying leaves [48], birds [12, 27] etc.... Users often upload images without any labels to these systems. It is extremely important to know if any of their species are unknown to the community or if they are found in a new location. This information could potentially increase our existing knowledge about biodiversity. We also note that a biologist or an enthusiast might not have time to go through the list of millions of images to find such an exception. If we cluster the unlabeled images it becomes significantly easier for someone to browse through a few images from each cluster to discover an unknown category. Although getting reasonable image clustering results with a low number of clusters is possible, with an increase in the number of clusters, performance drops significantly. With large numbers of clusters, images from different categories are more likely to over- lap in the feature space making clustering an extremely difficult task. Semi-supervised clustering [4] is a possible way to improve clustering performance by incorporating hu- man supervision. Human supervision can take many possible forms. Pairwise constrained clustering is a well-known form of supervision in which humans indicate whether an im- age pair belong to the same category (must-link constraints) or not (can?t-link constraints). These constraints can be used to improve clustering performance. 2 Pairwise constraints are useful because often it is difficult for people to assign a label to an image (they may not be able to assign a name to a face in a surveillance video, or to a particular plant specimen), but much easier to assess whether two images come from the same class (humans can compare face images with 99% accuracy [49]). For example, while clustering images of biological organisms such as leaves or birds, a person without previous knowledge might find it difficult to label an image. But she might be able to say if two leaf/bird images belong to the same species or not. In this thesis we focus our attention on developing methods to improve pairwise constrained clustering. As an example of a semi-supervised problem, suppose we have access to surveil- lance video data from an airport and we would like to group the face images based on identity. This makes it easier to scan the people who have been present in the airport. Automatic clustering of these images is extremely hard because of the large number of clusters and large variation in pose, expression and lighting present in these faces. How- ever we know that faces that are tracked through the video should be faces of the same person, giving rise to many useful must-link constraints. Also, if two face tracks appear together in some frames, then faces in these two tracks must be of different persons, even if they look very similar, providing several can?t-link constraints. These constraints can be used to improve face clustering in videos [20, 73, 76]. Semi-supervision also arises when people have labeled a subset of images, in which case these labels can be converted to must-link or can?t-link constraints. While we use pairwise constraints to improve clustering we should look into the two major components of constrained clustering: 3 ? How do we improve clustering with pairwise constraints? Many past research works have studied how pairwise must-link and can?t-link con- straints can be used to improve clustering. There are two possible ways to do that. First, cluster the elements such that the specified constraints are satisfied as much as possible. There are several past works [6, 7, 74] in this area. Second, learn bet- ter distances between images using the pairwise constraints and use the new set of distances to cluster images. A recent survey [10] describes most of the important works in this area. However all of the past distance/metric learning works assume that an Euclidean distance is computed between images in a vector space and re- quires vector representation of all the images. ? Which pairwise constraints are most useful to improve clustering? If there are a total of N images to be clustered, it is possible to obtain O(N2) pairwise constraints. However all of these constraints are not equally important. If two images are very close in the feature space it is not very useful to know that they are must-linked; this constraint is unlikely to improve clustering. Similarly if two points are very far from each other in the feature space, a can?t-link between them will not improve clustering. Also obtaining all of the O(N2) pairwise constraints is expensive and not feasible in practice. So we are required to figure out which image pairs if constrained will provide us maximum information for better clustering and try to get as many of those as possible. In the pairwise constrained clustering setup we solve three related problems that can improve image clustering. In all these problems we assume that we know the number 4 of clusters beforehand. The first problem is an approach for improving clustering using the must-link/can?t-link constraints. The next two problems address how we can obtain useful pairwise constraints from humans to improve clustering of medium and large sized datasets. We briefly discuss all of the problems here and then we discuss each of these problems in detail in Sections 1.1-1.3. 1. We propose a method for learning better image distances from pairwise constraints and use these distances to cluster images. When we obtain a set of pairwise con- straints it is important to propagate them to the unlabeled image pairs. Our ap- proach uses the triangle inequality constraints to propagate constraints to the pairs for which no supervision is available. The proposed approach is a general method for distance learning that can work with any kinds of pairwise distances and does not require any vector representation of images. 2. We propose an active method for obtaining pairwise constraints from humans. This method chooses the image pair for which the overall expected change in clustering is maximized. We try to obtain complementary information from humans that we do not already have from the algorithms. Our approach produces state-of-the-art active clustering results in three different domains. 3. Finally, we propose a method for clustering a subset of the data (we call this sub- clustering) only, and see how pairwise constraints can help in subclustering. Sub- clustering is a novel problem and can help in browsing, summarization, category discovery etc... if the database size is very large (size 50,000 or more). We first propose a passive method for subclustering which does not require any human in- 5 tervention. Then we propose an active algorithm to choose image pairs to improve the passive subclustering solution. In Figure 1, we summarize the problems that we solve in this thesis and how they are connected to the constrained clustering research area. Pairwise Constrained Clustering (Must-links & Can't-links) How to improve clustering? Which pairwise constraints are important? 1. Distance Learning in Non-vector space 2. Active Clustering3. Active Subclustering Figure 1: A diagram that motivates how the problems that we solve are related to pairwise constrained clustering. We perform experiments on clustering face images, leaf images and scene images. We also perform some experiments on clustering face images extracted from videos. We discuss the experimental details in the relevant chapters. Some example face images and 6 leaf images are shown in Figures 2 and 3 respectively. 1.1 Distance Learning in Non-vector Spaces First, we develop a distance learning method for learning better image distances from pairwise constraints. There are several previous metric learning approaches that can use pairwise constraints to modify distances for unsupervised pairs, when the image distances can be represented as Euclidean distances in vector spaces [24, 77]. However, in many cases, distances between images are not computed this way. Many authors have proposed manifold based distances [29, 55, 64], in which a geodesic distance between two images is computed on a manifold. Robust distance measures [25, 39] are also used, where large differences in feature values are not taken into account while computing the pairwise distance between images. Often a classifier is trained to tell whether two images belong to the same class, and the classifier output [9] is converted to an image distance. In shape analysis, distances between shapes are often computed using combinatorial approaches such as bipartite matching [11, 52]. We are particularly interested in semi-supervised clustering of fine-grained categories, as these have many meaningful categories that are challenging to distinguish without supervision. If we look at the top 10 distance measures in each of two important domains (LFW faces [1] and leaf shapes [2]) related to fine- grained classification, we find that 80% of the methods use non-vector space distances. These distances can be used for clustering images, but are not suitable for existing metric learning algorithms. It is also not clear how one can develop a parametric method for distance learning that will work effectively with all of the above-mentioned distances. 7 We note that in order to propagate constraints from supervised to unsupervised pairs, some structure must be assumed on the set of possible distances. Otherwise, the distance between supervised pairs could be altered without affecting the distance between unsupervised pairs. Perhaps the weakest assumption that we can make about a distance is that it obeys the triangle inequality which states that if a,b,c are three points and d(., .) is a distance function, then ? a,b,c, d(a, b) + d(b, c) ? d(a, c). Enforcing the triangle inequality allows us to propagate constraints; if a constraint alters one distance, other distances must also change to maintain triangle inequalities. In Section 2.2.1 (Chapter 2) we use toy examples to explain how triangle inequality constraints effectively propagate pairwise constraints to unlabeled image pairs. The triangle inequality always holds for metric distances, even if the distance is computed in a non-vector space (e.g., manifold distances). For many other interesting distances, the triangle inequality is not guaranteed to hold. However, we empirically find the triangle inequality almost always holds for distances computed for fine-grained clas- sification even when not explicitly enforced. For example, with the distances we use in Pubfig [9, 49] faces and Leafsnap [47], 99.98% and 97.48% of the triangle inequality con- straints hold respectively. This strongly motivates us to enforce the triangle inequalities when we alter distances to incorporate the pairwise constraints. We then find empiri- cally that by enforcing the triangle inequality we can improve performance on several real-world datasets. Our main contribution is a novel method for using the triangle inequality to learn image distances from pairwise constraints. We use these distances along with a con- strained clustering algorithm [74] to achieve state-of-the-art clustering results. The pro- 8 posed approach is a general framework to learn distances using any kind of pairwise distances between all the images to be clustered and does not require any vector space representation of images. Our approach formulates a quadratic optimization problem, where the pairwise distances between images are modified such that pairwise constraints and triangle in- equality constraints are satisfied as much as possible. Since enforcing O(N3) triangle inequalities is computationally expensive, we propose a graph based approach, where only O(n(M + C)) triangle inequalities are sampled for use in the QP (N is the total number of images, n is the number of edges in the n-nearest neighbor graph, M and C are the number of must-link and can?t-link constraints respectively). We empirically show that this sampling approach works well in practice. We theoretically analyze a simplified case in which only one constraint is present, to gain insights into our fast approach. Our sampling approach is based on the intuition that clustering is predominately affected by small distances, and is not sensitive to the exact value of larger distances. We prove that our sampling approach produces the same set of small distances that would be obtained by enforcing all constraints. We perform experiments on leaf and face image datasets and show that distances obtained by our method achieve state-of-the-art clustering results. We also run experi- ments on a real world video dataset, where extracted faces are clustered based on identity. Our approach outperforms recent constrained clustering methods on this video dataset. 9 1.2 Active Clustering In this second work we propose an active method for image pair selection to get con- straints from humans. Our goal is to meld human and automatic algorithms by directing valuable human attention to those judgments that are most critical to improving clustering produced by automatic means. We may then summarize the active clustering setup which we work on as having the following characteristics: ? We begin with a collection of objects that can be grouped into a set of disjoint clusters. ? We have an automatic algorithm that can give us some useful information about the similarity of two objects. ? A person can make these judgments with much greater accuracy than existing al- gorithms. Given this problem formulation, we have proposed an algorithm that does the fol- lowing: 1. Cluster objects into groups, combining automatically computed distances with any constraints provided by people. 2. Choose a useful question to ask a person. The person will compare two objects and indicate whether they belong to the same group (or answer ?don?t know?). 3. Repeat, using the information provided by the person as a new constraint. 4. Continue asking questions until a reasonable clustering is achieved or the human 10 budget is exhausted. We show the pipeline of our algorithm in Figure 4 (face images from [58]). We perform experiments to evaluate our algorithm in three different domains: face, leaf and scene images. We demonstrate that our novel approach is much better than the state of the art algorithms. Since we assume that people are highly accurate, in experi- ments we can simulate their behavior using ground truth. We also relax the assumption that humans are always correct and perform some experiments with real humans in the loop using Mechanical Turk. We find that our algorithm still produces useful results and is better than the state-of-the-art approaches. 1.3 Active Subclustering Humans often search for images in large databases. Let us take the example of video surveillance in which a user might be looking to see whether a person of interest was present in the airport on a particular day. If we can create an organized collection of face images containing a few different images of each person present in the airport, it will be significantly easier for an analyst to browse them to detect the presence of that person of interest. Similarly for a category discovery application it is useful to create a leaf/bird image collection from the unlabeled images that only contains a few images from each class present in the data, allowing easy browsing. In both examples we start with a large collection of unlabeled images. These im- ages can be clustered using traditional clustering algorithms [40, 44, 56, 66] to facilitate browsing, allowing a user to examine meaningful subsets of images. As we pointed out 11 earlier, accurate clustering is extremely difficult when there are a large number of clusters. To cope with this challenge, we introduce the idea of subclustering, in which we cluster only a small subset of images. Our goal is to create clusters that represent each object in the image set with a few images. For example, in the surveillance scenario, subclustering aims to create clusters containing a few images of each individual. Subclustering has the potential to produce more efficient algorithms, and to produce more accurate clusters by ignoring images that are hard to categorize. We also assume that a modest amount of labeling effort can be obtained from humans to improve subclustering. We again abstract human input as providing must-link and can?t-link constraints1. Our two main contributions in subclustering are: 1. We propose a new passive subclustering algorithm where no human input is re- quired. We propose a cost function suitable for subclustering and optimize it with an algorithm motivated by the Partition Around Medoids (PAM) [44] algorithm. In experiments we show that our approach produces much better subclusters than those produced by several baseline algorithms. Our subclustering algorithm re- duces run-time by O(N) compared to a similar complete clustering algorithm like PAM, where N is the total number of images in the dataset. 2. We also propose a new active subclustering algorithm that uses pairwise constraints on the clusters. The algorithm repeatedly selects an image pair and asks a user whether they belong to the same category or not. Pairs of images are chosen such that existing errors in a subclustering are corrected with limited human interaction. 1We will use classes/categories to imply the ground truth and clusters/subclusters to imply algorithms? output throughout this thesis. 12 In experiments we show that we can accurately subcluster a face image dataset with 51, 418 images of 200 classes with around 143, 000 pairwise constraints from humans (equivalent Mechanical Turk cost: $358). In general active algorithms for complete clustering require much more human supervision [6, 78] and are often too slow to run on large scale datasets [14]. We run experiments in two different domains: face images (the complete Pubfig dataset [49]) and leaf images [47]. We show that our proposed algorithms perform much better than baseline algorithms. A pipeline of our proposed system for subclustering is depicted in Figure 5. 1.3.1 Why Subclustering? Over the last decade active/semi-supervised clustering algorithms [6, 14, 78], includ- ing our own work, have produced much better clustering results. However, as the data size (number of images and classes) increases, the required number of pairwise con- straints increases so much that human-in-the-loop clustering algorithms become imprac- tical. Therefore we introduce subclustering that aims to produce a clustering that contains n images from each category with n < Nk (where Nk is the actual number of instances of class k present in the full dataset). This will produce smaller but purer clusters and will require significantly less human input. Since large unlabeled image collections are highly prevalent these days subclustering can have a wide range of applications including browsing image databases, image search, summarizing image databases, category discov- ery etc.... 13 Input to a subclustering algorithm will include a pairwise distance matrix, the num- ber of clusters and the number of images required (n ? 1) from each class. We emphasize that a subclustering algorithm?s goal is to extract any n images from each class. For large real world applications, where each class is expected to have many (more than 100) im- ages, we find that n = 10 is reasonable. It is not too large that subclustering becomes error-prone and also not too small that images from the same class do not show variation. However in Section 4.6.2 (Chapter 4), we perform experiments to see how robust our ap- proach is to the subcluster size. In Figure 6 we show some typical subclusters produced after using our passive and active algorithms. Although the number of examples from each class is small, they have enough intra-class variation that humans can determine how these people look and compare them to a new face image. We note that subclus- ters are not intended for training classifiers because they are small and may not represent the complete distribution of each class. Subclustering and its use for browsing large im- age databases is partially motivated by the fact that humans can learn reasonable object models even from a small set of examples [51]. Although it is possible to create subclusters from complete clusters, algorithms solely built to solve subclustering create purer subclusters. Intuitively it is clear that points that are difficult to cluster can affect the results of clustering algorithms in a way that degrades the clustering of easier points. 14 1.4 Conclusion Although image clustering has a wide range of applications, clustering is an extremely hard problem especially when the number of clusters is high. Pairwise constraints from humans can help image clustering significantly. In this thesis we have proposed three methods to improve the performance of pairwise constrained clustering. The first ap- proach uses the constraints to learn better image distances for image pairs. Next two methods were proposed to improve clustering by obtaining useful pairwise constraints from humans for medium and large scale datasets. We performed experiments on real world datasets of faces, leaves, videos and scenes and showed that our approaches can improve state-of-the-art image clustering by significant margins. 15 Figure 2: We show some example face images from the dataset Pubfig[49]. Each row shows some example face images of a single person. Variation in lighting, expression and pose is so high that automatic clustering algorithms fail badly, especially with a large number of clusters. 16 Figure 3: We show some example leaf images from the dataset used in Leafsnap[47]. Each row shows some example leaf images of a particular plant species. Due to the presence of specularity and shadow automatic clustering algorithms often fail badly, es- pecially with a large number of clusters. 17 Figure 4: Pipeline of our system: Active-HACC is our proposed active algorithm for selecting data-pairs to get constraints and HACC is a constrained clustering algorithm. FINAL SUBCLUSTERING OUTPUT PASSIVE SUBCLUSTERING DIFFERENT ACTIVE SUBCLUSTERING Figure 5: The pipeline of our proposed subclustering algorithms. First we start with an unlabeled set of images. Next, we apply our passive subclustering algorithm. Green, red and blue boxes are used to denote three different subclusters of size three. Then our active algorithm chooses pairs of images to show to a user to obtain constraints; we update the subclustering with those constraints. At any point, we can produce a subclustering of the image data that users can browse easily. 18 Figure 6: 10 example subclusters produced after we run our proposed algorithms. Each row contains one person?s faces. 19 Chapter 2: Distance Learning in Non-Vector Spaces In this chapter, we propose an approach to use the triangle inequality to propagate pair- wise constraints to the unsupervised data and learn a new set of distances. We assume that no vector space representation of images is available. The learned distances will be used for clustering where all the training data and pairwise constraints are available ahead of time. We note that we are not considering an out-of-sample extension in this work. First, we formulate a brute-force Quadratic Programming (QP) method that mod- ifies the distances such that the total change in distances is minimized but the final dis- tances obey the triangle inequality. Then we propose a much faster version of the QP that enforces only a subset of the inequalities. We show experimentally that this faster QP pro- duces stronger clustering results on datasets of face, leaf and video images, outperforming state-of-the-art methods for constrained clustering. To gain insight into the effectiveness of this algorithm, we analyze a special case of the semi-supervised clustering problem, and show that the subset of constraints that we sample still preserves key properties of the distances that would be produced by enforcing all constraints. 20 2.1 Related Work Our work draws on prior work in constrained clustering and metric learning. One of the earliest constrained clustering algorithms was proposed by Wagstaff et al [74] in 2001. The authors in [74] proposed an algorithm in which points are assigned to clusters such that none of the specified constraints are violated. The authors in [7] proposed a method in which a Hidden Markov Random Field (HMRF) is formulated along with must-link and can?t-link constraints and MAP estimation is performed on the HMRF. The authors in [6] proposed a soft constrained clustering algorithm, where the cost function for clustering takes care of the pairwise constraints. However this approach is more suited for the active constrained clustering setup proposed in [6], where constraints are obtained in an interac- tive manner. The authors in [53, 54] proposed spectral clustering methods where pairwise constraints are propagated to the entire dataset. However spectral methods usually require vector representation of the images for clustering. In [45] the authors proposed a method for constrained clustering where the must-links are propagated between other image pairs, by computing the shortest path between all pairs. A straight-forward modification to the Floyd-Warshall algorithm [21] is used to perform the fast all-pair-shortest-path computa- tion. There is a significant amount of past research work on metric learning. A recent sur- vey [10] nicely describes the past work in this area. We discuss a couple of well known methods for metric learning. The authors in [77] proposed one such method where a Ma- halanobis Distance metric was learned, such that must-linked image pairs come close to each other and can?t-link pairs move away from each other. However as pointed out by 21 [24], this method can be slow and inaccurate. The authors in [24] proposed a widely used metric learning approach called Information Theoretic Metric Learning (ITML), which works well in practice; both these methods require a vector representation of images. Per- haps the most similar past work is on non-parametric kernel learning methods [35, 82], where a kernel matrix is learned from the similarity matrix and user provided constraints by using semi-definite programming. However we demonstrate in the experimental re- sults that such approaches do not work well for image datasets, especially when there are many clusters. Recently several researchers have applied constrained clustering techniques for clustering faces in videos. In [20], the authors proposed a method for metric learning from constraints and used that for face identification. The authors in [73] also proposed an approach to cluster face images with pairwise constraints for movie content analysis. In the most recent work on face clustering [76], a method was proposed where a Hid- den Markov Random Field based objective function is formulated that incorporates the pairwise constraints; it is optimized using the simulated field algorithm [18]. 2.2 Our Approach Semi-supervised clustering with pairwise constraints is most useful when the constraints are propagated to other image pairs as well. In this chapter we propose an approach for learning new distances using only the initial pairwise distances between images. We formulate a quadratic program to modify the distances such that total distance modifica- tion is minimized but the final distances obey the triangle inequality. We define the ideal 22 scenario QP in Section 2.2.3 in which all constraints are enforced and discuss the high computation cost for solving the ideal case. Next, in Section 2.2.4 we propose a graph based QP formulation which is much faster and applicable to larger datasets in practice. In Section 2.2.5, we theoretically demonstrate why the graph based approach is a good alternative to the brute-force method. 2.2.1 High Level Intuition In this subsection, we describe toy examples to explain how our approach uses the triangle inequality to propagate pairwise constraints. In Figure 1a, we have 3 two dimensional pointsA,B and C and a must-link betweenA andB. Since there is a must-link constraint present between points A and B, the distance (referred to using the function d(?, ?) from now on) between A and B is set to zero. Now intuitively it is clear that C should become equidistant to both A and B. Similarly, in Figure 1c, a can?t-link between A and B alters d(A,B) to a very high value (say, 10). In this scenario C cannot stay close to both A and B and should move away from both or at least one of them. We find that by enforcing triangle inequality constraints while changing the distances as little as possible, we can set d(A,C) equal to d(B,C) (in Figure 1b) and move C far from both A and B (in Figure 1d). These examples show how we can learn better distances by enforcing triangle inequality constraints. 23 P a 4.00 i 1.00 3.50 (a) PaPi r PaPi waww s e (b) ?? F ? ? ? ?I (c) P a i rwse ew e Cowoo (d) Figure 1: Toy examples to motivate our distance learning approach. We have three two- dimensional points A, B and C with a must-link/can?t-link between A and B. Initial pairwise distances are written in black. (a) Must-link constraint (shown in green). (b) Distances learned using our approach (in blue). (c) Can?t-link constraint (shown in red). (d) Distances learned using our approach (written in blue). 2.2.2 Notation We begin with a set of N unlabeled images U (x ? U) from K classes. We are also provided with initial distances between all the image pairs, i.e., dI(xi,xj) is given ?i, j (note that dI denotes initial distance). LetM denote the set of must-link constraints such that any pair (xi,xj) ? M implies that xi and xj belong to the same class. Similarly C denotes the set of can?t-link constraints such that any pair (xi,xj) ? C implies that xi and xj belong to different classes. 2.2.3 Problem Formulation We learn a new set of distances for all the image pairs given the pairwise constraintsM and C. A quadratic optimization problem is formulated to find these distances. Let us assume that the set of final distances are given by dF (xi,xj), which we can obtain by solving the following quadratic optimization problem: 24 minimize dF ? (xi,xj)/?M?C (dF (xi,xj)? dI(xi,xj))2 subject to (i) dF (xi,xj) ? U, (xi,xj) ?M (ii) dF (xi,xj) ? L, (xi,xj) ? C (iii) dF (xi,xj) + dF (xj,xk) ? dF (xi,xk), ?i, j, k (iv) dF (xi,xj) ? 0,?i, j (2.1) In the formulation in equation 2.1, distances between all image pairs are modified such that the constraints (i) to (iv) are satisfied. The objective function minimizes the total sum of changes in pairwise distances. Constraints in (i) are added such that distances corresponding to the must-link image pairs are minimized as much as possible, i.e., are upper bounded by a small user defined constant U . Similarly, constraints in (ii) are added such that can?t-link image pairs move as far as possible, i.e., are lower bounded by a large user defined constant L. One of the major contributions of our approach is to add the triangle inequality constraints in (iii), which enforce that the final distances (dF (xi,xj)) should obey the triangle inequality for all possible triplets of images. These triangle inequality constraints propagate the information about must-link and can?t-link constraints to other image pairs. The final set of constraints in (iv) ensure that all the quadratic optimization variables, i.e., the pairwise distances remain non-negative. In the QP formulation as stated in 2.1, there are O(N2) QP variables and O(N3) triangle inequality constraints. Although this brute force approach can be applied to small problems with 100 or fewer images, as the problem size grows larger the QP becomes 25 extremely large and it is not possible to solve in practice. Also with all the pairwise and triangle inequality constraints there may not exist a feasible solution. To avoid these issues we reduce the size of the QP significantly by determining which triangle inequalities are crucial for the purpose of clustering using a novel graph-based formulation. 2.2.4 Graph Based QP (DistLQP) In this section we describe a faster approach, based on enforcing a subset of the con- straints in equation 2.1. We make use of the intuition that when a dataset is clustered, comparatively smaller distances usually determine the clusters. When two points are far apart, the exact distance between them is not critical in determining the clusters. We can build a nearest neighbor graph to determine which distances are small and possibly im- portant for clustering. Distances smaller than or equal to a fixed threshold th are called nearest neighbors or ?small? and distances larger than th are called ?large?. We focus our method on accurately determining the small distances, while allowing some inaccuracy in the large distances. In Section 2.2.5, we prove a theorem that states that if we enforce a subset of the triangle inequalities based on the nearest-neighbor graph, then when one must-link or can?t-link constraint is added we will still compute the same distances as computed by the brute-force method for small distances, while accurately computing the qualitative values of larger distances. That is, distances that are large before and after the full set of constraints are enforced will also be large using the sampled constraints, though their exact value may not be preserved by the sampled constraints. Since large distances are 26 unlikely to affect clustering it is not required to compute the exact change in the distance. This theorem gives useful insight and motivates us to use the faster version, which we will describe next. Although we use a hard threshold th for these proofs, in practice we find it more reasonable to adapt the threshold locally. So we use an n-nearest neighbor graph, in which the parameter n is selected depending upon the problem domain. First, a n-nearest neighbor graph G = (V , E) is created from the N images, where each node (image) is connected by edges to its n nearest neighbors. We also add edges when there is a must-link or can?t link constraint between two images but the correspond- ing distances are not lower than U or more than L respectively. If two images are already very close and must-linked, it is not required to modify the corresponding distance. Sim- ilarly, if two images are very far and are can?t-linked, we do not need to change the corresponding distances. E denotes the nearest neighbor edges and must-link/can?t-link edges of the graph. One of our chief contributions is to determine which triangle inequalities out of O(N3) possible inequalities should be enforced in the QP. We only enforce triangle in- equalities for triplets that contain at least one nearest neighbor edge and at least one must- link/can?t-link edge. This reduces the number of triangle inequalities to O(n(M + C)). In Section 2.2.5, we prove that under some simplified assumptions even with this small subset of triangle inequalities we obtain distances with the same key properties as those obtained by enforcing all the constraints. Now E is augmented to make E ? = E ? EA, where EA includes edges between two nodes that are connected to a third node with one nearest neighbor edge and one must-link/can?t-link edge in E . All edges in E ? are used as the variables of the QP. 27 We also add slack variables to the must-link and can?t-link constraints in the QP such that a feasible solution can be obtained even when all the pairwise constraints are not satisfied. Slack variables could also be added to the triangle inequality constraints if required, however that would increase the number of QP variables significantly. We have not found this to be necessary. The final QP (with the graph-based formulation and the slack variables for pairwise constraints) that we optimize is provided below (d?F (xi,xj) denotes the final distances obtained from the following QP): minimize d?F ,?m,?c ? (xi,xj)?E ??M?C (d?F (xi,xj)? dI(xi,xj)) 2+ ?1 ? m ?m + ?2 ? c ?c subject to (i) d?F (xi,xj) ? U + ?m, (xi,xj) ?M? E ? (ii) d?F (xi,xj) ? L? ?c, (xi,xj) ? C ? E ? (iii) d?F (xi,xj) + d ? F (xj,xk) ? d ? F (xi,xk),?i, j, k such that E = {(xi,xj), (xi,xk), (xj,xk)} ? E ?and ? ep, eq ? E such that ep 6= eq, ep ?M? C and eq ? E (iv) d?F (xi,xj) ? 0,?i, j (v) ?m ? 0,?m (vi) ?c ? 0,?c (2.2) where variables ?m and ?c refer to the slack variables corresponding to the must-link and can?t-link constraints respectively. ?1 and ?2 are user defined constants. The QP in equation 2.2 is a convex problem and can be solved by using any standard QP solver. 28 We used the interior point method for the quadratic optimization. We will refer to our proposed approach for distance learning as DistLQP (Distance Learning with Quadratic Programming) from now on. In Figure 2a, we consider a simple 2-d example with six points xi, where i = 1, .., 6. The nearest neighbor edges are shown in black, the must-link edge is shown in green and the can?t link edge is shown in red. In this scenario, distances d(x1,x3), d(x2,x3), d(x3,x4), d(x4,x5), d(x5,x6), d(x1,x2), d(x2,x4), d(x3,x5), d(x4,x6) are used as variables in the QP. Triangles ?x1x2x3, ?x2x3x4, ?x3x4x5 and ?x4x5x6 are used for the triangle inequality constraints. We note that our approach is developed for clustering images, in which all distances and constraints are available when we learn the new distances. The out of sample problem is also of interest, in which a query image retrieves its nearest neighbors using the learned distances. We have performed preliminary experiments using our approach to distance learning for retrieval, but have not found that it significantly improves performance. This may be an interesting topic for future research. 2.2.5 Theoretical Analysis Although our graph based approach is much faster than the brute-force version and em- pirically found to be better than the state-of-the-art approaches, we theoretically analyze a simple scenario to obtain further insight into our faster approach. We consider the case in which one constraint is added to a set of distances that obey the triangle inequality (as mentioned in the introduction, the distances we use generally do obey the triangle 29 (a) Six points in 2-d. (b) Must-link. (c) Can?t-link. Figure 2: Simple examples in 2-d Euclidean space (green denotes must-link edge and red denotes can?t-link edge) to explain our proposed approach. inequality). We state and prove a theorem in this section which shows that in this case, even with the subset of triangle inequalities we enforce, key properties of the brute-force QP solution are preserved. Note that dI(xi,xj) are the initial distances, dF (xi,xj) are the distances obtained by solving the QP in equation 2.1 and d?F (xi,xj) are the distances obtained by solving the QP in equation 2.2. All the lemmas and the theorem are proved for four points xi, xj , xk, xl in Figure 2b and 2c for simplicity. However they can easily be extended to any number of points in general. The final distance is set to be 0 for a must-link and L for a can?t-link, where L is the maximum of all initial distances. Due to space constraints we provide most of the proofs of our lemmas and the theorem in the Appendix 6.1 except lemma 1. We include the proof for lemma 1 in this chapter to give the reader some general idea about our proofs. Here we note some initial triangle inequality constraints, which we will use to prove 30 the lemmas: dI(xi,xk) + dI(xk,xl) ? dI(xi,xl) (2.3a) dI(xj,xk) + dI(xk,xl) ? dI(xj,xl) (2.3b) dI(xi,xk) + dI(xi,xl) ? dI(xk,xl) (2.3c) dI(xj,xk) + dI(xj,xl) ? dI(xk,xl) (2.3d) Lemma 1. If there is a must-link constraint between points xi and xj , for any other point xk, dF (xi,xk) ? min(dI(xi,xk), dI(xj,xk)). Proof. We will refer to Figure 2b and the QP in 2.1 for this proof. We find the minimum cost solutions for the graph (Figure 2b) considering triangles ?xixjxk and ?xixjxl and show that this solution does not violate the triangle inequality for ?xixkxl and ?xjxkxl. When an optimal solution obtained using a subset of the constraints (?xixjxk and ?xixjxl) also satisfy other constraints (?xixkxl and ?xjxkxl) which were not en- forced, it is not required to change the optimal solution. Thus the solution which we obtain in this proof is optimal. There is a must-link constraint between xi and xj . To minimize the objective function in 2.1, and to avoid violation of the triangle inequality for triangle ?xixkxj , we will have dF (xi,xk) = dF (xj,xk) = dI(xi,xk)+dI(xj ,xk) 2 . Similarly for ?xixlxj , dF (xi,xl) = dF (xj,xl) = dI(xi,xl)+dI(xj ,xl) 2 . Now we show that the triangle inequality holds for ?xixkxl and ?xjxkxl with dF (xk,xl) = dI(xk,xl). Adding equations 6.1a and 6.1b, we get dF (xi,xk)+dI(xk,xl) ? 31 dF (xi,xl). Similarly adding equations 6.1c and 6.1d, we get dF (xi,xk) + dF (xi,xl) ? dI(xk,xl). By symmetry dF (xi,xl) + dI(xk,xl) ? dF (xi,xk). So the triangle in- equality holds for ?xixkxl. Similarly, we can prove that triangle inequality holds for ?xjxkxl. So dF (xk,xl) = dI(xk,xl) is not required to change. Thus dF (xi,xk) ? min(dI(xi,xk), dI(xj,xk)) because dF (xi,xk) = dI(xi,xk)+dI(xj ,xk) 2 . Lemma 2. If there is a must-link constraint between points xi and xj , only distances dF (xi,xk) and dF (xj,xk),?k, differ from the initial distances. All other distances remain the same. Proof. Follows from the proof for Lemma 1. Lemma 3. If there is a can?t-link constraint between points xi and xj , for any other point xk, dF (xi,xk) ? dI(xi,xk). Proof. This proof is provided in Appendix 6.1. Lemma 4. If there is a can?t-link constraint between points xi and xj , only distances dF (xi,xk) and dF (xj,xk),?k, may differ from the initial distances. All other distances remain the same. Proof. Follows from proof for Lemma 3. Now we state a theorem and show that even with the triangle inequality constraints that we enforce in QP in equation 2.2, distances are changed in a similar way as they would have, had we enforced all the triangle inequality constraints (as in equation 2.1). The proof requires the above lemmas and is provided in the Appendix 6.1. We note that 32 distances that are smaller than or equal to a fixed threshold th are called nearest neighbors or ?small? and distances larger than th are called ?large? for all dI , dF and d?F . Theorem 5. If there are N points {x1,x2, ...,xN} with a must-link/can?t-link constraint present among them, for any fixed threshold th and for any pair {xi,xk}, I. if dI(xi,xk) > th and dF (xi,xk) > th, then d?F (xi,xk) > th. II. if dI(xi,xk) > th and dF (xi,xk) ? th, then d?F (xi,xk) = dF (xi,xk). III. if dI(xi,xk) ? th and dF (xi,xk) > th, then d?F (xi,xk) = dF (xi,xk). IV. if dI(xi,xk) ? th and dF (xi,xk) ? th, then d?F (xi,xk) = dF (xi,xk). We prove that when one must-link or can?t-link constraint is added distances learned using only a subset of triangle inequalities will be exactly the same as the distances learned by the brute-force approach except when a large (> th) distance remains large using the brute-force approach. Since large distances are unlikely to affect clustering, we prove that if a large distance remains large using the brute-force approach, it will also be large using our fast approach but we are not required to compute the exact change in those distances. We note that using our method we can modify a maximum of 4n distances with each pairwise constraint, which is much higher than the one pairwise distance modifica- tion when no constraint propagation happens. If there are a total of (M + C) pairwise constraints available, 4(M + C)n pairwise distances may get modified. 33 2.3 Experimental Results We run experiments on three different domains, leaves, faces and detected faces from video images. We describe the datasets below: ? Leaf-1042 [14] : This data set contains 1042 leaf images from 62 classes. Chi- square distances [14] between curvature histograms are used as the initial distances for our algorithm. ? Face-1000 & Face-10000 [49] : These datasets are subsets of the Pubfig dataset. The complete Pubfig dataset has around 58, 797 celebrity images of 200 persons. Since images in Pubfig are taken in completely uncontrolled settings with non- cooperative subjects there is a large variation in pose, lighting, expression, and imaging conditions. Due to this large set of variations, it is extremely hard to cluster face images from Pubfig. Distances are calculated based on the output of a pre- trained classifier [9]. Face-1000 has 1000 images from 50 classes, with each class having 20 images. Face-10000 is a much larger dataset with 10000 images from 50 classes, where each class has 200 images. ? BF0502-subset-1 and BF0502-subset-2 [26]: These datasets contain extracted faces from videos. The original video dataset BF0502 [26] has 27, 504 extracted faces of 11 main cast members and others from the TV series ?Buffy the Vampire Slayer?. BF0502-subset-1 has 687 faces of 6 persons and is exactly same as the BF0502-subset used in [76]. Since BF0502-subset-1 had only 6 clusters we created another subset called BF0502-subset-2 which contains 600 face images of 11 per- sons. Unlike the other datasets, previous work on this video data has shown strong 34 results using vector space representations. So one goal of these experiments is to provide a comparison to a state-of-the-art method on its home turf. Since our method does not require a vector space representation, we are free to make use of a non-vector space, robust distance on this data, which we find im- proves results. For each image pair, the squared difference is computed in each dimension. Then only the smallest 80% of those differences are summed and the square root is taken to obtain a robust distance. Note that while such robust dis- tances are common in computer vision, they cannot be represented as a Euclidean distance in any vector space. We describe the algorithms below that we compare in our experimental evaluation. We compare our approach to previous vector space methods by first using Multidimen- sional Scaling (MDS) [15] to embed the images in a vector space. ? K-means [56]1: In this approach the distances corresponding to the must-link pairs are set to zero and can?t-link pairs are set to a very high value. This version of K-means works better than the K-means without any distance modification. ? K-means + DistLQP: Distances learned from our approach are used along with the K-means clustering algorithm. ? COP-Kmeans [74]: This is a traditional constrained clustering algorithm, which avoids constraint violations. ? COP-Kmeans + DistLQP: Distances learned from our approach are used along with the constrained clustering algorithm COP-Kmeans. 1Since we do not have the vector representation of images we compute the medoid instead of the mean in the update step. 35 ? ITML [24] + MDS [15]: MDS is used to project the distances back to a vector space and Information Theoretic Metric Learning (ITML) is used to learn distances. ? HMRF-com [76] + MDS [15]: MDS is used to project the distances back to vector space and HMRF-com is used for clustering. ? COP-Kmeans + NPKL [35]: Distances are learned using a Non-parametric Kernel Learning (NPKL) approach and are used along with COP-Kmeans. The major steps in our approaches K-means + DistLQP and COP-Kmeans + Dis- tLQP are: 1. We have the initial distances and the randomly generated pairwise constraints. 2. We set the distances corresponding to must-link to 0 and can?t-link to a high value. 3. We learn a new set of distances using our approach. 4. K-means or COP-kmeans is run with the new distances. 5. We get a clustering and evaluate the clustering solution. In the baselines K-means and COP-Kmeans all of the above steps are used except step 3, where we learn a new set of distances using our approach. We use Jaccard?s coefficient [68], which varies from 0 to 1 (1 is the best), as the clustering evaluation metric for Leaf-1042, Face-1000 and Face-10000. Jaccard?s Coeffi- cient of a clustering C with respect to a ground truth clustering G is defined as: JCCG(C) = SS SS + SD +DS (2.4) 36 where SS is the number of pairs that are in the same cluster in both C and G, SD is the number of pairs that are in same cluster in G but in different clusters in C, and DS is the number of pairs that are in different clusters in G but are in same cluster in C. For BF0502-subset-1 and BF0502-subset-2, we use the same evaluation metric as used by [76], which computes the accuracy based on the confusion matrix derived from the predicted labels and the ground truth labels. Each algorithm is run with 25 random initializations and with 5 different random sets of pairwise constraints for a fixed con- straint set size. The average of all of them is reported. Since the video data has one fixed set of constraints, the average of 25 random initializations are reported. For Leaf-1042 and Face-1000, the total number of constraints is varied from 1000 to 5000, with 20% of the constraints being always must-link constraints. For Face-10000, one fixed constraint size of 25, 000 (20% must-link) is used. However with the video dataset BF0502-subset- 1 we have a fixed set of 687 must-link constraints and 180 can?t-link constraints. For BF0502-subset-2, we have 600 must-link and 189 can?t-link constraints available. These constraints are produced following the same method as used by [76]. The nearest neigh- bor graph size n is set to a fixed value of 20 for all datasets. The must-link threshold U is always set to a low value of 0.01, whereas the can?t-link threshold varies from one dataset to another depending on the maximum pairwise image distance. We set that to L = max(dI(xi,xj),?i,j)? , where ? is a parameter. ? is set to be 8 for leaves, 3 for faces and 2 for the video subsets. ?1 is set to be 10 and ?2 is set to be 1 for all the datasets. 37 1000 2000 3000 4000 5000 0.20.30.40.50.60.7 Numbe r of Co nstrain ts Jaccard?s Coefficient COP?K means [74] COP?K means + Dist LQP K?mea ns [56] K?mea ns + D istLQP ITML [24 ] + MDS [15] HMRF ?com [76]+ MD S [15] COP?K means + NPK L [35] (a) Leaf-1042 1000 2000 3000 4000 5000 00.10.20.30.4 Numbe r of Co nstrain ts Jaccard?s Coefficient COP?K means [74] COP?K means + Dist LQP K?mea ns [56] K?mea ns + D istLQP ITML [24 ] + MDS [15] HMRF ?com [76]+ MD S [15] COP?K means + NPK L [35] (b) Face-1000 Figure 3: Comparison of our proposed distance learning approach with other constrained clustering methods. Table 2.1: Results for Face-10000 (HMRF-com [76] and NPKL [35] were not fast enough for this dataset). K-means [56]K-means+DistLQPCOP-Kmeans [74]COP-Kmeans +DistLQPITML [24]+MDS 0.0784 0.0803 0.0818 0.0831 0.0711 Table 2.2: Results for BF0502-subset-1 and BF0502-subset-2 (confusion matrix based evaluation [76]). Algorithms BF0502-subset-1BF0502-subset-2 K-means [56] 0.3640 0.2578 K-means + DistLQP 0.3749 0.2635 COP-Kmeans [74] 0.3865 0.2603 COP-Kmeans + DistLQP 0.4035 0.2695 ITML [24] 0.3566 0.2177 HMRF-com [76] 0.4576 0.2633 COP-Kmeans+NPKL [35] 0.3223 0.2440 38 2.3.1 Discussion of Results Our approach outperforms all other algorithms for all datasets except for BF0502-subset- 1, where HMRF-com uses the actual vector representation of the images, and only six clusters are present. HMRF-com usually works well with a small number of clusters and when the actual vector representation of the images are available. However when the number of clusters increases, this approach does not perform well because there are not usually enough samples from each class to learn the parameters of the Gaussian distribu- tions used in their approach. Also when we perform MDS to convert the distances to a vector space representation, the representation is often inaccurate. When these vectors are used along with HMRF-com and ITML, their performance degrades because of the poor vector space representations. We used 60 dimensional MDS for all the datasets except the video subsets, where we had access to the original vector space representations. We analyze the results of each dataset below: ? Leaf-1042: We provide the results for Leaf-1042 in Figure 3a. Our approach out- performs all other algorithms by a significant margin. For example, with 5, 000 pairwise constraints, our approach reduces the error (1-Jaccard?s Coefficient) from 0.42 (with the second best approach) to 0.38, eliminating 9.5% of the error. The pro- posed method takes around 4 minutes (in matlab) with 5, 000 pairwise constraints to learn a new set of distances for this dataset. ? Face-1000: Results for Face-1000 are shown in Figure 3b. In this dataset also our approach outperforms all other approaches by a significant margin. For example, when 5, 000 pairwise constraints are available, our approach reduces the error from 39 0.72 (with the second best approach) to 0.67, eliminating 7% of the error. Our approach takes around 2 minutes to learn a new set of distances with 5, 000 pairwise constraints for this dataset. ? Face-10000: Even for this dataset, our approach outperforms other datasets (see Table 2.1). We run our approach on this dataset to see if our approach can help to cluster larger datasets. Although the proposed approach improves clustering for this dataset, it is still far from perfect. From the performance of all the algorithms, it seems that clustering such a large dataset is an extremely hard problem and it is probably hard to obtain a reasonable clustering with state-of-the-art image features. It takes around 40 minutes for our approach to learn the new distances with the 25, 000 constraints we use. However HMRF-com [76] and NPKL [35] were not fast enough that we could run these algorithms for Face-10000. ? BF0502-subset-1 and BF0502-subset-2: Results for these datasets are provided in Table 2.2. For these two datasets we used the original vector space representation (after dimensionality reduction with Principal Component Analysis as suggested by [76]) for HMRF-com [76] and ITML [24]. We did not have to do Multidimen- sional scaling to obtain a vector space representation for these datasets. Although for BF0502-subset-1, HMRF-com outperforms our approach by a significant mar- gin, we note that we do not use the actual vector space representation in our ap- proach. However for BF0502-subset-2, we outperform all other approaches, which demonstrates that HMRF-com?s performance may be more sensitive to the number of clusters. For the video subsets, it took around 1.3 minutes to learn a new set of distances for clustering. We note that while our approach is not really designed for 40 this scenario, in which a vector space representation of the images is available, it still performs very well, outperforming other approaches when the number of clus- ters is 11. In part, this is because our approach offers the freedom to make use of robust distances that deviate from the original vector space. 2.4 Conclusion In this chapter we propose a novel method for learning distances between images when pairwise must-link and can?t-link constraints are provided. Our method uses only pair- wise distances between all images and does not require the vector space representation of the images. We apply these learned distances for clustering images. We run experi- ments for leaf and face clustering. We also use our approach for clustering faces extracted from videos. We demonstrate that our approach outperforms other approaches and pro- duces state-of-the-art clustering results. Although the experimental results are shown for images, the proposed approach can be applied to any other domain. 41 Chapter 3: Active Image Clustering In this chapter we present an algorithm to actively solicit human input to cluster images using pairwise constraints. Algorithms give information about similarity of images but humans are much more accurate in judging if two images are the same or not. We in- corporate that information (must-link and can?t-link constraints) from humans and use that in a clustering algorithm. Since human effort is costly we would like to minimize the total human effort required for perfect clustering. We propose a novel algorithm to choose an image pair out of O(N2) image pairs to show to a user to get a constraint on it. In a clustering assignment each image pair may be either in the same cluster or in differ- ent clusters. We simulate inverting those pairwise relationships and see how that affects overall clustering produced by human input. We choose the pair for which the expected change in overall clustering is maximum. 3.1 Related Work Recently computer vision researchers have been interested in melding active learning [3, 65] with computer vision. We will discuss a few recent works in this area. In [72], the authors propose a method for active selection of a batch of images for labeling using 42 a SVM classifier where the total cost for labeling at each iteration is constrained by a predefined budget. The proposed method is shown to be useful for object recognition, activity recognition and image retrieval. The authors in [70] propose a method of training a large scale object detector in an active manner. The proposed algorithm iteratively poses annotation requests to humans and gathers relevant images for labeling from Flickr, where the true labels of the data are not known beforehand. In [71], the authors propose a method for active frame selection for labeling such that a video sequence is labeled (at the pixel level) with as little human input as possible. They propose two methods, one is a basic flow-based approach while the other is a more sophisticated approach that uses a flow model within a Markov Random Field [19]. [17] propose an interactive human-in- the-loop method for bird classification. Humans will answer simple questions on visual attributes of birds such that a bird can be classified with limited human interaction. In [16], the authors propose a method for large scale interactive learning of deformable part models [28]. The authors in [67] propose an active technique for learning appearance and contextual models for scene understanding simultaneously. This is the first multi-class active learning technique that takes contextual interactions between different parts of a scene image into account. The authors in [32] proposed an active learning method for labeling unlabeled data given a set of labeled instances. They use an optimistic information theoretic way to use the unlabeled data; that is to find the instance whose best possible label assignment could maximize the mutual information about the labels of the remaining unlabeled instances. Also in [22], the authors propose an active learning method, which uses an information- theoretic diversity measure, to label samples from a large collection of unlabeled images. 43 In [43] the authors propose a method for active learning that uses the expected value of information to label images. However all of these approaches want to get labels of im- ages and do not try to partition the dataset as we do. Note that we are only computing the expected change in clustering, which is different from pure information theoretic ap- proaches, because it is not clear what is the best possible way to quantify information gain for active learning in clustering. However our approach is inspired by information theoretic approaches for active learning in general. In [36] the authors propose an ac- tive learning technique for object recognition, which aims to minimize the entropy of the current unlabeled set of images given a labeled set. The authors in [42] proposed the first multi-class active learning technique that requires binary (yes or no) user inputs. This method points out the difficulty of providing category labels for large multi-class problems and suggests using pairwise similarity questions for labeling with access to an already available labeled set. In a clustering problem we usually do not have any labeled data available for selection of these pairwise questions. In [41] the authors proposed an active learning scheme for large multi-class problems where the goal is to select the data point that strengthens the existing classification model the most in terms of its discrimi- native capabilities. Clustering has been an interesting topic of research in machine learning, computer vision and natural language processing for more than 50 years. We refer to [40] for a nice and comprehensive review. Combining active learning with constrained clustering [8] has long been an area of interest for machine learning researchers. In [6], the authors propose an active clustering algorithm. They suggest two major phases in an active learning setting namely ?Explore? (cluster center initialization) and ?Consolidate? (data points are added 44 to cluster centers). In problems with a large number of clusters (which is very common in the image clustering domain), the ?Explore? stage itself takes a large number of questions to initialize the distinct cluster centers. The authors in [57] have proposed an approach that uses a min-max criterion to find informative questions. They also rely on the ?Explore? stage in the beginning as [6] does. There are also a couple of active clustering algorithms [75, 79] based on spectral eigenvectors, but they are good for two-cluster problems only. Very recently the authors in [78] proposed an active spectral clustering algorithm with k- nearest neighbor graphs that can be used for more than two-cluster problems. We compare with their approach and show that our proposed approach is much better. In [38], the authors have proposed an active framework for constrained document clustering. This paper is philosophically similar to our approach, i.e., they also try to ask questions to maximize the gain. They begin with a skeleton structure of neighborhoods covering all the clusters. They then search for an informative data pair to match an unlabeled data point to one of the centroids of the existing neighborhoods. Also, they use an ?Explore? stage to build an initial skeleton structure, which we already know to be a potential problem when there is a large number of clusters. Another approach by Huang et al. [37] has an active constrained clustering algorithm for documents with language modeling; it is not clear how we could adopt this algorithm for image clustering. In [5] the authors proposed a semi-supervised method for clustering using seeding, where labeled data is used to generate better initial cluster centers and to constrain the clustering solution; this approach does not have any active learning component. Some methods [30] were also developed to use multiple humans to cluster datasets, where each user is asked to locally cluster a part of the dataset and finally those results are aggregated to find a complete 45 partitioning. Some active constrained clustering approaches are also known for image clustering [13, 31]. In [13], the authors have proposed a heuristic, which works well but has several parameters that must be tuned properly for different datasets. In [31], the authors take a fuzzy clustering based approach to find images near cluster boundaries to form useful data pairs. As described in this section, there are many approaches in active learning, which want to maximize the information obtained from humans. Some of these methods were proposed for tasks, that are different from clustering with pairwise constraints. However there are some approaches [38, 57], that were proposed specifically for clustering with constraints and were designed to maximize the information gain from humans. However these active approaches measured the information gain using a local approach, where they do not consider the affect of a constraint on the complete dataset. For example, some constraints might be useful locally, i.e., can change clustering of a few points, but may not have high impact if the overall clustering of the dataset is considered. The novelty of our approach is in judging the effectiveness of a constraint by measuring how much it changes the overall clustering (a global approach), i.e., we expect a useful constraint to change assignment of many points in the dataset. A local approach can be useful when we already have a reasonable clustering to start with and we want to correct a few existing mistakes in that. On the other hand, a global approach like ours is more suitable for a problem like image clustering, where the initial clustering itself is very poor and it needs large changes to achieve a reasonable clustering. Thus, our method can ask useful questions from the very beginning and can achieve a good clustering by asking fewer 46 questions than other methods. 3.2 Our Approach We now describe our approach to active clustering. We first motivate this approach with a simple example, and then describe technical details. 3.2.1 High Level Intuition We have developed a novel algorithm that determines which questions to ask a person in order to improve clustering. This algorithm is based on the intuitive idea of asking questions that will have the largest expected effect on the clustering. This really has two components. First, if we ask a person to compare two objects, her answer will only affect the clustering immediately if it differs from current clustering; that is, if the person says either that two objects that are not currently in the same cluster should be, or that two objects that are in the same cluster should not be. Any answer that differs from the current clustering must result in moving at least one object to a different cluster, but some answers will affect many more objects. So the second component of our algorithm is to ask questions whose answers might have a big effect on the way objects are clustered. To provide a better intuition about our approach, we consider the simple toy exam- ple of clustering a set of 2D points. Figure 1 shows a collection of such points as black disks. The circles indicate four clusters that might be formed by an automatic clustering algorithm. We have marked five of these points with the letters ?A? to ?E? for ease of ref- erence. Now suppose that an expert can compare two of these points and tell us whether 47 Figure 1: Toy example to motivate our approach. they truly belong in the same cluster. Considering the following possibilities, we find: ? Comparing B and C is not that desirable, since it is likely that we have already correctly placed them in different clusters. A human opinion about these two points is unlikely to change anything. ? Comparing A and B (or A and C) will tell us in which cluster A truly belongs to. Since A is between two clusters, it is quite possible that this question will change the cluster to which we assign A. However, the answer to this question will only affect A. ? Comparing D and E will provide the most information. These two clusters are close, and it is somewhat ambiguous whether they should be separated into two clusters, or joined into one. So it is reasonably likely that a person might say D and E belong in the same cluster. If they do, this will lead us not only to treat D and E differently, but in fact to join the two clusters together, affecting the grouping of many points. Consequently, we select questions for human attention that maximize the product of the probability that the answer will cause a change in our current clustering and the size of 48 this change, should it occur. Finding the best such question can potentially require a large amount of computation. If we are clustering N objects, then there will be O(N2) same- or-different questions to consider, and for each we must determine its possible effects. For this reason, we adopt a simple, greedy clustering algorithm. Without human assistance, this algorithm does not perform well, but by using a simple clustering algorithm, we can more easily and quickly select good questions to ask a human, and rapidly improve our clustering performance. In Section 3.2.3 we explain the brute force version of our proposed algorithm and in Section 3.2.4 we describe a faster version of our algorithm. This faster version is possible due to the use of a simple and greedy clustering algorithm. In order to further speed up our algorithm, we have also experimented with two additional heuristics: First, when our estimate of the probable response of a human indicates that it is very likely that the human response will agree with the current clustering, we do not bother to simulate the results of a different response. For all datasets we exclude simulation of pairs which are very unlikely to be in the same cluster. For larger datasets (of size more than 1000), we initially use K-means [56] to group very close points together. These points are always clustered together. We represent them using their centroids and then run our algorithm. We refer to this heuristic as H1. Second, we observe that when we simulate an additional constraint between a data pair, change in clustering assignments is often limited to clusters that contain the points in that pair. Determining those changes is much faster than checking for all possible changes. We perform experiments with this approximation and we find that it makes our algorithm?s performance a little worse but much faster. We refer to this heuristic as H2. 49 3.2.2 Components of Our Algorithm We now explain our algorithm and its components formally. We define the best image pair that we will present to a person for labeling as: (d?i, d?j) = arg max di,dj EJ(di, dj) (3.1) EJ(di, dj) is the expected change in the clustering if a question is asked about an image pair di and dj . Since we do not know the ground truth clustering, we cannot choose image pairs that are guaranteed to increase the quality of the clustering. One of the main insights of our work is that by finding pairs that most rapidly change the clustering we quickly converge to a good clustering. Now, we formally describe how we compute the components needed to determine EJ(di, dj) given that we have a matrix containing all pairwise distances for a dataset. 3.2.2.1 Consensus Clustering based Pairwise Probability Distribution We first need to estimate the probability that each data pair will belong to the same cluster. We have borrowed ideas from consensus clustering [62] (also referred to as aggregation of clustering) to find those probabilities. Consensus clustering typically computes multi- ple clusterings of the same dataset and then produces a single clustering assignment that reflects a consensus. This can be developed as an optimization problem that is known to be NP-complete. Consensus clustering in unsupervised learning is similar to ensem- 50 ble learning in a supervised framework. However, we avoid optimization and just use multiple clusterings to estimate the probability that a pair of points belong to the same cluster. Specifically, if we haveN data points and S clusterings, let As be theN?N matrix where element (i, j) is 1 if the i-th and j-th data points (referred as di and dj respectively from now on) are in the same cluster in the s-th clustering, and zero otherwise. We use the K-means algorithm to produce clusters, each time beginning with different random initial cluster centroids. Now if P is the probability matrix for N data points where again element (i, j) is the pairwise probability of di and dj being in the same cluster, P = 1 S S? s=1 As (3.2) If (di, dj) is any pair of points and R is a random variable corresponding to pair (di, dj) resulting from the above random process then we estimate: P (di, dj) = prob(di = dj|R) = R (3.3) di = dj implies di and dj are from same class. We experimentally verify that this method produces reasonable values. In Figure 2 (using images and pairwise distances from [48]), we plot prob(di = dj|R) = R and a histogram generated from the results of those exper- iments showing the fraction of times that a pair with a given R is from the same cluster. Our heuristic provides a good estimate of prob(di = dj|R) = R, which becomes more accurate with larger data sets. 51 Figure 2: Pairwise probability distribution of real data. 3.2.2.2 Minimum Spanning Forests (MSF) We perform clustering by creating a Minimum Spanning Forest (MSF). We define an MSF as a collection of trees which we get if we run Kruskal?s algorithm [46] and stop when we have K spanning trees. We can think of clustering as finding a forest with K spanning trees from a set of disconnected nodes. We use a constrained clustering algorithm that is very similar to Kruskal?s algorithm, but also respects constraints. In the clustering community a similar approach is well-known as bottom up or hierarchical agglomerative clustering (HAC). We assume we are provided with the distance matrix for a given dataset. We can consider each image of the dataset as an individual node in a complete graph G = (V,E) and let their mutual distances represent edge weights. We can also have information about a pair of images being in the same cluster (?must-link? constraints) or different clusters (?can?t-link? constraints). Let us assume we create a copy of the graph G without any edges and call it F = (V, ?). First, we add edges corresponding to must-link constraints to the forest F. Next we sort the edges in E in an ascending order of their weights and store them in a queue Ep. We start popping edges from Ep and add them to F . While doing this, we always maintain the tree structure, i.e. do not add an edge between two 52 nodes if the nodes are already connected. We will also not add an edge between two trees if there is any can?t-link constraint between any of the pairs of nodes in those two trees. We continue doing this until we have K trees in the forest (referred as MSF in future). We refer to this algorithm as HAC with constraints (HACC). We selected MSF as the base clustering algorithm for several reasons. First, MSF does not have any randomness involved. Since we want to measure the change in clus- tering for all possible pairs, we would not like the change in clustering to be affected by anything but the added constraint. Second, our high level idea is to simulate the results of all questions so that we can choose the question that has the greatest expected affect on the clustering. Doing this in a brute force way, even with the simplest clustering algorithm is not practical. So it is critical that we use a semi-supervised clustering algorithm in which it is possible to perform O(N2) incremental clustering operations efficiently. We have been able to develop fast ways of simulating the incremental effects of single constraints while using MSF as the semi-supervised clustering algorithm. Third, MSF?s sensitivity to constraints, i.e., addition of some constraints can potentially change the MSF clustering result by a large amount, which helps us to measure the effectiveness of a constraint in a better way. However we emphasize that our use of this constrained version of minimum spanning forests is not meant to suggest that this semi-supervised clustering algorithm is competitive in performance with other methods. We expect MSF to work well along with the pairwise constraints which we obtain using our proposed active algorithm and that is demonstrated more in the experimental results (Section 3.3). Since we are working with hierarchical clustering with constraints, we have to dis- cuss feasibility and dead-end issues [23]. The feasibility problem is defined as given a 53 set of data and set of constraints, does there exist a partitioning of the data into K clus- ters? In our problem the answer is obviously yes because all of the constraints are true. However determining whether there is a feasible solution which satisfies all constraints is NP-complete [23]. In HACC, dead-end situations (reaching an iteration with more than K clusters, where no further pair of clusters can be joined without violating any of the constraints) can occur in principle, but in practice we do not encounter this problem. 3.2.3 Our Algorithm (Active-HACC) In 3.2.2.1, we estimate a distribution that allows us to determine the probability that a person will provide a constraint that differs from the current clustering. In this section, we determine the magnitude of the change this will have on the current clustering. To do this, we define a measure of similarity between two clustering assignments, which we call Relative Jaccard?s Coefficient, by analogy to the Jaccard?s Coefficient [68] (detail in Chapter 2). If C1 and C2 are two clustering assignments of the same dataset, the Relative Jaccard?s Coefficient of clustering C2 with respect to C1 is defined as: JCCC1(C2) = SS SS + SD +DS (3.4) where SS is the number of pairs that are in the same cluster in both C1 and C2, SD is the number of pairs that are in same cluster in C1 but in different clusters in C2, and DS is the number of pairs that are in different clusters in C1 but are in same cluster in C2. This becomes the traditional Jaccard?s coefficient if C1 is the ground truth clustering. Now, we describe our algorithm, assuming that we have asked q questions and need 54 to determine the (q + 1)-th question to ask. We define: D: The dataset, that should be clustered. N : Number of images in the dataset. K: Number of clusters. di, dj: Any pair of images from D. Cq: The set of can?t-link constraints obtained after we have asked q questions. Mq: The set of must-link constraints obtained after we have asked q questions. Note |Cq|+ |Mq| = q. Hq = HACC(D,Cq,Mq) : HACC is the clustering function on a given dataset D, using the must-link constraints (Mq) and can?t-link constraints (Cq). Hq is the clustering that is produced. Jdi,dj ,yq+1 = JCCHq(HACC(D,Cq,Mq ? di = dj)) : Relative Jaccard?s Coefficient of a clustering after q + 1 questions with respect to Hq, if the (q + 1)-th constraint is that di and dj are in same cluster. Jdi,dj ,nq+1 = JCCHq(HACC(D,Cq ? di 6= dj,Mq)) : Relative Jaccard?s Coefficient of a clustering after q + 1 questions with respect to Hq, if the (q + 1)-th constraint is that di and dj are not in same cluster. Now, we ask the user about the pair: (d?i, d?j) = arg max di,dj EJ(di, dj) (3.5) Where EJ(di, dj) is the expected change in the Relative Jaccard?s Coefficient and 55 is defined as follows: EJ(di, dj) = |JCCHq(Hq)? (P (di, dj)J di,dj ,y q+1 + (1? P (di, dj))J di,dj ,n q+1 )| (3.6) Note that JCCHq(Hq) = 1 and P (di, dj) is the probability of di and dj being in the same cluster. Now we can see that if points di and dj are in the same cluster after q questions, then HACC(D,Cq,Mq ? di = dj) = Hq and if they are in different clusters then HACC(D,Cq ? di 6= dj,Mq) = Hq. So we have: ? di and dj are in the same cluster after q questions: EJ(di, dj) = |(1? P (di, dj))(1? J di,dj ,n q+1 )| (3.7) ? di and dj are in different clusters after q questions: EJ(di, dj) = |P (di, dj)(1? J di,dj ,y q+1 )| (3.8) Using this approach, we find the data pair that will produce the greatest expected change in the clustering. After receiving an answer from the user, we update our con- straints and continue. When we plot the clustering performance, we show the Jaccard?s Coefficient rela- tive to ground truth, as the number of questions increases. This curve does not always increase monotonically, but it generally increases. Pseudo code of our proposed active 56 algorithm is shown in 1. Algorithm 1 Active-HACC Given: D, Max questions, Mq = ?, Cq = ?, num questions=0 while num questions ?Max questions do HACC(D,Mq,Cq) For all pairs (di, dj), evaluate EJ(di, dj) Find the pair (di, dj) with maximum EJ(di, dj) Ask and Update Mq and Cq num questions=num questions+1 end while Output: HACC(D,Mq,Cq) 3.2.3.1 Complexity of the algorithm We now discuss the rather high complexity of the brute-force version of Active-HACC. We will then consider several optimizations and heuristics to improve this run time. For each question we have O(N2) (N is the number of images) possible pairs. To simulate what will happen for each pair in a brute force manner, we will have to run the constrained clustering algorithm for each pair. We now describe the complexity of Active-HACC. Kruskal?s algorithm [46] for minimum spanning tree runs in O(N2 logN) time (if |E| = O(N2)). HACC also has O(N2 logN) complexity from a very similar analysis to Kruskal?s, except for the issue of keeping track of the can?t-links. To do this efficiently, we maintain an l? l lower triangular matrix A in which l is the current number of clusters. A(m,n) = 1 if m > n and there is a can?t-link constraint between clusters m and n, and A(m,n) = 0 otherwise. Before merging two clusters m and n, we check that A(m,n) = 0 (m > n). In this case, we assign cluster n to have identity m. We update A by setting row m equal to the OR of rows m and n, and removing row and column n. This update takes O(N) time, and can occur in O(N) iterations. So enforcing the can?t-link 57 constraints adds O(N2) time to Kruskal?s algorithm, which still runs in O(N2 logN) time. If we run this variation on Kruskal?s algorithm for O(N2) pairs, the complexity of choosing each question will be O(N4 logN). Even with moderate N (say N=100) we cannot ask O(N) questions with this much cost for each question. So we will propose a much less computationally complex version of Active-HACC in Section 3.2.4. In part, this complexity helps motivate our use of a very simple clustering algorithm such as HACC. Since we must simulate O(N2) possibilities for each question, it is im- portant that the clustering algorithm be relatively cheap. Moreover, as we will see, HACC lends itself to incremental clustering, in which simulating the effects of one constraint can be done efficiently. At the same time, HACC is quite interesting because the addition of one constraint can in some cases significantly alter the clustering. 3.2.4 FAST-Active-HACC Our previous analysis assumes that we rerun HACC O(N2) times to simulate the effects of every question we consider asking. This is wasteful, since most new constraints only have a small effect on the clustering. We save a good deal of effort by incrementally com- puting these changes starting from the existing clustering. However, these changes can be somewhat complex. When a can?t-link constraint is added to points in the same cluster, we must remove an edge from the current MSF, to disconnect these points. Removing one edge can have a domino effect, since there may be other edges that would have been added if that edge had not been present. Similarly, adding a must-link constraint might 58 require us to merge two clusters that have can?t-link constraints between them, requiring us to remove edges to separate any points connected by a can?t-link constraint. We must simulate any effects of these changes. Our complete algorithm is quite complex, and is described below. Here we provide a couple of key intuitions. First, we save a good deal of time by creating data structures once that can be used in O(N2) simulations. Whenever we add a can?t-link constraint between two points in a cluster, we must remove the last edge added to the MSF that is on the path between these points. To facilitate this, we keep track of the path on the MSF between all pairs of points in the same cluster. Also, as we perform the initial clustering, whenever an edge is not added to the MSF because of a can?t-link constraint or because it would destroy the tree structure, we keep track of any edge which blocks it. That is, edge B blocks edge A if edge A would be added if not for the presence of edge B. Then, whenever an edge is removed, we can reconsider adding any edge that was blocked by this edge. Of course, as we ?go back in time? and make changes to the MSF, we must carefully account for any later decisions that might also change. In a second optimization, we notice that we do not need to simulate the effects of all Can?t-Link constraints. If a cluster has Nc elements, there are (Nc 2 ) possible Can?t- Link constraints, but only Nc possible edges that can be removed. This means that many can?t-link constraints will cause us to remove the same edge from the MSF, and have the same effect on the clustering. These can be simulated together. Since this overall procedure is complex, code has been made publicly available. 59 3.2.4.1 Definitions We are given a set of points and pairwise distances. As we mentioned earlier, we consider a complete graphG = (V,E) where each pair of distinct nodes are connected by a unique edge and each edge has a weight given by the distance matrix. We are computing a mini- mum spanning forest for G. To do this, we sort the edges, E by distance, place them in a priority queue, and pop them off the queue in turn. As we progress, we can divide E into four disjoint groups. Active Edges (AE): This is the set of edges that have been added to the MSF. The number of active edges may vary from (N ?K ?M) to (N ?K) (M is the number of must link constraints). Cannot Link Edges (CLE) : CLE is the set of edges that have been popped from the queue but have not been added to the MSF due to cannot link restrictions. We should not confuse a Cannot Link Edge with a Cannot Link Constraint between two points. Redundant Edges (RE): An edge is included in set RE if it was not added to the MSF because the pair of points connected by this edge were already in the same cluster. Since we want to maintain a forest, we avoid all cycles. Other Edges (OE): We call all the edges in E which are still on the queue, other edges. 60 We also put a time stamp, which is the index of an element in the sorted list, on each edge. An edge with lower weight will have a lower time stamp than an edge with higher weight. In Figure 3, it is explained how we build the sorted list and apply time stamps. We use the abbreviations AE, CLE, RE and OE as both set and as an element of a set. Figure 3: Complete graph G=(V,E) with edge weights on the edges and a sorted edge list. 3.2.4.2 Data Structures In this section we discuss some data structures required for the faster version of Active- HACC. The basic intuition is as follows. When we simulate adding a cannot link edge to the same tree in an MSF, we must remove some edge, to break the tree in two and separate the tree into two components that each contain one of the nodes in the cannot link edge. We must then simulate HACC?s behavior from the point at which the removed link had 61 (a) When we add an edge between two clus- ters we update Store Simple path with paths between all nodes in the two trees. (b) Simulation of SCC. (c) An example that shows how CLEs are dependent on active edges (see text). (d) An example that shows how REs are dependent on active edges (see text). (e) Simulation of DCM. Figure 4: Simple examples for explaining the faster version of our algorithm. 62 been initially added. Similarly, if we simulate adding a must link edge that connects two trees that contain cannot link edges, we must join these two trees, and then remove edges to break this new tree apart, so that again no tree contains a cannot link edge. In either event, we must undo some edge or edges that were added to the MSF, and determine which subsequent decisions might be altered by this change. To make this efficient, we maintain data structures that keep track of the dependencies between various decisions we have made. It is important to note that these data structures need to be constructed only once, for all O(N2) questions whose answers we simulate. For each question, we only need to simulate changes to the data structures that can occur in simulating the changes to the cluster that this question induces; these changes are typically not large. Store simple path between all pairs of points: Store-simple-path is a structure that stores the simple path between any pair of points in the same tree in the forest. A simple path in a graph is a path containing edges, where no vertices repeat. In a tree a simple path between any two points is unique. We update Store-simple-path whenever we add an edge while running HACC. For example in Figure 4a when we add an edge between trees 1 and 2 (i.e, add an edge between C and F), we can store the simple paths between any pair of points in T1?T2, where T1 = {A,B,C,D} and T2 = {E,F,G,H, J, L}. For example, the simple path between A and L is Store ? simple ? path(A,L) = Store ? simple?path(A,C)?Store?simple?path(F,L)? (C,F ). Store-simple-path should already have stored the simple paths between A and C, and F and L. Active edge and Cannot Link edge dependency: AEs and CLEs are mutually dependent. While we build a MSF, a CLE is not added because some set of active edges, 63 with lower time stamps, were already present in the forest. If one or more of those ac- tive edges should be removed there is a possibility that a CLE can be added back to the MSF, without violating any constraints. In FAST-Active-HACC, when we simulate a new constraint, we look at which AEs must be removed to incorporate a simulated constraint and as a result which CLEs can be added to the forest. But these additions and removals of edges should be consistent with the given set of constraints including the one we are simulating. We consider an example scenario in Figure 4c to understand the modeling of de- pendency between AEs and CLEs. Let us assume we are building a MSF with edges from a sorted edge list E. We have added some edges to the forest. The next edge to be considered from the list for addition to MSF is an edge between B and F. B and F are part of trees TREE-1 and TREE-2 respectively. Although there is no cannot link constraint between B and F there are other cannot links between AE, CG and DH. So we do not add edge BF to MSF and include BF in set CLE. Say BF has a time stamp t. Now, BF was not added because a subset of active edges {AB,BC,CD,EF, FG,GH} were added before time t. We create lists that will contain those active edges. Since there are three cannot links between trees TREE-1 and TREE-2, we have three lists for BF corresponding to each of the cannot links; we call them LISTBFAE , LIST BF CG , LIST BF DH . In LIST BF AE , we have {BC,CD,FG,GH}. We also have to keep track of the multiplicity of dependency, e.g, edges BC and FG are present in both LISTBFCG and LIST BF DH . So BC and FG have multiplicity of two. If we remove any one of BC or FG, we can delete two lists together. That implies that if either of BC or FG was not added when we built the MSF, we would not have had the cannot link restrictions due to cannot links between CG or DH. Now let 64 us assume we remove active edge BC. So we can get rid of two lists. Still there is no chance we can add BF and make it active unless either AB or EF is inactive, because the cannot link constraint AE will prevent addition of BF. If either of those edges become inactive we can also delete the list LISTBFAE and can add BF to the MSF. So for all CLEs we keep lists of active edges and their multiplicity. Whenever we delete one active edge we see which of the lists can be deleted. If a cannot link edge is free from all lists it can be considered for addition to the MSF. However if a CLE itself is also a cannot link constraint we can not ever add that. Above subsection describes how removal of an active edge makes way for other CLEs to be added. Redundant edge and Cannot Link edge dependency: A redundant edge is not added to the MSF if the points connected by the RE is already connected by a set of active edges. So we also note which of the active edges are responsible for not adding an RE. Let us take an example in Figure 4d, where we have a set of active edges already added to the MSF. Now if AF is the next edge from the sorted list E to be considered, we cannot add that edge as it will create a cycle. So AF is included in RE. Now when we simulate constraints if one of the active edges from {AB,BC,CD,DE,EF} is removed, can we add AF to MSF? No, if all other edges between A and F in TREE-1 remain active. Let us assume active edge CD is removed to satisfy some cannot link constraint. Now we have two subtrees TREE-11 and TREE-12. If redundant edge EF is added in this scenario, the constraint for which CD was removed would be violated again. But if any other active edges from {AB,BC,DE,EF} is removed AF comes under consideration for possible addition to MSF. Now we may wonder what happens if any other edge from 65 {BG,BH,DI,EJ} is removed. If that gets rid of the constraint for which CD was removed, we would add CD back and will not add AF, because AF has a higher time stamp than CD. So we don?t worry about any other edge other than the active edges on the simple path between A and F. If two or more active edges from that path are removed we start considering AF. So we keep a separate list of active edges for all redundant edges and see if two or more edges are removed from the list. Now we have discussed how removal of active edges bring CLEs and REs into consideration. But when we add an edge to the MSF from CLE or RE, how does this change active edges? At a high level, whenever we are about to add a CLE or RE, we remove all active edges, with a later time stamp than that of the CLE/RE, from the two trees connected by that CLE/RE. However we reconsider them again for addition to the MSF. This is pretty fast in practice. The main reason is, addition or removal of edges is very local, so in reality we do not need to remove or consider adding huge number of edges. Priority Queue (PQ): This is the most important structure we use while we sim- ulate a SCC or a DCM. In this queue we store possible edges to be added in the MSF but sorted based on their time stamps. But there are some additional properties of the elements in this queue and we describe them below: ? Edges in the priority queue must not be present in the MSF. ? The MSF should always be consistent. That means it can have more or less than K clusters but all the must link and cannot link constraints must be satisfied. ? Once we start popping elements from the priority queue, if we have considered 66 adding an edge at time t, we never look at an edge with time stamp T ? t. That implies any other edge with T ? t can not be added to the priority queue from now on. In our framework we keep a priority queue with a set of edges, that can possibly be added to the MSF. We want to minimize the total number of elements(say |PQ|) added to PQ. The minimum set is the set of edges, which we can definitely add to MSF. Although it is easier to find which of the CLE/REs are affected when we remove an active edge, tracking which of active edges could be affected while we add a CLE/RE is complicated. It is hard to find a polynomial time precomputation process for this part, which can be computed once and used for all possible question simulations. The precomputation in that case involves taking care of all possible cycles of a complete graph, which could take exponential time. So we have a simple work around, that works well in practice. We try to minimize |PQ| as much as possible but we do not try to get the exact minimum of |PQ| using our algorithm. We make sure the algorithm is correct, but we may do some extra computation. 3.2.4.3 Primitive Operations These are some set of basic operations which we perform while we simulate the same cluster cannot-links or different cluster must-links. ? Insert with priority: Add an edge to the priority queue with the priority based on time. 67 ? Pop highest priority element: Pop out an edge from the priority queue which has the lowest time stamp. ? Add an edge to the MSF: Consider an edge to add to the MSF such that all con- straints are satisfied and if added, relabel necessary nodes. ? Remove an edge from MSF: Remove an edge and relabel nodes. 3.2.4.4 Same Cluster Cannot Links (SCC) In this section, we consider an additional optimization that speeds up the simulation of adding a cannot link edge between two nodes from the same cluster. In fact, we do not need to consider possible pairs within the same cluster separately. It turns out that many possible cannot link edges, if added, will have exactly the same effect. If we have n points within a cluster we should simulate the results of cannot link edges between (n 2 ) possible pairs. But in reality we have to simulate only (n-1) pairs. Let us take an example in Figure 4b to explain this. The edge weights are shown on the edges. Let us assume we want to simulate the effects of a cannot link constraint between D and E. Now if there is a CL between D and E, all the active edges would have been added except CF, because CF is the highest weighted edge. When we would have tried to add CF, the cannot link constraint between D and E would stop addition of CF. Now let us assume we want to simulate a cannot link constraint between A and L. Again we would not have added the edge CF because it has the maximum weight on a path connecting A and L. So in both of these cases we have to remove CF to simulate a cannot link constraint. Now we can observe that if we want to simulate a cannot link constraint between any pair in 68 {A,B,C,D} (T1) ? {E,F,G,H, J, L} (T2), we have to remove CF, because this is the maximum weighted edge on any simple path between a pair of points from T1 and T2 . For all of those pairs we would ultimately not add edge CF. So we can proceed like this: We find the maximum weighted edge (say EM in a tree, find the subtrees (say T1 and T2) connected by this edge. For all possible pairs in T1 ? T2, we would remove edge EM . Now we can proceed to the next maximum weighted edge and do the same thing, but we don?t update simulation of cannot link constraints for pairs that we have already considered. We continue doing this till we reach the least weighted edge. In this way, by simulating the effects of removing n possible edges, we determine the effects of every possible cannot link edge that could be added to this cluster. While we simulate a SCC or removal of an edge, we first remove that edge (say time stamp of this edge is tM ) and then run a module QUICK-SIMULATE to complete the simulation of SCC. We will provide a brief high level idea and detailed pseudo code of this part later. We will use the same module also in DCM simulation. 3.2.4.5 Different Cluster Must Links (DCM) When we simulate the must links between different clusters finding the set of active edges to be removed is harder. However we have precomputed some data before the start of the simulation to make things easier. We will refer to Figure 4e to explain which of the active edges should be removed to add a DCM. Let us assume we would like to add an ML between B and F. There are three CLs between TREE-1 and TREE-2. So if the cannot 69 link constraint AE and simulated must link constraint BF both have to be satisfied one of the active edges on each simple path (in this particular example each simple path is just an edge) AB or EF must be removed. The higher weight edge will be removed for the same reasons we mentioned in our discussion of SCC. So while we simulate a DCM, we find the maximum weighted edges on the union of simple paths between the cannot link ends and simulated must link ends. So if there are three cannot links between TREE-1 and TREE-2 we will have three such paths and we have to find the maximum weighted edge in each path. Since we already have the simple paths calculated between any pair of points in a tree, this process is fast. In our example three paths are, {AB,EF}, {BC,FG} and {BC,CD,EG,GH}. The set of edges to be removed are RS = {BC,CD,EF}. Now we start removing edges with the minimum time stamped edge being removed first. So we remove BC first. We also see the fact that BC has multiplicity two and if we remove BC we do not need to remove CD. Then we can remove EF. Also assume the minimum time stamp of RS is tM . Unlike SCC, here we will get a set of edges to be removed that can contain more than one element. Now we would run the module QUICK-SIMULATE to complete the simulation of DCM. 3.2.4.6 QUICK-SIMULATE This is a module which is used both for SCC and DCM. Once we remove the set of required edges, we can run this module. Now whatever happened with HACC before time stamp tM will remain the same in the simulation. We know if an active edge is removed, which CLE/REs come under consideration. We insert them into the priority 70 queue. Now we pop the first element (say time stamp t) from the priority queue and try to add that into the forest. We should remember again any edge with time stamp T ? t, will not be included in priority queue in future. Now, if the popped edge is a CLE or RE with time stamp t, we remove all the active edges (with time stamp T ? t) in the two subtrees, which are going to be connected by that CLE/RE and put them into the priority queue. We also see if we can add any other CLE/RE due to removal of these edges. One of our observations is that if we don?t remove the active edges with T ? t, sometimes we may end up with a different clustering than HACC would obtain. As we mentioned earlier this does not take lot of time in real experiments because we remove a few edges very locally and the total number of edges to be removed in this process is very small. We continue doing this until we do not have any element in the priority queue. Then, if we have more clusters than required, we pull the minimum time stamped element from OE and add that to the forest. If we have less than the required number of clusters we remove the maximum weighted edge from the forest. We get the clustering but we are not done yet. To complete the simulation, we need to compute the change in the Relative Jaccard?s coefficient. Calculation of Jaccard?s co- efficient is complex, i.e, we have to do O(N2) work. But we can easily keep track of which of the points change assignments or move from one cluster to another. We use that information to calculate Jaccard?s coefficient and we have empirically seen on an average it does not take more than O(N) time. Once we have found the effect of simulation of all possible constraints, we select the pair with maximum expected change in the Relative Jaccard?s Coefficient. We present 71 Algorithm 2 QUICK-SIMULATE Given: Priority-Queue with edges to be removed while Priority ?Queue 6= ? do Pop the first element e from Priority-Queue e connects clusters C1 and C2 and has time stamp t if e is in CLE or RE then Remove all active edges with time stamp T ? t from C1 and C2 Insert them into Priority-Queue end if end while If required remove the last added edge or add more edges from OE to maintain K clusters that pair to some user, get feedback and continue doing this until our human budget is ex- hausted or we get reasonable clustering. Since FAST-Active-HACC is complex, code for this has been made publicly available for use at http://www.umiacs.umd.edu/ ?arijit/projects/Active_clustering/active_clustering. 3.3 Experimental Results We have experimented in three different domains, leaf, face and scene images. For leaves, we create three subsets from a huge corpora of leaf images used for Leafsnap [48]. All leaf images are iPhone images of leaves on a white background. The leaf subsets are called Leaf-100 (containing 100 images from 10 different species), Leaf-250 (250 images from 25 different species) and Leaf-1042 (1042 images from 62 species). Leaf-100 and Leaf-250 have the same number of leaves from all species. But in Leaf-1042, the number of leaves in each species vary from 4 to 31. Similarly for faces, we have extracted three subsets of images from a face dataset called PubFig [49]. The PubFig database is a large, 72 real-world face dataset consisting of 58, 797 images of 200 people collected from the Internet. Unlike most other existing face datasets, these images are taken in completely uncontrolled situations with non-cooperative subjects. Thus there is large variation in pose, lighting, expression, scene, camera, and imaging conditions, etc.... The subsets of images are called Face-100 (100 images from 10 different people), Face-250 (250 images from 25 different people) and Face-500 (500 images from 50 different people). All of these face datasets have the same number of images in each class. For scenes, we have created a dataset called Scene-300 which has 300 images from 30 classes with each class having 10 images. Scene-300 is a subset of the Indoor Scene Recognition dataset [63]. This dataset has images from different kinds of places such as airports, hospitals, libraries, kitchens etc.... We first run experiments with perfect human responses, and then run a second experiment to evaluate the effect of real human responses. The distance matrix for face images and leaf images are calculated based on algo- rithms described in [9] and [48] respectively. For scenes, 512 dimensional gist features are extracted from each image and Euclidean distances are calculated between all pairs of images to create the distance matrix. Once we get the distance matrix, we can run our proposed algorithm on all these datasets. As described in Section 3.2.1 we have used a couple of heuristics named H1 and H2 to further speed up our algorithm. In H1, if the pairwise probability measure indicates that human response is going to be very similar to the current clustering, we do not bother to simulate a different response. In our experiments, if the probability that an image pair are from different classes is more than 0.9 and they are actually in different clusters, we do not simulate a must-link constraint for that pair. For larger datasets of size more than 1000 73 we initially use K-means [56] to group very close points together. For example for Leaf- 1042, we cluster 1042 images to 700 clusters and then run our algorithm with centroids of those 700 clusters. In H2, instead of computing the change in overall clustering, we only compute local changes involving the image pair under consideration. Often when a new constraint is added, the effects are very local and involves clusters which contain the image pair. We can determine these changes much faster than the overall change in clustering. The main objective of running experiments on smaller datasets of size 100 and 250 is to show that even if we use heuristics the performance of the algorithm is not affected that much. The algorithm without heuristics is too slow to run on larger datasets. For example we run our algorithm on Face/Leaf-100 without any heuristic, with only H1 and then with H1 and H2 both. For Face/Leaf-250 we run our experiment with only H1 and then H1 and H2 both. For Face-500/Leaf-1042/Scene-300, we run a set of experiments using both heuristics H1 and H2. 3.3.1 Empirical Observations We have run our algorithm on all of the datasets described above. All the algorithms to which we compare are described in Table 3.1. Figures 5 and 6 (where Jaccard?s Coefficient corresponding to one misclassification per cluster is displayed using green squares) show performance evaluations of all the algorithms on Leaf-100, Face-100, Leaf-250, Face-250, Leaf-1042, Face-500 and Scene-300. We run K-means clustering 100 (S = 100) times on each dataset with different random initial set of centroids to find the pairwise probability distributions. We see how Jaccard?s Coefficient changes with the 74 Table 3.1: Summary of the algorithms that we compare. Active-HACC Proposed active learning version for Image Clustering FAST-Active-HACC Faster version of our proposed algorithm without any heuristic FAST-Active-HACC-H1 Faster version of our proposed algorithm with H1 only FAST-Active-HACC-H1 H2 Faster version of our proposed algorithm with H1 and H2 both Random Constraints Seek pairwise constraints randomly and use HACC K-means without questions [56] Simple K-means without any human intervention CAC1 [13] A heuristic to find the best pair Active-PCKMeans [6] An active constrained clustering algorithm COP-Kmeans [74] A semi-supervised clustering algorithm (with pair- wise constraints) [78] A spectral active clustering approach based on k near- est neighbor graph purification number of questions. Since we have the ground truth for these datasets we were able to calculate the Jaccard?s Coefficient with respect to the ground truth. In real world situa- tions we will not have the ground truth and will have to decide when we have reached a good clustering. One possible way to stop asking questions is when clustering assign- ments do not change even with additional constraints. Also, one of the advantages of FAST-Active-HACC is that we do not need to set any parameters. One of our main inter- ests is in problems that grow big because the number of clusters becomes large. We make the following observations from the experiments: ? In all of these experiments our algorithm significantly outperforms all other algo- rithms. We were able to greatly reduce the number of constraints required for a reasonable clustering. ? We run both Active-HACC and FAST-Active-HACC for the Leaf-100 and Face-100 datasets. We see that FAST-Active-HACC is 25?30 times faster than Active-HACC for a dataset of size 100. Overall we expect FAST-Active-HACC to be O(N) faster 75 than Active-HACC. Since Active-HACC is slow, we could not experiment with it in larger datasets. We also observe that FAST-Active-HACC-H1 produces the exact same results as FAST-Active-HACC for Leaf-100/Face-100. For a dataset of size 1042, FAST-Active-HACC-H1 H2 takes around a minute per question. We believe this could be further sped up with more optimized code (our current implementation is in MATLAB) or parallel processing as our algorithm is highly parallelizable. ? In Figure 5c, we compare the results for different algorithms for the Leaf-1042 dataset. For this dataset only we use K-means initially as part of H1, to reduce the number of points to 700. Even with that, we get Jaccard?s Coefficient of 0.8152 (one misclassification per cluster on average) by asking just 3782 questions. ? We wanted to compare our algorithm with [31] and [38], but a direct comparison on our image datasets is not possible due to the complexity of their algorithm and the lack of publicly available code. However we also run experiments using the iris dataset to compare with [31], on which they have reported results. This is a relatively easy dataset with 150 points from 3 clusters in 4 dimensions. They have used the ratio of well categorized points to the total number of points as an evaluation metric (let us call it RWCP). Our algorithm reaches RWCP of 0.97 within three questions, where they take ten questions to reach the same RWCP. ? One of the major differences in these domains is the distance matrix. The leaf image distances are more accurate than the face images. Also the face image distances are more accurate than the scene images. So even if we have three similar sized datasets from three different domains, we need different amount of human interaction in each of these domains. 76 3.3.2 Experiments with Real Humans In previously described experiments we assumed that humans are always correct when they provide pairwise constraints. We note that all previous active clustering algorithms [6, 37, 38, 57, 79] have explicitly made this assumption. Authors in [49] demonstrate that this assumption is reasonable for face images, as humans are 99.2% accurate in judging similarity of two faces. However we wanted to verify how robust our algorithm can be when we relax this assumption and allow real humans to judge similarity of images using Amazon Mechanical Turk. To allow for extensive quantitative evaluations while still using feedback from real users, we collected exhaustive human feedback for all possible pairs of images in our datasets Leaf-250 and Face-250 using Mechanical Turk. We ask about each image pair to a mechanical turk user to decide if a pair of images belong to the same category or not. In general Mturk workers found this task to be fun and interesting; their comments were ?nice?,?good?, ?again provide these types of work? etc.... However some workers provided random/noisy responses throughout a HIT, which we had to filter out. Overall we find that humans are wrong around 1.2% of the time in judging the similarity of face images and around 1.9% of the time in judging the similarity of leaf images. We plot the experimental results with real human input in Figure 7. We find that with real human feedback we never achieve perfect clustering (JCC=1) because of the errors introduced by humans. Also note that after asking some number of questions all algorithms? performance start to become worse. Our experiments were run until our proposed algorithm stopped improving. Determining a good stopping point automatically remains an interesting research problem. We find that our algorithm is still better than 77 0 50 100 150 200 250 0.40.50.60.70.80.91 Compa rison fo r datas et of si ze :100 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC FAST? Active? HACC? H1 FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (a) Leaf-100 dataset 0 200 400 600 800 1000 1200 1400 00.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :250 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1 FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (b) Leaf-250 dataset 0 2000 4000 6000 8000 10000 12000 14000 16000 00.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :104 2 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (c) Leaf-1042 dataset 0 50 100 150 200 250 300 350 400 450 0.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :100 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? WOH FAST? Active? HACC? H1 FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (d) Face-100 dataset 0 500 1000 1500 2000 00.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :250 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1 FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (e) Face-250 dataset 0 2000 4000 6000 8000 10000 00.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :500 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] (f) Face-500 dataset Figure 5: Performance plot showing how the Jaccard?s Coefficient increase with the num- ber of questions. Our proposed algorithm Active-HACC significantly outperforms all other algorithms we compare. 78 0 1000 2000 3000 4000 5000 00.10.20.30.40.50.60.70.80.91 Compa rison fo r datas et of si ze :300 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC ?H1_H 2 Rando m Con straints K?mea ns with out qu estions CAC1 [13] Active PCKM eans [6] COP?K means [74] [78] Figure 6: Scene-300 dataset other active and passive algorithms. Although the heuristic proposed by [13] has a good starting point for Leaf-250, it?s performance quickly becomes worse as real human input is introduced. However for Face-250, [13] does not perform well even in the beginning. Our proposed algorithm achieves maximum Jaccard?s Coefficient (JCC) of 0.5 for Leaf- 250 and 0.35 for Face-250. Note that JCC = 0.5 implies 20 clusters with 2 mistakes and 5 clusters with only one mistake on average (cluster size is 10). Similarly JCC = 0.35 implies 20 clusters with 3 mistakes and 5 clusters with 2 mistakes on average (cluster size is 10). 3.4 Conclusions We have presented an approach for image clustering that incorporates pairwise constraints from humans. An algorithm is developed for choosing data pairs to use when querying a person for additional constraints. Since a brute force version of our algorithm is time consuming we also formulate a more complex but faster version in this chapter. Our 79 0 100 200 300 400 500 600 700 800 900 0.050.10.150.20.250.30.350.40.450.50.55 Compa rison fo r datas et of si ze :250 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] (a) Leaf-250 dataset 0 500 1000 1500 2000 2500 3000 3500 4000 00.050.10.150.20.250.30.35 Compa rison fo r datas et of si ze :250 Numbe r of qu estions Jaccards Coefficient FAST? Active? HACC? H1_H2 Rando m Con straints K?mea ns with out que stions CAC1 [13] Active PCKM eans [6] COP?K means [74] (b) Face-250 dataset Figure 7: Performance plot showing how the Jaccard?s Coefficient increases with the number of questions when we get pairwise constraints from real humans in Mechanical Turk. algorithm outperforms all state-of-the-art results in image clustering. Although this work was focused on solving image clustering, this idea could be extended to any clustering domain in general. 80 Chapter 4: Active Subclustering In this chapter we talk about clustering a subset of a dataset only. Although there are many excellent clustering algorithms, clustering images with many classes is a hard prob- lem. We address these challenges in two ways. First, we propose a passive algorithm to cluster a subset of the data only (which we call subclustering). Subclustering will produce smaller but purer clusters. Second, we get pairwise constraints from humans to correct errors in a subclustering solution. We run experiments on a face image dataset (having 51, 418 images from 200 classes) and a leaf image dataset and show that our proposed algorithms perform better than baseline methods. 4.1 Related Work As far as we know there is no previous work on the proposed subclustering approach. Per- haps the most similar work is Bregman Bubble Clustering proposed by Gupta et al. [33]. Their approach tries to find a robust and scalable clustering framework to find K dense clusters in the dataset, removing outliers. The authors in [80] propose a robust clustering algorithm that maximizes a total similarity objective function related to an approximate density shape estimation. However, these approaches show experimental results for a very 81 small number of clusters. We implemented the BBC-S algorithm [33] but modified to our subclustering framework for comparison. The authors in [69] proposed a face annotation framework that involves partial clustering and interactive labeling. The partial clustering approach tries to model images that are not grouped tightly enough, using a uniform back- ground noise distribution. However it is often hard to model these images, with only one uniform background distribution, especially if there are many classes. Note that our ex- periments are with 51, 418 images from 200 clusters, where [69] uses only 1, 147 images from 34 clusters. The authors in [50] proposed a method for visual category discovery that focuses on the easiest examples first and then progressively expands its repertoire by including more complex objects. However this approach is initialized with a familiar set of category classifiers. There are also a couple of recent works which try to enforce diversity in mixture models [61, 83], such that mixture components do not represent redundant information. The authors in [34, 81] proposed methods for discriminant analysis, where data from each class is represented by a mixture of Gaussians. These methods are useful when data from a single category belong to multiple clusters instead of a single cluster. Although these methods have shown to improve the classification results, finding the actual number of mixture components in each class still remains a difficult problem. Similar to these methods our approach also assumes that data from different categories do not form well- separated clusters, such that any single cluster will contain all the data points from a single category. However we note that our approach is developed for clustering problems, where no class labels are available. In subclustering instead of representing data from each category with multiple clusters, we find one small but dense group of points from 82 each category. Since we have already discussed the past work on active learning for computer vi- sion and clustering in detail in the last chapter (Chapter 3) we do not repeat the discussion here. 4.2 Passive Subclustering Standard clustering algorithms usually perform a complete partition of the dataset. How- ever, state of the art image features do not always map images to a space in which images from the same class remain close to each other and far from images from other classes. Consequently, standard clustering approaches [44, 56, 66], which optimize a cost function that takes into account all images, often end up with many impure clusters. In subcluster- ing the cost function takes into account only a few examples from each class. We formulate subclustering as a problem of maximizing the probability that all images in a cluster come from the same class, and that every pair of cluster centroids comes from a different class. Intuitively, this means that we want cluster centroids to be as far from each other as possible, while images assigned to each centroid should be as close to the centroid as possible in the feature space. 4.2.1 Notation We now introduce some notation. We are given a set of N unlabeled images U to be clus- tered into K clusters. Nk denotes the number of images in the k-th cluster ( ?K k=1Nk = N ). T denotes the number of iterations an algorithm is run. We want to produce sub- 83 clusters of size n (provided by the user) and we assume n ? Nk for any k. C (c ? C) represents the set of K cluster centroids and C ? U , i.e., centroids are members of the dataset. R (|R| = R) denotes a restricted set of images (randomly chosen from U), which is larger than the number of clusters but smaller than the total number of images (R ? U and K < R < N ). P (Ii = Ij) denotes the probability that images Ii and Ij are from the same class (Ii, Ij ? U). In all our experiments we assume that we have access to pairwise distances between any pair of images. Any kind of image features and distance measures can be used to find a pairwise distance. We convert each pairwise distance to a pairwise probability by using a simple exponential function P (Ii = Ij) = exp(?d(Ii, Ij)/?) [66], where d(Ii, Ij) denotes the distance between images Ii and Ij and ? is a positive scaling parameter. 4.2.2 Our Algorithm In subclustering we want to choose a set of centroids C? such that the following product of probabilities is maximized: P = K? k=1 n? j=2 P (ck = I j k) ? ?? ? P1 ? ? k1,k2 k1 6=k2 (1? P (ck1 = ck2)) ? ?? ? P2 (4.1) where Ijk denotes the j-th image assigned to cluster k. Note that the first image assigned to the k-th cluster is ck itself. The first part (P1) ensures that n? 1 images {I j k} assigned to centroid ck should be close to ck. The second part (P2) makes sure that the probability of dissimilarity between all pairs of centroids in C? should be as high as possible. 84 Next we convert the maximization problem to a minimization problem by taking the negative log of (4.1), i.e., Cost = ?log(P) is: ? K? k=1 n? j=2 logP (ck = I j k)? ? k1,k2 k1 6=k2 log(1? P (ck1 = ck2)) (4.2) The optimal centroid set is C? = arg minC Cost. The cost function in equation 4.1 is formulated in a way such that images in the same subcluster are tightly grouped and are far from the images in other subclusters. Ide- ally we want to extract features from images such that images from the same category stay close to each other in the feature space and images from different categories are far from each other. However state-of-the-art computer vision image features are not accurate enough that all images will remain in the feature space in this way. However we expect that features from some of the images will follow the ideal structure in the feature space. Our cost function for subclustering is formulated in a way such that if optimized will out- put only those images. Although there are other possible ways to formulate cost functions to perform subclustering, they may not be as good as our approach. For example, one such possible cost function is the traditional complete clustering cost, where we assign each point to the nearest cluster center and then minimize the average distance between the images and the centroid in each cluster. From the complete clustering solution, we can obtain n images (nearest to the center) from each cluster to get a subclustering output. We note that although such an approach can be used to create subclusters from complete clusters, cost functions solely built to solve subclustering create purer subclusters. Intu- itively it is clear that points that are difficult to cluster can affect the results of clustering 85 algorithms in a way that degrades the clustering of easier points. That is, it makes more sense to solve the problem we want to solve directly (subclustering), rather than trying to solve a harder problem (clustering) as an intermediate step. This is demonstrated in the experimental results in Section 4.6. For each cluster centroid we have to find n ? 1 images {Ijk} to be assigned to centroid ck. This is achieved by simply assigning the n? 1 closest images to ck in cluster k. Since this can lead to overlapping subclusters we add an extra step in our algorithm to minimize the overlap between clusters, discussed later in this section. We denote a set of n ? 1 images assigned to a centroid c along with c itself as Sc (we use ck or c both as elements of C, ck additionally implies that it is the k-th centroid). We optimize Cost with an iterative algorithm. We initialize our algorithm with random cluster centers, assigning the n ? 1 closest images to each of these cluster cen- ters. Then, at each iteration, we swap one of these cluster centers with a new cluster center, chosen from a pool of possible replacements in the restricted set,R, choosing the replacement that most reduces Cost. Now we explain our algorithm in detail: (a) First, we precompute the change in the first part of Cost if any r ? R is added to or removed from a centroid set. We can precompute that as we know which images will be assigned to each centroid due to our simple assignment procedure. We take advantage of this fact and significantly cut down computational cost. (b) Next, we randomly choose an initial centroid set C ? R. We maintain an additional vector of size R, which stores the changes in the second part of Cost if a centroid r ? R is added to the current centroid set C. This vector needs to be updated at the 86 start of each iteration. (c) We calculate the present subclustering cost using (4.2) and denote that as Costt at the t-th iteration. (d) We consider replacing each centroid, c ? C, with each centroid from the restricted set, r ? R. (i) A replacement of c with r is valid only if Sr has no more overlap with existing subclusters than Sc does. We say there is an overlap between two subclusters if one or more images in those two subclusters are the same. We define an in- dicator function 1(Sci ,Scj), which is 1 if there is an overlap between subclus- ters Sci and Scj and 0 otherwise. A valid centroid replacement must follow: ?K?1 j=1 1(Sc,Scj) ? ?K?1 j=1 1(Sr,Scj), where cj ? (C ? c). (ii) We calculate the changed cost for all of these valid replacements (Costt+1(r, c)) and see if any of them is less than the present cost Costt. If yes, we make the replacement for which the decrease in cost is maximum, i.e., replace the corresponding centroid c? with r? in C. (e) Go to step (c) until the algorithm converges, i.e., we do not find any replacement which decreases Costt. The centroid set at the final iteration along with the n? 1 closest images to each centroid is the output of our subclustering algorithm. Optimizing the subclustering cost function in equation 4.2 is a NP-hard problem1 and obtaining a global solution is not possible in polynomial time. We use an iterative algorithm that can find a locally optimal solution in polynomial time. We note that the it- 1This is similar to a subset selection problem and they are known to be NP-hard. 87 0 5 10 0246810 (a) Our Subclustering Algorithm 0 5 10 0246810 (b) K-medoids for Subclustering Figure 1: Comparison of our passive subclustering algorithm with K-medoids (medoid and closest 2 points from the medoid are displayed in each cluster) in a small scale 2-d clustering problem. We see that subclusters produced by our proposed algorithm are more meaningful (30 points from 3 clusters with subcluster size of 3). erative optimization method is motivated by the optimization used in Partition Around Around Medoid clustering algorithm [44]. Also note that our cost function is lower bounded by zero and at each iteration of our algorithm cost either decreases or remains the same, thus our proposed algorithm is guaranteed to converge. We have made the code for this algorithm available at http://www.umiacs. umd.edu/?arijit/projects/Active_subclustering/active_subclustering. We demonstrate the proposed subclustering algorithm?s performance on a simple 2-d ex- ample in Figure 1. Complexity: The total computational cost of our proposed algorithm is O(RN + KRT ) (see Appendix 6.2 for proof), where K is the number of clusters, T is the number of iterations the algorithm is run, R is the restricted set size and N is the total number of images. In our experiments we find that for any given K, our algorithm converges within a maximum 100 iterations, i.e., T = 200. That is why in general KT remains 88 typically similar to N ; so our runtime is proportional to KRT . This is O(N) faster than PAM [44](O(KRNT )) with the same sized restricted set. However, we note that PAM in general does not use a restricted set. 4.3 Active Subclustering In active subclustering we improve our passive subclustering solution by asking humans if a pair of images is from the same class. In a subclustering solution we generally have two kinds of errors: ? Between-class errors: One of our objectives in subclustering is to choose centroids covering as many different image classes as possible, so we want to eliminate cen- troids that are from the same class, using constraints provided by humans. We refer to these errors as between-class errors and a centroid pair of the form (ci, cj) as a between-class pair. ? Within-class errors: In subclustering we also want images in the same cluster to be from the same class. So we look for images with classes that are different from the classes of the centroids they are assigned to and correct them. They are called within-class errors and a centroid-image pair of the form (ck, I j k) is called a within-class pair. One of the major challenges in active subclustering is that we want to correct both between-class and within-class errors simultaneously. In a subclustering solution we have K centroids with n images assigned to each of these centroids. At any iteration, we can ask about any of theO(K2) between-class centroid pairs orO(Kn) within-class centroid- 89 image pairs. We choose the image pair that is most useful (formally defined later as a utility function) for subclustering. One factor in determining the usefulness of an image pair is the likelihood that it reveals an error. So we ask about centroid pairs that are likely to be from the same class to correct between-class errors. However the usefulness of within-class pairs not only depends on the similarity be- tween image Ijk and the centroid ck, but also depends on the likelihood that ck will remain in the centroid set C even after we correct the between-class errors. If ck is very unlikely to remain in C after between-class errors have been corrected, corresponding within-class pairs (ck, I j k) are considered less useful (even if they are likely to be erroneous) for sub- clustering and we should not spend a good amount of human budget in correcting them. At each iteration we have to estimate the probability that a centroid ck will remain in C. This is hard to compute as it depends on other centroids, their pairwise similarity and also changes made to C during between-class error correction. To estimate such probabilities we make an assumption that for each ck, its probability of remaining in C will be deter- mined by another centroid ck ? ? (C ? ck), which is most likely to be similar to ck. We assume that high similarity between ck and ck ? implies that ck is unlikely to remain in set C after between-class error correction. We note that there are other possible ways we can estimate this probability, however this method is very efficient and works well in practice. Now, let us look at a toy example in Figure 2 to understand intuitively what our estimation means. In this subclustering example, it is evident that I1i is less likely to be- long to the same class as ci than I1j ?s likelihood of being in the same class as cj . However since ci is very close to ck, there is a good chance that ci will be taken out of C due to it?s similarity with ck. On the other hand cj is unlikely to be from the same class as ck 90 Pairwse ContdaC c jc i I i1 I j1ck Figure 2: This is a subclustering example with 3 subclusters and with 4 images in each subcluster. We ask humans about within-class pairs like (cj, I1j ) before (ci, I 1 i ). (see text) and should remain in C. So in our approach we prefer to ask users about pairs like (cj, I1j ) before (ci, I1i ). Based on our estimates as described above, we show either a between-class or a within-class pair to humans and update our subclustering solution with their response. We continue as long as the human budget permits. As we get human input, pairwise probabilities P (Ii = Ij) also change, i.e., same- class image pairs are assigned a probability of 1 and different-class image pairs are as- signed 0. We also update the subclustering solution using our passive algorithm and these new probabilities (details are provided later). We note that we want to build algorithms that depend only on pairwise image distances and not on image vectors. For this reason we do not learn a new distance metric as we obtain additional constraints. Also we believe the amount of pairwise supervision we obtain from users in active subclustering would not be enough to alter the distances significantly. For example, for a dataset of 51418 im- ages, we eventually get supervision for only 0.0108% of all possible pairs to reach almost perfect subclustering. Now, we define a utility function U(Ii, Ij) for all within-class and between-class image pairs, where Ii and Ij are any pair of images. At each iteration we choose the 91 image pair for which this functional value is maximum. For a between-class image pair the utility function is defined as, U(ci, cj) = P (ci, cj), where ci and cj are centroids in C. For a within-class pair (ck, I j k): U(ck, I j k) = (1? P (ck = I j k))(1? P (ck = ck ?)) (4.3) where ck ? = argmax c P (ck = c) and c ? (C ? ck). A priority queue PQ of size O(K2 + Kn) is created that contains utility function (U ) values for O(K2) within-class and O(Kn) between-class pairs. Highest priority is given to the maximum utility pair. The pseudocode of our active algorithm is described in Algorithm 3 and the details are provided in Appendix 6.3. The steps of our active approach are designed such that we can correct both between-class and within-class errors simultaneously, with minimum human interaction. The image pair selection for active input is based on a utility function for image pairs. This utility function indicates the likelihood that an error will be revealed by asking about that image pair and also whether detecting that error is at all required, as the clustering solution changes while we get human feedback. 4.4 Subclustering Jaccard?s Coeff. (SJC) Many methods exist for evaluating the quality of a clustering of data relative to ground truth. We feel that these are not ideal for evaluating subclustering, however. We pro- pose a novel metric to evaluate subclustering called Subclustering Jaccard?s Coefficient 92 Algorithm 3 Active Subclustering Given: PQ (Priority Queue), Max questions while num questions ?Max questions do Pop from PQ and ask about the corresponding image pair if the selected pair is a between-class pair (ci, cj) then if User says ?no? then Update the corresponding probability. else Update the corresponding probability. The current solution is no longer locally optimal. we return to the optimization procedure used in the passive algorithm until we find another optimal centroid set. end if else The selected pair is a within-class pair (ck, I j k) if User says ?yes? then Update the corresponding probability. else Update the corresponding probability. Replace Ijk with another image in the database I j k ? that is the next closest to ck and has not already been assigned to any other centroid. end if end if Update the utility functions and the priority queue PQ. end while Output: An improved subclustering solution. 93 (SJC), which is motivated by Jaccard?s coefficient [68] (detail in Chapter 2) for complete clustering. SJC mainly captures two properties of a subclustering solution: ? Coverage: We want K subclusters to represent K different ground truth classes. ? Purity: When a subcluster represents a ground truth class, it should be as pure as possible. We first compute a simple mapping between the subclusters and the ground truth classes. For each class, we match it with the subcluster that contains the most elements from this class. This selects at most one subcluster for each class. Then we evaluate the qual- ity of these subclusters using a metric similar to the Jaccard?s coefficient. Specifically, subclustering S is evaluated with respect to the ground truth G SJCG(S) = ?K k=1 indk ? SSk nC2 K (4.4) For each class, k, SSk is the number of pairs of elements from class k in the subcluster associated with k, and nC2 is the total number of pairs of elements present in that sub- cluster. For example, if for class k there is a subcluster with six out of ten elements from class k, the fraction in this sum will be 6C2 10C2 , i.e.,1545 (Figure 3). Additionally, we require that a subcluster can only represent a class if it contains a sufficient number of elements from that class. In our experiments, we require a subcluster to have greater than half of its elements from a single class; subclusters without this minimal level of purity will not be very useful for browsing. If a subcluster of size ten contains less than six elements of class k it is considered not to be useful. indk is an indicator variable that will be zero if there is no subcluster that contains a sufficient number of elements from class k. This 94 ????? ?? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ?? ? ? ? ?? Figure 3: Some subclusters (n = 10) and the fraction they contribute to a Subclustering Jaccard?s Coefficient calculation (K clusters). Red boxes denote errors in a subcluster. also ensures that a subcluster can only represent one class. 4.5 Fast-PAM Because there are no prior algorithms for subclustering, we will adapt other clustering algorithms to the problem so that we may use them as baselines. K-medoids algorithms are easy to adapt to subclustering, by selecting each cluster?s medoid, and the n?1 closest other points in the cluster as a subcluster. We see that PAM [44] is much more effective on large image sets than a baseline K-medoids algorithm2 and we provide some experimental evidence regarding that in the Appendix 6.4. However PAM is slow and not scalable to medium or large datasets. A contribution of this work is to provide a faster algorithm that optimizes the same cost as PAM. This can be used to cluster datasets of size over 2A baseline K-medoids is similar to K-means [56] but the medoid is calculated instead of the mean in the update step. 95 10, 000 in around 7 minutes (Matlab implementation), and could be further sped up with an optimized C implementation. We provide the implementation details of Fast-PAM in Appendix 6.4. 4.6 Experimental Results We run experiments in two different domains: face images from Pubfig [49] and leaf images from Leafsnap [47]. Unlike most other face datasets, images in Pubfig are taken in completely uncontrolled settings with non-cooperative subjects. Thus, there is a large variation in pose, lighting, expression, scene, camera, imaging conditions and parameters, etc.... For the passive algorithm, we create several subsets of Pubfig [49] containing 40, 80, 120, 160 and 200 clusters. These face image sets are called Face-40, Face-80, Face-120, Face-160 and Face-200 respectively. Face-200 is the complete Pubfig dataset containing 51418 images from 200 classes3. The number of images in each class in Face- 200 vary from 56 to 1434 and we use subcluster size of n = 10. We also create a leaf image dataset containing 1535 images from 27 classes to test our algorithms in a different image domain and we call this dataset Leaf-27. Each class in this case has at least 50 or more images and we use a subcluster size of 5. This is a subset of the leaf image data used in Leafsnap [47]. We also create a small face dataset containing 250 images from 10 classes (called Face-10) and a leaf dataset containing 250 images from 25 classes (called Leaf-25). We use a subcluster size of 5 and 3 for Face-10 and Leaf-25 respectively. In our experiments we required pairwise distances between all images. The distance 3Note that the total number of images we use is less than the actual number of images (58797) in Pubfig because of inactive links to images. 96 40 60 120 160 200 0.10.20.30.40.5 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Fast?P AM (Sect ion 4.5) Propos ed Pas sive Su bcluste ring (a) SJC comparison 40 60 120 160 200 024681012x 10 4 Numbe r of Clu sters Time (in seconds) BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Fast?P AM (Sect ion 4.5) Propos ed Pas sive Su bcluste ring (b) Time comparison Figure 4: Comparison of our proposed passive subclustering algorithm with other base- lines for Face-40, 80, 120, 160, 200 (|R| = 20K). matrix for face images and leaf images are calculated based on algorithms in [9] and [47] respectively. Scaling parameter ? is chosen based on a small holdout training dataset and we use ? = 5 for the face images and ? = 0.2 for leaf images. 4.6.1 Subclustering Our passive algorithm is compared with four other algorithms, Fast-PAM (Section 4.5), K-medoids baseline, Normalized Cut [66] and BBC-S [33], where the first three are tra- ditional complete clustering algorithms. From complete clustering results, the centroid of each cluster along with the n ? 1 nearest images from the centroid create a subcluster. This seems to be a natural way of creating subclusters from clusters and provides useful baselines for comparisons. We use SJC as the evaluation metric. Each algorithm is run 10 times for each dataset and mean SJC is reported. From Figure 4 and Table 4.1, we can see that our proposed algorithm significantly outperforms other baselines. For example, our algorithm achieves 26% higher SJC than that of Fast-PAM and is 30 times faster than Fast-PAM (even with the fast implementation) for Face-200. 97 Table 4.1: Subclustering comparison for Leaf-27. Algorithms SJC time (s) BBC-S algorithm [33] 0.5023 0.4114 K-medoids baseline 0.5778 0.4244 Normalized cut [66] 0.6037 0.7709 Fast-PAM (Section 4.5) 0.6103 77.32 Proposed Passive Subclustering Algorithm 0.6159 9.7857 4.6.2 Restricted Set Size and Subcluster Size Variation One of the important innovations that allows our algorithm to run efficiently is to select the set of centroids from a restricted set only. If the restricted set size is smaller, our algorithm will be efficient but results may not be accurate. On the other hand with a larger restricted set we are likely to get a much better solution but at the cost of efficiency. We empirically find that selecting restricted sets of size 5K-25K (K is the number of clusters) gives reasonable trade-off between efficiency and accuracy. In Figure 5a we show how the clustering performance varies (after the passive algorithm) for all the face datasets as we change the restricted set size. The runtime increases as we increase the restricted set size, e.g., for Face-40 runtime increases from 19 seconds to 95 seconds, as we increase the restricted set size from 5K to 40K. We expect that a user will provide the subcluster size based on her requirements. Subcluster size of 10 seems reasonable for most of the possible applications of subclus- tering such as browsing, categorization, summarization. This size is not too large that the subclusters will be erroneous and not too small that there will be no intra-class vari- ation in each subcluster. However we performed some experiments where the subcluster size is varied to see how sensitive our approach is with respect to the subcluster size. 98 In Figures 5b-5f, we compare the proposed approach with other approaches, where the subcluster size varies from 5 to 25. As we increase the subcluster size, all algorithms? performance become a little worse, but our algorithm still outperforms other approaches by a significant margin. 4.6.3 Active Subclustering ? We compare our active algorithm to three previous active clustering algorithms [6, 14, 78], adapted to subclustering by selecting n images from each cluster to use as subclusters. ? Face-200: Our approach requires around 143, 000 questions to reach a SJC of 0.92 (equivalent to 120 perfect subclusters and 80 subclusters with only one mistake) for Face-200. We note that asking 143, 000 questions in Mechanical Turk costs only around $358. Basu et al.?s algorithm [6] achieves SJC of just 0.44 after 143, 000 questions. The ?Explore? stage itself takes more than 200, 000 questions and com- plete clustering of the dataset takes as many as 2.3 million questions (Mturk cost: $5800). Xiong et al.?s approach [78] also falls short producing SJC of only 0.2 after 143, 000 questions. In Figure 6a we plot SJC against the number of questions for Face-200. We could not run [14] on Face-200, because their approach is not fast enough for such a large dataset. ? Leaf-25 and Face-10: Although our active algorithm is aimed at problems with a large number of clusters, we run our approach in datasets (Leaf-25 and Face-10 in Figure 6b and 6c respectively) with a small number of clusters for comparison and 99 40 60 120 160 200 0.20.250.30.350.4 Numbe r of Clu sters Subclustering Jaccard?s Coefficient Restric ted Set size=5 K Restric ted Set size=1 0K Restric ted Set size=1 5K Restric ted Set size=2 0K Restric ted Set size=2 5K Restric ted Set size=3 0K Restric ted Set size=3 5K Restric ted Set size=4 0K (a) Restricted Set Size Variation 40 60 120 160 200 0.10.20.30.40.5 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Propos ed Pas sive Su bcluste ring (b) Subcluster Size is 5 40 60 120 160 200 00.10.20.30.4 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Propos ed Pas sive Su bcluste ring (c) Subcluster Size is 10 40 60 120 160 200 00.10.20.30.40.5 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Propos ed Pas sive Su bcluste ring (d) Subcluster Size is 15 40 60 120 160 200 00.10.20.30.4 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Propos ed Pas sive Su bcluste ring (e) Subcluster Size is 20 40 60 120 160 200 00.10.20.30.4 Numbe r of Clu sters Subclustering Jaccard?s Coefficient BBC?S algorith m [33] K?med oids ba seline Norma lized cu t [66] Propos ed Pas sive Su bcluste ring (f) Subcluster Size is 25 Figure 5: Varying restricted set size and subcluster size. (a) The restricted set size |R| is varied from 5K to 40K. (b)-(f) The subcluster size is varied from 5 to 25 and performance for all algorithms were reported. Restricted set size was set to a fixed value of 20K. Since Fast-PAM is slow, we do not report that algorithm for this part of the experimental evaluation. 100 0 5 10 15 x 104 00.20.40.60.81 Numbe r of que stions Subclustering Jaccard?s Coefficient Active PCKMe ans [6] Xiong e t al. [78] Propos ed Activ e Subcl ustering (a) Face-200 0 1000 2000 3000 4000 00.20.40.60.81 Numbe r of que stions Subclustering Jaccard?s Coefficient FAST? Active? HACC? H1_H2 [14] Active PCKMe ans [6] Xiong e t al. [78] Propos ed Activ e Subcl ustering (b) Leaf-25 0 500 1000 1500 2000 00.20.40.60.81 Numbe r of que stions Subclustering Jaccard?s Coefficient FAST? Active? HACC? H1_H2 [14] Active PCKMe ans [6] Xiong e t al. [78] Propos ed Activ e Subcl ustering (c) Face-10 Figure 6: Proposed active algorithm?s performance for several datasets. 101 we still outperform all approaches. For example, we reach perfect subclustering by asking just 654 and 161 questions for Leaf-25 and Face-10 respectively, where the next best approaches take 876 and 639 questions respectively. ? While we feel that these comparisons with previous active clustering algorithms provide a useful baseline, we stress that these prior approaches were not designed for subclustering, and poor performance on these tasks should not be taken as a criticism of the overall algorithms. 4.7 Conclusion We propose a novel clustering problem called subclustering and show its application in large image datasets. We presented a passive and an active algorithm for subclustering. We compare our passive algorithm with several baselines and show that our proposed algorithm is better and scalable to large datasets. We compare our active subclustering algorithm with active complete clustering algorithms and demonstrate that the proposed algorithm is much better. We also present a metric for subclustering evaluation. Although experimental results are shown only for images, this idea can be extended to any clustering domain. 102 Chapter 5: Conclusion and Future Works In this thesis we have proposed three approaches to improve image clustering with pair- wise constraints that indicate whether a pair of images belong to the same category or not. We have shown that these approaches can improve clustering significantly. However there is still a plenty of scope to improve clustering of images. Although extracting better features may improve clustering, perfect unsupervised clustering still seems unlikely in the near future. That strongly motivates further research on semi-supervised clustering to improve the state-of-the-art. There are many interesting open problems in the area of semi-supervised clustering. We point out only a few of them which are directly related to the research carried out in this thesis: ? In chapter 2 we have proposed a method for learning better image distances us- ing the pairwise constraints. This is a general approach that can be used to learn distances using any kinds of pairwise distances between all images, and does not require any vector representation of images. Our approach was developed for clus- tering images, in which all distances and constraints are available when we learn the new distances. The out-of-sample problem is also useful, in which a query image retrieves its nearest neighbors using the learned distances. Extending our distance learning approach to an out-of-sample scenario is a possible future work 103 which will enable us to compute distances between new images that were not part of the training data. ? Our proposed approach for distance learning is a general but non-parametric ap- proach that can work with any kinds of pairwise distances. It is not clear how a parametric method can be developed for distance learning which will work with all kinds of distances. Perhaps it is possible to develop parametric methods catered to each type of distance individually. These approaches, if developed in future, might produce better distances between images than a general framework such as ours that applies to all kinds of distances. ? In chapter 3 we have proposed an active learning technique for image clustering where useful pairwise constraints are obtained from humans in an interactive man- ner. In most of the experiments we assumed that humans are always correct in judging the similarity of two images. We also performed some experiments using real humans in Amazon Mechanical Turk. We found that it is not possible to reach perfect clustering if any of the inputs from humans is erroneous. Although our ap- proach performs better than the state-of-the-art with erroneous human inputs it is far from perfect. That is why it is useful to build active models that can incorporate erroneous pairwise constraints in a more robust way. ? In chapter 3 and 4 we have proposed active learning methods for clustering and sub- clustering. In these approaches we assume that at each iteration only one pairwise constraint can be obtained. However in real platforms such as Amazon Mechanical Turk we can avail ourselves of multiple annotators providing constraints simultane- 104 ously. It is important to propose active methods where multiple image pairs can be chosen at each iteration. That will require developing fast methods for evaluating all sets of possible pairs and then choosing the best set of pairs for obtaining con- straints. This could also be an interesting future work which will enable us to collect a large number of constraints at every iteration and use platforms like Mechanical Turk in a more effective way. ? As another future work we envision a large scale interactive system involving hu- mans and images. The proposed system will contain a very large collection of images; multiple humans will interact with those images in different ways. Hu- mans can upload, label, provide pairwise constraints, retrieve, play simple image based games or browse the image collection depending on their expertise or interest. The system should have multiple algorithmic components such as clustering with constraints, retrieval, active selection of tasks, assigning tasks based on human ex- pertise, learning from human interactions etc.... Building such a system will require expertise in computer vision, machine learning, algorithms, database management, programming and many other related areas. The system should evolve with time as more images are uploaded and as more human interactions happen. In this thesis we have looked only at a small part of this system which involves clustering with constraints. However there are many fascinating problems that must be solved to develop the system as proposed and should be pursued in the future. 105 Chapter 6: Appendix 6.1 Detailed Proofs for Distance Learning Here we note some initial triangle inequality constraints, which we will use to prove the Lemmas: dI(xi,xk) + dI(xk,xl) ? dI(xi,xl) (6.1a) dI(xj,xk) + dI(xk,xl) ? dI(xj,xl) (6.1b) dI(xi,xk) + dI(xi,xl) ? dI(xk,xl) (6.1c) dI(xj,xk) + dI(xj,xl) ? dI(xk,xl) (6.1d) Proof for Lemma 1. We will refer to Figure 1a and the QP in 2.1 for this proof. We find the minimum cost solutions for the graph (Figure 1a) considering triangles ?xixjxk and ?xixjxl and show that this solution does not violate the triangle inequality for ?xixkxl and ?xjxkxl. When an optimal solution obtained using a subset of the constraints (?xixjxk and ?xixjxl) also satisfies other constraints (?xixkxl and ?xjxkxl) which were not enforced, it is not required to change the optimal solution. Thus the solution which we obtain in this proof is optimal. 106 (a) Must-link constraint. (b) Can?t-link constraint. (c) Four points in 2-d. Figure 1: Simple examples in 2-d Euclidean space (green denotes must-link edge and red denotes can?t-link edge) to explain our proposed approach. 107 There is a must-link constraint between xi and xj . To minimize the objective function in 2.1, and to avoid violation of the triangle inequality for triangle ?xixkxj , we will have dF (xi,xk) = dF (xj,xk) = dI(xi,xk)+dI(xj ,xk) 2 . Similarly for ?xixlxj , dF (xi,xl) = dF (xj,xl) = dI(xi,xl)+dI(xj ,xl) 2 . Now we show that the triangle inequality holds for ?xixkxl and ?xjxkxl with dF (xk,xl) = dI(xk,xl). Adding equations 6.1a and 6.1b, we get dF (xi,xk)+dI(xk,xl) ? dF (xi,xl). Similarly adding equations 6.1c and 6.1d, we get dF (xi,xk) + dF (xi,xl) ? dI(xk,xl). By symmetry dF (xi,xl) + dI(xk,xl) ? dF (xi,xk). So the triangle in- equality holds for ?xixkxl. Similarly, we can prove that triangle inequality holds for ?xjxkxl. So dF (xk,xl) = dI(xk,xl) is not required to change. Thus dF (xi,xk) ? min(dI(xi,xk), dI(xj,xk)) because dF (xi,xk) = dI(xi,xk)+dI(xj ,xk) 2 . Proof for Lemma 2. Follows from the proof for Lemma 1. Proof for Lemma 3. We will refer to Figure 1b and the QP in 2.1 for this proof. Note that the can?t-link threshold distance was L. Since there is a can?t-link constraint between xi and xj the corresponding distance is set to be L. We find the minimum cost solutions for the graph (Figure 1b) considering triangles ?xixjxk and ?xixjxl and show that this solution does not violate the triangle inequality for ?xixkxl and ?xjxkxl. Now there are three possible cases: Case 1: Triangle inequality is violated for both ?xixjxl and ?xixjxk. To avoid any vio- lation and to minimize cost, the new distances will be: dF (xj,xk) = L2 + dI(xj ,xk)?dI(xi,xk) 2 , dF (xi,xk) = L2 + dI(xi,xk)?dI(xj ,xk) 2 , dF (xj,xl) = L 2 + dI(xj ,xl)?dI(xi,xl) 2 , dF (xi,xl) = L 2 + dI(xi,xl)?dI(xj ,xl) 2 . Now we have to prove that with 108 dF (xk,xl) = dI(xk,xl) and the new set of distances, triangle inequalities hold for both ?xixkxl and ?xjxkxl. We know that L ? dI(xi,xk) + dI(xj,xk), L ? dI(xi,xl) + dI(xj,xl) (that is why the triangle inequality was violated for ?xixjxk and ?xi bmxjxl). We divide these two inequalities by 2 and add them together. If the resulting equation is added with another equation dI(xi,xk) + dI(xi,xl) ? dI(xl,xk), we get dF (xi,xk) + dF (xi,xl) ? dI(xk,xl). We also have dI(xi,xk)+dI(xk,xl) ? dI(xi,xl) and dI(xj,xl)+ dI(xk,xl) ? dI(xj,xk). Adding this two equations we can show that, dF (xi,xk) + dI(xk,xl) ? dF (xi,xl). By symmetry we can prove that, dF (xi,xl) + dI(xk,xl) ? dF (xi,xk). Thus the triangle inequality is proved for ?xixkxl. Similarly we can prove the triangle inequality for ?xjxkxl. Now we see that dF (xi,xk) ? dI(xi,xk) = L2 ? dI(xi,xk)+dI(xj ,xk) 2 ? 0, because L ? dI(xi,xk) + dI(xj,xk). That implies dF (xi,xk) ? dI(xi,xk). Similarly it can be proved that dF (xj,xk) ? dI(xj,xk), dF (xi,xl) ? dI(xi,xl) and dF (xj,xl) ? dI(xj,xl). Case 2: The triangle inequality is violated for either ?xixjxk or ?xixjxl. With- out loss of generality we assume the triangle inequality is violated for ?xixjxk. In this case the new distances will be: dF (xj,xk) = L2 + dI(xj ,xk)?dI(xi,xk) 2 , dF (xi,xk) = L 2 + dI(xi,xk)?dI(xj ,xk) 2 . However, dF (xj,xl) = dI(xj,xl) and dF (xi,xl) = dI(xi,xl), because dI(xi,xl) + dI(xj,xl) ? L. Now we have to prove that with dF (xk,xl) = dI(xk,xl) and the new set of distances, triangle inequalities hold for both ?xixkxl and ?xjxkxl. Now since dF (xi,xk) ? dI(xi,xk), it can easily be proved that dF (xi,xk) + dF (xi,xl) ? dI(xk,xl) and dF (xi,xk) + dI(xk,xl) ? dF (xi,xl). Now we have, dI(xi,xl) + dI(xj,xl) ? L, dI(xk,xl) + dI(xj,xk) ? dI(xj,xl) and dI(xi,xl) + 109 dI(xk,xl) ? dI(xi,xk). Adding all three equations, we can show that dF (xi,xl) + dI(xk,xl) ? dF (xi,xk). Thus the triangle inequality is proved for ?xixkxl. Similarly we can prove the triangle inequality for ?xjxkxl. Then as in Case 1, it can be proved that dF (xi,xk) ? dI(xi,xk) and dF (xj,xk) ? dI(xj,xk). However, dF (xi,xl) = dI(xi,xl) and dF (xj,xl) = dI(xj,xl). Case 3: No triangle inequality is violated. In this case all distances remain the same as the initial distances. Thus dF (xi,xk) ? dI(xi,xk). Proof for Lemma 4. Follows from proof for Lemma 3. Now with the following theorem we show that even with the triangle inequality constraints that we enforce in the QP in equation 2.2, distances change in a similar way as they would have, had we enforced all the triangle inequality constraints (as in equation 2.1). We note that distances that are smaller than or equal to a fixed threshold th are called nearest neighbors or ?small? and distances larger than th are called ?large? for all dI , dF and d?F . Let us refer to Figure 1c in the following Theorems. In the QP in 2.1, the enforced triangle inequalities are for triangles ?xixjxk, ?xixjxl, ?xkxjxl, ?xkxixl. In the QP in 2.2, triangle inequalities for ?xixjxk and ?xixjxl might be enforced. Note that when a must-link/can?t-link constraint is present between xj and xl, dF (xi,xk) = dI(xi,xk), even when all the triangle inequality constraints are enforced (from Lemma 2 and 4). Without loss of generality we consider the case when there is a must-link/can?t- 110 link constraint between xi and xj and we study how dI(xi,xk) is modified when we run the QPs in equation 2.1 and 2.2. Proof for Theorem 5. I. Case 1: There is a must-link between xi and xj and dI(xj,xk) > th. By Lemma 1, dF (xi,xk) > th, when QP in 2.1 is used. Now with QP in 2.2, we do not consider the triangle ?xixjxk, which was responsible for change in dI(xi,xk), because xk is not a nearest neighbor to either xi or xj . Thus d?F (xi,xk) = dI(xi,xk) > th. Case 2: There is a must-link between xi and xj and dI(xj,xk) ? th. Triangle ?xixjxk, which is responsible for change in dI(xi,xk) is considered in both QP in 2.1 and 2.2. Thus d?F (xi,xk) = dF (xi,xk). Case 3: There is a can?t-link between xi and xj and dI(xj,xk) > th. By Lemma 3, dF (xi,xk) > th, when QP in 2.1 is used. Now with QP in 2.2, we do not consider the triangle ?xixjxk, which was responsible for change in dI(xi,xk), because xk is not a nearest neighbor to either xi or xj . Thus d?F (xi,xk) = dI(xi,xk) > th. Case 4: There is a can?t-link between xi and xj and dI(xj,xk) ? th. Triangle ?xixjxk, which is responsible for change in dI(xi,xk) is considered in both QP in 2.1 and 2.2. Thus d?F (xi,xk) = dF (xi,xk). Thus, d ? F (xi,xk) > th if dI(xi,xk) > th and dF (xi,xk) > th. II. dI(xi,xk) can become small (? th) only when there is a must-link between xi and xj and dI(xj,xk) ? th. Triangle ?xixjxk, which is responsible for change in dI(xi,xk) is considered in both QP in 2.1 and 2.2 because xixj is a must-link edge and xjxk is a nearest neighbor edge. Thus d?F (xi,xk) = dF (xi,xk). III. Triangle ?xixjxk, which is responsible for change in dI(xi,xk) is considered in 111 both QP in 2.1 and 2.2 because xixj is a must/can?t-link edge and xixk is a nearest neighbor edge. Thus d?F (xi,xk) = dF (xi,xk). IV. The same proof as part III applies. 6.2 Computational Complexity of Passive Subclustering In this section we calculate the computational complexity of our passive algorithm. No- tations used in this section are the same as used in Chapter 4. We assume that we are given the distance between all pairs of images in U . Actually, our algorithm only requires distances between images in U and images in R; if these are not available a priori they may be computed in O(RND) time (where D is the dimension of feature vectors). For step (a) (Section 4.2.2 in Chapter 4) in our algorithm we have to find the n? 1 closest points to each centroid in the restricted set. Finding the n? 1 smallest elements in an array of size N can be done in O(N) time using a linear time selection algorithm [21] (subcluster size n is assumed to be constant). So, this takes O(RN) time for R centroids. For step (b), we maintain a vector of size R and update at every iteration, which takes O(R); total cost is O(RT ). Another major computational cost comes from tracking the overlap between clusters in step d(i). We can do this by maintaining a matrix of size K-by-R and updating it at every iteration. This requires O(KR) time in each iteration; total O(KRT ). For each replacement overlapping can be checked in constant time. The final major computational cost comes from considering KR replacements (step d(ii)); this is carried out T times and takes O(KRT ) time. Each replacement is done in constant 112 time with the precomputed cost of addition or removal of a centroid (see steps (a) and (b)). Other steps in our algorithm are not significant with respect to computational cost. So the total computational cost for the proposed algorithm is O(RN + KRT ). In our experiments we find that KT is typically similar to N ; so our runtime is proportional to KRT . This is O(N) faster than PAM [44](O(KRNT )) with a same sized restricted set. However, we note that PAM in general does not use a restricted set. 6.3 Details of the Active Subclustering Method Here we formally describe our active algorithm for correction of between-class and within- class errors: (a) A priority queue PQ of size O(K2 + Kn) is created that contains utility func- tion (U ) values for O(K2) within-class and O(Kn) between-class pairs. Highest priority is given to the maximum utility pair. (b) We pop from the priority queue and ask about the corresponding image pair. (c) If the selected pair is a between-class pair (ci, cj): ? User says ?no?: Cost of the present centroid set decreases (because a positive cost component contributed by ?log(1 ? P (ci = cj)) will be zero after we know ci and cj are from different classes), so it still remains locally optimal. We update the corresponding pairwise probability. ? User says ?yes?: The cost of the current centroid set blows up (as ?log(1 ? P (ci = cj)) becomes infinite) and the current solution is no longer locally optimal. At that point we return to the optimization procedure used in the 113 passive algorithm until we find another optimal centroid set. (d) If the selected pair is a within-class pair (ck, I j k): ? User says ?yes?: Update the pairwise probability. ? User says ?no?: Replace Ijk with another image in the database I j k ? that is the next closest to ck and has not already been assigned to any other centroid (to avoid any overlapping induced by our active algorithm). (e) Within-class thresholding: we keep a threshold (qth) on the maximum number of questions we ask to correct one particular image assignment for a centroid in a within-class pair. Once we decide to replace a cluster element this is always the highest priority in the queue, until we find a good replacement. However, we don?t want to get stuck repeatedly asking questions about one cluster when finding a replacement is difficult. So after the qth questions, we do not ask about a (ck, I j k ? ) pair and adjust the corresponding probability to ask about it later. qth is updated at every iteration such that when we have corrected most of the between-class errors, we will gradually shift our emphasis to within-class errors and start asking more questions about them (see Appendix 6.3.1 for details). (f) Update the utility functions and the priority queue PQ. Goto step (b). (g) Continue asking about image pairs until the human budget is exhausted. 6.3.1 Thresholding (qth) in Active Learning We define variables Hbc and Hwc, which keep track of how often humans agree with the algorithm regarding between-class pairs and within-class pairs respectively. If more 114 questions are required to get a disagreement (a same-class centroid pair for between-class and a different-class centroid-image pair for within-class) from users, that implies humans are agreeing more with the present subclustering. Hbc is defined as the average number of questions asked for between-class pairs to obtain a between-class disagreement. Hwc is defined in similar way but corresponds to questions asked for within-class pairs only. These variables are updated after every iteration. Within-class thresholding (step (e) in active subclustering): we keep a thresh- old (qth) on the maximum number of questions we ask to correct one particular image assignment for a centroid in a within-class pair. Replacing a cluster element is always the highest priority in the queue, until we find a good replacement. However, we don?t want to get stuck repeatedly asking questions about one cluster when finding a replacement is difficult. So after the qth questions, we do not ask about a (ck, I j k ? ) pair and adjust the corresponding probability to ask about it later. qth is set to be Hbc Hwc and is updated at ev- ery iteration. This intuitively means, as we keep correcting between-class errors, we will gradually shift our emphasis to within-class errors and start asking more questions about them. 6.4 Fast Partition Around Medoids (Fast-PAM) 6.4.1 Related Work We use PAM [44] proposed by Kaufman and Rousseeuw in 1987 as one of our major baselines. Although PAM is slow for large scale datasets, in general K-medoids algo- rithms like PAM are known to be more robust to outliers and are invariant to translations 115 and orthogonal transformations to data points [59]. There have been several approaches to make PAM faster like CLARA [44], CLARANS [59] and [60]. We provide a faster implementation of PAM which we call Fast-PAM. 6.4.2 Algorithm Description First, we introduce some notation to be used in this section. We are given a set of N unlabeled images U to be clustered into K clusters. Nk denotes the number of images in the k-th cluster ( ?K k=1Nk = N ). T denotes the number of time-steps or iterations an algorithm is run. C (c ? C) represents the set of K cluster centroids and C ? U , i.e., centroids are members of the dataset only. R (|R| = R) denotes a restricted set of images (randomly chosen from U), which is larger than the number of clusters but smaller than the total number of images (R ? U and K < R < N ). d(Ii, Ij) is the distance between images Ii and Ij (Ii, Ij ? U). The state of the art implementation of PAM has computational costO(TKN2) [59], where our implementation takes only O(TRN). Our approach uses some additional data structures and precomputation which helps to carry out each centroid replacement of PAM (c with r) in only O(navg) time, where navg is the average number of images in each cluster; this helps to reduce the run time of PAM by O(K). For simplicity, we will assume N = Knavg. Also in Fast-PAM we use a restricted set of images to choose the replacements from unlike using all non-centroids in PAM; this gives additional speed up of O(NR ). In experiments we see that as the number of clusters increase, Fast-PAM is more efficient than PAM. 116 The cost function in clustering is of the form Cost = ?K k=1 ? Ijk?Sk d(ck, I j k), where d measures the distance between two images, ck is the k-th centroid, I j k is the j-th image assigned to the k-th cluster and Sk is the set of images in cluster k. In PAM, at each iteration, a cluster center is replaced by the data point that reduces the Cost by the maximum amount. This continues until convergence, that is until no such replacement is found that decreases the Cost. In the original implementation of PAM [59] each replacement (cwith r) takesO(N) (since N >> k) time. We reduce this replacement cost from O(N) to O(navg). When we take out a centroid c ? C from the present set of centroids and add r ? R to C, the cost changes mainly because of two sets of points. First, the points that were earlier assigned to c as the closest centroid now need to be reassigned to some other centroid. We call these set of points I1. Second, there is a set of points which will change their present assignment and will be assigned to r, because they are closer to r than their present assignments. We call these set of points I2. As we mentioned earlier, for simplicity we assume |I1| = |I2| = navg to calculate the complexity of our implementation in a simpler way. Now we discuss the additional matrices, which we maintain to reduce the cost of replacements. Given a set of centroids C we maintain couple of matrices CLR?N1 and CLR?N2 . CL1(i, j) and CL2(i, j) hold the first and second closest centroid for an extended centroid set C ? = C ? r for the j-th point in the dataset (say r is the i-th element in R). Before we start each iteration of the total KR replacements, we also create a list for each r ? R, which holds the points closest to r, given centroid set C ?. We call the list set RL. Now we look at how we reassign the points in I1 and I2 below when a centroid replacement is made. 117 ? For I1: For each I ? I1, we have to see what is the second closest assignment given r and the other centroids except c. This can be done using matrices CL1 and CL2 in constant time for each image. So total computational cost for these set of points is O(navg). ? For I2: We already have set I2 from lists RL, which we compute before the re- placements stage starts in each iteration. Although we have to remove some points from I2, which were in I1 as they have already been taken care of. Now for each I ? I2, we have to update corresponding old cost d(I, cold) (cold is the earlier as- signment of I) with its new cost d(I, r). Total computational cost for I2 isO(navg). At the end of each iteration centroid set C will be updated. So we have to update CL1, CL2 and RL. But the computational cost to update these is O(RN) because each element in CL1 and CL2 can be updated in constant time. List RL can be obtained from CL1 inO(RN) time. So the total computational cost for each iteration becomesO(RN+ KRnavg) = O(RN) because we assume N = Knavg. So total computational cost for the complete algorithm becomesO(TRN), which isO(KNR ) faster than the original PAM implementation. 6.4.3 Experimental Results We compare Fast-PAM with two other algorithms CLARANS [59] and a K-medoids base- line [56] (similar to K-means but in the update step we calculate the medoid instead of the mean) for performance and running time. We note that we can also use a restricted set in original PAM; complexity is O(TKRN). If we use the same restricted set (used in 118 Fast-PAM) in PAM, Fast-PAM produces the exact same result as that of PAM but O(K) faster. So we do not run PAM with a restricted set in our experiments. Also PAM without a restricted set (complexity is O(TKN2)) is too slow to run for these datasets. Fast-PAM is O(KNR ) faster than PAM without a restricted set, i.e., the original PAM implementa- tion [44]. We use several subsets of the face image dataset Pubfig [49]. We vary the number of clusters from 5 to 50 with each cluster containing 200 images. Jaccard?s Coefficient (JCC) [68] is used as the evaluation metric. Each algorithm is run 20 times and mean JCC and time are plotted in Figure 2. Although CLARANS produces results comparable to Fast-PAM, it requires a cou- ple of parameters (numlocal and maxneighbour) to be set. We used numlocal = 100 and maxneighbour = 1000 (selected after trying several sets) for CLARANS. Although the K-medoids baseline algorithm is fast its performance is not on a par with Fast-PAM. The difference between Fast-PAM and K-medoids baseline is more pronounced towards the beginning of the performance curve (Figure 2a). As the size of the dataset increases, clustering performance degrades for all the algorithms. So the difference in JCC between Fast-PAM and K-medoids is not very significant. However we see that as the number of clusters increases Fast-PAM tends to produce better (2 ? 3% lower) local minimum of Cost than produced by the K-medoids baseline. For example, in the dataset with 50 clusters and 10000 images, the value of local minimum achieved by Fast-PAM is 34007 compared to 34873 achieved by K-medoids baseline. Overall, we feel that Fast-PAM is comparable with CLARANS but better than the K-medoids baseline in terms of perfor- mance for complete clustering. 119 0 10 20 30 40 50 0.10.150.20.250.30.350.4 Jaccar d?s Co efficien t comp arison Numbe r of Clu sters (Si ze of da taset) Jaccard?s Coefficient CLARA NS [59] K?med oids ba seline Fast?P AM (a) JCC Comparison 0 10 20 30 40 50 02004006008001000 Time c ompar ison Numbe r of Clu sters (Si ze of da taset) Time (in seconds) CLARA NS [59] K?med oids ba seline Fast?P AM (b) Time Comparison Figure 2: Performance and time comparison of Fast-PAM with other algorithms. 120 Publications ? Journals: 1. ?Active Image Clustering with Pairwise Constraints from Humans?; Ar- ijit Biswas, David Jacobs; International Journal of Computer Vision (IJCV), 2014. 2. ?Active Subclustering?; Arijit Biswas, David Jacobs; Computer Vision and Image Understanding (CVIU), Elsevier Press, 2014. ? Conferences: 1. ?Distance Learning Using the Triangle Inequality for Semi-supervised Clustering?; Arijit Biswas, David Jacobs; European Conference on Com- puter Vision (ECCV), 2014, Submitted. 2. ?Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback?; Arijit Biswas, Devi Parikh; IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2013. 3. ?Active Image Clustering: Seeking Constraints from Humans to Comple- ment Algorithms?; Arijit Biswas, David Jacobs; IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2012. 4. ?Leafsnap: A Computer Vision System for Automatic Plant Species Iden- tification?; Neeraj Kumar, Peter Belhumeur, Arijit Biswas, David Jacobs, W. John Kress, Ida Lopez, Joao Soares; European Conference in Computer Vi- sion (ECCV), 2012. (Oral presentation) 5. ?Odd Leaf Out: Improving visual recognition with games ?; Derek Hansen, David Jacobs, Darcy Lewis, Arijit Biswas, Jennifer Preece, Dana Rotman, Eric Stevens; Third IEEE International Conference on Social Computing (So- cialcom), 2011. ? Workshops: 1. ?Large Scale Image Clustering with Active Pairwise Constraints?; Arijit Biswas, David Jacobs; Combining Learning Strategies to Reduce Label Cost: International Conference in Machine Learning (ICML) Workshop, 2011. ? Demos: 1. ?Relative Attributes for Enhanced Human-Machine Communication?; Naman Agarwal, Arijit Biswas, Adriana Kovashka, Kristen Grauman, Devi Parikh; IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. 121 Bibliography [1] http://vis-www.cs.umass.edu/lfw/. [2] http://www.dabi.temple.edu/ shape/mpeg7/results.html. [3] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319?342, 1987. [4] E. Bair. Semi-supervised clustering methods. CoRR, abs/1307.0252, 2013. [5] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In C. Sam- mut and A. G. Hoffmann, editors, ICML, pages 27?34. Morgan Kaufmann, 2002. [6] S. Basu, A. Banerjee, and R. J. Mooney. Active semi-supervision for pairwise constrained clustering. In Fourth SIAM ICDM, 2004. [7] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In ACM SIGKDD, 2004. [8] S. Basu, I. Davidson, and K. Wagstaff. Constrained Clustering: Advances in Algorithms. Data Mining and Knowledge Discovery Series. IEEE Computer Society, 2008. [9] P. Belhumeur. Personal communication with author. 2011. [10] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013. [11] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape matching 122 and object recognition. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 831?837. MIT Press, 2000. [12] T. Berg and P. N. Belhumeur. How do you tell a blackbird from a crow? In ICCV, pages 9?16. IEEE, 2013. [13] A. Biswas and D. Jacobs. Large scale image clustering with active pairwise constraints. In- ternational Conference in Machine Learning 2011 Workshop on Combining Learning Strate- gies to Reduce Label Cost. [14] A. Biswas and D. W. Jacobs. Active image clustering: Seeking constraints from humans to complement algorithms. In CVPR. IEEE, 2012. [15] I. Borg and P. Groenen. Modern multidimensional scaling, theory and applications. Springer - Verlag , New York, 1997. [16] S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training of deformable part models. In D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, editors, ICCV, pages 1832?1839. IEEE, 2011. [17] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In K. Daniilidis, P. Maragos, and N. Paragios, editors, ECCV (4), volume 6314 of Lecture Notes in Computer Science, pages 438?451. Springer, 2010. [18] G. Celeux, F. Forbes, and N. Peyrard. EM procedures using mean field-like approximations for markov model-based image segmentation. Pattern Recognition, 2003. [19] R. Chellappa and A. K. Jain. Markov Random Fields: Theory and Applications. Academic Press, 1993. [20] R. G. Cinbis, J. Verbeek, and C. Schmid. Unsupervised metric learning for face identification in TV video. 2011. 123 [21] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, 1990. [22] C. K. Dagli, S. Rajaram, and T. S. Huang. Utilizing information theoretic diversity for SVM active learn. In ICPR, pages II: 506?511, 2006. [23] I. Davidson and S. S. Ravi. Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min. Knowl. Discov, 18(2):257?282, 2009. [24] J. V. Davis, B. Kulis, P. J. 0002, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML, 2007. [25] B. Erol and F. Kossentini. A robust distance measure for the retrieval of video objects. In SSIAI, pages 40?44, 2002. [26] M. R. Everingham, J. Sivic, and A. Zisserman. Hello! my name is buffy: Automatic naming of characters in TV video. In BMVC, 2006. [27] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, and L. S. Davis. Birdlets: Subordi- nate categorization using volumetric primitives and pose-normalized appearance. In D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, editors, ICCV, pages 161?168. IEEE, 2011. [28] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, pages 1?8, 2008. [29] O. Freifeld and M. J. Black. Lie bodies: A manifold representation of 3D human shape. In A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, ECCV (1), volume 7572 of Lecture Notes in Computer Science, pages 1?14. Springer, 2012. [30] R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 558?566, 2011. 124 [31] N. Grira, M. Crucianu, and N. Boujemaa. Active semi-supervised fuzzy clustering for image database categorization. In Multimedia Information Retrieval, pages 9?16. ACM, 2005. [32] Y. Guo and R. Greiner. Optimistic active-learning using mutual information. In M. M. Veloso, editor, IJCAI, pages 823?829, 2007. [33] G. Gupta and J. Ghosh. Bregman bubble clustering: A robust framework for mining dense clusters. TKDD, 2008. [34] T. Hastie and R. Tibshirani. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society series B, 58:158?176, 1996. [35] S. C. H. Hoi, R. Jin, and M. R. Lyu. Learning nonparametric kernel matrices from pairwise constraints. In ICML, 2007. [36] A. D. Holub, P. Perona, and M. C. Burl. Entropy-based active learning for object recognition. In Online Learning for Classification Workshop, pages 1?8, 2008. [37] R. Huang and W. Lam. An active learning framework for semi-supervised document clus- tering with language modeling. Data Knowl. Eng, 68(1), 2009. [38] R. Huang, W. Lam, and Z. Zhang. Active learning of constraints for semi-supervised text clustering. In SDM. SIAM, 2007. [39] D. W. Jacobs, D. Weinshall, and Y. Gdalyahu. Classification with nonmetric distances: Image retrieval and class representation. IEEE Trans. Pattern Anal. Mach. Intell, 2000. [40] A. K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 2010. [41] P. Jain and A. Kapoor. Active learning for large multi-class problems. In CVPR, pages 762?769, 2009. [42] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Breaking the interactive bottleneck in multi- class classification with active selection and binary feedback. In CVPR, pages 2995?3002. 125 IEEE, 2010. [43] A. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In M. M. Veloso, editor, IJCAI, pages 877?882, 2007. [44] L. Kaufman. Finding groups in data: an introduction to cluster analysis. In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990. [45] D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML, 2002. [46] Kruskal, J. B. On the shortest spanning subtree of a graph and the traveling salesman prob- lem. Proc. of the AMS, 2:48?50, 1956. [47] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C. Lopez, and J. V. B. Soares. Leafsnap: A computer vision system for automatic plant species identification. In ECCV, 2012. [48] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C. Lopez, and J. V. B. Soares. Leafsnap: A computer vision system for automatic plant species identification. ECCV (2), 7573:502?516, 2012. [49] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verification. In ICCV. IEEE, 2009. [50] Y. J. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discov- ery. In CVPR, pages 1721?1728. IEEE, 2011. [51] K. Levi, M. Fink, and Y. Weiss. Learning from a small number of training examples by exploiting object categories. In IEEE Workshop on Learning in Computer Vision and Pattern Recognition, page 96, 2004. [52] H. B. Ling and D. W. Jacobs. Shape classification using the inner-distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 29(2):286?299, Feb. 2007. 126 [53] Z. Lu and H. H.-S. Ip. Constrained spectral clustering via exhaustive and efficient constraint propagation. In ECCV. Springer, 2010. [54] Z. D. Lu and M. A. C. Perpinan. Constrained spectral clustering through affinity propagation. In CVPR, 2008. [55] Y. M. Lui and J. R. Beveridge. Grassmann registration manifolds for face recognition. In ECCV, pages II: 44?57, 2008. [56] J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symp. on Mathematics Statistics and Probability, 1967. [57] P. K. Mallapragada, R. Jin, and A. K. Jain. Active query selection for semi-supervised clustering. In ICPR, 2008. [58] A. Martinez and R. Benavente. The AR Face Database. CVC Technical Report #24, 1998. [59] R. T. Ng and J. Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng, 14(5):1003?1016, 2002. [60] H.-S. Park and C.-H. Jun. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl, 36(2):3336?3341, 2009. [61] F. Petralia, V. Rao, and D. B. Dunson. Repulsive mixtures. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 1898?1906, 2012. [62] K. Punera and J. Ghosh. Consensus-based ensembles of soft clusterings. Applied Artificial Intelligence, 22(7&8):780?810, 2008. [63] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, pages 413?420, 2009. [64] F. R. Schmidt, M. Clausen, and D. Cremers. Shape matching by variational computation of geodesics on a manifold. In DAGM, pages 142?151, 2006. [65] B. Settles. Active learning literature survey. Technical report, 2010. [66] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern 127 Analysis and Machine Intelligence, 22:888?905, 1997. [67] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, pages 2979?2986. IEEE, 2010. [68] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005. [69] Y. D. Tian, W. Liu, R. Xiao, F. Wen, and X. Tang. A face annotation framework with partial clustering and interactive labeling. In CVPR, pages 1?8, 2007. [70] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, pages 1449?1456. IEEE, 2011. [71] S. Vijayanarasimhan and K. Grauman. Active frame selection for label propagation in videos. In A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, ECCV (5), volume 7576 of Lecture Notes in Computer Science, pages 496?509. Springer, 2012. [72] S. Vijayanarasimhan, P. Jain, and K. Grauman. Far-sighted active learning on a budget for image and video recognition. In CVPR, pages 3035?3042. IEEE, 2010. [73] N. Vretos, V. Solachidis, and I. Pitas. A mutual information based face clustering algorithm for movie content analysis. Image Vision Comput, 2011. [74] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-means clustering with background knowledge. In ICML, pages 577?584, 2001. [75] X. Wang and I. Davidson. Active spectral clustering. In ICDM. IEEE Computer Society, 2010. [76] B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji. Constrained clustering and its application to face clustering in videos. In CVPR, 2013. [77] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with side-information. In NIPS, 2002. 128 [78] C. Xiong, D. Johnson, and J. J. Corso. Spectral active clustering via purification of the k-nearest neighbor graph. In ECDM, 2012. [79] Q. Xu, M. desJardins, and K. Wagstaff. Active constrained clustering by examining spectral eigenvectors. In Discovery Science, volume 3735. Springer, 2005. [80] M.-S. Yang and K.-L. Wu. A similarity-based robust clustering method. IEEE Trans. Pattern Anal. Mach. Intell, 26(4), 2004. [81] M. L. Zhu and A. M. Martinez. Subclass discriminant analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 28(8):1274?1286, Aug. 2006. [82] J. Zhuang, I. W. Tsang, and S. C. H. Hoi. SimpleNPKL: simple non-parametric kernel learning. In ICML, 2009. [83] J. Y. Zou and R. P. Adams. Priors for diversity in generative latent variable models. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 3005?3013, 2012. 129