ABSTRACT Title of dissertation: NEW APPROACHES TO SIMILARITY SEARCHING IN METRIC SPACES Cengiz Celik, Doctor of Philosophy, 2006 Dissertation directed by: Professor David Mount Department of Computer Science The complex and unstructured nature of many types of data, such as mul- timedia objects, text documents, protein sequences, requires the use of similarity search techniques for retrieval of information from databases. One popular approach for similarity searching is mapping database objects into feature vectors, which in- troduces an undesirable element of indirection into the process. A more direct approach is to define a distance function directly between objects. Typically such a function is taken from a metric space, which satisfies a number of properties, such as the triangle inequality. Index structures that can work for metric spaces have been shown to provide satisfactory performance, and were reported to outperform vector-based counterparts in many applications. Metric spaces also provide a more general framework, and for some domains defining a distance between objects can be accomplished more intuitively than mapping objects to feature vectors. In this thesis we will investigate new efficient methods for similarity searching in metric spaces. We will first show that current solutions to indexing in metric spaces have several drawbacks. Tree-based solutions do not provide the best trade- offs between construction time and query performance. Tree structures are also difficult to make dynamic without further degrading their performance. There is also a family of flat structures that address some of the deficiencies of tree-based in- dices, but they introduce their own unique problems in terms of higher construction cost, higher space usage, and extra CPU overhead. In this thesis a new family of flat structures will be introduced, which are very flexible and simple. We will show that dynamic operations can easily be performed, and that they can be customized to work under different performance requirements. They also address many of the general drawbacks of flat structures as outlined above. A new framework, composite metrics will also be introduced, which provides a more flexible similarity searching process by allowing several metrics to be combined in one search structure. Two indexing structures will be introduced that can handle similarity queries in this setting, and it will be shown that they provide competitive query performance with respect to data structures for standard metrics. NEW APPROACHES TO SIMILARITY SEARCHING IN METRIC SPACES by Cengiz Celik Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Commmittee: Dr. David Mount, Chair/Advisor Dr. Lise Getoor Dr. Hanan Samet Dr. Yavuz A. Oruc Dr. Amitabh Varshney c?Copyright by Cengiz Celik 2006 TABLE OF CONTENTS List of Figures iv 1 Introduction 1 2 Related Work 10 2.1 Clustering-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Local Pivot-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Vantage-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Issues in Metric Space Indexing 18 3.1 Pivoting Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Evaluating Indexing Methods . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Modeling Query Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Experiment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Pivot Prioritization Schemes 28 4.1 Prioritized Vantage Points . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 The Kvp Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Secondary Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Comparison of Kvp and Tree-Based Structures . . . . . . . . . . . . 42 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 The HKvp Structure 45 5.1 Tradeoffs for Existing Index Structures . . . . . . . . . . . . . . . . . 46 5.2 The HKvp Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Pivot Selection Policy . . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 Drop Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3 Indirect Elimination . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Pivot Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Query Performance of HKvp . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 The EcKvp Structure 68 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 The Pivot Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2.1 The HKvp Index Structure . . . . . . . . . . . . . . . . . . . . 76 6.2.2 Cost of the Inner Queries . . . . . . . . . . . . . . . . . . . . 77 6.2.3 Information Received from the Inner Queries . . . . . . . . . . 78 6.2.4 HKvp Parameters . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3 Performance of EcKvp . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1 The Construction Cost Ratio . . . . . . . . . . . . . . . . . . 85 6.3.2 The Construction Cost and Performance Trade-off . . . . . . 87 ii 6.3.3 Comparing EcKvp with HKvp for Similar Construction Costs 91 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7 Composite Metrics 96 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2 Motivation for Composite Metrics . . . . . . . . . . . . . . . . . . . . 98 7.3 The C-Tree and C-Forest Structures . . . . . . . . . . . . . . . . . . . 101 7.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.4.2 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 108 7.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 114 8 Conclusions 116 8.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Bibliography 122 iii LIST OF FIGURES 2.1 Space decomposition by the vp-tree (left) and the mvp-tree (right) . . 15 3.1 A general case and two extreme examples of distance relations be- tween 3 objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Distance distribution in unit hyper-cube for 5,10 and 20 dimensions (left) and distances normalized to 1.0 (right) using a bucket width of 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Distribution of various types of vector data for a bucket width of 0.1 in 20 dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Distribution of various types of clustered vector data in 20 dimensions for a bucket width of 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Distance distribution of English words under edit distance. . . . . . . 26 3.6 Distance distribution of image data under two separate metrics. . . . 27 3.7 The density function (a) and cumulative density function (b) of web page data under cosine distance using a bucket width of 0.005 . . . . 27 4.1 Distribution of the number of objects eliminated by pivots based on object?s distances to the pivots for a query radius of 0.6 on a database of 100,000 objects uniformly distributed in 20 dimensions. . . . . . . 30 4.2 Query objects Q1 and Q2 have distances of 8 and 14 to the pivot. The areas represent the set of objects that cannot be eliminated by this pivot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 The comparison of the rate of the object elimination by the regular and prioritized vantage points approach for a query radius of 0.4 on a database of 100,000 objects uniformly distributed in 20 dimensions. 32 iv 4.4 A sample database of 9 vectors in 2-dimensional space, and an ex- ample of the Kvp structure on this database that keeps 2 distance values per database object. (a) The location of objects. Boxes rep- resent objects that have been selected as pivots. (b) The distance matrix between pivots and regular database objects. For each object, the 2 most promising pivot distances are selected to be stored in Kvp (indicated by using gray background color). (c) The first three object entries in the Kvp. Each object entry keeps the id of the object, and an array of pivot distances. . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Query performance of the Kvp structure, for vectors uniformly dis- tributed in 20 dimensions. . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Query performance in terms of distance computations when the total memory usage is kept at a constant and the number of bits per dis- tance value and the number of pivots stored is varied. The database has 10000 uniformly distributed vectors in 20 dimensions. . . . . . . 37 4.7 Effects of changing search radius and amount of total memory on performance. The database has 10000 uniformly distributed vectors in 20 dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Comparison of Kvp and the vp-tree in 20 dimensional vector space for query radius of 0.3 (left) and 0.4. . . . . . . . . . . . . . . . . . . 41 4.9 Comparison of the Kvp and the vp-tree for queries performed in the metric space of English words for query radii 1 (left) and 2. . . . . . . 41 5.1 Query performance of Kvp in terms of the number of distance com- putations for varying numbers of pivots. We see that there is an optimal setting for the number of pivots when either the query is rel- atively easy (left) or when the number of pivots is high (right). For both figures, the database is composed of 10,000 vectors uniformly distributed in 40-dimensional Euclidean space. . . . . . . . . . . . . . 48 5.2 Query performance of vp-tree (left) and GNAT (right) in terms of the number of distance computations versus the construction cost for a query radius of 0.44. The database is composed of 10,000 vectors uniformly distributed in 40 dimensions. . . . . . . . . . . . . . . . . . 49 5.3 Performance comparison of three structures by their construction costs for a query radius of 0.44. The database is composed of 10,000 vectors uniformly distributed in 40 dimensions. . . . . . . . . . . . . . 50 v 5.4 A sample database of 9 vectors in 2-dimensional space, and an ex- ample of the HKvp structure on this database that keeps 2 distance values per database object. (a) The location of objects. Boxes rep- resent objects that have been selected as pivots. (b) The distance matrix between pivots and regular database objects. For each ob- ject, the 2 most promising pivot distances are selected to be stored in HKvp (indicated by using gray background color). (c) The HKvp consists of the distances between pivots and the object entries. Each object entry keeps the id of the object, and an array of pivot distances. 52 5.5 A sample run of the first phase of the HKvp range query algorithm on the database given in Figure 5.4. The query object is the vector (0.73, 0.07), and the query radius is 0.2. First, the pivot o4 is selected for being processed. Based on their distances to o4, the other pivots improve their bounds on the distance to q. After this step, o3 and o6 are eliminated and removed from the processing queue. This leaves only o0 to process. o3 and o6 improve their bounds even more based on their distance to o0, since this may improve the likelihood of the elimination of regular database objects later on. After processing o0 the range query may proceed to processing the regular database ob- jects based on the pivot distance information compiled in pivotBounds and finalBounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.6 Pivot reduction in dimension 20 for a query radius of 0.9 and 500 pivots 57 5.7 Query Costs for Varying Drop Rate values for a query radius of 0.9 in 10 dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.8 The crossing point between drop rates of 1.0 and 0.9 in 10 dimensions for various database sizes (left) and query radius values (right). . . . 59 5.9 (a)Comparison of direct cuts and approximate cuts in HKvp (left). (b) The number of approximate and direct pivots for HKvp (right). For both graphs, the database consists of 2000 points uniformly dis- tributed in 10 dimensions. . . . . . . . . . . . . . . . . . . . . . . . . 60 5.10 Breakdown of indirect and direct elimination of objects for varying query radius in 10 dimensions on a database with 500 pivots and 1500 ordinary objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.11 (a) Using different Pivot Limit parameters (left). (b) Optimizing the drop rate for better pivot limit utilization (right). The database size is 4000, The objects are uniformly distributed in 10 dimensions. . . . 62 vi 5.12 Ratio of the costs of vp-tree to Kvp (left) and vp-tree to HKvp (right) for various construction cost ratios on a database of 10,000 uniformly distributed random vectors in 40 dimensions. . . . . . . . . . . . . . . 63 5.13 Comparison of HKvp with Kvp on a database of 1000 vectors. . . . . 64 5.14 Hkvp compared to LAESA for various settings. The drop parameters were optimized for both structures. . . . . . . . . . . . . . . . . . . . 65 5.15 Ratio of the costs of LAESA to HKvp for query radius 0.7 on a database of 4,000 uniformly distributed random vectors. . . . . . . . 66 6.1 Cost ratio for various settings in 10 dimensions. (a) cost ratio by database size for a fixed query radius of 0.5. (b) various database sizes by the query ratio. . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Number of exact distances by inner query radius for a pivot pool of size 1500 uniformly distributed in 15-dimensional space . . . . . . . . 79 6.3 Distribution of the distance information for various inner query radii. (right) disribution weighted by the size of population . . . . . . . . . 80 6.4 (a)Total efficiency versus the inner query radius. (b) Cost per effi- ciency versus the inner query radius . . . . . . . . . . . . . . . . . . . 81 6.5 Cost per unit efficiency for various query radius values by the inner index size for an inner query radius of 0.3 in 10 dimensions. . . . . . 82 6.6 Query cost versus the number of pivots for the inner index in 10 dimensions for a query radius of 0.5. . . . . . . . . . . . . . . . . . . 84 6.7 Comparison of query performances of Kvp, vp-tree and GNAT versus their construction costs in 40 dimensions for a database of size 10000. 85 6.8 The real construction cost ratio versus projected ratio for varying database sizes and for a pivot pool of size 1500 in 20 dimensions. . . . 86 6.9 EcKvp performance for varying inner query radius values using a pivot pool of 1500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 vii 6.10 The construction cost and query performance ratios of EcKvp to HKvp in terms of the construction cost ratio and query performance ratio for different values of (a) dimensionality, (b) query radius, (c) number of pivots, (d) database size. In the experiments, unless spec- ified otherwise, the database size is 3000, the objects are uniformly distributed random vectors in 15 dimensions, the query radius is 0.5, and the number of pivots is 1500. . . . . . . . . . . . . . . . . . . . . 89 6.11 Comparison of EcKvp to HKvp for varying number of clusters for a database of 3000 vectors uniformly distributed in 15 dimensions for a cluster standard deviation of (a) 0.1 and (b) 0.2. . . . . . . . . . . . 90 6.12 Comparison of EcKvp with an HKvp structure that uses the same construction cost ratio (cr). The pivot pool size is 2000, the query radius is 0.8, and objects are drawn from a uniform distribution in 20-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.13 Comparison of EcKvp with an HKvp that uses the same total con- struction cost. The database size is 100,000, the query radius is 0.8, and objects are drawn from a uniform distribution in 20-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.14 Comparison of the index structures using the web page data of size 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.1 Varying the number of components (left) and complexity of the com- ponents (right) on the c-tree. . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Size of the database and query performance in v(1,5,1) with a c-tree having a branching factor of 2 . . . . . . . . . . . . . . . . . . . . . . 109 7.3 Effects of varying fanout on query performance and setup cost for r = 0.04 in v(1,5,3) having 10000 objects . . . . . . . . . . . . . . . . 110 7.4 Query performance versus setup cost for various c-tree and c-forest configurations for r = 0.04 (left) and r = 0.08 (right) for v(1,5,3). . 111 7.5 c-tree performance on image data. . . . . . . . . . . . . . . . . . . . . 111 7.6 Performance by construction cost in m(uniform,10,5) having 40K objects for query radius 0.08. . . . . . . . . . . . . . . . . . . . . . . 112 7.7 Comparison of the structures in various distributions as defined in Chapter 3 for m(?,5,10). The database size is 40K, query range 0.08. 113 viii 7.8 Comparison of the best settings of the structures for various query radius values. The database size is 40K. . . . . . . . . . . . . . . . . 114 ix Chapter 1 Introduction Computers excel at storing vast amounts of information. Many applications in computer science depend on efficient storage and retrieval of data. As the cardinality of data sets increases, it is important to be able to retrieve data based on certain criteria. One important example of this is similarity search, where a query object is given, and similar objects are retrieved by the system. Two important queries used in similarity matching are range queries and k-nearest neighbor queries. A range query is given a query object q and a number r, and returns all the objects that are within distance r of q. A k-nearest neighbor query returns the k closest objects to a given query object, for some given positive integer k. A few applications of the similarity search include: audio and image databases [20], video, text files, fingerprints [31], face recognition [30], and protein sequences [15]. Although they support essentially the same interface, similarity search sys- tems differ dramatically in how they model objects and how distances are defined. In particular, most of the research to date has focused on objects represented as co- ordinate vectors. In these systems, it is assumed that the objects can be decomposed into or represented as vectors over some multi-dimensional space, and distances are measured using geometric distance functions like standard Euclidean distance. A large number of index structures have been created based on this framework. The 1 coordinate vector approach is too restrictive for many classes of objects. An alter- native direction for research had been similarity search in the more general setting of metric spaces. A metric space is defined to be a set of objects X together with a distance function d on pairs of objects that satisfies the following properties for all a,b?X ? Positivity: d(a,b)?0. ? Symmetry: d(a,b) = d(b,a). ? Triangle Inequality: d(a,b) +d(b,c)?d(a,c). The metric space model has the capability of capturing a large variety of similarity search applications. In many cases, it is the most natural manner in which to approach a problem. For example, the distance between two character strings may easily be determined by the edit distance, which is a metric [28]. Metric-based methods have the simplicity of not needing to make any as- sumptions about the internal structure of the objects. All they require is a distance function that can be invoked on any pair of objects. This high level of abstraction, however, limits the flexibility of index structures. For example, vector-based meth- ods can enhance efficiency by processing the dimensions of the vector one at a time. An example of this is incremental distance computation [4], where the distance of the query object to a bounding box is computed one dimension at a time. Another example is the TV-tree [29]. In the TV-tree, new dimensions are introduced only as they are needed. 2 The triangle inequality dictates that the distance between two objects is closely related to their distances to a third object. Metric-space indexing structures exploit this fact by appointing a small set of objects to represent the whole population. These objects are called pivots or vantage points. The distances between the pivots and a set of database objects are pre-computed and stored in the index structure. At query time, the distance between some of the pivots and the query object is computed. Using the triangle inequality, the distance between a regular database object and the query object can be bounded by their distances to the pivots. If the lower bound of the distance between a database object and the query object is greater than the query radius, it follows that the object is outside the query range, and the object can be eliminated from consideration. In a similar fashion, if the upper bound of the distance is less than the query radius, it follows that the object lies within the range. We call this operation pivoting. Objects that have been classified in this manner are said to be eliminated. Database objects that are not eliminated must have their distances to the query object computed explicitly. The efficiency of an index structure is directly related to the fraction of database objects that can be eliminated through pivoting. Almost all existing index structures for metric similarity search are built around the concept of pivoting. They differ in the way they select pivots, decide which objects should be represented by each pivot, and how the pivot distances will be organized. These differences also affect how the querying process will be carried out. In this thesis we will focus on cases where the distance computation is relatively 3 expensive and dominates the overall cost of a query. Although we will also address other issues like space and additional query time overhead, we will use the number of distance computations as our primary measure of the cost of answering a query. We will use the terms CPU overhead and computational overhead when referring to other organizational and bookkeeping operations performed by the index. Throughout the thesis will focus primarily on distributions that we charac- terize as being difficult. Informally, a distribution is difficult if it does not allow a pivot to eliminate many objects. This occurs when distances tend to be concentrated around a small number of values. When this happens a large number of objects of the database are at roughly the same distance to a typical pivot. As a result the process of elimination through pivoting is least efficient. The phenomenon of dis- tance concentration is known to occur for uniformly distributed point sets in vector spaces of high dimension. Although there is not a widely agreed upon definition of dimension for metric spaces, we will sometimes refer to such difficult distributions informally as being ?high-dimensional?. We also characterize a query as being dif- ficult if many database objects are at a distance from the query object that is close to the range?s radius. Again, this definition is operational because it represents a situation where pivoting will be least efficient in eliminating objects. Before discussing the contributions of this thesis, let us briefly review some of the existing work in this area. (A more detailed review can be found in Chapter 2.) Much of the previous work in similarity search in metric spaces has focused on structures residing in main memory. They usually depend on a hierarchical organization of the pivots and objects. Among these, the vp-tree [42] stands out 4 as one requiring only a small amount of memory and being able to be constructed efficiently. However it is inferior to others in terms of query performance, including methods like the mvp-tree [8], and GNAT [9], both of which improve performance at the cost of greater space and construction time. The M-tree [17] and Slim-tree [23] are disk-based structures, and support dy- namic manipulations on the index while maintaining the balance of the tree. In order to be able to efficiently handle split and merge operations, however, they keep less precise data than the comparable GNAT structure. This results in poorer query performance. Tree structures typically only allow an object to have as many pivots as the height of the tree. This may not be satisfactory for difficult distributions and queries. For this reason tree-based structures are not flexible enough to provide greater elimination power when needed. In contrast, vantage point structures like LAESA [34], Spaghettis [12] and FQA [13] represent another family of solutions. They use more space and construction time, but provide greater efficiency at query time. Although other tree structures also have some parameters that can be adjusted, their improvements are not as pervasive or as dramatic. The shortcomings of vantage points-based methods are the extra computational overhead that they incur, higher construction costs, and higher space usage. If they are allowed to use a sufficient number of pivots, these methods have been shown to outperform other methods in terms of the number of distance computations performed. Some of the structures in this family offer some improvements to the common problems of high space and construction time. The Spaghettis structure reduces computational overhead but 5 uses more space than the common approach. The FQA also reduces overhead, but it uses less precision in the distance information it stores, resulting in reduced performance in terms of the number of distance computations. In this thesis, we study a number of practical improvements to data structures and algorithms for similarity searching in metric spaces. Because of our interest in practical performance, our analysis will be primarily empirical in nature. A recurring theme in our methods will be the observation that some distance information is more important than others. More specifically, we will show that a pivot is most effective for objects that are either very close or distant to it. We will first introduce a method called prioritized vantage points, which re- duces the computational overhead of vantage point-based methods. The basic idea is to only process pivots that are either close to or far from the query object. Unlike similar solutions proposed before, this method does not compromise on the quality of the distance information. (This is one of the principal weakness of the FQA structure). Also, our method does not introduce extra space overhead (as does the Spaghettis structure). In the following chapters we will introduce a series of index structures that provide successive improvements to each other, in the sense that each structure will accumulate all the benefits of the previous ones and make additional enhancements. The first is the Kvp structure. This structure is unique since it improves both the storage and computational overhead of the classical vantage-points approach. The Kvp structure offers a number of benefits: 6 ? It is a simple data structure and can be implemented relatively easily. ? It can support dynamic operations like insertion and deletion. ? It is easily adapted for use as a disk-based structure and its access patterns minimize the number of disk-seek operations. ? Queries may be executed in parallel. We will show that tree-based structures like the vp-tree may offer suboptimal performance, even when the given distribution and query are not particularly dif- ficult (for example, when the distribution is of low dimension and when the query radius is small). Vantage point structures suffer even more in this respect, since the number of pivots is not bounded by the height of the tree. We introduce a new structure, called HKvp, which overcomes these shortcomings since it does not nec- essarily compute the distance of the query object to all the pivots, that is, it allows the elimination of less promising ones. Even though this feature was implemented by some of the earliest structures, they lacked the capability of limiting space and computational overheads. We will show that HKvp works more efficiently than its counterparts, while offering all the benefits of the Kvp structure. Next we introduce a structure, called EcKvp, which addresses one of the major weaknesses of global pivot-based structures, namely construction complexity. Since the pivots are global, we need to compute their distances to all the objects in the database. Most of this distance information is relatively useless. On the other hand, tree structures attempt to cluster relevant objects together, and therefore a selected pivot will be representing similar objects. This will help the pivot compute its 7 distance against these similar objects. The EcKvp structure also avoids computing distances to all of the objects. We will show that it does a better job of finding similar objects to a given pivot. This allows EcKvp to be constructed more efficiently, but with a negligible performance penalty. Finally, we propose a new framework for similarity searching in metrics spaces, called composite metrics. A composite metric is a way of incorporating several met- rics into the process of distance computation. We present examples that motivate this concept in two different ways: when there are alternative definitions of distance, and when the distance is dependent on several unrelated criteria. The queries in this framework can specify relative weights for each of these metrics, which makes querying more flexible. We will introduce two new index structures that are specif- ically designed for composite metrics: the c-tree and the c-forest. We will show that they perform very competitively compared to similar structures that work on standard metric spaces. The c-tree uses only one of the metrics per node, and incor- porates others at different levels of the tree. The c-forest is a composition of c-tree structures. It operates by executing the query independently in each of the c-trees. These c-tree structures are used for eliminating objects that do not lie within the query range. The c-forest only computes the distances to the objects that were not eliminated by any of the c-tree structures. We will present experiments that show that c-trees and c-forests provide an efficient solution to similarity searches over composite metrics. The organization of the rest of this thesis is as follows. In Chapter 2, we will present a brief survey of the existing work, followed by a discussion of general issues 8 in metric space indexing in Chapter 3. In Chapters 4, 5, and 6 we will present the Kvp, HKvp and EcKvp structures, respectively. In Chapter 7 we will introduce the composite metrics framework along with the c-tree and the c-forest structures. Finally, in Chapter 8 we will offer concluding remarks. 9 Chapter 2 Related Work There has been considerable work on similarity searching in metric spaces from different disciplines, sometimes unaware of each other. Two very good surveys can be found at [15] and [21]. For index structures that reside in main memory, query performance is based on two principal components: the time spent on computing distances between ob- jects and any additional time needed to process, evaluate, and combine the results of these distance computations. Henceforth, we will refer to the latter as CPU overhead or computational overhead. Wewillclassifyapproachesintothreebroadcategories: Clustering-basedmeth- ods, local pivot-based methods, and vantage points-based methods. 2.1 Clustering-Based Methods The basic theme behind clustering-based methods is the use of a hierarchical, tree-based decomposition of the space, where the subtrees are designed to group close objects together. We also observe that each subtree is represented by a single object from the database that is ideally located near the center of the group of objects stored in this subtree. J. Uhlmann [42] defined the gh-tree, short for generalized hyperplane tree, as 10 one of the first examples in this category. The idea is to pick two objects from the current subset as representatives, and partition the rest of the set into two classes, depending on which representative is closer. The GNAT tree, presented by S. Brin [9] is a generalization of the gh-tree, where there are more than two representatives. A simple algorithm is given to pick the representatives. According to the algorithm, if we are to select k representatives, we first pick 3?k points randomly. Then, starting with an initial set consisting of one random representative, we incrementally grow the set by adding the point that maximizes the minimum distance to the other representatives. In addition to its representatives, each node can also maintain the radius of the associated region, that is, the maximum distance of the objects inside a representative?s region. This method was used in the M-tree (described below). Another enhancement would be to include the distances between the representatives as well. An even more precise way is used in the GNAT tree. Every representative stores the minimum and maximum distances to the objects in every other subset. The performance of GNAT was compared to another structure, the vp-tree, which will be introduced later. In the best case, the GNAT has been reported to make more than a factor of 6 fewer distance computations than the vp-tree, while requiring about a factor of 14 more in distance computations in its construction. However, it was reported to be worse than the vp-tree in some cases. The original study [9] also showed that GNAT was outperformed by a variant of the vantage points structure, although no data was given about the parameters used in the construction of this structure. Recent experiments presented by [15, 13] show indeed 11 that GNAT performs consistently worse than variants of vantage points structures in terms of number of distance computations, while it consumes less space and has less computational overhead. The M-tree [17] is designed to be a dynamic structure, with emphasis being paid to the structure?s ability to perform queries efficiently and to optimize I/O performance after a sequence of data insertions. Similar to SS-tree [46], it keeps the distance to the farthest object in a sub-tree. Maintaining the radius of the representative objects allows it to easily reorganize disk blocks. Splitting a node involves selecting two new representatives and redistributing objects associated with this node among these two new nodes. The M-tree considers all possibilities for a split and chooses the one with tightest covering radius. The Slim-tree [23] employs a more efficient splitting method. The minimum spanning tree of the objects is generated and the longest arcs of the spanning tree is removed partition the set of objects into two subsets. 2.2 Local Pivot-Based Methods The structures in this category are also tree-based, however, the partitions are based on the distances of objects to either one or two selected objects called pivots. Objects that have similar distances to the pivots are put inside the same subtree, but that does not necessarily mean they are in close proximity of each other. The pivots are only used within their subset, and this is why we call them local pivots. W. Burkhard and R. Keller [10] suggested selecting a random object in the data 12 set and partitioning the rest such that every object having the same distance to the pre-selected object is placed in the same subset. The tree construction continues recursively on the subset of points at the same distance. Since their application domain produced discrete distance values, it was possible for many points to be at the same distance. An adaptation of the same basic idea to continuous distance values is the vp-tree, which was introduced by J. Uhlmann [42]. Such a tree is defined by a branching factor. In order to construct a vp-tree with a given branching factor k, at a given node, one of the objects is selected as the vantage point, and the distances from the other objects to this vantage point are calculated. Then these objects are partitioned into k groups of roughly equal size based on these distances. In this way a node can have k branches with each subtree having roughly m/k objects, where m denotes the number of objects for that node. The only information that needs to be stored is the vantage point itself, and the k?1 distance values, denoted as cutoff[1..k], defining the ranges of distances for each subtree. A range query of radius r centered at a query point q is answered as follows: at any given node, the distance d between q and the node?s vantage point is calculated. If d is smaller than r, the vantage point is added into the result set. For every subset i of the node defined by the cutoff values, if the interval of the subset, [cutoff[i?1],cutoff[i]] intersects the interval [d?r,d+r], then subset i is searched recursively. A nice feature of the vp-tree is that it is possible to divide the space into many divisions through a single distance calculation. As a result, when doing a search, we 13 need only perform one distance calculation per node. However, as the dimensionality of the data distribution grows, it is well known that for many distributions, the objects tend to cluster around a single distance value [7]. As a result, almost all of the objects are at the same distance to the vantage point. Thus, the distance to the vantage point loses its discriminating power with respect to the objects. Another common way to describe the situation is to visualize the situation in a 3 dimensional space, where the median spheres dividing the branches have very similar radii, subdividing space into thin spherical shells. As a result, objects that are grouped under same subtree tend to be spread around the space rather being close to one another. The mvp-tree [8] uses two vantage points per node. After partitioning the points with one primary vantage point, the partitions are further divided by using the second vantage point. This way, if we divide the space into m different regions by first vantage point, we will have a total of up to m2 subsets. It should be noted that the second vantage point uses different cutoff values for each partition of the first vantage point. This allows the tree to maintain balance by assigning approximately the same number of points to each subset. This occurs at the cost of more space consumption per node. The value of this partitioning approach is that, instead of dividing the space into very thin shells, it strives to produce more tightly clustered subsets, while still achieving the same fanout. Figure 2.1 illustrates the contrast between the space decompositions of the vp-tree and Mvp-tree. The mvp-tree stores distances to two vantage points at the leaf nodes, making it a hybrid of the vantage-point structures. It is reported to perform up to 80% fewer 14 Figure 2.1: Space decomposition by the vp-tree (left) and the mvp-tree (right) distance calculations compared to the vp-tree. Experiments in [15, 13] show that the non-hybrid version of Mvp-tree is consistently outperformed by vantage-point variants in terms of the number of distance computations. 2.3 Vantage-Point Methods In vantage-point methods the pivots are used to control processing for the entire set of objects instead of having local scope as they do in the previously described methods. A subset of the objects are selected as vantage points. The distance between the pivots and the rest of the objects are computed at initialization time and stored in the database. At query time these precomputed distances are used to eliminate candidates in a way that is similar to local methods. If there are k vantage points, then the basic method performs k?n distance computations at construction and keeps k?n distance values in the index structure. A range query accesses these distance values to determine which objects can be eliminated based on their distances to vantage points. Finally, a pass through all objects not eliminated by use of pivots is performed. Note that vantage-point methods require extra processing compared to local 15 methods, where determination of the partition at a node is done only for the ob- jects covered by the node. Local methods require storage that is only linear in the database size, whereas vantage-point methods require O(k?n) storage. A powerful aspect of these methods is that it is possible to use as many pivots as desired at the cost of construction time, which results in higher storage require- ments and extra preprocessing time. Nonetheless, this additional effort and space can yield progressively better query performance in terms of the number of distance computations. To the best of our knowledge, the first vantage-point structure that appeared in literature was LAESA [34], as a special case of AESA [43]. There have been some improvements over the basic LAESA algorithm, such as keeping distances to the vantage points sorted and doing binary searches to identify which objects can be eliminated from consideration [36]. The TLAESA structure [33] was proposed as a hybrid method between the LAESA and the gh-tree. The pivots are organized as in a gh-tree, but a distance matrix is also used to provide lower bounds for the distance of the query object to the node representatives. Their experiments were performed in low dimensions, and although were superior to LAESA in terms of total CPU cost, it was inferior in terms of the number of distance computations. The Spaghettis structure [12] was introduced as a method designed to further decrease computational overhead. Here the distances are sorted in a similar fashion. In addition, every distance has a pointer for the same object?s distance in the next array of distances. As done in the case of sorted distances, the feasible ranges are 16 computed for each array using binary search. For each point, its path starting from first array is traced using the pointers. Once the object falls out of range in any of the arrays, we may infer that the object cannot lie within the query region. The Fixed Queries Array (FQA) [13] is one of the recent global pivot-based methods. It sorts the points according to their distances to the first vantage point, then on the second, and so on. It decreases the precision with which distances are measured, for otherwise the points effectively would be sorted only in their distance to the first pivot. Using this sorted structure, the query algorithm performs binary searches within each distance range. The first pivot is processed as in the sorted- array approach, after that, for each range of objects that has the same discretized distance to the first vantage point, we perform a binary search to find the range that is valid for the second pivot. The search continues in this fashion performing binary searches within ranges. FQA is unique among vantage-point methods that are designed to reduce computational overhead in that it does not require any additional storage. However it does not work very well if too many bits are used for the distance values, since this would require that the structure be sorted only by the first pivot. This creates an additional trade-off between the number of bits used for distance storage and extra CPU processing time needed. This comes in addition to the trade-off between number of bits and query performance in terms of distance computations. Their experiments show great improvements in low dimensions, but for 20 dimensional data for a database of one million objects, they estimate FQA would take only 37.6% less time than a naive approach. 17 Chapter 3 Issues in Metric Space Indexing In this chapter, we will mention about some of the factors that influence in- dexing in metric spaces. Specifically, we will analyze the pivoting operation in more detail, list some of the criteria we may use to evaluate different indexing methods, briefly summarize some of the previous work to model query cost, and present the datasets we will use in later chapters. 3.1 Pivoting Operation Recall that metric space indexing methods commonly use the pivoting opera- tion to eliminate objects without having to compute their real distance to the query object. Given a pivot object p and any database object o, where the distance be- tween p and o, dpo, is pre-computed, we first compute the distance between p and the query object q. Based on this distance dpq, we can deduce that the distance between q and o has to be in the range: [|dpo?dpq|,dpo + dpq]. Figure 3.1 depicts this situation in a 2-dimensional setting. Sometimes the pivoting operation is performed on a subset of objects. In this case, the pivot usually maintains the minimum and maximum distances to the subset. This approach is used, for example, in the vp-tree and the GNAT structures. Alternatively, just the maximum of the distance is stored, as done in the M-tree and 18 Figure 3.1: A general case and two extreme examples of distance relations between 3 objects. the slim-tree. Of course, unless the subset of objects is very tightly clustered, upper and lower bounds for a subset will be weaker than bounds computed for an individual object. For a given database object, there will be a number of pivots that can be processed. The overall lower bound of the distance between the object and the query object is the maximum of the lower bounds acquired from all the pivots. Similarly, the overall upper bound will be the minimum of upper bounds acquired from all the pivots. In order to avoid computing the actual distance of an object to the query object, either the object?s overall lower bound should be greater than the query radius, or the upper bound should be less than the query radius. In other words, for a given pivot p, we want the quantity|dpo?dpq|to be as large as possible. Recall that for difficult distributions and high-dimensional spaces, distances tend to cluster, and so objects tend to be at about the same distance from each other. Assuming that the query object follows this expected pattern, in order for pivoting 19 to be effective we need to either have a high or low value of dpo. This observation will be supported by experimental results later. Tree structures attempt to cluster relevant objects together, so that dpo is minimized. When we have a symmetrical distance distribution, we will show that high values of dpo are as important for object pruning as low values. The new structures that we will introduce will utilize this observation. 3.2 Evaluating Indexing Methods The performance characteristics of similarity searching in metric spaces are affected by a number of factors. It is almost always possible to contrive a situation that favors any one index structure over another. The first question that needs to be asked is whether it is worth using an indexing structure at all. In the paper ?When is ?Nearest Neighbor? Meaningful?? [7], the authors rightly argued that, in high dimensions, many of the published index structures actually perform worse than brute-force sequential scan. They give the example of relational database optimizers that refuse to use an index unless it can filter out at least 90% of the records. This is based on the fact that seek operations on disk are much more expensive than sequential reads. Although metric structures are usually meant to handle large and complex objects, the point is still valid to a degree given the extra cost of using an index structure. There are a number of factors that can affect how well a particular scheme may perform. The distributions of objects, the range of query, and the computational 20 cost of distance computations can all have different effects on various methods. Some of them can handle objects in low dimensions better, and some may work better for small ranges. The computational overhead (elements of the search other than distance computations) of an algorithm becomes less importantas the computational cost of distance computations increases. Factors that determine the efficiency of a structure include construction time complexity, storage requirements, and how costly it is to process dynamic manipulations. 3.3 Modeling Query Cost Here we will offer a theoretical analysis of the query performance of global vantage points-based methods. Our analysis assumes that the distance distribution is known. The actual distance distribution can be approximated by taking a sample of object distances. Given a database object o, pivot p, and query radius r, we define the function PF(o,p,r) to be the probability that p fails to eliminate object o for a query of radius r. Let f() denote the distance probability distribution and let F() denote the cumulative distribution function. the distance between o and p as dop, and assuming that all objects conform reasonably with the distribution function, we can approximate this failure probability as the ratio of objects that are expected to have distances between dop?r and dop +r to p, that is PF(o,p,r) = dop+rintegraldisplay dop?r f(x)dx = F(dop +r)?F(dop?r) (3.1) 21 In order to generalize Equation (3.1) to any object and any pivot, we assume that pivots and objects are distributed independently, which is the case for the classic vantage points approach. Under this assumption, the expected probability of the failure of the elimination of an object, EPF(r), is as follows. EPF(r) = ?integraldisplay ?? f(x)?(F(x+r)?F(x?r))dx (3.2) If a vantage points approach uses k pivots, then the total cost of a query, qcost(r), involves computing the distance of the query object to the pivots and then computing distances to the objects that have not been eliminated by the pivots. Thus we have qcost(r) = k +n?(EPF(r))k (3.3) The M-tree structure has a cost model that also depends on the distribution function [16]. There also has been some work to approximate distance distributions. It was argued that most of the real data follows an exponent rule [22] where the number of objects that are within a distance of r can be approximated by r raised to a constant. However, this assumption may be too general, especially for clustered datasets. Another approximation of the distance function was performed based on the mean and standard deviation of the distance distribution. This approach is also insufficient where the distribution has several peaks or when the peaks are not symmetrical. To the best of our knowledge, Equation (3.2) has not been used for query cost 22 estimation before. Equation (3.3) have been used [14] to successfully predict the query cost of global vantage points-based methods that have a suitable distribution for their model. 3.4 Experiment Data In this section we will introduce the different types of real and synthetic data distributions we have used for our experiments. The primary means of understand- ing this data will be through their distance distributions. The majority of our experiments were performed on uniformly distributed vector spaces. Each of the dimensions had values between 0 and 1, so different dimensions had different distance ranges. Figure 3.2 shows the distance distribution in two different ways. In the regular version, we see that the mean distance value increases as dimension increases. When the distance values are normalized to the range [0, 1], the mean of the distributions is exactly the mid point, and the distance values are scattered into narrower ranges as we increase dimensionality. Other types of vector dataset we used are as follows. Uniform(s, e) coordinates are uniform over the range [s,e]. Gaussian coordinates are drawn from a Gaussian distribution of zero mean and unit variance. Laplace coordinates are drawn from a Laplacian distribution of zero mean and unit variance. 23 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 percentage distance dim. 5 dim. 10 dim. 20 0 2 4 6 8 10 12 14 16 18 0 0.2 0.4 0.6 0.8 1 percentage distance dim. 5 dim. 10 dim. 20 Figure 3.2: Distance distribution in unit hyper-cube for 5,10 and 20 dimensions (left) and distances normalized to 1.0 (right) using a bucket width of 0.1 Clust Gauss(nclust, cdev) coordinates are partitioned intonclust number of clusters whose centers are distributed uniformly in a unit hypercube, and within each cluster the coordinates are drawn from a Gaussian distribution having the cluster center as mean and a standard deviation of cdev. Figure 3.3 shows the histogram of distances for these distributions. Note that the clustered Gaussian distribution has two separate peaks; the smaller and closer one representing the distances within the cluster, the other one representing the distances between clusters. Figure 3.4 provides more configurations for clustered data. We see that as we increase the number of clusters, the number of objects per cluster drops, and the peak that represents the objects in the same cluster becomes smaller. We also experimented with real metric data sets. One dataset we employed involved a set of about 20,000 English words collected from the Unix operating system. We used the edit distance as our metric. The distribution is shown in 24 0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 frequency (%) distance gaussian clust_gauss(10, 0.1) laplacian uniform Figure 3.3: Distribution of various types of vector data for a bucket width of 0.1 in 20 dimensions. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 percentage distance clust_gauss(10, 0.1) clust_gauss(20, 0.1) clust_gauss(10, 0.2) Figure 3.4: Distribution of various types of clustered vector data in 20 dimensions for a bucket width of 0.1. 25 0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 percentage edit distance Figure 3.5: Distance distribution of English words under edit distance. Figure 3.5. Even though words of similar spellings would be expected to demonstrate some clustering properties, the clusters are too small to be evident in the graph. The distribution looks very much like our synthetic vector data, having a fairly symmetric form. Another dataset we used was a set of 1800 images having a resolution of 128 by 96 pixels taken from the Corel collection [1]. For each image there is a 64-dimensional color histogram and a 62-dimensional feature vector generated by Gabor texture filters [32]. Figure 3.6 illustrates the distance distribution of these images. We again note that these distributions are very much like others we have presented earlier. Finally, we considered a document data set consisting of a set of web pages obtained through Google Inc. [2]. The web pages were converted into term vectors, and cosine distance [38, 27] was used as a metric. We see in Figure 3.7 that this distribution is very difficult in the sense described in Chapter 1 because of the high concentration of distances near the maximum value. 26 0 2 4 6 8 10 12 14 16 0 0.1 0.2 0.3 0.4 frequency (%) distance color texture Figure 3.6: Distance distribution of image data under two separate metrics. 0 2 4 6 8 10 12 14 16 0.7 0.75 0.8 0.85 0.9 0.95 1 percentage distance (a) 0 10 20 30 40 50 60 70 80 90 100 0.7 0.75 0.8 0.85 0.9 0.95 1 percentage distance (b) Figure 3.7: The density function (a) and cumulative density function (b) of web page data under cosine distance using a bucket width of 0.005 27 Chapter 4 Pivot Prioritization Schemes In this chapter, we introduce a new approach to improve the performance of pivot-based methods, which we call prioritized vantage points. Recall that query processing in similarity search is in essence a process of identifying which database objects lie within the query range and which do not. Pivot-based methods operate by using precomputed distances from database objects to a collection of distinguished objects called pivots. A database object is said to be eliminated by a pivot if this object can be classified as lying inside or outside the query range based solely on its distance to one or more pivots, without explicitly computing its distance to the query object. The effectiveness of a pivot is defined to be the expected ratio of database objects this pivot succeeds in eliminating. We will show that pivots have varying degrees of effectiveness in eliminating objects depending on their distance to the query object. Based on this observation, prioritized vantage points structure only processes a subset of the pivots that look promising. This results in a reduction in CPU overhead at the cost of making (ideally relatively few) more distance computations. Research to date on pivot-based methods has focused on structures that reside in main memory. This has limited the number of pivots used in experiments. The FQA structure [13] overcomes this by using fewer bits to encode distance informa- 28 tion per pivot per database object, but using fewer bits also reduces the accuracy of pivots. Rather than uniformly reducing the number of bits of precision, we introduce a new approach, called Kvp, which intuitively seeks to keep the most effective dis- tance information. The Kvp structure is designed to prioritize pivot distance data to reduce space requirements. The Kvp structure is an enhancement of prioritized vantage points. In addition to saving space, we will see that it also provides savings in CPU overhead. 4.1 Prioritized Vantage Points The usual vantage points-based method, which we will call vps for short, stores k?n pieces of information, where k is the number of pivots and n is the database size. As mentioned above, we present ways to prioritize or ignore some of this data to improve space or CPU overhead at the cost of relatively small number of additional distance computations. As mentioned above, a pivot?s effectiveness in eliminating database objects depends on its distance to the query object. Figure 4.1 shows the number of objects eliminated as a function the pivot?s distance to query object. The data set consists of 100,000 uniformly distributed points in 20-dimensional space for a query radius of 0.6. Observe that the pivot is most effective when it is either close to or far from the query object. Figure 4.2 illustrates the intuition behind this observation. Here we have range queries of radius 1 on a database of English words. As the query object is closer to 29 0 10000 20000 30000 40000 50000 60000 70000 80000 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 #objects eliminated distance between pivot and query object Figure 4.1: Distribution of the number of objects eliminated by pivots based on object?s distances to the pivots for a query radius of 0.6 on a database of 100,000 objects uniformly distributed in 20 dimensions. the bulk of the population, the area under the curve grows larger, therefore more database objects are saved from elimination. In the basic vantage points approach, the distance between a query object and all of the pivots are computed, and all of these pivots are processed in an arbitrary order. Instead of processing all of the pivots, we prioritize their use based on their distances to the query object, such that close or far pivots are processed earlier than others. Note that this does not add any extra distance computations to the process. Figure 4.3 illustrates the resulting reduction in the size of the (unpruned) search space. Here, the y-axis shows the number of objects not eliminated from the candidate list. As more pivots are processed, this number decreases. We see that it is possible to eliminate database objects much more quickly with priority vantage-points. We can use a large number of pivots for our structure, but only use a portion of them at query time depending on the query object. Like 30 Figure 4.2: Query objects Q1 and Q2 have distances of 8 and 14 to the pivot. The areas represent the set of objects that cannot be eliminated by this pivot. the FQA structure, this scheme does not need any additional structures other than the k?n distance values to the pivots. Unlike FQA, in order to achieve its CPU efficiency it does not need to reduce the number of bits of precision for the distance values, and so it is possible to use as much precision as desired. 4.2 The Kvp Structure In this section we introduce the Kvp structure. As mentioned above, it is desirable to use pivots that are particularly close to the query object. In a similar fashion, we can expect a pivot to be more effective for objects that are close to or distant from it. At index creation time, we can find such pivots, and choose to keep only the distances to these promising pivots. We will call the result a Kvp structure. With priority vantage-points, we do not know where the query object will be in advance, and so we have to keep all pivot distances. However, with Kvp the distance relations between the pivots and database elements are computed beforehand at construction time. In addition to reducing CPU overhead by first processing the 31 0 5000 10000 15000 20000 25000 30000 35000 40000 5 10 15 20 25 30 query cost #pivots used vantage points priority vantage points Figure 4.3: The comparison of the rate of the object elimination by the regular and prioritized vantage points approach for a query radius of 0.4 on a database of 100,000 objects uniformly distributed in 20 dimensions. most promising pivots, we can eliminate distance computations to the less promising pivots, thus decreasing the space requirements. There are two ways this can be implemented. One way would involve the usual layout, where every pivot stores an array of distances to all the database objects. The object distances can be sorted so that binary search can be used to quickly determine set of objects that are eliminated. Another way to implement the basic idea is to have a collection of object entries, where each object entry stores the distances to its selected pivots. The benefit of this latter approach is that it is very easy to insert or delete objects from the database, since there is no global data structure that keeps information about the objects. We preferred to implement Kvp using the second approach. Figure 4.4 shows an example. Other than the fact that Kvp only stores a subset of pivot distances, the way it processes queries is identical to the classical global pivot-based method. For each 32 o0 o1 o2 o4 o6 o5 o8 o7 o3 o1 o2 o5 o7 o8 o0 0.32 0.16 0.46 0.74 0.73 o3 0.50 0.68 0.86 0.16 0.45 o4 0.45 0.67 1.01 0.33 0.83 o6 0.72 0.54 0.16 1.02 0.60 (a) (b) obj. id 1 pivot id 0 distance 0.32 pivot id 6 distance 0.72 obj. id 2 pivot id 0 distance 0.16 pivot id 3 distance 0.68 obj. id 5 pivot id 6 distance 0.16 pivot id 4 distance 1.01 (c) Figure 4.4: A sample database of 9 vectors in 2-dimensional space, and an example of the Kvp structure on this database that keeps 2 distance values per database object. (a) The location of objects. Boxes represent objects that have been selected as pivots. (b) The distance matrix between pivots and regular database objects. For each object, the 2 most promising pivot distances are selected to be stored in Kvp (indicated by using gray background color). (c) The first three object entries in the Kvp. Each object entry keeps the id of the object, and an array of pivot distances. 33 database object it maintains a lower and upper bound for the distance to the query object. Each pivot is used to attempt to tighten these bounds. After processing all possible pivot distances, if the bounds are good enough to either discard the object as out of the query range, or prove that it is within the query range, we avoid computing the actual distance between the object and the query object. Otherwise we compute this distance. Figure 4.5 shows the query performance of Kvp as a function of the number of pivots stored for a query radius of 0.4 in 20 dimensions. The results that are labeled as ?random? choose the next pivot to be used randomly, simulating a classic vantage-points structure. Kvp methods first process close and distant vantage points. For example, assume we have a Kvp structure that has a pool of 50 prioritized vantage points, which we refer to as Kvp 50. In the sorted array of pivot distances 0 through 49, the processing proceeds in the order: 0, 49, 1, 48, 2, and so on. As the number of pivots in the pool is increased, the chances of finding a better suited pivot also increases. Varying the number of pivots provides flexibility to improve query performance by spending more time at construction time without increasing space and CPU overhead. We can see from the graphs that Kvp, like priority vantage-points, can elim- inate database objects much faster than the classic approach. Even though it uses less space than priority vantage-points, it produces similar query results. The re- ductions in CPU overhead and space are very closely related. For example, if we were to store and process two pivots instead of ten, we would be saving a factor of 5 in both storage and computational overhead. In the case of radius 0.4 for 20 dimensions, we observed the same query performance in terms of the number of 34 0 1000 2000 3000 4000 5000 6000 7000 8000 0 2 4 6 8 10 12 14 query cost #pivots used random Vps Kvp 50 Kvp 200 Figure 4.5: Query performance of the Kvp structure, for vectors uniformly dis- tributed in 20 dimensions. distance computations. 4.3 Secondary Storage Access patterns of pivot-based structures are targeted toward minimizing CPU time, but they are not always suitable to be stored on disk. For example, performing binary search in secondary storage is expensive as it involves many seek operations. Disks are much better at performing sequential scans. The Kvp structure is quite amenable for data that are stored on disk. It only requires a sequential scan of distance values. It does not involve a heavy processing burden, so processing time does not dominate over I/O time. It requires relatively little memory, since only the vantage objects, the query object, and the distance vector of the processed object is needed. 35 4.4 Memory Usage To the best of our knowledge, Kvp and its variants HKvp and EcKvpwhich will be introduced in later chapters are the only structures to store fewer distance values than the classic vantage-point methods. As with FQA, another way of storing less information is by discretization, so that fewer bits are used for the distance values. Consider its simplest form where the intervals have equal width, using b bits in a metric space where the maximum distance is Dmax. This will map distances into buckets of width Dw where Dw = Dmax2b Since all the distances in the same bucket will be assigned the same distance value, we will have a maximum error of Dw per distance value. Assuming query objects are distributed uniformly, we can approximate the error to Dw/2. Therefore, if a pivoting operation with radius r extends its range of objects that cannot be eliminated into r + Dw2 , we will be able to simulate a query process where only b bits of distance information are available. Experiments were performed where the number of bits used per distance value varied, and the number of pivots per object was adjusted accordingly to keep the total size fixed. In the following figures, a label ?r,b,Vps? stands for classical vantage-points algorithm using b bits per database object for a query radius of r, and?r,b,Kvp k?stands for Kvp with k pivots. We can see from Figure 4.6 that using more bits is more effective than using more pivots for values below 7 bits. Also observe that using more bits is almost as 36 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 4 6 8 10 12 14 16 query cost number of bits per distance 0.5, 200 Vps 0.5, 200 Kvp 150 Figure 4.6: Query performance in terms of distance computations when the total memory usage is kept at a constant and the number of bits per distance value and the number of pivots stored is varied. The database has 10000 uniformly distributed vectors in 20 dimensions. important as using more pivots even after that point. One of the results that emerged from these experiments is that, ignoring con- struction cost, for a given Kvp structure and a metric space, there is an optimal number of bits. This is due to the trade-off between the number of pivots and the number of bits. Another important result is, using a precision that is less than a critical value (such as 6 in this case) results in very high performance penalties. There is a sharp decline in query costs as we increase the number of bits. This explains why a structure like the vp-tree, which effectively uses very few bits of pre- cision by not actually storing the distances but a limited number of distance ranges, is significantly inferior in performance. As we increase number of bits above this critical value and decrease number of pivots we see that the performance degrades, but not as dramatically. 37 Radius #Bits Cost Ratio (Vps/Kvp) 0.7 200 2.56 0.5 100 5.22 0.5 200 3.53 0.5 400 1.73 0.4 200 1.84 Table 4.1: A sample of improvements Kvp provides over Vps Typical tree-based structures rely on the partition information implicit in each internal node to prune out subtrees from search. Since nodes are divided into lim- ited number of subtrees, this amounts to using insufficient information for pruning. Although vantage-point algorithms are very simple in structure, it is their ability to use a large number of bits of distance information that gives them such good performance. Figure 4.7 compares two methods under varying radii and memory parameters. Observe that the Kvp structure performs consistently better than Vps. Table 4.1 summarizes the speedups that Kvp provides over Vps in terms of the number of distance computations. In general, vantage-point structures need more space and construction time than the relative lightweight vp-tree. The Kvp structure adds flexibility that allows the index structure to better fit a given domain. In contrast the vp-tree does not have this flexibility to tradeoff space for better query time. We ran some experiments 38 0 2000 4000 6000 8000 10000 4 5 6 7 8 9 10 query cost bits per distance 0.5, 200 Vps 0.5, 200 Kvp 150 0.7, 200 Vps 0.7, 200 Kvp 200 0 1000 2000 3000 4000 5000 6000 7000 4 5 6 7 8 9 10 query cost bits per distance 0.5, 200 Vps 0.5, 200 Kvp 150 0.5, 100 Vps 0.5, 100 Kvp 150 Figure 4.7: Effects of changing search radius and amount of total memory on per- formance. The database has 10000 uniformly distributed vectors in 20 dimensions. to see how this space-query time tradeoff works for the Kvp structure. We assume a balanced tree is stored in the typical ?heap layout? in an array, such that a node at index i has its subtrees at index 2i and 2i + 1 [26]. Therefore each internal node does not require any pointers, and only needs to store a distance value using 64 bits. Since the leaves of the tree do not need to store anything other than the object itself, this would amount to using 32 bits per object. Given a database of size 10,000, the vp-tree performs approximately log2 10,000?13 distance computations per database object at construction time. To produce a fair comparison, we created Kvp structures that use various multiples of 13 pivots and 32 bits of distance information. The construction cost and space are the independent variables. The construction cost dictated how many pivots the Kvp structure could use, and for a given number of pivots, the space dictated how many bits we could use per database object, which should be less than bits per distance value times the number of pivot distances stored per object. For a given space limitation of B bits per object, we varied the number of bits per pivot 39 distance, b, and calculated the corresponding number of pivot distances to store as ?B/b?. Figures 4.8 and 4.9 summarize the experiment results. The x-axis, construction cost, is measured as the ratio of the construction cost used by Kvp to a construction cost of 13 pivots per database object. The y-axis, space, indicates the multiple of 32 bits that was used per database object. The z-axis, query improvement, is the ratio of vp-tree query cost to the particular Kvp query cost in terms of distance computations. Figure 4.8 shows a query improvement of more than 70-fold in Kvp over the vp-tree using 8 times the storage of the vp-tree and 2 times the construction cost of vp-tree for vectors in 20 dimensions, whereas we have an almost 50-fold improvement for English words. When we use approximately the same resources as the vp-tree, we obtain query cost improvements up to a factor of 2.59 for vectors and 6.97 for English words. Note that Figures 4.8 and 4.9 do not show optimal results, since we see some performance decrease even as we increase construction cost. Using more pivots does not help if we do not have the space to store them. Since Kvp computes distances to all the pivots, we have a problem when optimal number of distance computations is below the number of pivots used by Kvp. For a given space bound, there is an optimal construction cost that Kvp should use, and this trade-off obviously is not reflected on our graphs. In the next chapter, we will introduce a new structure, called the HKvp, that can overcome this problem. 40 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 14 16 query cost ratio (vp-tree/kvp) const. ratio (kvp/vp-tree) 1x space 2x space 4x space 8x space 0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 query cost ratio (vp-tree/kvp) const. ratio (kvp/vp-tree) 1x space 2x space 4x space 8x space Figure 4.8: Comparison of Kvp and the vp-tree in 20 dimensional vector space for query radius of 0.3 (left) and 0.4. 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 query cost ratio (vp-tree/kvp) const. ratio (kvp/vp-tree) 1x space 2x space 4x space 8x space 0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 query cost ratio (vp-tree/kvp) const. ratio (kvp/vp-tree) 1x space 2x space 4x space 8x space Figure 4.9: Comparison of the Kvp and the vp-tree for queries performed in the metric space of English words for query radii 1 (left) and 2. 41 4.5 Comparison of Kvp and Tree-Based Structures Using a Kvp structure, one can easily vary a number of parameters, including the construction cost, the number of pivots used per object, the number of pivots stored per object, the number of pivots processed at query time per object, and the number of bits used per distance value. It is possible to observe how these changes affect performance. In a sense, it is possible to view most of the existing structures as variants of the vantage point-based methods. For example in a vp-tree with a branching factor of k, there is one pivot per node, all the objects in subtrees can be eliminated with their distances to this pivot, and number of branches have an affect similar to the number of bits used. For a database object, there are approximately as many pivots as the height of the tree. This view explains why changing k in the vp-tree has little affect on query performance, since as k increases and pivots become more precise (which is similar to using more bits), the height of the tree becomes shorter and there are fewer pivots per database object. A major problem with the vp-tree is that the only data that is used are the cutoff values. The individual distances of objects to the pivots are computed but then discarded. From the perspective of vantage points, it is also easier to see why GNAT with branching factor k improves on the vp-tree. In GNAT, there are k pivots per node, and the distance ranges of k subtree to these pivots are stored. One slight disadvantage of GNAT is that ranges of distances to a pivot can overlap. However, instead of having just one pivot per one, objects in GNAT make use of k pivots. 42 Tree-based methods have two advantages over the classical vantage points methods. Whereas a pivoting operation involves one object in Vps, it usually in- volves groups of objects in tree-based structures. As explained before, this is some- thing that only improves on the CPU overhead, and has a negative impact on the number of distance computations. Secondly, tree-based methods attempt to divide the space into clusters in order to benefit from the locality of pivots. This is similar to what priority vantage points and Kvp try to accomplish. While tree structures have varying degrees of success in clustering similar objects together, Kvp takes a direct approach and precisely computes the closest pivots. In addition, Kvp properly makes use of far pivots as well. 4.6 Conclusions In this chapter we presented two new structures, the Priority Vantage Points and the Kvp. We showed that pivots are more effective if they are close to or distant from the query object. The Priority Vantage Points structure utilizes this fact to only process promising pivots to save from the CPU overhead. Kvp takes this idea further, by only storing and processing relevant distances between database objects and the pivots. We showed that there is a performance penalty involved in terms of number of distance computations when we ignore some of the distances, but it is not significant thanks to our prioritization schemes that process more promising pivots earlier. We also showed that the Kvp structure can support dynamic insertions and 43 deletions. It can also be stored on secondary memory very easily and only requires sequential scans. 44 Chapter 5 The HKvp Structure One of the ways to evaluate metric space indexing methods is to compare their query performance in terms of the number of distance computations versus their construction costs, ignoring the space consumption. We shall see that, while some index structures can take advantage of added preprocessing to improve query time, others either cannot or produce only minor improvements. We will show that global pivot-based methods like LAESA, Spaghettis, FQA, Omni Sequential and Kvp take advantage of their preprocessing time more effectively than tree-based structures. Although LAESA is one of the earliest pivot-based methods, it remains as the best performer by this criteria. In part this is because more recent methods have focused on reducing the relatively high space and extra processing times that arise with global pivot-based methods. In this chapter we introduce a new index structure, called HKvp. We will show that this structure can reduce the query times by up to 49% compared to LAESA. HKvp is based on the Kvp structure so it also has the ability to decrease both the space usage and the extra processing time, which LAESA lacks. 45 5.1 Tradeoffs for Existing Index Structures Before introducing the HKvp structure, we will take a closer look at the re- lationship between the construction cost and query performance for three different structures, which represent the three different families of metric space index struc- tures as described in Chapter 2. GNAT was chosen as standard example of cluster- ing methods since others in the same family sacrifice the pivot precision for allowing dynamic operations. We chose Kvp as the representative for global pivot-based methods. The various methods within this class that we have introduced usually have identical query performance in terms of the number of distance computations; they only differ in their space and computational overheads. For local pivot-based solutions, there were only two alternatives for continuous distance values, the vp- tree and the mvp-tree. The vp-tree was chosen over the Mvp-tree because of its simplicity and the fact that a majority of other algorithms use it as a benchmark for comparison. Also, the mvp-tree puts some distance information in leaf nodes, making it a hybrid between the vp-tree and global pivot-based methods [15]. The basic assumptions underlying the working principles of global pivot-based methods are that we have a very large database with a high intrinsic dimension and a very expensive distance function. Therefore, it is assumed that the number of pivots is quite limited compared to the database size. There are cases when this assumption does not hold, however, for example when the database size is limited or the application requires a high number of pivots to meet a given set of performance requirements. In other words, the ratio of the number of pivots to the database size 46 is not always as low as assumed. Given a query object q, a typical pivot-based structure begins the search pro- cess by computing the distances between all k pivots and the query object q. Assume that the average probability of a pivot not eliminating an object for a given radius is f. In this case, after processing k pivots, there would still be fk objects that remain not eliminated. Therefore the total cost of the query would be: Cost(q,r) = k +n?fk It can be argued that this function has an optimal k value for a given query object and radius. This observation is supported in Figure 5.1. The cost of a query using k pivots is at least k, if there is a solution that would require fewer distance computations, typical pivot-based structures will fail to find it. Figure 5.1 also shows that after the optimal number of pivots is reached, the cost is dominated by the number of pivots. In the vp-tree, the vantage points in the nodes govern only the associated subtree. The out-degree of the nodes of the tree determines its height and therefore determines the total number of pivots. The higher the tree, the more internal nodes will be produced, resulting in more pivots, more efficient query processing, but higher construction costs. On the other hand, increasing the branching factor results in narrower distance ranges, increasing the effectiveness of the pivots as discussed earlier. Figure 5.2 shows the query performance of the vp-tree based on the construction cost for different branching factors. We see that there is a strong correlation between the construction cost and query performance. Similar to Kvp, 47 0 50 100 150 200 250 300 10 20 30 40 50 60 70 80 90 100 query cost # pivots r=0.3 r=0.4 r=0.5 0 100 200 300 400 500 20 40 60 80 100 120 140 160 180 200 query cost # pivots r=0.5 r=0.6 r=0.7 Figure 5.1: Query performance of Kvp in terms of the number of distance computa- tions for varying numbers of pivots. We see that there is an optimal setting for the number of pivots when either the query is relatively easy (left) or when the number of pivots is high (right). For both figures, the database is composed of 10,000 vectors uniformly distributed in 40-dimensional Euclidean space. there is an improvement on query cost as we increase the construction cost. The reason there is not much difference between different configurations is probably because the precision loss we encounter for low branching factor values, as discussed before. Observe that as construction time increases, the vp-tree does not improve query times as effectively as Kvp does. Otherwise, the two graphs are quite similar. In GNAT, increasing the branching factor of the tree decreases the height in a similar fashion as the vp-tree. The difference is that the number of pivots per level also increases in direct proportion to the branching factor. As a result, increasing the branching factor of GNAT results in much higher construction times than the vp- tree. Figure 5.2 also offers a review of GNAT for different branching factors. We can see that the construction cost can grow as high as 50 million distance computations, compared to about 100 thousand for the vp-tree. 48 5000 5200 5400 5600 5800 6000 6200 6400 30K 60K 90K 120K query cost construction cost (thousands) 0 1000 2000 3000 4000 5000 6000 10M 20M 30M 40M 50M query cost construction cost (millions) Figure 5.2: Query performance of vp-tree (left) and GNAT (right) in terms of the number of distance computations versus the construction cost for a query radius of 0.44. The database is composed of 10,000 vectors uniformly distributed in 40 dimensions. Figure 5.3 summarizes the overall picture. In order to make the data pre- sentable, we only showed a limited number of GNAT configurations. We can see that GNAT has very high construction times, and yields worse performance than the vp-tree in some cases even when given more construction time. Due to the na- ture of the vp-tree, it is not possible to freely increase the construction cost, since it achieves its maximum when the branching factor is 2. The reason GNAT out- performs the vp-tree is that it can use more pivots, thus can store more distance information. We also see that the Kvp yields much better results when using same amount of setup time as other tree methods. As Figure 5.1 reveals, however, Kvp is not perfect in its utilization of con- struction time. We see that it may perform worse when allowed more pivots. To solve this problem, we need a way of eliminating pivots as well as ordinary database objects, which would result in the possibility of processing a subset of the avail- 49 0 1000 2000 3000 4000 5000 6000 7000 500K 1000K 1500K 2000K query cost construction cost (thousands) gnat vp-tree kvp Figure 5.3: Performance comparison of three structures by their construction costs for a query radius of 0.44. The database is composed of 10,000 vectors uniformly distributed in 40 dimensions. able pivot set. AESA is a pivot-based structure where all the database objects are selected as pivots [43]. All the distances between objects are computed, and all this information is used to successively process pivots and eliminate other pivots. LAESA is a modification of this approach in which only a subset of the database are used as pivots [35]. The basic algorithm used by LAESA, adapted to a range search for a query object q of radius r over set of objects O is presented in Algorithm 1. The algorithm maintains a global array for objects to hold the lower bounds for the distances to q. An object can be eliminated if its lower bound is greater than r. The lower bounds are also used to choose the next most promising pivot from the unprocessed list. It was reported [35] that choosing the pivot with the least lower bound gives the best results. The criterion for eliminating pivots is based on a pre-determined ratio. LAESA is one of the earliest index structures, and so has been surpassed with the introduction of more recent pivot-based structures that use 50 space and processing time in a more efficient manner. Algorithm 1 Range Query for LAESA. Note that all the distances between pivots are pre-computed. Input Query object q, radius r, set of database objects O, set of pivots P. Output resultSet, set of objects that qualify for the query. toProcess?O choose an arbitrary object s?P while toProcessnegationslash={} compute Dsq = d(s,q) toProcess?toProcess?{s} if Dsq <= r resultSet?resultSetuniontext{s} for all u?toProcess if s?P update u?s approximate distance to q if u can be eliminated from search if u?P and pivot elimination criteria holds toProcess = toProcess?{u} else if u is an ordinary object toProcess = toProcess?{u} set s to be the next most promising object from toProcess 5.2 The HKvp Structure We introduce a new structure, called HKvp, which stands for ?High perfor- mance Kvp?. HKvp maintains the complete set of distances between pivots, and eliminates pivots as well as ordinary database objects like AESA. Moreover, HKvp is an extension of Kvp, therefore it exploits Kvp?s capability to reduce space and processing times. Applying other well-known CPU or space reduction methods on LAESA is not straightforward. An example HKvp is shown in Figure 5.4. 51 o0 o1 o2 o4 o6 o5 o8 o7 o3 o1 o2 o5 o7 o8 o0 0.32 0.16 0.46 0.74 0.73 o3 0.50 0.68 0.86 0.16 0.45 o4 0.45 0.67 1.01 0.33 0.83 o6 0.72 0.54 0.16 1.02 0.60 (a) (b) obj. id 1 pivot id 0 distance 0.32 pivot id 6 distance 0.72 obj. id 2 pivot id 0 distance 0.16 pivot id 3 distance 0.68 obj. id 5 pivot id 6 distance 0.16 pivot id 4 distance 1.01 (c) obj. id 7 pivot id 3 distance 0.16 pivot id 6 distance 1.02 obj. id 8 pivot id 3 distance 0.45 pivot id 4 distance 0.83 o6 o4 o3 o0 0.62 0.71 0.82 o3 0.99 0.49 o4 1.17 pivot distance matrix object entries Figure 5.4: A sample database of 9 vectors in 2-dimensional space, and an example of the HKvp structure on this database that keeps 2 distance values per database object. (a) The location of objects. Boxes represent objects that have been selected as pivots. (b) The distance matrix between pivots and regular database objects. For each object, the 2 most promising pivot distances are selected to be stored in HKvp (indicated by using gray background color). (c) The HKvp consists of the distances between pivots and the object entries. Each object entry keeps the id of the object, and an array of pivot distances. 52 LAESA only makes use of lower bounds for pivot and object elimination. HKvp also makes use of upper bounds for object elimination and choosing the next pivot to process. Another improvement of HKvp is the maintenance of distance bounds for pivots. Rather than eliminating a pivot right away, it waits until all pivots are known to be inside or outside the query range, when we have the greatest information about the bounds of a pivot, and then it chooses which pivots to explore further and calculates their exact distances to the query object. Instead of discarding those pivot bounds that are only approximate, it uses them for elimination as well. The first phase for HKvp involves computing the distance bounds for pivots. Some of these distances to query objects are computed exactly, and the rest are only approximations based on distance relations to other pivots. In the second phase, the objects are visited and the pivot bounds are used to eliminate them. The corresponding algorithm is presented in Algorithm 2. An illustration of the first phase of the HKvp is given in Figure 5.5. 5.2.1 Pivot Selection Policy We have seen that HKvp progressively computes distances to pivots that are expected to be promising. In this section, we explain how we determine how valuable a pivot is expected to be. Note that this discussion is not about the selection of which subset of the objects to use as pivots at the construction time. Previous experiments have shown that the best way to select the next pivot to process is to choose the one with the lowest lower bound for the distance to the 53 toProcess : {o0, o3, o4, o6} pivotBounds o0 o3 o4 o6 lower bound 0 0 0 0 upper bound ? ? ? ? o4 selected, d(o4, q) = 0.88 toProcess : {o0} pivotBounds o0 o3 o6 lower bound 0.16 0.38 0.29 upper bound 1.59 1.37 2.05 o3, o6 eliminated finalBounds: {{o4,0.88}} o0 selected, d(o0, q) = 0.16 toProcess : {} pivotBounds o3 o6 lower bound 0.65 0.45 upper bound 0.99 0.79 finalBounds: {{o4,0.88},{o0,0.16}} Figure 5.5: A sample run of the first phase of the HKvp range query algorithm on the database given in Figure 5.4. The query object is the vector (0.73, 0.07), and the query radius is 0.2. First, the pivot o4 is selected for being processed. Based on their distances to o4, the other pivots improve their bounds on the distance to q. After this step, o3 and o6 are eliminated and removed from the processing queue. This leaves only o0 to process. o3 and o6 improve their bounds even more based on their distance to o0, since this may improve the likelihood of the elimination of regular database objects later on. After processing o0 the range query may proceed to processing the regular database objects based on the pivot distance information compiled in pivotBounds and finalBounds. 54 Algorithm 2 Range Query for HKvp. Note that all the distances between pivots are pre-computed. Input Query object q, radius r, set of database objects O, set of pivots P. Output resultSet, set of objects that qualify for the query. initialize all pivot?s bounds as [??,+?] in pivotBounds toProcess?P finalBounds?{} while toProcessnegationslash={} get the most promising pivot p based on pivotBounds compute Dpq = d(p,q) if Dpq <= r resultSet?resultSetuniontext{p} toProcess?toProcess?{p} remove the bounds of p from pivotBounds and put into finalBounds for all bounds [Sv,Ev]?pivotBounds update [Sv,Ev] based on Dpq and d(v,p) if Ev <= r resultSet?resultSetuniontext{v} toProcess?toProcess?{v} else if Sv > r toProcess = toProcess?{v} ncompute?(1?pivotDropRate)?|pivotBounds| for ncompute times do get the most promising pivot p based on pivotBounds compute Dpq = d(p,q) if Dpq <= r resultSet?resultSetuniontext{p} toProcess?toProcess?{p} remove the bounds of p from pivotBounds and put into finalBounds update bounds of other pivots put the rest of the elements in pivotBounds to finalBounds process the regular database objects like Kvp by using finalBounds 55 query object [40, 35]. This chosen pivot is then compared against the query object, and based on this distance, the bounds for other pivots are updated. We explore two new approaches for selecting the next pivot: the pivot with the highest upper bound, and the pivot with the greatest difference between its lower and upper bounds. We also consider a combination of these policies. The idea is that using several policies at the same time will improve the evaluation of the pivots that have different strengths over each other. In order to control the relative degree of importance of these policies in more detail, we make use of two parameters. FarToCloseRatio dictates the ratio between ?presumed? far pivots to close pivots. Likewise, WideToCloseRatio is used to control the proportion of ?wide? candidates. In the following figures, a ratio a :: b denotes a FarToCloseRatio of a and WideToCloseRatio of b. Using 1::1 makes use of three policies in equal amounts, whereas 0::0 only uses the close pivots, inf::0 uses only far pivots, and 0::inf uses only wide pivots. Active pivots are stored in a processing queue. As each pivot is processed, it eliminates some other pivots from the processing queue. In the following figures we plot the number of pivots remaining to be processed as a function of the number of pivots already processed. Figure 5.6 shows results for 500 pivots chosen from a uniform distribution in 20-dimensional space. We get the worst results from processing far pivots, whereas wide pivots provide a slight improvement. The best of the pure policies is using close pivots, which is supported by prior research [35]. There is still room for improvement however, as shown by the equal combination policy denoted by 1::1. When we mix the policies more proportional to their individual performances, such as 0.7::0.9, we 56 200 250 300 350 400 20 30 40 50 60 70 80 90 100 #pivots remaining #pivots processed 1::1 0::0 1::inf inf::1 Figure 5.6: Pivot reduction in dimension 20 for a query radius of 0.9 and 500 pivots obtained and improvement of up to 10% over the original idea. Our results suggest that, especially in relatively high dimensions, selecting the next pivot based on a combination of factors gives the best results. These ratios can be tuned further for a given radius, dimension and pivot number, but we do not explore the matter in further detail here, given that using an equal combination policy seems to provide acceptable improvements. 5.2.2 Drop Rate In the first phase, pivots are processed until everything is known about their inclusion or exclusion in the query range. If all the objects were also pivots as in AESA, this step would be sufficient to complete the query processing. However, there are some ordinary database objects that still need to be explored. At one extreme we have Kvp, which computes all the distances to pivots first. At the other extreme we can process the ordinary objects using only what we know about pivots from the first phase. This decision is controlled by a parameter called the drop rate. 57 900 950 1000 1050 1100 1150 1200 1250 1300 200 400 600 800 1000 1200 query cost #pivots drop 1.0 drop 0.9 drop 0.8 Figure 5.7: Query Costs for Varying Drop Rate values for a query radius of 0.9 in 10 dimensions When the first phase is executed, we have a set of pivots PP that have their exact distance to query object calculated, and the rest of the pivots PR know their bounds with sufficient accuracy to be excluded or included in the query range. We continue processing the set PR to obtain better bounds for pivots, but this time we restrict this to a ratio of the cardinality of set PR. We drop a portion of PR and process the rest of the pivots in PR. Earlier experiments showed that for a fixed number of database objects, as the number of pivots is increased, more of them should be dropped for better perfor- mance. The probability that a pivot will be useful for elimination of ordinary objects increases as there are more objects per pivot. Figure 5.7 compares the query per- formance of HKvp in 10 dimensions for a radius of 0.9 for drop rates 1.0, 0.9, and 0.8. Other settings yielded similar results, and are not presented here. As we see, as the ratio of pivots decreases we need to make use of more pivots. The point where the performance of the drop rate value of 1.0 equals to that of 58 500 750 1000 1250 1500 250 500 750 1000 1250 1500 1750 query cost # pivots n=1000 n=2000 n=3000 0 50 100 150 200 250 300 100 200 300 400 500 600 query cost # pivots r=0.4 r=0.5 r=0.6 Figure 5.8: The crossing point between drop rates of 1.0 and 0.9 in 10 dimensions for various database sizes (left) and query radius values (right). 0.9 is particularly interesting, since this is when there is a substantial need to start adjusting the drop-rate parameter to use more pivots. We ran some experiments to see how the size of the database and query radius affects this critical point. Figure 5.8 summarizes our results. As expected, we see that larger databases need to start dropping pivots later than smaller ones. We also see that as we increase the search radius and therefore make the query more difficult, we need more pivots to process, therefore the point where we should stop eliminating all pivots goes toward that direction. 5.2.3 Indirect Elimination One of the improvements of HKvp over its predecessors is that it uses pivots without a direct distance to the query object for elimination of ordinary database objects. Recall that after the pivot elimination step, we will have lower and upper bounds for distances of some pivots. We make use of these bounds to deduce distance 59 0 500 1000 1500 2000 2500 0 100 200 300 400 500 # cuts # pivots r=0.5 direct r=0.5 indirect r=0.4 direct r=0.4 indirect 0 100 200 300 400 500 0 100 200 300 400 500 # pivots computed pivots approximated pivots Figure 5.9: (a)Comparison of direct cuts and approximate cuts in HKvp (left). (b) The number of approximate and direct pivots for HKvp (right). For both graphs, the database consists of 2000 points uniformly distributed in 10 dimensions. bounds from the ordinary objects to the query object. In order to see the extent to which indirect elimination contributes to query performance, we ran some experiments where we counted the number of indirect cuts compared to regular cuts. Figure 5.9(a) summarizes the results for a 10-dimensional database with 2,000 objects. We see that approximate cuts provide a fair amount of object elimination. As the number of pivots grows, we see that approximate cuts are of greater value. Figure 5.9(b) provides an explanation. Here we plot the size of PP versus the size of PR. We see that the former number grows much slower, and the ratio of approximate pivots increases as we increase the total number of pivots. Figure 5.10 compares indirect versus direct cuts for varying radius values us- ing 500 pivots. We see that direct cuts are dominant for difficult queries since they provide greater certainty; however, as the queries become easier, approximate elimination becomes quite important. 60 0 200 400 600 800 1000 1200 1400 1600 0 0.5 1 1.5 2 2.5 3 3.5 # eliminations query radius #indirect cuts #direct cuts Figure 5.10: Breakdown of indirect and direct elimination of objects for varying query radius in 10 dimensions on a database with 500 pivots and 1500 ordinary objects. 5.2.4 Pivot Limit The pivot limit is a parameter of Kvp that indicates the number of pivot distances stored and used per database object. As an extension of Kvp, HKvp can also adjust this parameter. Rather than storing the distances to all pivots, keeping only the most relevant pivots reduces both storage and processing times simultaneously. Figure 5.11(a) shows the result of adjusting the pivot limit on query times for a database of size 4,000 in 10 dimensions for different radius values. A standard drop rate of 1.0 was used in these experiments, which means that only the necessary pivots were computed against the query object. The total number of pivots was 500. We see that the pivot limit parameter is more helpful for lower radius values. For radii of 0.5 and 0.6, storing only 300 pivots produced results very close to that of using full 500 pivots. Although the results look similar to typical Kvp experiments, there is one slight 61 0 200 400 600 800 1000 1200 1400 1600 0 100 200 300 400 500 query cost # pivots used r=0.5 r=0.6 r=0.7 0 200 400 600 800 1000 1200 0 100 200 300 400 500 query cost # pivots used drop optimized drop 1.0 Figure 5.11: (a) Using different Pivot Limit parameters (left). (b) Optimizing the drop rate for better pivot limit utilization (right). The database size is 4000, The objects are uniformly distributed in 10 dimensions. difference: the pivot limit does not necessarily reflect the actual number of pivots used per database object. Some of the pivots are not usable since they have been eliminated in the first phase of HKvp. By adjusting the drop-rate parameter, we can increase the number of pivots that are actually used, and the optimal drop-rate parameter will be different for different pivot-limit values. Figure 5.11(b) explores the affect of the drop-rate parameter on the pivot-limit parameter. Varying drop rates with increments of 0.1 were applied and the best one was taken as the optimized result. We see that with a good choice of drop rate, we can decrease the performance penalty that results from limiting the number of pivots. 62 0 50 100 150 200 1 2 3 4 5 6 7 8 9 10 query cost ratio vp-tree/kvp const. cost ratio kvp/vp-tree r=0.3 r=0.4 r=0.5 r=0.6 0 50 100 150 200 1 2 3 4 5 6 7 8 9 10 query cost ratio vp-tree/hkvp const. cost ratio hkvp/vp-tree r=0.3 r=0.4 r=0.5 r=0.6 Figure 5.12: Ratio of the costs of vp-tree to Kvp (left) and vp-tree to HKvp (right) for various construction cost ratios on a database of 10,000 uniformly distributed random vectors in 40 dimensions. 5.3 Query Performance of HKvp We mentioned earlier that the number of pivots in a pivot-based structure has an optimal value. Figure 5.12 compares the Kvp with the vp-tree for three different query radius values. We see that Kvp can perform up to 120 times better using 3 times the construction cost of the vp-tree on a database of 40-dimensional uniformly distributed vectors. Even though the improvement we obtain in terms of performance by increasing the construction cost is impressive, we see that the performance of Kvp begins to decline once the optimal number of pivots is reached. The optimal number of pivots increases as the query radius increases, because more pivots are needed for queries that are more difficult. We compared the performance of HKvp versus Kvp to determine the perfor- mance gain due to its better use of pivots. Figure 5.13 summarizes our results on a 63 200 250 300 350 400 450 500 550 600 650 700 750 100 200 300 400 500 600 700 query cost # pivots kvp hkvp drop=1.0 hkvp drop optimized Figure 5.13: Comparison of HKvp with Kvp on a database of 1000 vectors. database of 1,000 vectors using varying number of pivots. We see that HKvp with a drop-rate parameter of 1.0 improves on Kvp when the number of pivots is high, but it produces poorer results otherwise. The reason is that the pivot elimination process is targeted at minimizing the number of pivots being processed, but the unprocessed pivots decrease the likelihood of the elimination of ordinary objects. This shortcoming can be rectified by adjusting the drop-rate parameter according to the ratio of pivots to the database size. The figure also shows the result of HKvp where the drop-rate parameter is optimized. In this case it outperforms the two extreme cases in all our tests. Figure 5.12 also shows a comparison of HKvp with the vp-tree. In this fig- ure, Kvp was seen to perform worse as more pivots were used. We see that this shortcoming is eliminated in HKvp. Our experiments that compared HKvp to LAESA showed that it performs up to 49% fewer distance computations. Some of our results are summarized in Figure 5.14. In all experiments, the same greedy algorithm was used to determine the 64 0 50 100 150 200 250 300 350 400 500 700 900 1100 1300 query cost # pivots dim=30, n=4K, r=0.7 Laesa HKvp 0 100 200 300 400 500 500 700 900 query cost # pivots dim=30, n=10K, r=0.7 Laesa HKvp 0 100 200 300 400 500 600 700 800 500 700 900 1100 query cost # pivots dim=30, n=4K, r=0.8 Laesa HKvp 0 50 100 150 200 250 300 350 500 700 900 1100 1300 query cost # pivots dim=40, n=4K, r=0.7 Laesa HKvp Figure 5.14: Hkvp compared to LAESA for various settings. The drop parameters were optimized for both structures. optimal value for the drop-rate parameter for both structures. In order to obtain a better understanding of the query cost improvements, Figure 5.15 presents two of the settings in Figure 5.14 in terms of the ratio of the query costs. Observe that when the number of pivots is low, both structures work very similar to Kvp and use all the pivots available. This explains why HKvp does not provide improvements over LAESA for very low pivot number settings. Otherwise, we see that there is a consistent improvement over LAESA for cases the HKvp was designed for. We chose a pivot selection policy of 1::1 for the experiments which yielded query costs that are up to 43% less than LAESA. Optimizing this policy provided up to 6% further improvements on the query cost. 65 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 1400 1600 query cost ratio HKvp / Laesa # pivots dim=30 dim=40 Figure 5.15: Ratio of the costs of LAESA to HKvp for query radius 0.7 on a database of 4,000 uniformly distributed random vectors. 5.4 Summary The efficiency of a metric space indexing algorithm may be measured by a number of factors including space usage, construction cost, processing time and the number of distance computations to process a query. In this chapter, we focused our attention on the effect of construction cost and the number of distance compu- tations. We evaluated three major structures from this point of view, and showed that the performance of these structures is highly correlated with their construction times. Our experiments clearly show that global pivot-based methods offer a better performance to construction cost ratio. We also showed that the optimal number of pivots varies with the choice of method and query radius. Easier queries, that is, queries executed on databases having lower dimensions or queries having small query radius values, tend to require fewer pivots than more difficulty queries. A pivot-based method may perform worse even when given more setup time because it has more pivots to process than needed. 66 Although it lacks some of the features of its modern descendants, LAESA has the ability to eliminate pivots as well as ordinary database objects, and produces better results when the ratio of pivots to the number of ordinary database objects is high. We also introduced a new structure, called HKvp, which performs significantly better than LAESA while providing greater flexibility. HKvp achieves this flexibility by trading off query performance (in terms of distance computations) against lower overheads in space and processing times. The major drawback of HKvp is the need to adjust the drop-rate parameter to produce the best results. A possible solution is to modify the HKvp search to try different drop-rate values and train itself on which drop rates work best for different query radius values. This is purely a query-time adjustment and varying the drop rate would not result in any structural changes on HKvp. 67 Chapter 6 The EcKvp Structure In this chapter we introduce a new structure, called EcKvp, which is based on the Kvp structure. It shares with Kvp the advantages of low space and query time, but with only a small performance penalty, it offers significantly lower preprocessing time. It achieves lower construction costs by storing the pivots in an index structure, and retrieving distance values between objects and pivots by querying into this structure rather than computing all the distances between them. 6.1 Introduction In order to better understand how EcKvp works, let us consider a typical global pivot-based structure. Given a set of objects O, we first choose P as global pivots. For each object o, we compute the distances to all the pivots in P and create a pivot distance vector PDo containing these distances. Thus, the index structure is basically a distance matrix between P and O?P. Let us consider a range query centered about a query object q with radius r. Recall that our basic strategy is to eliminate database objects from consideration without explicitly computing the distance between the database object and the query object. We first compute the distances between q and all the pivots in P. For each object o, we use the information in PDo to attempt to avoid computing the actual distance between o and q. Let E 68 denote the set of such eliminated objects. Given that PDop contains the distance between each pivot p and each non-pivot o, it follows from the triangle inequality that o can be eliminated if its distance to p is less than d(q,p)?r or if its distance to p is greater than d(q,p) +r. That is, o?E??p?P((PDop < d(q,p)?r)?(PDop > d(q,p) +r)) (6.1) Observe that this test does not require that any distances be computed at query time other than between pivots p and thee query object q. Only when we fail to eliminate an object based on all the stored pivot distances in PDo will we compute the actual distance between o and q to determine whether d(q,o) < r. Thus, the total cost of the query in terms of the number of distance computations performed is|O?E|. Variants of this basic idea differ in the way they store the distance informa- tion and the way they process that information to eliminate objects. For instance, Kvp purges the less effective distances in PDo to avoid storing and processing the complete distance matrix. The EcKvp structure introduces a new method for creating the pivot distance vector. It makes use of an internal index containing only pivots to acquire this information. We call this structure the pivot index or the inner index. The pivots of the global structure will be regarded as ordinary database objects inside the inner index. The database objects of the global structure will be used as query objects at construction time. By running a query on the inner index we will discover distance 69 relationships between pivots and regular objects. We call this an inner query. The structure to use as the inner index and the type of the inner query to be executed inside the inner index are the parameters of the EcKvp structure. Because of the special needs of the EcKvp construction process, it is important to augment the information returned by similarity search queries applied to the inner index. Typical similarity search queries, such as range and k-nearest neighbor queries only identify which database objects meet the search criteria; it is not a requirement to return actual distances to the query object. But this distance information will be of value to the EcKvp search algorithm. Also, in the process of answering the sim- ilarity queries we may discover useful information about the distance relationships for objects that are not in the result set. This information is discarded when the search is completed, but may be of value to the EcKvp search. Let us consider how to augment information returned from a query for the purposes of building the EcKvp structure. Assume that database object o is being inserted into an EcKvp structure, and as a part of the insertion, we run a query on the inner index to determine PDo. Let us assume that the inner index is a global pivot-based structure. When answering the query on o, the search algorithm seeks to eliminate objects of the inner index. Let us say that in the middle of the query algorithm, we are processing the pivot pi of the inner index on the object p to decide whether p should be in the result set. Recall that p can be an ordinary object in the context of the inner structure, but it is one of the pivots of the EcKvp structure. Let Do denote the distance between pi and o, Dp denote the distance between pi and p, and r denote the query radius. Letting Dop denote the distance between p 70 and o, we can conclude that: |Do?Dp|< Dop < Do +Dp (6.2) In this way, we can determine upper and lower bounds for the value of Dop. We call this the pivot bound of object o to pivot p, or PBop, and the set of all pivot bounds for object o are denoted by PBo. Thus, Equation (6.1), which describes the cases when an object can be eliminated, can be expressed as: o?E?(PBop.upperBound < d(q,p)?r)?(PBop.lowerBound > d(q,p)+r) (6.3) If the pivot bound PBop is strong enough to prove that o is either inside the query range or outside the query range, we stop the processing of o, and otherwise we move on to process other pivots. Each pivot we process will tend to improve PBop, and if none of the pivots are strong enough to eliminate o, we compute the actual distance between o and p. Thus, although this process is oriented towards answering the range search with as few distance computations as possible, it produces some very useful information in the form of PBo as a byproduct. Recall that the typical range query discards information about objects that lie outside of the query radius. Our special query returns all the pivot bounds produced. Another modification we make is to let the query process look at all the pivots of the inner index, even if the object can be proven to be inside or outside of the result set. This way we can further strengthen the pivot bounds without incurring additional distance computation costs during preprocessing. 71 The query radius used for the pivot index, which we will call inner query radius, is a parameter. Selecting a small inner query radius value will result in low cost query executions, but the distance information it provides will also be limited. Therefore, there is a trade-off to consider in the choice of the radius parameter, which we will explore later in the chapter. In summary, the EcKvp structure consists of an internal pivot index and PB, the set of all pivot bounds for all the objects in the database. The construction algorithm is given in Algorithm 1.The algorithm for range search is very similar to that of Kvp. The only difference is we need to take into account that we may not have exact distance of objects to pivots. The algorithm is given in Algorithm 2. The algorithm for producing PBo is independent of the EcKvp structure. Algorithm 3 presents the algorithm for a global pivot-based inner index PI using a modification of the range search. 6.2 The Pivot Index The performance of the pivot index is crucial for the efficiency of the EcKvp construction. If EcKvp were to use sequential scan for the inner queries, effectively computing the distances between each database object and all pivots, it would perform exactly the same amount of work as the Kvp structure. Therefore, any index structure that works better than sequential scan will provide us with improvements on the overall construction cost. The exact amount of improvement can be quantified as follows. Given a query range, define its cost ratio to be the ratio of the average 72 Algorithm 1 EcKvp Construction Input O, set of database objects Output PB, the pivot bounds that make up the index. PB?{} select P ?O as pivots. construct the inner structure PI as the pivot index consisting of objects in P for all o?O?P do execute a query on PI using o as the query object and produce PBo remove the unpromising entries in PBo sort PBo so that entries that are more promising appear in front PB?PB?PBo 73 Algorithm 2 EcKvp Query Input EcKvp structure E, query object q, query radius r. Output RS, the set of objects within query radius. RS?{} for all pivots p?E do Compute d(p,q) for all objects o?E do for all PBop?PBo do if r > d(p,q) +PBop.upperBound RS?RS?o move on to the next object if [d(p,q)?r,d(p,q) +r] and PBop does not intersect move on to the next object Compute Doq as d(o,q) if Doq < r RS?RS?o 74 Algorithm 3 Algorithm for constructing PBo Input Owner of the entry o, pivot index PI Output PBo PBo?{} for each p?PI do PBop? for each pivot pi?PI do if|d(pi,o)?d(pi,p)|> PBop.lowerBound PBop.lowerBound ?|d(pi,o)?d(pi,p)| if d(pi,o) +d(pi,p) < PBop.upperBound PBop.upperBound ?d(pi,o) +d(pi,p) PBo?PBo?PBop 75 query cost in terms of the number of distance computations for this range to the size of the index. In this chapter, we will make the reasonable assumption that the number of pivots is significantly smaller than the size of the database, and therefore the construction cost of the pivot index will be a negligible component of the overall preprocessing costs. This means the inner index will consist of a small database of objects, and so we are willing to spend a significant amount of time in constructing the inner index in order to speed up query processing. Given these requirements, we will show that the HKvp structure is an excellent choice for the pivot index. The experiments we will perform on HKvp will be based on uniformly distributed vector spaces. 6.2.1 The HKvp Index Structure As introduced in Chapter 5, HKvp is a variant of the Kvp structure that has the advantage of eliminating pivots as well as database objects during a search. In this way, HKvp can avoid computing the distances to all the pivots. In order to eliminate pivots based on the distance of the query to another pivot, HKvp stores the distances between pivots. Therefore, the construction cost and space of the HKvp structure grows quadratically with the number of pivots. Recall that the HKvp query processing starts with the processing of the pivots. First, it computes the distance of the query object to some of the pivots, and based on these distances, it computes bounds on the distances to the rest of the pivots. 76 The important thing here is, at the end of this phase, the bounds of a pivot are good enough to infer whether it is inside or outside the search radius. In the second phase it processes the non-pivot database objects based on the pivot bounds computed in the first phase. The first phase works in an incremental fashion. At each step we find the most promising pivot based on some policy, compute the distance from this pivot to the query object, and update the bounds of other unprocessed pivots based on this new information. The process continues until all the pivot bounds are sufficient as described above. 6.2.2 Cost of the Inner Queries The number of the pivots, which determines the size of the pivot index, and the inner query radius, are both parameters used in the construction of the EcKvp structure that will influence the cost of the inner queries. Let n be the size of the database and p be the number of pivots selected out of these n objects. As mentioned earlier, we assume that p is significantly smaller than n. Since our inner index is an HKvp structure in which all objects are pivots, the construction of the pivot index will involve O(p2) distance computations. On the other hand, assuming that HKvp provides sub-linear query performance, the cost ratio of HKvp will improve as we increase p. Figure 6.1(a) supports this observation, showing that the cost ratio sharply improves as the size of pivot index grows. Keeping in mind that the improvement in preprocessing time for EcKvp over 77 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 500 1000 1500 2000 2500 3000 cost ratio database size (a) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 cost ratio query radius (b) 500 1000 1500 2500 Figure 6.1: Cost ratio for various settings in 10 dimensions. (a) cost ratio by database size for a fixed query radius of 0.5. (b) various database sizes by the query ratio. Kvp is related to the cost ratio, for the setting in Figure 6.1 we see that an im- provement of 80% is possible with as few as 200 pivots. The improvements are more dramatic at the lower portion of the graph, and for this particular setting, a pivot pool size of 1000 to 1500 seems to offer a good compromise. Figure 6.1 (b) shows that the results are consistent under different inner query radius values. 6.2.3 Information Received from the Inner Queries We have seen that increasing the size of the inner index tends to improve the cost ratio, while increasing the query radius tends to make the cost ratio worse. Although our results show that improvements in query times are possible, they do not provide a clear picture of how query performance depends on the partial information obtained from the inner queries. Query performance will depend on the quantity and quality of the information received from the inner queries. One way to measure the quantity of information received from the inner queries 78 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 const. cost ratio query radius for pivot index Figure 6.2: Number of exact distances by inner query radius for a pivot pool of size 1500 uniformly distributed in 15-dimensional space is to consider the number of exact bounds, where the lower and upper bounds are equal to each other, and hence they provide the exact distance between the object and the pivot. For any index structure, this is also equal to the cost of the query in terms of the number of distance computations. Therefore, while it is true that the cost of the query is expected to grow sub-linearly with the size of the inner index, the amount of exact information also follows the same pattern. Figure 6.2 shows the amount of the information viewed from this respect. We see that the amount of information grows rapidly as we increase the inner query radius, and that it is possible to pick a radius value to yield any desired inner query cost. The number of exact distances is related to only the quantity of the informa- tion, but not the ?quality? of this information. This is because pivots have varying degrees of success in their ability to prune database objects from consideration. Specifically, we know that pivots are most valuable when they are either close to or far from the query object [11]. To investigate this issue, we have run experiments 79 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 percentage distance value (a) all r=0.4 r=0.6 r=0.8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 weighted percentage distance value (b) r=0.3 r=0.4 r=0.5 r=0.6 Figure 6.3: Distribution of the distance information for various inner query radii. (right) disribution weighted by the size of population to analyze the distribution of exact distances between query objects and pivots for various inner query radius values. Figure 6.3 summarizes our results. We see that smaller radius values produce higher quality distance information, since the pivot distances are closer to the object. This behavior is symmetric, which means that as the inner query radius values approach the other end of distance spectrum we see that more restrictive queries produce information with more favorable distributions. The discussion about the set of pivots for which we have discovered the exact distances is not sufficient to give us the whole picture. We need to somehow incor- porate the rest of the pivot bounds that are not exact. We define the efficiency of a pivot bound PBop with respect to a query radius r to be: Ef(PBop,r) = probability that o will be eliminated by p for a radius of r. Given F(), the cumulative probability distribution function, we can approxi- mate the efficiency using the following equation. Ef(PBop,r)?F(PBop.lowerBound?r) + (1?F(PBop.lowerBound +r)) (6.4) 80 0 500 1000 1500 2000 2500 0 0.5 1 1.5 2 2.5 3 total efficiency inner query radius (a) r=0 r=0.1 r=0.2 r=0.3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 0.5 1 1.5 2 2.5 3 cost per efficiency inner query radius (b) r=0.2 r=0.3 r=0.4 Figure 6.4: (a)Total efficiency versus the inner query radius. (b) Cost per efficiency versus the inner query radius Note that Equation (6.4) is directly derived from Equation (6.3), and corre- sponds to the area which represents objects that can be eliminated by p. When we run a query for object o on the inner index, we will get a set of pivot bounds, PBo. We define the total efficiency of PBo with respect to a query radius r to be: Ef(PBo,r)= summationdisplay PBop?PBo Ef(PBop,r) (6.5) Figure 6.4 (a) shows the total efficiency values achieved through different inner query radius values. We see that the total efficiency values show similar trends for different difficulty levels. This shows that the choice of the EcKvp parameters does not depend on the query radius, which may vary and is only available at query time. At this point, it is natural to ask how the cost per efficiency changes as a function of the inner query radius. Figure 6.4 (b) shows that there is an optimal radius value from this respect. Figure 6.4 presents evidence as to why EcKvp has 81 0 0.5 1 1.5 2 2.5 3 3.5 0 500 1000 1500 2000 2500 3000 cost per efficiency database size r=0.1 r=0.3 r=0.4 r=0.5 Figure 6.5: Cost per unit efficiency for various query radius values by the inner index size for an inner query radius of 0.3 in 10 dimensions. a great potential for improving the preprocessing costs. The inner query radius value where information is most expensive is also the point where the query is most difficult. This is exactly the path taken by other global pivot-based methods. In contrast, EcKvp can generate information with significantly lower computational cost. We know that the cost for generating exact pivot distances for varying inner index sizes is constant; each distance computation produces one exact distance. We performed experiments to determine how the total efficiency output is affected by the inner index size. Figure 6.5 shows this relationship for a constant inner query radius value of 0.2. We see that larger index sizes produce information more efficiently than smaller ones. The more difficult the queries, the wider the gap between performance of different inner index sizes. 82 6.2.4 HKvp Parameters The HKvp structure has a number of parameters whose values can be chosen to optimize performance for the pivot index. The pivot limit parameter controls how many pivot distances per object are stored in the database. In our experiments, we stored all possible distances, but the pivot limit value can be used to reduce space and CPU requirements of our pivot index. Another parameter is the pivot selection policy, which determines the order in which pivots are processed. Although several alternatives exist, we have not pursued optimizing this parameter in our experiments. As the default ratio, we used 1::1 as explained in Chapter 5. Yet another parameter is the number of pivots to use. Even though the EcKvp parent structure may have specified the number of pivots to keep in the pivot index, the HKvp does not have to appoint all of them as pivots in its own scope. Figure 6.6 shows query results for various choices of the number of pivots. The database size was 2500 and the number of pivots was varied, so that if we have k pivots we would have 2500 - k ordinary database objects. We see that in this example the performance tend to level off after assigning 1500 pivots. Using fewer pivots for our pivot index improves the construction cost of the pivot index considerably, but at the price of slightly poorer query performance. 6.3 Performance of EcKvp In this section, we compare the performance of EcKvp against the HKvp structure. To simplify the presentation, in our comparisons we will ignore the space 83 0 50 100 150 200 0 500 1000 1500 2000 2500 query cost # pivots Figure 6.6: Query cost versus the number of pivots for the inner index in 10 dimen- sions for a query radius of 0.5. usage and focus instead on construction costs. As we have mentioned, EcKvp uses the same space-saving techniques as Kvp. We first compare HKvp with two well-known methods from the other two families of solutions as outlined in Chapter 2. We chose GNAT from the clustering methods family since it is a well known structure with established performance. We chose the vp-tree from the local pivot-base methods family because of its simplicity. The best performer here, the mvp-tree, was reported to make up to 80% fewer distance computations, using twice the construction cost of vp-tree. However, it is a hybrid structure since it keeps some extra distance information at the leaves. We generated several instances of vp-trees and GNAT trees by varying their branching factors. Figure 6.7 summarizes our results for one particular setting. We observe that global pivot-based methods, represented by Kvp, are far superior to others. 84 0 1000 2000 3000 4000 5000 6000 7000 500K 1000K 1500K 2000K query cost construction cost (thousands) gnat vp-tree kvp Figure 6.7: Comparison of query performances of Kvp, vp-tree and GNAT versus their construction costs in 40 dimensions for a database of size 10000. 6.3.1 The Construction Cost Ratio Here we outline how we compare the construction costs of the EcKvp and HKvp. The preprocessing performance of EcKvp is sensitive to the size of the database. We will attempt to quantify performance through a statistic called the construction cost ratio, which is independent of the database size. Given a database of n objects, and choosing p of the objects as pivots, the total construction cost of EcKvp using an inner query radius of r to construct an inner index PI is as follows. QC(PI,r)?(n?p) + p 2 2 where QC(PI,r) is the average cost of queries run against PI using radius r. The total construction cost of Kvp is given by: p?n?p 2 2 The ratio between these two costs is: 85 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 5000 10000 15000 20000 25000 30000 ratio database size cost ratio real ratio Figure 6.8: The real construction cost ratio versus projected ratio for varying database sizes and for a pivot pool of size 1500 in 20 dimensions. p 2n?p + QC(PI,r) p Assuming n is far larger than p, the first term of the above sum is negligible relative to the second term. The second term is equal to what we have defined earlier as cost ratio. In the context of EcKvp, we will also call this the construction cost ratio to be more specific. From now on, we will use this term to compare the construction times for EcKvp to HKvp. Figure 6.8 provides some justification for our decision to use the construction cost ratio as the main criterion of the evaluation of the preprocessing performance of EcKvp. We see that as the database size grows, the actual ratio of the construction costs of EcKvp and HKvp quickly decreases to a value that is comparable to the construction cost ratio. As expected, the construction cost ratio remains around the same value for different database sizes. This allows us to run our experiments for modest database sizes, yet accurately predict the cost ratios for significantly large 86 databases. 6.3.2 The Construction Cost and Performance Trade-off We have mentioned that executing expensive inner queries provides us with more distance data, thus improving the query performance of the EcKvp. On the other hand, expensive inner queries increase the construction cost of EcKvp. In this subsection we will show examples of various settings where we explore the trade-off between query performance and construction cost. Throughout we compare the query performance of these two structures as the ratio of the query times of EcKvp to HKvp, called the query performance ratio. In theory, the HKvp structure should never perform worse than EcKvp, implying that this ratio should never be smaller than 1, but due to minor variations in the order in which pivots are processed, we sometimes observed ratios that are slightly smaller than 1. One way to generate EcKvp structures with various construction cost ratios is to vary the inner query radius values. We ran experiments to see how varying the inner query radius influences the construction cost ratio and query performance. Figure 6.9 offers a summary for a particular setting for two different dimensions. One point worth mentioning here is that for a given dimension, there seems to be actually two performance trends. The slightly worse trend is produced by using high inner query radius values. Figure 6.4 suggests that high radius values produce less efficient pivot bounds. Given that increasing the inner query radius rapidly increases the inner query costs, one idea that we experimented with was to run two 87 0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 query performance ratio construction cost ratio dimension 10 dimension 15 Figure 6.9: EcKvp performance for varying inner query radius values using a pivot pool of 1500. queries, one close and one far, and then combine their outcomes. Figure 6.3 suggests that the distribution of these two inner queries are expected to be very different, so that the pivot information we obtain from these two separate relatively inexpensive queries should have very little in common, so that the extra query provides some new distance information. Unfortunately, our experiments with this approach did not yield a significant improvement. In practice, determining a suitable inner query radius is not a trivial design issue. Because of this we implemented a revised version of the EcKvp structure that is given a target construction cost ratio, and it dynamically changes the inner query radius in order to reach this target value. Figure 6.10 shows the query performance of this implementation for various settings. So far, all of our experiments have involved uniformly distributed data sets. Next, we performed a series of experiments to investigate how EcKvp performs for clustered data sets. We generated synthetic data by partitioning vectors into clusters 88 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 0.02 0.04 0.06 0.08 0.1 query cost ratio const. cost ratio (a) dim. 15 dim. 20 dim. 25 dim. 30 0.8 0.9 1 1.1 1.2 1.3 1.4 0.02 0.04 0.06 0.08 0.1 query cost ratio const. cost ratio (b) r=0.3 r=0.4 r=0.5 r=0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.02 0.04 0.06 0.08 0.1 query cost ratio const. cost ratio (c) n=500 n=1000 n=1500 n=2000 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 0.02 0.04 0.06 0.08 0.1 query cost ratio const. cost ratio (d) size 3K size 10K size 30K size 70K Figure 6.10: The construction cost and query performance ratios of EcKvp to HKvp in terms of the construction cost ratio and query performance ratio for different values of (a) dimensionality, (b) query radius, (c) number of pivots, (d) database size. In the experiments, unless specified otherwise, the database size is 3000, the objects are uniformly distributed random vectors in 15 dimensions, the query radius is 0.5, and the number of pivots is 1500. 89 0 5 10 15 20 0.01 0.02 0.03 0.04 0.05 0.06 query cost ratio cost ratio (a) cl. size 10 cl. size 20 cl. size 50 cl. size 100 0 1 2 3 4 5 6 0.01 0.02 0.03 0.04 0.05 0.06 query cost ratio cost ratio (b) cl. size 10 cl. size 20 cl. size 50 cl. size 100 Figure 6.11: Comparison of EcKvp to HKvp for varying number of clusters for a database of 3000 vectors uniformly distributed in 15 dimensions for a cluster standard deviation of (a) 0.1 and (b) 0.2. whose centers are distributed uniformly within a unit hypercube, and within each cluster the coordinates are drawn from a Gaussian distribution having the cluster center as mean and using a specified standard deviation. Our results are summarized in Figure 6.11. We see that as the cost ratio increases the performance of the EcKvp structure approaches the performance of HKvp, but with a construction cost of only 6% thatofHKvp. IncomparingFigures6.11(a) and6.11(b) we seethatas thecluster standard deviations decrease, and the number of clusters increase, higher cost ratios are needed to match HKvp?s performance. One explanation is that with clustered data finding a pivot within the same cluster as the query object provides a pivot that is very close to the query point, and by the observations made earlier about close pivots, such a pivot will have very strong discrimination power. By using all the pivots available, HKvp is more likely to find such powerful pivots. 90 6.3.3 Comparing EcKvp with HKvp for Similar Construction Costs Up to this point we have viewed EcKvp as a way of achieving lower construc- tion cost to provide comparable query performance as HKvp. Another way to look at EcKvp is that it should provide better query performance when given the same construction cost as HKvp. We ran a number of experiments to investigate this point of view. We created an EcKvp structure with a particular construction cost ratio, and then compared it with an HKvp structure that uses the same number of distance computations per regular database object. That is, given an EcKvp with p pivots and using a construction cost ratio of cr, the HKvp structure with which we compared it would use p?cr pivots. Note that this may not be a fair comparison when the database size is very small; the query performance will be dominated by the elimination of pivots, which in general is handled more efficiently than regular objects. Because of this, we ran our experiments for increasing values of the number of database objects to maintain control of this side effect. Figure 6.12 summarizes our results. We see that EcKvp can reduce query times by up to 70% compared to HKvp. An important parameter of EcKvp is the number of pivots. We argued before that, in general, having a greater pool of pivots increases the efficiency of inner queries. We also ignored the construction cost of the pivot index, since the number of pivots is typically much smaller than the database size. If this cost is incorporated into the total construction cost, then the greater the pivot pool size is, the less we will be able to spend for inner queries. We ran experiments where we fixed the total 91 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10K 20K 30K 40K 50K 60K 70K 80K 90K performance ratio database size cr 0.1 cr 0.3 cr 0.5 Figure 6.12: Comparison of EcKvp with an HKvp structure that uses the same construction cost ratio (cr). The pivot pool size is 2000, the query radius is 0.8, and objects are drawn from a uniform distribution in 20-dimensional space. construction cost, including the inner index construction, and varied the number of pivots. We also compared our results with that of an HKvp structure that has the same construction cost. Let the total construction cost permitted be denoted by ccost, the database size be n, and the number of pivots be p. We can compute the number of pivots needed to achieve this construction cost by solving the quadratic equation: ccost = p 2 2 + (n?p)?p Our results are presented in Figure 6.13. We see that EcKvp can provide up to 60% improvement over HKvp in the query performance. So far we have used only synthetically generated pseudo-random data to test our structures. In order to validate our results on real metric data, we have obtained a set of web-page data obtained through google [2]. We employed cosine distance [38] 92 0 5000 10000 15000 20000 25000 30000 35000 40000 0 500 1000 1500 2000 2500 query cost # pivots EcKvp HKvp Figure 6.13: Comparison of EcKvp with an HKvp that uses the same total con- struction cost. The database size is 100,000, the query radius is 0.8, and objects are drawn from a uniform distribution in 20-dimensional space. to the term vectors to carry out our range queries. Figure 3.6 shows the distribution of this data. In a similar fashion to the way we generated Figure 6.7, we generated several different configurations of the GNAT tree and the Vp-tree by varying their branching factors. We also varied the total construction cost of HKvp and EcKvp for the same reason. Our results are summarized in Figure 6.14. We see that the results are similar to our earlier experiments. EcKvp provides performance improvement of up to 42% compared to HKvp. 6.4 Conclusions In this chapter we have introduced a new data structure for similarity search- ing, called EcKvp, which retains all the positive features of the Kvp and HKvp structures but with significantly lower preprocessing times. We have shown that 93 0 2 4 6 8 10 12 14 16 0.7 0.75 0.8 0.85 0.9 0.95 1 percentage distance (a) 0 10 20 30 40 50 60 70 80 90 100 0.7 0.75 0.8 0.85 0.9 0.95 1 percentage distance (b) Figure 6.14: Comparison of the index structures using the web page data of size 1000. EcKvp can also perform better than HKvp for same construction cost because it will be able to use more pivots. EcKvp has a number of weaknesses. Although we are advised to use as many pivots as possible, when the size of the database is low, the optimal pivot pool size will be smaller than expected. We have shown that there seems to be an optimal inner query radius from the cost per efficiency point of view. This might mean that sometimes when the number of pivots is increased, the EcKvp structure might perform worse since it uses a suboptimal inner query radius. In our experiments this has not been an issue, but dealing with this issue in a more rigorous way is an interesting problem for future research. In an attempt to better understand the EcKvp structure, one can draw paral- lels to tree structures. From many respects, Kvp can be considered as a bottom-up version of a tree-like structure. Tree structures require less space than do flat global pivot-based structures because pivots have local scope, and they only carry infor- mation about the objects within their own subtree. The Kvp structure eliminates 94 useless pivot bounds, and so in a way, a given pivot has a sort of ?local scope? on objects that have found it to be useful. Tree-based structures attempt to cluster relevant objects together, so that pivots are used for similar objects. The objects that are in the same subtree as a given pivot have landed there based on a number of comparisons to other pivots starting from the root of the tree. The objects and the pivot must have compared similarly to the upper-level pivots, increasing the likelihood that they are in close proximity. This is one factor that the current global pivot-based methods are missing; they blindly compute distances between all pivots and all objects. EcKvp improves the process by explicitly performing a query on the pool of pivots. Tree structures use the same weak channels to organize both the pivots and objects, even though pivots are much more important. In contrast, EcKvp organizes the pivots in an expensive but powerful structure. As an exam- ple, if EcKvp used a vp-tree as the inner structure, and used an approximate greedy nearest neighbor search process (without the usual back-tracking) as the inner query, then it would have the same information as a vp-tree, even though organized in a different way. Instead, EcKvp is free to use a better pivot organization and a more involved inner query. 95 Chapter 7 Composite Metrics In this chapter, we define a new framework called composite metrics for simi- larity matching. A composite metric is a weighted sum of a set of metrics. A range query in composite metrics not only specifies a query object and a radius value, but also the relative weights to be used for each metric. This enables users to adjust the definition of the distance at the query time according to their needs. This approach can be very useful when the definition of similarity is affected by a number of in- dependent factors, or when there are multiple alternative distance functions that complement each other. To provide some concrete motivation for our framework, we cite several examples from pattern recognition and machine learning. Finally, we introduce two new structures, the c-tree and c-forest, for indexing in compos- ite metrics. Our experiments show that it is possible to provide competitive query performance while providing this extra flexibility. 7.1 Introduction Recall from Chapter 1 that a metric space is a set of objects O and a distance function between database objects that obeys the properties of positivity, symme- try, and triangle inequality [24]. Given such a distance function d, a range query Range(q,r,O) is defined as: 96 Range(q,r,O) ={o?O|d(q,o)?r} (7.1) Given a set of metrics, {d1,d2,...,dc} and associated weights {w1,w2,...,wc}, we define a composite metric to be a linear combination of these distance metrics. That is: d(a,b) = summationdisplayci=1 wi?di(a,b) (7.2) where a and b are two objects from the database. It is straightforward to show that if each of these distance functions is a metric, their linear combination is also a metric. Note that a family of non-linear combinations can also be made to work in this framework. For example, given a distance function d(a,b) = p radicalbiggsummationdisplay c i=1 (wi?di(a,b)) p thequeryrangeRange(q,r,O)inthissettingcanbetransformedintoRange(q,rp,O) using the composite distance function d?(a,b) = csummationdisplay i=1 w?i?d?i(a,b) where d?i(a,b) = (di(a,b))p and w?i = wpi. Range searches in composite metrics can be applied to provide flexibility in the choice of the overall distance function. In addition to having a query object and a radius as input, the search can also specify custom weights to be used in the distance function. For example, we can define a generalized weighted range query: 97 Range(q,r,W,O) ={o?O|summationdisplaywi?di(q,o)?r} where a vector of constants, W = w1,w2,...,wc is used to assign custom weights to each individual metric. This provides the flexibility at query time to adjust the relative importance of the various components of the distance metric. 7.2 Motivation for Composite Metrics Metric spaces are based on a distance function that is typically presented as a black box. In contrast, in vector spaces the individual components of the distance computation, the coordinates that is, are all visible. One of the motivations for composite metrics is that by having more information about the inner workings of the distance function, we can improve our index structures. For this point of view to be valid however, one would need to know the weights of individual metrics at preprocessing time. Our goal here is to provide greater flexibility. We want to improve the usability of the index structure by allowing the user to customize the weights of individual metrics at query time. This can be very useful when the user wishes to inject an element of feedback into the search process, or when there is an underlying system that learns weights based on some pre-established success rate.It is understood that the price of this added flexibility will be some loss in the overall performance, but our goal will be to minimize this performance loss. There are a number of reasons for considering the combination of a number of 98 metrics. In many applications, the measure of similarity is a function of a number of independent factors, where each of these factors is itself a similarity measure. For example, the similarity of two persons could be measured as a combination of their similarity of age, location of residence, gender, profession, interests, and so on. Another case is when there are a number of different methods to define similarity between objects. For example, image similarity can be performed by color histograms, texture, and many other different methods that provide varying degrees success depending on the specific application. Another application where composite metrics are useful arises in machine learning. Machine learning algorithms provide a mapping from objects with a set of attributes or features to either class labels or real values. They are typically given a set of training data with known mappings, and based on these past experiences, estimate the outcome of a new instance. Instance-based learning algorithms [3] base their results on experiences that are similar to the new instance. A distance func- tion can be used to measure the similarity between objects. A detailed discussion of distance functions is presented in [47]. In order to compute the distance between two objects, distances between each feature are combined through a set of weights [5]. When the feature weights are fixed, a typical metric space index structure can be used to speed up the retrieval of relevant training data. G. Atkeson et al. [5] cites some examples of systems that use the k-d tree [6] for retrieval, however k-d trees only work in vector spaces. Some algorithms modify feature weights to improve the quality of the learner at run-time [45, 44, 39]. These dynamic cases cannot be handled by a traditional 99 metric space index structure without rebuilding the entire index as the weights are changed. Index structures for composite metrics, however, would be flexible enough to be used in such situations. Multi-classifier systems in machine learning combine the results of a number of different methods [47, 18]. This was shown to produce better results than the individual learning methods. One of the reasons is that different classifiers have dif- ferent regions where they perform better, and a composite approach can maximize the influence of the most effective classifiers [18]. In the literature of multi-classifier systems, using classifiers based on different feature sets is called parallel combin- ing, whereas using different classifiers based on same feature set is called stacked combining [19]. In some applications, such as face recognition and fingerprint matching, each training instance is a class of its own, namely the owner of the face or fingerprint. In these applications, an approximation of the output probability distribution can be constructed directly by the distance of the instance that is being matched to each of the training instances in the database. If the combination strategy is based on summing the individual parts then this process can be completely captured within the framework of composite metrics. It has been shown that this method performs better than other methods [25]. X. Lu et al. [30] have used this approach to sum matching scores from three different methods to produce an overall matching score, and their results outperformed individual classifiers. Similarly A. Ross et al. [37] combined two methods using different weights for fingerprint matching. Some multiclassifier systems maintain a set of weights that are updated dynamically [41]. 100 This is another example that motivates the custom weighted querying framework in composite metric spaces. 7.3 The C-Tree and C-Forest Structures To the best of our knowledge, no structure exists in the literature that can an- swer dynamically weighted queries in the composite metrics framework. A straight- forward approach to index composite metrics would be to treat the metric space as having just one composed distance function, and to apply one of the existing methods. This has the disadvantage, however, that one must fix the weights before constructing the structure. Therefore, it does not allow one to query with weights that have been chosen at query time. In this section we introduce the c-tree structure for answering queries over composite metrics. It can be classified classified as a combination of the k-d tree [6] and the vp-tree [42]. We first start with a brief discussion of these two structures. The k-d tree [6] is a method for indexing vector spaces, it uses one dimension per node to partition the objects and rotates the choice of partitioning plane among the various dimensions with each level of descent in the tree. The median of the coordinate values of the selected dimension is used to partition the current subset in two subsets that are associated with the left child and right child of the node. In contrast, the vp-tree works in metric spaces. As described in Chapter 2, the construction of a vp-tree with a branching factor of k proceeds in a recursive manner, partitioning the nodes that represent a particular subset of the data. At 101 each node, one of the objects is selected as the vantage point, and the distances from the other objects to this vantage point are calculated. Then these objects are partitioned into k groups of roughly equal size, based on these distances. In this way, a node can have k branches with each subtree having n/k objects, n being the number of objects for that node. The only information that needs to be kept is the vantage point itself, and k?1 distance values, denoted as cutoff[1..k], defining the ranges of distances for each subtree. A range query of radius r centered at a point q is performed as follows: at any given node, the distance d between q and the node?s vantage point is calculated. If d is smaller than r, the vantage point is added into the result set. For every subset j of the node defined by the cutoff values, if the interval of the subset, [cutoff[j?1],cutoff[j]] intersects the interval [d?r,d + r], then subset j is searched recursively. The c-tree is also a tree-based structure. It uses one vantage point per node, like the vp-tree, and partitions the objects depending on their distances using only one of the metrics. Therefore, each internal node contains a pivot, a set of ranges of distances for a given metric, and child node pointers associated with each of these ranges. Recall that c denotes the number of metrics in the composition. While traversing the tree in order to process a query, it maintains two c-element vectors to represent the set of objects rooted at the given node, denoted min and max, to help decide whether the current node can be eliminated from the search or not. The value mini contains the minimum possible distance between the region represented by the current subtree and query object in terms of metric di, and maxi contains the maximum distance. We also maintain scalars Min and Max to hold the weighted 102 sum of min and max vectors respectively. That is, Min = summationdisplayci=1 wi?mini and Max = summationdisplayci=1 wi?maxi We can eliminate a subtree from the range search if Min is greater than the query radius. If Max is less than the radius, this means the entire subtree is inside the query range and no further processing is necessary, other than collecting the list of objects in the subtree. Suppose that, during the traversal of the tree, we are examining a node p having vectors minp, maxp and the vantage point vp that partitions the objects based on metric i. A child node u, that lies between distances s and e of metric di to vp, will have minu and maxu vectors that have the same values as minp and maxp for each component except i, where: minui = max(minpi,[s,e]?di(vp,q)) and maxui = min(maxpi,e+di(vp,q)) The distance between a range and a distance value is defined as follows: [s,e]?di(vp,q) = ?? ???? ???? ???? ???? ? di(vp,q)?e if di(vp,q) > e s?di(vp,q) if di(vp,q) < s 0 otherwise One thing to note about this structure is that it makes it possible to defer defining the metric weights until query time. The structure only depends on indi- vidual metrics for organization. It is when we actually perform the query that we determine the distance between a given region and the query object. 103 The determination of which metrics to choose as representatives of the nodes is a parameter of the structure. Our current implementation permutes the metrics randomly and rotates among them in this order with each level of descent in the tree. Unfortunately, the vp-tree is not considered to be one of the best performing index structures. For difficult queries where the dimensionality is high or the query radius is high, it does not carry sufficient information to eliminate a large fraction of objects from consideration. In order to address this deficiency, we also introduce the c-forest structure, which maintains a collection of c-tree structures and combines their results to process a query. These c-tree structures will have different sets of pivots which increases the likelihood that an object will be eliminated during search. In order to accomplish this, we implemented a special type of query on the c-tree, which, upon reaching a leaf node, does not compute the actual distance but flags the object associated with the node as uneliminated. The only distance computations performed are the ones made in internal nodes to navigate through the tree. After repeating this process for all trees, we compute the intersection of the sets of all uneliminated objects. As a final step, we compute the distances between query object and each element in this common intersection to determine whether they should be included in the final result set. To avoid repeated computations of distances, the c-forest caches the distances computed during the processing of query to be used across the trees. This same optimization can also be applied for the construction of the trees, but our current implementation does not have this feature, and so the construction cost of c-forest 104 grows linearly with the number of trees. 7.4 Performance Because we know of no other index structures for handling composite metrics, the only basis for comparison for the c-tree and c-forest structures is sequential scan. We use the term cost ratio to denote the ratio of the total number of distance computations of our structures to sequential scan, that is, the size of the database. Beyer et al. [7] point out that, due to the nature of modern storage systems, an index structure is considered to be effective if it reduces the number of candidates by at least 90%, which corresponds to a a cost ratio of 0.1. We expect objects in our domain to be larger than records in a traditional relational database, so the ratio of disk seek time to the total time should be lower than for typical database applications. Thus, cost ratio values that are higher than 0.1 could still be considered efficient depending on the size of the objects. Of course, seek times are only an issue for disk-based implementations. If the database resides in main memory, it is much easier to demonstrate improvements over linear scan. Even though traditional index structures are not designed to work in the com- posite setting, we have chosen to include two well known metric space indexing methods, the vp-tree and GNAT in some of our experiments. Our approach is to let the traditional methods work with fixed weights for both the construction and query phases. Note that this is not a fair comparison for both sides, since on one hand, the c-tree and c-forest provide greater flexibility, and have to handle weights 105 that are not given until query time, whereas the vp-tree and GNAT have to compute the full composite distance on the whole object and cannot execute partial distance computations on the parts of the objects. Nevertheless, we would like to investigate the performance cost incurred by the additional flexibility of allowing weights to be given at query time. 7.4.1 Data Sets We ran experiments on two types of data sets, one representing images and the other consisting of synthetically generated vectors. The images present us with the case where the metrics are defined to produce similar results, and vectors were used to model cases where the components are independent. Our image database consists of 1800 images. For each image there is a 64- dimensional color histogram and a 62-dimensional feature vector generated by Gabor texture filters [32]. This provides us with a setting where there are two complemen- tary metrics. Figure 3.5 presented in Chapter 3 shows the distance distribution of the images based on these two distance metrics. The majority of our experiments were performed using random vectors. Each object had a collection of vectors sampled from the unit hypercube that were gen- erated independently of each other. We used Euclidean distance between vectors, where distances were normalized to the range [0,1]. In order to simulate a setting where components might have varying degrees of complexity, we used distributions that we label as v(d0,d1,k) where each object 106 is represented as a collection of uniformly distributed random vectors having dimen- sions between d0 and d1, and there are k instances of each vector. In other words, there are a total of k?(d1?d0 +1) vectors having dimensions d0,d0 +1,...,d1. The normalization of the distances ensures that only the query weights determine the relative importance of components, not their dimensionality. We examined various different distributions of vectors. For this reason, we created multiples, a set of datasets labeled as m(distr,dim,mult) where mult many of vectors were sampled from a certain distribution distr in dimension dim . We used various distributions for the vectors as described in Chapter 3. Figure 3.2 shows the histogram of distances for these distributions. 7.4.2 Experiment Setting In all of our queries the custom query weights were randomly generated and their sum was normalized to 1 in an effort to keep the difficulty of the queries constant. In general, using high weights will have a similar effect to using low radius values, thus making the query easier. For the vp-tree and GNAT, we made all the weights equal so that their sum is 1. The major parameter of c-tree, vp-tree, and GNAT that can be adjusted by the user is the tree?s branching factor (that is, the out-degree of each node). Some of our graphs are scatter plots that show these possible different settings of the same structure. Usually, a branching factor that results in more construction cost stores more distance information and yields relatively better performance. Our graphs 107 show the trade-off between the construction cost and the cost ratio. The c-forest structure can use as many c-trees as desired. These c-tree struc- tures might have varying branching factors, but for our experiments we fixed the branching factors of all the c-trees. A label ?c-forest?m?? in our graphs represents the set of c-forests that use m as the branching factor of its internal trees. We obtained different construction cost settings by varying the number of trees inside the c-forest. 7.4.3 Experimental Results In this section we outline the performance characteristics of our structures under different conditions. The difficulty of a composite metric query is determined both by the number of components and their individual complexity. This is explored in Figure 7.1. The cardinality of the database is another important parameter that determines the query cost. When the cost of the query is sublinear, the cost ratio values are expected to improve as the size of the database grows. Figure 7.2 summarizes our results. We see that the results show similar characteristics across different radius values. The rest of our experiments assume a fixed query radius to emphasize other factors. We have also seen that larger cardinalities provide more favorable results in terms of the cost ratio, but in our following experiments we limit the database to moderate sizes in order to present a more balanced view. 108 0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 cost ratio radius v(1,5,1) v(1,5,2) v(1,5,3) 0.2 0.4 0.6 0.8 0.05 0.1 0.15 0.2 cost ratio radius v(6,10,1) v(11,15,1) v(26,30,1) Figure 7.1: Varying the number of components (left) and complexity of the compo- nents (right) on the c-tree. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.02 0.04 0.06 0.08 0.1 cost ratio radius n=10K n=20K n=100K n=200K Figure 7.2: Size of the database and query performance in v(1,5,1) with a c-tree having a branching factor of 2 109 0 0.1 0.2 0.3 0.4 0.5 0.6 10 20 30 40 50 cost ratio branching factor 0K 20K 40K 60K 80K 100K 120K 140K 10 20 30 40 50 construction cost branching factor Figure 7.3: Effects of varying fanout on query performance and setup cost for r = 0.04 in v(1,5,3) having 10000 objects The branching factor of the c-tree is its most important parameter. Differ- ent branching factors produce different possibilities for construction cost and query performance. This is demonstrated in Figure 7.3. Higher branching factors produce shallower trees, and therefore the total number of pivots decreases and so does the construction cost. The c-forest structure offers even more flexibility for trade-offs between con- struction times and query performance. Here, the important parameters are the number of c-tree structures used, as well as the fanout of these internal trees. Fig- ure 7.4 summarizes our experiments for random vectors. We see that the c-forest can provide better performance at the expense of greater construction costs. It also appears that for our settings, a branching factor of 4 gave the best results. Figure 7.5 shows the performance results of c-tree for the image data. Once again we see that the c-tree provides improvements over sequential scan. In our experiments, we observed that GNAT consumes too much construction time as compared to its query performance. It eventually produces competitive 110 0 0.1 0.2 0.3 0.4 0.5 0.6 100K 300K 500K cost ratio construction cost c-tree c-forest 2 c-forest 4 c-forest 8 0 0.2 0.4 0.6 0.8 1 100K 300K 500K cost ratio construction cost c-tree c-forest 2 c-forest 4 c-forest 8 Figure 7.4: Query performance versus setup cost for various c-tree and c-forest configurations for r = 0.04 (left) and r = 0.08 (right) for v(1,5,3). 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 cost ratio radius Figure 7.5: c-tree performance on image data. 111 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10M 20M 30M 40M cost ratio construction cost ctree cforest vptree gnat Figure 7.6: Performance by construction cost in m(uniform,10,5) having 40K ob- jects for query radius 0.08. results, but at too great a cost. Figure 7.6 provides a typical comparison between GNAT and the other structures. Finally, we compared our structures with their natural non-composite coun- terpart, the vp-tree. Figure 7.7 illustrates the relative cost ratios of these structures using various distributions. The uniform distribution proves to be the most difficult one as Figure 3.2 has already suggested. In this distribution only the c-forest pro- vides adequate query performance. In the other distributions, we see that we can go well beyond the cost ratio value of 0.1. As we move to less difficult distributions, the extra power provided by the c-forest is less evident. In some cases we actually observe poorer performance as we increase the number of inner c-trees. We also see that, for our settings, the c-tree matches the query performance of the vp-tree with lower construction costs and sometimes outperforms it. Given that the c-tree and vp-tree are closely related in terms of their internal structures, this shows that com- 112 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1M 2M 3M 4M 5M cost ratio construction cost uniform c-tree c-forest vp-tree 0 0.05 0.1 0.15 0.2 0.25 1M 2M 3M 4M 5M cost ratio construction cost gaussian c-tree c-forest vp-tree 0 0.05 0.1 0.15 0.2 0.25 1M 2M 3M 4M 5M cost ratio construction cost laplace c-tree c-forest vp-tree 0 0.05 0.1 0.15 0.2 0.25 1M 2M 3M 4M 5M cost ratio construction cost clust_gauss(10, 0.1) c-tree c-forest vp-tree Figure 7.7: Comparison of the structures in various distributions as defined in Chap- ter 3 for m(?,5,10). The database size is 40K, query range 0.08. posite metrics point of view can indeed improve performance by avoiding computing the full distance computation when possible. We also see that the c-tree and c-forest are relatively less successful for the given clustered distribution while vp-tree seems unaffected. Figure 7.8 presents the comparison of the best settings of the structures for various query radius values. For c-tree, we used branching factors between 2 and 50. As pointed out before, low fanout values produced the best results. For the c-forest, the internal branching factors were set to 4, and the number of internal c-tree structures varied between 1 and 10. For the vp-tree, we used branching factors up to 50 as we have done with the c-tree structure. 113 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.04 0.05 0.06 0.07 0.08 0.09 0.1 cost ratio radius m(uniform, 5, 10) c-tree c-forest vp-tree 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.04 0.05 0.06 0.07 0.08 0.09 0.1 cost ratio radius m(clust_gauss(20, 0.1), 5, 10) c-tree c-forest vp-tree Figure 7.8: Comparison of the best settings of the structures for various query radius values. The database size is 40K. 7.5 Summary and Conclusions In this chapter we introduced a concept called composite metrics in which a distance function is defined as a linear combination of several metrics. We also defined a new type of range query where the weights of the metrics are given as parameters to the query. We presented examples from machine learning and multi-classifier systems where composite metrics can be of significant value. In multi-classifier systems classification is based on a number of factors that can be alternatives of each other as in stacked combining of classifiers, or unrelated components of the object as in parallel combining. We introduced two structures, the c-tree and c-forest, both of which can per- form queries in our framework. C-tree is a hierarchical organization of the objects based on distances to local pivots similar to the vp-tree. It can be constructed with- 114 out dealing with component weights since it handles the components one at a time. C-forest is a collection of c-tree structures which are designed only for elimination of the objects, and not for computing the actual query. The query is finalized by a sweep of objects that are not eliminated by any of the c-tree structures. We show that c-tree and c-forest can provide significant improvements over sequential scan, and even over the well known vp-tree and GNAT structures. 115 Chapter 8 Conclusions In this thesis, we have presented a number of practical improvements to data structures and algorithms for similarity searching in metric spaces, both in the standard context and in a new context called composite metrics. We have evaluated the efficiency of these data structures through a number of empirical analyses. A recurring theme in our methods has been the idea of selecting the most effective pivots, that is, the pivots that are either very close or distant to the query object. We showed that there is a high correlation between the construction cost of a structure and its query performance. The pivoting operation is the essential means used in these structures for eliminating objects without explicitly computing the distance between these objects and the query object. The greater the construction cost, the more pivots a structure can store, and hence the fewer distance compu- tations are needed at query time. Tree structures like vp-tree and M-tree variants, where the number of pivots is dependent on the tree height are very restricted from this point of view, that is, it is very difficult to invest more in the construction phase to improve query times. In the GNAT structure, the representatives stored in a node do not only store information about their own subtrees. As a result, it is possible to increase the construction cost by increasing the branching factor of the tree. 116 For structures having the same construction cost, ignoring the computational overhead, the success at query time depends on the efficient use of the pivots. Global pivot-based structures use all the information available, they store the exact dis- tances between the pivot and all of the objects. Other structures typically attempt to group similar objects together and they store distance ranges to these objects. Although this approach may reduce the computational overhead, since the elimina- tion of a single subtree means the elimination of all the objects in it, it reduces the power of the pivots. The likelihood that two distance ranges intersect is higher than the likelihood that one single distance value is within a distance range. As the com- plexity of data distribution increases, it becomes more difficult to cluster relevant objects together, making the pivots even weaker. This observation is supported by experimental evidence, which indicates that tree structures perform poorer when using the same amount of construction cost. Although vantage points-based methods make full use of the pivots, this also means they use more space, more computational overhead, and greater construction cost. The main flaw here is that a pivot governs the whole population. We have shown that a pivot has varying degrees of effectiveness, and it is particularly effective for objects that are either close to it or far from it. It may happen to help eliminate other objects especially when the query object is close to the database object, but these contributions of the pivot are negligible compared to more closely related cases. One of our main contributions, the Kvp structure, uses this fact to eliminate unpromising distance values from the structure. This means there is less informa- tion to store, and less information to process at the query time. This also eliminates 117 the need to use more complicated, global data structures in order to reduce com- putational overhead. That is why we proposed a flat, unordered structure where each object only stores its distances to relevant pivots. This organization makes it straightforward to implement insertion and deletion operations. Unlike most of the other methods, it requires just a sequential scan of the indexing data, which renders it an excellent candidate to be stored on secondary memory. We also introduced the EcKvp structure, which offers a solution to the con- struction cost problem. Even though Kvp eventually discards less useful distance information, it has to compute the distance between the object and the pivot before it can decide that the distance value is unpromising. Many of these costly distance computations are wasted. EcKvp avoids computing all the distances between a pivot and the objects by organizing the pivots themselves in the HKvp structure. At construction time, each object executes a range query on the pivots to retrieve the distance values to a subset of them. Our experiments show that these queries are successful in returning promising pivot distances. We introduced a variant of the Kvp structure, called the HKvp structure. It helps when there are more pivots than necessary to process a query. It offers the flexibility of eliminating pivots as well as database objects. While eliminating pivots, it uses a measure of importance of pivots. We have shown that the HKvp can identify the promising pivots, and in addition it possesses the same space and computational savings of the Kvp structure. Finally, we defined the concept of composite metrics to provide a flexible set- ting for similarity queries, where the distance is defined to be a weighted linear com- 118 bination of a given set of metrics. The user can pose queries in which the weights are given at query time. We introduced two index structures, the c-tree and c-forest, for handling composite metric queries. We have demonstrated experimentally that they can answer such queries efficiently. 8.1 Future Research Overall we feel that these structures and concepts introduced in this thesis provide a significant improvement to our understanding of efficient index structures for similarity search in metric spaces. There are, however, a number of areas for future research. The EcKvp structure, for example, has a number of shortcomings. The drop rate parameter introduced with HKvp needs to be adjusted according to the difficulty of the object distribution at hand. We mentioned that this parameter is only used at query time, and therefore it is possible to install a simple learning system that can use the optimal drop rate value based on the query radius. In Chapter 3 we suggested possible approaches to assess the difficulty of a query and calculate the optimal number of pivots to be used. This matter merits further investigation to make our solutions more practical. We showed that in general using more pivots is better for EcKvp inner queries, but more pivots mean that the construction cost of the inner index grows quadrat- ically, which increases preprocessing time considerably. This suggests that there is an optimal value for the number of pivots where the benefit of using more pivots is balanced against the construction cost. For high-dimensional synthetic data for 119 which we have a well-defined distance distribution, this optimal value was very high, and was not an issue. This may not be the case for other distributions, however. Determining the best value is a challenging problem. One approach might be to include an optimizer module that tries different variations of inner index size and inner query radius in order to keep the total construction cost at a desired level. Statistics like the optimal inner query radius to achieve the minimum cost per effi- ciency can be used to improve the optimizer. This can be performed independently without causing any structural changes in the index. EcKvp uses a variation of the range query to discover relevant pivots, but different approaches can be used to improve the inner query. Note that we have a lot of flexibility here since we have no obligation of returning all the objects that qualify for the query. In this thesis, we only focused on the range queries. There are other types of queries, most notably the k-nearest neighbor query, which we have not investigated. One reason is these queries can be solved by using a series of range queries. However, it is also possible to develop special algorithms to solve these problems directly. The methods we analyzed and introduced all produce exact results, in other words, they return all the objects that lie within the query range, and nothing else. Another approach for improving efficiency is through approximation. In approxi- mate range searching it is possible to incorrectly classify objects that lie sufficiently close to the boundary of the query range. These algorithms are expected to run faster in return for their imprecise results. Although some of these methods can be applied to all of the structures in general, there may be ways of improving the query 120 performance or success rates. 121 BIBLIOGRAPHY [1] http://www.corel.com. [2] http://www.google.com/programming-contest/. [3] David W. Aha, Dennis Kibler, and Marc K. Albert. Instance-based learning algorithms. Mach. Learn., 6(1):37?66, 1991. [4] Sunil Arya and David M. Mount. Algorithms for fast vector quantization. In J. A. Storer and M. Cohn, editors, Proceedings DCC93 (IEEE Data Compres- sion Conference), pages 381?390, Snowbird, UT, USA, 1993. [5] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning for control. Artificial Intelligence Review, 11(1-5):75?113, 1997. [6] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509?517, 1975. [7] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is ?nearest neighbor? meaningful? Lecture Notes in Computer Science, 1540:217?235, 1999. [8] Tolga Bozkaya and Meral Ozsoyoglu. Distance-based indexing for high- dimensional metric spaces. In SIGMOD ?97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 357?368, New York, NY, USA, 1997. ACM Press. 122 [9] Sergey Brin. Near neighbor search in large metric spaces. InThe VLDB Journal, pages 574?584, 1995. [10] W. A. Burkhard and R. M. Keller. Some approaches to best-match file search- ing. Commun. ACM, 16(4):230?236, 1973. [11] Cengiz Celik. Priority vantage points structures for similarity queries in met- ric spaces. In Proceedings of EurAsia-ICT, volume 2510 of Lecture Notes in Computer Science, pages 256?263. Springer, 2002. [12] Edgar Ch?avez, Jos?e L. Marroqu??n, and Ricardo A. Baeza-Yates. Spaghet- tis: An array based algorithm for similarity queries in metric spaces. In SPIRE/CRIWG, pages 38?46, 1999. [13] Edgar Ch?avez, Jos?e L. Marroqu??n, and Gonzalo Navarro. Fixed queries array: A fast and economical data structure for proximity searching. Multimedia Tools Appl., 14(2):113?135, 2001. [14] Edgar Ch?avez and Gonzalo Navarro. Towards measuring the searching com- plexity of general metric spaces. Technical report, 2001. [15] Edgar Ch?avez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Marroqu??n. Searching in metric spaces. ACM Comput. Surv., 33(3):273?321, 2001. [16] Paolo Ciaccia, A. Nanni, and Marco Patella. A query-sensitive cost model for similarity queries with m-tree. In Australasian Database Conference, pages 65?76, 1999. 123 [17] Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In The VLDB Journal, pages 426?435, 1997. [18] Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1?15, 2000. [19] Robert P. W. Duin and David M. J. Tax. Experiments with classifier combining rules. In MCS ?00: Proceedings of the First International Workshop on Multiple Classifier Systems, pages 16?29, London, UK, 2000. Springer-Verlag. [20] Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, and William Equitz. Efficient and effective querying by image content. J. Intell. Inf. Syst., 3(3/4):231?262, 1994. [21] G??sli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric spaces. ACM Trans. Database Syst., 28(4):517?580, 2003. [22] Caetano Traina Jr., Agma J. M. Traina, and Christos Faloutsos. Distance exponent: A new concept for selectivity estimation in metric trees. In ICDE, page 195, 2000. [23] Caetano Traina Jr., Agma J. M. Traina, Bernhard Seeger, and Christos Falout- sos. Slim-trees: High performance metric trees minimizing overlap between nodes. In Advances in Database Technology - EDBT 2000, 7th International Conference on Extending Database Technology, Konstanz, Germany, March 124 27-31, 2000, Proceedings, volume 1777 of Lecture Notes in Computer Science, pages 51?65. Springer, 2000. [24] J.L. Kelly. General Topology. D. Van Nostrand Company, New Jersey, 1955. [25] Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. On combin- ing classifiers. IEEE Trans. Pattern Anal. Mach. Intell., 20(3):226?239, 1998. [26] Donald E. Knuth. Fundamental Algorithms. The Art of Computer Program- ming. Addison-Wesley, Reading, Massachusetts, 1973. [27] Lillian Lee. Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pages 25?32, 1999. [28] V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Daklady, 10:707?710, 1966. [29] King-Ip Lin, H. V. Jagadish, and Christos Faloutsos. The tv-tree: An index structure for high-dimensional data. VLDB J., 3(4):517?542, 1994. [30] Xiaoguang Lu, Yunhong Wang, and Anil K. Jain. Combining classifiers for face recognition. volume 3, pages 13?16. ICME, 2003. [31] D. Maio and D. Maltoni. A structural approach to fingerprint classification. In ICPR ?96: Proceedings of the International Conference on Pattern Recognition (ICPR ?96) Volume III-Volume 7276, page 578, Washington, DC, USA, 1996. IEEE Computer Society. 125 [32] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell., 18(8):837?842, 1996. [33] J.; Carrasco-Jimnez R.C. Mic-Andrs, M.L.; Oncina. A fast branch and bound nearest neighbour classifier in metric spaces. 0167-8655 - Pattern Recognition Letters, 17(7):731?739, 1996. [34] L. Mic?o, J. Oncina, and E. Vidal. A new version of the nearest-neighbour ap- proximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 15:9?17, 1994. [35] Luisa Mic?o, Jos?e Oncina, and Enrique Vidal. A new version of the nearest- neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 15(1):9?17, 1994. [36] Nene and Nayar. A simple algorithm for nearest neighbor search in high di- mensions. IEEETPAMI: IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1997. [37] Arun Ross, Anil K. Jain, and James Reisman. A hybrid fingerprint matcher, August 2002. [38] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986. [39] Steven Salzberg. A nearest hyperrectangle learning method. Mach. Learn., 6(3):251?276, 1991. 126 [40] Dennis Shasha and Tsong-Li Wang. New techniques for best-match retrieval. ACM Trans. Inf. Syst., 8(2):140?158, 1990. [41] Ching Y. Suen and Louisa Lam. Multiple classifier combination methodologies for different output levels. In MCS ?00: Proceedings of the First International Workshop on Multiple Classifier Systems, pages 52?66, London, UK, 2000. Springer-Verlag. [42] Jeffrey K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett., 40(4):175?179, 1991. [43] E. Vidal. An algorithm for finding nearest neighbors in (approximately) con- stant average time. Pattern Recognition Letters, 4:145, 1986. [44] Dietrich Wettschereck and David W. Aha. Weighting features. In ICCBR ?95: Proceedings of the First International Conference on Case-Based Reasoning Research and Development, pages 347?358, London, UK, 1995. Springer-Verlag. [45] Dietrich Wettschereck and Thomas G. Dietterich. An experimental compar- ison of the nearest-neighbor and nearest-hyperrectangle algorithms. Machine Learning, 19(1):5?27, 1995. [46] David A. White and Ramesh Jain. Similarity indexing with the ss-tree. In ICDE ?96: Proceedings of the Twelfth International Conference on Data Engineering, pages 516?523, Washington, DC, USA, 1996. IEEE Computer Society. [47] D. Randall Wilson and Tony R. Martinez. Improved heterogeneous distance functions. J. Artif. Intell. Res. (JAIR), 6:1?34, 1997. 127