Hashing Technique: A New Index Method for High Dimensional Data Zhexuan Song Nick Roussopoulos Department of Computer Science Department of Computer Science & University of Maryland Institute For Advanced Computer Studies College Park, Maryland 20742 University of Maryland zsong@cs.umd.edu College Park, Maryland 20742 nick@cs.umd.edu September 24, 1999 CS-TR-4059 Abstract As showed in [17], when dimension goes high, sequential scan processing becomes more ecient than most index-based query. In this paper, we propose a new index method for high-dimensional data spaces. This method is based on hashing technique. The basic idea is: First nd a hashing function which puts the given d-dimensional space data into a  d-dimensional buckets where  d  d. Then, we use existing index techniques to manage those buckets. We later de ne some properties of a good hash- ing function and give four hashing functions. To demonstrate the eciency of our idea, we experimentally compared our algorithms with sequential scan and Pyramid- Techniques. The results demonstrate that this method outperforms others for skewed data set. It always beats the sequential scan by using only half of elapsed time for 1 range query. However if the data has uniform distribution, Pyramid-Technique is still the best method. 1 Introduction More and more high dimensional data now appear in a variety of new database applications. For example, new database applications such as Cubetree [9] create high dimensional data when store and retrieve data warehousing views. In other area such as multimedia, content- based search [10] is an attractive new idea which is based on feature vectors. In order to support those applications, database system must be able to handle high dimensional data. The low-dimensional case (when dimensionality equals to 2 or 3) is well solved [3]. However, recent researches [11, 12] show that basically none of the querying and indexing techniques provide good results on both low-dimensional and high-dimensional data for large queries because of the so-called \curse of dimensionality". The approach taken now is to nd methods to lower the dimensionality of data and use low-dimensional index structure. For example, we have no method to handle 50-dimensional data, if we can transform those data into 2-dimensional data, we have a bunch of methods to store to retrieve them. However, as we do such transformation, some information of the original data set is lost. We have to sacri ce some query eciency to compensate for the lost information. The cost we paid is either the inaccuracy of the result such as in SVD method, or many useless \false alarms" in the query result such as in Pyramid-technique. In this paper, we developed a hashing based techniques. The basic idea is to nd a hashing function which puts the high-dimensional data into low-dimensional buckets. When processing high-dimensional range queries, we rst nd the corresponding low-dimensional queries. Then we retrieve all the buckets which are inside the transformed query ranges. Finally, we check all the data in those buckets and discard the \false alarms". As showed in our experiment, like sequential scan, this method performs very well not matter how the 2 data set is distributed. And it always outperforms sequential scan by using only half of the time. Some state-of-art methods work better when the data has uniform distribution. We believe that our method is the best choice on very skewed data set. Furthermore, our index structure does not need to be re-adjusted when the data is dynamically added. Another advantage of our method is the fact that we can use any existing low-dimensional index structures to store the data items and take advantage of all the nice properties of those structures. The hashing based technique can easily be implemented on top of any existing DBMS. The rest of the paper is organized as following: In section 2, we give a brief description of related work in high-dimensional indexing. Section 3 describes our algorithm and gives some discussion. Section 4 presents our experimental results. Section 5 gives the conclusions and directions for future research. 2 Related Word Recently, many high-dimensional index structures have been proposed. Berchtold, Keim, and Kriegel presented the X-tree [11] which is trying to adapting the algorithms of R*-trees to high-dimensional data. The basic idea of their algorithm is that they try to nd an overlap-free split. If that split leads to an unbalanced split, then second, they omit that split and construct a supernode. The major drawback of the X-tree is that in order to guarantee the storage utilization, it must use the 50%-quantile when splitting a data page. As showed in [14], this is the worst case in high-dimensional indexing, because the resulting pages have an access probability close to 100%. The same thing will happen on R*-tree [4], KD-tree [2], SS-tree [12] and TV-tree [8]. The pyramid-technique [14] is a solution for the above drawback. It rst splits the d- dimensional space into 2d parts. Then, the algorithm creates a mapping from the given 3 space to a 1-dimensional space. Finally, a B+-tree is used to manage the transformed data. Once a range query arrives, it checks 2d parts separately and nds the result. The pyramid- technique outperforms the X-tree by a factor up to 2500 for range queries. The authors also extend the method for skewed data set. But in order to maintain the eciency, the algorithm must nd the median point before splitting the space. In a dynamic environment, the median is not always the same. Their solution is to re-build the whole index if the median point goes too far which is very expensive. Another approach is called Singular Value Decomposition (SVD) method [6, 7]. The basic idea is to condenses most of the information in a data set to a few dimensions by applying SVD. The data in the few condensed dimensions are then indexed to support fast retrieval. The major drawback to this approach is that SVD is very expensive to compute, once new data are added into the dataset, the whole transformation must be re-computed. Therefore, it is not readily applicable to dynamic databases either. The other thing is that SVD method can not give the 100% correct result, because some information is lost due to the dimensionality reduction. To overcome the rst drawback, Kanth, Agrawal and Singh recently proposed an e- cient way to incorporate the recomputed SVD-transform in the existing index structure [15]. However, their method is based on SVD, the second drawback can not be overcome. 3 The Hashing Technique The basic idea of the Hashing Technique is to put the d-dimensional data points into  d- dimensional buckets by using a hashing function H where  d  d. Then use an ecient index structure such as B+-tree [1] (when  d = 1) and R*-tree [4] (when  d > 1) to store and access the buckets. Basically, any ecient low-dimensional index methods can be used. In this paper, we use R*-tree and B+-tree. Figure 1 gives the overview of the method: 4 d-dimensional dataset d-dimensional bucketsH d-dimensional range query H d-dimensional range query insert, delete query low-dimensional index structure Figure 1: Overview of the Hashing Technique In the leaf of the low-dimensional index structure, we store the d-dimensional information along with the  d-dimensional key so we do not need to provide an inversed transformation. Now the problem remained is how to nd a good hashing function. 3.1 Hashing Functions In order to nd a good hashing function H, some criteria must be met. First of all, in order to guarantee the correctness, if point p is in the range R, after transformation, H(p) must be in transformed range H(R). Property 3.1 To any point p and range R, p is in R, then H(p) is in H(R). We do not need the inversed side to be satis ed. Actually, if we want to lower the dimension, it is impossible to guarantee the inverse to be true. The second property is set to guarantee the eciency of the hashing function. If two point p 1 and p 2 which are close in d-dimensional space, H(p 1 ) and H(p 2 ) should be close in  d-dimensional space. Otherwise, suppose we have a range query R which contains p 1 and p 2 , after transformation, in order to satisfy the rst property, H(R) must contain H(p 1 ) and 5 H(p 2 ). If H(p 1 ) and H(p 2 ) are very far away in  d-dimensional space, H(R) now becomes a very large query range. This will decrease the eciency of the method. Property 3.2 To any " > 0, 9, to any p 1 ;p 2 , if jp 1 p 2 j < ", then jH(p 1 )H(p 2 )j < . If we can not nd such a small  for ", that means we can not give an upper bound of size of the transformed query range. Of course, the lower  is, the more ecient the function is. Here \lower" means the size of  comparing to the whole low-dimensional data space. We calledthe hashing functions whichsatisfy the above properties good hashing functions. 3.2 Distance mapping First, we use distance function as our hashing function. The idea is that rst nd a point p in d-dimensional space. To any data point p 0 , nd the distance between p and p 0 . De ne H(p 0 ) = jpp 0 j. Use H(p 0 ) as a key and save the data into a B+-tree. Here the point p can be randomly selected. It is not necessary to pick a data point. Once we have a query range R. The remaining problem is to nd H(R). De ne D min and D max as the minimum and maximum distance of any point in the range R to p. Then H(R) = [D min ;D max ]. D min and D max can be easily found from the following lemma: Lemma 3.3 Given a hyper-rectangle R = [x 1 low ;x 1 high ];:::;[x d low ;x d high ] and p = fy 1 ;:::;y d g, D min = q P d i=1 D 2 i where D i = 8 > > > < > > > : 0 : x i low  y i  x i high x i low BnZry i : y i < x i low y i BnZrx i high : x i high < y i and D max = q P d i=1 D 2 i where D i = maxfjx i low BnZry i j;jy i BnZrx i high jg 6 Proof: We will only prove the correctness of D min because D max is almost the same. In each dimension, D i is the minimum possible value of distance of any points in R to p on dimension i, so D min is the minimum value. The important thing here is that we can nd such a p 0 = fy 0 1 ;:::;y 0 d g in R which satis es H(p 0 ) = D min . De ne y 0 i = 8 > > > > < > > > > : y i : x i low  y i  x i high x i low : y i < x i low x i high : x i high < y i It is very obvious that p 0 is in R. 2 Now, we can transform a d-dimensional hyper-rectangle query to a 1-dimensional range query. For each query range R, rst we nd D min and D max , then retrieve all the buckets in the range from B+-tree. Check the points in the buckets and lter the \false alarms", we have the nal result. Theorem 3.4 Distance mapping is a good hashing function. Here we de ne H(p 0 ) = jpp 0 j. Proof: First of all, if p 0 is in R, from the lemma above we know that D min  jpp 0 j  D max , that means H(p 0 ) 2 [D min ;D max ]. Secondly, to any " > 0, de ne  = ", to any p 1 ;p 2 , if jp 1 p 2 j < ", then according to triangular inequality, jH(p 1 )BnZrH(p 2 )j  . 2 The detail algorithm is listed in gure 2. 3.3 Multi-distance mapping Since there are many great index methods for low-dimensional data, we further extend our method in order to increase the eciency. 7 // p is the base point in d-dimensional space Point_Set rangeQuery_1 (range R, Point p) { Point_Set res; find Dmin and Dmax using p; // using Lemma cs = btreeQuery (Dmin, Dmax); for (c = cs.first; cs.end; cs.next) { if (inside (R, c)) res.add (c); } } Figure 2: Range query algorithm 1 This time we select a set of points fp 1 ;p 2 ;:::;p  d g, for each point p in the d-dimensional space, nd the distance from p to each point of the set. Now we have a  d-dimensional vector < d 1 ;d 2 ;:::;d  d >. Obviously,  d must be a small number, otherwise, we do not have any improvement here. Use that  d-dimensional vector as a key and save the point into a R*-tree. The insert, delete and update operations are quite easy. For any range query R, we nd D i min and D i max for each of  d dimensions using the same method in section 3.2. Now we have a new range R 0 in our low-dimensional space. Run the query and lter the \false alarm", we have the result. Theorem 3.5 Multi-distance mapping is a good hashing function. Here we de ne H(p 0 ) =< jp 1 pj;jp 2 pj;:::;jp  d pj >. Proof: First of all, if p is in R, we know that D i min  jpp i j  D i max , that means H(p) i 2 [D i min ;D i max ] for each i 2 [1;:::;  d]. So H(p) is in H(R). Secondly, to any " > 0, de ne  =  d p  d", to any p 1 ;p 2 , if jp 1 p 2 j < ", then according to triangular inequality, 8 jH(p 1 ) i BnZrH(p 2 ) i j  " for each i 2 [1;:::;  d]. So jH(p 1 )H(p 2 )j  . 2 The detail algorithm is listed in gure 3. // d_bar is the number of points we selected // p is the array of base point in d-dimensional space Point_Set rangeQuery_2 (range R, int d_bar, Point p[]) { Point_Set res; // transform the range query for (i = 0; i < d_bar; i++) find Dmin[i] and Dmax[i] using p[i]; cs = rtreeQuery (Dmin, Dmax, d_bar); for (c = cs.first; cs.end; cs.next) { if (inside (R, c)) res.add (c); } } Figure 3: Range query algorithm 2 3.4 Max-dimension and Multi max-dimension functions The distance function is not very ecient. Look at the following example in a 2-dimensional space (Figure 4). Here p is a point we selected as the base point. For range R, D min and D max are showed as the minimum and maximum distance of points in R to p. All the points between D min and D max will be retrieved. Since R is a rectangle, D min and D max are two circles. Wherever R is, there is always some waste space. 9 p R D D max min Figure 4: Example of a distance mapping In order to solve the problem, we de ne the following function. De nition 3.6 Suppose p 1 = fx 1 ;x 2 ;:::;x d g and p 2 = fy 1 ;y 2 ;:::;y d g are two d-dimensional points. De ne: Dist(p 1 ;p 2 ) = maxfjx i BnZry i jg, where i 2 [1;:::;d]. The meaning of function Dist(p;p 0 ) can be view as following: Generate a series of hyper- cubes and p is their center. There must exist a hypercube C which passes p 0 . The distance between p and any side of C equals to Dist(p;p 0 ). Here Dist(p;p 0 ) = Dist(p 0 ;p). Select a point p in a d-dimensional space, for each data point p 0 in the space, nd Dist(p;p 0 ) and use it as a key to save p 0 into a B+-tree. Like the distance mapping, for any range R, we need to nd D min and D max . The method is almost the same: Lemma 3.7 Given a hyper-rectangle R = [x 1 low ;x 1 high ];:::;[x d low ;x d high ] and p = fy 1 ;:::;y d g, D min = maxfD i g where D i = 8 > > > < > > > : 0 : x i low  y i  x i high x i low BnZry i : y i < x i low y i BnZrx i high : x i high < y i and D max = maxfD i g where D i = maxfjx i low BnZry i j;jy i BnZrx i high jg 10 We omit the proof here because it is almost the same as the previous one. Next we want to show that Dist is a good hashing function. Theorem 3.8 Function Dist is a good hashing function. Here H(p 0 ) = Dist(p;p 0 ). Proof: First of all, if p 0 is in R, we know that D min  Dist(p;p 0 )  D max by the de nition of Dist. So H(p) is in H(R). A B H(B) - H(A) |AB| p C 1 2C Figure 5: Max-dimension mapping is a good hashing function Secondly, to any " > 0, de ne  = ", to any A;B, if jABj < ", then we want to show that jH(A)BnZrH(B)j < . Look at Figure 5, p is the selected point and A and B are two random data points. There are two hypercubes C 1 and C 2 centered with p which pass A and B. The distance between two cubes are jH(B)BnZrH(A)j which the nearest possible distance between any point on C 1 and C 2 . That means jH(B)BnZrH(A)j  jABj < " = . 2 The query algorithm is the same as distance mapping. So we do not list the detail here. Like distance mapping, we can also select a set of points instead of one. The basic idea is almost the same. And it is not hard to prove that it is a good hashing function too. 11 3.5 Discussion We present four hashing functions in the previous parts. The remain two problems is about the location of the base points and the number of dimensions after transformation. De nition 3.9 The degree of a hashing function is the dimensionality of the transformed data. The degree selection is very important in the hashing method. In most cases, as the degree increases, the query becomes more ecient. That is, the data in the buckets which are in the transformed query region are more likely to be the data we need. However, at the same time, we need to use more complicated index structure which creates more overhead. According to [17, 14], when the degree is bigger than 10, most index methods become inecient. So to any hashing functions, the degree should be no more than 10. According to our experiment, if the selectivity of the query is very low, degree can be big, otherwise, degree should be no more than 3. The details can be found in the next section. The other thing is about the selection of the base points. The object of base points selection is to decrease the \false alarm" to any queries. Theoretically, any positions are acceptable, even outside the data space. However as the degree is bigger than 1, things are a little di erent. For example, we use multi max-dimensional function as our hashing function and choose degree to be 2. First of all, we need to select two base points. If those two points (A and B) are very close, to any data point p in the space, Dist(A;p) and Dist(B;p) are very close. Now we can say these two values have some relations. The object of selecting more than one base point is to include more information. However, in the above example, we gain very little bene t by introduction the second base point. This exampleshows us that when we select more than one base point, those points should be far away enough to generate di erent information for any data points. If all the data 12 points are in a hypercube. Corner points can be a very good selection. 4 Experimental evaluation To access the merit of our algorithm, we implemented the algorithm and performed some experimental evaluation of Hashing Technique. The following competitive techniques are used:  Pyramid-Technique  Sequential Scan. The Pyramid-Technique has been chosen for comparison, because it is the best high dimensional techniques ever designed. According to [14], Pyramid-Technique outperforms the X-tree by a factor of up to 2500 (total elapsed time) for range queries, which is an amazing result. Sequential scan processing is inecient in low-dimensional data space. However, ac- cording to recent research [17], it yields better performance in high-dimensional data spaces than most index-based query processing. Therefore, we included the sequential scan in our experiment. In Pyramid-technique and our method, the index key is transformed data. We stored all the relevant information in the leaf node. Therefore, no additional object fetchings are needed. This made the comparison more fair. The experiments are run on a Sun Ultra 1 machine with 128 M memory. In the rst part, we compared the performance of four di erent hashing functions. Then in the second part, we used the best hashing function to perform the comparison with the other two techniques. In both part, range queries are used since range query is a basic 13 operation for other queries. The queries are hypercubes selected randomly from the data space. But each hypercube has same volume. This volume can be viewed as the selectivity of a query if the data points are uniformly distributed. 4.1 Evaluation of di erent hashing functions In the last section, we introduced four hashing functions, in this part, we compared the per- formance of those functions. The data set we used contains 1,000,000 uniformly distributed points in a 40-dimensional data space. We chose distance mapping, degree 2 multi-distance mapping, max-dimension mapping, degree 2 and 3 multi max-dimension mapping functions. Since the data points are uniformly distributed, the selectivity is the volume of the query hypercube to the total volume of the data space. The result is showed in Figure 6. - 6 10 BnZr6 10 BnZr5 10 BnZr4 10 BnZr3 0:01 0:1 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Selectivity T i m e (s)             ? ? ? ? ? ?               Distance Mapping Function   Degree 2 Multi-Distance Mapping Function ? ? Max-Dimension Mapping Function   Degree 2 Multi Max-Dimension Mapping Function   Degree 3 Multi Max-Dimension Mapping Function Figure 6: Performance Behavior over Di erent Hashing Functions 14 In our rst experiment, we measured the performance of ve functions on a given set of data points. The selectivity varied from 10 BnZr5 to 0.1. The page size is 4096 bytes, which leads to an e ective page capacity of 25 objects per page. Three methods use R*-tree whose minimum number of entries in a page is 10 objects per page. All multi degree functions use corner points as base points. We observed that in these ve results, Degree 2 multi max-dimension mapping is the most ecient. Especially when the selectivity is low. And degree 2 multi-distance mapping is the worst. The max-dimension mapping is always better than distance mapping. As the selectivity becomes close to 1, all ve functions seems to have no big di erence. Now look at three max-dimension mapping functions. When the degree increases from one to two. The decrease of the \false alarms" makes the method more ecient. However at the same time, we must spend more overhead on the two-dimensional index structure. Here the gain is more than loss, so the total elapsed time becomes low. As the degree increases again from two to three, the gain is no longer more than loss. Then the total time increases. We can claim that the multi max-dimension method will become even more inecient as the degree is greater than 3. 4.2 Evaluation of di erent techniques In this part, we want to compare our method with sequential scan and Pyramid-Technique. In the last part, we found that degree 2 max-dimensionmapping is the best hashing function. So we use that function in this part along with Pyramid Technique and Sequential Scan. We used two data sets. Both contains 500,000 50-dimensional data points. In the rst set, all the data are uniformly distributed and in the second set, they are very skewed. In the rst experiment (Figure 7) we nd that Pyramid-Technique is the best method in uniformly distributed data set. Our method is better than Sequential Scan (use about 1/2 total elapsed time) but worse than Pyramid-Technique, especially when the selectivity 15 - 6 10 BnZr6 10 BnZr5 10 BnZr4 10 BnZr3 0:01 0:1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Selectivity T i m e (s)             ? ? ? ? ? ?   Degree 2 Multi Max-Dimension Mapping   Pyramid Technique ? ? Sequential Scan Figure 7: Performace of three techniques on uniformly distributed data is low. In the second experiment, we use a skewed data set. It is a d-dimensional data but only k dimensions are independent. The rest d BnZr k dimensions have some relation with those k dimensions. This time, we varied k and set d = 50. The number of data points is 500,000 and the volume of query range is 0.01% of the total volume. The result is displayed in Figure 8. As Figure 8 shows, our method outperforms Pyramid-Technique when k is low. How- ever, as k increases, i.e. the data become less skewed, Pyramid-Technique shows its power. Sequential scan couldn't compete with our method for any of these queries. Summarize the results of our experiments, we make the following observations: 1. For all our hashing functions, degree 2 multi max-dimension mapping function is the best one. 16 - 6 1 3 5 7 9 11 13 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Number of attributes speci ed T i m e (s)         Degree 2 Multi Max-Dimension Mapping         Sequential Scan ? ? ? ? ? ? ? ? Pyramid Technique Figure 8: Performace of three techniques on uniformly distributed data 2. For uniformly distributed data set, Pyramid-Technique is the best method. However if the data set is very skewed, both sequential scan and our method is faster. Our method can beat sequential scan in all cases. 5 Conclusions In this paper, we proposed a new indexing method, the Hashing-based index method. The basic idea of this method is to select a hashing function. Then put the high dimensional data into low dimensional buckets, and use the existing method to store and retrieve those buckets. The most important thing in this method is the selection of the hashing function. We gave some properties of a good hashing function and presented 4 di erent functions. The concepts of Hashing Technique are designed for hypercube range query in a high- 17 dimensional space. For skewed data set, our method performs better than other index structures. However, if the data has uniform distribution, Pyramid-Technique is still the best choice. We plan to nd more hashing functions in our future work. References [1] Comer D. The Ubiquitous B-tree ACM Computing Surveys, 1979. [2] Robinson J. T. The K-D-B-tree: A Search Structure for Large Multidimensional Dy- namic Indexes Proc. ACM SIGMOD, 1981. [3] H. Edelsbrunner. Algorithms in Combinatorial Geometry Springer-Verlag, 1987. [4] Bechmann N., Kriegel H.P., Schneider R., Seeger B.. The R*-tree: An Ecient and Robust Access Method for Points and Rectangles Proc. ACM SIGMOD, 1990. [5] Tzi cker Chiueh. Content-Based Image Indexing Proc. of VLDB , 1994. [6] Hull, D. Improving text retrieval for the routing problem using latent semantic indexing Proc. ACM SIGMOD, 1994. [7] R. Ng, A. Sedighian. Evaluating multi-dimensional indexing structures for images trans- formed by principle component analysis Proc. ACM SPIE, 1994. [8] Lin K, Jagadish H. V., Faloutsos C. The TV-Tree: An Index Structure for High- Dimensional Data VLDB, 1995 [9] Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos. Cubetree: Organization of and Bulk Incremental Updates on the Data Cube [10] Philip Korn, Nikolaos Sidiropoulos, Christos Faloutsos, Eliot Siegel, Zenon Protopapas Fast Nearest-Neighbor Search in Medical image Databases Conf. on VLDB, 1996 18 [11] Stefan Berchtold, Daniel A. Keim, Hans-peter Kriegel. The X-tree: An Index Structure for High-Dimensional Data Proc. of VLDB, 1996. [12] David White, Ramesh Jain. Similarity Indexing with the SS-tree Proc. of ICDE, 1996. [13] Sergey Brin. Nearest Neighbor Search in Large Metric Spaces Proc. of VLDB, 1997. [14] Stefan Berchtold, Christian Bhon, Hans-Peter Kriegel.The Pyramid-Technique: Toward Breaking the Curse of Dimensionality Proc. of the ACM SIGMOD, 1998. [15] K. V. Rave Kanth, Divyakant Agrawal, Ambuj Singh. Dimensionality Reduction for Similarity Search in Dynamic Database Proc. of ACM SIGMOD, 1998. [16] Stefan Berchold, Bernhard Ertl, Kaniel Keim, Hans-Peter Kriegel, Thomas Seidl. Fast Nearest Neighbor Search in High Dimensional Space Proc. of ACM SIGMOD, 1998. [17] Beyer K., Goldstein J., Ramakrishnan R., Shaft U. When is \Nearest neighbor" Mean- ingful? submitted for publication, 1998. 19