Similarity Searching in Peer-to-Peer Databases Indrajit Bhattacharya, Srinivas R. Kashyap, and Srinivasan Parthasarathy findrajit, raaghav, srig@cs.umd.edu Department of Computer Science, University of Maryland, College Park, MD 20742. Abstract We consider the problem of handling similarity queries in peer-to-peer databases. Given a query for a data object, we propose an indexing and searching mechanism which returns the set of objects in the database that are semantically related to the query. Our schemes can be implemented on a variety of structured overlays such as CAN, CHORD, Pastry, and Tapestry. We provide analytical and experimen- tal evaluation of our schemes in terms of the search accuracy, search cost, and load balancing. Our an- alytical guarantees perfectly predict the experimen- tally observed trends for the search accuracy. 1 Introduction Structured peer-to-peer systems such as CAN, CHORD, Pastry, and Tapestry [6, 9, 7, 11] have received a lot of attention lately. These systems implement a Distributed Hash Table (DHT) func- tionality on top of a self-organizing overlay of the network nodes. The main abstraction supported by these DHTs is the lookup. Given a query in- volving a particular index, the lookup solves the problem of nding the network node which owns the index. Although all DHTs implement the basic lookup functionality e ciently, most real-life applica- tions demand more. For instance, consider an In- formation Retrieval (IR) application where nodes export a collection of text documents. Each doc- ument is characterized by a d-dimensional vector. The eld of IR is replete with vector-space meth- ods for such document characterizations (for e.g, [2, 4]). A user query consists of a vector and the user needs all documents in the database which match this vector or which are relevant to it. DHTs do not e ciently support applications like the one above. The fundamental reason which renders DHTs ine ective in these situa- tions is that data objects in a DHT are dis- tributed uniformly at random across the network nodes. While this ensures that no node stores too many objects, it also scatters semantically related objects across the network. Thus, when a query is issued, the only way the DHT can return all objects relevant to it would be to flood the en- tire network which leads to unacceptable network loads. Our focus in this work is to e ciently sup- port similarity queries in DHT based overlay net- works. We introduce a query model where users issue queries of the form (x; ). Here x is a data object and is a distance measure. The search algorithm needs to return all data objects y in the network such that f(x; y) ,wherefis an ap- plication speci c distance function. The schemes presented in this paper are geared towards the Cosine distance metric which is de ned as follows: f(x; y)=cos ?1 x y jxjjyj ,wherex yis the dot prod- uct between the vectors and j jis the Euclidean (l 2 ) norm. The Cosine distance is a widely used distance function in IR applications. The key driver behind our techniques is the notion of similarity preserving hash functions (SPHs). SPHs provide a powerful and interest- ing property in the context of our work. Given a set of points which are at a small distance from each other, with high probability an SPH maps these points into a \small" set of related indices. Such a mapping of data objects onto indices leads to a simple search strategy as follows: a node u which has a query (x; ), computes the set of indices which are relevant to object x; u then queries all the nodes which own these indices us- ing the lookup primitive supported by the under- lying DHT. The queried nodes return the set of relevant objects back to u. Several research proposals exist for handling complex queries in peer-to-peer databases. Our work is in the same spirit as the pSearch system proposed by Tang et al. [10]. The pSearch system supports similarity queries based on the Cosine distance measure. It is built on top of CAN and uses LSI [4] to index documents. Object coordi- nates derived from these indices are used for rout- ing and object location. However, to the best of our knowledge, the pSearch system is not extensi- ble for overlay topologies other than CAN. Gupta et al. [5] and Schmidt et al. [8] use SPHs to dis- tribute the objects on top of a CHORD overlay. The former supports approximate range queries in peer-to-peer databases and the latter supports exact range queries. However, neither of them support similarity queries. Also, to the best of our knowledge, their techniques do not extend beyond the CHORD overlay. Several other sys- tems exist where the search algorithm is guided by summaries of neighbors? contents (Bloom l- ters), routing indices, or inverted indices. Our work departs from these in signi cant ways. Speci cally, we view the following as the main contributions of this work. We present a scheme for indexing vector data and a search al- gorithm for e cient similarity searches in peer- to-peer databases. Our techniques are speci - cally geared towards the Cosine distance mea- sure which is of interest in IR applications. Our indexing scheme and search algorithm de ne a broad framework which accomodates a variety of overlay topologies and can be implemented on top of CAN, CHORD, Pastry, and Tapestry. We provide analytical guarantees which present the trade-o between the accuracy of the search re- sults vs. the search cost (in terms of the number of nodes queried). These guarantees are indepen- dent of the overlay topology as well as the data distribution. The analytical guarantees are also in perfect agreement with the experimental re- sults obtained through simulations. 2 Design Details The two main components of our design are in- dexing and searching. Each data object in the peer-to-peer database is associated with an in- dex. The set of indices are distributed across all the nodes and each index is owned by a unique node. We now propose a hash function h which takes a d-dimensional data object x as input and produces a k-bit string h(x) as output. The string h(x) is the index of object x. 2.1 The Indexing Scheme Let r be a d-dimensional unit vector. Corre- sponding to this vector, we de ne the function b r as follows: b r (x)= 1ifr x 0 0ifr x<0 b r (x) de nes the orientation of x w.r.t. r.This function was proposed by Charikar [3] for esti- mating cosine distances between points in high di- mensional space. He also observed that if r is cho- sen uniformly at random from all d-dimensional unit vectors, then for any two vectors x and y, Pr[b r (x) 6= b r (y)] = = ,where =cos ?1 x y jxjjyj is the angle between the two vectors in radians. Our hash function h is parametrized by a setofunitvectorsr 1 ;:::;r k ,eachofwhichis chosen uniformly at random from the set of all d-dimensional unit vectors. The hash value h(x) is simply the concatenation of the bits b r 1 (x);:::;b r k (x). We may construct multiple hash functions for placing multiple replicas of an object. Speci cally, if each object has t replicas, then we construct hash functions h 1 ;:::;h t as de- scribed above. For any object x, these hash func- tions yield t di erent indices h 1 (x);:::;h t (x). In general, t di erent nodes own these indices in the DHT and each of these t nodes acquire a replica when x is published. 2.2 The Search Algorithm The search algorithm is parametrized by a ra- dius r, which is a non-negative integer. A node u which generates a query (x; ) rst computes the index h(x). It then computes the set S of all indices whose hamming distance from h(x)isat most r (i.e., the set of indices which di er from h(x)inatmostrbit positions; note that S al- ways includes h(x)). Let V be the set of nodes in the network which own the indices in S.Nodeu queries each of the nodes in V .NodesinVreturn all data objects which match u?s query. How is the search radius r determined? The search radius r is a ected by various parameters such as k, t, the query parameter ,andthede- sired search accuracy. Fixing all other variables, an increase in the value of r would in general re- sult in more objects which match the query being returned. Of course, this increased accuracy is also achieved at an increased search cost. We ex- amine the e ect of r on the search accuracy and cost in Section 4. The search algorithm may be easily extended when there are multiple replicas of an object. Querying the set of nodes which own the rele- vant indices is achieved by performing a lookup operation for each index. DHTs typically imple- ment the lookup primitive such that the routing cost for each lookup is logarithmic in the size of the network. Although this value is typically low, several optimizations of our basic search algo- rithm are possible which reduce the routing over- head further. We now discuss one such optimiza- tion for Pastry and Tapestry. 3 Discussion 3.1 Routing optimizations for Pas- try and Tapestry In an idealized scenario, Pastry and Tapestry can be viewed as implementations of a hypercube net- work. Each node in the overlay has a unique ID which is a k-bit binary string. The number of nodes in the system is exactly n =2 k . Two nodes are overlay neighbors of each other if and only if the hamming distance of their IDs is one (i.e., they di er in exactly one bit). One may view the nodes as occupying the vertices of d-dimensional unit hypercube. The ID of the node is the k- bit string obtained from the k-dimensional f0; 1g- coordinates of the hypercube vertex occupied by the node. An object is stored by a node if and only if the object index matches with the node ID. In this scenario, a node u with a query (x; ) performs a single lookup for x which terminates in a node v whose ID is h(x). Node v performs a local search by flooding the query to all its r-hop overlay neighbors where r is the search radius. These nodes return their local search results to v which gathers and returns the union of all the results to u. Note that this optimization does not reduce the number of nodes being queried. However, it reduces the routing load substantially since each lookup is now replaced by a single over- lay message. 3.2 Routing Optimization with Holes The ideal setting described in the previous sec- tion may not hold in practice. The index size k is determined during network creation when the number of nodes is not known. Further, the num- ber of nodes will vary dynamically in the system. We get around this by xing k such that 2 k is an upper bound on the number of nodes in the network. However, this would result in some of the hypercube vertices being unoccupied leading to holes. In our scheme, any hole u in the hypercube is adopted by a randomly selected node u a in the network. This node will be responsible for host- ing the objects assigned to this hole as well as performing the routing for it. Nodes which own the neighboring hypercube vertices of the hole u, update their routing tables to point to u a .This ensures that if vertex u was within distance r of some other vertex v, then it remains within dis- tance r from v after being adopted by u a .Conse- quently, while the r-hop neighborhood of a node may change, the set of data objects in its r-hop neighborhood does not reduce. Our scheme thus provides the nice guarantee that the retrieval ac- curacy does not deteriorate due to holes. 4 Analysis Consider a query (x; ). Let S be the set of all points in the database which matches this query. Let S 0 be the set of objects returned by the search algorithm. Recall that t is the number of replicas, k is the the number of bits in the index and r is the search radius. We de ne the accuracy of the search to be jS 0 j=jSj, i.e., the fraction of objects in the database which match the query and which are returned by the search. E[jS 0 j=jSj] denotes the expected accuracy. The following theorems hold. Theorem 4.1 E[jS 0 j=jSj] 1? 1 ? r X i=0 k i i 1? k?i ! t (1) Theorem 4.2 Let the search cost (the number of nodes being queried) be C. Then, C = t r X i=0 k i (2) We note that, in comparison with uniform hash functions, the use of SPHs generally results in an uneven distribution of data objects across in- dices and hence across nodes. Obtaining analyti- cal load balancing guarantees without sacri cing accuracy is an involved issue and is a subject of future research. 5 Experimental Evaluation We have implemented a prototype simulator for Pastry to evaluate our algorithms. We use the CHORD simulator which is publicly available [1]. The main goal of our simulations is to evaluate the search accuracy and the storage load as a function of various system parameters. Recall the de nition of accuracy from Section 4. To evalu- ate storage load, we sort the nodes in decreasing order of number objects they store. The sorted list of nodes is bucketed such that each bucket contains 5% of the total number of nodes. For each of the 20 buckets, we plot the percentage of total number of objects that are stored in the nodes of the bucket. The baseline for comparsion is the perfectly uniform distribution where each bucket stores 5% of the objects. 5.1 Experimental Setup The data samples for our experiments are drawn as follows: each coordinate of the data vector is sampled uniformly and independently at random from a standard normal distribution. For each query (x; ), the parameter x is sampled from the same distribution as that of the data. We observe the e ect of the number of replicas t,thesizeof the index k, the dimensionality of the data d,the number of nodes n, the number of data objects N, and the query parameter on the search ac- curacy and storage load. The default values of these parameters are in the table below. Results are averaged over 100 trials. N d k t r n 50,000 15 10 1 1 0.75 2 k =1024 Table 1: Default values for network parameters used in the experiments 5.1.1 Observations Figures (a)-(g) plot the e ect of the various sys- tem parameters on the accuracy for the Pastry system. We plot both the experimentally ob- served values as well as the analytically predicted ones. The accuracy results from our CHORD simulations are identical to those of Pastry as ex- pected. The accuracy increases as a function of the number of replicas t and the search radius r. It does does not vary much as a function of the data dimension d or the number of nodes n or the number of data objects N in the system. However, the accuracy decreases with the size of the index k as well as the the query parameter . Amazingly, our analysis predicts the experimen- tal trends extremely accurately in all the trials. This suggests that the accuracy guarantees pro- vided by our analysis do not only hold in expec- tation, but also with high probability. Also note that the experimentally observed values are al- ways higher than the analytically predicted ones. This is explained by the fact that our analysis always yields a lower bound on the expected ac- curacy rather than the exact value. Figures (h)-(l) plot the e ect of the system pa- rameters on the storage load across nodes. Fig- ures (j) and (i) respectively indicate that increas- ing the size of the index k adversely a ects the storage load balance while increasing the num- ber of replicas t aids load balance. Varying other parameters does not seem to change the storage distribution across the nodes. The observed load balancing trends are similar for both CHORD and Pastry. 6 Conclusion and Future Work We have presented a framework for indexing and searching data objects in peer-to-peer informa- tion retrieval systems. Our schemes use SPHs to map semantically related data objects to a small set of indices leading to a simple and e cient search algorithm. This framework can be imple- mented on a wide variety of structured overlays such as CAN, CHORD, Pastry and Tapestry. We plan to extend our work in the future through extensive experimental evaluation. In particular, we plan to evaluate our schemes with data sets obtained from real applications and compare the performance of our scheme with ex- isting systems such as pSearch. We also plan to evaluate the performance of our schemes under dynamic network conditions. Finally, load bal- ancing mechanisms which evenly distribute the indices and query load across nodes without sac- ri cing accuracy will be a major focus of future studies. References [1] http://www.pdos.lcs.mit.edu/chord/. [2] Michael W. Berry, Zlatko Drmac, and Eliz- abeth R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335{362, 1999. [3] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380{388. ACM Press, 2002. [4] S.C. Deerwester, S.T. Dumais, T.K. Lan- dauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Jour- nal of the American Society for Information Science, 41(6):391{407, 1990. [5] Abhishek Gupta, Divyakant Agrawal, and Amr El Abbadi. Approximate range se- lection queries in peer-to-peer systems. In CIDR, 2003. [6] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content- addressable network. In ACM SIGCOMM, 2001. [7] Antony Rowstron and Peter Druschel. Pas- try: Scalable, distributed object location and routing for large-scale peer-to-peer sys- tems. In IFIP/ACM International Confer- ence on Distributed Systems Platforms (Mid- dleware), pages 329{350, November 2001. [8] Christina Schmidt and Manish Parashar. Flexible information discovery in decentral- ized distributed systems. In IEEE Interna- tional Symposium on High-Performance Dis- tributed Computing (HPDC-12), 2003. [9] Ion Stoica, Robert Morris, David Liben- Nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek, and Hari Bal- akrishnan. Chord: a scalable peer-to- peer lookup protocol for internet applica- tions. IEEE/ACM Transactions on Net- working (TON), 11(1):17{32, 2003. [10] Chunqiang Tang, Zhichen Xu, and Sand- hya Dwarkadas. Peer-to-peer information re- trieval using self-organizing semantic overlay networks. In Proceedings of the 2003 confer- ence on Applications, technologies, architec- tures, and protocols for computer communi- cations, pages 175{186. ACM Press, 2003. [11] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John Kubiatowicz. Tapestry: A resilient global- scale overlay for service deployment. To ap- pear in IEEE Journal on Selected Areas in Communications. 0 10 20 30 40 50 60 70 80 90 100 6 7 8 9 10 11 12 13 14 15 accuracy in % index size k accuracy vs. k observed predicted (a) 0 10 20 30 40 50 60 70 80 90 100 200 400 600 800 1024 accuracy in % no. of nodes n accuracy vs. no. of nodes observed predicted (b) 0 10 20 30 40 50 60 70 80 90 100 10 11 12 13 14 15 16 17 accuracy in % data dimension d accuracy vs. d observed predicted (c) 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 accuracy in % no of replicas t accuracy vs. no. of replicas. observed predicted (d) 0 10 20 30 40 50 60 70 80 90 100 0 1 2 accuracy in % search radius r accuracy vs. search radius observed predicted (e) 0 10 20 30 40 50 60 70 80 90 100 0.6 0.7 0.8 0.9 1 accuracy in % query parameter delta (in radians) accuracy vs. delta observed predicted (f) 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 accuracy in % no of documents N (in thousands) accuracy vs. N observed predicted (g) 0 5 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 % of docs stored buckets Pastry: load balance vs. number of nodes n=1024 n=819 n=612 n=408 n=204 baseline (h) 0 5 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 % of docs stored buckets Pastry: load balance vs. number of replicas h=1 h=2 h=3 h=4 h=5 baseline (i) 0 5 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 % of docs stored buckets Pastry: load balance vs. index size k=8 k=10 k=12 k=14 baseline (j) 0 5 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 % of docs stored buckets CHORD: load balance vs. no. of docs n=10,000 n=20,000 n=30,000 n=40,000 n=50,000 baseline (k) 0 5 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 % of docs stored buckets CHORD: load balance vs. dim. of data d=11 d=12 d=13 d=14 d=15 baseline (l) Figure 1: Experimental Results for Accuracy and Load Balance