Similarity Searching in Peer-to-Peer Databases
Indrajit Bhattacharya, Srinivas R. Kashyap, and Srinivasan Parthasarathy
findrajit, raaghav, srig@cs.umd.edu
Department of Computer Science, University of Maryland, College Park, MD
20742.
Abstract
We consider the problem of handling similarity
queries in peer-to-peer databases. Given a query for
a data object, we propose an indexing and searching
mechanism which returns the set of objects in the
database that are semantically related to the query.
Our schemes can be implemented on a variety of
structured overlays such as CAN, CHORD, Pastry,
and Tapestry. We provide analytical and experimen-
tal evaluation of our schemes in terms of the search
accuracy, search cost, and load balancing. Our an-
alytical guarantees perfectly predict the experimen-
tally observed trends for the search accuracy.
1 Introduction
Structured peer-to-peer systems such as CAN,
CHORD, Pastry, and Tapestry [6, 9, 7, 11] have
received a lot of attention lately. These systems
implement a Distributed Hash Table (DHT) func-
tionality on top of a self-organizing overlay of the
network nodes. The main abstraction supported
by these DHTs is the lookup. Given a query in-
volving a particular index, the lookup solves the
problem of  nding the network node which owns
the index.
Although all DHTs implement the basic lookup
functionality e ciently, most real-life applica-
tions demand more. For instance, consider an In-
formation Retrieval (IR) application where nodes
export a collection of text documents. Each doc-
ument is characterized by a d-dimensional vector.
The  eld of IR is replete with vector-space meth-
ods for such document characterizations (for e.g,
[2, 4]). A user query consists of a vector and the
user needs all documents in the database which
match this vector or which are relevant to it.
DHTs do not e ciently support applications
like the one above. The fundamental reason
which renders DHTs ine ective in these situa-
tions is that data objects in a DHT are dis-
tributed uniformly at random across the network
nodes. While this ensures that no node stores too
many objects, it also scatters semantically related
objects across the network. Thus, when a query
is issued, the only way the DHT can return all
objects relevant to it would be to flood the en-
tire network which leads to unacceptable network
loads.
Our focus in this work is to e ciently sup-
port similarity queries in DHT based overlay net-
works. We introduce a query model where users
issue queries of the form (x;  ). Here x is a data
object and  is a distance measure. The search
algorithm needs to return all data objects y in the
network such that f(x; y)   ,wherefis an ap-
plication speci c distance function. The schemes
presented in this paper are geared towards the
Cosine distance metric which is de ned as follows:
f(x; y)=cos
?1
x y
jxjjyj
,wherex yis the dot prod-
uct between the vectors and j jis the Euclidean
(l
2
) norm. The Cosine distance is a widely used
distance function in IR applications.
The key driver behind our techniques is the
notion of similarity preserving hash functions
(SPHs). SPHs provide a powerful and interest-
ing property in the context of our work. Given a
set of points which are at a small distance from
each other, with high probability an SPH maps
these points into a \small" set of related indices.
Such a mapping of data objects onto indices leads
to a simple search strategy as follows: a node
u which has a query (x;  ), computes the set of
indices which are relevant to object x; u then
queries all the nodes which own these indices us-
ing the lookup primitive supported by the under-
lying DHT. The queried nodes return the set of
relevant objects back to u.
Several research proposals exist for handling
complex queries in peer-to-peer databases. Our
work is in the same spirit as the pSearch system
proposed by Tang et al. [10]. The pSearch system
supports similarity queries based on the Cosine
distance measure. It is built on top of CAN and
uses LSI [4] to index documents. Object coordi-
nates derived from these indices are used for rout-
ing and object location. However, to the best of
our knowledge, the pSearch system is not extensi-
ble for overlay topologies other than CAN. Gupta
et al. [5] and Schmidt et al. [8] use SPHs to dis-
tribute the objects on top of a CHORD overlay.
The former supports approximate range queries
in peer-to-peer databases and the latter supports
exact range queries. However, neither of them
support similarity queries. Also, to the best of
our knowledge, their techniques do not extend
beyond the CHORD overlay. Several other sys-
tems exist where the search algorithm is guided
by summaries of neighbors? contents (Bloom  l-
ters), routing indices, or inverted indices.
Our work departs from these in signi cant
ways. Speci cally, we view the following as the
main contributions of this work. We present a
scheme for indexing vector data and a search al-
gorithm for e cient similarity searches in peer-
to-peer databases. Our techniques are speci -
cally geared towards the Cosine distance mea-
sure which is of interest in IR applications. Our
indexing scheme and search algorithm de ne a
broad framework which accomodates a variety of
overlay topologies and can be implemented on
top of CAN, CHORD, Pastry, and Tapestry. We
provide analytical guarantees which present the
trade-o between the accuracy of the search re-
sults vs. the search cost (in terms of the number
of nodes queried). These guarantees are indepen-
dent of the overlay topology as well as the data
distribution. The analytical guarantees are also
in perfect agreement with the experimental re-
sults obtained through simulations.
2 Design Details
The two main components of our design are in-
dexing and searching. Each data object in the
peer-to-peer database is associated with an in-
dex. The set of indices are distributed across all
the nodes and each index is owned by a unique
node. We now propose a hash function h which
takes a d-dimensional data object x as input and
produces a k-bit string h(x) as output. The string
h(x) is the index of object x.
2.1 The Indexing Scheme
Let r be a d-dimensional unit vector. Corre-
sponding to this vector, we de ne the function
b
r
as follows:
b
r
(x)=
 
1ifr x 0
0ifr x<0
b
r
(x) de nes the orientation of x w.r.t. r.This
function was proposed by Charikar [3] for esti-
mating cosine distances between points in high di-
mensional space. He also observed that if r is cho-
sen uniformly at random from all d-dimensional
unit vectors, then for any two vectors x and y,
Pr[b
r
(x) 6= b
r
(y)] =  = ,where =cos
?1 x y
jxjjyj
is
the angle between the two vectors in radians.
Our hash function h is parametrized by a
setofunitvectorsr
1
;:::;r
k
,eachofwhichis
chosen uniformly at random from the set of
all d-dimensional unit vectors. The hash value
h(x) is simply the concatenation of the bits
b
r
1
(x);:::;b
r
k
(x). We may construct multiple
hash functions for placing multiple replicas of an
object. Speci cally, if each object has t replicas,
then we construct hash functions h
1
;:::;h
t
as de-
scribed above. For any object x, these hash func-
tions yield t di erent indices h
1
(x);:::;h
t
(x). In
general, t di erent nodes own these indices in the
DHT and each of these t nodes acquire a replica
when x is published.
2.2 The Search Algorithm
The search algorithm is parametrized by a ra-
dius r, which is a non-negative integer. A node
u which generates a query (x;  )  rst computes
the index h(x). It then computes the set S of all
indices whose hamming distance from h(x)isat
most r (i.e., the set of indices which di er from
h(x)inatmostrbit positions; note that S al-
ways includes h(x)). Let V be the set of nodes in
the network which own the indices in S.Nodeu
queries each of the nodes in V .NodesinVreturn
all data objects which match u?s query.
How is the search radius r determined? The
search radius r is a ected by various parameters
such as k, t, the query parameter  ,andthede-
sired search accuracy. Fixing all other variables,
an increase in the value of r would in general re-
sult in more objects which match the query being
returned. Of course, this increased accuracy is
also achieved at an increased search cost. We ex-
amine the e ect of r on the search accuracy and
cost in Section 4. The search algorithm may be
easily extended when there are multiple replicas
of an object.
Querying the set of nodes which own the rele-
vant indices is achieved by performing a lookup
operation for each index. DHTs typically imple-
ment the lookup primitive such that the routing
cost for each lookup is logarithmic in the size of
the network. Although this value is typically low,
several optimizations of our basic search algo-
rithm are possible which reduce the routing over-
head further. We now discuss one such optimiza-
tion for Pastry and Tapestry.
3 Discussion
3.1 Routing optimizations for Pas-
try and Tapestry
In an idealized scenario, Pastry and Tapestry can
be viewed as implementations of a hypercube net-
work. Each node in the overlay has a unique ID
which is a k-bit binary string. The number of
nodes in the system is exactly n =2
k
. Two nodes
are overlay neighbors of each other if and only if
the hamming distance of their IDs is one (i.e.,
they di er in exactly one bit). One may view the
nodes as occupying the vertices of d-dimensional
unit hypercube. The ID of the node is the k-
bit string obtained from the k-dimensional f0; 1g-
coordinates of the hypercube vertex occupied by
the node. An object is stored by a node if and
only if the object index matches with the node
ID.
In this scenario, a node u with a query (x;  )
performs a single lookup for x which terminates
in a node v whose ID is h(x). Node v performs a
local search by flooding the query to all its r-hop
overlay neighbors where r is the search radius.
These nodes return their local search results to
v which gathers and returns the union of all the
results to u. Note that this optimization does
not reduce the number of nodes being queried.
However, it reduces the routing load substantially
since each lookup is now replaced by a single over-
lay message.
3.2 Routing Optimization with
Holes
The ideal setting described in the previous sec-
tion may not hold in practice. The index size k
is determined during network creation when the
number of nodes is not known. Further, the num-
ber of nodes will vary dynamically in the system.
We get around this by  xing k such that 2
k
is
an upper bound on the number of nodes in the
network. However, this would result in some of
the hypercube vertices being unoccupied leading
to holes.
In our scheme, any hole u in the hypercube is
adopted by a randomly selected node u
a
in the
network. This node will be responsible for host-
ing the objects assigned to this hole as well as
performing the routing for it. Nodes which own
the neighboring hypercube vertices of the hole u,
update their routing tables to point to u
a
.This
ensures that if vertex u was within distance r of
some other vertex v, then it remains within dis-
tance r from v after being adopted by u
a
.Conse-
quently, while the r-hop neighborhood of a node
may change, the set of data objects in its r-hop
neighborhood does not reduce. Our scheme thus
provides the nice guarantee that the retrieval ac-
curacy does not deteriorate due to holes.
4 Analysis
Consider a query (x;  ). Let S be the set of all
points in the database which matches this query.
Let S
0
be the set of objects returned by the search
algorithm. Recall that t is the number of replicas,
k is the the number of bits in the index and r is
the search radius. We de ne the accuracy of the
search to be jS
0
j=jSj, i.e., the fraction of objects in
the database which match the query and which
are returned by the search. E[jS
0
j=jSj] denotes
the expected accuracy. The following theorems
hold.
Theorem 4.1
E[jS
0
j=jSj]  1?
 
1 ?
r
X
i=0
 
k
i
  
 
 
 
i
 
1?
 
 
 
k?i
!
t
(1)
Theorem 4.2 Let the search cost (the number of
nodes being queried) be C. Then,
C = t
r
X
i=0
 
k
i
 
(2)
We note that, in comparison with uniform hash
functions, the use of SPHs generally results in
an uneven distribution of data objects across in-
dices and hence across nodes. Obtaining analyti-
cal load balancing guarantees without sacri cing
accuracy is an involved issue and is a subject of
future research.
5 Experimental Evaluation
We have implemented a prototype simulator for
Pastry to evaluate our algorithms. We use the
CHORD simulator which is publicly available [1].
The main goal of our simulations is to evaluate
the search accuracy and the storage load as a
function of various system parameters. Recall the
de nition of accuracy from Section 4. To evalu-
ate storage load, we sort the nodes in decreasing
order of number objects they store. The sorted
list of nodes is bucketed such that each bucket
contains 5% of the total number of nodes. For
each of the 20 buckets, we plot the percentage
of total number of objects that are stored in the
nodes of the bucket. The baseline for comparsion
is the perfectly uniform distribution where each
bucket stores 5% of the objects.
5.1 Experimental Setup
The data samples for our experiments are drawn
as follows: each coordinate of the data vector is
sampled uniformly and independently at random
from a standard normal distribution. For each
query (x;  ), the parameter x is sampled from the
same distribution as that of the data. We observe
the e ect of the number of replicas t,thesizeof
the index k, the dimensionality of the data d,the
number of nodes n, the number of data objects
N, and the query parameter  on the search ac-
curacy and storage load. The default values of
these parameters are in the table below. Results
are averaged over 100 trials.
N d k t r  n
50,000 15 10 1 1 0.75 2
k
=1024
Table 1: Default values for network parameters
used in the experiments
5.1.1 Observations
Figures (a)-(g) plot the e ect of the various sys-
tem parameters on the accuracy for the Pastry
system. We plot both the experimentally ob-
served values as well as the analytically predicted
ones. The accuracy results from our CHORD
simulations are identical to those of Pastry as ex-
pected. The accuracy increases as a function of
the number of replicas t and the search radius
r. It does does not vary much as a function of
the data dimension d or the number of nodes n
or the number of data objects N in the system.
However, the accuracy decreases with the size of
the index k as well as the the query parameter  .
Amazingly, our analysis predicts the experimen-
tal trends extremely accurately in all the trials.
This suggests that the accuracy guarantees pro-
vided by our analysis do not only hold in expec-
tation, but also with high probability. Also note
that the experimentally observed values are al-
ways higher than the analytically predicted ones.
This is explained by the fact that our analysis
always yields a lower bound on the expected ac-
curacy rather than the exact value.
Figures (h)-(l) plot the e ect of the system pa-
rameters on the storage load across nodes. Fig-
ures (j) and (i) respectively indicate that increas-
ing the size of the index k adversely a ects the
storage load balance while increasing the num-
ber of replicas t aids load balance. Varying other
parameters does not seem to change the storage
distribution across the nodes. The observed load
balancing trends are similar for both CHORD
and Pastry.
6 Conclusion and Future
Work
We have presented a framework for indexing and
searching data objects in peer-to-peer informa-
tion retrieval systems. Our schemes use SPHs to
map semantically related data objects to a small
set of indices leading to a simple and e cient
search algorithm. This framework can be imple-
mented on a wide variety of structured overlays
such as CAN, CHORD, Pastry and Tapestry.
We plan to extend our work in the future
through extensive experimental evaluation. In
particular, we plan to evaluate our schemes with
data sets obtained from real applications and
compare the performance of our scheme with ex-
isting systems such as pSearch. We also plan to
evaluate the performance of our schemes under
dynamic network conditions. Finally, load bal-
ancing mechanisms which evenly distribute the
indices and query load across nodes without sac-
ri cing accuracy will be a major focus of future
studies.
References
[1] http://www.pdos.lcs.mit.edu/chord/.
[2] Michael W. Berry, Zlatko Drmac, and Eliz-
abeth R. Jessup. Matrices, vector spaces,
and information retrieval. SIAM Review,
41(2):335{362, 1999.
[3] Moses S. Charikar. Similarity estimation
techniques from rounding algorithms. In
Proceedings of the thiry-fourth annual ACM
symposium on Theory of computing, pages
380{388. ACM Press, 2002.
[4] S.C. Deerwester, S.T. Dumais, T.K. Lan-
dauer, G.W. Furnas, and R.A. Harshman.
Indexing by latent semantic analysis. Jour-
nal of the American Society for Information
Science, 41(6):391{407, 1990.
[5] Abhishek Gupta, Divyakant Agrawal, and
Amr El Abbadi. Approximate range se-
lection queries in peer-to-peer systems. In
CIDR, 2003.
[6] S. Ratnasamy, P. Francis, M. Handley,
R. Karp, and S. Shenker. A scalable content-
addressable network. In ACM SIGCOMM,
2001.
[7] Antony Rowstron and Peter Druschel. Pas-
try: Scalable, distributed object location
and routing for large-scale peer-to-peer sys-
tems. In IFIP/ACM International Confer-
ence on Distributed Systems Platforms (Mid-
dleware), pages 329{350, November 2001.
[8] Christina Schmidt and Manish Parashar.
Flexible information discovery in decentral-
ized distributed systems. In IEEE Interna-
tional Symposium on High-Performance Dis-
tributed Computing (HPDC-12), 2003.
[9] Ion Stoica, Robert Morris, David Liben-
Nowell, David R. Karger, M. Frans
Kaashoek, Frank Dabek, and Hari Bal-
akrishnan. Chord: a scalable peer-to-
peer lookup protocol for internet applica-
tions. IEEE/ACM Transactions on Net-
working (TON), 11(1):17{32, 2003.
[10] Chunqiang Tang, Zhichen Xu, and Sand-
hya Dwarkadas. Peer-to-peer information re-
trieval using self-organizing semantic overlay
networks. In Proceedings of the 2003 confer-
ence on Applications, technologies, architec-
tures, and protocols for computer communi-
cations, pages 175{186. ACM Press, 2003.
[11] Ben Y. Zhao, Ling Huang, Jeremy Stribling,
Sean C. Rhea, Anthony D. Joseph, and John
Kubiatowicz. Tapestry: A resilient global-
scale overlay for service deployment. To ap-
pear in IEEE Journal on Selected Areas in
Communications.
0
10
20
30
40
50
60
70
80
90
100
6 7 8 9 10 11 12 13 14 15
accuracy in %
index size k
accuracy vs. k
observed
predicted
(a)
0
10
20
30
40
50
60
70
80
90
100
200 400 600 800 1024
accuracy in %
no. of nodes n
accuracy vs. no. of nodes
observed
predicted
(b)
0
10
20
30
40
50
60
70
80
90
100
10 11 12 13 14 15 16 17
accuracy in %
data dimension d
accuracy vs. d
observed
predicted
(c)
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
accuracy in %
no of replicas t
accuracy vs. no. of replicas.
observed
predicted
(d)
0
10
20
30
40
50
60
70
80
90
100
0 1 2
accuracy in %
search radius r
accuracy vs. search radius
observed
predicted
(e)
0
10
20
30
40
50
60
70
80
90
100
0.6 0.7 0.8 0.9 1
accuracy in %
query parameter delta (in radians)
accuracy vs. delta
observed
predicted
(f)
0
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80
accuracy in %
no of documents N (in thousands)
accuracy vs. N
observed
predicted
(g)
0
5
10
20
30
40
50
2 4 6 8 10 12 14 16 18 20
% of docs stored
buckets
Pastry: load balance vs. number of nodes
n=1024
n=819
n=612
n=408
n=204
baseline
(h)
0
5
10
20
30
40
50
2 4 6 8 10 12 14 16 18 20
% of docs stored
buckets
Pastry: load balance vs. number of replicas
h=1
h=2
h=3
h=4
h=5
baseline
(i)
0
5
10
20
30
40
50
2 4 6 8 10 12 14 16 18 20
% of docs stored
buckets
Pastry: load balance vs. index size
k=8
k=10
k=12
k=14
baseline
(j)
0
5
10
20
30
40
50
2 4 6 8 10 12 14 16 18 20
% of docs stored
buckets
CHORD: load balance vs. no. of docs
n=10,000
n=20,000
n=30,000
n=40,000
n=50,000
baseline
(k)
0
5
10
20
30
40
50
2 4 6 8 10 12 14 16 18 20
% of docs stored
buckets
CHORD: load balance vs. dim. of data
d=11
d=12
d=13
d=14
d=15
baseline
(l)
Figure 1: Experimental Results for Accuracy and Load Balance