ABSTRACT
Title of dissertation: SEARCH, REPLICATION AND GROUPING
FOR UNSTRUCTURED P2P NETWORKS
Dimitrios Tsoumakos, Doctor of Philosophy, 2006
Dissertation directed by: Professor Nicholas Roussopoulos
Department of Computer Science
In my dissertation, I present a suite of protocols that assist in efficient content loca-
tion and distribution in unstructured Peer-to-Peer overlays. The basis of these schemes is
their ability to learn from past interactions, increasing their performance with time.
Peer-to-Peer (P2P) networks are gaining increasing attention from both the scien-
tific and the large Internet user community. Popular applications utilizing this new tech-
nology offer many attractive features to a growing number of users. P2P systems have
two basic functions: Content search and dissemination. Search (or lookup) protocols de-
fine how participants locate remotely maintained resources. In data dissemination, users
transmit or receive content from single or multiple sites in the network.
P2P applications traditionally operate under purely decentralized and highly dy-
namic environments. Unstructured systems represent a particularly interesting class of
P2P networks. Peers form an overlay in an ad-hoc manner, without any guarantees rel-
ative to lookup performance or content availability. Resources are locally maintained,
while participants have limited knowledge, usually confined to their immediate neighbor-
hood in the overlay.
My work aims at providing effective and bandwidth-efficient searching and data
sharing. A suite of algorithms which provide peers in unstructured P2P overlays with
the state necessary in order to efficiently locate, disseminate and replicate objects is pre-
sented. The Adaptive Probabilistic Search (APS) scheme utilizes directed walkers to
forward queries on a hop-by-hop basis. Peers store success probabilities for each of their
neighbors in order to efficiently route towards object holders. AGNO performs implicit
grouping of peers according to the demand incentive and utilizes state maintained by APS
in order to route messages from content holders towards interested peers, without requir-
ing any subscription process. Finally, the Adaptive Probabilistic REplication (APRE)
scheme expands on the state that AGNO builds in order to replicate content inside query
intensive areas according to demand.
SEARCH, REPLICATION AND GROUPING
FOR UNSTRUCTURED P2P NETWORKS
by
Dimitrios Tsoumakos
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2006
Advisory Committee:
Professor Nicholas Roussopoulos, Chair/Advisor
Professor Lise Getoor
Professor Amol Deshpande
Professor Louiqa Raschid
Professor Virgil Gligor
c?Copyright by
Dimitrios Tsoumakos
2006
DEDICATION
To my beloved father, Nikiforos.
ii
ACKNOWLEDGMENTS
I owe my gratitude to all the people who have made this thesis possible and because
of whom my graduate experience has been one that I will value forever.
First and foremost I would like to thank my parents. My father and mother have
always stood by, supported and actively encouraged me to follow my dreams throughout
life. Without their moral, psychological and material support, I would not have been able
to achieve my goals. I owe them my deepest gratitude and love.
I offer my wholehearted thanks to my advisor, Professor Nick Roussopoulos for
trusting my abilities and giving me the opportunity to study in one of the best graduate
schools. He provided me with the freedom to follow my research interests and the support
under his research program. His technical and moral advice followed me throughout the
duration of my stay here. It has been a pleasure to work with and learn from such an
extraordinary individual.
I would also like to thank the members of my committee for the valuable advice
they provided during and after my proposal. Dr. Lise Getoor and Dr. Jim Reggia offered
some very useful pointers about Reinforcement Learning and Dr. Bobby Bhattacharjee
offered his expert opinion on a variety of occasions. Special thanks are due to all my
course professors, with whom I still have excellent relations, that helped me enhance
my understanding on many diverse notions in computer science. I also like to thank Dr.
Timos Sellis, my advisor in the National Technical Institute of Athens, who both guided
iii
my decision to pursue my PhD and co-advised me on the GrouPeer project.
My colleagues and close friends have enriched my graduate life in many ways and
deserve a special mention. Dr. Antonios Deligiannakis, who besides being a close friend
and roommate of 5 years, offered valuable advice on research and technical issues. Kon-
stantinos Bitsakos, my neighbor and colleague of years, who proved a rare friend and col-
laborator. The same is true for Emmanuel Hourdakis, Athanasios Chryssis, Dr. Yiannis
Sismanis, Polyvios Pratikakis, Dr. Alexandros Lambrinidis, Konstantinos Spiliopoulos,
Vassilios Botopoulos and all my friends here in College Park. I thank them all for their
friendship and support.
Deepest thanks are due to my dear friends in Greece: Georgios Gkoumas, Georgios
Andritsopoulos, and the rest of my friends who have constantly supported me in both
good and bad periods of my stay here. Among them is my very good friend and colleague,
Verena Kantere. Our collaboration was fruitful and utterly satisfying for me. I thank her
for both for her cooperation as well as for her support all those years.
Finally, I would like to acknowledge the financial support from the U.S. Army
Research Laboratory and the U.S. Army Research Office for all the projects discussed
herein, as well as the continuous efforts of our our project leader, Dr. Barras. Without
their support, all this would have been near-impossible.
Special thanks are due to the administrative staff of CS and ISR who have accom-
modated all my needs effortlessly. The same is true for the technical support offered by
the CS and UMIACS staffs.
It is impossible to remember all, and I apologize to those I?ve inadvertently left out.
Lastly, thank you all and thank God!
iv
TABLE OF CONTENTS
List of Figures viii
1 Introduction 1
1.1 The Notion of Peer-to-Peer and its Internet Origins . . . . . . . . . . . . 1
1.2 Categorization of Peer-to-Peer Overlays . . . . . . . . . . . . . . . . . . 3
1.3 A General Model for an Unstructured P2P Overlay . . . . . . . . . . . . 6
1.4 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Searching in Unstructured P2P Overlays: The APS Method 13
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The APS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Algorithm Improvements . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 APS and Reinforcement Learning . . . . . . . . . . . . . . . . . 24
2.2.4.1 Elements of Reinforcement Learning . . . . . . . . . . 24
2.2.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . 27
2.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Comparing the Index Update Functions . . . . . . . . . . . . . . 33
2.3.2 Basic Performance Analysis . . . . . . . . . . . . . . . . . . . . 37
2.3.3 Discovered Objects vs. Distance from Requesters . . . . . . . . . 42
2.3.4 Effect of Object Popularity . . . . . . . . . . . . . . . . . . . . . 43
2.3.5 Results for Different Topologies . . . . . . . . . . . . . . . . . . 45
2.3.6 Comparison with GUESS . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Content Dissemination to Groups of Peers: AGNO 49
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 AGNO Protocol Description . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.1 Protocol Specifics . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Basic Performance Analysis . . . . . . . . . . . . . . . . . . . . 65
3.3.2 Sensitivity to ?r . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Changes in Group Size . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.4 Sensitivity to the Binning Scheme . . . . . . . . . . . . . . . . . 73
3.3.5 Real Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4 Adaptive Replication for Unstructured Overlays 79
4.1 Introduction and Overview of our Approach . . . . . . . . . . . . . . . . 79
4.2 Adaptive Replica Expansion/Contraction: APRE . . . . . . . . . . . . . 83
4.2.1 Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v
4.3.1 Basic Performance Comparison . . . . . . . . . . . . . . . . . . 94
4.3.2 Load Distribution Between Replicas . . . . . . . . . . . . . . . . 98
4.3.3 Flash Crowds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.4 Effect of the Maximum Replication Ratio and Limitup . . . . . . 104
4.3.5 Simulations with Different Topologies . . . . . . . . . . . . . . . 106
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5 Related Work 109
5.1 Search Algorithms for P2P Systems . . . . . . . . . . . . . . . . . . . . 109
5.1.1 Blind Search Methods . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.2 Informed Search Methods . . . . . . . . . . . . . . . . . . . . . 110
5.2 Data Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.1 Application-layer Multicast . . . . . . . . . . . . . . . . . . . . 117
5.2.2 Multicast over P2P Overlays . . . . . . . . . . . . . . . . . . . . 117
5.2.3 Group Communication in Unstructured Overlays . . . . . . . . . 118
5.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Conclusions 124
Appendices 126
A Analysis and Comparison of P2P Search Methods 126
A.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2.1 Algorithm Implementations . . . . . . . . . . . . . . . . . . . . 127
A.2.2 Basic Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.2.3 Results on Denser Graphs . . . . . . . . . . . . . . . . . . . . . 133
A.2.4 Increased Number of Objects . . . . . . . . . . . . . . . . . . . . 138
A.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B Sharing Relational Data in Unstructured Overlays 141
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . 144
B.1.2 Our Contribution: GrouPeer . . . . . . . . . . . . . . . . . . . . 147
B.2 Clustering Peers for Accurate Query Answers . . . . . . . . . . . . . . . 148
B.2.1 Query Reformulation and Similarity for P2P Database Systems . 149
B.2.2 Description of the Clustering Process . . . . . . . . . . . . . . . 150
B.2.3 GrouPeer Protocol Internals . . . . . . . . . . . . . . . . . . . . 152
B.3 Interest Groups in GrouPeer . . . . . . . . . . . . . . . . . . . . . . . . 153
B.3.1 Group Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 154
B.3.2 Discussion on the Group Inference Process . . . . . . . . . . . . 157
B.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.4.1 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.4.2 Group Inference Results . . . . . . . . . . . . . . . . . . . . . . 170
B.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
vi
B.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Bibliography 179
vii
LIST OF FIGURES
1.1 Client-Server architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Peer-to-Peer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Centralized P2P network . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Pure P2P network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Hybrid P2P architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Schematic description of an overlay network . . . . . . . . . . . . . . . . 5
1.7 An identifier circle (ring) with 10 nodes in a DHT. Node 8 issues a lookup(54)
command and this is routed to the host node (56) . . . . . . . . . . . . . 5
1.8 Pictorial description of our framework. Links represent logical connec-
tions in the overlay. Two searches take place, one from node A (for object
k) and the other from node J for object i . . . . . . . . . . . . . . . . . . 8
2.1 Search for an object stored at node F using the pessimistic approach of
APS with two walkers. The table shows how various index values change,
where X?Y denotes the index value stored at node X for neighbor Y
relative to the requested object. . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 The model of interaction between agent and environment . . . . . . . . . 24
2.3 Part of an overlay for our example. Nodes 2 and 7 obtain the object in
search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 No unlearning, first setting . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 No unlearning, second setting . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Learning with decay, first setting . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Learning with decay, second setting . . . . . . . . . . . . . . . . . . . . 34
2.8 Standard update, first setting . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Standard update, second setting . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 Linear update, first setting . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.11 Linear update, second setting . . . . . . . . . . . . . . . . . . . . . . . . 34
viii
2.12 Distance-based update, first setting . . . . . . . . . . . . . . . . . . . . . 35
2.13 Distance-based update, second setting . . . . . . . . . . . . . . . . . . . 35
2.14 Percentage of finding the closest object for the various index update methods 36
2.15 Success rate, message production, number of hits and number of duplicate
messages of the two methods vs. number of deployed walkers in the three
different settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.16 Success rate vs. number of requests per object . . . . . . . . . . . . . . . 40
2.17 Hits per query vs. hop distance (static setting) . . . . . . . . . . . . . . . 41
2.18 Hits per query vs. hop distance (more dynamic setting) . . . . . . . . . . 41
2.19 Ratio of hits per query vs. hop distance for w-APS . . . . . . . . . . . . . 43
2.20 Ratio of hits per query vs. hop distance for s-APS . . . . . . . . . . . . . 43
2.21 Distribution of index values according to object popularity for one peer
and 10 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.22 Distribution of index values versus object popularity in our default setting 44
2.23 Individual success rate vs. object popularity . . . . . . . . . . . . . . . . 45
3.1 Graphic explanation of AGNO reverse indices. The filled table represents
the reverse index values stored at node A, which coincide with the APS
index values that nodes B,C,D,E store regarding A . . . . . . . . . . . . 54
3.2 The black nodes search for an object stored at node s (left). On the right,
s initiates a push phase in order to contact the requesters . . . . . . . . . 55
3.3 Example of computation of a reverse index value . . . . . . . . . . . . . 57
3.4 Sample binning scheme with the respective Thresh and Threshi?bi?1
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Success rate over variable number of searches . . . . . . . . . . . . . . . 65
3.6 Stress over variable number of searches . . . . . . . . . . . . . . . . . . 65
3.7 Utilization of pure forwarding vs. backpointers . . . . . . . . . . . . . . 66
3.8 Stress and success rate over variable group size . . . . . . . . . . . . . . 67
ix
3.9 Success rate over variable ?r values (Tn = 10sec) . . . . . . . . . . . . . 68
3.10 Stress values over variable ?r values (Tn = 10sec) . . . . . . . . . . . . . 68
3.11 Success rate for different values of T (Tn = 10sec) . . . . . . . . . . . . . 69
3.12 Adaptation to a change in ?r by a factor of 20 . . . . . . . . . . . . . . . 69
3.13 Stress and success rates when a different ratio of peers depart at time
t=100sec (?r = 1/sec,Tn = 10sec) . . . . . . . . . . . . . . . . . . . . . 71
3.14 Success rate after a series of member departures and arrivals (?r = 0.5,Tn =
10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.15 Stress after a series of member departures and arrivals (?r = 0.5,Tn = 10) 72
3.16 Comparison of 100 different binning configurations to the original one . . 74
3.17 Average results for one-day periods for the CNN and BBC news front pages 75
3.18 Results for a 7-day period for the Microsoft, SUN and Oracle quotes be-
tween 11:00am and 11:59am . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Part of the overlay network of our model. Dark nodes inside the bold
dotted ellipse represent Mi, while those inside the thin dotted ellipse rep-
resent Mj. Peers with a file attached also serve objects i or j . . . . . . . 81
4.2 State transitions in our system . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 The shaded oval represents a server set for a specific object. Our system
expands by creating replicas inside two areas where demand (depicted by
arrows) is high. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Due to low demand in certain regions of the server set (depicted as white
areas inside the dotted line), our system contracts its replica set . . . . . . 86
4.5 Visual representation of a sample power-law graph, after several searches
for a single object using the APS method. Solid line arcs show high index
value links between nodes . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 After searches for an object at s take place, reverse index values are up-
dated and a push phase creates new replicas inside areas of high demand
(dotted links) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.7 Variation in ? and|Si|over increasing ?r values . . . . . . . . . . . . . . 94
4.8 Ratio of overloaded servers vs. variable ?r . . . . . . . . . . . . . . . . . 95
x
4.9 Percentage of change in|Si|vs. variable ?r . . . . . . . . . . . . . . . . . 95
4.10 Variation in the average load vs. variable ?r (5000 requesters) . . . . . . 96
4.11 Percentage of change in|Si|vs. variable ?r (5000 requesters) . . . . . . . 96
4.12 ? and|Si|over time for 5000 requesters and multiple ?r values . . . . . . 97
4.13 Variation of ?? vs. variable ?r (2000 requesters) . . . . . . . . . . . . . . 99
4.14 Average values of G as a function of ?r for the ClosestFirst strategy . . . 99
4.15 Si load distribution for ?r = 4/sec (2000 requesters) . . . . . . . . . . . . 101
4.16 Si load distribution for ?r = 10/sec (2000 requesters) . . . . . . . . . . . 101
4.17 Effect of flash crowds in ? and|Si|in two different settings . . . . . . . . 103
4.18 |Si|variation for different maximum allowed replication ratio . . . . . . . 104
4.19 ? variation for different maximum allowed replication ratio . . . . . . . . 104
4.20 Percentage of overloaded nodes for different maximum allowed replica-
tion ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.21 Percentage of change in Si for different maximum allowed replication ratio 104
4.22 ? variation for different values of Limitups,i . . . . . . . . . . . . . . . . . 105
4.23 Percentage of overloaded nodes for different values of Limitups,i . . . . . . 105
4.24 Average load for 1k and 2k requesters in power-law topologies (?r = 6/sec)106
A.1 Success rate and message production of the methods using a set of 10,000-
node random graphs with average degree d = 4 . . . . . . . . . . . . . . 129
A.2 Hits per query of the methods using the set of 10,000-node random graphs
with average degree d = 4 . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.3 Hits per hop distance from the requesters . . . . . . . . . . . . . . . . . . 134
A.4 Accuracy and message production vs. object popularity in the dynamic
setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
xi
B.1 Propagation of queries among acquaintees. The size of the rectangles
reflects the amount of degradation after a rewriting. Qprimeprimeprime1 negationslash= Qprimeprime2 because
the queries followed different paths . . . . . . . . . . . . . . . . . . . . . 143
B.2 Query directed towards a group schema which holds mappings with all
group members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B.3 Part of a P2P system from a health-related environment . . . . . . . . . . 144
B.4 Two schemas to be semantically merged . . . . . . . . . . . . . . . . . . 161
B.5 SIG is initialized to SDavisDB and there is mapping M1 between SIG and
SLuDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.6 Relations Disease and Sickness of Figure B.5 are merged . . . . . . . . . 161
B.7 Relations Disease/Sickness and Treatment of Figure B.6 are merged . . . 162
B.8 Similarity of answers to the original and rewritten query versions over
variable queries per requester . . . . . . . . . . . . . . . . . . . . . . . . 167
B.9 Similarity of answers to join queries over variable queries per requester . 167
B.10 Ratio of GrouPeer?s clustering versus the optimal, given an equal number
of acquaintees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.11 % Increase in answer similarity over variable MaxP and t . . . . . . . . . 171
B.12 % Increase in number of answers over variable MaxP and t . . . . . . . . 171
B.13 % Increase in answer similarity over variable group creation time . . . . . 172
B.14 % Increase in number of answers over variable group creation time . . . . 172
B.15 Relationship between initiator and inferred schema similarity . . . . . . . 174
B.16 Similarity and number of answers of the initial and merged groups vs
creation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
xii
Chapter 1
Introduction
1.1 The Notion of Peer-to-Peer and its Internet Origins
Peer-to-Peer computing (hence P2P) represents the notion of sharing resources
available at the edges of the Internet [1]. The P2P paradigm dictates a fully-distributed,
cooperative network design, where nodes collectively form a system without any supervi-
sion. Most importantly, they operate in a symmetric manner, running the same protocols
and communicating freely and equally with each other.
The Internet started out as a system operating under the aforementioned basic P2P
properties [2]. The original ARPANET was a network among equal interconnected com-
puters. Every site could contact and accept connections from every other site in this
network. Examples of such applications are USENET [3] and DNS [4].
USENET is a distributed worldwide system that allows users to post, read and ex-
change messages by directly connecting with each other. An important characteristic of
USENET is the lack of any requirement for a central administration or controlling host
to manage the network. DNS combines principles of P2P with a hierarchical organiza-
tion in order to achieve efficient file-sharing. Instead of replicating and managing a single
hosts.txt file, DNS allows the delegation of responsibility through the use of name servers.
These hosts operate as both clients and servers, making, answering and forwarding DNS
requests.
1
LAN/WAN
SERVER
Clients
Figure 1.1: Client-Server architecture
LAN/WAN
Peers
Figure 1.2: Peer-to-Peer architecture
With the explosion of the Internet, its nature gradually shifted from being symmet-
ric and cooperative towards an asymmetric and disjoint environment. Web browsing, the
dominant application during Internet?s surge in popularity, is based on the client/server
architecture: Client machines send requests to a small number of powerful, well-known
sites running special software, retrieve answers and display them locally (see Figure
1.1). The immense popularity also brought the need for security and control. Large
sub-networks got behind firewalls, denying the majority of their hosts direct access to
or from the outside world. The same effect is produced by the numerous NAT boxes
which are used to provide a single point of contact between the Internet and the numerous
local networks. Finally, the Internet Service Providers, realizing the dominant trend is
to request and not to disseminate data, engineered and provided asymmetric bandwidth
services which inherently changed the ability to equally share content.
With the emergence of file-sharing P2P applications (especially [5,6]), users started
2
massively sharing multimedia resources freely and equally, without any central control
(see Figure 1.2). A large number of systems and architectures that utilize this technology
have emerged since ( [7,8], etc.). Its advantages (although application-dependent in many
cases) include robustness in failures, extensive resource-sharing, self-organization, load
balancing, data persistence, anonymity, etc.
According to very conservative estimates [1], there exist more than 10?109 MHz of
CPU power and 10,000 TB of storage not utilized at the edges of the Internet. According
to a different report [9], bandwidth consumption attributed to popular file-sharing appli-
cations amounts to a considerable fraction (up to 60%) of the total Internet traffic. These
two reports identify two different challenges: First, there is a vast amount of ?untapped?
potential over the Internet. On the other hand, current resource-sharing applications are
responsible for huge amounts of data transmissions over the network. P2P technology
can play a key role in our efforts to tackle both issues.
1.2 Categorization of Peer-to-Peer Overlays
An overlay network is a computer network built on top of one or more existing
ones (often the Internet itself). Its connectivity usually differs from the underlying phys-
ical connectivity. Nodes can be thought of as being connected by logical links, each of
which corresponds to a path of one or more physical links (see Figure 1.6). Thanks to
NAT, firewalls and private IP address, the implementation of applications for end-to-end
communication over the IP network eventually requires some sort of overlay structure.
Several P2P overlays have been proposed by both academia and industry in the last few
3
Figure 1.3: Centralized
P2P network
Figure 1.4: Pure P2P net-
work
Figure 1.5: Hybrid P2P ar-
chitecture
years. Their primary functionality is to provide a routing substrate between nodes identi-
fied by a mechanism other than their IP addresses.
We can roughly classify P2P architectures into two categories: Centralized ap-
proaches utilize a central directory for object location, ID assignment, etc (see Figure 1.3).
Decentralized approaches abandon this solution to employ a distributed directory struc-
ture. Pure decentralized systems exhibit a fully distributed behavior with all peers equally
making, answering and forwarding requests (Figure 1.4). In hybrid systems, nodes are
categorized as leaf-nodes or super-peers (also referred to as supernodes or ultrapeers, see
Figure 1.5). Super-peers are responsible for returning results to the queries posed by their
neighboring leaf-nodes. They usually achieve that by indexing the repositories of all their
leaf-nodes and communicating with a number of different super-peers.
Another taxonomy classifies P2P networks into structured and unstructured, ac-
cording to the degree of control over the topology and routing infrastructure they provide.
Structured networks provide strict rules for file placement and object discovery, while
unstructured approaches offer arbitrary network topology, file placement and search.
Several researchers have proposed the Distributed Hash Tables (DHTs) as a means
of organizing a P2P overlay (e.g., [10?13]). In these systems, files and node-IDs are asso-
4
Physical linkOverlay link
Figure 1.6: Schematic description of an
overlay network
Figure 1.7: An identifier circle (ring) with
10 nodes in a DHT. Node 8 issues a
lookup(54) command and this is routed to
the host node (56)
ciated with a key produced by hashing filenames or addresses. Each node is responsible
for a range of keys in this namespace: Object locations are stored at the node(s) whose
ID(s) is(are) numerically closest to the given key. The basic operation in these DHT sys-
tems is to implement lookup(key), which returns the identity of the node storing the
object with that key. When a lookup(key) is issued, the message is routed through the
overlay network, each time bringing the request to nodes numerically closer to the key,
until the node responsible for it is reached (see Figure 1.7). DHTs provide a very efficient
(O(logn), with n equal to the size of the network) routing mechanism. This comes at a
cost of maintaining state about a number of overlay nodes that assist in routing.
Today, some of the popular P2P applications operate on unstructured networks. In
contrast to DHTs, peers connect in an ad-hoc fashion, the location of the documents is
not controlled by the system and no guarantees for the success or the complexity of a
search are offered to the users. More important, peers obtain only local knowledge of
a network where nodes enter and leave frequently. For such systems, searching for an
5
object is traditionally implemented by either broadcast-based schemes [6], or randomized
walks [14]. Queries utilizing exact-match object-IDs or keywords are propagated inside
the overlay on a hop-by-hop basis. Each time a peer receives a request, it evaluates
it against its local repository and (if necessary) forwards it to a number of its overlay
neighbors (i.e., nodes directly connected to it).
1.3 A General Model for an Unstructured P2P Overlay
In many realistic scenarios, the topology cannot be controlled and thus DHTs can-
not be used (e.g., ad-hoc networks or current large-scale unstructured overlays). In our
work, we focus on pure decentralized unstructured P2P systems. Such systems have
been shown to attract large user populations and be of great impact to the network com-
munity [9].
We now describe our system model for search and content-sharing in unstructured
P2P networks. We assume a pure P2P model, with no imposed hierarchy over the set
of participating peers. All of them may equally serve and make requests for various
objects. Peers and documents are assumed to have unique identifiers, with object IDs used
to specify the query target. Ignoring physical connectivity and topology from our talk,
we assume that peers are aware of their one-hop neighbors in the overlay. Neighboring
nodes are connected with direct logical links and can contact each other with one overlay
message. Throughout its lifetime, a node periodically checks the availability and status of
its neighbors. The system can generally exhibit a dynamic behavior, with peers entering
and leaving at will and also updating their local repositories. We should also note that we
6
do not expect the overlay structure to be static, since nodes are not guaranteed to connect
to the same neighbors each time they return from an off-line state.
Peers overcome some of these deficiencies by keeping soft state, i.e., auxiliary in-
formation stored at a node, erased after a short amount of time and the loss of which will
not keep the node from functioning. For example, peers temporarily store the unique ID
of each query they process, enabling them to make the distinction between new queries
and duplicate ones.
Each peer retains a local collection of documents (or objects), while it makes re-
quests for those it wishes to obtain. The documents are stored at various nodes across the
network, without the system dictating a relationship between content and its location (un-
like DHTs). Objects are assumed to be of varying popularity, which affects the respective
number of replicas and received requests. Objects are distributed over the network ac-
cording to the replication distribution, which dictates the number and identity of objects
stored at each node. Each peer makes requests according to a query distribution, which
controls how many requests are made for each object (e.g., popular objects get many more
requests than unpopular ones). A search is successful if it discovers at least one replica
of the requested object. The ratio of successful to total searches made is called the suc-
cess rate (or accuracy). A search can result to multiple discoveries (or hits), which are
replicas of the same object stored at distinct nodes. A global time-to-live (TTL) parameter
represents the maximum hop-distance a query can travel before it gets discarded.
Figure 1.8 shows an sample overlay to demonstrate the concepts of this model. Our
system consists of 10 nodes with IDs A?L. Node E holds objects i and k. Node A initiates
a search for object k, indicated by the arrows. Searches are propagated among peers
7
A
B
C
D
E
F
G
H
L i
k
J
(s1,A,k,1)
(s1,A,k,0)
(s1,A, k, 1)
(s1,A,k,1)
(s1,A,k,0)
(s15,J,i,1)
(s15,J,i,1) (s15,J,i,0)
(s15,J,i,0)
(s15,J,i,0)
{E,k}
{E,i}
Figure 1.8: Pictorial description of our framework. Links represent logical connections in
the overlay. Two searches take place, one from node A (for object k) and the other from
node J for object i
usually in a hop-by-hop fashion. Search messages have their TTL value set to 2, so that
no node more than 2 hops away from A can receive it. Each search message contains a
unique identifier (e.g., s1), the initiator?s ID, the requested object and the remaining hops,
reduced by one at each node. Peer E replies directly to A, notifying it that it obtains the
desired object (dotted arrow). A similar search, identified as s15, takes place from J. Node
E now notifies J that it can share object i.
1.4 Our Contribution
In our previous discussion, we argued about the potential and impact of the P2P
paradigm in the modern Internet. A variety of sources attest to the importance that P2P
has received over the last few years. On one hand, we have economic incentives that fol-
low the success and popularity of sharing content available at the edges of the Internet. On
8
the other hand, we cannot but notice the profound economic, social and practical impact
that such applications have: Their operation often challenges pre-defined notions of copy-
right, trust, accountability and security. We have witnessed a barrage of legal disputes,
which represent the materialization of the clash between powerful economic interests and
the users? desire to freely share. While such issues evolve and hold great significance,
we cannot overlook some equally important practical issues. The unsupervised, ad-hoc
character of P2P systems puts inherent strains on their ability to operate both efficiently
and at low-cost. Valuable resources such as bandwidth, processing power, and connection
time must be carefully distributed according to demand and the system?s resources.
In this dissertation, we exclusively deal with the practical/technical challenges in
P2P networking. Our goal is to provide functional, adaptive and bandwidth-efficient al-
gorithms for unstructured Peer-to-Peer networks. Our main contribution is to describe
an efficient search algorithm in order to locate content. Its most notable characteristic is
the utilization of a learning feature that enables an increase in accuracy as more requests
are generated in the network. Extending this, we also present our protocol for content
dissemination to groups of peers in a cost-effective manner. We conclude by presenting a
scheme which, building on our previous algorithms, achieves adaptive replication in order
to perform efficient content sharing in high-demand scenarios. Hence, our contribution
can be divided in 3 major parts:
1. Adaptive Probabilistic Search for Unstructured P2P Networks:
We propose a new search algorithm that achieves efficient lookups with low band-
width consumption, the Adaptive Probabilistic Search method (APS) [15]. In APS, a node
deploys k walkers for object discovery, but the forwarding process is probabilistic instead
9
of random. Peers effectively direct walkers using feedback from previous searches, while
keeping information only about their neighbors. As we show in this work, APS exhibits
many plausible characteristics, such as:
? High accuracy
? Low bandwidth consumption
? Robust and adaptive behavior in rapidly-changing environments
These features come as a result of our algorithm?s learning character, which enables peers
to share, refine and adjust their search knowledge with time. Furthermore, APS induces
zero overhead over the network at join/leave/update operations. We present a formulation
of our method by defining it as a Reinforcement Learning problem. This formulation
explains many of our empirical observations.
Concurrently with this work, Appendix A presents a detailed comparison of con-
temporary search algorithms for unstructured overlays. Our work in [16, 17] describes
a detailed categorization, description and performance evaluation of current approaches.
Our focus lies on the behavior of these algorithms for each of the following metrics:
? Efficiency in object discovery (accuracy and number of hits)
? Bandwidth consumption
? Adaptation to changing topologies and workloads
To evaluate our analysis, we simulate eight of those methods (alongside APS) and
present a direct quantitative comparison of their performance. We identify the relative
10
advantages and disadvantages of each method as well as the conditions under which they
can be most or least effective. We believe this is an important contribution that can pro-
vide a better understanding of the various search mechanisms and assist in choosing an
algorithm that best fits a particular application.
2. Adaptive Group Notification for Unstructured P2P Networks:
In this part, we propose a novel approach to content dissemination based on the
demand incentive. The goal of our Adaptive Group Notification (AGNO) [18] protocol
is to enable peers to disseminate important updates/notifications that relate to shared ob-
jects in the overlay. AGNO combines the utilization of state accumulated during the APS
search process together with a set number of probabilistically stored requester addresses
to contact groups of nodes defined implicitly through lookups. Our method builds its
knowledge by only monitoring the independently conducted searches and avoids the cost
of explicit multicast group formation.
3. Adaptive Probabilistic Replication for Unstructured P2P Networks:
This part of our work describes APRE (Adaptive Probabilistic REplication) [19]. It
represents the third member of our suite of algorithms that build on probabilistic soft state
in order to provide higher-order functionality. Our goal is to design and implement a repli-
cation protocol that will provide efficient sharing of objects (servers operating under low
load), scalability and bandwidth-efficiency. APRE is a distributed scheme that automat-
ically adjusts the replication ratio of every shared item according to the current demand
for it. By utilizing inexpensive routing indices during searches, loaded servers are able
to identify ?hot? areas inside the unstructured overlay with a customizable push phase.
Chosen nodes receive copies thus sharing part of the load. Under-utilized servers become
11
freed and can host other content. The rationale behind APRE is the tight coupling be-
tween replication and the lookup protocol which controls how searches get disseminated
in the overlay. By combining the Adaptive Probabilistic Search (APS) state with APRE,
we are able to identify in real-time ?hot? or ?cold? paths and avoid the need of advertising
constantly created replicas. We show that this method proves very efficient in minimizing
the number of overloaded peers and achieving a robust and well-balanced distribution in
a variety of settings.
The remainder of this thesis is organized as follows: Chapter 2 presents our work
on searching in P2P overlays, presenting the Adaptive Probabilistic Search. Chapter 3
describes the group notification scheme, which implicitly groups peers according to their
demand for an object. In Chapter 4 we present the last member from this family of
algorithms, a replication scheme that adaptively expands or contracts the replicas of any
given object based on local demand computation. Our presentation ends with the related
work (Chapter 5) and our conclusions (Chapter 6). Appendix A presents the performance
evaluation of a number of search methods presented in Chapter 5. Finally, Appendix B
describes GrouPeer, a system that adopts our goal of efficient content sharing through
learning and grouping in the area of relational peer-databases.
12
Chapter 2
Searching in Unstructured P2P Overlays: The APS Method
2.1 Overview
Searching for information has been a fundamental tool in society?s continuous ef-
fort for progress. We are witnesses to a series of breakthroughs in technology which, in
turn, fundamentally alter the way humans communicate with each other. With the rise
and popularity of the Internet, immense amounts of information have become available
to an increasing number of people. To search and process this ocean of information has
become an absolute necessity. As an example, a nationwide survey of Internet users in
2004 [20] shows how important search engines have become: Over 85% of the Inter-
net users were reported to search daily for content, ranking this activity second only to
email. Steadily, these ?consumers? become producers, adding their own content, in a
self-reinforcing process.
The primary goal of P2P systems is to allow large peer populations to interconnect
and share content. In these systems, each peer individually decides on its availability,
conformity to protocols and identity of objects to share. Due to the decentralization and
heterogeneity of these environments, it is vital that efficient lookup schemes are provided
to their users. The lack of a centralized directory or global knowledge forces searches
to take place in a distributed manner, with peers directing queries to a greater part of the
system. This, combined with the large popularity and enormous volumes of data being
13
exchanged, necessitates bandwidth-efficient P2P searches. Finally, it is important to note
that popular P2P networks display a highly dynamic behavior, with most users connecting
for small periods of time and then leaving the system [21], locally managing their object
repositories. Any algorithm that fails to scale along this pattern, inevitably puts excessive
burden on network traffic.
A search process includes aspects such as the query-forwarding method, the set of
nodes that receive query-related messages, the form of these messages, local processing,
stored indices and their maintenance, etc. We associate the performance of an algorithm
with its success rate and number of hits, while its cost relates to the number of messages
produced, either directly during the search or indirectly during index updates, object re-
locations, etc.
We can categorize search schemes according to the query forwarding method into
flood-based (utilizing the standard flooding scheme or one of its variations, e.g., [22]),
non-flood-based (e.g., hop by hop [23], direct contact [24]) or combinations of the two
(e.g., [25]).
According to the type of information used, there exist two general strategies to
search for an object: Blind and informed searches. Blind schemes try to propagate the
query to a sufficient number of nodes in order to satisfy the request. Current methods
waste a lot of bandwidth to achieve high accuracy. Every search requires contacting
many nodes within a distance called time-to-live (TTL), generating huge overhead to the
network. This approach aims at finding the maximum number of results within an area
of the network with the originating node being at the center and the radius being a TTL-
related parameter.
14
Several search protocols have been proposed with an intention to reduce the over-
head of the original flooding scheme. In the Random Walks algorithm [14], the requesting
node sends out k query messages to an equal number of randomly chosen neighbors. Each
of these queries follows its own path, having intermediate nodes forward it to a randomly
chosen neighbor at each step. These queries are known as walkers. While this approach
manages to reduce messages by more than an order of magnitude, it exhibits low accuracy
due to its random nature and inability to adapt to different query loads.
Informed approaches, on the other hand, utilize stored or created information in
order to locate various content in the overlay. The semantics of the used information range
from simple forwarding hints to exact object locations. The placement of this information
may also vary: In centralized approaches (e.g., [5]), a central directory known to all
peers exists. Distributed approaches can also be subdivided into pure and hybrid. In
purely distributed approaches (e.g., [15, 22, 26]), all participating peers maintain some
portion of the information. In hybrid schemes, certain nodes assume the role of a super-
peer and the rest become leaf-nodes. Each super-peer acts as a proxy for its leaf-nodes
by indexing all their documents and serving their requests. GUESS [27] is an example
from this category. Peers are ranked as ultrapeers or leaf-nodes. A search is conducted
with the requester?s ultrapeer iteratively contacting different (not necessarily neighboring)
ultrapeers and having them ask all their leaf-nodes, until a number of objects are retrieved.
The semantics of the stored indices in informed approaches can be used for another
categorization. Indices might relate to exact object locations (e.g., [25]), probability of
discovery through a link (e.g., [15]), number of objects through a link (e.g., [26]), or other
metrics (e.g., [28]). Informed methods use their indices in order to achieve high accuracy
15
(by choosing ?good? neighbors to forward the query to) and to reduce overhead. The
shortcoming of most informed methods is the maintenance cost of the indices while peers
join/leave the network or update their collections. In most cases, these events trigger
floods of update messages, increasing network traffic.
In this thesis, we propose a new search algorithm that achieves high performance
at low cost, the Adaptive Probabilistic Search method (APS). In APS, a node deploys
k walkers which are probabilistically directed using index values that each peer stores
regarding its neighbors. The indices are updated along paths of every walk according
to the outcome of the search. This enables searches to become more accurate as more
requests are generated in the network, while each peer stores only a small part of this
knowledge. Because of the nature of these indices, APS induces zero overhead over
the network at join/leave/update operations. As we show in this work, APS achieves
high accuracy and maintains a low message consumption in both static and dynamically
changing environments. In the remainder of this Chapter we will make the following
contributions:
1. Define the APS algorithm for search in unstructured P2P networks. We describe the
main idea, the indexing scheme, the search and update procedures and analyze its
performance.
2. Present two improved versions of the algorithm which exhibit significant gains in
message reduction and the number of objects discovered near the requesters respec-
tively.
3. Formulate our problem as a Reinforcement Learning problem and show that con-
16
vergence of the index values to optimal ones can be achieved.
4. Perform extensive simulations and compare APS with the Random Walks and GUESS
methods over different environments. Our algorithm achieves excellent results in
the success rate, number of discovered objects, message consumption and adapta-
tion to changing topologies.
2.2 The APS Method
2.2.1 Algorithm Description
In APS, each node keeps a local index consisting of one entry per neighbor for each
object it has requested, or forwarded a request for. The value kept for each index entry
reflects the relative probability of this node?s neighbor to be chosen as the next hop in a
future request for the specific object. Searching is based on the simultaneous deployment
of k walkers and probabilistic forwarding: A node forwards to k (if it initiates a search) or
one (if it is an intermediate node) of its neighbors not randomly, but using the probabilities
computed by the stored index values.
The search message is defined by the tuple: (requesterID, objectID, search-
ID, TTL, v[TTL]). The requester node includes its identity, the identity of the object in
search, the unique ID of the search, its scope (TTL) and an initially empty array of TTL
values. Entry v(d) identifies the node visited by this walker after d forwarding steps.
The requester chooses k out of its N neighbors (if k?N, the query is sent to all
neighbors) to forward the request to. Each of these nodes evaluates the query against its
17
local repository and if a hit occurs, the walker terminates successfully. On a miss, the
query is forwarded to one of the node?s neighbors. This procedure continues until all k
walkers have terminated, either with a success or a failure. At each forwarding step, the
current node appends its identifier in the search message (the corresponding entry in the
v table) and keeps a soft state about the search it has just processed (the requesterID,
searchID pair). If two walkers from the same request cross paths (i.e., a node receives a
duplicate message due to a cycle), the second walker is assumed to have terminated with
a failure and the duplicate message is discarded.
Index values stored at peers are updated in the following manner: When a node
forwards the request to one or k of its neighbors, it pro-actively either increases the relative
probability of the peer(s) it picked, assuming the walker(s) will be successful (optimistic
approach), or it decreases the relative probability of the chosen peer(s), assuming the
walker(s) will fail (pessimistic approach).
Upon walker termination, if the walker is successful, there is nothing to be done in
the optimistic approach. If the walker fails, index values relative to the requested object
along the walker?s path must be corrected. Using information available inside the search
message, the last node in the path sends an ?update? message to the preceding node.
This node, after receiving the update message, decreases its index value for the last node
to reflect the failure. The update procedure continues along the reverse path towards the
requester, with intermediate nodes decreasing their local index values relative to the next
hops for that walker. Finally, the requester decreases its index value that relates to its
neighbor for that walker. If we employ the pessimistic approach, this update procedure
takes place after a walker succeeds, having nodes increase the index values along the
18
Indices Initially After walkers finish After updates
A?B 30 20 20
B?C 30 20 20
C?D 30 20 20
A?E 30 20 40
E?F 30 20 40
A?G 30 30 30
Figure 2.1: Search for an object stored at node F using the pessimistic approach of APS
with two walkers. The table shows how various index values change, where X?Y de-
notes the index value stored at node X for neighbor Y relative to the requested object.
walker?s path. There is nothing to be done when a walker fails.
Figure 2.1 shows an example of how the search process works. Node A initiates a
request for an object owned by node F using two walkers. Assume that all index values
relative to this object are initially equal to 30 and the pessimistic approach is used. The
paths of the two walkers are shown with thicker arrows. During the search, the index
value for a chosen neighbor is reduced by 10. One walker with path (A,B,C,D) fails,
while the second with path (A,E,F) finds the object. The update process is initiated for the
successful walker on the reverse path (along the dotted arrows). First node E, then node
A increase the value of their indices for their next hops (nodes F, E respectively) by 20 to
19
indicate object discovery through that path. In a subsequent search for the same object,
peer A will choose peer B with probability 2/9 (= 2020+40+30), peer E with probability 4/9
and peer G with probability 3/9.
Our method utilizes ?probabilistic? walkers with a learning feature that incorpo-
rates knowledge from past and present searches to enhance future performance. The
learning process adaptively directs the walkers to promising parts of the network, while
keeping bandwidth consumption low.
APS requires no message exchange on any dynamic operation such as node arrivals
or departures and object insertions or deletions. Because the indices do not depend on
content or its location but rather on the success or failure of search paths, the handling of
these operations is simple: If a node detects the arrival of a new neighbor, it will associate
an initial index value with that neighbor when a search will take place. If a neighbor
disconnects from the network, the node removes the relative entries and stops considering
it in future queries. No action is required after object updates, since indices are not related
to file content. So, although APS actively uses information, its maintenance cost on any
of these events is zero, a major advantage over most current approaches.
2.2.2 Discussion
Each node stores a relative probability (e.g., an unsigned integer value) for each of
its neighbors for each (directly or indirectly) requested object. So, for R such objects
and N neighbors, O(R N) space is needed. For a typical network node, this amount of
space is not a burden. In nodes with limited storage capacities, index values for objects
20
not requested for some time can be erased. This can be achieved by assigning a time-to-
expire value on each newly-created or updated index, or by expunging the least recently
(or frequently) used indices.
Let us calculate how many messages it will take for the APS method to terminate (in
success or failure). In the worst case ? all walkers travel TTL hops and then invoke the
update procedure ? the number of messages exchanged will be 2k?TTL, so the method
has the same complexity with the Random Walks algorithm ? O(k?TTL). The only extra
messages that occur in APS are due to the update process along the reverse path. This is
where our two index update policies are used: If we expect or experience after a while
that for a specific number of walkers k, only few of them terminate successfully, then the
pessimistic mode should be employed. Conversely, if many of our walkers hit their targets
on average, the optimistic approach should be considered.
Along the paths of all k walkers, indices are updated so that better next hop choices
are made with bigger probability. Our learning feature includes both positive and nega-
tive feedback from the walkers in both update approaches. In the pessimistic approach,
each node on the walker?s path decreases the relative probability of its next hop for the re-
quested object concurrently with the search. If the walker succeeds, the update procedure
increases those index values by more than the subtracted amount (positive feedback). So,
if the initial index value for a neighbor for a certain object was J, it becomes bigger than
J if the object is discovered through (or at) that node and smaller than J if the walker
fails. This is the only invariant we require from our update process. In the next section,
we compare several index update functions to empirically decide on their performance.
The learning process in the optimistic approach operates in an opposite fashion, with neg-
21
ative feedback taking place after a walker fails. Our algorithm exhibits both learning and
unlearning characteristics: Learning is important to achieve both high performance and
discovery of newly inserted objects. Unlearning helps our search process adjust to object
deletions and node departures, redirecting the walkers elsewhere.
Another characteristic of the algorithm is its ability to learn faster with more ques-
tions. The more feedback from the walkers, the more precise the indices become. This
particularly suits the discovery of popular objects in the P2P network, which, according
to studies [21], constitute over 60% of all searches. Another observation is that all nodes
participating in a search will benefit from the process. This is a distinctive feature of
our method, with indices being constantly updated during searches and not after object
updates. In our case, both requesters and peers on the paths of all walkers actively ad-
just their knowledge about the specific object. A node that has never before requested an
object but is ?near? peers that have done so, inherits this knowledge by proximity. Be-
sides standard resource-sharing in P2P systems, our algorithm achieves the distribution
of search knowledge over a large number of peers.
2.2.3 Algorithm Improvements
APS produces update messages to adjust index values along the paths of some walk-
ers. Our goal is to minimize these messages in order to further reduce the level of band-
width consumption. Obviously, if fewer than k/2 walkers are successful, then the pes-
simistic approach should be employed instead of the optimistic and vice versa. Choosing
one strategy over the other for queries over all objects is not optimal, as many unnecessary
22
update messages would be produced for both popular and unpopular object requests. An
improved version of APS is the swapping-APS (or s-APS), where the algorithm constantly
monitors the ratio of successful walkers for each object and accordingly switches to the
update policy that produces fewer messages. This makes our s-APS improvement even
more bandwidth efficient. The number of objects for which nodes monitor the successful
walker ratio depends on available node storage, although the overhead will be very small
in most cases.
Another improvement relates to the index update procedure. The idea is to give
preference to objects located near the requesters. In the original scheme, all index values
are updated without any regard to the hop-distance from the requesters. In the weighted
approach (w-APS), we incorporate a distance-based function for modifying the indices
stored at each node. Index values for peers closer to the discovered object are increased
more than those for distant nodes. Thus, the updated value of index J at node X would
be J ?J + ??h, where h is X?s hop-distance from the discovered object, ? is the rein-
forcement parameter, ??(0,1) and ? is a multiplicative factor (to assure a non-negligible
index increase for the larger values of h). The smaller the value of ?, the more biased
walkers become towards nearest-object paths. Distance information is directly accessible
from the stored path inside the search messages. With this method, peers are biased to
direct walkers towards closer objects in the overlay.
Both improved versions impose no extra burden to the search process, while they
aim at reducing its average response time. This is achieved either by decreasing the
produced messages or the distance to the objects.
23
Environment
Agent (State)
ActionReward
New State
Figure 2.2: The model of interaction between agent and environment
2.2.4 APS and Reinforcement Learning
In this section we discuss the formulation of our problem as a reinforcement learn-
ing problem. First, we shortly describe the area of reinforcement learning and the general
components of the problems it addresses. Later, we show that our search scheme can be
similarly formulated and that a unique optimal policy can be reached.
2.2.4.1 Elements of Reinforcement Learning
Reinforcement Learning is defined as the task of learning how to behave in a certain
environment [29]. Specifically, an agent or learner is expected to learn a mapping from
states into actions which will eventually maximize its feedback from the environment
(reward). The agent has to learn its behavior through trial-and-error, unlike supervised
learning where the correct behavior is given through a series of training examples.
Figure 2.2 shows the model that reinforcement learning uses. At each time step t
the agent is in state st. It can choose action at ?A(st), that is one of the actions available
given the state it is in. Upon acting, it finds itself in state st+1, while the environment
signals a numerical value, rt+1 as a reward for entering that state. The goal of the agent
24
is to select a policy pi (choice of an action for each possible state) that will maximize the
total reward it receives in the long run:
Rr =
T?
k=0
?krt+k+1
The parameter ??(0,1] is the discount factor that allows the same formulation for both
finite (T < ?) and infinite (T = ?) horizon problems. It represents a natural notion that
rewards received in the future should not weigh as much as the immediate reward.
Besides the model and the reward function, the third important element in rein-
forcement learning methods is the notion of the state value. The value of a state s, V pi(s),
represents the expected total reward for the agent starting at state s and following policy
pi thereafter. Intuitively, state values represent a metric of goodness for the agent being in
that state. Their computation is the ultimate goal in a reinforcement learning problem.
A very important property that greatly assists in the formulation and solution of
reinforcement learning problems is the Markov property. It states that, given a finite
number of states and actions, next or future states depend solely on the current state and
action. Formally, the probability of transitioning from state si to state s j taking action ak
is: P{st+1 = s j|st = si,ak}, i.e., depends only on the current state and action. If a problem
has this property, then it can be formulated as a Markov Decision Process (MDP).
An MDP is formally defined by the tuple (S,A,{Psa},?,R), where S is the set of
states, A is the set of actions, {Psa}is the set of transition probabilities, ??[0,1) is the
discount factor and R is a reward function: R : Smapsto?R. As we described before, the MDP
proceeds in the following manner: At each time step t, the agent finds itself in state st.
Choosing action at ?A drawn according to probabilities {Pstat} brings the agent to the
25
next state st+1, receiving a reward of R(st+1). Our total reward (or return) is given by the
discounted sum of rewards. The goal then becomes the maximization of the return. The
value function of a state, as me mentioned before, is defined as the expected return when
the agent starts from that state. The value function satisfies the Bellman equation:
V pi(s) = R(s)+? ?
sprime?S
a?A(s)
PsaV pi(sprime)
Thus, the value of a state equals the immediate reward for being in that state plus the
discounted values of the future states through the probability distribution.
The core of reinforcement learning methods is the estimation of these value func-
tions for each state, since they represent a measure of how useful each state is in achieving
a high return. The Bellman equations, which hold for a set policy pi, define an|S|?|S|lin-
ear system that can be solved to give the respective V(s) values. Among all policies, there
exists at least one that maximizes the expected sum of rewards. Following that policy (pi?)
produces the optimal value functions V?(s), which also satisfy Bellman?s equation:
V?(s) = R(S)+max
a?A
? ?
sprime?S
PsaV?(sprime)
This equation means that the optimal value function for state s is equal to the immediate
reward plus the maximum (over all possible actions from that state) future sum of rewards.
For finite MDPs, there exists a unique set of V?(s) values that satisfy the Bellman
equations, regardless of a policy. Given the optimal value functions, one merely has to
perform greedy single-step choices that maximize the value functions: Given any start
state s, the optimal policy(-ies) are those that choose actions that maximize V?(s). Thus,
solving finite-state MDPs becomes equal to the task of producing efficient algorithms
26
(e.g., dynamic programming, temporal-difference learning, Monte-Carlo methods, etc) to
estimate the value functions.
2.2.4.2 Problem Formulation
In this section we present a simple formulation to our problem based on Markov
Decision Processes. It shows that many parts of our scheme already fit into this framework
that provably converges to optimal state values/policy.
First off, we show that the Markov property holds for our system. Indeed, each time
a query arrives at a node, the path to be followed and the discovery of an object through
this node does not depend on the previously visited nodes: What happens next depends
on the peer that currently processes the request and the neighbor it will choose to forward
it to.
In the following, neighbors(x,y) is a function that returns TRUE if nodes x,y are
neighbors and FALSE otherwise. Similarly, has(x,o) returns TRUE if node x has object
o and FALSE otherwise. V is the set of nodes that have received query q. Finally, HL
is a variable that at the initial state has a value of TTL, reduced by 1 each time q gets
forwarded. Having asserted that our problem can be formulated as an MDP, we now
define the tuple (S,A,{Psasprime},?,R) that describes our problem:
S: The set of states. Let S ={S1,S2,...,SN}, where Si represents a node in our
overlay and |S|= N = size of the network. We map the position of q at each time to a
state of our MDP.
A: The set of actions. Let A(s) ={a1,a2,...,aN}, where ai represents the action of
27
node s choosing node i to forward the query to.
Psasprime: The transition probability matrix. The quantity pia j shows the probability of
transitioning from state i to state j given action a was taken. For our setting:
pia j =
??
???
???
???
????
1, if neighbors(i,j) and a?a j and HL > 0
and j /?V and?has(i,o)
0, otherwise
The transition probabilities describe that a state transition is only allowed if the object has
yet to be found, there are still more hops to travel and we forward to a neighboring node,
given that the query has not previously visited that node.
?: The discount factor 0?? < 1.
R: Numerical rewards associated with each state. Let ri be the reward we receive
when the query reaches state (node) i. For our system:
ri =
??
??
???
1, if has(i,o)
0, otherwise
Given this formulation, we can use Bellman?s equation on the optimal state value
functions, which, given our definition of Psasprime becomes:
V?(s) = maxa?a
sprime
{r(s)+?V?(sprime)},
with neighbors(s,sprime) =TRUE.
To illustrate how these equations can actually help us determine, in an optimal way,
the query forwarding policy at any peer, we consider the following example: Figure 2.3
shows a subgraph with eight nodes/states. Let us compute V?(s1), assuming HL = 6:
28
S1
S3
S4
S8 S7
S6
S2
S5
Figure 2.3: Part of an overlay for our example. Nodes 2 and 7 obtain the object in search
V?(s1) = r1 +? max
HL?4
{V(s2),V(s3),V(s4)}
= ?max{1,r3 +?HL?3V(s5),r4 +?HL?3V(s6)}
= ?max{1,?2 max
HL?2
{V(s2),V(s7),V(s8)},0}
= ?max{1,?2 max{1,1,0},0}
= ?max{1,?2,0}= ?
Similarly:
V?(s3) = r3 +? max
HL?4
{V(s1),V(s5)}
= ?max{r1 +? max
HL?3
{V(s2),V(s4)},r5 +? max
HL?3
{V(s2),V(s7),V(s8)}}
= ?max{?max{1,r4 +? max
HL?2
{V(s6)},?max{1,1,? max
HL?2
{V(s8)}}
= ?max{?max{1,?r6},?max{1,1,?r8}}= ?2
With this method, we arrive at the optimal solution:
V?(s1) = ?,V?(s2) = 1,V?(s3) = ?2,V?(s4) = ?2,V?(s5) = ?,V?(s6) = ?3,
V?(s7) = 1,V?(s8) = ?2
Given these values and assuming ? < 1, it is easy to implement the optimal forward-
ing policy: From any node s, if V(s) < 1, forward to a neighbor sprime that maximizes V(sprime).
For example, node 1 should forward to node 2, node 5 can forward to either one of nodes
29
2 or 3, etc. The above hold for deployment of a single walker per query (k = 1). For
k > 1, we initially forward to the top?k neighbors according to their values. The optimal
policy behaves greedily in respect to the optimal value functions.
Moreover, given the discounted model we used, the query is directed towards the
nearest replica in the graph. Nodes with no chance of locating an object have a value of
zero. All other states maintain a value ? times the length of the shortest path to a replica.
This is the index update model that our w-APS approach utilizes.
Value iteration is one of the methods that we can use to compute this solution. We
update all values in steps: Initially, V?(0)(si) = ri. Then, compute
V?(t)(si) = max
sprime
{r(s)+?V?(t?1)(sprime)}
for all i and t = 1,2,... until the values converge. Initially we have: V?(0)(s1)=V?(0)(s3)=
V?(0)(s4) = V?(0)(s5) = V?(0)(s6) = V?(0)(s8) = 0 and V?(0)(s2) = V?(0)(s7) = 1.
Small changes in this formulation can alter the resulting policy. For example, since
we assume each search has a limited scope, we can set ? = 1. This would make all
state values equal to 1, meaning an object can potentially be discovered from any state1.
Another formulation could drop the restriction that the search terminates upon object dis-
covery. For this reformulation, the optimal value functions are: V?(s1) = ?+?3,V?(s2) =
V?(s7) = 1+?2,V?(s3) = V?(s4) = ?2 +?4,V?(s5) = ?,V?(s6) = ?3 +?5,V?(s8) = ?2.
We notice that the values of the states are increased. Indeed, each state?s value now has
one term for each object it can recover with one path of length at most HL.
1note that this choice may lead to non-optimal policies when deciding on state values. To avoid that, we
should instead compute the optimal action-value functions
30
The previous analysis describes a theoretic formulation of our problem and the guar-
antee that, under certain assumptions, dynamic programming (among other) techniques
can be used to show convergence to optimal state values (and thus query forwarding pol-
icy). Nevertheless, in most realistic scenarios, this computation is either very expensive
or not desirable. In the area of reinforcement learning, simple greedy algorithms such
as ??greedy, pursuit and softmax methods are considered very effective means of solv-
ing similar problems [29]. This is also apparent in the Ant-based routing algorithms.
They represent a family of reinforcement learning algorithms that base their operation on
biological ants and their collective behavior. Such algorithms have proved extremely suc-
cessful in providing shortest path routes in dynamic networks [30], yet they incorporate a
variety of empirically-tunable parameters and variable convergence rates. In the chapter
presenting related work we describe these approaches in more detail.
2.3 Simulation Results
To simulate the P2P overlay, we mainly used the random graph topology with the
pure P2P model. We also experimented with the hybrid model for a comparison with
GUESS. In the pure model, all peers equally pose and answer requests; in the hybrid
model, nodes are organized in an ultrapeer-leaf hierarchy. Some experiments were run
over power-law [31] graphs. We utilized two well-known topology generators: GT-ITM
[32] for the pure and hybrid random graph models and Inet-3.0 [33] for the power-law
graph model.
For the object placement and query strategies, we choose from two different distri-
31
butions, namely uniform and zipf. Requesters are randomly chosen and always represent
a noticeable fraction (10% or more) of the network. The default graph has 10,000 nodes
with an average out-degree d?= 9. The default value for k is 12 and for TTL is 5 hops.
The minimum value of an index is 1, so that no nodes are precluded from the forwarding
process.
To simulate a dynamic network behavior, we insert new nodes and remove online
ones with varying frequencies. In the first setting (static), there are no dynamic operations.
In the less dynamic setting, the topology changes more than 300 times during each run,
while in the more dynamic one it changes more than 3000 times. Periodically, a portion
of peers depart from the overlay and offline ones return. We always keep approximately
80% of the network nodes online. Departing nodes clear their local cache from all built
knowledge.
We used 100 objects in most simulations for simplicity and speed. Objects are of
varying popularity, which affects the respective number of replicas and received requests.
An increase in the number of objects did not affect the quality of the results. We modeled
the query and object placement strategies using a zipfian distribution to achieve results
similar to the observations in [21]. The highest-ranked 10% of objects amount to over
40% of the total number of stored objects and receive about half of the requests. With
our default parameters, the most popular object is stored in more than 10% of the peers,
while the least popular only in 0.25% of them. Table 2.1 summarizes our simulation
parameters and their default values.
In the figures that follow, the label ?APS? is used when all variations of our method
have very similar performance in a particular metric. If the results were taken under any
32
Table 2.1: Simulation parameters and their default values
Simulation Parameters Default Values
Number of Nodes (N) 10000
Graph/P2P model Random/Pure
Average node out-degree (d) 9
Walkers deployed (k) 12
TTL 5
Replication Distribution Zipf (a = 0.82)
Query Distribution Zipf (a = 0.9)
Number of Requester Nodes 1000
Number of Queries per Requester Node 3162
Reinforcement Parameter (?) 0.3
of the two dynamic settings, this will be shown in parenthesis.
2.3.1 Comparing the Index Update Functions
Previously, we described that our index update strategy increases or decreases index
values along walkers? paths in order to direct future searches. One can identify a variety
of strategies in order to achieve that. Clearly, not every function can be as efficient in
achieving fast learning of paths or redirect walkers after objects relocate. In our first set
of simulations, we try to examine the behavior of several index update functions.
We take several 20-node connected parts from our main graphs and make queries
originating from a single node each time. We consider two settings: In the first one, a
single object exists 2 hops away from the initiator. After 10 requests, it gets deleted and
relocated at another node (on a completely different path) 4 hops away. In the second
setting, we instead place the item at a different peer 2 hops away. Note that this is more
33
20406080100120140160180200#requests010
203040
506070
8090100
Success Rate (%) no negative, k=3 (62%)no negative, k=2 (61%)no negative, k=1 (44%)
Figure 2.4: No unlearning, first set-
ting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%)
no negative, k=3 (66%)no negative, k=2 (63%)no negative, k=1 (45%)
Figure 2.5: No unlearning, second
setting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) decay, k=3 (64%)decay, k=2 (62%)decay, k=1 (44%)
Figure 2.6: Learning with decay, first
setting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) decay, k=3 (63%)decay, k=2 (64%)decay, k=1 (45%)
Figure 2.7: Learning with decay, sec-
ond setting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) standard, k=3 (77%)standard, k=2 (75%)standard, k=1 (55%)
Figure 2.8: Standard update, first set-
ting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) standard, k=3 (77%)standard, k=2 (76%)standard, k=1 (56%)
Figure 2.9: Standard update, second
setting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) linear, k=3 (87%)linear, k=2 (86%)linear, k=1 (71%)
Figure 2.10: Linear update, first set-
ting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) linear, k=3 (89%)linear, k=2 (88%)linear, k=1 (70%)
Figure 2.11: Linear update, second
setting
34
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) weighted, k=3 (84%)weighted, k=2 (82%)weighted, k=1 (69%)
Figure 2.12: Distance-based update,
first setting
20406080100120140160180200#requests01020
304050
607080
90100
Success Rate (%) weighted, k=3 (85%)weighted, k=2 (85%)weighted, k=1 (69%)
Figure 2.13: Distance-based update,
second setting
challenging than simply removing the node with the initial replica. Keeping the node
active forces APS to consider it for future requests with the already accumulated index
knowledge.
We monitor the accuracy achieved by several functions after the deletion and present
the results for five of them in Figures 2.4?2.13 (for different values of k). The achieved
accuracy just before the deletion is shown in parenthesis. We evaluate a function with
no negative reinforcement (no unlearning), one with ?temporal? decay (small negative
reinforcement at each time step), a flat update function (change indices by a set value
each time, see Figure 2.1), a linear function (amount of change is a linear function of the
current index value) and a weighted function (as described for the w-APS method).
We notice that all functions ?learn? with more queries, although they do so with
varying speed. Trying to learn the location of an object 4 hops away is harder than finding
a new one 2 hops away, as we would expect. Utilizing more walkers mitigates this prob-
lem as more resources are now available for exploration of the network. The results for
k?2 do not differ significantly since in this experiment we deal with only two replicas.
The linear method clearly performs better when k = 1 in both accuracy and fast unlearn-
ing. When k > 1, the standard, linear and weighted update schemes perform similarly.
35
0 30 60 90 120 150#requests0
20
40
60
80
100
% times closest object discovered
weightedlinearstandard
no negative
Figure 2.14: Percentage of finding the closest object for the various index update methods
Methods with negligible or no negative reinforcement show worse performance.
These results show that both learning and un-learning are necessary: The linear
function increases its accuracy faster to match the initial success rate. Our observations
show that unlearning is more effective if the amount of index decrease is proportional
to its value. Similarly, the w-APS scheme proves more effective when the positive rein-
forcement is analogous to ?h. Obviously, the rate at which nodes (and therefore paths to
objects) depart affects the efficiency of the unlearning process. Also, these experiments
do not take into account the interaction between different queries. This would enable the
failure/success of previous queries to be considered by the current search.
In a similar experiment for the w-APS method, we monitor the percentage of hits
for each replica, having only 2 of them at distances 2 and 4 hops away from the requester
respectively (k = 1, Figure 2.14). The un-weighted functions find the nearest object about
45?60% of the time. A function with the amount of increase being proportional to ?h
and ? = 0.3, discovers the nearest replica with increasing frequency (i.e., over 95% of the
time). This function will be used to evaluate the w-APS version. However, for larger k
36
this advantage diminishes, since the walkers quickly establish paths to objects and search
for alternative ones. Indeed, for k?2, our experiment shows that the method converges
fast to an equal discovery ratio for the two objects.
2.3.2 Basic Performance Analysis
For the default graph, our simulations show that the standard flooding scheme with
TTL=4 can be successful in over 99% of its searches, while producing over 9000 mes-
sages per query. These values are well-known, but mentioned here for direct comparison
with the Random Walks and APS algorithms. In the following figures, if one or more of
our algorithm?s variations are compared, they will be specifically mentioned with their
names (e.g., w-APS). The label ?APS? is used to denote the s-APS version of the protocol
with the linear index update scheme.
In our basic set of simulations, we try to validate the analysis of Section 2.2.2. We
vary the number of walkers deployed (k) from 1 to 15 for the default parameters and test
the two algorithms on all three settings. Figure 2.15 presents the detailed comparison on
the three important metrics (accuracy, hits and messages per query).
Random Walks exhibits low success rate (below 50%) as a result of its nature. More-
over, it barely averages one hit per query in the dynamic settings, even with many walkers.
Its message production is reduced during the dynamic runs (mainly when more walkers
are utilized), since some of the unsuccessful paths inside the network are cut short with
the departures of nodes. The performance decrease is relatively small though, as walkers
are not directed according to object locations but randomly across the network.
37
1 2 3 4 5 6 7 8 9 101112131415k10
20
30
40
50
60
70
80
90
100
Success Rate(%) APSAPS (less dyn)APS (more dyn)
RWalksRWalks (less dyn)
RWalks (more dyn)
1 2 3 4 5 6 7 8 9 101112131415k
10
20
30
40
50
60
Average Messages per Query
APS (pessimistic)s-APS (more dyn)
s-APS (less dyn)RWalks
s-APSRWalks (more dyn)
1 2 3 4 5 6 7 8 9 101112131415k0
1
2
3
4
5
6
7
8
Average Hits per Query
APSAPS (less dyn)
APS (more dyn)RWalks
RWalks (less dyn)RWalks (more dyn)
1 2 3 4 5 6 7 8 9 101112131415k103
104
105
106
107
Duplicate Messages
RWalkss-APS (more dyn)
s-APSw-APS
Figure 2.15: Success rate, message production, number of hits and number of duplicate
messages of the two methods vs. number of deployed walkers in the three different set-
tings
On the other hand, APS achieves high quality results in all these metrics. APS
manages to maintain high levels of robustness for a variety of reasons: In the static en-
vironments, the learning process achieves fast direction of walkers towards objects. This
is achieved with increasing accuracy as more queries are collected in the system. Nodes
utilize indices built by all their neighbors or even other nearby peers.
In the dynamic environments, two things that affect a search may happen: First,
objects may be removed and/or inserted at different locations. Second, peers may discon-
nect from the system, disrupting established paths. Because APS query forwarding is a
probabilistic process, nodes with the largest values do not get necessarily chosen. Thus,
no peers are excluded because of a low probability, enabling recovery from bad choices
38
during query routing. Moreover, our algorithm performs unlearning (negative reinforce-
ment), which enables walkers to be redirected if previously discovered objects are found
missing. Finally, the probability of query failure is greatly reduced with the use of a large
number of walkers. This achieves both exploitation of high index paths and exploration
of less accurate neighbors in order to determine new object locations. The changes in
topology or object locations must simultaneously affect all successful paths in order for a
failure to occur. The metric we expect to be reasonably affected is the number of hits per
search, as some paths to discovered objects frequently ?disappear?.
We can see that APS achieves very high success rates (about 40% more accurate
than Random Walks) even with few deployed walkers. As predicted above, the accuracy
is not greatly influenced by node departures. For the less dynamic run, the amount of
decrease is almost zero, while it remains within only 5% for relatively large (k?8) values.
One would expect that our method produces a much larger number of messages
compared to Random Walks due to the update process, but this is not the case, as the
majority of walkers in APS are successful and only few of them reach TTL hops away.
In Random Walks, about 70% of the walkers fail and travel TTL hops each. To a lesser
extent, objects are equally discovered at all possible distances in the random method,
while our scheme discovers more objects closer to the requesters. The results confirm our
case: Using only the pessimistic approach, APS produces around 15 messages more per
search compared to Random Walks. This proves that a single update policy is not suitable
for all ranges of requests. The s-APS improvement has the same very low production as
the random algorithm. This effect is enhanced if we recall that no message exchange is
necessary for peer join/leave/update operations. Only in the highly dynamic setting do we
39
0 20 40 60 80 100 120 140 160 180 200Requests per Object0
20
40
60
80
100
Success Rate(%)
APS(1%)APS (0.2%)
RWalks(1%)RWalks (0.2%)
Figure 2.16: Success rate vs. number of requests per object
see an increase in the average production, which is at most 5?7 messages per search. This
gap appears because of the frequent broken paths to objects, causing walkers to travel
more inside the network.
Moreover, APS puts the walkers to a much better use, discovering around 4 times as
many objects as the competing method. This is extremely important for current popular
P2P applications, giving the user a much broader choice for download. This characteris-
tic comes as a result of its high success rate and minimization of walker collisions (two
walkers that cross paths forcing one of them to fail). In the dynamic settings, the max-
imum reduction in the number of hits is around 25% and 40% for the less dynamic and
more dynamic runs respectively. These numbers occur for large values of k, where the
probability of node departures affecting the walkers increases.
The last graph of Figure 2.15 displays the vast reduction that APS achieves in the
number of duplicate messages. These occurrences are considered to be failure states
for our walkers, therefore the learning process makes adjustments in order to minimize
40
1 2 3 4 5Hop Distance0
1
2
3
4
Hits per Query
w-APSAPSRWalks
Figure 2.17: Hits per query vs. hop dis-
tance (static setting)
1 2 3 4 5Hop Distance0
1
Hits per Query
w-APSAPSRWalks
Figure 2.18: Hits per query vs. hop dis-
tance (more dynamic setting)
them. Our method constantly outperforms Random Walks, producing 1 to 2 orders of
magnitude fewer duplicate messages. This is also important because it increases the
useful processing time for each peer. The weighted approach exhibits almost 20% fewer
duplicate messages than our default methods.
To demonstrate how APS increases its accuracy as more queries come into the sys-
tem, we vary the number of requests per object on the default graph, using a uniform
replication ratio of 0.2% and 1%. The results are presented in Figure 2.16. We can see
that the accuracy of our method improves significantly with only a small increase in re-
quests. For replication ratios greater or equal to 2%, our method exhibits almost perfect
results. It is noteworthy that even for the rarest of objects, APS manages to build paths
leading to them through learning and cooperation. At the same time, Random Walks is
steadily below 40% and 10% respectively, regardless of the number of requests.
41
2.3.3 Discovered Objects vs. Distance from Requesters
Figure 2.17 shows how the hits are distributed over their distance from the re-
questers, for the default parameters in the static setting. While Random Walks discovers
about the same amount of objects throughout the 1 to TTL range, APS makes an effort
to discover closer ones. It displays a symmetric curve, finding the most objects 3 hops
away from the requesters. The reason for this is its learning feature that promptly locates
the closest ones (one and two hops away). The rest of the walkers are directed towards
more distant content. Such objects exist in larger quantity (since nodes increase expo-
nentially with distance) but are less easily accessible (more paths, walker collisions, etc).
The results for the dynamic settings are similar, the only difference being the reduction
in hits we mentioned. We have also noticed that our algorithm becomes more biased
into discovering nearby objects as the number of replicas inside the network increases.
This happens because the walkers have a broader selection of paths to objects and can,
therefore, choose the shortest.
The w-APS technique marginally improves the s-APS performance by locating a
small amount of extra content two and three hops away. While one would expect the
weighted version to locate considerably more objects closer to the requesters, this is not
the case: As k increases, paths to the nearest replicas are exploited by both methods.
Furthermore, as Figure 2.18 shows, the un-weighted update method shows superior per-
formance under dynamic environments, exhibiting both higher accuracy and hit count
compared to the weighted index update method. Nevertheless, given a different setting,
one could notice the difference in performance: By applying a uniform 10% replication
42
1 2 3 4 5Hop-Distance00.1
0.20.3
0.40.5
0.60.7
0.80.9
1
Ratio of Hits
w-APS, k=1w-APS, k=4w-APS, k=8
Figure 2.19: Ratio of hits per query vs.
hop distance for w-APS
1 2 3 4 5Hop-Distance00.1
0.20.3
0.40.5
0.60.7
0.80.9
1
Ratio of Hits
s-APS, k=1s-APS, k=4s-APS, k=8
Figure 2.20: Ratio of hits per query vs.
hop distance for s-APS
rate (to allow for more choice) and 100 requests per object, we measure the ratio of ob-
jects discovered at different distances by the weighted and un-weighted version of APS
in Figures 2.19 and 2.20. As k increases, the difference between the more steep w-APS
curve and the flatter one by s-APS is diminished. For larger values of k, the two schemes
almost coincide (see Figure 2.17).
2.3.4 Effect of Object Popularity
Next, we analyze the behavior of our scheme?s index values. APS is an inherently
adaptive search algorithm, whose power lies in the use of the local indices. For the next
experiment, we choose only one node from our default graph with degree 12 and examine
how its local indices change. We make requests for 10 objects, with object 1 being the
most popular and 10 the least. Replication and request distributions take their default
values. Figure 2.21 displays the number of high-valued indices for that node for all 10
objects. Object popularity decreases from left to right on the x-axis. We monitor indices
with large values (more than 20 hits) and indices that have a fairly large value (more than
43
Object Popularity0
2
4
6
8
10
12
#Indices
High-value indices (value > 1000)Medium-value indices (100 < value < 1000)
Figure 2.21: Distribution of index values
according to object popularity for one peer
and 10 objects
0 20 40 60 80 100Object Popularity0.5
0.60.7
0.80.9
1
Ratio of APS indices
High-valueMedium-valueLow-value
Figure 2.22: Distribution of index values
versus object popularity in our default set-
ting
5 but less than 20 hits). We notice that many indices with large values exist for the very
popular objects, while this number decreases as popularity drops. Still, some indices with
a relatively large value always exist for less popular objects. APS exhibits high precision
for very popular objects, building up its ?confidence? through large index values. On the
other hand, the few fairly large indices for unpopular objects point out the algorithm?s
ability to locate them with good probability.
Figure 2.22 presents how the entirety of index values changes for our default set-
ting. The majority of the O(Nd) indices are not used, since only 10% of the nodes are
requesters. We notice that high-valued indices exist mostly for the 20% most popular
items and medium-valued ones are prominent roughly between the top 20%?40% of ob-
jects. Nevertheless, some exist even for the least replicated content, giving few of APS?s
walkers viable paths to discovery.
We conclude our analysis on object popularity and APS with the results of Figure
2.23. We show the success rates for individual objects grouped according to their pop-
ularity, using all default parameters (in the non-dynamic setting). Popularity decreases
44
Object Popularity0
20
40
60
80
100
Success Rate(%)
w-APSAPS
Random Walks
Figure 2.23: Individual success rate vs. object popularity
from left to right on the x-axis. APS shows almost perfect results for popular objects and
displays a ?graceful? decline for unpopular requests, while w-APS slightly improves on
this for unpopular requests. On the other hand, Random Walks? accuracy drops signifi-
cantly after requests for the highest-ranked 10% of objects, reaching a mere 11% for the
least popular objects.
2.3.5 Results for Different Topologies
In this section, we compare s-APS with Random Walks over four different graphs:
The default one, a 10,000-node random graph with d = 4 (similar to Gnutella-type graphs),
a 50,000-node random graph with d = 10 and a 10,000-node power-law (PLAW) graph
with d = 4.4. Table 2.2 presents the two algorithms? performance in the highly dynamic
setting with the respective results from the static runs in parentheses.
First, we test the methods using a uniform distribution for both requests and storage
in the default graph. The replication ratio for each object is set to 1% and each of them
receives 30 queries by each requester node. We clearly notice that s-APS greatly benefits
from such a setup, delivering over 94% in success rate (a mere 2% decrease from the
45
Table 2.2: Results for more environments
s-APS Random Walks
Graph-Distr. Succ% Mesg Hits Succ% Mesg Hits
10K-Rand 94.1 58.5 4.3 32.3 41.8 0.4
(d=10,Unif) (96.1) (53.5) (7.2) (38.2) (49.6) (0.5)
10K-Rand 70 17.3 1.4 26.0 12.0 0.3
(d=4,Zipf) (82.2) (18.2) (2.25) (34.5) (15.0) (0.5)
50K-Rand 79.3 48.4 2.4 55.6 39.5 1.3
(d=10,Zipf) (87.6) (47.0) (5.7) (57.6) (45.7) (1.4)
10K-PLAW 67.6 13.0 1.11 21.0 9.0 0.3
(d=4.4,Zipf) (76.1) (14.9) (1.76) (31.6) (12.0) (0.5)
static run) and discovering more than 10 times more objects than Random Walks.
On a similar graph with smaller out-degree and k = 5, s-APS is still 40?50% more
accurate, 5 times more effective in locating objects and almost as bandwidth-efficient as
the random method. The results are worse compared to the default graph because of the
smaller out-degree and fewer walkers used.
Our simulations on the 50,000-node random graph justify our prediction that the
graph size cannot influence the performance of APS. The results were a little worse from
the ones in the original graph, because the quality of the new graph was worse (many
more disconnected components were present). We notice the success rate is about 8%
lower from the static case, while the number of discovered objects is almost halved.
Our results on the 10,000-node power-law graph show an even greater gap in the
performance of the two algorithms. Our method delivers about 4 times more results and
exhibits a success rate three times bigger than Random Walks?. The success rate for s-APS
drops by around 9% and discovered objects decrease by 37%, while message production
slightly decreases.
46
Table 2.3: Comparison with GUESS
s-APS GUESS
Metric Succ% Mesg Hits Succ% Mesg Hits
97.7 16.3 5.22 63.9 16.1 1.28
Messages 98.6 22.0 7.01 65.6 22.2 1.87
99.7 33.2 11.39 84.0 33.1 2.55
81.0 3.2 1.33 63.9 16.1 1.28
Hits 94.6 8.7 3.42 86.4 45.0 3.70
97.9 16.5 5.42 94.5 65.1 5.60
In these simulations, our method kept its message production at the same levels
with the static runs, wasting at most 5 extra messages per search, a direct proof that it
does not impose more burden on network traffic. As expected, the success rate shows
only a small decrease, ranging from 2% to 12%. These results also show that our method
maintains its relative performance gains over the different environments.
2.3.6 Comparison with GUESS
Lastly, we present results comparing s-APS with an implementation of GUESS [27]
on a random hybrid graph with 6500 peers, 500 of them being super-peers (or ultrapeers
in GUESS). Each ultrapeer is connected to 12 leaf-nodes on average. Links exist only
between ultrapeers and between an ultrapeer and its leaf-nodes. In our GUESS imple-
mentation, initiating ultrapeers forward queries to k randomly chosen neighbor ultrapeers.
Query and object placement distributions are set to their default values. Since it is im-
possible to directly compare the two methods for the same k and TTL values, we select
simulations where the two algorithms had similar performance in one of two important
47
metrics: Messages and hits per query. The results are presented in Table 2.3 and the
comparison metric is typed in boldface. For similar message consumption, our scheme
exhibits higher success rates and delivers 4 to 5 times more results. For similar hits per
search, our scheme produces 4 to 5 times fewer messages and always outperforms GUESS
in accuracy. APS achieves these results taking no advantage of the hybrid topology that
GUESS utilizes.
2.4 Summary
APS deploys probabilistically directed walkers by utilizing information from past
searches regarding their success or failure. This allows for fast learning with a low mes-
sage consumption. Peers are required to keep indices only relative to their neighbors,
while no message exchange is necessary for any dynamic network event, local or global.
Our results show that APS exhibits effectiveness being almost as bandwidth-efficient as
Random Walks. It discovers 4 times as many objects and delivers very high success rates
compared to the Random Walks and GUESS methods, maintaining these features in dy-
namic environments. Appendix A contains a direct performance comparison between
APS, Random Walks, GUESS and six more representative schemes described in the re-
lated work section.
48
Chapter 3
Content Dissemination to Groups of Peers: AGNO
3.1 Overview
Mass communication is defined as the process of data distribution to a greater num-
ber of people at the same time. The importance and applications of group communication
schemes in computer networks and in distributed systems in particular have been well-
defined in past and recent research work (e.g., [34?36]). A multicast transmission is de-
fined as the dissemination of information to several hosts within a network. These hosts
are interested in receiving the same content from an authority node (such as a web server)
and naturally form a group. The lack of deployment of multicast communication in the IP
layer has led to the development of various application-level multicast protocols, in which
the end hosts are responsible for implementing this functionality. One-to-many commu-
nication is a very useful mechanism for a variety of network applications (e.g., [37, 38]).
As the applications that embrace the P2P paradigm grow, a number of methods
have also been proposed to implement multicast communication utilizing some popular
P2P overlays (e.g., [34, 35, 39, 40]). Nevertheless, these approaches take advantage of
the structure that DHTs provide. As we mentioned before in our work, there exist many
realistic scenarios where the topology cannot be controlled and thus DHTs cannot be
used (e.g., ad-hoc networks or existing large-scale unstructured overlays). Explicit group
formation schemes require frequent communication overhead between group members.
49
Nodes must go through a subscription process by contacting a special node and announce
their intent to receive/transmit/forward group messages. These techniques often prove
unsuitable because of the generated traffic for large and dynamically changing group pop-
ulations.
In the area of unstructured P2P overlays, contacting large numbers of nodes is im-
plemented by either broadcast-based schemes (e.g., Gnutella [6], Modified-BFS [22]), or
gossip-based approaches, e.g., [36, 41, 42]. Both produce large numbers of messages by
contacting many peers inside the network. Our work aims at providing peers in dynamic,
unstructured environments with an effective yet inexpensive mechanism to disseminate
content-related information to groups of nodes interested in their content. Specifically,
we intend to provide a scheme that is:
? Efficient: It should be able to contact a high percentage of interested peers with low
message overhead.
? Scalable: The scheme should be able to scale to very large group sizes (thousands
of peers).
? Robust: We would like to avoid the necessity of a single point of contact or group
leader as well as the burden of costly message exchanges in case of member arrivals
and departures.
? Adaptive: It should adapt to changes in the group size and to dynamic workloads.
We assume a fully distributed and unstructured system, where peers share and re-
quest resources replicated inside the network. Users are interested in objects with chang-
50
ing content such as results of a sports meeting in real time, temperature readings, weather
maps, stock quotes, security updates, etc. There exist some nodes (similar to the web
servers or mirror sites in the Internet) that provide fresh content, but their connectivity or
availability varies, as happens with all other network nodes. Peers that are interested in re-
trieving the newest version of the content conduct searches for it in order to locate a fresh
or closer replica. In this environment, interest in a specific object is tied to the lookups
generated for it. We argue for a push-based approach, where a server node forwards
notifications (or other object-specific information) towards the interested hosts. Our as-
sumption is that peers which have recently searched or retrieved an object would also be
interested in receiving important updates about it. For example, it is safe to assume that a
host frequently querying for the price of a quote or the temperature of an area would like
to be informed about an update or another object-related notification.
It is important to note here that peers still search and retrieve objects in a distributed
manner. The notification itself may or may not be directly related to a specific object: A
severe weather alert to be effective in the next 3 hours is not related to the current area
temperature. A change in the scores or quote prices, on the other hand, is directly linked
to the content of the object. Group communication (especially for large groups) requires
a considerable amount of bandwidth. Content providers can assess the importance of
various updates/notifications and choose to push those that would be the most beneficial.
On a more technical note, the forwarding path between any two given peers in a
DHT remains the same with high probability. This is a feature that many approaches
utilize in order to construct efficient multicast paths. This is not the case for unstructured
P2P networks: Peers have multiple (and dynamically changing) communication paths
51
with each other. Therefore, a notification scheme for such networks can also be used
to simulate that functionality and identify reverse paths from the destination (location of
an object) back to the requesters. This information can in turn be used in a variety of
problems (e.g., assist in dynamic replication, see Chapter 4).
In this Chapter, we present the Adaptive Group Notification (AGNO) method. Our
approach combines the utilization of state accumulated during the search process together
with probabilistically stored shortcuts. The first indicates the amount of demand for a spe-
cific object and can be used to infer membership and guide the dissemination of updates
on a hop-by-hop basis. By also allowing peers to locally store a constant amount of re-
quester addresses (called backpointers), we show that AGNO achieves a robust, scalable
behavior in a variety of environments and group sizes. Our method utilizes a simple bin-
ning scheme as well as adaptive index aging to adjust its performance to different work-
loads and member joins/leaves. AGNO does not require any global knowledge, existence
of a special contact node or any membership message exchange. It builds its knowledge
by exclusively monitoring the independently conducted lookups. Finally, its performance
can be easily tuned to fit specific application requirements.
3.2 AGNO Protocol Description
A multicast transmission (also referred to as the notification or push phase here-
after) in our setting is initiated by a content-holding peer (or server) and its target is to
contact as many ?group? members (i.e., requester nodes) as possible with the least amount
of overlay messages. The focus of this work is to describe an efficient mechanism for such
52
transmissions and not to define their content. The message relayed during the push phase
will be referred to as a notification or push message and always relates to a specific object
that is shared in the network.
The rationale behind AGNO relates to the observation that efficient group commu-
nication comes at a cost. In current approaches, this cost is paid by either a membership
management protocol or an overlay infrastructure. Our goal is to provide with the missing
state that can allow for content dissemination to a group of peers, but in a way consistent
with the nature of an unstructured P2P system. In AGNO, the equivalent of group mem-
bership is the demand for an object (or a collection of them), realized through searches
and object sharing that are independently conducted by peers. The granularity can be as
coarse or fine-grained as the application requires. For the remainder of our discussion, we
assume a per-object level of granularity.
After each search using the APS algorithm, peers accumulate knowledge about the
relative success of a search through each of their neighbors. Intuitively, overlay paths that
comprise of high index values are the ones most frequently used to connect requesters and
object holders. In AGNO, nodes utilize this information in order to forward group mes-
sages towards possible group members during the push phase. Note here that, although
we utilize the APS method as a means to provide this state, our approach can be used with
a variety of search mechanisms, as long as they support a similar demand incentive.
We now describe the nature of the index values that are stored at each peer. APS
keeps a local view (an index value) for each neighbor. For AGNO, each peer P needs to
maintain the index values that P?s neighbors hold relative to P. Let X i?Y denote the
APS index value stored at node X for neighbor Y and object i. Then, peer P must know
53
A
B C
D
E
AX110030
X290... ...A50
AE
Y2Y5
10150
4090
AD
Z2
2090
10
BC
DE
50100
1020 BCDE 602203030
Reverse Indices
Figure 3.1: Graphic explanation of AGNO reverse indices. The filled table represents
the reverse index values stored at node A, which coincide with the APS index values that
nodes B,C,D,E store regarding A
X i?P, for each neighbor X (see Figure 3.1). These values can be made known to P
either implicitly or explicitly: In the first case, peer P can infer the index X i?P if it
knows about the update process used (optimistic or pessimistic) and its initial value. In
the explicit approach, whenever a search for object i is conducted and X forwards to P,
it piggybacks X i?P. We call these new stored values the reverse indices, to distinguish
them from the indices used by APS in searches. For the rest of our discussion, we assume
that the explicit approach is used.
Reverse indices are not the only state that our method utilizes. During the search,
intermediate nodes decide with probability pr whether or not to cache the requester?s
address. Thus, for a search path h hops long, it will be stored on hpr peers on average.
With this scheme, we create a number of soft-state shortcuts called backpointers along
the search paths which point to group members. Each peer can individually decide on the
maximum number of backpointers stored. For simplicity, we assume that all nodes can
store a maximum of c backpointer values. Backpointers are soft-state that gets invalidated
54
s s
Figure 3.2: The black nodes search for an object stored at node s (left). On the right, s
initiates a push phase in order to contact the requesters
after some amount of time.
Notifications are issued by peers that (authoritatively) serve objects. They are of
the form (nodeID, nodeIP, notificationID, objectID, TTL, content), where
(nodeID, nodeIP) is the server?s identifier and IP address, notificationID is a unique
identifier for each push message generated by nodeID (to eliminate duplicate receptions),
TTL is the maximum distance allowed for the message to travel and content holds the
actual content of the notification that refers to object objectID.
During the push phase, peers issuing or receiving a notification forward it to their
neighbors using the reverse index values. We consider the following forwarding schemes:
? Probabilistically forward to k?1 neighbors using the reverse indices or forward to
those with the top-k values
? Forward to all neighbors with reverse index value larger than a defined threshold
Moreover, a peer sends the push message directly to each of its valid backpointers with
probability pn. These messages have a TTL=1 and do not travel further.
Whenever an overlay link is crossed, the TTL field is decremented. A push message
55
is discarded either when its TTL value reaches zero or when it is received more than
once due to a cycle. Therefore, our scheme combines a selective, BFS-like forwarding
augmented with shortcuts in order to contact the group members. This is shown pictorially
in Figure 3.2.
We now discuss how the aforementioned state is maintained at each peer. The
backpointer values expire after a certain amount of time. Since our incentive to push a
message is the demand on a per-object basis, new backpointers replace the oldest valid
ones (if a node has already c valid backpointers). As searches take place inside the system,
the backpointer repositories get updated, while the probabilistic fashion in which they are
stored guarantees a diverse collection of (ID, address) pairs. Reverse indices get updated
during searches, but this is not enough: There may be peers that have searched for an
object and built large index values, but are no longer interested in receiving notifications
(i.e., stopped querying for that object). If searches are no longer routed through those
peers, the reverse index values (which reflect APS indices) will not be updated and will
remain high.
To correct this situation, we add an aging factor ? to the reverse indices, which
forces their values to decrease with time. Peers need to keep track of the time that a
reverse index was last updated in order to acquire its correct value before using it. When a
peer receives a search message, it sets the corresponding reverse index to the piggybacked
value and its last modified field to the time of receipt. Figure 3.3 shows how this process
works. The value of the index decreases exponentially, while two searches at times t1,t2
reset its value. A push message received at time t3 will use the value as shown in the figure.
The last modified value is also reset when a reverse index is used, since a peer computes
56
t
t2
t1
t0
t3
Reverse Index Value
Figure 3.3: Example of computation of a reverse index value
its current value before using it. Obviously, a fixed value for ? will perform suboptimal
aging, by either reducing the reverse indices too much or by failing to reduce them enough
for the push phase to prune out disinterested peers. The next section describes in more
detail how our protocol proceeds in the computation of the parameters described above.
3.2.1 Protocol Specifics
1) Space Requirements: The amount of space required by the peers is O(2d +2c)
per object, where d is the average node degree in the overlay and c is the maximum num-
ber of backpointers stored. Each peer stores one reverse index value and its modification
time and a backpointer with its creation time per object. Even if nodes want to keep
track of large numbers of objects, the space requirements are in the order of a few tens
of megabytes, definitely affordable by the vast majority of modern hosts (typical 1GB of
main memory configurations). For about 1 million objects, assuming c = d = 4, each peer
would need approximately 64MB of memory for AGNO.
2) Forwarding: Nodes use a threshold parameter Thresh in order to choose the
neighbors to which a notification will be forwarded. Neither the probabilistic nor the top-
57
k value schemes are suitable, as they fail in certain cases. Consider for example a peer
with very low values for all its neighbors. Thresholding enables peers to forward to the
most ?promising? (active in searches) parts of the overlay. A good first approximation
is for each peer to use the average of all its neighbors? indices as Thresh. Nevertheless,
both the average and the median values fail as well in various circumstances (e.g., when
all indices have a very close low or high value). Thus, we have to identify a value for
Thresh that will enable more high quality indices to be selected and less (or none) of the
low-quality ones.
3) Local Threshold Computation: After each peer computes the average of its
neighbors? reverse index values?RIV?t at time t, it uses a system-wide binning scheme to
come up with the actual value for Thresh. The binning method divides the space of reverse
index values into a set number of bins,{Bini = ([ai,bi),T hreshi)}. Bini is characterized
by its lower and upper limit values ai,bi (a0 < b0 = a1 < b1 = a2...) and a Threshi value.
The final threshold value is Thresh= Threshi, if?RIV?t ?[ai,bi). For example, assume
we use a 2-bin scheme,{Bin0 = ([0,50),40),Bin1 = ([50,?),100)}. If?RIV?t = 75, that
node will forward to all neighbors with reverse index value over 100. Bins represent
an approximation that maps reverse indices to a value representing their quality. Higher
numbered bins represent higher quality indices. Values Threshi are chosen such that:
Threshi?1?bi?2 > Threshi?bi?1 and Threshi?1 < T hreshi, where we assume that b?1 =
a0. For small i values we should pick few neighbors (therefore a high threshold relative
to the bin?s interval), while for large i (i.e., high quality bins), most of the neighbors need
to be chosen. Note that we do not require Threshi to belong to [ai,bi), nor do we require
that bi?ai = b j?a j,inegationslash= j. As a simple heuristic that produces good results for selecting
58
a0 b0 b1 b2 bn
Thresh0 Thresh2Thresh1 Thresh(n+1)
Figure 3.4: Sample binning scheme with the respective Thresh and Threshi?bi?1 values
the Threshi values, given Thresh0 for bin [a0,b0), we set Threshi = Threshi?1 + bi?1?bi?22 .
Figure 3.4 gives a graphic description of our binning scheme. Its granularity, controlled
by the number of defined bins, can be as fine-grained or coarse as our application requires.
4) Reverse Index Aging: APS updates its index values after either a success or a
failure, achieving learning in both situations. This is very important for AGNO as well:
Peers that lose interest in an object should be left out of the push phase as quickly as
possible. Our scheme uses the aging factor ? together with the last modified time of each
reverse index to reduce the influence of inactive ones. Assuming index P?Q was last
modified at time tlast, its value at time t?tlast is: P?Q(t) = (1??)t?tlast P?Q(tlast),
where ??[0,1]. For ? = 0.2, a reverse index value will be 80% of its last modified after
one time unit.
The value of ? dictates how aggressive our aging will be. It depends on the rate at
which requests occur (and therefore index updates): The larger the rate of searches ?r, the
more aggressive the aging can be. Nevertheless, it is still application-dependent, since the
rate ?n at which notifications occur (or even their content) largely affects the aging factor.
For example, in sharing stock market data, for the duration of a peer?s online time it can
be assumed that a user is always interested in her portfolio.
For the remainder of this paper, we assume that peers use the same value for ? which
59
satisfies the inequality: (1??)T max reduced Thresh < mini(Threshi) (1). In effect, we
pick ? such that any reverse index with value less or equal to max reduced Thresh will
be reduced below the lowest threshold (and thus will not be selected) if not used for T
time steps (T is defined as our ?tolerance? parameter). The maximum Threshi represents
the minimum high-quality index value, as this is defined by our binning scheme. There-
fore, by setting max reduced Thresh = maxi(Threshi), we choose ? such that all reverse
indices up to that level of quality are discarded after a period of time T without getting
updated. Choosing larger values results in a more aggressive aging. The same is true for
choosing smaller T values. Assuming that, in the vast majority of cases, notifications are
considerably less frequent than requests, we set T = O(1/?r), which defines the tolerance
interval to be in the order of the average request interarrival period. This is done in order
to quickly identify and decrease idle indices in the overlay.
5) Estimation of ?r: In order for our scheme to work without requiring a priori
knowledge of the request rate but also to be able to adapt to changes in the workload, we
need an effective yet inexpensive mechanism to estimate its value and compute the new ?
before each push. This value is then piggybacked downstream and used by all receiving
nodes. In order to estimate ?r, we need the zeroth and first frequency moment of the
request sequence arriving at a server. F0 is the number of distinct IDs that appear in the
sequence, while F1 is the length of the sequence (number of requests). Servers can easily
monitor the number of incoming requests inside a time interval. Many efficient schemes
to estimate F0 within a factor of 1?? have been proposed (e.g., [43, 44]). We use one of
the schemes in [43], which requires an extra of only O(1/?2 +log(m)) memory bits (at
server-nodes), where m is the number of distinct node IDs. In reality, m is in the order
60
of the distinct peers within TTL hops from a server, since only these nodes can reach it:
msimilarequaldTTL?log(m)similarequalTTL?log(d), which is usually very small. After each push phase,
both estimates are reset and a new estimation cycle begins.
6) Backpointer Selection: Finally, we specify which backpointers are used by a
node that receives a group notification message. Clearly, following the same number of
backpointers at different peers and times is not efficient. Our method utilizes the local
thresholding computation to assist in the process of selecting valid backpointers. As we
mentioned before, the threshold value is representative of the average quality of a peer?s
reverse indices (higher bins choose on average more neighbors to forward to). Given that
a peer?s threshold bin is i at time t, the probability with which each stored backpointer
will be followed is pni, given from the set {pn1, pn2,... pni,...} (i.e., one pn value for
each bin). We choose those values such that pni > pn j ?i < j, since better quality bins
forward to more neighbors and need not waste more bandwidth. With this scheme, AGNO
adaptively balances the amount of forwarded messages per peer between the shortcuts and
the neighbors according to the current quality of its reverse indices.
7) Summary: AGNO is a probabilistic group notification scheme that integrates
search indices with a constant amount of shortcuts to effectively route messages in an
unstructured overlay. It utilizes a binning scheme to choose between the exact amount
of useful information from each source and an aging mechanism to gracefully adapt to
member departures, requiring no explicit cooperation on their part.
61
3.3 Simulation Results
We use a message-level simulator written in C (about 2,100 lines of code) which
runs on a linux-based platform using an Athlon 2.1GHz processor and 1GB of main mem-
ory. Requesters make searches for objects using APS at rate ?r (exponentially distributed
interarrival times), while servers initiate push transmissions at rate ?n. At each run, we
randomly choose a single node that plays the role of a server and a number of requesters,
also uniformly at random. Results are averaged over several tens of runs.
We present results for both random and power-law graphs. There has been strong
evidence [45] that connects large-scale unstructured P2P networks to a power-law topol-
ogy. We utilize the BRITE [46] and Inet-3.0 [33] topology generators to create the random
and power-law graphs respectively. We consider 10K node graphs with average node de-
grees around 4 (similar to Gnutella snapshots [45]). Results for graphs up to 50K nodes
and larger average degrees are qualitatively similar.
Finally, the following basic metrics are used to evaluate the performance of a scheme:
The success notification rate (or success rate in brief), which is the ratio of contacted
group members versus the total number of group nodes and the bandwidth stress, which
we define as the ratio of the produced messages over the minimum number of messages
in order to contact all members.
AGNO Parameters: We choose to set c?= d, which reserves an amount of space
for backpointers roughly equal to the average node degree. Potentially, for each of its
neighbors, any peer can keep one backpointer address during a search. Ref. [45] shows
that over 90% of the node pairs in Gnutella are around 5 hops away. Given this value as
62
an estimate for the TTL parameter, we set pr ?= 1/TTL, so that on average one peer on
the search path will store the requester?s address. While we experimented with different
distributions (e.g., favoring storage of backpointers for closer nodes), the results did not
considerably vary from the uniform policy.
Given the index update policy used by APS, we employ a simple 3-bin scheme. The
first bin represents indices below the initial value (very few to no successes, pn1 = 0.4),
the second those with some hits (pn2 = 0.15) and the last one those with more successes
(pn3 = 0.05). The values of pn for the second and third bin are chosen deliberately low
since the values of the reverse indices are high enough and backpointers are less frequently
useful. Finally, from equation (1) and setting T = 2Tr (where Tr = 1/?r is the average
request interarrival period), we have: ? = 1?0.440.5?r. The value of ?r (and therefore ?)
is estimated right before each server push using ? = 0.1.
Compared Methods: We compare our method against the SCAMP protocol [47]
which defines explicit membership procedures and the two rumor-spreading schemes
in [41]: Rumor Mongering (RM) and its deterministic version (det-RM), where peers
have complete topology information. All three schemes are gossip-based approaches for
update dissemination/group communication in unstructured overlays. Furthermore, they
do not require a single point of contact or frequent refresh messages, similar to AGNO.
In SCAMP, joining members subscribe by contacting a random existing member.
Upon receiving a subscription request, a member forwards it to all the members in its local
repository. Nodes decide probabilistically whether to store or forward the subscription.
For the unsubscription process, a node notifies the locally known members to replace its
ID with the IDs of the members it has received messages from. Group communication is
63
performed in the standard gossip-based manner. SCAMP is shown to converge to a local
state of slightly over log(n) member IDs, which guarantees with high probability that all
members will receive a notification.
In [41], peers that have received a message less than F times, forward it to B ran-
domly selected neighbors, but only those that the node knows have not yet received it. The
deterministic version of that algorithm requires global knowledge of the overlay. Nodes
forward messages to all neighbors with degree equal to 1, plus to B remaining neigh-
bors that have the smallest degrees. For SCAMP, we first run the membership phase, in
which we favor the method by assuming joining peers know all already joined members.
The parameters for those three methods are the branching factor B, which represents
how many other peers shall be contacted per forwarding step and the seen value F that
represents how many times a peer can receive the same message before dropping it.
Finally, for demonstration purposes, we design and implement a pure shortcut se-
lection scheme (Shortcuts) inspired by the DHT-based multicast tree creation. Search
packets carry the (ID, address) values of the last node along the path interested in the
object so far. Initially, this pair contains the requester node?s information. During the
search, an interested peer that receives a search message, decides with probability pr
whether to store the last member?s ID or not. Moreover, it replaces this ID with its own
before forwarding the request. With this scheme, we create a small sub-overlay of soft-
state backpointers with direction from the object holders towards the group members.
For simplicity, we assume the same maximum number of shortcuts as in AGNO. In the
push phase, a peer forwards to all valid shortcuts, using the standard TTL scheme (unlike
AGNO, where backpointers are contacted with a TTL = 1).
64
0 20 40 60 80 100searches per member0
20
40
60
80
100
% of contacted members
AGNO (random)Shortcuts(random)AGNO (power-law)
Shortcuts(power-law)
Figure 3.5: Success rate over variable
number of searches
0 20 40 60 80 100searches per member1
1.5
2
2.5
3
stress
AGNO (random)Shortcuts(random)AGNO (power-law)
Shortcuts(power-law)
Figure 3.6: Stress over variable number of
searches
Table 3.1: (Success rate, Stress) results for the remaining methods (500 requesters)
SCAMP RM det-RM
10K Random (89%,2.7) (89%,34.5) (98%,31.1)
10K Power-law (68%,2.1) (27%,13.6) (65%,10.8)
3.3.1 Basic Performance Analysis
In this first set of experiments, using a group of 500 requesters, we vary the number
of lookups each of them makes before a single push phase occurs. We report the results
averaged over sets of 10,000-Node random and power-law topologies (d = 4 and d = 4.1
respectively). Figures 3.5 and 3.6 present the results for AGNO and Shortcuts which are
affected by the number of searches.
We notice that the pure shortcut scheme cannot provide an efficient notification
method by itself. AGNO quickly contacts the majority of requesters after only a few
searches take place, while maintaining a low stress factor. As our scheme creates better
quality indices, there exists a slight variation in the stress. This is due to the fact that after
a certain number of queries, peers switch to a different (higher) bin on average.
In the power-law topologies, where about 34% of the peers have degree one, fewer
65
0 20 40 60 80 100searches per member0
20
40
60
80
100
% of successes due to reverse indices% of traffic due to reverse indices
Figure 3.7: Utilization of pure forwarding vs. backpointers
paths are used compared to the random graphs. This, combined to the fact that ? = 0
in these experiments, explains why the stress for AGNO slightly increases with more
requests. The respective results for the remaining methods (not affected by searches) are
shown in Table 3.1. AGNO proves very accurate (in the big majority of runs) and also the
most bandwidth-efficient of the compared methods. All three rumor-spreading schemes
show considerably worse numbers in the power-law topologies. det-RM is much more
effective than RM in such graphs, which is in accordance to the findings of [41].
Figure 3.7 shows the percentage of contacted members and messages of AGNO
purely attributed to forwarding (not backpointers). As we move from less to more precise
reverse indices (from fewer to more queries), our method uses a decreasing number of
backpointers. These results also depict the usefulness of the backpointer scheme as for
less accurate indices they can provide with over 50% of the contacted members.
Table 3.2 summarizes the effect that a change in the number of maximum stored
backpointers (c) has on the performance of AGNO. We select two runs from the previous
66
Table 3.2: Effect of parameter c
10 queries/member 20 queries/member
success rate stress success rate stress
c=1 68.7% 1.17 90.3% 1.16
c=2 73.5% 1.27 91.5% 1.20
c=4 77.9% 1.42 91.6% 1.23
c=8 79.6% 1.80 92.5% 1.37
c=16 81.2% 2.80 92.9% 1.49
10 100 10002000group size1
5
10
15
20
stress
AGNOSCAMPRMdet-RM
10 100 10002000group size40
60
80
100
% of contacted members AGNOSCAMPRMdet-RM
Figure 3.8: Stress and success rate over variable group size
experiment, where each of the 500 members make 10 or 20 queries in the random topolo-
gies. For 10 queries per requester, many peers fall into bins 1 and 2 on average, while the
majority of nodes operate on bin 3 with twice as many queries. With less queries (and
larger backpointer usage), the increase in the success rate over our selected c = 4 is very
small compared to the increase in stress. As the indices get more accurate, the method
becomes almost insensitive to the value of c.
Next, we measure the scalability of our method with group sizes ranging from 10 to
2,000 peers using the random topologies. Requesters make only 10 searches on average,
immediately followed by a single push phase from the server. For SCAMP, the member-
ship protocol is run before each different group size. For RM, det-RM and SCAMP, we
set B = 3,F = 1, which proves the best combination taking into consideration both the
67
0 100 200 300 400time(sec)20
40
60
80
100
% of contacted members
lr=.2lr=.6lr=1lr=6
SCAMP
Figure 3.9: Success rate over variable ?r
values (Tn = 10sec)
0 100 200 300time(sec)1
1.5
2
2.5
3
stress
lr=0.2lr=0.6lr=1.0
lr=6.0SCAMP
Figure 3.10: Stress values over variable ?r
values (Tn = 10sec)
success rate and stress metric. Figure 3.8 presents the results.
Our method is very successful in all group sizes, deteriorating only slightly as the
members increase. This happens because with more requesters, their average distance
from the server increases (the number of peers reachable from a node increases expo-
nentially with the hop distance). This makes APS searches (and its indices) less accurate
for some requesters. The RM schemes produce a similar number of messages regardless
of the group size, while the closest competitor (SCAMP) has roughly twice the stress
value of AGNO, without including the overhead of the membership phase. Our method
manages to contact a very high percentage of the members (86-99.5%) using an almost
constant message ratio over the group size.
3.3.2 Sensitivity to ?r
In this section, we try to evaluate the effectiveness of our ?r estimator and the
computed ? values over the random topologies. Results for the power-law graphs are
qualitatively similar.
Assuming a group size of 1,000 peers, we try to evaluate the performance of AGNO
68
0 100 200 300time(sec)40
60
80
100
% of contacted members
T=0.2TrT=0.5TrT=0.9TrT=Tr
T=2Tr
Figure 3.11: Success rate for different val-
ues of T (Tn = 10sec)
0 50 100150200250300time(sec)0
20
40
60
80
100
% of contacted members
1
1.2
1.4
1.6
1.8
2
stress
High-Low-HighLow-High-Low
Figure 3.12: Adaptation to a change in ?r
by a factor of 20
for different ?r values. Figures 3.9 and 3.10 show the results. Not surprisingly, the larger
the value of ?r, the faster the increase in the success rate, since indices get accurate
faster. Another observation is that, regardless of the average request rate, our method
asymptotically manages to contact all interested peers and reach a very low stress level
(below 1.3). For most realistic scenarios (Tn greatermuchTr), the choice of Tn does not affect
AGNO?s performance. In the very rare cases that Tn < Tr, we just set T = O(Tn) to achieve
comparable adaptation. In all cases, our adaptive aging mechanism selects a suitable value
for ? such that the stress remains almost constant and below 1.4, half the value of the best
of the remaining schemes (SCAMP). For small request rates, peers adapt using initially
low and then higher quality bins (thus the slight variation in stress). The smaller the value
of ?r, the longer this adaptation takes.
The value of T defines how aggressive the aging is. The smaller it gets, the bigger
? becomes and thus the bigger the reduction in the reverse index values. Figure 3.11
shows how the success rate of AGNO, given 1,000 peers making requests at ?r = 1/sec
(and Tn = 10sec), varies by changing the value of T relative to the average request period
Tr = 1/?r. Our default choice for T = 2Tr yields very good results, while choosing values
69
close to the request period also produces fast learning. As T decreases more, the success
rates increase at much smaller rate. Surprisingly, even if we employ twice as aggressive
an aging as the average request rate, over 80% of the members will be contacted after
three AGNO pushes (30 seconds). Nevertheless, it is not safe to assume that the larger the
value of T the better. This would be the case if, for example, we had a static group size
(no aging necessary); a significant number of member departures combined with a large
value for T would delay the adaptation to the new group size and cause more messages to
be created than necessary.
Finally, Figure 3.12 shows how effective our adaptive ?r estimation scheme is. We
simulate the extreme case where the 1,000 requesters suddenly change their query rates by
a factor of 20 (from ?r = 4/sec to ?r = 0.2/sec and vice versa). Our goal for the transition
from high to low rate is to quickly decrease ? so that our success rate is not affected. For
the transition from low to high rate, we wish to quickly adjust the new ? value according to
the increased requests, such that no more than the necessary indices increase their value.
We name our two runs high-low-high and low-high-low respectively: Starting with a rate
of ?r = 4/sec (0.2/sec), requesters drop (increase) their average number of requests to
0.2/sec (4/sec) at time t = 100sec. At time t = 200sec, they increase (decrease) their
rates back to 4 queries/sec (0.2/sec). The top two lines correspond to success rates while
the bottom two to the respective stress values. The maximum observed decrease in the
success rates at 100 or 200 seconds is only 2%, while the stress values remain almost
unaffected (increase equal to 0.01).
70
0 100 200 300 400time (sec)1
2
3
4
5
stress
dep=0.1dep=0.2
dep=0.5dep=0.8
0 100200300400time0
2040
6080
100
success (%)
Figure 3.13: Stress and success rates when a different ratio of peers depart at time
t=100sec (?r = 1/sec,Tn = 10sec)
3.3.3 Changes in Group Size
We now evaluate the performance of AGNO under dynamic changes in the group
size. Our goal is to allow for members to join or leave the group with the minimum
amount of performance degradation. Employing this approach that ties group member-
ship to the interest (or lack thereof) of peers for objects, we require no coordination be-
tween members nor any single authority node.
Figure 3.13 shows how our two metrics are affected by having 10%?80% of the
1,000 requesters leave the group (stop making queries) at time t = 100sec. We assume that
all these nodes jointly and instantly decide to leave the group (as a worst-case scenario).
In all runs, the stress value peaks at the time of the departures, since the same number of
peers are notified but fewer are now considered as members. The size of the departing sub-
group directly affects the stress increase. The stress value instantly drops due to our aging
mechanism, but it does not reach its previous value (though it decreases very slowly). This
is due to the fact that a peer?s indices get updated not only when it makes a request but also
71
150200250300350400450500time (sec)60
70
80
90
100
% of contacted members
AGNORMdet-RMSCAMP
Figure 3.14: Success rate after a series
of member departures and arrivals (?r =
0.5,Tn = 10)
150200250300350400450500time (sec)0
10
20
30
40
50
stress
AGNORMdet-RMSCAMP
Figure 3.15: Stress after a series of mem-
ber departures and arrivals (?r = 0.5,Tn =
10)
when any request passes through it. Therefore, while shortcuts for departing peers expire,
indices leading to them may still have large values, depending on the relative positions of
other requesters in the overlay. The amount of increase for{10%, 20%, 50% and 80%}
of the members departing is {7%, 12%, 38% and 100%} respectively. The amount of
increase gets reduced as the original group size gets smaller, which proves our previous
point: Assuming 200 initial members instead, the respective stress increase percentiles are
{7%, 9%, 16% and 25%}. On the other hand, as the included graph shows, our success
rate is not affected at all. We show next that the decrease in stress after new members join
compensates for the increase after peer departures.
Figures 3.14 and 3.15 display the performance of the compared methods under a
combination of member joins and leaves. At times t ={200,350}sec, 50% of the current
group members decide to leave. At t ={250,280,300,400,420,440}sec, 50% of the non-
active requesters re-join the group. Members make requests at ?r = 0.5/sec, while the
group notification phase is performed every 10 secs.
The success rate shows an instant decrease at the exact time of arrival which is
72
proportional to the number of joining peers. Nevertheless, always more than 85% of the
current members are contacted, and AGNO has learned of their presence by the exact
next transmission. In the next push phases, the method quickly reaches its previous lev-
els. On the other hand, the value of stress decreases after member joins and balances the
small increase that occurs after member departures. SCAMP and the two rumor spread-
ing schemes show big variations in the stress metric. For RM and det-RM, this happens
because of the change in the group size (same number of messages regardless of peer
membership), while for SCAMP this is due to the subscription and unsubscription pro-
cesses. AGNO contacts the vast majority of members at a cost 1 to 10 times lower than
the closest compared method (SCAMP).
3.3.4 Sensitivity to the Binning Scheme
In all our experiments, we used the same binning scheme. The question is how
sensitive AGNO is to different binning configurations. An adaptive process that will adjust
an initial binning configuration according to the method?s performance is a difficult task:
Even if the server knows about the number of interested peers that received a notification
(by members acknowledging through piggybacking), finding how many messages were
sent in a distributed manner requires extra overhead. Furthermore, success rate and stress
are often conflicting goals.
Assuming the simple solution of a single binning scheme, we evaluate AGNO using
different bin configurations. We measure the success rate and stress of a single push phase
to a group of 1,000 peers each having made 20 requests on our set of random graphs. We
73
-60 -40 -20 0 20% variation in success rate-20
-10
0
10
20
30
40
50
 % variation in stress
Figure 3.16: Comparison of 100 different binning configurations to the original one
produce 100 different binning configurations with 2, 3 and 4 bins for direct comparison
to the original scheme. Our study confirms that if we follow the empirical conditions of
Section 3.2.1, even with coarse granularity (2 bins), our method?s performance exhibits
small variation. On the other hand, random choices for the bin limits and/or threshold
values result in performance significantly degraded. In Figure 3.16, each point represents
the percentile variation in our 2 basic metrics of a bin configuration compared to the
original one. Configurations marked with ??? represent choices that follow our rules,
while ???s represent bin settings that do not adhere to those rules. Even random choices
of the binning scheme which reasonably respect our simple conditions exhibit less than
10% variation.
3.3.5 Real Traces
We now present results from simulations using real traces. In our first experi-
ment, we monitor the change in content for two very popular web sites, CNN and BBC
74
0 2006008001000120014001600Time (min)0
20
40
60
80
100
% of contacted hosts
CNN home pageBBC news
0 500 100014001
1.5
2
2.5
3
stress
Figure 3.17: Average results for one-day periods for the CNN and BBC news front pages
news. We retrieve their home pages (http://www.cnn.com and http://news.bbc.co.uk re-
spectively) at a minute granularity and record the time that their content has been modi-
fied. To determine that, we extract the official Last Updated string from the page and also
directly compare the files 1. Each page is preprocessed with HTML Tidy [48]. Taking
advantage of the fact that the overall structure of the same page rarely changes, we dis-
card code, advertisements and pictures that change after each browser refresh, focusing
on content. We monitor the changes over a period of 2 weeks, from Feb. 16th to Mar. 1st,
2004.
The CNN home page changes every 18.1 minutes on average, while BBC?s news
page changes every 8.6 minutes. In our experiments, we use the same 10,000-Node
power-law graphs of the previous sections and a group size of 1,000 requesters, making
requests with exponentially distributed interarrival times (?r = 0.1/min) for those two
pages. The notification phases occur each time a page is updated, as given by our col-
1This method was developed as part of a project for the CS724 Database graduate course in University
of Maryland
75
lected data. At exponentially distributed intervals (an average of 1/15minutes), we choose
with equal probability among the following events: 10% of the members stop requesting
the pages or 80% of inactive members resume their requests or nothing happens. On av-
erage, we vary our setup over 60 times per run. Figure 3.17 shows the results over the
14 1-day periods (averaged over all graphs with multiple runs for each). AGNO manages
to exhibit very high accuracy and adapts its notification mechanism such that the stress
value always remains stable between 1.6 and 1.7.
Finally, we test the behavior of our scheme in a much more dynamic environment.
We use real traces taken from NYSE stock trades, which describe the accesses, volumes
and values of all quotes in a 10-day period (Apr. 3-14, 2000). Aggregating to minute gran-
ularity, we monitor quote activity (accesses-updates) during a busy time interval (11:00-
11:59am) each day. For our simulation, using the same power-law topologies as in the
previous experiment, we assume a standard client population (group members) equal to
the maximum number of accesses recorded at any minute per individual quote. We model
our system such that, given there were Q accesses at a given minute, only the first Q clients
are assumed to query for that object. This is equivalent to having a variable request rate
for each member. Pushes were conducted whenever a quote?s value was updated, with a
maximum of one notification per minute.
Figure 3.18 shows the results for three of the most active quotes, SUNW (Sun Mi-
crosystems Inc.), MSFT (Microsoft Corp.) and ORCL (Oracle Corp.) The statistics for
each of these quotes are presented in Table 3.3. The interesting statistic here is the high
standard deviation value for all three quotes, which translates to a wide range of different
?r rates for each requester in our experiments. Updates (=push transmissions) were per-
76
0 10 20 30 40 50Time (min)0
20
40
60
80
100
% of contacted clients
MSFTSUNWORCL
0 10 20 30 40 50Time (min)1
1.21.4
1.61.8
2
stress
Figure 3.18: Results for a 7-day period for the Microsoft, SUN and Oracle quotes between
11:00am and 11:59am
Mean Max STD
SUNW 148 1037 118
MSFT 240 1171 184
ORCL 165 1137 101
Table 3.3: Access statistics for the three quotes
formed almost every minute. For all three datasets, AGNO achieves a high success rate
with few small spike-shaped decreases occurring. A more detailed analysis of the data
shows that these coincide with sudden increases (often more than 400%) in the group size
(or accesses per minute), as were observed in the data. Given traces for more days, those
spikes would have less weight on the averages. We also depict the average stress values
for the quotes, which are kept at a very low level throughout the whole interval. These
results also show that our adaptive forwarding and aging mechanisms work effectively
even in the most dynamic environments. Results for less popular quotes or for time inter-
vals outside high-access periods are qualitatively similar and were not selected since the
average group size was less than 100.
77
3.4 Summary
AGNO is an adaptive and scalable message dissemination scheme for unstructured
Peer-to-Peer networks. Our method integrates knowledge accumulated during searches to
enable content-providers contact interested peers with very small overhead. We described
in detail our adaptive mechanisms to regulate message forwarding according to the quality
of existing knowledge as well as to ensure efficient performance in all group operations. A
variety of simulations using both synthetic and real traces show that AGNO adapts quickly
to variable request rates and group sizes, being at least twice as bandwidth-efficient as the
compared methods.
78
Chapter 4
Adaptive Replication for Unstructured Overlays
4.1 Introduction and Overview of our Approach
While Peer-to-Peer?s success can still be largely attributed to file-sharing applica-
tions (e.g., [49?51]), an increasing number of different utilizations of this technology have
emerged. P2P has been proposed to assist in web caching [52], instant messaging [53],
e-mails [54], update propagation [36], conferencing [37], etc.
A basic requirement for every P2P system is fault-tolerance. Since the primary ob-
jective is resource location and sharing, we require that this basic operation takes place
in a reliable manner. Nevertheless, in a variety of situations, the distributed and dynamic
nature of the environment stresses the system?s ability to operate smoothly. For example,
the demand for certain content can become overwhelming for the peers serving these ob-
jects, forcing them to reject connections. Flash crowds, regularly documented surges in
the popularity of certain content, are also known to cause severe congestion and degrada-
tion of service [55]. Failing or departing nodes further reduce the availability of various
content. Consequently, resources become scarce, servers get overloaded and throughput
can diminish due to high workloads.
Data replication techniques are commonly utilized in order to remedy these situa-
tions. Replicating critical or frequently accessed system resources is a well-known tech-
nique utilized in many areas of computer science (distributed systems, databases, file-
79
systems, etc) in order to achieve reliability, fault-tolerance and increased performance.
Resources such as content, location of replicas, routing indices, topology information etc,
are cached/replicated by multiple nodes, alleviating single points of contact in routing
and sharing of data. This has the additional benefit of reducing the average distance to
the objects. Replication can be performed in a variety of manners: Mirroring, Content
Distribution Networks (CDNs [56, 57]), web caching [58], etc.
However, these approaches often require full control and provide static replication.
Static replication schemes require a priori knowledge of the popularity/workload distri-
bution in order to compute the amount of replicas needed. In large scale unstructured P2P
networks, peers usually operate on local knowledge, having variable network connectiv-
ity patterns and no control over the induced topology or workload. Data availability and
efficient sharing dictate replication in this challenging environment. Structured P2P sys-
tems (DHTs) provide with the state necessary to accurately identify the paths that requests
take. This information can be used to point out, with high probability, all possible repli-
cation locations. However, such information is not available in unstructured overlays.
File-sharing applications implicitly handle replication through object downloads, while
some force their users to maintain the new replicas for the benefit of others. Yet, this
does not tackle the issue of real-time replication responsive to workload for unstructured
environments.
In this part of our work we present APRE (Adaptive Probabilistic REplication), a
replication method for unstructured overlays based on soft-state routing indices. Our ap-
proach focuses on providing an adaptive solution to the problem of availability together
with minimizing the instances of server overloads and serious service degradation. Our
80
Figure 4.1: Part of the overlay network of our model. Dark nodes inside the bold dotted
ellipse represent Mi, while those inside the thin dotted ellipse represent Mj. Peers with a
file attached also serve objects i or j
system dynamically ?expands? and ?contracts? its resources according to the workload
as perceived locally. New replicas are created in areas of high demand in the overlay,
thus disposing of the need to advertise them. Moreover, this will be done in a completely
decentralized manner, with minimal communication overhead and using absolutely af-
fordable memory space per node.
The framework we use to describe our system is a model as general and realistic as
possible, avoiding many unnecessary assumptions, thus following the general description
of Section 1.3. As a motivating example, assume an unstructured P2P system, where
peers share and request replicated resources. Objects are assumed to be requested reg-
ularly, since their content changes over time: results of a live sports meeting, weather
maps, security updates, real time aggregated statistics, tactical data, etc. Some of the
nodes provide fresh content, while others share versions they have recently downloaded.
Peers that are interested in an object conduct searches for it in order to locate a fresh or
closer replica.
81
Figure 4.1 gives a graphic representation of the APRE framework. For each object i,
there exists a set of peers called the server set Si ={si1,si2,...,sik}that serve the specific
object. These are the nodes that, at a given time, are online, store object i and are willing
to share it. A subset of Si, the mirror set Mi ?Si (the shaded peers) represents the set
of peers that, if online, always serve i. This does not imply that all peers in Mi will
always be online, their connectivity in the overlay will remain the same, or that they will
never refuse connections. But we can assume, without loss of generality, that these nodes
will be mostly available. Our assumption is not unrealistic: Imagine that these servers
can represent mirror sites/authority nodes that provide up-to-date content. Apart from
the mirror set, other peers that already host or have recently retrieved an object can serve
requests for it (nodes with files attached to them in Figure 4.1). A server set comprises of
these nodes plus the corresponding mirror set.
Naturally, peers may belong to server or mirror sets for multiple objects. While this
is a symmetric environment, it is clear that nodes exhibit different sharing abilities. A
variety of parameters, including storage and CPU capability, popularity of stored objects,
system workload, connectivity, etc, contribute to this fact. Some of these factors remain
more or less static over time (e.g., processing power or the maximum available bandwidth
of a host), while others change dynamically.
Whichever the case, it is safe to assume that each peer in this system imposes a
limit on the services it provides to other peers. This is something that is already utilized
by several file-sharing applications (e.g., Kazaa [49], FTP servers, etc). There exist a
variety of metrics that can be used to realize those limits. Peers may set restrictions
on the number of concurrent connections, their upload bandwidth, the number of shared
82
files, the rate of received requests, etc. In this work, we focus on two of these parameters,
namely workload and object popularity as they are manifested through a single observable
quantity, the request rate ?. It is obvious that servers of popular (or temporally popular)
items receive a larger number of requests, which can possibly affect their sharing ability
as well as the system?s behavior.
Given this general framework, our goal is to design and implement a replication
protocol that will provide efficient sharing of objects (in terms of providing low load
operation), scalability and bandwidth-efficiency. APRE is a distributed protocol that au-
tomatically adjusts the replication ratio of every shared item according to the current
demand for it. By utilizing inexpensive routing indices during searches, loaded servers
are able to identify ?hot? areas inside the unstructured overlay with a customizable push
phase. Chosen nodes receive copies thus sharing part of the load. Under-utilized replicas
are released, allowing their hosts to store more popular content. The rationale behind
APRE is the tight coupling between replication and the lookup protocol which controls
how searches get disseminated in the overlay. By utilizing search state, in a manner simi-
lar to AGNO, we are able to identify in real-time ?hot? or ?cold? paths and avoid the need
of advertising constantly created replicas. Our experimental evaluation shows that this
method proves very efficient in a variety of metrics and environments.
4.2 Adaptive Replica Expansion/Contraction: APRE
Our main goal is to provide a completely decentralized mechanism through which
the system will adaptively expand its replica size when demand is increased and will
83
Contract Expand
Contract
Normal Operation
Overloaded
Under?utilizedPeer Load
Figure 4.2: State transitions in our system
shrink when demand will fall. APRE is based on two basic operations: Expand and
Contract.
The high-level behavior of our system can be described using a simple model (Fig-
ure 4.2): In normal mode, nodes can adequately serve requests and also retrieve objects.
As load increases due to incoming requests, some reach their self-imposed limits. By in-
voking the Expand process, we aim at bringing the node status back to normal and lower
the average load for a specific object through the creation of more replicas. Normal op-
eration through the distribution of load will not be necessarily achieved in a single step.
Consider, for example, that a peer initiating Expand may receive requests for multiple
objects. Expanding with respect to one of them will probably lower its load, but will not
necessarily bring its level back to normal. As load decreases, nodes can free up space
(and the respective resources) and thus share a bigger portion of the workload.
Let us now discuss why the system would benefit from these two operations. When
parts of the server set Si receive too many requests for object i, the following may occur:
Clients? connections get refused, while servers receive an increasing amount of requests
and their performance deteriorates. Both groups would benefit from an increase in the
number of replicas available, especially if those replicas were placed inside the areas of
84
Figure 4.3: The shaded oval represents a server set for a specific object. Our system
expands by creating replicas inside two areas where demand (depicted by arrows) is high.
Algorithm 1 Expand
1: if Replica i at node s reaches its limit then
2: P?FindPossibleServers(i);{P?Si = /0}
3: Activate(i) at Y ?P{Replicate at a subset of the nodes in the high-demand area}
4: end if
high demand in the overlay.
Conversely, consider that one or more subsets of Si have recently received very
few requests for object i. This practically means that an amount of their storage space is
under-utilized. They could remove i to free up space or replace it with another object of
higher demand. We have to stress here the point that the system will not force any peer
to store or serve an object until this becomes necessary. Peers with available storage can
play that role. Contract will also be invoked when a peer is called to join Si but cannot do
so without exceeding its limits (e.g., available storage). Note that peers can still choose
to reject a certain action, e.g., refuse to remove an object in order to serve a new one.
Algorithm 1 describes the high-level operation of the Expand process. It is invoked
by peers receiving more requests than those that they are willing to accept. Overloaded
peers have to identify the set P, i.e., candidate nodes for replication inside query intensive
areas. A subset Y of these nodes is selected and, upon their agreement, the new replicas
are transfered (Activate). Figure 4.3 shows an example of our system expanding in re-
85
Figure 4.4: Due to low demand in certain regions of the server set (depicted as white areas
inside the dotted line), our system contracts its replica set
Algorithm 2 Contract
1: if (Replica i at node s is under-utilized) or (s receives Activate(j)) then
2: i?ChooseOb ject();{i is among the candidates for eviction}
3: Deactivate(i);
4: if (s received an Activate(j)) then
5: Activate(j);
6: end if
7: end if
sponse to increased demand for a specific object. On the left, we see some initial server
set (gray oval) and the demand for i (arrows from various parts of the network). Servers
in two areas are overloaded with requests, thus forcing extra replicas in those two areas
to be activated. Si expands, as we see on the right part of the picture, in response to the
current demand for object i.
Algorithm 2 describes our Contract process. It is invoked by a peer that either
receives a low amount of requests for the object(s) it serves or is requested to serve a
more popular one but cannot do so without freeing up some space. In any case, peers stop
serving the object(s) that fall into these categories (Deactivate). Function ChooseObject
decides at each point which object should be deactivated at nodes that have decided to
serve a new object (i.e., received an Activate) but have reached their storage capacities.
Natural choices are to have the new replica replace the least recently requested or the least
popular one. Figure 4.4 shows that two areas of the server set (the areas inside the dotted
86
line) do not receive any requests for object i. This leads to the contraction of Si which is
now the gray oval on the right part of the figure. Our goal is to achieve a system behavior
that resembles the buffer management techniques in databases: Viewing the P2P network
as a large buffer, we want to decide (in a distributed and dynamic manner) the ratios of
objects in the buffer according to user-specified queries (i.e., workload).
4.2.1 Protocol Implementation
In this section we describe the actual implementation of the APRE protocol as de-
scribed by the Expand and Contract algorithms. We assume that servers measure load
and perform replication on a per-object basis, at the same level of granularity with lookup
and reverse indices of APS and AGNO respectively. Vital to the success of our scheme
are the following:
1. A mechanism to identify object popularity
2. A mechanism to create replicas inside high-demand areas
3. Minimization of communication within each server set
The conditions of line 1 in Algorithms 1 and 2 describe when Expand or Contract
are initiated. We believe that each peer can independently choose when to initiate an
expansion or when to deactivate a replica. Therefore, there is no need for any message
exchange between servers.
We assume that each server s defines the maximum number of requests that replica i
can accept per time unit Limitups,i . If it receives less than Limitdowns,i requests for object i, this
87
Figure 4.5: Visual representation of a sample power-law graph, after several searches for a
single object using the APS method. Solid line arcs show high index value links between
nodes
replica is deactivated/deleted from the node?s cache without any further communication.
Alternative measures such as the maximum number of allowed connections can be used.
These limits can be optionally advertised inside the network upon connection or replica
activation. If a peer cannot sustain its advertised rates, then it may choose to advertise
new maximum capacities. This can potentially assist requesters by hinting them to avoid
very high request rates. Nevertheless, it is not required by our approach. Obviously, the
total maximum capacity for server s is equal to ?
i
Limitups,i , where i refers to every object
that s serves.
In order to identify ?hot? areas inside the overlay (i.e., locate set P), we utilize
a push phase similar to the one featured in AGNO. Paths with large APS-index values
connect the requesters to the content providers (Figure 4.5). Peers store reverse index
values for each of their neighbors. Reverse indices are used to identify (paths to) active
88
s s
Figure 4.6: After searches for an object at s take place, reverse index values are updated
and a push phase creates new replicas inside areas of high demand (dotted links)
requesters in the overlay on a per-object basis.
In order to discover candidate new servers to host replicas of i, whenever the local
load for object i (measured in requests per time unit) ?si(t) exceeds the limit Limitups,i , the
respective server s issues a special message which is forwarded to k neighbors with the k
highest reverse index values. The push message contains the tuple (s,i,size(i),TTL, Di(t)):
The overloaded server?s ID, the ID and size of the object that caused the overload, the
hop distance left and the amount of overload Di(t) = ?si(t)?Limitups,i . Each node that
receives this message, independently decides whether to join Si according to our imple-
mented replication policy. This phase continues with each intermediate node forwarding
this message to k neighbors in a similar fashion until either its TTL value reaches zero or
a duplicate reception is detected. Figure 4.6 shows an example of our scheme at work:
Black nodes represent requesters of the item held at node s. APS searches are depicted
by arrows. In the push phase, paths with high index values are visited (links with dotted
lines). The new shaded nodes with bold outline represent possible replicas created.
Reverse indices get updated in the same manner as in AGNO, while an aging factor
forces their values to decrease with time. In AGNO, push phases are assumed to be less
89
frequent than request rates (?rgreatermuch?push), thus the need for an estimate of ?r, through which
the tolerance parameter T is defined (see Section 3.2.1). However, since our priority is
for the system to be as reactive as possible, we assume that servers may initiate Expand
frequently (checking for overloads on every time unit). In this case, the aging scheme
can be made simpler by avoiding the estimation of ?r. Instead, we can set T ?1/?si(t) in
equation 3.2.1.(1), substituting the estimator with the observed rate at each time step.
Each node on the path independently decides whether it will join Si according to our
replication policy. Currently, we have implemented three: FurthestFirst, ClosestFirst and
Uniform. In FurthestFirst, the probability of a node joining Si increases with the message
distance, while the opposite occurs in ClosestFirst. All nodes are given the same chance
in Uniform. After subset Y has been identified, replicas are transmitted and activated.
In order for APRE to adapt to various workloads and avoid system oscillation [59]
due to replicas with perceived load a little above or below the limits frequently enter-
ing and leaving Si, we introduce a scaled replication policy: We regulate the number
of replicas activated per push phase according to the amount of overload for object i,
Di(t), as observed by the server initiating the push at time t. To achieve that, we de-
fine a set of intervals {d1,d2,...,dm} that group the different values of Di. Each inter-
val dk :{(lk,uk),{pk1, pk2,..., pkT T L}} is defined by an upper and lower value and TTL
probability values, one for each hop distance. For the interval limits, we require that
l1 < u1 = l2 < u2 ... < um. When a server receives a push message, it joins Si with prob-
ability pk?, if lk < Di ?uk and the TTL value in the message is ?. Probability values
increase as D falls into higher number intervals (i.e., pk? < p(k+1)?). Thus, a heavily
overloaded server will create more replicas than a less overloaded one and marginally
90
overloaded peers will not alter Si significantly. We note here that each server locally
estimates ?i(t), the number of requests for object i per time unit.
Our experimental evaluation confirms that expanding relative to the size of excess
load D is necessary to achieve smooth changes in the server-set. Our results also show
that peers can further avoid oscillations by monitoring the number of times they joined or
left Si inside a small window of time. Peers that repeatedly leave and re-join a server set
can choose a single state (either host the object or not) for the following ? time steps. If
the size of the object and the peer?s free space allow it, it is preferable that the node serves
object i for ? time steps, regardless if ?i(t) < Limitdowni .
4.3 Performance Evaluation
We test the effectiveness of APRE using a message-level simulator written in C.
Requests for object i occur at rate ?i with exponentially distributed inter-arrival times. At
each run, we randomly choose a single node that plays the role of the initial Mi?Si set
and a number of requesters, also uniformly at random. Our experiments involve a single
object each time (thus ?i??r). This is done for two reasons: First, the only dependency
that exists between replication of different objects relates to a possible deactivation of
an object before the activation of a new replica. In the previous section we described
how APRE tackles this issue. The more practical reason relates to the amount of memory
required to simulate multiple objects for our graphs.
Periodically, 10% of the current requesters stop querying for the object and are
replaced by an equal number of other (previously non-requester) nodes. Results are av-
91
eraged over several tens of runs over sets of 10,000-node random and power-law graphs
with average node degrees around 4 (similar to gnutella snapshots [45]). These are created
with the BRITE [46] and Inet-3.0 [33] topology generators.
To evaluate the replication scheme, we utilize the following metrics: The average
load ? which is the number of received requests per time unit averaged over the number
of servers |Si|. Obviously, regarding our load-balancing requirement, we also need to
measure the disparity of the load distribution. To that direction, we compute the standard
deviation ?? and the Gini coefficient [60]. High values for both these metrics indicate
that load is unevenly balanced across Si. Besides the size of the server set, we also keep
track of the number of replica activations/de-activations. Frequent changes in Si incur
huge overheads in terms of messages and bytes transferred.
APRE Parameters: We assume that (Limitups ,Limitdowns ) = (18,3) requests/sec, for
each server s. To calculate the decay of the reverse indices, we choose an aggressive value
of T = 0.5?i(t). Different values of T ?1 sec produce similar results. During the Expand
push phase, peers forward to the two neighbors with the largest reverse index values.
Servers check whether to initiate Expand and Contract every time unit for fast response.
We assume no item can be replicated at more than 40% of the network nodes (maximum
replication ratio). This external condition simulates the natural limitations in storage that
exist in most systems. We present experiments that show our method?s performance by
altering this ratio.
We utilize a scheme with 3 distinct intervals for values of D: [0,5],(5,20] and
(20,?). The results did not vary considerably for schemes with more intervals. As the
intervals get fewer, APRE becomes less responsive to D, with the number of created
92
replicas having smaller variation. This affects the performance of the replication during
the warm-up state as well as during sudden surges in requests. The chosen configuration
works well and is used in the entirety of our simulations. For the first interval and Closest-
First, we use probabilities p1 ={.12,.06,.03,.02,.01}, p2 ={.22,.10,.06,.04,.02}, p3 =
{.35,.18,.10,.07,.04}. These are reversed when we use FurthestFirst. Uniform uses a
(.05,.08,.15) probability for each of the 3 intervals. Thus, we roughly double the pk?
value from one interval to the next and halve it from one hop to the next in the same
interval. For Uniform, we roughly select the average of the pk??s of the other methods
as the pk value. Increasing these probabilities causes more objects to be created, often
reaching the maximum replica count, while much smaller values delay the responsive-
ness of AGNO (in high-demand settings). Increasing the difference between pk? and pk?+1
changes the ratio of replicas created at different distances, while decreasing it produces
an effect similar to Uniform. While we experimented with different configurations, our
results are based on the described one.
We compare APRE against the following methods: In the random replication scheme
(hence Random), we randomly create the same number of replicas as our method in its
steady state at the start of the simulation. In path replication (hence path-cache), each
time a server is overloaded we replicate the object along the reverse path to the requester.
This is similar to the replication applied by Freenet [61]. In all cases, the APS method
is used for lookups, while in path-cache replica deactivation occurs using our Contract
scheme. Obviously, by varying the push method and the replication probabilities, APRE
can behave either as path-cache, Random or in between, with a variable rate of replica cre-
ation per workload. This allows for full customization according to the system?s primary
93
5 10 15 20lambda (requests/sec)0
5
10
15
20
25
Average Load (requests/sec)
APRE-FurthestFirstAPRE-ClosestFirstAPRE-UniformPath-cache
5 10 15 20lambda (requests/sec)0500
10001500
20002500
30003500
4000
|Si|
APRE-FurthestFirstAPRE-ClosestFirstAPRE-UniformPath-cache
Figure 4.7: Variation in ? and|Si|over increasing ?r values
objects, namely low load (more replicas) or space (replicas only where necessary).
4.3.1 Basic Performance Comparison
For our default setting, we assume 2000 requesters and vary their ?r. The results
are presented in figure 4.7.
APRE effectively expands Si in order to accommodate the increased demand and
manages to keep the average load into the ?Normal Operation? zone, well below Limitups
(identified by the bold horizontal line). Our first observation is that FurthestFirst achieves
lower ? values by creating more replicas than ClosestFirst. The paths traversed during
the push phase contact an increasing number of nodes as their distance from the initiator
increases, thus giving FurthestFirst an increased probability of replication. Uniform be-
haves in-between, creating replicas equally at all distances. Path-cache exhibits a steeper
increase in ? and fails to keep its value within the acceptable region for large ?r. Choosing
only single successful paths to replicate along prevents the algorithm from doing further
replication. Increased demand merely forces the algorithm to utilize a few more paths,
which is the reason why this method fails to increase the replica set to meet the limits.
94
5 10 15 20lambda (requests/sec)0
20
40
60
80
% of overloaded servers
APRE-FurthestFirstAPRE-ClosestFirstPath-cacheRandom
Figure 4.8: Ratio of overloaded servers vs.
variable ?r
5 10 15 20lambda (requests/sec)010
2030
4050
6070
80
% change in Si
APRE-FurthestFirstAPRE-ClosestFirstPath-cache
Figure 4.9: Percentage of change in|Si|vs.
variable ?r
Figure 4.8 displays the average percentage of overloaded servers at any time for
all three methods. Our technique clearly outperforms the two competing methods: For
?r < 10/sec, less than 4% of servers are overloaded, while about 10% and 25% are doc-
umented as overloaded for the largest demand. Random, having the same number of
servers, exhibits twice as many overloaded nodes. Even though the learning feature of
APS helps in redirecting queries, yet the load cannot be evenly distributed. Path-cache
shows the worst performance (at least 3 times larger ratio of overloaded peers than APRE),
reaching 75% at the highest ?r value. Replicating closer to requesters creates, as we saw,
more service points, thus marginally reducing the number of overloaded instances for
FurthestFirst (Uniform exhibits the same curve).
Moreover, we show that APRE achieves a much more robust replication. The sta-
bility of the server population constitutes an important metric to the evaluation of a repli-
cation scheme. This is measured by the average ratio of new replicas entering the server
set per replication phase over the size of the server set. This quantity approximates the
amount of marginally under-utilized replicas in the overlay: Receiving few requests, they
95
5301088221928743355
3703
3999
3999
6871040
17042122
23402528
2730
3068
7721583321039993997
3999
3998
3997
5 10 15 20lambda (requests/sec)05
1015
2025
3035
Average Load (requests/sec)
APRE-FurthestFirstAPRE-ClosestFirstPath-cache
Figure 4.10: Variation in the average load
vs. variable ?r (5000 requesters)
5 10 15 20lambda (requests/sec)0
10
20
30
40
50
% of change in Si
APRE-FurthestFirstAPRE-ClosestFirstPath-cache
Figure 4.11: Percentage of change in |Si|
vs. variable ?r (5000 requesters)
get deactivated. Server overloads force them to get re-activated, producing an oscillat-
ing effect. Obviously, this is a highly undesirable situation: network and local resources
are burdened by a multiplicative factor, since replicas need both control messages and
data transfer for reactivation. Figure 4.9 shows that APRE is particularly robust, alter-
ing at most 3% of Si per push phase, while Path-cache oscillates and performs poorly in
most runs. altering a large percentage of the server set. The variability in the amount
of oscillation is due to the effect we described before: An increase in the demand is not
always followed by an increase in the number of replicas. In these situations, the existing
ones receive the extra amount of requests (assisted by the APS scheme), thus reducing the
marginally idle servers.
The same experiment is repeated with 5,000 requesters, which constitute 50% of
the overlay (see Figure 4.10 where we annotate the respective|Si|value over each point).
APRE again manages to keep the system within its limits, except for the two cases where
even the largest replica set cannot achieve that (75k and 100k total queries per second).
Our method documents its largest ratio of overloaded servers in those two settings (30%
96
1002003004005006007008009001000Time (sec)0
1020
3040
5060
Average Load (requests/sec)
lambda = 10/seclambda = 8/seclambda = 4/seclambda = 2/sec
1002003004005006007008009001000Time(sec)
1000
2000
3000
4000
|Si|
lambda = 10/seclambda = 8/seclambda = 4/seclambda = 2/sec
Figure 4.12: ? and|Si|over time for 5000 requesters and multiple ?r values
and 60% respectively). Figure 4.11 shows the amount of change in the server sets of the
two methods. APRE is much more stable in the Si population for both strategies, while
Path-cache deteriorates compared to the results for 2000 requesters.
Finally, Figure 4.12 shows how ? and|Si|vary with time, using ClosestFirst. For
all values of ?r, APRE manages to bring ? to a steady state within few time steps, a state
which is hence maintained with almost no deviation. The same is true for the size of Si,
with the exception that for high total demand, it takes longer to reach the steady state.
This is due to the fact that there is a limit to the maximum amount of replication per push
phase for our method (as there is for path-cache) that causes the delay in reaching the
constant values.
Table 4.1 summarizes our observations for this setup by documenting the perfor-
mance of the three schemes for a variety of metrics. APRE manages to keep bandwidth
consumption steadily low in all runs: The number of push messages during Expand re-
mains constant, while replicating inside query-intensive areas allows for an active reduc-
tion to the average distance between requesters and servers. The Random method shows
an increased average distance to the objects compared to APRE and path-cache that repli-
97
Table 4.1: Performance Comparison under a variety of metrics (5000 Requesters)
Method ?r = 1/sec ?r = 4/sec ?r = 8/sec ?r = 20/sec
Search Mesg 9.6 7.8 7.8 7.8
APRE Push Mesg 41.7 39.7 39.2 39.4
(ClosestFirst) Hit Distance 3.4 2.2 2.0 1.9
% Overloaded servers 1.6 6.8 20.2 60.8
Search Mesg 10.2 8.4 7.9 7.7
Path-cache Hit Distance 3.6 2.5 2.2 2.0
% Overloaded servers 11.1 23.5 37.8 94.3
Search Mesg 10.4 10.3 9.4 9.0
Random Hit Distance 3.7 3.0 2.5 2.2
% Overloaded servers 8.9 15.0 23.0 54.4
cate along search paths. Our method exhibits a far smaller percentage of overloaded
servers compared to path-cache but it is comparable to Random for the largest value of
?r. This happens because APRE has reached the maximum replication ratio by exhausting
all possible paths where requests are coming from in the overlay.
4.3.2 Load Distribution Between Replicas
While a low number of overloaded servers and a low ? value are important, we
have not yet investigated how load is distributed among the replicas. Obviously, balanced
distributions are preferred to those showing a high degree of variation. Various metrics
that quantify the degree of disparity of a number or measurements have been proposed.
We investigate two of them, namely the standard deviation and the Gini coefficient.
Returning to our 2000 requesters experiment, Figure 4.13 compares the standard
deviation of ? for the three methods. We note that APRE exhibits small ?? values, rang-
98
5 10 15 20lambda (requests/sec)0
5
10
15
20
Average Load Standard Deviation
APRE - ClosestFirstAPRE - FurthestFirstPath-cacheRandom
Figure 4.13: Variation of ?? vs. variable
?r (2000 requesters)
2 4 6 8 101214161820lambda (requests/sec)00.1
0.20.3
0.40.5
0.60.7
0.80.9
1
G
5000requ2000requ
Figure 4.14: Average values of G as a
function of ?r for the ClosestFirst strategy
ing from 3.3 to 11. It increases to 14.9 only when ?r = 20/sec. These values are either
smaller or at most comparable to ?, a good indication of load balancing. On the other
hand, randomly placing the same number of replicas yields significantly worse load dis-
tributions, with ?? values roughly twice as large. This is a clear indication of the need for
correct placement inside structureless multi-path overlays. Finally, path-cache behaves
in-between, with larger deviation values than APRE that converge as load increases. This
happens since both methods base their replication on paths connecting requesters and
servers. Our method utilizes multiple paths that consistently carry requests, thus it out-
performs path-cache in almost every setting.
The Gini coefficient (or Gini ratio) G is a summary statistic that serves as a measure
of inequality in a population [60]. The Gini coefficient is calculated as the sum of the
differences between every possible pair of individuals, divided by the mean size:
G =
n
?
i=1
n
?
j=1
|xi?x j|
2n2? ,
where n is the number of observations whose values are given by xi, and ? =
n
?
i=1
xi
n is their
99
mean. The Gini coefficient has been used as a measure of inequality in size and fecundity
in plant populations in numerous studies (e.g., Weiner 1985, Geber 1989, Knox et al.
1989, Preston 1998). Its value ranges between 0 and 1, where 0 corresponds to perfect
equality and 1 corresponds to the theoretic case of an infinite population with only one
individual having a non-zero value. Recent work [62] proposed its use as a load-balancing
metric. Assuming our population comprises of the number of received requests by each
replica, we calculate the value of G as an index of load distribution among servers. Note
here that a low value of G is a strong indication that load is equally distributed among
them, but does not necessarily imply that this load is low.
Figure 4.14 shows the average values of G for all different values of request rates in
both settings (2k and 5k requesters). In low-load runs, servers show very similar loads. As
the total load increases (either through an increase in ?r or the requester population), so
does the inequality between the received requests. The authors in [62] identify that G <
0.5 presents very well-balanced configurations, while when G is approximately between
0.5 and 0.65, relatively fair distributions are achieved.
Our scheme, while not explicitly providing any mechanism to balance load inside
Si, manages to provide very well-balanced configurations for medium to low loads and
fair ones for medium to high loads. The reason for that is because the push phase (thus,
by extension, the creation of new replicas) operates symmetrically on multiple (k) high-
quality paths. So, neither the originating server gets starved of requests, nor the newly
established ones differ substantially in their positioning. Only when|Si|approaches our
artificial limit of 4000 peers we notice that load allotment gets uneven.
To visualize and confirm these findings, Figures 4.15, 4.16 plot the load distribution
100
..
..........
..................................
................................
0 2004006008001000Si1
10
100
Load (requests/sec)
APRE. .
.............................
.................
...................
.2004006008001000
Si
Path-cache. ............
...............
........................
...................
......
500 1000Si
Random.
Figure 4.15: Si load distribution for ?r = 4/sec (2000 requesters)
........
..................
..............................................................
0 500 1000 1500Si1
10
100
Load (requests/sec)
APRE. ..............
...........................
......................................
.........
500 1000 1500Si
Path-cache. ....................
...................
............................
....
..500 1000 1500
Si
Random.
Figure 4.16: Si load distribution for ?r = 10/sec (2000 requesters)
of the server set at a random point in time for ?r = 4/sec and ?r = 10/sec. Servers are
sorted in decreasing order of load. First, we notice that in both cases APRE has less
servers above Limitup (the dotted line). Our method exhibits a less steep curve, with
fewer groups of replicas with similar loads. Path-cache shows more unbalanced load
and a larger number of servers below Limitdown. Random replication causes even more
unbalanced load in all runs.
Thus far we established our basic premise, that replication along high demand paths
in the overlay proves an effective and highly robust solution in a variety of metrics and
workloads. Although our method does not explicitly offer load-balancing, it achieves a
well-proportionate load distribution. We also showed that our method is advantageous
to randomly replicating inside the network or merely choosing a single path and fully
101
replicating along it. In the first case, few replicas receive the majority of requests, while in
the second case the composition of the replica sets changes very frequently. Our method
outperforms both alternatives by keeping fewer peers over the sharing limit and showing
less disparity in the distribution of load among servers.
4.3.3 Flash Crowds
In the next set of simulations, we examine the behavior of our method when we
experience a sudden surge in the workload. This is often referred to as a flash crowd, an
unexpected rise in requests towards specific content, typically due to some newsworthy
event that just took place. Flash crowds have been regularly documented in web traffic
history (e.g., September 11th) and are known to cause severe congestion at the network
layer. Requests may never reach the servers, while others do so with significant delays
caused by packet loss and retransmission attempts. Content holders are unable to han-
dle the volume of requests, while end-users experience long delays and failures in their
queries.
To simulate this situation, we initiate our system with 500 requesters querying at
rate ?r = 2/sec. At time t=401sec, 10 times as many requesters start querying for this
item at rate ?r = 12/sec. The parameters return to their initial values at time t=601sec.
On average, the total demand during the flash-crowd period increases by a factor of over
70. Note that this is the worst case scenario, when simultaneously both requesters and
rates increase. We present the variations in ? and|Si|in the first 2 graphs of Figure 4.17.
APRE promptly manages to meet the surge in requests by increasing the replication
102
..........................
1002003004005006007008009001000Time (sec)0
10
20
30
Average Load (requests/sec)
APRE..Path-cache
..........
.....
..........0 1002003004005006007008009001000Time(sec)10002000
3000
4000
|Si|
APRE..Path-cache
........................
1002003004005006007008009001000Time (sec)0
10
20
30
Average Load (requests/sec)
APRE..Path-cache
..........
...
.............0 2004006008001000Time (sec)10002000
3000
4000
|Si|
APRE..Path-cache
Figure 4.17: Effect of flash crowds in ? and|Si|in two different settings
ratio by a factor of 30. Excluding a very short window due to our mechanism?s response,
our method succeeds in keeping the load factor below the limit (with ?? < 10) and steady
through time. At both times that load changes, replicas are activated and de-activated
rapidly to meet the extra requests and reduced traffic. While path-cache shows simi-
lar response speed, it creates more servers in the low-workload period and less than the
minimum number required to keep content providers from overloading during the surge.
The bottom two figures show how the same two metrics vary in a more challenging
flash-crowd setting. Here, we initially have 500 requesters with ?r = 1/sec, while for
time t ?(400,480] we set ?r = 10/sec for 5000 requesters. On average, the workload
inside the overlay increases by a factor of 120. Our results show that, even for shorter and
steeper changes, APRE very successfully adapts to the surge in requests. On average, Si
103
2 4 6 8 101214161820lambda (requests/sec)0
2000
4000
6000
8000
10000
|Si|
Max Repl = 1Max Repl = .8Max Repl = .4
Max Repl = .2Max Repl = .1
Figure 4.18: |Si| variation for different
maximum allowed replication ratio
2 4 6 8 101214161820lambda (requests/sec)04
812
1620
2428
3236
40
Average Load (requests/sec)
Max Repl = 1Max Repl = .8Max Repl = .4
Max Repl = .2Max Repl = .1
Figure 4.19: ? variation for different max-
imum allowed replication ratio
2 4 6 8 101214161820lambda (requests/sec)0
0.2
0.4
0.6
0.8
1
ratio of overloaded servers
Max Repl = 1Max Repl = .8Max Repl = .4
Max Repl = .2Max Repl = .1
Figure 4.20: Percentage of overloaded
nodes for different maximum allowed
replication ratio
2 4 6 8 101214161820lambda (requests/sec)0
2
4
6
8
10
% change in Si 
Max Repl = 1Max Repl = .8Max Repl = .4
Max Repl = .2Max Repl = .1
Figure 4.21: Percentage of change in Si for
different maximum allowed replication ra-
tio
is expanded by a factor of 175 in order to reduce and balance load (our results document
an average ??similarequal8.6).
4.3.4 Effect of the Maximum Replication Ratio and Limitup
Our default scenarios assumed a set value of 0.4N for the maximum allowed |Si|
(where N = 10k = size of our overlay). In the next figures, we plot the performance of
APRE (using ClosestFirst and 2000 requesters) while we vary the maximum replication
ratio from 0.1 to 1.0. Too small values should force the system to quickly saturate, while
104
2 4 6 8 101214161820lambda (requests/sec)05
1015
2025
3035
40
Average Load (requests/sec)
limit_up = 10limit_up = 18limit_up = 30
Figure 4.22: ? variation for different val-
ues of Limitups,i
2 4 6 8 101214161820lambda (requests/sec)0
0.2
0.4
0.6
0.8
1
ratio of overloaded servers
limit_up = 10limit_up = 18limit_up = 30
Figure 4.23: Percentage of overloaded
nodes for different values of Limitups,i
complete freedom to replicate should exhibit the best behavior. The results are presented
in Figures 4.18, 4.19, 4.20 and 4.21.
When the ratio is too small (at most 1000 nodes are allowed to host the object), Si
quickly reaches this limit. This affects the values of ? as well as the number of servers
over Limitup, which rapidly increase. As more peers are allowed to become servers,|Si|
increases and so does the percentage of servers below the limit. The interesting obser-
vation here is that even for the most optimistic case (no replication restriction), APRE
manages to keep the ratio of change in the server-set below 3%.
In the next experiment, we vary the maximum advertised capacity Limitup uni-
formly. Sample results are presented in Figures 4.22 and 4.23. As we would expect, the
smaller the upper limit gets, the faster our algorithm reacts to load, thus creating replicas
sooner. Obviously, given some storage restriction (such as a maximum allowed number
of peers able to enter Si), small Limitup values cause system saturation and more over-
loaded instances. On the other hand, for larger upper limits, the server-set increases more
gracefully and significantly fewer server overloads are observed.
105
1002003004005006007008009001000Time (sec)0
510
1520
2530
3540
Average Load (requests/sec)
FurthestFirst-2k requClosestFirst-2k requFurthestFirst-1k requ
ClosestFirst-1k requ
Figure 4.24: Average load for 1k and 2k requesters in power-law topologies (?r = 6/sec)
4.3.5 Simulations with Different Topologies
We tested our method on a set of 4,000-node power-law graphs created with the
Inet-3.0 generator [33]. These graphs have an average degree of d = 4.3 (maximum
degree equals to 855), while over 30% of the nodes have only one neighbor. Figure 4.24
shows how ? varies with time for ClosestFirst and FurthestFirst using 1000 or 2000
nodes as requesters.
These topologies noticeably affect performance compared to our previous simula-
tions. Even for average-range ?r values, ? moves close to the overload line, while Expand
shows diminished ability to extend Si. This is consistent with results documented in previ-
ous work [15]. The tested topologies offer fewer paths between servers and clients, while
a large percentage of the nodes only have one neighbor. This also explains why Fur-
thestFirst outperforms ClosestFirst. Favoring replication close to the requesters quickly
saturates available nodes due to lack of alternate paths. Nevertheless, its is worth noticing
that our method still manages to keep ? at lower levels. Even at the 2k-ClosestFirst run,
where ? > Limitup, 14% of the servers are overloaded compared to 20% by path-cache.
106
We must note here that the replication protocol is not always responsible for over-
loaded servers. In many occasions, the amount of demand or the overlay connectivity
cannot allow for more extensive or balanced replication. As we experiment with more
densely connected graphs, APRE performs inside the load limits where it failed to do
so over more sparse overlays. Moreover, with biased forwarding, as happens with most
informed approaches, certain nodes will unavoidably receive a bulk of requests. This sit-
uation can only be corrected through maintaining additional state at each peer (such as
the location of other servers) and changing the forwarding scheme. In an environment
with rapid changes in workload and server sets, this locally maintained metadata can be-
come frequently stale, thus incurring larger communication (besides the local storage)
overhead. Our approach, on the other hand, does not require a change in APS, but relies
on its ability to independently create and store that state through new object discoveries
and reverse index built-up.
4.4 Conclusions
In this part of our work we presented our adaptive replication scheme for unstruc-
tured Peer-to-Peer systems based on probabilistic soft state. APRE aims at providing a
direct response to workload changes, by creating server points in needy areas or releasing
redundant servers in areas of low demand. Our approach couples lookup indices together
with an aging mechanism in order to identify, in real-time, query intensive areas inside
the overlay. Peers then individually decide on the time and extent of replication, based on
local workload computation.
107
Our work shows that it is important to couple replication with the search proto-
col in unstructured systems. Random replication performs poorly with informed lookup
schemes, unless extra state is added to enhance searches. Applying APRE over a scheme
such as APS solves this problem. APS-indices store local, per-object state to direct queries
to objects. While peers only keep metadata about their neighbors, this information can be
used to identify, hop-by-hop, where the queries are coming from. Moreover, our scheme
is highly customizable, allowing control of both the size and the location (as defined
through reverse-indices) of replication.
Using thorough simulations, we show that APRE is extremely robust in eliminating
server overloads, while minimizing the communication overhead and balancing the load.
Specifically, we show that replicating along the reverse path is an extreme case of our
protocol. By effectively discovering all reverse paths, APRE manages to distribute content
proportional to demand in a variety of overlays and workloads. Finally, we show that our
method succeeds in creating a very stable server set with minimal amount of oscillation.
108
Chapter 5
Related Work
Peer-to-Peer networks have been studied a lot in the last few years. A large amount
of information for P2P computing with taxonomies, definitions, current trends, appli-
cations and related companies can be obtained at [63, 64], as well as individual sources
(e.g., [7,8]). P2P computing is also described in [65], with basic terminology, taxonomies
and description of some systems. A brief summarization of Gnutella [6] and Napster [5],
together with approaches for structured networks are also included. Gnutella and Napster
are the focus of two measurement studies: Reference [66] attempts a detailed charac-
terization of the participating end-hosts, while the work in [21] measures the locality of
stored and transferred documents. In [67], a traffic measurement for three popular P2P
networks is being conducted at the border routers of a large ISP. Extensive results for
traffic attributed to HTTP, Akamai and P2P systems are also presented in [68].
In this part of our dissertation, we present work related to each of our contributions:
Search algorithms, group communication schemes and replication methods for unstruc-
tured P2P networks.
5.1 Search Algorithms for P2P Systems
As part of this thesis, we present a thorough description of many representative
search algorithms for unstructured P2P networks. We first describe blind search algo-
109
rithms and proceed with several informed techniques. Appendix A presents a direct ex-
perimental comparison between many of these schemes.
5.1.1 Blind Search Methods
GNUTELLA [6]: The original Gnutella algorithm (or flooding scheme) contacts all
accessible nodes within TTL hops. Its basic characteristics are its simplicity and the huge
overhead it produces by contacting many nodes (and possibly multiple times each).
Modified-BFS [22]: In this variation of the flooding scheme, peers randomly choose
only a ratio of their neighbors to forward the query to. This reduces the average message
production, but still contacts a large number of peers.
Iterative Deepening: Two similar approaches that use consecutive BFS searches at
increasing depths are described in [14, 69]. These algorithms achieve best results when
the search termination condition relates to a user-defined number of hits and it is possible
that searching at small depths will satisfy the query. In a different case, they produce even
bigger loads than the standard flooding mechanism.
5.1.2 Informed Search Methods
Super-Peer approaches: In Gnutella2 (G2) [70], when a super-peer (or hub) re-
ceives a query from a leaf, it forwards it to its relevant leaves and also to its neighboring
hubs. These hubs process the query locally and forward it to their relevant leaves. No
other nodes are visited with this algorithm. Neighboring hubs regularly exchange local
repository tables to filter out unnecessary traffic.
110
Both G2 and GUESS [27] rely on a dynamic hierarchical structure of the network.
They present similar solutions for reducing the effects of flooding by utilizing the struc-
ture of hybrid networks. The number of leaf-nodes per super-peer must be kept high,
even after node arrivals/departures. This is the most important condition in order to re-
duce message forwarding and increase the number of discovered objects.
Intelligent-BFS [22]: This is an informed version of modified-BFS. Nodes store
(query, neighborID) tuples for recently answered requests from (or through) their neigh-
bors in order to rank them. First, a peer identifies all queries similar to the current one,
according to a query similarity metric; it then chooses to forward to a set number of its
neighbors that have returned the most results for these queries. If a hit occurs, the query
takes the reverse path to the requester and updates local indices. This approach focuses
more on object discovery than message reduction. At the cost of an increased message
production compared to modified-BFS (because of the update process), the algorithm in-
creases the number of hits. It achieves high accuracy, enables knowledge sharing and
induces no overhead during node arrivals/departures. On the other hand, its message pro-
duction is very large and only increases with time as knowledge is spread over the nodes.
It shows no easy adaptation to object deletions or peer departures, because the algorithm
does not utilize negative feedback and forwarding is based on ranking. Finally, its accu-
racy depends highly on the assumption that nodes specialize in certain documents.
Local Indices (LI) [69]: Each node indexes the objects stored at every peer within a
certain radius r and can answer queries on behalf of all of them. A search is performed in
a BFS-like manner, but only nodes accessible from the requester at certain depths process
the query. To minimize the overhead, the hop-distance between two consecutive depths
111
must be 2r + 1. This approach resembles the two search schemes for hybrid networks.
The method?s accuracy and hits are very high, due to the indexing scheme. On the other
hand, message production is comparable to flooding, even if the processing time is smaller
because many nodes just forward the query. The scheme requires a flood with TTL = r
whenever a node joins/leaves the network or updates its local repository, so the overhead
becomes even larger for dynamic environments.
GIA [28]: In GIA, requesting nodes deploy biased walkers in order to discover var-
ious objects. Each peer chooses to forward the query to the neighbor with the highest
announced capacity. This is a user-defined metric that reflects the processing power of a
node inside the system. Moreover, the protocol requires that each peer indexes the docu-
ments of its neighbors. This scheme also utilizes a topology-adaptation algorithm which
re-configures the overlay connectivity such that each node is connected to a number of
peers proportional to its capacity. The biased walkers are then directed towards highly
connected neighbors and, probabilistically, to those with the highest number of indexed
objects. Finally, the scheme provides a flow-control mechanism which allows peers to
control the rate at which they can accept and process requests from their neighbors. Once
the topology has been set, we expect GIA to perform very bandwidth-efficient searches
with several hits. On the other hand, the adaptation algorithm plus the indexing of the
neighbors? repositories increase the responsibilities of each peer as well as the communi-
cation overhead. Another issue is how fast can the algorithm work for joining peers and
at what cost for their neighborhood.
Routing Indices (RI) [26]: Documents are assumed to fall into a number of thematic
categories. Each node stores an approximate number of documents from every category
112
that can be retrieved through each outgoing link (i.e., not only from that neighbor but
from all nodes accessible from it). The forwarding process is similar to DFS: A node
that cannot satisfy the query stop condition with its local repository will forward it to the
neighbor with the highest ?goodness? value. Three different functions which rank the
out-links according to the expected number of documents discovered through them are
also defined. The algorithm backtracks if more results are needed. This approach trades
index maintenance overhead for increased accuracy. While a search is very bandwidth-
efficient, RIs require flooding in order to be created and updated, so the method is not
suitable for highly dynamic networks. Moreover, stored indices can be inaccurate due to
thematic correlations, errors in the categorization of documents and network cycles.
In [23], each node holds a number of bloom filters for each neighbor. The ith filter
summarizes documents that can be found i hops away through that specific link. Nodes
forward queries to the neighbor whose smaller depth bloom filter matches a hashed rep-
resentation of the object ID. After a certain number of steps, if the search is unsuccessful,
it is handled by a deterministic algorithm instead of backtracking. The scheme?s expec-
tation is to find only one replica of the object with high probability. Index maintenance
requires flooding messages initiated from nodes that arrive or update their collections.
Distributed Resource Location Protocol (DRLP) [24]: Nodes with no information
about the location of a document forward the query to each of their neighbors with a
certain probability. If an object is found, the query takes the reverse path to the requester,
storing the document location at those nodes. In subsequent requests, nodes with indexed
location information directly contact the specific node. If that node does not currently
obtain the document, it just initiates a new search as described before. This algorithm
113
initially utilizes flooding to find the locations of an object. In subsequent requests, it
might use a single message to discover it. A low message production is achieved only
with a large workload that enables the initial cost to be amortized over many searches.
In rapidly changing networks, this approach fails and more nodes have to perform blind
search. This also affects the number of hits: If many blind searches are made, then many
results are found; if many direct queries take place, then only one replica is discovered.
Gnutella with Shortcuts (GS) [25]: In this work, the authors propose the addition of
shortcuts (i.e., direct links to peers that have recently proved useful in answering queries)
to a Gnutella-like overlay. The original flooding mechanism is initially used to locate
documents. Peers that provide answers are indexed by the requesters, following the as-
sumption that they could provide answers to more requests. When a new query is made,
nodes first forward it to their shortcuts (ranked in a descending order of usefulness ?
usually the success rates). If all shortcuts fail, the standard flooding scheme is again used
to locate the object. This approach resembles the DRLP scheme but stores more than
one pointer and keeps statistics on them. For semantically related queries, we expect it to
quickly identify relevant peers and mostly use the shortcuts for object location. Moreover,
we anticipate a very high success rate since the fall-back mechanism is flooding. On the
other hand, if peers make many unrelated queries or they do not store relevant content, it
is possible that the shortcuts will fail, which in turn means that the system pays the price
with a full-scale flooding. The same is true when objects are removed or peers depart
frequently.
New Approaches: Recently, there has been an effort to combine the advantages of
structured systems (DHTs) and unstructured ones. In [71], an immediate neighborhood
114
area is defined for each peer. Object placement inside these overlapping areas is per-
formed in a DHT-like fashion. Searches use the standard flooding mechanism except that
only certain areas are probed. In [72], peers are grouped into possession rules, according
to whether they contain a specific item or not. Nodes search inside one possession rule in
a blind fashion. The possession rule is chosen by a greedy mechanism according to past
query results. Finally, the work in [73] combines random walks in unstructured overlays
with DHT-like replica placement: The owner of each object places replicas of the object
on several nodes. The replicas are assigned to nodes which have IDs numerically close to
the object. During a search, random walks are used to locate several minima for a given
object (i.e., nodes inside a neighborhood that have the closest ID to the object).
Finally, we describe a family of algorithms which are based on traditional rein-
forcement learning and inspired by dynamics observed in biological colonies. Several
algorithms have been proposed to mimic the collective foraging behavior of ants that self-
organize in order to locate and transfer food back to the nest in an almost-optimal manner.
They are known as ant-based algorithms [30]. The problem of routing data packets in dy-
namic communication networks has characteristics well-suited for ant-based solutions.
Indeed, a variety of schemes based on mobile agents (or ants) have been proposed in or-
der to discover shortest routes between any pair of nodes in data networks (e.g., [74,75]).
These schemes utilize some ideas similar to our probabilistic APS walkers.
The main characteristics of ant-based routing (as seen in AntNET) [75] can be sum-
marized as follows:
- Each node holds probability values per neighbor per destination. These values are
used to guide the ants. Moreover, it holds statistics for network traffic as seen locally.
115
- At regular intervals, forward ants are launched from nodes to randomly selected
destinations. Forward ants keep memory of the visited nodes and traffic characteristics
during their route.
- If the destination is reached, the agent creates a backward ant that travels along the
reverse path and updates probability values and local traffic measurements. Specifically,
the probability of taking the successful path is increased using the standard reinforcement-
learning rule: P?P+r(1?P).
These algorithms feature a plethora of tunable parameters that actively affect perfor-
mance: The rate at which forward ants are created, the reinforcement learning parameter
r (should depend on the roundtrip time and the local traffic measurements), the probabil-
ity of exploration versus exploitation, etc. APS differs from such schemes as it updates
probabilities on both success and failure with respect to message minimization. Second,
it deploys multiple walks thus actively exploring and exploiting at the same time, while it
requires neither a regular query dissemination nor the calibration of many parameters.
5.2 Data Dissemination
The problem of distributing content to multiple hosts is well-studied. We cate-
gorize existing methods into general application-layer multicast protocols, multicast for
structured P2P overlays and, finally, approaches for unstructured networks.
116
5.2.1 Application-layer Multicast
Proposed approaches roughly fall into three categories: The mesh-first category
(e.g., Narada [76]), where nodes form a random mesh between them and then compute
unicast paths for each pair of members. This approach requires control overhead quadratic
to the group size with refresh messages. In the tree-first approach (e.g., Yoid [77]), peers
directly form a data delivery tree and also maintain a few extra links to exchange control
messages. Finally, in the implicit approach (e.g., NICE [38]), both control and delivery
structures are implicitly defined by the underlying routing protocol. For example, NICE
arranges members into a hierarchy of layers and clusters and defines processes for mem-
ber arrival/departure and cluster merge/split. All these approaches require the existence
of a designated host to initiate the membership process, periodic exchange of control
messages and also significant overhead for member joins/leaves.
5.2.2 Multicast over P2P Overlays
The algorithm described in [40] describes a broadcast mechanism that operates over
CAN [13]. Nodes forward to their neighbors in the d-dimensional space, as this is defined
in CAN. There are also provisions made to eliminate duplicate messages and prevent
looping of the packets around the coordinate space.
Scribe [39] is implemented on Pastry [10]. Interested hosts route their requests
towards the node responsible for the group?s key (the root). Each node on the path checks
if it is a current member of the group. If this is the case, it registers the source node as
its child in the multicast tree and stops the forwarding process. Otherwise, it stores the
117
ID of the source and makes a join request towards the root. Scribe is a decentralized
and scalable protocol that takes advantage of the overlay structure to produce a balanced
delivery tree.
Bayeux [34] is implemented on Tapestry [11]. The difference with Scribe is that
join/leave operations go through the root of the tree, making it less scalable. Overcast
[35] also requires coordination with the root node, while it builds its multicast tree in a
manner similar to Yoid. The work in [78] contains thorough descriptions and performance
comparisons for representative schemes from this category.
5.2.3 Group Communication in Unstructured Overlays
Many search schemes for unstructured P2P networks have been proposed that im-
plement flooding or its modifications in order to contact large numbers of nodes. Ex-
amples include the gnutella flooding algorithm [6], the modified-BFS scheme [22], the
iterative deepening method [69], etc. All these techniques produce a large number of
messages, cannot adapt to variable group sizes and use blind forwarding, which results in
many non-members receiving the message.
An alternative solution to the problem is presented by a variety of gossip algorithms,
where each member is responsible for forwarding a notification to a randomly selected
subset of the group. These approaches have been used in a variety of different scenarios
(e.g., distributed databases [79], publish-subscribe systems [80]) and have proved to be a
robust solution in the face of member/network failures at the cost of inducing extra traffic
to the network.
118
In Lpbcast [80], membership is achieved by a periodic gossiping of subscriptions:
peers transmit a set of subscriptions that they recently heard to a random subset of their
locally known group members. Upon receiving such a message, nodes replace a random
subscription from their local lists with the new one. To achieve the probabilistic guaran-
tees offered by similar schemes, the size of the group and the local list size must be fixed,
which is not the case in highly dynamic networks.
SCAMP [47] is a decentralized membership protocol that utilizes gossiping. Joining
members subscribe by contacting a random existing member. Upon receiving a subscrip-
tion request, a member forwards it to all the members in its local repository. Nodes decide
probabilistically whether to store or forward the subscription. For the unsubscription pro-
cess, a node notifies the locally known members to replace its ID with the IDs of the
members it has received messages from. Group communication is performed in the stan-
dard gossip-based manner. SCAMP is shown to converge to a local state of slightly over
log(n) member IDs, which guarantees with high probability that all members will receive
a notification.
In [36], the push phase of an update algorithm for unstructured P2P networks is a
rumor-spreading scheme: each peer receives an update message along with a partial list of
other peers to which the update has been sent. If the update has not been received before, it
is forwarded to a different subset of members with a certain probability. In [41], peers that
have received a message less than F times, forward it to B randomly selected neighbors,
but only those that the node knows have not yet received it. The deterministic version of
that algorithm requires global knowledge of the overlay. Nodes forward messages to all
neighbors with degree equal to 1, plus to B remaining neighbors that have the smallest
119
degrees.
In contrast, our approach requires no group subscription/unsubscription process
nor any centralized or distributed storage of the current group members. Its forwarding
scheme is an adaptive selection between neighbors and shortcuts, relative to the quality
of the local search knowledge.
5.3 Replication
Replication is a well-known technique utilized to achieve high availability and fault-
tolerance in large-scale systems. While applied to a variety of contexts, we focus in the
area of distributed (P2P) systems.
Structured overlays (DHTs) balance routing between network nodes, due to the
nature of the hashing functions used. Moreover, in systems like CFS [81] and PAST
[82], each item (or chunk of it) is replicated on a set number of network nodes. DHTs
take advantage of the routing structure, which in effect allows for almost-deterministic
paths between two nodes, thus identifying ?hot? areas easily. Nevertheless, DHTs are not
optimized for skewed access patterns and direct such traffic to few nodes responsible for
popular content.
DHash [83] is a replication method applied on Chord [12]. The protocol allows for
r copies to be stored at the r immediate successors of the initial copy?s home. In [84],
the authors propose the storage of at most R replicas for an object. Their location is
determined by a hash function, allowing requesters to pro-actively redirect their queries.
The work in [85] proposes replicating one hop closer to requester nodes as soon as peers
120
are overloaded.
Lar [86] is a DHT-based approach similar to APRE, in that it adapts in response
to current workload. Overloaded peers replicate at the query initiator and create routing
hints on the reverse path. Hints contain some other locations that the content has been
previously replicated, so queries are randomly redirected during routing. The method
takes advantage of the DHT substrate in order to place the hints. Our scheme does not
attempt to re-route queries or shed load to the initiator, but rather places replicas inside
forwarding-intensive areas using multiple paths. Moreover, the state kept is accessible at
any time, not only at the time of the query arrival. Finally, it appears that lar would suffer
from a slow propagation of hints in lower-demand scenarios as well as from stale caches
in dynamic settings.
HotRoD [62] presents a load-balancing approach for DHTs handling range queries
for relational database systems. It is based on a locality-preserving DHT and replication
of overloaded arcs (consecutive modes on the DHT ring). The work in [87] employs a
minimization function that combines high availability with low load to replicate video
content inside a DHT. The approach requires knowledge of peer availabilities, workload
and data popularity. In [59], the authors show that load-balancing based on periodic load
statistics suffers from oscillation. By directing queries towards the maximum capacity
replica location, both heterogeneity and oscillation issues are tackled. However, this ap-
proach assumes knowledge of all existing replicas and that replicas regularly advertise
their capacities to the network.
The work in [14] discusses static replication in unstructured networks that use Ran-
dom Walks as a lookup method. Various replication strategies are compared and it is
121
concluded that replicating proportionally to the square root of the access frequencies of
objects (which must be known a priori) minimizes the size of a search. In [88], replicas
install pointers to their locations on O(??n) random peers using a random walk (n being
the number of peers and ? a parameter). Searches are conducted in the same manner,
contacting O(??n) random nodes with a single walk. This approach utilizes replication
of object locations in order to provide guarantees for the success of a search and not for
load-balancing or adaptive replication purposes.
The replication method utilized by PlanetP [89] attempts to tackle the problem of
resource availability in unstructured environments. Peers regularly gossip metadata about
their online status, free space and stored objects to other nodes. Each peer periodically
chooses an object it hosts and decides, based on information collected from all peers, on
its availability in the network. Given a low estimate, it fragments the file and pushes all
fragments to nodes using hints about their free space. This approach relies heavily on the
collection of data from all network nodes to achieve high-availability.
In most P2P file-sharing applications, replication is naturally handled through con-
tent sharing among users. In general, the following two approaches exist: Files comprise
of equal size chunks and are individually indexed, or peers dynamically decide the por-
tion that is retrieved from each source peer. The first approach is utilized by Overnet [90],
BitTorrent [91] and Slurpie [92]. Each file is divided into a number of standard-size
fragments (9500KB, 256KB, 256KB for those systems respectively). A peer may then
download different fragments from various sources. Upon completion, each fragment
becomes available for sharing with other nodes.
The second approach [93] (or modifications of it [94]) is currently used by other P2P
122
applications (e.g., Morpheus [95]). A requester contacts many source peers and retrieves
small portions of the file from each of them. When each small chunk is retrieved, more
is asked from that specific source. There also exist several schemes (e.g. [96, 97]) which
allow for increased robustness in reconstructing a file by receiving a few extra parts of it.
There has also been considerable amount of work on flash crowd avoidance. In
[98], overloaded servers redirect future requests to mirror nodes to which content has
been pushed. This approach does not tackle the issue of which node to replicate to.
PROOFS [99] explicitly constructs a randomized overlay to locate content under heavy
load conditions or unwilling participants. In effect, the method relies on the combination
of a custom overlay and a gossip-based lookup scheme to locate objects, without involving
any replication.
123
Chapter 6
Conclusions
In the last few years, the research community has provided a plethora of powerful
tools in the area of distributed communications. The interest in P2P computing produced
a variety of systems and schemes that facilitate the two important primitives in large
decentralized environments: Content sharing and open communication.
While we are still unsure about the future applications of P2P, no one can deny their
popularity and attractive features that favor them as a choice to become the basis of future
platforms. Our thesis focuses exclusively on providing adaptive, bandwidth-efficient pro-
tocols for data search, retrieval and one-to-many communication in unstructured overlays.
Our schemes offer deployable, low-cost solutions for current applications (in the case of
APS-AGNO-APRE), with a look towards the future and scientific collaborations (in the
case of GrouPeer, described in Appendix B).
There exists a set of common characteristics in all these methods. First and fore-
most, we aim for algorithms that are adaptive to the environment they operate on. To
achieve that, we enable a learning feature in each of our protocols: Peers learn from ex-
perience and interactions with other peers in order to both increase their performance and
adapt to changes. Second, we aim for a collaborative operation among the users. In order
to achieve that, we design our schemes such that individual experience (in the form of state
stored at nodes) can be shared and refined collectively. Third, we identify the need for
124
bandwidth-efficient operation in such systems. Consequently, we utilize directed walkers
instead of flooding in order to locate content. We regulate our push phases such that only
the interested peers participate. Finally, clustering/grouping of peers according to content
or demand is used in order to improve data sharing or our communication flexibility.
In the future, we expect an increase in the number and size of P2P collaborative ap-
plications. While our focus will still be on efficient distributed algorithms, more attention
will be given to security, reputation and trust issues. The combination of lack of central
authority with the reality that not all users are equal or play fair, is the biggest, in our
opinion, bet that P2P has to win and decisively so.
125
Appendix A
Analysis and Comparison of P2P Search Methods
A.1 Overview
With the increasing interest in P2P systems, a plethora of search schemes for un-
structured P2P networks has been proposed. In this supporting work, we try to analyze
the performance of many representative lookup protocols alongside APS. We focus on the
behavior of these algorithms for each of the following metrics: Efficiency in object dis-
covery (accuracy and number of hits), bandwidth consumption and adaptation to changes
in topology and object locations. While discovering many objects is very important, as it
enables efficient object retrieval, minimizing search messages always represents a high-
priority goal for distributed systems. Finally, it is important that any search algorithm
adapts to changing conditions, since in most P2P networks users frequently enter and
leave the system, as well as update their collections.
To evaluate our analysis, we simulate nine methods and present a direct quantitative
comparison of their performance. We identify the relative advantages and disadvantages
of each method as well as the conditions under which they can be most or least effective.
To our knowledge, this is the first work that attempts a direct comparison of such a diverse
set of search techniques proposed for unstructured P2P systems. We believe this is an
important contribution that can provide a better understanding of the various mechanisms
and assist in choosing an algorithm that best fits a particular application.
126
A.2 Performance Evaluation
A.2.1 Algorithm Implementations
In this section we present results for nine of the methods described in the Related
Work Chapter: G2, Modified-BFS, Intelligent-BFS, Local Indices, DRLP, GS and GIA,
together with the already described APS and Random Walks. The simulated methods
are representative blind and informed schemes, both flood and non flood-based, with
or without user-initiated index updates (that is, updates triggered strictly by the search
process). In our experiments, we utilize the GT-ITM [32] and Inet-3.0 [33] topology
generators to produce sets of random and power-law graphs respectively. For each setup,
the results are averaged over a set of 10 similar graphs for each described topology. We
also present results on a real gnutella graph [100], with 61,685 nodes and average degree
d = 4.6.
For the default parameters, we mainly follow the model described for the APS eval-
uation (see Table 2.1). Requester nodes are randomly chosen and represent about 20% of
the total number of nodes. Each requester makes about 1,500 queries over a time period.
We do not allow extra replicas to be stored (i.e., we only consider the search phase, not
object retrieval). Finally, besides keeping a dynamic node population, we also redistribute
objects to model file insertions and deletions. Object re-location always follows the initial
distribution parameters.
The Intelligent-BFS method was modified to allow for object-ID requests. Index
values at peers now represent the number of replies for an object through each neighbor
and nodes choose the neighbors with the highest index values when forwarding a query.
127
For Modified-BFS?s, DRLP?s and Intelligent-BFS?s flood-based search, nodes choose an
equal number of neighbors to forward a query in order to make direct comparisons. For
G2, peers randomly choose k neighbors to forward the query to. The chosen nodes for-
ward the query to all their neighbors. By modifying the value for k we can simulate the
operation of both G2 (with k always larger than the number of neighbors) and GUESS.
In our simulations, G2/GUESS operate on a pure (instead of a hybrid) model in order
to achieve uniformity in our results. Moreover, they both function in a blind manner, so
no cache or repository table exchange takes place. We name this approach HG2 (Hybrid
G2/Guess). For our LI implementation, nodes index the objects of their neighbors (r = 1).
To ensure that the search is equivalent to a flood with TTL=5, only peers at depths 1 and
4 process the query. We also ensure that no object from the same peer is being discovered
multiple times. Finally, our GIA implementation deploys k walkers, with each peer for-
warding to the neighbor with the highest out-degree, while the overlay adaptation process
is not simulated. Peers index the documents of their immediate neighbors. For our GS
implementation, we use 5 shortcuts and rank them by their success rates.
A.2.2 Basic Comparison
In our first set of experiments, we use a set of 10,000-Node random graphs (aver-
age degree d = 4) to compare the nine methods over 5 different environments: A static
one, one with low/high object relocation frequency and one with low/high peer departure
frequency. In the two low-frequency scenarios, relocation and departures/arrivals occur
about 300 times per run, while in the high-frequency ones they occur 10 times more often.
128
Int-BFSLI HG2 RWalkss-APSMod-BFS DRLP GS GIA0
20
40
60
80
100
Success Rate(%)
High relocationLow relocation
StaticLow departures
High departures
Int-BFSLI HG2 RWalkss-APSMod-BFS DRLPGS GIA1
10
100
1000
Messages per Request
High relocationLow relocation
StaticLow departures
High departures
Figure A.1: Success rate and message production of the methods using a set of 10,000-
node random graphs with average degree d = 4
DRLP and Int/Mod-BFS forward to 3 neighbors, while k = 7 for s-APS, GIA, HG2 and
Random Walks. Figures A.1 and A.2 present the results.
Blind methods show a fairly stable performance between the static and dynamic set-
tings, since the dynamic operations do not interfere with the forwarding scheme. Flood-
based schemes discover many objects at a higher cost. Nevertheless, only LI and GS
with the pure-flooding scheme achieve very high accuracy. This happens because of the
small out-degree of our network. We also notice that blind and flood-based techniques
do not get affected by object relocation, but only by peer joins/leaves. While our relo-
129
cation process does not substantially alter anything in those algorithms? operation, peer
arrivals/departures alter the topology and the amount of available resources.
Mod/Int-BFS show relatively high accuracy and return many hits. Their perfor-
mance is very similar, with the informed method showing marginally better results. For
environments resembling this setup, Mod-BFS will be preferred, since its performance is
equally high and it is much simpler. We expect the informed method to perform better
in richer or more specialized environments (like the one described in [22]), mainly in the
number of hits.
Random Walks displays low accuracy (<34%) and finds less than 0.5 objects on
average. Its bandwidth consumption is quite low (about 15 messages), while its perfor-
mance is hardly affected by the dynamic operations. HG2 behaves similarly, with the
exception of producing about 5 more messages per search. In general, these algorithms
exhibit poor performance and appear very robust to increased network variability. This
is reasonable, as walkers are randomly directed with no regard to topology or previous
results.
The s-APS method achieves a success rate of over 75% in the static run, a number
that drops by around 30% in the highly dynamic settings, but only around 12% in the two
less dynamic ones. The metric that is reasonably affected is the number of discovered
objects, which are almost cut to a third. This happens because it takes some time for the
learning feature to adapt to the new topology and paths to discovered objects frequently
?disappear?. On the other hand, it manages to keep its messages almost as low as Ran-
dom Walks. The scheme is equally affected by relocations and departures/arrivals, since
walkers are directed towards specific locations which are altered by both types of events.
130
Int-BFSLI HG2 RWalkss-APSMod-BFS DRLP GS GIA0.1
1
10
100
Hits per Request
High relocationLow relocation
StaticLow departures
High departures
Figure A.2: Hits per query of the methods using the set of 10,000-node random graphs
with average degree d = 4
Nevertheless, it exhibits a good overall performance compared to the non-BFS related
schemes, without indexing other peers? repositories.
The DRLP algorithm exhibits some interesting characteristics. First, its message
production is very low (less than 6 messages per request). Our simulations count the
direct contact of a node (both for DRLP and GS) as one message, although a link between
them might not exist in the overlay. Dynamic behavior causes the stored addresses to
become more frequently ?stale?, thus the initial flooding is performed more often. This
is the reason for the decrease in its accuracy from 99% in the static case to 77% and 15%
in the highly dynamic ones. DRLP produces the same amount of messages for its initial
search with Modified-BFS, so it needs many successful requests to amortize this initial
cost. The number of objects it discovers is very small, ranging from 1.4 to 0.2. If DRLP
is forced to use flooding many times, then the number of hits increases. If it is successful
and produces few messages, then it only finds one replica per request. Despite this, we
notice that it proves very bandwidth-efficient and flooding is scarcely used. This is due to
131
the fact that, with many nodes making requests, most of them obtain a pointer for every
object after a while. So, even if some node initiates a flood, most of its neighbors will
only forward to one other node. The large number of requests per run helps DRLP achieve
a very low average message consumption. This scheme seems ideal for relatively static
environments and large workloads, with the exception that the number of hits will be very
close to one. Another important observation is that DRLP is affected far more by object
relocation than by node departures. This is reasonable if we consider that with departures
there still exist nodes with a valid pointer to an object, whereas object relocation may
make many pointers become stale at once.
The LI scheme proves the most productive in terms of discovered locations and the
most costly in message production. It produces one order of magnitude more messages
than the other BFS-related methods but also discovers about 10-20 times more objects,
taking advantage of its index scheme. Its performance is only affected by the dynamic
joins and leaves, with a decrease of more than 50% in located objects. The cost of the
index updates, even under the more dynamic settings, is negligible compared to the cost
of a search (at most 2% over the total number of messages). On the other hand, this cost
is considerable for nodes that stay idle (and possibly alter their local repositories), since
it induces traffic without any search involved.
GS shows very high accuracy, since it can always fall back to the flooding scheme.
Nevertheless, when peers do not have shortcuts or when these fail (this happens mostly
when objects get relocated), message consumption increases dramatically. On the other
hand, similarly to DRLP, the more flood searches are performed, the more objects are
discovered. Shortcuts are mostly used in the static and dynamic arrival/departure modes,
132
Table A.1: Comparison on 10,000-node random graphs with degree d = 10
Metric ModBFS IntBFS LI HG2 RWALKS s-APS DRLP GS GIA
Success(%) 98.8 99.8 100 70.2 53.4 91.7 100 100 97.0
Messages 875 1233 39710 108.7 43.6 43.0 8.0 2344 35.0
Duplicates(%) 10.3 0.4 18.7 8.3 0.2 0.1 1.8 17.8 0.9
Hits 20.2 32.6 300.0 2.9 1.2 6.1 1.4 18.9 9.5
Hit Distance 4.58 4.61 3.99 1.88 2.78 3.16 1.90 4.60 3.1
since 5 shortcuts proved sufficient for at least one of them to provide an answer most of
the times.
Finally, GIA manages to perform as well as Mod/Int-BFS but being more bandwidth-
efficient. The combination of one hop indexing and biased walkers achieves a good, ro-
bust performance at relatively low cost. Only in the high relocation setting we notice a
considerable increase (200%) in the average message consumption since peers have to
refresh their indices frequently.
A.2.3 Results on Denser Graphs
In the next set of simulations we use a random graph set with an average degree
d = 10 to compare the 9 methods over two different environments: A static one, and one
where both object relocation and peer departures occur about 600 times per run. DRLP
and Int/Mod-BFS forward to 4 neighbors at each step, while k = 12 for s-APS, Random
Walks, HG2, GIA. All other parameters remain the same. The results for the static case
are shown in Table A.1. We also report the percentage of messages per search that are
duplicates and the average distance of the hits in overlay hops.
Blind forwarding causes a large amount of messages to be dropped. Informed meth-
133
Int-BFSLI HG2RWalkss-APSMod-BFS DRLP GS GIA0
0.2
0.4
0.6
0.8
1
Ratio of Hits per Hop-count
hop=1hop=2
hop=3hop=4
hop=5
Figure A.3: Hits per hop distance from the requesters
ods with no direct indices perform much better (s-APS, Int-BFS wasting only 0.1% and
0.4% of their messages respectively). Flood-based schemes also exhibit large hop dis-
tances for their hits.
All algorithms produce a larger number of messages per request in the new graph,
taking advantage of the larger number of connections. DRLP still averages less than 10
messages per request. Random Walks and s-APS roughly double their hits and increase
their accuracy. On the other hand, Int/Mod-BFS produce 10 times more messages. HG2
performs in between, producing about 5 times more messages. LI increases its bandwidth
production by more than an order of magnitude. The overhead due to update messages is
even less apparent now, since its search messages overshadow their effect. GS?s perfor-
mance increases similarly to LI?s since they use the same underlying mechanism. Finally,
GIA exhibits a very good performance again, having low message consumption and in-
creased accuracy/hits.
Another interesting metric is the percentage of hits discovered at various distances
by the methods (Figure A.3). It shows how many objects each method locates with few or
134
Object Popularity
20
40
60
80
100
Success Rate(%)
Int-BFSLI
Mod-BFSs-APS
HG2RWalks
DRLPGS
GIA
Object Popularity0.1
1
10
100
1000
Messages per Request
Int-BFSLI
Mod-BFSs-APS
HG2RWalks
DRLPGS
GIA
Figure A.4: Accuracy and message production vs. object popularity in the dynamic set-
ting
more messages. Our discussion is based on the static setting. Flood-based schemes dis-
cover the vast majority of the objects TTL hops away, since the available nodes increase
exponentially with distance. LI always locates about 99% of its objects 4 hops away, and
the rest only 1 hop away from the requesters (since only nodes at these two depths pro-
cess the queries), while HG2 discovers about 90% of the objects with its flooding phase
(2 hops away). Random Walks discovers almost the same number of objects per distance,
since the query forwarding is done randomly. GIA also uses walkers and exhibits a similar
behavior as requesters are randomly chosen in our simulations. DRLP finds almost 70%
of its hits using its indices (which also explains why its hit average is close to one). s-APS
displays a symmetric curve. After a certain distance, possible paths become too many
and the accuracy of the indices drops. Finally, we notice that GS only discovers about
5% of its hits using the shortcuts, whereas in the smaller graph the respective number was
50%. This can be explained by the fact that the flooding scheme now finds 2 orders of
magnitude more objects than in the previous graph, while shortcuts still find one object.
Figure A.4 shows how object popularity affects the methods? accuracy and message
production in the dynamic environment. Popularity decreases as we move to the right
135
along the x-axis. The first data point represents the accuracy/messages of the methods for
the top-10%, the second for objects ranked between 11?20%, etc. This is an important
comparison, because different applications or users target objects of varying popularity.
The three BFS-related methods together with GS exhibit very high accuracy, with
Mod-BFS showing a noticeable decrease only for the least popular items. Random Walks,
HG2, s-APS and GIA show decreasing accuracy as popularity drops, with GIA and s-APS
clearly performing better. DRLP performs very poorly for the very popular documents
(about 20%), but its accuracy increases as popularity drops. This can be explained by
the fact that less popular objects receive considerably fewer queries. Therefore, object
relocations and node departures which affect the algorithm happen less frequently during
requests for such objects. All algorithms ? except DRLP and GS ? waste roughly the
same amount of messages per request for each popularity group. DRLP and GS increase
their consumption with a popularity decrease for the sole reason that the cost of the initial
floods is now amortized over a smaller number of requests. Finally, we noticed that all
algorithms except DRLP and GS discover a decreasing number of objects as popularity
drops, exactly because this means there exist fewer objects to be located.
In the dynamic environment, we also measure the percentage of messages per re-
quest sent due to index updates (for relevant methods only). We found that Int-BFS re-
quires 11%(=131 mesg) of its messages for index updates. The respective numbers for
LI, GIA and s-APS are 14.4%(= 2968 mesg), 31.7%(= 14 mesg) and 18.5%(= 8 mesg).
Although GIA and s-APS appear to require a larger portion of updates, they are much
more bandwidth-efficient than the other methods in absolute numbers.
Our previous simulations depicted the relative performance characteristics of the
136
Table A.2: Comparison on 10,000-node random graphs with degree d = 20
Metric Mod-BFS Int-BFS HG2 RWALKS s-APS DRLP GS GIA
Success(%) 63.6 67.6 63.5 62.2 93.4 100 90.8 99.9
Messages Messages 73.4 83.0 77.0 72.5 70.6 79.2 77.0 70.0
Hits 1.9 2.3 2.1 2.0 10.7 5.3 1.12 14.9
Success(%) 75.8 77.0 71.9 75.0 80.2 100.0 100.0 92.2
Hits Messages 134.4 117.1 115.1 125.2 31.4 43.0 356.5 32.1
Hits 3.5 3.2 3.1 3.2 3.8 3.4 3.6 3.8
nine algorithms. To some extent, that sort of comparison was not direct either because
of the different nature of the methods or because of the single choice of the various pa-
rameters. Since it is impossible to directly compare the methods for the same parameter
values (e.g., k, TTL), we select simulations on a third set of 10,000-node random graphs
(d = 20), where the algorithms had similar performance in one of two important metrics:
Messages and hits per query. These results were obtained by experimenting on various
values for k, TTL, number of neighbors to forward and number of requester nodes. The
results are presented in Table A.2 and the comparison metric is typed in boldface. LI
is omitted from this table because its large number of messages and hits could not be
matched by the other methods.
For similar message consumption, first GIA, then s-APS discover the most objects
(followed by DRLP with about 10 extra messages per search). These three methods also
prove extremely accurate, while the rest of the schemes (either flood-based or random)
do not perform well. For similar hits per search, again GIA and s-APS stand out above
DRLP, which wastes a few more messages but is perfectly accurate. From the rest of the
methods, only GS is 100% successful, but exhibits the highest message consumption.
137
Table A.3: Comparison of the nine methods with a 20,000-object pool
Graph Mod-BFS Int-BFS LI HG2 RWALKS s-APS DRLP GS GIA
Succ(%) 68.4 69.7 89.9 30.7 29.8 75.2 99.0 89.2 74.4
RAND Mesg 118.8 115.4 1511.6 24.9 18.6 24.1 7.1 563.5 18.3
Hits 2.3 2.4 37.7 0.5 0.4 2.2 1.2 5.0 3.2
Succ(%) 56.8 62.3 93.3 76.7 22.9 75.7 98.3 88.4 85.7
PLAW Mesg 73.3 82.0 1473.0 750.3 13.1 15.1 5.0 355.9 19.1
Hits 1.5 1.8 86.1 17.7 0.3 1.9 1.2 3.0 13.9
Succ(%) 67.8 76.2 94.7 63.3 33.7 70.1 99.1 83.6 78.8
GNUT Mesg 145.6 217.4 1325.1 282.1 24.3 33.1 17.1 886.5 20.9
Hits 2.6 4.4 59.8 5.7 0.5 3.0 2.0 15.3 6.0
A.2.4 Increased Number of Objects
Our previous model was mainly tailored for a system where peers continuously
search for specific objects. The wide range of replication ratios together with the net-
work dynamics best enables us to observe the effect of popularity, dynamic behavior and
forwarding scheme. We now consider a more general situation, with a large number of
objects (20,000) and 5,000 requester nodes, each making 2,000 queries. This could be
an example of a P2P search engine application, with users having their own preferences
(changing with time). Table A.3 presents our comparison using three sets of graphs, our
original 10,000-node set (d = 4, RAND), a 10,000-node power-law graph set (d = 4.4,
PLAW) and a Gnutella topology snapshot (d = 4.6, GNUT). For larger graphs (simula-
tions up to 50,000 nodes), results are qualitatively similar.
Compared to the previous results, we clearly notice a small performance degrada-
tion, which is natural if we consider that now more queries are made for sparsely located
138
objects, while flooding is used more by some of the methods. Nevertheless, first DRLP,
followed by s-APS and GIA achieve numbers closest to the original ones. With the power-
law topology, although the average out-degree is the same as with the random graphs,
various neighborhoods differ substantially, since there are few nodes with very high con-
nectivity. GIA clearly takes advantage of this to increase its discovered objects. Another
observation is that pure flood-based schemes also discover substantially more objects
(compared to the respective runs over the random topologies with 20,000 objects). HG2
achieves more than 10 times more hits with a 150% increase in accuracy, using 30 times
more messages. LI doubles its hits without any message increase. The rest of the schemes
perform very similarly to the previous simulation. The results for the real topology resem-
ble those for the power-law graphs if we also take into account the size increase as well as
an increase in the average out-degree and the number of poorly connected neighborhoods
(possibly due to crawling imperfections). In general, most methods show increased mes-
sages and hits compared to the random topologies. While they effectively locate popular
objects, they either fail to be as accurate or greatly increase their message production for
the bulk of the non-popular items.
A.3 Conclusions
In this work we presented many of the search techniques available for unstructured
P2P networks, along with a quantitative comparison through simulation. Our analyses
focus on the performance metrics of search accuracy, bandwidth consumption, discovered
objects and behavior under dynamic operations.
139
The specifics of the problem play a big role in choosing the appropriate method.
Each scheme has its own goals and it is important that these goals match the applica-
tion?s. Important parameters that could influence our decision include the primary pur-
pose of the application (e.g., fast discovery, many hits, bandwidth-efficient and accurate,
easy deployment, etc), the underlying topology, expected workload, etc. We offer some
general-purpose observations based on our analysis and simulations, hoping they will
prove useful in evaluating the plethora of different schemes.
a) Blind forwarding is not adequate for both high numbers of hits and low message
production.
b) Index semantics play an important role: Direct location information is effective
but sensitive to changes and more demanding (becomes obsolete if a failure/relocation
occurs, requires update messages). Indirect information (e.g., success rates in s-APS,
Int-BFS or connectivity/capacity in GIA) is much more robust but less accurate.
c) Indexing other peers? repositories is very useful but must be carefully applied,
since it requires updates to keep the indices up-to-date.
d) Adaptation is a key characteristic through which peers that have a prolonged
stay in the network enhance their knowledge with time. GS, s-APS and Int-BFS learn
from system searches and improve their performance.
e) In many cases, the simple protocols are the preferred ones. The simplicity of the
mechanisms behind flooding or Random Walks make them powerful and easy to imple-
ment. They can be used either by themselves or in combination with other schemes to
improve their performance.
140
Appendix B
Sharing Relational Data in Unstructured Overlays
B.1 Introduction
In this Appendix we describe the problem of sharing relational data in unstruc-
tured overlays. This is joint work with Verena Kantere and Professor Timos Sellis of the
Department of Electrical and Computer Engineering, National Technical University of
Athens, Greece.
In the past few years, there has been a growing interest in the Peer-to-Peer (P2P)
paradigm, primarily boosted by popular applications that enable massive data sharing
among millions of users. Our research, thus far, has been focusing on applications with
exact-match queries: Users are requesting for an object by either providing a unique
identifier (e.g., filename, system-wide file-ID, etc) or a single attribute-value pair,
evaluated always as TRUE or FALSE. While this formulation covers a significant portion of
real-life scenarios, it is certainly not restrictive. Scientific collaborations, enterprise data
integration and sharing in the World Wide Web are only examples of applications that
require more powerful data and query formulations.
In contrast to data integration architectures, P2P data sharing systems do not as-
sume a mediated schema to which all sources of the system should conform in order to
share data. In such a system, each peer is an autonomous source that has a local schema
and individually stores and manages its data, revealing only part of its schema to the rest
141
of the peers. Due to the lack of global schema, users express and answer queries based
on their local schema. In a P2P data management system, peers also perform local co-
ordination with their acquaintees, i.e., their one-hop neighbors in the overlay. During
the acquaintance procedure, the two peers exchange information about part of their local
schema and create a mediating mapping semi-automatically [101]. The establishment of
an acquaintance implies an agreement for the performance of data coordination between
the acquaintees based on the respective schema mapping. However, peers do not have
to conform to any kind of data or schema transformation to establish acquaintances with
other peers and participate in the system.
As we mentioned before, many popular P2P applications operate on unstructured
networks, with peers joining and leaving the system in an ad-hoc fashion, while main-
taining only local knowledge. In such systems, joining peers usually become acquainted
to the first randomly available nodes and not to the ones that best meet their need for
information. Therefore, they have to direct queries not only to their neighbors, but to a
greater part of the system. One can roughly identify two common approaches in order to
query and retrieve answers in such a system:
The first approach is to propagate queries on paths of bounded length in the overlay.
At each routing step, the query is rewritten to the schema of its new host based on the
respective acquaintance mappings (see Figure B.1). A query may have to be rewritten
several times from peer to peer till it reaches nodes that are able to answer it sufficiently
in terms of quality but also quantity. It is obvious that the successive rewritings decrease
or restrict the information that can be returned by a query and, thus, also reduce the
possibility of accurate query answering. Moreover, it is the case that peers may not be able
142
Q
Q1?
Q2??
Q1??= 0
Q1???= 0
Q2?
Q1?= Q2?
Figure B.1: Propagation of queries among
acquaintees. The size of the rectangles re-
flects the amount of degradation after a
rewriting. Qprimeprimeprime1 negationslash= Qprimeprime2 because the queries
followed different paths
GroupSchema
Q
Figure B.2: Query directed towards a
group schema which holds mappings with
all group members
to sufficiently answer received queries, not because their local schema does not match the
initial query adequately, but because the incoming rewritten version has been gradually
reduced or corrupted. Therefore, the performance of the query processing procedure is
degraded during the rewritings on intermediate peers.
In the second approach, nodes are organized (usually by one or more administra-
tors and application experts) into groups of peers that store semantically related data. The
administrators, using schema matching tools as well as domain knowledge, create a medi-
ated schema that is representative of the group. Group schemas hold mappings with each
of the local databases. This configuration corresponds to multiple data integration system
realizations, one per semantic group. Queries are then globally expressed on this me-
diated schema (see Figure B.2). Obviously, this approach requires human involvement,
143
P2P Layer 
DavisDB 
StuartDB 
LuDB 
P2P Layer 
P2P Layer 
DavisD B  : Visits( Pid, Date, Did ) 
Disease ( Did , DisDescr, Ache) Treatment ( Did, Drug , Dosology) LuDB  : Disease( Did , AvgFever, Drug) Patients( Insurance#, Did, Age , Ache) 
StuartDB  : Treatment( Pid, Did, Date , Symptom, 
TreatDescr, DisDescr) 
Figure B.3: Part of a P2P system from a health-related environment
extensive peer coordination and repetition of the process each time the group changes.
B.1.1 Motivating Example
As a motivating example, envision a P2P system where the participating peers are
databases of private doctors of various specialties, diagnostic laboratories and databases
of hospitals. Figure B.3 depicts a small part of this system, where nodes are: DavisDB -
the database of the private doctor Dr. Davis, LuDB - the database of pediatrician Dr Lu
and StuartDB - the database of the pharmacist, Mr Stuart. A P2P layer on top of each
database is responsible for all data exchange between a peer and its acquaintees. The
P2P layer is also responsible for the creation and maintenance of mappings between local
schemas during the establishment of acquaintances towards the line of [101]. Moreover,
each peer owns a query rewriting and a query-schema matching mechanism. The local
schemas exported by these peers are shown in Figure B.3.
Suppose that Dr Davis would like to collect from the system general information
about patients that have had diseases. He expresses the following query on his database:
Qorig:
SELECT V.Pid, D.DisDescr, D.Ache, T.Drug, T.Dosology
144
FROM Disease D, Treatment T, Visits V
WHERE V.Did = D.Did AND D.Did = T.Did
Having only one acquaintance, the pharmacist?s database, Dr. Davis?s database
propagates Qorig to it. We assume GAV, LAV, or GLAV (i.e. Global, Local, Global and
Local As View) mappings between acquaintees [102]. We assume the following LAV
mapping between DavisDB and StuartDB databases:
MStuartDB DavisDB: Treatment(Pid, , , Symptom, TreatDescr, DisDescr) :-
Visits(Pid, , Did), Disease(Did, DisDescr, Ache), Treatment(Did, Drug, ),
where correspondences Symptom = Ache, TreatDescr = Drug are implied1. Thus, the
rewritten query on StuartDB is the following:
QStuartDB sr:
SELECT T.Pid, T.DisDescr, T.Symptom, T.TreatDescr
FROM Treatment T
Obviously, the new query has lost the attribute referring to information about drug
dosology, since it cannot be mapped in StuartDB. The node of Mr Stuart passes the rewrit-
ten version QStuartDB sr to Dr Lu with whom he has the following GAV mapping:
MStuartDB LuDB: Treatment(Pid, , , Symptom, , ) :-
Disease(Did, AvgFever, ), Patients(Insurancesharp, Did, , ), Age < 132,
1The mapping is actually a view defined on StuartDB.Treatment, which is matched with a join on
DavisDB relations such as: View1(Pid, Symptom, TreatDescr, DisDescr):-Treatment(Pid,Did, Date, Symp-
tom, TreatDescr, DisDescr)
View1(Pid, Ache, Drug, DisDescr):- Visits(Pid, Date, Did),Disease(Did, DisDescr, Ache), Treatment(Did,
Drug, Dosology). We summarize mappings by omitting view definitions and introducing ? ? for attributes
that are not matched.
2Because he treats children
145
where correspondences Pid = Insurancesharp, Symptom = AvgFever are implied. Thus, the
rewritten query on LuDB is the following:
QLuDB sr:
SELECT P.Insurance#, D.AvgFever
FROM Disease D, Patients P
WHERE D.Did = P.Did, P.Age < 13
Clearly, the new query has lost more attributes, which refer to the description of
the disease and the respective drug. Moreover, the new query is more restrictive than the
original, since it has an additional condition on ?Age?. Finally, it is clear that the ?Ache?
attribute of the original query has been poorly rewritten to ?AvgFever?, even though the
schema of LuDB contains an attribute that represents the exact same concept. Yet, if Dr
Davis was acquainted with Dr Lu, among the supported mappings could be:
MprimeLuDB DavisDB: Visits(Pid, , Did), Disease (Did, , Ache), Treatment (Did, Drug, ) :-
Disease(Did, , Drug), Patients(Insurancesharp, Did, , Ache),
where the correspondence Pid = Insurancesharp is implied. Using the above mapping, Dr
Davis would ideally like his query to be translated as follows:
QLuDB ideal:
SELECT P.Insurance#, D.Ache, D.Drug
FROM Disease D, Patients P
WHERE D.Did = P.Did
The above version overcomes the degradation of successive rewriting in terms of
query information loss and further query restriction, as well as the poor matching of the
?Ache? attribute.
Our approach enables DavisDB to evaluate Dr Lu?s query translations (e.g., suggest
that Ache = AvgFever is not a good correspondence and Pid = Insurance is a good one)
146
and gradually help him improve the quality of its query rewriting. Through iterative
evaluations, Dr Davis notices the average answer quality from Dr Lu is high enough to
add him as an acquaintee.
Following the clustering of peers into semantic neighborhoods, our system can ini-
tiate the creation of a mediating schema G representative of all three databases. G holds
mappings with each of the creating nodes and functions as a point of contact for all incom-
ing queries, whether from inside or outside the cluster. Thus, requesters need only eval-
uate answers and mappings against one schema, instead of multiple ones. Furthermore,
they can effectively speed-up the learning process by directing queries to semantically
relevant clusters known system-wide.
B.1.2 Our Contribution: GrouPeer
GrouPeer is a system designed to enable accurate query evaluation through se-
mantic overlay clustering and automatic creation and maintenance of semantic groups in
relational P2P databases without prior schema or meta-schema information. GrouPeer?s
contributions are twofold:
? Clustering of semantically related peers: Nodes individually decide whether to an-
swer the successively rewritten query or automatically rewrite its original version.
Requesters evaluate the replies along with the returned rewritings and gradually
build mappings with remote peers. Eventually, peers with similar local schemas
become acquainted and clusters are created around active peers.
147
? Group schema creation and maintenance: Nodes in well-formed semantic clus-
ters are candidates for initiating the group inference process. The process contacts
nodes similar to the initiator inside the already formed cluster, creating a schema
representative of the participants. Group schemas are then propagated inside the
network, enabling all nodes to direct relevant queries towards a single mediated
schema.
Our focus will be on the second part of GrouPeer, i.e., the creation of group schemas
from semantically similar nodes in a completely distributed manner. In Section B.2 we
present a brief overview of the clustering process. A detailed description can be found
in [103]. Section B.3 discusses group schema creation, while Section B.4 presents our
experimental evaluation.
B.2 Clustering Peers for Accurate Query Answers
GrouPeer proposes a procedure that supports the evasion of successive rewritings
along every query?s propagation path. This methodology enables peers to discover others
with similar interests and schemas, given that no form of global knowledge (e.g., each
peer?s schema [104]) is assumed system-wide. Learning is performed through making
queries and evaluating their answers, and is formed through mappings between the re-
spective schemas. As pairs of peers build more mappings, query rewriting becomes more
accurate. Eventually, peers with similar schemas become acquainted, gradually restruc-
turing the overlay into semantic neighborhoods. This process is referred to as the clus-
tering process. We should note here that GrouPeer focuses on queries and mappings that
148
can be expressed as SPJ queries (or else conjunctive queries with arithmetic comparisons)
B.2.1 Query Reformulation and Similarity for P2P Database Systems
The goal of the reformulation mechanism is to transform a query so that it can
be answered, fully or in part, by an acquaintee. We assume that each peer exports a
relational schema to its acquaintees. For the remainder of this work, we shall refer to
it as the peer?s schema, regardless if it represents the whole schema or a part of it (e.g.,
for security/anonymity purposes). Each pair of acquaintees holds peer mappings between
their schemas, which are considered to be of the well-known GAV/LAV/GLAV form.
The available query rewriting algorithms restrict their usage to queries that can be
completely rewritten under a set of mappings. Yet, this is not suitable for a P2P envi-
ronment, where peers are satisfied with information that shares characteristics similar to
those of their query, not necessarily (and precisely) all of them.
In GrouPeer, peers utilize a modified reformulation mechanism based on existing
rewriting algorithms. Our mechanism allows a rewritten version of a query to maintain
only the attributes and conditions that ?survive? the query translation that is performed
among acquaintees.
After the translation of a query to their local schemas, peers can proceed with its
computation. Our goal is to measure the similarity between different versions of a query
and its original formulation in order to decide which constitutes a more accurate rewriting.
The similarity function Msim proposed in GrouPeer measures the semantic deviation of
the target query Qrewr from the original Qorig. Assuming that every query is defined as a
149
set of elements, one for each ?select? attribute and one for each ?where? condition, the
rewritten version can deviate from the original one in the following ways: Elements of
Qorig cannot be mapped in Qrewr, or extra conditions are introduced in Qrewr.
Formally, for two query versions Qorig, Qrewr and a set of user-specified weights
wQorig that denote the importance of each element in the semantics of Qorig, the similarity
of the rewritten version to the original is:
Msim(Qorig,Qrewr) = 1?
?
??
? ?elements i
not mapped
wi
?
??
?+# of conditions in Qrewr but not in Qorig
?
i
wi
Msim is structured such that dissimilar elements diminish its value. Perfect similarity is
represented by Msim = 1.
B.2.2 Description of the Clustering Process
In order to achieve the discovery of remote relevant peers, the key idea of our
method is to propagate along the query path not only the successively rewritten version,
but also the original one. In this way, peers can individually decide which one to answer.
Peers are assumed to be equipped with a query rewriting mechanism and an automatic
schema-matching tool. The rewriting mechanism is used to reformulate queries received
from acquaintees based on the respective mappings. The automatic schema-matching
tool is used in order to translate queries (or parts of them) expressed on schemas for
which mappings are not available.
Successive query reformulation produces query versions that deviate from the orig-
150
inal query. Obviously, if the chain of peer mappings used for the rewriting is poor in infor-
mation relevant to the query (i.e., query elements cannot be reformulated accurately), this
can result in fast degradation within a few hops. Query elements that cannot be translated
through existing mappings are eliminated in the rewritten version. Although the follow-
ing nodes on the query path may encapsulate the eliminated concepts in their schemas,
they still cannot contribute them to the original query, because the version they receive
does not include them. Our approach keeps the eliminated concepts and tries to match
them in subsequent reformulations.
Overall, an initiated query Qorig is propagated along the query path. On each node,
the query is rewritten through mappings with the previous node to Qsr, which is aug-
mented with automatically rewritten query elements to Qsra. Also, Qorig is automatically
rewritten from scratch to Qar. The answering node compares the two rewritten versions
with the original one, using our similarity function and answers the version it seems most
similar to it. The query initiator evaluates the satisfiability of the received answer and
sends its feedback to the answering peer about the query version it chose to reply to. Ac-
cording to the evaluation, the query replier keeps record of bad and good rewritings on the
initiator?s schema elements. Gradually, content providers build mappings with the initia-
tors through the queries they receive and answer on their behalf. Moreover, the initiators
log the evaluation of query answers from each replier. Based on this, the initiators can
decide that they have common interests with a remote peer and ask to become acquainted.
New acquaintees can base their communication on mappings already created.
151
B.2.3 GrouPeer Protocol Internals
In the following we describe basic algorithm internals, specifically the query routing
scheme and the addition/deletion of acquaintances.
1) Routing: Our method utilizes informed walks with a TTL parameter in order to
propagate queries to nodes in the overlay. The requester deploys k walkers, each following
independent paths. A node forwards a query to the neighbor(s) whose schemas have the
highest similarity value with respect to this query. Note that these values can be computed,
since neighbors share this information by default in our protocol.
2) Adding/dropping acquaintees: We augment our clustering algorithm by allow-
ing the dropping of existing neighbors in order to gradually improve on the random initial
setup: New acquaintees are added whenever the local evaluation average is over ?PI and
existing ones are dropped when its value is below ?PILow, provided we have received at
least THR replies from that node. This confidence parameter is important to ensure that
the local evaluation is based on an adequate number of queries. We also define a maxi-
mum number of connections per peer, MAXDEGREE, which forces a neighbor addition
to be preceded by the dropping of the neighbor with the smallest schema similarity if this
limit is reached. A link is dropped whenever the local evaluation average is below ?PILow,
provided the degrees of both nodes are at least MINDEGREE. This ensures that peers do
not get disconnected from the network.
152
B.3 Interest Groups in GrouPeer
We now describe GrouPeer?s group schema creation process. Our goal is to mate-
rialize the creation of semantic clusters by combining the overlay clustering presented in
the previous section with a distributed process that iteratively merges local schemas into
the final group schema.
After the performed clustering, peers with similar information are close in the over-
lay, achieving an increase in the number and quality of answers. Yet, this overlay clus-
tering is implicit, in that there is no information about the identity and characteristics of
a cluster or the peers that participate in it. We define explicit knowledge of a cluster to
consist of knowledge about its participants, their schemas, the cluster?s schema and the
relations between each participant?s schema and the group schema. Such information has
multiple advantages:
First, it enables peers to direct relevant queries towards a single, authoritative,
schema. Instead of traversing multiple paths and performing learning with multiple
sources, query initiators interact with a single ?virtual? schema (the group schema). Par-
ticipants already hold complete mappings with this target schema.
Joining nodes can also benefit by selecting appropriate acquaintees or speeding up
the learning process instead of choosing random entry points. Finally, since our system
operates in a dynamic environment, with node arrivals/departures and possible schema or
workload changes, dynamically created group metadata can be automatically refreshed.
We call these explicit clusters interest or semantic groups and the process of creat-
ing them the group inference process.
153
B.3.1 Group Inference
The process comprises the following steps: (a) Initialization - who and when initi-
ates the group inference, (b) Propagation - how does the process advance among peers of
the same group, and (c) Termination and Refinement - when is the process over/reiterated.
Initialization: There are two main considerations in the initialization process:
First, the nature of our application requires that the group inference is performed in a
distributed manner, without global coordination. Hence, peers should independently de-
cide to start the process that creates the respective schema. Second, we must ensure that
the initiator(s) are qualified representatives of a semantic group. Given that, the only
proof of group existence in our system is implied by the changed overlay topology and
local state stored at each node. In GrouPeer, a peer may consider itself part of a semantic
cluster and initiate the group inference process if the following requirements are met:
? The similarity of answered queries to the original ones (measured at this node) is
above a certain threshold To.
? The average rate of queries sent from this node is over a certain threshold.
For example, we may require that any prospective initiator has received replies of average
similarity greater than 0.7 and that it has made at least 50 queries in the last 2 hours. These
requirements satisfy both conditions described before, by ensuring that the respective
peer is an active participant (rate of queries) and is a member of a well-formed cluster
(similarity threshold).
Since any prospective initiator is a qualified representative of the group, its local
schema will also become a point-of-reference regarding the inferred one. Thus, the peer
154
schemas considered for the formation of the group schema should not substantially differ
(in semantic distance) from the schema of the initiator. The following function calculates
the directed semantic similarity, SS, of two relational schemas:
SS(S,T) =
?
i
?
j
wi jMappedT(SRi j)
?
i
?
j
wi jSRi j
In the above function, S is the source schema and T is the target schema. SS calculates
the portion of S?s attributes (SR) that are mapped on T . Obviously, SS(S,T)negationslash= SS(T,S)
in general. In order for SS to be computable, we have to know the mapping between i, j.
This requires a composition of mappings between acquaintees until a mapping from i to
j is produced. Related work [105] describes efficient composition schemes that can be
utilized. Nevertheless, GrouPeer assumes mappings that are simple 1-1 correspondences
and can be easily composed. SS achieves to measure semantic similarity because it takes
into consideration the mapping of concepts beyond their structural interpretations on the
schema level. Moreover, since SS ignores the schema structure, it is very easily calculated.
In GrouPeer, we require that all considered local schemas be at least t-similar to
the initiator?s schema: SS(SI,T)?t,?T . The initiator peer I is called the originator of
the group, its schema SI is the origin of the group schema and the maximum similarity
distance between the origin and the peer schemas that participate in this process is the
semantic radius t of the group.
Propagation: Initiator I (with schema SI) initializes the group schema to its own
and creates a stack ST(I) with its acquaintees that are part of the cluster. Specifically,
ST(I)={A1,A2,...,Am}is an ordered set of elements A j ={Pj,SS(SI,SPj)}, where Pj is a
peer with schema SPj. Elements A j refer to the I?s most similar acquaintees: SS(SI,SPj)?
155
t, j = 1,..,m and SS(SI,SPj)?SS(SI,SPj+1), j = 1,..,m?1. The initiator propagates the
inference procedure to the first peer on the stack. Each intermediate node P merges its
own schema with the group schema it receives. P then determines its acquaintees Pj for
which SS(SI,SPj)?t, adds the respective pair {Pj,SS(SI,SPj)}, to ST(I) and orders it.
SS(SI,SPj) is calculated indirectly, as the product: SS(SI,SP)?SS(SprimeP,SPj), where SprimeP is
the part of SP mapped on SI. Essentially, SS(SI,SP) aims to measure how much of the
semantics of SI can be found on schema SP, independently of other semantics that the
latter captures. The only way to measure this (without automatic matching) is through the
chain of mappings of SI all the way to SP. As such, the value of SS(SI,SP) depends on
the path that the inference process follows and fails to consider concepts that exist both in
SI and SP but not in the schemas of intermediate nodes. However, this formula produces
a satisfactory result, since nodes are visited in decreasing order of similarity with I and
clustering precedes this process, so a peer P will have higher similarity with the originator
than successor nodes in the stack. Moreover, if a peer P already in ST(I) is considered
for addition, the entry with the highest SS(SI,SP) value is kept.
Even though the participation or not of peers in the inference process is judged by
a part of their schemas, their whole schema contributes to the inferred group schema.
The goal of the inference process is to produce a schema that represents semantics en-
capsulated in the cluster. In order to determine the cluster?s semantic borders, we use the
semantics of the initiator as a reference. In this way, the process is safe from producing a
schema much broader or distorted from the initiator?s interests.
Termination: The group inference procedure ends when the stack of participating
peers becomes empty. However, if many peers have schemas very similar to the origina-
156
tor?s schema or the similarity threshold t is small, (i.e., the semantic radius is big), then it
may be the case that the stack grows at each step. The inference procedure is prolonged,
taking into account a large number of peers. After a certain number of iterations, there
is usually no point in considering more schemas, because they do not contribute signifi-
cantly. In order to reduce the time of the inference and save valuable network resources,
we add a limit to the maximum number of encountered peer schemas, MaxP, as a termi-
nation condition. MaxP is not a TTL condition, since successive hops are not always on
the same path; MaxP refers to the total number of participating nodes.
Finally, there may be situations where the inference procedure terminates due to
MaxP while important semantic information is still added, or continues until MaxP is
reached while little information is assimilated. To rectify this, GrouPeer also considers
the degree of change that occurs to the inferred schema during each merging step. In case
of a poorly chosen MaxP value, this criterion can be used to calibrate this parameter.
B.3.2 Discussion on the Group Inference Process
In this section we briefly cover issues related to the inferred groups, such as schema
creation and merging, group broadcast, maintenance and interaction.
Group Schema Creation: As mentioned earlier, the inference of the interest group
schema is achieved gradually by merging the schemas of peers in consecutive steps. The
goal of this procedure is to produce a schema that represents the majority of the peers that
belong to the respective cluster. Therefore, the merged schema is neither the intersection
nor the union of the members of the cluster. Assuming that such a cluster comprises of nu-
157
merous peers, it is straightforward that the intersection of their schemas would probably
be empty and their union would be too large. Thus, our merging procedure has to incor-
porate only the most ?popular? elements of the respective peer schemas in the merged
schema.
Yet, inferred schemas should also be representative of almost all their source peer
schemas, therefore our merging procedure should also perform high compression be-
fore discarding schema elements. Finally, we note that the whole procedure is based
only on available information in the peers, i.e., schemas and mappings between them.
Specifically, we assume that peer mappings are GAV/LAV/GLAV and peer schemas are
relational, (i.e. the only internal mappings are foreign key constraints). One mapping is
considered to be a set of 1-1 correspondences between attributes that hold with an op-
tional set of value constraints on some attributes. Moreover, peers do not carry semantic
information about their schemas and mappings. Following is the description of the merg-
ing algorithm:
Input: the merged schema SIG; the peer schema SP and a set of mappings M between
them; a set of intra-schema mappings Mi
Output: the new merged schema SprimeIG, a set of mappings Mprime with the following node on
the path, a set of intra-schema mappings Mprimei and a dictionary D
Initialization: SIG = /0, M = /0, Mi = /0, D = /0
On each peer of the network path perform the following steps:
Step1: Add to SIG all the relations of SP
Step2: If M = /0 set SprimeIG = SIG and go to step 7
Step3: Merge relations that share the same key
158
Step4: While the number of relations is over the limit do:
a. Select pairs of relations that have the most correspondences between their attributes
and that do not depend on value constraints
b. From pairs of (a) select the pairs of relations that have the fewest not mapped attributes
and merge them
c. Remove from M the mappings used for the merge of (b) and add the involved corre-
spondences in the dictionary D
Step5: Set SprimeIG = SIG, Mprimei = Mi?M
Step6: Select the next node, Pprime, of the network path from the acquaintees of peer P. Set
Mprime equal to the set of mappings between P, Pprime. Change attribute and relation names of P
in Mprime to the respective names in the merged schema SprimeIG
Step7: Send SprimeIG, Mprimei and Mprime to Pprimea50
Steps 3 and 4 refer to the merging of a pair of relations. The following procedure
performs the merging of two relations:
Input: A pair of relations R1, R2 a set of mappings M
Output: The merged relation R
Initialization: R = /0
Step1: Add to R all attributes of the relations R1, R2
Step2: Until the number of attributes is above the limit, if it is possible do:
a. if there are any, merge attributes that are involved only in one correspondence; else go
to b
b. merge the attributes that are involved in at least one correspondence, starting from
those participating in the fewest correspondences a50
159
At the end of the schema merging procedure, i.e., when all relevant peer schemas
have been merged, relations and relation attributes that have been met very rarely during
the procedure can be dropped.
Example: Assume that Dr Davis is a doctor with a peer database the schema of
which is:
SDavisDB :
Visits(Pid, Date, Did)
Disease (Did, DisDescr, Symptom)
Treatment (Did, Drug, Dosology)
And Dr Lu is another doctor with a peer database, the schema of which is:
SLuDB :
Sickness(Did, AvgFever, Drug)
Patients(Insurancesharp, Did, Age, Ache)
The schemas of DavisDB and LuDB are presented in Figure B.4. The databases
have the following mapping:
M1LuDB DavisDB:
Disease (Did, , Symptom), Treatment (Did, Drug, ):-Sickness(Did, AvgFever, Drug),
where the correspondences Symptom = AvgFever and Disease = Sickness are implied.
In this case, as shown in Figure B.5, there are three correspondences that are encap-
sulated in mapping M1. We assume that the peer of Dr Davis initializes the schema merge.
Thus, SIG is initialized to SDavisDB. After the 1st step of the schema merging algorithm,
SIG contains all the relations of SDavisDB and SLuDB. Since there is a mapping among
the relations,the algorithm goes on to Step3: relations Disease and Sickness are merged
160
Figure B.4: Two schemas to be semantically merged
Figure B.5: SIG is initialized to SDavisDB and there is mapping M1 between SIG and SLuDB
Figure B.6: Relations Disease and Sickness of Figure B.5 are merged
161
Figure B.7: Relations Disease/Sickness and Treatment of Figure B.6 are merged
in one, since they share the same key (see Fgure B.6). Thus, attributes Symptom and
AvgFever are merged. The correspondence Disease/Sickness.Drug = Treatment.Drug
is kept as an internal one. Also, the dictionary D is enriched with correspondences
Disease = Sickness and Symptom = AvgFever; actually the schema keeps one name for
each relation or attribute from the alternative ones. At the end of the schema merging
procedure we propose that the schema keeps for relation and attribute names the most
common ones encountered during the procedure.
Assuming that the algorithm goes on to Step4, relations Disease/Sickness and
Treatment are merged (Figure B.7), since they are the only ones related with a map-
ping. Now there is one attribute named ?Drug? and it is part of the relation key, even
though just one of the attributes that where merged was a key. Additional iterations can
merge relations based on foreign key constraints, since no other internal mappings exist.
The schema merging procedure produces the interest group schema but also a set
of internal mappings and a dictionary. The internal mappings are the peer mappings that
were not consumed in the successive schema merges. These hold additional syntactic
and implicitly semantic information for the interest group schema elements; thus, they
can be very helpful to peers that would like to join the group and create mappings with
their local schema. Moreover, this set of mappings has the collection of all mappings
162
with value constraints met during the merging procedure. These kind of mappings cannot
be consumed: the involved relations/attributes cannot be merged, since they are mapped
under certain conditions (the value constraints).
Group Broadcast: After a group schema is created, metadata about this group is
periodically propagated across the overlay. This metadata includes the group schema,
some or all of the IDs of participating nodes (contact list), the time of creation and the
originator. Any peer in the system can rewrite its queries to the group schemas available.
Queries can then be directly forwarded to the group members. In this way, we manage to
bypass the information loss of multiple rewritings, since a query is translated only once,
through the group schema. Making the participating nodes known to all peers enables any
remote node to enter the cluster. Peers can now become acquainted with group nodes that
have very similar schemas with them, without having to wait to be gradually clustered.
Group Interaction and Merging: While our completely decentralized approach
in group creation is necessary, it also raises some consistency issues, since more than
one groups can be created, even simultaneously. This can affect correct behavior only if
nodes similar to the initiator choose to create a group and the two processes overlap in the
overlay. Topologically close peers initiating the process over different semantic groups
pose no problem. The same is true if the initiators? hop distance is such that would not
allow either procedure to incorporate both groups in its progress.
In order to avoid extended negotiation rounds between competing potent origina-
tors, we require that initiators announce their intention to create a group to their neighbor-
hood. In effect, this forces competing initiators with schemas similar to the first initiator
to postpone or abort their process, if they are inside the announcement neighborhood. We
163
note that the announcement neighborhood must have a radius proportional to the seman-
tic radius of the group to be inferred. If such peers do not eventually participate in the
group inference, they can add themselves to the overlay neighborhood or participate in
the consequent maintenance of the group.
Nevertheless, peers are eligible to initiate a new group if they have not received
a relevant announcement or if they incorrectly calculate their similarity with a known
initiator. It is possible that such originators will create groups that have a significant
semantic overlap with existing ones. Thus, these groups are subject to be merged into a
unified schema. After both groups are advertised, the respective originators can detect the
similarity between the inferred schemas and initiate the merging process. This involves
choosing a new originator among the two existing ones, merging the two schemas and
advertising the new group using the new initiator and the union of the contact lists.
We must note that an important property must hold: Groups created by similar
initiators will also be similar and groups by dissimilar initiators will be dissimilar. This is
essential because it justifies that authoritative peers can independently initiate the process
(and thus block other similar ones from doing it). GrouPeer?s clustering process assures
that this property holds, something also evident in our evaluation.
Group Maintenance: The maintenance process refers to updates in the contact
list as well as the group schema itself. Maintenance is necessary, since peers join the
group while others that belong to the group leave or change their local databases in time.
There are two ways to decide how to maintain a group schema: The first is to allow the
originator to initiate the inference process periodically. The second is to allow any eligible
peer re-start the process. In order for both approaches to work, we define an epoch factor
164
to represent the maximum life-span of a group, after which it will become invalid. Then,
the originator can invoke the inference process every epoch minutes and re-transmit the
new group inside the overlay. This way, group metadata are kept in a form of soft state
inside our network and get promptly updated. By allowing any eligible peer to undertake
the role of the originator, we eliminate inconsistencies created by changes in the original
initiator and also ensure that the inferred schema does not specialize. Obviously, there is a
trade-off between the cost of repeating the process over the anticipated query performance
using stale groups.
B.4 Experimental Evaluation
To evaluate the performance of GrouPeer, we use a message-level simulator written
in C. By default, we randomly choose 100 nodes that play the role of the requesters, each
making 100 queries to the system. We present results for 1,000-node random graphs (an
adequate number of participants regarding our motivating application) with average node
degrees around 4, created by the BRITE [46] topology generator. Results are averaged
over 20 graphs of the same type and size, with 100 runs in each.
For the schemas stored at each node, we use two initial relational schemas, whose
tables and attributes are uniformly distributed at nodes. The initial schema comprises of
5 tables and 33 attributes. Seven attributes are keys with a total of 11 mappings (corre-
spondences) between them. Each peer stores 10 table columns (attributes) on average.
Queries are formed on a single or multiple tables if applicable (join queries). We exper-
imented with larger schemas (90 attributes over 12 tables) and a flat 100-attribute single
165
table (no mappings between attributes). Because the creation of the individual schemas
is computer-generated, an increase in the schema reduces the amount of the default sim-
ilarity between nodes (unless more attributes are distributed per node). Nevertheless, the
important observation is that, in all cases, GrouPeer maintains its relative advantages and
behaves in a similar fashion.
Our basic performance metrics are the average similarity or accuracy of answers
to the original queries (i.e., the similarity of the answered query over the original one
evaluated at the requester), as well as the number of nodes that provide an answer.
B.4.1 Clustering Results
For the automatic rewriting of the original query, we simulate the possible erroneous
outcome by altering the ?perfect? rewriting by 50%. This is then gradually ameliorated
through our learning process. We set the maximum number of allowed hops per query
TTL=6, the number of deployed walkers k = 3, as well as ?PI = 0.7 and ?PILow = 0.3 using
a threshold parameter of THR=5 replies. Finally, we assume that the returned tuples do
not play any role to the answer evaluation.
Figure B.8 shows the performance of our algorithm by varying the number of
queries posed by each of the 100 randomly selected requesters. Our method manages to
return far more accurate results, achieving a similarity of around 85% in the steady state.
The accuracy increases fast as more queries are created, since new acquaintees are added
and neighbors with no contribution are dropped. We also present the respective values
for answering the original and the rewritten versions of the query. Both the original and
166
100 200 300 400 500Queries per Requester50
60
70
80
90
100
Similarity (%)
GrouPeer totalGrouPeer original QuGrouPeer rewritten Qu
Naive
Figure B.8: Similarity of answers to the
original and rewritten query versions over
variable queries per requester
100 200 300 400 500Queries per Requester40
50
60
70
80
90
100
Similarity (%)
GrouPeerNaive-NoJoinsNaive
Figure B.9: Similarity of answers to join
queries over variable queries per requester
the consecutive rewritten queries are answered with more precision. Our method?s learn-
ing feature allows the automatic rewriting of the original query to improve over time as
mappings are built between requester-replier pairs. Our clustering mechanism helps into
bringing more information-rich nodes closer to requesters which also increases the accu-
racy of the consecutive rewritings. Our scheme is compared against Naive, which uses the
same forwarding scheme as our method but answers only the successively rewritten query
version. Our method can never fall below Naive?s performance but steadily performs bet-
ter with more queries. Finally, it is almost as bandwidth-efficient as Naive, since the
few additional messages reported are due to the communication between sources and re-
questers during the learning mechanism, as well as the message exchange when a new
acquaintance is made.
Next, we monitor GrouPeer?s performance by specifically tracking join queries in
the same setting as the previous experiment. Figure B.9 shows the results for our method
and two different versions of Naive: The regular one we described before (which allows
the rewriting of a join query even if the join is not mapped ? like GrouPeer) and one
167
100 200 300 400 500Queries per Requester40
50
60
70
80
90
100
% of optimal clustering
500 requ100 requ
10 requ
Figure B.10: Ratio of GrouPeer?s clustering versus the optimal, given an equal number
of acquaintees
that returns an empty query if the join(s) are not preserved. As before, we notice that
GrouPeer performs at least as good as the original naive method and quickly increases
in accurate answers as more queries are generated. The more strict naive method returns
more similar results for few queries compared to our scheme. This happens as this method
favors a complete (and thus more accurate) rewriting. Nevertheless, this comes at a cost
of retrieving an answer from about 1/3 of the peers that GrouPeer gets answers from.
We also examine the quality of the clustering process as a means of locating nodes
with similar schemas. For each requester, we measure the average similarity with its
acquaintees at the end of the querying process and compare it with the best possible
scenario: Having all top-m nodes in the overlay with schemas most similar to the initiator
being its acquaintees, where m is equal to the total number of acquaintees this node has
at the end of the querying process. We report the ratio of the actual average similarity to
this optimal value in Figure B.10.
168
Table B.1: Performance varying the number of query attributes
Similarity Clustering
attr = 2,queries = 100 0.87 80.2%
attr = 2,queries = 500 0.89 82.1%
attr = 4,queries = 100 0.80 86.1%
attr = 4,queries = 500 0.84 88.4%
attr = 6,queries = 100 0.71 83.0%
attr = 6,queries = 500 0.76 84.5%
attr = 8,queries = 100 0.67 80.0%
attr = 8,queries = 500 0.71 81.0%
Our methodology achieves clustering that is very close to the best achievable value
in the steady state, while its quality quickly reaches that level. As more nodes become
active, the process improves, since in GrouPeer nodes can take advantage of their neigh-
bors? knowledge/connectivity. The ideal restructuring is hard to be achieved because of
the random initial connectivity: The most similar nodes may not all receive queries and
thus are not considered by the clustering process. Specifically, nodes may either be out-
side the query range or be left out of walkers? paths. By having more active nodes, our
method effectively reduces the influence of the latter, since query initiators get replies by
better nodes, taking advantage of other requesters? clustering. Figure B.10 shows that in
the steady state and with 10, 100 and 500 requesters, GrouPeer achieves 77%, 88% and
91% of the optimal clustering respectively. We can identify 88% of the optimal nodes in
the entire network by having only 10% active nodes and each of them contacting at most
k?T T L = 18 nodes per query (this amounts to less than 2% of the peers).
Table B.1 summarizes the performance of GrouPeer with a different number of
query attributes (each requester making 100 or 500 queries). As the number or attributes
169
per query increases, the accuracy of the answers slightly drops, since a smaller percentage
of attributes has the chance to be satisfied. Note that the quality of the clustering increases
up to a point, after which it starts to slightly decrease. This is due to the fact that there are
two competing factors that affect the clustering process: The more attributes in a query,
the more precise the clustering process becomes, since the initiator learns more informa-
tion for its schema as a whole; the query similarity (which affects clustering through the
Ev function), on the other hand, decreases with the number of attributes.
We tested our method in graphs of different sizes (from 100 to 4K nodes) and dif-
ferent connectivities (power-law). Results of these runs are qualitatively similar to the
presented ones.
B.4.2 Group Inference Results
In this section we present results on the group schema creation of GrouPeer. Our
basic setup remains the same, with the exception that queries on created groups are refor-
mulated using the inferred schema(s). Our metrics are the percentile increase/decrease in
accuracy and number of replies compared to clustering as these are measured on the first
created group. The maximum size of the inferred schema is always in the order of the size
of the initial schema used to produce the local ones during start-up. When the first group
is created, we direct relevant queries to the inferred schema and measure their similarity
compared to the clustering produced at the time of group creation. Initiators that belong
to the group hold the complete mappings with the group schema, avoiding reformulation
errors. Non-members utilize the same learning feature as with normal nodes, assuming a
170
0.2 0.4 0.6 0.8 1t0
5
10
15
20
25
30
Similarity Increase (%)
maxP=200maxP=80maxP=40
maxP=20
Figure B.11: % Increase in answer simi-
larity over variable MaxP and t
0.2 0.4 0.6 0.8 1t-200
0200
400600
8001000
12001400
% Increase in Answers
maxP=200maxP=80maxP=40
maxP=20
Figure B.12: % Increase in number of an-
swers over variable MaxP and t
?virtual? host holding the group schema as their contact.
First, we vary the maximum group size limit, MaxP, as well as the minimum sim-
ilarity of participating peers to the initiator node, t. Figures B.11 and B.12 show the
obtained results for 100 requesters and maximum 100 queries each. As t increases, the
group becomes more specialized and less general. In contrast, small similarity values pro-
duce groups too general that incorporate many concepts foreign to the initiator. Initiators
choose to send queries to a schema if they deem it advantageous. This has the effect that
specialized groups (i.e., high value of t) receive fewer queries, while more ?general? ones
receive more but cannot answer them all satisfactorily. Thus, there exists a point where
grouping ceases to increase its relative gains to clustering, as our graphs show.
Both metrics increase as MaxP increases. This is reasonable since more nodes
can participate and produce results. Very specialized grouping causes significantly less
populated groups, which in turn affects the number of returned answers. As groups get
more general (around t = 0.6), an improvement of 13-23% in accuracy is achieved, while
the gains in replies are 40-900%. As t decreases, the gains in accuracy decrease but more
171
500010000150002000025000300003500040000query number where group is created0
10
20
30
Similarity Increase (%)
50 requesters100 requesters200 requesters
400 requesters
Figure B.13: % Increase in answer simi-
larity over variable group creation time
500010000150002000025000300003500040000query number where group is created0100
200300
400500
600700
800900
1000
% Increase in Answers
50 requesters100 requesters200 requesters
400 requesters
Figure B.14: % Increase in number of an-
swers over variable group creation time
results are generated. These curves show that a t value of around 0.65 with the group
initiator and MaxP = 80 achieve good results without too much generalization. These
will be our default values for the rest of this discussion. Also, in all experiments, we set
To = 0.7.
Next, we try to determine the quality of the created group based on its creation
time, i.e., the number of queries at which it was created. Figures B.13 and B.14 show
the percentile improvement in our basic metrics when the first group is created at various
points in the clustering process. Our observations show a decrease in the relative gains in
accuracy and an increase in the corresponding number of answers. This happens because
clustering improves with time while the number of results slightly decreases due to the
forwarding process: now more walkers cross paths on relevant nodes. What is important
is that groups that are allowed to be created as soon as possible (which would be the
frequent case) show about 20% more accurate answers and return about three times more
results compared to clustering, even though the inference procedure is performed on a less
optimally clustered overlay. Groups that are created later exhibit noticeable gains, espe-
172
Table B.2: Performance comparison with clustering
100 requ 400 requ
qu/requ Sim #Answ Sim #Answ
10 0.68 (+17.8%) 55.0 (+411%) 0.70(+19.9%) 53.7 (+387%)
50 0.71(+17.7%) 51.8 (+370%) 0.71(+19.0%) 61.6 (+461%)
100 0.72(18.2%) 55.8 (+413%) 0.72(+19.2%) 60.0 (+444%)
cially in terms of the number of replies. When more requesters are active, the clustering
process is expedited, which suits the purposes of grouping.
Table B.2 shows the exact performance figures using our default parameters for
various requesters/queries-per-requester combinations. The figures in parentheses show
the percentile increase compared to simple clustering for the same number of queries. We
notice that querying the inferred groups results in an average 18% increase in accuracy
and around 400% increase in number of replies. This is true regardless of the requesters
or their querying rates. It is interesting to note that, in all these results, the queries from
nodes inside the created groups are less than 10% of the total. This proves that group
creation and propagation effectively helps all nodes in the overlay.
One of the basic assumptions of our scheme is that each peer can individually
choose to initiate the group inference process. This allows for completely distributed be-
havior only if semantically close initiators produce similar groups and the opposite. We
measure the similarity between the first and randomly selected thereafter initiators as well
as of the group schemas created respectively. Figure B.15 displays results over different
runs, where either the two initiators were over 70% or less than 40% similar. Clearly,
for very similar initiators the process yields very similar groups. On the other hand, for
fairly dissimilar initial schemas, the created groups are 40-50% similar. This value is a
173
0 0.10.20.30.40.50.60.70.80.91Initiator similarity0
0.10.2
0.30.4
0.50.6
0.70.8
0.91
Group schema similarity
similarity>0.7similarity<0.4
Figure B.15: Relationship between initiator and inferred schema similarity
little higher than expected due to the high overlap and semantic relations between stored
attributes at various peers. When data is placed in a non-overlapping manner, such groups
have less than 20% similarity. So, there clearly exists a correlation between initiator and
inferred schema similarity value.
As we just showed, peers with similar schemas generate similar groups. To do so
simultaneously is undesirable for two reasons: First, the system will perform a redundant
operation and second, it will force our merging process to be invoked regularly. As we
mentioned in Section B.3.2, initiators broadcast their intention to create a semantic group.
Nevertheless, broadcasts that reach many nodes are very costly. Furthermore, our clus-
tering process assures that a non-negligible number of semantically close nodes will also
be close to the initiator in the hop-distance metric. To demonstrate this, we measure the
hop-distance distribution of peers not included in the group creation process with similar-
ity greater or equal to D to the initiator, given our default parameters. Table B.3 presents
our results.
We notice that the minimum distance increases as we search for more similar peers,
174
Table B.3: Estimating group broadcast range
D = 0.5 D = 0.7 D = 0.8 D = 0.9 D = 1.0
Min/Max Distance 1.1/5.9 1.9/5.6 2.1/5.3 2.9/4.8 3.8/4.2
#nodes 597 235 113 27 17
%nodes?4 hops 78 80 80 78 70
200040006000800010000# query0.4
0.50.6
0.70.8
0.91
Similarity
individually, t=0.65merged, t=0.65individually, t=0.8
merged, t=0.8
200040006000800010000# query010
2030
4050
6070
8090
#Answers
individually, t=0.65merged, t=0.65individually, t=0.8
merged, t=0.8
Figure B.16: Similarity and number of answers of the initial and merged groups vs cre-
ation time
while the maximum decreases. This is due to the clustering process: Similar peers get
closer in the overlay. Grouping includes most of these peers, so the minimum distance
to a non-grouped similar node increases. Moreover, the ones that have been left out of
the group inference are now closer than before. The results show that a broadcast range
of 4 contacts around 80% of our target nodes. Nevertheless, as D increases, these nodes
become scarce. Thus, assuming that Dsimilarequal0.65 for practical reasons, a TTL=4 would
suffice. In our experiments, a broadcast of that scope blocks an increasing number of
nodes with time. For larger values of D, broadcasting with large range causes the majority
of messages to be delivered to dissimilar peers.
Finally, we present some results concerning GrouPeer?s merging process. When
two similar groups are identified (through broadcasting of the group metadata), the merge
process is initiated. We measure the similarity and number of replies by the two groups
175
as well as the merged one and present the results in Figure B.16. We notice that, while the
two groups and the merged one do not substantially differ in the accuracy of the results
(although the merged group always outperforms them), the new schema delivers almost
twice as many. A very important observation is that the time of creation of the individual
groups plays almost no role in their performance, which shows that GrouPeer will keep
operating without performance degradation.
B.5 Related Work
The Chatty Web [106] considers P2P systems that share (semi)-structured informa-
tion. The authors are concerned about the gradual degradation, in terms of syntax and
semantics, of a query propagated along a network path. This approach considers peers
that own very simple relational schemas and GAV mappings with their acquaintees. In-
stead, we are interested in more complex schemas and we consider GAV, LAV or GLAV
mappings.
In [104], the authors propose optimization techniques for query reformulation in
P2P database systems. They focus on minimizing the rewriting of a query and prun-
ing the propagation path in order to avoid redundant reformulations. It is indicated that
pre-computation of the query reformulation path-tree proves to accelerate the procedure
despite the disadvantage of the necessary maintenance of pre-computed mappings. Our
approach is designed for large-scale unstructured overlays. First, it evades reformula-
tion at peers poor in query-relevant information by adaptively choosing the version of the
query to be answered. Also, while in [104] central knowledge of the system structure
176
is required, our scheme enables nodes to operate in a completely decentralized fashion,
utilizing the standard lookup operations to refine their local knowledge.
PeerDB [107] features relational data sharing without schema knowledge. Query
matching and rewriting is based on keywords. First all nodes within a TTL radius are
contacted, returning prospective answer meta-data. Then the user selects those that are
relevant to the query and the selected sources are contacted directly for the results to
the various rewritten versions of the query. Instead, our approach employs an automated
technique based on a combination of successive query rewriting and query-schema match-
ing, while it utilizes bandwidth-efficient walks compared to the costly flooding scheme.
Some of the well-known projects that deal with the data heterogeneity problem in
P2P systems are [108?111]. Piazza [108] presents a solution to the heterogeneity issue
in P2P data management systems and proposes a language for schema mediation between
peers. It also presents algorithms for query reformulation based on GAV/LAV query
answering.
Edutella [109] is a schema-based network that holds RDF data. Peers have ser-
vices (e.g. querying, mapping, mediating etc) that they share with other peers. Peers
can formulate complex queries that are translated in wrappers to queries on the Edutella
Common Data Model. Peers register the query-types they can answer to mediators, which
route queries to appropriate peers. Edutella is an effort towards the solution of the hetero-
geneity problem of data and services. However, it does not focus on semantic clustering,
neither does it propose sophisticated methods for distributing queries to semantically rel-
evant peers.
GridVine [110], and pSearch [111], are based on a structured P2P overlay. Grid-
177
Vine hashes and indexes RDF data and schemas, and pSearch represents documents as
well as queries as semantic vectors, which are the keys of a DHT structure. Both these
projects base search efficiency on the underlying DHT, and, thus, do not solve the seman-
tic diversity problem in an unstructured P2P system. Another disadvantage of p-Search is
that documents of newly-joined peers, with terms that are not encapsulated in the existing
vector, cannot be indexed by them.
Beyond semantic clustering, the work in [112] looks into the problem of discov-
ering connectivity clusters of nodes in P2P networks, detecting the transmission of the
same query multiple times at the same node. In [72], peers are grouped into possession
rules, according to whether they contain a specific item or not. Nodes search inside one
possession rule in a blind fashion. The possession rule is chosen by a greedy mechanism
according to past query results.
B.6 Summary
GrouPeer is a system that effectively implements both popular approaches of an-
swering queries in P2P data management systems: Propagation along paths of bounded
depth and querying a mediated schema. First, it performs a gradual formulation of se-
mantically similar clusters. Our system creates and maintains, in an automated way, a
schema representative of the cluster. Requesters can direct relevant queries to the adver-
tised groups and join relevant (or interesting) ones in a more effective and timely manner.
Our results show that grouping results in significant gains in both the answer quality and
quantity compared to the original clustering method.
178
BIBLIOGRAPHY
[1] Clay Shirky. What Is P2P ... And What Isn?t. OpenP2P.com, 2000.
[2] A. Oram, editor. Peer-to-Peer: Harnessing the Power of Disruptive Technologies.
O?Reilly, 2001.
[3] RFC 1036. Standard for Interchange of USENET Messages.
[4] RFC 1034. Domain Names - Concepts and Facilities.
[5] http://www.napster.com. Napster website.
[6] http://www.gnutella.com. Gnutella website.
[7] http://www.jxta.org. Project JXTA.
[8] http://www.microsoft.com/net. Microsoft .NET.
[9] The impact of file sharing on service provider networks. An Industry White Paper,
Sandvine Inc.
[10] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location,
and Routing for Large-Scale Peer-to-Peer Systems. Lecture Notes in Computer
Science, 2001.
[11] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastructure for fault-
tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC
Berkeley, 2001.
[12] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scal-
able Peer-To-Peer lookup service for internet applications. In SIGCOMM, 2001.
[13] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content
Addressable Network. Technical Report TR-00-010, University of Berkeley, CA,
2000.
[14] C. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and Replication in Unstruc-
tured Peer-to-Peer Networks. In ICS, 2002.
[15] D. Tsoumakos and N. Roussopoulos. Adaptive Probabilistic Search for Peer-to-
Peer Networks. In 3rd IEEE Intl Conference on P2P Computing, 2003.
[16] D.Tsoumakos and N. Roussopoulos. A Comparison of Peer-to-Peer Search Meth-
ods. In WebDB, 2003.
[17] D.Tsoumakos and N. Roussopoulos. Analysis and Comparison of P2P Search
Methods. In INFOSCALE, 2006.
[18] D.Tsoumakos and N. Roussopoulos. AGNO: An Adaptive Group Communication
Scheme for Unstructured P2P Networks. In Euro-Par, 2005.
179
[19] D. Tsoumakos and N. Roussopoulos. APRE: An Adaptive Probabilistic Replica-
tion Method for Unstructured P2P Networks. In CoopIS, 2006.
[20] D. Fallows, L. Rainie, and G. Mudd. The popularity and importance of search
engines, 2004. ComScore data Memo.
[21] J. Chu, K. Labonte, and B. Levine. Availability and Locality Measurements of
Peer-to-Peer File Systems. In SPIE, 2002.
[22] V. Kalogeraki, D. Gunopulos, and D. Zeinalipour-Yazti. A Local Search Mecha-
nism for Peer-to-Peer Networks. In CIKM, 2002.
[23] S. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. In INFOCOM,
2002.
[24] D. Menasc?e and L. Kanchanapalli. Probabilistic Scalable P2P Resource Location
Services. SIGMETRICS Perf. Eval. Review, 2002.
[25] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient Content Location Using
Interest-Based Locality in Peer-to-Peer Systems. In INFOCOM, 2003.
[26] A. Crespo and H. Garcia-Molina. Routing Indices for Peer-to-Peer Systems. In
ICDCS, July 2002.
[27] S. Daswani and A. Fisk. Gnutella UDP Extension for Scalable Searches (GUESS)
v0.1.
[28] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker. Making
Gnutella-like P2P Systems Scalable. In SIGCOMM, 2003.
[29] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press,
1998.
[30] E. Bonabeau, M. Dorigo, and G. Theraulaz. Swarm Intelligence: From Natural to
Artificial Systems. Oxford University Press, 1999.
[31] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On Power-law Relationships of the
Internet Topology. In SIGCOMM, 1999.
[32] E. Zegura, K. Calvert, and S. Bhattacharjee. How to model an internetwork. In
Infocom, 1996.
[33] C. Jin, Q. Chen, and S. Jamin. Inet: Internet Topology Generator. Technical Report
CSE-TR443-00, Department of EECS, University of Michigan, 2000.
[34] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz. Bayeux: An architec-
ture for scalable and fault-tolerant wide-area data dissemination. In Proceedings
of NOSSDAV, 2001.
180
[35] J. Jannotti, D. Gifford, K. Johnson, F. Kaashoek, and J. O?Toole. Overcast: Reli-
able multicasting with an overlay network. In OSDI, 2000.
[36] A. Datta, M. Hauswirth, and K. Aberer. Updates in highly unreliable, replicated
peer-to-peer systems. In ICDCS, 2003.
[37] Y. Chu, S. Rao, S. Seshan, and H. Zhang. Enabling conferencing applications on
the internet using an overlay muilticast architecture. In SIGCOMM, 2001.
[38] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable application layer
multicast. In SIGCOMM, 2002.
[39] A. Rowstron, A. Kermarrec, M. Castro, and P. Druschel. Scribe: The design of a
large-scale event notification infrastructure. In NGC, 2001.
[40] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Application-level multicast
using content-addressable networks. Lecture Notes in Computer Science, 2001.
[41] Marius Portmann and Aruna Seneviratne. Cost-effective broadcast for fully decen-
tralized peer-to-peer networks. Computer Communications, 26, 2003.
[42] A. Ganesh, A. Kermarrec, and L. Massoulie. Peer-to-peer membership manage-
ment for gossip-based protocols. IEEE Trans. Comp., 2003.
[43] Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting
distinct elements in a data stream. In RANDOM, 2002.
[44] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the
frequency moments. In STOC, 1996.
[45] M. Ripeanu and Ian Foster. Mapping the gnutella network: Macroscopic properties
of large-scale peer-to-peer systems. In IPTPS, 2002.
[46] A. Medina, A. Lakhina, I. Matta, and J. Byers. BRITE: An Approach to Universal
Topology Generation. In MASCOTS, 2001.
[47] A. Ganesh, A. Kermarrec, and L. Massoulie. SCAMP: Peer-to-peer lightweight
membership service for large-scale group communication. In Networked Group
Communication, 2001.
[48] http://tidy.sourceforge.net/. HTML Tidy Project Page.
[49] http://www.kazaa.com. Kazaa website.
[50] http://www.emule project.net/. eMule project.
[51] http://www.bittorrent.com/index.html. BitTorrent home page.
[52] S. Iyer, A. Rowstron, and P. Druschel. Squirrel: A decentralized peer-to-peer web
cache. In PODC, 2002.
181
[53] http://web.icq.com/. ICQ web site.
[54] J. Kangasharju, K. Ross, and D. Turner. Secure and Resilient Peer-to-Peer E-Mail:
Design and Implementation. In IEEE Intl Conf. on P2P Computing, 2003.
[55] J. Jung, B. Krishnamurthy, and M. Rabinovich. Flash crowds and denial of service
attacks: Characterization and implications for CDNs and web sites. In WWW,
2002.
[56] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. Globally
Distributed Content Delivery. IEEE Internet Computing, September 2002.
[57] M. Freedman, E. Freudenthal, and D. Mazires. Democratizing Content Publication
with Coral. In NSDI, 2004.
[58] http://www.squid-cache.org/. Squid Web Proxy Cache.
[59] M. Roussopoulos and M. Baker. Practical load balancing for content requests
in peer-to-peer networks. Technical Report cs.NI/0209023, Stanford University,
2003.
[60] C. Damgaard and J. Weiner. Describing Inequality in Plant Size or Fecundity.
Ecology, 81, 2000.
[61] I. Clarke, O. Sandberg, B. Wiley, and T. Hong. Freenet: A Distributed Anonymous
Information Storage and Retrieval System. Lecture Notes in Computer Science,
2001.
[62] Theoni Pitoura, Nikos Ntarmos, and Peter Triantafillou. Replication, Load Balanc-
ing and Efficient Range Query Processing in DHTs. In EDBT, 2006.
[63] http://www.openp2p.com. openP2P website.
[64] http://www.peer-to-peerwg.org/. Peer-to-Peer working group.
[65] D. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard,
S. Rollins, and Z. Xu. Peer-to-Peer Computing. Technical Report HPL-2002-57,
HP, 2002.
[66] S. Saroiu, P. Gummadi, and S. Gribble. A measurement study of peer-to-peer file
sharing systems. Technical Report UW-CSE-01-06-02, Un. of Washington, 2001.
[67] S. Sen and J. Wang. Analyzing peer-to-peer traffic across large networks. In SIG-
COMM Internet Measurement Workshop, 2002.
[68] S. Saroiu, K. Gummadi, R. Dunn, S. Gribble, and H. Levy. An Analysis of Internet
Content Delivery Systems. In OSDI, 2002.
[69] B. Yang and H. Garcia-Molina. Improving Search in Peer-to-Peer Networks. In
ICDCS, 2002.
182
[70] M. Stokes. Gnutella2 Specifications Part One:
http://www.gnutella2.com/gnutella2 search.htm.
[71] P. Ganesan, Q. Sun, and H. Garcia-Molina. YAPPERS: A peer-to-peer lookup
service over arbitrary topology. In INFOCOM, 2003.
[72] E. Cohen, A. Fiat, and H. Kaplan. Associative search in peer to peer networks:
Harnessing latent semantics. In INFOCOM, 2003.
[73] R. Morselli, B. Bhattacharjee, M. Marsh, and A. Srinivasan. Efficient Lookup on
Unstructured Topologies. In PODC, 2005.
[74] D. Subramanian, P. Druschel, and J. Chen. Ants and reinforcement learning: A
case study in routing dynamic networks. In IJCAI, 1997.
[75] G. Di Caro and M. Dorigo. AntNet: Distributed Stigmergetic Control for Commu-
nications Networks. Journal of Artificial Intelligence Research, 9:317?365, 1998.
[76] Y. Chu, S. Rao, and H. Zhang. A case for end system multicast. In SIGMETRICS,
2000.
[77] P. Francis. Yoid: Extending the internet multicast architecture, 2000. White Paper.
[78] M. Castro, M. Jones, A. Kermarrec, A. Rowstron, M. Theimer, H. Wang, and
A. Wolman. An evaluation of scalable application-level multicast built using peer-
to-peer overlays. In INFOCOM, 2003.
[79] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis,
D. Swinehart, and D. Terry. Epidemic algorithms for replicated database mainte-
nance. In PODC, 1987.
[80] P. Eugster, S. Handurukande, R. Guerraoui, A. Kermarrec, and P. Kouznetsov.
Lightweight probabilistic broadcast. In DSN, 2001.
[81] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative
storage with CFS. In SOSP, 2001.
[82] A. Rowstron and P. Druschel. Storage Management and Caching in PAST, A
Large-scale, Persistent Peer-to-peer Storage Utility. In SOSP, 2001.
[83] J. Cates. Robust and efficient data management for a distributed hash table. Mas-
ter?s thesis, Massachusetts Institute of Technology, May 2003.
[84] M. Waldvogel, P. Hurley, and D. Bauer. Dynamic replica management in dis-
tributed hash tables. Technical Report RZ?3502, IBM, 2003.
[85] T. Stading, P. Maniatis, and M. Baker. Peer-to-peer caching schemes to address
flash crowds. In IPTPS, 2002.
183
[86] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher. Adaptive replica-
tion in peer-to-peer systems. In ICDCS, 2004.
[87] W. Poon, J. Lee, and D. Chiu. Comparison of Data Replication Strategies for Peer-
to-Peer Video Streaming. In ICICS, 2005.
[88] R. Ferreira, M. Ramanathan, A. Awan, A. Grama, and S. Jagannathan. Search with
Probabilistic Guarantees in Unstructured Peer-to-Peer Networks. In IEEE P2P,
2005.
[89] F. Cuenca-Acuna, R. Martin, and T. Nguyen. Autonomous Replication for High
Availability in Unstructured P2P Systems. In SRDS-22, 2003.
[90] http://www.overnet.com/. eDonkey2000-Overnet.
[91] Bram Cohen. Incentives build robustness in bittorrent, 2003.
[92] Rob Sherwood, Ryan Braud, and Bobby Bhattacharjee. Slurpie: A cooperative
bulk data transfer protocol. In IEEE Infocom, 2004.
[93] Pablo Rodriguez and Ernst W. Biersack. Dynamic parallel access to replicated
content in the Internet. IEEE/ACM Transactions on Networking, 10(4), August
2002.
[94] John W. Byers, Michael Luby, and Michael Mitzenmacher. Accessing multiple
mirror sites in parallel: Using tornado codes to speed up downloads. In INFOCOM,
1999.
[95] http://www.morpheus.com. Morpheus website.
[96] Michael G. Luby, Michael Mitzenmacher, M. Amin Shokrollahi, Daniel A. Spiel-
man, and Volker Stemann. Practical loss-resilient codes. In STOC, 1997.
[97] Michael O. Rabin. Efficient dispersal of information for security, load balancing,
and fault tolerance. JACM, 36(2), 1989.
[98] P. Felber, T. Kaldewey, and S. Weiss. Proactive hot spot avoidance for web server
dependability, 2004.
[99] A. Stavrou, D. Rubenstein, and S. Sahu. A lightweight, robust p2p system to handle
flash crowds. In ICNP, 2002.
[100] M. Ripeanu and I. Foster. Mapping the Gnutella Network: Macroscopic Properties
of Large-Scale Peer-to-Peer Systems. In IPTPS, 2002.
[101] V. Kantere, I. Kiringa, J. Mylopoulos, A. Kementsientidis, and M. Arenas. Coor-
dinating P2P Databases Using ECA Rules. In DBISP2P, 2003.
[102] M. Lenzerini. Data Integration: A Theoretical Perspective. In 21th ACM PODS,
2002.
184
[103] V. Kantere, D. Tsoumakos, T. Sellis, and N. Roussopoulos. GrouPeer: Dynamic
Clustering of P2P Databases. Technical Report DBLAB?2006/4, National Tech-
nical University of Athens, Department of Electrical and Computer Engineering,
2005. http://www.dbnet.ece.ntua.gr/pubs/.
[104] I. Tatarinov and A.Halevy. Efficient Query Reformulation in Peer-Data Manage-
ment Systems. In SIGMOD, 2004.
[105] J. Madhavan and A. Halevy. Composing mappings among data sources. In VLDB,
2003.
[106] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. The Chatty Web: Emergent
Semantics Through Gossiping. In WWW Conference, 2003.
[107] B. Ooi, Y. Shu, K.L. Tan, and A.Y. Zhou. PeerDB: A P2P-based System for Dis-
tributed Data Sharing. In ICDE, 2003.
[108] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema Mediation in Peer Data
Management Systems. In ICDE, 2003.
[109] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer,
and T. Risch. Edutella: A p2p networking infrastructure based on rdf. In WWW,
2002.
[110] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. Van Pelt. Gridvine:Building
internet-scale semantic overlay networks. In International Semantic Web Confer-
ence, 2004.
[111] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-
organizing semantic overlay networks. In SIGCOMM, 2003.
[112] L. Ramaswamy, B. Gedik, and L. Liu. A Distributed Approach to Node Clustering
in Decentralized Peer-to-Peer Networks. In IEEE Transactions on Parallel and
Distributed Systems, 2005.
185
Index
acquaintance, 142, 144, 149
acquaintee, 142, 147, 151
Adaptive Group Notification, see AGNO
Adaptive Probabilistic REplication, see
APRE
Adaptive Probabilistic Search, see APS
AGNO, 11, 52, 83
aging, 56, 59, 69
backpointers, 54, 56, 61
Estimation of ?r, 60, 68, 70
performance, 65
reverse indices, 54, 56
stress, 62
ant-based algorithms, 31, 115
AntNET, 115
APRE, 11, 80, 83, 87, 107
ClosestFirst, 90, 94, 104
Contract, 81, 84, 87, 93
Expand, 81, 84, 87, 106
FurthestFirst, 90, 94
Gini coefficient, 99
mirror set, 82
path-cache, 93
Performance, 94
replication policy, 90
server set, 82, 84, 94
Uniform, 90
APS, 9, 16, 17, 53, 62, 88, 107
learning, 22
optimistic, 18, 21, 22
performance, 37, 127
pessimistic, 18, 21, 22, 39
s-APS, 23, 37
w-APS, 23, 30, 35, 36, 42
bloom filters, 113
client/server architecture, 2
DHT, 4, 51, 80, 114, 120
DNS, 1
Flash crowds, 79, 102
Freenet, 93
Gnutella, 50, 62, 110, 114
Gnutella2, 110, 127
gossip-based algorithms, 50
GrouPeer, 178
automatic matching, 150
clustering, 147, 150, 153
clustering evaluation, 166
grouping, 148, 153
grouping evaluation, 170
initiator, 155
originator, see initiator
query rewriting, 150
schema semantic similarity, 155
semantic radius, 155
GUESS, 15, 31, 47, 127
Internet, 1, 2, 13
leaf-node, 15, 110
Markov Decision Process, see MDP
multicast, 49
overlay, 3
neighbors, 6
P2P, 1, 13, 79, 144
Centralized, 4
Decentralized, 4, 6
Hybrid, 4, 31
Pure, 4, 6, 31
Structured, see DHT
Unstructured, 4?6, 50, 51, 62, 80,
142
P2P Data Management System, 142
Peer-to-Peer computing, 1, 141
query distribution, 7, 32
query rewriting, 144
Random Walks, 15, 37, 127
Reinforcement Learning, 10, 24
Bellman?s equation, 26, 28
discount factor ?, 25, 28
MDP, 25, 27
policy, 25
reward, 24
186
state value, 25
optimal value functions, 26, 28
Value Iteration, 30
replication distribution, 7, 32
SCAMP, 63, 67
search, 14
accuracy, see success rate
blind, 14, 110, 129
hit, 7, 126, 131, 134, 139
informed, 15, 110
success rate, 7, 129, 134, 136, 139
soft state, 7
super-peer, 4, 15, 110
system model, 6
time-to-live (TTL), 7, 14
USENET, 1
187