XJoin: Getting Fast Answers From Slow and Bursty Networks
Tolga Urhan
Institute for Advanced Computer Studies
Computer Science Department
University of Maryland
College Park, MD 20742
urhan@cs.umd.edu
Michael J. Franklin
Institute for Advanced Computer Studies
Computer Science Department
University of Maryland
College Park, MD 20742
franklin@cs.umd.edu
CS-TR-3994, UMIACS-TR-99-13
Abstract
The combination ofincreasingly ubiquitous Internet connectivity and advances in heterogeneous and semi-
structured databases has the potential to enable database-style querying over data from sources distributed
around the world. Traditional query processing techniques, however, fail to deliver acceptable performance
in such a scenario for two main reasons: First, they optimize for delivery of the entire query result, while
on-line users would typically benet from receiving initial results as quickly as possible. Second, slow or
bursty delivery of data from remote sources can stall query execution, making the already inadequate batch-
like behavior even worse. Both of these problems can be addressed using fully pipelined query execution.
The symmetric hash join operator supports such pipelining, but it requires all base data and intermediate
results to be memory-resident, which is unacceptable for complex queries over large datasets. In this paper
we present a multi-threaded extension of the symmetric hash join, called XJoin, that can execute eectively
with far less memory. By reactively scheduling background processing, XJoin hides intermittent delays
in data arrival to produce more tuples earlier. XJoin includes a very ecient, on-the-y algorithm for
preventing duplicates from being created by its independently running threads. We have implemented the
XJoin operator and added it to the PREDATOR Object-Relational DBMS. Using this implementation along
with traces obtained by monitoring Internet data delivery, we show that XJoin is an eective solution for
providing fast query responses to users even in the presence of slow and bursty remote sources.
1 Introduction
1.1 Wide-Area Query Processing
The explosive growth of the Internet and the World Wide Web has made tremendous amounts of data available
on-line. Currently, searching for information in this huge database is usually done using navigational methods,
such as following links, or by submitting a few terms to a search engine. Such limited querying capability arises
in part, due to the lack of structure and semantics in most web data sources. Fortunately, however, this situation
is beginning to change due to emerging standards such as XML, as well as to the development of technology
for wrapping sources to provide relational-style interfaces. As a result, it is becoming possible to pose more
sophisticated, declarative queries over data sources that are widely distributed across the Internet.
1
Beyond the issues ofstructure andsemantics, however, there remainsignicanttechnical obstacles tobuilding
responsive, usable query processing systems for wide-area environments. A key performance issue that arises
in such environments is response-time unpredictability. Data access over wide-area networks involves a large
number of remote data sources, intermediate sites, and communications links, all of which are vulnerable to
overloading, congestion, and failures. Such problems can cause signicant and unpredictable delays in the
access of informationfrom remote sources. These delays, in turn, cause traditional distributed query processing
strategies to break down, resulting in unresponsive and hence, unusable systems.
In previous work [AFTU96] we identied three classes of delays that can aect the responsiveness of query
processing: 1) initial delay, in which there is a longer than expected wait until the rst tuple arrives from a
remote source; 2) slow delivery, in which data arrive at a fairly constant but slower than expected rate; and 3)
bursty arrival, in which data arrive in a uctuating manner, with bursts of data followed by long periods of no
arrivals. With traditional query processing techniques, query execution can become blocked even if only one of
the accessed data sources experiences such delays.
We developed Query Scrambling to address this problem and showed how it can be used to hide initial
delays [UFA98] and bursty arrivals [AFT98]. Query Scrambling is a reactive approach to query execution; it
reacts to data delivery problems by on-the-y rescheduling of query operators and restructuring of the query
execution plan. Query Scrambling is aimed at improving the response time for the entire query, and may
actually slow down the return of some initial results in order to minimize the time required to produce the
remaining portion of a query answer once all necessary data has been obtained from all of the remote sources.
In this paper, we explore a complementary approach, based on a non-blocking join operator we call XJoin.
XJoin extends the symmetric hash join (SHJ) [WA90, HS93] to use secondary storage, which allowsit to be used
with large inputs and to run concurrently with other query operators in a bushy query plan. Simply extending
SHJ to use secondary storage, however, is insucient for tolerating signicant delays in receiving data from
remote sources. For this reason, a key component of XJoin is a reactively scheduled background process, which
opportunistically utilizes delays to produce more tuples earlier. We show that by using XJoins it is possible
to produce query execution plans that can better cope with data delivery problems and that can deliver initial
results orders of magnitude faster than traditional techniques, with in many cases, little or no degradation in
the time required to deliver the entire result.
1.2 Solution Overview
The XJoin approach is based on two fundamental principles:
1. It is optimized for producing results incrementally as they become available. When used in a fully pipelined
query plan, answer tuples can be returned to the user as soon as they are produced. The early delivery of
initial answers can provide tremendous improvements in the responsiveness of the system. Furthermore,
in many situations, users require only a small subset of the total query answer [CK97], so returning initial
results quickly is the key to system usability.
2
2. It allows progress to be made even when one or more sources experience delays. There are two reasons
for this. First, by using less memory XJoin allows for bushier query plans than are possible with other
pipelined join methods. Thus, some parts of a query plan can continue while others are stalled waiting
for input. Second, by employing background processing on previously received input, an XJoin operator
can run and produce results even when both of its inputs become blocked.
The symmetric hash join, on which XJoin is based, was aimed at addressing similar issues. As originally
proposed, however, symmetric hash join requires that hash tables for both of its inputs be kept in mainmemory
until all of the tuples have been received from both of its inputs. As a result, symmetric hash join cannot be
used for joins with large inputs, and the ability to run multiple joins (e.g., in a bushy query plan) is severely
limited. XJoin avoids these problems by allowing tuples from one or both of the inputs to be temporarily
spooled to secondary storage. In a sense, XJoin provides for symmetric hash join, the same exibility that the
hybrid hash join provides for the classic hash join [Sha86]. Not surprisingly, similarly to hybrid hash join it is
based on partitioning.
The main challenges in developing XJoin include the following:
 Managing the ow of tuples between memory and secondary storage.
 Controlling the background processing that is initiated when inputs are delayed.
 Ensuring that the full answer is ultimately produced (i.e., no answers should be lost).
 Ensuring that no duplicate tuples are inadvertently produced.
In this paper, we describe the design and performance of XJoin, focusing on how it addresses the above
issues. In order to evaluate its eciency and its ability to deal with delays we have implemented it in the
context of the PREDATOR Object-Relational DBMS [SP97]. Using network traces recorded by transferring
data from various remote sites in the Internet along with workloads derived from the Wisconsin benchmark we
show that XJoin is indeed an eective solution for providing fast query responses to users in the presence of
slow and bursty remote sources.
The work described in this paper is related to other recent projects on improving the responsiveness of
query processing, including techniques for returning initial answers more quickly [BM96, CK97] and those for
returning continually improving answers to long running queries [VL93, HHW97]. Our work diers from this
other research due to (among other reasons) the focus on coping with unpredictable delays arising from wide-
area remote data access. As mentioned previously, the XJoin approach is complementary to our earlier work on
Query Scrambling for unpredictable delays in distributed query processing [AFTU96, AFT98, UFA98], as well
as to other dynamic approaches such as [ONK
+
97, TRV96]. A recent paper [IFFL
+
99] describes the Tukwilla
system that contains an operator similar to XJoin, which can adapt to limited memory, but that diers from
XJoin in several ways. Most importantly that operator does not include the reactively scheduled background
processing of XJoin (called the \second stage"), which is a key to its performance. This is and other related
work is discussed in more detail in Section 5.
3
Hash
table 2
Hash
table 1
matching
source1 source2
Figure 1: Symmetric Hash Join [WA90,HS93].
1
SOURCE-A
DISK
flush Tuple B
11
MEMORY
1
Tuple A
partitions of source Bpartitions of source A
SOURCE-B
nk
nkn
n
hash(TupleA) = 1 hash(TupleB) = n
Memory-resident Memory-resident
Disk-resident Disk-residentpartitions of source A partitions of source B
Figure 2: Handling the partitions.
2 XJoin
XJoin is based on the Symmetric Hash Join (SHJ) [WA90, HS93], which was originally designed to allow a high
degree of pipelining in parallel database systems. Unlike traditional hash joins, which build a hash table on
one input and then probe it with tuples from the other input, SHJ (shown in Figure 1) builds two hash tables,
one for each source. When a tuple arrives on one of the inputs, it is rst inserted into the hash table for that
input, and then immediately used to probe the hash table of the other input. Thus, even if one source becomes
temporarily blocked, the operator can still produce results.
It may not be immediately apparent that the algorithm is correct, i.e. that 1) it produces all the tuples in
the result, and 2) it produces no extra duplicate result tuples. The correctness of symmetric hash join is due
to the fact that probing, which produces the output tuples, is performed for all matching pairs once and only
once, namely when the later-arriving tuple of the pair is processed. As we will see, preserving these correctness
properties in XJoin requires some additional attention, due to its more asynchronous nature.
XJoin extends the symmetric hash join to use less memory by allowing parts of the hash tables to be moved
to secondary storage. It does this by partitioning its inputs, similar to the way that hybrid hash join solves the
memory problems of classic hash join.
2.1 Partitioning
XJoin splits both of its inputs into a number of partitions based on a hash function.
1
Each partition is composed
of a memory-resident portion and a disk-resident portion. The memory-portion contains the tail (i.e., recently
arrived tuples) of the partition, and the disk-resident portion contains the rest. The memory-resident portions
are maintainedas hash tables as in symmetric hash join. Each memory-resident portion (for both sources) has at
least one block of memoryreserved for it at all times. The remaining memory (if any) is divided evenly between
the two sources and is used to allow the memory-resident portions to grow as tuples arrive from the sources.
When XJoin receives a tuple from one of its sources, it inserts the tuple into its corresponding partition,
which is found by applying a hash function to its join attribute (Figure 2). When the memory becomes full,
1
The number of partitions is determinedby using the formula
p
F kRk where F is the \fudge" factor, and kRk is the number
of pages in the smaller input [Sha86].
4
the tuples of the partition with the largest memory-resident portion are written to the disk. This grows the size
of the disk-resident portion (e.g., partition k of source B in Figure 2). The memory-resident portion of that
partition is then reduced to a single (initially, empty) block, and the remaining free blocks are made available
for use by any of the partitions as new tuples arrive. The ushing process is repeated whenever the memory
becomes full.
In the remainder of the paper we use the following notion: P
iA
denotes the ith partition of source A. The
memory and disk-resident portions of P
iA
are referred to as MP
iA
and DP
iA
respectively. Thus, we always
have MP
iA
[DP
iA
= P
iA
and MP
iA
\DP
iA
= ;. The two input sources will be referred to as A and B.
2.2 The Three Stages of XJoin
XJoin proceeds in three stages, each of which is performed by a separate thread. The rst stage joins tuples
in the memory resident portions of the partitions, acting similarly to the standard symmetric hash join. The
second stage joins tuples from disk with tuples that have not yet been ushed to disk. The third stage is a
clean-up stage, which performs any necessary matching to catch any results missed by the rst two stages. The
rst and second stages run in an interleaved fashion | the second stage takes over when the rst becomes
blocked due to a lack of input. These stages are terminated after all input has been received, at which point
the third stage is initiated.
If care is not taken, spurious duplicate result tuples can be produced by the interaction of the three stages.
In order to prevent duplicates XJoin employs a fast, on-the-y duplicate detection mechanism, which will be
presented in Section 2.3. We now describe the three stages in more detail.
First Stage
The rst stage works similarlyto the original symmetric hash join. The maindierence is that in XJoin, the
tuples are organized in partitions based on their join attribute values. Figure 3 shows an example of the rst
stage when a tuple is received from Source
A
(respectively Source
B
) that hashes to partition i (resp. partition
j). If there is room for the tuple in memory, then the tuple is stored in its partition and used to probe the
memory-resident portion of the corresponding partition for the other source. Any matches that occur are output
as results. If, however, memoryis full, then one of the memory-residentportions is chosen as a victimand ushed
to disk as described in Section 2.1. Join processing then continues as usual. The rst stage runs as long as at
least one of its inputs is producing tuples. When the rst stage times-out on both of its inputs (e.g., due to
some unexpected delays), it blocks and the second stage is allowed to run. The rst stage terminates when it
has received all of the tuples from both of its inputs.
Second Stage
The second stage (shown in Figure 4) is activated whenever the rst stage blocks. It picks a disk-resident
portion of one of the partitions
2
, say DP
iA
and uses its tuples to probe the memory-resident portion of the
corresponding partition of the other source (i.e., MP
iB
, in this case). Any matches found are output (subject
2
The partitionis chosen using optimizer-generatedestimatesof the outputcardinalityand the cost of performingthe stage using
the partition.
5
partitions of source A partitions of source B
i
MEMORY
ji j
insertinsert probeprobe
SOURCE A SOURCE B
Tuple A
output
hash(record A) = i hash(record B) = j
Tuple B
Figure 3: Stage 1 - Memory-to-Memory joins
output
probe
MEMORY
partitions of source A partitions of source B
DISK
partitions of source A partitions of source B
ii
i i
Figure 4: Stage 2 - Disk-to-Memory joins
to duplicate detection as described in Section 2.3) as result tuples. After a disk-resident portion has been
completely processed, the operator checks to see if either of the join inputs are ready to begin producing tuples
again, if so, then the second stage halts and the rst stage is resumed. If not, then a dierent disk-resident
portion is chosen and the second stage is continued.
In contrast to the rst stage, this stage does not depend on the availability of tuples from the sources. It
deals with the tuples that have already been stored locally (either in the memory or on the disk) and can
produce output even when both relations are delayed. The second stage may use the same partition multiple
times, as the partition grows over the course of the join execution. Once the second stage begins processing a
disk-resident portion of a partition, it runs until that portion is completely processed. Only after the partition
has been processed does the operator check to see if either of the input sources has become unblocked. If so, then
the rst stage is resumed, otherwise, the second stage continues by choosing another partition to process. This
coarse-grained approach simplies the implementation of the second stage, but it also raises some performance
risks. The second stage incurs overhead in the hope of generating result tuples. This overhead is essentially
\free", as long as the inputs of the XJoin are delayed, as no progress could be made in that situation anyway.
If, however, one or both of the inputs becomes unblocked, the additional costs of the second stage could delay
the delivery of results. This aspect of the second stage is investigated in the experiments described in Section 4.
Third Stage
The third stage executes after all tuples have been received from both inputs. It is a clean-up stage that
makes sure that all the tuples that should be in the result set are ultimately produced. This step is necessary
because the rst and second stages may only partially compute the result. The rst stage fails to join tuples
that were not in the memory at the same time. If two matching tuples are received far apart in time, one of
them might have already been ushed to disk by the time the other tuple arrived. Such pairs of tuples cannot
be joined by the rst stage. The second stage similarly fails to join two tuples if one of them is not in the
memory when the other is brought from the disk and used to probe the memory-resident tuples.
The third stage joins all the partitions (both the memory-resident and disk-resident portions) of the two
sources. It distinguishes between the smaller and the larger of the input relations while operating. Assume
6
source A produces the smaller input. The third stage rst brings all the tuples of DP
1A
into the memory and
creates a hash table for them. Of course, this process might require some memory-resident portions of other
partitions to be ushed to disk. The hash table is then probed with the tuples from both MP
1B
and DP
1B
.
This process is repeated for all the remaining partitions. The operation of the third stage is similar to the
processing done by hybrid hash join after it has partitioned its inputs. The third stage typically will perform
less work, however, because it produces only those result tuples that haven't already been produced by the rst
and second stages, which can result in signicant savings in CPU usage.
2.3 Handling Duplicates in XJoin
As stated in the previous section, the multiple stages of XJoin may produce spurious duplicate tuples because
they can potentially perform overlapping work. Duplicates can be created in both the second and third stages.
The rst stage does not generate spurious duplicates for the same reason that duplicates are not created by the
original symmetric hash join, namely, that it matches pairs of memory resident tuples only at the instant that
the later of the two tuples arrives from its source.
The creation of duplicates in the second and third stages and the mechanism used to avoid these problems
are described in the followingtwo sections. The duplicate prevention mechanisms rely on timestamp values that
are maintainedby the XJoin operator. Timestampsare implementedusing a counter whose value is incremented
every time a new tuple is received from a source or ushed to the disk.
XJoin augments the structure of tuples to contain two timestamps that are set during the rst stage. One
timestamp, called the arrival timestamp (ATS) is assigned when the tuple is rst received from its source. The
second timestamp,called the departure timestamp(DTS) is assigned when the tuple is ushed from memoryto
make room for additional input as described previously. The ATS and DTS together describe the time interval
during which a tuple was in the memory-resident portion of its partition. The ATS and DTS of a tuple are
never changed once they are assigned.
2.3.1 Detecting duplicates in the second and third stages
As stated previously, duplicates cannot be generated in the rst stage, but both the second and third stages do
have the potential of creating duplicates. Tuples that are matched during these stages could have been matched
by the rst stage, or by a previous run of the second stage. Checking for the matches from rst stage is easy.
For a pair of tuples to have been matched by the rst stage they both must have been in the memory at the
same time, thus they must have overlapping ATS and DTS ranges. Any such pair of tuples are not considered
for joining by the second or third stages. Figure 5 shows two cases. The tuples in Figure 5(a) have overlapping
ranges (in fact Tuple
A
was in the memory the entire time that Tuple
B
was in memory) so these tuples should
not be joined again by the later stages. In contrast, the tuples in Figure 5(b) were never in the memory at the
same time during the rst stage so they could not have been joined during the rst stage.
The ATS and DTS can be used to avoid duplicates of results produced by the rst stage, but they do not
solve the potential for duplicates created by multiple runs of the second stage. Recall that the second stage
7
uses tuples from one source that have been evicted from memory to probe the memory-resident portion of the
corresponding partition of the other source. If a disk-resident tuple is used by several dierent runs of the
second stage, then spurious duplicates could be created. This latter problem is solved by using timestamps to
record the times that the second stage has used each disk-resident portion.
Tuple B2Tuple B1
Tuple ATuple A
DTSATS
Overlapping
198
234102
DTSATS
Non-overlapping
234102
348 601178
a) Tuples joined in    the first stage. b) Tuples not joined
   in the first stage
Figure 5: Detecting tuples joined in 1st stage
ProbeTSlastDTS
ATS DTS
Overlap
ATS DTS
100 300 800 900
20 340 250 550 300 700
500 600
100 200
History list for the corresponding partitions
Tuple A
Tuple B
Figure 6: Detecting tuples joined in 2nd stage
These additional timestamps are recorded in a linked list associated with each disk-resident portion. The
linked list contains an entry for each time the second stage has processed that portion. The entries in the list
are of the form fDTS
last
, ProbeTSg where DTS
last
is the DTS value of the last tuple of the disk-resident
portion that was used to probe the memory-resident tuples, and ProbeTS is the timestamp value at the time
that the second stage was executed
3
. These entries can be used to infer that all tuples of disk-resident portion
having DTS values up to (and including) DTS
last
were used by the second stage at time ProbeTS.
Checking for tuples matched in the second phase is then performed as follows: For two matching tuples
Tuple
A
and Tuple
B
, belonging to partitions DP
iA
and MP
iB
respectively, we rst scan the linked list con-
structed for DP
iA
to nd out if and when Tuple
A
was used to probe the memory-resident tuples. Then we
check whether Tuple
B
was in the memoryduring one of these probings, by using its ATS and DTS values. If so
we deduce that Tuple
A
was already matched against Tuple
B
so these two tuples are not matched again. Note
that this same approach is used to avoid generating spurious duplicates in the third stage as well, by performing
the check symetrically (i.e., in both directions).
Figure 6 shows an example of the use of the linked lists. From the top list, it can be deduced that Tuple
A
was matched against memory-resident tuples twice, at times 550 and 700. By examining the ATS and DTS of
Tuple
B
, it is learned that Tuple
B
was in the memory from time 500 to 600. Thus, Tuple
B
must have already
been matched (by the second stage) with Tuple
A
at time 550 so they should not be matched again.
2.4 Further optimizations
2.4.1 Adding a cache
The previous sections described the basic XJoinalgorithmand the techniques ituses to avoidgenerating spurious
duplicate result tuples. In this section we discuss an important optimization that can be applied to the second
stage. Recall that the second stage picks a disk-resident portion of a partition, say DP
iA
, and joins it with
MP
iB
. In the basic algorithm the tuples of DP
iA
are discarded after they are used. If instead, some tuples
from DP
iA
could be retained (i.e., cached) in memory, then a subsequent run of the second stage joining DP
iB
3
Note that the timestamp value remains unchanged during an execution of the second stage since no tuples can be added to or
evicted from memory while it is executing.
8
and MP
iA
could also use the cached tuples from DP
iA
.
Figure 7 shows two consecutive runs of the second stage with the caching optimization enabled. In the rst
run the second stage algorithm reads DP
iA
from the disk and uses it to probe MP
iB
. Some of its tuples are
also stored in the cache. In the next run of the second stage DP
iB
is read from the disk and joined with MP
iA
and with the tuples of DP
iA
that have been stored in the cache.
Partitions of Source BPartitions ofSource APartitions of Source BSource A
Partitions of 
Source BPartitions ofSource A
Partitions of
i i
b) second run of second stagea) first run of second stage
probe
CACHE CACHE
DISK
MEMORY
output
i i i i
i iii
probe insert
output output
probe
Figure 7: Two consecutive runs of the second stage with caching.
This optimizationhas the potential to produce a signicant number of additionalresult tuples at the expense
of some extra space allocated to the cache.
4
This optimization, however, introduces a third potential point of
duplicate creation. In the following, we describe a very simple and fast algorithm for avoiding such duplicates.
The algorithm is based on the observation that the join between the cached tuples from a disk-resident
portion DP
iA
and the tuples from DP
iB
can be thought of as a join between two intervals. One interval
summarizes which tuples of DP
iA
were in the cache, and the other interval summarizes which tuples of DP
iB
were used to probe the cache. Such intervals can be represented by storing only two timestamp values for each
partition: The DTS of the last tuple from DP
iA
that was able to get into the cache, and the DTS of the last
tuple from DP
iB
that was used to probe the cache. We call these two values DTS
cache
and DTS
probe
.
If we picture the DTSs of the tuples from these two partitions being laid out as the axes of a matrix, the
DTS
cache
and DTS
probe
dene a rectangular region on this matrix which is justied to the origin (Figure 8). In
the second and third stages, any matching pair of records whose DTS values are found to fall into this rectangle
are not joined again. What is nice about keeping this history as a rectangular region is that when two partitions
are joined the same way again the new rectangular region will completely overlap the previous region. This
is because at least as many tuples as were cached in the previous run will be able to get into the cache. Also
at least as many tuples of the probing partition as were used in the previous run will be used again this time.
Thus the rectangular region grows monotonically larger, overlapping the previous one. This property makes it
safe to keep only one interval per partition of each source.
4
In our implementationwe use 10% of the memory allocated to XJoin for the cache. If the cache is not large enough to hold all
the tuples of a disk-resident portion only its leading tuples are stored.
9
ATS
DATA
(DTScache,
(DTScache,
30
X 20
DTSATS
Y
DATA
 DTSprobe)
DTS
 DTSprobe)
Tuple B (Partition iB)
Tuple A (Partition iA)
60
(20,30)
10
20
30
5010 20 4030
DTS of the tuples of DPiBthat were  cached
= (15,50)
60
(20,30)
40
10
20
30
40
5010 20 4030
= (32, 34)
DTS of the tuples of DP
that  were used to probe the cacheiB
DTS of the tuples of DP   
iA
that  were used to probe the cache
DTS of the tuples of DP   
that were cached
iA
Figure 8: Detecting the disk-to-cache joins.
Figure 8 shows the case where two tuples from partitions DP
iA
and DP
iB
, with DTS values of 20 and 30
respectively, are about to be joined. Two rectangular regions have already been constructed: The rst rectangle
summarizes the tuples joined when DP
iA
was cached and probed with DP
iB
, and the second rectangle shows
when DP
iB
was cached and probed with DP
iA
. The rst rectangle denotes that tuples of DP
iA
having DTS
values up to 32 were cached and probed with tuples of DP
iB
having DTS values up to 34. These values are 15
and 50 respectively for the second rectangle.
When two tuples, Tuple
A
and Tuple
B
, are about to be matched we rst check (using the rst rectangle), if
Tuple
A
was cached and Tuple
B
was used to probe it. The second rectangle is then used in the same manner
to check if Tuple
B
was cached and was probed by Tuple
A
. In the example of Figure 8, the point represented
by the DTSs of these two tuples,i.e. (20,30), is found to fall within the rst rectangle, thus, we deduce that
Tuple
A
was cached and Tuple
B
was used to probe it, so these tuples are not joined again.
2.4.2 Controlling the second stage
Recall that the main idea behind the second stage is to perform additional processing during delays in order
to produce more tuples. As will be seen in Section 4, this reactive operation has the potential to dramatically
improvethe speed with which results are providedto the user. As described in Section 2.2,however, the overhead
incured by the second stage is hidden only when both inputs to the XJoin experience delays. Furthermore, due
to the complexities of scheduling and duplicate detection, the operation of the second stage is fairly coarse
grained; once processing begins on a partition, that processing is run to completion before the rst stage can
be resumed. As a result, there is a tradeo between the agressiveness with which the second stage is run, and
the benets to be obtained by using it.
To address this tradeo, our implementation includes a mechanism that can be used to restrict the second
stage to processing only those partitions that are likely to yield a signicant number of result tuples. This
activation threshold is specied as a percentage of the total numberof result tuples expected to be produced from
a partition during the course of the entire join. For example, if the join of a partition is expected to contribute
1000 tuples to the result, a threshold value of 0.01 will allow the second stage to process the partition as long
as it is expected to produce 10 or more tuples. Thus, a lower activation threshold results in a more agressive
use of the second stage. The calculations of the expected numbers of tuples are performed using statistics as
well as the linked lists which store the history of the execution of the second stage for the partitions.
10
3 Experimental Environment
In order to study the performance of XJoin we have implemented it in an extended version of PREDA-
TOR [SP97], an Object-Relational DBMS that uses SHORE as its underlying storage manager. In previous
work We extended PREDATOR to support Query Scrambling [UFA98]. Most of the relational engine has been
modied to support both a data-driven and a control-driven execution model, and to process arbitrary bushy
plans. In addition to adding the XJoin operator itself, we extended the query optimizer to account for the
operator and to provide some of the statistics and calculations required by XJoin.
Because performing experiments directly on the Internet would not provide repeatable results, we instead
modeledthe behaviorof the network using trace data that could be easily replayed. These traces provide detailed
informationabout the characteristics of data transfer as perceived by the receiver. It includes connection delays,
when and how much data was received, the silent periods between the data transfers, etc. In order to obtain
these traces we performed experiments on the Internet by fetching large les (such as images, large texts etc)
from 15 randomly chosen sites. The sources were sampled every hour for two days, and statistics were collected.
We have seen that the sources exhibited widely dierent degrees of burstiness and we saw average transfer rates
that ranged between 5 KBytes/sec and 180 KBytes/sec with most values occurring between the 20 KBytes/sec
and 120 KBytes/sec.
From the arrival patterns we collected, we chose two as representatives of the behavior of a bursty and a
fast source (gures 9, and 10). The arrival patterns in these gures show the quantity of data is received at the
query site. Each trace is bucketized by dividing the time axis into 100 buckets and aggregating the amount of
data transferred within each bucket. Each impulse gives the magnitude of the data transfer (in bytes) during
the corresponding time bucket. The bursty and fast patterns have average transfer rates of 23.5 KBytes/sec.
and 129.6 KBytes/sec. respectively. Since the rst pattern has a slow overall transfer rate we refer to it as the
\slow" arrival pattern.
0
20000
40000
60000
80000
100000
120000
140000
160000
0 20000 40000 60000 80000 100000 120000 140000 160000
Bytes Transferred
Time in msecs (Total bytes: 3822054    bucket: 1578 msecs) 
SEMI11.Apr-30-23
Figure 9: Bursty arrival. Avg. Rate 23.5 KBytes/sec.
0
2000
4000
6000
8000
10000
12000
0 500 1000 1500 2000 2500 3000 3500 4000
Bytes Transferred
Time in msecs (Total bytes: 514780    bucket: 38 msecs) 
SEMI13.Apr-30-23
Figure 10: Fast arrival. Avg. Rate 129.6 KBytes/sec.
The arrival behavior of a remote relation is modeled by using one of the network traces in the following
manner. First the network trace is read and stored in memory (network traces are usually small). The tuples
of the relation are then generated on-the-y in the PREDATOR scan operator, but rather than sending tuples
of the relation directly to the parent (i.e., join) operator, the scan introduces articial delays corresponding to
the arrival pattern in the chosen network trace.
In the experiments we use a database containing up to six 100,000 tuple Wisconsin benchmark rela-
tions [BDT83]. Each tuple is 288 bytes for a total of 28.8 MB per relation. For some of the experiments
11
we project these tuples down to 86 bytes or 8.6 MB total. Each relation is populated according to the Wiscon-
sin benchmark specications, using dierent random seeds. All of the experiments described here use a unique,
unclustered integer attribute (unique1) as the join attribute so the result cardinality of all of the queries is also
100,000 tuples.
We ran the experiments on a Sun Ultra 5 Workstation running Solaris 2.6, with 128 MBytes of real memory,
and approximately 4 GBytes of disk space. Each disk and memory page is 8KBytes. In all the experiments
the SHORE (i.e., storage manager) buer size was set at 800 Kbytes; the amount of main memory allocated to
each XJoin operator is 3 MBytes, unless noted otherwise.
4 Results
In this section we investigate the performance of XJoin and compare it to that of Hybrid Hash Join (HHJ).
5
We
examine three variations of XJoin, in order to separate out the contributions of the major components of the
algorithm. The rst variation, labeled XJoin-No2nd, does not use the second stage at all. Thus, it simply runs
the rst stage (ushing tuples to disk as necessary) until all input tuples have been received, at which point
the third stage is initiated. The second variation, labeled XJoin-NoCache uses second stage, but without the
caching optimization described in Section 2.4.1. The third variation, labeled XJoin, is the full implementation
of XJoin as described in Section 2 (i.e., including the second stage with caching).
We have examined two variations of HHJ: one in which tuples from remote sources are fetched only when
they are required by the operator, and one in which tuples from all of the remote sources are fetched in parallel,
and those that are not needed when they arrive are temporarily spooled to the local disk. Reading the tuples
in the background reduces the amount of time that the operator must spend waiting for the tuples to arrive
(especially for slow sources) so the latter approach was found to perform better in all of our experiments. Thus,
the results we report for HHJ in this section are those for the version that reads all inputs in parallel. This
improvement to HHJ levels the playing eld since XJoin also inherently accesses its inputs in parallel.
4.1 Experiment 1 - Basic performance of XJoin
In the rst set of experiments, we examine the basic performance of XJoin and HHJ using a simple two-
way join query and four dierent delay scenarios. The join in this query is the 1-to-1 join of 100,000-tuple
Wisconsin relations described in the previous section. The delay scenarios are constructed by applying the four
combinations of the slow and fast arrival patterns shown in Figures 9, and 10 to the two input relations. In this
experiment, the join operators are allocated 3 MBytes of memory, and the input relations contain 28.8 MBytes
each. Note for this rst set of experiments, the activation threshold for the second stage is set to 0.01, which is
a fairly aggressive setting. This threshold is the focus of the second set of experiments, described in Section 4.2.
Figures 11 thru 14 show the cumulative response times for the four algorithms(the three variations of XJoin
5
We also compared the performance of XJoin to the basic Symmetric Hash Join (SHJ). In the case where there is sucient
memory to run SHJ, XJoin was found to perform nearly identically to SHJ. When there is less memory (as is the case in the
experiments reported here) SHJ was found to thrash, at which point it became impractical to use. Thus, we do not report results
for SHJ here.
12
plus HHJ) as result tuples are produced for the four combinations of network traces. The x-axis shows a count
of the result tuples produced (from one to 100,000) and the y-axis shows the time at which that result tuple
was produced. As can be seen in the gures, in all cases the three variants of XJoin produce the rst answers
several orders of magnitude faster than HHJ, thereby providing far superior interactive performance. Perhaps
even more surprisingly, the XJoin variations are competitive even in terms of the completion time in most of
the cases. Among the XJoin variants, for cases with at least one slow source, the full XJoin algorithm performs
best, and the variant without the second stage performs worst the entire execution of the query. In fact, in
these cases, XJoin-No2nd provides little if any advantage over HHJ after the delivery of the rst 10% of the
answer. These results demonstrate the importance of the reactive background processing of the second stage
for coping with delays and bursty arrival. In the case where both sources are fast, this ordering of the XJoin
variants holds for the delivery of the initial results, but inverts for the later results. In fact, for this last case,
HHJ performs somewhat better than XJoin and XJoin-NoCache for the delivery of the second half of the result.
This latter result demonstrates the pitfalls of overly aggressive use of the second stage in cases where tuple
delivery is relatively fast. These results are described in more detail in the following sections.
4.1.1 Slow Network
Figure 11 shows the case when both sources are slow. HHJ starts producing tuples very late, since it waits
for the entire build relation to arrive before it produces any tuples. It then produces the remaining tuples
fairly quickly, since it has already prefetched most of the probe relation while processing the build relation.
In contrast to HHJ, all of the XJoin variants produce all the tuples faster than HHJ, with several orders of
magnitude improvement in getting the initial results.
A comparison of the XJoin variants reveals the benets of the second stage and of using a cache. XJoin-
No2nd, which does not use the second stage, performs comparably to the other two XJoin variants for the rst
few results, but its performance quickly degrades. The rst stage stays active until the 850th second, i.e. until
both sources arrive completely. During this time it produces only about 9000 tuples due to the limited memory.
Recall that the rst stage can only join tuples while they are memory resident. In this case, only about 5% of
the tuples can be memoryresident at a given time. The third stage takes over after all the input has arrived and
produces the remaining 91000 of the tuples in 45 seconds. Thus, although XJoin-No2nd still performs better
than HHJ it produces the majority of the tuples only after a large latency.
The eect of enabling the second stage is shown by the XJoin-NoCache curve. In this case the second stage
is interleaved with the rst stage and both stages are executed in this fashion until the 851th second. In the
same time it takes XJoin-No2nd to produce only 9000 tuples XJoin-NoCache produces about 77000 tuples.
Therefore the second stage considerably boosts (by a factor of about 8.5) the performance of XJoin in this case.
Recall that the advantage of the second stage is that it joins memory-resident tuples with disk-resident tuples
while waiting for additional input to arrive. Without the second stage, these joins would be delayed until all
input was received. Comparing the XJoin and XJoin-NoCache lines, it can be seen that adding a cache to the
second stage provides further improvement in this case by allowing additional tuples to be joined during delays.
13
0
100
200
300
400
500
600
700
800
900
1000
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output
XJOIN
XJOIN - No2nd
XJOIN - NoCache
HHJ
Figure 11: Slow build, Slow probe
0
100
200
300
400
500
600
700
800
900
1000
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output
XJOIN
XJOIN - No2nd
XJOIN - NoCache
HHJ
Figure 12: Slow build, Fast probe
0
100
200
300
400
500
600
700
800
900
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output
XJOIN
XJOIN - No2nd
XJOIN - NoCache
HHJ
Figure 13: Fast build, Slow probe
0
50
100
150
200
250
300
350
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output
XJOIN
XJOIN - No2nd
XJOIN - NoCache
HHJ
Figure 14: Fast build, Fast probe
In this case, for example, the rst 5,000 tuples are produced 29% faster when a cache is used.
Interestingly, the performance benets of XJoin carry over even to the last tuple delivered in this case,
indicating that the third stage of XJoin is as fast or faster than the later stages of HHJ once all input tuples
have arrived. The reasons for this are twofold. First, recall that HHJ spools tuples of the probe relation to
disk until they are needed. Thus, the probe relation tuples must be partitioned before they can be used. If
this overhead were removed (which, conceivably, it could be), HHJ would produce its last tuple slightly faster,
namely at about the time that XJoin-No2nd completes, but still later than the other XJoin variants. The more
fundamental dierence is that due to the tuples produced by the second stage, XJoin-NoCache and XJoin both
produce far fewer tuples in the third stage. The creation of result tuples incurs CPU overhead; incurring this
overhead in the second stage allows it to be overlapped with the arrival delays, while in contrast, incurring the
overhead after all tuples have arrived (as is done by XJoin-No2nd and HHJ) simplyadds to the completion time
of the query.
This experiment showed the followingresults for a slow or bursty environment: (a) XJoin can provide major
improvements in the delivery of initial answers to users, (b) the use of reactive background processing to exploit
delays (as embodied in the second stage) is crucial for attaining good performance beyond just the initial tuples,
and (c) the use of a cache can further improve performance. We now turn to the other three delay scenarios
studied in this experiment.
14
4.1.2 Mixed Network
Figures 12 and 13 show the results when one of the sources is fast and the other is slow. As in the slow/slow
case, the XJoin variants all perform as well or better than HHJ, and the XJoin variants that use the second
stage perform signicantly better. The XJoin variants exhibit the same behavior in both the slow/fast and
fast/slow cases since they do not distinguish between the build and probe relation. Compared to the slow/slow
case, the XJoin variants that include the second stage perform better, as they are able to exploit the faster
arrival of either of the inputs. XJoin-No2nd produces its initial tuples slightly earlier than in the slow/slow
case, but it still incurs signicant blocking, which causes it to lag far behind the other XJoins for most of the
execution. The performance of HHJ for the initialresults is dominated the by speed of the build relation, thus it
has much better initial performance for the fast/slow case than for the slow/fast case. Still, in either situation,
HHJ incurs signicant slowdown compared to XJoin-NoCache and the full XJoin.
4.1.3 Fast Network
XJoin is intended to solve the problem of bursty and slow arrivals and the previous three cases have shown that
XJoin eectively deals with these arrival problems. Now we examine how XJoin performs when the network
is relatively fast. In this case (shown in Figure 14) all three XJoin variants still provide substantial benets in
the delivery of the initial results. Users of HHJ would have to wait over 2.5 minutes before seeing their rst
results, whereas XJoin would give the rst result after only about 2 seconds. As the query progresses, however,
the benets of XJoin diminish in this case. In fact, HHJ is able to deliver the second half of the result faster
than XJoin-NoCache and the full XJoin here. Furthermore, XJoin-No2nd, which performed the worst of the 3
variants in the previous cases, loses here initially, but ultimately delivers the last 60% of the result faster than
the variants that use the second stage. These results demonstrate the tradeos of the background processing
done by the second stage that were identied in Section 2.2. Namely, that the overhead incurred by the second
stage pays o only when it is used to oset delays. Recall that the fairly coarse-grained nature of the second
stage causes it to committo some amount of work each time it is executed. In this experiment, the second stage
was run in a very aggressive manner (i.e, activation threshold = 0.01). This aggressiveness is benecial in the
presence of delays, but as shown in this case, can hurt in their absence. Thus, in the next set of experiments
we investigate a way of controlling the aggressiveness of the second stage.
4.2 Experiment 2 { Controlling the Second Stage
The previous set of experiments showed that the second stage of XJoin is crucial for a good interactive perfor-
mance when dealing with slow and bursty data sources, but that with more reliable sources, it could adversely
aect the overall query response time. One way to reduce this overhead is to employ the second stage less
aggressively, i.e. less often. Recall that the ability of the second stage to operate can be manipulatedby varying
the activation threshold presented in Section 2.4.2. A higher threshold value makes it less likely for the second
stage to be executed.
Looking more closely at the results of the previous experiment, it can be seen that the benets of the
15
second stage are highest during the early parts of the query execution and that they diminish as the execution
progresses. This behavior arises because as the disk-resident portions of the partitions grow, the coarseness of
the second stage increases (recall that a disk-resident portion is the unit of granularity at which the second stage
executes). This increases the risk of executing the second stage. Because of this behavior we have developed a
version of XJoin, called XJoin-Dyn, that is more aggressive in the early stages of the query and becomes less
aggressive as more of the result is produced. XJoin-Dyn starts with a low threshold value (0.01), and linearly
increases it to a much larger value (0.20) as the percentage of result tuples produced grows.
0
100
200
300
400
500
600
700
800
900
1000
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output (Slow inner - Slow outer)
HHJ
XJOIN
XJOIN-Conservative
XJOIN-Dyn
Figure 15: 2 Slow relations
0
50
100
150
200
250
300
350
0 10000 20000 30000 40000 50000 60000 70000 80000 90000100000
Time (secs)
Number of Tuples Output (Fast inner - Fast outer)
HHJ
XJOIN
XJOIN-Conservative
XJOIN-Dyn
Figure 16: 2 Fast relations
The eectiveness of XJoin-Dyn is examined in Figures 15 and 16, which show the performance for the query
used in the previous experiments with the slow/slow and fast/fast network settings respectively. In addition
to XJoin-Dyn, the gures show the HHJ and XJoin (i.e., threshold = 0.01) algorithms, as well a conservative
version of XJoin with the threshold xed at 0.20 (labeled XJoin-Conservative in the gures). As seen in the
previous experiments, when both sources are slow (Figure 15) the more aggressive XJoin performs best. The less
aggressive setting (XJoin-Conservative) provides poor initial performance, but provides reasonable (although
stillworse than XJoin) performance as the query progresses. XJoin-Dyn initiallyfollowsXJoin but then becomes
more like XJoin-Conservative as it becomes increasingly conservative, that is, it performs slightly worse than
the aggressive setting.
With two fast sources, (Figure 16) as seen in the previous section, the aggressive setting (XJoin) provides
good performance initially, but quickly falls behind, eventually losing even to HHJ. In this case XJoin-Dyn is
able to combine the best aspects of both conservative and aggressive behavior. Its initial aggressiveness leads to
good interactive performance, while its gradualconversion to a more conservative behavior reduces the overhead.
Thus, in this case, XJoin-Dyn is able to provide excellent interactive performance initially, while performing
better than HHJ throughout the entire query. Because of its good behavior for fast networks and small costs
for slow networks, we adopt this dynamic version of XJoin for the remaining experiments, and simply refer to
it as \XJoin".
4.3 Experiment 3 { The eect of memory size
Recall that a prime motivation for designing XJoin was the huge memory requirements of the symmetric hash
join. The memory requirements of XJoin are reduced by allowing the staging of data to disk. This staging,
however adds overhead, both in terms of disk I/O and in terms of duplicate detection. In order to explore
16
the eect of memory size on XJoin, we have performed experiments varying the size of the memory. For these
experiments, we used the single join query similar to the one used in the previous sections, but with the size
of the input relations projected down to 8.6 MBytes (rather than the 28.8 MBytes used earlier). We use three
dierent memory allocations: 3MB, 10MB, 20MB. 3MB represents the limited memory cases where neither of
the inputs t into the memory, while the other two represent cases where one input, and both inputs t into
the memory respectively.
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
Time (secs)
Number of Tuples Output
HHJ
XJoin
3 MB
10 MB
20 MB
Figure 17: Slow network, Varying memory
0
50
100
150
200
250
0 20000 40000 60000 80000 100000
Time (secs)
Number of Tuples Output
HHJ
XJoin
3 MB
10 MB
20 MB
Figure 18: Fast network, Varying memory
Figures 17 and 18 show the result delivery times of XJoin and HHJ for the three memory sizes in the
slow/slow and fast/fast cases respectively. In all the cases XJoin performs better than HHJ, with orders of
magnitude improvement in the interactive region. XJoin also wins in terms of the completion time in all the
cases. Note that when both relations t in memory (e.g., 20 MB), the basic Symmetric Hash Join (SHJ) is a
viable alternative. In this case however, its performance (not shown) is virtually identical to that of XJoin for
all delay scenarios.
4.4 Experiment 4 { Stress testing XJoin
Finally, we briey report on one additional set of experiments that investigates the impact of query complexity
on XJoin. We varied the number of relations in the query from 2 to 6 (i.e., 1 to 5 joins). Again, all joins were
on the unique1 attribute, so the cardinality of all the results was 100,000 tuples. Up to 6 dierent relations
were used, each one was projected down to 8.6 MBytes. The available memory was scaled with the queries by
allocating 3 MBytes to each join operator. We have repeated the experiment for two network settings: Table 19
contains the production times of various tuples for XJoin and HHJ when all the sources are slow. Each row
corresponds to a dierent degree of complexity in terms of the number of relations joined. Table 20 compares
the performance of both operators in a similar manner, but for an extreme case; i.e. when none of the inputs
is problematic.
In general we see that in all the cases for both network settings, XJoin continues to deliver very good
performance in delivering the initial portions of the query results. When the network is slow (table 19) XJoin
performs better than HHJ in all the cases including the query completion time. Thus, XJoin is eective in
coping with slow and bursty sources even for fairly complex queries.
Table 20 shows the performance of both algorithms in the unlikely case where none of the remote sources
17
are problematic. Here too, XJoin performs better than HHJ in delivering the initial results (e.g., up to the
rst 20,000 tuples) for all of the cases. For example, even with 6 relations, XJoin produces the rst tuple over
6 times faster than HHJ. With increasing query complexity and the lack of delays, however, the delivery of
the later tuples by XJoin begins to lag behind that of HHJ. This behavior arises because the second stages of
individual operators are controlled on a per-operator basis rather than globally, and thus, the second stage of
one operator can interfere with the rst and third stage processing of other operators. While this issue could
be addressed by a more global scheduling mechanism, even in the extreme case here (with 6 fast relations) the
degradation in the delivery of the last tuple is only 35%.
5 Related work
Carey and Kossmann [CK97] worked on returning the rst N tuples of the result of queries including an ORDER
BY clause. Their method tries to minimize the time it takes to produce rst N tuples, although the user may
not get any result until all N tuples are computed. Their algorithm does not attempt to compute more than N
tuples and does not extend to the bursty and slow environments since joins would still block when a relation
shows bursty nature.
Bayardo and Miranker [BM96] worked on returning the rst result of a query. In doing so they have modied
the well known nested loops join algorithm to detect and minimize the wasted work in a long pipeline of nested
loops join operators. The limitation of their work is twofold. First nested loops join is an inecient join for
large inputs when there are no indices on the join attributes of the relations. Thus nested loops does not scale
well if the user wishes to get more than one result fast. Second, the output characteristics of the nested loops
join is aected by the behavior of the inner relation. If the inner relation is slow or bursty the output of the
join will also be slow or bursty.
The online aggregation work presented in [HHW97] allows the output of a query (in their case an aggregate
query) to be presented to the user as it improves in real time. This requires joins that are non-blocking, i.e. that
supply their output early and in a steady fashion. In [Hel97], Hellerstein discusses dierent versions of nested
loops join algorithms with dierent traversal patterns. One could change the traversal pattern dynamically if
one of the input sources block. However the query execution is still likely to suer from the inecient nature
of the nested loops join when the inputs are large.
The Approximate query processor [VL93] rst computes a semantically approximateanswer to a query. This
answer is then rened over time. However their work requires that the base relations be partitioned on some of
selection attributes for ecient processing. This may not be true on a widely distributed heterogeneous system.
Rels First tuple 5,000 20,000 50,000 Last Tuple
xjoin HHJ xjoin HHJ xjoin HHJ xjoin HHJ xjoin HHJ
2 5 823 195 826 452 836 668 836 860 878
3 17 862 340 873 561 901 763 930 865 949
4 46 916 482 938 684 979 786 992 907 1018
5 85 353 563 965 709 1019 864 1036 981 1066
6 79 992 400 1075 631 1120 860 1144 952 1174
Figure 19: Tuple Production rates of XJoin and HHJ in secs. Slow Network
18
Rels First tuple 5% 20% 50% Last Tuple
xjoin HHJ xjoin HHJ xjoin HHJ xjoin HHJ xjoin HHJ
2 1 150 36 153 87 163 127 181 178 201
3 4 184 90 195 157 223 233 252 296 270
4 17 285 175 307 282 348 378 362 470 387
5 35 353 274 417 406 469 538 487 635 516
6 75 476 381 559 598 605 803 629 892 660
Figure 20: Tuple Production rates of XJoin and HHJ in secs. Fast Network
Also their method is not directly extensible to the bursty environments.
Our previous work on Query Scrambling[UFA98] tries to react to the changes in the network by dynamically
restructuring the query plan on the y. This approach is aimed primarily at improving the overall response
time of the entire query, rather than improving the response of the rst few tuples. As such, we have developed
XJoin as a complementary approach to Query Scrambling.
Finally, as mentioned in the introduction, we have just recently become aware of a related project that has
been developed independently from and concurrently with the XJoin work. This is the Tukwilla system being
developed at Washington [IFFL
+
99]. This system includes techniques based on Query Scrambling, but also
includes a pipelined join operator based on Symmetric Hash Join. The join operator used in Tukwilla adjusts
to limited memory by ushing tuples to secondary storage in various ways, and then completes the join using
techniques similar to Hybrid Hash Join. This approach is similar to the rst stage and third stages of XJoin.
As demonstrated in the preceding sections, a key component of XJoin's ability to provide fast answers in the
presence of slow and bursty sources is the background processing that is triggered in response to both inputs
becoming temporarily delayed. This \second stage", is a source of additional complexity and tradeos in the
algorithm (e.g., duplicate elimination, scheduling, degree of aggressiveness, etc.), but is also crucial for good
performance in an unpredictable environment. Thus, our experiments lead us to believe that the XJoin operator
is a better solution in dealing with bursty and slow sources.
6 Conclusion and Future Work
In this paper, we addressed the problem of getting fast query answers in an unpredictable communications
environment, such as the Internet. We presented a multi-staged join algorithm, called XJoin, that extends the
symmetric hash join. XJoin has similar performance when memory is plentiful, but can operate eciently with
far less memory, making it much more practical for this environment.
The XJoin lowers its memory requirement by partitioning its inputs and achieves a high level of pipelining
by running its stages in an interleaved fashion. We have also presented a fast, on-the-y duplicate elimination
algorithm to eliminate potential duplicate tuples that could be created due to the overlapping nature of the
stages. An important feature of XJoin is its ability to react to delays and take advantage of silent periods
to produce more tuples faster. The algorithm includes a dynamic mechanism by which its aggressiveness in
exploiting delays is adjusted during the execution of a query.
We have implemented XJoin in the PREDATOR Object-Relational DBMS, and compared its performance
with that of hybrid hash join using real network traces. We performed at detailed experimental study, which
19
investigated the performance of XJoin in the presence of dierent data delivery rates, memory sizes, and query
complexity. In all the cases studied, XJoin had much better (often by several orders of magnitude) interactive
performance (i.e., in terms of producing the initial portions of the result) than hybrid hash join, and in most
cases it performed better than hybrid hash join for the entire query, delivering even the nal result tuple as fast
or faster.
In terms of future work, we plan to tie our current work with our previous work on Query Scrambling to
provide a unied set of techniques for dealing with the problem of unpredictable data delivery in wide-area
networks. We also plan to work on delivering more \interesting" portions of a result (such as some subset
of columns) faster in wide-area environments. Such query behavior is desirable when the semantics of the
application are such that some identiable portions of the data are substantially more important than others.
References
[AFTU96] L. Amsaleg, M. J. .Franklin, A. Tomasic, and T. Urhan. Scrambling Query Plans to Cope With
Unexpected Delays. PDIS Conf., Miami, USA, 1996.
[AFT98] L. Amsaleg, M. J. .Franklin, and A. Tomasic. Dynamic Query Operator Scheduling for Wide-Area
Remote Access. Journal of Distributed and Parallel Databases, Vol. 6, No. 3, July 1998.
[BDT83] D. Bitton, D. J. Dewitt, C. Turbyll. Benchmarking Database Systems, a Systematic Approach.
VLDB Conf., Florence, Italy, 1983.
[BM96] R. Bayardo, and D. Miranker. Processing Queries for the First Few Answers. Proc. 3rd CIKM Conf.,
Rockville, MD, 1996.
[CK97] M. J. Carey, and D. Kossman. On Saying \Enough Already!" in SQL. ACM SIGMOD Conf., Tucson,
AZ, 1997.
[Gra93] G. Graefe. Query Evaluation Techniques for Large Databases. ACM Computing Surveys, 25(2), 1993.
[GRVB98] J. Gruser, L. Raschid, M. E. Vidal, L. Bright Wrapper Generation for Web Accessible Data Sources.
Int. Conf. Cooperative Information Systems, New York City, NY, 1998.
[HHW97] J. M. Hellerstein, P. J. Hass, and H. J. Wang. Online Aggregation. ACM SIGMOD Conf., Tucson,
AZ, 1997.
[Hel97] J. M. Hellerstein. Online Processing Redux. Data Engineering Bulletin, 20(3), 1997.
[HS93] W. Hong, M. Stonebraker. Optimization of Parallel Query Execution Plans in XPRS. Distributed and
Parallel Databases, 1(1):9-32, 1993.
[IFFL
+
99] Z. Ives, D. Florescu, M. Friedman, A. Levy, D. S. Weld. An Adaptive Query Execution System for
Data Integration. To appear in ACM SIGMOD Conf., Philadelphia, PA, 1999.
[ONK
+
97] F. Ozcan, S. Nural, P. Koksal, C. Evrendilek, A. Dogac. Dynamic Query optimization in Multi-
databases. Data Engineering Bulletin, Vol. 20, No. 3, September, 1997.
[Sha86] L. D. Shapiro. Join Processing in Database Systems with Large Main Memories. ACM Transactions
on Database Systems, 11(3), 1986.
[SP97] P. Seshadri, M. Paskin PREDATOR: An OR-DBMS with Enhanced Data Types. ACM SIGMOD Conf.,
Tucson, Arizona, 1997.
[TRV96] A. Tomasic,L. Raschid, and P. Valduriez. Scaling Heterogeneous Databases and the Design of DISCO.
ICDCS Conf., Hong Kong, 1996.
[UFA98] T. Urhan, M. J. .Franklin, and L. Amsaleg. Cost Based Query Scrambling for Initial Delays. ACM
SIGMOD Conf., Seattle, WA, 1998.
[VL93] S.V.Vrbsky, andJ.W.S.Liu. Approximate,AQuery Processor thatProduces MonotonicallyImproving
Approximate Answers. IEEE Transactions od Knowledge and Data Engineering, Vol.5, No.6, December 1993.
20
[WA90] A. N. Wilschut, and P. M. G. Apers. Pipelining in Query Execution. Conf. on Databases, Parallel
Architectures, and their Applications, Miami, 1991.
21