Beyond Uniformity and Independence : Analysis of R-trees Using
the Concept of Fractal Dimension
Christos Faloutsos

and Ibrahim Kamel
Department of Computer Science and
Institute for Systems Research (ISR)
Univ. of Maryland at College Park
February 24, 1994
Abstract
We propose the concept of fractal dimension of a set of points, in order to quantify the
deviation from the uniformity distribution. Using measurements on real data sets (road inter-
sections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we
provide evidence that real data indeed are skewed, and, moreover, we show that they behave as
mathematical fractals, with a measurable, non-integer fractal dimension.
Armed with this tool, we then show its practical use in predicting the performance of spatial
access methods, and specically of the R-trees. We provide the rst analysis of R-trees for
skewed distributions of points: We develop a formula that estimates the number of disk accesses
for range queries, given only the fractal dimension of the point set, and its count. Experiments
on real data sets show that the formula is very accurate: the relative error is usually below 5%,
and it rarely exceeds 10%.
We believe that the fractal dimension will help replace the uniformity and independence
assumptions, allowing more accurate analysis for any spatial access method, as well as better
estimates for query optimization on multi-attribute queries.
1 Introduction
In this work we study the distribution of real datasets with D-dimensional points. Sets of multi-
dimensional points appear in several database applications: records in relational DBMSs can be

This research waspartially funded by the National Science Foundation under Grants IRI-9205273 and IRI-8958546
(PYI), with matching funds from EMPRESS Software Inc. and Thinking Machines Inc.
1
viewed as points in attribute space; geographic information systems (GIS) [21] contain point data,
such as cities on a 2-dimensional map; medical image databases with, eg., 3-dimensional MRI brain
scans, require thestorageand retrieval of point-sets, such asdigitized surfaces ofbrain structures[2],
etc. For the balance of this work, the term point-set and dataset will be used interchangeably.
An interesting problem is the estimation of the search eort for range queries, when the
points are stored in a spatial access method (SAM), such as an R-tree. A closely related problem
is the estimation of selectivity (ie., number of qualifying points) for range queries: this is useful
in the query optimization in an RDBMS with multidimensional queries [14] or in a GIS [1]. The
traditional assumptions are the uniformity and independence assumption, which make the analysis
tractable.
However, these assumptions do not hold on real data; moreover, they lead to pessimistic
estimates [5]. For a single attribute, the uniformity assumption has been relaxed, eg., [10], typically
using the Zipf distribution [24]. Distributions of real attributes indeed follow the Zipf distribution
or the generalized Zipf distribution: for example, word frequencies in the English language (as well
as other languages); salaries [24]; rst names and last names of people [7], etc.
However, no attempt has been made to model multi-dimensional distributions. Theoretical
analyses in such cases assume that points are uniformly distributed in the address space [6],[1],
which also implies that the attributes are uncorrelated. Even in simulation studies, researchers
on spatial access methods and multi-attribute query optimization are forced to use ad-hoc, non-
uniform distributions, such as the gaussian distribution [15], some sort of clustered distributions
(with points clustering around uniformly distributed sites [17], or points clustering around curves,
like the sinusoidal curve [4]), or even to obey the independence assumption anyway [14]. Although
non-uniform, it is unclear whether these distributions resemble real distributions.
Evidence that real distributions violate both assumptions is overwhelming. Figure 1 provides
a counter-example for both assumptions, using two real datasets. Figure 1(a) shows the population
vs. area scatter-plot for 181 nations, from the World Factbook (available, eg., with anonymous
ftp from ocf.berkeley.edu, /ftp/pub/Library/Reference/World Factbook). Notice that the
points follow a highly skewed distribution: there are a few large and/or highly populated nations
(eg., China, India), while the vast majority of points is close to the origin. Figure 1(b) shows the
same scatter plot in doubly logarithmic scales, to highlight the strong correlation between area and
population.
Figure 1 (c) shows the road intersections for the Montgomery County of Maryland. Notice
that the distribution is clearly non-uniform. Also notice that it does not resemble a gaussian
distribution either : there are several high-concentration areas (one for each city, like Bethesda,
Rockville, Gaithersburg etc).
2
 
Population x 109
6Area x 10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.10
1.20
0.00 2.00 4.00 6.00 8.00
 
Population
Area
5
1e+04
2
5
1e+05
2
5
1e+06
2
5
1e+07
2
5
1e+08
2
5
1e+09
2
1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
 x 103
3 x 10
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
22.00
24.00
26.00
28.00
30.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00
(a) (b) (c)
Figure 1: (a) WORLD dataset (b) WORLD dataset - log-log scale (c) Cross-roads of Montgomery
County of Maryland
In conclusion, we need a way to describe distributions of real multi-dimensional points. The
problem we want to solve is the following:
GIVEN real distributions of multi-dimensional points
FIND a way to characterize their deviation from uniformity
TO PREDICT the search performance of a spatial access method (SAM).
The description should allow us to analyze several types of queries, such as 'range' queries, 'nearest
neighbor' queries, 'spatial joins' etc. In this paper we focus on range queries.
For the above problem, we propose the concept of fractal dimension as the solution. We show
that several real distributions indeed exhibit fractal behavior, that is, they are self-similar: portions
of the point-set are statistically similar to the whole set (see Figure 3 for an example). We describe
how to compute the fractal dimension of a set of points, and then we show how to use it to predict
the performance of spatial access methods that store this point-set. Specically, we examine the
R-trees and provide the rst known analysis for them, for real distributions.
The paper is organized as follows: Section 2 provides the background information about
R-trees. Section 3 gives the denition of the fractal dimension, along with examples from real
data. Section 4 presents the analysis. Section 5 gives the experimental results and Section 6 lists
the conclusions.
3
2 Background
In this section, we provide a brief description of R-trees [9], which will be the SAM that we use to
illustrate our analysis. Additional SAMs include the quad-tree based methods [8, 17] and grid le
and its variants [16]; see [20] for a survey. We focus on the R-tree because it seems to be one of
the most successful methods.
The original R-tree [9] was suggested by Guttman; it is the extension of the B-tree for
multidimensional objects. A geometric object is represented by its minimum bounding rectangle
(MBR). Non-leaf nodes contain entries of the form (ptr,R) where ptr is a pointer to a child node in
the R-tree; R is the MBR that covers all rectangles in the child node. Leaf nodes contain entries
of the form (obj-id, R) where obj-id is a pointer to the object description, and R is the MBR of
the object. The main innovation in the R-tree is that father nodes are allowed to overlap. This
way, the R-tree can guarantee at least 50% space utilization and at the same time remain balanced.
Figure 2 illustrates data rectangles (in black), organized in an R-tree with fanout 3.
0 1
0
1
4
5
6
7
8
9
10
11 12
1
2
3
Figure 2: Data (dark rectangles) organized in an R-tree with fanout=3
Subsequent work on R-trees includes the packed- [19] and Hilbert-packed R-trees [12], the
R
+
-tree [23], R-trees using Minimum Bounding Polygons [11] and the R

-tree [4]. The latter seems
to give the best search times, mainly thanks to the idea of deferring the splits by 'force-reinserting'
some of the entries of the overowing nodes.
The last part in this section is a summary of previous attempts to analyze the R-trees. In [6]
we assumed that the points are uniformly distributed in the address space. Recently, we provided
formulas [12] that assume that the R-tree has been built and that we can measure the MBR of
each node of the R-tree. The results are as follows. Consider the n-th node of the tree, and let its
MBR be x
1;n
:::x
D;n
. We denote it compactly as:
~x
n
= (x
1;n
;:::;x
D;n
) (1)
4
Similarly, let
~q = (q
1
;:::;q
D
) (2)
denote a range query with side q
i
on the i-th dimension. Then, we have the following theorem:
Theorem 1 The average number P(~q) of nodes ( pages) accessed by a query of sides ~q is given
by
P(~q) =
X
n
D
Y
i=1
(x
i;n
+q
i
) (3)
where the summation extends over all the nodes of the tree.
Proof: See [12].
A similar analysis, mainly focusing on square queries, was published independently in [18].
Notice that Theorem 1 allows us to calculate the node accesses for any level of the tree: we only
need to restrict the summation over the nodes of the level(s) of interest.
In the next sections we shall see how to predict the size of the MBRs of the nodes of the tree,
before even the tree is built. To do that, we show that we only need the fractal dimension d of the
point-set, and, of course, the number of points N and the fanout of the tree.
3 Fractal Dimension
Intuitively, a set of points is a fractal if it exhibits self-similarity overall scales. This is illustrated by
an example: Figure 3(a) shows the rst few steps in constructing the so-called Sierpinski triangle.
Figure 3(b) gives 5,000 points that belong to this triangle. Theoretically, the Sierpinski triangle is
derived from an equilateral triangle ABC, by excluding its middle (triangle A'B'C') and recursively
repeating this procedure for each of the resulting smaller triangles. The resulting set of points
exhibits `holes' in any scale; moreover, each smaller triangle is a miniature replica of the whole
triangle. In general, the characteristic of fractals is this self-similarity property: parts of the fractal
are similar (exactly or statistically) to the whole fractal. For our experiments we use 50,000 sample
points from the Sierpinski triangle (`SIERPINSKI' dataset).
The Sierpinski triangle gives an example of points which follow a highly non-uniform dis-
tribution; yet, the distribution is deterministic, and easy to describe, at least in English. There
should be an equally easy, way to describe it mathematically.
3.1 Formal denitions and Measurements
Consider a geometrical object (eg., like the Sierpinski triangle) with the 'self-similarity' property,
consisting of a set of points in D-dimensional space. The dimensionality D of the address space is
5
B
CA A C
B
A C
B
C?
?
A?
B?
C?
B?
A?
(a)
Y x 103
3X x 10
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
22.00
24.00
26.00
28.00
30.00
0.00 10.00 20.00 30.00 40.00 50.00 60.00
1
10
100
1000
10000
100000
100 1000 10000
log(N(r))
log(r)
"SIERPINSKI"
exp(16.2) * x**-1.59
1-d object
2-d object
(b) (c)
Figure 3: (a)Three steps in generating sierpinski triangle (b) 5000 points forming Sierpinski triangle
(c) The slope of the solid line gives the fractal dimension of Sierpinski triangle  1.59
dened as the embedding dimension.
The fractal dimension d, (or, more accurately, the box-counting fractal dimension) is dened
as follows [22]: Divide the D-dimensional space into (hyper-)cubic grid cells of side r. Let N(r)
denote the number of cells that are penetrated by the fractal (i.e., contain 1 or more points of it).
Then the (box-counting) fractal dimension d of a fractal is dened as
d  lim
r!0
logN(r)
log(1=r)
(4)
This denition is useful for mathematical fractals, that consist of innite number of points.
For a nite sample of points that belong to a fractal, we use the boxcount-side plot to estimate the
fractal dimension. By boxcount-side plot we mean the plot of log(N(r)) vs log(r). Its usefulness
lies in the fact that, if the point-set is self-similar, the plot is almost a straight line. Moreover, the
slope of the line is the (negated) fractal dimension of the point-set. This is the standard way to
measure the fractal dimension for real point-sets [22]:
Denition 1 For a point-set that has the self-similarity property, its fractal dimension d is mea-
sured as
d = BnZr
@ log(N(r))
@log(r)
= constant (5)
6
Corollary 1 For a point-set with the self-similarity property, we have
N(r) = K=r
d
(6)
where K is an integration constant.
Proof: Since d remains constant with r, we obtain Eq. 6 by integrating Eq. 5.
Next we give some examples to illustrate how the method works.
(b)
(1,1)
(0,0)(a)(0,0)
(1,1)
r = 1/4 r = 1/8
(1,0)
(0,1) (0,1)
(1,0)
Figure 4: Computing the fractal dimension of a line segment
Example 1: Figure 4 illustrates the method when the object is the diagonal line segment (0,0)-
(1,1). Figure 4(a) shows that the line segment penetrates 4 boxes of side r = 1=4, and Figure 4(b)
shows it penetrating 8 boxes of side r = 1=8. Thus, we have
N(r) = (1=r) = (1=r)
1
r  1 (7)
and, therefore,
d = BnZr
@log((1=r)
1
)
@log(r)
= 1 (8)
which is intuitively expected, since a line segment is a 1-dimensional object.
The fact that the fractal dimension of a line segment reduces to its Euclidean dimension is
not a coincidence: Notice that Euclidean objects, like lines, circles, planes etc. trivially fulll the
self-similarity requirement: for example, a part of a line segment is a miniature replica of the whole
segment. Based on the above, we have [13]:
Observation 1 For Euclidean objects, their fractal dimension reduces to their Euclidean dimen-
sion.
Thus, lines, line segments, circles, and all the standard curves have d=1; planes, disks and standard
surfaces have d=2; euclidian volumes in D-dimensional space have d = D.
7
Example 2. This is an example of a fractal with a non-integer d. Figure 3-(c) shows the boxcount-
side plot for the SIERPINSKI set of points of Figure 3-(b) The line has slope (-)1.59, while the
theoretical number is log3=log2 = 1.5849 [13]. The plot also contains two lines. The former corre-
sponds to a 2-dimensional object (like unit square) and the latter corresponds, to a 1-dimensional
object, such as the line segment of example 1, with slopes -2 and -1 respectively.
3.2 Fractal dimensions of real datasets
In the above examples, all the involved point-sets are known to have the self-similarity property,
and therefore it is expected that their boxcount-side plots will be straight lines. The question is
whether real datasets exhibit self-similarity. Table 1 shows the characteristics of the datasets we
used:
The `MGCounty'and `LBCounty' datasets are part of the TIGER database of the US Bureau
of Census and they contain the road intersections of the Montgomery county, MD and Long Beach
county, CA, respectively. The `IUE' contains observation points (latitude and longitude of stars)
from the International Ultraviolet Explorer (IUE) satellite of NASA. The WORLD dataset has
been discussed in the introduction (see Figure 1).
In the upcoming plots we also used synthetic datasets: In addition to the SIERPINSKI one
that we have discussed we also used the 2D-UNIFORM, 1D-Uniform point sets. As their names
reveal, they consisted of points uniformly distributed in the unit square and on a line, respectively
Figure 5 (a)-(d) shows the boxcount-side plots for all these datasets. Notice that the plots
are indeed straight lines, conrming that real point-sets exhibit fractal behavior. This implies that
we can use Eq. 6 to do useful predictions, as we show next. In all of gures 5 (a)-(d) we also give
the boxcount-side plot for the two uniform synthetic datasets, 2D-UNIFORM and 1D-UNIFORM.
The reason is that we want to highlight the fact that the slopes for the real sets are non-integers.
This means that the real datasets are self similar, but clearly non-uniform.
The conclusion is that we have conrmed once more that real data sets are far from uniformly
distributed - the only dierence is that, this time, we have a usable measure of their skewness!
Before we continue with the analysis, we list a few more observations:
Observation 2 The fractal dimension can model point-sets with highly correlated attributes, even
if the correlation is non-linear.
The reason is that, if two attributes are strongly correlated (even in a quadratic, logarithmic or
some other non-linear fashion), the resulting set of points in attribute space will be a curve, with
d=1, as we just discussed in Observation 1. The input point-set will be correctly characterized as
a linear object, as opposed to a 2-dimensional object.
8
1
10
100
1000
10000
100000
1e+06
1e+07
10 100 1000 10000 100000
log(N(r))
log(r)
"MGCounty"
slope = -1.67
"1D-UNIFORM"
slope = -0.99
"2D-UNIFORM"
slope = -1.98
1
10
100
1000
10000
100000
1e+06
1e+07
10 100 1000 10000 100000
log(N(r))
log(r)
"LBCounty"
slope = -1.7
"1D-UNIFORM"
slope = -0.99
"2D-UNIFORM"
slope = -1.98
(a) MGCounty - slope = -1.67 (b) LBCounty - slope = -1.70
1
10
100
1000
10000
100000
1e+06
1e+07
10 100 1000 10000 100000
log(N(r))
log(r)
"IUE"
slope = -1.69
"1D-UNIFORM"
slope = -0.99
"2D-UNIFORM"
slope = -1.98
1e-10
1
1e+10
10 100 1000 10000 100000 1e+06 1e+07 1e+08
log(N(r))
log(r)
"World"
slope = -0.54
"1D-UNIFORM"
slope = -0.99
"2D-UNIFORM"
slope = -1.98
(c) IUE - slope = -1.69 (d) WORLD - slope = -0.54
Figure 5: 'boxcount-side' plots for real datasets (solid lines); slopes -1 and -2 (dotted lines)
Observation 3 A set of D-dimensional points with attributes that obey the 'uniformity' and 'in-
dependence' assumptions has fractal dimension equal to the number of attributes:
d = D (9)
This is justied, because the points cover the whole space, forming practically a D-dimensional
solid. Thus, a data set with 'uniform' and 'independent' attributes can be trivially modeled as a
fractal, with fractal dimension d = D.
4 Analysis of R-trees
In the previous section, we claimed that real datasets will obey Eq. 6:
N(r) =
K
r
d
9
Name Description number of points d
MGCounty road intersections, Montgomery Cty, MD 79,438 1.67
LBCounty road intersections, Long Beach Cty, CA 68,849 1.70
IUE star coordinates (NASA's Infrared-Ultraviolet Explorer) 11,281 1.69
WORLD population vs. area of all countries - World Almanac 181 0.54
2D-UNIFORM uniformly dist. points on a plane - synthetic 50,000 1.98
SIERPINSKI points on Sierpinski triangle - synthetic 50,000 1.59
1D-UNIFORM uniformly dist. points on a line - synthetic 50,000 0.99
Table 1: Datasets used
where K is a constant. Assuming that the address space is normalized to the unit (hyper)-cube,
we have that K = 1, or equivalently
N(r) =
1
r
d
(10)
where d is the (box counting) fractal dimension of the dataset. In this section we want to derive
a formula to predict the number of disk accesses for range queries, when these points are stored
in an R-tree. First we estimate the number of leaf nodes only; then we extend the analysis for all
levels of the R-tree. We dene the eective capacity C
eff
of the nodes of the R-tree as the average
number of entries per node:
C
eff
= C u (11)
where u is the average node utilization (typically, 70% for the R*-trees).
Thus the problem is as follows:
Given: N: the number of points in D-dimensional space
d: their fractal dimension
C
eff
: the eective capacity of the nodes
Find: the expected size ~s = (s
1
;s
2
;:::;s
D
) of the MBRs for the leaf nodes. That is, the average
MBR will be a hyper-rectangle of size s
1
s
2
:::s
D
Once we have estimated ~s, we can immediately use Eq. 3 to calculate the expected number
of leaf accesses for any given query ~q. The number of leaf nodes is
N
l
= N=C
eff
(12)
We make the following (optimistic) assumption:
10
Assumption A1: The algorithms of the R-tree are 'good', in the sense that they will result in
tight, square-like MBRs, roughly of the same size. That is
s
1
= s
2
= ::: = s
D
  (13)
From this assumption we expect that the MBRs of the leaf nodes will be roughly similar,
square-like hyper-rectangles of side . Notice that this setting resembles very much the setting of
the denition of the fractal dimension: We have N
l
boxes of side , that cover all the points of the
dataset. Assuming that the address space is normalized to the unit D-dimensional cube, we can
use Eq. 10 as follows:
N
l
= 1=
d
(14)
N=C
eff
= 1=
d
(15)
 = (
C
eff
N
)
1=d
(16)
Combining Eq. 3 with the above equation, we can estimate the number of leaf accesses P(~q)
for a query ~q:
P(~q) =
X
all leaf nodes
D
Y
i=1
( +q
i
) (17)
or
P(~q) =
N
C
eff
D
Y
i=1
( +q
i
) (18)
We have just shown how to estimate the node accesses at the leaf level. The analysis can
be similarly extended to any level of the R-tree. Assuming that the average fanout is C
eff
at
every level, we can estimate the number of nodes N
j
at each level j, as well as the side 
j
of the
(D-dimensional cubic) MBRs. The nal formula for the total number of nodes accessed P
all
(~q) is
given by adding the node accesses at each level. Thus
P
all
(~q) =
hBnZr1
X
j=0
N
C
eff
hBnZrj
D
Y
i=1
(
j
+q
i
) (19)
where h is the height of the tree (the root is assumed at level j = 0 and the leaves at level j = hBnZr1);
and 
j
given by:

j
= (
C
eff
hBnZrj
N
)
1=d
j = 0;:::;hBnZr1 (20)
11
Symbols Denitions
C max. number of rectangles per page(= node)
C
eff
eective page capacity = C u
d fractal dimension
D embedding dimension (= # of attributes/axes)
h height of the R-tree
N number of data rectangles
N
l
number of leaf nodes
~q query hyper-rectangle q
1
q
2
::: q
k
q
i
length of the query in the i-th dimension
P(~q) avg. leaf pages retrieved by a query ~q
P
all
(~q) avg. pages (at all levels) retrieved by a query ~q

j
side of the ()square MBR of a node at level j
u avg. node utilization
Table 2: Summary of Symbols and Denitions
5 Experimental results
We carried out several experiments, to compare our analytical results with the results of an R*-
tree [4]. The R*-tree was written in C under UNIX and the experiments ran on DEC 5000 work-
station.
Except for the 'WORLD' dataset, which was too small, we used all the real and synthetic
datasets that we used in section 3. Their characteristics are summarized in Table 1.
In all cases, the address space was normalized to the unit square. The queries were squares
with side varying from 0 to 0.8. For each query side we report the average response time over
1000 uniformly distributed queries. Queries that were not completely inside the address space were
'wrapped around'.
We ran two sets of experiments: in the rst, we measured only the leaf accesses and in the
second we measured node accesses at all levels.
The results of the rst set are plotted in Figures 6(a)-(e). For the predictions we used Eq.
18. We put more emphasis on the leaf accesses for two reasons: (a) the majority of the node
accesses will be on leaf nodes and (b) most of the non-leaf nodes are likely to t in main memory.
Figures 6 (a);(b)and (c) show the number of leaf accesses vs. the query side q, for the real datasets
(IUE, LBCounty and MGCounty respectively). They plot the actual results with a solid line and
the predicted ones with a dotted line. Figures 6 (d) and (e) show the same measurements for the
synthetic datasets (2D-UNIFORM and SIERPINSKI respectively). The common observation in all
12
?IUE? Dataset
Experimental
analytical
Pages Touched
-3Qside x 10
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
220.00
240.00
260.00
280.00
0.00 200.00 400.00 600.00 800.00
?LBCounty? Dataset
Experimental
Analytical
Pages Touched x 103
-3Qside x 10-0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
0.00 100.00 200.00 300.00 400.00 500.00 600.00
(a) IUE - Leaf accesses vs. query side (b) LBCounty - Leaf accesses vs. query side
?MGCounty? Dataset
Experimental
Analytical
Pages Touched x 103
-3Qside x 10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1.90
2.00
0.00 200.00 400.00 600.00 800.00
?Uniform? Dataset
Experimental
Analytical
Pages Touched x 103
-3Qside x 10-0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
0.00 200.00 400.00 600.00 800.00
(c) MGCounty - Leaf accesses vs. query side (d) 2D-UNIFORM - Leaf accesses vs. query side
?Sierpinski? Dataset
Experimental
Analytical
Pages Touched
-3Qside x 10
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
550.00
0.00 100.00 200.00 300.00 400.00 500.00 600.00
?MGCounty? Dataset - all nodes on the disk
Experimental
Analytical
Pages Touched
-3Qside x 10
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
550.00
600.00
650.00
700.00
750.00
800.00
0.00 100.00 200.00 300.00 400.00 500.00
(e) SIERPINSKI - Leaf accesses vs. query side (f) MGCounty - total node accesses vs. query side
Figure 6: Real response time vs. analytical for dierent query sizes
13
the graphs is that the analytical estimate is very close the actual result: the relative error is usually
below 5% and rarely above 10%.
The second set of experiments test the accuracy of Eq. 19, which computes the number of
node accesses at all levels. This will translate to the actual number of disk accesses, in case that
all the levels of the tree reside on disk. The results were similar for all the datasets. For brevity,
we present only the experiments with the MGCounty dataset, in Figure 6(f). Again, our analysis
gives accurate predictions, with relative error usually below 7% and rarely above 12%.
6 Discussion - Conclusions
There are two contributions in this work. The major one is the proposal to use the fractal dimen-
sion to quantify the skewness of real point sets. Up to now, the 'uniformity' and 'independence'
assumptions have been (rightly) challenged; however no satisfactory multi-dimensional distribution
models have been proposed to replace them. We showed that the fractal dimension provides an
excellent measure of the deviation from the above assumptions.
The fractal dimension has several desirable characteristics:
 it constitutes a simple way to describe the non-uniformity of the data set, using just a single
number.
 it is applicable to real point-sets, as our experiments showed.
 it includes the uniform distribution as a special case (d = D).
 it is based on a well developed theory [13, 22]
In addition to the above theoretically pleasing properties, we showed that the fractal di-
mension has practical applications in the performance analysis of spatial access methods on real
datasets. Using it, we provided the rst analysis of R-trees on real data; the resulting formula is
simple, and, as showed experimentally, it is very accurate, usually within 5% of the experimental
results. This is the second contribution of this work: Despite the fact that R-trees are known for
almost a decade, there has been only one attempt for their analysis [6]; even that one used the uni-
formity assumption, pressumably leading to pessimistic estimates. The current analysis superseeds
the old one, since, according to (Observation 3), the uniform case is just a special case of a fractal
distribution.
We believe that the fractal dimension will become a powerful modeling tool for multi-variate
distributions in relational and spatial databases. Future work could examine its potential applica-
tions, such as:
14
 Analysis of other spatial access methods, such as quadtrees/octrees, grid les etc. For ex-
ample, in [3] we stored the 3-dimensional MRI-scans of human brains using an oct-tree de-
composition; we observed that the number of octants required to cover the surface of human
brains increased exponentially with the resolution, with an exponent of 2.6 (close to the frac-
tal dimension 2.7 of mammal brains [13],p.113). Thus, knowledge of the fractal dimension of
a surface (or set of points, in general), is useful in the prediction of the storage requirements
for the resulting quadtrees/octrees.
 Query optimization, for multi-attribute queries and for geometric/geographic databases. For
example, Eq. 18 can be used to predict the selectivity of a range query (number of qualifying
points) by setting the leaf capacity C=1.
 Generation of synthetic but realistic data, to study the performance of spatial access meth-
ods: instead of generating points that follow the uniform distribution, or gaussian or some
other ad-hoc distribution, we propose to generate points that have the desirable fractal di-
mension. Methods to generate point-sets with a given fractal dimension are described, eg.,
in [13](chapter 32), using the so-called `Levy ights'.
Acknowledgement : we would like to thank Erik Hoel for providing MGCounty and LB-
County datasets.
References
[1] Walid G. Aref and Hanan Samet. Optimization strategies for spatial query processing. Proc.
of VLDB (Very Large Data Bases), pages 81{90, September 1991.
[2] Manish Arya, William Cody, Christos Faloutsos, Joel Richardson, and Arthur Toga. Qbism: a
prototype 3-d medical image database system. IEEE Data Engineering Bulletin, 16(1):38{42,
March 1993.
[3] Manish Arya, William Cody, Christos Faloutsos, Joel Richardson, and Arthur Toga. Qbism:
Extending a dbms to support 3d medical images. Tenth Int. Conf. on Data Engineering
(ICDE), February 1994. (to appear).
[4] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an ecient and robust
access method for points and rectangles. ACM SIGMOD, pages 322{331, May 1990.
[5] S. Christodoulakis. Implication of certain assumptions in data base performance evaluation.
ACM TODS, June 1984.
15
[6] C. Faloutsos, T. Sellis, and N. Roussopoulos. Analysis of object oriented spatial access meth-
ods. Proc. ACM SIGMOD,pages 426{439426{439,May1987. also available asSRC-TR-87-30,
UMIACS-TR-86-27, CS-TR-1781.
[7] Christos Faloutsos and H.V. Jagadish. On b-tree indices for skewed distributions. In 18th
VLDB Conference, pages 363{374, Vancouver, British Columbia, August 1992.
[8] I. Gargantini. An eective way to represent quadtrees. Comm. of ACM (CACM), 25(12):905{
910, December 1982.
[9] A. Guttman. R-trees: a dynamic index structure for spatial searching. Proc. ACM SIGMOD,
pages 47{57, June 1984.
[10] Yannis E. Ioannidis and Stavros Christodoulakis. On the propagation of errors in the size of
join results. Proc. of ACM SIGMOD, pages 268{277, May 1991.
[11] H. V. Jagadish. Spatial search with polyhedra. Proc. Sixth IEEE Int'l Conf. on Data Engi-
neering, February 1990.
[12] Ibrahim Kamel and Christos Faloutsos. On packing r-trees. Second Int. Conf. on Information
and Knowledge Management (CIKM), November 1993. to appear.
[13] B. Mandelbrot. Fractal Geometry of Nature. W.H. Freeman, New York, 1977.
[14] M. Muralikrishna and David J. DeWitt. Equi-depth histograms for estimating selectivity
factors for multi-dimensional queries. In Proc. ACM SIGMOD, pages 28{36, Chicago, IL,
June 1988.
[15] R. Nelson and H. Samet. A population analysis of quadtrees with variable node size. Tech.
Report CAR-TR-241, also CS-TR-1740, DCR-86-05557, Computer Science Department, Univ.
of Maryland, College Park, December 1986.
[16] J. Nievergelt, H. Hinterberger, and K.C. Sevcik. The grid le: an adaptable, symmetric
multikey le structure. ACM TODS, 9(1):38{71, March 1984.
[17] J. Orenstein. Spatial query processing in an object-oriented database system. Proc. ACM
SIGMOD, pages 326{336, May 1986.
[18] B. Pagel, H. Six, H. Toben, and P. Widmayer. Towardsan analysis of range query performance.
In Proc. PODS 93, pages 214{221, Washington, D.C., May 1993.
16
[19] N. Roussopoulos and D. Leifker. Direct spatial search on pictorial databases using packed
r-trees. Proc. ACM SIGMOD, May 1985.
[20] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989.
[21] H. Samet. Applications of Spatial Data Structures Computer Graphics, Image Processing and
GIS. Addison-Wesley, 1990.
[22] Manfred Schroeder. Fractals, Chaos, Power Laws: Minutes From an Innite Paradise. W.H.
Freeman and Company, New York, 1991.
[23] T. Sellis, N. Roussopoulos, and C. Faloutsos. The r+ tree: a dynamic index for multi-
dimensional objects. In Proc. 13th International Conference on VLDB, pages 507{518, Eng-
land,, September 1987. also available as SRC-TR-87-32, UMIACS-TR-87-3, CS-TR-1795.
[24] G.K. Zipf. Human Behavior and Principle of Least Eort: an Introduction to Human Ecology.
Addison Wesley, Cambridge, Massachusetts, 1949.
17
Contents
1 Introduction 1
2 Background 4
3 Fractal Dimension 5
3.1 Formal denitions and Measurements : : : : : : : : : : : : : : : : : : : : : : : : : : 5
3.2 Fractal dimensions of real datasets : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
4 Analysis of R-trees 9
5 Experimental results 12
6 Discussion - Conclusions 14
18