A Parallel Sorting Algorithm
With an Experimental Study
(Preliminary Draft)
David R. Helman David A. Bader

Joseph JaJa
y
Institute for Advanced Computer Studies, and
Department of Electrical Engineering,
University of Maryland, College Park, MD 20742
fhelman, dbader, josephg@umiacs.umd.edu
December 29, 1995
Abstract
Previous schemes for sorting on general-purpose parallel machines have had to choose between
poor load balancing and irregular communication or multiple rounds of all-to-all personalized commu-
nication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of
regular all-to-all personalized communication in a scheme that yields very good load balancing with
virtually no overhead. This algorithm was implemented in Split-C and run on a variety of platforms,
including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code
using widely dierent benchmarks to examine the dependence of our algorithm on the input distribu-
tion. Our experimental results are consistent with the theoretical analysis and illustrate the eciency
and scalability of our algorithm across dierent platforms. In fact, it seems to outperform all similar
algorithms known to the authors on these platforms, and its performance is invariant over the set of
input distributions unlike previous ecient algorithms. Our results also compare favorably with those
reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.
Keywords: Parallel Algorithms, Generalized Sorting, Integer Sorting, Sample Sort, Parallel Per-
formance.

The support by NASA Graduate Student Researcher Fellowship No. NGT-50951 is gratefully acknowledged.
y
Supported in part by NSF grant No. CCR-9103135 and NSF HPCC/GCAG grant No. BIR-9318183.
1
1 Introduction
Sorting is arguably the most studied problem in computer science, both because of its intrinsic theo-
retical importance and its use in so many applications. Its signicant requirements for interprocessor
communication bandwidth and the irregular communication patterns that are typically generated
have earned its inclusion in several parallel benchmarks such as NAS [7] and SPLASH [35]. Moreover,
its practical importance has motivated the publication of a number of empirical studies seeking to
identify the most ecient sorting routines. Yet, parallel sorting strategies have still generally fallen
into one of two groups, each with its respective disadvantages. The rst group, using the classication
of Li and Sevcik [24], is the single-step algorithms, so named because data is moved once between
processors. Examples of this include sample sort [20, 10], parallel sorting by regular sampling [32, 25],
and parallel sorting by overpartitioning [24]. The price paid by these single-step algorithms is an
irregular communication scheme and diculty with load balancing. The other group of sorting algo-
rithms is the multi-step algorithms, which include bitonic sort [9], column sort [23], rotate sort [26],
hyperquicksort [29], ashsort [30], B-ashsort [19], smoothsort [28], and Tridgell and Brent's sort [33].
Generally speaking, these algorithms accept multiple rounds of communication in return for better
load balancing and, in some cases, regular communication.
In this paper, we present a novel variation on the sample sort algorithm which addresses the
limitations of previous implementations. We exchange the single step of irregular communication
for two steps of regular communication. In return, we reduce the problem of poor load balancing
because we are able to sustain a very high oversampling ratio at virtually no cost. Second, we obtain
predictable, regular communication requirements which are essentially invariant with respect to the
input distribution. The importance of utilizing regular communication has become more important
withthe adventofmessagepassing standards,such asMPI [27],which seek toguaranteetheavailability
of very ecient (often machine specic) implementations of certain basic collective communication
routines.
Our algorithm was implemented in a high-level language and run on a variety of platforms, includ-
ing the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using
a variety of benchmarks that we identied to examine the dependence of our algorithm on the input
distribution. Our experimental results are consistent with the theoretical analysis and illustrate the
scalability and eciency of our algorithm across dierent platforms. In fact, it seems to outperform
all similar algorithms known to the authors on these platforms, and its performance is indierent to
the set of input distributions unlike previous ecient algorithms.
The high-level language used in our studies is Split-C [14], an extension of C for distributed
memory machines. The algorithm makes use of MPI-like communication primitives but does not
make any assumptions as to how these primitives are actually implemented. The basic data transport
2
is a read or write operation. The remote read and write typically have both blocking and non-
blocking versions. Also, when reading or writing more than a single element, bulk data transports are
provided with corresponding bulk read and bulk write primitives. Our collective communication
primitives, described in detail in [6], are similar to those of the MPI [27], the IBM POWERparallel [8],
and the Cray MPP systems [13] and, for example, include the following: transpose, bcast, gather,
and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all
personalized communication in which each processor has to send a unique block of data to every
processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data
from a single source to all the other processors. The primitives gather and scatter are companion
primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of
which is distributed to a unique processor, and gather coalesces these blocks back into a single array
at a particular processor. See [3, 6, 4, 5] for algorithmic details, performance analyses, and empirical
results for these communication primitives.
The organization of this paper is as follows. Section 2 presents our computation model for
analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm.
Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.
2 The Parallel Computation Model
We use a simple model to analyze the performance of our parallel algorithms. Each of our hardware
platforms can be viewed as a collection of powerful processors connected by a communication network
that can be modeled as a complete graph on which communication is subject to the restrictions
imposed by the latency and the bandwidth properties of the network. We view a parallel algorithm
as a sequence of local computations interleaved with communication steps, and we allow computation
and communication to overlap. We account for communication costs as follows.
Assuming no congestion, the transfer of a block consisting of m contiguous words between two
processors takes O( + m) time, where  is an upper bound on the latency of the network and 
is the time per word at which a processor can inject or receive data from the network. The cost of
each of the collective communication primitives will be modeled by O( +  max(m;p)), where m
is the maximum amount of data transmitted or received by a processor. Such a cost (which is an
overestimate) can be justied by using our earlier work [22, 21, 6, 5]. Using this cost model, we can
evaluate the communication time T
comm
(n;p) of an algorithm as a function of the input size n, the
number of processors p , and the parameters  and . The coecient of  gives the total number of
times collective communication primitives are used, and the coecient of  gives the maximum total
amount of data exchanged between a processor and the remaining processors.
This communication model is close to a number of similar models (e.g. [16, 34, 1]) that have
3
recently appeared in the literature and seems to be well-suited for designing parallel algorithms on
current high performance platforms.
We dene the computation time T
comp
as the maximum time it takes a processor to perform all the
local computation steps. In general, the overall performance T
comp
+T
comm
involves a tradeo between
T
comp
and T
comm
. Our aim is to develop parallel algorithms that achieve T
comp
= O

T
seq
p

such that
T
comm
is minimum, where T
seq
is the complexity of the best sequential algorithm. Such optimization
has worked very well for the problems we have looked at, but other optimization criteria are possible.
The important point to notice is that, in addition to scalability, our optimization criterion requires
that the parallel algorithm be an ecient sequential algorithm (i.e., the total number of operations of
the parallel algorithm is of the same order as T
seq
).
3 A New Sample Sort Algorithm
Consider the problem of sorting n elements equally distributed amongst p processors, where we assume
without loss of generality that p divides n evenly. The idea behind sample sort is to nd a set of pBnZr1
splitters to partition the n input elements into p groups indexed from 0 up to pBnZr1 such that every
element in the i
th
group is less than or equal to each of the elements in the (i + 1)
th
group, for
0 i  pBnZr2. Then the task of sorting each of the p groups can be turned over to the correspondingly
indexed processor, after which the n elements will be arranged in sorted order. The eciency of this
algorithm obviously depends on how well we divide the input, and this in turn depends on how well
we choose the splitters. One way to choose the splitters is by randomly sampling the input elements
at each processor - hence the name sample sort.
Previous versions of sample sort [20, 10, 17, 15] have randomly chosen s samples from the
n
p
elements at each processor, routed them to a single processor, sorted them at that processor, and then
selected every s
th
element as a splitter. Each processor P
i
then performs a binary search on these
splitters for each of its input values and then uses the results to route the values to the appropriate
destination, after which local sorting is done to complete the sorting process. The rst diculty with
this approach is the work involved in gathering and sorting the splitters. A larger value of s results in
better load balancing, but it also increases the overhead. The other diculty is that no matter how
the routing is scheduled, there exist inputs that give rise to large variations in the number of elements
destined for dierent processors, and this in turn results in an inecient use of the communication
bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular
communication primitives proposed under the MPI standard [27].
In our solution, we incur no overhead in obtaining
n
p
2
samples from each processor and in sorting
these samples to identify the splitters. Because of this very high oversampling, we are able to replace
the irregular routing with exactly two calls to our transpose primitive.
4
The pseudo code for our algorithm is as follows:
 Step (1): Each processor P
i
(0  i  pBnZr1) randomly assigns each of its
n
p
elements to one of
p buckets. With high probability, no bucket will receive more than c
1
n
p
2
elements, where c
1
is a
constant to be dened later.
 Step (2): Each processor P
i
routes the contentsof bucket j to processor P
j
, for (0 i;j  pBnZr1).
Since with high probability no bucket will receive more than c
1
n
p
2
elements, this is equivalent to
performing a transpose operation with block size c
1
n
p
2
.
 Step (3): Each processor P
i
sorts the (
1
n
p
 c
1
n
p
) values received in Step (2) using an
appropriate sequential sorting algorithm. For integers we use the radix sort algorithm, whereas
for oating point numbers we use the merge sort algorithm.
 Step (4): From its sorted list of (
n
p
 c
1
n
p
) elements, processor P
0
selects each (j
n
p
2
)
th
element as a splitter, for (1 j  pBnZr1). By default, the rst and last splitters are respectively
the smallest and largest values allowed by the data type used.
 Step (5): Processor P
0
broadcasts the pBnZr1 intermediate splitters to the other pBnZr1 processors.
 Step (6): Each processor P
i
nds the positions of the splitters in its local array of sorted
elements by performing a binary search for each of these splitters.
 Step (7): Each processor P
i
routes the subsequence falling between splitter j and splitter j +1
to processor P
j
, for (0 i;j  pBnZr1). Since with high probability no sequence will contain more
than c
2
n
p
2
elements, where c
2
is a constant to be dened later, this is equivalent to performing a
transpose operation with block size c
2
n
p
2
.
 Step (8): Each processor P
i
merges the p sorted subsequences received in Step (7) to produce
the i
th
column of the sorted array. Note that, with high probability, no processor has received
more than 
2
n
p
elements, where 
2
is a constant to be dened later.
We can establish the complexity of this algorithm with high probability - that is with probability
 (1BnZr n
BnZr
) for some positive constant . But before doing this, we need to establish the results of
the following four lemmas.
Lemma 1: At the completion of Step (1), the number of elements in each bucket is at most c
1
n
p
2
with high probability, for any c
1
 2 and p
2

n
3lnn
.
Proof: The probability that exactly c
1
n
p
2
elements are placed in a particular bucket in Step (1) is
given by the binomial distribution
b(s;r;q)=
 
r
s
!
q
s
(1BnZrq)
rBnZrs
; (1)
5
where s = c
1
n
p
2
, r =
n
p
, and q =
1
p
. Using the following Cherno bound [12] for estimating the tail of
a binomial distribution
X
s(1+)rq
b(s;r;q) e
BnZr

2
rq
3
; (2)
the probability that a particular bucket will contain at least c
1
n
p
2
elements can be bounded by
e
BnZr(c
1
BnZr1)
2
n
3p
2
. Hence, the probability that any of the p
2
buckets contains at least c
1
n
p
2
elements can be
bounded by p
2
e
BnZr(c
1
BnZr1)
2
n
3p
2
, and Lemma 1 follows.
Lemma 2: At the completion of Step (2), the total number of elements received by processor
P
0
, which comprise the set of samples from which the splitters are chosen, is at most 
n
p
with high
probability, for any  > 1 and p
2

n
3lnn
.
Proof: The probability that processor P
0
receives exactly 
n
p
elements is given by the binomial
distribution b(
n
p
;n;
1
p
). Using the Cherno bound for estimating the tail of a binomial distribution,
the probability that processor P
0
receives at least 
n
p
elements can be bounded by e
BnZr(BnZr1)
2
n
3p
and
Lemma 2 follows.
Lemma 3: At the completion of Step (7), the number of elements received by each processor is at
most 
2
n
p
with high probability, for any 
2
 1:33 and p
2

n
3lnn
.
Proof: Establishing a bound on the number of elements received by any processor in Step (7) is
equivalent to establishing a bound on the number of elements which fall between any two consecutive
splitters in the sorted order. But as Blelloch et al. [10] observed, the number of elements which fall
between any two consecutive splitters in the sorted order can only be greater than 
2
n
p
if in the sorted
order there are less than
n
p
2
samples drawn from the 
2
n
p
elements which follow the rst splitter.
Since every element has an equal and independent probability of being a sample, the probability that
exactly
n
p
2
samples will be found amongst the next 
2
n
p
elements is given by the binomial distribution
b(
n
p
2
;
2
n
p
;
1
p
). Using the following \Cherno" type bound [18] for estimating the head of a binomial
distribution
X
srq
b(s;r;q) e
BnZr(1BnZr)
2
rq
2
; (3)
where s =
n
p
2
, r = 
2
n
p
, and q =
1
p
, the probability that
n
p
2
or less samples will be found amongst the
next 
2
n
p
elements following any of the p splitters can be bounded by pe
BnZr(1BnZr
1

2
)
2

2
n
2p
2
and Lemma 3
follows.
Lemma 4: The number of elements exchanged by any two processors in Step (7) is at most c
2
n
p
2
with high probability, for any c
2
 2:48 and p
2

n
3lnn
.
Proof: Since with high probability no processor can receive more than 
2
n
p
elements in Step (7),
and since the randomization in Step (1) means that each of these elements can originate with equal
6
probability from any of the p processors, the probability that exactly c
2
n
p
2
elements are exchanged by
any two particular processors is given by the binomial distribution b(c
2
n
p
2
;
2
n
p
;
1
p
). Using the Cherno
bound for estimating the tail of the binomial distribution, the probability that any of the p processors
exchange at least c
2
n
p
2
elements can be bounded by p
2
e
BnZr(
c
2

2
BnZr1)
2

2
n
3p
2
and Lemma 4 follows.
With these bounds on the values of c
1
, 
2
, and c
2
, the analysis of our sample sort algorithm is as
follows. Steps (1), (3), (4), (6), and (8) involve no communication and are dominated by the cost
of the sequential sorting in Step (3) and the merging in Step (8). Sorting integers using radix sort
requires O(
n
p
) time, whereas sorting oating point numbers using merge sort requires O(
n
p
log
n
p
) time.
Step (8) requires O(
n
p
logp) time if we merge the sorted subsequences in a binary tree fashion. Steps
(2), (5), and (7) call the communication primitives transpose, bcast, and transpose, respectively.
The analysis of these primitives in [6] shows that with high probability these three steps require
T
comm
(n;p) ( +2
n
p
2
(pBnZr1)), T
comm
(n;p) ( +(pBnZr1)), and T
comm
(n;p) ( +2:48
n
p
2
(pBnZr1)),
respectively. Hence, with high probability, the overall complexity of our sample sort algorithm is given
(for oating point numbers) by
T(n;p) = T
comp
(n;p)+ T
comm
(n;p)
= O

n
p
logn +  +
n
p


(4)
for p
2
<
n
3lnn
.
Clearly, our algorithm is asymptotically optimal with very small coecients. But a theoretical
comparison of our running time with previous sorting algorithms is dicult, since there is no consensus
on how to model the cost of the irregular communication used by the most ecient algorithms.
Hence, it is very important to perform an empirical evaluation of an algorithm using a wide variety
of benchmarks, as we will do next.
4 Performance Evaluation
Sample sort was implemented using Split-C [14] and run on a variety of machines and processors,
including the Thinking Machines CM-5, the IBM SP-2-WN and SP-2-TN2, and the Cray Research
T3D. For every platform, we tested our code on six dierent benchmarks, each of which had both a
32-bit integer version (64-bit on the Cray T3D) and a 64-bit double precision oating point number
(double) version.
4.1 Sorting Benchmarks
Our six sorting benchmarks are dened as follows, in which MAX is (2
31
BnZr 1) for integers and
approximately 1:810
308
for doubles:
7
1. Uniform [U], a uniformly distributed random input, obtained by calling the C library random
number generator random(). This function, which returns integers in the range 0 to
BnZr
2
31
BnZr1

,
is initialized by each processor P
i
with the value (23 + 1001i). For the double data type, we
\normalize" these values by rst assigning the integer returned by random() a randomly chosen
sign bit and then scaling the result by
MAX
(2
31
BnZr1)
.
2. Gaussian [G], a Gaussian distributed random input, approximated by adding four calls to
random() and then dividing the result by four. For the double type, we rst normalize the
values returned by random() in the manner described for [U].
3. Zero [Z], a zero entropy input, created by setting every value to a constant such as zero.
4. Bucket Sorted [B], an input that is sorted into p buckets, obtained by setting the rst
n
p
2
elements at each processor to be random numbers between 0 and

MAX
p
BnZr1

, the second
n
p
2
elements at each processor to be random numbers between
MAX
p
and

2
MAX
p
BnZr1

, and so
forth.
5. g-Group [g-G], an input created by rst dividing the processors into groups of consecutive
processors of size g, where g can be any integer which partitions p evenly. If we index these
groups in consecutive order, then for group j we set the rst
n
pg
elements to be random num-
bers between
BnZrBnZr
jg +
p
2

mod p

MAX
p
and

BnZrBnZr
jg +
p
2
+1

mod p

MAX
p
BnZr1

, the second
n
pg
elements at each processor to be random numbers between
BnZrBnZr
jg +
p
2
+ 1

mod p

MAX
p
and

BnZrBnZr
jg +
p
2
+ 2

mod p

MAX
p
BnZr1

, and so forth.
6. Staggered [S], created as follows: if the processor index i is <
p
2
, then we set all
n
p
elements
at that processor to be random numbers between (2i +1)
MAX
p
and

(2i+ 2)
MAX
p
BnZr1

, and
so forth. Otherwise, we set all
n
p
elements to be random numbers between
BnZr
iBnZr
p
2

MAX
p
and

BnZr
iBnZr
p
2
+ 1

MAX
p
BnZr1

, and so forth.
We selected these six benchmarks for a variety of reasons. Previous researchers have used the
Uniform, Gaussian, and Zero benchmarks, and so we tooincluded them for purposes ofcomparison.
But benchmarks should be designed to illicit the worst case behavior from an algorithm, and in this
sense the Uniform benchmark is not appropriate. For example, for n  p, one would expect that
the optimal choice of the splitters in the Uniform benchmark would be those which partition the
range of possible values into equal intervals. Thus, algorithms which try to guess the splitters might
perform misleadingly well on such an input. In this respect, the Gaussian benchmark is more telling.
But we also wanted to nd benchmarks which would evaluate the cost of irregular communication.
Thus, we wanted to include benchmarks for which an algorithm which uses a single phase of routing
would nd contention dicult or even impossible to avoid. A naive approach to rearranging the
data would perform poorly on the Bucket Sorted benchmark. Here, every processor would try to
8
route data to the same processor at the same time, resulting in poor utilization of communication
bandwidth. This problem might be avoided by an algorithm in which at each processor the elements
are rst grouped by destination and then routed according to the specications of a sequence of
p destination permutations. Perhaps the most straightforward way to do this is by iterating over
the possible communication strides. But such a strategy would perform poorly with the g-Group
benchmark, for a suitably chosen value of g. In this case, using stride iteration, those processors
which belong to a particular group all route data to the same subset of g destination processors. This
subset of destinations is selected so that, when the g processors route to this subset, they choose the
processors in exactly the same order, producing contention and possibly stalling. Alternatively, one
can synchronize the processors after each permutation, but this in turn will reduce the communication
bandwidth by a factor of
p
g
. In the worst case scenario, each processor needs to send data to a single
processor a unique stride away. This is the case of the Staggered benchmark, and the result is a
reduction of the communication bandwidth by a factor of p. Of course, one can correctly object that
both the g-Group benchmark and the Staggered benchmark have been tailored to thwart a routing
scheme which iterates over the possible strides, and that another sequences of permutations might be
found which performs better. This is possible, but at the same time we are unaware of any single
phase deterministic algorithm which could avoid an equivalent challenge.
4.2 Experimental Results
For each experiment, the input is evenly distributed amongst the processors. The output consists of
the elements in non-descending order arranged amongst the processors so that the elements at each
processor are in sorted order and no element at processor P
i
is greater than any element at processor
P
j
, for all i < j.
Two variations were allowed in our experiments. First, radix sort was used to sequentially sort
integers, whereas merge sort was used to sort double precision oating point numbers (doubles).
Second, dierent implementations of the communication primitives were allowed for each machine.
Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide
all of our communication primitives as part of its machine specic Collective Communication Library
(CCL) [8]. As one might expect, they were faster than the high level Split-C implementation.
The graphs in Figures 1 and 2 display the performance of our sample sort as a function of input
distribution for a variety of input sizes. In each case, the performance is essentially independent of the
input distribution. These gures present results obtained on a 64 node Cray T3D; results obtained
from other machines validate this claim as well. Because of this independence, the remainder of this
section will only discuss the performance of our sample sort on the single benchmark [U].
9
Figure 1: Performance is independent of input distribution for integers.
Figure 2: Performance is independent of input distribution for doubles.
10
Sample Sorting of 4M Integers
Number of Processors
Machine 4 8 16 32 64 128
CRAY T3D - 1.66 0.894 0.486 0.272 0.149
IBM SP2-WN 2.04 1.03 - - - -
IBM SP2-TN2 2.97 1.34 0.755 - - -
TMC CM-5 - - 3.61 1.67 0.761 0.444
Table I: Total execution time (in seconds) for sorting 4M integers on a variety of machines and processors.
A hyphen indicates that that particular platform was unavailable to us.
The results in Tables I and II together with their graphs in Figure 3 examine the scalability of
our sample sort as a function of machine size. Results are shown for the CM-5, the SP-2-WN, the
SP2-TN2, and the T3D. Bearing in mind that these graphs are log-log plots, they show that for a
given input size n the execution time scales almost inversely with the number of processors p. While
this is certainly the expectation of our analytical model for doubles, it might at rst appear to exceed
our prediction of an O(
n
p
logp) computational complexity for integers. However, the appearance of
an inverse relationship is still quite reasonable when we note that this O(
n
p
logp) complexity is entirely
due to the merging in Step (8), and in practice, as we show later with Figure 6, Step (8) only
accounts for about 25% of the observed execution time. Note that the complexity of Step 8 could be
reduced to O(
n
p
) for integers using radix sort, but the resulting execution time would be slower.
Figures 4 and 5 examine the scalability of our sample sort as a function of problem size, for dier-
ing numbers of processors. They show that for a xed number of processors there is an almost linear
dependence between the execution time and the total number of elements n. While this is certainly
the expectation of our analytic model for integers, it might at rst appear to exceed our prediction of
a O(
n
p
logn) computational complexity for oating point values. However, this appearance of a linear
relationship is still quite reasonable when we consider that for the range of values shown logn diers
by only a factor of 1:2.
Next, the graphs in Figures 6 and 7 examine the relative costs of the eight steps in our sample
sort on a 64 node T3D. Notice that the sequential sorting and merging performed in Steps (3) and
Sample Sorting of 4M Doubles
Number of Processors
Machine 4 8 16 32 64 128
CRAY T3D - 2.61 1.32 0.683 0.361 0.191
IBM SP2-WN 6.96 3.67 - - - -
IBM SP2-TN2 8.78 4.43 2.28 - - -
TMC CM-5 - - 6.51 3.31 1.85 0.915
Table II: Total execution time (in seconds) for sorting 4M doubles on a variety of machines and processors.
A hyphen indicates that that particular platform was unavailable to us.
11
Figure 3: Scalability of sorting integers and doubles with respect to machine size.
(8) consume nearly 80% of the execution time, whereas the two transpose operations in Steps
(2) and (7) together consume only about 20% of the execution time (and less for doubles). Similar
results were obtained for all of our benchmarks, showing that our algorithm is extremely ecient in
its communication performance.
Finally, Table III shows the experimentally derived expected value (E) and sample standard
deviation (STD) of the coecients c
1
, 
1
, c
2
, and 
2
used to describe the complexity of our algorithm
in Section 3. For each input size, the values were obtained by analyzing data collected while sorting
the [G], [B], [2-G], [4-G], and [S] benchmarks. Each of these benchmarks was generated and sorted
keys/proc E(c
1
) STD(c
1
) E(
1
) STD(
1
) E(c
2
) STD(c
2
) E(
2
) STD(
2
)
4K 2.02 0.091 1.08 0.017 2.64 0.94 1.55 0.18
8K 1.70 0.066 1.06 0.012 1.98 0.40 1.37 0.11
16K 1.48 0.040 1.04 0.007 1.66 0.23 1.25 0.07
32K 1.33 0.031 1.03 0.005 1.43 0.13 1.18 0.05
64K 1.23 0.025 1.02 0.003 1.29 0.08 1.12 0.03
128K 1.16 0.012 1.01 0.002 1.20 0.05 1.09 0.02
256K 1.11 0.011 1.01 0.002 1.14 0.04 1.06 0.02
512K 1.08 0.008 1.01 0.001 1.10 0.02 1.05 0.01
1M 1.06 0.004 1.00 0.001 1.07 0.02 1.03 0.01
Table III: Statistical evaluation of the experimentally observed values of the algorithm coecients on a 64
node T3D.
12
TMC CM-5 IBM SP-2-WN
IBM SP-2-TN2
Cray T3D
Figure 4: Scalability of sorting integers with respect to problem size, for diering numbers of processors.
13
TMC CM-5 IBM SP-2-WN
IBM SP-2-TN2
Cray T3D
Figure 5: Scalability of sorting doubles with respect to problem size, for diering numbers of processors.
14
Figure 6: Distribution of execution time amongst the eight steps of sample sort for integers. Times are
obtained on a 64 node T3D.
Figure 7: Distribution of execution time amongst the eight steps of sample sort for doubles. Times are
obtained on a 64 node T3D.
15
20 times, each time using a dierent seed for the random number generator. The experimentally
derived values for c
1
, 
1
, c
2
, and 
2
agree closely with the theoretically derived values of c
1
(2),

1
 c
1
, c
2
(2.48), and 
2
(1.33) for p
2

n
3lnn
.
4.3 Comparison with Previous Results
Despite the enormous theoretical interest in parallel sorting, we were able to locate relatively few
empirical studies. Of these, only a few were done on machines which either were available to us for
comparison or involved code which could be ported to these machines for comparison. In Tables
IV and V, we compare the performance of our sample sort algorithm with two other sample sort
algorithms. In all cases, the code was written in Split-C. In the case of Alexandrov et al. [1], the
times were determined by us directly on a 32 node CM-5 using code supplied by the authors which
had been optimized for a Meiko CS-2. In the case of Dusseau [17], the times were obtained from the
graphed results reported for a 64 node CM-5.
[U] [G] [2-G] [B] [S]
int./proc. HBJ AIS HBJ AIS HBJ AIS HBJ AIS HBJ AIS
4K 0.051 0.153 0.050 0.152 0.051 1.05 0.055 0.181 0.049
y
8K 0.090 0.197 0.090 0.192 0.092 1.09 0.094 0.193 0.087
y
16K 0.183 0.282 0.182 0.281 0.184 1.16 0.189 0.227 0.179
y
32K 0.360 0.450 0.359 0.449 0.363 1.34 0.364 0.445 0.361
y
64K 0.725 0.833 0.730 0.835 0.735 1.76 0.731 0.823 0.740
y
128K 1.70 2.02 1.70 2.02 1.70 2.83 1.72 1.99 2.02
y
256K 3.81 4.69 3.80 4.59 3.80 5.13 3.81 4.56 4.69
y
512K 8.12 10.0 8.04 9.91 8.11 9.58 8.10 9.98 10.0
y
Table IV: Total execution time (in seconds) required to sort a variety of benchmarks and problem sizes,
comparing our version of sample sort (HBJ) with that of Alexandrov et al. (AIS) on a 32-node CM-5.
y
We were unable to run the (AIS) code on this input.
[U] [B] [Z]
int./proc. HBJ DUS HBJ DUS HBJ DUS
1M 16.6 21 12.2 91 10.6 11
Table V: Time required per element (in microseconds) to sample sort 64M integers, comparing our results
(HBJ) with those obtained from the graphed results reported by Dusseau (DUS) on a 64 node CM-5.
16
Finally, there are the results for the NAS Parallel Benchmark [31] for integer sorting (IS). The
name of this benchmark is somewhat misleading. Instead of requiring that the integers be placed in
sorted order as we do, the benchmark only requires that they be ranked without any reordering, which
is a signicantly simpler task. Table VI compares our results on the Class A NAS Benchmark with
the best times reported for the TMC CM-5 and the Cray T3D. We believe that our results, which were
obtained using high-level, portable code, compare favorably with the other reported times, which were
obtained by the vendors using machine-specic implementations and perhaps system modications.
Comparison of Class A NAS (IS) Benchmark Times
Number Best Our
Machine of Processors Reported Time Time
CM-5 32 43.1 29.4
64 24.2 14.0
128 12.0 7.13
Cray T3D 16 7.07 12.6
32 3.89 7.05
64 2.09 4.09
128 1.05 2.26
Table VI: Comparison of our execution time (in seconds) with the best reported times for the Class A NAS
Parallel Benchmark for integer sorting. Note that while we actually place the integers in sorted order, the
benchmark only requires that they be ranked without actually reordering.
The only performance studies we are aware of on similar platforms for generalized sorting are those
of Tridgell and Brent [33], who report the performance of their algorithm using a 32 node CM-5 on a
uniformly distributed random input of signed integers, as described in Table VII.
Problem [U]
Size (HBJ) (TB)
8M 4.57 5.48
Table VII: Total execution time (in seconds) required to sort 8M signed integers, comparing our results
(HBJ) with those of Tridgell and Brent (TB) on a 32 node CM-5.
5 Conclusion
In this paper, we introduced a novel variation on sample sort and conducted an experimental study
of its performance on a number of platforms using widely dierent benchmarks. Our results illustrate
the eciency and scalability of our algorithm across the dierent platforms and appear to improve on
all similar results known to the authors. Our results also compare favorably with those reported for
the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.
17
We have also studied several variations on our algorithm which use diering strategies to ensure
that every bucket in Step (1) receives an equal number of elements. The results obtained for these
variations were very similar to those reported in this paper. On no platform did the improvements
exceed approximately 5%, and in many instances they actually ran more slowly. We believe that a
signicant improvement of our algorithm would require the enhancement of the sequential sorting and
merging in Steps (3) and (8), and that there is little room for signicant improvement in either the
load balance or the communication eciency.
6 Acknowledgements
We would like to thank Ronald Greenberg of UMCP's Electrical Engineering Department for his
valuable comments and encouragement.
We would also like tothank the CASTLE/Split-C group at The University of California, Berkeley,
especially for the help and encouragement from David Culler, Arvind Krishnamurthy, and Lok Tin
Liu.
The University of California, Santa Barbara, parallel radix sort code was provided to us by Mihai
Ionescu. Also, Klaus Schauser, Oscar Ibarra,Chris Scheiman, andDavidProbertofUCSantaBarbara,
provided help and access to the UCSB 64-node Meiko CS-2. The Meiko CS-2 Computing Facility
was acquired through NSF CISE Infrastructure Grant number CDA-9218202, with support from the
College of Engineering and the UCSB Oce of Research, for research in parallel computing.
Arvind Krishnamurthy provided additional help with his port of Split-C to the Cray Research
T3D [2]. The Jet Propulsion Lab/Caltech 256-node Cray T3D Supercomputer used in this investiga-
tion was provided by funding from the NASA Oces of Mission to Planet Earth, Aeronautics, and
Space Science. We also acknowledge William Carlson and Jesse Draper from the Center for Comput-
ing Science (formerly Supercomputing Research Center) for writing the parallel compiler AC (version
2.6) [11] on which the T3D port of Split-C has been based.
This work also utilized the CM-5 at National Center for Supercomputing Applications, University
of Illinois at Urbana-Champaign, under grant number ASC960008N.
We would like to acknowledge the use of the UMIACS 16-node IBM SP-2-TN2, which was provided
by an IBM Shared University Research award and an NSF Academic Research Infrastructure Grant
No. CDA9401151.
Please see http://www.umiacs.umd.edu/~dbader for additional performance information. In ad-
dition, all the code used in this paper will be freely available for interested parties from our anonymous
ftp site, ftp://ftp.umiacs.umd.edu/pub/dbader. We encourage other researchers to compare with
our results for similar inputs.
18
References
[1] A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages
into the LogP Model - One step closer towards a realistic model for parallel computation. In 7th
Annual ACM Symposium on Parallel Algorithms and Architectures,pages 95{105, SantaBarbara,
CA, July 1995.
[2] R.H. Arpaci, D.E. Culler, A. Krishnamurthy, S.G. Steinberg, and K. Yelick. Empirical Evaluation
of the CRAY-T3D: A Compiler Perspective. In ACM Press, editor, Proceedings of the 22nd
Annual International Symposium on Computer Architecture, pages 320{331, Santa Margherita
Ligure, Italy, June 1995.
[3] D.A. Bader, D.R. Helman, and J. JaJa. Practical Parallel Algorithms for Personalized Commu-
nication and Integer Sorting. CS-TR-3548 and UMIACS-TR-95-101 Technical Report, UMIACS
and Electrical Engineering, University of Maryland, College Park, MD, November 1995. Submit-
ted to ACM Journal of Experimental Algorithmics.
[4] D.A. Bader and J. JaJa. Parallel Algorithms for Image Histogramming and Connected Com-
ponents with an Experimental Study. Technical Report CS-TR-3384 and UMIACS-TR-94-133,
UMIACS and Electrical Engineering, University of Maryland, College Park, MD, December 1994.
To appear in Journal of Parallel and Distributed Computing.
[5] D.A. Bader and J. JaJa. Parallel Algorithms for Image Histogramming and Connected Com-
ponents with an Experimental Study. In Fifth ACM SIGPLAN Symposium of Principles and
Practice of Parallel Programming, pages 123{133, Santa Barbara, CA, July 1995.
[6] D.A. Bader and J. JaJa. Practical Parallel Algorithms for Dynamic Data Redistribution, Median
Finding, and Selection. Technical Report CS-TR-3494 and UMIACS-TR-95-74, UMIACS and
Electrical Engineering, University of Maryland, College Park, MD, July 1995. To be presented
at the 10th International Parallel Processing Symposium, Honolulu, HI, April 15-19, 1996.
[7] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg,
P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga.
The NAS Parallel Benchmarks. Technical Report RNR-94-007, Numerical Aerodynamic Simula-
tion Facility, NASA Ames Research Center, Moett Field, CA, March 1994.
[8] V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A
Portable and Tunable Collective Communication Library for Scalable Parallel Computers. IEEE
Transactions on Parallel and Distributed Systems, 6:154{164, 1995.
[9] K. Batcher. Sorting Networks and Their Applications. In Proceedings of the AFIPS Spring Joint
Computer Conference 32, pages 307{314, Reston, VA, 1968.
19
[10] G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S.J. Smith, and M. Zagha. A Com-
parison of Sorting Algorithms for the Connection Machine CM-2. In Proceedings of the ACM
Symposium on Parallel Algorithms and Architectures, pages 3{16, July 1991.
[11] W.W. Carlson and J.M. Draper. AC for the T3D. Technical Report SRC-TR-95-141, Supercom-
puting Research Center, Bowie, MD, February 1995.
[12] H. Cherno. A Measure of Asymptotic Eciency for Tests of a Hypothesis Based on the Sum of
Observations. Annals of Math. Stat., 23:493{509, 1952.
[13] Cray Research, Inc. SHMEM Technical Note for C, October 1994. Revision 2.3.
[14] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and
K. Yelick. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262{
273, Portland, OR, November 1993.
[15] D.E. Culler, A.C. Dusseau, R.P. Martin, and K.E. Schauser. Fast Parallel Sorting Under LogP:
From Theory to Practice. In Portability and Performance for Parallel Processing, chapter 4,
pages 71{98. John Wiley & Sons, 1993.
[16] D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian,
and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993.
[17] A.C. Dusseau. Modeling Parallel Sorts with LogP on the CM-5. Technical Report UCB//CSD-
94-829, Computer Science Division, University of California, Berkeley, 1994.
[18] T. Hagerup and C. Rub. A Guided Tour of Cherno Bounds. Information Processing Letters,
33:305{308, 1990.
[19] W.L. Hightower, J.F. Prins, and J.H. Reif. Implementations of Randomized Sorting on Large
Parallel Machines. In Proceedings of the 4th Symposium on Parallel Architectures and Algorithms,
pages 158{167, San Diego, CA, July 1992.
[20] J.S. Huang and Y.C. Chow. Parallel Sorting and Data Partitioning by Sampling. In Proceedings
of the 7th Computer Software and Applications Conference, pages 627{631, November 1983.
[21] J. JaJa and K.W. Ryu. The Block Distributed Memory Model. Technical Report CS-TR-3207,
Computer Science Department, University of Maryland, College Park, January 1994. To appear
in IEEE Transactions on Parallel and Distributed Systems.
[22] J.F. JaJa and K.W. Ryu. The Block Distributed Memory Model for Shared Memory Multipro-
cessors. In Proceedings of the 8th International Parallel Processing Symposium, pages 752{756,
Cancun, Mexico, April 1994. (Extended Abstract).
20
[23] F.T. Leighton. Tight Bounds on the Complexity of Parallel Sorting. IEEE Transactions on
Computers, C-34:344{354, 1985.
[24] H. Li and K.C. Sevcik. Parallel Sorting by Overpartitioning. Technical Report CSRI-295, Com-
puter Systems Research Institute, University of Toronto, Canada, April 1994.
[25] X. Li, P. Lu, J. Schaeer, J. Shillington, P.S. Wong, and H. Shi. On the Versatility of Parallel
Sorting by Regular Sampling. Parallel Computing, 19:1079{1103, 1993.
[26] J.M. Marberg and E. Gafni. Sorting in Constant Number of Row and Column Phases on a Mesh.
Algorithmica, 3:561{572, 1988.
[27] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Technical report,
University of Tennessee, Knoxville, TN, June 1995. Version 1.1.
[28] C.G. Plaxton. Ecient Computation on Sparse Interconnection Networks. Technical Report
STAN-CS-89-1283,DepartmentofComputerScience, StanfordUniversity, Stanford,CA,Septem-
ber 1989.
[29] M.J. Quinn. Analysis and Benchmarking of Two Parallel Sorting Algorithms: Hyperquicksort
and Quickmerge. BIT, 29:239{250, 1989.
[30] J.H. Reif and L.G. Valiant. A Logarithmic Time Sort for Linear Sized Networks. Journal of the
ACM, 34:60{76, 1987.
[31] S.Saini andD.H.Bailey. NAS Parallel Benchmarks Results 12-95. Report NAS-95-021, Numerical
Aerodynamic Simulation Facility, NASA Ames Research Center, Moett Field, CA, December
1995.
[32] H. Shi and J. Schaeer. Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed
Computing, 14:361{372, 1992.
[33] A. Tridgell and R.P. Brent. An Implementation of a General-Purpose Parallel Sorting Algorithm.
Techical Report TR-CS-93-01, Computer Sciences Laboratory, Australian National University,
Canberra, Australia, February 1993.
[34] L.G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM,
33(8):103{111, 1990.
[35] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The SPLASH-2 Programs: Charac-
terization and Methodological Considerations. In Proceedings of the 22nd Annual International
Symposium on Computer Architecture, pages 24{36, June 1995.
21