ABSTRACT
Title of dissertation: DESCRIBING AND MODELING
REPETITIVE SEQUENCES IN DNA
Suzanne S. Sindi, Doctor of Philosophy, 2006
Dissertation directed by: Professor James A. Yorke
Department of Mathematics
Department of Physics
and
Professor Brian R. Hunt
Department of Mathematics
A significant fraction of the genome, i.e. the complete DNA sequence, of most
organisms is comprised of sequences for which there are similar copies somewhere
within the genome. While most repetitive DNA was originally thought to have no
function, there is a growing body of literature to suggest that repetitive sequences
are vital to the genome.
The goal of this dissertation is to analyze statistical properties of repetitive
sequences in the genomes of a variety of organisms. We find a variety of striking
features of repetitive sequence in the human genome and the genomes of C. elegans
(worm), A. thaliana (mustard seed) and D. melanogaster (fruit fly) with some com-
parison to S. cerevisiae (yeast) and E. coli (a bacteria). We find that the number
of times each 40-mer (sequence of 40 bases) occurs in a genome is approximated
by a power law distribution. We analyze in detail the separation between copies
of 40-mers that occur exactly twice in a chromosome and observe that a significant
portion of these pairs, that we call ?proximal?, have extremely small separations,
while the remaining ?distant? pairs have a distribution more consistent with being
uniformly distributed throughout the chromosome. We introduce a type of exactly
repetitive region, which we call a ?repeat string,? and find the distribution of lengths
of repeat strings is roughly a power law.
Since these properties have been verified for the genomes of a variety of or-
ganisms there may be a common explanation of their origin. When possible, we
suggest evolutionary mechanisms that could cause the emergence of such statistical
properties. In particular, we developed a model of the evolution of repeat strings in
a genome. We find that, under quite general conditions, the stationary distribution
of our evolutionary model is the Pareto distribution, a close relative of the power
law distribution.
DESCRIBING AND MODELING
REPEATED SEQUENCES IN DNA
by
Suzanne S. Sindi
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2006
Advisory Committee:
Dr. James A. Yorke, Chair/Advisor
Dr. Brian R. Hunt, Co-Advisor
Dr. Stephen Altschul
Dr. David Mount
Dr. Steven Mount
c? Copyright by
Suzanne Sindi
2006
ACKNOWLEDGMENTS
There are many at this University and elsewhere that I owe a tremendous debt
of gratitude. I know that I will certainly forget to list some, and for that I apologize
in advance. While it ?takes a village to raise a child?, it takes at least that many
to help someone earn a PhD.
I would first like to acknowledge my advisors James A. Yorke and Brian R.
Hunt. Over the years they, more than anyone, have helped shape me into the
researcher that I am today. Through them I learned that, surprisingly, the most
difficult part of research can be simply formulating an interesting question that you
can answer.
The support I have received from my fellow graduate students has been so
important to my progress through graduate school and general sanity. I would
especially like to thank Cathy Jones and D.J. Patil; they really looked after me and
their love and attention have been so important to me. I was also lucky enough
to have many wonderful students enter graduate school around the same time as
me. This was such an amazing group of people and I am so pleased to call them
friends. I would especially like to thank Brandy Rapatski who was my first Maryland
friend; Greggo Johnson and Blake Pelzer for many nights of television viewing;
Richard Hallquist for being such a good friend; JT Halbert and Aaron Lott for the
many outdoor excursions and frequent conversations about ?life, the universe and
ii
everything?.
In addition, I was lucky to get to know a wonderful person, Somantika Datta.
Because of her stories and view on life my thoughts on the world have been broad-
ened. I will never forget one of the first things we did together; she drew a map of
India and told me about food, culture and history of some of the different regions.
I feel very special to gotten to know some of her family as well.
I thank my family: Maria, Abdullah and Fred, for providing me with encour-
agement and reminding me just why I wanted to do this PhD thing anyway.
A few years ago I was fortunate enough to meet Ben Jewell. In addition to
being an amazing cook, Ben has been a wonderful companion and partner. His daily
support and positive attitude have been vital to my progress, especially in this last
year. I feel so fortunate to get to share the rest of my life with him.
I have also benefited tremendously from discussions about my research with
Nathan Edwards, Stephen Altschul and Steven Salzberg. Their comments and crit-
icism has helped shape the direction of my research. Throughout graduate school,
my research was funded in part by NSF Grants DMS 0104087 and DMS 0312360
and under NIH Grant 1R01HG0294501.
I must also thank the employees and owners of The Bagel Place for providing
a location conducive to pondering coutour integrals. It is in part because of their
reasonably priced breakfasts and endless coffee refills that I passed the Analysis
qualifying examination.
(Thanks Dennis, Ruth, Patrick, Abbey, Alex, Helena, Jen, Joe, Magda, Maria
and all the Leaders - I did it!)
iii
TABLE OF CONTENTS
List of Tables vi
List of Figures vii
1 Introduction 1
1.1 Repetitive DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Whole Genome Shotgun Assembly . . . . . . . . . . . . . . . . . . . 2
1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Distribution of Long Word Counts in DNA Sequences 6
2.1 Low Count Repetitive Sequence . . . . . . . . . . . . . . . . . . . . . 8
2.2 Distribution of 40-mer Counts. . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Sensitivity to Word Length . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 40-mer Counts in Coding DNA . . . . . . . . . . . . . . . . . . . . . 12
2.5 Relation to Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Separations Between Repeats that Occur Exactly Twice 17
3.1 Repeat Pairs and Proximal Separations . . . . . . . . . . . . . . . . . 19
3.2 Repeat Pair Separations for E. coli . . . . . . . . . . . . . . . . . . . 28
3.3 Extension of Proximal Repeat Pairs . . . . . . . . . . . . . . . . . . . 28
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Mechanisms Creating Proximal and Distant Repeat Pairs . . . 31
3.4.2 Selecting Regions with Count Exactly Two . . . . . . . . . . . 34
3.4.3 Implication for Genome Assembly: BAC Assembly . . . . . . 34
4 Evolution by Random Segmental Duplication Accurately Predicts Repeat
String Distribution 36
4.1 Distribution of Lengths of Repeat Strings . . . . . . . . . . . . . . . . 38
4.2 Models of Genome Evolution Through Random Segmental Duplica-
tions and Deletions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Fixed Length Model . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Stationary Distribution of Fixed Length Model . . . . . . . . 45
4.2.3 Variable Length Model . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Growing Genome Model . . . . . . . . . . . . . . . . . . . . . 51
4.3 Unitary Repeat Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Chromosome Repeat Strings . . . . . . . . . . . . . . . . . . . 60
4.4.2 Implications of Model . . . . . . . . . . . . . . . . . . . . . . 62
4.4.3 Implication for Genome Assembly . . . . . . . . . . . . . . . . 63
A Mathematical Formulations 65
A.1 Choosing a Word Length . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Showing H(f) is a Contraction . . . . . . . . . . . . . . . . . . . . . 68
A.3 Deriving the Stationary Distribution for the General Model . . . . . . 69
iv
Bibliography 72
v
LIST OF TABLES
3.1 Repeat Pairs. We list for each organism the length of its genome
and the number of 40-mers with count two. Strikingly, most 40-mers
with count two have both their copies on the same chromosome. We
call such 40-mers a repeat pair. . . . . . . . . . . . . . . . . . . . . 21
3.2 Proximal Repeat Pairs. We say a repeat pair is proximal if the
separation between the 40-mers is less than 0.003 times the length
of the chromosome it belongs to. We list the percent of repeat pairs
that are proximal in each chromosome averaged over all chromosomes
in a genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Proximal Cutoffs for Selected Chromosomes. We analyze hu-
man chromosome 21 because it is similar in length to the chromo-
somes for the other organisms. In all cases we find that proximal
separations make up the majority of repeat pairs, the most extreme
case being D. melanogaster, which had less than 3,000 distant repeat
pairs. Notice that in contrast to the 46% proximal fraction for the
human genome in Table 3.1, here over 70% of the repeat pairs is
proximal for human chromosome 21. . . . . . . . . . . . . . . . . . . 22
4.1 Repeat String Distribution Slope. The distribution of repeat
string length L for the organisms we study roughly follows a power
law. The slope of the logarithm of the cumulative distribution versus
log(L) was calculated, using a least squares fit, over a range deter-
mined appropriate for each organism. We find a diversity of slopes,
but the distributions appear to be split into two categories. Distri-
butions for the human genome, C. elegans have slopes fairly close to-
gether. The slope of the cumulative distribution for D. melanogaster
and S. cerevisiae are similar and significantly more shallow that the
slopefortheother three genomes. With theexception of D. melanogaster,
the range of the power law is correlated with genome length. . . . . . 40
4.2 Unitary Repeat Strings. We select a subset of repeat strings we
call unitary repeat strings. For all genomes we considered, unitary
repeat strings were the majority of repeat strings. . . . . . . . . . . 56
vi
4.3 Unitary Repeat String Distribution Slope. The power law
slope, as determined by a least square fit, for the unitary repeat
strings more closely matches the slope predicted by our Fixed and
Variable Length Models for three of our genomes. The distributions
of the two, D. melanogaster and S. cerevisiae, have a slope that is still
more shallow than predicted by our Fixed and Variable Length Mod-
els, but that could be consistent with the Growing Genome Model.
Notice that the range approximated by a power law has decreased
from the distributions of repeat strings. . . . . . . . . . . . . . . . . 59
LIST OF FIGURES
2.1 Distribution of 40-mer Counts. The number, N(c), of distinct
40-mers with count c for C. elegans and A. thaliana is roughly a
power law over the range 3 ? c ? 70; that is, the distributions can be
approximated by the straight line segment shown above. The plot for
the human genome is nearly a straight line on the range 10 ? c ? 700.
The plot for D. melanogaster fluctuates substantially but follows the
trend of the line segment approximating C. elegans and A. thaliana.
The peak in the distribution of D. melanogaster denoted ?roo peak?
is caused by the roo transposon (see text). The power law exponent
for each genome studied is roughly, ?2.5. . . . . . . . . . . . . . . . 10
2.2 DistributionofCountsin C.elegans for Different WordLengths.
The distribution ofk-mer counts for C. elegans follows a similar power
law for all the values of k shown. In a random sequence of {A,C,G,T}
of the same length as C. elegans, we expect every 10-mer to occur
roughly 190 times. Because of this effect, the power law behavior for
k = 10 does not emerge until well beyond a count of 200. . . . . . . 13
2.3 Distribution of 40-mers in the Coding DNA of C. elegans.
We present the distribution of 40-mer counts in the genes of C.elegans
as well in as the distribution of 40-mer counts restricted to the exons.
For comparison, we plot the distribution of 40-mers for the entire
genome of C.elegans. The gene counts follow an approximate power
law similar to that for the whole genome, while the exon counts de-
crease with a similar slope over a shorter range. . . . . . . . . . . . . 15
vii
3.1 Repeat Pair Separations - Human Chromosome 21. The cu-
mulative fraction F(s) of repeat pairs that have separation less than s
for human chromosome 21 appears to hit the vertical axis at roughly
0.73. The vertical jump that we label ?Segmental Duplication? is due
to a large segmental duplication in the genome; a region of approxi-
mately 200,000 bases is duplicated (with some local rearrangements)
at a separation of ?15 million bases [81]. . . . . . . . . . . . . . . . . 23
3.2 Repeat Pair Separations - C. elegans, A. thaliana and D. melanogaster.
We plot the cumulative fraction F(s) of separations less than s for
repeat pairs in the first chromosomes of C. elegans and A. thaliana
and chromosome arm 2L for D. melanogaster. The dots on the dis-
tributions indicate the separation chosen as proximal cutoff for each
chromosome. Notice that if this were on a linear scale, the dots would
appear to lie on the vertical axis. . . . . . . . . . . . . . . . . . . . . 24
3.3 Distant Separations. For the same chromosomes as in Figures
3.2 and 3.3, we plot the cumulative fraction of separations between
0.003 and ?. That is, proximal repeat pairs are removed and we
have normalized the chromosomes to have unit length. In addition,
we show the separation distribution under a model when we pick
locations of each 40-mer in a repeat pair from a uniform random
distribution, namely F(?) = 2???2. While none of the distributions
are well matched by the uniform random model we see that the shape
of the distribution is more characteristic of the uniform random model
than the distant separations, especially for human chromosome 21.
For chromosome 1 of C. elegans and A. thaliana small separations
are still over-represented and for arm 2L of D. melanogaster small
separations are under-represented. . . . . . . . . . . . . . . . . . . . 26
3.4 Repeat Separations - E. coli. The cumulative fraction of sep-
arations between repeat pairs in E. coli appears to hit the vertical
at 0.20 and 20% of the repeat pairs are proximal. Since the genome
of E. coli is circular, the largest separation possible between repeat
pairs is half the genome length, or approximately 2.65 million let-
ters. The straight line shown with the distribution is the cumulative
distribution of separations expected from a uniform random model. . 29
3.5 Extending the Match for a Proximal Repeat Pair. We attempt
to extend the match between proximal 40-mers in both directions to
determine if the repeat pair belongs to the same inexact repeat region.
We align the sequences labeled R and L with the sequence between
the repeat pair, labeled S. For the cases we sampled from the various
genomes we find roughly half the time over 90% of the separation S
has a high quality match in either L or R. . . . . . . . . . . . . . . . 32
viii
4.1 Distribution of Lengths of Repeat Strings. We plot the cumula-
tive distribution of lengths of repeat strings for a variety of organisms.
In each case we find this distribution to be approximated by a power
law for a part of the range of L. Along with the data, we plot a line
segment illustrating the approximate range and slope of the power law
for C. elegans. For D. melanogaster and S. cerevisiae the range of
the power law is shorter than for human, C. elegans and A. thaliana.
Additionally, the exponent of the power law, corresponding to the
slope on this log-log plot, for D. melanogaster and S. cerevisiae is
shallower than for the other organisms. . . . . . . . . . . . . . . . . 41
4.2 Simulation of Fixed Length Genome Evolution Model. We
fix parameters W = 108, S = 104 and show results for M = 100
and M = 1,000. The simulation for M = 100 was carried out for 1
million steps and for M = 1,000 for 10 million steps, so that we made
approximately 10,000 segmental duplications in each simulation. We
overlay the shifted length distribution with the stationary distribu-
tion, a Pareto distribution, predicted by our analysis. Notice that
over a significant range the computer simulations agree well with our
theoretical findings, especially for M = 1000. . . . . . . . . . . . . . 44
4.3 Simulation of Variable Length Genome Evolution Model. We
plot the length distribution from runs of our Variable Length Model
where we have fixed W = 108 and M = 100. We show the result from
three runs where different probability distributions, all having mean
S = 104, were used as the length distribution for segmental duplica-
tions and deletions. We use the exponential distribution with mean
S, uniform distribution on [0,2?104] and normal distribution with
mean and standard deviation S. In the case that the length deter-
mined by the normal distribution is negative the value is ignored and
a new length selected from the normal distribution. In this manner,
the actual mean of the distribution is somewhat larger than S. In all
three sample cases shown, the distribution of repeat strings converges
to a distribution consistent with the stationary distribution derived
for modeling the length of segmental mutations as constant. . . . . . 52
ix
4.4 Simulation of Growing Genome Evolutionary Model. We plot
the output for two runs of our Growing Genome Simulation. In each
case the distribution of segmental mutations was determined from
a uniform random variable on [0,2 ? 104]. The model was halted
when the genome reached the specified length. In both cases the
distribution can beapproximated a power law over a part ofthe range.
In the case that the genome grew from length 106 to 5?106, the power
law exponent determined by a least square fit was ?1.5 over the range
[1040,106]. For the other simulation, where the genome grew from
2?106 to 107, the power law exponent ?1.8 over [300,7000]. . . . . 55
4.5 Distribution of Unitary Repeat Strings. The distributions of
unitary repeat strings for the organisms we consider are consistent
with a power law distribution and have steeper slopes that for all
repeat strings. The unitary repeat strings are more consistent with
the evolutionary dynamics described by our model. The distributions
for the human genome, C. elegans and A. thaliana are consistent with
the slope predicted by our Fixed and Variable Length Models. Since
the slopes of the distributions of D. melanogaster and S. cerevisiae
are shallower than ?2, the Growing Genome Model provides the best
characterization of their distributions. . . . . . . . . . . . . . . . . . 58
4.6 Chromosome Repeat Strings. We plot the distribution of repeat
strings as determined for only the chromosome for human chromo-
somes 3 and 9 along with the entire human genome. We notice that
while the distribution of repeat strings in each case can be approxi-
mated by a power law ofer a range of the distribution the slopes of
the distribution are different. . . . . . . . . . . . . . . . . . . . . . . 61
x
Chapter 1
Introduction
1.1 Repetitive DNA
The availability of complete genomes for a variety of organisms has resulted
in an explosion of work in analyzing the statistics of DNA sequences. One area
of scientific and mathematical interest has been studying sequences of DNA with
highly similar copies in the DNA sequence. Such repetitive subsequences constitute
a significant fraction of the genomes of many organisms (for example, [3, 7, 19,
24, 47, 56, 77, 79, 80]). In fact more than half of the human genome is repetitive
[46, 82]. Numerous studies (for example, [5, 34, 38, 42, 49, 50, 69]) have found
evidence that repetitive DNA is involved with important processes. In the case of
humans, some repetitive sequences have been linked to genetic disorders [23]. As
James Shapiro wrote, ?...the distribution of repetitive DNA sequence elements is
a key determinant of how a particular genome functions (i.e., replicates, transmits
to future generations, and encodes phenotypic traits.)? [74]
The goal of this dissertation is to analyze the structure of repetitive sequences
in the DNA sequences of a variety of organisms. We study the human genome and
the genomes of C. elegans (worm), A. thaliana (mustard seed), D. melanogaster
(fruit fly) with some comparison to the genomes of S. cerevisiae (yeast) and E. coli
(a bacteria). The genomes we study have been sequenced and analyzed beyond the
1
draft phase we will describe in the next section. These organisms were selected, in
part, because their published sequences are among the most complete and accurate.
Through studying the structure of these repetitive sequences in these genomes we
seek to develop evolutionary models that provide plausible explanations for the
structures we observe.
1.2 Whole Genome Shotgun Assembly
One of our motivations for studying repetitive sequences in DNA is the com-
plications repetitive DNA poses to genome assembly, the process of determining
the DNA sequence of an organism. Throughout this dissertation we will make ref-
erence to the relationship of our results to genome assembly. There are a number of
different genome assembly programs in use (for example [8,29,35,54,63]). Although
there are differences in the details of their algorithms, the underlying procedure is
similar to what we next describe.
In the dominant method of genome assembly, termed shotgun assembly
many copies of a genome are broken into millions of overlapping reads of about 700
bases (on average). No information about the location of the reads in the genome
is obtained. The sequence of the reads has an error rate of roughly 1%; that is,
typically several bases from each read are incorrect. Given this information, one
must computationally assemble these reads into an estimate of the sequence of the
genome. The first step in this process is typically to determine which reads come
from overlapping parts of the genome; two such reads are said to overlap. Further
2
steps combine the reads into larger segments and the collection of all such segments
is the draft assembly. (For further details on genome assembly refer to [55,65,66].)
Repetitive DNA is the major obstacle to shotgun assembly because it causes
reads from different parts of the genome to appear to overlap. This makes it difficult
to determine the correct location of each read relative to other reads. Generally,
repetitive regions shorter than the read length are not so problematic because a
single read can span a repeat, connecting the nonrepetitive sequence on each side.
Repeat copies that are less than 98% similar are unlikely to pose a problem for a
high quality assembler. Indeed, a small number of differences (? 1/2%) are often
sufficient to distinguish between repeat copies. High fidelity repeats are the major
cause of difficulty in genome assembly as well as the major source of errors in draft
assemblies (see [25,72]).
1.3 Outline of Thesis
In Chapter 2, we observe and discuss an approximate power law in the dis-
tribution of counts (number of occurrences) of length 40 words, 40-mers in the
human genome and the genomes of C. elegans, A. thaliana, and D. melanogaster.
Our results are not sensitive to the word length k = 40; we obtain similar distribu-
tions for 20 ? k ? 100. Previous studies have examined count distributions of much
shorter words - those of length k for 3 ? k ? 10 - in DNA sequences and sometimes
utilized ranked distributions. We discuss how using longer word lengths provides a
distribution more consistent with the counts of repetitive sequences in the genome.
3
In Chapter 3, we discuss a preference for small separations between copies of
40-mers that occur exactly twice in a single chromosome. We classify these pairs
as either ?proximal? or ?distant?. Proximal pairs are those that are separated
by less than 0.3% of the chromosome length. In all five of the cases we examined,
over 20% of the pairs are proximal, and in three cases over half are proximal. We
explore extending proximal repeat pairs to an inexact match to study if both 40-
mers are actually part of the same repeat region. We find that for most organisms a
significant fraction of the cases we studied proximal repeat pairs belong to the same
inexact repeat.
In Chapter 4, we study the distribution of lengths of repetitive sequences
in a variety of genomes. We say a k-mer is repetitive, in a specified genome,
if it occurs at least twice in the genome. We say a sequence in the genome of
length ? k is a repeat string if each k-mer in it is repetitive and the string is
maximal.1 In all cases we observed that the distribution of repeat string lengths
can be approximated by a power law. We develop a model of the evolution of
repeat strings by random segmental duplications and point mutations. We show
that under general conditions the distribution produced by this model is a Pareto
Distribution2, a distribution closely related to the power law [59]. The convergence
of our model suggests distributions we observe in the genomes could be related to
1By maximal we mean that the subsequence is not contained in a longer sequence that is
repetitive.
2A power law distribution is a function of the form f(x) = axb while a Pareto distribution is of
the form g(x) = a(x + c)b where a,b,c,x ?R and x > 0.
4
evolution through segmental duplications.
Our results are computed from GenBank [11] sequences of the human genome
and the genomes of C. elegans, A. thaliana, D. melanogaster, S. cerevisiae and
E. coli. The genome sequences and gene annotations we use in this research were
current as of March 2006.
5
Chapter 2
Distribution of Long Word Counts in DNA Sequences
The term genome refers to the complete DNA sequence of an organism. DNA
can be represented as one or more sequences of bases denoted A,C,G and T. Some
recent papers have studied the number of occurrences (or counts) of short words (10
or fewer letters) in genomes [4, 22, 28, 32, 48, 51, 52, 60, 84]. Since a genome contains
information complex enough to determine an entire organism, there should be no
expectation that such a sequence would have statistics consistent with a randomly
generated sequence of A, C, G and Ts. One of the features of DNA that distinguishes
it from a randomly generated sequence is the presence of subsequences that occur
repeatedly throughout the genome. Repetitive DNA comprises a significant fraction
of the genomes of many organisms [3, 7, 19, 46, 47, 77, 79, 80, 82].
In this investigation we determine how many times each length?k word, or
k-mer, occurs in a genome. If a k-mer occurs c times in the genome, we say it has
count c. DNA consists of a double strand,
Because DNA is a double stranded helix where one strand is the complement
of the other, the count of a k-mer includes all occurrences of its reverse complement
as well.1 We analyze the distribution of the number of k-mers with count c. We
1DNA consists of two complementary strands (or sequences); As are paired with Ts and Gs with
Cs. Furthermore, the two strands are read in reverse direction and are called reverse comple-
ments.Hence, AAC is the reverse complement of the string GTT. When searching for a particular
6
generally study words of length 40, 40-mers, because in a random sequence (each
base equally likely) of length similar to the genomes we analyze, 40 is sufficiently
large so that we expect no repeated words of length ? 40 (see Appendix). However,
the results we present are not sensitive to this value and similar results hold for a
broad range of word lengths between 20 and 100.
We study these distributions for the human genome and the genomes of C. el-
egans, A. thaliana and D. melanogaster. For all the genomes we study, we observe a
power-law-like distribution in the counts of repetitive length 40 words over a signif-
icant range of counts, from 3 to roughly 70. When the data are approximated by a
power law distribution, we find that the power law exponent, ??2.5, is similar for
all of the genomes analyzed. In addition, this exponent is preserved over different
word lengths.
Previous studies, such as [48], have observed a power law distribution in the
counts of significantly shorter words, k-mers where k ? 10. Several other studies
analyzed the distribution of ranked word counts; that is, the counts of k-mers plot-
ted in decreasing order (such as [51]). Both of these types of analysis reflect the
distribution of the most frequently occurring words, those with counts in the hun-
dreds or thousands. Some properties reported in [51] have been found to hold for
randomly generated sequences [14]. In this paper we characterize the distribution
of counts of longer words, 40-mers, and show they provide results over a range of
counts more consistent with the majority of repetitive DNA for the organisms we
string, we also look for its reverse complement. The count of a k-mer is actually the number of
occurrences of it or its reverse complement.
7
study.
2.1 Low Count Repetitive Sequence
We now demonstrate that repetitive sequences in the genomes we study typi-
cally have a count less than 70 (for two different ways of counting repetitions). This
is in the range of counts that is emphasized by our 40-mer distributions but not
by previous studies [48,51]. For three of the genomes we study, over half of the
repetitive 40-mers have a count of exactly two.
There is also a position-based way to count repetitions. Each position (or base)
in the genome is the beginning of a 40-mer (except near the end of a chromosome),
so we can refer to the count of the position (meaning the count of its 40-mer). We
say a position is repetitive if its 40-mer is repetitive. For all our genomes, we find
a typical repetitive position has a relatively low count. Over 2/3 of the positions,
for each genome we study, had count at most 70.
To illustrate the difference between the distribution of 40-mer counts and po-
sition counts, imagine that a genome has only 101 repetitive 40-mers, 100 occurring
twice and one 100 times. The counts by position are dominated by the most fre-
quently occurring 40-mers. The average count of a repetitive 40-mer is (1)100+(100)2101 ?
3 while the average count of the 300 repetitive positions is (1)1002+(100)22300 ? 35. No-
tice that in this case the median repetitive position count is two. The significant
fraction of repetitive positions having high counts in the genomes we study can be
illustrated as follows.
8
? For the human genome, the average repetitive 40-mer count is 5.08, while the
average repetitive position count is 726. The median repetitive position count
is 9. The maximum count of a 40-mer (and thus of a position) is 53,022.
? For C. elegans, the count of the average repetitive 40-mer is 3.34, while the
average repetitive position count is 15.6. The median repetitive position count
is 3. The maximum count of a 40-mer (and position) is 1,890.
The graphs in this paper show the distributions of repetitive 40-mer counts,
not repetitive position counts. The distributions of 40-mer counts and position
counts are related as follows. If N(c) is the number of 40-mers with count c, then
since each of these 40-mers occurs in c positions, the number of positions with count
c is cN(c). In this paper we graph N(c), but the power law behavior we observe
applies equally well to cN(c) with an exponent one greater, i.e. ??1.5.
2.2 Distribution of 40-mer Counts.
Recall, by the count of a 40-mer we mean the number of times a 40-mer or its
reverse complement appears in a genome. In Figure 2.1 we graph the number N(c)
of 40-mers with count c for the genomes we considered. A power law distribution,
N(c) ? acb, for numbers a and b, is reflected by a linear relationship on the log-log
scale, log(N(c)) ? blog(c) + log(a). We indicate with a straight line segment the
range best approximated by a power law for three of the genomes we study. The
line segment approximates the counts from 3 ? c ? 70 and has a slope ? ?2.5.
This indicates that for all our genomes, N(c) is approximately proportional to c?2.5
9
Figure 2.1: Distribution of 40-mer Counts. The number, N(c), of distinct 40-
mers with count c for C. elegans and A. thaliana is roughly a power law over the
range 3 ? c ? 70; that is, the distributions can be approximated by the straight
line segment shown above. The plot for the human genome is nearly a straight line
on the range 10 ? c ? 700. The plot for D. melanogaster fluctuates substantially
but follows the trend of the line segment approximating C. elegans and A. thaliana.
The peak in the distribution of D. melanogaster denoted ?roo peak? is caused by
the roo transposon (see text). The power law exponent for each genome studied is
roughly, ?2.5.
10
in this range. For C. elegans, N(c) remains close to this line over the range, while
for A. thaliana the distribution drops more quickly for c > 30. The distribution
for D. melanogaster fluctuates considerably more than the others but follows the
general trend of the line. The data for the human genome follows a power law closely
over the range 10 ? c ? 700.
We further investigated instances where 40-mer counts for D. melanogaster de-
viate from the power law (see Figure 2.1). We find that these peaks are primarily due
to the presence of high fidelity copies of transposons (i.e. transposable elements2)
whose sequence can be found at the Berkeley Drosophila Genome Project [12]. For
example, the peak at count c = 104 is due to the roo elements, which is the trans-
posable element in D. melanogaster having the highest copy number. It also has
high sequence conservation as described in [37]. We can similarly relate some of the
other peaks in the distribution to high fidelity transposons in D. melanogaster. Since
the distribution of 40-mer counts is significantly smoother for the other genomes,
we believe this may indicate that they have comparatively fewer high fidelity copies
of transposable elements.
In all cases we study, the data deviates significantly from the power law for the
40-mers with the highest counts. These 40-mers consist largely of low complexity
40-mers that comprise microsatellites3 and other types of tandem sequence.
2Transposable elements are a class of repetitive sequences in DNA. These sequences can create
a copy of themselves and insert the copy at another location in the genome.
3Microsatellites are subsequences of DNA composed of the same short word, typically less than
length 6, repeated many times in succession. For example, CATCATCATCATCATCATCAT.
11
2.3 Sensitivity to Word Length
We compared the distribution of word counts for a variety word lengths, k, for
the genomes we study. There is a significant qualitative change in the distribution
of short word lengths, i.e., words that are sufficiently short that they will occur
repeatedly by chance, compared to longer words where the number of words in the
genome is significantly less than the number of words of that length. Figure 2.2
shows the distribution of k-mers in C. elegans for a variety of word lengths. For
k = 10 only the tail of the distribution, c ? 300, is consistent with a power law.
Notice that the peak is near c = 40, which is somewhat lower than the value of
c ? 190 we would see for a purely random genome. For k ? 20 the full distribution
is more consistent with a power law. The behavior of the distribution is similar for
these higher values of k and the range of counts from3 ? c ? 70 is well approximated
by a power law with exponent ??2.5. As seen in Figure 2.2, as k increases beyond
40 the data become more sparse. However, the distributions are still characterized
by a power law with similar slope as for the lower values of k, but over a shorter
range.
2.4 40-mer Counts in Coding DNA
We analyzed the distributions of counts of 40-mers in the genes and exons
of the genomes of C. elegans, A. thaliana and D. melanogaster. We find that the
distribution of word counts in the genes and exons is characterized by a power law of
similar exponent to the power law for the entire genome. The behavior for C. elegans
12
Figure 2.2: Distribution of Counts in C.elegans for Different Word
Lengths. The distribution of k-mer counts for C. elegans follows a similar power
law for all the values of k shown. In a random sequence of {A,C,G,T} of the same
length as C. elegans, we expect every 10-mer to occur roughly 190 times. Because
of this effect, the power law behavior for k = 10 does not emerge until well beyond
a count of 200.
13
is typical for the other two genomes studied and therefore we present results for only
C. elegans.
The distribution of counts of 40-mers in the genes, exons and the entire genome
for C. elegans is shown in Figure 2.3. The distribution of the 40-mers in the genes is
similar to the power law behavior seen for the entire genome. There is significantly
less sequence in the exons, but the distribution of counts is reasonably consistent
with a power-law-like distribution over a much shorter range (Figure 2.3).
Genes often occur in families that code for similar proteins. A power law has
been shown to hold for the distribution of the number of members of gene families
[43] as well as for other quantities related to coding DNA such as protein folds and
pseudogene families [48]. Since genes in the same family may have highly similar
sequence and the size of families of genes is distributed according to a power law, one
may expect the distribution of 40-mers in coding regions of DNA to follow a power
law as well. As is evident from Figure 2.3, the 40-mer counts for genes represent
only a fraction of the overall counts. Thus, the distribution of 40-mer counts is
preserved throughout the genome and cannot be explained by the distribution in
only the coding regions of DNA.
2.5 Relation to Previous Work
Several papers have described power laws in the counts of short (k ? 10) words
(see for example, [48] and [51]). Other work has suggested that the short word data
is better fit by a distribution with more parameters such as the Yule distribution
14
Figure 2.3: Distribution of 40-mers in the Coding DNA of C. elegans. We
present the distribution of 40-mer counts in the genes of C.elegans as well in as the
distribution of 40-mer counts restricted to the exons. For comparison, we plot the
distribution of 40-mers for the entire genome of C.elegans. The gene counts follow
an approximate power law similar to that for the whole genome, while the exon
counts decrease with a similar slope over a shorter range.
15
[52]. We do not claim that a pure power law distribution is the best fit to the
distribution of 40-mers counts. Rather, we claim that the power law distribution
provides a reasonable characterization of the distribution of 40-mers in the range of
relatively low copy repeats, such as c ? 100.
As described earlier, for a random sequence having the same length as the
genome of C. elegans, the expected count of each 10-mer is roughly 190. Thus, repeat
regions with a count that is significantly less than 200 will not have a noticeable
effect on the distribution of 10-mer counts. Indeed, the power laws observed in [48]
all occur for counts well over 200.
Recall that in the genomes we studied, most of the bases in repetitive 40-mers
(and repetitive positions in the genome) have count less than 100. Thus, the power
law we observe reflects the majority of the repeat content in these genomes with high
enough fidelity (similarity between copies) that many of the 40-mers they contain
are identical. On the other hand, lower fidelity repeat regions (e.g., those with 90%
similarity) will contain relatively few repetitive 40-mers. But then, based on our
discussion above, repeat regions with sufficiently low fidelity and low count will not
contribute significantly to the distribution of k-mer counts for any k.
Recent papers describe models of the evolution of gene families [39,68]. These
papers may provide an explanation of the power law in genes (see Figure 2.3) [39,68].
However, because most repetitive 40-mers are not in genes these models as developed
do not provide a complete explanation of the behavior in the counts of long words.
16
Chapter 3
Separations Between Repeats that Occur Exactly Twice
Repetitive sequence is typically classified according to sequence composition,
copy number and mode of propagation in the genome. Some types of repetitive
DNA, such as ALUs and LINEs, have millions of copies throughout the genome [67].
Other types of repetitive DNA have copies that occur adjacently and their method
of propagation is through slippage or unequal crossing over (see [20, 44, 76]). The
evolutionary dynamics of some types of repetitive DNA have been studied through
direct observation and modeling (see references [10, 15, 16, 17, 18, 21, 30, 73]).
However, there are other types of repetitive DNA whose mechanisms for propagation
remain less certain. In this chapter we explore the distribution of a type of repetitive
sequence that is not well characterized by previous classification methods.
In this chapter we analyze repetitive sequence having exactly two copies in the
genome where both copies are within the same chromosome. Our goal is to see how
these are distributed. There is a technical problem in identifying these repeats when
the copies are of low fidelity (similarity). If the repeats are of low fidelity they can
be hard to find, and if we find two such copies, we do not know if there are other
copies that are of such low fidelity that they can not be detected. To avoid this
problem, we select a certain high fidelity subset of the duplicates. We identify these
high fidelity regions by first finding 40-mers that occur exactly twice in the genome.
17
We generally use length 40 because 40 is sufficiently large that the probability of
any 40-mer occurring more than once by chance in a random sequence the length of
the genomes (or chromosomes) is small (see Appendix).
Recall from Chapter 2 that we refer to the number of times a 40-mer, or its
reverse complement, appear in a genome as the count of the 40-mer. A 40-mer
with a count of at least two is called a repeat 40-mer because it has at least one
other copy. The most common copy number for a repeat 40-mer in the genomes we
study is exactly two (see Chapter 2). Strikingly, 40-mers with count two typically
have both copies on the same chromosome (see Table 3.1). We call 40-mers with
this property a repeat pair. We refer to the length of the sequence separating
the 40-mers in a repeat pair as the separation. We analyze in greater detail re-
peat pairs in the human genome and the genomes of C. elegans, A. thaliana and
D. melanogaster with some comparison to E. coli. In addition to presenting re-
sults for the complete genomes, we focus specifically on a typical chromosome from
each genome; chromosomes 1 of C. elegans and A. thaliana, chromosome arm 2L of
D. melanogaster and human chromosome 21.
In the next section, we classify these pairs as either ?proximal? or ?distant?;
proximal pairs are those that are separated by less than 0.3% of the chromosome
length. For all five of the genomes we examined, over 40% of the pairs are proximal,
and in three cases over half are proximal. We also find a preference for proximal
repeat pairs to occur in the same orientation1.
1The 40-mers in a repeat pair are said to occur in the same orientation if they are not reverse
complements of one another.
18
We extend the match between proximal repeat pairs in the same orientation
and find that for some organisms roughly half of these belong to two ?nearly adja-
cent? repeats. In other words, some lie in an approximate tandem repeat as we will
discuss later.
A recent paper discusses separations between intra-chromosomal duplications
[78]. The goal of this work was to analyze the separations between short exact
duplications that were not part of a larger inexact duplication. In contrast, we seek
to study segmental duplications in a more general setting. In particular, we wish to
extend exact matches to larger inexact repeats, in the case of proximal repeats, to
study if these could have originated from tandem duplication.
Although we can not identify the mechanisms responsible for the creation of
such repeats in DNA we are able to observe their effects. By examining these intra-
chromosomal repeat pairs we can determine characteristics of the mechanisms that
created them.
3.1 Repeat Pairs and Proximal Separations
In Table 3.1 we describe the lengths of the genomes we study and how many
40-mers with count two have both copies on the same chromosome. In all cases
we find that a significant fraction of 40-mers with count two occur on the same
chromosome.
Figure 3.1 shows the fraction of the repeat pairs in human chromosome 21 that
appear within a particular separation. The graph seems to hit the vertical axis at
19
about 0.73 because 73% of the repeat pairs have separations less than 0.003 times the
length of the chromosome. These are the pairs we call proximal for this chromosome.
There are over 120 times as many proximal pairs as would be expected if repeat pair
locations were uniformly distributed. (For human chromosome 21, most of the first
13 million bases were undetermined, so we consider only the final 34.2 million bases.
[81])
In all the organisms studied we find that repeat pairs with small separations
are over represented; we define a proximal cutoff of 0.003 times the chromosome
length to designate repeat pairs that are proximal. Repeat pairs that occur at a
separation larger than the proximal cutoff are called distant. While our choice of
the cutoff between proximal and distant pairs is somewhat arbitrary we generally
see a natural division in the distribution of separations somewhere close to 0.003
times the chromosome length. In Table 3.2, we describe the percent of repeat pairs
that are proximal for the selected chromosomes from the organisms we study.
The Proximal Cutoff. We now illustrate the distribution of proximal sep-
arations more explicitly. Figure 3.2 shows the fraction, F(s), of repeat pairs with
separation less than s in the chosen chromosomes of C. elegans, A. thaliana and D.
menalogaster. The horizontal axis has a logarithmic scale, unlike Figure 3.1. Table
3.3 lists the proximal cutoffs for these chromosomes and the fraction of repeat pairs
that are proximal. For each chromosome in Figure 3.2, there are about 100 times
more proximal pairs than if the repeat pair locations were uniformly distributed.
20
Organism Genome Length # of Count 2 # of # of Proximal
(in millions) 40-mers Repeat Pairs Repeat Pairs
(in thousands) (in thousands) (in thousands)
C. elegans 100 1,618 1,286 909
A. thaliana 119 1,404 924 611
D. melanogaster 120 551 472 407
human 3,000 38,767 18,104 8,321
Table 3.1: Repeat Pairs. We list for each organism the length of its genome and
the number of 40-mers with count two. Strikingly, most 40-mers with count two
have both their copies on the same chromosome. We call such 40-mers a repeat
pair.
Organism Genome Length ?% Proximal
(in millions) Repeat Pairs
C. elegans 100 70.7
A. thaliana 119 66.2
D. melanogaster 120 86.2
human 3,000 46.0
Table 3.2: Proximal Repeat Pairs. We say a repeat pair is proximal if the
separation between the 40-mers is less than 0.003 times the length of the chromo-
some it belongs to. We list the percent of repeat pairs that are proximal in each
chromosome averaged over all chromosomes in a genome.
21
Chromosome Length Proximal Cutoff ?% Proximal
(in millions) (in thousands)
C. elegans chr 1 15 45 80.7
A. thaliana chr 1 30 91 64.1
D. melanogaster arm 2L 28 67 97.9
human chr 21 34.2 141 73.1
Table 3.3: Proximal Cutoffs for Selected Chromosomes. We analyze human
chromosome 21 because it is similar in length to the chromosomes for the other
organisms. In all cases we find that proximal separations make up the majority
of repeat pairs, the most extreme case being D. melanogaster, which had less than
3,000 distant repeat pairs. Notice that in contrast to the 46% proximal fraction for
the human genome in Table 3.1, here over 70% of the repeat pairs is proximal for
human chromosome 21.
22
Figure 3.1: Repeat Pair Separations - Human Chromosome 21. The cu-
mulative fraction F(s) of repeat pairs that have separation less than s for human
chromosome 21 appears to hit the vertical axis at roughly 0.73. The vertical jump
that we label ?Segmental Duplication? is due to a large segmental duplication in
the genome; a region of approximately 200,000 bases is duplicated (with some local
rearrangements) at a separation of ?15 million bases [81].
23
Figure 3.2: Repeat Pair Separations - C. elegans, A. thaliana and
D. melanogaster. We plot the cumulative fraction F(s) of separations less than
s for repeat pairs in the first chromosomes of C. elegans and A. thaliana and chro-
mosome arm 2L for D. melanogaster. The dots on the distributions indicate the
separation chosen as proximal cutoff for each chromosome. Notice that if this were
on a linear scale, the dots would appear to lie on the vertical axis.
24
Distant 40-mer Pairs. The cumulative distribution in Figure 3.1, for human
chromosome 21, looks like a quadratic distribution with a vertical discontinuity. To
investigate the separations of distant repeat pairs, we plot the distribution for the
selected chromosomes after removing the proximal repeat pairs.
In Figure 3.4, we show the cumulative distributions of separations for distant
repeat pairs in the chromosomes we studied. These distributions are shown along
with the quadratic that is the cumulative distribution for the uniform random model
in which the locations of each copy of a 40-mer in a repeat pair is determined by a
uniform random variable.
The distant separations for chromosomes 1 of C. elegans and A. thaliana are
significantly more likely to be small than when compared to the uniform random
model. In contrast, the distant separations for D. melanogaster arm 2L are larger
than would be expected under a uniform random model. For human chromosome 21
the cumulative distribution is more consistent with being determined by a uniform
random model, with the exception of the single large segmental duplication. If we
remove repeat pairs from the large segmental distribution for this chromosome, we
find a distant distribution that can be reasonably approximated by the uniform
random model.
Segmental Jumps. As we have said, the large jump in Figure 3.2 is due to a
large segmental duplication corresponding to many repeat pairs at approximately
the same separation. Segmental jumps indicate that there is a relatively long highly
conserved repeat (not necessarily exact). The size of the jump will be related both to
the length and fidelity of the repeat. Such a jump can occur within either proximal
25
Figure 3.3: Distant Separations. For the same chromosomes as in Figures 3.2 and
3.3, we plot the cumulative fraction of separations between 0.003 and ?. That is,
proximal repeat pairs are removed and we have normalized the chromosomes to have
unit length. In addition, we show the separation distribution under a model when we
pick locations of each 40-mer in a repeat pair from a uniform random distribution,
namely F(?) = 2? ??2. While none of the distributions are well matched by the
uniform random model we see that the shape of the distribution is more characteris-
tic of the uniform random model than the distant separations, especially for human
chromosome 21. For chromosome 1 of C. elegans and A. thaliana small separations
are still over-represented and for arm 2L of D. melanogaster small separations are
under-represented.
26
or distant separations. The large segmental jump in Figure 3.1 is due to a sequence
of approximately 200,000 bases duplicated with a separation of about 15 million
bases. The two copies are 96% identical, with some local rearrangements [81]. If
the 4% differences were distributed at random, then about 0.9640, or 20%, of the
40-mers in each segment would have a duplicate in the corresponding position of
the other segment copy.
Orientation of Repeat Pairs. A repeat pair is in the reverse orientation
if the two copies of the repeated 40-mer are reverse complements; otherwise they
are in the same orientation. In general, we find that proximal repeat pairs are
significantly more likely to occur in the same orientation than distant repeat pairs.
The only exception is C. elegans for which we see essentially no correlation between
proximality and orientation.
Similarly, over all repeat pairs, repeat pairs with the same orientation have
smaller separations than repeat pairs with reverse orientation. For D. melanogaster
arm 2L the median separation for pairs in the same orientation is ? 3,000 bases
and ? 5.8?106 bases for reverse oriented pairs. For A. thaliana chromosome 1 the
median separation for pairs with the same orientation is approximately 12,500 bases
and 320,000 bases for reverse oriented pairs. For human chromosome 21 the median
separation for pairs with the same orientation is? 260,000 bases and? 7?106 bases
for pairs with reverse orientation. However, for C. elegans the median separation
both types of repeat pairs was similar, approximately 8,000 bases.
As an additional anomaly, C. elegans chromosome 1 had approximately the
same number of pairs in each orientation while the other chromosomes studied had
27
a preference for repeat pairs with the same orientation.
3.2 Repeat Pair Separations for E. coli
For a comparison to the repeat structure of the human genome and then
genomes of C. elegans, A. thaliana and D. melanogaster we investigated repeat pair
separations in the genome of E. coli. Since E. coli has a single (circular) chromosome
[13], all 40-mers with count two will be, by default, repeat pairs. As with the other
genomes, we find that a significantly higher fraction of 40-mer repeat pairs have
small separation.
The separation between repeat pairs for E. coli is shown in Figure 3.4. The
proximal cutoff is 11,500 bases and, roughly 20% of the pairs are proximal. Nearly all
proximal repeat pairs are in the same orientation. The median separation between
repeat pairs with the same orientation is roughly 140,000 bases and just over 700,000
bases for pairs with reverse orientation. This confirms that repeat pairs in the same
orientation occur closer together than would be expected if both types had the same
distribution.
3.3 Extension of Proximal Repeat Pairs
In this section we analyze in more detail the proximal repeat pairs. In partic-
ular, we want to determine if each repeat pair is from two adjacent repeat regions,
i.e., a tandem repeat. We find in many cases the length of the sequence separat-
ing proximal repeat pairs can be decreased substantially by extending the match
28
Figure 3.4: Repeat Separations - E. coli. The cumulative fraction of separations
between repeat pairs in E. coli appears to hit the vertical at 0.20 and 20% of
the repeat pairs are proximal. Since the genome of E. coli is circular, the largest
separation possible between repeat pairs is half the genome length, or approximately
2.65 million letters. The straight line shown with the distribution is the cumulative
distribution of separations expected from a uniform random model.
29
between them as an inexact repeat.
In this study we use the two 40-mers in a proximal repeat pair as seeds and
attempt to extend the match (see Figure 3.5). We present results for extending the
match for repeat pairs with the same orientation only. We sample the collection
of repeat pairs to avoid ?discovering? the same repeat more than once as follows.
Of the 40-mers in a repeat pair, we select the one closest to the beginning of the
chromosome and attempt to extend the alignment. We proceed to the next proximal
repeat pair along the chromosome, ignoring any repeat pairs having a 40-mer in
between the copies of previously considered repeat pairs.
We attempt to extend the sequence (in both directions) around both 40-mers in
the repeat pair by running the NCBI-BLAST algorithm2[57] between the sequences
labeled L and S and then R and S in Figure 3.5. We determine what fraction of
the separation sequence, indicated S in Figure 3.5, matches (part of) the sequence
in L and R.
For C. elegans we sample 2,816 proximal repeat pairs with the same orienta-
tion. We find that for 1,410 cases over 90% of the sequence between the repeat pairs
can be matched by sequence in L or R. In this case we say the repeat pairs could
be joined. Repeat pairs with this property are likely to belong to the same inexact
repeat region that will exhibit some tandem-like structure. A similar fraction of
repeat pairs could be similar joined with this method for D. melanogaster. Using
our sampling method we select 814 repeat pairs. Of these, 411 were joined.
2The program blastall was used to run blastn between two sequences. The minimum word
length was W = 20 and every match identified by blastn was utilized.
30
For A. thaliana the situation is significantly different. Our sampling method
selected 2,022 same oriented proximal repeat pairs; only 540 cases could be joined.
For the case of the human genome, we applied a different sampling process. Be-
cause the proximal cutoff for some human chromosomes was significantly larger than
for the other genomes, we reduced the execution time by eliminating all proximal
repeat pairs having separation larger than 106 bases. We then continued sampling
repeat pairs in the same manner as described above. Of the 20,678 repeat pairs in
our sample 10,395 of them were joined.
This suggests that about 50% (25% for A. thaliana) of the proximal repeat
pairs belong to the same repeat region that seem to have originated from a tandem
duplication event.
3.4 Discussion
In this section we discuss some possible consequences and significance of our
findings. One of our main motivations for studying repetitive DNA is its importance
in genome assembly, and in this section we discuss the implication of our results in
this context.
3.4.1 Mechanisms Creating Proximal and Distant Repeat Pairs
As stated in the introduction, there are many mechanisms that create new
repeat regions in the genome [17,41]. Some, such as slippage [44] and unequal
crossing over [20,76] cause the formation of repetitive regions that are in tandem.
31
Figure 3.5: Extending the Match for a Proximal Repeat Pair. We attempt
to extend the match between proximal 40-mers in both directions to determine if
the repeat pair belongs to the same inexact repeat region. We align the sequences
labeled R and L with the sequence between the repeat pair, labeled S. For the cases
we sampled from the various genomes we find roughly half the time over 90% of the
separation S has a high quality match in either L or R.
32
As time passes, identical tandem repeats become less similar due to the accumulation
of mutations. Other mechanisms, such as the mechanisms of transposable elements
and chromosome/genome duplication, create copies that can be separated by large
distances. Wecan broadly classify these mechanisms to beeitherproximal, creating
nearby copies, or distant creating copies with little bias regarding location on the
chromosome.
Our studies suggest that for different organisms, distant and proximal mecha-
nisms for creating intra-chromosomal repeats occur with different relative frequency.
In all cases, we find the local mechanisms occur more than the global mechanisms.
Between different species, the same ratio between these mechanisms is not preserved.
The proximal fraction, that is fraction of repeat pairs that are proximal, is much
greater for D. melanogaster than our other genomes (see Table 3.1), while human
has by far the smallest proximal fraction.
Within individual genomes, theproximal fraction between chromosomes varies.
In the case of human and C. elegans the X chromosome had the greatest proxi-
mal fraction over all other chromosomes in these genomes. The proximal fractions
are 88.85% for chromosome X of human and 85.50% for C. elegans. However, for
D. melanogaster the situation is reversed. With the exception of the smallest chro-
mosome, which is chromosome arm 4, the X chromosome had the smallest proximal
fraction 71.98%.
33
3.4.2 Selecting Regions with Count Exactly Two
Notice that a repeat pair might in fact belong to a repeat with a count higher
than two due to minor differences in other copies. For example, the two 40-mers
in the repeat pair may be part of a low-fidelity repeat of count three where one of
the copies has a mutation causing some 40-mers to have count two. We remedy this
situation by looking at the counts of 20-mers within the 40-mers in the repeat pair.
We require that each 20-mer in the 40-mer occur only twice in the genome.
We find that requiring all 20-mers within the repeat pair to have count exactly
two eliminates roughly half of all repeat pairs. This suggests that the eliminated
repeat pairs are part of an inexact repeat in the genome with count greater than
two. Repeat pairs that are proximal are preferentially eliminated by this process.
3.4.3 Implication for Genome Assembly: BAC Assembly
One method that assembly groups have undertaken to combat the presence
of repetitive DNA in genome assembly is to first create a tiling of the genome by
Bacterial Artificial Chromosomes (BACs) which are regions of about 150,000 bases
[9]. Then each of these BACs is sequenced according to the shotgun procedure
described in Chapter 1. While this reduces the problems caused by repetitive DNA
occurring on different chromosomes or with separations greater than the length of
a BAC, the problem of ?proximal? repetitive DNA still exists. Further study of
problems caused by repetitive DNA reveal that repeat regions with a nearby copy
are especially problematic for genome assembly [66,72]. As we have found in our
34
analysis of repeat pairs in many cases the majority of repetitive DNA with two
copies in the genome occurs nearby and in the same chromosome. Thus, simply
switching to a BAC by BAC shotgun procedure will not provide a solution for the
problem of repetitive DNA.
35
Chapter 4
Evolution by Random Segmental Duplication Accurately Predicts
Repeat String Distribution
The DNA sequence, or genome, of an organism is one or more sequences of
four nucleic acids (or bases) represented by letters from the alphabet {A,C,G,T}.
The length of the genome is several million letters for typical bacteria and about
three billion for a mammalian genome. Because DNA sequences come from a small
alphabet we expect to see subsequences that occur more than once. If a sequence of
DNA is longer than 4k +k for some positive integer k, then the sequence must have
at least one repeated k-letter subsequence or k-mer. However, in practice there
are many sequences in a genome that have highly similar copies more often than
would be predicted by a random sequence of {A,C,G,T}. Such repetitive regions
compose a major part of the DNA sequence of many organisms [3, 7, 19, 79, 80].
Although there have been numerous papers describing algorithms for finding
repeats, such as [6, 36, 45, 64, 83], the concept of a repetitive region is typically left
imprecise. This is in part because the term includes both exact and inexact, but
similar, copies. In the latter case distinct copies of repeat regions have differences
between them. After accumulation of many differences, two copies of a repeat region
may not appear to be related at all. Identifying all copies of a repeated region may
involve finding regions of varying similarity and thus depends on the methods used
36
to define similarity.
Another difficulty is the complicated structure of repeat regions. Repetitive
sequences can overlap by arbitrary amounts and rearrangements in copies of a repeat
region further complicate their identification [64]. In these situations each part of
a subsequence of the genome will have a matching subsequence elsewhere in the
genome; however, the entire subsequence will not have a complete match anywhere.
For these reasons and others, the notion of length distribution of repeat regions is
not defined in general.
To identify and study the length distribution of repetitive DNA in an unam-
biguous fashion, we develop a formulation we call a repeat string. We say a k-mer is
a repetitive k-mer, in a specified genome, if it (or its reverse complement) appears
at least twice in the genome. We say a subsequence is repetitive if, for a fixed k,
each k-mer within it is repetitive. For simplicity in this dissertation we fix k = 40
because in a random sequence of {A,C,G,T} the length of a mammalian genome
we do not expect any repetitive 40-mers (see Appendix). However, the results we
present are valid over a range of sufficiently large values of k.
We define a sequence, S, to be a repeat string if it is repetitive and not
contained in a longer repetitive sequence. Notice that for k ? 40, each k-mer with
an exact duplicate in the genome must lie in a repeat string.
In this chapter we find an approximate power law describes the distribution
of lengths of repeat strings in the human genome and the genomes of C. elegans,
A. thaliana, D. melanogaster and S. cerevisiae. Because the same distribution oc-
curs for a variety of organisms the processes responsible the emergence of these
37
distributions should be quite general. We aim at determining general mechanisms
that can create such power laws.
In this chapter, we develop a model of the genome evolution employing point
mutation and segmental duplications and deletions. We show that under quite
general conditions the stationary distribution of lengths of repeat strings for our
model is a Pareto Distribution, a close relative of the power law [59]. The consistency
of our model results with the repeat string length distributions we find in several
genomes suggest the distributions we observe could have emerged through evolution
by random segmental duplications and deletions.
4.1 Distribution of Lengths of Repeat Strings
We find an approximate power law relationship between the length, L, of a
repeat string and the number, N(L), of repeat strings of that length. Figure 4.1
shows the cumulative number, C(L) = N(L) + N(L + 1) +L(N + 2)..., of repeat
strings with length at least L for the human genome and the genomes of C. elegans,
A. thaliana, D. melanogaster and S. cerevisiae. For all genomes, the distribution
in the log-log plot of Figure 4.1 can be well approximated by a line segment over
a significant range of the repeat string lengths. This linear fit corresponds to a
distribution of lengths having the form C(L) ? aLb, where a > 0 and b < 0.
Values of the exponent, b, of C(L) were determined by a least squares fit and
given in Table 4.1. While the different distributions are qualitative similar for the
various organisms, namely a roughly power law distribution, there are significant
38
quantitative differences.
The exponents for the genomes we study seem to fall into two classes. The
exponents for human and C. elegans are similar to one another, approximately ?1.7.
Similarly, the exponents for D. melanogaster and S. cerevisiae are roughly ?1. The
distribution for A. thaliana is similar to that of C. elegans, but the log-log graph
for A. thaliana is significantly less straight than C. elegans. The exponent ?1.36
for A. thaliana is most representative of the range 1,000 ? L ? 10,000.
Another difference in the distributions is the range approximated by a power
law. For human, C. elegans, A. thaliana nearly the entire range of the distribution in
Figure 4.1 is well approximated by a line segment. This illustrates that the longest
repeat string in a genome does not seem to grow linearly with the genome length.
For S. cerevisiae the distribution decays quite rapidly after L = 5,000, but
the genome is significantly shorter than the others. The length of the genome of
D. melanogaster is similar to C. elegans and A. thaliana; however, the distribution
decays more rapidly than a power law beyond L = 5,000. With the exception of
D. melanogaster, the range approximated by a power law increases with the length
of the genome.
4.2 Models of Genome Evolution Through Random Segmental Du-
plications and Deletions
Because thedistribution of repeatstrings isqualitatively similar for thegenomes
we study, we expect that there are common mechanisms influencing the distribu-
39
Table 4.1: Repeat String Distribution Slope. The distribution of repeat string
length L for the organisms we study roughly follows a power law. The slope of
the logarithm of the cumulative distribution versus log(L) was calculated, using a
least squares fit, over a range determined appropriate for each organism. We find
a diversity of slopes, but the distributions appear to be split into two categories.
Distributions for the human genome, C. elegans have slopes fairly close together.
The slope of the cumulative distribution for D. melanogaster and S. cerevisiae are
similar and significantly more shallow that the slope for the other three genomes.
With the exception of D. melanogaster, the range of the power law is correlated
with genome length.
Genome Length Slope Range of Power Law
(in millions) Approximation
human 3,000 -1.70 40 ? L ? 30,000
A. thaliana 119 -1.36 40 ? L ? 10,000
D. melanogaster 120 -0.99 40 ? L ? 5,000
C. elegans 100 -1.69 40 ? L ? 10,000
S. cerevisiae 12 -0.99 40 ? L ? 6,000
40
Figure 4.1: Distribution of Lengths of Repeat Strings. We plot the cumulative
distribution of lengths of repeat strings for a variety of organisms. In each case we
find this distribution to be approximated by a power law for a part of the range of
L. Along with the data, we plot a line segment illustrating the approximate range
and slope of the power law for C. elegans. For D. melanogaster and S. cerevisiae
the range of the power law is shorter than for human, C. elegans and A. thaliana.
Additionally, the exponent of the power law, corresponding to the slope on this
log-log plot, for D. melanogaster and S. cerevisiae is shallower than for the other
organisms.
41
tion. Since a genome represents only a snap-shot in evolution it is not possible to
observe the processes responsible for creating repetitive DNA from the sequence
data alone. We view the genome as the current state of a dynamical system describ-
ing evolutionary mechanisms. Our goal is, by observing the current distribution,
to determine evolutionary mechanisms that could have lead to the emergence of
behavior we observe.
Distinct types of repetitive DNA have different modes of duplication [1, 2, 21,
24, 41]. As described in [10, 34, 44, 76] there are many mechanisms for the growth
of existing repeat regions and the creation of new repeat regions and repeat strings.
Our goal is not to model specific types of repetitive DNA such as microsatellites
and transposable elements; the dynamics of these sequences have been modeled
extensively (see [30, 44, 61, 70, 71, 73]). Our aim is to understand more broadly the
evolution of the genome. We develop a simplified model of segmental duplications
and deletions that occur in two copies.
4.2.1 Fixed Length Model
In thisinitial model, we maintain afixed length for segmental duplications/deletions
in the genome. Additionally, we fix equal rates of segmental duplication and dele-
tion to ensure that the length of the genome stays roughly the same. Our model has
three parameters: W ? the genome length; S ? the segmental duplication/deletion
length; and M ? a constant that relates the point mutation rate to the segmental
mutations rate.
42
Our model begins with a random genome of length W where each letter in the
sequence is generated independently and is equally likely to be an A, C, G or T.
Then, for W on the scale of the lengths of the genomes we study, we do not expect
any repetitive 40-mers, and consequently repeat strings (see Appendix).
At each step in the model we induce a mutation, either a segmental mutation
(which is a segmental duplication or deletion) or a point mutation. With probability
1/(M + 2) we create a segmental duplication: a random subsequence of length
S is chosen from the genome and inserted at a random location. A segmental
deletion occurs with the same probability, 1/(M + 2). We choose a subsequence of
length S from the genome at random and delete it. With the remaining probability,
M/(M + 2), we introduce a point mutation by changing the letter at a randomly
chosen location in the genome.
We implemented our model as a computer simulation and ran experiments
over many parameter values. After a specified amount of time, typically several
million iterations, the simulation was halted and the repeat strings in the resulting
genome were determined. We provide the results of some simulations in Figure
4.2 along with the stationary distribution for the model, which we will derive in
the next section. Figure 4.2 indicates that the distribution from the simulation is
qualitatively similar to the distributions observed for the genomes we study. That
is, the distributions produced by our simulation can be roughly approximated by a
line on a log-log plot.
Notice that nearly all repeat strings have length less than the copy length S.
The exception is a repeat string introduced by copying a segment of the genome
43
Figure 4.2: Simulation of Fixed Length Genome Evolution Model. We fix
parameters W = 108, S = 104 and show results for M = 100 and M = 1,000.
The simulation for M = 100 was carried out for 1 million steps and for M = 1,000
for 10 million steps, so that we made approximately 10,000 segmental duplications
in each simulation. We overlay the shifted length distribution with the stationary
distribution, a Pareto distribution, predicted by our analysis. Notice that over a
significant range the computer simulations agree well with our theoretical findings,
especially for M = 1000.
44
with both ends in existing repeat strings. However, repeat strings of this form occur
relatively infrequently and make up no more than 1% of the total number of repeat
strings produced by the model as long as S/(M + 2) is not too large.
4.2.2 Stationary Distribution of Fixed Length Model
We can explicitly analyze the stationary distribution of an approximation to
the Fixed Length Model by making additional assumptions. Using a continuous
approximation we let x ? [0,S] represent the possible lengths of repeat strings
and denote by f(x) the density of repeat strings of length x at a given time. As
discussed above we may create a repeat string longer than S but since these will
be very uncommon we assume, for simplicity, that repeat strings have a maximum
length of S.
We further assume that all repeat strings contain only repetitive 40-mers with
count two. In doing so we are able to model a point mutation in an existing repeat
string as splitting it into two repeat strings, the sum of whose length is equal to the
length of the original repeat strings (here we ignore the 1 base that is lost due to
the point mutation). If a repeat string had a 40-mer with count greater than two,
this assumption would be violated.
We analyze the distribution of repeat strings produced by this process. Since
our model begins with no repetitive sequences, all repeats are introduced by seg-
mental duplications. Changes in the lengths of previously duplicated segments are
caused by: (1) deleting (or partially deleting) an existing repeat string; (2) intro-
45
ducing a point mutation to an existing repeat string; (3) inserting a segmental du-
plication into an existing repeat string. Thus to analyze the distribution of lengths
of repeat strings we consider the probability a repeat string is deleted or is mutated
during a step of our simulation.
Let Lf be the expected density after one step of the model. We want to
identify the stationary distribution, that is, the limit as n goes to infinity of Lnf.
We begin by identifying the explicit form of Lf. We claim that Lf is an affine
transformation, whose linear part we call Hf:
(Lf)(x) = (Hf)(x)+ 1M + 2?(x?S) (4.1)
where
(Hf)(x) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy. (4.2)
We will derive equations (4.1) and (4.2) by considering the expected action
of each of our mutations. We first consider the expected action of introducing a
segmental duplication to the distribution of repeat strings. Each time we introduce
a duplication we create a repeat string of length exactly S. Because the length
of repeat strings are continuous in the range [0,S] this corresponds to changing f
by adding a delta function centered at S.1. The other operation that is part of a
segmental duplication is inserting the repeat string to a randomly chosen position
in the genome. If the location of the insertion were in an existing repeat string of
length x, this would split the repeat string into two separate repeat strings whose
1The Dirac delta function is denoted ?(x). This ?function? is defined as having unit integral,
integraltext+?
?? ?(x)dx = 1 but ?(x) = 0 when x negationslash= 0.
46
total lengths sum to x. Since the insertion point is chosen random from the W
positions in the genome, a position in a repeat string of length x will be chosen with
probability xW . After an insertion the fraction of repeat strings with length x that
will still have length x is (1? xW).
If a repeat string has length y > x there are exactly two insertion points that
would create a repeat string of length x. Similarly the probability of choosing the
insertion point in a string of length y is simply y/W for each repeat string of length
y, which is f(y). Combining these ideas, the expected action on of one segmental
duplication is given by the operator C:
(Cf)(x) =
parenleftBig
1? xW
parenrightBig
f(x) +
integraldisplay S
x
2
Wf(y)dy +?(x?S). (4.3)
A segmental deletion selects at random a position in the genome and deletes
the S bases following the position. An existing repeat string may be entirely or
partially deleted by this process. For each repeat string of length x the probability
that the length of the repeat string is changed by the segmental deletion is, S+xW .
The fraction of repeat strings with length x that are not partially deleted by the
deletion operation is (1? S+xW ).
Similarly to the duplication process a repeat string of length x can be created
if the deletion point is in a repeat string of length y > x. For each repeat string there
are two choices for deletion positions that would create a repeat string of length x.
Thus, the expected action on f of a segmental deletion of length S is given by the
operator D:
(Df)(x) =
parenleftbigg
1? x+SW
parenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy. (4.4)
47
Finally, we consider the action of a single point mutation. A point mutation
will change the repeat structure only if it lands in a repeat string. As described
above, the repeat string will split into two parts. Similar to the insertion mutation,
the fraction of repeat strings of length x that will survive a point mutation is (1? xW).
Additionally, as in our two previous operations, a mutation occurring in a repeat
string of length y > x can create a repeat string of length x if the mutation chosen to
be in one of two possible positions. The expected action of a single point mutation
is given by E:
(Ef)(x) =
parenleftBig
1? xW
parenrightBig
f(x) +
integraldisplay S
x
2
Wf(y)dy. (4.5)
Notice that that the operators C and E differ by only the term from creating
a repeat string of length S. This is because the effect of an insertion on existing
repeat strings is essentially equivalent to that of a point mutation.
The expected action on f overall is given by a linear combination of these
operators:
(Lf)(x) = 1M + 2 ((Cf)(x) + (Df)(x)) + MM + 2(Ef)(x).
This can be simplified to the form given in equations (4.1) and (4.2):
(Lf)(x) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy +
1
M + 2?(x?S).
Let Hf be the first two terms in the above expression for Lf. Observe that
Hf is linear in f:
(Hf)(x) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy.
48
We claim that Hf is a contraction in the norm
bardblfbardbl? :=
integraldisplay S
0
|xf(x)|dx. (4.6)
Indeed, a simple calculation (see Appendix A.2) shows that H is a contraction:
bardbl(H(f))bardbl? =
parenleftbigg
1? SW(M + 2)
parenrightbigg
bardblfbardbl?
Because the linear part of the operator L is a contraction, the entire operator
is a contraction and must have a unique attracting fixed point g. By inspection we
determine this fixed point, the stationary distribution, to be
g(x) = ?
parenleftbigg
x+ SM + 2
parenrightbigg?3
+??(x?S). (4.7)
? and ? are constants that depend on W,M and S.
(See Appendix A.3 for a derivation of constants ? and ?.)
Observe that g is a global attractor, and thus the stationary distribution does
not depend on the initial distribution of repeat intervals. That is, our assump-
tion that there were no initial repeat intervals can be removed and not impact the
stationary distribution.
The stationary distribution, given in equation (4.7), is a Pareto distribution (a
shifted power-law) plus a multiple of a delta function. We refer to SM+2 as the shift.
The delta function component of g(x) is an artifact of all segmental duplications
created with the same length S in this model. The power ?3 in the density function
corresponds to a power ?2 in the cumulative distribution function. Notice that
some of the slopes in Figure 4.1 are roughly ?1.7, which is near our theoretical
49
value of ?2. In a later section we discuss a refinement of the notion of repeat string
that more closely matches the types of repeat strings our model represents.
Notice in our analysis we assumed that S is the maximum length of a repeat
string. As discussed previously, there may be repeat strings longer than S created.
We find that when approximating our theoretical distribution of repeat strings from
our model with the actual distribution of repeat strings produced by a computer
simulation that the two distributions are quite similar (see Figure 4.2).
Recall we also assumed the count of a repeat string was two. We note that
the results from our simulation agree better with our theoretical distribution when
M = 1,000 that for M = 100 (see Figure 4.2). We find that 40.3% of repetitive
40-mers had count greater than two when M = 100 as opposed to only 13.9% when
M = 1,000. Larger values of M correspond to lower 40-mer counts; this suggests
that our stationary distribution is more accurate for larger values of M.
Notice that the length of two different repeat strings will change if the location
of an insertion, deletion or point mutation occurs within a repeat string. However,
we only explicitly describe the impact of these mutations on one repeat string.
Changing Lf to describe changes in the length of both repeat strings multiplies some
of the terms by a factor of two, but does not change the stationary distribution.
4.2.3 Variable Length Model
Our Fixed Length Model has several assumptions that we loosen to analyze
how general the convergence to a Pareto distribution is. We first change the length
50
distribution of segmental duplications and deletions. We now model the length of
segmental mutations as coming from a probability distribution with mean S. We
find that the distribution of repeat strings remains approximated by the same Pareto
distribution as found for the constant length model.
We present results in Figure 4.3 for W = 108, M = 100 and the segmental
mutations as an exponential, normal and uniform distribution with mean S = 104.
Note we plot the length distribution with the same shift as predicted by our analysis
of the Fixed Length Model. We find that the stationary distribution derived in the
previous section is consistent with the repeat string distribution from the Variable
Length Model.
4.2.4 Growing Genome Model
Another restriction of our Fixed Length Model was that, on average, the length
of the genome remained constant. Certainly the length of the genome is not a fixed
value. There are frequent insertions and even whole genome duplications especially
in plants [27,40]. In this model we allow the length of the genome to vary, but so
the length of the genome does not go to zero we have a preference for duplications
over deletions.
We developed a simulation, based on our Fixed and Variable Length Mod-
els, which allows the length of a genome to increase. The model has the same
parameters; W ? the initial genome length; S ? the average length of a segmental
duplication or deletion and M ? a constant that relates the point mutation rate to
51
Figure 4.3: Simulation of Variable Length Genome Evolution Model. We
plot the length distribution from runs of our Variable Length Model where we have
fixed W = 108 and M = 100. We show the result from three runs where different
probability distributions, all having mean S = 104, were used as the length distribu-
tion for segmental duplications and deletions. We use the exponential distribution
with mean S, uniform distribution on [0,2?104] and normal distribution with mean
and standard deviation S. In the case that the length determined by the normal
distribution is negative the value is ignored and a new length selected from the nor-
mal distribution. In this manner, the actual mean of the distribution is somewhat
larger than S. In all three sample cases shown, the distribution of repeat strings
converges to a distribution consistent with the stationary distribution derived for
modeling the length of segmental mutations as constant.
52
the segmental mutations rate. We perform a point mutation with the same proba-
bility M/(M +2), but the probabilities of segmental duplications and deletions are
no longer symmetric. With probability 1.1/(M + 2) we perform a segmental dupli-
cation and probability 0.9/(M+2) we perform a segmental deletion. The lengths of
the segmental duplications and deletions are selected from a probability distribution
with mean S. Instead of halting the simulation after a specified number of steps we
halt when the length of the genome has reached a specified value.
We show the output of two simulations in Figure 4.4. Note that we plot the
length distributions with the same shift as for the Fixed and Variable Models. For
these simulations the lengths of segmental duplication and deletions were selected
from a uniform distribution on [0,2?104].
We note that in Figure 4.4 the distributions of repeat string lengths is char-
acterized by a relatively linear region (on a log-log scale) and then a decaying
tail. The decay of the tail is more consistent with the decay in the distribution
of D. melanogaster (see Figure 4.1).
The distinguishing featureofthe distribution produced by theGrowing Genome
Model from the Fixed and Variable Length Models is the slope in the linear region
of the distribution. The slope produced by the Growing Genome model is signifi-
cantly shallower. For the two simulations shown in Figure 4.4 the slope of the linear
regions are roughly ?1.5 and ?1.8 when determined using a least-squares fit over
the intervals [1040,106] and [300,7000] respectively. Notice that these slopes are
closer to the values for the slope determined for D. melanogaster and S. cereviaise
than produced by the other two models. Because the genome of S. cerevisiae has
53
undergone genome duplication [40], the dynamics of the Growing Genome Model
are perhaps more consistent with its evolutionary history than the dynamics of the
other two models we developed.
4.3 Unitary Repeat Strings
We have shown the distributions produced by our Fixed and Variable Models
converge to a Pareto distribution with a constant exponent of ?3, corresponding to
a cumulative slope of ?2. However, as shown from Figure 4.1 and Table 4.1, none
of our organisms have a slope of ?2. In this section, we consider a refinement of the
definition of a ?repeat string? and find that our Fixed and Variable Length Models
model produces results that are more consistent with this refinement for three of
the genomes we study.
Repetitive sequence can be created through tandem duplication. This is
when a duplicate copy is created adjacent to the original sequence. In this case, the
length of a repeat string would increase and a new repeat string may not be created.
The models we developed in the previous section do not include the dynamics of
tandem duplication.
To attempt to remove repeat strings that may have been created through a
tandem duplication we define a (n,k) unitary repeat string. Recall that a repeat
string is a sequence where, for a fixed k, every k-mer in the sequence occurs more
than once in the genome. We now define a (n,k) unitary repeat string as a repeat
string where, for a fixed n ? k, every n-mer within the repeat string occurs only once
54
Figure 4.4: Simulation of Growing Genome Evolutionary Model. We plot
the output for two runs of our Growing Genome Simulation. In each case the
distribution of segmental mutations was determined from a uniform random variable
on [0,2?104]. The model was halted when the genome reached the specified length.
In both cases the distribution can be approximated a power law over a part of the
range. In the case that the genome grew from length 106 to 5?106, the power law
exponent determined by a least square fit was ?1.5 over the range [1040,106]. For
the other simulation, where the genome grew from 2 ? 106 to 107, the power law
exponent ?1.8 over [300,7000].
55
Table 4.2: Unitary Repeat Strings. We select a subset of repeat strings we call
unitary repeat strings. For all genomes we considered, unitary repeat strings were
the majority of repeat strings.
Genome Length % of Non-Unitary Repeat Strings
(in millions)
human genome 3,000 2.0%
A. thaliana 119 2.2%
D. melanogaster 120 12.5%
C. elegans 100 8.0%
S. cerevisiae 12 5.8%
in the repeat string. Since the unitary repeat strings do not contain any duplicate
n-mers, we are able to say with some confidence that they do not represent a tandem
repeat. (The examples in this dissertation are for the distribution of (20,40) unitary
repeat strings.)
Notice that for all genomes the unitary repeat strings are the majority of
repeat strings (see Table 4.2). In all cases at least 87% (and as much as 98%) of the
repeat strings were preserved.
After determining the unitary repeat strings for the various organisms we plot
the resulting distribution of repeat strings (see Figure 4.5). As in Figure 4.1, we
plot the length L of a repeat string against C(L) the number of repeat strings in the
genome with length at least L. We find that the distributions are again consistent
with an approximate power law.
56
Notice from Figure 4.5, most of the longest repeat strings in the genomes we
study are not (20,40) unitary repeat strings. This is consistent with our observations
that the longest repeat strings for the various organisms were tandem repeats. In
particular, the longest repeat string for C. elegans was a near-perfect tandem repeat
of over 200 copies of a 72 letter sequence.
Comparing the distributions in Figure 4.5 to those in Figure 4.1, we note
a number of differences. In some cases, such as A. thaliana, the distribution is
straighter. The slope of the distribution of unitary repeat strings for each genome
differs somewhat from the slope for repeat strings (compare Table 4.1 with Table
4.3). For three organisms, human, C. elegans and A. thaliana, the slope determined
for the distribution of unitary repeat strings is close to the slope of the stationary
distribution of the Fixed and Variable Length Models, ?2. In addition, the tail of
the distribution produced by the Growing Genome Model is more consistent with
the decay of the distribution of D. melanogaster.
The stationary distribution from our Fixed and Variable Length Models pro-
vide a reasonable match to the distribution of unitary repeat strings in C. elegans,
A. thaliana and the human genome. While we have not determined parameters for
our Growing Genome Model that would produce the same slope as the unitary re-
peat string distribution for D. melanogaster and S. cerevisiae we believe that their
distributions are consistent with the results provided by our model.
57
Figure 4.5: Distribution of Unitary Repeat Strings. The distributions of
unitary repeat strings for the organisms we consider are consistent with a power law
distribution and have steeper slopes that for all repeat strings. The unitary repeat
strings are more consistent with the evolutionary dynamics described by our model.
The distributions for the human genome, C. elegans and A. thaliana are consistent
with the slope predicted by our Fixed and Variable Length Models. Since the slopes
of the distributions of D. melanogaster and S. cerevisiae are shallower than ?2, the
Growing Genome Model provides the best characterization of their distributions.
58
Table 4.3: Unitary Repeat String Distribution Slope. The power law slope,
as determined by a least square fit, for the unitary repeat strings more closely
matches the slope predicted by our Fixed and Variable Length Models for three of
our genomes. The distributions of the two, D. melanogaster and S. cerevisiae, have
a slope that is still more shallow than predicted by our Fixed and Variable Length
Models, but that could be consistent with the Growing Genome Model. Notice
that the range approximated by a power law has decreased from the distributions
of repeat strings.
Genome Length Slope Range of Power Law
(in millions) Approximation
human genome 3,000 -2.20 40 ? L ? 8,000
A. thaliana 119 -2.03 40 ? L ? 5,000
D. melanogaster 120 -1.16 40 ? L ? 4,000
C. elegans 100 -1.83 40 ? L ? 5,000
S. cerevisiae 12 -1.03 40 ? L ? 1,000
59
4.4 Discussion
In this section we provide more general interpretations of our results. We
discuss some implications ofour models for genome evolution. We close by discussing
the relation of our findings to the process of genome assembly.
4.4.1 Chromosome Repeat Strings
The genomes of organisms we study are composed of several chromosomes.
Additionally, there are several mechanisms related to the propagation of repeats
that have a preference for creating copies in the same chromosome [1,20,75,76].
To study the repeat structure at the chromosome level, we define a chromosome
repeat string to be a subsequence of a chromosome where, for a fixed k, every k-
mer is repetitive within that chromosome. The collection of genome repeat strings
implicitly contains every chromosome repeat string.
We plot in Figure 4.6 the distribution of chromosome repeat strings for human
chromosomes 3, 9 along with the repeat string distribution for the entire human
genome. Although all three distributions have a power law approximation, the
exponent of the power law for these various distributions is different. This result
suggests that there may be different evolutionary models operating on different
chromosomes.
60
Figure 4.6: Chromosome Repeat Strings. We plot the distribution of repeat
strings as determined for only the chromosome for human chromosomes 3 and 9
along with the entire human genome. We notice that while the distribution of
repeat strings in each case can be approximated by a power law ofer a range of the
distribution the slopes of the distribution are different.
61
4.4.2 Implications of Model
The convergence of our computer simulations and the convergence of our ab-
stracted model strongly support the idea that simple mechanisms of repeat evolution
? point mutation and sequence duplication ? can explain the behavior seen in the
distributions of lengths of repeat strings seen in the genomes we study. In partic-
ular, our findings suggest that the genomes of A. thaliana, C. elegans and human
have had an evolutionary history consistent with our Fixed and Variable Length
Models. While D. melanogaster and S. cerevisiae have a structure of repeat strings
that is more consistent with a evolutionary past characterized greater duplications
than deletions, such as our Growing Genome Model.
However, there were several other statistical properties of repetitive DNA dis-
cussed in this dissertation. A natural question to ask of our evolutionary models is
if the other features of repetitive DNA were demonstrated.
Proximal Repeat Separations. Since there is no preference for inserting
duplications near the original copy in our models, the distribution of repeat pairs
is similar to that of the uniform random model discussed in Chapter 3. A more
sophisticated model would include several methods of repeat creation each of which
could have preferences related to the separation from the original copy.
Counts of 40-mers. The distribution of 40-mer counts in human genome and
the genomes of C. elegans, A. thaliana and D. melanogaster was shown to roughly
follow power law distribution with exponent ?2.5 (see Chapter 2).
The models we have developed fail to duplicate the power law relationship
62
in counts of 40-mers. In the output of the Fixed and Variable Length models, the
distribution 40-mer counts follows a roughly exponential distribution. One of the
reasons is a lack of preference for which sequence is duplicated. When the model
begins all 40-mers in the sequence have count one, thus they are all equally likely
to be duplicated by the first step in the duplication process.
Our Growing Genome model produces a distribution more similar to a power
law. However, the reasons for this are somewhat artificial because the final genome
was produced by a smaller pool of 40-mers than the final genome from the other
models. The Growing Genome model creates greater heterogeneity of counts of
40-mers as opposed to the other two models.
Similar models to our Growing Genome Model were developed in [33] and
[85] others here were developed. For example, in [33] Hsieh and Lee developed a
model for a bacterial genome where the initial genome length of 1000 bases grew
to 1,000,000 bases (the length increased to 1,000 times the original length). Their
model was shown to produce a similar distribution to that found in the word counts
of short words, k ? 10.
4.4.3 Implication for Genome Assembly
As stated in Chapter 1, one of our reasons for investigating repetitive DNA
was the difficulties repetitive DNA poses for genome assembly. The power law
relationship we find between the length L of a repeat string and the number, N(L)
of such strings that exist in a genome suggests a relationship between the read length
63
R and the number of repeat strings that are problematic for assembly. Namely if
C(L) ? Lb, the number of repeat strings that are longer than a read scales like Rb.
If we assume that the slope of the repeat string distribution is b = ?2, then for
every factor of ?2 that the read length is increased, the number of repeat strings
greater than the read length will be halved. This provides a method of comparing
read length to the computational complexity of the assembly problem.
64
Appendix A
Mathematical Formulations
A.1 Choosing a Word Length
If this dissertation we study repetitive DNA by identifying repetitive 40-mers.
As mentioned, the value of 40 is not an essential part of our results and similar
behavior holds for sufficiently long k-mers. That is, k sufficiently large that we do
not expect to find any repeated k-mers by chance in a sequence the length of the
genomes we study. To make more precise what we mean by ?k sufficiently large?
we compute the expected number of pairs of locations in a random genome that are
the starting position of identical k-mers.
Consider a sequence W of length N from the alphabet {A,C,G,T} with the
letter at each position determined independently with equal probability of being
any of the four letters. Select two locations in the genome at random, we consider
the k-mers that begin at those locations. The probability that the letter at both
locations is the same is simply:
1
16 +
1
16 +
1
16 +
1
16 =
1
4
Since the letter at each position is chosen independently, the probability that
the k-mers at both locations are identical is 14k = 4?k. When we also count reverse
complements as matches the probability is doubled, 2(4?k).
65
The number of pairs of locations in W is approximately N22 when N ? k.
Because expectation is linear, the expected number, E, of pairs of locations with
identical k-mers is
E = 4?kN2.
The expected number E is less than 1 when
k > 2log(N)log(4) = log(N)log(2)
When N = 3?109, as in the case of the human genome, we require k > 31.5
to ensure that we do not expect any repeated k-mers in the sequence W. (Of course
k must be integer valued.)
We now investigate how this value changes when the occurrence of each letter
is not equally likely. Assume that A occurs with probability PA and similarly for C,
G and T with the condition that PA = PT, PC = PG and PA + PC + PG + PT = 1.
Then the probability of the letter at two randomly chosen positions in W matching
is given by, P where
P = PA2 +PC2 +PG2 +PT2.
Whatever P is, the preceeding calculation of the expected number of pairs of loca-
tions with identical k-mers becomes,
E = PkN2.
We have E < 1 when k > 2log(N)/log(1/P). The more skewed the base compo-
sition, the greater the minimum value of k such that E < 1 is. For example, with
N = 3?109, PA = PT = 38 and PC = PG = 18 (P = 5/16) we require k > 38.2.
66
Sequence composition varies throughout a genome. Various papers have clas-
sified the varying CG content throughout the genome, such as [58] and [62]. These
findings indicate that the most skewed regions are when the GC content is roughly
30%. Our above calculation with P = 5/16 corresponds to a GC content of 25%.
Selecting k = 40 ensures that, even for regions of low GC content, we expect no
k-mers to have an identical copy in the genome.
Genome assembly (see Chapter 1) often uses seeds of length 20, a choice that
is related to our choice of 40-mers. In genome assembly the length of the seed is
chosen, in part, so that the number of occurrences of a typical k-mer is 1.
Consider a random genome of length N, where PA = PT, PC = PG and
PA +PC +PG +PT = 1. Pick an arbitrary k-mer and let E?(k) denote the expected
number of occurrences of that k-mer in a random genome of length N where N ? k.
From our analysis above,
E?(k) = PkN
and
E(k) = PkN2
.
Now, we select a value for k? so that E?(k?) = 1 and k so that E(k) = 1,
(allowing the fiction that k and k? need not be integers), yields
2k? = k
Thus, when k? = 20, k = 40. This is an additional reason for selecting k = 40
to illustrate our results.
67
A.2 Showing H(f) is a Contraction
In Section 4.2.2 we derived the stationary distribution for our Fixed Length
Model. Here we provide more details of the calculation of the stationary distribution.
In Section 4.2.2 we show that, under appropriate assumptions and approximations,
the expected action of one mutation on the distribution of repeat strings of length
x ? [0,S], which we denote f(x) is given by
(Lf)(x) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy +
1
M + 2?(x?S).
We denote by H the terms in L that are linear in f:
(Hf)(x) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy.
We now show by calculation that H is a contraction in the norm
bardblfbardbl? :=
integraldisplay S
0
|xf(x)|dx
.
By definition we have,
bardbl(H(f))bardbl? =
integraldisplay S
0
|x(Hf)(x)|dx
=
integraldisplay S
0
vextendsinglevextendsingle
vextendsinglevextendsinglex
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x) +
integraldisplay S
x
2
Wf(y)dy
vextendsinglevextendsingle
vextendsinglevextendsingledx
Because f(x) ? 0, x ? 0 and W > parenleftbigS + SM+2parenrightbig for reasonable values of M,S, and
W, we can remove the absolute value signs.
=
integraldisplay S
0
x
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
f(x)dx+
integraldisplay S
0
x
integraldisplay S
x
2
Wf(y)dydx
68
We interchange the bounds of integration using Fubini?s Theorem:
=
integraldisplay S
0
xf(x)dx?
integraldisplay S
0
x2 1Wf(x)dx?
integraldisplay S
0
xS
W(M + 2)f(x)dx+
integraldisplay S
0
integraldisplay y
0
x 2Wf(y)dxdy
= bardblfbardbl? ?
integraldisplay S
0
x2
Wf(x)dx?
S
W(M + 2)bardblfbardbl? +
integraldisplay S
0
y2
Wf(y)dy
=
parenleftbigg
1? SW(M + 2)
parenrightbigg
bardblfbardbl?
Thus, bardbl(H(f))bardbl? =
parenleftBig
1? SW(M+2)
parenrightBig
bardblfbardbl?. Because 0 < SW(M+2) < 1 the operator H
is a contraction.
A.3 Deriving the Stationary Distribution for the General Model
In Section 4.2.2 we state that the stationary distribution from the approxima-
tion to the Fixed Length Model is a Pareto Distribution g(x):
g(x) = ?
parenleftbigg
x+ SM + 2
parenrightbigg?3
+??(x?S).
We now explicitly determine the values of constants A and B. We begin by studying
L(g(x)) = g(x):
L(g(x)) = H(g(x))+ 1M + 2?(x?S) (A.1)
Because H is a linear operator we can write H(g(x)) in terms of each component of
g(x):
H(g(x)) = ?H
parenleftBiggparenleftbigg
x+ SM + 2
parenrightbigg?3parenrightBigg
+?H(?(x?S)) (A.2)
69
We now consider each component separately:
H
parenleftBiggparenleftbigg
x+ SM + 2
parenrightbigg?3parenrightBigg
=
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbiggparenleftbigg
x+ SM + 2
parenrightbigg?3
+
integraldisplay S
x
2
W
parenleftbigg
y + SM + 2
parenrightbigg?3
dy
=
parenleftbigg
x+ SM + 2
parenrightbigg?3
? 1W
parenleftbigg
x+ SM + 2
parenrightbigg?2
? 1W
parenleftbigg
y + SM + 2
parenrightbiggvextendsinglevextendsinglevextendsingle
vextendsinglevextendsingle
S
x
=
parenleftbigg
x+ SM + 2
parenrightbigg?3
? 1W
parenleftbigg
x+ SM + 2
parenrightbigg?2
+ 1W
parenleftbigg
x+ SM + 2
parenrightbigg?2
? 1W
parenleftbigg
S + SM + 2
parenrightbigg?2
Thus we have,
?H
parenleftBiggparenleftbigg
x+ SM + 2
parenrightbigg?3parenrightBigg
= ?
parenleftbigg
x+ SM + 2
parenrightbigg?3
? ?W
parenleftbigg
S + SM + 2
parenrightbigg?2
(A.3)
Similarly for the term involving ? in equation (A.2),
H(?(x?S)) =
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
?(x?S)+
integraldisplay S
x
2
W?(x?S)dy
=
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
?(x?S) + 2W
So we can express,
?H(?(x?S)) = ?
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
?(x?S) + 2?W (A.4)
We combine equations (A.1)-(A.4) to solve for ? and ? so that
?
parenleftbigg
x+ SM + 2
parenrightbigg?3
+??(x?S) = ?
parenleftbigg
x+ SM + 2
parenrightbigg?3
? ?W
parenleftbigg
S + SM + 2
parenrightbigg?2
70
+?
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
?(x?S) + 2?W + 1M + 2?(x?S)
When x < S, ?(x?S) = 0 so the above equation becomes:
?
parenleftbigg
x+ SM + 2
parenrightbigg?3
= ?
parenleftbigg
x+ SM + 2
parenrightbigg?3
? ?W
parenleftbigg
S + SM + 2
parenrightbigg?2
+ 2?W
The terms involving x cancel and we can solve for ? in terms of ?.
? = 2?
parenleftbigg
S + SM + 2
parenrightbigg2
(A.5)
When x = S we can simplify our expression to involve only ?:
??(x?S) = ?
parenleftbigg
1? 1W
parenleftbigg
x+ S(M + 2)
parenrightbiggparenrightbigg
?(x?S) + 1M + 2?(x?S)
Equating the coefficients on both sides yields:
? = W(M + 2)parenleftbigS + S
M+2
parenrightbig (A.6)
.
We can rewrite equation (A.5) for ? as:
? = 2WM + 2
parenleftbigg
S + SM + 2
parenrightbigg
(A.7)
71
BIBLIOGRAPHY
[1] Achaz G,Netter Pand Coissac E. 2001. Study ofintrachromosomal duplications
among the eukaryote genomes. Mol. Biol. Evol. 18(12):2280-2288.
[2] Achaz G, Rocha E, Netter P and Cossiac E. 2002. Origina and fate of repeats
in bacteria. Nucleic Acids Research. 30(13):2987-2994.
[3] Adams M et al. 2000. The genome sequence of Drosophila melanogaster. Sci-
ence. 287:1185-1195.
[4] Anh V, Lau K, Yu Z. 2001. Multifractal characterization of complete genomes.
J. Phys A: Math. Gen. 34:7127-7139.
[5] Aminetzach Y, Macpherson M and Petrov D. 2005. Pesticide resistance
via transposition-mediated adaptive gene truncation in Drosophila. Science.
309:764-767.
[6] Bao, Z., and Eddy, S. Automated De Novo Identification of Repeat Sequence
Families in Sequenced Genomes. Genome Research. 12:1269-1276. (2002).
[7] Bartolome C, Maside X and Charlesworth B. 2002. On the abundance and
distribution of transposable elements in the genome of Drosophila melanogaster.
Mol. Biol. Evol. 19(6):926-937.
[8] Batzoglou S, et al. 2002. ARACHNE: a whole-genome shotgun assembler.
Genome Research. 12(1):177-189.
[9] Batzoglou S, et al. 1999. Sequencing a genome by walking with clone-end se-
quences: A mathematical analysis. Genome Research. 9:1163-1174.
[10] Bell G and Jurka J. 1997. The length distribution of perfect dimmer repeti-
tive DNA is consistent with its evolution by an unbiased single step mutation
process. J. Mol. Evol. 44:414-421.
[11] Benson et al. 2006. GenBank. Nucleic Acids Res. 33:D34-D38.
[12] Berkeley Drosophila Genome Project. http://www.flybase.org.
[13] Blattner F, et al. 1997. The complete genome sequence of Escherichia coli K-12.
Science. 277:1453-1462.
[14] Bonhoeffer S, Herz A, Boerlijst M, Nee S, Nowak M, and May R. 1996. No
Signs of Hidden Language in Noncoding DNA. Phys. Rev. Lett. 76:1977.
72
[15] Borstnik B and Pumpernik D. 2004. Mutational dynamics of short tandem
repeats in human genome. Eurpohysics Letters. 65(2):290-296.
[16] Borstnih B and Pumpernik D. 2005. Evidence on DNA slippage step-length
distribution. Phy Rev E. 71:031913.
[17] Brookfield J. 2005. The ecology of the genome mobile DNA elements and their
hosts. Nature Reviews Genetics. 6:128-136.
[18] Brookfield J and Badge R. 1997. Populations genetics models of transposable
elements. Genetica. 100:281-294.
[19] Celniker S et al. 2002. Finishing a whole-genome shotgun: Release 3 of
the Drosophila melanogaster euchromatic genome sequence. Genome Biology.
3:research0079.1-0079.14.
[20] Charlesworth B, Langley C and Wolfgang S. 1986. The evolution of restricted
recombination and the accumulation of repeated DNA sequences. Genetics.
112:947-962.
[21] Charlesworth, B., Sniegowski, P. and Wolfgang S., The Evolutionary Dynamics
of Repetitive DNA in Eukaryotes. Nature 371:215-220. (1994).
[22] Chaudhuri P and Das S. 2001. Statistical analysis of large DNA sequences using
distribution of DNA words. Current Science. 80(9):1161-1166.
[23] Djian P. 1998. Evolution of simple repeats in DNA and their relation to human
disease. Cell. 94:155-160.
[24] Edler J and Turner B. 1995. Concerted evolution of repetitive DNA sequences
in eukaryotes. Quarterly Review of Biology. 70(3):297-320.
[25] Eichler E. 2001. Segmental duplications: whats missing, misassigned and mis-
assembled and should we care? Genome Research. 11:653-656.
[26] Eickbush T and Furano A. 2002. Fruit flies and humans respond differently to
retrotransposons. Curr Opin in Genetics and Development. 12:669-674.
[27] Gregory T. 2005. The Evolution of the Genome. Pp3 151. Elsevier Academic
Press.
[28] Hao B, Lee H, Zhang S. 2000. Fractals related to long DNA sequences and
complete genomes. Chaos, Solitons and Fractals. 11:825-836.
73
[29] Havlak P, et al. 2001. The Atlas whole-genome assembler. Currents Trends in
Computational Molecular Biology, Montreal, Canada, 2001.
[30] Hedges D, Cordaux R, Xing J, Witherspoon D, Rogers A, Jorde L and Batzer
M. 2005. Modeling the amplification dynamics of the human Alu retrotrans-
posons. PLoS Computational Biology. 1(4):333-340.
[31] Holste D, Grosse I, Beirer S, Schieg P and Herzel H. 2003. Repeats and corre-
lations in human DNA sequences. Phy Rev E. 67:061913.
[32] Hsieh L et al. 2003. Minimal model for genome evolution and growth. PRL.
90(1):018101(4).
[33] Hsieh L and Lee H. 2002. Model for the growth of bacterial genomes. Modern
Physics Letters B. 16(22):821-827.
[34] Hurst G and Werren J. 2001. The role of selfish genetic elements in eukaryotic
evolution. Nature Reviews Genetics. 2:597-606.
[35] Jaffe D, et al. 2003. Whole-genome sequence assembly for mammalian genomes:
arachne 2. Genome Research. 13(1):91-96.
[36] Jurka J. 194. Approaches to identification and analysis of interspersed repetitive
DNA sequences. In Automated DNA Sequencing and Analysis (Adams et al,
eds.) p. 294-298, Academic Press.
[37] Kaminker J et al. 2002. The transposable elements of the Drosophila
melanogaster euchromatin: a genomics perspective. Genome Biology.
3(12):research0084.1-0084.20.
[38] Kazazian H. 2004. Mobile elements: drivers of genome evolution. Science.
303:1626-1632.
[39] Karev G, Wolf Y, Rzhetzky A, Berezovskaya, F and Koonon E. 2002. Birth
and death of protein domains: a simple model of evolution explains power law
behavior. BMC Evolutionary Biology. 2:18.
[40] Kellis M, Birren B and Lander E. 2004. Proof and evolutionary analysis of
ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature.
428:617-624.
74
[41] Kent W, Baertsch R, Hinrichs A, Miller W and Haussler D. 2003. Evolutions
cauldron: duplication, deletion and rearrangement in the mouse and human
genomes. PNAS. 100(20):11484-11489.
[42] Kidwell M and Lisch D. 2001. Perspective: transposable elements, parasitic
DNA, and genome evolution. Evolution. 55(1):1-24.
[43] Koonin E, Wolf Y and Karev G. 2002. The structure of the protein universe
and genome evolution. Nature. 420:218-223.
[44] Kruglyak S, Durrett R, Schut M and Aquadro C. 1998. Equilibrium distribu-
tions of microsatellite repeat length resulting from a balance between slippage
events and point mutations. PNAS. 95:10774-10778.
[45] Kurtz S, Choudhuri J, Ohlebusch E, Schleiermacher C, Stoye J and Gierich
R. 2001. REPuter: the manifold applications of repeat analysis on a genomic
scale. Nuc. Acids Res. 29(22):4633-4642.
[46] Lander E et al. 2001. Initial sequencing and analysis of the human genome.
Nature. 409:860-921.
[47] Liu G, NISC Comparative Sequencing Program, Zhao S, Bailey J, Sahinalp C,
Alkan C, Tuzun E, Green E and Eichler E. 2003. Analysis of primate genomic
variation reveals a repeat-drive expansion of the human genome. Genome Re-
sarch. 13:358-368.
[48] Luscombe N et al. 2002. The dominance of the population by a selected few:
power-law behaviour applies to a wide variety of genomic properties.Genome
Biology 3(8):research0040.1-0040.7.
[49] Majewski J and Ott J. 2000. GT repeats are associated with recombination on
human chromosome 22. Genome Research. 10:1108-1114.
[50] Makalowski W. 2000. Genomic scrap yard: how genomes utilize all that junk.
Gene. 259:61-67.
[51] Mantegna R et al. 1994. Linguistic features of noncoding DNA sequences. Phys-
ical Review Letters. 73(23):3169-3172.
[52] Martindale C and Konopka A. 1996. Oligonucleotide frequencies in DNA follow
a Yule distribution. Computers and Chemistry. 20(1):35-38.
75
[53] McNeil J, Smith K, Hall L and Lawrence J. 2006. Word frequency analysis
reveals enrichment of dinucleotide repeats on the human X chromosome and
[GATA]n in the X escape region. Genome Research. 16:477-484.
[54] Mullikin J and Ning Z. 2003. The phusion assembler. Genome Research.
13(1):81-90.
[55] Myers G. 1999. Whole-genome DNA sequencing. IEEE: Computational Bi-
ology. 33-42.
[56] Myers E et al. A whole-genome assembly of Drosophila. 2000. Science.
287:2196-2204.
[57] NCBI-BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
[58] Nekrutenko A and Li W. 2000. Assessment of compositional heterogeneity
within and between eukaryotic genomes. Genome Research. 10:1986-1995.
[59] Newman M. 2005. Power laws, Pareto distributions and Zipfs law. Contempo-
rary Physics. 46(5):323-351.
[60] Nikolaou C and Almirantia Y. 2005. ?Word? preference in the genomic text and
genome evolution: different modes of n-tuplet usage in coding and noncoding
sequences. J. Mole. Evol. 61:23-35.
[61] Ohta T and Kimura M. 1981. Some calculations on the amount of selfish DNA.
PNAS. 78(2):1129-1132.
[62] Oliver J, Bernaola-Galvan P, Carpena P and Roman-Roldan R. 2001. Isochore
chromosome maps of the eukaryotic genomes. Gene. 276:47-56.
[63] Pevzner P, Thang Haixu and Waterman M. 2001. An Eulerian path approach
to DNA fragment assembly. PNAS. 98(17):9748=9753.
[64] Pevzner P, Tang H and Tesler G. 2004. De novo repeat classification and frag-
ment assembly. Genome Research. 14:1786-1796.
[65] Pop M, Salzberg S and Shumway M. 2002. Genome sequence assembly: algo-
rithms and issues. IEEE Computer. 35(7):47-54.
[66] Pop M. 2004. Shotgun Sequence Assembly. Advances in Computers.60:193-245.
76
[67] Price A, Eskin E, and Pevzner P. 2004. Whole-genome analysis of ALU repeat
elements reveals complex evolutionary history. Genome Research.14:2245-2253.
[68] Qian J, Luscombe N and Gerstein M. 2001. Protein family and fold occurrence
in genomes: power-law behavior and evolutionary model. J. Mol. Biol. 313:673-
681.
[69] Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A and Platt D.
2006. Short blocks from the noncoding parts of the human genome have in-
stances within nearly all known genes and relate to biological processes. PNAS.
103(17):6605-6610.
[70] Le Rouzic A and Deceliere G. 2005. Models of the population genetics of trans-
posable elements. Genet. Res. Camb. 85:171-181.
[71] Rudd M, Wray G and Willard H. 2006. The evolutionary dynamics of ?-
satellite. Genome Research. 16:88-96.
[72] Salzberg SL and Yorke JA. 2005. Beware of mis-assembled genomes. Bioinfor-
matics. 21(24):4320-4321.
[73] Schlotterer C. 2000. Evolutionary dynamics of microsatellite DNA. Chromo-
soma. 109:365-371.
[74] Shapiro JA. (2002) Genome organization and reorganization in evolution: for-
matting for computation and function. Ann NY Acad Sci. 981:111-134.
[75] Slamovits, C. and Rossi M. Satellite DNA: Agent of Chromosomal Evolution
in Mammals. J. Neotrop. Mammal 9(2):297-308. (2002).
[76] Smith, G. Evolution of Repeated DNA Sequences by Unequal Crossingover.
Science. 191:528-535. (1976).
[77] Stein L et al. 2003. The genome sequence of C. briggsae: a platform for com-
parative genomics. PLoS Biol. 1(2):166-192.
[78] Thomas E, Srebro N, Sebat J, Navlin N, Healey J, Mishra B and Wigler M.
2004. Distribution of short paired duplications in mammalian genomes. PNAS.
101(28):10349-10354.
[79] The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of
the flowering plant Arabidopsis thaliana. Nature. 408:796-815.
77
[80] The C. elegans Sequencing Consortium. 1998. Genome sequence of the nema-
tode C. elegans: a platform for investigating biology. Science. 282:2012-2018.
[81] The chromosome 21 mapping and sequencing consortium. 2000. The DNA se-
quence of human chromosome 21. Nature. 405:311-319.
[82] Venter J et al. 2001. The sequence of the human genome. Nature.
291(5507):1304-1351.
[83] Volfovsky, N., Hass, B., and Salzberg, S. A Clustering Method for Repeat
Analysis in DNA Sequences. Genome Biology 2(8):1-11(2001).
[84] Yu Z, Anh V, Lau K. 2003. Iterated function system and multifractal analysis
of biological sequences. Int. J. of Modern Physics B. 17:4367-4375.
[85] Zhou Y and Mishra B. 2004. Models of genome evolution. Modeling in Molecular
Biology, G. Ciobanu, G. Rozenberg (Eds.), Natural Computing Series, pages
287-304, Lecture Notes in Computer Science.
78