ABSTRACT
Title of dissertation: Investigating the Distribution of CRISPR
Adaptive Immune Systems Among Prokaryotes
Jake L. Weissman
Doctor of Philosophy, 2019
Dissertation directed by: Professor Philip L.F. Johnson, Department of Biology
Professor William F. Fagan, Department of Biology
Just as larger organisms face the constant threat of infection by pathogens,
so too do bacteria and archaea. In response, prokaryotes employ a diverse set of
strategies to simultaneously cope with their viral and physical environments.
Here I explore the ecology and evolution of the CRISPR adaptive immune
system, a powerful form of protection against viruses that is the only known exam-
ple of adaptive immunity in prokaryotes. CRISPR systems are widespread across
diverse bacterial and archaeal lineages, suggesting that CRISPR eectively defends
against viruses in a broad array of environments. Nevertheless, this defense system
is nearly absent in many bacterial groups, and in many environments. I focus on
understanding these patterns in CRISPR incidence and the ecological drivers behind
them.
First, I identify the ecological conditions that favor the adoption of a CRISPR-
based defense strategy. I develop a phylogenetically-conscious machine learning
approach to build a predictive model of CRISPR incidence using data on over 100
phenotypic traits across over 2600 species and discovered a strong but hitherto-
unknown negative interaction between CRISPR and aerobicity.
I then consider the multiplicity of CRISPR arrays on a genome, testing whether
or not selection favors redundancy in immunity. I use a comparative genomics
approach, looking across all prokaryotes to demonstrate that on average, organisms
are under selection to maintain more than one CRISPR array. I then explain this
surprising result with a theoretical model demonstrating that a trade-o between
memory span and learning speed could select for paired `long-term memory and
short-term memory CRISPR arrays.
Finally, I provide a theoretical examination of the phenomenon of immune
loss, specically in the context of CRISPR immunity. In doing so, I propose an
additional mechanism to answer the perennial question: How do bacteria and bac-
teriophage coexist stably over long time-spans? I show that the regular loss of
immunity by the bacterial host can produce host-phage coexistence more reliably
than other mechanisms, pairing a general model of immunity with an experimental
and theoretical case study of CRISPR-based immunity.
Investigating the Distribution of CRISPR
Adaptive Immune Systems Among Prokaryotes
by
Jake L. Weissman
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulllment
of the requirements for the degree of
Doctor of Philosophy
2019
Advisory Committee:
Professor Philip L.F. Johnson, Chair/Advisor
Professor William F. Fagan, Co-Advisor
Professor Stephanie A. Yarwood
Professor Pierre-Emmanuel Jabin
Professor Charles F. Delwiche
?c Copyright by
Jake L. Weissman
2019

Preface
This dissertation contains an overview (Chapter 1), three research chapters
in manuscript form (Chapters 2, 3, and 4), and appendices to the chapters which
include all supplemental information (text, tables, and gures) for the publications
on which these chapters are based. A single bibliography is provided at the end for
literature cited throughout the dissertation.
This dissertation is based on the following publications:
Chapter 2: Jake L. Weissman, Rohan M.R. Laljani, William F. Fagan, and Philip
L.F. Johnson. Visualization and prediction of CRISPR incidence in microbial
trait-space to identify drivers of antiviral immune strategy. The ISME journal,
2019. https://doi.org/10.1038/s41396-019-0411-2
Chapter 3: Jake L. Weissman, William F. Fagan, and Philip L.F. Johnson. Selec-
tive maintenance of multiple CRISPR arrays across prokaryotes. The CRISPR
Journal, 1(6):405-413, 2018. http://doi.org/10.1089/crispr.2018.0034
Chapter 4: Jake L. Weissman, Rayshawn Holmes, Rodolphe Barrangou, Sylvain
Moineau, William F. Fagan, Bruce Levin, and Philip L.F. Johnson. Immune
loss as a driver of coexistence during host-phage coevolution. The ISME Jour-
nal, 12(2):585-597, February 2018. https://doi.org/10.1038/ismej.2017.194
ii
Acknowledgments
Many individuals and organizations contributed to the work before you, and
to my graduate education more generally. First I would like to thank my sources of
funding, whose generosity provided me with exibility in pursuing my research goals.
I have been supported by a COMBINE Network Science Fellowship (NSF award
DGE-1632976) as well as a GAANN Fellowship (U.S. Department of Education).
I have also been supported by the U.S. Army Research Laboratory and the U.S.
Army Research Oce under Grant W911NF-14-1-0490.
Next, my advisor Philip L.F. Johnson and co-advisor William F. Fagan have
been instrumental in advancing my training as a scientist. They have given me the
freedom to explore and follow independent research paths, as well as the support I
needed to develop successful projects. Any line of research inevitably faces setbacks,
but Philip taught me to even celebrate sideways progress.
I thank my committee, Stephanie A. Yarwood, Pierre-Emmanuel Jabin, and
Charles F. Delwiche, for their guidance and careful critique throughout my PhD.
Special thanks to Stephanie for her course on microbial ecology that helped initiate
my transition into this eld. I am also grateful to Michelle Girvan and Daniel
Serrano for their mentorship as part of the COMBINE program.
A number of collaborators have been involved in this work. Specically,
Rayshawn Holmes, under the supervision of Bruce Levin, generated the experi-
mental data described in Chapter 4. Rodolphe Barrangou and Sylvain Moineau
also contributed sequencing data for that chapter, as well as guidance while writing
iii
the nal manuscript. Rohan M.R. Laljani contributed to the analysis of restriction
modication systems in Chapter 2 as an undergraduate working under my supervi-
sion.
My lab-mates study everything that I don't, from T-cells to turtles. I thank Sil-
via Alvarez, Nina Attias, Nicole Barbour, Noelle Beckman, Sharon Bewick, Eleanor
Brush, Fabian Casas-Arenas, Je Demers, Xianghui Dong, Andy Foss-Grant, Eliezer
Gurarie, Allison Howard, Kumar Mainali, Julie M. Mallon, Shauna Rasband, Phillip
Staniczenko, Anshuman Swain, Wei Xiao, Hao Y. Yiu, and Jenny Zambrano for con-
stantly exposing me to new ideas and for occasionally forcing me to also think about
eukaryotes. Special thanks to Hao for creating the cartoon elements that are the
basis for Fig 1.1, and also for collaborating on outreach eorts [1].
The BEES community has been constantly supportive during my time at the
University of Maryland. Our student group, BEESst, has steadily worked to build
camaraderie among students and create new opportunities for student-faculty in-
teraction. I thank the members of this community at large, and especially those
who have taken on leadership roles and dedicated their time to this community's
improvement.
iv
Table of Contents
Preface ii
Acknowledgements iii
Table of Contents v
List of Tables viii
List of Figures x
List of Abbreviations xiv
1 Introduction 1
1.1 Prokaryotic Antiviral Defense Systems . . . . . . . . . . . . . . . . . 1
1.2 CRISPR, What is it? . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Distribution of CRISPR Among Prokaryotes . . . . . . . . . . . 6
1.4 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Visualization and prediction of CRISPR incidence in microbial trait-space to
identify drivers of antiviral immune strategy 11
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1.1 Trait Data . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1.2 Genomic Data and Immune Systems . . . . . . . . . 16
2.3.2 Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Visualizing CRISPR/RM Incidence . . . . . . . . . . . . . . . 17
2.3.4 CRISPR/RM Prediction from ProTraits . . . . . . . . . . . . 18
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Visualizing CRISPR Incidence in Trait Space . . . . . . . . . 23
2.4.2 Predicting CRISPR Incidence . . . . . . . . . . . . . . . . . . 27
2.4.3 Predicting CRISPR Type . . . . . . . . . . . . . . . . . . . . 34
2.4.4 NHEJ, CRISPR, and Oxygen . . . . . . . . . . . . . . . . . . 36
2.4.5 Predicting RM Incidence . . . . . . . . . . . . . . . . . . . . . 37
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
3 Selective maintenance of multiple CRISPR arrays across prokaryotes 43
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Test for selection maintaining multiple arrays . . . . . . . . . 47
3.3.3 CRISPR spacer turnover model . . . . . . . . . . . . . . . . . 49
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Having more than one CRISPR array is common . . . . . . . 50
3.4.2 Selection maintains multiple CRISPR arrays . . . . . . . . . . 51
3.4.3 A tradeo between memory span and acquisition rate could
select for multiple arrays in a genome . . . . . . . . . . . . . . 54
3.4.4 Selection varies between taxa and system types . . . . . . . . 56
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Selection maintains multiple CRISPR arrays across prokaryotes 58
3.5.2 Why have multiple CRISPR-Cas systems? . . . . . . . . . . . 59
4 Immune Loss as a Driver of Coexistence During Host-Phage Coevolution 64
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 General Immune Loss Model . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 A Case Study: CRISPR-Phage Coevolution . . . . . . . . . . . . . . 73
4.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 CRISPR-phage Coevolutionary Model . . . . . . . . . . . . . 79
4.4.2.1 Stable Host-Dominated Coexistence . . . . . . . . . 84
4.4.2.2 Transient Coexistence with Low Density Phage . . . 88
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A Supplemental Information For: Visualization and prediction of CRISPR in-
cidence in microbial trait-space to identify drivers of antiviral immune strat-
egy 93
A.1 Outline of Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2 ProTraits Without Genomic Data . . . . . . . . . . . . . . . . . . . . 96
A.3 Resampling Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.4 Additional Models and Phylogenetic Corrections . . . . . . . . . . . . 98
A.5 CRISPR in the Tara Oceans Data . . . . . . . . . . . . . . . . . . . . 100
A.6 Number of CRISPR Arrays . . . . . . . . . . . . . . . . . . . . . . . 102
A.7 NHEJ-Oxygen Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B Supplemental Information For: Selective maintenance of multiple CRISPR
arrays across prokaryotes 129
B.1 Validation of functional / non-functional classication . . . . . . . . . 129
B.2 Deriving the distribution of number of arrays per genome under a
neutral accumulation model . . . . . . . . . . . . . . . . . . . . . . . 130
B.3 Instantaneous array loss vs. gradual decay . . . . . . . . . . . . . . . 131
vi
B.4 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
B.5 Conrming selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.6 Neo-CRISPR Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.7 Validation of CRISPRDetect array predictions . . . . . . . . . . . . . 146
B.8 Autoimmunity Constrains Unprimed Spacer Acquisition Rates . . . . 147
B.9 Bet Hedging Against Memory Loss . . . . . . . . . . . . . . . . . . . 148
B.10 No evidence for array specialization . . . . . . . . . . . . . . . . . . . 149
C Supplemental Information For: Immune Loss as a Driver of Coexistence
During Host-Phage Coevolution 173
C.1 Parameter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
C.2 Alternative Costs of Immunity . . . . . . . . . . . . . . . . . . . . . . 173
C.2.0.1 Simulation Parameters . . . . . . . . . . . . . . . . . 175
C.2.0.2 Varying adsorption . . . . . . . . . . . . . . . . . . . 177
C.3 Analysis and Simulation Methods . . . . . . . . . . . . . . . . . . . . 178
C.3.0.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.3.0.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . 178
C.4 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 179
References 205
vii
List of Tables
2.1 Top 10 variable loadings on the rst three principal components of
the PCA performed on the microbial traits dataset, shown in Figs 2.1
and A.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Predictive ability of models of CRISPR incidence on the Proteobac-
teria test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 CRISPR array and cas multiplicity across prokaryotic genomes. . . . 50
4.1 Denitions and oft used values/initial values of variables, functions,
and parameters for the general mathematical model . . . . . . . . . . 70
4.2 Sequencing data shows four rst-order spacers that persist as a high-
frequency cohort over time . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Denitions and oft used values/initial values of variables, functions,
and parameters for the simulation model . . . . . . . . . . . . . . . . 83
A.1 Predictors added to each logistic regression model during forward
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 Phylogenetic logistic regression of CRISPR incidence as predicted by
16s rRNA count on 10 bootstrapped trees. . . . . . . . . . . . . . . . 106
A.3 Phylogenetic regression of number of restriction enzymes as predicted
by temperature or oxygen on 10 boostrapped trees. . . . . . . . . . . 106
A.4 Phylogenetic logistic regression of CRISPR incidence as predicted by
Ku and oxygen on 10 boostrapped trees . . . . . . . . . . . . . . . . 107
B.1 Test for selection maintaining multiple arrays applied to dierent sub-
sets of the RefSeq data . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.2 Species specic values of ?? with bootstrapped 95% CIs. . . . . . . . 152
B.3 Denitions of relevant variables and parameters for CRISPR array
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
B.4 Denitions of relevant variables and parameters for autoimmunity
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
C.1 Equilibrium of general model without coevolution . . . . . . . . . . . 181
viii
C.2 Spacer dynamics for long term coevolution experiment (Experiment 1)182
ix
List of Figures
1.1 Outline of CRISPR immunity . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Organisms with CRISPR separate from those without in trait space . 24
2.2 Organisms with CRISPR partially cluster in trait space away from
those without . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Importance of top ten predictors in the RF model of CRISPR inci-
dence using the ProTraits predictors . . . . . . . . . . . . . . . . . . 31
2.4 Temperature range and oxygen requirement are strong predictors of
CRISPR incidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Type II CRISPR systems appear to be more prevalent in host-associated
microbes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Selection maintains more than one CRISPR array on average across
prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Immune memory is maximized at intermediate and low spacer acqui-
sition rates, creating a tradeo with the speed of immune response to
novel threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Model behavior under variations in the rates of autoimmunity (?)
and CRISPR-Cas system loss (?) . . . . . . . . . . . . . . . . . . . . 72
4.2 Serial transfer experiments carried out with S. thermophilus and lytic
phage 2972 Bacteria are resource-limited rather than phage-limited
by day ve and phages can either (a) persist at relatively low density
in the system on long timescales (greater than 1 month) or (b) collapse
relatively quickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Distribution of phage extinction times in bacterial-dominated cultures
with dierent possible combinations of coexistence mechanisms . . . . 85
4.4 Distribution of phage extinction times in bacterial-dominated cultures
with dierent rates of PAM back mutation in phages . . . . . . . . . 86
A.1 Phylogeny with CRISPR incidence visualized . . . . . . . . . . . . . . 108
A.2 Pipeline for generating trait and immunity dataset and matching phy-
logeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
x
A.3 Phylogeny generated from PhyloSift marker genes . . . . . . . . . . . 110
A.4 Repeated t-SNE decomposition of ProTraits data with CRISPR inci-
dence visualized for varied perplexity values . . . . . . . . . . . . . . 111
A.5 Flowchart showing the decision-making process that would lead to
the various modeling approaches used in Chapter 2 . . . . . . . . . . 112
A.6 A conceptual example of the dierences between blocked and random
folds for cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.7 Comparison of variable importance across predictive models . . . . . 113
A.8 Comparison of variable importance across predictive models (Pear-
son's correlation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.9 Organisms with CRISPR do not separate from those without along
the rst principal component of trait space . . . . . . . . . . . . . . . 116
A.10 Trait distributions over t-SNE reduced dataset . . . . . . . . . . . . . 117
A.11 Variable importance scores from sPLS-DA model for top 10 predictors
on the 5 components included in model . . . . . . . . . . . . . . . . . 118
A.12 Variable importance scores from MINT sPLS-DA model for top 10
predictors on the single component included in model . . . . . . . . . 119
A.13 Importance of top ten predictors in each of the ve forests included
in the RF ensemble model . . . . . . . . . . . . . . . . . . . . . . . . 120
A.15 The link between oxygen requirement and CRISPR incidence is ap-
parent even when sub-setting to only mesophiles . . . . . . . . . . . . 123
A.16 Importance of top ten predictors in the RF model built excluding the
phyletic prole and gene neighborhood information sources from
ProTraits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.17 The incidence of the Ku protein in trait space . . . . . . . . . . . . . 124
A.18 Importance of top ten predictors in the RF model of Ku incidence . . 125
A.19 CRISPR and Ku are negatively associated in aerobes but not anaerobes125
A.20 The incidence of restriction enzymes in trait space . . . . . . . . . . . 126
A.21 Importance of top ten predictors in the RF model of restriction en-
zyme incidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.22 Resampling genomes has little eect on our overall outcome . . . . . 127
A.23 Functional proles for cas genes from Tara Oceans Project with cor-
responding oxygen metadata . . . . . . . . . . . . . . . . . . . . . . . 128
B.1 Dataset restricted to genomes with one or fewer sets of cas genes . . . 153
B.2 Alternative model results with array length cap agree qualitatively
with those of the primary model . . . . . . . . . . . . . . . . . . . . . 155
B.3 Priming increases the region of memory washout and thus deepens
the memory span versus acquisition rate tradeo . . . . . . . . . . . . 156
B.4 Signature of multi-array selection in archaeal genomes . . . . . . . . . 157
B.5 Boxplots of array counts associated with genomes carrying a partic-
ular type of cas targeting machinery . . . . . . . . . . . . . . . . . . 158
B.6 Dataset with subsampled genomes of overrepresented taxa . . . . . . 159
B.7 Dataset restricted to genomes with one or fewer sets of cas genes . . . 160
xi
B.8 Similar CRISPRDetect score distributions in non-functional and func-
tional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.9 Arrays in functional genomes are longer on average than arrays in
non-functional genomes . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.10 Short, functional arrays do not drive the result of our non-parametric
test for selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.11 Short, functional arrays do not drive the result of our parametric test
for selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.12 Outline of model analysis . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.13 Relationship between the number of cas1 genes and the number of
CRISPR arrays in a genome. . . . . . . . . . . . . . . . . . . . . . . 166
B.14 Species-specic ??k values mapped onto the SILVA Living Tree 16s
rRNA Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.15 In two-array genomes there is a slight positive association in both
array score and array length between arrays within the same genome 168
B.16 Equilibrium host density values from autoimmunity model (B.8) over
varying spacer acquisition rates . . . . . . . . . . . . . . . . . . . . . 169
B.17 Pairwise distance between consensus repeats from arrays within a
genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.18 Consensus repeat diversity across datasets (CRISPRDetect left vs.
CRISPRdb right) in two-array genomes (a,b) and all genomes (c,d) . 171
B.19 Even restricting to arrays with identical consensus repeats, functional
genomes are more likely to have multiple CRISPR arrays . . . . . . . 172
C.1 Equilibria with alternative costs of immunity . . . . . . . . . . . . . . 183
C.2 Equilibria with each coexistence mechanism in isolation . . . . . . . . 184
C.3 Numerical solutions to model at 80 days with realistic initial conditions185
C.4 Simulations of perturbed starting conditions (small perturbations) . . 186
C.5 Simulations of perturbed starting conditions (intermediate perturba-
tions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
C.6 Simulations of perturbed starting conditions (large perturbations) . . 188
C.7 Simulations of perturbed starting conditions (very large perturbations)189
C.8 Mean population size with perturbed starting conditions (intermedi-
ate perturbations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
C.9 Phase diagram of general model with phage coevolution . . . . . . . . 191
C.10 Phase diagram of general model with innate immunity . . . . . . . . 192
C.11 Replicate serial transfer experiments . . . . . . . . . . . . . . . . . . 193
C.12 Mean sequenced order of host over time in serial transfer experiments
1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
C.13 Optimal host order for phages to infect over time . . . . . . . . . . . 195
C.14 Eect on simulations of varied phage adsorption rates . . . . . . . . . 196
C.15 Representative simulation with a oor on the susceptible host popu-
lation and high autoimmunity . . . . . . . . . . . . . . . . . . . . . . 197
C.16 Transient phage survival at low density Example of low-level phage
persistence due to slow evolutionary dynamics . . . . . . . . . . . . . 198
xii
C.17 Eect of changes in PAM mutation cost (c) . . . . . . . . . . . . . . . 199
C.18 Phase diagram of general model with phage coevolution . . . . . . . . 200
C.19 Equilibrium phage population during coexistence . . . . . . . . . . . 201
C.20 Distribution of phage extinction times in bacterial-dominated cultures
with an MOI of 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
C.21 Distributions of phage extinction times in bacterial-dominated cul-
tures with various burst sizes . . . . . . . . . . . . . . . . . . . . . . 203
C.22 Representative example of a simulation demonstrating stable coexis-
tence under a loss mechanism . . . . . . . . . . . . . . . . . . . . . . 204
xiii
List of Abbreviations
BIC Bayesian Information Criterion
bp Base Pair
BREX Bacteriophage Exclusion
Cas CRISPR-Associated
CRISPR Clustered Regularly Interspaced Short Palindromic Repeats
CRISPRDetect CRISPR Detection Algorithm
crRNA CRISPR-RNA
CV Cross-Validation
DISARM Defense Island System Associated With Restriction-Modication
E-value Expect Value
FTP File Transfer Protocol
HMM Hidden Markov Model
HGT Horizontal Gene Transfer
KEGG Kyoto Encyclopedia of Genes and Genomes
MINT sPLS-DA Multivariate Integrative sPLS-DA
ML Maximum Likelihood
MOI Multiplicity of Infection
MSE Mean Squared Error
NB Negative Binomial
NCBI National Center for Biotechnology Information
NHEJ Non-Homolohous End Joining
OG Orthologous Group
PAM Protospacer Adjacent Motif
PCA Principal Component Analysis
PCR Polymerase Chain Reaction
Pfam Protein Family Database
PLS Partial Least Squares Regression
ProTraits Prokaryotic Traits Database
xiv
REBASE Restriction Enzyme Database
RefSeq NCBI Reference Sequence Database
RM Restriction Modication
RF Random Forest
rRNA Ribosomal RNA
sPLS-DA Sparse Partial Least Squares Discriminant Analysis
t-SNE t-Distributed Stochastic Neighbor Embedding
xv
Chapter 1: Introduction
1.1 Prokaryotic Antiviral Defense Systems
Viruses of bacteria and archaea severely impact their hosts' population and
evolutionary dynamics [2, 3]. In an ecological context, these viruses lead to the
release of important nutrients back into the environment [4] and may play a role in
maintaining microbial diversity [5, 6, 7]. In an evolutionary context, viruses drive
the evolution of host immune strategy, often leading to iterative co-evolutionary
dynamics [8, 9]. In the microbial world these two contexts are not distinct, with
demographic and genetic changes occurring at similar rates, making any separation
of scales infeasible. This is especially true at the interface of virus-host interactions,
where the set of host defense systems is diverse and fast-evolving [10].
Prokaryotic antiviral defense systems range in complexity from crude mem-
brane modications that prevent viral attachment, all the way to CRISPR (clustered
regularly interspaced short palindromic repeats) adaptive immune systems, which
are able to record memories of past infections in order to specically target those
viruses in the future [10, 11]. In between these extremes lies a great deal of strate-
gic diversity, including restriction-modication and prokaryotic Argonaute systems
which degrade viral DNA, altruistic abortive-infection systems which kill the host
1
cell upon infection, and an ever-growing set of recently-discovered, mechanistically
diverse novel defense-systems [12, 13]. This strategic diversity is combinatorial,
where many organisms employ multiple strategies [14], or even multiple seemingly-
redundant versions of the same strategy [15]. Additionally, microbes can rapidly
swap-out strategy sets when defense-related genes are lost and gained via horizon-
tal gene transfer. These transfers occur frequently, making defense genes one of
the most labile classes of genes [16], and often leading to a great deal of immune
diversity even among closely related sets of strains [6, 17].
The apparent complexity of microbial antiviral defenses immediately leads us
to a number of pressing questions: In the face of such incredible functional diversity,
is there any discernible link between an organism's ecology and its immune strategy?
Why would an organism employ more than one defense system, especially in cases
where those systems appear to be redundant? Considering how frequently defense
genes are gained and lost, what are the implications of these evolutionary dynamics
on the ecological dynamics of host and virus communities? These questions form
the core of this dissertation, motivating, respectively, Chapters 2, 3, and 4. As my
focus I take the ecology and evolution of CRISPR immunity, described in detail
below, expanding to antiviral defense systems in general where appropriate.
1.2 CRISPR, What is it?
CRISPR is a powerful form of protection against viruses and other mobile ge-
netic elements that is the only known example of adaptive immunity in prokaryotes.
2
CRISPR systems can rapidly acquire novel and highly specic immune memory
and then use this memory to degrade viral genetic material [18, 19]. In general,
these systems are composed of two parts:
? The CRISPR array serves as a repository for the immune memories acquired
from viruses. These genomic loci consist of variable numbers of short (? 30bp)
conserved repeat sequences (repeats) interspaced with short variable regions
(? 30bp) called spacers [18, 20], preceded by a leader sequence which is
important for array transcription and the integration of novel spacers [19, 20,
21, 22, 23, 24]. Each spacer is an individual immune memory, corresponding
to a matching target on a viral genome or some other mobile genetic element
(the protospacer; although self-targeting is also possible, e.g., [25, 26, 27]).
In fact, the rst hints that CRISPR might be an adaptive immune system were
that many of these spacers matched known viral sequences [18]. Importantly,
CRISPR immune memory is encoded on the host genome, meaning it will be
vertically transferred along a host lineage [28, 29].
? The CRISPR-associated ( cas) genes serve as the machinery used to acquire
spacers and subsequently target viruses and are typically located adjacent to
the CRISPR array on the genome [20, 30].
Immunity proceeds in three stages:
1. During acquisition Cas proteins cut out a piece of viral DNA and integrate
this short sequence into the host genome at the leading end of the CRISPR
array as a novel spacer [19, 23]. Spacers are inserted progressively at the
3
leader end, creating a linear history of infection along the array as the host
encounters novel viral species [24, 31]. All CRISPR systems share the same
core acquisition genes, cas1 and cas2, though the acquisition process may
dier in some details between systems (with some systems using additional
acquisition proteins [30], and some even aacquiring spacers from RNA [32]).
2. Arrays are then transcribed and processed into short CRISPR-RNA (crRNA)
molecules, which associate with the Cas targeting machinery and surveil the
cell for their corresponding protospacer [33, 34, 35, 36, 37].
3. Finally, if the crRNA-Cas complex nds its matching protospacer, the target
is degraded [19, 34, 35].
For a caricature of this process see Fig 1.1. More details on the subtleties of CRISPR
biology will be provided in each chapter as needed, but this general picture should
suce for the time being. For the CRISPR initiate or layperson, I recommend our
review of CRISPR immunity for Frontiers Young Minds, which is targeted towards
young readers (ages 8-12) but should be comprehensible to a general audience [1].
Finally, I must note that the discovery of CRISPR, now over a decade ago
[18, 38], has generated great interest among biologists, who have repurposed the
programmable targeting specicty of CRISPR-associated (Cas) proteins to create
novel genome-editing tools [39, 40]. Here I will focus on the natural distribution and
evolution of these systems, happily avoiding any discussions of applications for the
remainder of this dissertation (fair warning for those who opened this text looking
for genome editing wisdom there is none to be found here).
4
Bacterium
No Infection
Virus
Infection
Viral Genome
Cas Acquisition Machinery
Targeting
Spacer Repeat
CRISPR Array
crRNA + Cas Targeting Machinery Host Genome
Surveillance Spacer Acquisition
Figure 1.1: Outline of CRISPR immunity. Spacers are acquired from viral genetic
material and then used to guide proteins to degrade those target sequences in the
future. I note that many details are ommitted in this simple cartoon, and some
CRISPR systems work somewhat dierently [30]. I provide further details on the
more subtle subtle aspects of CRISPR immunity as needed in each individual chap-
ter. This gure is adapted from Weissman et al. [1].
5
1.3 The Distribution of CRISPR Among Prokaryotes
For the biologist interested in studying the ecology and evolution of microbial
immune strategy using a comparative framework, CRISPR exists in something of
a sweet-spot. Some defense systems are extremely common among microbes, such
as restriction-modication systems which are nearly ubiquitous [41], while others
are extremely rare, such as the BREX and DISARM systems which are present in
< 10% of sequenced prokaryotic genomes [42, 43]. CRISPR, on the other hand is
present in about half of sequenced bacterial genomes (? 40%, though much more
common in archaea; [20, 44, 45, 46, 47, 48, 49]), and because it is frequently hori-
zontally transferred and lost [16, 50, 51], its distribution among species likely cannot
be explained by shared evolutionary history alone. In fact, CRISPR is found across
the prokaryotic tree (Fig A.1), in both bacteria and archaea and even in members
of the Cadidate Phyla Radiation who were thought to largely disfavor this immune
system [52, 53, 54]. Additionally, organisms dier greatly in both the number of
CRISPR systems they encode on their genomes [15, 55] and the number of spacers
included in any given CRISPR array [56, 57, 58], implying that the relative impor-
tance of CRISPR as a primary line of defense against viruses varies greatly between
organisms. Thus we might glean some insight into what factors drive CRISPR's
distribution by comparing the characteristics of taxa that tend to favor or disfavor
CRISPR immunity.
More broadly, CRISPR provides a tractable model for the evolution of mem-
ory where memories are discrete, observable objects (spacers). The heterogeneous
6
incidence of CRISPR across species suggests that memory is not always adaptive.
In fact, the core questions of my dissertation can be re-framed in the context of the
evolution of memory: Under what environmental conditions can we expect memory
to evolve (Chapter 2)? What processes result in the evolution of short or long-term
memory, together or in isolation (Chapter 3)? What dynamics occur in an antag-
onistic system (i.e., host-virus) when memory is lost (Chapter 4)? Incidentally, I
was rst drawn to this system while thinking about spatial memory in the context
of animal movement, realizing that the immediately observable states of CRISPR
memory are easily represented with theoretical treatments, in contrast to the many
abstractions required to model animal memory.
What, then, is known about the distribution of CRISPR systems among
prokaryotes? CRISPR incidence varies consistently along certain environmental
gradients. For example, surveys of public genomic databases show that CRISPR sys-
tems are far more prevalent in thermophiles than in mesophiles [44, 45, 46, 47, 48, 49].
Very recently, a survey of CRISPR immune diversity in the oceans revealed that the
total number of immune memories associated with CRISPR increases along a depth
gradient [59], which agrees with my own observation that the incidence of CRISPR
immunity increases with decreasing metabolic oxygen requirement (see Chapter 2;
[49]).
What mechanisms can explain such environmental trends? Theoretical work
has supported the hypothesis that if the local viral community is very diverse, then
a memory-based system like CRISPR will not be advantageous [47, 48], because an
individual cell is unlikely to encounter the same virus twice. Similarly, the presence
7
of viral anti-CRISPR proteins will strongly adversely aect the adaptive advantage
of having this system [60].
It is also possible that having an active CRISPR system is simply too costly
in some environments. CRISPR incurrs costs via self-targeting (i.e., autoimmunity;
[25, 27, 61]), expression [62], and lost opportunities for benecial horizontal gene
transfer (e.g., of antibiotic resistance genes [51]). The frequency of viral infection
can inuence the favorability of CRISPR by altering cost structures, as expression
of these systems is often very costly but inducible, unlike membrane modications
which are generally constitutive but possibly less costly [63]. Costs will also vary
based on the competitive environment, with recent work showing that CRISPR
is favored in competitive environments that constrain the evolution of cell surface
molecules (making intracellular immunity the only option for antiviral defense; [64]).
Alternatively, I and others have suggested that the abiotic environment may
impact selection for or against CRISPR immunity more directly (via negative in-
terference with certain DNA repair pathways [49] or via constraints on membrane
evolution [65, 66, 67]). Thus we are left with a complex set of potential drivers of
immune strategy, where the abiotic environment, local viral community, and host
community may all play a role. Determining the ecological drivers of microbial
immune strategy, it seems, is no easy task.
8
1.4 Outline of Dissertation
Just as larger organisms face the constant threat of infection by pathogens,
so too do bacteria and archaea. This work focuses on these interactions, and how
prokaryotes employ a diverse set of strategies to simultaneously cope with their viral
and physical environments.
More specically, my dissertation explores the ecology and evolution of the
CRISPR adaptive immune system, a powerful form of protection against viruses that
is the only known example of adaptive immunity in prokaryotes [18, 19]. CRISPR
rapidly incorporates novel and highly specic immune memory and then uses this
memory to degrade selsh genetic elements such as bacteriophage and plasmids
[19]. CRISPR systems are widespread across diverse bacterial and archaeal lin-
eages, suggesting that CRISPR eectively defends against viruses in a broad array
of environments [30, 45, 68]. Nevertheless, this defense system is nearly absent in
many bacterial groups [52], and in many environments [44, 45, 46, 47, 48, 49, 59].
In this dissertation I focus on understanding these patterns in CRISPR incidence
and the ecological drivers behind them.
In my second chapter I identify the ecological conditions that favor the adoption
of a CRISPR-based defense strategy [49]. I develop a phylogenetically conscious
machine learning approach to build a predictive model of CRISPR presence/absence
across over 2600 species using a large microbial trait database. I nd evidence for
a strong negative interaction between CRISPR and DNA repair processes in the
cell. It seems that tradeos may constrain the evolution of memory in microbes,
9
which contrasts with other work that implicates the pathogenic environment and
local competition as determinants of the adaptiveness of CRISPR immune memory.
My third chapter focuses on the multiplicity of CRISPR arrays on a genome,
testing whether or not selection favors redundancy in immunity [15]. Around 20% of
sequenced prokaryotic genomes have more than one CRISPR array. While immune
diversity likely reduces the chance of pathogen evolutionary escape, it remains puz-
zling why many prokaryotes also have multiple, seemingly redundant, copies of the
same type of immune system. Adapting population genetic models to build a neu-
tral model of gene content evolution, I demonstrate that on average, prokaryotes are
under selection to maintain more than one CRISPR array. I explain this surprising
result with a theoretical model demonstrating that trade-os between memory span
and learning speed can favor paired long- and short-term memory arrays.
In my fourth chapter I provide a theoretical examination of the phenomenon of
immune loss, where it has been shown that CRISPR systems can lose functionality
at a high rate [69]. In doing so, I propose an additional mechanism to answer the
perennial question: How do bacteria and bacteriophage coexist stably over long
time-spans?. In well-mixed host-phage systems we typically expect to see a run-
away evolutionary arms race, ultimately leading to the extinction of one species.
Nevertheless, in many systems, host and pathogen coexist with minimal coevolu-
tion. I show that the regular loss of immunity by the bacterial host could produce
host-phage coexistence more reliably than other tradeo-based mechanisms, pairing
a general model of immunity with an experimental and theoretical case study of
CRISPR-based immunity.
10
Chapter 2: Visualization and prediction of CRISPR incidence in mi-
crobial trait-space to identify drivers of antiviral immune
strategy
2.1 Abstract
Bacteria and archaea are locked in a near-constant battle with their viral
pathogens. Despite previous mechanistic characterization of numerous prokary-
otic defense strategies, the underlying ecological drivers of dierent strategies re-
main largely unknown and predicting which species will take which strategies re-
mains a challenge. Here, we focus on the CRISPR immune strategy and develop
a phylogenetically-corrected machine learning approach to build a predictive model
of CRISPR incidence using data on over 100 traits across over 2600 species. We
discover a strong but hitherto-unknown negative interaction between CRISPR and
aerobicity, which we hypothesize may result from interference between CRISPR
associated proteins and non-homologous end-joining DNA repair due to oxidative
stress. Our predictive model also quantitatively conrms previous observations of
an association between CRISPR and temperature. Finally, we contrast the envi-
ronmental associations of dierent CRISPR system types (I, II, III) and restriction
11
modication systems, all of which act as intracellular immune systems.
2.2 Introduction
In the world of prokaryotes, infection by viruses poses a constant threat to con-
tinued existence (e.g., [70]). In order to evade viral predation, bacteria and archaea
employ a range of defense mechanisms that interfere with one or more stages of the
viral life-cycle. Modications to the host's cell surface can prevent viral entry in
the rst place. Alternatively, if a virus is able to enter the host cell, then intracellu-
lar immune systems, such as the clustered regularly inter-spaced short palindromic
repeat (CRISPR) adaptive immune system or restriction-modication (RM) innate
immune systems, may degrade viral genetic material and thus prevent replication
[11, 17, 18, 19, 38, 71]. Despite our increasingly in-depth understanding of the mech-
anisms behind each of these defenses, we lack a comprehensive understanding of the
factors that cause selection to favor one defense strategy over another.
Here we focus on the CRISPR adaptive immune system, which is a par-
ticularly interesting case study due to its uneven distribution across prokaryotic
taxa and environments. Previous analyses have shown that bacterial thermophiles
and archaea (both mesophilic and thermophilic) frequently have CRISPR systems
(? 90%), whereas less than half of mesophilic bacteria have CRISPR (? 40%;
[44, 45, 46, 47, 48]). Environmental samples have revealed that many uncultured
bacterial lineages have few or no representatives with CRISPR systems, and that
the apparent lack of CRISPR in these lineages may be linked to an obligately sym-
12
biotic lifestyle and/or a highly reduced genome [52]. Nevertheless, no systematic
exploration of the ecological conditions that favor the evolution and maintenance
of CRISPR immunity has been made. Additionally, though these previous results
appear broadly true [72], no explicit accounting has been made for the potentially
confounding eects of phylogeny in linking CRISPR incidence to particular traits.
What mechanisms might shape the distribution of CRISPR systems across
microbes? Some researchers have emphasized the role of the local viral commu-
nity, suggesting that when viral diversity and abundance is high CRISPR will fail,
and thus be selected against [47, 48, 63]. Others have focused on the tradeo be-
tween constitutively expressed defenses like membrane modication and inducible
defenses such as CRISPR [63]. Yet others have noted that hot, and possibly other
extreme environments can constrain membrane evolution, necessitating the evolu-
tion of intracellular defenses like CRISPR or RM systems [65, 66, 67]. Many have
observed that since CRISPR prevents horizontal gene transfer, it may be selected
against when such transfers are benecial (e.g. [51, 73]). More recently it has
been shown that at least one CRISPR-associated (Cas) protein can suppress non-
homologous end-joining (NHEJ) DNA repair, which may lead to selection against
having CRISPR in some taxa [74]. In order to determine the relative importances
of these dierent mechanisms, we must rst identify the habitats and microbial
lifestyles associated with CRISPR immunity.
Here we aim to expand on previous analyses of CRISPR incidence in three
ways: (1) by drastically expanding the number of environmental and lifestyle traits
considered as predictors using the combination of a large prokaryotic trait database
13
and machine learning approaches, (2) by incorporating appropriate statistical cor-
rections for non-independence among taxa due to shared evolutionary history, which
has not always been done, and (3) by simultaneously looking for patterns in RM sys-
tems, which will help us untangle the dierence between environments that speci-
cally favor CRISPR adaptive immunity versus DNA-degrading intracellular immune
systems in general (RM and CRISPR).
2.3 Methods
2.3.1 Data
For a schematic outlining the entire data compilation process see Fig A.2. For
a list of all visualizations, predictive models, and statistical tests see Appendix A.1.
2.3.1.1 Trait Data
We downloaded the ProTraits microbial traits database [75] which describes
424 traits in 3046 microbial species. These traits include metabolic phenotypes, pre-
ferred habitats, and specic behaviors like motility, among many others. ProTraits
was built using a semi-supervised text-mining approach, drawing from several online
databases and the literature. All traits are binary, with categorical traits split up
into dummy variables (e.g. oxygen requirement listed as aerobic, anaerobic, and
facultative). For each trait in each species, two condence scores in the range
[0, 1], are given, corresponding to the condence of the text mining approach that a
particular species does (c+) or does not (c?) have a particular trait.
14
We derived a single score (p) that captured the condences both that a species
does and does not have a particular trait. Assuming we want our score to lay in the
interval [0, 1], such a score should be zero when we are completely condent that a
species does not have a trait, one when we are completely condent that a species
has a trait, and 0.5 when we are completely uncertain whether or not a species has
a trait (i.e., equally condent that it does and does not have the trait). In the
following formula, c+ captures the relative condence that a species does rather
c++c?
than does not have a trait, which we then scale by the overall maximal condence
(so that as overall condence decreases the score shrinks towards 0.5)
( )
1 c+ 1
p = + ? ?max(c+, c?). (2.1)
2 c+ + c? 2
Many of the scores are missing for particular species-trait combinations (18%),
indicating situations in which the text mining approach was unable to make a trait
prediction. Our downstream analyses do not tolerate missing data, and so we im-
puted missing values using a random forest approach (R package missForest; [76]).
There is a set of summary traits in the ProTraits dataset that were created de-
novo using a machine learning approach, as well as a number of traits describing
the growth substrates a particular species can use. We removed both summary and
substrate traits from the dataset for increased interpretability (post-imputation; 174
traits remaining).
We note that the authors of ProTraits also used genomic data to help them
infer trait scores, though we found that the exclusion of this data does not aect
15
our overall outcome (Appendix A.2).
2.3.1.2 Genomic Data and Immune Systems
For each species listed in the ProTraits dataset we downloaded a single genome
from NCBI's RefSeq database, with a preference for completely assembled reference
or representative genomes. See Appendix A.3 for a conrmation that our results are
robust to the resampling of genomes. A number of species (333) had no genomes
available in RefSeq, or only had genomes that had been suppressed since submission,
and we discarded these species from the ProTraits dataset.
CRISPR incidence in each genome was determined using CRISPRDetect [77].
Additionally, data on the number of CRISPR arrays found among all available
RefSeq genomes from a species were taken from Weissman et al. ([15]).
We downloaded the REBASE Gold database of experimentally veried RM
proteins and performed blastx searches of our genomes against this database [78, 79].
The distribution of E-values we observed was bimodal, providing a natural cuto
(E < 10?19).
To assess the ability of a microbe to perform non-homologous end-joining
(NHEJ) DNA repair we used hmmsearch to search the HMM prole of the Ku pro-
tein implicated in NHEJ against all RefSeq genomes (E-value cuto of 10?2/number of genomes;
Pfam PF02735; [80, 81, 82]). We also used the annotated number of 16s rRNA genes
in each downloaded RefSeq genome as a proxy for growth rate and the annotated
cas3, cas9, and cas10 genes as indicators of system type [83]. Where available as
16
meta-data from NCBI, we also downloaded the oxygen (1949 records) and temper-
ature requirements (1094 records) for the biosample record associated with each
RefSeq genome. The NCBI trait data was used exclusively for building Fig 2.4 and
the analyses implicating Ku in the CRISPR versus oxygen association.
2.3.2 Phylogeny
We used PhyloSift to locate and align a large set of marker genes (738) found
broadly across microbes, generally as a single copy [84, 85]. Of these marker genes,
67 were found in at least 500 of our genomes, and we limited our analysis to just this
set. Additionally, eight genomes had few (< 20) representatives of any marker genes
and were excluded from further analysis. We concatenated the alignments for these
67 marker genes and used FastTree (general-time reversible and CAT options; [86])
to build a phylogeny (Fig A.3). In order to analyze the eect of tree uncertainty on
our phylogenetic regressions, we bootstrapped our dataset using seqboot and built
a new tree from each replicate.
2.3.3 Visualizing CRISPR/RM Incidence
The size of the ProTraits dataset, both in terms of number of species and
number of traits, and the probable complicated interactions between variables ne-
cessitate techniques that can handle complex, large scale data. To visualize the
structure of microbial trait space and the distribution of immune strategies within
that space we made use of two unsupervised machine learning techniques, princi-
17
pal component analysis (PCA, prcomp() function in R) and t-distributed stochastic
neighbor embedding (t-SNE, perplexity = 50 and 5000 iterations using Rtsne() func-
tion in Rtsne R package, otherwise default parameters, perplexity varied in Fig A.4;
[87, 88]).
PCA is a well-used technique in ecology that allows us to reduce the dimen-
sionality of a dataset for eective visualization in two-dimensional space. Essentially,
we collapse our trait dataset into two or three composite traits and observe whether
species with a particular immune strategy tend to vary systematically in terms of
where they fall in this trait space. A newer variant of this approach, t-SNE, per-
forms a similar process, but unlike PCA allows for non-linear transformations of
trait space. Therefore, local structure and non-linear interactions between traits in
high dimensional space are preserved by t-SNE but often not captured by PCA [87].
On the other hand, t-SNE axes are less easily interpreted precisely because they
represent non-linear rather than linear combinations of variables.
2.3.4 CRISPR/RM Prediction from ProTraits
In order to predict the distribution of CRISPR and RM systems, we applied a
number of supervised machine learning approaches to our dataset (see Fig A.5 for a
ow-chart describing the logic behind our model choices). In order to obtain accurate
estimates of model performance, we initially set aside a portion of the data as a test
set to be used exclusively in model assessment after all models were constructed
(no tting to this set). Because of the underlying evolutionary relationships in the
18
data, we chose a test set that is phylogenetically independent of our training set.
Alternatively, if we were to draw a test set at random from the microbial species
we would risk underestimating our prediction errors due to non-independence of the
training and test sets [89]. We chose the Proteobacteria as a test set because they
are well-represented in the dataset (1139 species), ecologically diverse, and highly
heterogeneous in terms of CRISPR incidence (Fig A.1). The remaining phyla were
used to train our models.
First we built a series of linear models to classify species by immune strategy
(CRISPR present or absent) using logistic regression. We had a large number of
predictor variables (100+), which necessitated a model-selection approach in order
to build a reasonably (and optimally) sized model. We used a forward selection
algorithm to select the optimal set of predictors for each model size, with mean
squared error under cross validation (CV) as our optimality criterion. We then
selected model size by comparing BIC among these optimal models (i.e., selecting
the model with the lowest score).
Similar to choosing a test set, care must be taken when performing CV on
phylogenetically-structured data. CV assumes that when the data is partitioned
into folds, each of these folds is independent of the others. If we draw species at
random from a phylogeny, this assumption is violated, since the same hierarchical
tree-structure will underlay each fold. Therefore, it is better to perform blocked
CV than random CV [89], wherein folds are chosen based on divergent groups on
the tree (e.g. phyla). If each group has diverged far enough in the past from the
others, we can consider these folds to be essentially evolutionarily independent in
19
terms of trait evolution (see Fig A.6 for a conceptual example). Therefore blocked
CV is essentially a non-parametric method (i.e., no explicit evolutionary model)
to account for the non-independence arising from the shared evolutionary history
between species. We use both random and blocked CV to build models. We clus-
tered the data into blocked folds using the pairwise distances between tips on our
tree (partitioning around mediods, pam() function in R package cluster, ve folds
so that k = 5; [90, 91]). A key assumption we make here is that our folds can
be taken as independent from one another (i.e. no eect of shared evolutionary
history). Since these clusters correspond roughly to Phylum-level splits, and since
CRISPR and other prokaryotic immune systems are rapidly gained and lost over
evolutionary time [16], we are comfortable making this assumption. We also re-
peated this analysis using phylogenetic logistic regression to more formally correct
for phylogeny (R package phylolm; [92, 93]). Phylogenetic logistic regression is a
more powerful method since it ts an explicit model of trait evolution, although it
relies on the assumption that traits evolve according to the chosen model and can
give misleading results otherwise.
Stepwise methods for variable selection, such as those used above (i.e., for-
ward subset selection), are simple, computationally feasible, and easy to implement
and interpret, but perform poorly when variables in the dataset covary with one
another (i.e. multicollinearity; [94, 95]). As it so happens, the trait data used here
exhibit strong multicollinearity (R package mctest; [96, 97]). Therefore, we sought
out methods that deal well with this type of data, specically partial least squares
regression (PLS; [94]). Briey, PLS combines features of PCA and linear regression
20
to nd the linear combination of predictors that maximizes the variance of the data
in the space of outcome variables. We use a variant of PLS, sparse partial least
squares discriminant analysis (sPLS-DA), where the sparse refers to a built-in
variable selection process in the model-tting algorithm and discriminant analysis
refers to the fact that we are focused on a classication problem (i.e., presence vs.
absence of a particular immune strategy; we used tune.splsda() perform 5-fold cross
validation, repeated 50 times, to select the optimal number of components n to in-
clude and splsda() to perform variable selection and model selection simultaneously
given n as an input; functions in R package mixOmics; [98, 99]).
We also attempt to ameliorate the eects of shared evolutionary history on our
PLS model by using a philosophically similar approach to our blocked CV method
above. Multivariate integrative (MINT) sPLS-DA is a variant of PLS that can ac-
count for systematic variation between groups of data when those groupings are
known (e.g., our phylogenetically-blocked folds from above). It was originally devel-
oped for use in situations where multiple experiments testing the same hypothesis
could show systematic biases from one another. In our case, the history of prokary-
otic evolution is our experiment, and deep branching lineages are our replicates. We
apply MINT sPLS-DA to the data, using the same blocked folds we used for CV
(we used tune.mint.splsda() to perform 5-fold blocked cross validation to select the
optimal number of components n to include and mint.splsda() to perform variable
selection and model selection simultaneously given n as an input; functions in R
package mixOmics; [99, 100]).
While regression provides easily interpretable trait weights and is computa-
21
tional ecient, in order to capture higher-order relationships between microbial
traits we needed more powerful methods. Random forests (RF) are an attractive
choice for our aims since they produce a readily-interpretable output and can in-
corporate nonlinear relationships between predictor variables [101]. We built an RF
classier on our training data from 5000 trees (otherwise default settings in R pack-
age randomForest so that the number of variables tried at each split is the square
root of the total number of predictors; [102]). To prevent tting to phylogeny, we
took an ensemble approach which was similar in philosophy to our blocked CV and
MINT sPLS-DA approaches above. Using the phylogenetically blocked folds de-
ned above we t ve individual forests, each leaving out one of the ve folds. We
then weighted these forests by their relative predictive ability on the respective fold
excluded during the tting process (measured as Cohen's ?; [103]). We predicted
using our ensemble of forests by choosing the predicted outcome with the greatest
total weight.
2.4 Results
Below, we associate specic microbial immune strategies with a diverse list of
microbial traits. The traits span a range of scales including aspects of habitat (e.g.
aquatic), morphology (e.g., coccus), and physiology (e.g., heterotroph) [75].
While this variety of scales poses a modeling challenge to traditional approaches
including linear regression, machine learning algorithms provide an elegant means of
integrating such multi-scale traits in a statistically rigorous predictive framework. In
22
particular, we apply algorithms that excel at identifying both linear and non-linear
combinations of traits with high predictive ability. For a systematic comparison of
the output of our predictive models, discussed individually below, please see Figs
A.7 and A.8.
2.4.1 Visualizing CRISPR Incidence in Trait Space
We visualized CRISPR incidence in microbial trait space using two unsuper-
vised algorithms to collapse high-dimensional data (174 binary traits assessed in
2679 species; see Methods) into fewer dimensions. Both methods revealed clear
dierences between the placement of CRISPR-encoding and CRISPR-lacking or-
ganisms in trait space, despite the fact that no explicit information about CRISPR
was included.
First, principal components analysis (PCA) of the trait data reveals several
previously recognized patterns of microbial lifestyle choice and CRISPR incidence.
The rst principal component (17% variance explained) corresponds broadly to an
axis running from host-associated to free-living microbes (Table 2.1), as observed by
others [104, 105]. CRISPR-encoding and CRISPR-lacking microbes are not dier-
entiated along this axis (Fig A.9). We see CRISPR-encoding and CRISPR-lacking
organisms beginning to separate along the second (10% variance explained) and
third (7% variance explained) principal components (Fig 2.1). The second com-
ponent roughly represents a split between extremophilic species typically living in
low-productivity environments and mesophilic, plant-associated species (Table 2.1).
23
Optimal growth temperature appears to be an important predictor of CRISPR in-
cidence, as previously noted by others [47, 48]. The third component is not as
easy to interpret, but appears to indicate a spectrum from group living microbes
(e.g. biolms) to microbes that tend to live as lone, motile cells (Table 2.1). That
CRISPR is possibly favored in group-living microbes is not entirely surprising, con-
sidering the increased risk of viral outbreak at high population density, and that
some species up-regulate CRISPR during biolm formation [106].
0.15
0.10
0.05
?
0.00
?10 ?5 0 5
PC3
10 10
5 5
0 0
?5 ?5
?10 ?10
CRISPR
No CRISPR
?10 ?5 0 5 0.00 0.03 0.06 0.09
PC3 density
Figure 2.1: Organisms with CRISPR separate from those without in trait space.
The second and third components from a PCA of the microbial traits dataset are
shown, where each point is a single species. CRISPR incidence is indicated by color
(green with, orange without), but was not included when constructing the PCA. No-
tice the separation of organisms with and without CRISPR along both components.
Marginal densities along each component are shown to facilitate interpretation. See
Fig A.9 for the rst component.
24
PC2 density
PC2
25
PC1 Weight PC2 Weight PC3 Weight
ecosystemcategory_human -0.16 temperaturerange_mesophilic 0.19 growth_in_groups -0.24
specicecosystem_sediment 0.16 temperaturerange_thermophilic -0.19 gram_stain_positive -0.24
ecosystem_environmental 0.16 oxygenreq_strictanaero -0.19 cellarrangement_singles 0.21
knownhabitats_host -0.15 temperaturerange_hyperthermophilic -0.18 cellarrangement_laments -0.20
ecosystemsubtype_intertidalzone 0.15 knownhabitats_hotspring -0.17 sporulation -0.20
ecosystem_hostassociated -0.15 exosystemtype_rhizoplane 0.17 energysource_chemoorganotroph -0.19
habitat_hostassociated -0.15 habitat_specialized -0.16 cellarrangement_clusters -0.18
habitat_freeliving 0.15 metabolism_methanogen -0.16 shape_tailed -0.18
ecosystemtype_digestivesystem -0.14 ecosystemcategory_plants 0.15 habitat_terrestrial -0.18
specicecosystem_fecal 0.14 ecosystemtype_thermalsprings -0.15 motility 0.17
Table 2.1: Top 10 variable loadings on the rst three principal components of the PCA performed on the microbial traits
dataset, shown in Figs 2.1 and A.9. These three components explain 17%, 10%, and 7% of the total variance, respectively.
Second, we visualized the trait data using t-distributed stochastic neighbor
embedding (t-SNE), which is a nonlinear method that can often detect more sub-
tle relationships in a dataset (Fig 2.2; [87]). This method reveals a clustering of
CRISPR-encoding microbes in trait space, further emphasizing that microbial im-
mune strategy is inuenced by ecological conditions. Because the axes of t-SNE plots
are not easily interpretable, we mapped the top weighted traits from the PCA above
(Table 2.1) onto the t-SNE reduced data (Fig A.10). Surprisingly, the most clearly
aligned trait with CRISPR-incidence is having an obligately anaerobic metabolism.
0.015
0.010
0.005 ?
0.000
?30 0 30 60
50 50
0
0
CRISPR ?50
?50 No CRISPR
?30 0 30 60
density
Figure 2.2: Organisms with CRISPR partially cluster in trait space away from those
without. Two dimensional output of t-SNE dimension reduction of the microbial
traits dataset are shown, where each point is a single species (same dataset as in
Fig 2.1). CRISPR incidence is indicated by color (green with, orange without), but
was not included when performing dimension reduction. The axes of t-SNE plots
have no clear interpretation due to the non-linearity of the transformation.
26
density
0.025
0.020
0.015
0.010
0.005
0.000
2.4.2 Predicting CRISPR Incidence
The above unsupervised approaches (i.e. uninformed about the outcome vari-
able, CRISPR) revealed that CRISPR incidence appears to be impacted by other
microbial traits. In order to more formally characterize these patterns, and exploit
them for their predictive ability, we applied several supervised prediction methods
(i.e. trained with information about CRISPR incidence) methods to the complete
trait dataset.
Unlike traditional statistical techniques focused on assigning p-values to par-
ticular input variables, with our machine learning approach we assessed model per-
formance in terms of predictive ability. For unbiased error estimates, we chose
an independent test set to withhold during the model tting process and to be
used only during model assessment. We consider eective prediction of CRISPR
incidence in this independent dataset as support that our model encodes real infor-
mation about how dierent microbial traits inuence the ecological advantages of
the CRISPR system. We then examined the structure of these models, and which
variables play an outsize role in their performance, in order to select candidate
traits associated with CRISPR incidence. Importantly, we chose the Proteobacteria
as our test set because they represent a phylogenetically-independent group from
our training set (see Methods).
All models we implemented showed improved predictive ability over a null
model only accounting for the relative frequency of CRISPR among species (Co-
hen's ? > 0; Table 2.2), indicating that there is some ecological signal in CRISPR
27
incidence, though overall predictive performance was not overwhelming. Of these
models the random forest (RF) model ranked highest, and did reasonably well
(? = 0.241). The percent incidences of CRISPR in the training (56%) and test
sets (36%) are considerably dierent, which may have been dicult for these mod-
els to overcome. It is also possible that the Proteobacteria vary systematically from
other phyla in terms of ecology and immune strategy, making them a particularly
dicult (and thus conservative) test set. Nevertheless, the trait data clearly held
some information about CRISPR incidence. We will primarily focus here on the RF
model since it performed best, but see Appendix A.4 for further discussion of the
performance of our other models.
28
29
Phylogenetic Correction Performance
Model Type Non-Parametric Parametric Model Size Accuracy ? TPR
Log. Reg. No No 18 66.1% 0.152 0.233
Log. Reg. Yes No 9 67.5% 0.168 0.209
Log. Reg. No Yes 10 67.7% 0.188 0.246
Log. Reg. Yes Yes 6 67.4% 0.160 0.294
[7, 159, 4, 169, 50]
sPLS-DA No No 68.4% 0.190 0.219
(5 comp.)
MINT sPLS-DA Yes No 32 (1 comp.) 60.5% 0.173 0.538
RF No No - 68.8% 0.241 0.327
RF Ensemble Yes No - 68.6% 0.240 0.332
Table 2.2: Predictive ability of models of CRISPR incidence on the Proteobacteria test set. Model size refers to number of
variables chosen overall, or per-component in the case of the partial least squares models. Accuracy is measured as the total
number of correct predictions over the total attempted and ? is Cohen's ?, which corrects for uneven class counts that can
inate accuracy even if discriminative ability is low. Roughly, ? expresses how much better the model predicts the data than
one that simply knows the frequency of dierent classes (? = 0 being no better, ? > 0 indicating improved predictive ability).
The true positive rate (TPR) is the number of correctly identied genomes having CRISPR divided by the total number of
genomes having CRISPR in the test set. The non-parametric correction for phylogeny refers to our phylogenetically blocked
folds, whereas the parametric correction refers to our use of phylogenetic logistic regression [92]. Observe that the RF model
appears to perform best at prediction in general.
While each of our models revealed a distinct set of top predictors of CRISPR
incidence, there was broad agreement overall (Table A.1 and Figs 2.3, A.11, and
A.12). Keywords indicating a thermophilic lifestyle (e.g. thermophilic, hot springs,
hyperthermophilic, thermal springs) appeared across all models as either the most
important or second most important predictor of CRISPR incidence. Keywords re-
lating to oxygen requirement (e.g. anaerobic, aerobic) also appeared across nearly
all models as top predictors, excluding only the two worst performing models (Ta-
ble A.1). In the case of the RF and sPLS-DA models, oxygen requirement was
always one of the top three predictors, and often the top predictor of CRISPR inci-
dence (Figs 2.3, A.11, A.12, and A.13). Other predictors that frequently appeared
across model types included termite hosts (host_insectstermites), the degradation
of polycyclic aromatic hydrocarbons (PAH; metabolism_pahdegrading), freshwater
habitat (knownhabitats_freshwater), and growth as laments (shape_lamentous).
In general, the sPLS-DA, MINT sPLS-DA, RF, and RF ensemble models agreed
with each other rather closely. Finally, we built an RF model using only traits
related to temperature range, oxygen requirement, and thermophilic lifestyle (hot
springs, thermal springs, hydrothermal vents). This temperature- and oxygen-only
RF model outperformed all non-RF models (? = 0.191). These traits alone appear
to hold the majority of information about CRISPR incidence in the dataset.
As an additional check that these candidate traits versus CRISPR associa-
tions are real and not due to some irregularity in our dataset, we downloaded meta-
data available from NCBI. We were able to reproduce the result that thermophiles
strongly prefer CRISPR (92% with CRISPR as opposed to 49% in mesophiles, Fig
30
oxygenreq_strictanaero oxygenreq_strictanaero
host_insectstermites temperaturerange_thermophilic
knownhabitats_hotspring knownhabitats_hotspring
temperaturerange_thermophilic host_insectstermites
ecosystemtype_thermalsprings ecosystemtype_thermalsprings
temperaturerange_mesophilic temperaturerange_mesophilic
temperaturerange_hyperthermophilic oxygenreq_strictaero
knownhabitats_freshwater knownhabitats_humanoralcavity
oxygenreq_strictaero knownhabitats_freshwater
knownhabitats_hydrothermalvent metabolism_pahdegrading
0 5 10 15 20 0 10 20 30 40
Mean Decrease in Gini Impurity Index Mean Decrease in Accuracy
Figure 2.3: Importance of top ten predictors in the RF model of CRISPR inci-
dence using the ProTraits predictors. The mean decrease in accuracy measures the
reduction in model accuracy when a variable is randomly permuted in the dataset.
The Gini impurity index is a common score used to measure the performance of
decision-tree based models (e.g. RF models). Briey, when a decision tree is built
the Gini impurity index measures how well separated the dierent classes of out-
come variable are at the terminal nodes of the tree (i.e., how pure each of the
nodes is). The mean decrease in Gini impurity measures the estimated reduction in
impurity (increase in purity) when a given variable is added to the model. These
importance scores are useful to rank variables as candidates for further study, but in
themselves should not be taken as statistical support or eect sizes similar to those
seen in linear regression. RF models may include non-linear combinations of vari-
ables, and therefore the contribution of any one variable is not as easily interpreted
as with a linear model, a drawback of this approach. See Fig A.14 for all predictor
importances.
2.4a; [47, 48]). Though we have too few genomes categorized as psychrotolerant
(35) or psychrophilic (14) to make any strong claims, these genomes seem to lack
CRISPR most of the time, suggesting that CRISPR incidence decreases continu-
ously as environmental temperatures decrease [46]. We were also able to conrm
that, in agreement with our visualizations and predictive modeling, aerobes disfavor
CRISPR immunity (34% with CRISPR) while anaerobes favor CRISPR immunity
(67% with CRISPR, Fig 2.4b). This is true independent of growth temperature,
with mesophiles showing a similarly strong oxygen-CRISPR link (Fig A.15). Over-
31
all, both oxygen (?2 = 254.04, p < 2.2 ? 10?16, categories with < 10 observations
excluded) and temperature (?2 = 98.86, p < 2.2 ? 10?16, categories with < 10
observations excluded) had signicant eects on incidence (for breakdown see Fig
2.4).
(a)100
75
50
25
14 35 911 13 92 23
0
psychrophile psychrotolerant mesophile thermotolerant thermophile hyperthermophile
Temperature Range
(b) (c)100 Ku
75 Ku
No Ku
75
50
50
25
32 1015 232 20 32 463 150 25
0
ober rob
e
ativ
e e ic e e
ae ae ult aer
ob
oph
il
aer
ob ero
b
 r a 520 495 33 430
liga
te fac  an e n n
b tiv
e icro
a a te a 0
o ga
acu
lta m bli
f o aerobe anaerobe
Oxygen Requirement Oxygen Requirement
Figure 2.4: Temperature range and oxygen requirement are strong predictors of
CRISPR incidence. Trait data taken from NCBI. (a) Thermophiles strongly fa-
vor CRISPR immunity, while mesophiles appear ambivalent. (b) Anaerobes favor
CRISPR immunity, while aerobes tend to lack CRISPR and facultative species fall
somewhere in between. (c) CRISPR and the Ku protein are negatively associated in
aerobes but not anaerobes. Error bars are 99% binomial condence intervals (non-
overlapping intervals can be taken as evidence for a statistically signicant dierence
at the p < 0.01 level). Total number of genomes in each trait category shown at the
bottom of each bar. Categories represented by fewer than 10 genomes were omitted.
32
Percentage With CRISPR Percentage With CRISPR
Percentage With CRISPR
Following previous suggestions that CRISPR incidence might be negatively
associated with host population density and growth rate [47, 48, 63], and that this
could be driving the link between CRISPR incidence and optimal temperature range,
we sought to determine if growth rate was a major determinant of CRISPR inci-
dence. The number of 16s rRNA genes in a genome is an oft-used, if imperfect,
proxy for microbial growth rates and an indicator of copiotrophic lifestyle in general
[107, 108, 109]. While CRISPR-encoding genomes had slightly more 16s genes than
CRISPR-lacking ones (3.1 and 2.9 on average, respectively), the 16s rRNA gene
count in a genome was not a signicant predictor of CRISPR incidence (logistic re-
gression, p = 0.05248), although when correcting for phylogeny 16s gene count does
seem to be signicantly positively associated with CRISPR incidence (phylogenetic
logistic regression, m = 0.06277, p = 6.651? 10?5), the opposite of what we would
expect if growth rate were driving the CRISPR-temperature relationship (though
the eect was not consistent across bootstrapped trees; Table A.2).
As a secondary conrmation of the link between oxygen and CRISPR, we
examined metagenomic data from the Tara Oceans Project [110], and found that
across a large set of ocean metagenome samples CRISPR prevalence was inversely
related to environmental oxygen concentration (Appendix A.5).
We also attempted to predict the number of CRISPR arrays in a genome
given that that genome had at least one array, though this attempt was entirely
unsuccessful (Appendix A.6).
33
2.4.3 Predicting CRISPR Type
Each CRISPR system type is associated with a signature cas targeting gene
unique to that type (cas3, cas9, and cas10 for type I, II, and III systems respec-
tively). There are many species in the dataset with cas3 (605), but relatively few
with cas9 (160) and cas10 (222), suggesting that the traits correlated with CRISPR
incidence probably correspond primarily to type I systems (the dominance of type I
systems has been noted previously [30]). We mapped the incidence of each of these
genes onto the PCA we constructed earlier (see Fig A.9 and Table 2.1), and found
that cas9 separates from cas3 and cas10 along the rst component (Fig 2.5a).
Broadly, this indicates that type II systems are more commonly found in host-
associated than free-living microbes, the opposite of the other two system types.
We built an RF model of cas9 incidence, with the Proteobacteria as the test
set. Because our training set had so few cases of cas9 incidence (10% of set), we
performed stratied sampling during the RF construction process to ensure rep-
resentative samples of organisms with and without cas9. Surprisingly, despite the
extremely small number of organisms with cas9 in the training and test sets (160
and 58 respectively), this model was accurately able to predict type II CRISPR in-
cidence and had some discriminative ability (Accuracy = 93.0%, ? = 0.164), though
it missed many of the positive cases (TPR = 0.172). This model also suggested
that a host-associated lifestyle seems to be a major factor inuencing the incidence
of type II systems, with many of the top-ranking variables in terms of importance
corresponding to keywords having to do with the split between host associated and
34
(a) ?
0.100
0.075
0.050
0.025
0.000 ?
?15 ?10 ?5 0 5 10
PC1
5 5
0
0
?5
?5
cas10 ?10
?10
cas3
cas9
?15 ?10 ?5 0 5 10
PC1 density
(b) ?
cellarrangement_filaments ? ecosystemsubtype_vagina ?
knownhabitats_humanoralcavity ? ecosystemtype_thermalsprings ?
habitat_freeliving ? ecosystemtype_reproductivesystem ?
ecosystemtype_digestivesystem ? ecosystemtype_geologic ?
ecosystemcategory_terrestrial ? mammalian_pathogen_oral_cavity ?
ecosystem_environmental ? habitat_single ?
metabolism_sulfuroxidizer ? mammalian_pathogen_urogenital ?
ecosystemsubtype_oral ? knownhabitats_humanairways ?
cellarrangement_chains ? host_insectstermites ?
habitat_hostassociated ? mammalian_pathogen_oportunisticnosocomial ?
0 1 2 3 0 1 2 3 4 5
Mean Decrease in Gini Impurity Index Mean Decrease in Accuracy
Figure 2.5: Type II CRISPR systems appear to be more prevalent in host-associated
microbes. (a) The cas targeting genes associated with type I, type II, and type III
systems (cas3, cas9, and cas10 respectively) mapped onto the PCA in Fig A.9.
Organisms without any targeting genes were omitted from the plot for readability.
Recall from Table 2.1 that PC1 roughly corresponds to a spectrum running from
host-associated to free-living microbes. (2) A variable importance plot from an RF
model of cas9 incidence. Observe that keywords related to a host-associated lifestyle
appear many times.
35
PC2 density
PC2
0.20
0.15
0.10
0.05
0.00
free-living organisms (Fig 2.5b).
2.4.4 NHEJ, CRISPR, and Oxygen
Recently, Bernheim et al. [74] demonstrated that the type II-A CRISPR sys-
tem interferes with the NHEJ DNA repair pathway, leading to an inverse relation-
ship between the presence of type II-A systems and the NHEJ pathway in microbial
genomes. We hypothesized that this negative relationship between CRISPR and
NHEJ might be more widespread across system types. We also hypothesized that
this could explain the negative relationship between CRISPR and aerobicity we ob-
serve, since reactive oxygen species produced during aerobic respiration can induce
double-strand breaks, thus selecting for the presence of NHEJ repair in aerobic or-
ganisms [111, 112]. We use the presence of Ku protein as a proxy for the NHEJ
pathway, since this protein is central to the pathway.
There was a clear interaction between the presence of Ku and aerobicity on the
incidence of CRISPR (Fig 2.4c, using aerobicity meta-data from NCBI for this and
below analyses). Using our full set of RefSeq genomes, we found a weak negative
association between CRISPR and Ku incidence overall (Pearson's correlation, ? =
?0.012; ?2 = 15.015, p = 1.067? 10?4), but restricting only to aerobes the negative
association between Ku and CRISPR was much stronger (Pearson's correlation, ? =
?0.250, p = 9.109 ? 10?16), whereas in anaerobes it was nonexistent (? = ?0.023,
p = 0.704). This pattern was consistent when correcting for phylogeny (Appendix
A.7), and was true for both type I and III systems individually, though was not
36
signicant for type II systems of which there were fewer in the dataset Fig A.19.
Similar to our CRISPR analysis, we used PCA and an RF model to nd if and
where Ku-possessing organisms clustered in trait space. We found that the NHEJ
pathway clusters strongly in trait space (Fig A.17), and is favored in soil-dwelling,
spore-forming, aerobic microbes, consistent with expectations of where NHEJ will
be most important (Fig A.18; [111, 112]).
2.4.5 Predicting RM Incidence
So far, our analyses have not distinguished if temperature and oxygen predict
whether a microbe has an intracellular immune system that degrades DNA in gen-
eral, or whether these traits are specic to CRISPR adaptive immunity. We tested
these two possibilities by building an RF model of restriction enzyme incidence using
the same stratied sampling approach that we used for CRISPR system type. This
model showed decent predictive ability (? = 0.317). However, the correlation be-
tween variable importance scores for the CRISPR and restriction enzyme RF models
was low (Fig 2.3 vs. Fig A.21; Pearson's correlation, ? = 0.169 for mean decrease
in Gini Impurity Index, ? = ?0.0487 for mean decrease in accuracy; also Figs A.7
and A.8). This result implies that RM systems have dierent traits determining
their incidence than do CRISPR systems (also note PCA plot, Fig A.20). When
we directly tested for an association with temperature and oxygen we also found
that the number of restriction enzymes was, unlike CRISPR incidence, negatively
associated with an anaerobic lifestyle (m = ?4.53877, p = 2 ? 10?16, phylogenetic
37
linear regression), and only marginally signicantly associated with a thermophilic
lifestyle (m = 1.51063, p = 0.03779, phylogenetic linear regression). These results
were consistent across bootstrapped trees (Table A.3).
2.5 Discussion
We detected a clear association between microbial traits and the incidence of
the CRISPR immune system across species. We found that two predictors were es-
pecially important for predicting CRISPR incidence, thermophilicity and aerobicity.
The links between these two traits and CRISPR were conrmed with annotations
from NCBI, and in the case of aerobicity with metagenomic data from the Tara
Oceans Project (Appendix A.5; [110]). The relationship between temperature and
CRISPR is well known [44, 45, 46], but we lend further support here by formally
correcting for shared evolutionary history in our statistical analyses using both para-
metric and non-parametric approaches.
Previous theoretical models predict that CRISPR will be selected against in
environments with dense and diverse viral communities [47, 48], since hosts are less
likely to repeatedly encounter the same virus in such environments. These models
in turn predict that in high-density host communities CRISPR will not be adaptive,
since high host density leads to high viral diversity [47, 48], and that this might
explain why potentially slow-growing thermophiles favor CRISPR immunity (as op-
posed to copiotrophic mesophiles). Our results show a marginal positive association
between growth rate and CRISPR incidence, and that group-living microbes seem
38
to favor CRISPR immunity, calling these prior viral diversity and density based
explanations into question. Additionally, our analysis suggests that psychrophilic
and psychrotolerant species disfavor CRISPR more strongly than mesophiles, which
is not clearly explained or predicted by hypotheses based on host density.
We suspect that another factor could be aecting the degree of viral diversity
that a host encounters, so that viral diversity is high in colder environments and
low in hotter ones. Dierences in dispersal limitation among viruses could lead
to lower immigration rates in hot environments, as viral decay rates may be low
at lower temperatures and high at higher temperatures [113], though this is highly
speculative. We note that host dispersal rates are unlikely to aect the viral diversity
seen by a host on average unless most of the host population is dispersing, an
unrealistic expectation.
Surprisingly, we nd that oxygen requirement appears to be just as important
of a predictor of CRISPR incidence as temperature, and that this pattern is inde-
pendent of any eect of temperature. Possibly, this association can be explained
by inhibitory eects of CRISPR on NHEJ DNA repair. Type II-A CRISPR sys-
tems have been shown to directly interfere with the action of the NHEJ DNA repair
pathway in prokaryotes [74]. Reactive oxygen species are produced during aerobic
metabolism and can cause DNA damage [111], making NHEJ potentially partic-
ularly important in aerobes. Thus, if CRISPR interferes with the NHEJ repair
pathway, and this pathway is important in aerobes, we would expect CRISPR in-
cidence to be inversely related to the presence of oxygen. Our data showed a clear
interaction between aerobicity and the NHEJ machinery in determining CRISPR
39
incidence that suggests that the link between CRISPR and aerobicity may be medi-
ated by the presence of the NHEJ pathway (Fig 2.4c). The Cas proteins share many
structural similarities with proteins implicated in DNA repair, and in some cases
prefer to associate with DSBs, and it is perhaps unsurprising that they appear to
broadly inhibit the NHEJ pathway whose proteins may be competing for substrate
[114]. Nevertheless, the evidence supporting this hypothesis is only preliminary.
The negative interaction between CRISPR and Ku should be experimentally con-
rmed in type I and type III systems. Additionally, our repair versus immunity
tradeo hypothesis could be tested using an experimental evolution setup in which
organisms with CRISPR are exposed to DNA damage.
The link that we propose between aerobic metabolism and NHEJ repair is
somewhat tenuous. Reactive oxygen species are thought to directly produce single
strand breaks which are most often converted to double strand breaks during cell
growth, the precise time when repair may be possible via homologous recombination
due to the presence of multiple genome copies. That being said, reactive oxygen
species can lead to double strand breaks during stationary phase when damage is
spatially clustered on the genome [115, 116], when cells experience specic types
of starvation that lead to vulnerable single-stranded DNA gaps [117, 118], or when
ROS occurs in conjunction with other damaging agents including cyanide [119] and
irradiation [120, 121, 122]. Furthermore, while NHEJ certainly will be important
during stationary phase, its relevance during growth is unknown. The pathway
itself does appear to be more prevalent in environments with oxygen (Figs A.17 and
A.18). Nevertheless, we have no ability to assess causality presently, and the strong
40
interaction between Ku and aerobicity on CRISPR incidence we observed could be
the result of some other, as yet unrevealed driver. For example, NHEJ is thought to
be important for desiccation resistance [123, 124], and many organisms facing this
specic threat are likely to be aerobic.
As an alternative to our NHEJ hypothesis, could patterns in viral diversity
explain the relationship between aerobicity and CRISPR incidence? The viral-decay
hypothesis we proposed to explain the enrichment of thermophiles with CRISPR
does not make sense in this context, since we might expect viruses to decay more
readily in the presence of oxygen rather than under anoxic conditions. It is unclear
to us why the viruses of anaerobes would be more dispersal limited. Nevertheless, if
the viral communities infecting anaerobes were shown to be less diverse than those
infecting aerobes this could also explain the increased incidence of CRISPR among
these organisms.
We found no strong link between the incidence or number of RM systems on a
genome and a thermophilic or anaerobic lifestyle, suggesting that the major drivers
of CRISPR incidence are indeed CRISPR specic, consistent with our viral-diversity
and NHEJ-inhibition hypotheses.
We were also able to show that CRISPR types vary in in terms of the environ-
ments they are found in, with type II systems appearing primarily in host-associated
microbes. This phenomenon could be due in part to phylogenetic biases in the
dataset, but our use of a phylogenetically independent test set lends credence to the
overall trend. We have no clear mechanistic understanding of why cas9 containing
microbes tend to favor a host-associated lifestyle. Nevertheless this result may have
41
practical implications for CRISPR genome editing, since it has recently been found
that humans frequently have a preexisting adaptive immune response to variants of
the Cas9 protein [125]. We note that type I and III systems do not appear to have
a strong link to host-associated lifestyles.
While our dataset spanned a broad phylogenetic range (with some notable
exceptions such as the Candidate Phyla Radiation [126]), we had a limited number
of microbial traits, which may have obscured some important CRISPR-trait asso-
ciations. With the number of microbial genomes in public databases constantly
expanding, so too should eorts to provide metadata about each of the organisms
represented by those genomes. At least part of the problem lies in the lack of a uni-
versally accepted controlled vocabulary for microbial traits (similar to that provided
by the Gene Ontology Consortium [127]), although some admirable attempts have
been made [128, 129]. This would both facilitate the construction of more expansive
trait databases, and would help deal with the issue of comparing traits that span
many dierent scales.
The ecological drivers of microbial immune strategy are likely as diverse as
the ever-increasing number of known prokaryotic defense systems [13, 42]. The
exploratory, database-centered approach we take here can be complemented by tar-
geted studies examining shifts in immune strategy across environmental gradients
(e.g., Appendix A.5) to provide a more ne-grained understanding of how microbial
populations adapt to their local pathogenic and abiotic environments. Ultimately,
experimental manipulations will provide the power to fully validate proposed mech-
anisms behind ecological patterns in immune strategy.
42
Chapter 3: Selective maintenance of multiple CRISPR arrays across
prokaryotes
3.1 Abstract
Prokaryotes are under nearly constant attack by viral pathogens. To protect
against this threat of infection, bacteria and archaea have evolved a wide array
of defense mechanisms, singly and in combination. While immune diversity in a
single organism likely reduces the chance of pathogen evolutionary escape, it remains
puzzling why many prokaryotes also have multiple, seemingly redundant, copies of
the same type of immune system. Here, we focus on the highly exible CRISPR
adaptive immune system, which is present in multiple copies in a surprising 28%
of the prokaryotic genomes in RefSeq. We use a comparative genomics approach
looking across all prokaryotes to demonstrate that, on average, organisms are under
selection to maintain more than one CRISPR array. Given this surprising conclusion,
we consider several hypotheses concerning the source of selection and include a
theoretical analysis of the possibility that a tradeo between memory span and
learning speed could select for both long-term memory and short-term memory
CRISPR arrays.
43
3.2 Introduction
Just as larger organisms must cope with the constant threat of infection by
pathogens, so too must bacteria and archaea. To defend themselves in a given
pathogenic environment, prokaryotes may employ a range of dierent defense mech-
anisms, and oftentimes more than one [11, 12, 17]. While having multiple types
of immune systems may decrease the chance of pathogen evolutionary escape [130],
having multiple instances of the same type of system is rather more puzzling. Here
we explore this apparent redundancy in the context of CRISPR-Cas immunity.
The CRISPR-Cas immune system is a powerful defense mechanism against
mobile genetic elements such as viruses and plasmids, and is the only known example
of adaptive immunity in prokaryotes [45, 131]. This system allows prokaryotes to
acquire specic immune memories, called spacers, in the form of short viral genomic
sequences which they store in CRISPR arrays in their own genomes [18, 19, 38].
These sequences are then transcribed and processed into short RNA fragments that
guide CRISPR-associated (Cas) proteins to degrade matching foreign DNA or RNA
[19, 132, 133]. Thus the CRISPR array is the genomic location in which memories
are recorded, while the Cas proteins act as the machinery of the immune system.
CRISPR systems appear to be widespread across diverse bacterial and ar-
chaeal lineages, with previous analyses of genomic databases indicating that ? 40%
of bacteria and ? 80% of archaea have at least one CRISPR system [53, 68, 134].
These systems vary widely in cas gene content and targeting mechanism, although
the cas1 and cas2 genes involved in spacer acquisition are universally required for
44
a system to be fully functional [19, 68]. Such prevalence suggests that CRISPR
systems eectively defend against phage in a broad array of environments. The
complete story seems to be more complicated, with recent analyses of environmen-
tal samples revealing that some major bacterial lineages almost completely lack
CRISPR systems and that the distribution of CRISPR systems across prokaryotic
lineages is highly uneven [52]. Other studies suggest that particular environmental
factors can be important in determining whether or not CRISPR immunity is eec-
tive (e.g., in thermophilic environments [47, 48]). While previous work has focused
on the presence or absence of CRISPR across lineages and habitats, little attention
has been paid to the number of systems in a genome.
In fact, the multiplicity of CRISPR systems per individual genome varies
greatly, with many bacteria having multiple CRISPR arrays and some having mul-
tiple sets of cas genes as well (e.g., [56, 58]). CRISPR and other immune systems
are horizontally transferred at a high rate relative to other genes in bacteria [16],
meaning that any apparent redundancy of systems may simply be the result of the
selectively neutral accumulation of systems within a genome. Alternatively, some
microbes may experience selection for multiple sets of cas genes or CRISPR arrays.
We suspected that prokaryotes may be under selection to maintain multiple
CRISPR arrays, given that it is common for organisms across lineages to have mul-
tiple systems (as detailed below) and, in some clades, these appear to be conserved
over evolutionary time (e.g. [135, 136]). Because microbial genomes have a deletion
bias [137, 138], we would expect extraneous systems to be removed over time. Here
we construct a test of neutral CRISPR array accumulation via horizontal transfer
45
and loss. Using publicly available genome data we show that the number of CRISPR
arrays in a wide range of prokaryotic lineages deviates from this neutral expecta-
tion by approximately two arrays. Thus we conclude that, on average, prokaryotes
are under selection to have multiple CRISPR arrays. We go on to discuss several
hypotheses for why having multiple arrays might be adaptive. Finally, we suggest
that a tradeo between the rate of acquisition of immune memory and the span of
immune memory could lead to selection for multiple CRISPR arrays.
3.3 Methods
3.3.1 Dataset
All available completely sequenced prokaryotic genomes (all assembly lev-
els, bacteria and archaea) were downloaded from NCBI's non-redundant RefSeq
database FTP site ([139]) on December 23, 2017. Genomes were scanned for the
presence of CRISPR arrays using the CRISPRDetect v2.2 software [77]. We used
default settings except that we did not take the presence of cas genes into account in
the scoring algorithm (to avoid circularity in our arguments), and accordingly used
a quality score cuto of three, following the recommendations in the CRISPRDetect
documentation. CRISPRDetect also identies the consensus repeat sequence and
determines the number of repeats for each array. Presence or absence of cas genes
were determined using genome annotations from NCBI's automated genome anno-
tation pipeline for prokaryotic genomes [83]. We discarded genomes that lacked a
CRISPR array in any known members of their species. In this way we only examined
46
genomes known to be compatible with CRISPR immunity.
3.3.2 Test for selection maintaining multiple arrays
We detect selection by comparing non-functional (i.e., neutrally-evolving) and
functional (i.e., potentially-selected) CRISPR arrays. Since all known CRISPR
systems require the presence of cas1 and cas2 genes in order to acquire new spacers,
we use the presence of both genes on a genome as a marker for functionality of
arrays on that genome and the absence of one or both genes as a marker for non-
functionality (validated in Appendix B.1). This dierentiation allows us to consider
the probability distributions of the number of CRISPR arrays i in non-functional
(Ni) and functional (Fi) genomes, respectively.
We start with our null hypothesis that in genomes with functional CRISPR
systems possession of a single array is highly adaptive (i.e. viruses are present and
will kill any susceptible host) but additional arrays provide no additional advantage.
Thus these additional arrays will appear and disappear in a genome as the result of
a neutral birth/death horizontal transfer and loss process, where losses are assumed
to remove an array in its entirety. This hypothesis predicts that the non-functional
distribution will look like the functional distribution shifted by one (Si):
??
H0 : Ni ? Si = Fi+1/ Fj (3.1)
j=1
for i ? 0 (Si renormalized to account for loss of 0-array category).
We take two approaches to testing this hypothesis: one parametric from rst
47
principles and one non-parametric with less power but fewer assumptions. In our
parametric approach, we construct a stochastic model of neutral array accumulation
and nd that both Ni and Si should t a negative binomial distribution at equi-
librium (see Appendix B.2 for derivation). We calculate point maximum likelihood
estimates of the means of these tted distributions (??N and ??S). We expect that
??S > ??N if more than one array is selectively maintained, and we bootstrap con-
dence intervals on these estimates by resampling with replacement from our func-
tional and non-functional array count distributions in order to determine whether
the eect is signicant.
We also construct a non-param?etric test for selection by determining at what
shift s the mismatch between F ?i+s/ j=s Fj and Ni, measured as the sum of squared
dierences between the distributions, is minimized:
?(? ? )? 2
s? = argmin Ni ? Fi+s/ Fj . (3.2)
s
i=0 j=s
Under our null hypothesis s? = 1, and a value of s? > 1 implies that selection
maintains more than one array. Our parametric test is superior to s? because it can
detect if selection maintains more than one array across the population on average,
but not in all taxa, so that the optimal shift is fractional.
We note that the array accumulation process underlying these methods as-
sumes that CRISPR arrays are primarily lost all-at-once (e.g. due to recombination
between anking insertion sequences [50, 140]) rather than through a process of
gradual decay due to spacer loss. Experimental evidence supports spontaneous loss
48
of the entire CRISPR array [51], as do comparisons between closely related genomes
[50]. We discuss this assumption and provide evidence supporting spontaneous loss
in Appendix B.3.
3.3.3 CRISPR spacer turnover model
We develop a simple deterministic model of the spacer turnover dynamics
in a single CRISPR array of a bacterium exposed to n viral species (i.e., disjoint
protospacer sets; Appendix B.4). This model allows us to specify the strength of
priming (i.e., if a CRISPR array has a spacer targeting a particular viral species,
the rate of spacer acquisition towards that species is increased; [141, 142]) and a
functional form for spacer loss over time.
Using this model we can determine the optimal spacer acquisition rate given
a particular pattern of pathogen recurrence in the environment. If the optima for
distinct recurrence patterns do not overlap, it indicates that multiple arrays would
be required to simultaneously combat viral species with these distinct recurrence
patterns. For model analysis see Appendix B.4.
We consider two functional forms for spacer loss based on known features
of CRISPR biology. (1) The rate of per-spacer loss increases linearly with locus
length. This form is based on the observation that spacer loss appears to occur via
homologous recombination between repeats [31, 143, 144], which becomes more likely
with increasing numbers of spacers (and thus repeats). (2) The length of an array is
capped at some xed eective number of spacers. This form is based on evidence
49
Genome With CRISPR > 1 CRISPR > 1 signature > 1 type of
Set array arrays cas genes signature cas gene
Full dataset 44% 28% 5% 2%
Subsampled 40% 24% 9% 5%
Table 3.1: CRISPR array and cas multiplicity across prokaryotic genomes.
that mature crRNA transcripts from the leading end of the CRISPR array are far
more abundant than those from the trailing end, and that this decay over the array
happens quickly (most transcripts are from the rst few spacers; [145, 146, 147]). We
analyze both models (Appendix B.4), though they give qualitatively similar results,
and so we focus on case (1) in the Results.
3.4 Results
3.4.1 Having more than one CRISPR array is common
Almost half of the prokaryotic genomes in the RefSeq database have at least
one CRISPR array, and around a quarter have multiple CRISPR arrays (Table 3.1).
In contrast to this result, having more than one set of cas targeting genes is not
nearly as common. We counted the number of signature targeting genes diagnostic
for type I, II, and III systems in each genome (cas3, cas9, and cas10 respectively
[30]). Only 5% of all genomes have more than one targeting gene. Of these cases,
about half correspond to cases of multiple types of targeting genes in the same
genome (Table 3.1).
Some species are overrepresented in RefSeq (e.g. because of medical relevance),
and we wanted to avoid results being driven by just those few particular species.
50
We controlled for this bias by randomly sub-sampling 10 genomes from each species
with more than 10 genomes in the database and found broadly similar results (Table
3.1).
3.4.2 Selection maintains multiple CRISPR arrays
We leveraged the dierence between functional and non-functional genomes,
within each of which the process of CRISPR array accumulation should be dis-
tinct (Fig 3.1 and Table B.1). Non-functional CRISPR arrays should accumulate
neutrally in a genome following background rates of horizontal gene transfer and
gene loss (see Methods). We constructed two point estimates of this background
accumulation process using our parametric model to infer the distribution of the
number of arrays. One estimate came directly from the non-functional genomes
(??N , Fig 3.1(a)). The other came from the functional genomes, assuming that hav-
ing one array is adaptive in these genomes, but that additional arrays accumulate
neutrally (??S, Fig 3.1(b)). If selection maintains multiple (functional) arrays, then
we should nd that ??N < ??S. We found this to be overwhelmingly true, with about
two arrays on average seeming to be evolutionarily maintained across prokaryotic
taxa (?? = ??S ? ??N = 1.09? 0.03). We bootstrapped 95% condence intervals of
our estimates by resampling genomes (Table B.1) and found that the bootstrapped
distributions did not overlap, indicating a highly signicant result (Fig 3.1(d)). To
control for the possibility that multiple sets of cas genes in a small subset of genomes
could be driving this selective signature, we restricted our dataset only to genomes
51
with one or fewer signature targeting genes (cas3, cas9, or cas10 [30, 68]) and one
or fewer copies each of the genes necessary for spacer acquisition (cas1 and cas2 ).
Even in this restricted set selection maintains more than one (functional) CRISPR
array, though the eect size is smaller (?? = 0.61? 0.02, Fig B.1).
In order to further conrm our results we (1) subsampled overrepresented taxa
in the dataset, (2) performed phylogenetically-corrected tests to account for possible
evolutionary correlation in rates of horizontal gene transfer (HGT), (3) considered
the eects of potential physical linkage between cas genes and CRISPR arrays, (4)
looked for artifacts as a factor of genome assembly level, (5) considered the potential
eects of CRISPR immunity on rates of HGT [148], and, nally, (6) merged arrays
with identical repeats to account for the potential formation of neo-CRISPR arrays
by o-target spacer integration [149] as well as other array duplication events. In all
cases our qualitative result of selection (?? > 0) holds (Appendices B.5 and B.6).
Additionally, we explored the possibility that the CRISPR detection algorithm we
used could be biased and/or suering from a high rate of false positives, and found
our qualitative result did not change when using a higher score cuto, restricting to
arrays with experimentally veried repeat sequences, or using an alternative algo-
rithm (Appendix B.7).
52
40000
30000 10000
20000
5000
10000
0 0
0 4 8 0 4 8
Number of CRISPR Arrays Per Genome Number of CRISPR Arrays Per Genome
(a) (b)
0.5
0.4 75
0.3
50 ??N ??S
0.2
25
0.1
0
0.4 0.8 1.2 1.6
0 1 2 3 4 5
Shift (s) ??
(c) (d)
Figure 3.1: Selection maintains more than one CRISPR array on average across
prokaryotes. (a-b) Distribution of number of arrays per genome in (a) genomes
with non-functional CRISPR immunity and (b) genomes with putatively functional
CRISPR immunity. The tails of these distributions are cut o for ease of visual
comparison (24 genomes with > 10 arrays in (a) and 498 genomes with > 10 arrays
in (b)). In (a) the black circles show the negative binomial t to the distribution of
arrays in non-functional genomes. In (b) black circles indicate the negative binomial
t to the single-shifted distribution (s = 1) and pink triangles to the double-shifted
distribution (s = 2). Note that the t to the double-shifted distribution (pink
triangles in b) visually resembles the distribution of non-functional arrays shown
in (a). (c) We formally quantify the dierence between the non-functional/shifted
function distributions and nd an optimal shift of s? = 2. (d) The bootstrapped
distributions of the parameter estimates of ??S and ??N show no overlap with 1000
bootstrap replicates.
53
Sum of Squared Differences Number of Genomes
Bootstrap Freq. Number of Genomes
3.4.3 A tradeo between memory span and acquisition rate could select
for multiple arrays in a genome
We built a simple model of spacer turnover dynamics in a single CRISPR array.
We consider three patterns of viral residency in the environment corresponding to the
major threats prokaryotes are likely to face: (1) background viruses that coexist
with their hosts over long time periods [150], (2) periodic outbreaks of a particular
transient virus that enters and leaves the system [151], and (3) novel viruses
that a host has not previously encountered (see Methods and Appendix B.4). For
very high spacer acquisition rates, a host will be able to eectively defend against
all three types of viral species simultaneously, because the acquisition of immunity
will be nearly instantaneous (short-term memory/fast-learning in Figs 3.2 and
B.2). Such high rates are unrealistic due to physical constraints on the speed of
adaptation as well as the evolutionary constraint of autoimmunity (Appendix B.8,
[25, 27, 61, 152, 153]). CRISPR adaptation is rapid, but it is not instantaneous, and
infected but susceptible hosts will often perish before a spacer can be acquired [154].
This is precisely why the memory-like quality of CRISPR immunity is advantageous.
Our analysis also reveals a region of parameter space with low spacer acqui-
sition rates in which immunity is maintained towards both background and tran-
sient viruses (long-term memory/slow-learning in Fig 3.2(a)). The long-term
memory/slow-learning region of parameter space is separated from the short-
term memory/fast-learning region of parameter space by a memory-washout
region in which spacer turnover is high so that memory is lost but acquisition is
54
Increasing 
A
"Short-Term Memory"/"Fast-Learning" 1
Memory Maintained
Slow Novel Spacer Acquisition
0.9 Memory Lost
Rapid Spacer Acquisition
Memory Washout
 (Immunity towards "background" virus maintained)
0.8
0
10
0.7
Increasing 
A
0.6 Increasing 
A
-5
10 "Long-Term Memory" 0.5 =1e-3A
 /"Slow-Learning" Memory Lost
=1e-2
Incomplete A Moderate Spacer Acquisition
Immunity =1e-1
at Equilibrium A
0.4
-5 0
10 10 10 -1 10 0
Ratio of Viral Densities (v /v ) Speed of Response to "Novel" Virus (1/(1+t
B T N
))
(a) (b)
Figure 3.2: Immune memory is maximized at intermediate and low spacer ac-
quisition rates, creating a tradeo with the speed of immune response to novel
threats. (a) Phase diagram of the behavior of our CRISPR array model with two
viral species, a constant background population and a transient population that
leaves and returns to the system at some xed interval. The yellow region indi-
cates that immunity towards both viral species was maintained. The green region
indicates where immune memory was lost towards the transient phage species, but
reacquired almost immediately upon phage reintroduction (t < 10?5I , where tI is
the time to rst spacer acquisition after the return of the species to the system fol-
lowing an interval of absence). The light blue region indicates that only immunity
towards the background species was maintained (i.e., immune memory was rapidly
lost and t > 10?5I ). Dark blue indicates where equilibrium spacer content towards
one or both species did not exceed one despite both species being present in the
system (Appendix B.4). (b) The tradeo between memory span and learning speed.
The speed of immune response to the transient virus is plotted against the speed
of response to a novel virus to which the system has not been previously exposed
(so that there are no spacers targeting this virus), over a range of spacer acquisition
rates (?A ? [10?3.5, 1]), and letting the densities of transient and background viruses
be equal. The speed of immune response to a virus is dened as 1/(1 + t) where
t is time to rst spacer acquisition (t = 0 if memory is maintained). The speed of
response to the novel virus is therefore 1/(1+tN) where tN is the time to rst spacer
acquisition towards this virus. For specics on calculating tI and tN see Appendix
B.4. Note that ?A is the number of spacers expected to be acquired per viral particle
adsorbed to the host cell.
55
Spacer Acquisition Rate ( )
A
Speed of Response to Return
of "Transient" Virus (1/(1+tI ))
not rapid enough to quickly re-acquire immunity towards the transient virus. This
sets up a tradeo between the ability of a host to defend against both transient
and novel viruses, since the response time towards novel threats in the long-term
memory/slow-learning region of parameter space is slow (Fig 3.2(b)), but memory
of transient threats is lost if spacer acquisition rates are increased. Thus, in order to
maximize novel spacer acquisition and memory span simultaneously, a two-system
solution will be required.
Additionally, priming expands the washout region of parameter space, be-
cause high spacer uptake from background viruses will crowd out long term immune
memory (Fig B.3). This suggests that priming strengthens the learning vs. memory
tradeo and makes a two-array solution more likely.
3.4.4 Selection varies between taxa and system types
A handful of species in the dataset were represented by a large number of
genomes (> 1000), with at least one each of functional and non-functional genomes.
We performed our test for selection on each of these species individually and found a
large amount of variation between species (Table B.2). Notably, genomes of Campy-
lobacter jejuni, Escherichia coli, and Salmonella enterica show evidence for selection
against having a functional CRISPR array (negative ??), indicating that CRISPR
immunity is selected against on average in some groups of organisms. Previous work
has shown that CRISPR in E. coli and S. enterica appears to be non-functional as an
immune system under natural conditions [155, 156]. We had relatively few archaeal
56
genomes (< 1% of dataset), but they showed a clear signal of selection maintaining
multiple arrays (?? = 1.05? 0.56, Fig B.4).
While we do not have direct information on system type for the majority of
arrays in our dataset, we can subdivide genomes into those containing the signature
cas targeting genes for type I, II, or III CRISPR systems (cas3, cas9, and cas10
respectively) as a proxy for system type [30]. The number of arrays per genome
diered signicantly among system types (Fig B.5), and the largest dierence was
between genomes with class I targeting proteins which had around 2 arrays on
average (type I and type III, 2.10 and 1.96 respectively) and class II targeting
proteins which only had one array on average (type II, 1.05). We excluded genomes
with multiple types of targeting genes for this analysis.
We cannot run our test for selection directly on these subsets of the data, since
they exclude genomes without arrays or cas genes. Instead we classied species into
types if the only observed targeting gene type among all representatives of that
species corresponded to a a particular type. Thus we can test for our signature of
selection among species that favor a particular type of CRISPR system. All types
showed a signature of multi-array selection (?? = 1.09?0.05, 0.62?0.02, 1.79?0.06
respectively). In particular type III species had an exceptionally strong signal, and
organisms in this group may be under selection to maintain three arrays.
57
3.5 Discussion
3.5.1 Selection maintains multiple CRISPR arrays across prokaryotes
On average, prokaryotes are under selection to maintain more than one CRISPR
array. The number of CRISPR arrays in a genome appears to follow a negative bi-
nomial distribution quite well (Figs 3.1, B.1, B.6, and B.7), consistent with our
theoretical prediction. We note that, due to the large size of this dataset, formal
goodness-of-t tests to the negative binomial distribution always reject the t due
to small but statistically signicant divergences from the theoretical expectation.
Our test for selection is conservative to the miscategorization of arrays as
functional or non-functional. Miscategorizations could occur for several reasons
because preexisting spacers may continue to confer immunity, some CRISPR arrays
may be conserved for non-immune purposes (e.g. [155, 157]), and intact acquisition
machinery is no guarantee of system functionality. Our test is conservative precisely
because of such miscategorizations, as they should drive ??N and ??S closer to each
other. Selection against having a CRISPR array in non-functional genomes could
produce a false signature of multi-array selection, but this is unlikely because non-
functional arrays probably carry extremely low or nonexistent associated costs [158].
Our test for selection is also robust to false positive or negative array discovery
rates because it relies on relative dierences between array counts in functional and
non-functional genomes, not their absolute values. The only problem could arise
if the discovery error rates were dierent between the two categories; however, the
58
array detection process did not take functionality into account and we found only a
marginal dierence in CRISPRDetect condence scores between the two groups (Fig
B.8). We further conrmed this robustness to peculiarities of the detection algorithm
by changing our CRISPRDetect score threshold and comparing to the distribution
of arrays per genome in the independently-generated CRISPR Database (Appendix
B.7; [159]).
Finally, we note that ??N and ??S take on a range of values depending on
what subset of taxa/genomes is considered. This is to be expected as each set of
species will occupy a distinct environment in terms of both the rate of horizontal
gene transfer and the usefulness of CRISPR immunity. Nevertheless, our qualitative
signature of selection is robust to this quantitative variability.
3.5.2 Why have multiple CRISPR-Cas systems?
Possibly, multiple arrays could be selectively maintained even in the absence
of any tness advantage if, by chance, each array acquired complementary spacer
content towards distinct viral targets. In type I and II systems, if arrays share
acquisition machinery then such complementarity is unlikely because priming will
ensure both arrays contain spacers towards any target encountered, meaning that
the content of the two arrays will be largely redundant [141]. Type III systems
are unprimed and have slow spacer acquisition rates [160], and therefore may be
maintained via spacer complementarity, perhaps explaining why species favoring
type III systems appear to experience selection maintaining three rather than just
59
two CRISPR arrays. Even in type I and II systems, if each array is associated with
a separate set of spacer acquisition machinery, then cross-priming will be less likely
and complementarity could arise. Nevertheless, this does not explain the multi-array
conservation we see in genomes with only a single set of cas genes.
Therefore we are left with two broadly dened reasons why having multiple
CRISPR arrays might be adaptive: (1) multiple similar systems could lead to im-
proved immunity through redundancy and (2) multiple dissimilar systems could
allow specialization towards distinct types of threats.
In the case of similar systems, immunity could be improved by (a) an increased
spacer acquisition rate, (b) an increased rate of targeting, or (c) a longer time to
expected loss of immunity. Duplication of cas genes could increase uptake (a) and
targeting rates (b), but again this could not explain our results with a single set of
cas genes. Alternatively, duplication of CRISPR arrays could increase targeting (b)
by producing a larger number of crRNA transcripts or increase memory duration
(c) through spacer redundancy. However, the eectiveness of crRNA may actually
decrease in the presence of competing crRNAs [158, 161, 162], and spacer redun-
dancy across multiple arrays has little advantage over redundancy within a single
array (Appendix B.9). At a larger scale, redundancy of either arrays or cas genes
might be a form of bet-hedging against mutation-induced loss of functionality of the
CRISPR system [51, 69].
Alternatively, dissimilar systems could help defend against diverse threats.
Diverse cas genes may allow hosts to evade broadly-acting anti-CRISPR proteins
encoded by some viruses [60, 163]. Indeed, promiscuous type III Cas proteins are
60
often encoded alongside type I systems and can cooperate to target phages that
have mutated to escape type I targeting [164]. Empirically, we see the inclusion of
genomes with multiple cas targeting genes increases the eect size of our test for
selection, suggesting these factors may play a role. However, these cas-diversity hy-
potheses cannot explain the signature for multi-array adaptiveness observed among
genomes with only a single set of targeting proteins. We note that we observed our
signature of selection on multiple arrays both when limiting our analyses to arrays
with identical (Appendix B.10) and dissimilar (Appendix B.6) repeat sequences.
Therefore selective maintenance of multiple arrays does not appear to be isolated
to genomes with arrays of the same type or dierent types, but rather to be a much
more general phenomenon. Additionally, given the very small number of genomes
with multiple types of cas targeting genes in our dataset, it is unlikely that selection
for multiple types of systems is particularly widespread even if it does exist in some
cases.
We develop a hypothesis that diversity in spacer acquisition rate among ar-
rays could lead to selection for multiple arrays. Our theoretical model illustrates
how factors intrinsic to the mechanism of CRISPR immunity could create a trade-
o between memory span and learning speed. Either the physical loss of spacers
due to homologous recombination or the eective loss of spacers due to dierential
transcription along the array leads to a qualitatively similar result. In both cases,
rapid spacer uptake causes rapid spacer loss (either physical or eective), producing
the aforementioned tradeo. A low acquisition rate system is unlikely to pick up
a spacer from a single viral exposure, but, over a long time-frame, it may acquire
61
spacers from viruses that periodically reappear in the system. Additionally, recom-
bination between arrays [165] could potentially facilitate the passage of memories
between fast and slow arrays, allowing short-term memories to become long-term
ones.
While we do not have empirical evidence that rate variation drives the observed
signature of selection of multiple arrays, this hypothesis remains attractive since it
can explain the signature even in the absence of multiple sets of cas genes. Acquisi-
tion rates vary between arrays, even on the same genome [63, 150], and even when
those arrays share cas genes and have an identical or nearly identical repeat sequence
[166, 167]. We found no clear link between the diversity of repeat sequences and a
proxy for spacer acquisition rates (Appendix B.10). Further, we found indications
of selection even when restricting to arrays with identical repeats (Appendix B.10).
Thus the factors inuencing acquisition rate appear to be idiosyncratic, perhaps
related to the genomic position of the CRISPR array.
When partial spacer-target matches exist, variability in spacer acquisition rates
among arrays will be largely irrelevant because priming will ensure rapid acquisition
of new spacers. On the other hand, when no match exists, either due to spacer loss
or the introduction of a truly novel viral species into the environment, primed spacer
uptake will not occur. Thus the rate at which a host encounters novel threats will
determine the importance of the baseline spacer acquisition rate. In environments
where novel viruses are frequently encountered, small dierences in acquisition rate
can be important for host tness, whereas in environments where host and virus pairs
consistently coevolve over time priming will be the more important phenomenon.
62
Finally, our examination of immune conguration is likely relevant to the full
range of prokaryotic defense mechanisms. In contrast to previous work focusing on
mechanistic diversity (e.g. [48, 63, 130, 152]), we emphasize the importance of the
multiplicity of immune systems in the evolution of host defense. As we suggest, a
surprising amount of strategic diversity may masquerade as simple redundancy.
63
Chapter 4: Immune Loss as a Driver of Coexistence During Host-Phage
Coevolution
4.1 Abstract
Bacteria and their viral pathogens face constant pressure for augmented im-
mune and infective capabilities, respectively. Under this reciprocally imposed selec-
tive regime, we expect to see a runaway evolutionary arms race, ultimately leading
to the extinction of one species. Despite this prediction, in many systems host and
pathogen coexist with minimal coevolution even when well-mixed. Previous work
explained this puzzling phenomenon by invoking tness tradeos, which can dimin-
ish an arms race dynamic. Here we propose that the regular loss of immunity by the
bacterial host can also produce host-phage coexistence. We pair a general model of
immunity with an experimental and theoretical case study of the CRISPR-Cas im-
mune system to contrast the behavior of tradeo and loss mechanisms in well-mixed
systems. We nd that, while both mechanisms can produce stable coexistence, only
immune loss does so robustly within realistic parameter ranges.
64
4.2 Introduction
While the abundance of bacteria observed globally is impressive [126, 168,
169], any apparent microbial dominance is rivaled by the ubiquity, diversity, and
abundance of predatory bacteriophages (or phages), which target these microbes
[170, 171, 172, 173, 174]. As one might expect, phages are powerful modulators of
microbial population and evolutionary dynamics, and of the global nutrient cycles
these microbes control [168, 170, 172, 173, 175, 176, 177, 178, 179, 180]. Despite this
ecological importance, we still lack a comprehensive understanding of the dynam-
ical behavior of phage populations. More specically, it is an open question what
processes sustain phages in the long term across habitats.
Bacteria can evade phages using both passive forms of resistance (e.g. receptor
loss, modication, and masking) and active immune systems that degrade phages
(e.g. restriction-modication systems, CRISPR-Cas) [71]. These defenses can incite
an escalating arms race dynamic in which host and pathogen each drive the evolution
of the other [8, 9]. However, basic theory predicts that such an unrestricted arms race
will generally be unstable and sensitive to initial conditions [181]. Additionally, if
phages have limited access to novel escape mutations, an arms race cannot continue
indenitely [182, 183, 184]. This leads to an expectation that phage populations
will go extinct in the face of host defenses [183].
While typically this expectation holds [e.g. 185], phages sometimes coexist with
their hosts, both in natural [e.g. 186, 187] and laboratory settings [e.g. 150, 181, 183,
188, 189, 190, 191, 192]. These examples motivate a search for mechanisms to explain
65
the deescalation and eventual cessation of a coevolutionary arms race dynamic, even
in the absence of any spatial structure to the environment. Previous authors have
identied (1) uctuating selection and (2) costs of defense as potential drivers of
coexistence in well-mixed systems. Here we propose (3) the loss of immunity, wherein
the host defense mechanism ceases to function, as an additional mechanism. We
focus on intracellular immunity (e.g., CRISPR-Cas) in which immune host act as
a sink for phages rather than extracellular resistance (e.g., receptor modications),
since the former poses more of an obstacle for phages and thus more of a puzzle for
explaining long-term coexistence.
Under a uctuating selection dynamic, frequencies of immune and infective al-
leles in the respective host and phage populations cycle over time [193, 194, 195, 196].
That is, old, rare genotypes periodically reemerge because the dominant host or
pathogen genotype faces negative frequency dependent selection. Fluctuating se-
lection is likely in situations where host immune and phage infectivity phenotypes
match up in a one-to-one lock and key type manner [195], and there is evidence
that arms races do give way to uctuating selection in some host-phage systems
[184]. Fluctuating selection cannot always proceed, though. When novel pheno-
types correspond to increased generalism we do not expect past phenotypes to recur
[195, 196] since they will no longer be adaptive. Such expanding generalism during
coevolution has been seen in other host-phage systems [197]. Thus the relevance
of uctuating selection depends on the nature of the host-phage immune-infective
phenotype interaction.
Another possible driver of coexistence are costs incurred by tradeos between
66
growth and immunity (for host) or host range and immune evasion (for phage)
[190, 198, 199, 200]. A tradeo between immunity and growth rate in the host
can lead to the maintenance of a susceptible host population on which phages can
persist [183, 189, 190, 199, 201, 202]. Tradeos often imply a high cost of immunity
that does not always exist [e.g. 181], particularly in the case of intracellular host
immunity, as we show later.
Finally, in large host populations typical of bacteria, even low rates of immune
loss could produce a substantial susceptible host subpopulation, which, in turn,
could support phage reproduction and coexistence. Such loss of function in the host
defenses could be due to either mutation or stochastic phenotypic changes. Delbr?ck
[203] initially described this hypothesis of loss of defense via back-mutation in order
to challenge the evidence for lysogeny. Lenski [204] reiterated this hypothesis in
terms of phenotypic plasticity and noted that conditioning the production of a sus-
ceptible host population on a resistant one could lead to very robust, host-dominated
coexistence. More recently, Meyer et al. [205] presented an empirical example of
a system in which stochastic phenotypic loss of resistance leads to persistence of a
coevolving phage population.
We hypothesize that coexistence equilibria will be more robust under an im-
mune loss mechanism than under a tradeo mechanism [204]. We build a general
mathematical model to demonstrate this point and then use a combination of ex-
perimental evidence and simulation-based modeling to apply this result to the co-
evolution of Streptococcus thermophilus and its lytic phage 2972 in the context of
CRISPR immunity.
67
4.3 General Immune Loss Model
We begin with a general model that considers two populations of host (de-
fended with a functional immune system; undefended without) and one pop-
ulation of pathogen. Starting from classical models of bacteria-phage dynamics
[198, 206], we add key terms to capture the eects autoimmunity (i.e., a tradeo),
immune loss, and the implicit eects of coevolution. This relatively simple model
allows us to analyze steady states and parameter interactions analytically. Later, we
examine the CRISPR-Cas immune system in detail and build a model with explicit
coevolutionary dynamics.
We examine the chemostat system with resources:
? ? evRR? = w(A R) (D + U) (4.1)
z +R
defended host: ( )
vR
D? = D ? ??dP ? ?? ?? w , (4.2)
z +R
undefended host: ( )
vR
U? = U ? ??uP ? w + ?D, (4.3)
z +R
and phage:
P? = P (?U(?u? ? 1) + ?D(?d? ? 1)? w) , (4.4)
where parameter denitions and values can be found in Table 4.1 and rationale/references
for parameter values in Appendix C.1. However, we describe here the parameters
68
of direct relevance to coexistence.
First, we allow for defended host to come with the tradeo of autoimmunity
(?), which applies naturally to the CRISPR-Cas system examined later. While
autoimmunity could either decrease the host growth rate [207] or be lethal, we
focus on the latter as lethality will increase the stabilizing eect of this tradeo
[26, 207, 208]. However, we also nd similar general results when applying a penalty
to the resource anity or maximum growth rate of the defended host (Appendix
C.2, Figs C.1-C.8).
Second, we add ow from the defended to undefended host populations repre-
senting loss of immunity at rate ?.
Finally, we model the eect of coevolution by allowing a fraction of even the
defended host population to remain susceptible (0 < ?d ? 1). In a symmetric
fashion, even nominally undefended host may have secondary defenses against phage
(0 < ?u ? 1).
69
70
Symbol Denition Value
R Resources R0 = 350?g/mL
D Defended Host D0 = 106 cells/mL
U Undefended Host U = 1020 cells/mL
P Phage P = 1060 particles/mL
e Resource consumption rate of growing bacteria 5? 10?7 ?g/cell
v Maximum bacterial growth rate 1.4 divisions/hr
z Resource concentration for half-maximal growth 1?g/mL
A Resource pool concentration 350?g/mL
w Flow rate 0.3mL/hr
? Adsorption rate 10?8 mL per cell per phage per hr
? Burst Size 80 particles per infected cell
?u Degree of susceptibility of undefended host 1
?d Degree of susceptibility of defended host 0
? Autoimmunity rate 2.5? 10?5 deaths per individual per hr
? Rate of immune inactivation/loss 5? 10?4 losses per individual per hr
Table 4.1: Denitions and oft used values/initial values of variables, functions, and parameters for the general mathematical
model
We analyze our model analytically as well as numerically to verify which equi-
libria are reachable from plausible (e.g., experimental) starting values (Appendix
C.3).
Assuming no phage coevolution (?d = 0), this model has a single analytic
equilibrium in which all populations coexist (Table C.1). In Fig 4.1, we explore
model behavior under varying rates of autoimmunity (?) and immune loss (?).
Clearly when autoimmunity and loss rates surpass unity, defended host go extinct
in the face of excessive immune loss and autoimmune targeting. At the opposite
parameter extreme, we see coexistence disappear from the numeric solutions (Fig
4.1b) as phage populations collapse. This leads to a band of parameter space where
coexistence is possible, stable, and robust. In this band, autoimmunity and/or
immune loss occur at high enough rates to ensure maintenance of coexistence, but
not so high as to place an excessive cost on immunity. Crucially, this band is
much more constrained in the ?-dimension, with autoimmunity restricted to an
implausibly high and narrow region of parameter space. This suggests a greater
robustness of coexistence under an immune loss mechanism even at low loss rates
(Fig 4.1, Figs C.2-C.8). To assess more directly the degree of robustness of each
driver of coexistence we can perturb our system and see its response. We move
our system away from equilibrium X? so that X ? = X? exp (?(Y ? 1)) where Y ?
2
Uniform[0, 1], and then solve numerically using X ? as our initial condition. Under
increasing levels of perturbation the system is less likely to reach stable coexistence,
specically in the ?-dimension, indicating that autoimmunity produces a far less
robust coexistence regime (Fig 4.1c-e, Figs C.2-C.8).
71
A B
no coexistence
0 0
10 10
phage
-2
10 dominated
-2
10
host dominated coexistence
coexistence
-4 -4
10 10
-6 -6
10 -6 -4 -2 0 10 -6 -4 -2 0
10 10 10 10 10 10 10 10
CRISPR loss (?) CRISPR loss (?)
C ? = 0.1 D ? = 1 E ? = 10
0 0 0
10 10 10
-2 -2 -2
10 10 10
-4 -4 -4
10 10 10
-6 -6 -6
10 -6 -4 -2 0 10 -6 -4 -2 0 10 -6 -4 -2 0
10 10 10 10 10 10 10 10 10 10 10 10
CRISPR loss (?) CRISPR loss (?) CRISPR loss (?)
Figure 4.1: Model behavior under variations in the rates of autoimmunity (?) and
CRISPR-Cas system loss (?). Equilibria (Table C.1) derived from Equations 4.1-4.4
are shown in (a) where orange indicates a stable equilibrium with all populations
coexisting and defended host dominating phage populations, green indicates that all
populations coexist but phages dominate, and blue indicates that defended bacteria
have gone extinct but phages and undefended bacteria coexist. In (b) we nd numer-
ical solutions to the model at 80 days using realistic initial conditions more specic
to the experimental setup (R(0) = 350, D(0) = 106, U(0) = 100, P (0) = 106).
In this case orange indicates coexistence at 80 days with defended host at higher
density than phages, green indicates a phage-dominated coexistence at 80 days, and
blue indicates that coexistence did not occur. Numerical error is apparent as noise
near the orange-blue boundary. We neglect coevolution and innate immunity in this
analysis (?u = 1, ?d = 0). (c-e) Phase diagrams with perturbed starting conditions.
Numerical simulations with starting conditions (X(0) = [R(0), D(0), U(0), P (0)])
perturbed by a proportion of the equilibrium condition X(0) = X? exp (?(Y ? 1))
2
where Y ? U [0, 1] and X? signies an equilibrium value to explore how robust the
equilibria are to starting conditions. A single simulation was run for each parameter
combination.
72
Autoimmunity (?) Autoimmunity (?)
Autoimmunity (?)
Autoimmunity (?)
Autoimmunity (?)
If we add large amounts of innate immunity to undefended host (?u < 0.5), we
nd phage-dominated coexistence for a wider range of ? (Fig C.10). This result is in
line with the counterintuitive suggestion that higher immunity may increase phage
density by allowing the host population to increase in size [48]. However, secondary
defense has minimal eects for more plausible levels of protection (?u closer to 1).
In the case of phage coevolution (?d > 0), the equilibria still have closed
forms, but are not easily representable as simple equations and so are not written
here. When ?d > 1 , defended host contribute positively to phage growth, eventually?
shifting the coexistence equilibrium from host to phage dominance (Fig C.9).
4.4 A Case Study: CRISPR-Phage Coevolution
The CRISPR (Clustered Regularly Inter-spaced Short Palindromic Repeats)
prokaryotic adaptive immune system incorporates specic immune memory in the
form of short sequences of DNA acquired from foreign genetic elements (spacers)
and then uses this memory to target the corresponding sequences (protospacers)
during subsequent infections [18, 19, 38, 209]. CRISPR can lead to rapidly escalating
arms races between bacteria and phages [150, 210, 211], in which evolutionary and
population dynamics occur on the same timescale [150, 212, 213, 214].
CRISPR-Cas can quickly drive phages extinct in an experimental setting [185],
but in some cases long-term CRISPR-phage coexistence has been observed [150].
Previous theoretical and limited experimental work has explained short-term coexis-
tence through tradeos and spacer loss [215], and long-term coexistence by invoking
73
continued coevolution via uctuating selection [214] or tradeos with host switching
to a constitutive defense strategy such as surface receptor modication [63, 216].
However, these previous hypotheses are insucient to explain simple coevolu-
tion experiments with Streptococcus thermophilus (type II-A CRISPR-Cas system)
and its lytic phage 2972 resulting in long-term coexistence [26, 150]. In these ex-
periments, bacteria are resource-limited and appear immune to phages, implying
they have won the arms race and that phages are persisting on a small susceptible
subpopulation of hosts. Deep sequencing of the same experimental system shows
dominance by a few spacers that drift in frequency over time, inconsistent with a uc-
tuating selection dynamic [26]. Specically, these results contradict the coexistence
regime seen in the Childs et al. [214, 217] model, wherein host are phage-limited
and the system undergoes a uctuating selection dynamic. Thus either (1) costs
associated with CRISPR immunity or (2) the loss of CRISPR immunity is playing
a role in maintaining susceptible host subpopulations on which phages can persist.
In this system, the primary cost of a functional CRISPR-Cas system is au-
toimmunity via the acquisition of self-targeting spacers. It is unclear how or if
bacteria distinguish self from non-self during the acquisition step of CRISPR immu-
nity [25, 27, 61, 152, 153]. In S. thermophilus, experimental evidence suggests that
there is no mechanism of self vs. non-self recognition and that self-targeting spacers
are acquired frequently [27], which implies that autoimmunity may be a signicant
cost.
Outright loss of CRISPR immunity at a high rate could also lead to coex-
istence. The bacterium Staphylococcus epidermidis loses phenotypic functionality
74
in its CRISPR-Cas system, either due to wholesale deletion of the relevant loci or
mutation of essential sequences (i.e. the leader sequence or cas genes), at a rate
of 10?4-10?3 inactivation/loss events per individual per generation [51]. Functional
CRISPR loss has been observed in other systems as well [143, 218].
Below we replicate the serial-transfer coevolution experiments performed by
Paez-Espino et al. [26, 150] and develop a simulation-based coevolutionary model
to explain the phenomenon of coexistence.
4.4.1 Experiments
We performed long-term daily serial transfer experiments with S. thermophilus
and its lytic phage 2972 in milk, a model system for studying CRISPR evolution
(see Appendix C.4 for detailed methods). We measured bacteria and phage densities
on a daily basis. Further, on selected days we PCR-amplied and sequenced the
CRISPR1 and CRISPR3 loci, the two adaptive CRISPR loci in this bacterial strain.
From the perspective of density, phages transiently dominated the system early
on, but the bacteria quickly took over and by day ve appeared to be resource-limited
rather than phage-limited (Fig 4.2a,b). This switch to host-dominance corresponded
to a drop in phage populations to a titer two to three orders of magnitude below
that of the bacteria. Once arriving at this host-dominated state, the system either
maintained quasi-stable coexistence on an extended timescale (over a month and a
half), or phages continued to decline and went extinct relatively quickly (Fig 4.2a,b).
We performed six additional replicate experiments which conrmed this dichotomy
75
between either extended coexistence (4 lines quasi-stable for > 2 weeks) or quick
phage extinction (2 lines < 1 week) (Fig C.11).
Sequencing of the CRISPR1 and CRISPR3 loci revealed the rapid gain of a
single spacer (albeit dierent spacers in dierent sequenced clones) in CRISPR1
followed by minor variation in spacer counts with time (Fig C.12), with CRISPR1
being more active than CRISPR3. We tracked the identity of the rst novel spacer in
the CRISPR1 array over time. We found a cohort of four spacers that persisted over
time and were repeatedly seen despite a small number of samples taken at each time
point (less than 10 per time point; Table 4.2). Other spacers were sampled as well,
but this small cohort consistently reappeared while other spacers were only found
at one or two timepoints, indicating this cohort was dominating the system (Table
C.2). Such a pattern is inconsistent with a uctuating selection hypothesis. Further,
we did not observe frequent spacer loss in the CRISPR1 or CRISPR3 arrays.
76
10 10
10 10 Bacteria
Phage
8 8
10 10
6 6
10 10
4 4
10 10
2 2
10 10
Bacteria
0 0
10 Phage 10
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Time (Days) Time (Days)
(a) (b)
Bacteria Bacteria Bacteria
10 10 10
10 Phage 10 Phage 10 Phage
8 8 8
10 10 10
6 6 6
10 10 10
4 4 4
10 10 10
2 2 2
10 10 10
0 0 0
10 10 10
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Time (Days) Time (Days) Time (Days)
(c) (d) (e)
Figure 4.2: Serial transfer experiments carried out with S. thermophilus and lytic
phage 2972 Bacteria are resource-limited rather than phage-limited by day ve and
phages can either (a) persist at relatively low density in the system on long timescales
(greater than 1 month) or (b) collapse relatively quickly. These results agree with
those of Paez-Espino [150] where coexistence was observed in S. thermophilus and
phage 2972 serially transferred culture for as long as a year. Experiments were
initiated with identical starting populations and carried out following the same
procedure. In (c-e) we show that our simulations replicate the qualitative patterns
seen in the data, with an early phage peak, followed by host-dominated coexistence
that can either be (c) stable, (d) sustained but unstable, or (e) short-lived. Each plot
is a single representative simulation and simulations were ended when phages went
extinct. Note that experimental data has a resolution of one time point per day,
preventing conclusions about the underlying population dynamics (e.g., cycling),
whereas simulations are continuous in time.
77
Population Density Population Density
Population Density
Population Density
Population Density
78
Spacer ID
Time D E F G Total in Cohort Total Sampled Sequences Percent Samples in Cohort
1 0 0 0 0 0 3 0
2 1 1 1 0 3 4 75
3 2 0 1 1 4 5 80
4 0 0 0 0 0 1 0
5 0 0 2 1 3 7 43
11 1 2 0 2 5 7 71
15 1 1 1 0 3 7 43
25 0 5 0 0 5 8 63
35 0 1 0 2 3 6 50
40 0 0 0 2 2 9 22
Table 4.2: Sequencing data shows four rst-order spacers that persist as a high-frequency cohort over time. Samples identied
by the rst novel spacer added to the array as compared to the wild-type. See Table C.2 for complete spacer dynamics.
4.4.2 CRISPR-phage Coevolutionary Model
We next built a hybrid deterministic/stochastic lineage-based model similar to
an earlier model by Childs et al. [214, 217] that explicitly models the coevolutionary
dynamics of the CRISPR-phage system wherein bacteria acquire spacers to gain
immunity and phages escape spacers via mutations. Our simulations also replicate
the resource dynamics of a serial dilution experiment, wherein the system undergoes
large daily perturbations.
We model phage mutations only in the protospacer adjacent motif (PAM)
region, which is the dominant location of CRISPR escape mutations [150] to prevent
the possibility of spacer re-acquisition. This approach diers from previous models
which considered mutations in the protospacer region itself [e.g. 47, 48, 214] and
thus allowed for the possibility of spacer re-acquisition. We justify modeling only
PAM mutations with three arguments. First, the probability of spacer re-acquisition
will be quite low if there are many protospacers. Second, re-acquired spacers will
already have undergone selection for escape mutation by phage, and, assuming that
there are therefore diverse escape mutations in the phage population, these spacers
will thus provide limited benet to the host. Third, as we move away from the PAM
along the protospacer sequence, more substitutions are tolerated by the CRISPR
matching machinery [219], meaning that mutations farther away from the PAM will
be less eective at escaping immunity [220].
79
We model population dynamics using dierential equations for resources:
( )
?evR ?
R? = U + Di (4.5)
z +R
i
CRISPR-enabled bacteria with spacer set Xi:
( (
vR ? ) )
D?i = Di ? ? (1?M(Xi, Yj))Pj ? ?? ?L (4.6)
z +R
j
a pool of undefended bacteria with a missing or defective CRISPR-Cas system:
( )
vR ? ?
U? = U ? ? Pi + ?L Di (4.7)
z +R
i i
and phages with protospacer set Yi :
( ? )
P?i = ?Pi U(?i ? 1) + Dj(?i(1?M(Xj, Yi))? 1) , (4.8)
j
and stochastic events occur according to a Poisson process with rate ?:
? ? ?
? = ?B + ?P + ?K (4.9)i i i
i i i
which is a sum of the total per-strain spacer-acquisition rates:
?
?B = ?b?Di Pj (4.10)i
j
80
total per-strain PAM mutation rates:
( ? )
?P = ?p?i?Pi U + (1?M(Xi, Yj))Di (4.11)i
j
and total per-strain PAM back mutation rates:
( ? )
?Q = ?q?i?Pi U + (1?M(Xi, Yj))Di . (4.12)i
j
In this way each unique CRISPR genotype (Xi), dened as a set of linked
spacers sharing the same array, is modeled individually, as is each phage genotype
(Yi). As new spacers are added and new PAMs undergo mutation, new pairs of
genotypes and equations are added to the system. Host that have undergone immune
loss are modeled separately (U), as if they have no CRISPR-Cas system.
The function M(Xi, Yj) is a binary matching function between (proto)spacer
content of bacterial and phage genomes that determines the presence or absence of
immunity. We refer to the order of a host or phage strain, which is the number
of evolutionary events that strain has undergone, |Xi| or ns? |Yi| respectively. The
PAM back mutation rate ?q describes the rate at which we expect a mutated PAM to
revert to its original sequence (assuming the mutation is a substitution). While back
mutation is not required to generate stable host-dominated coexistence, it greatly
expands the relevant region of parameter space because it allows phages to avoid the
cost we will impose on PAM mutations, discussed below, when those immune escape
mutations are no longer benecial. Recombination among viral strains could have a
similar eect by providing another route to an un-mutated or less mutated genome.
81
P?ez-Espino [150] suggest that recombination can produce stable host-dominated
coexistence, although we reject such diversity-driven hypotheses [e.g. 214] based on
our sequencing data.
We assume that the number of PAM mutations in a single phage genome is
constrained by a tradeo with phage tness, as this is necessary to prevent the total
clearance of protospacers from a single strain at high mutation rates. Increases
in host breadth at the species level generally come at a cost for viruses due to
pleiotropic eects [221]. More broadly, mutations tend to be deleterious on average
[e.g. 222]. It is reasonable to speculate that phages have evolved under pressure to
lose any active PAMs on their genomes, and thus that the persisting PAMs may
have been preserved because their loss is associated with a tness cost.
The function
?c?base?i = |Yi|+ ?base (4.13)
ns
incorporates a linear cost of mutation into the phage burst size. See Table 4.3
for further denitions of variables, functions, and parameters in Equations 4.5-4.13.
Simulation procedures and rationale for parameter values, including phage genome
size, are detailed in Appendix C.3.
82
83
Symbol Denition Value
R Resource concentration 350 ?g/mL
Bi Population size of CRISPR
+ bacterial strain i 106
Pi Population size of phage strain i 10
6
Bu Population size of CRISPR
? bacteria 102
?B Mutation rate of bacterial strain i n/ai
?P PAM mutation rate of phage strain i n/ai
?Q PAM back mutation rate of phage strain i n/ai
? Total rate of mutation events occurring in model n/a
M(Xi, Yj) Matching function between spacer set of bacterial strain i and protospacer set of phage strain j no matches initially
?(|Yi|) Burst size as a function of the order of phage strain i ?(0) = 80
|Xi| Order of bacterial strain i 0
|Yi| Order of phage strain i 0
e Resource consumption rate of growing bacteria 5? 10?7 ?g
v Maximum bacterial growth rate 1.4/hr
z Resource concentration for half-maximal growth 1 ?g
? Adsorption rate 10?8 mL per cell per phage per hr
?base Maximum burst size 80 particles per infected cell
ns Size of phage genome 10 protospacers
c Cost weight per PAM mutation 3
?L Per individual per generation CRISPR inactivation/loss rate 5? 10?4
? Rate of autoimmunity 50?b deaths per individual per hr
?b Spacer acquisition rate 5? 10?7 acquisitions per individual per hr
?p Per-protospacer PAM mutation rate 5? 10?8 mutations per spacer per individual per hr
?q PAM back mutation rate 5? 10?9 mutations per spacer per individual per hr
Table 4.3: Denitions and oft used values/initial values of variables, functions, and parameters for the simulation model
4.4.2.1 Stable Host-Dominated Coexistence
Simulations with immune loss reliably produce extended coexistence within a
realistic region of the parameter space (Fig 4.3) thus replicating our experimental
results (Fig 4.2), and conrming our qualitative results from the simpler determin-
istic model (Fig 4.1). We observed no simulations in which autoimmunity alone
produced stable coexistence. This agrees with our earlier numerical results from
the general model where unrealistically high rates of autoimmunity were required
to produce coexistence.
Similar to our experimental results, for a single set of parameters this model
can stochastically fall into either stable coexistence or a phage-free state (Fig 4.3).
The relative frequencies with which we see each outcome, as well as the distribution
of times that phages are able to persist, depend on the specic set of parameters
chosen. In particular, increasing the PAM back mutation rate will increase the prob-
ability of the coexistence outcome (Fig 4.4), although even in the absence of back
mutation the system will occasionally achieve stable coexistence. This dependence
on back mutation is caused by the combined eects of the cumulative cost we impose
on PAM mutations and the inability of phages to keep up with host in a continuing
arms race. In the early stages of the arms race it is optimal for phages to continue
undergoing PAM mutations as the most abundant available hosts are high-order
CRISPR variants, whereas once hosts are able to pull suciently ahead of phages
in the arms race it becomes optimal for phages to feed on the lower-density but
consistently available CRISPR-lacking host population (Fig C.13).
84
?L = 5e?4, ? = 0 ?L = 5e?4, ? = 50?b
50 50
40 40
30 30
20 20
10 10
0 0
10 30 50 > 70 10 30 50 > 70
?L = 0, ? = 0 ?L = 0, ? = 50?b
50 50
40 40
30 30
20 20
10 10
0 0
10 30 50 > 70 10 30 50 > 70
Time to Phage Clearance (Days)
Figure 4.3: Distribution of phage extinction times in bacterial-dominated cultures
with dierent possible combinations of coexistence mechanisms. The peak at ? 75
corresponds to what we call stable coexistence (simulations ran for a maximum of 80
days). There is no signicant dierence between the top two panels in the number
of simulations reaching the 80 day mark (?2 = 2.8904, df = 1, p?value = 0.08911).
Back mutation was set at ? ?9q = 5? 10 .
85
Frequency
?q = 5e?9 ?q = 5e?10 ?q = 5e?11
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
10 30 50 > 70 10 30 50 > 70 10 30 50 > 70
Time to Extinction (Days)
Figure 4.4: Distribution of phage extinction times in bacterial-dominated cultures
with dierent rates of PAM back mutation in phages (?q). The peak at 80 corre-
sponds to what we call stable coexistence (simulations ran for a maximum of 80
days). These results are shown for a locus-loss mechanism only (? = 5 ? 10?4L ,
? = 0). The histogram for ?q = 5? 10?8 is omitted as it is nearly identical to that
for ? = 5 ? 10?9q , indicating that the height of the coexistence peak saturates at
high back mutation.
The adsorption rate, on a coarse scale, has an important eect on how the
model behaves (Fig C.14). At high values of ? where we would expect phages to
cause host extinction in the absence of CRISPR immunity (? = 10?7) we see that
long-term coexistence occurs rarely, and is negatively associated with the phage
back mutation rate. In this case phages will rapidly consume the susceptible host
population and crash to extinction unless they have undergone PAM mutations that
lower their growth rate. This causes a reversal in the previous trend seen with back
mutation where the ability of phages to escape the costs of PAM mutation was
essential to their persistence. A decrease in the adsorption rate to a very low value
(? = 10?9) leads to most simulations persisting in host-dominated coexistence until
the 80 day cuto. Because both evolutionary and demographic dynamics occur
86
Frequency
much more slowly in this case, long term persistence does not necessarily imply
actual stability, as suggested by our and previous [150] experimental results in which
coexistence eventually ends. In general, lower adsorption rates lead to longer periods
of host-dominated coexistence and reduce the chance of phage extinction.
The failure of autoimmunity to produce coexistence warrants further investi-
gation. Upon closer examination, it is clear that in the early stages of the arms
race where CRISPR-enabled bacteria have not yet obtained spacers or been se-
lected for in the host population, phages are able to proliferate to extremely high
levels and greatly suppress the CRISPR-lacking host. Because autoimmunity as a
mechanism of coexistence relies on the continued presence of immune-lacking host,
it may not be able to function in the face of this early phage burst if susceptible
host are driven extinct. There is a possibility that very low locus loss rates that
reintroduce CRISPR-lacking bacteria but do not appreciably contribute to their
density combined with high rates of autoimmunity could maintain high enough den-
sity susceptible host populations to sustain phage. To investigate this possibility we
imposed a oor of U > 1 and ran further simulations. Even with very high rates of
autoimmunity based on an upper limit of likely spacer acquisition rates (? = 50?b,
?b = 10
?5) the susceptible host population does not grow quickly enough to su-
ciently high levels to sustain phage (Fig C.15). Thus it is not early dynamics that
rule out autoimmunity but the insuciency of the mechanism itself for maintaining
large enough susceptible host populations.
87
4.4.2.2 Transient Coexistence with Low Density Phage
While we do not observe stable coexistence in any case where there is not loss
of the CRISPR-Cas immune system, we did observe prolonged phage persistence
in some cases where ?L = ? = 0 (Fig 4.3) and in cases with autoimmunity only
(?L = 0). Phages were able to persist at very low density (? 10?100 particles/mL)
for as long as two months in a host-dominated setting without the presence of a
CRISPR-lacking host subpopulation (Fig 4.3, Fig C.16). It appears that in these
cases phages are at suciently low density as to have a minimal eect on their host
population and thus that host strain is selected against very slowly. Because the
phages have undergone many PAM mutations at this point they are unable to pro-
liferate rapidly enough between dilution events to have an easily measurable impact
on the host population. Essentially, phages delay their collapse by consuming their
host extremely slowly (Fig C.16). However, with an active locus loss mechanism
(i.e., ?L > 0), we did not see this sustained but unstable coexistence occur, likely
because the undefended hosts would have driven the phage population to higher
levels and increased selection on the susceptible CRISPR variants.
4.5 Discussion
We paired a general model of immunity with a case study of the CRISPR
immune system to characterize and contrast the potential drivers of long-term host-
phage coexistence in well-mixed systems. We found that a tradeo mechanism
does not lead to a robust coexistence equilibirum in the case of intracellular host
88
immunity. We also ruled out coevolutionary drivers of coexistence in the S. ther-
mophilus-phage 2972 system based on a combination of our own sequencing data
and previous work on the same system [26]. Since some mechanism(s) must be pro-
ducing susceptible hosts on which phages can replicate, we are left with an immune
loss hypothesis as the best remaining explanation for our empirical results. Our
simulations showed that the addition of early coevolutionary dynamics alongside
immune loss replicates key features of our experimental results, including stochastic
switching between the possible outcomes of long term coexistence and rapid phage
clearance. Therefore we predict that that the CRISPR-Cas immune system is lost at
a nontrivial rate in S. thermophilus in addition to S. epidermidis [51], and possibly
other species.
With regards to CRISPR, while our experiments do not speak to the relative
importance of locus loss versus costly autoimmunity, our theoretical results reject
autoimmunity as a realistic mechanism of phage persistence. Our experimental
setup was in serial dilution, which subjects the culture to large daily perturbations,
ruling out any mechanism that does not produce a robust coexistence regime.
We emphasize that CRISPR immunity, and immunity in general, is still likely
costly [62]. Nevertheless, in cases of intracellular host immunity those costs are
insucient to drive continued phage persistence in the environment. Intracellular
immunity destroys phages rather than simply preventing phage replication. Thus
the threshold density of susceptible host for phage persistence is higher than in
systems where hosts have an extracellular defense strategy (i.e. receptor/envelope
modication), meaning the cost of immunity must be higher. When hosts escape
89
phage predation via receptor modications, a growth-resistance tradeo may lead
to coexistence.
Our sequencing results in the S. thermophilus system reject coevolutionary
mechanisms for coexistence. We can directly reject an arms race dynamic since
it predicts the rapid, continued accumulation of spacers, which does not occur in
our data. A uctuating selection dynamic makes the more subtle prediction that
the frequencies of spacers in the population should cycle over time, with dierent
spacers dominating at dierent times. Even with relatively small sample sizes (<10
CRISPR loci sequenced per timepoint), we see a small cohort of spacers increase in
frequency early in the experiment and continue to be detected at later timepoints
(Table 4.2). These results are consistent with those of Paez-Espino et al. [26]
who performed deep sequencing with the same phage-host system and observed
dominant spacers that drifted in frequency over time. This continued detection and
dominance of particular spacers rules out strong tness dierences between these
spacers, which, in turn, contradicts the expectation of uctuating selection that
tnesses change over time. Our stochastic simulations agree, with coevolutionary
dynamics in the absence of loss or cost most often yielding rapid phage extinction
and only occasionally showing coexistence for over a month  but never exhibiting
sustained coexistence (Fig 4.3).
A similar model by Childs et al. [217] found that a uctuating selection dy-
namic could lead to long term coexistence in a CRISPR-phage system when arrays
were saturated, in the sense that they were lled to some preset maximum capac-
ity with spacers, which we do not observe in our experimental data. The fact that
90
we see little expansion of the array suggests that hosts are completely immune to
phages, as rapid phage genome degradation inside the CRISPR-immune cells can
prevent further uptake of spacers [223].
While we conclude that immune loss plays a key role in our system, it is
not immediately clear why bacterial immune systems would lose functionality at
such a high rate. Our sequencing of the S. thermophilus CRISPR loci did not
reveal pervasive spacer loss events, indicating that immune loss is at the system
rather than spacer level. Perhaps in the case of CRISPR there is some inherent
instability of the locus, leading to high rates of horizontal transfer [50, 143, 218,
224, 225]. Jiang et al. [51] propose that CRISPR loss is a bet-hedging strategy
that allows horizontal gene transfer to occur in stressful environments (e.g., under
selection for antibiotic resistance). This proposal is consistent with evidence that
CRISPR does not inhibit horizontal gene transfer on evolutionary timescales [226].
A high rate of CRISPR loss and inactivation could produce pressure for bacteria
to frequently acquire new CRISPR-Cas systems through horizontal gene transfer,
perhaps explaining why strains with multiple CRISPR-Cas systems are frequently
observed, including S. thermophilus [56, 58]. This is consistent with a broader view
in which prokaryotic defense systems appear to be labile, having higher rates of gain
and loss than other genetic content [16].
While some clear anecdotes of immune loss exist [51, 205], other examples of
this phenomenon may have been missed because it is dicult to detect. Phages will
quickly destroy any evidence of loss, and loss rates can be low while still aecting
population dynamics. Jiang et al. [51] go to great lengths to demonstrate loss in
91
their system. Particularly with complex systems like CRISPR-Cas, a mutation in
any number of components can lead to inactivation, making loss hard to detect
from genetic screens. Phenotypic screens like those of Jiang et al. [51] require
the engineering of CRISPR spacer content and/or plasmid sequence as well as an
otherwise competent host.
Other paths to sustained coexistence between CRISPR-enabled hosts and
phages may also exist. There is a great diversity of CRISPR-Cas system types
and modes of action [30] and the particular mechanism of each system may lead to
distinct host-phage dynamics. That being said, our model of CRISPR evolutionary
dynamics is rather general, and we recovered similar qualitative results over a wide
range of parameter values apart from the S. thermophilus-specic parameter space.
Finally, our results show that the regular loss of immunity can sustain a viable
phage population, leading to the maintenance of selective pressure and thus keeping
immunity prevalent in the population overall. Even though long-term coexistence
with phages may not aect overall host population density, we suggest that, coun-
terintuitively, the periodic loss of individual immunity may drive the maintenance
of a high population immune prevalence.
92
Appendix A: Supplemental Information For: Visualization and pre-
diction of CRISPR incidence in microbial trait-space to
identify drivers of antiviral immune strategy
A.1 Outline of Analyses
Visualizations
? CRISPR Incidence
 PCA (Figs 2.1 and A.9, Table 2.1)
 t-SNE (Figs 2.2, A.10, and A.4)
? CRISPR Type
 PCA (Fig 2.5)
? RM Incidence
 PCA (Fig A.20)
? Ku Incidence
 PCA (Fig A.17)
93
Predictive Models (Proteobacteria Test Set)
Comparison of all predictive models in Figs A.7 and A.8
? CRISPR Incidence (Table 2.2, Appendix A.4, Figs A.7 and A.8)
 Logistic Regression with Forward Subset Selection and Random CV (Ta-
ble A.1)
 Logistic Regression with Forward Subset Selection and Blocked CV (Ta-
ble A.1)
 Phylogenetic Logistic Regression with Forward Subset Selection and Ran-
dom CV (Table A.1)
 Phylogenetic Logistic Regression with Forward Subset Selection and Blocked
CV (Table A.1)
 sPLS-DA (Fig A.11)
 MINT sPLS-DA (Fig A.12)
 Random Forest (Figs 2.3 and A.14)
 Random Forest Ensemble (A.13)
 Random Forest, no genetic information (Appendix A.2, Fig A.16)
? CRISPR Incidence With only Temperature and Oxygen as Predictors to Train
Model
 Random Forest
94
? Number of CRISPR Systems
 Random Forest (regression; Appendix A.6)
? Type II CRISPR Incidence (cas9 )
 Random Forest (Fig 2.5)
? RM Incidence
 Random Forest (Fig A.21)
? Ku Incidence
 Random Forest (Fig A.18)
Phylogenetic Regressions
? CRISPR vs 16s rRNA count (also on bootstrapped trees; Table A.2)
? CRISPR vs Ku and Oxygen (also on bootstrapped trees, oxygen use from
NCBI metadata; Appendix A.7, Table A.4)
? Number of Restriction Enzymes vs Temperature (also on bootstrapped trees;
Table A.3)
? Number of Restriction Enzymes vs Oxygen (also on bootstrapped trees; Ta-
ble A.3)
Metagenomic Data
? cas1,2,3,9,10 Coverage vs Dissolved Oxygen (Appendix A.5, Fig A.23)
95
Other
? CRISPR vs Temperature and Oxygen (NCBI metadata)
 Binomial condence intervals (Figs 2.4 and A.15)
 ?2 test
? Correlation between CRISPR and Ku
? Resampling genomes for CRISPR incidence (Appendix A.3, Fig A.22)
A.2 ProTraits Without Genomic Data
The ProTraits database, from which we take our trait data, combines vari-
ous sources of text-based and genomic information to make trait predictions [75].
While the inclusion of genomic sources of information considerably improves the trait
condence scores, some of these sources explicitly consider gene presence/absence,
and we worried it may lead to circularity in our arguments (e.g. if cas gene presence
were used to predict a trait, which was then used to predict CRISPR incidence).
Therefore we repeated our predictive analyses excluding the phyletic prole and
gene neighborhood sources in ProTraits. We took the maximum condence scores
for having and lacking a trait respectively across all other sources in the database
to produce a negative and positive trait score. We integrated these into a single
score as described in Methods. We then built an RF model of CRISPR incidence,
as this was the highest performing model on the complete dataset. This model had
comparable predictive ability (? = 0.243). We also found similar predictors to when
96
the full dataset was used (Fig A.16). A notable change is that termite host and
PAH degradation no longer appear as important predictors in the model.
A.3 Resampling Genomes
For our main analysis we sampled one genome from RefSeq per species to
assess CRISPR incidence, with a preference for completely assembled reference and
representative genomes. Often, CRISPR is lost by members of a species [51], and
incidence can vary among strains [227]. Therefore, we attempted to determine the
potential eects of our sampling process. In general, it is better to sample single
genomes to assess CRISPR incidence rather than averaging across all genomes for a
given species, since species are unevenly represented in RefSeq and thus the variances
in incidence between species will not be equal. A drawback of sampling is that it
throws away information, although strong trends should still be apparent if species
have consistent tendencies to either possess or lack CRISPR. In fact, for 84% of the
species in our trait dataset, the available genomes either all have or all lack CRISPR
(no within species variability). Thus, sampling should have relatively minor eects
on our outcome.
We veried this by repeatedly resampling CRISPR incidences from the set of
all RefSeq genomes (previously determined in [15]). First we randomly resampled a
new genome with known CRISPR incidence for each species in the dataset, then we
split the data into training and test sets (using Proteobacteria again as the test set)
and built an RF model as in the main text. This process was repeated 1000 times,
97
and the resulting ? values and top predictors are reported in Fig A.22. The results
of this analysis were consistent with our analysis in the main text.
A.4 Additional Models and Phylogenetic Corrections
In addition to the RF models described in the main text, we built several other
models (described in Methods). Here we discuss their performance. For the logis-
tic regression models, taking phylogeny into consideration, both via blocked cross
validation (CV) (? = 0.168) and an explicit evolutionary model of trait evolution
(? = 0.188), improved predictive ability relative to the phylogenetically-uninformed
models. When combined these two corrections appeared to conict with one an-
other (? = 0.160). This is to be expected based on the dierent ways these two
approaches deal with the problem of shared evolutionary history. Blocked cross val-
idation prevents overtting to the underlying tree by leaving out contiguous portions
of the data during the tting process (see Methods). Phylogenetic logistic regres-
sion assumes an explicit model of trait evolution and attempts to t that model
to the data using a provided phylogeny. Because blocked CV leaves out chunks of
the tree, phylogenetic logistic regression is unable to t to those missing pieces of
the tree, and thus the method's performance is reduced. In other words, blocked
CV and phylogenetic logistic regression can both improve model performance when
working with phylogenetically structured data, but combining these two approaches
is unlikely to work well.
Moving on to our partial-least squares models (see Methods), sPLS-DA per-
98
formed better than all logistic regression models, indicating that multicollinearity
was likely a signicant hurdle for logistic regression with subset selection (even more
so than phylogenetic structure). Our cluster-based approach to phylogenetic cor-
rection (MINT) for sPLS-DA reduced overall predictive ability, but dramatically
improved the true positive rate of the prediction (TPR = 0.538), at the cost of
an increased false positive rate. In general there is always a tradeo between false
positive and false negative rates, but it is unclear to us why MINT sPLS-DA set
its threshold for detection so low in this case. This is possibly an artefact from the
dierences in CRISPR incidence between our training and test sets, where MINT
sPLS-DA learned to predict CRISPR presence at too low a threshold due to an
overly CRISPR-enriched training set.
The RF and phylogenetically-informed RF ensemble models had nearly iden-
tical performance. We note though, that the ensemble approach gave a much more
reliable estimate of predictive ability on the training set (mean ? = 0.258 predict-
ing on excluded clusters) than the internal estimate automatically generated by the
global RF model (out-of-bag estimate, ? = 0.441). In general, with phylogenetically
structured data the internal error estimates generated by an RF model will be mis-
leading, and the blocked cross-validation approach we employ is one way to correct
these estimates.
99
A.5 CRISPR in the Tara Oceans Data
An alternative, and complementary approach to the one we took here is to
directly measure the change in prevalence of a particular immune strategy (e.g.
CRISPR) across environmental gradients using metagenomics. This approach has
its own pitfalls and will require its own solutions. For example, in complex communi-
ties it may be dicult to link CRISPR proteins to particular genomes or organisms,
meaning that it will be dicult to dierentiate between changes in CRISPR preva-
lence due to dierential gene content in the same set of organisms and changes in
prevalence due to shifts in community composition. The situation is further compli-
cated by the fact that many organisms have multiple CRISPR systems, or conversely
have partial and non-functional systems, and that CRISPR and other defense sys-
tems are extremely labile, being gained and lost frequently [15, 16, 51]. This makes
the metagenome assembly process signicantly more dicult with respect to cor-
rectly mapping CRISPR to host. We also note that our current dataset integrates
microbial traits across many scales, whereas a metagenomic approach will only link
CRISPR prevalence to the coarsest scale of environmental parameters. Even consid-
ering oxygen, in many environments there is a possibility for extremely ne-grained
variation that allows aerobes and anaerobes to live in close proximity (e.g. anoxic
sediments in wetlands aerated by plant roots). In other words, our approach in
the main text labels microbes as is this, whereas relating environmental gradients
to metagenomic data labels microbes as lives here, where here is by necessity
an average across the sample. A metagenomics approach links immune strategy to
100
microbial traits indirectly via environment.
Nevertheless, metagenomics is an attractive alternative because it allows us to
analyze strategy shifts actually occurring in the environment. While it is beyond the
scope of the current study to perform extensive analyses of metagenomic datasets,
we wish to provide an encouraging example to motivate future work in what we
think is an exciting area. We used data from the Tara Oceans project [110], a
global study of the microbial communities in earth's oceans, as our case study. The
dataset consisted of a set of functional proles provided by Tara, in which reads
were mapped to particular orthologous groups (OG) using the KEGG orthologous
groups database, as well as environmental metadata for each sample [110]. We
identied the OGs for cas1 and cas2 (universal CRISPR marker genes; K15342 and
K09951), cas3 (type I marker; K070012), cas9 (type II; K09952), and cas10 (type
III; K07016). We then normalized the coverage of each OG by total coverage in a
given sample, and paired this data with the dissolved oxygen concentration for each
sample.
Similar to our results based on ProTraits, we found a negative association
between oxygen and CRISPR (Fig A.23). We found a signicant negative correlation
between oxygen and cas1 (Pearson's product moment correlation, ? = ?0.1757433,
p = 0.00668), cas2 (? = ?0.2254487, p = 0.0004696), cas3 (? = ?0.1939399,
p = 0.002714), and cas10 (? = ?0.4018567, p = 1.304 ? 10?10). The relationship
between oxygen and cas9 was not signicant (? = ?0.03446256, p = 0.5976). We
note that this data doesn't strictly represent an oxygen gradient since dissolved
oxygen content appears to be bimodal, with peaks corresponding to oxic and anoxic
101
conditions (Fig A.23).
A.6 Number of CRISPR Arrays
Many prokaryotes have multiple CRISPR arrays, and this multiplicity is po-
tentially maintained by selection [15]. We sought to assess whether we could predict
the multiplicity of CRISPR arrays on a genome using our trait data. CRISPRDetect
identies individual arrays, so that our original dataset already included informa-
tion about array multiplicity as well as incidence. We excluded all species lacking
CRISPR so as not to confound the question of incidence (who has CRISPR?) with
multiplicity (how many CRISPR arrays do they have?). Random forests can be used
on continuous outcome variables (regression), and so we built a RF model using the
same procedure as in the main text, but with multiplicity rather than incidence
as the outcome variable. This model performs extremely poorly, with essentially
no predictive ability (MSE = 4.26, R2 = 0.008). The predicted and actual values
on the test set were barely signicantly correlated (Pearson's correlation, ? = 0.09,
p = 0.048). This is not entirely surprising, as regression is generally more dicult
that classication. In other words, it is harder to predict whether an organism has
one, two, or three CRISPR arrays than it is to predict if it has CRISPR at all.
A.7 NHEJ-Oxygen Model
Using our annotations for Ku and the NCBI annotations for oxygen require-
ment (aerobes and anaerobes only, facultative organisms excluded) we compiled a
102
set of 1473 genomes for which both pieces of information were available. We built
a phylogeny for these genomes using the method described in the main text. We
then built a phylogenetically corrected linear model with CRISPR incidence as the
binary outcome variable, Ku and aerobicity as binary predictors, as well as an inter-
action term (phylogenetic logistic regression, phylolm R package; [92, 93]). Ives and
Garland [92] recommend that when categories have small sample sizes (as does our
anaerobe with Ku category at 33 genomes) that p-values for phylogenetic logistic
regression are obtained via bootstrapping, although this method is more computa-
tionally intensive. We performed 1000 bootstrap replicates (the 'boot' option in the
phyloglm() function) to assess the statistical signicance of each term in the model.
We repeated this analysis with the cas3, cas9, and cas10 genes, which are diagnostic
of CRISPR system type, in order to see if any Ku-oxygen-CRISPR interaction was
type-specic.
Our bootstrapped p values for both Ku and Oxygen, as well as their inter-
action, were all below 0.001 (all bootstrapped coecients diered from zero in a
consistent direction across all replicates). These p-values diered from the maxi-
mum likelihood estimates generated from the phylogenetic logistic regression model
(notably, the interaction between Ku and oxygen was not signicant using these es-
timates, at p = 0.088164, though the eects of Ku 1.53?10?5, and oxygen 0.001183
remained signicant), though this should not be surprising as the behavior of these
estimates are not well characterized at low sample sizes and bootstrap estimates are
generally the favored approach [92].
For type I and III systems, the results were generally consistent. In the case of
103
type I systems all model terms were signicant under bootstrapping (Ku p < 0.001,
oxygen p = 0.016, interaction p < 0.001) but when using p-values from the ML
estimate oxygen was not a signicant predictor of cas3 incidence (Ku p = 4.164 ?
10?10, oxygen p = 0.1138959, interaction p = 0.0004477). The same was true for
type III systems in terms of bootstrapped p-values (Ku p < 0.001, oxygen p = 0.039,
interaction p = 0.002) and those from the ML estimates (Ku p = 0.0002035, oxygen
p = 0.3048236, interaction p = 0.0014942). For type II systems only the eects
of Ku were signicant, and only in the bootstrapped case (Ku p = 0.005, oxygen
p = 0.052, interaction p = 0.0546), not for the ML estimates (Ku p = 0.1176, oxygen
p = 0.2542, interaction p = 0.7550).
For all of these phylogenetic regressions, results were consistent on 10 boot-
strapped trees (Table A.4).
104
Logistic Regression
Random CV Blocked CV
temperaturerange_thermophilic (+) temperaturerange_thermophilic (+)
mammalian_pathogen_oral_cavity (+) mammalian_pathogen_oral_cavity (+)
knownhabitats_freshwater (+) metabolism_carbondioxidexation (+)
ecosystemtype_marine (-) host_insectstermites (+)
pathogenic_in_mammals (-) ecosystemtype_geologic (+)
knownhabitats_hydrothermalvent (+) energysource_autotroph (+)
metabolism_carondioxidexation (+) ecosystemsubtype_vagina (+)
host_insectstermites (+) metabolism_sulfuroxidizer (-)
shape_tailed (+) habitat_terrestrial (-)
knownhabitats_soil (-)
knownhabitats_creosotecontaminatedsoil (-)
energysource_heterotroph (-)
cellarrangement_tetrads (-)
ecosystemsubtype_vagina (+)
knownhabitats_insectendosymbiont (-)
ecosystemtype_thermalsprings (+)
habitat_hostassociated (+)
cellarrangement_singles (-)
Phylogenetic Logistic Regression
Random CV Blocked CV
knownhabitats_hotspring (+) knownhabitats_hotspring (+)
mammalian_pathogen_oral_cavity (+) mammalian_pathogen_oral_cavity (+)
host_insectstermites (+) host_insectstermites (+)
shape_lamentous (+) shape_lamentous (+)
oxygenreq_strictaero (-) oxygenreq_strictaero (-)
ecosystemtype_reproductivesystem (+) energysource_heterotroph (-)
mammalian_pathogen_respiratory_lundisease (-)
ecosystemtype_marine (-)
knownhabitats_hydrothermalvent (+)
ecosystemcategory_plants (-)
Table A.1: Predictors added to each logistic regression model during forward
selection (top to bottom in order of addition). Plus and minus signs indicate whether
a variable is positively or negatively associated with CRISPR incidence.
105
Outcome Variable Bootstrap ?16s p16s
CRISPR 1 0.05444265 0.0004987372
CRISPR 2 0.06871256 1.650863E-05
CRISPR 3 0.05602856 0.0003348601
CRISPR 4 -0.06244074 4.65824E-05
CRISPR 5 -0.06051718 7.066252E-05
CRISPR 6 -0.0656118 1.96243E-05
CRISPR 7 -0.06858516 9.275297E-06
CRISPR 8 -0.06523327 2.200228E-05
CRISPR 9 -0.06482068 2.414822E-05
CRISPR 10 0.06773283 1.521424E-05
Table A.2: Phylogenetic logistic regression of CRISPR incidence as predicted by 16s
rRNA count on 10 bootstrapped trees.
Outcome Variable Bootstrap ?Temperature pTemperature
No. R Enzymes 1 1.46992099 0.04000844
No. R Enzymes 2 1.47619604 0.0395859
No. R Enzymes 3 0.05938679 0.67987639
No. R Enzymes 4 1.47619604 0.0395859
No. R Enzymes 5 1.49642946 0.03825504
No. R Enzymes 6 1.46992099 0.04000844
No. R Enzymes 7 1.46196112 0.04055124
No. R Enzymes 8 1.51134694 0.0373039
No. R Enzymes 9 0.0593766 0.67990236
No. R Enzymes 10 1.51134694 0.0373039
Outcome Variable Bootstrap ?O2 pO2
No. R Enzymes 1 -4.5032905 4.775284E-35
No. R Enzymes 2 -4.5236085 3.343288E-35
No. R Enzymes 3 -0.9838951 1.133434E-08
No. R Enzymes 4 -4.5262046 3.194396E-35
No. R Enzymes 5 -4.540566 2.482726E-35
No. R Enzymes 6 -4.5216758 3.458622E-35
No. R Enzymes 7 -4.5195994 3.586961E-35
No. R Enzymes 8 -4.5446124 2.312513E-35
No. R Enzymes 9 -0.9838037 1.135213E-08
No. R Enzymes 10 -4.5419952 2.421222E-35
Table A.3: Phylogenetic regression of number of restriction enzymes as predicted
by temperature or oxygen on 10 boostrapped trees.
106
Outcome Variable Bootstrap ?Ku pKu ?O2 pO2 ?Interaction pInteraction
CRISPR 1 -0.7776371 0.001 0.66532 0.001 0.8076997 0.001
CRISPR 2 -0.7762393 0.001 0.6650369 0.001 0.7553189 0.001
CRISPR 3 -0.7543548 0.001 0.6523729 0.001 0.7743154 0.002
CRISPR 4 -0.7475297 0.001 0.6470698 0.001 0.7970853 0.001
CRISPR 5 -0.7542887 0.001 0.6499091 0.001 0.7606847 0.001
CRISPR 6 -0.7750393 0.001 0.6560871 0.001 0.7523748 0.001
CRISPR 7 -0.752069 0.001 0.6479241 0.001 0.8017921 0.001
CRISPR 8 -0.7570104 0.001 0.6461722 0.001 0.7978345 0.001
CRISPR 9 -0.7931716 0.001 0.676406 0.001 0.7285855 0.001
CRISPR 10 -0.7782145 0.001 0.6634043 0.001 0.7725446 0.002
cas3 1 -1.309984 0.001 0.3429482 0.009 1.586328 0.001
cas3 2 -1.339081 0.001 0.3289939 0.007 1.582142 0.001
cas3 3 -1.304121 0.001 0.3257619 0.017 1.581483 0.001
cas3 4 -1.311048 0.001 0.2938582 0.011 1.590015 0.001
cas3 5 -1.333989 0.001 0.3291178 0.023 1.586987 0.001
cas3 6 -1.315037 0.001 0.3253995 0.008 1.585864 0.001
cas3 7 -1.308 0.001 0.2901924 0.014 1.58649 0.001
cas3 8 -1.325072 0.001 0.3139974 0.023 1.59048 0.001
cas3 9 -1.351763 0.001 0.3449322 0.007 1.590153 0.001
cas3 10 -1.328922 0.001 0.3384614 0.014 1.584191 0.001
cas9 1 -0.6166104 0.001 0.3386759 0.05 -0.3086753 0.536
cas9 2 -0.5713407 0.011 0.4218407 0.019 -0.3035605 0.538
cas9 3 -0.6371981 0.002 0.380116 0.04 -0.3139595 0.545
cas9 4 -0.2427449 0.05 1.0055809 0.025 -0.3538297 0.451
cas9 5 -0.598588 0.005 0.3807694 0.04 -0.304255 0.556
cas9 6 -0.6344803 0.006 0.4011055 0.02 -0.3178936 0.565
cas9 7 -0.6162547 0.01 0.3864871 0.017 -0.2999418 0.555
cas9 8 -0.6261813 0.002 0.3399736 0.059 -0.3096002 0.564
cas9 9 -0.6106022 0.006 0.4121876 0.011 -0.3016357 0.552
cas9 10 -0.5917813 0.003 0.381879 0.051 -0.3111323 0.523
cas10 1 -2.78319 0.001 0.338453 0.048 3.0924286 0.001
cas10 2 -2.76847 0.001 0.3371324 0.028 3.0689795 0.001
cas10 3 -0.735233 0.002 0.2797778 0.039 0.7382052 0.028
cas10 4 -1.209831 0.001 0.3202466 0.027 1.2494469 0.014
cas10 5 -2.927175 0.001 0.3800277 0.047 3.1038217 0.001
cas10 6 -2.750308 0.001 0.3269329 0.05 3.080121 0.001
cas10 7 -2.706733 0.001 0.3376516 0.032 2.9835339 0.001
cas10 8 -2.839044 0.001 0.3164233 0.064 3.116515 0.001
cas10 9 -1.944355 0.001 0.3521944 0.026 2.244925 0.002
cas10 10 -2.891364 0.001 0.3741208 0.03 3.1046525 0.001
Table A.4: Phylogenetic logistic regression of CRISPR incidence as predicted by Ku
and oxygen on 10 boostrapped trees. Bootstrapped p-values shown as discussed in
Appendix A.7
107
No CRISPR
CRISPR
Figure A.1: Phylogeny generated from PhyloSift marker genes (as in Fig A.3).
Color indicates CRISPR incidence.
108
Marker Gene
ProTraits NCBI's RefSeq Genomes PhyloSift
Convert to Trait Score
BLAST REBASE HMMER +Pfam FastTree
Find CRISPR arrays to find RMs to find Ku 
Traits with CRISPRDetect 
NA
CRISPR, RM, Ku
NA Phylogeny
Impute missing values 
with missForest() 
Traits Sample one genome per species
CRISPR, RM, Ku Traits
Figure A.2: Pipeline for generating trait and immunity dataset and matching phy-
logeny. See Methods for details.
109
Species Species
Genome
Genome
?
?
? ? ?
??? ? ??
??? ?? ??? ?
?? ??? ? ?
? ?? ?
? ? ? ?????
? ? ????? ?? ?
?
? ?
?
?? ???
?? ??? ?
? ???
?????
? ?
???
? ????????
????????
?? ? ?
? ???? ? ? ?? ???? ?
? ??
? ? ??? ? ? ?
?
? ?
? ??? ???? ?? ? ?
? ?? ? ?
? ????
? ?? ? ?? phyla
? ???
? ??
? ??? ? ?? ? ??
?? ? ???? ? ? ? ??? ??? ?
???
????? ? ? ?
??? ??? ??
???
??? ?
?
? ??? ?
? ?
? ??? ?
?
??? ???? ??? ??
?? ?
? ? ?
? ????? ??
???? ?? ????
? Actinobacteria
?? ? ??
? ? ??
? ? ??
???? ?
?
?????? ???
????
?? ? ?? ??
? ?
? ? ? ? ??? ??????? ? ? ?? ?
?? ?? ?? ? ????? ?
?? ? ?? ?
?
? ? ??????
? ? ?? ??? ? ? Alphaproteobacteria
? ???
? ? ? ?
? ?? ?? ???
? ? ? ?
?????
? ?? ?
? ? ?
?? ?
? ?? ???
?????????
? ?
? ?? ?? ? ? ?????
?
? ?? ?? ? ? ?? ? ???
?? ?
??
??? ? ??? ???? ? ? ?
?
???? ? ???? Bacteroidetes
? ??? ? ?
? ???? ????? ?? ?? ? ?? ??
? ?
? ? ? ?? ?????
? ? ? ?
??? ?????
?? ??
? ????
?
? ?
?
??? ??? ? ?
??
? ? ???
???? ??
?
?
??? ?
?? ?? ???? ?? Betaproteobacteria
? ?? ? ?? ?? ??????
? ?
?? ? ?
? ? ?? ?
?? ???
?
????
? ? ???
? ? ? ?? ??? ? ?
? ? ??
?? ??? Cyanobacteria?
? ? ?? ?
?
?? ??
? ?????
??
? ?? ??
?
? ? ????? ???
? ?
? ?? ?
?? ?? ?? ??????
?
? Deltaproteobacteria?? ? ? ?
? ??? ??
??? ??
?
? ? ?? ???
???
?
? ?? ??? Euryarchaeota??? ????
????? ??? ?
?? ??? ??
?
??
????? ? ??? ????
??
???? ?
? ? Firmicutes
?? ????
?????? ??
??
? ???
???? ? ?
?
? ???? ?????????? ? ?? Gammaproteobacteria?
? ?? ?? ??
???
???? ? ??
? ???? ?
?
?????
??? ?
? ?
?? ????? ?? ? ? ?? ? ?? ? Other? ? ??
?? ?????
??? ? ?
?
??? ? ? ?
?
? ? ? ????? ? ?? ? ? ?? ? ? ? Spirochaetes???? ?? ??????? ???? ????
? ?? ? ? ??
??
? ? ??
? ????
???? ?? ???
? NA
? ? ????
? ??
?
? ?
??
?
?? ? ?? ?? ?? ????? ?
?? ? ? ?
? ?? ???
?? ? ??
? ?????
?? ? ???
? ? ? ? ? ?? ??
?
? ???
? ?
?
???????
? ? ????? ???
?? ??
? ? ?
?? ??
?
?
? ? ? ??? ?
??
??
?
? ? ? ? ?
?? ? ?
?? ? ?
? ? ?
?
? ? ??? ?
? ? ??? ????? ? ?
?
??? ?? ? ? ? ??? ? ??? ??? ? ? ??????
? ?
???????
???
? ???
? ?? ???? ? ?? ?
? ???? ?? ?
? ? ?
?? ?? ?
????
? ?? ? ??
?
? ?? ?? ? ? ?? ? ? ??? ???????
?? ?? ?
? ?? ??? ? ??
?? ????? ??????
? ? ???? ?? ???
?????? ???? ??
? ??? ???????? ?
? ? ?? ? ?
??? ?
??????
?
??????
?? ?
? ?? ? ?
?? ?? ? ??
?? ??
?? ???? ?????? ??
? ?
?? ? ?
?
Figure A.3: Phylogeny generated from PhyloSift marker genes. Phylum indicated
by color, with taxonomic classications taken from NCBI.
110
(a0).0120.008 (b00).0150.004 .010
0.000 00..000050
? ?
?100 ?50 0 50 100 ?30 0 30
Perplexity = 20 Perplexity = 40
50
40 50
40
0
0 0
0
?50
?40 ?50
?40
?100
CRISPR ?80 CRISPR
?100
?80 No CRISPR No CRISPR
?100 ?50 0 50 100 ?30 0 30
density density
(c00.0)..0
01250 (d00).04
0.0010 0.
.0023
0.00005 00..0001
? ?
?50 ?25 0 25 50 ?20 0 20
Perplexity = 60 75 Perplexity = 80
75
50 20
20
50
25
0
25
0
0
0
?20
?25 ?20
?25
CRISPR CRISPR
No CRISPR No CRISPR
?50 ?25 0 25 50 ?20 0 20
density density
(e000)
.
..0
04
032 (f)00.0..000430 20..0001 00..0010
? ?
?20 ?10 0 10 20 30 ?20 ?10 0 10 20
Perplexity = 100 30 Perplexity = 120
30 20
20 20
20 10
10 10
10 0
0 0
0
?10
?10 ?10
?10
?20
?20 ?20
?20 CRISPR CRISPR
No CRISPR No CRISPR
?20 ?10 0 10 20 30 ?20 ?10 0 10 20
density density
(g0).04 (h00)..0500..0032 00.040. .00230.0001 00..0001
? ?
?20 ?10 0 10 20 ?20 ?10 0 10 20
Perplexity = 140 20 Perplexity = 160
20 20
20
10
10 10
10
0 0
0 0
?10 ?10
?10 ?10
CRISPR ?20 CRISPR ?20
?20 ?20
No CRISPR No CRISPR
?20 ?10 0 10 20 ?20 ?10 0 10 20
density density
(i)0.060.04 (j)00..0043
0.02 0.
0.00 00..0
02
001
? ?
?10 0 10 ?20 ?10 0 10 20
Perplexity = 180 Perplexity = 200
20
20
10
10
10
10
0
0
0
0
?10
?10
?10
?10
?20
?20 CRISPR CRISPR
No CRISPR No CRISPR
?10 0 10 ?20 ?10 0 10 20
density density
Figure A.4: Repeated t-SNE decomposition of ProTraits data with CRISPR in-
cidence visualized for varied perplexity values. The CRISPR versus no-CRISPR
separation is somewhat less apparent for very high perplexity values.
111
density density density density density
density density density density density
0.015 0.04 0.040.03 0.03 0.060.010 0.02
0.005 0.02 0.02 0.04
0.000 0.01 0.01 0.01 0.020.00 0.00 0.00 0.00
0.015 0.020 0.04 0.05 0.05
0.010 0.015 0.03
00.04
0.04
0.005 0.010 0.02
0.03
0.005 0.01 00.
.0023 00..0020.000 0.000 0.00 0..0001 0.001
Account for shared Modeling Do not account for
evolutionary history Approaches shared evolutionaryhistory
Fit an explicit? model Prevent fitting to
of trait evolution underlying tree(leave sections out)
Simple linear Deal with Includenonlinear Simple linear Deal with
Include
model multicolinearity relationships model multicolinearity
nonlinear
relationships
Phylogenetic
Logistic Log. Reg. with
sPLS-DA 
blocked CV MINT sPLS-DA RF Ensemble
Log. Reg. with RF 
Regression random CV
(partial least
squares) (random forest)
Assumes some? Treat issue of shared evolutionary history as a If shared evolutionary history between datapoints, then
evolutionary model? problem of "overfitting" to underlying tree results will not be generalizable/ecologically relevant
Figure A.5: Flowchart showing the decision-making process that would lead to the
various modeling approaches used here. Major considerations for each approach are
noted. See Methods for details on each approach.
B
A
C
D
E
G
F
H
L
Blocked Folds Random Folds
K
J Fold 1 A, B, C, D, E D, H, K, L
I Fold 2 F, G, H A, B, G, I
Outgroup Fold 3 I, J, K, L C, E, F, J
Figure A.6: A conceptual example of the dierences between blocked and random
folds for cross validation. Cross validation (CV) relies on the assumption that folds
are independent from one another, but when species share an evolutionary history
this assumption is violated. By instead choosing folds based on phylogenetic groups
that have diverged from each other suciently far in the past, we can better avoid
the inclusion of phylogenetic signal in our model t. In other words, in blocked CV
we attempt to choose evolutionarily independent folds.
112
113
Random_Forest_NoGeneInference ?
Random_Forest ?
Ensemble_Random_Forest_Mod5 ?
Ensemble_Random_Forest_Mod4 ?
Ensemble_Random_Forest_Mod3 ?
Ensemble_Random_Forest_Mod2 ?
Ensemble_Random_Forest_Mod1 ?
sPLSDA_Component5 ?
sPLSDA_Component4 ?
sPLSDA_Component3 ?
sPLSDA_Component2 ?
sPLSDA_Component1 ?
Phylogenetic_Logistic_Regression ?
MINT ?
Logistic_Regression_BlockedCV ?
Phylogenetic_Logistic_Regression_BlockedCV ?
Logistic_Regression ?
0.15 0.20 0.25
Kappa
Figure A.7: Comparison of variable importance across predictive models. The top 10 most important variables (columns)
for each predictive model of CRISPR incidence (rows) are indicated by lled black cells. Models are ordered top to bottom in
order of decreasing performance in terms of Cohen's ? (shown at right). Note that for the high performing models, temperature
variables and oxygen (anaerobe and aerobe) are consistently found in the top 10 predictors. All models incorporate temperature
as an important predictor, and the only models without oxygen as a top predictor are the two logistic regression models that
were not formally corrected for phylogeny (and were low-performing). In general, moderate and high performing models are
largely in agreement about a core set of important variables. NoGeneInference corresponds to the model built in Appendix
A.2.
t e m p e r a t u r e r a n g e _ t h e r m o p h i l i c
m a m m a l i a n _ p a t h o g e n _ o r a l _ c a v i t y
k n o w n h a b i t a t s _ f r e s h w a t e r
e c o s y s t e m t y p e _ m a r i n e
p a t h o g e n i c _ i n _ m a m m a l s
k n o w n h a b i t a t s _ h y d r o t h e r m a l v e n t
m e t a b o l i s m _ c a r b o n d i o x i d e f i x a t i o n
h o s t _ i n s e c t s t e r m i t e s
s h a p e _ t a i l e d
k n o w n h a b i t a t s _ s o i l
e c o s y s t e m t y p e _ g e o l o g i c
e n e r g y s o u r c e _ a u t o t r o p h
e c o s y s t e m s u b t y p e _ v a g i n a
m e t a b o l i s m _ s u l f u r o x i d i z e r
h a b i t a t _ t e r r e s t r i a l
k n o w n h a b i t a t s _ p l a n t s y m b i o n t
k n o w n h a b i t a t s _ h o t s p r i n g
s h a p e _ f i l a m e n t o u s
o x y g e n r e q _ s t r i c t a e r o
e c o s y s t e m t y p e _ r e p r o d u c t i v e s y s t e m
m a m m a l i a n _ p a t h o g e n _ r e s p i r a t o r y _ l u n g d i s e a s e
e c o s y s t e m c a t e g o r y _ p l a n t s
e n e r g y s o u r c e _ h e t e r o t r o p h
h o s t _ m a m m a l s h e r b i v o r e s
o x y g e n r e q _ s t r i c t a n a e r o
e c o s y s t e m t y p e _ t h e r m a l s p r i n g s
t e m p e r a t u r e r a n g e _ m e s o p h i l i c
t e m p e r a t u r e r a n g e _ h y p e r t h e r m o p h i l i c
m e t a b o l i s m _ p a h d e g r a d i n g
e n e r g y s o u r c e _ c h e m o l i t h o t r o p h
h a b i t a t _ a q u a t i c
e c o s y s t e m s u b t y p e _ b l o o d
e c o s y s t e m _ e n v i r o n m e n t a l
e c o s y s t e m _ h o s t a s s o c i a t e d
e c o s y s t e m c a t e g o r y _ a n i m a l
k n o w n h a b i t a t s _ h u m a n o r a l c a v i t y
c e l l a r r a n g e m e n t _ f i l a m e n t s
e c o s y s t e m c a t e g o r y _ m a m m a l s
e n e r g y s o u r c e _ p h o t o a u t o t r o p h
d . f u c o s e
e c o s y s t e m s u b t y p e _ o r a l
114
Figure A.8: Comparison of variable importance across predictive models. Pearson's
correlation between variable importance scores for all predictive models (CRISPR
incidence and otherwise) in the paper measured as % increase in node purity for
random forests and variable importance projections for PLS models; logistic regres-
sion models were excluded because importance is measured as rank - i.e. what order
the variable was added to the model. Note the high agreement between the mod-
els predicting CRISPR incidence, and some agreement with the model predicting
number of CRISPR systems. Also note that models predicting the incidence of RM
systems and Ku appear to have distinct predictors (these models performed well
at prediction tasks in the main text). NoGeneInference corresponds to the model
built in Appendix A.2.
115
0.100
0.075
0.050
0.025 ?
0.000
?10 0 10
PC1
10 10
5 5
0 0
?5 ?5
?10 ?10
CRISPR
No CRISPR
?10 0 10 0.00 0.03 0.06 0.09
PC1 density
Figure A.9: Organisms with CRISPR do not separate from those without along
the rst principal component of trait space. The rst and second components from
a PCA of the microbial traits dataset are shown. CRISPR incidence is indicated
by color (green with, orange without), but was not included when constructing the
PCA. Marginal densities along each component are shown to facilitate interpreta-
tion. See Fig 2.1 for the third component.
116
PC2 density
PC2
ecosystemcategory_human specificecosystem_sediment ecosystem_environmental
?40 ?20 0 20 40 ?40 ?20 0 20 40 ?40 ?20 0 20 40
(a) (b) (c)
temperaturerange_mesophilic temperaturerange_thermophilic oxygenreq_strictanaero
?40 ?20 0 20 40 ?40 ?20 0 20 40 ?40 ?20 0 20 40
(d) (e) (f)
growth_in_groups gram_stain_positive cellarrangement_singles
?40 ?20 0 20 40 ?40 ?20 0 20 40 ?40 ?20 0 20 40
(g) (h) (i)
Figure A.10: Trait distributions over t-SNE reduced dataset. Each point is an
organism mapped onto our t-SNE decomposition of trait space. Instead of coloring
points by presence/absence of CRISPR as shown in Fig 2.2, we color each organism
by its score for selected microbial traits in our trait dataset (set of traits shown
chosen because they were highly weighted in our PCA). Recall that scores range
from zero (blue) to one (red). We note that, in a general sense, the region occupied
by anaerobic microbes appears to correspond to the densest regions of CRISPR
incidence in Fig 2.2.
117
?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80
?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80
?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80 ?40 ?20 0 20 40 60 80
temperaturerange_thermophilic temperaturerange_thermophilic
knownhabitats_hotspring knownhabitats_hotspring
oxygenreq_strictanaero oxygenreq_strictanaero
host_insectstermites host_insectstermites
ecosystemtype_thermalsprings ecosystemtype_thermalsprings
temperaturerange_hyperthermophilic temperaturerange_hyperthermophilic
temperaturerange_mesophilic temperaturerange_mesophilic
ecosystemcategory_animal mammalian_pathogen_oral_cavity
ecosystem_hostassociated knownhabitats_humanoralcavity
ecosystem_environmental metabolism_carbondioxidefixation
0 2 4 6 8 0 2 4 6 8
Variable Importance in the Projection on C1 Variable Importance in the Projection on C2
temperaturerange_thermophilic temperaturerange_thermophilic
knownhabitats_hotspring knownhabitats_hotspring
shape_filamentous shape_filamentous
oxygenreq_strictanaero oxygenreq_strictanaero
host_insectstermites host_insectstermites
ecosystemtype_thermalsprings ecosystemtype_thermalsprings
temperaturerange_hyperthermophilic temperaturerange_hyperthermophilic
temperaturerange_mesophilic temperaturerange_mesophilic
cellarrangement_filaments cellarrangement_filaments
mammalian_pathogen_oral_cavity mammalian_pathogen_oral_cavity
0 2 4 6 8 0 2 4 6 8
Variable Importance in the Projection on C3 Variable Importance in the Projection on C4
temperaturerange_thermophilic
knownhabitats_hotspring
shape_filamentous
oxygenreq_strictanaero
host_insectstermites
ecosystemtype_thermalsprings
temperaturerange_hyperthermophilic
temperaturerange_mesophilic
ecosystemcategory_mammals
energysource_photoautotroph
0 2 4 6 8
Variable Importance in the Projection on C5
Figure A.11: Variable importance scores from sPLS-DA model for top 10 predictors
on the 5 components included in model. Variable importance scores generated by
the vip() function in the mixOmics package for R.
118
temperaturerange_thermophilic
knownhabitats_hotspring
oxygenreq_strictanaero
temperaturerange_hyperthermophilic
host_insectstermites
ecosystemtype_thermalsprings
temperaturerange_mesophilic
metabolism_pahdegrading
knownhabitats_hydrothermalvent
oxygenreq_strictaero
0 1 2 3 4 5
Variable Importance in the Projection on C1
Figure A.12: Variable importance scores from MINT sPLS-DA model for top 10
predictors on the single component included in model. Variable importance scores
generated by the vip() function in the mixOmics package for R.
119
knownhabitats_hotspring oxygenreq_strictanaero
temperaturerange_thermophilic knownhabitats_hotspring
ecosystemtype_thermalsprings host_insectstermites
oxygenreq_strictaero temperaturerange_thermophilic
oxygenreq_strictanaero ecosystemtype_thermalsprings
temperaturerange_mesophilic temperaturerange_mesophilic
metabolism_pahdegrading temperaturerange_hyperthermophilic
host_insectstermites knownhabitats_freshwater
knownhabitats_hydrothermalvent knownhabitats_hydrothermalvent
energysource_chemolithotroph oxygenreq_strictaero
0 5 10 15 0 5 10 15
Mean Decrease in Gini Impurity Index Mean Decrease in Gini Impurity Index
oxygenreq_strictanaero host_insectstermites
host_insectstermites knownhabitats_hotspring
temperaturerange_thermophilic oxygenreq_strictanaero
knownhabitats_hotspring temperaturerange_thermophilic
temperaturerange_mesophilic ecosystemtype_thermalsprings
ecosystemtype_thermalsprings ecosystemsubtype_blood
habitat_aquatic knownhabitats_freshwater
metabolism_pahdegrading temperaturerange_mesophilic
ecosystemsubtype_blood temperaturerange_hyperthermophilic
temperaturerange_hyperthermophilic oxygenreq_strictaero
0 5 10 15 0 5 10 15
Mean Decrease in Gini Impurity Index Mean Decrease in Gini Impurity Index
oxygenreq_strictanaero
temperaturerange_thermophilic
host_insectstermites
knownhabitats_hotspring
ecosystemtype_thermalsprings
temperaturerange_mesophilic
temperaturerange_hyperthermophilic
mammalian_pathogen_oral_cavity
oxygenreq_strictaero
knownhabitats_freshwater
0 5 10 15
Mean Decrease in Gini Impurity Index
Figure A.13: Importance of top ten predictors in each of the ve forests included
in the RF ensemble model, as measured by the mean decrease in the Gini impurity
index or accuracy when that variable is excluded from the respective forest. The
relative ranking of the top ten predictors does vary somewhat over the ve forests,
but the set of top predictors is largely consistent across the forests.
120
oxygenreq_strictanaero oxygenreq_strictanaero
host_insectstermites temperaturerange_thermophilic
knownhabitats_hotspring knownhabitats_hotspring
temperaturerange_thermophilic host_insectstermites
ecosystemtype_thermalsprings ecosystemtype_thermalsprings
temperaturerange_mesophilic temperaturerange_mesophilic
temperaturerange_hyperthermophilic oxygenreq_strictaero
knownhabitats_freshwater knownhabitats_humanoralcavity
oxygenreq_strictaero knownhabitats_freshwater
knownhabitats_hydrothermalvent metabolism_pahdegrading
metabolism_pahdegrading temperaturerange_hyperthermophilic
energysource_chemolithoautotroph knownhabitats_hydrothermalvent
ecosystemsubtype_blood ecosystemtype_digestivesystem
ecosystemsubtype_hydrothermalvents ecosystemsubtype_blood
mammalian_pathogen_respiratory mammalian_pathogen_respiratory
habitat_aquatic habitat_aquatic
ecosystemtype_digestivesystem ecosystemsubtype_oral
mammalian_pathogen_oral_cavity ecosystemcategory_human
knownhabitats_humanoralcavity ecosystemsubtype_hydrothermalvents
energysource_chemolithotroph habitat_specialized
ecosystemsubtype_oral mammalian_pathogen_oral_cavity
shape_filamentous metabolism_nitrogenfixation
host_plantsexceptlegumes ecosystemtype_dairyproducts
energysource_heterotroph energysource_chemolithotroph
knownhabitats_mud habitat_single
metabolism_nitrogenfixation shape_filamentous
ecosystemcategory_human knownhabitats_gastrointestinaltract
habitat_specialized energysource_heterotroph
ecosystemtype_geologic specificecosystem_fecal
host_mammalsman ecosystemsubtype_oceanic
energysource_phototroph host_mammalsman
halophilic knownhabitats_marine
ecosystemsubtype_oceanic knownhabitats_humanintestinalmicroflora
mammalian_pathogen_respiratory_lungdisease host_plantsexceptlegumes
metabolism_carbondioxidefixation ecosystemcategory_foodproduction
knownhabitats_marine habitat_multiple
ecosystemsubtype_mine cellarrangement_pairs
knownhabitats_plants shape_bacilli
ecosystemtype_marine ecosystemsubtype_mine
energysource_chemoorganotroph habitat_freeliving
energysource_lithotroph mammalian_pathogen_enteric
mammalian_pathogen_enteric ecosystemtype_soil
knownhabitats_gastrointestinaltract habitat_terrestrial
habitat_terrestrial knownhabitats_host
shape_bacilli knownhabitats_soil
energysource_chemoheterotroph growth_in_groups
mammalian_pathogen_oportunisticnosocomial energysource_chemoorganotroph
shape_tailed ecosystemtype_marine
ecosystemcategory_insecta energysource_chemolithoautotroph
habitat_single pathogenic_in_mammals
knownhabitats_oilfields knownhabitats_plants
metabolism_cellulosedegrader ecosystemtype_respiratorysystem
ecosystemtype_skin cellarrangement_chains
specificecosystem_sputum ecosystemsubtype_lentic
knownhabitats_humanintestinalmicroflora knownhabitats_mud
energysource_photoautotroph mammalian_pathogen_oportunisticnosocomial
growth_in_groups halophilic
mammalian_pathogen_skin_softtissues knownhabitats_humanvaginalmicroflora
specificecosystem_fecal ecosystemtype_circulatorysystem
energysource_photosynthetic knownhabitats_aquatic
ecosystemsubtype_lentic shape_tailed
knownhabitats_aquatic ecosystemsubtype_nasopharyngeal
ecosystemtype_circulatorysystem ecosystemtype_skin
ecosystemsubtype_saltcrystallizerponds habitat_hostassociated
energysource_autotroph energysource_lithotroph
cellarrangement_singles metabolism_carbondioxidefixation
host_insectsgeneral metabolism_cellulosedegrader
ecosystemcategory_wastewater mammalian_pathogen_urogenital
cellarrangement_filaments ecosystemsubtype_intertidalzone
pathogenic_in_mammals flagellarpresence
temperaturerange_psychrophilic ecosystemcategory_aquatic
ecosystemcategory_plants ecosystemsubtype_vagina
bioticrelationship_symbiotic ecosystem_hostassociated
cellarrangement_chains energysource_phototroph
mammalian_pathogen_cardiovascular_heart_bloodvessels mobility
habitat_multiple knownhabitats_oilfields
cellarrangement_pairs metabolism_methanogen
knownhabitats_food mammalian_pathogen_respiratory_lungdisease
ecosystemtype_rhizoplane oxygenreq_facultative
ecosystemtype_respiratorysystem knownhabitats_sludge
ecosystemcategory_foodproduction gram_stain_positive
sporulation ecosystemcategory_plants
ecosystemsubtype_intertidalzone energysource_chemoheterotroph
metabolism_methanogen ecosystemcategory_mammals
ecosystemsubtype_fecal ecosystem_environmental
oxygenreq_microaerophilic mammalian_pathogen_cardiovascular_heart_bloodvessels
cellarrangement_clusters shape_coccus
shape_coccus ecosystemcategory_terrestrial
oxygenreq_facultative energysource_photoautotroph
ecosystemtype_soil host_plantslegumes
knownhabitats_sediment knownhabitats_food
habitat_freeliving specificecosystem_rumen
metabolism_ammoniaoxidizer shape_spirilla
phenotypes_acidophile cellarrangement_clusters
ecosystemcategory_aquatic specificecosystem_sputum
knownhabitats_sludge energysource_photosynthetic
metabolism_sulfuroxidizer ecosystemcategory_animal
mammalian_pathogen_respiratory_throatnoseeyes bioticrelationship_symbiotic
mammalian_pathogen_urogenital_urinary sporulation
ecosystemsubtype_vagina ecosystemtype_geologic
knownhabitats_deepsea energysource_autotroph
knownhabitats_humanvaginalmicroflora ecosystemcategory_fish
ecosystemtype_dairyproducts specificecosystem_sediment
host_fish knownhabitats_creosotecontaminatedsoil
knownhabitats_skin cellarrangement_singles
knownhabitats_host ecosystemcategory_wastewater
pathogenic_in_fish knownhabitats_humanfecal
ecosystemtype_industrialwastewater pathogenic_in_fish
specificecosystem_rumen knownhabitats_feces
mammalian_pathogen_nervous_system metabolism_ammoniaoxidizer
ecosystemcategory_fish ecosystemtype_rhizoplane
ecosystemsubtype_cerebrospinalfluid host_fish
metabolism_biomassdegrader metabolism_sulfuroxidizer
knownhabitats_soil phenotypes_alkaliphile
flagellarpresence ecosystemtype_reproductivesystem
knownhabitats_rootnodule cellarrangement_filaments
mammalian_pathogen_urogenital knownhabitats_deepsea
knownhabitats_feces ecosystemsubtype_largeintestine
knownhabitats_oralcavity ecosystemcategory_insecta
ecosystemcategory_animal ecosystemtype_phylloplane
metabolism_ironreducer ecosystemsubtype_cerebrospinalfluid
ecosystemcategory_mammals ecosystemsubtype_saltcrystallizerponds
knownhabitats_endosymbiont motility
metabolism_storespolyhydroxybutyrate phenotypes_acidophile
host_mammalspig knownhabitats_oralcavity
specificecosystem_sediment host_insectsgeneral
phenotypes_alkaliphile mammalian_pathogen_skin_softtissues
knownhabitats_humanfecal metabolism_biomassdegrader
ecosystemcategory_terrestrial mammalian_pathogen_respiratory_throatnoseeyes
cellarrangement_tetrads knownhabitats_bovine
shape_spirilla metabolism_nitrifying
mammalian_pathogen_urogenital_reproductive mammalian_pathogen_urogenital_reproductive
mobility ecosystemtype_composting
energysource_methylotroph knownhabitats_humanairways
ecosystemsubtype_largeintestine knownhabitats_plantsymbiont
metabolism_nitrifying ecosystemsubtype_fecal
ecosystemsubtype_lotic metabolism_nitrogenproducer
ecosystemtype_composting knownhabitats_sediment
ecosystem_hostassociated ecosystemtype_industrialwastewater
knownhabitats_bovine metabolism_ironreducer
mammalian_pathogen_bone mammalian_pathogen_urogenital_urinary
knownhabitats_intestinaltract cellarrangement_tetrads
ecosystemsubtype_nasopharyngeal knownhabitats_intestinaltract
metabolism_sulfurrespiration ecosystemsubtype_foregut
host_mammalsherbivores metabolism_sulfurrespiration
energysource_chemoautotroph ecosystemcategory_solidwaste
ecosystemcategory_birds oxygenreq_microaerophilic
ecosystemtype_metal knownhabitats_insectendosymbiont
metabolism_sulfatereducer temperaturerange_psychrophilic
ecosystemcategory_solidwaste pathogenic_in_plants
knownhabitats_humanairways ecosystemcategory_birds
knownhabitats_surfacewater knownhabitats_skin
gram_stain_positive ecosystemcategory_bioremediation
ecosystemsubtype_foregut metabolism_hydrogensulfidegasrelease
pathogenic_in_plants ecosystemsubtype_lotic
ecosystemtype_reproductivesystem mammalian_pathogen_nervous_system
knownhabitats_wastewater host_mammalspig
energysource_oligotroph knownhabitats_surfacewater
knownhabitats_plantroot knownhabitats_wastewater
motility knownhabitats_rootnodule
metabolism_nitrogenproducer energysource_methylotroph
habitat_hostassociated mammalian_pathogen_bone
radioresistance host_mammalsherbivores
metabolism_hydrogensulfidegasrelease metabolism_sulfatereducer
metabolism_ironoxidizer energysource_chemoautotroph
ecosystem_environmental knownhabitats_plantroot
host_plantslegumes ecosystemtype_metal
ecosystemcategory_bioremediation knownhabitats_rhizosphere
ecosystemsubtype_wetlands radioresistance
knownhabitats_rhizosphere metabolism_storespolyhydroxybutyrate
ecosystemtype_phylloplane ecosystemsubtype_wetlands
knownhabitats_creosotecontaminatedsoil metabolism_ironoxidizer
knownhabitats_insectendosymbiont knownhabitats_endosymbiont
knownhabitats_plantsymbiont energysource_oligotroph
2 4 6 8 10 12 14 16 0 10 20 30 40
Mean Decrease in Gini Impurity Index 121 Mean Decrease in Accuracy
Figure A.14: Importance of all predictors in CRISPR RF model, as measured by
the mean decrease in the Gini impurity index or accuracy when that variable is
excluded from the respective forest. Note the elbow in the Gini importance ranking
after the rst ten predictors.
122
75
50
25
16 439 133 13 154 42
0
obligate aerobe aerobe facultative microaerophilic anaerobe obligate anaerobe
Oxygen Requirement
Figure A.15: The link between oxygen requirement and CRISPR incidence is
apparent even when sub-setting to only mesophiles. Error bars are 99% binomial
condence intervals. Total number of genomes in each trait category shown at the
bottom of each bar. Categories represented by fewer than 10 genomes were omitted.
oxygenreq_strictanaero temperaturerange_thermophilic
temperaturerange_thermophilic oxygenreq_strictanaero
knownhabitats_hotspring temperaturerange_mesophilic
temperaturerange_mesophilic knownhabitats_hotspring
ecosystemtype_thermalsprings ecosystemtype_thermalsprings
d.fucose d.fucose
knownhabitats_freshwater ecosystemsubtype_oral
temperaturerange_hyperthermophilic knownhabitats_humanoralcavity
knownhabitats_humanoralcavity ecosystemsubtype_blood
ecosystemsubtype_oral knownhabitats_freshwater
0 5 10 15 0 10 20 30 40
Mean Decrease in Gini Impurity Index Mean Decrease in Accuracy
Figure A.16: Importance of top ten predictors in the RF model built excluding the
phyletic prole and gene neighborhood information sources from ProTraits, as
measured by the mean decrease in the Gini impurity index or accuracy when that
variable is excluded from the model.
123
Percentage With CRISPR
0.15
0.10
0.05
?
0.00
?10 0 10
PC1
10 10
5 5
0 0
?5 ?5
?10 ?10
Ku
No Ku
?10 0 10 0.00 0.05 0.10
PC1 density
(a)
0.10
0.05
?
0.00
?10 ?5 0 5
PC3
10 10
5 5
0 0
?5 ?5
?10 ?10
Ku
No Ku
?10 ?5 0 5 0.00 0.05 0.10
PC3 density
(b)
Figure A.17: The incidence of the Ku protein in trait space. PCA as in Figs 2.1
and A.9.
124
PC2 density PC2 density
PC2 PC2
ecosystemtype_soil habitat_terrestrial
habitat_terrestrial sporulation
knownhabitats_soil ecosystemtype_soil
ecosystemcategory_terrestrial oxygenreq_strictaero
oxygenreq_strictaero knownhabitats_soil
sporulation shape_coccus
knownhabitats_creosotecontaminatedsoil metabolism_pahdegrading
metabolism_pahdegrading ecosystemcategory_terrestrial
metabolism_sulfurrespiration knownhabitats_creosotecontaminatedsoil
metabolism_sulfuroxidizer shape_bacilli
0 10 30 0 20 40 60
Mean Decrease in Gini Impurity Inde Mean Decrease in Accuracy
Figure A.18: Importance of top ten predictors in the RF model of Ku incidence.
This model had high predictive ability (? = 0.578).
100 100 100
75 75 75
Ku Ku Ku
50 Ku 50 Ku 50 Ku
No Ku No Ku No Ku
25 25 25
520 495 33 430 520 495 33 430 520 495 33 430
0 0 0
aerobe anaerobe aerobe anaerobe aerobe anaerobe
Oxygen Requirement Oxygen Requirement Oxygen Requirement
(a) (b) (c)
Figure A.19: CRISPR and Ku are negatively associated in aerobes but not anaer-
obes. Percentage of genomes with Cas proteins associated with a particular system
type. Error bars are 99% binomial condence intervals. Total number of genomes
in each trait category shown at the bottom of each bar. Of the 1047 genomes
represented here 253 have cas3, 61 have cas9, and 54 have cas10.
125
Percentage With cas3
Percentage With cas9
Percentage With cas10
0.09
0.06
0.03 ?
0.00
?10 0 10
PC1
10 10
5 5
0 0
?5 ?5
?10 ?10
No Restriction Enzymes
Restriction Enzyme(s)
?10 0 10 0.00 0.05 0.10 0.15 0.20
PC1 density
(a)
0.15
0.10
0.05 ?
0.00
?10 ?5 0 5
PC3
10 10
5 5
0 0
?5 ?5
?10 ?10
No Restriction Enzymes
Restriction Enzyme(s)
?10 ?5 0 5 0.00 0.05 0.10 0.15 0.20
PC3 density
(b)
Figure A.20: The incidence of restriction enzymes in trait space. PCA as in Figs
2.1 and A.9.
126
PC2 density PC2 density
PC2 PC2
host_insectsgeneral energysource_phototroph
cellarrangement_pairs metabolism_carbondioxidefixation
cellarrangement_chains knownhabitats_freshwater
energysource_photosynthetic knownhabitats_sediment
energysource_photoautotroph energysource_chemolithoautotroph
metabolism_nitrogenfixation ecosystemtype_geologic
habitat_multiple cellarrangement_pairs
knownhabitats_insectendosymbiont ecosystemcategory_wastewater
cellarrangement_singles ecosystemtype_circulatorysystem
mammalian_pathogen_cardiovascular_heart_bloodvessels pathogenic_in_fish
0.0 0.2 0.4 0.6 0.8 0 2 4 6 8
Mean Decrease in Gini Impurity Index Mean Decrease in Accuracy
Figure A.21: Importance of top ten predictors in the RF model of restriction enzyme
incidence, as measured by the mean decrease in the Gini impurity index or accuracy
when that variable is excluded from the model.
(a)
90
60
30
0
0.21 0.24 0.27 0.30
?
(b) habitat_aquatic ? (c) oxygenreq_strictaero ?
oxygenreq_strictaero ? knownhabitats_hydrothermalvent ?
temperaturerange_hyperthermophilic ? temperaturerange_hyperthermophilic ?
knownhabitats_freshwater ? knownhabitats_freshwater ?
temperaturerange_thermophilic ? temperaturerange_thermophilic ?
temperaturerange_mesophilic ? temperaturerange_mesophilic ?
oxygenreq_strictanaero ? oxygenreq_strictanaero ?
knownhabitats_hotspring ? knownhabitats_hotspring ?
host_insectstermites ? host_insectstermites ?
ecosystemtype_thermalsprings ? ecosystemtype_thermalsprings ?
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Prop. Time In Top 10 Predictors (Mean Decrease Accuracy) Prop. Time In Top 10 Predictors (Mean Decrease Gini)
Figure A.22: Resampling genomes has little eect on our overall outcome. (a)
Distribution of ? values for 1000 RF models built with resampled datasets. Mean
(blue) and 95% CIs (red) indicated with vertical lines. (b-c) The proportion of
resampled datasets for which each predictor fell within the set of top 10 predictors
based on variable importance scores.
127
Frequency
(a) (b) 1e?04
1e?04
1e?05
1e?05 cas present
1e?06 FALSE
1e?06
TRUE
1e?07
1e?07
0 100 200 300 0 100 200 300
Dissolved Oxygen (?M/kg) Dissolved Oxygen (?M/kg)
(c) 1e?04 (d) (e)
1e?06
1e?05 1e?04
1e?07
1e?06
1e?05
1e?07 1e?08
1e?06
1e?08 1e?09
0 100 200 300 0 100 200 300 0 100 200 300
Dissolved Oxygen (?M/kg) Dissolved Oxygen (?M/kg) Dissolved Oxygen (?M/kg)
Figure A.23: Functional proles for cas genes from Tara Oceans Project with cor-
responding oxygen metadata. Values for each cas gene shown are the coverage map-
ping to that orthologous group normalized by the total coverage in the metagenome.
Zero values for coverage plotted along x-axis in red (since data plotted on log scale).
Trend-lines plotted on log transformed data for ease of interpretation.
128
cas3 cas1
cas9
cas2
cas10
Appendix B: Supplemental Information For: Selective maintenance
of multiple CRISPR arrays across prokaryotes
B.1 Validation of functional / non-functional classication
Our power to detect selection depends critically on our ability to classify
genomes as CRISPR functional vs. non-functional. Functional CRISPR arrays
should, on average, contain more spacers than non-functional arrays [226]. Thus we
compared the number of repeats in CRISPR arrays in genomes with both cas1 and
cas2 present (functional, 16.01 repeats on average) to the number of spacers in
genomes lacking either or both genes (non-functional, 12.23 repeats on average)
and conrmed that the former has signicantly more than the latter (t = ?36.516,
df = 55340, p < 2.2 ? 10?16; Fig B.9). This dierence in length (3.80 repeats) is
not as large as one might expect, possibly because some systems are able to acquire
or duplicate spacers via homologous recombination [165] and arrays may have been
inherited recently from strains with active cas machinery. The mean array length
across the dataset was 15.12 repeats.
129
B.2 Deriving the distribution of number of arrays per genome under a
neutral accumulation model
Recall our null hypothesis that in genomes with functional CRISPR systems
possession of a single array is highly adaptive (i.e. viruses are present and will kill
any susceptible host) but additional arrays provide no additional advantage. Thus
these additional arrays will appear and disappear in a genome as the result of a
neutral birth/death horizontal transfer and loss process, where losses are assumed
to remove an array in its entirety. This hypothesis predicts that the non-functional
distribution will look like the functional distribution shifted by one (Si):
??
H0 : Ni ? Si = Fi+1/ Fj (B.1)
j=1
for i ? 0.
We begin by deriving a functional form for the distribution Ni from rst prin-
ciples following a neutral process. If CRISPR arrays arrive in a given genome at a
constant rate via rare horizontal transfer events, then we can model their arrivals
using a Poisson process with rate ?. Assuming arrays are also lost independently
at a constant rate, the lifetime of each array in the genome will be exponentially
distributed with rate ?. This leads to a linear birth-death process of array accumula-
tion, which yields a Poisson equilibrium distribution with rate ? = ? . While this rate
?
might be constant for a given taxon, it will certainly vary across taxa due to dierent
intrinsic (e.g. cell wall and membrane structure) and extrinsic factors (e.g. density
130
of neighbors, environmental pH and temperature) [16]. We model this variation by
allowing genome j to have rate ??j = j and assuming ?j ? Gamma(?, ?), which?j
we pick for its exibility and analytic tractability. This combination of gamma and
Poisson distributions leads to the number of arrays i in a random genome following
a negative binomial distribution N ?i = NB(r, p) where r = ? and p = .1+?
Now we can t this distribution to data to nd maximum likelihood estimates
of r and p for the distribution of array counts in both the set of non-functional
genomes (Ni) and the set of functional genomes as shifted under our null hypothesis
(Si). This allows us to construct a parametric test of selection. We expect that
r?N ? r?S and p?N ? p?S under our null hypothesis (where subscripts correspond to
the distribution to which the parameters were t). When our null hypothesis is
violated it should shift the means of these distributions. Therefore we estimate and
compare these means ? = pxrxx , x ? {N,S}. We expect that ?? > ?? if more than1?p S Nx
one array is selectively maintained, and we bootstrap condence intervals on these
estimates by resampling with replacement from our functional and non-functional
array count distributions in order to determine whether the eect is signicant.
B.3 Instantaneous array loss vs. gradual decay
There are two possible routes to complete CRISPR array loss: (1) an all-
at-once loss of the array (e.g. due to recombination between anking insertion
sequences [50, 140]) and (2) gradual decay due to spacer loss. Previous experimen-
tal evidence supports route (1) spontaneous loss of the entire CRISPR array [51],
131
as do comparisons between closely related genomes [50]. The distinction above is
important, because if CRISPR array loss were to occur primarily via route (2) grad-
ual decay, then functional genomes would have an intrinsically lower rate of array
loss than non-functional genomes. This is because in functional genomes spacer
acquisition would counteract spacer loss, reducing the rate of array decay, whereas
this compensation would not occur in non-functional genomes. This could lead us
to spuriously accept a result of selection maintaining multiple arrays.
If arrays were primarily lost via gradual decay, we would expect our data to
show a positive relationship between the number of arrays in a genome and the
average array length in a genome, because arrays experiencing more decay (either
due to increased spacer loss rates or reduced acquisition rates) should be shorter
and prone to eventual deletion. In functional genomes with the complete spacer
acquisition machinery (cas1 and cas2 ) this trend would be due to the higher proba-
bility of stochastically reaching a 0-spacers state in shorter arrays, and arrays will in
general be shorter in genomes with lower spacer acquisition rates. In non-functional
genomes that lack the complete spacer acquisition machinery, this trend would result
from dierences in time since loss of acquisition machinery, where genomes that had
lost that machinery farther in the past would have both shorter arrays and fewer
arrays on average.
Overall we see no relationship between mean array length and array count in a
genome (linear regression, m = ?0.001, p = 0.109, R2 = 5.55? 10?5). Surprisingly,
in functional genomes we nd a slightly negative linear relationship between the
number of arrays in a genome and mean array length in that genome (m = ?0.0081,
132
p < 2 ? 10?16, R2 = 0.0032). In non-functional genomes we see a slightly positive
relationship (m = 0.0054, p = 7.23 ? 10?10, R2 = 0.0026). While both of these
relationships are signicant, they are extremely weak and probably spurious. This
lack of a clear relationship suggests that gradual decay is not the primary cause
of array loss. Nevertheless, because rates of spacer acquisition and loss and array
acquisition (via HGT) and loss are two somewhat easily confounded processes, this
evidence is not conclusive.
As an additional, more direct test of whether array decay or instantaneous
loss drives CRISPR array loss dynamics, we consider the dierence in array length
between putatively functional and non-functional arrays. On average the pres-
ence of the Cas acquisition machinery leads to the addition of about 4 repeats to
a given array (Fig B.9). We expect short arrays, of approximately this length, to
be the most rapidly lost upon loss of array functionality. Thus, if array loss occurs
primarily through gradual decay, then our result of selection maintaining multiple
arrays would be driven primarily by short, functional arrays. By removing any
functional arrays below some threshold length from the dataset, we can test this
hypothesis. Even upon removal of all functional arrays less than 10 spacers long
from the dataset, we still observe a substantial signature of selection maintaining
multiple arrays (Figs B.10 and B.11). By design, this signature does weaken slightly
as the threshold length for functional arrays is increased, but this is to be expected
as the removal of arrays from the functional dataset must also decrease the dier-
ence in mean number of arrays between the functional and non-functional datasets
(since no removals are made from the non-functional set). It is notable though,
133
that this decrease in signal is small (especially when removing arrays < 6 spacers
long), demonstrating that the selective signal is not being driven primarily by short,
functional arrays.
B.4 Model Analysis
We develop a simple deterministic model of the spacer turnover dynamics
in a single CRISPR array of a bacterium exposed to n viral species (i.e., disjoint
protospacer sets). Let Ci be the number of spacers in the array that target viral
species i:
dCi
= ?ai(t?,?Ci?) ? ??Lli(C?1,?..., Cn?), (B.2)dt
Acquisition Loss
ai(t, Ci) = ?Avifi(t)g(Ci), (B.3)
where ?L is the per-spacer loss rate parameter, li(C1, ..., Cn) is a function describ-
ing how spacer loss depends on the array length, ?A is the per-infection spacer
acquisition rate, vi is a composite parameter describing the infection intensity in
the environment (viral density times adsorption rate), fi(t) ? [0, 1] is a function
describing the uctuations of viral population i over time, and
??????1 Ci < 1g(Ci) = ??? (B.4)p Ci ? 1
134
is a function determining whether or not the system is primed towards viral species
i (i.e., if a CRISPR array has a spacer targeting a particular viral species, the rate
of spacer acquisition towards that species is increased; [141, 142]), where p > 1 is
the degree of priming (see Table C.1 for summary of model parameters and units).
Using this model we can determine the optimal spacer acquisition rate given a
particular pattern of pathogen recurrence in the environment, fi(t). If the optima for
distinct recurrence patterns do not overlap, it indicates that multiple arrays would
be required to simultaneously combat viral species with these distinct recurrence
patterns.
We analyze the ability of the array to respond to three types of viral threats:
(1) background species representing the set of all viruses persisting over time in
the environment (fB(t) = 1), transient species that leave the system and return
after some interval of time (fT (t) ? {0, 1}), and novel species that have not been
previously encountered. Thus we can compare how CRISPR balances the need for
consistent immunity, long-term memory, and rapid adaptation (full analysis below).
We consider two forms for the function li based on known features of CRISPR
biology. (1) The rate of per-spacer loss increases linearly with locus length. This
form is based on the observation that spacer loss appears to occur via homologous
recombination between repeats [31, 143, 144], which becomes more likely with in-
creasing numbers of spacers (and thus repeats). (2) The length of an array is capped
at some xed eective number of spacers. This form is based on evidence that
mature crRNA transcripts from the leading end of the CRISPR array are far more
abundant than those from the trailing end, and that this decay over the array hap-
135
pens quickly (most transcripts are from the rst few spacers; [145, 146, 147]). We
analyze both models below, though they give qualitatively similar results, and so
we focus on case (1) in the main text Results.
Primary Model: Array-Length-Dependent Spacer Loss
First we consider a version of the model above where the per-spacer loss rate
increases linearly with the total number of spacers in the array (i.e., physical loss
via homologous recombination):
dC ?ni
= a?i(t?,?Ci?) ???LCi?? C?j, (B.5)dt j=1Acquisition
Loss
ai(t, Ci) = ?Avifi(t)g(Ci), (B.6)
?????1 Ci < 1
g(Ci) = ???? . (B.7)p Ci ? 1
?
That is, li(C1, ..., C
n
n) = Ci j=1Cj. Parameter values can be found in Table C.1.
For the purposes of simplifying our analysis we consider the case where there
are two viral species in the system, one that persists at some background level
(background B, fB(t) = 1) and another that returns in sharp bursts at periodic
intervals (transient T , fT (t) ? {0, 1}).
This two-virus situation captures the conict between the ability to maintain
136
immune memory of periodically recurring infections but also defend against ever-
present persisting viral enemies. We are interested in the time lag between the return
of virus T to the system and the development of host immunity. In the case where
memory is maintained during the absence of virus T the lag will be zero, otherwise
it will depend on the base spacer uptake rate (?A) and the relative densities of the
two viral populations.
We examine the time to reacquisition of immunity towards virus T using the
following procedure (Fig B.12). Note that for simplicity we let time have units of
virus return time (so that one unit of time equals the interval in which T is absent
from the system). Our method is as follows: (1) start the system at its equilibrium
state with both viral species present, (2) remove species T for a single unit of time
and track the decay of CT , and (3) return T to the system and calculate how long it
takes after this return for CT to exceed one (time to reacquisition of immunity, tI).
Let tI = 0 if CT remains above one despite the absence of T (i.e. no loss of immune
memory). In more detail:
1. Find the equilibrium of our system (C?B, C?T ) when T is present, assuming that
at the equilibrium state C?B, C?T ? 1. We have that
?
vB ?Ap
C?B = ? (B.8)
?L(vB + vT )
and
? ?vT ?ApC?T = . (B.9)
?L(vB + vT )
137
2. Use this equilibrium as an initial condition for a system where virus T has
been removed (where fT (t) = 0, so dCT = ??LCT (CT + CB)) and solve (nu-dt
merically) for the state of the system at time t = 1 assuming that T remains
absent (CT (1)). In practice we use the ode15s solver available in MATLAB
(2016a, MathWorks, Inc., Natick, Massachusetts, United States).
3. Find the time to reacquisition of immunity (tI such that CT (tI + 1) = 1) by
solving the unprimed system where we ignore loss (since no spacers yet exist
to be lost, loss should not be important). Dene C?T (t) so that
C?T (t) = ?AvT t+ CT (1). (B.10)
We are interested in tI where C?T (tI) = 1, so that
1? CT (1)
tI = (B.11)
?AvT
where CT (1) is the solution from step (2). This equation only holds if CT (1) <
1 (let tI = 0 otherwise). This time to reacquisition is our measure of tness,
with a lower tI indicating lower tness of the host. We dene some ? such that
if tI < ? we say immunity is maintained. Ideally, ? should be shorter than
the time it takes viruses to cause irreparable damage to an infected bacterium
after initial infection.
We can also use Eq B.11 to nd the time of immune acquisition towards a
novel viral species (N) arriving in the environment. When no spacers towards that
138
species are already contained in the CRISPR array (CN(0) = 0), the time to novel
acquisition is
1
tN = . (B.12)
?AvN
Note that we assume here that the spacer loss rate does not eect dCN when CN < 1dt
in order to simplify our analysis. During analysis we let vN = vT for simplicity.
Alternative Model: Leader-End crRNA Processing
We are interested in having a hard cuto on the array length in this version
of the model (corresponding to a xed cap on the number of transcribed spacers).
Therefore, we consider an eective array of xed length L and let Ci be the propor-
t?ion of the array taken up by spacers targeting viral species i (n total viral species,n
i=1Ci = 1). This model generally follows a form similar to that in Eqs B.2-B.4:
dC ?i
= ai(t, Ci)? Ci aj(t, Cj) (B.13)
dt
j
where,
ai(t, Ci) = ?Avifi(t)g(Ci). (B.14)
We modify the denition of the priming function g to match the new scenario so
that
?????1 C 1i <
g(Ci) = ????
L
(B.15)
p C 1i > L
139
Observe that in Eq B.13 the ow rate into and out of the system match so that the
ove?rall number of eective spacers should not change (in this case li(t, C1, ..., Cn) =
Ci j aj(t, Cj) is also a function of t and we let ?L = 1).
When fi(t) = 1 ?i and we assume the system is primed towards all viruses,
then we have equilibrium
C?i = ?vi . (B.16)
j vj
Let's return to our simple two-virus system with background (B) and transient (T )
viral species. We perform a three step analysis similar to the one used for the linear
spacer loss model above:
1. Concentrating on an interval where both viruses are present (fT = 1)
???? 1
????
?????Avip? p?ACi (vi + vj 6=i) Ci, Cj=6 i > L?? 1 1
dC ???Avi ? ?ACi (vi + pvj 6=i) Ci < ,CL j=6 i >i = ? L??? i, j ? {B, T}.dt ??????Avip? ?ACi (pv + v
1 1
? i j=6 i
) Ci > ,CL j=6 i <
??
L
??Avi ? ? 1ACi (vi + vj=6 i) Ci, Cj=6 i < L
(B.17)
We derive the equilibrium spacer content, assuming that the system starts
140
from a doubly primed condition so that
?
????????? vi 1 < vi < 1? 1vi+vj 6=i L vi+vj=6 i L
C?i = ???? v v 1? i i < i, j ? {B, T}. (B.18)??vi+pvj 6=i vi+vj=6 i L?? pvi vi > 1? 1
pvi+vj 6=i vi+vj=6 i L
2. We now focus on the dynamics of the system after the transient viral species
leaves (fT = 0) assuming the system starts from the equilibrium in Eq B.18.
The decay of spacers targeting the transient viral population will follow
????
dC ????AvBpCT CT < 1?
1
T ? L= ?? . (B.19)dt ?? 1AvBCT CT > 1? L
We can solve for CT (t) with the given initial condition so that
??( )
???????? vT ?p?AvBt? e C?T , C? >
1
? vT+v BB L???( )vT e?p?AvBt C? < 1 , C? 1
v +pv T L B
>
T B L
CT (t) =
???????
?( ) (B.20)
? pvT e??? A
vBt
( ) t ? tB, C?B <
1
? pvT+vB L??? pvT e??AvB(tB+p(t?tB)) t > t , C? < 1
pvT+v B BB L
where ( )
? ln (pvT+vB)(L?1)
pvTL
tB = . (B.21)
?AvB
3. Let us assume that time is measured in units of viral return intervals so that
141
the return time of the transient (T ) species is tI = 1. Then our goal is to nd
CT (1), assess whether this value has dropped below 1 (immune loss), and ifL
so, nd the time to immune reacquisition CT (tI ? 1) = 1 . Let us dene a newL
function C?T (t) that tracks the re-acquisition of immunity after T returns to
the system so that C?T (0) = CT (1) and, assuming no decay of spacers when
C? 1T < ,L
1
= C?T (tI) = ?AvT tI + CT (1). (B.22)L
Then
1 ? C (1)
t = L
T
I (B.23)
?AvT
and we have CT (1) from Eq B.20.
This model gives qualitatively similar results to those found with the primary model
(Fig B.2).
B.5 Conrming selection
In order to further conrm our result of selective maintenance of multiple
CRISPR arrays, we (1) subsampled overrepresented taxa in the dataset, (2) per-
formed phylogenetically-corrected tests to conrm both that dierential rates of
horizontal gene transfer (HGT) between species were not driving our results and
that the pattern of selection we observed was not isolated to any particular group,
(3) showed that potential linkage between cas genes and CRISPR arrays was not
producing the signature of selection we observed, and (4) demonstrated that the
142
genome assembly level had no eect on our outcome. Additionally, (5) we discuss
why the potential eects of CRISPR on rates of HGT cannot account for the selec-
tion we see.
(1) Sub-sampling to reduce the inuence of overrepresented taxa altered our
parameter estimates slightly, but did not change our overall result (sampled 10
genomes from each species with > 10 genomes, ?? = 1.13 ? 0.09, Fig B.6, Table
B.1). To control for the possibility that multiple sets of cas genes in a small subset of
genomes could be driving this selective signature, we restricted our dataset only to
genomes with one or fewer signature targeting genes and one or fewer copies each of
the genes necessary for spacer acquisition. Even in this restricted and subsampled set
of genomes selection maintains more than one (functional) CRISPR array, though
the eect size is smaller (?? = 0.18? 0.08, Fig B.7).
(2) The number of CRISPR arrays is positively related to the number of cas
genes in a genome (Fig B.13). Dierential rates of horizontal gene transfer (HGT)
among species could produce an observed correlation between cas1 and cas2 pres-
ence and array count in the absence of selection. To control for this potentially
confounding eect we performed a species-wise parametric test. For each species k
we calculated a species-specic ??k = ??S ? ??N and then bootstrapped the meank k
of the distribution of these values (???) to evaluate signicance. This test conrms
a signature of multi-array selection (??? = 0.70 ? 0.14) despite low power due to
most species having few sequenced genomes. Additionally, in order to determine if
the signature of selection was conned to a particular set of clades, we mapped all
species-specic ??k values onto the SILVA Living Tree 16s rRNA tree [228]. Of
143
the 623 species with at least one functional and one non-functional genome, 568
were represented on the tree. Positive and strongly positive (> 1) values of ??k
were distributed across the tree, indicating this phenomenon was not isolated to
a particular group (Fig B.14). Formal testing revealed no signicant phylogenetic
signal in the ??k values (K = 1.88 ? 10?9, p = 0.604; [229, 230]). Briey, this
test for phylogenetic signal involves randomly permuting the underlying phylogeny
and comparing the t of the observed trait data (in this case ??k) to these random
trees (test developed in [229] and implemented with the phylosig function in the
phytools R package [230]).
(3) Often CRISPR arrays and cas genes are collocated such that loss of one
may be linked to loss of the other. At equilibrium, the distribution of array counts
per genome will be unaected by such collocation. To test this assumption directly,
we used regression to check if the minimum distance between CRISPR arrays and
cas genes in a genome drives the species-specic signature of selection, ??k (using
only completely assembled genomes). We saw a slight positive relationship between
CRISPR-cas distance and our signature of multi-array selection, the opposite of
what we would expect if linkage were driving our results (m = 3.163 ? 10?7, p =
8.52? 10?6, R2 = 0.009937).
(4) Finally, to conrm that assembly level had no eect on our conclusion, we
ran our parametric test restricted to completely assembled genomes in the dataset
(6263 genomes, ?? = 0.98? 0.09).
(5) CRISPR immunity may generically increase rates of horizontal gene trans-
fer [148], including transfer of CRISPR arrays themselves. Under our model of
144
array accumulation, an increase in the rate of the arrival of new arrays in functional
genomes would certainly increase the mean of the distribution of arrays per func-
tional genome. Nevertheless, assuming only selection on a single array (our null
hypothesis), we would still expect this functional distribution to have negative bi-
nomial form shifted by one array (see Chapter 3 Methods and Appendix B.2). It is
clear from Fig 3.1(b) that the distribution must be shifted by at least two arrays in
order to resemble a negative binomial distribution.
B.6 Neo-CRISPR Arrays
O-target spacer integration into the genome can spawn novel CRISPR arrays
in E. coli [149]. This could create a spurious signature of selection maintaining
multiple arrays using our test, since the production of neo-CRISPR arrays would
only occur in functional genomes. A simple way to control for this is to merge all
CRISPR arrays with identical consensus repeat sequences in a genome, thus remov-
ing any duplicates. Doing this, we nd that the signature of multi-array selection
remains, albeit being somewhat less strong (?? = 0.46? 0.02). We were consider-
ably surprised that this signature of selection still remained after merging, since such
merging will also remove a large portion of arrays acquired through horizontal trans-
fer, assuming such transfers most often happen between closely related individuals.
In any case, while the production of neo-CRISPR arrays may be driving our result
in part, it cannot account for the overall signal. It is unclear if neo-CRISPR arrays
are commonly produced in bacteria via o-target integration, though [149] found
145
circumstantial evidence it may occur in two other species. The CRISPR system of
E. coli is not naturally active [231] and requires articial up-regulation of the spacer
acquisition machinery, so that its dynamics may not be representative of CRISPR
systems at large. Nevertheless, this mechanism may explain the large number of
arrays found in some genomes (e.g., Clostridium dicile genomes typically have
nine arrays; [135, 136]).
We note that this control also applies to any other potential array duplication
process, since repeat sequence should be preserved during duplication.
In both the case of neo-CRISPR arrays and other duplication events, we might
expect that the new array formed via o-target integration or fragmentation of
a larger array would lead to second arrays that were shorter and potentially lower
scoring than the rst. Comparing arrays in two-array genomes, we do not see this
pattern emerge (i.e., no negative correlation between arrays in terms of length or
score). In fact, in both cases we see slight positive correlations between arrays,
though with little explanatory value (Fig B.15).
B.7 Validation of CRISPRDetect array predictions
We ran our tests for selective maintenance of multiple arrays on the same
dataset excluding arrays with a CRISPRDetect score lower than 6 (double the de-
fault threshold). We found no qualitative dierences in our results when we used
this increased detection threshold (?? = 1.00 ? 0.02). By default, CRISPRDetect
identies arrays with repeats matching experimentally-veried CRISPR arrays as
146
well as de novo repeats. If we restrict to only arrays with a positive hit on this list
we again found the same pattern (?? = 0.76? 0.03).
We also downloaded the set of CRISPR arrays and array-lacking genomes
available on the CRISPR Database [159]. This database uses an alternative algo-
rithm for array detection [232] and thus serves as an independent verication of
our results. This dataset showed a clear signature of selection maintaining multiple
arrays (?? = 1.49? 0.17).
B.8 Autoimmunity Constrains Unprimed Spacer Acquisition Rates
Empirical evidence suggests that there is no self versus non-self recognition
mechanism in the CRISPR systems of Streptococcus thermophilus, a popular model
system for CRISPR research [27]. Thus any increase in spacer acquisition will also
increase the rate of autoimmune targeting. In the absence of viruses or once im-
munity has been established, we can model the growth of bacteria (B) experiencing
autoimmune targeting in a chemostat as
( )
vR
B? = B ? ?? w (B.24)
z +R
with resources (R)
evR
R? = w(A?R)? B (B.25)
R + z
where ?A is the spacer acquisition rate and ? = 50?A is an estimate of the rate
of autoimmunity based of the relative genome sizes of S. thermophilus and its lytic
147
phage 2972 [38, 233] (other parameters in Table C.2).
As shown in Fig B.16, there is little eect of autoimmunity on the equilibrium
density of bacteria for ?A < 10?3, but after a point around ? = 10?2A there is a
rapid drop in density. This puts a theoretical cap on the maximum rate of spacer
uptake in our system and imposes a severe cost on spacer uptake rates greater than
10?2 given the parameters used here, based on the S. thermophilus system. Other
taxa do seem to exhibit some degree of self versus non-self recognition, but still
frequently incorporate self-spacers [25, 61, 153], suggesting that our general result
holds across taxa though the threshold is likely to be variable. We note that in Fig
B.16 even a 50% reduction in the rate of autoimmunity only shifts the threshold
spacer acquisition rate by a small amount. Additionally, although CRISPR may
provide a competitive advantage when viruses are present in the system, this ad-
vantage cannot help the host overcome the autoimmunity cap on acquisition rate
after which the population is no longer viable.
B.9 Bet Hedging Against Memory Loss
Spacer loss in the CRISPR array most likely occurs via homologous recombi-
nation of repeat sequences [31, 143, 144]. Thus the time to immune loss will increase
with the number of arrays targeting a particular viral species. Assuming that immu-
nity towards a given virus in a single array has an exponentially distributed lifetime
with expected value ? (i.e., time to loss of all spacers targeting that virus in that
array), in the absence of novel acquisitions the expected time to complete immune
148
?
loss is ? N 1i=1 , where N is the number of arrays that initially target the virus ini
question. Clearly, the advantage conferred in terms of memory span decreases with
each additional array, though this eect is important for the rst few added arrays.
In fact, it is more appropriate to model the lifetime of individual spacers with an
exp?onential distribution such that the expected time to complete immune loss is
? n 1s i=1 , where n is the total number of spacers in all arrays and ?s the expectedi
lifetime of each spacer. Note that we assume here that arrays are of comparable
length (so that spacer loss rates remain constant). Thus the relative advantage of
multiple arrays is further reduced in the case where each array can have multiple
spacers targeting the same virus, assuming that spacer loss rates are similar across
arrays (appropriate in the case of identical arrays near some equilibrium length).
If spacers vary in their eectiveness in attacking a viral target then we would
expect this to increase the relative payo of a bet-hedging strategy since it will es-
sentially reduce the number of eective spacers in any given array. There is evidence
that spacers vary in their targeting eciency [234] in some systems. Nevertheless,
if a system experiences priming then it is extremely likely that a single array would
have many spacers towards the same target, making a bet hedging strategy less
likely.
B.10 No evidence for array specialization
In genomes with multiple arrays, the dissimilarity between consensus repeat
sequences of arrays in a single genome spanned a wide range of values (Levenshtein
149
Distance, Figs B.17 and B.18), though the mode was at zero (i.e., identical consensus
repeats). When limiting our scope to only genomes with exactly two CRISPR ar-
rays, we saw a bimodal distribution of consensus repeat dissimilarity, with one peak
corresponding to identical arrays within a genome and the other corresponding to
arrays with essentially randomly drawn repeat sequences except for a few conserved
sites between them (S7D Fig). We also observed that among functional genomes,
the area of the peak corresponding to dissimilar repeat sequences was signicantly
higher than among non-functional genomes (?2 = 61.432, df = 1, p < 4.582?10?15,
Fig B.17). This suggests that the observed signature of selection may be related to
the diversity of consensus repeat sequences among CRISPR arrays in a genome. On
the other hand, this enrichment of functional genomes with dissimilar arrays was
not observed in an independently-generated dataset, calling this result into ques-
tion (CRISPRdb [159], AppendixB.7, Fig B.18). Even when looking only at arrays
with identical consensus repeats, there is a clear interaction between functionality
and having multiple arrays, suggesting that selection maintaining multiple arrays is
present even in these cases (Fig B.19).
Finally, we sought to assess if this observed variability in repeat sequences
among arrays might have functional implications for CRISPR immunity, even when
arrays share a set of cas genes. One measure of system functionality is array length,
as we expect it to be correlated with the rate of spacer acquisition [226]. Therefore,
we determined whether the mean pairwise dissimilarity between array consensus
repeat sequences in a genome was associated with the variance of array lengths in
that genome. Array length was measured as the number of repeats in an array.
150
In genomes with exactly two arrays, the mean pairwise distance between consensus
repeats within a genome was positively associated with the variance of the number
of repeats across arrays in a genome, but this relationship was poorly predictive,
not signicant considering multiple testing, and likely spurious (linear regression,
R2 = 0.002557, p = 0.0382).
151
Bootstrap
Only ? 1 cas set Sub-sampled ??S ??N ?? 2.5% 97.5% s?
No No 1.56 0.46 1.09 1.07 1.12 2
No Yes 2.26 1.13 1.13 1.05 1.22 2
Yes No 1.05 0.45 0.61 0.59 0.62 2
Yes Yes 1.26 1.07 0.18 0.11 0.26 1
Table B.1: Test for selection maintaining multiple arrays applied to dierent subsets
of the RefSeq data. See Figs 3.1, B.1, B.6, and B.7.
Species ??
Staphylococcus aureus 1.15? 0.37
Klebsiella pneumoniae 0.76? 0.06
Shigella sonnei 0.72? 0.17
Listeria monocytogenes 0.67? 0.08
Mycobacterium tuberculosis 0.41? 0.05
Pseudomonas aeruginosa 0.35? 0.16
Campylobacter jejuni ?0.12? 0.05
Escherichia coli ?0.20? 0.04
Salmonella enterica ?0.54? 0.06
Table B.2: Species specic values of ?? with bootstrapped 95% CIs.
152
12500
40000
10000
30000
7500
20000
5000
10000
2500
0 0
0 5 10 15 20 0 10 20 30 40
Number of CRISPR Arrays Per Genome Number of CRISPR Arrays Per Genome
(a) (b)
0.5
600
0.4
0.3 400 variable
N
S
0.2
200
0.1
??N ??S
0.0 0
0 1 2 3 4 5 0.4 0.6 0.8 1.0
Shift (s) ??
(c) (d)
Figure B.1: Dataset restricted to genomes with one or fewer sets of cas genes
(one copy or less each of cas1, cas2, and a cas targeting gene). (a-b) Distribution
of number of arrays per genome in (a) non-functional genomes and (b) functional
genomes. In (a) the black circles show the negative binomial t to the distribution of
arrays in non-functional genomes. In (b) black circles indicate the negative binomial
t to the single-shifted distribution (s = 1) and pink triangles to the double-shifted
distribution (s = 2). (c) The optimal shift is where the dierences between the two
distributions is minimized. (d) The bootstrapped distributions of the parameter
estimates of ??S and ??N show no overlap with n = 1000 samples drawn.
153
Sum of Squared Differences Number of Genomes
Bootstrap Freq. Number of Genomes
Symbol Denition Value
?A Unprimed Spacer Acquisition Rate varied Figs 3.2(a) and B.3, spacersinfection
?L Per-Spacer Loss Rate 10?2 1spacers?time
vT Density Virus T?Adsorption 102 infectionstime
vB Density Virus B?Adsorption 102 infections , varied in Fig 3.2(a)time
p Priming Factor 104, varied in Fig B.3
Table B.3: Denitions of relevant variables and parameters for CRISPR array model.
Infection refers to the adsorption and injection of a phage into the host. Time is
in units of viral return intervals (i.e. the amount of time the transient phage is
absent from the system).
Symbol Denition Value
e Resource Consumption Rate of Growing Bacteria 5? 10?7 ?g/mL
v Maximum Bacterial Growth Rate 1.4 divisions/hr
z Resource Concentration for Half-Maximal Growth 1 ?g/mL
w Flow Rate 0.3 mL/hr
A Resource Pool 350 ?g/mL
Table B.4: Denitions of relevant variables and parameters for autoimmunity model.
154
Figure B.2: Alternative model results with array length cap agree qualitatively
with those of the primary model (p = 1 (no priming), L = 5, and vT = 100). Blue
signies memory washout (tI ? 10?5) and yellow signies immune maintenance
towards the transient virus, T (t < 10?5I ).
155
Figure B.3: Priming increases the region of memory washout and thus deepens the
memory span versus acquisition rate tradeo. Phase diagram of the behavior of our
CRISPR array model with two viral species, a constant background population and
a transient population that leaves and returns to the system at some xed interval
(Appendix B.4, Fig B.12). The yellow region indicates that immunity towards both
viral species was maintained. The green region indicates where immune memory
towards the transient viral species was lost, but reacquired almost immediately upon
viral reintroduction. The light blue region indicates that only immunity towards
the background species was maintained (i.e., immune memory towards the transient
viral species was rapidly lost but not rapidly reacquired). Dark blue indicates where
equilibrium spacer content towards one or both species did not exceed one despite
both species being present in the system. The parameter p is the priming factor
by which acquisition rate is increased when spacers towards a given target already
exist in the array.
156
30 90
20 60
10 30
0 0
0 3 6 9 12 0 10 20
Number of CRISPR Arrays Per Genome Number of CRISPR Arrays Per Genome
(a) (b)
50 ??N ??S
0.05
40
0.04
30 variable
0.03 N
S
20
0.02
10
0.01
0
0.00
0 1 2 3 4 5 1.5 2.0 2.5 3.0 3.5
Shift (s) ??
(c) (d)
Figure B.4: Signature of multi-array selection in archaeal genomes. (a-b) Dis-
tribution of number of arrays per genome in (a) non-functional genomes and (b)
functional genomes. In (a) the black circles show the negative binomial t to the
distribution of arrays in non-functional genomes. In (b) black circles indicate the
negative binomial t to the single-shifted distribution (s = 1) and pink triangles to
the double-shifted distribution (s = 2). (c) The optimal shift is where the dierences
between the two distributions is minimized. (d) The bootstrapped distributions of
the parameter estimates of ??S and ??N show signicant overlap with n = 1000 sam-
ples drawn. We note that the large majority of archaeal genomes had CRISPR arrays
and were also functional, making our approach less powerful. Further, if those few
non-functional genomes lost their cas spacer acquisition machinery recently, then
our power would be reduced even more because these genomes might still bear the
remnants of past selection. In general, as more archaeal genomes become available
in public databases we will have more power to search for selection using out test,
as currently there is an issue of small sample size overall.
157
Sum of Squared Differences Number of Genomes
Bootstrap Freq. Number of Genomes
?
?
? ? ?
? ?
? ? ?
? ? ?
? ? ?
? ?
? ?
? ?
?
?
?
Type I Type II Type III
Figure B.5: Boxplots of array counts associated with genomes carrying a particular
type of cas targeting machinery. System type was determined by the type of cas
targeting gene found on the genome (genomes with no signature targeting genes
or multiple types are excluded). Outlier points for genomes with > 10 arrays not
shown for readability.
158
Number of Arrays
0 2 4 6 8 10
2500
2000 2000
1500
1000 1000
500
0 0
0 5 10 15 20 0 10 20 30 40
Number of CRISPR Arrays Per Genome Number of CRISPR Arrays Per Genome
(a) (b)
0.20
150
0.15
variable
100
N
0.10
S
50 ??N ??S
0.05
0
1.0 1.5 2.0
0 1 2 3 4 5
Shift (s) ??
(c) (d)
Figure B.6: Dataset with subsampled genomes of overrepresented taxa. (a-b) Dis-
tribution of number of arrays per genome in (a) non-functional genomes and (b)
functional genomes. In (a) the black circles show the negative binomial t to the
distribution of arrays in non-functional genomes. In (b) black circles indicate the
negative binomial t to the single-shifted distribution (s = 1) and pink triangles to
the double-shifted distribution (s = 2). (c) The optimal shift is where the dierences
between the two distributions is minimized. (d) The bootstrapped distributions of
the parameter estimates of ??S and ??N show no overlap with n = 1000 samples
drawn.
159
Sum of Squared Differences Number of Genomes
Bootstrap Freq. Number of Genomes
2500
2000
2000
1500
1500
1000
1000
500 500
0 0
0 5 10 15 20 0 10 20 30 40
Number of CRISPR Arrays Per Genome Number of CRISPR Arrays Per Genome
(a) (b)
0.20 200
0.15 150
variable
0.10 100 N
S
0.05 50 ??N ??S
0.00 0
0 1 2 3 4 5 1.0 1.1 1.2 1.3
Shift (s) ??
(c) (d)
Figure B.7: Dataset restricted to genomes with one or fewer sets of cas genes (one
copy or less each of cas1, cas2, and a cas targeting gene) with subsampled genomes
of overrepresented taxa. (a-b) Distribution of number of arrays per genome in (a)
non-functional genomes and (b) functional genomes. In (a) the black circles show
the negative binomial t to the distribution of arrays in non-functional genomes. In
(b) black circles indicate the negative binomial t to the single-shifted distribution
(s = 1) and pink triangles to the double-shifted distribution (s = 2). (c) The
optimal shift is where the dierences between the two distributions is minimized.
(d) The bootstrapped distributions of the parameter estimates of ??S and ??N show
nearly no overlap with n = 1000 samples drawn.
160
Sum of Squared Differences Number of Genomes
Bootstrap Freq. Number of Genomes
? ?
Non?Functional Functional
Figure B.8: Similar CRISPRDetect score distributions in non-functional and func-
tional arrays. Non-functional arrays have slightly higher scores (Wilcox rank-sum
test, p = 0.01254), although the eect size is marginal and statistical signicance
at this level is questionable after considering the number of tests conducted in this
study.
161
Score
4 6 8 10
Non?Functional Non?Functional
Functional Functional
0 20 40 60 80 100 0 20 40 60 80 100
Number of Repeats Per Array Number of Repeats Per Array
(a) (b)
Non?Functional
Functional
0 20 40 60 80 100
Mean Number of Repeats Per Array in a Genome
(c)
Figure B.9: Arrays in functional genomes are longer on average than arrays in non-
functional genomes (t-test, p < 2?10?16). (a) Full dataset. (b) Data from genomes
with one or fewer sets of cas. (c) Functional genomes tend to have more repeats
on average in their CRISPR arrays than non-functional genomes (array length rst
averaged over arrays in each genome, so each datapoint is a genome rather than an
array). (a,b,c) In blue is the distribution of mean array length in functional genomes.
In red (overlaid) is the distribution of mean array length in non-functional genomes.
Vertical lines with corresponding colors indicate the means of these distributions.
162
Density
0.00 0.04 0.08 0.12
Density
0.00 0.04 0.08 0.12
Density
0.00 0.04 0.08 0.12
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.0 0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Shift (s) Shift (s) Shift (s)
(a) (b) (c)
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Shift (s) Shift (s) Shift (s) Shift (s)
(d) (e) (f) (g)
Figure B.10: Short, functional arrays do not drive the result of our non-parametric
test for selection. The results for our non-parametric test for selection when remov-
ing all functional arrays shorer than (a) 4 repeats, (b) 5 repeats, (c) 6 repeats, (d)
7 repeats, (e) 8 repeats, (f) 9 repeats, and (g) 10 repeats. Note that in all cases the
test indicates selection maintaining two arrays.
163
Sum of Squared Differences
Sum of Squared Differences
Sum of Squared Differences
Sum of Squared Differences
Sum of Squared Differences
Sum of Squared Differences
Sum of Squared Differences
1.00
0.75
0.50
0.25
0.00
4 6 8 10
Minimum array length for functional arrays
Figure B.11: Short, functional arrays do not drive the result of our parametric test
for selection. The results for our parametric test for selection when removing all
functional arrays shorer than a given threshold. Note that even when removing all
functional arrays less than 10 repeats long we still see a substantial signature of
selection maintaining multiple arrays, and that up to a threshold of 6 repeats this
signature is rather strong. By the nature of this test, ?? must decrease monotoni-
cally as the threshold increases (Appendix B.3).
164
??
Figure B.12: Outline of model analysis. Hypothetical time-course of CT during the
departure and return of virus T from the system. Virus T leaves at time t = 0 and
returns at time t = 1. CT exceeds a value of one after viral reintroduction at time
1 + tI .
165
Number of cas1 genes
?
?
40 ?
?
?
30 ?
? ?
?
? ?
?
? ?
? ?
?
? ? ?
? ? ? ? ?
20 ? ? ? ?
? ?
? ? ? ?
? ? ?
? ? ?
? ? ?
? ? ? ?
? ? ?
? ? ?
? ? ?
10 ? ? ?
? ?
? ?
? ?
?
?
?
0
Figure B.13: Relationship between the number of cas1 genes and the number of
CRISPR arrays in a genome.
166
Number of CRISPR arrays
0 (n= 10801 )
1 (n= 32156 )
2 (n= 2891 )
3 (n= 392 )
4 (n= 100 )
5 (n= 27 )
6 (n= 3 )
7 (n= 2 )
8 (n= 1 )
9 (n= 1 )
s es
rin
u s en
b ini og
s s
o
ng
u
omu r
cc s s
a
is toc
h
o u
oc cc  s
u
stapt o s
s
tili
tre to
c cc
u a i
o s 
di rm
S pre oc ce v
e
t t y s a di
ae
S
tre
p
om ce a
s
pt y s 
fr eu s
S e m e risc g e pic
u
trS pt
o y s m e atr
a co i
i
re o s e
k
St pt my
c  k ro ho
tre to ce
s yg en ies
S ep
y s h uw ab
trS to
m ce  le
e isc
tre
p my eso ci
d
S pt yc s a r
is
e sr m e be vu
St pto yc s u flare
St pto
m u o
cc  al
b
lus nso e
St
re oc espt yc  al
bu cifa
tre
o
S pto
m es b
e my
c m
s a us
Str to cep lb s u
s
tre tom
y s a s
S p yce
tu lo
atr
a iop
Str
e m s to ce ras
in
rep my s p umSt t
rep
to
my
ce ac
St to es
 p icu
s is
ep yc
ns
tr om s 
lyd lan
e
S pt
tre my
ce gqin s
S pto
s su
tre my
ce
 rim
o
s iei berS pto ce
tre my c
ab ru
S to ces
 s eo
p ola
c
tre my  vi eusS
rep
to yce
s
ina
c es
St om s vt u
s gen
p
tre yc
e e o
m  sil
ac m
S to s h
ro r
rep yc
e irid
oc nig
e
St pto
m s v eus s
e
Str my
ce lac en
to s v
io solv
rep yce trin
o
um ??  (Species?Specific)St tom dex se
trep vibr
io  ro
S ni ngi
um ens
is ans k
ucc
i ora th ox
id
S tos
p dsw
or fido
trep a
sulo
S lla w rm
tter
e
us t
he
u ill nita
lis
S bac nigefo i
Sul la a
s nita
lis
aylo
rel uige e
T lla e
q nerae tur ilus h
Tay
lor
ibac
ter op
in  h
al
red occ
us
e c coli
cus
T o
age
n ter g
ly
Tetr croba itimao r
Terr
isp a s
otog
a m oran
rm s ole
iv
The
ssol
ituu
a oph
ilus
l
Tha us th
erm ?? < 0
erm entic
ola
Th a d s k
nem ene
Trep
o  pyo
g
lla
ruep
ere rimiti
a
T ma pne
Trep
o
eoga
nii
r inosu
m
rio b m spVib u
comic
robi nsis
Verru ontpe
lliere
a m
Veillo
nell
pica
llonel
la aty
Vei lerae
Vibrio
 cho
arrens
is
 nav
Vibrio
triegen
s
Vibrio 
na us
molytic 0 ? ?? ? 1
ibrio pa
rahae
V
toecus k
Vibrio m
e
rveyi
Vibrio h
a
us
Vibrio ro
tiferian
ensis
Vibrio sin
alo
odis
homonas
 axonop
Xant s
omonas a
lbilinean
Xanth
sella confu
sa
Weis
antothentic
us
Virgibacillu
s p
us
Vibrio vulnific
ampestris
Xanthomonas
 c ?? > 1
Xanthomonas c
itri
thomonas oryz
ae k
Xan
monas transluce
ns
Xantho
norhabdus bovie
nii
Xe
Xanthomonas vasic
ola
Xenorhabdus nematop
hila
Zobellella denitrificans
Xylanimonas cellulosilytica
Zooshikella ganghwensis
Zymomonas mobilis
Acinetobacter radioresistens
Acinetobacter junii
Acidithiobacillus thiooxidans
Actinomyces massiliensis
Actinobacillus pleuropneumoniae
Actinomyces oris
Actinomyces naeslundii
Actinomyces urogenitalis
Aeribacillus pallidus
Aerococcus urinaeequ
A ierococcus urinae
Aeromonas cavia
A eeromo Number of Genomesnas vero
A ng iigregatibacte
A r g ag ctr ie nog mat yib cea tc emc
A teli r
o
c  s
m
e itansyc gli np ih silus
A  dk ek ne ir tm rifia cn as ni s
A al  i mcy uc cl io nb ipa hc ila
A illl ulo ss  c ma ard cro ov sia p om raa nA c gl ii ii dv uib sri ao c afi e
A sclt he ero rimon
A ac si  d mit eh dio itb ea rrAc c
a
i nllu eaidip sr  co ap ldus 10
A ioc ne it bo ab ca tc et re iuAce r
m
 tr  o atob p
c
i ia c
d
a ipli rs opioni
A cc teid r s
c
ith y
i
i zo yb gAc a
ii
e ct io llb ua s 
A c
fe
c te e
rr
t ro  
iv
p o
b o
r
a m
an
ct o
s
r
A umce et ro  pb er
A ac c
s
e t
i
e ci
to r b pa astA eum cte riy ac ro  ace
nu
A l
s
a
n to o
ti
x py sb is oA rien aca ie llr u
n
s t afl lo a
is
A sq tip
v
e is tu hitale  h
e
a rmdr uA a u srco  m s 100b
A a
a
c gt ne urc r 
s
la ooba n
n
t iiAn c
hie
a tee rr  
ri
A o
m
n c
a
o rina ue cr co us
s
A  tn ca oc
e
c tre u ar s dA o
i
n co
 
l p
us
a re uro m
ev
A m n
o
a t iircob yx
a
a o
m
Az c
b
t a
in
o c
o
t eo r t
t
h e
va
r ld ericA ba ere e azot co te iu
halo
Ba b
r v s g
c a
e
i c
i n
t ne ela ans
A llus r ch nzo  r
dii
A s
a
p ltit
o
u oc
to ir dp illu in
o
i cs cum
B oa b
m
c ii u
 b
ll m
rasi 1000B us  va la g ec  b i nse
B illa u
ad ns c iu
a
s ec
B ila lu
la
s uc s
B il
 
l cu e
ii
a rec si  
u
l c sBa lu
o
c s
ag
i uB l
 
a l
c l
u y a
c s
t
 oh t
n
il o o
s
Ba lu r
xic
c s nil  m e
u
B lu e c
s
a kc sil  
tha iae
Ba lu
ps n
c s
eu o
B ill
 m d liu y o
c
m ua s
B ct
s
e  p
co
r u im d
y
a o e
coi
B c
s d
t e
a e
id
ro e
il
s us
s
B ci
 
l ca l
i
u d es es l
B cill  to c
lulo
a us y
a
o c
s
c ic  n a lyB te thu e e
ti
n ca ro us
B cte ir d
rin si
a e g s
B ct
o
e i
s id ea ro e
e
s g
ns
c g isBa te
id  r e m
e
a rc o s thB te ide  a r f
ss ir i
c a ilBa t
o
e id se  s
g
t i
ie
r lis
nsi
B ct oe id s 
e
o r
s
ifid ro es v
co
Bi ob ide  
a
f un t
r
u irs oB ido a s i
s
if b ct  
fo o
i e th r
r
e m id so a rb ct iu t ia er m
aio s
cte iu
 a tao
r m niu  a im
m
m d a il c b o
r
l oif e
is n
id su cm entis
Figure B.14: Species-specic ??k values mapped onto the SILVA Living Tree 16s
rRNA Tree. In red are species experiencing apparent selection against having a
functional CRISPR array. In blue are species showing a strong signature of selection
for multiple arrays. Number of genomes is represented by tip size, and is a rough
indicator of power.
167
e
ns
li ica on
e
leiso n z il
dia
 rci ma
v
r a
n
a
a a f  ss
o
oc
i m
rd llu  d
a
a esN ca irip si
s n
alb e
No s op s ig
Ni
tro rd
i i
ps for
m
a  
oc di
o er sali ein
ii
N r t r i t
oc
a c e s s
N lob
a ethr alv old on
i
s
xa lla
 u lus s 
g
il t
a
O e c de is ae
Ol
ig iba ero
i s d rd
n i
ae ac
t
roi
de me llis rdeP
ara
b ctea roi
de  so xa
P rab cte idi
um my
Pa ba str
ly
ra o s p
o s
a l i h
u
P en
ic cill
u e
a  ye otr
op
ns
Pa nib
s
e cu
s t a
c  pa
n orv ta
n
Pa oac us n
o en
r cc m
i
a ifer
m
Pa
rac
o s b
a cc
u  
P co idi
um
a tr icr
a
r bii
Pa clo
s s m toe
ara
a  
P mo
n
cill
us ida u
m
vi a ltoc tov
or
r b
Pa eo uag lla 
m car
o
eusr  c
Pa ure iumr a
ste te ent
os
c ticiPa tob
a p
cus
 
idil
ac
ecP c ac
dio
co us c nii
Pe coc e
y s
dio ter 
st nta
n
Pe bac rm
e
ima
lis
o  fe cr um s
Ped tsin
us la s iu
o lus
 ngu rob
Pel nip
hi  a nae
pto ter
ium s a
Pe ac occ
u aris
hot
ob toc im
P trep m a
qu i
os riu hita
ni
ept cteP a m k
is
elaeb s
Pho
to eriu m
oba
ct a m
t rium
 d ndu
Pho acte  pr
ofu
ob um e
Pho
t teri ovia
tob
ac
r ge
rg ides
Pho l
o
liba
cte hige
l
Plu
ra
ona
s s ilus
iom  hal
oph lis
Ples s g
iva
ibac
illu  gin is
nt na
s hae
ns
Po o grom  don
Por
phy acte
r
yrob ii
Por
ph mn
ella 
a
tii
Prev
ot brya
n
 
otel
la
calis
Prev a bu
c
revo
tell
P  disi
ens
evot
ella
r opriP c
tella
 
edia
Prev
o m
ella i
nter
ot ensis
Prev lla tim
on
Prev
ote livae a
votel
la sa icogen i
Pre lani
n i
a me denr
eich
otellev r
eu
Pr cteriu
m f
niba
Propi
o
auser
i
Prote
us h
lgaris
roteus
 vu
P irabilis lchra
Proteu
s m
onas f
lavipu
m
oaltero
Pseud rugino
sa
monas
 ae cea
Pseud
o as lute
oviola
doalter
omon
Pseu alearicas b
Pseudo
mona
as indic
a
Pseudo
mon
s mendo
cina
Pseudom
ona
onas otit
idis
Pseudom ans
domonas
 oleovor
Pseu eudoalcali
genes
as ps
Pseudomo
n
zeri
Pseudomon
as stut
thermotoleran
s
seudomonas P
onia solanace
arum
Ralst
Raoultella ornith
inolytica
amlibacter tataouinens
is
R
um
Rhizobium leguminosa
r
Rhodobacter capsulatus
Rhodanobacter denitrificans
Rhodobacter sphaeroides
Rhodococcus enclensis
Rhodococcus triatomae
Rhodomicrobium vannielii
Riemerella anatipestifer
Rhodovulum sulfidophilum
Rhodopseudomonas palustris
Roseomonas gilardii
Rodentibacter pneumotropicus
Rothia dentocariosa
Rothia mucilaginosa
Sanguibacter keddieii
Salmonella enterica
Salmonella bongori
Salinivibrio costic
S oe lalenomonas rum
S ie nd anim tiuin mibacillus
S  he ar lr oa pt hia ilu r subida
S ee arratia marc
S eh si cg ee nl sla dyse
R nu tm eri in ao ecocc
R usu  m bri on mo ic ioc
R cuu sm  ain lbo uc soc
[R cuu sm  ci hn ao mco pc ac nellR enu sm usin ] gn
is
oco a
S c
vus
a cli un sis  fp lao vefa
S raa  l p
c
in a
ie
is c
n
if sp io c
S r
a
a al  i an ri es np io cr oSh a
la
i  g tre oll pa i
S  
c
f a
p leh xin ng eo ri
S bo iulo mb  a bader
S cte iph rii un mg  mo
S omo oingu n
rei
l ai
S s
s
p  ph auh ci ig aera me ol
S l
b
a  s aci
i
d lis
ta ipp o hh nnei iy la
S lt oa cp oh cy cl uo sS  ata cp oh c
rg
y c
ente
S l
u
o s uta c
 s
p o
ag
h cc
netis
S yt lo
u
c so  aa c uphy c
r
u eus
S s t l ca op coc
a
c ph u iy s tisSt l  a o cp ch oc
a
c rn
S y
o
lo uc s h
su
tap occ a
s
hy u
e
s mS ota lo  e
ly
p cocc qh u u
ticu
S y
oru s
ta lop c
s
o  ep mhy c idSta lo
cus erm
p ch o
 ho
y ccu m
idis
Sta loco sp c  
in
lu ishy cu gS l do s uta np c e
S h
o
y c
m
lo cu
a
s s
ns
ta co  
s
d i
i
p l
s
ien
S hy cc
e
u lph sta lo s
is
ph coc  p
ini
St y
a
re lo
c s
p c
u te
o s p uri
St t
c s
r oe c
c
o u
e
p c s 
u
t cu s
d
a ip nS ttr oe co s a r
e
o rp m
S pt cc ng hy
edi
t ur o u i t se co s no ic
S pto cc
aga su l u
us
t sre co sp c  c
ac
t c tStr oco u
o
s n
i
e  s
ae
S p c c
t
t c r
e
t is
lla
i tg oco u tm ac s d t
us
S uta ap t
cu ysg se
S p li l
s
a  ea q
ala
taph  
 au ui cin r ta iaS y eta l
d
o ica ntia
S p c ct hre y
o
p l
a
o cc cSt to o
us
r  
S e
c c wpt o
c
c ut o c s
a
re c u  
r
s ner
S p im itr t
oc s
e oc c
 eq ula
St p
u
t oc s
u n
re oc c
 
u g
i s
o s o
n
S r
us
t pre to
c  do
S p c
cu inf n
tr t
o
o c
s an ii
e c cu int te aS rt p ore toc c
s
c  ru in m
ius
S f et pre to
o
c cc
s
u  
a
m n dt
S is
ius
t p ire to
oc s t
S p c c
 
u p
is
o s atre to c  
r
c aS ct pt oc u
o s
re o c s
ra al ng
S  itr p
c m s u
e t
oc uo s uS t
ini
tr p
c c
t oc u
 pa a s
e o c s
n
S c  
ra sp
t ur pto oc us n bep c  
e
o c p u
er
t ioc c
u
c s
s m s
o u  p
e o
c s s
ud n
c  p eu o
i
p au es d n ys oa g
o
e p
e
o uliv n r
mo
a e cr s in
n
iu u
ia
s s e
yi
als
b
 w
umt ii ns
e
dr
a se
n ii e
a lar
s ul mon o or
u
loq
u x bb em va ro di
a fer x g
i zz a n
H lo a  er m cc
h ofu
a e rH lof r
u  sa sp
Ha ru
b
um lac
u ta
lo r  e
n ttii
Ha ru
b
um al
il
rtle
alo b
r lk a
H lor
u b
lea
 a er 
a ct a yiH lot
a
iba fla
v wa
Ha sti
n a the m
e ale  ha tor
um tu m
Int e
a ru ns
dro
t a e l o a
y ate
ll c c h r
H g cte
r  nu rop riv
o
un ob
a ium ec ha
H clic cte
r n
a ium
 ac
He b r r
 se ris
us
o cte ct lve s
F bao nib
a  pu an m
s e ter er tic
u
Fu at ac ulc
n
ic b  
o
od
Fu
s i i
co
n m
n ter
iu r pe
a c m sisFr ba u neri iao
us ct ula
re ag
F a tob lla ilom
ir is
usF e
s
cis a p
h ne
n
tu ilus
Fra
n ells a h
nc
i  no sto
p s
Fra ise
lla
 ka
u
alis nul
atu
nc s
Fra aci
llu gina cat
e
b a v o idis
Ge
o
ere
ll
her
m t
n s t lpin
gi
ardG aci
llu  sa
ium rium
eob er vaG
liba
ct m 
l riu sis
Ga eact lute
a
b ok
en
Fus
o
pal
lidi nko
ia 
a
a d
Jej
u ciba a
otg
al
 vitu
lin re
Je erial ae vul
ga
and  kin
g m ea
K ella en
iu n
laci
Kin
g cig a
oni os
ogu
l ra p
h is
et po ane
ns
K
sato
s hig
Kita lla m
ic s
ie e
ne
ebs  aer
og
Kl la ca nii
lebs
iel xyto ans
e
K la o  h
bsie
l
ibac
ter
Kle tae
mag
a
iansKo  varria hila
Koc
u izop
curi
a rh riico
la niae
Ko a
o
lla v neu
m
Kleb
sie uas
ip e
lla q onia
Kleb
sie um
ella 
pne
i nii
Kleb
s owa c linusia
sako
n
acte
r xy
Ko
gata
eib rius
oma e
nta
K ccus
 sed m
o ip
hilu
ytoc m al
kal
K
icrob
iu
tipara
Lacim pira m
ul s
nos totole
ran
Lach ceillus a
tobac idipis
cis
Lac us accill
ctoba viariu
s
La  a
tobac
illus
lovoru
s
Lac s amy
actob
acillu yticus
L l
cillus 
amylo
ba
Lacto s agili
s
cillu
Lactob
a
us bre
vis
Lactob
acill
kii
acillus
 bac
Lactob
cillus c
asei
Lactob
a eckii
bacillus
 delbru
Lacto s
acillus c
urvatu
Lactob s
acillus c
rispatu
Lactob
ctobacillu
s equi
La orans
actobacill
us fructiv
L
us fermentu
m
Lactobacill
huensis
Lactobacillu
s fuc
bacillus galli
narum
Lacto
Lactobacillus 
gasseri
Lactobacillus h
arbinensis
Lactobacillus hel
veticus
Lactobacillus kefira
nofaciens
Lactobacillus johnsoni
i
Lactobacillus iners
Lactobacillus kunkeei
Lactobacillus mali
Lactobacillus paracasei
L a c t o b a c i l l u s  m u c o s a e
L a c t o b a c i l l u s  p a r a f a r r a g i n i s
Lactobacillus parakefiri
Lactobacillus paralimentarius
Lactobacillus plantarum
Lactobacillus paraplantarum
Lactobacillus reuteri
Lactobacillus sakei
Lactobacillus rhamnos
L ua sctobacillus salivari
L ua sctobacillus simili
L sactobacillus taiwa
L na ec nt so ib sacillus zym
L aa ectobacillus th
L aa ic lat no dc eo nc sc iu ss p
L ise cg iuio mnella pn
L ee up mto os pp hi ir laa a
L lse top nto iispira i
L ne tep rt ro os gp ai nra s 
L bo ok rgta pn ee tell ra s
L  
e
v nii
i est se tfr oia ld w ene sis
L li ss hte imri ea r m i
L oi ns ote cr yia to i gv en
L ay n
e
s o
s
in vi ib iac
L iy llusi sn  sib pa hc aM il
e
lu ris c ufu sacroc so if
M c
orm
a cr uin s
is
o  cb ase
M acte
o
r lyann  h e
ti
x cc us
M eim
ellen
a ia srin  o hb aM a
em
a cr ti e
o
r ly h tiy cn ao
M ba
d
e c
r
g t
o
a e
c
s r 
a
p li
r
h p
b
o o
M ae l
n
y
a r t
o
a ic
c
ri  u
la
s sticn uov eu lm s
s
M delis  s a
e
l ng ii ioco cL oe lapt co ct us 
L re icp h
p
t ia
lu
 h too o nius
L se pp ir
fstad
t i io llu itr mic  L h fee ia ru r icon w
p
L o a
h
e s d
ilu
e m
u t ic
L o
o
i n
c
o  gste s
e
t lr o
id
ia c
u
M   in l
m
a
ic n crob o
tis
Me is
c
th p
u
a o
a
ra
M  ri n oc oro b
se
c revi aMic oc
ba
ro cus cM t
t
e et rr   a lute so mbi sp us il tu o hM n r ia iob c  
M il
u
u sn  
g
m la
o c u
u
li co a
M r
u e
e s rll  cu ia sor  th rtis
M axe e iiyc lo la
rm
 a oM b ayc a
t
c late n
ce
o ta
ti
r e caMy bac iuc mob ter  M avyc ao c
ium iut
M p e  a
m
yc la
ri b
o s
um sce
M b m  syc a
ce s
o c
a la u
M p te
h t s
y la r
yo u
c iu
sy m
M o sm
m n
b  x ovy a e i
M c
act  io n
a
o w op eb eyr a
r
c iu ao e
i
My id
te mr  t
M ro
e
i s
i u
d  
u b
o m ed ka ry e c
N xo s
u
c  
o
m ra nti s
lo
e o a m a
s
s is
N iss ccu re inu im
ii
ei ri s s usN s a fi s  ut
N r
er f lo i
lav va e u
it s  r oc ba s
s
P o o c
c
a so c il
en
P en s
cu lifo s
a ib pir s rm
P en a
a
i ci  
oc is
ae ba ll
mu ea
Pa nib c
us lti
n
ill el f
i
P en ac us g
orm
ae iba il
i
P n lu
 d i ia s
ae ib
cil s g
rw
Pa n
ac lu li u
ini
e b
s c a
P n a
il
c lu
 ja an nu
ae iba il
s
lu  
mi ol s
P b ya n
l
ib cil s
ar ae t
e  e
ic
N n a lu as m n
u
o i ci  y g
s
c ba a l bc lu o lo
olt
r zdi ia l
s
l  p re
ly ii
 u e
a tic
te s 
l u
r l
i
p a
o s s
e r
ri
v a
n a
e
ic ea
e
re
v
m ns
e
 b tus
u e
ium a er
in ohn um
cte
r ulg o a g
a s v  c
h iw n
b e m as
h olo lum
d iu
ido oi er m 
k ud hi
f r t se m op
Bi cte ac i
u p u id
a ob cte
r
ium
 ng ac
B fid ba ter
 lo m i
i o c ium e
r ov
i m
B  th rd ru
Bi
fid ba r o
ido
a
cte um sc e
c
if
i a
B ob
a
cte
r
ium
 
llic
fid a r pu
a e
 
Bi ob cte us sti
s ct nsa
ifid
e
ba cc gre
h
B s
p urg
ifid
o
cic
o  a mo sp
B yri el
la he
r en
t x v s
Bu iau rix
 t  ra en
utt oth riu
m lv la
B h te bri
so ico
oc c  fi did
Br iba o or i
rev vib
ri  s te
B i nia g
a
tyr n s
Bu lle
ro u li s
ba rio
 h bi isu
a vib s 
de nc
C i lu otyr cil ter
 c
Bu iba c o
li is
ald lob
a r c al
C y t
e
ac etu
s
p f tes
tin
am lob er
 in
C mp
y ct yo
a b
a r h
C pyl
o
ac
te
m
Ca ylo
b m iap eu ort
h
Ca
m b w
utia
 o ds
la la w
a a
B hip odu
ct
Bil
o r e
tia 
p
xle
ra
u e cas
ei
 s
Bla a w um spo
ru
lau
ti i
B t
er ro
iba
c
s la
te
s
Bre
v llu vi
bac
i re
evi llus
 b
ari
Br iba
ci r l
ev bac
te ni
Br
ju m
ylo er j
e iorup ct
am a u
an
C pylo
b
ter 
ig
c a
e
am aC lob r sh
ow icusy yt s
Cam
p act
e ol cu
ylob er u
re tarc
ti
am
p
bac
t
o  sub
an i
C yl er ode
gm
mp t nCa ylob
ac  cy
mp pha
ga m
a to
ru
C ocy
to u
n cter
 sp
hrac
ea
Cap ba  oc
amp
ylo
pha
ga
omi
nis
C yto  h
apn
oc
teriu
m
C
diob
ac ytica
Car hag
a l
lop balt
ica
ellu  C aga enag
ellul
oph i
as f
lav sis
C mon don
en
Cell
ulo baekr gens
eriba
cte ver
el m
 di
C bact
eriu la
Carn
o s
m lim
ico u
iu  japo
nic
hloro
b r
C loba
cte m
oha  viola
ceu
hromC riu
m gene
s
obac
te ndolo i
Chro
m m
obac
teriu
se onati
cus
Chry r ama
l
cte
Citro
ba
braak
ii
cter 
Citrob
a i
ter fre
undi
trobacCi  diffici
le
ioides
Clostr
id
 beijer
inckii
lostrid
ium
C atii
um ba
r
Clostr
idi
tulinum bo
Clostri
dium
 ljungd
ahlii
Clostrid
ium m
dium di
sporicu
Clostri m
ridium b
utyricu
Clost
m novyi
Clostridi
u lvens
tridium] p
apyroso
[Clos
idium perf
ringens
Clostr
asteurianum
Clostridium
 p
accharolytic
um
] s
[Clostridium
tridium spor
ogenes
Clos
ens
[Clostridium] 
scind
mbiosum
[Clostridium] sy
Clostridium tertiu
m
Clostridium tetani
rynebacterium auri
mucosum
Co
Corynebacterium arge
ntoratense
rynebacterium diphtheriaeCo
Corynebacterium amycolatum
Corynebacterium jeikeium
Corynebacterium halotolerans
Corynebacterium glutamicum
Corynebacterium minutissimum
Corynebacterium mastitidis
Corynebacterium pilosum
Corynebacterium pseudodiphtheriticum
Clostridium ventriculi
Clostridium tyrobutyricum
Coprococcus comes
Corynebacterium acco
C leo nr synebacterium urea
C lyo ticry un mebacterium ps
C eur do on tuo bb ea rc cute lor s d isubl
C inr eo nn so isbacter ma
D loe nh aa til co uc soccoide
C s u mti cb ca ac rt te yirium a
C cr no en sobacter tu
C rr ico en no sib sacte
C r sro an kao zb aa kc it ie
D r e min uo yc tjo ec nc s
D u
ii
s
e  l mfti aa r mts ou rr iu sDe hlf at tia en la sic s
D ue stlf rt ii sa ac
D ide os vu ol rfo anv
D i
s
b
e rir om  pa igb er
D acic tek re  y ha o mD ch
in
r iick y
s
e sy aa n s theDi ock la
m
e ny i
i
a
E  dli aza db ane tt iE h
i
ik ke inn ge ia
E lla
 a
 c no opliza rr
h
o eb d le ise
E thki
n
n snorm gia m
E al  
en
i mz a ina gb se silie
o
n sEn thkin s
e
i
t g s
ptica
eroc iaEn ot c
 m
c ire icr us c oo
E bac e
la
dw t
c
e or ra c um
E rd lod sw iea l
a
l c
r ad  
a
ta e
E sie rg dg lla ae  h
D rt oo hr ee ll
s
a h l ia e
n
n aD  i long t
e
c akey icE a atnt  e ze
en
a ae
E ron ct oe c
E ro
cus
n c  fat oe c er c ciuE on c
us m
te or c
 fae
E o
cus cali
n co ster c
 m
cu uE o
nd
s cc o
s
c  r
ti
h c af
i
fi
E er us nsc ic
o
h  i sa
s
c u
[ h
s
E eu r
a c
i
b c
 co hli aro
E a
h
c ia
ly
sc te
 a
r l
ticu
he i
be
[ r
s
E r
um] tiiu
[ b
ic  c
E a
h
c ia 
e
fe lluu ter lo
E bac iu
rgu sol
xig t
m] se or n
ven
[E uo iu
 elig ii s
ub b
m] en
Ex a
a  h s
c ct te e
al
i r lg ir iE uo i
i
u um mx b ]  i a  au
F gu ct re rc aa ne oc b
e
F a a
r t t
c ium at le
iac
ac l e  
um
F k
iba rl c ia u
e
m nt cla m le
F vob i
e
a r
 s ns
in a  
ium ibh e
e o  
iric
Fl g
c
o t
m p u
a e in
rau m
G vo
ld riu is sn
lu b
ia m
a  m  c
itz
c iiGl o
c
n t
o
u e
a l
o r
g
iu n
u
a m
G c b m nl ou no a
a
cte  p
re
G ce o
s
o n
b
o ac
r
t  e j
y
b a
ch
G b a r po re a  
o
c c fr po i te a
nic hG b l t u ili u
G l
l
li aa c
u r e s m
m i
s
llu  t
o
h x
u
y rd io i
G rd e so lla  
e
t rm ano n  he ol sH r i a r ea do a s pe n i i
m
co o
ov
Ha m ia
o
a  
h
p w le a
de
e n
ra
H m to o n it
n
r sa
H e o
b la y s
ifi
p ca m h c
is is a
l t
op n
a o i e r sn p lu r ea he i
s m n
r lu
 pa a ivo sb  ra
ss o
i in in il
r
i au f nm lue fl
e s
 u
n
co n
e s
z n
is
ng a
z
e ao elense
2.5 1.5
2.0
1.5 1.0
1.0
0.5 0.5
0.0 ? 0.0 ?
5.0 7.5 10.0 3 10 30 100
300 300
10.0 10.0
100 100
7.5 7.5
30 30
5.0 5.0 10 10
3 3
5.0 7.5 10.0 3 10 30 100
Score of first array density Number of repeats in first array density
(a) (b)
Figure B.15: In two-array genomes there is a slight positive association in both
array score and array length between arrays within the same genome. (a) Most
array scores are centered around 8, but there is some positive association (linear
regression, m = 0.49, p < 2? 10?16, R2 = 0.2387). (b) A similar relationship holds
for length (linear regression, m = 0.39, p < 2? 10?16, R2 = 0.1458).
168
Score of second array density
Number of repeats in second array density
1.5
1.0
0.5
0.0
2.5
2.0
1.5
1.0
0.5
0.0
Figure B.16: Equilibrium host density values from autoimmunity model (Appendix
B.8) over varying spacer acquisition rates. Curves end because for extremely high
? the equilibrium no longer exists if we restrict both host density and resource
concentration to be non-negative.
169
0 5 10 15 20 25 0 5 10 15 20 25
Pairwise Distance Between Consensus Repeats Within a Genome Pairwise Distance Between Consensus Repeats Within a Genome
(a) (b)
0 5 10 15 20 25 0 5 10 15 20 25
Mean Pairwise Distance Between Consensus Repeats Within a Genome Mean Pairwise Distance Between Consensus Repeats Within a Genome
(c) (d)
Figure B.17: Pairwise distance between consensus repeats from arrays within a
genome (only genomes with two arrays shown). Distance calculated as Levenshtein
Distance between each pair divided by the length of the longest repeat in the pair.
(a) Non-functional genomes. (b) Functional genomes. genomes. The functional
genomes are enriched with dissimilar arrays (?2 = 26.406, df = 1, p = 2.766? 10?7,
dissimilarity cuto at 3 based on median across all two-array genomes). (c) The dis-
tributions of pairwise dierences between arrays in functional genomes (blue, a) and
non-functional genomes (red, b) are overlaid alongside a distribution of distances
between random sequences with lengths drawn from the empirical distribution of
repeat lengths in the full dataset (gray). The green line indicates the bottom 0.1%
of this random (gray) distribution, which can be used as an alternative similarity
cuto (?2 = 54.653, df = 1, p = 1.505 ? 10?13). (d) Same as (c) but with 4 bases
held constant across all repeats to simulate some degree of universal sequence con-
servation at one end of the repeat as observed among type II-A CRISPR systems
[235] (?2 = 64.168, df = 1, p = 1.142 ? 10?15). In (c,d) the simulated distribution
takes into account the overall frequency of each base across repeat sequences. His-
tograms drawn using default settings of hist() function in R (right-closed/left-open
intervals, except for the rst interval which includes the lower bound, i.e. zero).
170
Density Frequency
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 50 100 150 200 250 300
Density Frequency
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 100 200 300 400 500 600 700
0 5 10 15 20 25 0 5 10 15 20 25
Mean Pairwise Distance Between Consensus Repeats Within a Genome Mean Pairwise Distance Between Consensus Repeats Within a Genome
(a) (b)
0 5 10 15 20 25 0 5 10 15 20 25
Mean Pairwise Distance Between Consensus Repeats Within a Genome Mean Pairwise Distance Between Consensus Repeats Within a Genome
(c) (d)
Figure B.18: Consensus repeat diversity across datasets (CRISPRDetect left vs.
CRISPRdb right) in two-array genomes (a,b) and all genomes (c,d). (a) Pairwise
distance between consensus repeats from arrays within a genome (only genomes
with exactly two arrays shown). Distance calculated as Levenshtein Distance be-
tween each pair. Histogram drawn using default settings of hist() function in R
(right-closed/left-open intervals, except for the rst interval which includes the lower
bound, i.e. zero). (b) The same as (a) for the CRISPR Database dataset. (c,d) Mean
pairwise distance between consensus repeats from all arrays in a genome, including
genomes with more than two arrays, for (c) the CRISPRDetect and (d) the CRISPR
Database datasets. While the distribution of pairwise-distances between repeat se-
quences in two-array genomes was approximately the same shape as that we observed
for both datasets, the relationship between diversity and functionality was reversed
in the CRISPR Database dataset, with non-functional genomes having more diverse
consensus repeats among their arrays (?2 = 4.3952, df = 1, p = 0.03604). This
opposing result calls into question the potential link between selection on multiple
arrays and consensus repeat diversity observed in the CRISPRDetect data, though
this may be due to the smaller size of the CRISPR Database dataset.
171
Frequency Frequency
0 500 1000 1500 0 200 400 600 800
Frequency Frequency
0 200 400 600 800 0 100 200 300 400 500 600
Functional Non?Functional
0 5 10 15 20 25 0 5 10 15 20 25
Arrays Per Genome Arrays Per Genome
Figure B.19: Even restricting to arrays with identical consensus repeats, functional
genomes are more likely to have multiple CRISPR arrays. For each genome with
CRISPR we counted the incidence of each unique consensus repeat sequence and
retained only arrays with the most common sequence (choosing one at random
in the case of a tie, which has no eect on the outcome). We then plotted the
frequency of arrays per genome for this dataset, and found a clear excess of 2-array
vs. 1-array genomes in the functional category as compared to the non-functional
category (?2 = 1475.9, df = 1, p < 2.2? 10?16).
172
Frequency
0 5000 10000 15000
Frequency
0 2000 4000 6000 8000
Appendix C: Supplemental Information For: Immune Loss as a Driver
of Coexistence During Host-Phage Coevolution
C.1 Parameter Values
For both our analytical and simulation models we attempt to constrain pa-
rameter values within realistic ranges where possible. Resource uptake parameters
(e, v, z) and burst size (?) are taken from Levin et al. [236], although they are
rough estimates in some cases. We let ? = 10?8 be our base value for the rate of
adsorption. Levin et al. [236] t a model to data to estimate a value of ? = 10?7,
but they also incorporate a lag time, which we approximate by lowering ? tenfold.
C.2 Alternative Costs of Immunity
While autoimmunity represents one class of costs that may be associated with
prokaryotic immune systems (i.e. lethality via an additional death term), other cost
structures exist that may be applied to growth. Immune host may either suer from
reduced resource anity (z) or maximal growth rate (v). Thus we can write the
173
chemostat system with resources:
? ? evRR? = w(A R) (D + U) (C.1)
z +R
defended host: ( )
vdR
D? = D ? ??dP ? ?? ?? w , (C.2)
zd +R
undefended host: ( )
vuR
U? = U ? ??uP ? w + ?D, (C.3)
zu +R
and phage:
P? = P (?U(?u? ? 1) + ?D(?d? ? 1)? w) , (C.4)
where we let
zd = czzu (C.5)
and
vu
vd = . (C.6)
cv
so that the resource anity penalty, cz, and growth rate penalty, cv, describe the
costs applied to each aspect of host population growth respectively. It is possible
that alternative cost regimes are more capable of producing robust coexistence under
realistic parameter ranges than is autoimmunity (?).
Both alternative cost regimes can produce stable coexistence (Fig C.1), al-
though they are applied to a nonlinear term in the growth equations and thus
behave dierently than autoimmunity (Fig C.2). We see that under realistic initial
174
conditions immune loss can produce coexistence over a wider range of parameter
space than the other mechanisms, but all can do so over some range and the range
of values for ? and ? are not easily comparable with the range for cz and cv (Fig C.3).
All mechanisms can produce coexistence with initial conditions perturbed away from
the equilibrium condition (Figs C.4-C.7), but only immune loss reliably produces
coexistence over any part of parameter space when very large perturbations to the
system occur (on the order of those expected with serial dilution; Fig C.7). We also
note that the condition we call coexistence in Figs C.4-C.7 is that both immune
hosts and phages are present at 80 days at a level likely to be detected experimen-
tally (density of 100/mL) which can be taken as the most general requirement for
coexistence. If we observe the distribution of outcomes of these simulations in Fig
C.8 we see that growth rate and resource anity costs tend to produce coexistence
regimes that are dominated by susceptible hosts even with more mild perturbations
(though still severe).
C.2.0.1 Simulation Parameters
The rate of loss of functionality in the CRISPR immune system of S. epi-
dermidis has been shown to fall in the range 10?4 ? 10?3 losses per individual per
generation [51]. We choose to use a value in the middle of this range (? = 5?10?4L )
for our simulations. Similarly, based on the fact that there appears to be no self
vs. non-self recognition in the CRISPR system of S. thermophilus [27], and that
the genome of S. thermophilus is roughly 50 times the size of its lytic phage 2972
175
[38, 233], the rate of incorporation of a self-spacer by the CRISPR system should be
approximately 50 times the spacer acquisition rate per adsorbed phage. Assuming
that incorporation of a self-spacer is instantaneously toxic and leads to unavoidable
cell death, we can take this value as our rate of autoimmunity (? = 50?b). This
gives us a value on the high end of possible ? values, as it is possible that there
is self recognition that has not been observed experimentally or that the action of
autoimmunity can be delayed or avoided through spacer loss or corruption, as has
been found in some experiments [26, 51].
We set ns = 10 due to computational constraints, as small increases in cassette
length lead to a large increase in computational time. The experiments we compare
our models to saw small expansions of the CRISPR cassette (2-3 spacers) and our
system reaches either a phage-cleared or stable coexistence state where coevolution
is halted well before host have obtained the complete set of spacers. While the phage
genome has many protospacers (200+), in the S. thermophilus-phage 2972 system
the large majority of spacers come from a small subset of possible protospacers on the
phage genome (? 30), and thus our limited protospacer set may be an appropriate
approximation of reality [26]. Our value for the cost per PAM mutation (c) is set
arbitrarily, but is linked to the value we choose for ns. Our results are robust to the
value of this parameter (Fig C.17).
Childs et al. [214] uses a value of ? = 5? 10?7p for the protospacer mutation
rate in their model of CRISPR-phage coevolution. We choose to introduce newly
mutated strains at a population size of 100 individuals to eliminate the eects of
drift in our model and thus speed up our simulations, which requires a corresponding
176
decrease in the mutation rate. Additionally, we choose to consider PAM mutations
only, and since the PAM region is considerably shorter than the protospacer itself,
this also warrants a decrease in mutation rate. Thus we reduce their parameter
estimate tenfold for use in our simulations (?p = 5 ? 10?8). Similarly we reduce
previous estimates of the spacer acquisition rate (?b = 10?6) vefold to account for
our introduction of novel strains at a higher population size [19, 210, 211, 214, 236].
We use an initial multiplicity of infection (MOI) of 1 phage per host corre-
sponding to experimental values. We simulated outcomes with an initial MOI of 10
to conrm robustness to initial conditions (Fig C.20), as seen in previous experimen-
tal work [150]. Burst size (?) estimates for phage are imprecise. We ran additional
simulations at high and low burst sizes to conrm that our qualitative results are
robust to changes in this parameter (Fig C.21).
C.2.0.2 Varying adsorption
Because we do not have a good estimate of the rate of adsorption in this system
[236], and because we choose to depress our adsorption rate as a compensation for
the lack of latent period in our models, we explore the response of our results to
large variations in ? (Figs C.14 and C.18).
177
C.3 Analysis and Simulation Methods
C.3.0.1 Analysis
W found equilibria of our analytical model using Wolfram Mathematica (Ver-
sion 11.0, Wolfram Research Inc., Champaign, IL, 2016). We assessed stability by
linearizing the system around each equilibrium point via the Jacobian. We per-
formed robustness analysis by solving our system numerically using a variable order
method for sti systems (MATLAB 9.0, The MathWorks Inc., Natick, MA, 2016;
ode15s) at 80 days from inital conditions described in Main Text Fig. 4.1.
C.3.0.2 Simulations
We solve our system numerically using a variable order method for sti systems
(MATLAB 9.0, The MathWorks Inc., Natick, MA, 2016; ode15s), pausing the solver
to add strains due to spacer acquisition and PAM mutation events, and to perform
serial dilutions at 24 hour intervals. When we reach a serial dilution the resource
concentration is reset to its initial value and all populations are reduced by a factor
of 100. Spacer acquisition and PAM mutation rates are updated at the next strain
addition or dilution event or at a preset maximum update interval (1hr) and used
2
to draw the time of next addition from an exponential distribution.
When an addition event occurs the type of event is drawn with each type's
probability being proportional to its calculated rate. We then draw the strain in the
population to serve as the base for the new strain, with the probability of choosing
178
each strain proportional to its strain-specic acquisition, mutation, or recombination
rate. In the case of spacer acquisition a spacer is drawn with probabilities based on
the overall prevalence of each corresponding protospacer in the phage population.
In the case of PAM mutation or back mutation a protospacer is drawn uniformly
from the chosen strain's genome. In all cases new strains are added at a population
size of 100 individuals so that we focus only on strains that are able to establish
themselves and neglect drift to speed up computation. Accordingly we set ?b, ?p, and
?q lower than might otherwise be expected (see Appendix C.1). During simulations
we dynamically adjust our rates so as to avoid adding strains already present in the
system.
C.4 Experimental Methods
Powdered skim milk (Publix) was diluted in distilled water, 10gms/100ml
water. The suspension was autoclaved at 110?C for 12 minutes. Three ml of the
milk was then put into 13mm x100mm glass tubes. Streptococcus thermophilus
(DGCC7710) were grown overnight in a broth, LM17 with added calcium [236]. To
initiate the serial transfer cultures, 30 ?l of the overnight broth was added to the
tubes either alone or with the phage from an LM17 Ca lysate. The initial densities
of the bacteria and phage were estimated by serial dilution and plating on LM17Ca
agar for the bacteria and with LM17Ca soft agar for the phages (see [236]). Each day,
the cultures were vortexed to suspend the bacteria and phages (the milk fermented),
densities were estimated, and 30 ?l of the cultures were transferred to fresh tubes
179
with 3 ml milk. These cultures were serially passaged for the noted number of days,
with bacteria and phage densities estimated daily.
To test for bacteriophage immune mutants, periodically, single colonies from
the sampling plates were grown up on LM17 and used as lawns to test for their
sensitivity to the original phage and the phage in their respective cultures. For the
latter, LM17 lysates were made from single plaques taken from the sampled plates.
In this way, we were able to test for CRISPR escape mutants. For example, if the
bacteria from the culture appeared immune to the original phage, we would then
test for its sensitivity to phage from the serial passage culture. In this way, we
were able to follow some of the co-evolution that was occurring in the cultures, via
the acquisition of new spacers generating host and phage mutants evolving in these
cultures, for more details and more extensive consideration of this co-evolution see
[26, 150].
180
181
Eq. 1 Eq. 2 ( ?Eq. 3 ( ) ) Eq. 42
R? A z(?+?+w) 1 A? evU? evU? zw
v????? ? z + 4Az + A? ? zw 2 w w v?w
D? 0 ((?u? ? 1)U? ? w? ) 0 0
U? 0 1 ( w(A?R?)(z+R?) +( w)) ( w ) w(A?R?)(z+R?)?u? evR? ? ?(???1) evR?
P? 0 1 vR? ? w + ? D? 1 vR? ? w 0
?u? z+R? U? ?? z+R?
Table C.1: Equilibrium of general model without coevolution. Equilibria for the autoimmunity/locus-loss model where ? > 0
and ? > 0. Not all substitutions have been made here in the interest of readability, but equilibrium values can be easily
computed using the above expressions. The stability of these equilibria can be assessed by linearizing our system around them
(i.e., taking the Jacobian) as described in Appendix C.3.
182
Spacer ID
Transfer A B C D E F G H I J K L M N O P Q R S T U V
1 1 1 1 1
2 1 1 1 1
3 1 2 1 1
4 1
5 2 2 1 1 1
11 1 1 2 1 1 1
15 1 1 2 2 1
25 5 1 1 1
35 1 2 1 2
40 2 2 3 1 1
Table C.2: Spacer dynamics for long term coevolution experiment (Experiment 1). Spacers dynamics in the CRISPR1 locus
for serial transfer experiment 1. Each letter corresponds to a unique spacer sequence. All sampled sequences shown with the
number of times observed at each timepoint.
(a) cv = 1 (b) cz = 1
Figure C.1: Equilibria with alternative costs of immunity. Model behavior under
variations in the immune system loss rate and (a) resource anity coecient or (b)
growth rate penalty. Equilibria derived from our equations in Appendix C.2 are
shown where orange indicates a stable equilibrium with all populations coexisting
and defended host dominating phage populations, green indicates that all popula-
tions coexist but phages dominate, light blue indicates that defended bacteria have
gone extinct but phages and undefended bacteria coexist, and dark blue indicates
that there is no stable equilibrium. We neglect coevolution and innate immunity
in this analysis (?u = 1, ?d = 0) and do not consider the eects of autoimmunity
(? = 0).
183
(a) (b)
(c) (d)
Figure C.2: Equilibria with each coexistence mechanism in isolation. Behavior of
coexistence equilibrium when (a) there is only CRISPR loss without autoimmunity,
(b) there is only autoimmunity without CRISPR loss, (c) there is only a cost applied
to resource anity (Appendix C.2), and (d) there is only a cost applied to maximum
growth rate (Appendix C.2). Notice that immune loss and autoimmune mechanisms
essentially act in the same manner, except that the loss mechanism produces a larger
phage population by ushing extra susceptible bacteria into the system. This is
consistent with theoretical results showing that increasing resource availability in a
host-phage system increases phage rather than host populations [237]. The upper
bound of the x-axis in (a-d) represents the upper limit of the cost of immunity,
above which coexistence will not occur because immune host cannot survive.
184
(a) (b)
(c) (d)
Figure C.3: Numerical solutions to model at 80 days with realistic initial conditions.
Numerical solutions to the alternative cost model (Appendix C.2) at 80 days using
realistic initial conditions more specic to the experimental setup (R(0) = 350,
D(0) = 106, U(0) = 100, P (0) = 106). Results only shown for cases in which
all three populations remained extant. Results in each panel correspond to each
mechanism in isolation.
185
(a) (b)
(c) (d)
Figure C.4: Simulations of perturbed starting conditions (small perturbations). We
nd numerical solutions to the alternative cost model (Appendix C.2) at 80 days with
starting conditions (X(0) = [R(0), D(0), U(0), P (0)]) perturbed by a proportion of
the equilibrium condition X(0) = X?(1 + ?Y ) where Y ? U [0, 1] and X? signies an
equilibrium value to explore how robust the equilibria are to starting conditions. We
ran 50 simulations for each condition. We let ? = 0.1. Lines correspond to the left
axis and purple dots correspond to the right axis. Results in each panel correspond
to each mechanism in isolation.
186
(a) (b)
(c) (d)
Figure C.5: Simulations of perturbed starting conditions (intermediate perturba-
tions). We nd numerical solutions to the alternative cost model (Appendix C.2)
at 80 days with starting conditions (X(0) = [R(0), D(0), U(0), P (0)]) perturbed by
a proportion of the equilibrium condition X(0) = X?(1 + ?Y ) where Y ? U [0, 1]
and X? signies an equilibrium value to explore how robust the equilibria are to
starting conditions. We ran 50 simulations for each condition. We let ? = 1. Lines
correspond to the left axis and purple dots correspond to the right axis. Results in
each panel correspond to each mechanism in isolation.
187
(a) (b)
(c) (d)
Figure C.6: Simulations of perturbed starting conditions (large perturbations). We
nd numerical solutions to the alternative cost model (Appendix C.2) at 80 days with
starting conditions (X(0) = [R(0), D(0), U(0), P (0)]) perturbed by a proportion of
the equilibrium condition X(0) = X?(1 + ?Y ) where Y ? U [0, 1] and X? signies an
equilibrium value to explore how robust the equilibria are to starting conditions. We
ran 50 simulations for each condition. We let ? = 10. Lines correspond to the left
axis and purple dots correspond to the right axis. Results in each panel correspond
to each mechanism in isolation.
188
(a) (b)
(c) (d)
Figure C.7: Simulations of perturbed starting conditions (very large perturba-
tions). We nd numerical solutions to the alternative cost model (Appendix C.2) at
80 days with starting conditions (X(0) = [R(0), D(0), U(0), P (0)]) perturbed by a
proportion of the equilibrium condition X(0) = X?(1 + ?Y ) where Y ? U [0, 1] and
X? signies an equilibrium value to explore how robust the equilibria are to starting
conditions. We ran 50 simulations for each condition. We let ? = 100. Lines corre-
spond to the left axis and purple dots correspond to the right axis. Results in each
panel correspond to each mechanism in isolation.
189
(a) (b)
(c) (d)
Figure C.8: Mean population size with perturbed starting conditions (intermediate
perturbations). We nd numerical solutions to the alternative cost model (Appendix
C.2) at 80 days with starting conditions (X(0) = [R(0), D(0), U(0), P (0)]) perturbed
by a proportion of the equilibrium condition X(0) = X?(1 + ?Y ) where Y ? U [0, 1]
and X? signies an equilibrium value to explore how robust the equilibria are to
starting conditions. We ran 50 simulations for each condition. We let ? = 10. Mean
population across all simulations (including cases of phage or host extinction) shown
by bold line and two standard deviations away from the mean are represented by
the thin lines. Results in each panel correspond to each mechanism in isolation.
190
(a) ?d = 0.01 (b) ?d = 0.0131
(c) ?d = 0.0132 (d) ?d = 0.015
Figure C.9: Phase diagram of general model with phage coevolution. Phase di-
agrams of simple coevolutionary model behavior under variations in the rates of
autoimmunity (?) and CRISPR system loss (?) over various coevolutionary scenar-
ios (?d). Values of ?d were chosen so as to demonstrate the rapid shift that occurs
from host to phage dominated equilibrium as the infected fraction of defended host
increases. Orange indicates a stable equilibrium with all populations coexisting and
defended host dominating phage populations, green indicates that all populations
coexist but phages dominate, and blue indicates that defended bacteria have gone
extinct but phages and undefended bacteria coexist.
191
(a) ?u = 1 (b) ?u = 0.5
(c) ?u = 0.1 (d) ?u = 0.05
Figure C.10: Phase diagram of general model with innate immunity. Phase diagram
of model behavior under variations in the rates of autoimmunity (?) and CRISPR
system loss (?) for dierent values of (?u). Orange indicates a stable equilibrium
with all populations coexisting and defended host dominating phage populations,
green indicates that all populations coexist but phages dominate, and blue indi-
cates that defended bacteria have gone extinct but phages and undefended bacteria
coexist.
192
(a) Phage
(b) Bacteria
Figure C.11: Replicate serial transfer experiments. Densities of (a) phage and (b)
bacteria measured daily at serial transfer. All replicate experiments start with the
same conditions and strains as in the main text.
193
Figure C.12: Mean sequenced order of host over time in serial transfer experiments
1 and 2. Sum over CRISPR1 and CRISPR3 loci.
194
0 20 40 60 80
Time of Peak Optimal Order
(a) (b)
Figure C.13: Optimal host order for phages to infect over time. The optimal
host strain that is either currently infected by a phage strain or one PAM mutation
away from being infected. Optimality is dened in terms of population size times
the burst size of the phage strain that does or could infect that strain, so that the
balance between abundant host and mutation cost is taken into account. In (a)
we track the order of the best available host strain at any given point in a single
example simulation (Fig C.22), and in (b) look at the timing of the peak optimal
order across 100 simulations (Fig 4.4, ? ?9q = 5 ? 10 ). Note that after the initial
arms race dynamic the best available host strain is the CRISPR-lacking host in
all simulations. The order of the best host strain peaks early on in all simulations
and then drops to zero (CRISPR-lacking), signifying an early end to the arms race
between host and phage.
195
Frequency
0 10 20 30 40
?7
?=10
5e?9 5e?10 5e?11
?8
?=10
5e?9 5e?10 5e?11
?9
?=10
5e?9 5e?10 5e?11
?q
?7 ?11 ?8 ?9
?=10 , ?q=5 ? 10 ?=10 , ?q=5 ? 10
0 20 40 60 80 0 20 40 60 80
Time to Extinction Time to Extinction
Figure C.14: Eect on simulations of varied phage adsorption rates. Adsorption
rate has a profound eect on the outcome of host-phage interactions. At a high
adsorption rate (? = 10?7) either phages or bacteria tend to go extinct early on in
the simulation, and phages are able to drive their host extinct approximately 50%
of the time (49/100 simulations for all back-mutation rates). We see a reversed rela-
tionship of time to extinction with ?q from our base adsorption rate (? = 10?8, Fig
4.4), although in general at a very high adsorption rate few simulations demonstrate
long-term coexistence as phage consume all host early on. This suggests that the
costs associated with PAM mutations are required to keep phage growth rate low
enough to prevent overconsumption of host, and indeed upon closer examination of
individual simulations it is clear that back mutations to lower phage orders precip-
itate phage collapse. In the lowest panel we demonstrate that coexistence in the
long term at high ? is associated with a high mean phage order over the course of
a simulation, while the opposite is true of our typical intermediate ?. At a low ad-
sorption rate (? = 10?9) we see populations coexisting until the 80 day mark (max
simulation length) in almost all simulations.
196
Mean Phage Order Time to Extinction
0 2 4 40 70 20 60 0 40 80
Mean Phage Order
0 2 4
12
10
Bac CRISPR-
Bac CRISPR+
10
10 Phage
8
10
6
10
4
10
2
10
0
10
0 10 20 30 40 50 60 70 80
Time (Days)
Figure C.15: Representative simulation with a oor on the susceptible host
population and high autoimmunity. We let B ?4s > 1 ?t and ? = 5? 10 .
197
Population Size
(a)
(b) (c)
(d) (e)
Figure C.16: Transient phage survival at low density Example of low-level phage
persistence due to slow evolutionary dynamics. Here we see that (a) in the absence
of the constant production of susceptible bacteria by CRISPR-enabled strains (i.e.,
? = 0) phages are still able to paradoxically persist despite a clear advantage to
bacteria in the arms race and an absence of other sustaining mechanisms. In (b)
a small fraction of the CRISPR enabled bacterial population is maintained that
lacks spacers towards the infecting phages and in (c) we zoom in to show that this
population is declining due to this infection, but extremely slowly, implying this
coexistence is not stable in the long term. The number of bacterial strains that can
be infected by phages over time is shown in (d), and (e) shows how the richness of
phage and bacterial strains changes over time. Note that although the number of
bacterial strains increases asymptotically, after an initial spike the number of strains
that can be infected by phages drops dramatically to the single digits. This keeps
the overall phage population growth constrained (balanced by adsorption to immune
bacteria). In fact, over time phage act to suppress their own growth by negatively
infecting the competitiveness of their host (although this eect is so small that
phages can persist for an extended period of time seemingly stably). What looks
at is actually monotonically decreasing. All parameters as in Main Text Table 4.3
and ? = 0. 198
c=1 c=1.5
60 60
50 50
40 40
30 30
20 20
10 10
0 0
10 30 50 > 70 10 30 50 > 70
c=2 c=3
60 60
50 50
40 40
30 30
20 20
10 10
0 0
10 30 50 > 70 10 30 50 > 70
c=4
60
50
40
30
20
10
0
10 30 50 > 70
Time to Extinction (Days)
Figure C.17: Eect of changes in PAM mutation cost (c). Distribution of phage
extinction times in bacterial-dominated cultures with dierent costs on PAM mu-
tation in phage (c). The peak at 80 corresponds to stable coexistence (simulations
ran for a maximum of 80 days). These results of for a locus-loss mechanism only
(?L = 5? 10?4, ? = 0).
199
Frequency
(a) ? = 10?9 (b) ? = 10?8
(c) ? = 10?7 (d) ? = 10?6
Figure C.18: Phase diagram of general model with phage coevolution. Phase
diagrams of model behavior without coevolution or other forms of immunity (?d = 0,
?u = 1) under variations in the rates of autoimmunity (?) and CRISPR system loss
(?) over various adsorption rates (?). Orange indicates a stable equilibrium with
all populations coexisting and defended host dominating phage populations, green
indicates that all populations coexist but phages dominate, and blue indicates that
defended bacteria have gone extinct but phages and undefended bacteria coexist.
There is an apparent increase in the area of the coexistence region in which host
dominate as adsorption rate increases.
200
Figure C.19: Equilibrium phage population during coexistence. Equilibrium pop-
ulation of phages when there is full coexistence over a range of ? and ?L values for
our general model without coevolution (?u = 1, ?d = 0).
201
MOI = 10, ?q = 5e?8, ?L = 5e?4, ? = 0
60
50
40
30
20
10
0
10 30 50 > 70
Time to Extinction (Days)
Figure C.20: Distribution of phage extinction times in bacterial-dominated cultures
with an MOI of 10. The peak at 80 corresponds to what we call stable coexistence
(simulations ran for a maximum of 80 days).
202
Frequency
? = 50 ? = 80 ? = 110
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
10 30 50 > 70 10 30 50 > 70 10 30 50 > 70
Time to Extinction (Days)
Figure C.21: Distributions of phage extinction times in bacterial-dominated cul-
tures with various burst sizes. The peak at 80 corresponds to what we call stable
coexistence (simulations ran for a maximum of 80 days).
203
Frequency
(a) (b)
(c) (d)
Figure C.22: Representative example of a simulation demonstrating stable coexis-
tence under a loss mechanism (?L = 5?10?4, ? = 0, ?q = 5?10?9). In (a) we show
the archetypal shift from phage-dominance during an initial arms race to unstable
host-dominated coexistence where uctuating selection dynamics are observed to
stable host-dominated predator-prey cycling of phages and CRISPR-lacking hosts
as seen in (d) where evolution ceases to occur. In (b) we see that a drop in mean
phage order leads to stable cycling and in (c) that this corresponds to a single phage
strain becoming dominant after previous cycling of strains. This corresponds to a
shift away from uctuating selection dynamics. In (c) the colors specify dierent
phage strains.
204
References
[1] Jake L. Weissman, Hao H. Yiu, and Philip L.F. Johnson. What bacteria do
when they get sick. Fronteirs Young Minds, 7(102), 2019.
[2] John H Paul. Microbial gene transfer: an ecological perspective. Journal of
Molecular Microbiology and biotechnology, 1(1):4550, 1999.
[3] Curtis A Suttle. Marine viruses-major players in the global ecosystem. Nature
Reviews Microbiology, 5(10):801, 2007.
[4] Lionel Guidi, Samuel Charon, Lucie Bittner, Damien Eveillard, Abdelhalim
Larhlimi, Simon Roux, Youssef Darzi, St?phane Audic, L?o Berline, Jennifer R
Brum, et al. Plankton networks driving carbon export in the oligotrophic
ocean. Nature, 532(7600):465, 2016.
[5] T Frede Thingstad. Elements of a theory for the mechanisms controlling abun-
dance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic
systems. Limnology and Oceanography, 45(6):13201328, 2000.
[6] Sarit Avrani, Omri Wurtzel, Itai Sharon, Rotem Sorek, and Debbie Lindell.
Genomic island variability facilitates Prochlorococcusvirus coexistence. Na-
ture, 474(7353):604, 2011.
[7] Marcia F Marston, Francis J Pierciey, Alicia Shepard, Gary Gearin, Ji Qi,
Chandri Yandava, Stephan C Schuster, Matthew R Henn, and Jennifer BH
Martiny. Rapid diversication of coevolving marine synechococcus and a virus.
Proceedings of the National Academy of Sciences, 109(12):45444549, 2012.
[8] Sergey N. Rodin and Vadim A. Ratner. Some theoretical aspects of protein
coevolution in the ecosystem phage-bacteria I. The problem. Journal of
Theoretical Biology, 100(2):185195, January 1983.
[9] Sergey N. Rodin and Vadim A. Ratner. Some theoretical aspects of protein
coevolution in the ecosystem phage-bacteria II. The deterministic model of
microevolution. Journal of Theoretical Biology, 100(2):197210, January 1983.
[10] Eugene V Koonin, Kira S Makarova, and Yuri I Wolf. Evolutionary genomics
of defense systems in archaea and bacteria. Annual Review of Microbiology,
71:233261, 2017.
205
[11] Stineke van Houte, Angus Buckling, and Edze R. Westra. Evolutionary ecol-
ogy of prokaryotic immune mechanisms. Microbiology and Molecular Biology
Reviews, 80(3):745763, September 2016.
[12] Kira S. Makarova, Yuri I. Wolf, Sagi Snir, and Eugene V. Koonin. Defense
islands in bacterial and archaeal genomes and prediction of novel defense sys-
tems. Journal of Bacteriology, 193(21):60396056, November 2011.
[13] Shany Doron, Sarah Melamed, Gal Or, Azita Leavitt, Anna Lopatina, Mai
Keren, Gil Amitai, and Rotem Sorek. Systematic discovery of antiphage de-
fense systems in the microbial pangenome. Science, page eaar4120, January
2018.
[14] Kira S. Makarova, Vivek Anantharaman, L. Aravind, and Eugene V. Koonin.
Live virus-free or die: coupling of antivirus immunity and programmed suicide
or dormancy in prokaryotes. Biology Direct, 7:40, 2012.
[15] Jake L Weissman, William F Fagan, and Philip LF Johnson. Selective mainte-
nance of multiple CRISPR arrays across prokaryotes. The CRISPR Journal,
1(6):405413, 2018.
[16] Pere Puigb?, Kira S. Makarova, David M. Kristensen, Yuri I. Wolf, and Eu-
gene V. Koonin. Reconstruction of the evolution of microbial defense systems.
BMC Evolutionary Biology, 17:94, April 2017.
[17] Kira S. Makarova, Yuri I. Wolf, and Eugene V. Koonin. Comparative ge-
nomics of defense systems in archaea and bacteria. Nucleic Acids Research,
41(8):43604377, April 2013.
[18] Francisco J. M. Mojica, Cesar D?ez-Villase?or, Jes?s Garc?a-Mart?nez, and
Elena Soria. Intervening Sequences of Regularly Spaced Prokaryotic Re-
peats Derive from Foreign Genetic Elements. Journal of Molecular Evolution,
60(2):174182, 2005.
[19] Rodolphe Barrangou, Christophe Fremaux, H?l?ne Deveau, Melissa Richards,
Patrick Boyaval, Sylvain Moineau, Dennis A. Romero, and Philippe Horvath.
CRISPR provides acquired resistance against viruses in prokaryotes. Science,
315(5819):17091712, March 2007.
[20] Ruud Jansen, Jan DA van Embden, Wim Gaastra, and Leo M Schouls. Identi-
cation of genes that are associated with DNA repeats in prokaryotes. Molec-
ular Microbiology, 43(6):15651575, 2002.
[21] Reidun K. Lillest?l, Shiraz A. Shah, Kim Br?gger, Peter Redder, Hien Phan,
Jan Christiansen, and Roger A. Garrett. CRISPR families of the crenarchaeal
genus Sulfolobus: bidirectional transcription and dynamic properties. Molec-
ular Microbiology, 72(1):259272, April 2009.
206
[22] ?mit Pul, Reinhild Wurm, Zihni Arslan, Ren? Gei?en, Nina Hofmann, and
Rolf Wagner. Identication and characterization of E. coli CRISPR-Cas pro-
moters and their silencing by H-NS. Molecular Microbiology, 75(6):14951512,
2010.
[23] Yunzhou Wei, Megan T. Chesne, Rebecca M. Terns, and Michael P. Terns.
Sequences spanning the leader-repeat junction mediate CRISPR adaptation
to phage in Streptococcus thermophilus. Nucleic Acids Research, 43(3):1749
1758, February 2015.
[24] Anders F. Andersson and Jillian F. Baneld. Virus population dynamics and
acquired virus resistance in natural microbial communities. Science (New
York, N.Y.), 320(5879):10471050, May 2008.
[25] Adi Stern, Leeat Keren, Omri Wurtzel, Gil Amitai, and Rotem Sorek. Self-
targeting by CRISPR: gene regulation or autoimmunity? Trends in Genetics,
26(8):335340, August 2010.
[26] David Paez-Espino, Wesley Morovic, Christine L. Sun, Brian C. Thomas, Ken-
ichi Ueda, Buy Stahl, Rodolphe Barrangou, and Jillian F. Baneld. Strong
bias in the bacterial CRISPR elements that confer immunity to phage. Nature
Communications, 4:1430, February 2013.
[27] Yunzhou Wei, Rebecca M. Terns, and Michael P. Terns. Cas9 function and
host genome sampling in Type II-A CRISPRCas adaptation. Genes & De-
velopment, 29(4):356361, February 2015.
[28] Eugene V. Koonin and Yuri I. Wolf. Is evolution Darwinian or/and Lamarck-
ian? Biology Direct, 4:42, November 2009.
[29] Eugene V. Koonin and Yuri I. Wolf. Just how Lamarckian is CRISPR-Cas
immunity: the continuum of evolvability mechanisms. Biology Direct, 11:9,
2016.
[30] Kira S. Makarova, Yuri I. Wolf, Omer S. Alkhnbashi, Fabrizio Costa, Shiraz A.
Shah, Sita J. Saunders, Rodolphe Barrangou, Stan J. J. Brouns, Emmanuelle
Charpentier, Daniel H. Haft, Philippe Horvath, Sylvain Moineau, Francisco
J. M. Mojica, Rebecca M. Terns, Michael P. Terns, Malcolm F. White, Alexan-
der F. Yakunin, Roger A. Garrett, John van der Oost, Rolf Backofen, and
Eugene V. Koonin. An updated evolutionary classication of CRISPR-Cas
systems. Nature Reviews Microbiology, 13(11):722736, November 2015.
[31] Ariel D. Weinberger, Christine L. Sun, Mateusz M. Pluci?ski, Vincent J.
Denef, Brian C. Thomas, Philippe Horvath, Rodolphe Barrangou, Michael S.
Gilmore, Wayne M. Getz, and Jillian F. Baneld. Persisting viral se-
quences shape microbial CRISPR-based immunity. PLoS Computational Biol,
8(4):e1002475, April 2012.
207
[32] Sukrit Silas, Georg Mohr, David J Sidote, Laura M Markham, Antonio
Sanchez-Amat, Devaki Bhaya, Alan M Lambowitz, and Andrew Z Fire. Direct
CRISPR spacer acquisition from RNA by a natural reverse transcriptaseCas1
fusion protein. Science, 351(6276):aad4234, 2016.
[33] Caryn Hale, Kyle Kleppe, Rebecca M Terns, and Michael P Terns. Prokaryotic
silencing (psi) RNAs in Pyrococcus furiosus. RNA, 14(12):25722579, 2008.
[34] Caryn R Hale, Peng Zhao, Sara Olson, Michael O Du, Brenton R Graveley,
Lance Wells, Rebecca M Terns, and Michael P Terns. RNA-guided RNA
cleavage by a CRISPR RNA-Cas protein complex. Cell, 139(5):945956, 2009.
[35] Stan JJ Brouns, Matthijs M Jore, Magnus Lundgren, Edze R Westra, Rik JH
Slijkhuis, Ambrosius PL Snijders, Mark J Dickman, Kira S Makarova, Eu-
gene V Koonin, and John Van Der Oost. Small CRISPR RNAs guide antiviral
defense in prokaryotes. Science, 321(5891):960964, 2008.
[36] Hong Li. Structural principles of CRISPR RNA processing. Structure,
23(1):1320, 2015.
[37] Jason Carte, Ruiying Wang, Hong Li, Rebecca M Terns, and Michael P Terns.
Cas6 is an endoribonuclease that generates guide RNAs for invader defense in
prokaryotes. Genes & development, 22(24):34893496, 2008.
[38] Alexander Bolotin, Benoit Quinquis, Alexei Sorokin, and S. Dusko Ehrlich.
Clustered regularly interspaced short palindrome repeats (CRISPRs) have
spacers of extrachromosomal origin. Microbiology, 151(8):25512561, 2005.
[39] Giedrius Gasiunas, Rodolphe Barrangou, Philippe Horvath, and Virginijus
Siksnys. Cas9crRNA ribonucleoprotein complex mediates specic DNA cleav-
age for adaptive immunity in bacteria. Proceedings of the National Academy
of Sciences, 109(39):E2579E2586, 2012.
[40] Martin Jinek, Krzysztof Chylinski, Ines Fonfara, Michael Hauer, Jennifer A
Doudna, and Emmanuelle Charpentier. A programmable dual-RNAguided
DNA endonuclease in adaptive bacterial immunity. Science, 337(6096):816
821, 2012.
[41] Pedro H. Oliveira, Marie Touchon, and Eduardo P. C. Rocha. The interplay
of restriction-modication systems with mobile genetic elements and their
prokaryotic hosts. Nucleic Acids Research, 42(16):1061810631, September
2014.
[42] Tamara Goldfarb, Hila Sberro, Eyal Weinstock, Or Cohen, Shany Doron,
Yoav Charpak-Amikam, Shaked Ak, Gal Or, and Rotem Sorek. BREX is a
novel phage resistance system widespread in microbial genomes. The EMBO
Journal, 34(2):169183, January 2015.
208
[43] Gal Or, Sarah Melamed, Hila Sberro, Zohar Mukamel, Shahar Silverman,
Gilad Yaakov, Shany Doron, and Rotem Sorek. DISARM is a widespread
bacterial defence system with broad anti-phage activities. Nature Microbiology,
3(1):90, 2018.
[44] Francisco J. M. Mojica, Cesar D?ez-Villase?or, Elena Soria, and Guadalupe
Juez. Biological signicance of a family of regularly spaced repeats in the
genomes of Archaea, Bacteria and mitochondria. Molecular Microbiology,
36(1):244246, January 2002.
[45] Kira S. Makarova, Nick V. Grishin, Svetlana A. Shabalina, Yuri I. Wolf, and
Eugene V. Koonin. A putative RNA-interference-based immune system in
prokaryotes: computational analysis of the predicted enzymatic machinery,
functional analogies with eukaryotic RNAi, and hypothetical mechanisms of
action. Biology Direct, 1:7, March 2006.
[46] Rika E. Anderson, William J. Brazelton, and John A. Baross. Using CRISPRs
as a metagenomic tool to identify microbial hosts of a diuse ow hydrothermal
vent viral assemblage. FEMS Microbiology Ecology, 77(1):120133, July 2011.
[47] Ariel D. Weinberger, Yuri I. Wolf, Alexander E. Lobkovsky, Michael S.
Gilmore, and Eugene V. Koonin. Viral diversity threshold for adaptive immu-
nity in prokaryotes. mBio, 3(6):e0045612, December 2012.
[48] Jaime Iranzo, Alexander E. Lobkovsky, Yuri I. Wolf, and Eugene V. Koonin.
Evolutionary dynamics of the prokaryotic adaptive immunity system CRISPR-
Cas in an explicit ecological context. Journal of Bacteriology, 195(17):3834
3844, September 2013.
[49] Jake L Weissman, Rohan MR Laljani, William F Fagan, and Philip LF John-
son. Visualization and prediction of CRISPR incidence in microbial trait-space
to identify drivers of antiviral immune strategy. The ISME journal, 2019.
[50] Shiraz A. Shah and Roger A. Garrett. CRISPR/Cas and Cmr modules, mo-
bility and evolution of adaptive immune systems. Research in Microbiology,
162(1):2738, January 2011.
[51] Wenyan Jiang, Inbal Maniv, Fawaz Arain, Yaying Wang, Bruce R. Levin,
and Luciano A. Marrani. Dealing with the evolutionary downside of crispr
immunity: Bacteria and benecial plasmids. PLOS Genetics, 9(9):e1003844,
September 2013.
[52] David Burstein, Christine L. Sun, Christopher T. Brown, Itai Sharon, Karthik
Anantharaman, Alexander J. Probst, Brian C. Thomas, and Jillian F. Ban-
eld. Major bacterial lineages are essentially devoid of CRISPR-Cas viral
defence systems. Nature Communications, 7:10613, February 2016.
209
[53] David Burstein, Lucas B. Harrington, Steven C. Strutt, Alexander J. Probst,
Karthik Anantharaman, Brian C. Thomas, Jennifer A. Doudna, and Jillian F.
Baneld. New CRISPRCas systems from uncultivated microbes. Nature,
542(7640):237241, February 2017.
[54] Lin-Xing Chen, Basem Al-Shayeb, Rapha?l M?heust, Wen-Jun Li, Jennifer A
Doudna, and Jillian F Baneld. Candidate Phyla Radiation Roizmanbacteria
from hot springs have novel and unexpectedly abundant CRISPR-Cas systems.
Frontiers in Microbiology, 10:928, 2019.
[55] Aude Bernheim, David Bikard, Marie Touchon, and Eduardo PC Rocha. Co-
occurrence of multiple CRISPRs and cas clusters suggests epistatic interac-
tions. bioRxiv, page 592600, 2019.
[56] Philippe Horvath, Anne-Claire Co?t?-Monvoisin, Dennis A. Romero, Patrick
Boyaval, Christophe Fremaux, and Rodolphe Barrangou. Comparative anal-
ysis of CRISPR loci in lactic acid bacteria genomes. International Journal of
Food Microbiology, 131(1):6270, April 2009.
[57] C. Diez-Villasenor, C. Almendros, J. Garcia-Martinez, and F. J. M. Mojica.
Diversity of CRISPR loci in Escherichia coli. Microbiology, 156(5):13511361,
May 2010.
[58] Fei Cai, Seth D. Axen, and Cheryl A. Kerfeld. Evidence for the widespread
distribution of CRISPR-Cas system in the Phylum Cyanobacteria. RNA Bi-
ology, 10(5):687693, May 2013.
[59] Daniel J Nasko, Barbra D Ferrell, Ryan M Moore, Jaysheel D Bhavsar,
ShawnW Polson, and K Eric Wommack. CRISPR spacers indicate preferential
matching of specic virioplankton genes. mBio, 10(2):e0265118, 2019.
[60] Joe Bondy-Denomy, April Pawluk, Karen L. Maxwell, and Alan R. David-
son. Bacteriophage genes that inactivate the CRISPR/Cas bacterial immune
system. Nature, 493(7432):429432, January 2013.
[61] Ido Yosef, Moran G. Goren, and Udi Qimron. Proteins and DNA elements
essential for the CRISPR adaptation process in Escherichia coli. Nucleic Acids
Research, page gks216, March 2012.
[62] Pedro F. Vale, Guillaume Laorgue, Francois Gatchitch, Rozenn Gardan, Syl-
vain Moineau, and Sylvain Gandon. Costs of CRISPR-Cas-mediated resistance
in Streptococcus thermophilus. Proceedings. Biological Sciences / The Royal
Society, 282(1812):20151270, August 2015.
[63] Edze R. Westra, Stineke van Houte, Sam Oyesiku-Blakemore, Ben Makin,
Jenny M. Broniewski, Alex Best, Joseph Bondy-Denomy, Alan Davidson, Mike
Boots, and Angus Buckling. Parasite Exposure Drives Selective Evolution
of Constitutive versus Inducible Defense. Current Biology, 25(8):10431049,
April 2015.
210
[64] Ellinor O Alseth, Elizabeth Pursey, Adela M Luj?n, Isobel McLeod, Clare
Rollie, and Edze R Westra. Bacterial biodiversity drives the evolution of
CRISPR-based phage resistance in Pseudomonas aeruginosa. bioRxiv, page
586115, 2019.
[65] Yong Joon Chung, Christel Krueger, David Metzgar, and Milton H. Saier.
Size comparisons among integral membrane transport protein homologues in
bacteria, archaea, and eucarya. Journal of Bacteriology, 183(3):10121021,
February 2001.
[66] Luciano Brocchieri and Samuel Karlin. Protein length in eukaryotic and
prokaryotic proteomes. Nucleic Acids Research, 33(10):33903400, 2005.
[67] Heidi Ledford. Five big mysteries about CRISPR's origins. Nature News,
541(7637):280, January 2017.
[68] Kira S. Makarova, Daniel H. Haft, Rodolphe Barrangou, Stan J. J. Brouns,
Emmanuelle Charpentier, Philippe Horvath, Sylvain Moineau, Francisco J. M.
Mojica, Yuri I. Wolf, Alexander F. Yakunin, John van der Oost, and Eugene V.
Koonin. Evolution and classication of the CRISPRCas systems. Nature
Reviews Microbiology, 9(6):467477, June 2011.
[69] Jake L. Weissman, Rayshawn Holmes, Rodolphe Barrangou, Sylvain Moineau,
William F. Fagan, Bruce Levin, and Philip L. F. Johnson. Immune loss as
a driver of coexistence during host-phage coevolution. The ISME Journal,
12(2):585597, February 2018.
[70] Jacob H. Munson-McGee, Shengyun Peng, Samantha Dewer, Ramunas
Stepanauskas, Rachel J. Whitaker, Joshua S. Weitz, and Mark J. Young. A
virus or more in (nearly) every cell: ubiquitous networks of virushost inter-
actions in extreme environments. The ISME Journal, page 1, February 2018.
[71] Simon J. Labrie, Julie E. Samson, and Sylvain Moineau. Bacteriophage resis-
tance mechanisms. Nature Reviews. Microbiology, 8(5):317327, May 2010.
[72] Kira S. Makarova, Yuri I. Wolf, and Eugene V. Koonin. The basic building
blocks and evolution of CRISPRCas systems. Biochemical Society Transac-
tions, 41(6):13921400, December 2013.
[73] David Bikard, Asma Hatoum-Aslan, Daniel Mucida, and Luciano A Marraf-
ni. CRISPR interference can prevent natural transformation and virulence
acquisition during in vivo bacterial infection. Cell host & microbe, 12(2):177
186, 2012.
[74] Aude Bernheim, Alicia Calvo-Villama??n, Clovis Basier, Lun Cui, Eduardo
P. C. Rocha, Marie Touchon, and David Bikard. Inhibition of NHEJ re-
pair by type II-A CRISPR-Cas systems in bacteria. Nature Communications,
8(1):2094, December 2017.
211
[75] Maria Brbi?, Matija Pi?korec, Vedrana Vidulin, Anita Kri?ko, Tomislav ?muc,
and Fran Supek. The landscape of microbial phenotypic traits and associated
genes. Nucleic Acids Research, 44(21):1007410090, December 2016.
[76] Daniel J. Stekhoven and Peter B?hlmann. MissForestnon-parametric miss-
ing value imputation for mixed-type data. Bioinformatics, 28(1):112118,
January 2012.
[77] Ambarish Biswas, Raymond H.J. Staals, Sergio E. Morales, Peter C. Fineran,
and Chris M. Brown. CRISPRDetect: A exible algorithm to dene CRISPR
arrays. BMC Genomics, 17:356, 2016.
[78] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason
Papadopoulos, Kevin Bealer, and Thomas L. Madden. BLAST+: architecture
and applications. BMC bioinformatics, 10:421, December 2009.
[79] Richard J. Roberts, Tamas Vincze, Janos Posfai, and Dana Macelis. RE-
BASEa database for DNA restriction and modication: enzymes, genes and
genomes. Nucleic Acids Research, 38(suppl_1):D234D236, January 2010.
[80] S. R. Eddy. Prole hidden Markov models. Bioinformatics (Oxford, England),
14(9):755763, 1998.
[81] Doherty Aidan J., Jackson Stephen P., and Weller Georey R. Identica-
tion of bacterial homologues of the Ku DNA repair proteins. FEBS Letters,
500(3):186188, July 2001.
[82] L. Aravind and Eugene V. Koonin. Prokaryotic homologs of the eukaryotic
DNA-end-binding protein Ku, novel domains in the Ku protein and predic-
tion of a prokaryotic double-strand break repair system. Genome Research,
11(8):13651374, August 2001.
[83] Tatiana Tatusova, Michael DiCuccio, Azat Badretdin, Vyacheslav Chetvernin,
Eric P. Nawrocki, Leonid Zaslavsky, Alexandre Lomsadze, Kim D. Pruitt,
Mark Borodovsky, and James Ostell. NCBI prokaryotic genome annotation
pipeline. Nucleic Acids Research, 44(14):66146624, August 2016.
[84] Jenna Morgan Lang, Aaron E. Darling, and Jonathan A. Eisen. Phylogeny of
bacterial and archaeal genomes using conserved genes: Supertrees and super-
matrices. PLOS ONE, 8(4):e62510, April 2013.
[85] Aaron E. Darling, Guillaume Jospin, Eric Lowe, Frederick A. Matsen Iv,
Holly M. Bik, and Jonathan A. Eisen. PhyloSift: phylogenetic analysis of
genomes and metagenomes. PeerJ, 2:e243, January 2014.
[86] Morgan N. Price, Paramvir S. Dehal, and Adam P. Arkin. FastTree 2 
approximately maximum-likelihood trees for large alignments. PLOS ONE,
5(3):e9490, March 2010.
212
[87] Laurens van der Maaten and Georey Hinton. Visualizing data using t-SNE.
Journal of Machine Learning Research, 9(Nov):25792605, 2008.
[88] Jesse H. Krijthe. Rtsne: t-Distributed Stochastic Neighbor Embedding using
Barnes-Hut Implementation, 2015. R package version 0.15.
[89] Roberts David R., Bahn Volker, Ciuti Simone, Boyce Mark S., Elith Jane,
Guillera-Arroita Gurutzeta, Hauenstein Severin, Lahoz-Monfort Jos? J.,
Schr?der Boris, Thuiller Wilfried, Warton David I., Wintle Brendan A., Har-
tig Florian, and Dormann Carsten F. Cross-validation strategies for data
with temporal, spatial, hierarchical, or phylogenetic structure. Ecography,
40(8):913929, August 2017.
[90] A. P. Reynolds, G. Richards, B. de la Iglesia, and V. J. Rayward-Smith.
Clustering rules: A comparison of partitioning and hierarchical clustering al-
gorithms. Journal of Mathematical Modelling and Algorithms, 5(4):475504,
December 2006.
[91] Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt
Hornik. cluster: Cluster Analysis Basics and Extensions, 2018. R package
version 2.0.7-1.
[92] Anthony R. Ives and Theodore Garland. Phylogenetic logistic regression for
binary dependent variables. Systematic Biology, 59(1):926, January 2010.
[93] Lam si Tung Ho and C?cile An?. A linear-time algorithm for Gaussian and
non-Gaussian trait evolution models. Systematic Biology, 63(3):397408, May
2014.
[94] Il-Gyo Chong and Chi-Hyuck Jun. Performance of some variable selection
methods when multicollinearity is present. Chemometrics and intelligent lab-
oratory systems, 78(1-2):103112, 2005.
[95] Olga Morozova, Olga Levina, Anneli Uusk?la, and Robert Heimer. Compar-
ison of subset selection methods in linear regression in the context of health-
related quality of life and substance abuse in Russia. BMC medical research
methodology, 15(1):71, 2015.
[96] Donald E. Farrar and Robert R. Glauber. Multicollinearity in regression analy-
sis: The problem revisited. The Review of Economics and Statistics, 49(1):92
107, 1967.
[97] Muhammad Imdadullah, Muhammad Aslam, and Saima Altaf. mctest: An R
package for detection of collinearity among regressors. The R Journal, 8(2),
December 2016.
[98] Kim-Anh L? Cao, Simon Boitard, and Philippe Besse. Sparse PLS discrimi-
nant analysis: biologically relevant feature selection and graphical displays for
multiclass problems. BMC Bioinformatics, 12:253, June 2011.
213
[99] Florian Rohart, Beno?t Gautier, Amrit Singh, and Kim-Anh L? Cao.
mixOmics: An R package for `omics feature selection and multiple data inte-
gration. PLoS Computational Biology, 13(11):e1005752, November 2017.
[100] Florian Rohart, Aida Eslami, Nicholas Matigian, St?phanie Bougeard, and
Kim-Anh L? Cao. MINT: a multivariate integrative method to identify repro-
ducible molecular signatures across independent experiments and platforms.
BMC Bioinformatics, 18:128, February 2017.
[101] Leo Breiman. Random Forests. Machine Learning, 45(1):532, October 2001.
[102] Andy Liaw, Matthew Wiener, and others. Classication and regression by
randomForest. R news, 2(3), 2002.
[103] Jacob Cohen. A coecient of agreement for nominal scales. Educational and
Psychological Measurement, 20(1):3746, April 1960.
[104] Ruth E. Ley, Catherine A. Lozupone, Micah Hamady, Rob Knight, and Jef-
frey I. Gordon. Worlds within worlds: evolution of the vertebrate gut micro-
biota. Nature Reviews Microbiology, 6(10):776788, October 2008.
[105] Luke R. Thompson, Jon G. Sanders, Daniel McDonald, Amnon Amir, Joshua
Ladau, Kenneth J. Locey, Robert J. Prill, Anupriya Tripathi, Sean M. Gib-
bons, Gail Ackermann, Jose A. Navas-Molina, Stefan Janssen, Evguenia Kopy-
lova, Yoshiki V?zquez-Baeza, Antonio Gonz?lez, James T. Morton, Siavash
Mirarab, Zhenjiang Zech Xu, Lingjing Jiang, Mohamed F. Haroon, Jad
Kanbar, Qiyun Zhu, Se Jin Song, Tomasz Kosciolek, Nicholas A. Bokulich,
Joshua Leer, Colin J. Brislawn, Gregory Humphrey, Sarah M. Owens, Jar-
rad Hampton-Marcell, Donna Berg-Lyons, Valerie McKenzie, Noah Fierer,
Jed A. Fuhrman, Aaron Clauset, Rick L. Stevens, Ashley Shade, Katherine S.
Pollard, Kelly D. Goodwin, Janet K. Jansson, Jack A. Gilbert, Rob Knight,
The Earth Microbiome Project Consortium, Jose L. Agosto Rivera, Lisa Al-
Moosawi, John Alverdy, Katherine R. Amato, Jason Andras, Largus T. An-
genent, Dionysios A. Antonopoulos, Amy Apprill, David Armitage, Kate
Ballantine, Jir?? B?rta, Julia K. Baum, Allison Berry, Ashish Bhatnagar,
Monica Bhatnagar, Jennifer F. Biddle, Lucie Bittner, Bazartseren Boldgiv,
Eric Bottos, Donal M. Boyer, Josephine Braun, William Brazelton, Fran-
cis Q. Brearley, Alexandra H. Campbell, J. Gregory Caporaso, Cesar Car-
dona, JoLynn Carroll, S. Craig Cary, Brenda B. Casper, Trevor C. Charles,
Haiyan Chu, Danielle C. Claar, Robert G. Clark, Jonathan B. Clayton, Jose C.
Clemente, Alyssa Cochran, Maureen L. Coleman, Gavin Collins, Rita R. Col-
well, M?nica Contreras, Benjamin B. Crary, Simon Creer, Daniel A. Cristol,
Byron C. Crump, Duoying Cui, Sarah E. Daly, Liliana Davalos, Russell D.
Dawson, Jennifer Defazio, Fr?d?ric Delsuc, Hebe M. Dionisi, Maria Gloria
Dominguez-Bello, Robin Dowell, Eric A. Dubinsky, Peter O. Dunn, Danilo
Ercolini, Robert E. Espinoza, Vanessa Ezenwa, Nathalie Fenner, Helen S.
214
Findlay, Irma D. Fleming, Vincenzo Fogliano, Anna Forsman, Chris Free-
man, Elliot S. Friedman, Giancarlo Galindo, Liza Garcia, Maria Alexandra
Garcia-Amado, David Garshelis, Robin B. Gasser, Gunnar Gerdts, Molly K.
Gibson, Isaac Giord, Ryan T. Gill, Tugrul Giray, Antje Gittel, Peter
Golyshin, Donglai Gong, Hans-Peter Grossart, Kristina Guyton, Sarah-Jane
Haig, Vanessa Hale, Ross Stephen Hall, Steven J. Hallam, Kim M. Hand-
ley, Nur A. Hasan, Shane R. Haydon, Jonathan E. Hickman, Glida Hidalgo,
Kirsten S. Hofmockel, Je Hooker, Stefan Hulth, Jenni Hultman, Embriette
Hyde, Juan Diego Ib??ez-?lamo, Julie D. Jastrow, Aaron R. Jex, L. Scott
Johnson, Eric R. Johnston, Stephen Joseph, Stephanie D. Jurburg, Diogo Ju-
relevicius, Anders Karlsson, Roger Karlsson, Seth Kauppinen, Colleen T. E.
Kellogg, Suzanne J. Kennedy, Lee J. Kerkhof, Gary M. King, George W. Kling,
Anson V. Koehler, Monika Krezalek, Jordan Kueneman, Regina Lamendella,
Emily M. Landon, Kelly Lane-deGraaf, Julie LaRoche, Peter Larsen, Bon-
nie Laverock, Simon Lax, Miguel Lentino, Iris I. Levin, Pierre Liancourt,
Wenju Liang, Alexandra M. Linz, David A. Lipson, Yongqin Liu, Manuel E.
Lladser, Mariana Lozada, Catherine M. Spirito, Walter P. MacCormack, Au-
rora MacRae-Crerar, Magda Magris, Antonio M. Mart?n-Platero, Manuel
Mart?n-Vivaldi, L. Margarita Mart?nez, Manuel Mart?nez-Bueno, Ezequiel M.
Marzinelli, Olivia U. Mason, Gregory D. Mayer, Jamie M. McDevitt-Irwin,
James E. McDonald, Krista L. McGuire, Katherine D. McMahon, Ryan Mc-
Minds, M?nica Medina, Joseph R. Mendelson, Jessica L. Metcalf, Folker
Meyer, Fabian Michelangeli, Kim Miller, David A. Mills, Jeremiah Minich,
Stefano Mocali, Lucas Moitinho-Silva, Anni Moore, Rachael M. Morgan-Kiss,
Paul Munroe, David Myrold, Josh D. Neufeld, Yingying Ni, Graeme W. Nicol,
Shaun Nielsen, Jozef I. Nissimov, Kefeng Niu, Matthew J. Nolan, Karen
Noyce, Sarah L. O'Brien, Noriko Okamoto, Ludovic Orlando, Yadira Ortiz
Castellano, Olayinka Osuolale, Wyatt Oswald, Jacob Parnell, Juan M. Peralta-
S?nchez, Peter Petraitis, Catherine Pster, Elizabeth Pilon-Smits, Paola Pi-
ombino, Stephen B. Pointing, F. Joseph Pollock, Caitlin Potter, Bharath
Prithiviraj, Christopher Quince, Asha Rani, Ravi Ranjan, Subramanya Rao,
Andrew P. Rees, Miles Richardson, Ulf Riebesell, Carol Robinson, Karl J.
Rockne, Selena Marie Rodriguezl, Forest Rohwer, Wayne Roundstone, Re-
becca J. Safran, Naseer Sangwan, Virginia Sanz, Matthew Schrenk, Mark D.
Schrenzel, Nicole M. Scott, Rita L. Seger, Andaine Seguin-Orlando, Lucy
Seldin, Lauren M. Seyler, Baddr Shakhsheer, Gabriela M. Sheets, Congcong
Shen, Yu Shi, Hakdong Shin, Benjamin D. Shogan, Dave Shutler, Jerey
Siegel, Steve Simmons, Sara Sj?ling, Daniel P. Smith, Juan J. Soler, Mar-
tin Sperling, Peter D. Steinberg, Brent Stephens, Melita A. Stevens, Sayh
Taghavi, Vera Tai, Karen Tait, Chia L. Tan, Neslihan Tas, D. Lee Tay-
lor, Torsten Thomas, Ina Timling, Benjamin L. Turner, Tim Urich, Luke K.
Ursell, Daniel van der Lelie, William Van Treuren, Lukas van Zwieten, Daniela
Vargas-Robles, Rebecca Vega Thurber, Paola Vitaglione, Donald A. Walker,
William A. Walters, Shi Wang, Tao Wang, Tom Weaver, Nicole S. Webster,
Beck Wehrle, Pamela Weisenhorn, Sophie Weiss, Jerey J. Werner, Kristin
215
West, Andrew Whitehead, Susan R. Whitehead, Linda A. Whittingham, Eske
Willerslev, Allison E. Williams, Stephen A. Wood, Douglas C. Woodhams,
Yeqin Yang, Jesse Zaneveld, Iratxe Zarraonaindia, Qikun Zhang, and Hongxia
Zhao. A communal catalogue reveals Earth's multiscale microbial diversity.
Nature, 551(7681):457463, November 2017.
[106] Adrian G. Patterson, Simon A. Jackson, Corinda Taylor, Gary B. Evans,
George P. C. Salmond, Rita Przybilski, Raymond H. J. Staals, and Peter C.
Fineran. Quorum sensing controls adaptive immunity through the regulation
of multiple CRISPR-Cas systems. Molecular Cell, 64(6):11021108, December
2016.
[107] C. Condon, D. Liveris, C. Squires, I. Schwartz, and C. L. Squires. rRNA
operon multiplicity in Escherichia coli and the physiological implications of
rrn inactivation. Journal of Bacteriology, 177(14):41524156, July 1995.
[108] Sara Vieira-Silva and Eduardo P. C. Rocha. The systemic imprint of growth
and its uses in ecological (meta)genomics. PLOS Genetics, 6(1):e1000808,
January 2010.
[109] Benjamin R. K. Roller, Steven F. Stoddard, and Thomas M. Schmidt. Exploit-
ing rRNA operon copy number to investigate bacterial reproductive strategies.
Nature Microbiology, 1(11):16160, November 2016.
[110] Shinichi Sunagawa, Luis Pedro Coelho, Samuel Charon, Jens Roat Kultima,
Karine Labadie, Guillem Salazar, Bardya Djahanschiri, Georg Zeller, Daniel R
Mende, Adriana Alberti, et al. Structure and function of the global ocean
microbiome. Science, 348(6237):1261359, 2015.
[111] Zarir E. Karanjawala, Niamh Murphy, David R. Hinton, Chih-Lin Hsieh, and
Michael R. Lieber. Oxygen metabolism causes chromosome breaks and is
associated with the neuronal apoptosis observed in DNA double-strand break
repair mutants. Current Biology, 12(5):397402, March 2002.
[112] Robert S. Pitcher, Nigel C. Brissett, and Aidan J. Doherty. Nonhomologous
end-joining in bacteria: a microbial perspective. Annual Review of Microbiol-
ogy, 61:259282, 2007.
[113] E Jo?czyk, M K?ak, R Mi?dzybrodzki, and A G?rski. The inuence of external
factors on bacteriophages. Folia microbiologica, 56(3):191200, 2011.
[114] Guilhem Faure, Kira S Makarova, and Eugene V Koonin. CRISPR-Cas: Com-
plex functional networks and multiple roles beyond adaptive immunity. Jour-
nal of Molecular Biology, 2018.
[115] Grigory L Dianov, Tatyana V Timehenko, Olga I Sinitsina, Andrew V Kuzmi-
nov, Oleg A Medvedev, and Rudolf I Salganik. Repair of uracil residues closely
spaced on the opposite strands of plasmid DNA results in double-strand break
216
and deletion formation. Molecular and General Genetics MGG, 225(3):448
452, 1991.
[116] Stanislav G Kozmin, Yuliya Sedletska, Anne Reynaud-Angelin, Didier Gas-
parutto, and Evelyne Sage. The formation of double-strand breaks at multiply
damaged sites is driven by the kinetics of excision/incision at base damage in
eukaryotic cells. Nucleic Acids Research, 37(6):17671777, 2009.
[117] Yuzhi Hong, Liping Li, Gan Luan, Karl Drlica, and Xilin Zhao. Contribution
of reactive oxygen species to thymineless death in Escherichia coli. Nature
Microbiology, 2(12):1667, 2017.
[118] Sarah S Henrikus, Camille Henry, John P McDonald, Yvonne Hellmich, Eliz-
abeth A Wood, Roger Woodgate, Michael M Cox, Antoine M van Oijen, Har-
shad Ghodke, and Andrew Robinson. DNA double-strand breaks induced by
reactive oxygen species promote DNA polymerase iv activity in Escherichia
coli. bioRxiv, page 533422, 2019.
[119] Tulip Mahaseth and Andrei Kuzminov. Prompt repair of hydrogen peroxide-
induced DNA lesions prevents catastrophic chromosomal fragmentation. DNA
repair, 41:4253, 2016.
[120] Thomas Bonura, Christopher D Town, Kendric C Smith, and Henry S Ka-
plan. The inuence of oxygen on the yield of DNA double-strand breaks in
x-irradiated Escherichia coli K-12. Radiation research, 63(3):567577, 1975.
[121] Michael J Tilby and Pamela S Loverock. Measurements of DNA double-
strand break yields in E. coli after rapid irradiation and cell inactivation:
the eects of inactivation technique and anoxic radiosensitizers. Radiation
research, 96(2):309321, 1983.
[122] GP Van der Schans and Joh Blok. The inuence of oxygen and sulphhydryl
compounds on the production of breaks in bacteriophage DNA by gamma-rays.
International Journal of Radiation Biology and Related Studies in Physics,
Chemistry and Medicine, 17(1):2538, 1970.
[123] Robert S Pitcher, Andrew J Green, Anna Brzostek, Malgorzata Korycka-
Machala, Jaroslaw Dziadek, and Aidan J Doherty. NHEJ protects mycobacte-
ria in stationary phase against the harmful eects of desiccation. DNA repair,
6(9):12711276, 2007.
[124] Pierre Dupuy, Benjamin Gourion, Laurent Sauviac, and Claude Bruand. DNA
double-strand break repair is involved in desiccation resistance of Sinorhizo-
bium meliloti, but is not essential for its symbiotic interaction with Medicago
truncatula. Microbiology, 163(3):333342, 2017.
[125] Carsten T Charlesworth, Priyanka S Deshpande, Daniel P Dever, Joab
Camarena, Viktor T Lemgart, M Kyle Cromer, Christopher A Vakulskas,
217
Michael A Collingwood, Liyang Zhang, Nicole M Bode, et al. Identication of
preexisting adaptive immunity to Cas9 proteins in humans. Nature medicine,
25(2):249, 2019.
[126] Laura A. Hug, Brett J. Baker, Karthik Anantharaman, Christopher T. Brown,
Alexander J. Probst, Cindy J. Castelle, Cristina N. Buttereld, Alex W.
Hernsdorf, Yuki Amano, Kotaro Ise, Yohey Suzuki, Natasha Dudek, David A.
Relman, Kari M. Finstad, Ronald Amundson, Brian C. Thomas, and Jillian F.
Baneld. A new view of the tree of life. Nature Microbiology, 1:16048, April
2016.
[127] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein,
Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S
Dwight, Janan T Eppig, et al. Gene ontology: tool for the unication of
biology. Nature genetics, 25(1):25, 2000.
[128] Marcus C Chibucos, Adrienne E Zweifel, Jonathan C Herrera, William Meza,
Shabnam Eslamfam, Peter Uetz, Deborah A Siegele, James C Hu, and
Michelle G Giglio. An ontology for microbial phenotypes. BMC Microbiology,
14(1):294, 2014.
[129] V H Tierrafr??a, C Mej??a-Almonte, J M Camacho-Zaragoza, H Salgado,
K Alquicira, S Gama-Castro, C Ishida, and J Collado-Vides. Mco: towards an
ontology and unied vocabulary for a framework-based annotation of microbial
growth conditions. Bioinformatics, page bty689, 2018.
[130] Jaime Iranzo, Alexander E. Lobkovsky, Yuri I. Wolf, and Eugene V. Koonin.
Immunity, suicide or both? Ecological determinants for the combined evo-
lution of anti-pathogen defense systems. BMC Evolutionary Biology, 15:43,
2015.
[131] Moran Goren, Ido Yosef, Rotem Edgar, and Udi Qimron. The bacterial
CRISPR/Cas system as analog of the mammalian adaptive immune system.
RNA Biology, 9(5):549554, May 2012.
[132] Luciano A. Marrani and Erik J. Sontheimer. CRISPR interference limits
horizontal gene transfer in staphylococci by targeting DNA. Science (New
York, N.Y.), 322(5909):18431845, December 2008.
[133] Luciano A. Marrani. CRISPR-Cas immunity in prokaryotes. Nature,
526(7571):5561, October 2015.
[134] Rotem Sorek, C. Martin Lawrence, and Blake Wiedenheft. CRISPR-mediated
adaptive immune systems in bacteria and archaea. Annual Review of Biochem-
istry, 82(1):237266, 2013.
[135] Pierre Boudry, Ekaterina Semenova, Marc Monot, Kirill A. Datsenko, Anna
Lopatina, Ognjen Sekulovic, Maicol Ospina-Bedoya, Louis-Charles Fortier,
218
Konstantin Severinov, Bruno Dupuy, and Olga Soutourina. Function of the
CRISPR-Cas system of the human pathogen Clostridium dicile. mBio,
6(5):e0111215, October 2015.
[136] Joakim M. Andersen, Madelyn Shoup, Cathy Robinson, Robert Britton,
Katharina E. P. Olsen, and Rodolphe Barrangou. CRISPR diversity and
microevolution in Clostridium dicile. Genome Biology and Evolution,
8(9):28412855, September 2016.
[137] A. Mira, H. Ochman, and N. A. Moran. Deletional bias and the evolution of
bacterial genomes. Trends in genetics: TIG, 17(10):589596, October 2001.
[138] Chih-Horng Kuo and Howard Ochman. Deletional bias across the three do-
mains of life. Genome Biology and Evolution, 1:145152, January 2009.
[139] Nuala A. O'Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana
Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-
White, Danso Ako-Adjei, Alexander Astashyn, Azat Badretdin, Yiming Bao,
Olga Blinkova, Vyacheslav Brover, Vyacheslav Chetvernin, Jinna Choi, Eric
Cox, Olga Ermolaeva, Catherine M. Farrell, Tamara Goldfarb, Tripti Gupta,
Daniel Haft, Eneida Hatcher, Wratko Hlavina, Vinita S. Joardar, Vamsi K.
Kodali, Wenjun Li, Donna Maglott, Patrick Masterson, Kelly M. McGarvey,
Michael R. Murphy, Kathleen O'Neill, Shashikant Pujar, Sanjida H. Rangwala,
Daniel Rausch, Lillian D. Riddick, Conrad Schoch, Andrei Shkeda, Susan S.
Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Igor Tolstoy, Raymond E.
Tully, Anjana R. Vatsan, Craig Wallin, David Webb, Wendy Wu, Melissa J.
Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Ter-
ence D. Murphy, and Kim D. Pruitt. Reference sequence (RefSeq) database
at NCBI: current status, taxonomic expansion, and functional annotation.
Nucleic Acids Research, 44(Database issue):D733D745, January 2016.
[140] Crist?bal Almendros, Francisco JM Mojica, C?sar D?ez-Villase?or, Noem? M
Guzm?n, and Jes?s Garc?a-Mart?nez. CRISPR-Cas functional module ex-
change in Escherichia coli. MBio, 5(1):e0076713, 2014.
[141] Kirill A. Datsenko, Ksenia Pougach, Anton Tikhonov, Barry L. Wanner, Kon-
stantin Severinov, and Ekaterina Semenova. Molecular memory of prior infec-
tions activates the CRISPR/Cas adaptive bacterial immunity system. Nature
Communications, 3:945, July 2012.
[142] Daan C. Swarts, Cas Mosterd, Mark W. J. van Passel, and Stan J. J. Brouns.
CRISPR interference directs strand specic spacer acquisition. PLoS ONE,
7(4):e35888, April 2012.
[143] Roger A. Garrett, Shiraz A. Shah, Gisle Vestergaard, Ling Deng, Soley Gud-
bergsdottir, Chandra S. Kenchappa, Susanne Erdmann, and Qunxin She.
CRISPR-based immune systems of the Sulfolobales: complexity and diver-
sity. Biochemical Society Transactions, 39(1):5157, February 2011.
219
[144] Soley Gudbergsdottir, Ling Deng, Zhengjun Chen, Jaide V. K. Jensen,
Linda R. Jensen, Qunxin She, and Roger A. Garrett. Dynamic properties
of the Sulfolobus CRISPR/Cas and CRISPR/Cmr systems when challenged
with vector-borne viral and plasmid genes and protospacers. Molecular Mi-
crobiology, 79(1):3549, January 2011.
[145] David L. Bernick, Courtney L. Cox, Patrick P. Dennis, and Todd M. Lowe.
Comparative genomic and transcriptional analyses of CRISPR systems across
the genus Pyrobaculum. Frontiers in Microbiology, 3, July 2012.
[146] Caryn R. Hale, Sonali Majumdar, Joshua Elmore, Neil Pster, Mark Comp-
ton, Sara Olson, Alissa M. Resch, Claiborne V. C. Glover, Brenton R. Grave-
ley, Rebecca M. Terns, and Michael P. Terns. Essential features and rational
design of CRISPR RNAs that function with the Cas RAMP module complex
to cleave RNAs. Molecular Cell, 45(3):292302, February 2012.
[147] Hagen Richter, Judith Zoephel, Jeanette Schermuly, Daniel Maticzka, Rolf
Backofen, and Lennart Randau. Characterization of CRISPR RNA processing
in Clostridium thermocellum and Methanococcus maripaludis. Nucleic Acids
Research, 40(19):98879896, October 2012.
[148] Bridget NJ Watson, Raymond HJ Staals, and Peter C Fineran. CRISPR-Cas-
mediated phage resistance enhances horizontal gene transfer by transduction.
MBio, 9(1):e0240617, 2018.
[149] Je Nivala, Seth L Shipman, and George M Church. Spontaneous CRISPR loci
generation in vivo by non-canonical spacer integration. Nature Microbiology,
page 1, 2018.
[150] David Paez-Espino, Itai Sharon, Wesley Morovic, Buy Stahl, Brian C.
Thomas, Rodolphe Barrangou, and Jillian F. Baneld. CRISPR immunity
drives rapid phage genome evolution in Streptococcus thermophilus. mBio,
6(2):e0026215, May 2015.
[151] Cheryl-Emiliane T Chow and Jed A Fuhrman. Seasonality and monthly
dynamics of marine myovirus communities. Environmental microbiology,
14(8):21712183, 2012.
[152] M. Senthil Kumar, Joshua B. Plotkin, and Sridhar Hannenhalli. Regulated
CRISPR modules exploit a dual defense strategy of restriction and abortive
infection in a model of prokaryote-phage coevolution. PLoS Computational
Biology, 11(11):e1004603, November 2015.
[153] Asaf Levy, Moran G. Goren, Ido Yosef, Oren Auster, Miriam Manor, Gil Ami-
tai, Rotem Edgar, Udi Qimron, and Rotem Sorek. CRISPR adaptation biases
explain preference for acquisition of foreign DNA. Nature, 520(7548):505510,
April 2015.
220
[154] Alexander P. Hynes, Manuela Villion, and Sylvain Moineau. Adaptation in
bacterial CRISPR-Cas immunity can be driven by defective phages. Nature
Communications, 5:4399, July 2014.
[155] Marie Touchon and Eduardo P. C. Rocha. The small, slow and special-
ized CRISPR and anti-CRISPR of Escherichia and Salmonella. PLoS ONE,
5(6):e11126, June 2010.
[156] Marie Touchon, Sophie Charpentier, Olivier Clermont, Eduardo P. C. Rocha,
Erick Denamur, and Catherine Branger. CRISPR distribution within the
Escherichia coli species is not suggestive of immunity-associated diversifying
selection. Journal of Bacteriology, 193(10):24602467, May 2011.
[157] Rongpeng Li, Lizhu Fang, Shirui Tan, Min Yu, Xuefeng Li, Sisi He, Yuquan
Wei, Guoping Li, Jianxin Jiang, and Min Wu. Type I CRISPR-Cas targets
endogenous genes and regulates virulence to evade mammalian host immunity.
Cell Research, 26(12):12731287, December 2016.
[158] Alexander Martynov, Konstantin Severinov, and Iaroslav Ispolatov. Opti-
mal number of spacers in CRISPR arrays. PLoS Computational Biology,
13(12):e1005891, 2017.
[159] Ibtissem Grissa, Gilles Vergnaud, and Christine Pourcel. The CRISPRdb
database and tools to display CRISPRs and to generate dictionaries of spacers
and repeats. BMC bioinformatics, 8:172, May 2007.
[160] Nora C Pyenson, Kaitlyn Gayvert, Andrew Varble, Olivier Elemento, and
Luciano A Marrani. Broad targeting specicity during bacterial type
III CRISPR-Cas immunity constrains viral escape. Cell host & microbe,
22(3):343353, 2017.
[161] Aris-Edda Stachler and Anita Marchfelder. Gene repression in Haloarchaea
using the CRISPR (clustered regularly interspaced short palindromic repeats)
- Cas I-B system. Journal of Biological Chemistry, page jbc.M116.724062,
May 2016.
[162] Aris-Edda Stachler, Israela Turgeman-Grott, Ella Shtifman-Segal, Thorsten
Allers, Anita Marchfelder, and Uri Gophna. High tolerance to self-targeting
of the genome by the endogenous CRISPR-Cas system in an archaeon. Nucleic
Acids Research, March 2017.
[163] April Pawluk, Joseph Bondy-Denomy, Vivian HW Cheung, Karen L Maxwell,
and Alan R Davidson. A new group of phage anti-crispr genes inhibits the type
ie CRISPR-Cas system of Pseudomonas aeruginosa. MBio, 5(2):e0089614,
2014.
[164] Sukrit Silas, Patricia Lucas-Elio, Simon A Jackson, Alejandra Aroca-Crevill?n,
Loren L Hansen, Peter C Fineran, Andrew Z Fire, and Antonio S?nchez-Amat.
221
Type III CRISPR-Cas systems can provide redundancy to counteract viral
escape from type I systems. eLife, 6, 2017.
[165] Anne Kupczok, Giddy Landan, and Tal Dagan. The contribution of genetic
recombination to CRISPR array evolution. Genome Biology and Evolution,
7(7):19251939, July 2015.
[166] Raymond H. J. Staals, Simon A. Jackson, Ambarish Biswas, Stan J. J. Brouns,
Chris M. Brown, and Peter C. Fineran. Interference-driven spacer acquisition
is dominant over naive and primed adaptation in a native CRISPR-Cas system.
Nature Communications, 7:12853, October 2016.
[167] Haiyan Zeng, Jumei Zhang, Chensi Li, Tengfei Xie, Na Ling, Qingping Wu,
and Yingwang Ye. The driving force of prophages and CRISPR-Cas system
in the evolution of Cronobacter sakazakii. Scientic Reports, 7:40206, January
2017.
[168] William B. Whitman, David C. Coleman, and William J. Wiebe. Prokary-
otes: The unseen majority. Proceedings of the National Academy of Sciences,
95(12):65786583, June 1998.
[169] Patrick D. Schloss, Rene A. Girard, Thomas Martin, Joshua Edwards, and
J. Cameron Thrash. Status of the archaeal and bacterial census: an update.
mBio, 7(3):e0020116, July 2016.
[170] Steven W. Wilhelm and Curtis A. Suttle. Viruses and nutrient cycles in the
sea: Viruses play critical roles in the structure and function of aquatic food
webs. BioScience, 49(10):781, Oct 1999.
[171] K EWommack and R R Colwell. Virioplankton: viruses in aquatic ecosystems.
Microbiology and Molecular Biology Reviews : MMBR, 64:69114, March 2000.
[172] Curtis A. Suttle. Viruses in the sea. Nature, 437(7057):356???361, Sep 2005.
[173] Joshua S. Weitz and Steven W. Wilhelm. Ocean viruses and their eects on
microbial communities and biogeochemical cycles. F1000 Biology Reports, 4,
September 2012.
[174] Charles H Wigington, Derek Sonderegger, Corina P D Brussaard, Alison
Buchan, Jan F Finke, Jed A Fuhrman, Jay T Lennon, Mathias Middel-
boe, Curtis A Suttle, Charles Stock, William H Wilson, K Eric Wommack,
Steven W Wilhelm, and Joshua S Weitz. Re-examination of the relationship
between marine virus and microbial cell abundances. Nature Microbiology,
1:15024, January 2016.
[175] ?ivind Bergh, Knut Yngve B?rsheim, Gunnar Bratbak, and Mikal Hel-
dal. High abundance of viruses found in aquatic environments. Nature,
340(6233):467468, August 1989.
222
[176] John McN. Sieburth, Paul W. Johnson, and Paul E. Hargraves. Ultrastructure
and ecology of Aureococcus anophageferens gen. et sp. nov. (chrysophyceae):
The dominant picoplankter during a bloom in Narragansett Bay, Rhode Is-
land, summer 1985. Journal of Phycology, 24(3):416425, September 1988.
[177] Lita M. Proctor and Jed A. Fuhrman. Viral mortality of marine bacteria and
cyanobacteria. Nature, 343(6253):6062, January 1990.
[178] Gunnar Bratbak, Mikal Heldal, Svein Norland, and T. Frede Thingstad.
Viruses as Partners in Spring Bloom Microbial Trophodynamics. Applied and
Environmental Microbiology, 56(5):14001405, May 1990.
[179] Gunnar Bratbak, T. Frede Thingstad, and Mikal Heldal. Viruses and the
microbial loop. Microbial Ecology, 28(2):209221, 1994.
[180] Markus G Weinbauer and Fereidoun Rassoulzadegan. Are viruses driving
microbial diversication and diversity? Environmental microbiology, 6:111,
January 2004.
[181] S. J. Schrag and J. E. Mittler. Host-parasite coexistence: The role of spatial
refuges in stabilizing bacteria-phage interactions. The American Naturalist,
148(2):348377, 1996.
[182] Richard E. Lenski. Coevolution of bacteria and phage: Are there endless
cycles of bacterial defenses and phage counterdefenses? Journal of Theoretical
Biology, 108(3):319325, June 1984.
[183] Richard E. Lenski and Bruce R. Levin. Constraints on the coevolution of
bacteria and virulent phage: A model, some experiments, and predictions for
natural communities. The American Naturalist, 125(4):585602, 1985.
[184] Alex R. Hall, Pauline D. Scanlan, Andrew D. Morgan, and Angus Buckling.
Hostparasite coevolutionary arms races give way to uctuating selection.
Ecology Letters, 14(7):635642, July 2011.
[185] Stineke van Houte, Alice K. E. Ekroth, Jenny M. Broniewski, H?l?ne Chabas,
Ben Ashby, Joseph Bondy-Denomy, Sylvain Gandon, Mike Boots, Steve Pater-
son, Angus Buckling, and Edze R. Westra. The diversity-generating benets
of a prokaryotic adaptive immune system. Nature, 532(7599):385388, April
2016.
[186] John B. Waterbury and Frederica W. Valois. Resistance to co-occurring phages
enables marine synechococcus communities to coexist with cyanophages abun-
dant in seawater. Applied and Environmental Microbiology, 59(10):33933399,
October 1993.
[187] Pedro G?mez and Angus Buckling. Bacteria-phage antagonistic coevolution
in soil. Science, 332(6025):106109, April 2011.
223
[188] M. T. Horne. Coevolution of Escherichia coli and bacteriophages in chemostat
culture. Science, 168(3934):992993, 1970.
[189] Simon A. Levin and J. Daniel Udovic. A mathematical model of coevolving
populations. The American Naturalist, 111(980):657675, 1977.
[190] Lin Chao, Bruce R. Levin, and Frank M. Stewart. A complex community in
a simple habitat: An experimental study with bacteria and phage. Ecology,
58(2):369378, March 1977.
[191] Brendan J. M. Bohannan, Richard E. Lenski, and Associate Editor: Robert D.
Holt. Eect of prey heterogeneity on the response of a model food chain to
resource enrichment. The American Naturalist, 153(1):7382, 1999.
[192] Yan Wei, Amy Kirby, Bruce R. Levin, Associate Editor: Pejman Rohani,
and Editor: Judith L. Bronstein. The population and evolutionary dynamics
of Vibrio cholerae and its bacteriophage: Conditions for maintaining phage-
limited communities. The American Naturalist, 178(6):715725, 2011.
[193] L. Van Valen. A new evolutionary law. Evolutionary Theory, 1:130, 1973.
[194] Leigh Van Valen. Molecular evolution as predicted by natural selection. Jour-
nal of Molecular Evolution, 3(2):89101, June 1974.
[195] Aneil Agrawal and Curtis M. Lively. Infection genetics: gene-for-gene ver-
sus matching-alleles models and all points in between. Evolutionary Ecology
Research, 4(1):91107, 2002.
[196] S. Gandon, A. Buckling, E. Decaestecker, and T. Day. Hostparasite coevolu-
tion and patterns of adaptation across time and space. Journal of Evolutionary
Biology, 21(6):18611866, November 2008.
[197] A. Buckling and P. B. Rainey. Antagonistic coevolution between a bacterium
and a bacteriophage. Proceedings of the Royal Society of London B: Biological
Sciences, 269(1494):931936, May 2002.
[198] Bruce R. Levin, Frank M. Stewart, and Lin Chao. Resource-limited growth,
competition, and predation: A model and experimental studies with bacteria
and bacteriophage. The American Naturalist, 111(977):324, 1977.
[199] Luis F. Jover, Michael H. Cortez, and Joshua S. Weitz. Mechanisms of multi-
strain coexistence in hostphage systems with nested infection networks. Jour-
nal of Theoretical Biology, 332:6577, September 2013.
[200] Justin R. Meyer, Devin T. Dobias, Sarah J. Medina, Lisa Servilio, Animesh
Gupta, and Richard E. Lenski. Ecological speciation of bacteriophage lambda
in allopatry and sympatry. Science, 354(6317):13011304, December 2016.
224
[201] Takehito Yoshida, Stephen P. Ellner, Laura E. Jones, Brendan J. M. Bohan-
nan, Richard E. Lenski, and Nelson G. Hairston Jr. Cryptic population dy-
namics: Rapid evolution masks trophic interactions. PLoS Biology, 5(9):e235,
September 2007.
[202] Laura E. Jones and Stephen P. Ellner. Eects of rapid prey evolution on
predatorprey cycles. Journal of Mathematical Biology, 55(4):541573, May
2007.
[203] M. Delbr?ck. Bacterial viruses or bacteriophages. Biological Reviews,
21(1):3040, January 1946.
[204] Richard E. Lenski. Dynamics of interactions between bacteria and virulent
bacteriophage. In K. C. Marshall, editor, Advances in Microbial Ecology,
number 10 in Advances in Microbial Ecology, pages 144. Springer US, 1988.
[205] Justin R. Meyer, Devin T. Dobias, Joshua S. Weitz, Jerey E. Barrick, Ryan T.
Quick, and Richard E. Lenski. Repeatability and contingency in the evolution
of a key innovation in phage lambda. Science, 335(6067):428432, January
2012.
[206] Joshua S. Weitz. Quantitative Viral Ecology: Dynamics of Viruses and Their
Microbial Hosts. Princeton University Press, January 2016. Google-Books-ID:
0zNJCgAAQBAJ.
[207] Reuben B. Vercoe, James T. Chang, Ron L. Dy, Corinda Taylor, Tamzin Grist-
wood, James S. Clulow, Corinna Richter, Rita Przybilski, Andrew R. Pitman,
and Peter C. Fineran. Cytotoxic chromosomal targeting by CRISPR/Cas
systems can reshape bacterial genomes and expel or remodel pathogenicity
islands. PLOS Genetics, 9(4):e1003454, April 2013.
[208] Ron L. Dy, Andrew R. Pitman, and Peter C. Fineran. Chromosomal target-
ing by CRISPR-Cas systems can contribute to genome plasticity in bacteria.
Mobile Genetic Elements, 3(5):e26831, September 2013.
[209] Josiane E. Garneau, Marie-?ve Dupuis, Manuela Villion, Dennis A. Romero,
Rodolphe Barrangou, Patrick Boyaval, Christophe Fremaux, Philippe Hor-
vath, Alfonso H. Magad?n, and Sylvain Moineau. The CRISPR/Cas bac-
terial immune system cleaves bacteriophage and plasmid DNA. Nature,
468(7320):6771, November 2010.
[210] H?l?ne Deveau, Rodolphe Barrangou, Josiane E. Garneau, Jessica Labont?,
Christophe Fremaux, Patrick Boyaval, Dennis A. Romero, Philippe Horvath,
and Sylvain Moineau. Phage response to CRISPR-encoded resistance in Strep-
tococcus thermophilus. Journal of Bacteriology, 190(4):13901400, February
2008.
225
[211] Philippe Horvath, Dennis A. Romero, Anne-Claire Co?t?-Monvoisin, Melissa
Richards, H?l?ne Deveau, Sylvain Moineau, Patrick Boyaval, Christophe Fre-
maux, and Rodolphe Barrangou. Diversity, activity, and evolution of CRISPR
loci in Streptococcus thermophilus. Journal of Bacteriology, 190(4):14011412,
February 2008.
[212] Philip J. Gerrish and Richard E. Lenski. The fate of competing benecial
mutations in an asexual population. Genetica, 102-103(0):127, 1998.
[213] Michael M. Desai, Daniel S. Fisher, and Andrew W. Murray. The speed
of evolution and maintenance of variation in asexual populations. Current
Biology, 17(5):385394, March 2007.
[214] Lauren M. Childs, Nicole L. Held, Mark J. Young, Rachel J. Whitaker, and
Joshua S. Weitz. Multiscale model of CRISPR-induced coevolutionary dy-
namics: Diversication at the interface of Lamarck and Darwin. Evolution,
66(7):20152029, July 2012.
[215] Serena Bradde, Marija Vucelja, Tiberiu Te?ileanu, and Vijay Balasubrama-
nian. Dynamics of adaptive immunity against phage in bacterial populations.
PLoS Computational Biology, 13:e1005486, April 2017.
[216] H?l?ne Chabas, Stineke van Houte, Nina Molin H?yland-Kroghsbo, Angus
Buckling, and Edze R. Westra. Immigration of susceptible hosts triggers the
evolution of alternative parasite defence strategies. Proceedings. Biological
Sciences / The Royal Society, 283(1837), August 2016.
[217] Lauren M. Childs, Whitney E. England, Mark J. Young, Joshua S. Weitz,
and Rachel J. Whitaker. CRISPR-induced distributed immunity in microbial
populations. PLoS ONE, 9(7):e101710, July 2014.
[218] Kelli L. Palmer and Michael S. Gilmore. Multidrug-resistant Enterococci lack
CRISPR-Cas. mBio, 1(4):e0022710, October 2010.
[219] Ekaterina Semenova, Matthijs M. Jore, Kirill A. Datsenko, Anna Semenova,
Edze R. Westra, Barry Wanner, John van der Oost, Stan J. J. Brouns, and
Konstantin Severinov. Interference by clustered regularly interspaced short
palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proceed-
ings of the National Academy of Sciences, 108(25):1009810103, June 2011.
[220] Bruno Martel and Sylvain Moineau. CRISPR-Cas: an ecient tool for genome
engineering of virulent bacteriophages. Nucleic Acids Research, 42(14):9504
9513, 2014.
[221] Martin T. Ferris, Paul Joyce, and Christina L. Burch. High frequency of
mutations that expand the host range of an RNA virus. Genetics, 176(2):1013
1022, June 2007.
226
[222] Lin Chao. Fitness of RNA virus decreased by Muller's ratchet. Nature,
348(6300):454455, November 1990.
[223] Ekaterina Semenova, Ekaterina Savitskaya, Olga Musharova, Alexandra Strot-
skaya, Daria Vorontsova, Kirill A. Datsenko, Maria D. Logacheva, and Kon-
stantin Severinov. Highly ecient primed spacer acquisition from targets de-
stroyed by the Escherichia coli type I-E CRISPR-Cas interfering complex.
Proceedings of the National Academy of Sciences, 113(27):76267631, July
2016.
[224] James S. Godde and Amanda Bickerton. The repetitive DNA elements called
CRISPRs and their associated genes: Evidence of horizontal transfer among
prokaryotes. Journal of Molecular Evolution, 62(6):718729, June 2006.
[225] Sajib Chakraborty, Ambrosius P. Snijders, Rajib Chakravorty, Musaddeque
Ahmed, Ashek Md. Tarek, and M. Anwar Hossain. Comparative network
clustering of direct repeats (DRs) and cas genes conrms the possibility of the
horizontal transfer of CRISPR locus among bacteria. Molecular Phylogenetics
and Evolution, 56(3):878887, September 2010.
[226] Uri Gophna, David M Kristensen, Yuri I Wolf, Ovidiu Popa, Christine Drevet,
and Eugene V Koonin. No evidence of inhibition of horizontal gene transfer by
CRISPRCas on evolutionary timescales. The ISME Journal, 9(9):20212027,
September 2015.
[227] Edze R. Westra, Andrea J. Dowling, Jenny M. Broniewski, and Stineke van
Houte. Evolution and ecology of CRISPR. Annual Review of Ecology, Evolu-
tion, and Systematics, 47(1):307331, 2016.
[228] Pablo Yarza, Michael Richter, J?rg Peplies, Jean Euzeby, Rudolf Amann,
Karl-Heinz Schleifer, Wolfgang Ludwig, Frank Oliver Gl?ckner, and Ramon
Rossell?-M?ra. The All-Species Living Tree project: A 16s rRNA-based phylo-
genetic tree of all sequenced type strains. Systematic and Applied Microbiology,
31(4):241250, September 2008.
[229] Simon P Blomberg, Theodore Garland Jr, and Anthony R Ives. Testing for
phylogenetic signal in comparative data: behavioral traits are more labile.
Evolution, 57(4):717745, 2003.
[230] Liam J Revell. phytools: an R package for phylogenetic comparative biology
(and other things). Methods in Ecology and Evolution, 3(2):217223, 2012.
[231] Ekaterina Savitskaya, Anna Lopatina, Soa Medvedeva, Mikhail Kapustin,
Sergey Shmakov, Alexey Tikhonov, Irena I Artamonova, Maria Logacheva,
and Konstantin Severinov. Dynamics of Escherichia coli type I-E CRISPR
spacers over 42 000 years. Molecular Ecology, 26(7):20192026, 2017.
227
[232] Ibtissem Grissa, Gilles Vergnaud, and Christine Pourcel. CRISPRFinder: a
web tool to identify clustered regularly interspaced short palindromic repeats.
Nucleic Acids Research, 35(Web Server issue):W5257, July 2007.
[233] C?line L?vesque, Martin Duplessis, Jessica Labont?, Steve Labrie, Christophe
Fremaux, Denise Tremblay, and Sylvain Moineau. Genomic organiza-
tion and molecular analysis of virulent bacteriophage 2972 infecting an
exopolysaccharide-producing Streptococcus thermophilus strain. Applied and
Environmental Microbiology, 71(7):40574068, July 2005.
[234] Chaoyou Xue, Arun S. Seetharam, Olga Musharova, Konstantin Severinov,
Stan J. J. Brouns, Andrew J. Severin, and Dipali G. Sashital. CRISPR in-
terference and priming varies with individual spacer sequences. Nucleic Acids
Research, 43(22):1083110847, December 2015.
[235] Mason J. Van Orden, Peter Klein, Kesavan Babu, Fares Z. Najar, and Rakhi
Rajan. Conserved DNA motifs in the type II-A CRISPR leader region. PeerJ,
5:e3161, 2017.
[236] Bruce R. Levin, Sylvain Moineau, Mary Bushman, and Rodolphe Barran-
gou. The population and evolutionary dynamics of phage and bacteria with
CRISPRmediated immunity. PLoS Genet, 9(3):e1003312, March 2013.
[237] Brendan J. M. Bohannan and Richard E. Lenski. Eect of resource enrichment
on a chemostat community of bacteria and bacteriophage. Ecology, 78(8):2303
2315, December 1997.
228