Академический Документы
Профессиональный Документы
Культура Документы
committed to making scientific and medical literature a public resource Oceanic Metagenomics in
www.plos.org
Editorial ____________________________________________________________________________
Global Ocean Sampling Collection S1
Hemai Parthasarathy, Emma Hill, e83
Catriona MacCallum
T
oday, PLoS Biology publishes of that fraction of the trace data publisher Web sites, but their use
landmark metagenomics acquired from Ecuadorian coastal remains restricted, and to claim that
papers from the J. Craig waters), annotated with extensive freedom to read an article is the main
Venter Institute’s Global Ocean geographical and physicochemical benefit of open access is to miss the
Sampling expedition [1–3]. These metadata. The assemblies and promise inspired by DNA sequence
papers describe the initial analyses associated annotated peptides will be databases.
of several gigabasepairs’ worth of delivered to GenBank (http:⁄⁄www. While we and other open-access
sequence data from oceanic microbes ncbi.nlm.nih.gov/Genbank) around journals have both enjoyed and been
collected during the Sorcerer II the time of publication, and will grateful for strong support from the
expedition, as the ship made her become available after GenBank has genomics community, we are also
way down from Canada, through the processed them. More immediately, disappointed that authors of landmark
Panama Canal, and finally out beyond and potentially more usefully, these genomics papers, who adamantly
the Galapagos Islands well into the data are also freely available through support open access to sequence
tropical Pacific and the South Pacific a specially built database, CAMERA— data, have not taken the opportunity
Gyre. Results from the first foray of Cyberinfrastructure for Advanced to provide further leadership for
this research mission into the Sargasso Marine Microbial Ecology Research and their community by promoting open
Sea were published three years ago Analysis (http:⁄⁄camera.calit2.net)— access to the scientific literature. We
[4]. As described in the accompanying which provides greater annotation and encourage all researchers to apply the
Synopsis [5], the new voyage has analysis capabilities [8]. (CAMERA was same standards to their papers as they
added information from multiple funded by the Gordon and Betty Moore would to their data, regardless of the
biomes and several-fold more data. Foundation, which also supports PLoS.) publisher. As Jensen et al. stated in a
Analysis of these data poses not The proponents of open-access
recent review about the benefits of text
only scientific challenges [6], but also publishing, ourselves included, often
mining for the scientific community, “It
significant legal hurdles. Craig Venter cite as an inspiration the power that
is the restricted access to the full text of
is no stranger to issues of intellectual open access to DNA sequence databases
papers…that is currently the greatest
property—his previous incarnation has had in transforming scientific
limitation…” [10].
as the president of Celera saw him discovery. As our founders noted in
embroiled in controversy over the the inaugural issue of PLoS Biology,
decision to “privatize” aspects of his “With great foresight, it was decided
company’s work in sequencing the in the early 1980s that published Citation: Parthasarathy H, Hill E, MacCallum C (2007)
Global Ocean Sampling collection. PLoS Biol 5(3): e83.
human genome. Now, at the head of DNA sequences should be deposited doi:10.1371/journal.pbio.0050083
the Global Ocean Sampling project, in a central repository, in a common
Copyright: © 2007 Parthasarathy et al. This is an
Venter finds himself on the side of format, where they could be freely open-access article distributed under the terms
greater accessibility, negotiating the accessed and used by anyone. Simply of the Creative Commons Attribution License,
claims of individual governments giving scientists free and unrestricted which permits unrestricted use, distribution, and
reproduction in any medium, provided the original
on the genomic wealth within their access to the raw sequences led them author and source are credited.
waters. In particular, as of this writing, to develop the powerful methods,
there is an active negotiation with tools, and resources that have made Hemai Parthasarathy is Managing Editor, Emma Hill
is Associate Editor, and Catriona MacCallum is Senior
the Ecuadorian government (which the whole much greater than the sum Editor at PLoS Biology.
has seen more than one change of of the individual sequences....Now
* To whom correspondence should be addressed.
power since the expedition began) imagine the possibilities if the same E-mail: hemai@plos.org
over restricting commercial reuse of creative explosion that was fueled by
these data. Henry Nicholls describes open access to DNA sequences were This article is part of the Oceanic Metagenomics
collection in PLoS Biology. The full collection is
this tangled legal landscape in an to occur for the much larger body of available online at http://collections.plos.org/
accompanying Feature [7]. published scientific results.” [9] plosbiology/gos-2007.php.
PLoS Biology | www.plosbiology.org | S1 0369 Special Section from March 2007 | Volume 5 | Issue 3 | e83
Acknowledgments eastern tropical Pacific. PLoS Biol 5: e77. 6. Eisen JA (2007) Environmental shotgun
doi:10.1371/journal.pbio.0050077 sequencing: The potential and challenges
PLoS Biology relies on the support of our 2. Yooseph S, Sutton G, Rusch DB, Halpern of random and fragmented sampling of the
academic editors and reviewers in selecting AL, Williamson SJ, et al. (2007) The Sorcerer hidden world of microbes. PLoS Biol 5: e82.
and improving manuscripts for publication. II Global Ocean Sampling expedition: doi:10.1371/journal.pbio.0050082
Expanding the universe of protein families. 7. Seshadri R, Kravitz SA, Smarr L, Gilna P,
We would like to extend particular thanks PLoS Biol 5: e16. doi:10.1371/journal. Frazier M (2007) CAMERA: A community
to our editorial board members Sean Eddy, pbio.0050016 resource for metagenomics. PLoS Biol 5: e75.
Jonathan Eisen, and Nancy Moran, our 3. Kannan N, Taylor SS, Zhai Y, Venter JC, doi:10.1371/journal.pbio.0050075
Manning G (2006) Structural and functional 8. Nicholls H (2007) Sorcerer II: The search for
guest editors Simon Levin and Tony Pawson,
diversity of the microbial kinome. PLoS Biol 5: microbial diversity roils the waters. PLoS Biol 5:
and our anonymous peer reviewers for their e17. doi:10.1371/journal.pbio.0050017 e74. doi:10.1371/journal.pbio.0050074
contributions to this collection of articles. 4. Venter JC, Remington K, Heidelberg 9. Brown PO, Eisen MB, Varmus HE (2003) Why
JF, Halpern AL, Rusch D, et al (2004) PLoS became a publisher. PLoS Biol 1: e36.
References Environmental genome shotgun sequencing of doi:10.1371/journal.pbio.0000036
1. Rusch DB, Halpern AL, Sutton G, the Sargasso Sea. Science 304: 58–60. 10. Jensen LJ, Saric J, Bork P (2006) Literature
Heidelberg KB, Williamson S, et al. (2007) 5. Gross L (2007) Untapped bounty: Sampling mining for the biologist: From information
The Sorcerer II Gobal Ocean Sampling the seas to survey microbial biodiversity. PLoS retrieval to biological discovery. Nat Rev Genet
expedition: Northwest Atlantic through Biol 5: e85. doi:10.1371/journal.pbio.0050085 7: 119–129.
PLoS Biology | www.plosbiology.org | S2 0370 Special Section from March 2007 | Volume 5 | Issue 3 | e83
Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity
Liza Gross | doi:10.1371/journal.pbio.0050085
PLoS Biology | www.plosbiology.org | S3 0371 Special Section from March 2007 | Volume 5 | Issue 3 | e85
Box 1. Following the Sorcerer II’s Hunt for Microbes
The Sorcerer II expedition was inspired by the British such environments. Continuing down the Atlantic seaboard, the
Challenger expedition (1872–1876), a pioneering oceanography expedition stopped near Cape Hatteras, North Carolina, and the
research project that discovered hundreds of new genera and Florida Keys before passing through the Caribbean and ending
nearly 5,000 new marine species. Its gun stations replaced near Panama, where the crew collaborated with scientists at the
with research stations, the Challenger circumnavigated the Smithsonian Tropical Research Institute.
oceans, stopping every 320 kilometers to recover specimens The fourth leg of the voyage sampled sites in the Eastern
from bottom, intermediate, and surface depths to explore the Pacific, including Cocos Island, about 500 kilometers southwest
diversity of macroscopic marine life. At each stop, the crew of Costa Rica. A highly productive ecosystem inhabits the waters
recorded the location, what they used to extract the sample, the off the island, a result of ocean currents buffeting the coast and
causing nutrient upwellings that mix with warm surface waters.
The crew made one last stop in the open ocean, then headed for
the Galapagos Islands.
Owing in part to its position near major ocean currents and
atmospheric transition zones, the Galapagos Archipelago
sits within a hydrographically complex region. Unique
oceanographic features there support a diverse set of habitats
and endemic species, found within several discrete zones
distinguished by temperature. This microbial mother lode held
the crew’s attention for two months, while they extensively
sampled the region.
By early March 2004, the crew had collected the last three
samples used in these studies, from two open ocean sites
and a lagoon in a coral reef in the South Pacific Gyre. Follow
these links to learn more about the Sorcerer II (http://www.
sorcerer2expedition.org/version1/HTML/main.htm) and the
Challenger (http://hercules.kgs.ku.edu/hexacoral/expedition/
challenger_1872-1876/challenger.html) expeditions.
doi:10.1371/journal.pbio.0050085.g001
PLoS Biology | www.plosbiology.org | S4 0372 Special Section from March 2007 | Volume 5 | Issue 3 | e85
organisms. Most of the GOS sequences failed to be identified, genomic content revealed that tropical and temperate
in part because so few surface water microbes have been samples shared the least amount of genomic material. Some
sequenced. samples, however, were very similar.
A novel comparative genomic method. Focusing on While untangling all the factors that may affect genetic
the reads that recruited to these most abundant genera, makeup of a sample is beyond current datasets and methods,
Rusch et al. generated “fragment recruitment plots.” These the researchers demonstrated that specific genetic differences
graphics represent relatedness and diversity of environmental can be related to environmental factors. Several genes
sequences to a reference genome by showing where a read occurred up to seven times more frequently in a pair of
aligns with the reference genome (indicated by a horizontal samples from the Caribbean than they did in a pair from
bar) and its degree of similarity to the reference sequence the eastern Pacific, even though both pairs had similar
(indicated by its vertical position). Recruited reads were ribotype and genetic profiles. Many of these genes govern
color-coded based on sample origin to indirectly depict their the metabolism and transport of phosphate (required for
associated metadata (for example, salinity and pH). These microbial growth), likely reflecting functional adaptations in
plots provided a visual tool to explore genetic diversity at the the microbial communities to the measured differences in
sequence and gene level, genome structure and evolution, phosphate availability in the Caribbean and Pacific samples.
and taxonomic and evolutionary relationships. (For more on The researchers also explored diversity at the gene level
fragment recruitment plots, see the accompanying poster, by looking for evidence of functional differences in one
doi:10.1371/journal.pbio.0050077.sd001.) gene family, proteorhodopsins, light-activated proton pumps
Distinct recruitment patterns, easily detected by bands of with a slightly murky biological role. Proteorhodopsins
color, emerged for each organism. In some cases, a single were abundant in all the GOS and Sargasso Sea samples. In
reference genome had multiple color bands, distinguished keeping with the diverse light environments sampled during
by their similarity and sample provenance. Because the expedition, the researchers found a strong correlation
bands appeared to represent unique, closely related, and between sequence variation and sample provenance. They
geographically distinct populations—and showed a novel level hypothesize that the distribution of given variants reflects
of diversity across the entire genome—the researchers termed adaptation to the most abundant light spectra in their
each band a subtype. A tremendous amount of sequence habitats.
diversity appeared in the subtypes, which also harbored Altogether, these results reveal the power of metagenomic
substantial sequence variation at the protein level, some likely approaches to capture the true measure of microbial diversity
reflecting adaptations to local environments. This finding by uncovering genomic differences that would not have
reveals a potential locus of microbial diversity—at the level of been apparent using traditional marker-based approaches.
subtype rather than at the level of species, or ribotype (based The breadth of this newly revealed diversity may come as a
on a segment of a ribosomal RNA gene called 16S rRNA)— surprise to even inveterate microbe hunters.
and offers clues to why it emerged (perhaps in response to
local pressures) and how it evolved. The Expanding Protein Universe
A novel sequence assembly method. Because such high Along with insights into microbial diversity, metagenomics
levels of sequence diversity among organisms confound promises to help us understand the vast number of proteins
standard whole genome assembly software, and most of in nature. By randomly sampling DNA sequences from
the GOS data correspond to organisms for which there is communities of organisms, metagenomic studies overcome
no appropriate reference genome, Rusch et al. used an selection and culturing biases that arise from focusing on
“extreme assembly” approach to investigate the genomes of a particular organism or a set of proteins, to provide an
other abundant GOS populations. They used greatly reduced expansive view of protein diversity and evolution.
requirements for sequence similarity in the assemblers to Proteins are typically grouped into families based on
generate longer contigs and capture more of the GOS data their evolutionary relationship, which can then be used
in an assembly. While some of the resulting larger assemblies to guide investigations of their biological roles. Proteins
corresponded to known reference genomes, others did not, in the same family share similar amino acid sequences
allowing the researchers to study microbes without cultivated and three-dimensional conformations. Using amino acid
or sequenced counterparts. And because these larger sequence similarity as a measure to identify and group
assemblies could potentially provide functional insights into protein sequences from the GOS data with sequences from
uncharacterized organisms, they might identify conditions a comprehensive set of known proteins, Shibu Yooseph et al.
that would allow scientists to grow them in the lab. evaluated the impact of the GOS data on our understanding
Many of the large contigs failed to align in any significant of known proteins and studied the rate of discovery of protein
way with known genomes, so the researchers tried to match families with new sequences. To group related sequences and
them with “seed fragments” from known taxonomic groups. By predict proteins, they developed a novel sequence clustering
starting assembly from reads mated to the 16S rRNA gene— technique based on full-length sequence similarity.
one of the most common marker genes used for classifying Identifying proteins in metagenomics data. Hypothetical
microbes—they could generate large contigs associated with proteins can be predicted by searching for open reading
many of the abundant GOS ribotypes. Fragment recruitment frames (ORFs), sequences flanked by nucleotide triplets
plots of these assemblies again revealed multiple subtypes, (called codons) that signal the beginning and end of
providing further support for the presence of multiple translation but don’t necessarily encode a protein. Because
evolutionarily distinct subtypes within a given ribotype. the GOS data contain many fragmentary sequences,
Evidence for environmental adaptations. A computational Yooseph et al. allowed ORFs to be terminated at the end of a
approach designed to identify groups of samples with similar sequence, resulting in a partial or truncated ORF. They used
PLoS Biology | www.plosbiology.org | S5 0373 Special Section from March 2007 | Volume 5 | Issue 3 | e85
Box 2. Bioinformatic Methods at a Glance
Bioinformatics relies on statistics and computer power alignments of previously identified families to compute
to synthesize and interpret huge datasets. Here’s a brief “position-specific scoring matrixes” (PSSMs). Each position in
introduction to some of the environmental genomics methods the alignment is associated with a set of scores that reward or
used in the GOS studies. penalize the alignment of a given amino acid to the position.
Shotgun sequencing decodes genetic material by randomly Profile methods can be more sensitive than simple sequence
shredding it into millions of fragments. The DNA sequence of similarity methods because they give more weight to signals at
each end of a fragment is determined; the two ends of a given sites that are conserved within a protein family and less weight
fragment (or insert) can be associated, and constitute a “mate to more variable positions.
pair.” These random sequencing “reads” are then reassembled Initially, the advantages of profile methods for detecting
with a computer. Based on sequence similarity, overlapping remote homology were limited to well-characterized families,
reads are identified and merged into longer sequences called as construction of a profile required some expertise. However,
“contigs.” Contigs are organized into larger (but not necessarily this changed with the fully automated integration of this step
continuous) pieces of a genome, called “scaffolds,” based into PSI-BLAST. PSI-BLAST begins with a pairwise (sequence–
on mate pairs. The resulting assemblies can link genes to sequence) similarity search, but then iteratively runs alternating
their regulatory elements, guide investigations of biological steps of building a profile from the current set of similar
pathways, and connect unknown sequences with taxonomic sequences and using the profile to re-search the database for
markers to suggest evolutionary relationships. additional matching sequences.
Sequence similarity detection allows functional and taxonomic Hidden Markov models (HMMs) employ statistical methods
characterization of genomic sequences. Once the shotgunned to model the likelihood of different amino acids at any given
sequences have been organized into a library of sequence position of the sequence in an underlying alignment. Like
“scaffolds” and translated into hypothetical proteins, the next some profile methods, HMMs use a probability-based method
step uses sequence similarity to figure out what the proteins to determine the score of aligning an observed amino acid to
are and to identify families. Similarity can also associate a new a given position in a protein family, but HMMs improve upon
sequence with an approximate location on the tree of life. profiles by more sophisticated modeling of variation in protein
Sequence–sequence (pairwise) methods, the first step for length, storing the probabilities of insertions or deletions at
identifying closely related sequences, compare all sequences each position of the model. HMMs have a good track record for
to all other sequences in a pairwise manner. These methods identifying more distantly related protein sequences.
(such as BLAST) allow all collected sequences to be compared Profile–profile methods are the most recent enhancement to
with one another (and with all sequences already available in sequence homology detection methods. As the name suggests,
public databases) and reliably clustered into families of related profile–profile methods compare one profile to another. Because
sequences with high sequence similarity, or homology. each profile implicitly encodes more information than a single
Profile methods are used to identify more remote sequence, these methods identify relationships that cannot be
relationships. Profile methods use multiple sequence detected by comparing individual sequences.
the ORFs to generate a set of predicted proteins based on the revealed GOS counterparts in nearly all known prokaryotic
results of a series of clustering steps and statistical analyses. protein families; nearly 2,000 clusters appeared unique to the
After performing pairwise comparisons (of every sequence GOS dataset.
against every other sequence) of the more than 28 million Since they couldn’t use sequence similarity to infer
sequences in the combined dataset, the researchers identified function for the unique GOS sequences, the researchers
conserved groups of sequences after accounting for relied on the assumption that proteins with similar roles are
redundancy due to identical and near-identical sequences. more likely to reside in the same genomic neighborhood.
They then used profile methods to merge and expand these This analysis implicated several GOS-only clusters in
groups of sequences. While pairwise comparisons capture photosynthesis or electron transport. Such clusters may come
the most closely related sequences (or homologs), profile from viruses, as many viral parasites of photosynthetic bacteria
methods (the researchers used both PSI-BLAST and hidden express the photosynthetic genes of their hosts. Interestingly,
Markov models) detect more distantly related sequences by though most of the sequences in GOS-only clusters appeared
combining homologs into multiple sequence alignments to to be bacterial, a higher than expected proportion of them
generate “profiles.” (For more on these methods, see Box 2.) were flagged as viral. If such novel GOS protein families pan
From the clusters obtained by the above procedure, clusters out as viral, the researchers argue, “we are far from exploring
of spurious sequences (that overlap true protein regions the molecular diversity of viruses.”
on the genome) were identified in addition to clusters of Insights into evolutionary and functional diversity. To
noncoding conserved sequences (based on tests showing compare ocean versus terrestrial life at the biochemical level,
no selection on their codons). Sequences in these clusters Yooseph et al. compared GOS sequences to those of land-
were removed; those remaining were labeled as predicted dwelling prokaryotes. Nearly 70% of protein domains varied
proteins. The researchers identified nearly 6 million proteins between the two classes of microbes, mostly reflecting the
in the GOS dataset—1.8 times the number already in public distinct biochemical requirements of the two environments,
databases. Comparing the predicted protein clusters to as well as the different taxonomic groupings in the two
known prokaryotic and nonprokaryotic protein databases datasets. The researchers were surprised to find little evidence
PLoS Biology | www.plosbiology.org | S6 0374 Special Section from March 2007 | Volume 5 | Issue 3 | e85
of domains specific to gram-positive bacteria (defined by their a phosphate group from adenosine triphosphate (ATP)
unique cell wall), even though this bacterial group makes up to a specific amino acid on the protein, releasing energy
nearly 12% of the GOS dataset. They also found a relative and inducing structural changes that alter the protein’s
dearth of components related to flagella (the whip-like tail of activity. (Dephosphorylation removes the phosphate group,
microbial motility), possibly reflecting the reduced need for restoring the protein to its original conformation and
self-propulsion in the ocean. inactive state.) One cell can contain hundreds of different
Using a comprehensive protein family database (called protein kinases, each charged with phosphorylating one or
Pfam), the researchers compared the kingdom distribution many different proteins.
of known protein domains in the GOS data to that of Bacteria and other prokaryotes, conventional wisdom held,
proteins in public databases. In this process, some families rely mostly on structurally distinct kinases (histidine kinases)
that were previously thought to be single-kingdom turned to mediate protein phosphorylation and cell signaling. But
out to have members in multiple kingdoms. For example, it now emerges that ePK-like kinases (ELKs), once thought
indoleamine 2,3-dioxygenase (IDO), an enzyme linked to to be minor players, are more prevalent and widespread
the immune system in mammals, was considered unique to than the histidine kinases. Although ePKs and ELKs typically
eukaryotes. But the IDO Pfam search turned up matches to exhibit very low sequence similarity, they share similar
ten GOS sequences identified as bacterial—suggesting that phosphorylation mechanisms and the same structural fold
the proteins may have arisen much earlier than previously (the protein kinase–like, or PKL, fold).
thought, or perhaps arose through lateral gene transfer (from Since PKL kinases conserve both fold and mechanism
an unrelated organism). of action, they provide a robust model for determining
The sheer size of the GOS dataset—which nearly doubles how sequence variation corresponds to functional
the number of proteins—greatly expands the functional diversity. Unfortunately, comprehensive comparisons had
diversity of known protein families, providing valuable been frustrated by a lack of sequence information for
insights into their evolution. For example, the researchers the prokaryotic ELK families relative to the well-studied
found a 10-fold increase in the number and type of proteins eukaryotic domains. But now, thanks to the Sorcerer
involved in repairing ultraviolet radiation damage, likely II expedition, sequence databases are brimming with
reflecting the hazards of living in surface waters. A similar microbial sequences, including a 3-fold increase in ELK
boost in phosphatases—which function in such fundamental sequences. Taking advantage of the bounty, Natarajan
biological processes as cell signaling, development, and Kannan, Gerard Manning, and colleagues surveyed the
cell division—highlighted important differences in the way global PKL landscape, and identified over 45,000 PKLs,
one phosphatase (protein phosphatase 2C) functions in which they classified into 20 families. Surprisingly, PKLs
prokaryotes and eukaryotes. appear to usurp the histidine kinases as the core regulator
And the unexpected abundance of a nitrogen metabolism of prokaryotic signaling and cell behavior.
catalyst typically associated with eukaryotes (type II glutamine Cataloging the number and diversity of PKL families. To
synthetase) suggested two possible evolutionary mechanisms: detect kinase sequences, Kannan et al. searched over 17
either lateral gene transfer from eukaryotes, or gene million predicted proteins in the GOS dataset and 5 million-
duplication prior to the divergence of prokaryotes and plus predicted and known protein sequences in public
eukaryotes. (The researchers suspect gene duplication.) The databases. Kinase sequences were detected using hidden
diversity of the GOS sequences also promises to characterize Markov model (HMM) profiles of known PKLs along with
sequences with no similarity to known sequences (known as a model that predicts kinases on the basis of a few ultra-
ORFans): over 6,000 ORFans pair up with GOS sequences conserved motifs. The sensitivity of the HMMs allowed the
representing some 600 organisms, paving the way for further researchers to discover very remote new members of these
study of their identity and function. families and to classify and organize the tens of thousands of
As GOS protein predictions are tested, some of these sequences. Both approaches iterate through multiple runs of
proteins will expand existing protein families while others the clustered results to refine the family alignments and to
will carve out GOS-specific families. Both results will help classify clusters with little similarity to known PKL families as
researchers determine priority targets for structural studies— potentially novel. (For more on these methods, see Box 2.)
an essential strategy for dealing with the flood of protein The public databases, it turned out, harbored nearly 25,000
discoveries. And given that the GOS sequences represent ePKs and over 5,000 ELKs. Over 16,000 GOS sequences fell
mostly microbes from the ocean’s surface—yet point to into 20 PKL families—doubling the size of most families.
substantial viral diversity as well—the rate of protein discovery Three main superfamily clusters emerged, distinguished by
indicates that a comprehensive catalog of proteins in nature is the most abundant members: choline and aminoglycoside
far from complete. kinases (CAKs), a “particularly diverse” family harboring
kinases that facilitate colonization by beneficial and
Variations on a Theme: A Single Fold Spawns a Diverse pathogenic bacteria; ePKs, almost exclusively eukaryotic
Kinase Superfamily except for a similar bacterial kinase (pknB); and a cluster of
Cellular life chugs along under the power of enzymes, kinases, including Rio and Bud32, that are conserved between
proteins that catalyze the scores of chemical reactions archaea and eukaryotes. Three families bore no sequence
required for life. One of the largest protein families similarity to any other families save for a group of key motifs.
in eukaryotes, the eukaryotic protein kinases (ePKs), Overall, the 20 families exhibit significant functional and
regulates the activity of a large fraction of all proteins sequence diversity. Most of the families have not yet been
and almost all biological pathways by phosphorylating fully investigated, though they do include some characterized
proteins. Phosphorylation activates its target by transferring members. Those with known kinase activity target small
PLoS Biology | www.plosbiology.org | S7 0375 Special Section from March 2007 | Volume 5 | Issue 3 | e85
molecules (such as lipids and amino acids) and seem to direct substrates to the catalytic core or influence the nature
play regulatory roles, in contrast to many other structurally of the reaction.
unrelated small molecule kinases, which affect metabolism. Evolutionary insights and beyond. Altogether these results
Functional diversity springs from a set of core residues. reveal the vast functional and phylogenetic diversity that can
Because sequence similarity ranged from “very low to occur in even just a subset of proteins, even though they retain
almost undetectable,” the researchers used sequence a common catalytic fold and function. The massive sequence
profiles—models built from entire families to highlight their comparisons in this study not only identified the core of the
core characteristics—to both discover and classify kinase PKL kinase, but also revealed the specific motifs underlying
sequences. They found several novel families, and greatly each family, including the ePKs. And the flexibility of several
extended the breadth of previously defined families. With key regions within ePKs may underlie the huge expansion of
these methods to refine the relationships within and between these enzymes in eukaryotes. This structural flexibility may
PKL families, the researchers explored the traits that unite or give kinases the ability to integrate multiple regulatory signals,
distinguish them. and account for their almost universal involvement in the
Ten key amino acid residues of the catalytic domain regulation of eukaryotic pathways.
consistently turned up in each family. This “core pattern of These results set the stage for more in-depth structural and
conservation,” the researchers explain, represents an ancient biochemical studies to elucidate the diverse functions carried
evolutionary innovation, spanning not just the three divisions out by these critical regulators of cell behavior. This study
of life—which diverged 1–2 billion years ago—but also the also demonstrates how metagenomic datasets, by covering
diverse families. The conservation of these residues across an unbiased diversity of life, can refine our understanding
and within the families suggests that they play an essential of well-studied protein families, such as the ePKs, and shed
role. And, indeed, six of those already characterized mediate light on their evolution. Kannan et al. hope that others take
ATP binding and catalysis. advantage of the environmental metagenomic largesse to
Yet despite the seemingly universal presence of the ten pursue “similar insights into virtually every gene family with
residues, their occurrence in individual subfamilies showed prokaryotic relatives.”
a surprising pattern: all but one of these “core” residues had
either disappeared or changed in individual families—though Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al.
the proteins retained their fold and function—suggesting an (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest
unexpected flexibility for catalytic cores. To test this possibility, Atlantic through eastern tropical Pacific. doi:10.1371/journal.
the researchers focused on one of the ten residues—the pbio.0050077
catalytic lysine K72, which repositions ATP’s phosphates.
Present in ePKs, K72 is replaced by a different conserved Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)
amino acid in three CAK subfamilies. These subfamilies The Sorcerer II Global Ocean Sampling expedition: Expanding the
had corresponding substitutions near other key motifs, universe of protein families. doi:10.1371/journal.pbio.0050016
and structural modeling showed how these coordinated
Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural
replacements could still result in an active enzyme.
and functional diversity of the microbial kinome. doi:10.1371/journal.
A number of features (including amino acid motifs and
pbio.0050017
secondary structure) emerged as family-specific, being highly
conserved within but not between families. And as was seen This article is part of the Oceanic Metagenomics collection
in the CAK analysis, many family-specific residues occur near in PLoS Biology. The full collection is available online at
one of the ten key residues, suggesting that they may help http:⁄⁄collections.plos.org/plosbiology/gos-2007.php.
PLoS Biology | www.plosbiology.org | S8 0376 Special Section from March 2007 | Volume 5 | Issue 3 | e85
Feature
doi:10.1371/journal.pbio.0050074.g001
C
raig Venter is not short of ambition. With the human
Figure 1. Two of a Kind?
genome fresh off the sequencing machines, he The young Charles Darwin (left) and Craig Venter (right). (Photo: J. Craig
set his sights on a project of even grander scale: to Venter Institute)
describe the immense wealth of genetic information living in
the world’s oceans. This voyage into biologically uncharted RNA collected from the marine microbial world (Figure 2)
waters was, according to the Web site of the expedition vessel [3]. “We estimate there are at least 25,000 different kinds of
Sorcerer II, inspired in part by the voyage of H. M. S. Beagle [1]. microbes per litre of seawater,” says Sogin. “But I wouldn’t
Venter, it seems, would like to be remembered as the Charles be surprised if it turns out there are 100,000 or more.” A few
Darwin of the 21st century (Figure 1). of these microbes are common, and Venter will probably use
This is the largest effort to describe the genetic diversity them to recover complete gene sequences, he says. “The vast
in the world’s oceans. The voyage around national and majority of low-abundance organisms are going undetected.”
international waters, collecting from around 150 sites and Venter is more than aware that there’s a lot more to be
interrogating samples at the level of the gene rather than at discovered, but for the moment the goal is to sequence as many
the level of the organism, has already turned up between 5 genes, in their entirety, as possible from these ecologically rich
and 6 million genes. Most of these genes have never been environments. These data raise a host of intriguing questions:
seen before, says Venter. Analysing this immense collection in particular, what is the structure and function of the novel
of data, the researchers discovered that many of the genes proteins these genes encode, and what role do they play in the
encode proteins that fall outside standard classification metabolism of these undescribed microbes? Just as Darwin’s
schemes. Proteins grouped within their own unique work drove a change in the way we see the world, so Venter is
kingdoms are turning up in other kingdoms as well—forcing hoping these marine data will do the same in years to come.
the team to reconsider the evolutionary relationships of
established kingdoms. “This project is revealing some of the Legal Framework
biggest discoveries about the environment,” says Venter. (For But times have changed. In the 21st century, there are plenty
more on these discoveries see the synopsis of the research of hurdles to clear before the collecting and describing of
articles [2].) biodiversity—even microscopic biodiversity—can go ahead.
“If Darwin were alive today trying to Citation: Nicholls H (2007) Sorcerer II: The search for microbial diversity roils the
do his experiments, he would not have waters. PLoS Biol 5(3): e74. doi:10.1371/journal.pbio.0050074
been allowed to.” Copyright: © 2007 Henry Nicholls. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author
Untapped Diversity and source are credited.
The Sorcerer II probably captured only a tiny fraction of the Abbreviations: UNCLOS, United Nations Convention on the Law of the Sea
genetic diversity out there, says Mitchell Sogin, Director of the Henry Nicholls is a freelance science journalist based in London, United Kingdom.
Josephine Bay Paul Center in Comparative Molecular Biology His book Lonesome George was nominated for the 2006 Guardian First Book Award.
and Evolution at the Marine Biological Laboratory in Woods E-mail: henry.nicholls@tiscali.co.uk
Hole, Massachusetts. In August 2006, Sogin and his colleagues This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full
published a detailed analysis of variable stretches of ribosomal collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.
PLoS Biology | www.plosbiology.org | S9 0380 Special Section from March 2007 | Volume 5 | Issue 3 | e74
The 1982 United Nations Convention on the Law of the Sea Harbor Branch Oceanographic Institution, an oceanographic
(UNCLOS) endowed coastal nations with the sovereign right research and education institution based in Florida, is
to explore and exploit all resources within their “exclusive after compounds from marine organisms that might have
economic zone”—usually a body of water stretching 200 biomedical potential. The institution has patents on, among
nautical miles out to sea [4]. Most coastal states exercise others, potential anti-cancer agents derived from the marine
this right, granting permits to outsiders wanting to conduct sponges Discodermia dissoluta and Forcepia triabilis (Figure 3).
research in their waters. Deep-sea exploration, and the lengthy research and
The 1992 Convention on Biological Diversity went on to development that follows, is an expensive business. This
set out some basic principles that might encourage sharing means it’s a realistic option for only the world’s wealthiest
of benefits arising from genetic resources [5]. Where parties nations. At least that’s the concern being expressed by some
to the convention have got round to incorporating these developing countries that would like to see a piece of this
principles into their own legislation, the result has been that action, says David Leary of the Centre for Environmental Law
anyone wishing to conduct research on these resources must at Macquarie University in Sydney, Australia.
agree to terms set by the host government.
Beyond national waters (with a few exceptions) are the “No effort ever attempted to
“high seas”. Here, there is little regulation. According
to UNCLOS, mineral resources on the deep seabed are incorporate data from such vastly
considered the “common heritage of mankind”; this means divergent sources to meet the needs
that any benefits deriving from them should be shared with
the international community. But when it comes to biological of such a wide range of scientific
resources, just about anything goes. interests.”
The Rise of Bioprospecting These countries are seeking a change to UNCLOS that
In areas beyond national jurisdiction, there has been an requires biological resources to be treated in the same
increase in so-called bioprospecting, the search for and way as mineral resources and any benefits deriving from
exploitation of commercially valuable compounds from them to be shared with the wider community. But others
genetic resources. In 2005, researchers at the United Nations fear tighter regulation of such activities will only stifle pure
University scoured patent office databases for inventions marine scientific research. The Philippines was one of the
based on the genomic features of deep seabed organisms first countries to regulate access to its genetic resources, says
[6].They found that private companies such as Roche, Sam Johnston, an expert on international environmental law
Diversa, and New England Biolabs are after patents on DNA based in Melbourne, Australia, and a senior research fellow at
polymerases developed from deep-sea thermophilic bacteria the United Nations University Institute of Advanced Studies
that promise to enhance the molecular biologist’s expanding in Yokohama, Japan. “It basically closed down all research,”
toolbox. Others like Sederma (based in France) and he says. “A lot of researchers around the world have found
California Tan (based in the US) have used enzymes from the red tape prohibitive.”
similar microorganisms to develop skin products boasting UV- Finding a balance between the unregulated status quo and
and heat-resistant properties. cumbersome controls over research on marine biodiversity
There are plenty of not-for-profit organisations interested is now the concern of a United Nations working group [7].
in the applications of discoveries from the deep. For example, “Some countries see this as the early stage of negotiating a
new UNCLOS,” says Leary. But, he warns, “this could take 10
or 15 years before we see a result.”
One compromise might be for coastal states to allow all
research on their genetic resources with the proviso that
exploitation of any commercial application is subject to
further negotiation. Another possibility is for the patent
system to take responsibility for seeing that benefits are
shared fairly, only granting patents based on biological
resources if a royalty is paid into a global commons trust
fund.
Ecological Impact
Whilst the UN goes in search of this kind of middle ground,
both pure and applied research in the high seas continues
apace—and this is cause for another concern. “There’s a
number of sites that are so popular that there’s concern
about the intensity of research,” says Leary. Repeated visits to
the same deep-sea spot could not only result in unsustainable
collection of some species and influence local hydrological
doi:10.1371/journal.pbio.0050074.g002
and environmental conditions, but increase the likelihood
Figure 2. A Remotely Operated Platform Samples Vent Fluids from that one person’s experiment will influence that of another.
the Northeast Pacific Ocean So far, little thought has been devoted to this consequence of
(Photo: NOAA, http://oceanexplorer.noaa.gov) unregulated access, says Leary. “I haven’t yet seen any clear
PLoS Biology | www.plosbiology.org | S10 0381 Special Section from March 2007 | Volume 5 | Issue 3 | e74
Box 1. Zooming in on CAMERA
CAMERA is the convenient acronym for the cumbersomely
named Community Cyberinfrastructure for Advanced Marine
Microbial Ecology Research and Analysis. “This resource
will focus on providing easy-to-use tools for uploading,
downloading, searching, and analysis of genomic datasets,” says
Paul Gilna, CAMERA’s executive director, based at the California
Institute for Telecom and Information Technology in La Jolla,
California.
Researchers will also be able to clothe the bare genetic
sequences in a wealth of other data, such as GPS coordinates
and depth of collection, the water temperature, its oxygen
content, salinity and pH. The site could well draw upon other
resources that enrich these metadata, says Gilna. For example,
satellite imagery associated with the sampling sites, and other
data types, such as microscopy stills and high-definition video,
could become important metadata that help researchers
characterise the environments from which samples were taken.
Crucially, CAMERA will allow researchers to record the source
of each genetic sequence. Many coastal countries now want a
share of commercial applications that derive from their marine
resources. Countries may be happy to see genetic sequences
placed in CAMERA provided they are acknowledged and
doi:10.1371/journal.pbio.0050074.g003 commercial exploitation of their sequence is not permitted
without their consent.
Figure 3. Marine Sponges That Have Generated Products with Anti- But handling such immense datasets poses considerable
Cancer Promise
technological challenges. The GOS database alone contains
(A) Discodermia dissoluta. (Photo: NOAA)
(B) Forcepia triabilis. (Photo: T. Piper, NOAA) around 6 billion bases—the equivalent of two entire human
genomes. And the number and size of this kind of database
scientific data on the extent of the environmental impact of will only mushroom in coming years, making it necessary to
bioprospecting or marine scientific research,” he says. develop high-speed optical networks, grid-based computing,
Clearly, the environmental impact of carrying off 150-odd and new visualisation technologies. “We are quickly approaching
barrels of seawater for analysis isn’t something that Venter a ‘tipping point’,” says Gilna. “These datasets will start to follow
and his colleagues had to worry about. But navigating the exponential, rather than linear trends, much as was the case for
complex legal territory was. “If Darwin were alive today trying DNA sequencing.”
to do his experiments, he would not have been allowed to,” Finally, there’s the tricky task of satisfying all researchers who
says Venter. could benefit from this resource. “The scientific communities—
At least, that is, without help from a lawyer. Sorcerer II from studies on biodiversity and biogeochemistry to evolution
collected samples in the waters of 17 coastal states and and genomes—have different interests, different data
obtained all necessary permits, says Bob Friedman, Vice expectations, different vocabularies, and different levels of
President for Environmental and Energy Policy at the J. experience with using computational tools and databases,” says
Craig Venter Institute. Some countries required detailed John Wooley, a pharmacologist at the University of California,
agreements thrashing out how benefits deriving from these San Diego, who is working on CAMERA. “Before metagenomics,
data would be shared. All of these are posted on the Sorcerer no effort ever attempted to incorporate data from such vastly
II Web site, says Friedman [8]. Most countries, however, have divergent sources to meet the needs of such a wide range of
not decided how they might regulate access to their genetic scientific interests.” For more on CAMERA, see the Community
resources, he says. Page article by Seshadri et al. [13].
In addition to getting the paperwork in order, Venter
encouraged collaboration with local scientists. What’s more,
the entire metagenomic database will be put in the public just off Hiva Oa, an island in the Marquesas archipelago in
domain. The gene sequences should be of tremendous value the Pacific Ocean, tensions escalated. Although the plan to
to each of the countries involved, says Venter. In particular, sample seawater around the islands had the backing of local
it will help them monitor and manage the health of their French Polynesian authorities and scientists, the French
marine ecosystems more effectively, he predicts. To ensure government in Paris had other ideas, says Venter. “We
that this vast dataset will be available to all, the Gordon and were placed under house arrest.” Eventually, after a further
Betty Moore Foundation has stumped up $24.5 million round of intense negotiations, the Sorcerer II was allowed out
dollars for a seven-year project to design a new database to of the harbour to collect its seawater samples and continue
host it and new tools to interrogate it (Box 1). on its way.
Yet, it seems, all these undertakings and assurances Last year, a Canadian-based non-governmental
have not been enough to steer this expedition clear of organisation—the Action Group on Erosion, Technology and
controversy. In 2004, when the Sorcerer II dropped anchor Concentration—dedicated to “the advancement of cultural
PLoS Biology | www.plosbiology.org | S11 0382 Special Section from March 2007 | Volume 5 | Issue 3 | e74
and ecological diversity and human rights” labelled Venter a So, keen as Venter might be to put the controversy of his
“biopirate”, accusing him of “flagrant disregard for national human-genome-sequencing days behind him, this kind of
sovereignty over biodiversity” [9]. In several countries, there’s research strays into unknown biological, legal, and ethical
real concern about how he managed his collecting, claims territory. And in this environment, allegations of biopiracy
Pat Mooney, Executive Director of the group. Although the are almost inevitable. This, however, is unlikely to deter a
data are going into the public domain, it is laboratories like man like Venter. “If it’s in the Darwin school of biopiracy,
Venter’s that are best placed to exploit it, he argues. “There’s then fine,” he says.
a handful of folk around the planet that can understand such References
stuff,” says Mooney. 1. Sorcerer II Expedition (2005) Expedition info—Environmental genomics.
Venter is adamant that this whole project is just pure, clean Available: http:⁄⁄www.sorcerer2expedition.org. Accessed 19 January 2007.
2. Gross L (2007) Untapped bounty: Sampling the seas to survey microbial
marine scientific research. Indeed, the Sorcerer II Web site biodiversity. PLoS Biol 5: e85. doi:10.1371/journal.pbio.0050085
explicitly states that “no intellectual property rights will be 3. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006)
Microbial diversity in the deep sea and the underexplored “rare biosphere”.
sought by the Venter Institute on these genomic sequence Proc Natl Acad Sci U S A 103: 12115–12120.
data” [10]. Venter sums up the goal of the project: “We were 4. United Nations (1982) United Nations convention on the law of the sea of
just trying to answer some basic questions about the diversity 10 December 1982. New York: United Nations. Available: http:⁄⁄www.un.
org/Depts/los/convention_agreements/convention_overview_convention.
of microbes on the planet,” he says. htm. Accessed 18 January 2007.
But, says environmental lawyer Johnston, the distinction 5. Secretariat of the Convention on Biological Diversity (2002) Bonn
between pure and applied research is becoming increasingly guidelines on access to genetic resources and fair and equitable sharing
of the benefits arising out of their utilization. Montreal: Secretariat of the
blurred. To illustrate this, he cites a strain of thermophilic Convention on Biological Diversity. Available: https:⁄⁄www.biodiv.org/doc/
Bacillus collected from Antarctica in the early 1980s as part publications/cbd-bonn-gdls-en.pdf. Accessed 16 January 2007.
6. Arico S, Salpin C (2005) UNU-IAS report—Bioprospecting of genetic
of a study into the worldwide distribution and characteristics resources in the deep seabed: Scientific, legal and policy aspects. Yokohama
of such extremophiles. Years later, the same sample, taken (Japan): United Nations University Institute of Advanced Studies. Available:
out of storage and subjected to further study, turned out http:⁄⁄www.ias.unu.edu/binaries2/DeepSeabed.pdf. Accessed 16 January
2007.
to contain a talented enzyme that has the promise to 7. International Institute for Sustainable Development (2006) Ad Hoc Open-
revolutionise DNA extraction for forensic analysis [11]. “The ended Informal Working Group to study issues relating to the conservation
and sustainable use of marine biological diversity beyond areas of national
collector undertook the act in the purest form but ultimately jurisdiction. Winnipeg (Canada): International Institute for Sustainable
the use of it has changed in the course of two decades,” says Development. Available: http:⁄⁄www.iisd.ca/oceans/marinebiodiv. Accessed
Johnston. “So much depends on the perspective at which you 16 January 2007.
8. Sorcerer II Expedition (2005) Collaborative agreements. Available:
look at the issue.” http:⁄⁄www.sorcerer2expedition.org/permits. Accessed 22 January 2007.
This means that there are likely to be several different 9. Coalition Against Biopiracy (2006) Captain Hook awards for biopiracy
2006. Available: http:⁄⁄www.captainhookawards.org/winners/2006_pirates.
takes on the same research. What for one person is Accessed 18 January 2007.
pure marine scientific research can be another person’s 10. Sorcerer II Expedition (2005) Agreements. Available: http:⁄⁄www.
bioprospecting and yet another’s biopiracy. There are very sorcerer2expedition.org. Accessed 19 January 2007.
11. Moss D, Harbison AS, Saul DJ (2003) An easily automated, closed-tube
few cases where everyone agrees there has been outright forensic DNA extraction procedure using a thermostable proteinase. Int J
theft of a biological resource and very few cases where Legal Med 117: 340–349.
12. Laird SA, Wynberg R, Johnston S (2006) Recent trends in the biological
everyone is happy there’s been proper benefit sharing,
prospecting. 29th Antarctic Treaty Consultative Meeting. Available:
says Johnston. “Even the best-designed programmes where http:⁄⁄www.ias.unu.edu/binaries2/ATCM29_May2006.doc. Accessed 16
there’s enormous consultation with the local people have January 2007.
13. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: A
found it’s difficult to get the right kind of consensus and community resource for metagenomics. PLoS Biol 5: e75. doi:10.1371/
buy-in,” he says [12]. journal.pbio.0050075
PLoS Biology | www.plosbiology.org | S12 0383 Special Section from March 2007 | Volume 5 | Issue 3 | e74
Essay
without culturing them [5–7], and taxa may look the same. This vexing
the use of high-throughput “shotgun” problem was partially overcome in
methods to sequence the genomes of the 1980s through the use of rRNA-
cultured species [8]. We are now in the PCR (Table 1). This method allows
midst of another such revolution—this microorganisms in a sample to be
one driven by the use of genome phylogenetically typed and counted
sequencing methods to study microbes based on the sequence of their rRNA
directly in their natural habitats, an genes, genes that are present in all
approach known as metagenomics, cell-based organisms. In essence, a
environmental genomics, or database of rRNA sequences [14,15]
S
ince their discovery in the 1670s community genomics [9]. from known organisms functions
by Anton van Leeuwenhoek, In this essay I focus on one like a bird field guide, and finding a
an incredible amount has been particularly promising area of rRNA-PCR product is akin to seeing a
learned about microorganisms and metagenomics—the use of shotgun bird through binoculars. Rather than
their importance to human health, genome methods to sequence random counting species, this approach focuses
agriculture, industry, ecosystem fragments of DNA from microbes on “phylotypes,” which are defined as
functioning, global biogeochemical in an environmental sample. The organisms whose rRNA sequences are
cycles, and the origin and evolution randomness and breadth of this very similar to each other (a cutoff of
of life. Nevertheless, it is what is not environmental shotgun sequencing >97% or >99% identical is frequently
known that is most astonishing. For (ESS)—first used only a few years ago used). The ability to use phylotyping
example, though there are certainly [10,11] and now being used to assay to determine who was out there in any
at least 10 million species of bacteria, every microbial system imaginable microbial sample has revolutionized
only a few thousand have been formally from the human gut [12] to waste environmental microbiology [16],
described [1]. This contrasts with the water sludge [13]—has the potential to led to many discoveries [e.g.,17],
more than 350,000 described species reveal novel and fundamental insights and convinced many people (myself
of beetles [2]. This is one of many into the hidden world of microbes and included) to become microbiologists.
examples indicative of the general their impact on our world. However,
difficulties encountered in studying the complexity of analysis required
Citation: Eisen JA (2007) Environmental shotgun
organisms that we cannot readily see to realize this potential poses unique sequencing: Its potential and challenges for studying
or collect in large samples for future interdisciplinary challenges, challenges the hidden world of microbes. PLoS Biol 5(3): e82.
doi:10.1371/journal.pbio.0050082
analyses. It is thus not surprising that that make the approach both
most major advances in microbiology fascinating and frustrating in equal Series Editor: Simon Levin, Princeton University,
can be traced to methodological measure. United States of America
PLoS Biology | www.plosbiology.org | S13 0384 Special Section from March 2007 | Volume 5 | Issue 3 | e82
Table 1. Some Major Methods for Studying Individual Microbes Found in the Environment
Method Summary Comments
Microscopy Microbial phenotypes can be studied by making them more visible. In conjunction The appearance of microbes is not a reliable indicator of
with other methods, such as staining, microscopy can also be used to count taxa what type of microbe one is looking at.
and make inferences about biological processes.
Culturing Single cells of a particular microbial type are grown in isolation from other This is the best way to learn about the biology of a
organisms. This can be done in liquid or solid growth media. particular organism. However, many microbes are
uncultured (i.e., have never been grown in the lab in
isolation from other organisms) and may be unculturable
(i.e., may not be able to grow without other organisms).
rRNA-PCR The key aspects of this method are the following: (a) all cell-based organisms This method revolutionized microbiology in the 1980s by
possess the same rRNA genes (albeit with different underlying sequences); (b) PCR allowing the types and numbers of microbes present in
is used to make billions of copies of basically each and every rRNA gene present in a sample to be rapidly characterized. However, there are
a sample; this amplifies the rRNA signal relative to the noise of thousands of other some biases in the process that make it not perfect for all
genes present in each organism’s DNA; (c) sequencing and phylogenetic analysis aspects of typing and counting.
places rRNA genes on the rRNA tree of life; the position on the tree is used to infer
what type of organism (a.k.a. phylotype) the gene came from; and (d) the numbers
of each microbe type are estimated from the number of times the same rRNA gene
is seen.
Shotgun genome The DNA from an organism is isolated and broken into small fragments, and then This has now been applied to over 1,000 microbes, as well
sequencing of cultured portions of these fragments are sequenced, usually with the aid of sequencing as some multicellular species, and has provided a much
species machines. The fragments are then assembled into larger pieces by looking deeper understanding of the biology and evolution of life.
for overlaps in the sequence each possesses. The complete genome can be One limitation is that each genome sequence is usually a
determined by filling in gaps between the larger pieces. snapshot of one or a few individuals.
Metagenomics DNA is directly isolated from an environmental sample and then sequenced. This method allows one to sample the genomes of
One approach to doing this is to select particular pieces of interest (e.g., those microbes without culturing them. It can be used both for
containing interesting rRNA genes) and sequence them. An alternative is ESS, typing and counting taxa and for making predictions of
which is shotgun genome sequencing as described above, but applied to an their biological functions.
environmental sample with multiple organisms, rather than to a single cultured
organism.
doi:10.1371/journal.pbio.0050082.t001
The selective targeting of a single Certainly, many challenges remain the phylotypes and study its properties
gene makes rRNA-PCR an efficient before we can fully realize the potential in the lab. Unfortunately, many, if
method for deep community sampling of ESS for the typing and counting of not most, key microbes have not yet
[18]. However, this efficiency comes species, including making automated been cultured [22]. Thus, for many
with limitations, most of which are yet accurate phylogenetic trees of every years, the only alternative was to
complemented or circumvented by the gene, determining which genes are make predictions about the biology of
randomness and breadth of ESS. For most useful for which taxa, combining particular phylotypes based on what
example, examination of the random data from different genes even when was known about related organisms.
samples of rRNA sequences obtained we do not know if they come from Unfortunately, this too does not work
through ESS has already led to the the same organisms, building up well for microbes since very closely
discovery of new taxa—taxa that were databases of genes other than rRNA, related organisms frequently have
completely missed by PCR because of and making up for the lack of depth of major biological differences. For
its inability to sample all taxa equally sampling. If these challenges are met, example, Escherichia coli K12 and E.
well (e.g., [19]). In addition, ESS ESS has the potential to rewrite much coli O157:H7 are strains of the same
provides the first robust sampling of of what we thought we knew about the species (and considered to be the same
genes other than rRNA, and many of phylogenetic diversity of microbial life. phylotype), with genomes containing
these genes can be more useful for only about 4,000 genes, yet each
some aspects of typing and counting. What Are They Doing? Top Down possesses hundreds of functionally
Some universal protein coding and Bottom Up Approaches to important genes not seen in the
genes are better than rRNA both for Understanding Functions in other strain [23]. Such differences
distinguishing closely related strains are routine in microbes, and thus one
Communities
(because of third position variation in cannot make any useful inferences
codons) and for estimating numbers A community is, of course, more about what particular phylotypes are
of individuals (because they vary less than a list of types of organisms. doing (e.g., type of metabolism, growth
in copy number between species One approach to understanding properties, role in nutrient cycling, or
than do rRNA genes) [10]. Perhaps the properties and functioning of pathogenicity) based on the activities of
most significantly, ESS is providing a microbial community is to start their relatives.
groundbreaking insights into the with studies of the different types of These difficulties—the inability
diversity of viruses [20,21], which lack organisms and build up from these to culture most microbes and the
rRNA genes and thus were left out of individuals to the community. Ideally, functional disparities between close
the previous revolution. to do this one would culture each of relatives—led to one of the first kinds
PLoS Biology | www.plosbiology.org | S14 0385 Special Section from March 2007 | Volume 5 | Issue 3 | e82
Table 2. Methods of Binning
Method Description Comments
Genome assembly Identify regions of overlap between different fragments Getting deep enough sampling for this to work is very expensive
from the same organism to build larger contiguous pieces except for low diversity systems or for very abundant taxa.
(contigs).
Reference genome alignment Identify ESS fragments or contigs that are very similar (a) One of the most effective ways to sort through ESS data, if the
to already assembled sections of the genome of single reference genome is very closely related to an organism in the sample;
microbial types. (b) the reason why more reference genomes are needed; (c) does
not handle regions present in uncultured organisms but not in the
reference.
Phylogenetic analysis Build evolutionary trees of genes encoded by ESS fragments (a) Very powerful, but level of resolution depends on whether
or contigs. Assign fragments or contigs to taxonomic fragments encode useful phylogenetic markers and on how well
groups based on nearest neighbor(s) in trees. sampled the database is for the neighbor analysis; (b) would work
much better if more genomes were available from across the tree of
life.
Word frequency and nucleotide Measure word frequency and composition of each (a) Has the potential to work because organisms sometimes have
composition analysis fragment. Group by clustering algorithms or principal “signatures” of word frequencies that are found throughout the
component analysis. genome and are different between species; (b) very challenging for
small fragments.
Population genetics Build alignments of fragments or contigs with similarity May be most useful as a way of subdividing bins created by other
to each other (but not as much as needed for assembly). methods.
Examine haplotype structure, predicted effective
population size, and synonymous and non synonymous
substitution patterns.
Note that some methods can be applied to ESS fragments or to bins identified by other methods.
doi:10.1371/journal.pbio.0050082.t002
of metagenomic analyses, wherein and species), and these compartments trying to bin? Is it fragments from the
predictions of function were made matter. The key challenge in analyzing same chromosome from a single cell,
from analysis of the sequence of large ESS data is to sort the DNA fragments which would be useful for studying
DNA fragments from representatives (which are usually less than 1,000 base chromosome structure? If so, then
of known phylotypes. This approach pairs long relative to genome sizes of perhaps genome assembly methods
has provided some stunning insights, millions or billions of bases) into bins are the best. What if instead, as in the
such as the discovery of a novel form that correspond to compartments in the sharpshooter example, we are trying to
of phototrophy in the oceans [24]. system being studied. have each bin include every fragment
However, this large insert approach A recent study by myself and that came from a particular species,
has the same limitation as predicting colleagues illustrates the importance knowledge which may be useful for
properties from characterized of compartments when interpreting predicting community metabolic
relatives—a single cell cannot possibly ESS data. When we analyzed ESS data potential? If the level of genetic
represent the biological functions of all from symbionts living inside the gut polymorphism among individual
members of a phylotype. of the glassy-winged sharpshooter (an cells from the same species is high,
ESS provides an alternative, more insect that has a nutrient-limited diet), then genome assembly methods may
global way of assessing biological we were able to bin the data to two not work well (the polymorphisms
functions in microbial communities. As distinct symbionts [26]. We then could will break up assemblies). A better
when using the large insert approach, infer from those data that one of the approach might be to look for species-
functions can be predicted from symbionts synthesizes amino acids for specific “word” frequencies in the
sequences. However, in this case the the host while the other synthesizes DNA, such as ones created by patterns
predicted functions represent a random the needed vitamins and cofactors. in codon usage. The challenge is, how
sampling of those encoded in the Modeling and understanding of this do we tune the methods to find the
genomes of all the organisms present. ecosystem are greatly enhanced by the right target level of resolution? If we
This approach has unquestionably demonstration of this complementary are too stringent, most bins will include
been wildly successful in terms of gene division of labor, in comparison to only a few fragments. But if we are
discovery. For example, analysis of simply knowing that amino acids, too relaxed, we will create artificial
ESS data has revealed novel forms of vitamins, and cofactors are made by constructs that may prove biologically
every type of gene family examined, as “symbionts.” misleading, such as grouping together
well as a great number of completely How does one go about binning sequences from different species. To
novel families (e.g., [25]). However, ESS data? A variety of approaches have make matters more complex, most
there is a major caveat when using been developed, some of which are likely the stringency needed will vary
ESS data to make community-level described in Table 2. In considering for different taxa present in the sample.
inferences. Ecosystems are more than the different binning methods and Another critical issue is the diversity
just a bag of genes—they are made up of their limitations, the first question of the system under study. Generally,
compartments (e.g., cells, chromosomes, one needs to ask is, what are we binning works better when there are
PLoS Biology | www.plosbiology.org | S15 0386 Special Section from March 2007 | Volume 5 | Issue 3 | e82
few different phylotypes present, all Similarly, the initial comparisons of References
1. Gould SJ (1996) Full house: The spread of
of which are distantly related and ESS data involved comparisons of wildly excellence from Plato to Darwin. New York:
form discrete populations. This is why different environments [32], yielding Harmony Books. 244 p.
binning works well for the sharpshooter insights into the general structure of 2. Evans AV, Bellamy CL (1996) An inordinate
fondness for beetles. New York: Holt. 208 p.
system and other relatively isolated, communities. But as more comparisons 3. Woese C, Fox G (1977) Phylogenetic structure
low diversity environments. Binning are made between similar communities of the prokaryotic domain: The primary
increases in difficulty exponentially kingdoms. Proc Natl Acad Sci U S A 74: 5088–
[33,34], such as those sampled during 5090.
as the number of species increases: vertical and horizontal ocean transects 4. Mullis K, Faloona F (1987) Specific synthesis of
the populations and species start to [27,35–37], we will begin to learn DNA in vitro via a polymerase-catalyzed chain
reaction. Methods Enzymol 155: 335–350.
merge together, and the populations about shorter time scale processes such 5. Reysenbach AL, Giver LJ, Wickham GS, Pace
get more and more polymorphic and as migration, speciation, extinction, NR (1992) Differential amplification of rRNA
variable in relative abundance (such as genes by polymerase chain reaction. Appl
responses to disturbance, and Environ Microbiol 58: 3417–3418.
in the paper about the Global Ocean succession. It is from a combination 6. Medlin L, Elwood HJ, Stickel S, Sogin ML
Sampling expedition in this issue [27]). of both approaches—comparing (1988) The characterization of enzymatically
Further complicating binning is the amplified eukaryotic 16S-like ribosomal RNA-
both similar and very divergent coding regions. Gene 71: 491–500.
phenomenon of lateral gene transfer, communities—that we will be able to 7. Weisburg W, Barns S, Pelletier D, Lane D
where genes are exchanged between understand the fundamental rules of (1991) 16S ribosomal DNA amplification for
phylogenetic study. J Bacteriol 173: 697–703.
distantly related lineages at rates that microbial ecology and how they relate 8. Fleischmann RD, Adams MD, White O,
are high enough that random sampling to ecological principles seen in macro- Clayton RA, Kirkness EF, et al. (1995) Whole-
of a genome will frequently include genome random sequencing and assembly
organisms. of Haemophilus influenzae Rd. Science 269:
genes with multiple histories. 496–512.
Despite these challenges, I believe we Conclusions 9. Handelsman J (2004) Metagenomics:
can develop effective binning methods Application of genomics to uncultured
In promoting some of the exciting microorganisms. Microbiol Mol Biol Rev 68:
for complex communities. First, we 669–685.
opportunities with ESS, I do not
can combine different approaches 10. Venter JC, Remington K, Heidelberg
want to give the impression that it is JF, Halpern AL, Rusch D, et al. (2004)
together, such as using one method
flawless. It is helpful in this respect to Environmental genome shotgun sequencing of
to sort in a relaxed manner and then the Sargasso Sea. Science 304: 66–74.
compare ESS to the Internet. As with 11. Tyson GW, Chapman J, Hugenholtz P, Allen
using another to subdivide the bins
the Internet, ESS is a global portal for EE, Ram RJ, et al. (2004) Community structure
provided by the first method. Second, and metabolism through reconstruction of
looking at what occurs in a previously
we can incorporate new approaches microbial genomes from the environment.
hidden world. Making sense of it Nature 428: 37–43.
such as population genetics into
requires one to sort through massive, 12. Gill SR, Pop M, Deboy RT, Eckburg PB,
the analysis [28]. In addition, the Turnbaugh PJ, et al. (2006) Metagenomic
lessons learned here can be applied to random, fragmented collections of bits analysis of the human distal gut microbiome.
other aspects of metagenomics (e.g., of information. Such searches need Science 312: 1355–1359.
to be done with caution because any 13. Garcia Martin H, Ivanova N, Kunin V,
the counting and typing discussed Warnecke F, Barry KW, et al. (2006)
above) and provide insights into the time you analyze such a large amount Metagenomic analysis of two enhanced
nature of microbial genomes and the of data patterns can be found. In biological phosphorus removal (EBPR) sludge
communities. Nat Biotechnol 24: 1263–1269.
structure of microbial populations and addition, as with the Internet, there 14. Olsen GJ, Larsen N, Woese CR (1991) The
communities. is certainly some hype associated with ribosomal RNA database project. Nucleic Acids
ESS that gives relatively trivial findings Res 19: 2017–2021.
15. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-
Comparative Metagenomics more attention than they deserve. Syed-Mohideen AS, et al. (2007) The ribosomal
So far, I have discussed issues relating Overall, though, I believe the hype database project (RDP-II): Introducing myRDP
space and quality controlled public data.
mostly to intrasample analysis of is deserved. As long as we treat ESS Nucleic Acids Res 35: D169–D172.
ESS data. However, the area with as a strong complement to existing 16. Pace NR (1997) A molecular view of microbial
methods, and we build the tools and diversity and the biosphere. Science 276: 734–
perhaps the most promise involves 740.
the comparative analysis of different databases necessary for people to use 17. Hugenholtz P, Pitulle C, Hershberger KL,
samples. This work parallels the the information, it will live up to its Pace NR (1998) Novel division level bacterial
diversity in a Yellowstone hot spring. J Bacteriol
comparative analysis of genomes of revolutionary potential. 180: 366–376.
cultured species. Initial studies of 18. Sogin ML, Morrison HG, Huber JA, Welch
that type compared distantly related Acknowledgments DM, Huse SM, et al. (2006) Microbial diversity
in the deep sea and the underexplored “rare
taxa with enormous biological I thank Simon Levin, Joshua Weitz, biosphere”. Proc Natl Acad Sci U S A 103:
differences. What has been learned Jonathan Dushoff, Maria-Inés Benito, 12115–12120.
Doug Rusch, Aaron Halpern, and Shibu 19. Baker BJ, Tyson GW, Webb RI, Flanagan
from these studies pertains mostly to J, Hugenholtz P, et al. (2006) Lineages of
core housekeeping functions, such Yooseph for helpful discussions, and acidophilic archaea revealed by community
Melinda Simmons, Merry Youle, and three genomic analysis. Science 314: 1933–1935.
as translation and DNA metabolism,
anonymous reviewers for helpful comments 20. Angly FE, Felts B, Breitbart M, Salamon P,
and to other very ancient processes Edwards RA, et al. (2006) The marine viromes
on the manuscript. The writing of this
[29,30]. It was not until comparisons paper was supported by National Science
of four oceanic regions. PLoS Biol 4: e368.
doi:10.1371/journal.pbio.0040368
were made between closely related Foundation Assembling the Tree of Life 21. Edwards RA, Rohwer F (2005) Viral
organisms that we began to understand Grant 0228651 to Jonathan A. Eisen and by metagenomics. Nat Rev Microbiol 3: 504–510.
events that occurred on shorter 22. Leadbetter JR (2003) Cultivation of recalcitrant
the Defense Advanced Research Projects
microbes: Cells are alive, well and revealing
time scales, such as selection, gene Agency under grants HR0011-05-1-0057 their secrets in the 21st century laboratory.
transfer, and mutation processes [31]. and FA9550-06-1-0478. Curr Opin Microbiol 6: 274–281.
PLoS Biology | www.plosbiology.org | S16 0387 Special Section from March 2007 | Volume 5 | Issue 3 | e82
23. Perna NT, Plunkett G 3rd, Burland V, Mau B, II Gobal Ocean Sampling expedition: metagenomics of microbial communities.
Glasner JD, et al. (2001) Genome sequence of Northwest Atlantic through Eastern Tropical Science 308: 554–557.
enterohaemorrhagic Escherichia coli O157:H7. Pacific. PLoS Biol 5: e77. doi:10.1371/journal. 33. Edwards RA, Rodriguez-Brito B, Wegley L,
Nature 409: 529–533. pbio.0050077 Haynes M, Breitbart M, et al. (2006) Using
24. Beja O, Aravind L, Koonin EV, Suzuki MT, 28. Johnson PL, Slatkin M (2006) Inference pyrosequencing to shed light on deep mine
Hadd A, et al. (2000) Bacterial rhodopsin: of population genetic parameters in microbial ecology. BMC Genomics 7: 57.
Evidence for a new type of phototrophy in the metagenomics: A clean look at messy data. 34. Rodriguez-Brito B, Rohwer F, Edwards RA
sea. Science 289: 1902–1906. Genome Res 16: 1320–1327. (2006) An application of statistics to comparative
25. Yooseph S, Sutton G, Rusch DB, Halpern AL, 29. Koonin EV, Mushegian AR (1996) Complete metagenomics. BMC Bioinformatics 7: 162.
Williamson SJ, et al. (2007) The Sorcerer II genome sequences of cellular life forms: 35. DeLong EF (2005) Microbial community
Global Ocean Sampling expedition: Expanding Glimpses of theoretical evolutionary genomics. genomics in the ocean. Nat Rev Microbiol 3:
the universe of protein families. PLoS Biol 5: Curr Opin Genet Dev 6: 757–762. 459–469.
e16. DOI: 10.1371/journal.pbio.0050016 30. Mushegian AR, Koonin EV (1996) A minimal 36. DeLong EF, Preston CM, Mincer T, Rich V,
26. Wu D, Daugherty SC, Van Aken SE, Pai gene set for cellular life derived by comparison Hallam SJ, et al. (2006) Community genomics
GH, Watkins KL, et al. (2006) Metabolic of complete bacterial genomes. Proc Natl Acad among stratified microbial assemblages in the
complementarity and genomics of the dual Sci U S A 93: 10268–10273. ocean’s interior. Science 311: 496–503.
bacterial symbiosis of sharpshooters. PLoS Biol 31. Eisen JA (2001) Gastrogenomics. Nature 409: 37. Worden AZ, Cuvelier ML, Bartlett DH
4: e188. doi:10.1371/journal.pbio.0040188 463, 465–466. (2006) In-depth analyses of marine microbial
27. Rusch DB, Halpern AL, Sutton G, Heidelberg 32. Tringe SG, von Mering C, Kobayashi A, community genomics. Trends Microbiol 14:
KB, Williamson S, et al. (2007) The Sorcerer Salamov AA, Chen K, et al. (2005) Comparative 331–336.
PLoS Biology | www.plosbiology.org | S17 0388 Special Section from March 2007 | Volume 5 | Issue 3 | e82
Community Page
M
icrobes are responsible affect global climate. Although we metagenomics and enable researchers
for most of the chemical now have numerous global and real- to unravel the biology of environmental
transformations that are time methods to measure physical microorganisms (Figure 1). CAMERA’s
crucial to sustaining life on Earth. database includes environmental
Their ability to inhabit almost any metagenomic and genomic sequence
environmental niche suggests that We invite the research data, associated environmental
they possess an incredible diversity of community to submit its parameters (“metadata”), pre-
physiological capabilities. However, computed search results, and software
we have little to no information on a metagenomics data to tools to support powerful cross-analysis
majority of the millions of microbial CAMERA. of environmental samples.
species that are predicted to exist,
mainly because of our inability to and chemical parameters within the
culture them in the laboratory. ocean, few methods or concepts have
A growing discipline called been developed to measure important Citation: Seshadri R, Kravitz SA, Smarr L, Gilna P,
metagenomics allows us to study these microbial processes on a global scale. Frazier M (2007) CAMERA: A community resource
uncultured organisms by deciphering for metagenomics. PLoS Biol 5(3): e75. doi:10.1371/
Even if the technology to make such journal.pbio.0050075
their genetic information from measurements existed, we would
DNA that is extracted directly from presently not know what to measure or Copyright: © 2007 Seshadri et al. This is an
open-access article distributed under the terms
their environment, thus effectively how to interpret those measurements. of the Creative Commons Attribution License,
bypassing the laboratory culture step. We need a systematic way to explore which permits unrestricted use, distribution, and
Metagenomics allows us to address the the structure and function of ocean
reproduction in any medium, provided the original
author and source are credited.
questions “who’s there?”, “what are they ecosystems, and their impact on
doing?”, and “how are they doing it?”, global carbon processing and climate. Abbreviations: CAMERA, Community
offering insights into the evolutionary Cyberinfrastructure for Advanced Marine Microbial
Metagenomics has the potential to Ecology Research and Analysis; GOS, Global Ocean
history as well as previously shed light on the genetic controls Sampling
unrecognized physiological abilities of of these processes by investigating Rekha Seshadri, Saul A. Kravitz, and Marvin Frazier
uncultured communities. the key players, their roles, and are at the J. Craig Venter Institute (JCVI) in Rockville,
Studies such as the J. Craig Venter community compositions that may Maryland, United States of America. Larry Smarr
Institute’s Global Ocean Sampling and Paul Gilna are at the California Institute for
change as a function of time, climate, Telecommunications and Information Technology
(GOS) expedition (in this issue) reveal nutrients, carbon dioxide, and (Calit2), a University of California San Diego
a remarkable breadth and depth of anthropogenic factors. These studies (UCSD)/University of California Irvine partnership,
microbial diversity in the oceans. To La Jolla, California, United States of America. Larry
include a substantial informatics Smarr is also the Harry E. Gruber Professor of
date, researchers have made significant component, requiring researchers to Computer Science and Engineering at UCSD, La Jolla,
but largely preliminary inroads into take on complex computational and California, United States of America. CAMERA is being
developed by Calit2 at UCSD in collaboration with
understanding the biogeography mathematical challenges. Nonetheless, the JCVI, UCSD’s Center for Earth Observations and
of microbial populations across microbiologists have been quick to seize Applications (anchored by the Scripps Institution
ecosystems. We know even less about of Oceanography), the San Diego Supercomputer
upon this modern technique, resulting Center, and the University of California Davis.
the dynamic physiological processes in a deluge of sequence data, and an
ever-widening gap between the rates of * To whom correspondence should be addressed.
E-mail: rseshadri@venterinstitute.org
collecting data and interpreting it.
The Community Page is a forum for organizations The Community Cyberinfrastructure This article is part of the Oceanic Metagenomics
and societies to highlight their efforts to enhance the collection in PLoS Biology. The full collection is
for Advanced Marine Microbial available online at http://collections.plos.org/
dissemination and value of scientific knowledge.
Ecology Research and Analysis plosbiology/gos-2007.php.
PLoS Biology | www.plosbiology.org | S18 0394 Special Section from March 2007 | Volume 5 | Issue 3 | e75
The initial release will include computational infrastructure to provide
data and tools associated with the high-performance networking access
companion set of GOS expedition and grid-based computing (applying
publications [2–4]; metagenome data the resources of many computers in
from the Hawaii Ocean Time Series a network to a single problem at the
Station ALOHA [5] and marine same time), and to support new ways
viromes from four different oceanic of visualizing and interacting with the
regions[6]; standard nonredundant data. The distributed architecture of
sequence databases (e.g., nrnt for the CAMERA computational engine
nucleotides and nraa for amino will be based on the National Science
acids[7]); and collections of microbial Foundation–funded OptIPuter
genome sequences, including a set project [8,9], which allows for use of
of 155 marine microbial genomes dedicated 1- or 10-Gbps optical fiber
funded by the Gordon and Betty links between remote user laboratory
Moore Foundation. The focal point clusters and the CAMERA compute
for the CAMERA project is its Web doi:10.1371/journal.pbio.0050075.g001 complex. The data server complex
site: http://camera.calit2.net. We itself will contain a large amount of
invite the research community to Figure 1. Schematic of Intended Core rotating storage (ultimately several
Functions of the CAMERA Project
submit its metagenomics data to tens of terabytes replicated) and a
CBD, Convention on Biological Diversity.
CAMERA, and are establishing large computational cluster (upwards
mechanisms to streamline this of a thousand processors). It will be
process. Here we describe some of New-Generation Bioinformatics augmented on demand by a scalable
the key challenges and features of the Tools back end provided by the recently
CAMERA project. Analysis and comparison of complex upgraded National Science Foundation
metagenomic data is driving the TeraGrid.
Accessibility of Metadata development of a new class of
Existing data repositories provide bioinformatics and visualization Recognition of the Sources
limited support for metadata and software. CAMERA will integrate these of Samples
metadata-based queries—including tools with its database, couple them The Convention on Biological Diversity
any supplemental information with large-scale compute resources, grants countries certain rights over
for the sequence data, such as pH and make them widely available to their genetic resources, including,
and temperature of water at the the research community. Initially, for example, metagenomic sequence
collection site—and therefore these CAMERA will support analytical data of marine microbes taken from
metadata go underutilized by the tools used for analyses in the GOS a country’s territorial waters. Many
research community. CAMERA publications [2–6]. An example countries require, at minimum, that
will integrate sequence data with is shown in Figure 2: a subset of databases explicitly identify the country
all available, relevant metadata, metagenome sequence reads from of origin of the DNA. Rules vary by
including physical information (e.g., GOS environmental samples is country, and it is not a simple task
temperature and sample method), compared to a reference genome to find out what might be required.
chemical information (e.g., salinity sequence (Synechococcus spp.) using International harmonization of these
and pH), temporal information, BLASTN. The results and underlying rules is currently being debated by the
geospatial information, methodology metadata are displayed through an over 150 countries that are party to the
and instrumentation used for data interactive graphical viewer, which Convention on Biological Diversity.
collection, and satellite images of helps users quickly identify sequence Agreements about the use of genetic
the collection site. These contextual reads that are similar to the reference resources are negotiated on a case-by-
data allow researchers to derive genome sequence, and potentially case basis with each researcher who
correlations between deciphered identify metabolic similarities between wishes to sample within a country’s
ecology and the environmental microbes in environmental samples “exclusive economic zone,” typically
conditions that may favor one and a reference microbe. A detailed 200 miles from its shoreline. Some of
community structure over another. description of this tool and its these “memoranda of understanding”
One can envision a future where applications are provided in the GOS impose additional requirements on
metadata from satellites and weather companion paper by Rusch et al. [2]. the researchers. For example, the J.
stations, and other physicochemical CAMERA will work closely with the Craig Venter Institute’s agreement
data, can be used to help interpret and community to identify and incorporate with Australia requires us to “use
inform scientists on how these factors additional tools and workflows. reasonable effort to notify Australia as
affect microbial processes as well as soon as possible of any inquiries for
community composition. CAMERA Large-Scale, Robust, and commercial purposes.”
is working with other groups (e.g., Expandable Cyberinfrastructure Current databases do not allow the
Genome Standards Consortium) to The enormousness of metagenomics original investigators to inform others
establish standards for the information datasets requires terascale about the details of an agreement,
content and format of metagenomic computation and storage facilities. thus creating a significant roadblock to
data and metadata submissions. CAMERA is building a state-of-the-art both the collection and public release
PLoS Biology | www.plosbiology.org | S19 0395 Special Section from March 2007 | Volume 5 | Issue 3 | e75
doi:10.1371/journal.pbio.0050075.g002
of metagenomics data. To address and who acknowledge the potential Convention on Biological Diversity, all
this issue, CAMERA data will only be restriction on commercial use by data objects served by CAMERA will
made available to users who register countries from which the data were possess a mapping to the country of
by supplying a suitable E-mail address collected. To further comply with the origin of the underlying DNA sample.
PLoS Biology | www.plosbiology.org | S20 0396 Special Section from March 2007 | Volume 5 | Issue 3 | e75
Outreach and Training change and the processes that control References
1. Smarr L (2006 March 21) The ocean of life:
Since the ultimate success of CAMERA climate. Systematic and routine Creating a community cyberinfrastructure for
will depend on the broader research monitoring of genomic signatures advanced marine microbial ecology research
of global microbial populations and and analysis (a.k.a. CAMERA). Friday Harbor
community’s ability to make use of the (Washington): Strategic News Service.
novel cyberinfrastructure, a series of on- processes overlaid with meteorological 2. Rusch DB, Halpern AL, Sutton G, Heidelberg
site and Web-based training programs information and other metadata may KB, Williamson S, et al. (2007) The Sorcerer II
help researchers explain past shifts in Gobal Ocean Sampling expedition: Northwest
will be provided to keep users apprised Atlantic through eastern tropical Pacific.
of CAMERA’s functionalities and to global climate as well as predict future PLoS Biol 5: e77. doi:10.1371/journal.
support integration of CAMERA’s changes. This knowledge may someday pbio.0050077
3. Yooseph S, Sutton G, Rusch DB, Halpern AL,
service-oriented architecture into guide decisions about acceptable Williamson SJ, et al. (2007) The Sorcerer II
their computational fabrics. Finally, atmospheric levels of greenhouse Global Ocean Sampling expedition: Expanding
we envision interacting with the gases, or guide strategies to increase the universe of protein families. PLoS Biol 5:
e16. doi:10.1371/journal.pbio.0050016
community on several fronts, including sequestration of atmospheric carbon 4. Kannan N, Taylor SS, Zhai Y, Venter JC,
standardization of ontology, metadata, dioxide by changing ocean microbial Manning G (2006) Structural and functional
nomenclature, and tools, and compositions, in order to reverse the diversity of the microbial kinome. PLoS Biol 5:
e17. doi:10.1371/journal.pbio.0050017
incorporation or federation of existing effects of global warming. 5. DeLong EF, Preston CM, Mincer T, Rich V,
tools and resources with CAMERA. Hallam SJ, et al. (2006) Community genomics
We believe that the data and Acknowledgments among stratified microbial assemblages in the
ocean’s interior. Science 311: 496–503.
community cyberinfrastructure The authors benefited from many 6. Angly FE, Felts B, Breitbart M, Salamon P,
provided by CAMERA will help discussions with members of the CAMERA Edwards RA, et al. (2006) The marine viromes
team. We wish to thank Robert Friedman, of four oceanic regions. PLoS Biol 4: e368.
researchers to advance understanding doi:10.1371/journal.pbio.0040368
of the codependence or feedback Michael Press, Jasmine Pollard, and
7. Pruitt KD, Tatusova T, Maglott DR (2005)
between microbial communities Matthew LaPointe at the J. Craig Venter NCBI Reference Sequence (RefSeq): A curated
Institute for their assistance in preparing non-redundant sequence database of genomes,
and biogeochemical processes transcripts and proteins. Nucleic Acids Res 33:
the manuscript.
in oceans over time, and of how D501–D504.
Funding. The authors acknowledge
perturbations in the environment 8. Smarr L, Chien AA, DeFanti T, Leigh J,
funding from the Gordon and Betty Moore Papadopoulos PM (2003) The OptIPuter.
cause compositional changes Foundation to the California Institute for Commun ACM 46: 58–67.
(including extinction). Eventually, Telecommunications and Information 9. Taesombut N, Uyeda F, Chien AA, Smarr L,
the expanded global environmental DeFanti T, et al. (2006) The OptIPuter: High-
Technology at the University of California,
performance, QoS-guaranteed network service
metagenomics datasets will enable San Diego, and from National Science for emerging e-science applications. IEEE
better monitoring of environmental Foundation OptIPuter grant SCI-0225642. Commun 4: 38–45.
PLoS Biology | www.plosbiology.org | S21 0397 Special Section from March 2007 | Volume 5 | Issue 3 | e75
PLoS BIOLOGY
The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both
genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which
surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition.
These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and
ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp).
Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with
85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff.
Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and
assembly methods. One comparative genomic method, termed ‘‘fragment recruitment,’’ addressed questions of
genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes
and gene families. A second method, termed ‘‘extreme assembly,’’ made possible the assembly and reconstruction of
large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found
extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions
throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual
sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3)
hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into
genetically isolated populations that have overlapping but independent distributions, implying distinct environmental
preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show
how they may be grouped into several community types. Specific functional adaptations can be identified both within
individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or
absence of the phosphate-binding gene PstS.
Citation: Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern
tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077
Academic Editor: Nancy A. Moran, University of Arizona, United States of America
Received July 14, 2006; Accepted January 16, 2007; Published March 13, 2007
Copyright: Ó 2007 Rusch et al. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author
and source are credited.
Abbreviations: CAMERA, Cyberinfrastructure for Advanced Marine Microbial
Ecology Research and Analysis; GOS, Global Ocean Sampling; NCBI, National
Center for Biotechnology Information
* To whom correspondence should be addressed. E-mail: DRusch@venterinstitute.
org
This article is part of Global Ocean Sampling collection in PLoS Biology. The full
collection is available online at http://collections.plos.org/plosbiology/gos-2007.
php.
PLoS Biology | www.plosbiology.org | S22 0398 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S23 0399 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
the pilot Sargasso Sea study, 200 l surface seawater was environments as well as a few nonmarine aquatic samples for
filtered to isolate microorganisms for metagenomic analysis. contrast (Table 1).
DNA was isolated from the collected organisms, and genome Several size fractions were isolated for every site (see
shotgun sequencing methods were used to identify more than Materials and Methods). Total DNA was extracted from one
1.2 million new genes, providing evidence for substantial or more fractions, mostly from the 0.1–0.8-lm size range.
microbial taxonomic diversity [19]. Several hundred new and This fraction is dominated by bacteria, whose compact
diverse examples of the proteorhodopsin family of light- genomes are particularly suitable for shotgun sequencing.
harvesting genes were identified, documenting their exten- Random-insert clone libraries were constructed. Depending
sive abundance and pointing to a possible important role in on the uniqueness of each sampling site and initial estimates
energy metabolism under low-nutrient conditions. However, of the genetic diversity, between 44,000 and 420,000 clones
substantial sequence diversity resulted in only limited per sample were end-sequenced to generate mated sequenc-
genome assembly. These results generated many additional ing reads. In all, the combined dataset includes 6.25 Gbp of
sequence data from 41 different locations. Many of the clone
questions: would the same organisms exist everywhere in the
libraries were constructed with a small insert size (,2 kbp) to
ocean, leading to improved assembly as sequence coverage
maximize cloning efficiency. As this often resulted in mated
increased; what was the global extent of gene and gene family
sequencing reads that overlapped one another, overlapping
diversity, and can we begin to exhaust it with a large but
mated reads were combined, yielding a total of ;6.4 M
achievable amount of sequencing; how do regions of the
contiguous sequences, totaling ;5.9 Gbp of nonredundant
ocean differ from one another; and how are different sequence. Taken together, this is the largest collection of
environmental pressures reflected in organisms and com- metagenomic sequences to date, providing more than a 5-fold
munities? In this paper we attempt to address these issues. increase over the dataset produced from the Sargasso Sea
pilot study [19] and more than a 90-fold increase over the
Results other large marine metagenomic dataset [20].
Sampling and the Metagenomic Dataset Assembly
Microbial samples were collected as part of the Sorcerer II Assembling genomic data into larger contigs and scaffolds,
expedition between August 8, 2003, and May 22, 2004, by the especially metagenomic data, can be extremely valuable, as it
S/V Sorcerer II, a 32-m sailing sloop modified for marine places individual sequencing reads into a greater genomic
research. Most specimens were collected from surface water context. A largely contiguous sequence links genes into
marine environments at approximately 320-km (200-mile) operons, but also permits the investigation of larger
intervals. In all, 44 samples were obtained from 41 sites biochemical and/or physiological pathways, and also connects
(Figure 1), covering a wide range of distinct surface marine otherwise-anonymous sequences with highly studied ‘‘taxo-
PLoS Biology | www.plosbiology.org | S24 0400 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Table 1. Sampling Locations and Environmental Data
ID Sample Country Date, Time Location Sample Water T (8C)a Sb Size Habitat Chl a Sample Good
Location mm/dd/yy Depth, Depth, (ppt) Fraction Type Month (Annual Sequences
m m (lm) 6 SE) mg/m3
GS00a Sargasso Stations 13 and 11 Bermuda (UK) 02/26/03 3:00 3183296 99 n; 63835942 99 w 5.0 .4,200 20.0 20.5 36.6 0.1–0.8 Open ocean 0.17 (0.0.9 6 0.02) 644,551
10:10 31810950 99 n; 64819927 99 w 36.7
GS00b Sargasso Stations 13 and 11 Bermuda (UK) 02/26/03 3:35 31832910 99 n; 63835970 99 w 5.0 .4,200 20.0 20.5 36.6 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 317,180
10:43 31810950n; 64819927 99 w 36.7
GS00c Sargasso Stations 3 Bermuda (UK) 02/25/03 13:00 32809930 99 n; 64800936 99 w 5.0 .4,200 19.8 36.7 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 368,835
GS00d Sargasso Stations 13 Bermuda (UK) 02/25/03 17:00 3183296 99 n; 63835942 99 w 5.0 .4,200 20.0 36.6 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 332,240
GS01a Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n 64830900 99 w 5.0 .4,200 22.9 36.7 3.0–20.0 Open ocean 0.10 (0.10 6 0.01) 142,352
GS01b Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n; 64830900 99 w 5.0 .4,201 22.9 36.7 0.8–3.0 Open ocean 0.10 (0.10 6 0.01) 90,905
GS01c Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n; 64830900 99 w 5.0 .4,202 22.9 36.7 0.1–0.8 Open ocean 0.1 (0.1 6 0.01) 92,351
GS02 Gulf of Maine USA 08/21/03 6:32 42830911 99 n; 67814924 99 w 1.0 106 18.2 29.2 0.1–0.8 Coastal 1.4 (1.12 6 0.19) 121,590
GS03 Browns Bank, Gulf of Maine Canada 08/21/03 11:50 42851910 99 n; 6681392 99 w 1.0 119 11.7 29.9 0.1–0.8 Coastal 1.4 (1.12 6 0.19) 61,605
GS04 Outside Halifax, Nova Scotia Canada 08/22/03 5:25 4488914 99 n; 63838940 99 w 2.0 142 17.3 28.3 0.1–0.8 Coastal 0.4 (0.78 6 0.17) 52,959
0401
GS15 Off Key West, FL USA 01/08/04 6:25 24829918 99 n; 8384912 99 w 2.0 47 25.3 36.0 0.1–0.8 Coastal 0.2 (0.27 6 0.09) 127,362
GS16 Gulf of Mexico USA 01/08/04 14:15 24810929 99 n; 84820940 99 w 2.0 3,333 26.4 35.8 0.1–0.8 Coastal sea 0.16 (0.11 6 0.01) 127,122
GS17 Yucatan Channel Mexico 01/09/04 13:47 20831921 99 n; 85824949 99 w 2.0 4,513 27.0 35.8 0.1–0.8 Open ocean 0.13 (0.09 6 0.01) 257,581
GS18 Rosario Bank Honduras 01/10/04 8:12 1882912 99 n; 8384795 99 w 2.0 4,470 27.4 35.4 0.1–0.8 Open ocean 0.14 (0.09 6 0.01) 142,743
GS19 Northeast of Colón Panama 01/12/04 9:03 10842959 99 n; 80815916 99 w 2.0 3,336 27.7 35.4 0.1–0.8 Coastal 0.23 (0.15 6 0.02) 135,325
GS20 Lake Gatun Panama 01/15/04 10:24 989952 99 n; 79850910 99 w 2.0 4 28.5 0.06 0.1–0.8 Fresh water 296,355
GS21 Gulf of Panama Panama 01/19/04 16:48 887945 99 n; 79841928 99 w 2.0 76 27.6 30.7 0.1–0.8 Coastal 0.50 (0.73 6 0.22) 131,798
GS22 250 miles from Panama City Panama 01/20/04 16:39 6829934 99 n; 82854914 99 w 2.0 2,431 29.3 32.3 0.1–0.8 Open ocean 0.33 (0.28 6 0.02) 121,662
GS23 30 miles from Cocos Island Costa Rica 01/21/04 15:00 5838924 99 n; 86833955 99 w 2.0 1,139 28.7 32.6 0.1–0.8 Open ocean 0.07 (0.19 6 0.02) 133,051
GS25 Dirty Rock, Cocos Island Costa Rica 01/28/04 10:51 5833910 99 n; 8785916 99 w 1.1 30 28.3 31.4 0.8–3.0 Fringing reef 0.11 (0.19 6 0.01) 120,671
GS26 134 miles NE of Galapagos Ecuador 02/01/04 16:16 1815951 99 n; 90817942 99 w 2.0 2,376 27.8 32.6 0.1–0.8 Open ocean 0.22 (0.28 6 0.02) 102,708
GS27 Devil’s Crown, Floreana Ecuador 02/04/04 11:41 1812958 99 s; 90825922 99 w 2.0 2.3 25.5 34.9 0.1–0.8 Coastal 0.40 (0.38 6 0.03) 222,080
GS28 Coastal Floreana Ecuador 02/04/04 15:47 181391 99 s; 90819911 99 w 2.0 156 25.0c 0.1–0.8 Coastal 0.35 (0.35 6 0.02) 189,052
GS29 North James Bay, Santigo Ecuador 02/08/04 18:03 081290 99 s; 9085097 99 w 2.0 12 26.2 34.5 0.1–0.8 Coastal 0.40 (0.39 6 0.03) 131,529
GS30 Warm seep, Roca Redonda Ecuador 02/09/04 11:42 0816920 99 n; 9183890 99 w 19.0 19 26.9 0.1–0.8 Warm seep 359,152
GS31 Upwelling, Fernandina Ecuador 02/10/04 14:43 081894 99 s; 9183996 99 w 12.0 19 18.6 0.1–0.8 Coastal upwelling 0.35 (0.39 6 0.03) 436,401
GS32 Mangrove, Isabella Ecuador 02/11/04 11:30 0835938 99 s; 9184910 99 w 0.3 0.67 25.4 0.1–0.8 Mangrove 148,018
c
GS33 Punta Cormorant Lagoon, Floreana Ecuador 02/19/04 13:35 1813942 99 s; 90825945 99 w 0.2 0.33 37.6 46 0.1–0.8 Hypersaline 692,255
GS34 North Seamore Ecuador 02/19/04 17:06 0822959 99 s; 90816947 99 w 2.0 35 27.5 0.1–0.8 Coastal 0.36 (0.35 6 0.02) 134,347
GS35 Wolf Island Ecuador 03/01/04 16:44 1823921 99 n; 9184991 99 w 2.0 71 21.8 34.5 0.1–0.8 Coastal 0.28 (0.31 6 0.02) 140,814
GS36 Cabo Marshall, Isabella Ecuador 03/02/04 12:52 081915 99 s; 91811952 99 w 2.0 67 25.8 34.6 0.1–0.8 Coastal 0.65 (0.45 6 0.05) 77,538
GS37 Equatorial Pacific TAO Buoy International 03/17/04 16:38 1858926 99 s; 9580953 99 w 2.0 3,334 28.8 0.1–0.8 Open ocean 0.21 (0.24 6 0.02) 65,670
GS47 201 miles from French Polynesia International 03/28/04 15:25 1087953 99 s; 135826958 99 w 30.0 2,400 28.6 37.3 0.1–0.8 Open ocean 66,023
GS51 Rangirora Atoll French Polynesia 05/22/04 7:04 1588937 99 s; 14782696 99 w 1.0 10 27.3 34.2 0.1–0.8 Coral reef atoll 128,982
Total 7,697,926
a
Temperature.
b
Salinity.
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S26 0402 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Table 4. Microbial Genera that Recruited the Bulk of the GOS Reads
a
Reads aligned at or above 80% identity over the entire length of the read.
b
Reads aligned at or above 90% identity over the entire length of the read.
doi:10.1371/journal.pbio.0050077.t004
PLoS Biology | www.plosbiology.org | S27 0403 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S28 0404 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow-
green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062.
(B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. These
sequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads).
(C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially lower
identity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity.
(D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences were
recruited from the Sargasso samples.
(E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity in
vertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment that
corresponds to the highly conserved 16S and 23S mRNA gene operon.
(F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences are
seen corresponding to the location of conserved genes.
(G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderate
identity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible.
(H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequences
from some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited at
moderate identities.
(I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequences
as the WH8102 and CC9902 strains, they recruit at significantly lower identity.
(J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not only
possible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a
100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other.
(J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOS
sequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison).
(K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previous
recruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequences
primarily from the Sargasso Sea samples (red).
(M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages.
doi:10.1371/journal.pbio.0050077.g002
question often recruiting to distantly related sets of genomes. highly similar to the reference genome, as is the case for P.
Most microbial genomes, including many of the marine marinus MIT9312 (Poster S1) and Synechococcus RS9917
microbes (e.g., the ubiquitous genus Vibrio), demonstrated this (unpublished data). P. ubique HTCC1062 and other Synecho-
nonspecific pattern of recruitment. coccus strains like WH8102 show more complicated banding
The relationship between the similarity of an individual patterns (Poster S1D and S1F) because of the presence of
sequencing read to a given genome and the sample from which multiple subtypes that produce complex often overlapping
the read was isolated can provide insight into the structure, bands in the plots. Though the recruitment patterns can be
evolution, and geographic distribution of microbial popula- quite complex they are also remarkably consistent over much
tions. These relationships were assessed by constructing a of the reference genome. In these more complicated recruit-
‘‘percent identity plot’’ [27] in which the alignment of a read to ment plots, such as the one for P. ubique HTCC1062,
a reference sequence is shown as a bar whose horizontal individual bands can show sudden shifts in identity or
position indicates location on the reference and whose vertical disappear altogether, producing a gap in recruitment that
position indicates the percent identity of the alignment. We appears to be specific to that band (see P. ubique recruitment
colored the plotted reads according to the samples to which plots on Poster S1B and S1E, and specifically between 130–
they belonged, thus indirectly representing various forms of 140 kb). Finally, phylogenetic analysis indicates that separate
metadata (geographic, environmental, and laboratory varia- bands are indeed evolutionarily distinct at randomly selected
bles). We refer to these plots that incorporate metadata as locations along the genome.
fragment recruitment plots. Fragment recruitment plots of The amount of sequence variation within a given band
GOS sequences recruited to the entire genomes of Pelagibacter cannot be reliably determined from the fragment recruit-
ubique HTCC1062, Prochlorococcus marinus MIT9312, and Syn-
ment plots themselves. To examine this variation, we
echococcus WH8102 are presented in Poster S1.
produced multiple sequence alignments and phylogenies of
Within-Ribotype Population Structure and Variation reads that recruited to several randomly chosen intervals
Characteristic patterns of recruitment emerged from each along given reference genomes to show that there can be
of these abundant marine microbes consisting of horizontal considerable within-subtype variation (Figure 3A–3B). For
bands made up of large numbers of GOS reads. These bands example, within the primary band found in recruitment plots
seem constrained to a relatively narrow range of identities to P. marinus MIT9312, individual pairs of overlapping reads
that tile continuously (or at least uniformly, in the case when typically differ on average between 3%–5% at the nucleotide
abundance/coverage is lower) along ;90% of the reference level (depending on exact location in the genome). Very few
sequence. The uninterrupted tiling indicates that environ- reads that recruited to MIT9312 have perfect (mismatch-free)
mental genomes are largely syntenic with the reference overlaps with any other read or to MIT9312, despite ;100-
genomes. Multiple bands, distinguished by degree of sim- fold coverage. While many of these differences are silent (i.e.,
ilarity to the reference and by sample makeup, may arise on a do not change amino acid sequences), there is still consid-
single reference (Poster S1D and S1F). Each of these bands erable variation at the protein level (unpublished data). The
appears to represent a distinct, closely related population we amount of variation within subtypes is so great that it is likely
refer to as a subtype. In some cases, an abundant subtype is that no two sequenced cells contained identical genomes.
PLoS Biology | www.plosbiology.org | S29 0405 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S30 0406 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Identifying Genomic Structural Variation with proportion of mated reads in the ‘‘good’’ category (i.e., in
Metagenomic Data the proper orientation and at the correct distance) show that
Variation in genome structure in the form of rearrange- synteny is conserved for a large portion of the microbial
ments, duplications, insertions, or deletions of stretches of population. The strongest signals of structural differences
DNA can also be explored via fragment recruitment. The use typically reflect a variant specific to the reference genome
of mated sequencing reads (pairs of reads from opposite ends and not found in the environmental data. In conjunction with
of a clone insert) provides a powerful tool for assessing the requirement that reads be recruited over their entire
structural differences between the reference and the environ- length without interruption, recruitment plots result in
mental sequences. The cloning and sequencing process pronounced recruitment gaps at locations where there is a
determines the orientation and approximate distance be- break in synteny. Other rearrangements can be partially
tween two mated sequencing reads. Genomic structural present or penetrant in the environmental data and thus may
variation can be inferred when these are at odds with the not generate obvious recruitment gaps. However, given
way in which the reads are recruited to a reference sequence. sufficient coverage, breaks in synteny should be clearly
Relative location and orientation of mated sequences provide identifiable using the recruitment metadata based on the
a form of metadata that can be used to color-code a fragment presence of ‘‘missing’’ mates (i.e., the mated sequencing read
recruitment plot (Figure 4). This makes it possible to visually that was recruited but whose mate failed to recruit; Figure 4).
identify and classify structural differences and similarities The ratio of missing mates to ‘‘good’’ mates determines how
between the reference and the environmental sequences penetrant the rearrangement is in the environmental
(Figure 5). For the abundant marine microbes, a high population.
In theory, all genome structure variations that are large
enough to prevent recruitment can be detected, and all such
rearrangements will be associated with missing mates.
Depending on the type of rearrangement present other
recruitment metadata categories will be present near the
rearrangements’ endpoints. This makes it possible to distin-
guish among insertions, deletions, translocations, inversions,
and inverted translocations directly from the recruitment
plots. Examples of the patterns associated with different
rearrangements are presented in Figure 5. This provides a
rapid and easy visual method for exploring structural
variation between natural populations and sequenced repre-
sentatives (Poster S1A and S1B).
PLoS Biology | www.plosbiology.org | S31 0407 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
represent reference-specific differences that are not found in Interestingly, the long mated reads around this gap seem to
the environmental populations rather than a cloning bias be disproportionately from the Sargasso Sea samples,
that identifies genes or gene segments that are toxic or suggesting that this segment may be linked to geographic
unclonable in E. coli. The presence of missing mates flanking and/or environmental factors. Thus, hypervariable segments
these gaps indicates that the associated clones do exist, and are highly variable even within the same sample, can on
therefore that cloning issues are not a viable explanation for occasion be unoccupied, and the variation, or lack thereof,
the absence of recruited reads. Although the reference- can be sample dependent.
specific differences are quite apparent due to the recruitment Hypervariable segments have been seen previously in a
gaps they generate, there are also sporadic rearrangements wide range of microbes, including P. marinus [28], but their
associated with single clones, mostly resulting from small precise source and functional role, especially in an environ-
insertions or deletions. mental context, remains a matter of ongoing research. For
Careful examination of the unrecruited mates of the reads clues to these issues we examined the genes associated with
flanking the gaps allowed us to identify, characterize, and the missing mates flanking these segments and the nucleotide
quantify specific differences between the reference genome composition of the gapped sequences in the reference
and their environmental relatives. The results of this analysis genomes. In some rare cases the genes identified on reads
for P. ubique and P. marinus have been summarized in Table 5. that should have recruited within a hypervariable gap were
With few exceptions, small gaps resulted from the insertion highly similar to known viral genes. For example, a viral
or deletion of only a few genes. Many of the genes associated integrase was associated with the P. ubique HTCC1062
with these small insertions and deletions have no annotated hypervariable gap between 516 and 561 kb. However, in the
function. In some cases the insertions display a degree of majority of cases the genes associated with these gaps were
variability such that different sets of genes are found at these uncharacterized, either bearing no similarity to known genes
locations within a portion of the population. In contrast, or resembling genes of unknown function. If these genes were
many of the larger gaps are extremely variable to the extent indeed acquired through horizontal transfer then we might
that every clone contains a completely unrelated or highly expect that they would have obvious compositional biases.
divergent sequence when compared to the reference or to Oligonucleotide frequencies along the P. ubique HTCC1062
other clones associated with that gap. These segments are and Synechococcus WH8102 genomes are quite different in the
hypervariable and change much more rapidly than would be large recruitment gaps in comparison to the well-represented
expected given the variation in the rest of the genome. Sites portions of the genome (Poster S1). Surprisingly, this was less
containing a hypervariable segment nearly always contained true for P. marinus MIT9312, where the gaps have been linked
some insert. We identified two exceptions both associated to phage activity [28]. These results suggest that these
with P. ubique. The first is approximately located at the 166-kb hypervariable segments of the genome are widespread among
position in the P. ubique HTCC1062 genome. Though no large marine microbial populations, and that they are the product
gap is present, the mated reads indicate that under many of horizontal transfer events perhaps mediated by phage or
circumstances a highly variable insert is often present. The transposable elements. These results are consistent with and
second is a gap on HTCC1062 that appears between 50 and expand upon the hypothesis put forward by Coleman et al.
90 kb. This gap appears to be less variable than other [28] suggesting that these segments are phage mediated, and
hypervariable segments and is occasionally absent based on conflicts with initial claims that the HTCC1062 genome was
the large numbers of flanking long mated reads (Poster S1A). devoid of genes acquired by horizontal transfer [29].
PLoS Biology | www.plosbiology.org | S32 0408 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
MIT9312 36,401 38,311 2,132 Variable deletion 12 out of 66 clones support simple deletion. Remaining clones show considerable
sequence variation amongst themselves.
MIT9312 124,448 125,219 771 Variable insertion Associated with ASN tRNA gene; in the environment, half the reads identify pair
of small inserts with no similarity to known genes; the other half point to small
and large inserts of undetermined nature.
MIT9312 233,826 233,910 84 Insertion All clones support small insert (270 bp) with no clear sequence similarity to
known genes or sequences.
MIT9312 243,296 245,115 424 Variable deletion 24 of 42 clones support simple deletion of hypothetical protein. 13 support
slightly larger deletion. Remaining not clearly resolved.
MIT9312 296,818 300,888 4,344 Variable deletion 44 clones support deletion of 3,070 bp segment containing 4 genes (3 hypotheti-
cal; 1 carbamoyltransferase). 17 clones support alternative sequences with little or
no similarity to each other.
gMIT9312 342,404 342,662 326 Variable insert In environment is 93% chance of finding deoxyribodopyrimiden photolyase with
7% chance of finding an ABC type Fe3þ siderophore transport system permease
component.
MIT9312 345,933 365,351 19,418 Hypervariable Very limited similarity among clones indicates that this is a hypervariable seg-
ment. Note that it is closely associated with a site-specific integrase/recombinase.
MIT9312 551,347 552,025 678 Deletion Two small deletions within a hypothetical protein.
MIT9312 617,914 621,556 3,642 Hypervariable About 50% of the clones support small deletions among several hypothetical
proteins. Remaining support significant variability suggesting hypervariable seg-
ment. At least two of the missed mates contain integrase-like genes.
MIT9312 646,340 652,375 6,035 Deletion/Hypervariable Majority of clones support simple deletion of eight hypothetical and hypothetical
genes. Small number of clones indicate this may by hypervariable as well.
MIT9312 655,241 655,800 559 Deletion Small deletion between hypothetical proteins.
MIT9312 665,000 678000 12,000 Deletions Complex set of deletions and replacements that vary with geographic location.
MIT9312 665,824 666,380 556 Insertion Small inserted hypothetical protein.
MIT9312 670,747 671,933 1,186 Deletion Deletes hypothetical gene.
MIT9312 736,266 736,289 23 Insertion All clones support insertion of fructose-bisphosphate aldolase and fructose-1,6-bi-
sphosphate aldolase.
MIT9312 762,156 762,717 561 Deletion Small deletion between hypothetical proteins.
MIT9312 779,006 779,309 303 Insertion Small hypothetical protein inserted.
MIT9312 874,349 874,913 564 Insertion Small insertion including gene with similarity to RNA-dependent RNA-polymerase.
MIT9312 943,389 946,997 3,608 Variable Several small changes.
MIT9312 1,043,129 1,131,874 88,745 Hypervariable
MIT9312 1,140,922 1,141,412 490 Insertion Small insertion.
MIT9312 1,144,307 1,144,790 483 Insertion Small insertion of several genes.
MIT9312 1,155,123 1,156,440 1,317 Deletion Deletes high-light–inducible protein.
MIT9312 1,172,609 1,177,292 4,683 Variable Several genes have been replaced or deleted.
MIT9312 1,202,643 1,274,335 71,692 Hypervariable
MIT9312 1,288,481 1,290,367 1,886 Variable deletion Deletes polysaccharide export-related periplasmic protein (28 out of 55). Other
deletions are variable and may include replacement with alternate sequences.
MIT9312 1,323,606 1,324,523 917 Deletion Environmental sequences lack a small hypothetical protein.
MIT9312 1,369,637 1,369,996 359 Insertion NAD-dependent DNA ligase absent from MIT9312; has possible paralog
MIT9312 1,381,273 1,382,049 776 Deletion Small insert in MIT9312 not present in environment
MIT9312 1,384,664 1,385,110 446 Deletion Deletes delta(12)-fatty acid dehydrogenase and replaces gene with small (;100
bp) sequence. There is some variation in the exact location and replacement se-
quence.
MIT9312 1,388,430 1,389,718 1,288 Replacement Segment between two high-light–inducible proteins swapped for different se-
quence with no similarity.
MIT9312 1,392,865 1,392,976 111 Replacement Small replacement deletes hypothetical gene and replaces it with small, unknown
sequence. There is some variation in the precise boundaries of the deletion and
in the replacement sequences.
MIT9312 1,397,696 1,420,005 22,309 Hypervariable
MIT9312 1,486,145 1,487,971 231 Variable Small insertion of dolichyl-phosphate-mannose-protein mannosyltransferase; alter-
nately deletes glycosyl transferase (8 out of 56).
MIT9312 1,519,810 1,520,860 1,050 Variable deletion Deletes a single hypothetical protein; about half of the deletions contain variable
sequences of unknown origin.
MIT9312 1,568,049 1,569,121 1,072 Replacement Typically 928-bp portion of MIT9312 replaced by 175-bp stretch in environment;
some small amount of variation in environmental replacement sequence (11 out
of 51).
HTCC1062 50,555 93,942 43,387 Variable deletion Low recruitment segment containing many hypothetical, transporter, and secre-
tion genes.
HTCC1062 146,074 146,415 341 Deletion Often deleted segment containing DoxD-like and ferredoxin dependent gluta-
mate synthase peptide.
HTCC1062 166,600 166,700 100 Variable replacement GOS sequences indicate that variable blocks of genes are frequently inserted
here.
HTCC1062 308,720 309,633 913 Deletion Potential sulfotransferase domain deleted.
HTCC1062 339,545 339,951 406 Deletion Deletes a predicted O-linked N-acetylglucosamine transferase.
HTCC1062 385,348 386,224 876 Deletion SAM-dependent methyltransferase deleted.
PLoS Biology | www.plosbiology.org | S33 0409 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Table 5. Continued.
a
Begin indicates the approximate bp position which marks the beginning of the gap in recruitment.
b
End indicates the approximate bp position which marks the ending of the gap in recruitment.
c
The type of change indicates what would have to happen to the reference genome to produce the sequences seen in the environment (e.g., a deletion indicates that the indicated
portion of the reference would have to be deleted to generate the variant(s) seen in the environment).
doi:10.1371/journal.pbio.0050077.t005
Though insertions and deletions accounted for many of the on the amount of sequence that contributed to the analysis,
obvious regions of structural variation, we also looked for we estimate that one inversion or translocation will be
rearrangements. The high levels of local synteny associated observed for every 2.6 Mbp of sequence examined (less than
with P. ubique and P. marinus suggested that large-scale once per P. marinus genome).
rearrangements were rare in these populations. To inves- A further observation concerns the uniformity along a
tigate this hypothesis we used the recruitment data to genome of the evolutionary history among and within
examine how frequently rearrangements besides insertions subtypes. For instance, the similarity between GOS reads
and deletions could be identified. We looked for rearrange- and P. marinus MIT9312 is typically 85%–95%, while the
ments consisting of large (greater than 50 kb) inversions and similarity between MIT9312 and P. marinus MED4 is generally
translocations associated with P. marinus; however, we did not ;10% lower. However, there are several instances where the
identify any such rearrangements that consistently distin- divergence of MIT9312 and MED4 abruptly decreases to no
guished environmental populations from sequenced cultivars. more than that between the GOS sequences and MIT9312
Rare inversions and translocations were identified in the (Poster S1G). These results are consistent either with
dominant subtype associated with MIT9312 (Table 6). Based horizontal transfer (recombination) or with inhomogeneous
Table 6. Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype
Group Low Low High High Read ID Low Low Low High High High Read Inversion Sample
Genome Genome Genome Genome Read Read Read Read Read Read Length
Begin End Begin End Begin Breakpoint Strand Breakpoint End Strand
doi:10.1371/journal.pbio.0050077.t006
PLoS Biology | www.plosbiology.org | S34 0410 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S35 0411 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be
Separated by Extreme Assembly
Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarily
labeled (A–X) for reference in Figure 8.
doi:10.1371/journal.pbio.0050077.g007
distinct subtypes of SAR11 by repeatedly seeding extreme surveys use PCR to amplify ubiquitous but slowly evolving
assembly with fragments mated to a SAR11-like 16S sequence. genes such as the 16S rRNA or recA genes. These in turn can
Figure 7 compares the first 20 kb from each of 24 be used to distinguish microbial populations. Since PCR can
independent assemblies. Eighteen of these segments could introduce various biases, we identified 16S genes directly
be aligned full-length to a portion of the HTCC1062 genome from the primary GOS assembly. In total, 4,125 distinct full-
just upstream of 16S, while six appeared to reflect rearrange- length or partial 16S were identified. Clustering of these
ments relative to HTCC1062. The rearranged segments were sequences at 97% identity gave a total of 811 distinct
associated with more divergent 16S sequences (8%–14% ribotypes. Nearly half (48%) of the GOS ribotypes and 88%
diverged from the 16S of HTCC1062), while those without of the GOS 16S sequences were assigned to ribotypes
rearrangements corresponded to less divergent 16S (averag- previously deposited in public databases. That is, more than
ing less than 3% different from HTCC1062). In each segment, half the ribotypes in the GOS dataset were found to be novel
many reads were recruited above 90% identity, but different at what is typically considered the species level [30]. The
samples dominated different assemblies. Phylogenetic trees overall taxonomic distribution of the GOS ribotypes sampled
support the inference of evolutionarily distinct subtypes with by shotgun sequencing is consistent with previously published
distinctive sample distributions (Figure 8). PCR based studies of marine environments (Table 7) [31]. A
smaller amount (16%) of GOS ribotypes and 3.4% of the GOS
Taxonomic Diversity 16S sequences diverged by more than 10% from any publicly
Environmental surveys provide a cultivation-independent available 16S sequence, thus being novel to at least the family
means to examine the diversity and complexity of an level.
environmental sample and serve as a basis to compare the A census of microbial ribotypes allows us to identify the
populations between different samples. Typically, these abundant microbial lineages and estimate their contribution
PLoS Biology | www.plosbiology.org | S36 0412 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7
Correspond to Monophyletic Subtypes
Coloring of branches indicates that the corresponding reads align at .90% identity to the extreme assembly segments shown in Figure 7; colored
labels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned.
doi:10.1371/journal.pbio.0050077.g008
PLoS Biology | www.plosbiology.org | S37 0413 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S38 0414 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
a
Taxonomic classifications based on Hugenholtz ARB database. Labels indicate the most specific taxonomic assignment that could be confidently assigned to each ribotype. ‘‘Type a,’’
‘‘type b,’’ etc., used to arbitrarily discriminate separate 97% ribotypes that would otherwise be given the same name.
b
Note that the 16S rRNA gene can be multicopy.
c
Matching GenBank entries required full-length matches at 98% identity.
d
Less than 13 coverage outside described range.
doi:10.1371/journal.pbio.0050077.t008
PLoS Biology | www.plosbiology.org | S39 0415 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
organisms but rather are reflected in multiple microbial with the observed groupings. Other factors such as nutrients
lineages. and light for phototrophs and fixed carbon/energy for
It is tempting to view the groups of similar samples as chemotrophs may ultimately prove better predictors, but
constituting community types. Sample similarities based on these results demonstrate the potential of using metagenomic
genomic sequences correlated significantly with differences data to tease out such relationships.
in the environmental parameters (Table 1), particularly water Examining the groupings in Figure 11 in light of habitat
temperature and salinity (unpublished data). Samples that are and physical characteristics, the following may be observed.
very similar to each other had relatively small differences in The first two samples, a hypersaline pond in the Galapagos
temperature and salinity. However, not all samples that had Islands (GS33) and the freshwater Lake Gatun in the Panama
similar temperature and salinity had high community Canal (GS20) are quite distinct from the rest. Salinity—both
similarities. Water depth, primary productivity, fresh water higher and lower than the remaining coastal and ocean
input, proximity to land, and filter size appeared consistent samples—is the simplest explanation.
PLoS Biology | www.plosbiology.org | S40 0416 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Twelve samples form a strong temperate cluster as seen in samples from Nova Scotia through the Gulf of Maine. This
the similarity matrix of Figure 11 as a darker square bounded is followed by a subcluster of four samples between Rhode
by GS06 and GS12. Embedded within the temperate cluster Island and North Carolina. The northern subcluster was
are three subclusters. The first subcluster includes five sampled in August, the southern subcluster in November and
PLoS Biology | www.plosbiology.org | S41 0417 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
December. Though all samples were collected in the top few Continuing to the right and downward in Figure 11, one
meters, the southern samples were in shallower waters, 10 to can see a large cluster of 25 samples from the tropics and
30 m deep, whereas most of the northern samples were in Sargasso Sea, bounded by GS47 and GS00b. This can be
waters greater than 100 m deep. Monthly average estimates of further subdivided into several subclusters. The first sub-
chlorophyll a concentrations were typically higher in the cluster (a square bounded by GS47 and GS14) includes 14
southern samples as well (Table 1). All of these factors— samples, about half of which were from the Galapagos. The
temperature, system primary production, and depth of the second distinct subcluster (a square bounded by GS16 and
sampled water body—likely contribute to the differences in GS26) includes seven samples from Key West, Florida, in the
microbial community composition that result in the two well- Atlantic Ocean to a sample close to the Galapagos Islands in
defined clusters. The final temperate subgroup includes two the Pacific Ocean. Loosely associated with this subcluster is a
estuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11), sample from a larger filter size taken en route to the
distinguished by their lower salinity and higher productivity. Galapagos (GS25). The remaining samples group weakly with
However, GS11 is markedly similar not only to GS12 but also the tropical cluster. GS32 was taken in a coastal mangrove in
to coastal samples, whereas the latter appears much more the Galapagos. The thick organic sediment at a depth of less
unique. Interestingly, the Bay of Fundy estuary sample (GS06) than a meter is the likely cause for it being unlike the other
clearly did not group with the two other estuaries, but rather samples. Sample 00a was from the Sargasso Sea and contained
with the northern subgroup, perhaps reflecting differences in a large fraction of sequence reads from apparently clonal
the rate or degree of mixing at the sampling site. Burkholderia and Shewanella species that are atypical. When this
PLoS Biology | www.plosbiology.org | S42 0418 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
PLoS Biology | www.plosbiology.org | S43 0419 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Table 9. Continued.
TIGR00996 302 GS33 2.2 Cellular Pathogenesis Virulence factor Mce family protein
TIGR01254 419 GS33 2.2 Transport Other ABC transporter periplasmic binding protein, thiB subfamily
TIGR01278 508 GS33 2.2 Biosynthesis Chlorophyll Light-independent protochlorophyllide reductase, B subunit
TIGR01512 1,820 GS33 2.2 Transport Cations Cadmium-translocating P-type ATPase
TIGR01525 2,037 GS33 2.2 Heavy metal translocating P-type ATPase
TIGR01543 418 GS33 2.2 Protein Other Phage prohead protease, HK97 family
TIGR01857 1,495 GS33 2.2 Purines Purine Phosphoribosylformylglycinamidine synthase
TIGR02015 183 GS33 2.2 Energy Photosynthesis Chlorophyllide reductase subunit Y
TIGR02072 1,052 GS33 2.2 Biosynthesis Biotin Biotin biosynthesis protein BioC
TIGR02099 273 GS33 2.2 Hypothetical Conserved Conserved hypothetical protein TIGR02099
TIGR00203 168 GS33 2.1 Energy Electron Cytochrome d ubiquinol oxidase, subunit II
TIGR00218 141 GS33 2.1 Energy Sugars Mannose-6-phosphate isomerase, class I
TIGR00915 4,834 GS33 2.1 Transport Other Transporter, hydrophobe/amphiphile efflux-1 (HAE1) family
TIGR01315 319 GS33 2.1 FGGY-family pentulose kinase
TIGR01330 253 GS33 2.1 39(29),59-bisphosphate nucleotidase
TIGR01508 848 GS33 2.1 Diaminohydroxyphosphoribosylaminopyrimidine reductase
TIGR01511 1,928 GS33 2.1 Transport Cations Copper-translocating P-type ATPase
TIGR01764 387 GS33 2.1 Unknown General DNA binding domain, excisionase family
TIGR02014 298 GS33 2.1 Energy Photosynthesis Chlorophyllide reductase subunit Z
TIGR02047 222 GS33 2.1 Cd(II)/Pb(II)-responsive transcriptional regulator
TIGR00586 990 GS33 2 DNA DNA Mutator mutT protein
TIGR00853 114 GS33 2 Signal PTS PTS system, lactose/cellobiose family IIB component
TIGR00937 364 GS33 2 Transport Anions Chromate transporter, chromate ion transporter (CHR) family
TIGR01030 221 GS33 2 Protein Ribosomal Ribosomal protein L34
TIGR01214 2,834 GS33 2 Cell Biosynthesis dTDP-4-dehydrorhamnose reductase
TIGR01698 608 GS33 2 Purine nucleotide phosphorylase
TIGR02190 465 GS33 2 Glutaredoxin-family domain
a
Reads associated with Shewanella and Burkholderia have been excluded.
b
TIGRFAM is this many times more abundant than in the next most abundant sample.
doi:10.1371/journal.pbio.0050077.t009
sample is reanalyzed to exclude reads identified as belonging contained a mixture of otherwise exclusively marine and
to these two groups, sample GS00a groups loosely with freshwater ribotypes; similarity of these sites to the fresh-
GS00b, GS00c, and GS00d (unpublished data). Finally, three water sample (GS20) was minimal at the metagenomic level,
subsamples from a single Sargasso sample (GS01a, GS01b, while the greater similarity of GS11 to coastal samples visible
GS01c) group together, despite representing three distinct at the metagenomic level was not readily visible here. A fuller
size fractions (3.0–20, 0.8–3.0, and 0.1–0.8 lm, respectively; comparison of metagenome-based measurements of diversity
Table 1). based on a large dataset of PCR-derived 16S sequences will be
The complete set of sample similarities is more complex presented in another paper (in preparation).
than described above, and indeed is more complex than can
be captured by a hierarchical clustering. For instance, the Variation in Gene Abundance
southern temperate samples are appreciably more similar to Differences in gene content between samples can identify
the tropical cluster than are the northern temperate samples. functions that reflect the lifestyles of the community in the
GS22 appears to constitute a mix of tropical types, showing context of its local environment [20,32]. We examined the
strong similarity not only to the GS47–GS14 subcluster to relative abundance of genes belonging to specific functional
which it was assigned, but also to the other tropical samples. categories in the distinct GOS samples. Genes were binned
These results may be compared to the more traditional into functional categories using TIGRFAM hidden Markov
view of community structure afforded by 16S sequences models [18], which are well annotated and manually curated
(Figure 9). Some of the same groupings of samples are visible [33].
using both analyses. Several ribotypes recapitulated the The results can be filtered in various ways to highlight
temperate/tropical clustering described above. Others were genes associated with specific environments. One catalog of
restricted to the single instances of nonmarine habitats. possible interest is genes that were predominantly found in a
Several of the most abundant organisms from the coastal single sample. We identified 95 TIGRFAMs that annotated
mangrove, hypersaline lagoon, and freshwater lake were large sets of genes (100 or more) that were significantly more
found exclusively in these respective samples. However, while frequent (greater than 2-fold) in one sample than in any other
several ribotypes recapitulated the temperate/tropical dis- sample (Table 9). Not surprisingly, this approach dispropor-
tinction revealed by the genomic sequence, others crosscut it. tionately singles out genes from the samples collected on
A few dominant 16S ribotypes, related to SAR11, SAR86, and larger filters (GS01a, GS01b, and GS25) and from the
SAR116, were found in every marine sample. The brackish nonmarine environments, particularly the hypersaline pond
waters from two mid-Atlantic estuaries (GS11 and GS12) (sample GS33). Another contrast might be between the
PLoS Biology | www.plosbiology.org | S44 0420 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Table 10. Relative Abundance of TIGRFAM Matches in Temperate and Tropical Waters
TIGR01153 729 GS15–GS19 32.7 Energy Photosynthesis Photosystem II 44 kDa subunit reaction center protein
TIGR02093 673 GS15–GS19 29.6 Energy Biosynthesis Glycogen/starch/alpha-glucan phosphorylases
TIGR01335 813 GS15–GS19 26.6 Energy Photosynthesis Photosystem I core protein PsaA
TIGR01336 806 GS15–GS19 26.5 Energy Photosynthesis Photosystem I core protein PsaB
TIGR00975 648 GS15–GS19 11 Transport Anions Phosphate ABC transporter, phosphate-binding protein
TIGR00297 261 GS15–GS19 8.6 Hypothetical Conserved Conserved hypothetical protein TIGR00297
TIGR00992 302 GS15–GS19 8.5 Transport Amino Chloroplast envelope protein translocase, IAP75 family
TIGR02030 560 GS15–GS19 7.5 And Chlorophyll Magnesium chelatase ATPase subunit I
TIGR02041 359 GS15–GS19 6 Central Sulfur Sulfite reductase (NADPH) hemoprotein, beta-component
TIGR01151 2,095 GS15–GS19 4.7 Energy Photosynthesis Photosystem q(b) protein
TIGR01152 1,865 GS15–GS19 4.7 Energy Photosynthesis Photosystem II D2 protein (photosystem q(a) protein)
TIGR02031 800 GS15–GS19 4.2 Biosynthesis Chlorophyll Magnesium chelatase ATPase subunit D
TIGR01790 629 GS15–GS19 4 Lycopene cyclase family protein
TIGR02100 512 GS15–GS19 4 Energy Biosynthesis Glycogen debranching enzyme GlgX
TIGR00073 284 GS15–GS19 3.4 Protein Protein Hydrogenase accessory protein HypB
TIGR00159 497 GS15–GS19 3 Hypothetical Conserved Conserved hypothetical protein TIGR00159
TIGR01515 594 GS15–GS19 3 Energy Biosynthesis 1,4-alpha-glucan branching enzyme
TIGR00217 601 GS15–GS19 2.7 Energy Biosynthesis 4-alpha-glucanotransferase
TIGR01486 505 GS15–GS19 2.7 Mannosyl-3-phosphoglycerate phosphatase family
TIGR01098 720 GS15–GS19 2.6 Transport Carbohydrates Phosphonate ABC transporter, periplasmic phosphonate-binding protein
TIGR00101 495 GS15–GS19 2.5 Central Nitrogen Urease accessory protein UreG
TIGR01273 567 GS15–GS19 2.4 Central Polyamine Arginine decarboxylase
TIGR01470 179 GS5–GS10 25.7 Biosynthesis Heme Siroheme synthase, N-terminal domain
TIGR00361 374 GS5–GS10 12.1 Cellular DNA DNA internalization-related competence protein ComEC/Rec2
TIGR01537 333 GS5–GS10 6.2 Mobile Prophage Phage portal protein, HK97 family
TIGR00201 291 GS5–GS10 6 Cellular DNA comF family protein
TIGR00879 420 GS5–GS10 5 Transport Carbohydrates Sugar transporter
TIGR02018 373 GS5–GS10 4.1 Regulatory DNA Histidine utilization repressor
TIGR02183 294 GS5–GS10 4.1 Glutaredoxin, GrxA family
TIGR00427 602 GS5–GS10 4 Hypothetical Conserved Conserved hypothetical protein TIGR00427
TIGR01109 219 GS5–GS10 3.6 Energy Other Sodium ion-translocating decarboxylase, beta subunit
TIGR01262 840 GS5–GS10 2.8 Energy Amino Maleylacetoacetate isomerase
a
Average abundance of TIGRFAM is that many times more abundant the average abundance in the given samples than in the other set of samples (in this case, GS15–GS19 were
compared with GS5–GS10).
doi:10.1371/journal.pbio.0050077.t010
temperate and tropical clusters (Figures 10 and 11). We fold higher in the Pacific than the Caribbean pair of samples.
identified 32 proteins that were more than 2-fold more Several of the most differentially abundant genes are related
frequent in one or the other group (Table 10). The presence to phosphate transport and utilization. It is very plausible
of various Prochlorococcus-associated genes in this list high- that this is a reflection of a functional adaptation: these
lights some of the potential challenges with this sort of differences correlate well with measured differences in
approach. Overrepresentation may reflect: a direct response phosphate abundance between the Atlantic and eastern
to particular environmental pressures (as the excess of salt Pacific samples [34,35], and phosphate abundance plays a
transporters plausibly do in the hypersaline pond); a lineage- critical role in microbial growth [36,37]. Indeed, the ability to
restricted difference in functional repertoire (as exemplified acquire phosphate, especially under conditions where it is
by the excess of photosynthesis genes in samples containing limited, is thought to determine the relative fitness of
Prochlorococcus); or a more incidental ‘‘hitchhiking’’ of a Prochlorococcus strains [38].
protein found in a single organism that happens to be The single greatest difference between GS17 and GS18 on
present. the one hand and GS23 and GS26 on the other was attributed
We explored whether clearer and more informative to a set of genes annotated by the hidden Markov model
differences could be discovered between communities by TIGR02136 as a phosphate-binding protein (PstS). This
focusing on groups of samples that are highly similar in TIGRFAM identified a single gene in both P. marinus
overall taxonomic/genetic content. Two pairs of samples MIT9312 and P. ubique HTCC1062. In P. marinus MIT9312,
provide a particularly nice illustration of this approach. this gene is located at 672 kb lying roughly in the middle of a
Samples GS17 and GS18 from the western Caribbean Sea and 15-kb segment of the genome that recruits almost no GOS
samples GS23 and GS26 from the eastern Pacific Ocean were sequences from the Pacific sampling sites (Poster S1H). In P.
all very similar based on the presence of abundant ribotypes ubique HTCC1062, the PstS gene is found at 1,133 kb in a 5-kb
and overall similarity in genetic content (Figures 9–11). segment that also recruited far fewer GOS sequences from all
Despite these similarities, several genes are found to be up to the Pacific samples except for GS51 (Poster S1E). These
seven times more common in the pair of Caribbean samples genomic segments differ structurally among isolates but they
than the Pacific pair (Table 11). No genes are more than 2- are no more variable than the flanking regions, and thus are
PLoS Biology | www.plosbiology.org | S45 0421 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
Table 11. Relative Abundance of TIGRFAM Matches in Atlantic and Pacific Open Ocean Waters
a
Relative Abundance: average abundance of TIGRFAM is at least that many times more abundant the average abundance in the given samples than in the other set of samples (in this
case, GS17–GS18 were compared to GS23 and GS26).
doi:10.1371/journal.pbio.0050077.t011
not hypervariable in the sense used previously (unpublished absorbs primarily in the blue and red spectra; consequently,
data). Nor are they particularly conserved when present, the water appears green [43]. Conversely, in the open ocean
indicating that they are not the result of a recent lateral nutrients are rare and phytoplanktonic biomass is low, so
transfer. Phylogenetic analyses outside these segments did not waters appear blue because in the absence of impurities the
produce any evidence of a Pacific versus Caribbean clade of red wavelengths are absorbed preferentially [44]. It may be
either Prochlorococcus or SAR11 (Figure 3A–3B). The presence that proteorhodopsin-carrying microbes have simply adapted
or absence of phosphate transporters is not limited to these to take advantage of the most abundant wavelengths of light
two types of organisms. The number of phosphate trans- in these systems.
porters that were found in the Caribbean far exceeds the Proteorhodopsins encoded on reads that were recruited to
number that can be attributed to HTCC1062- and MIT9312- P. ubique HTCC1062 account for a fraction (;25%) of all the
like organisms. However, these results indicate that within proteorhodopsin-associated reads, suggesting that the re-
individual strains or subtypes the ability to acquire phosphate mainder must be associated with a variety of marine micro-
(in one or more of its forms) can vary without detectable bial taxa (see also [45–47]). Phylogenetic analysis of the
differences in the surrounding genomic sequences. SAR11-associated proteins revealed that each variant has
arisen independently at least two times in the SAR11 lineage
Biogeographic Distribution of Proteorhodopsin Variants (Figure 3C). Consistent with other findings that proteorho-
Variation in gene content is only one aspect of the dopsins are widely distributed throughout the microbial
tremendous diversity in the GOS data. The functional world [48], we conclude that multiple microbial lineages are
significance of all the polymorphic differences between responsible for proteorhodopsin spectral variation and that
homologous proteins remains largely unknown. To look for the abundance of a given variant reflects selective pressures
functional differences, we analyzed members of proteorho- rather than taxonomic effects. Similar mechanisms seem to
dopsin gene family. Proteorhodopsins are fast, light-driven be involved in the evolution and diversification of opsins that
proton pumps for which considerable functional information mediate color vision in vertebrates [49].
is available though their biological role remains unknown.
Proteorhodopsins were highly abundant in the Sargasso Sea
samples [19] and continue to be highly abundant and evenly Discussion
distributed (relative to recA abundance) in all the GOS Our results highlight the astounding diversity contained
samples. A total of 2,674 putative proteorhodopsin genes within microbial communities, as revealed through whole-
were identified in the GOS dataset. Although many of the genome shotgun sequencing carried out on a global scale.
sequences are fragmentary, 1,874 of these genes contain the Much of this microbial diversity is organized around
residue that is primarily responsible for tuning the light- phylogenetically related, geographically dispersed popula-
absorbing properties of the protein [39–41], and these tions we refer to as subtypes. In addition, there is tremendous
properties have been shown to be selected for under different variation within subtypes, both in the form of sequence
environmental conditions [42]. Variation at this residue is variation and in hypervariable genomic islands. Our ability to
strongly correlated with sample of origin (Figure 12). The make these observations derived from not only the large
leucine (L) or green-tuned variant was highly abundant in the volumes of data but also from the development of new tools
North Atlantic samples and in the nonmarine environments and techniques to filter and organize the information in
like the fresh water sample from Lake Gatun (GS20). The manageable ways.
glutamine (Q) or blue-tuned variant dominated in the
remaining mostly open ocean samples. Variation and Diversity
Given our limited understanding of the biological role for Our data demonstrate to an unprecedented degree the
proteorhopsin, the reason for this differential distribution is nature and evolution of genetic variation below the species
not immediately clear. In coastal waters where nutrients are level. Variation can be analyzed in several ways, including
more abundant, phytoplankton is dominant. Phytoplankton observed differences in sequence, genomic structure, and
PLoS Biology | www.plosbiology.org | S46 0422 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
gene complement. The observed patterns of variation shed In principle, this variation could reflect some combination of
light on the mechanisms by which marine prokaryotes evolve. physical barriers (true biogeography), short-term stochastic
Gene synteny seems to be more highly conserved than the effects, and/or functional differentiation. Given the confound-
nucleotide and protein sequences. This variation is seen over ing variables of geography, time, and environmental conditions
essentially the entire genome in every abundant group of in the current collection of samples, it is difficult to definitively
organisms sufficiently related for us to recognize a popula- separate these effects, but various observations argue for
tion by fragment recruitment. (These include, but are not functional differentiation between subtypes (i.e., they con-
limited to, the organisms shown in Figure 2 and Poster S1.) stitute distinct ecotypes). First, individual subtypes may be
Notably, we found no evidence of widespread low-diversity found in a wide range of locations; P. ubique HTCC1062 was
organisms such as B. anthracis [50]. isolated in the Pacific Ocean off the coast of Oregon [55], but
Phylogenetic trees and fragment recruitment plots (Figures closely related sequences are relatively abundant in our samples
7 and 8) indicate that the variation within a species is not an taken in the Atlantic Ocean. Second, geography per se cannot
unstructured swarm or cloud of variants all equally diverged fully explain differences in subtype distributions, as multiple
from one another. Instead, there are clearly distinct subtypes, subtypes are found simultaneously in a single sample. Third, the
in terms of sequence similarity, gene content, and sample collection of samples in which a given subtype was found
distribution. Similar findings have been shown for specific generally exhibits similar environmental conditions. A strong
organisms, based on evaluation of one or a few loci [2,51–53]. independent illustration of this comes from the correlation of
These results rule out certain trivial models of population temperature with the distribution of Prochlorococcus subtypes
history and evolution for what is commonly considered a [56]. Fourth, the extensive variation within each subtype (i.e.,
bacterial ‘‘species.’’ For instance, it argues against a recent the fact that subtypes are not clonal populations) indicates that
explosive population growth from a single successful indi- it cannot be chance alone that makes genetically similar
vidual (selective sweep) [54]. Equally, it argues against a organisms have similar observed distributions.
perfectly mixed population, suggesting instead some barriers Taken together, these results argue that subtype classifica-
to competition and exchange of genetic material. tion is more informative for categorizing microbial popula-
PLoS Biology | www.plosbiology.org | S47 0423 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
tions than classification using 16S-based ribotypes, or finger- inevitably have at least minor functional differences such as
printing techniques based on length polymorphism, such as T- in the optimal temperature or pH for the activity of some
RFLPs [57] or ARISA [58]. For example, the grouping of such enzyme. At the level of gene content, the observation of
disparate microbial populations under the umbrella P. marinus hypervariable segments ([28] and here) implies that there is
dilutes the significance of the term ‘‘species.’’ Indeed, an additional dimension to functional variability. Hyper-
numerous papers have been devoted to comparing and variable genomic islands with preferential insertion sites
contrasting the differences and variability in P. marinus isolates could potentially be associated with a wide range of
to better understand how this particularly abundant group of functions, though to date they have been most closely
organisms has evolved and adapted within the dynamic marine examined for their role in pathogenicity (for a review, see
environment [28,52,56,59–66]. Prior to the widespread use of [74]). However, given their apparent variability within even a
marker-based phylogenetic approaches, microbial systematics single sampling site, it seems unlikely that these elements
relied on a wide range of variables to distinguish microbial reflect a specific adaptive advantage to the local population.
populations [67]. Subtypes bring us back to these more Identifying the source(s), diversity, and range of functionality
comprehensive approaches since they reflect the influences associated with these islands by fully sequencing a large
of a wide range of factors in the context of an entire genome. number of these segments and understanding how their
Although subtypes are a salient feature of our data, individual abundances fluctuate should be quite informative.
variation within a ribotype does not stop at the level of Some might still argue that these differences must be moot
subtypes. Variation within subtypes is so extensive that few for the purpose of understanding the role these organisms play
GOS reads can be aligned at 100% identity to any other GOS in an ecosystem. Yet even small differences in optimal
read, despite the deep coverage of several taxonomic groups. conditions may have profound effects. They may prevent any
Related findings have been shown for the ITS region in single genotype from being universally fittest, allowing and/or
various organisms [2,51,52], and in a limited number of necessitating the coexistence of multiple variants [2,51,69].
organisms for individual protein coding and intergenic Moreover, variation within subtype might afford a form of
regions [2,53,68]. High levels of diversity within the ribotype functional ‘‘buffering,’’ such that the population as a whole
can be convincingly demonstrated in the 16S gene itself [69]. may be more stable in its ecosystem role than any one clone
The applicability of these results over the entire genome were could be (see also [51]). That is, while any one strain of
recently shown for P. marinus [28] using data from the Prochlorococcus might thrive and provide energy input to the
Sargasso Sea samples taken as a pilot project for the rest of the community at a limited range of temperatures, light
expedition reported here [19]. We have definitively demon- conditions, etc., the ensemble might provide such inputs over a
strated the generality of these findings, greatly increased our wider range of environmental conditions. In this way, micro-
understanding of the minimum number of variants of a given diversity might provide system stability or robustness through
organism, and shown that these observations apply to the functional redundancy and the ‘‘insurance effect’’ (reviewed in
entire genome for a wide range of abundant taxonomic [75]). Thus, while the extent of microdiversity suggests that
groups and across a wide range of geographic locations. knowing the behavior of any one isolate in exquisite detail
Average pairwise differences of several percent between might not be as useful to reductionist modeling as one might
overlapping P. marinus or SAR11 reads imply that this hope, this buffering could afford a more stable ensemble
variation did not arise recently. If one uses substitution rates behavior, facilitating the development and maintenance of an
estimated for E. coli [70], one could conclude that on average ecosystem and allowing for system-level modeling.
any two P. marinus cells must have diverged millions of years A direct equation of subtypes with ecotypes is tempting,
ago. Mutational rates are notoriously variable and hard to but not entirely clear-cut. The correlation of PstS distribution
estimate, and assumptions of molecular clocks are equally with phosphate abundance suggests a functional adaptation,
chancy, but clearly within-subtype variants have persisted but within Prochlorococcus and SAR11 the presence or absence
side by side for quite some time. This raises a question related of PstS subdivides subtypes without apparent respect for
to the classic ‘‘Paradox of the Plankton’’: how can so many phylogenetic structure. This contrasts markedly with the
similar organisms have coexisted for so long [71,72]? One distribution of proteorhodopsin-tuning variants within
explanation, which we favor, is that not only subtypes but also SAR11, which, despite a few convergent substitutions, are
individual variants are sufficiently different phenotypically to strongly congruent with phylogeny. It is interesting to ask
prevent any one strain from completely replacing all others what distinguishes pressures or adaptations that respect (or
(discussed further below; see [71] for a recent theoretical that lead to) lineage splits from those that show little or no
treatment). An alternative is that recombination might phylogenetic structuring. These two specific examples plau-
prevent selective sweeps within ecotypes, as proposed by sibly reflect two different mechanisms (i.e., convergent but
Cohan (reviewed in [73]). independent mutation in proteorhodopsin genes and the
acquisition by horizontal transfer of genes involved in
The Significance of Within-Subtype Variation phosphate uptake). Yet, we must wonder: given the evidence
Given the apparent generality of subtypes and intra- that proteorhodopsin has been transferred laterally [48], and
subtype variation, it is important to understand if and how that only a small number of mutations, in some circumstances
these subpopulations are functionally distinct. At the level of even a single base-pair change, are required to switch
DNA sequence, a substantial fraction of substitutions are between the blue-absorbing and green-absorbing forms
silent in terms of amino acid sequence, and others may be [39,40], why should proteorhodopsin variants show any
nonsynonymous but functionally neutral. However, two lineage restriction? Perhaps this relates to the modularity of
organisms that differ by 5% in their genetic sequence (e.g., the system in question: proteorhodopsin tuning may be part
100,000 substitutions in 2 Mbp of shared sequence) will of a larger collection of synergistic adaptations that are
PLoS Biology | www.plosbiology.org | S48 0424 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
collectively not easily evolved, acquired, or lost, while the PstS approaches rely on fairly well-known techniques but have
and surrounding genes may represent a functional unit that been modified to take greater advantage of the metadata.
can be readily added and removed over relatively short The technique of fragment recruit and the corresponding
evolutionary time scales. If so, perhaps subtypes are indeed fragment recruitment plots have proven highly useful for
ecotypes, but rapidly evolving characters can lead to examining the biogeography and genomic variation of
phenotypes that crosscut or subdivide ecotypes. abundant marine microbes when a close reference genome
Phage provide one possible mechanism for rapid evolution exists. Ultimately, this approach derives from the percent
of microbial populations or strains, and have been found in identity plots of PipMaker [27]. Similar approaches have been
abundance with this and other marine metagenomic datasets used to examine variation with respect to metagenomic
[18,20]. It has been proposed that hypervariable islands are datasets. For example, hypervariable segments and sequence
phage mediated [28]. However, there are reasons to be variation have been visualized in P. marinus MIT9312 using
cautious about invoking phage as an explanation for rapidly the Sargasso Sea data [28] and in human gut microbes
evolving characteristics. While we see variability of PstS and Bifidobacterium longum and Methanobrevibacter smithii [77].
neighboring genes in both SAR11 and P. marinus populations, Our primary advance associated with fragment recruit-
this variation does not seem to be linked to recent phage ment plots is the incorporation of metadata associated with
activity. Initially, the distribution of PstS seems similar to the the isolation or production of the sequencing data. While
variation associated with the hypervariable islands, which simple in nature, the resulting plots can be extremely
may be phage mediated [28]. Indeed, phosphate-regulating informative due to the volume of data being presented.
genes including PstS have been identified in phage genomes Being able to present the sequence similarity and metadata
[64], presumably because enhanced phosphate acquisition is visually allows a researcher to quickly identify interesting
required during the replication portion of their life cycle. portions of the data for further examination. This is one of
However, the regions containing the PstS genes in both the first tools to make extensive use of the metadata collected
SAR11 and Prochlorococcus do not behave in the same fashion during a metagenomic sequencing project. The use of sample
as clearly hypervariable regions, being effectively bimorphic and recruitment metadata is just the beginning. It is not
(modulo the level of sequence variation observed elsewhere in difficult to imagine displaying other variations such as water
the genome), whereas clearly hypervariable regions are so temperature, salinity, phosphate abundance, and time of year
diverse that nearly every sampled clone falling in such a with this approach. Even sample independent metadata such
region appears completely unrelated to every other. Nor do as phylogenetic information may produce informative views
the other genes in PstS-containing regions appear to be phage of the data. The usefulness of this and related approaches will
associated. These observations suggest that differences in PstS only grow as the robust collection of metadata becomes
presence or absence arose in the distant past, or that routine and the variables that are most relevant to microbial
different mechanisms are at work. It seems likely that phage communities are further elucidated.
may mediate lateral transfer of PstS and other phosphate The greatest limitation of fragment recruitment is the lack
acquisition genes, but it is unclear whether these genes then of appropriate reference sequences, particularly finished
can become fixed within the population. Phage require genomes. Using a series of modifications to the Celera
enhanced phosphate acquisition as part of their life cycle Assembler referred to as ‘‘extreme assembly,’’ we have
[64], so regulatory or functional differences in these genes produced large assemblies for cultivated and uncultivated
may limit their suitability for being acquired by the host cell marine microbes. On its own, the extreme assembly approach
for its own purposes. The rate of phage-mediated horizontal would be excessively prone to producing chimeric sequences.
transfer of genes may reflect a combination of the gene’s However, when extreme assemblies are used as references for
value to the host and to the agent mediating the transfer (e.g., fragment recruitment, the metadata provides additional
phage), suggesting that PstS may have much greater immedi- criteria to validate the sampling consistency along the length
ate value than do proteorhodopsin genes and their variants. of the scaffold. Chimeric joins can be rapidly detected and
In practical terms, these results highlight the limitations avoided. This argues that future metagenomic assemblers
associated with marker-based analysis and the use of these could be specifically designed to make use of the metadata to
approaches to infer the physiology of a particular microbial produce more accurate assemblies, and that metagenomic
population. At the resolution used here, marker-based assemblies will be improved by using data from multiple
approaches are not always informative regarding differences sources. Finding ways to represent the full diversity in these
in gene content (e.g., the PstS gene as well as neighboring assemblies remains a pressing issue.
genes), especially those associated with hypervariable seg- Extreme assembly can produce much larger assemblies but
ments. Though phosphate acquisition is known to vary within it is still limited by overall coverage. While many ribotypes are
different strains of P. marinus [64,76], our results clearly show presumably present in sufficient quantities that reasonable
that this variability can happen within a single subtype (as assemblies of these genomes might be expected, this did not
represented by MIT9312), effectively identifying distinct occur even for the most abundant organisms, including
ecotypes. Given the correct samples from the appropriate SAR11 and P. marinus. Many of the problems can be
environments, other core genes might also show similar attributed to the diversity associated with the hypervariable
variation and allow us to more fully assess the reliability of segments where the effective coverage drops precipitously. If
reference genomes as indicators of physiological potential. these are indeed commonplace in the microbial world, it is
unlikely that complete genomes will be produced using the
Tools and Techniques small insert libraries presented here. However, the ability to
Analysis of the GOS dataset has benefited from the bin the larger sequences based on their coverage profiles
development of new tools and techniques. Many of these across multiple samples, oligonucleotide frequency profiles,
PLoS Biology | www.plosbiology.org | S49 0425 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
and phylogenetic markers suggests that large portions of a populations, while providing some information about how
microbial genome can be reconstructed from the environ- these factors are structured along phylogenetic and environ-
mental data. This in turn should provide critical insights into mental factors. At the same time, many questions remain
the physiology and biochemistry of these microbial lineages unanswered. For example, although microbial populations
that will inform culture techniques to allow cultivation of are structured and therefore genetically isolated, we do not
these recalcitrant organisms under laboratory conditions. understand the mechanisms that lead to this isolation. Their
Not every technique described herein relies on metadata. isolation seems contradictory given overwhelming evidence
The marker-less, overlap-based metagenomic comparison that horizontal gene transfer associated with hypervariable
provides a quantitative approach to comparing the overall islands is a common phenomenon in marine microbial
genetic similarity of two samples (Figures 10 and 11). In populations. Whatever the mechanism, the role and rate at
essence, genomic similarity acts as a proxy for community which gene exchange occurs between populations will be
similarity. Marker-based approaches such as ARISA including crucial to understanding population structure within micro-
the use of 16S sequences described herein can also be used to bial communities and whether these communities are chance
infer community similarity, though these approaches more associations or necessary collections. The hypervariable
aptly generate a census of the community members islands could be a source for tremendous genetic innovation
[51,69,78,79]. This census is biased to the extent that 16S and novelty as evidenced by the rate of discovery of novel
genes can vary in copy number and relies on linkage of the protein families in the GOS dataset [18]. However, it is not
marker gene to infer genome composition. While our clear whether these entities are the main source of this
metagenome comparison does not directly provide a census, novelty or whether this novelty resides in the vast numbers of
the sensitivity can be tuned by restricting the identity of rare microbes [4] that cannot be practically accessed using
matches. This means that even subtype-level differences can current metagenomic approaches. Altogether, this research
be detected across samples. It would also identify the reaffirms our growing wealth and complexity of data and
substantial gene content differences between the K12 and paucity of understanding regarding the biological systems of
O157:H7 E. coli strains [12]. Such large-scale gene content the oceans.
differences have yet to be seen between closely related marine
microbes, but may be a factor in other environments.
Materials and Methods
Although the requisite amount of data will vary with the
complexity of the environment or the degree of resolution Sampling sites. A more detailed description of the sampling sites
provides additional context in which to understand the individual
required, we have found that 10,000 sequencing reads is samples. The northernmost site (GS05) was at Compass Buoy in the
sufficient to reliably measure the similarity of two surface highly eutrophic Bedford Basin, a marine embayment encircled by
water samples (unpublished data). This analysis may become a Halifax, Nova Scotia, that has a 15-y weekly record of biological,
general tool for allocating sequencing resources by allowing a physical, and chemical monitoring (http://www.mar.dfo-mpo.gc.ca/
science/ocean/BedfordBasin/index.htm). Other temperate sites in-
shallow survey of many samples followed by deep sequencing cluded a coastal station sample near Nova Scotia (GS4), a station in
of a select number of ‘‘interesting’’ ones. the Bay of Fundy estuary at outgoing tide (GS06), and three Gulf of
The application of this technique for comparing samples Maine stations (GS02, GS03, and GS07). These were followed by
sampling coastal stations from the New England shelf region of the
along with detailed analysis of fragments recruiting to a given Middle Atlantic Bight (Newport Harbor through Delaware Bay;
reference sequence can also help explicate differences among GS08–GS11). The Delaware Bay (GS11) was one of several estuary
communities in gene content or sequence variation. For samples along the Global Expedition path. Estuaries are complex
example, recent metagenomic studies have reported differ- hydrodynamic environments that exhibit strong gradients in oxygen,
nutrients, organic matter, and salinity and are heavily impacted by
ences in abundance of various gene families or differing anthropogenic nutrients. The Chesapeake Bay (GS12) is the largest
functional roles between samples. Some of these differences estuary in the United States and has microbial assemblages that are
correspond to plausible differences in physiology and diverse mixtures of freshwater and marine-specific organisms [80].
GS13 was collected near Cape Hatteras, North Carolina, inside and
biochemistry, such as the relative overabundance of photo- north of the Gulf Stream, and GS14 was taken along the western
synthetic or light-responsive genes in surface water samples boundary frontal waters of the Gulf Stream off the coast of
[20,32]. Other differences however are less obvious, such as Charleston, South Carolina. The vessel stopped at five additional
the abundance of ribosomal proteins at 130 m or the stations as it transited through the Caribbean Sea (GS15–GS19) to the
Panama Canal. In Panama, we sampled the freshwater Lake Gatun,
abundance of tranposase at 4,000 m [20]; some of these may which drains into the Panama Canal (GS20). The first of the eastern
reflect ‘‘taxonomic hitchhiking,’’ such that a sample rich in Pacific coastal stations GS21, GS22, and GS23 were sampled on the
Archaea or Firmicutes or Cyanobacteria, etc., has an over- way to Cocos Island (;500 km southwest of Costa Rica), followed by a
coastal Cocos Island sample (GS25). Near the island, ocean currents
representation of genes more reflective of their recent diverge and nutrient rich upwellings mix with warm surface waters to
evolutionary history than of a response to environmental support a highly productive ecosystem. Cocos Island is distinctive in
conditions. Being able to control or account for these the eastern Pacific because it belongs to one of the first shallow
taxonomic effects is crucial to understanding how microbial undersea ridges in the region encountered by the easterly flowing
North Equatorial Counter/Cross Current in the Far Eastern Pacific
populations have adapted to environmental conditions and [81,82]. After departing Cocos Island, the vessel continued southwest
how they may behave under changing conditions. The to the Galapagos Islands, stopping for an open ocean station (GS26).
metagenomic comparison method described here provides An intensive sampling program was then conducted in the Galapagos.
The Galapagos Archipelago straddles the equator 960 km west of
a new tool to more accurately measure the impact of mainland Ecuador in the eastern Pacific. These islands are in a
taxonomic effects. hydrographically complex region due to their proximity to the
In conclusion, this study reveals the wealth of biological Equatorial Front and other major oceanic currents and regional
information that is contained within large multi-sample front systems [83]. The coastal and marine parts of the Galapagos
Islands ecosystem harbor an array of distinctive habitats, processes,
environmental datasets. We have begun to quantify the and endemic species. Several distinct zones were targeted including a
amount and structure of the variation in natural microbial shallow-water, warm seep (GS30), below the thermocline in an
PLoS Biology | www.plosbiology.org | S50 0426 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
upwelling zone (GS31), a coastal mangrove (GS32), and a hypersaline frequency of single inserts. Clones were sequenced from both ends
lagoon (GS33). The last stations were collected from open ocean sites to produce pairs of linked sequences representing ;820 bp at the end
(GS37 and GS47) and a coral reef atoll lagoon (GS51) in the immense of each insert.
South Pacific Gyre. The open ocean samples come from a region of Template preparation. Libraries were transformed, and cells were
lower nutrient concentrations where picoplankton are thought to plated onto large format (16 3 16cm) diffusion plates prepared by
represent the single most abundant and important factor for layering 150 ml of fresh molten, antibiotic-free agar onto a previously
biogeochemical structuring and nutrient cycling [84–87]. In the atoll set 50-ml layer of agar containing antibiotic. Colonies were picked for
systems, ambient nutrients are higher, and bacteria are thought to template preparation using the Qbot or QPix colony-picking robots
constitute a large biomass that is one to three times as large as that of (Genetix, http://www.genetix.com), inoculated into 384-well blocks
the phytoplankton [88–90]. containing liquid media, and incubated overnight with shaking. High-
Sample collection. A YSI (model 6600) multiparameter instrument purity plasmid DNA was prepared using the DNA purification
(http://www.ysi.com) was deployed to determine physical character- robotic workstation custom-built by Thermo CRS (http://www.
istics of the water column, including salinity, temperature, pH, thermo.com) and based on the alkaline lysis miniprep [92]. Bacterial
dissolved oxygen, and depth. Using sterilized equipment [91], 40–200 l cells were lysed, cell debris was removed by centrifugation, and
of seawater, depending on the turbidity of the water, was pumped plasmid DNA was recovered from the cleared lysate by isopropanol
through a 20-lm nytex prefilter into a 250-l carboy. From this sample, precipitation. DNA precipitate was washed with 70% ethanol, dried,
two 20-ml subsamples were collected in acid-washed polyethylene and resuspended in 10 mM Tris HCl buffer containing a trace of blue
bottles and frozen (20 8C) for nutrient and particle analysis. At each dextran. The typical yield of plasmid DNA from this method is
station the biological material was size fractionated into individual approximately 600–800 ng per clone, providing sufficient DNA for at
‘‘samples’’ by serial filtration through 20-lm, 3-lm, 0.8-lm, and 0.1- least four sequencing reactions per template.
lm filters that were then sealed and stored at 20 8C until transport Automated cycle sequencing. Sequencing protocols were based on
back to the laboratory. Between 44,160 and 418,176 clones per station the di-deoxy sequencing method [93]. Two 384-well cycle-sequencing
were picked and end sequenced from short-insert (1.0–2.2 kb) reaction plates were prepared from each plate of plasmid template
sequencing libraries made from DNA extracted from filters [19]. DNA for opposite-end, paired-sequence reads. Sequencing reactions
Data from these six Sorcerer II expedition legs (37 stations) were were completed using the Big Dye Terminator chemistry and
combined with the results from samples in the Sargasso Sea pilot standard M13 forward and reverse primers. Reaction mixtures,
study (four stations; GS00a–GS00d and GS01a–GS01c; [19]. The thermal cycling profiles, and electrophoresis conditions were
majority of the sequence data presented came from the 0.8- to 0.1-lm optimized to reduce the volume of the Big Dye Terminator mix
size fraction sample that concentrated mostly bacterial and archaeal (Applied Biosystems, http://www.appliedbiosystems.com) and to ex-
microbial populations. Two samples (GS01a, GS01b) from the tend read lengths on the AB3730xl sequencers (Applied Biosystems).
Sargasso Sea pilot study dataset and one GOS sample (GS25) came Sequencing reactions were set up by the Biomek FX (Beckman
from other filter size fractions (Table 1). Coulter, http://www.beckmancoulter.com) pipetting workstations.
Filtration and storage. Microbes were size fractionated by serial Robots were used to aliquot and combine templates with reaction
filtration through 3.0-lm, 0.8-lm, and 0.1-lm membrane filters mixes consisting of deoxy- and fluorescently labeled dideoxynucleo-
(Supor membrane disc filter; Pall Life Sciences, http://www.pall.com), tides, DNA polymerase, sequencing primers, and reaction buffer in a
and finally through a Pellicon tangential flow filtration (Millipore, 5 ll volume. Bar-coding and tracking promoted error-free template
http://www.millipore.com) fitted with a Biomax-50 (polyethersulfone) and reaction mix transfer. After 30–40 consecutive cycles of
cassette filter (50 kDa pore size) to concentrate a viral fraction to 100 amplification, reaction products were precipitated by isopropanol,
ml. Filters were vacuum sealed with 5 ml sucrose lysis buffer (20 mM dried at room temperature, and resuspended in water and trans-
EDTA, 400 mM NaCl, 0.75 M sucrose, 50 mM Tris-HCl [pH 8.0]) and ferred to one of the AB3730xl DNA analyzers. Set-up times were less
frozen to 20 8C on the vessel until shipment back to the Venter than 1 h, and 12 runs per day were completed with average trimmed
Institute, where they were transferred to a 80 8C freezer until DNA sequence read length of 822 bp.
extraction. Glycerol was added (10% final concentration) as a Fosmid end sequencing. Fosmid libraries [24] were constructed
cryoprotectant for the viral/phage sample. using approximately 1 lg DNA that was sheared using bead beating to
DNA isolation. In the laboratory, the impact filters were generate cuts in the DNA. The staggered ends or nicks were repaired
aseptically cut into quarters for DNA extraction. Unused quarters by filling with dNTPs. A size selection process followed on a pulse
of the filter were refrozen at 80 8C for storage. Quarters used for field electrophoresis system with lambda ladder to select for 39–40 Kb
extraction were aseptically cut into small pieces and placed in fragments. The DNA was then recovered from a gel, ligated to the
individual 50-ml conical tubes. TE buffer (pH 8) containing 50 mM blunt-ended pCC1FOS vector, packaged into lambda packaging
EGTA and 50 mM EDTA was added until filter pieces were barely extracts, incubated with the host cells, and plated to select for the
covered. Lysozyme was added to a final concentration of 2.5 mg/ clones containing an insert. Sequencing was performed as described
ml1, and the tubes were incubated at 37 8C for 1 h in a shaking for plasmid ends.
water bath. Proteinase K was added to a final concentration of 200 Metagenomic assembly. Assembly was conducted with the Celera
lg/ml1, and the samples were frozen in dry ice/ethanol followed by Assembler [21], with modifications as follows. The ‘‘genome length’’
thawing at 55 8C. This freeze–thaw cycle was repeated once. SDS was artificially set at the length of the dataset divided by 50 to allow
(final concentration of 1%) and an additional 200 lg/ml1 of unitigs of abundant organisms to be treated as unique, as previously
proteinase K were added to the sample, and samples were incubated described [19]. Several distinct assemblies were computed. In the
at 55 8C for 2 h with gentle agitation followed by three aqueous primary assembly, all pairs of mated reads were tested to see whether
phenol extractions and one phenol/chloroform extraction. The the paired reads overlapped one another; if so, they were merged into
supernatant was then precipitated with two volumes of 100% a single pseudo-read that replaced the two original reads; further,
ethanol, and the DNA pellet was washed with 70% ethanol. Finally, only overlaps of 98% identity or higher were used to construct
the DNA was treated with CTAB to remove enzyme inhibitors. Size unitigs. A second assembly was conducted in the same fashion with
fraction samples not utilized in this study were archived for future the exception of using a 94% identity cutoff to construct unitigs.
analysis. Finally, series of assemblies at various stringencies were computed for
Library construction. DNA was randomly sheared via nebulization, subsets of the GOS data; in these assemblies, overlapping mates were
end-polished with consecutive BAL31 nuclease and T4 DNA not preassembled and the Celera Assembler code was modified
polymerase treatments, and size-selected using gel electrophoresis slightly to allow for overlapping and multiple sequence alignment at
on 1% low-melting-point agarose. After ligation to BstXI adapters, lower stringency.
DNA was purified by three rounds of gel electrophoresis to remove Construction of a low-identity overlap database. An all-against-all
excess adapters, and the fragments were inserted into BstXI- comparison of unassembled (but merged and duplicate-stripped)
linearized medium-copy pBR322 plasmid vectors. The resulting sequences from the combined dataset was performed using a
library was electroporated into E. coli. To ensure construction of modified version of the overlapper component of the Celera
high-quality random plasmid libraries with few to no clones with no Assembler [21]. The code was modified to find overlap alignments
inserts, and no clones with chimeric inserts, we used a series of (global alignments allowing free end gaps) starting from pairs of reads
vectors (pHOS) containing BstXI cloning sites that include several that share an identical substring of at least 14 bp. An alignment
features: (1) the sequencing primer sites immediately flank the BstXI extension was then performed with match/mismatch scores set to
cloning site to avoid excessive resequencing of vector DNA; (2) yield a positive outcome if an overlap alignment was found with
elimination of strong promoters oriented toward the cloning site; 65% identity. Overlaps involving alignments of 40 bp were
and (3) the use of BstXI sites for cloning facilitates the preparation of retained for various analyses. For the GOS dataset described here,
libraries with a low incidence of no-insert clones and a high this process resulted in a dataset of 1.2 billion overlaps. Due to the 14-
PLoS Biology | www.plosbiology.org | S51 0427 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
bp requirement and certain heuristics for early termination of A read that can be overlapped to another at sufficiently high-
apparently hopeless extensions, not all alignments at 65% were sequence identity was taken to indicate that they were from similar
found. In addition, some of the lowest-identity overlaps are bound to organisms, and, relatedly, that similar genes were present in the
be chance matches; however, this was a relatively uncommon event. samples. Only reads with such overlaps contributed to the calcu-
Approximately one in 5 3 106 pairs of 800-bp random sequences (all lation. Other reads reflect genes or segments of genomes that were so
sites independent, A ¼ C ¼ G ¼ T ¼ 25%) can be aligned to overlap lightly sampled (i.e., at such low abundance) that they were not
40 bp at 65% identity using the same procedure. At a 70% cutoff, informative regarding the similarity of two samples. Consequently,
the value is reduced to one in 4 3 107, and one in 5 3 108 at a 75% cut the analysis automatically corrects for differences in the amount of
off. sequencing, and can be computed over sets of samples that vary
Extreme assembly. Like many assembly algorithms, the extreme considerably in diversity. The resulting measure of similarity Si,j takes
assembler proceeds in three phases: overlap, layout, and consensus. on a value between 0 and 1, where 0 implies no overlaps between i
The overlap phase is provided by the all-against-all comparison and j, and 1 implies that a fragment from i and a fragment from j are
described above. The consensus phase is performed by a version of as likely to overlap one another as are two fragments from i or two
the Celera Assembler, modified to accept higher rates of mismatch. fragments from j. As with the Bray-Curtis coefficient [97], abundance
The layout phase begins with a single sequencing read (‘‘seed’’) that is of categories affects the computation. In an idealized situation where
chosen at random or specified by the user and is considered the two libraries can each be divided into some number k of ‘‘species’’ at
‘‘current’’ read. The following steps are performed off one or both equal abundance, and the libraries have l of the species in common,
ends of the seed. (1) Starting from the current fragment end, add the the similarity statistic will approach l/k for large samplings; in this
fragment with the best overlap off that end and mark the current sense, Si,j ¼ x indicates that the two samples share approximately a
fragment as ‘‘used,’’ thus making the added fragment the new current fraction x of their genetic material. It is frequently useful to define Di,j
fragment. (2) Mark as used any alternative overlap that would have ¼ 1 Si,j, the ‘‘dissimilarity’’ or distance between two samples.
resulted in a shorter extension. The simplest notion of ‘‘best overlap’’ Ribotype clustering and identification of representatives. An all-
is simply the one having the highest identity alignment, but more against-all comparison of predicted 16S sequences was performed to
complicated criteria have certain advantages. A simple but useful determine the alignment between pairs of overlapping sequences
refinement is to favor fragments whose other ends have overlaps over using a version of an extremely fast bit-vector algorithm [98]. A
those which are dead ends. For an unsupervised extreme assembly, hierarchical clustering was determined using percent-mismatch in
when the sequence extension terminates because there are no more the resulting alignments as the distance between pairs of sequences.
overlaps, a new unused fragment is chosen as the next seed and the Order of clustering and cluster identity scores were based on the
process is repeated until all fragments have been marked used. average-linkage criterion, with distances between nonoverlapping
Construction of multiple SAR11 variants. Sequencing reads mated partial sequences treated as missing data. Ribotypes were the
to SAR11-like 16S sequences but themselves outside of the ribosomal maximal clusters with an identity score above the cutoff (typically
operon (n ¼ 348) were used as seeds in independent extreme 97%). Representative sequences were chosen for each cluster based
assemblies. Since the assemblies were independent, the results were on both length and highest average identity to other sequences in the
highly redundant, with a given chain of overlapping fragments cluster.
typically being used in multiple assemblies. A subset of 24 assemblies Taxonomic classification. Taxonomic classification of 16S sequen-
that shared no fragments over their first 20 kb was identified as ces was conducted using phylogenetic techniques based on clade
follows. (1) Connected components were determined in a graph membership of similar sequences with 16S sequences with defined
defined by nodes corresponding to extreme assemblies. If the taxonomic membership. Representative sequences from clustered
assemblies shared at least one fragment in the first 20 kb of each sequences were analyzed as described previously [19,99] and by
assembly, the two nodes were connected by an edge. (2) A single addition into an ARB database of small subunit rDNAs [100,101].
assembly was chosen at random from each of the connected Results were spot-checked against the Ribosomal Database Project II
components. The consensus sequence over the 20-kb segment of Classifier server [102] and the taxonomic labels of the best BLASTN
each such representative was used as the reference for fragment hits against the nonredundant database at NCBI.
recruitment. Fragment recruitment. Global ocean sequences were aligned to
Phylogeny. Phylogenies of sequences homologous to a given genomic sequences of different bacteria and phage using NCBI
portion of a reference sequence (typically 500 bp) were determined BLASTN [26]. The following blast parameters were designed to
in the following manner. A set of homologous fragments was identify alignments as low as 55% identity that could contain large
identified based on fragment recruitment to the reference as gaps: -F ‘‘m L’’ -U T -p blastn -e 1e-4 -r 8 -q -9 -z 3000000000 -X 150.
described above. Fragments that fully spanned the segment of Reads were filtered in several steps to identify the reads that were
interest and had almost full-length alignments to the reference aligned over more or less their entire length. Reads had to be aligned
sequence of a user-defined percent identity (typically, 70%) were for more than 300 bp at .30% identity with less than 25 bp of
used for further analysis. A preliminary master–slave multiple unaligned bases on either end, or reads had to be aligned over more
sequence alignment of the recruited reads (slaves) to the reference than 100 bp at .30% identity with less than 20 bp of overhang off
segment (master) was performed with a modified version of the either end. Identity was calculated ignoring gaps. In some instances a
consensus module of the Celera Assembler. Based on this alignment, read might be placed, but the mate would not be placed under these
reads were trimmed to the portion aligning to the reference segment criteria. In such cases, if 80% or more of the mate were successfully
of interest. A refined multiple sequence alignment was then aligned, then the mate would be rescued and considered successfully
computed with MUSCLE [94]. Distance based phylogenies were aligned.
computed using the programs DNADIST and NEIGHBOR from the Generation of shredded artificial reads from finished genomes.
PHYLIP package [95] using default settings. Trees were visualized Random pieces of DNA from the genome in question with a length
using HYPERTREE [96]. between 1,800 to 2,500 bp were selected. For each piece a read length
Measurement of library-to-library similarity. Based on the low- N1 was selected from the distribution of lengths using the GOS
identity overlap database described above, the similarity of a library i dataset. If that GOS sequence had a mate pair, then a second length
to another library j at a given percent identity cutoff was computed as N2 was again randomly selected. The length N1 was used to generate
follows. For each sequence s of i, let ns,i ¼ the number of overlaps to a read from the 59 end of the DNA. The piece of DNA was then
other fragments of i satisfying the cutoff; ns,j ¼ the number of overlaps reverse complemented and if appropriate, a second length N2 was
to fragments of j satisfying the cutoff; and fs,i ¼ ns,i/(ns,i þ ns,j) ¼ fraction used to generate a second read. The relationship between these two
of reads overlapping s from i or j that are from i. reads was then recorded and used to produce a fasta file. This
X approach successfully mimics the types of reads found in the GOS
ri;i ¼ fs;i ð1Þ data with similar rates of missing mates.
s Abundance of proteorhodopsin variants. A total of 2,644 proteo-
rhodopsin genes were identified from the clustered open reading
X frames derived from the GOS assembly [18]. These genes could
ri;j ¼ 1 fs;i ð2Þ
be linked back to 3,608 GOS clones. Open reading frames were
s
predicted from these clones as described in [18]. The peptide
pffiffiffiffiffiffiffiffiffiffiffiffiffi sequences were aligned with NCBI blastpgp with the following
si;j ¼ 0:5 ðri;j þ rj;i Þ= ri;i rj;j ð3Þ parameters: -j 5 -U T -e 10 -W 2 -v 5 -b 5000 -F ‘‘m L’’ -m 3. The
search was performed with a previously described blue-absorbing
proteorhodopsin protein BPR (gij32699602) as the query. The amino
Si;j ¼ Sj;i ¼ 2si;j =ð1 þ si;j Þ ð4Þ acid associated with light absorption is found within a short
PLoS Biology | www.plosbiology.org | S52 0428 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
conserved motif RYVDWLLTVPL*IVEF, where the asterisk indicates (http://camera.calit2.net). The dataset and associated metadata will be
tuning amino acid [39–41]. In total, 1,938 clones were found to accessible via CAMERA (using the dataset tag CAM_PUB_Rusch07a).
contain this motif. Clones and the sample metadata were then Given the exceptional abundance of Burkholderia and Shewanella
associated with the tuning amino acid to determine the relative sequences in the first Sargasso Sea sample and the feeling that these
abundance of the different amino acids at these positions. Clones may be contaminants, we are also providing a list of the scaffold IDs
could be associated with SAR11 if both mated sequencing reads and sequencing read IDs associated with these organisms to facilitate
(when available) were recruited to P. ubique HTCC1062. analyses with or without the sequences. In addition to CAMERA, the
Site abundance estimates and comparisons. Given a set of genes GOS scaffolds and annotations will be available via the public sequence
identified on the GOS sequences, we can identify the scaffolds on repositories such as NCBI (http://www.ncbi.nlm.nih.gov/entrez/
which these genes were annotated. A vector indicating the number of query.fcgi?db¼genomeprj&cmd¼Retrieve&dopt¼Overview&list_
sequences contributed by every sample is determined for every gene. uids¼13694), and the reads will be available via the Trace Archive
This vector reflects the number of sequences from every sample that (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?).
assembled into the scaffold on which the gene was identified after
normalizing for the proportion of scaffold covered by the gene. For
example, if a 10-kb scaffold contains a 1-kb gene, then each sample Supporting Information
will contribute one GOS sequence for every ten GOS sequences it
contributed to the entire scaffold. The vectors are then summed and Poster S1. Fragment Recruitment of GOS Data to Finished Microbial
normalized to account for either the total number of GOS sequences Genomes
obtained from each sample or based on the number of typically single Found at doi:10.1371/journal.pbio.0050077.sd001 (21 MB PDF).
copy recA genes (identified as in [18]). Unless stated otherwise, recA
was used to normalize abundance across samples. When comparisons
using groups of samples were performed, the average value for the Accession Numbers
samples was compared. The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession num-
Oligonucleotide composition profile. A 1-D profile representing ber for proteorhodopsin protein BPR is gij32699602.
oligonucleotide frequencies was computed as follows. A sequence was
converted into a series of overlapping 10,000-bp segments, each
segment offset by 1,000 bp from the previous one, using perl and shell Acknowledgments
scripting. Dinucleotide frequencies are computed on each segment
using a C program written for this purpose. Higher-order oligonu- We acknowledge the Department of Energy (DOE), Office of Science,
cleotides were examined and gave similar results for the genomes of and Office of Biological and Environmental Research (DE-FG02-
interest. Remaining calculations were performed using the R package 02ER63453), the Gordon and Betty Moore Foundation, the Discovery
[103]. Principle component analysis (function princomp with default Channel, and the J. Craig Venter Science Foundation for funding to
settings) was applied to the matrix of frequencies per window undertake this study. We are also indebted to a large group of
position. The value of the first component for each position was individuals and groups for facilitating our sampling and analysis. We
normalized by the standard deviation of these values, and truncated thank the governments of Canada, Mexico, Honduras, Costa Rica,
to the range [5, 5]. For visualizations, the resulting values were Panama, Ecuador, French Polynesia, and France for facilitating
plotted at the center of each window. sampling activities. All sequencing data collected from waters of the
Estimating frequency of large-scale translocations and inversions. above-named countries remain part of the genetic patrimony of the
The unrecruited mated sequencing reads of reads recruited to P. country from which they were obtained. Canada’s Bedford Institute
marinus MIT9312 at or above 80% identity were examined. An of Oceanography provided a vessel and logistical support for
unrecruited mate indicated a potential translocation or inversion if it sampling in Bedford Basin. The Universidad Nacional Autónoma
aligned to the MIT9312 genome in two and only two distinct de México (UNAM) facilitated permitting and logistical arrangements
alignments separated by at least 50 kb, if each aligned portion was at and identified a team of scientists for collaboration. The scientists
least 250 bp long, if there was less than 100 bp of unaligned sequence and staff of the Smithsonian Tropical Research Institute (STRI)
and no more than 100 bp of overlapping sequence between the two hosted our visit in Panama. Representatives from Costa Rica’s
aligned portions in read coordinates, and if each aligned portion was Organization for Tropical Studies (Jorge Arturo Jimenez and
anchored to one end of the sequencing read with less than 25 bp of Francisco Campos Rivera), the University of Costa Rica (Jorge
unaligned sequence from each end. In total, 18 rearrangements were Cortés), and the National Biodiversity Institute (INBio) provided
identified, six of which appear to be unique events. assistance with planning, logistical arrangements, and scientific
The rate of discovery was estimated by determining the number of analysis. Our visit to the Galapagos Islands was facilitated by
rearrangements in a given volume of sequence. We estimated the assistance from the Galapagos National Park Service Director,
volume of sequence that was potentially examined by identifying Washington Tapia, and the Charles Darwin Research Institute,
recruited mated sequencing reads that fit the ‘‘good’’ category (i.e., especially Howard Snell and Eva Danulat. We especially thank Greg
which were recruited in the correct orientation at the expected Estes (guide), Héctor Cháuz Campo (Institute of Oceanography of the
distance from each other). For a given read, if the mate was recruited Ecuador Navy), and a National Park Representative, Simon Ricardo
at greater than or equal to 80% identity, then the expected amount of Villemar Tigrero, for field assistance while in the Galapagos Islands.
sequence examined should be the current (as opposed to mate) read Martin Wilkalski (Princeton University) and Rod Mackie (University
length minus 500 bp. This produces an estimate of the search space to of Illinois) provided planning advice for the Galapagos sampling plan.
be ;47 Mbp. Given 18 rearrangements, this leads to an estimate of We thank Matthew Charette (Woods Hole Oceanographic Institute)
one rearrangement per 2.6 Mbp. for nutrient data analysis. We also acknowledge the help of Michael
Quantification and assessment of sequences associated with gaps. Ferrari and Jennifer Clark for remote sensing data. The U.S.
GOS reads assigned to the ‘‘missing mate’’ category that were Department of State facilitated Governmental communications on
recruited at greater than 80% identity outside the gap in question multiple occasions. John Glass (J. Craig Venter Institute [JCVI])
were identified. The mates of these reads were then identified and provided valuable assistance in methods development. The dedicated
clustering was attempted with Phrap (http://www.phrap.org). Reads efforts of the quality systems, library construction, template, and
that were incorporated end to end into the Phrap assemblies were sequencing teams at the JCVI Joint Technology Center produced the
identified. For most small gaps a single assembly included all the high quality sequence data that was the basis of this paper. We thank
missing mate reads and identified the precise difference between the Matthew LaPointe, Creative Director of JCVI, for assistance with
reference and the environmental sequences. For the hypervariable figure design, and the JCVI information technology support team
segments, most of the reads failed to assemble at all, and those that who facilitated many of the vessel related technical needs. Special
did show greater sequence divergence than typically seen. In the case thanks are due for Charles H. Howard, captain of the Sorcerer II, and
of SAR11-recruited reads, to increase the number of reads associated fellow crew members Cyrus Foote and Brooke A. Dill for their time
with the hypervariable gaps we identified reads that did not recruit to and effort in support of this research. We gratefully acknowledge Dr.
the P. ubique HTCC1062 but aligned in a single HSP (high-scoring Michael Sauri, who oversaw medical related issues for the crew of the
pair) over at least 500 bp with one end unaligned because it extended Sorcerer II.
into the hypervariable gap. Author contributions. DBR, ALH, KB, HS, CAP, JFH, MF, and JCV
Data and tool release. To facilitate continued analysis of this and conceived and designed the experiments. DBR, ALH, JMH, KB, BT,
other metagenomic datasets, the tools presented here along with their HBT, CS, JT, JF, CAP, and JCV performed the experiments. DBR,
source code will be available via the Cyberinfrastructure for Advanced ALH, GS, KBH, SW, DW, JAE, KR, JEV, TU, YHR, MRF, KN, and RF
Marine Microbial Ecology Research and Analysis (CAMERA) website analyzed the data. DBR, ALH, GS, KBH, SY, JMH, KR, KB, BT, HS,
PLoS Biology | www.plosbiology.org | S53 0429 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
HBT, CS, JT, JF, CAP, KL, SK, JFH, TU, YHR, LIF, VS, GBR, LEE, and Environmental Research (DE-FG02-02ER63453), the Gordon and
DMK, SS, TP, EB, VG, GTC, MRF, RLS, MF, and JCV contributed Betty Moore Foundation, the Discovery Channel, and the J. Craig
reagents/materials/analysis tools. DBR, ALH, GS, KBH, SW, SY, JAE, Venter Science Foundation.
RLS, KN, RF, MF, and JCV wrote the paper.
Funding. Funding for this study was received from the US Competing interests. The authors have declared that no competing
Department of Energy, Office of Science, and Office of Biological interests exist.
References A Web server for aligning two genomic DNA sequences. Genome Res 10:
1. Whitman WB, Coleman DC, Wiebe WJ (1998) Prokaryotes: The unseen 577–586.
majority. Proc Natl Acad Sci U S A 95: 6578–6583. 28. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006)
2. Beja O, Koonin EV, Aravind L, Taylor LT, Seitz H, et al. (2002) Genomic islands and the ecology and evolution of Prochlorococcus. Science
Comparative genomic analysis of archaeal genotypic variants in a single 311: 1768–1770.
population and in two different oceanic provinces. Appl Environ 29. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005)
Microbiol 68: 335–345. Genome streamlining in a cosmopolitan oceanic bacterium. Science 309:
3. DeLong EF, Pace NR (2001) Environmental diversity of bacteria and 1242–1245.
archaea. Systematic Biol 50: 1–9. 30. Hagstrom A, Pommier T, Rohwer F, Simu K, Stolte W, et al. (2002) Use of
4. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) 16S ribosomal DNA for delineation of marine bacterioplankton species.
Microbial diversity in the deep sea and the underexplored ‘‘rare Appl Environ Microbiol 68: 3628–3633.
biosphere.’’ Proc Natl Acad Sci U S A 103: 12115–12120. 31. Giovannoni S, Rappe M (2002) Evolution, diversity and molecular ecology
5. Garrity GM (2001) Bergey’s manual of systematic bacteriology. New York; of marine Prokaryotes. In: Kirchman DL, editor. Microbial ecology of the
Springer-Verlag. oceans. New York: Wiley-Liss. pp. 47–84.
6. Madigan M, Martinko JM, Parker J (2000) Brock biology of micro- 32. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005)
organisms. Upper Saddle River (NJ); Prentice Hall. 991 p. Comparative metagenomics of microbial communities. Science 308: 554–
7. Fuhrman JA, McCallum K, Davis AA (1992) Novel major archaebacterial 557.
group from marine plankton. Nature (London) 356: 148–149. 33. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein
8. Giovannoni SJ, Britschgi TB, Moyer CL, Field KG (1990) Genetic diversity families. Nucleic Acids Res 31: 371–373.
in Sargasso Sea bacterioplankton. Nature 345: 60–63. 34. Conkright M, Levitus S, Boyer T. (1994) World ocean atlas 1994. Volume 1:
9. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Nutrients. Washington, D.C. U.S. Department of Commerce.
Science 276: 734–740. 35. Levitus S, Burgett R, Boyer T (1994) World ocean atlas 1994. Volume 3:
10. Rappe MS, Giovannoni SJ (2003) The uncultured microbial majority. Ann Nutrients. Washington, D.C.: U.S. Department of Commerce.
Rev Microbiol 57: 369–394. 36. Parekh P, Follows MJ, Boyle E (2004) Modeling the global ocean iron cycle.
11. Stackebrandt E, Goebel BM (1994) Taxonomic note: A place for DNA- Global Biogeochem Cycles 18: GB1002.
DNA reassociation and 16S rRNA sequence analysis in the present species 37. Scanlan DJ, Wilson WH (1999) Application of molecular techniques to
definition in bacteriology. Int J Syst Bacteriol 44: 846–849. addressing the role of P as a key effector in marine ecosystems.
12. Welch RA, Burland V, Plunkett G 3rd, Redford P, Roesch P, et al. (2002) Hydrobiologia 401: 149–175.
Extensive mosaic structure revealed by the complete genome sequence of 38. Moore L, Ostrowski M, Scanlan D, Feren K, Sweetsir T (2005) Ecotypic
uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99: 17020–17024. variation in phosphorus acquisition mechanisms within marine picocya-
13. Linklater E (1972) The voyage of the Challenger. Garden City (NJ); nobacteria. Aquat Microb Ecol 39: 257–269.
Doubleday. 280 p. 39. Kelemen BR, Du M, Jensen RB (2003) Proteorhodopsin in living color:
14. Mosley HN (1879) Notes by a naturalist on the ‘‘Challenger,’’ being an Diversity of spectral properties within living bacterial cells. Biochim
account of various observations made during the voyage of H.M.S. Biophys Acta 1618: 25–32.
‘‘Challenger’’ round the world, in the years 1872–1876. London: 40. Man D, Wang W, Sabehi G, Aravind L, Post AF, et al. (2003) Diversification
Macmillian and Company. 540 p. and spectral tuning in marine proteorhodopsins. EMBO J 22: 1725–1731.
15. Thompson SCW, Murray SJ, Nares GS, Thompson FT (1895) Report on 41. Man-Aharonovich D, Sabehi G, Sineshchekov OA, Spudich EN, Spudich
the scientific results of the voyage of H.M.S. Challenger during the years JL, et al. (2004) Characterization of RS29, a blue-green proteorhodopsin
1873–76 under the command of Captain George S. Nares, R.N., F.R.S. and variant from the Red Sea. Photochem Photobiol Sci 3: 459–462.
the late Captain Frank Tourle Thomson, R.N. Prepared under the 42. Bielawski JP, Dunn KA, Sabehi G, Beja O (2004) Darwinian adaptation of
superintendence of the late Sir C. Wyville Thomson, 1885–1895: proteorhodopsin to different light intensities in the marine environment.
Edinburgh: printed for H.M. Stationery off. (by order of Her Majesty’s Proc Natl Acad Sci U S A 101: 14824–14829.
Government). 43. Johnsen S, Sosik H. (2005) Shedding light on light in the ocean. Oceanus
16. Fuhrman JA, Mccallum K, Davis AA (1993) Phylogenetic diversity of Mag 43: 24–28.
subsurface marine microbial communities from the Atlantic and Pacific 44. Braun C, Smirnov S (1993) Why is water blue. J Chem Educ 70: 612–615.
Oceans. Appl Environ Microbiol 59: 1294–1302. 45. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial
17. Hewson I, Steele JA, Capone DG, Fuhrman JA (2006) Temporal and spatial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289:
scales of variation in bacterioplankton assemblages of oligotrophic surface 1902–1906.
waters. Mar Ecol Prog Ser 311: 67–77. 46. de la Torre JR, Christianson LM, Beja O, Suzuki MT, Karl DM, et al. (2003)
18. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) Proteorhodopsin genes are distributed among divergent marine bacterial
The Sorcerer II Global Ocean Sampling expedition: Expanding the universe taxa. Proc Natl Acad Sci U S A 100: 12830–12835.
of protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016 47. Sabehi G, Beja O, Suzuki MT, Preston CM, DeLong EF (2004) Different
19. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) SAR86 subgroups harbour divergent proteorhodopsins. Environ Micro-
Environmental genome shotgun sequencing of the Sargasso Sea. Science biol 6: 903–910.
304: 66–74. 48. Frigaard NU, Martinez A, Mincer TJ, DeLong EF (2006) Proteorhodopsin
20. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) lateral gene transfer between marine planktonic bacteria and archaea.
Community genomics among stratified microbial assemblages in the Nature 439: 847–850.
ocean’s interior. Science 311: 496–503. 49. Yokoyama S (2000) Phylogenetic analysis and experimental approaches to
21. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A study color vision in vertebrates. Methods Enzymol 315: 312–325.
whole-genome assembly of Drosophila. Science 287: 2196–2204. 50. Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, et al. (2002)
22. Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting Comparative genome sequencing for discovery of novel polymorphisms in
random clones: A mathematical analysis. Genomics 2: 231–239. Bacillus anthracis. Science 296: 2028–2033.
23. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002) 51. Brown MV, Fuhrman JA (2005) Marine bacterial microdiversity as revealed
The genome sequence of the malaria mosquito Anopheles gambiae. Science by internal transcribed spacer analysis. Aquat Microb Ecol 41: 15–23.
298: 129–149. 52. Rocap G, Distel DL, Waterbury JB, Chisholm SW (2002) Resolution of
24. Kim UJ, Shizuya H, Dejong PJ, Birren B, Simon MI (1992) Stable Prochlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNA
propagation of cosmid sized human DNA inserts in an F-factor based internal transcribed spacer sequences. Appl Environ Microbiol 68: 1180–
vector. Nucleic Acids Res 20: 1083–1085. 1191.
25. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF (1996) Characterization 53. Schleper C, DeLong EF, Preston CM, Feldman RA, Wu KY, et al. (1998)
of uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pair Genomic analysis reveals chromosomal variation in natural populations of
genome fragment from a planktonic marine archaeon. J Bacteriol 178: the uncultured psychrophilic archaeon Cenarchaeum symbiosum. J Bacteriol
591–599. 180: 5003–5009.
26. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local 54. Rogers AR, Harpending H (1992) Population growth makes waves in the
alignment search tool. J Mol Biol 215: 403–410. distribution of pairwise genetic differences. Mol Biol Evol 9: 552–569.
27. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, et al. (2000) PipMaker: 55. Rappe MS, Connon SA, Vergin KL, Giovannoni SJ (2002) Cultivation of
PLoS Biology | www.plosbiology.org | S54 0430 Special Section from March 2007 | Volume 5 | Issue 3 | e77
Sorcerer II GOS Expedition
the ubiquitous SAR11 marine bacterioplankton clade. Nature 418: 630– 79. Garcia-Martinez J, Rodriguez-Valera F (2000) Microdiversity of uncultured
633. marine prokaryotes: The SAR11 cluster and the marine Archaea of Group
56. Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM, et al. (2006) I. Mol Ecol 9: 935–948.
Niche partitioning among Prochlorococcus ecotypes along ocean-scale 80. Jenkins BD, Steward GF, Short SM, Ward BB, Zehr JP (2004) Finger-
environmental gradients. Science 311: 1737–1740. printing Diazotroph communities in the Chesapeake Bay by using a DNA
57. Liu WT, Marsh TL, Cheng H, Forney LJ (1997) Characterization of macroarray. Appl Environ Microbiol 70: 1767–1776.
microbial diversity by determining terminal restriction fragment length 81. Legeckis R (1988) Upwelling off the Gulfs of Panama and Papagayo in the
polymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 63: tropical Pacific during March 1985. J Geophys Res 93: 15485–15489.
4516–4522. 82. McCreary JP, Lee HS, Enfield DB (1989) The response of the coastal ocean
58. Fisher MM, Triplett EW (1999) Automated approach for ribosomal to strong offshore winds: With application to circulation in the gulfs of
intergenic spacer analysis of microbial diversity and its application to Tehuantepec and Papagayo. J Mar Res 47: 81–109.
freshwater bacterial communities. Appl Environ Microbiol 65: 4630–4636. 83. Palacios DM (2003) Oceanographic conditions around the Galápagos
59. Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S, et al. (2001) The Archipelago and their influence on Cetacean community structure. [PhD
photosynthetic apparatus of Prochlorococcus: Insights through comparative diss]. Corvallis: Oregon State University. 178 p.
genomics. Photosynth Res 70: 53–71. 84. Christian JR, Lewis MR, Karl DM (1997) Vertical fluxes of carbon,
60. Martiny AC, Coleman ML, Chisholm SW (2006) Phosphate acquisition nitrogen, and phosphorus in the North Pacific Subtropical Gyre near
genes in Prochlorococcus ecotypes: Evidence for genome-wide adaptation. Hawaii. J Geophys Res 102: 15667–15677.
Proc Natl Acad Sci U S A 103: 12552–12557. 85. Doney SC, Abbott MR, Cullen JJ, Karl DM, Rothstein L (2004) From genes
61. Moore LR, Chisholm SW (1999) Photophysiology of the marine cyano- to ecosystems: The ocean’s new frontier. Front Ecol Environ 2: 457–466.
bacterium Prochlorococcus: Ecotypic differences among cultured isolates. 86. McGillicuddy DJ, Anderson LA, Doney SC, Maltrud ME (2003) Eddy-driven
Limnol Oceanogr 44: 628–638. sources and sinks of nutrients in the upper ocean: Results from a 0.18
62. Moore LR, Rocap G, Chisholm SW (1998) Physiology and molecular resolution model of the North Atlantic. Global Biogeochem Cycles 17: 1035.
phylogeny of coexisting Prochlorococcus ecotypes. Nature 393: 464–467. 87. van der Staay SYM, van der Staay GWM, Guillou L, Vaulot D, Claustre H,
63. Rocap G, Larimer FW, Lamerdin J, Malfatti S, Chain P, et al. (2003) et al. (2000) Abundance and diversity of prymnesiophytes in the
Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche picoplankton community from the equatorial Pacific Ocean inferred
differentiation. Nature 424: 1042–1047. from 18S rDNA sequences. Limnol Oceanogr 45: 98–109.
64. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005) 88. Blanchot J, Charpy L, Borgne RL (1989) Size composition of particulate
Three Prochlorococcus cyanophage genomes: Signature features and organic matter in the lagoon of Tikehau Atoll (Tuiamotu Archipelago).
ecological interpretations. PLoS Biol 3: e144. Mar Biol 102: 329–339.
65. Ting CS, Rocap G, King J, Chisholm SW (2002) Cyanobacterial photosyn- 89. Torréton JP, Dufour P (1996) Bacterioplankton production determined by
thesis in the oceans: The origins and significance of divergent light- DNA synthesis, protein synthesis and frequency of dividing cells in
harvesting strategies. Trends Microbiol 10: 134–142. Tuamotu atoll lagoons and surrounding ocean. Microb Ecol 32: 185–202.
66. Zinser ER, Coe A, Johnson ZI, Martiny AC, Fuller NJ, et al. (2006) 90. Torréton JP, Dufour P (1996) Temporal and spatial stability of
Prochlorococcus ecotype abundances in the North Atlantic Ocean as bacterioplankton biomass and productivity in an atoll lagoon. Aquat
revealed by an improved quantitative PCR method. Appl Environ Microb Ecol 11: 251–261.
Microbiol 72: 723–732. 91. Rutala WA, Weber DJ (1997) Uses of inorganic hypochlorite (bleach) in
67. Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, et al. (1987) health-care facilities. Clin Microbiol Rev 10: 597–610.
Report of the ad hoc committee on reconciliation of approaches to 92. Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning. A laboratory
bacterial systematics. Int J Syst Bacteriol 37: 463–464. manual. Cold Spring Harbor (NY): Cold Spring Laboratory Press.
68. Thompson JR, Pacocha S, Pharino C, Klepac-Ceraj V, Hunt DE, et al. 93. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-
(2005) Genotypic diversity within a natural coastal bacterioplankton terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.
population. Science 307: 1311–1313. 94. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high
69. Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, et al. (2004) Fine- accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
scale phylogenetic architecture of a complex bacterial community. Nature 95. Felsenstein J (1989) PHYLIP: Phylogeny Inference Package (Version 3.2).
430: 551–554. Cladistics 5: 164–166.
70. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia 96. Bingham J, Sudarsanam S (2000) Visualizing large hierarchical clusters in
coli genome. Proc Natl Acad Sci U S A 95: 9413–9417. hyperbolic space. Bioinformatics 16: 660–661.
71. Scheffer M, Rinaldi S, Huisman J, Weissing FJ (2003) Why plankton 97. Bray JR, Curtis JT. (1957) An ordination of upland forest communities of
communities have no equilibrium: Solutions to the paradox. Hydro- southern Wisconsin. Ecol Monogr 27: 325–349.
biologia 491: 9–18. 98. Myers G (1999) A fast bit-vector algorithm for approximate string
72. Hutchinson GE (1961) The paradox of the plankton. Am Nat 95: 137–145. matching based on dynamic programming. J ACM 46: 395–415.
73. Cohan F (2002) Concepts of bacterial biodiversity for the age of genomics. 99. Penn K, Wu D, Eisen JA, Ward N (2006) Characterization of bacterial
In: Fraser CM, Read TD, Nelson KE, editors. Microbial genomes. Totowa communities associated with deep-sea corals on Gulf of Alaska Seamounts.
(New Jersey): Humana Press. pp. 175–194. Appl Environ Microbiol 72: 1680–1683.
74. Hacker J, Blum-Oehler G, Hochhut B, Dobrindt U (2003) The molecular 100. Ludwig W, Strunk O, Westram R, Richter L, Meier H, et al. (2004) ARB: A
basis of infectious diseases: Pathogenicity islands and other mobile genetic software environment for sequence data. Nucleic Acids Res 32: 1363–1371.
elements. A review. Acta Microbiol Immunol Hung 50: 321–330. 101. Hugenholtz P (2002) Exploring prokaryotic diversity in the genomic era.
75. McCann KS (2000) The diversity-stability debate. Nature 405: 228–233. Genome Biol 3: REVIEWS0003.
76. Fuller NJ, West NJ, Marie D, Yallop M, Rivlin T, et al. (2005) Dynamics of 102. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, et al.
community structure and phosphate status of picocyanobacterial pop- (2005) PHACCS, an online tool for estimating the structure and diversity
ulations in the Gulf of Aqaba, Red Sea. Limnol Oceanogr 50: 363–375. of uncultured viral communities using metagenomic information. BMC
77. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. (2006) Bioinformatics 6: 41.
Metagenomic analysis of the human distal gut microbiome. Science 312: 103. R Development Core Team (2004) R: A language and environment for
1355–1359. statistical computing [computer program]. Vienna, Austria: R Foundation
78. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA (2005) Coupling 16S- for Statistical Computing. http://www.R-project.org.
ITS rDNA clone libraries and automated ribosomal intergenic spacer 104. Gomez-Consarnau L, Gonzalez JM, Coll-Llado M, Gourdon P, Pascher T, et
analysis to show marine microbial diversity: Development and application al. (2007) Light stimulates growth of proteorhodopsin-containing marine
to a time series. Environ Microbiol 7: 1466–1479. Flavobacteria. Nature 445: 210–213.
PLoS Biology | www.plosbiology.org | S55 0431 Special Section from March 2007 | Volume 5 | Issue 3 | e77
PLoS BIOLOGY
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein
families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of
sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million
Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total
of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no
detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of
sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in
the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously
categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans)
from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset
is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins,
the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their
evolution. These observations are illustrated using several protein families, including phosphatases, proteases,
ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS
data has implications for choosing targets for experimental structure characterization as part of structural genomics
efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the
addition of new sequences, implying that we are still far from discovering all protein families in nature.
Citation: Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families.
PLoS Biol 5(3): e16. doi:10.1371/journal.pbio.0050016
Academic Editor: Sean Eddy, Washington University St. Louis, United States of
America
Received March 24, 2006; Accepted August 15, 2006; Published March 13, 2007
Copyright: Ó 2007 Yooseph et al. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author
and source are credited.
Abbreviations: aa, amino acid; ENS, Ensembl; EST, expressed sequence tag; GO,
Gene Ontology; GOS, Global Ocean Sampling; GS, glutamine synthetase; HMM,
hidden Markov model; IDO, indoleamine 2,3-dioxygenase; NCBI, National Center for
Biotechnology Information; ORF, open reading frame; PDB, Protein Data Bank; PG,
prokaryotic genomes; PP2C, protein phosphatase 2C; PSI, Protein Structure
Initiative; RLP, RuBisCO-like protein; TGI, TIGR gene indices; TC, trusted cutoff;
UVDE, UV dimer endonuclease
* To whom correspondence should be addressed. E-mail: Shibu.Yooseph@
venterinstitute.org
This article is part of Global Ocean Sampling collection in PLoS Biology. The full
collection is available online at http://collections.plos.org/plosbiology/gos-2007.
php.
PLoS Biology | www.plosbiology.org | S56 0432 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S57 0433 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Table 1. The Complete Dataset Consisted of Sequences from NCBI-nr, ENS, TGI-EST, PG, and GOS, for a Total of 28,610,944 Sequences
NCBI-nr NCBI 2,317,995 339 Consists of protein sequences submitted to SWISS-PROT, PDB, PIR, and PRF, and
also predicted proteins from both finished and unfinished genomes in GenBank,
EMBL, and DDBJ.
PG ORFs NCBI 3,049,695 160 ORFs identified from 222 prokaryotic genome projects. Organisms are listed in
Protocol S1.
TGI-EST ORFs TIGR Gene Index 5,458,820 119 ORFs identified from 72 datasets in which each dataset consists of EST assem-
blies. Organisms are listed in Protocol S1.
ENS Ensembl 361,668 466 Sequences from 12 species, including human, mouse, rat, chimp, zebrafish, fruit
fly, mosquito, honey bee, dog, two species of puffer fish, chicken, and worm.
GOS ORFs J. Craig Venter 17,422,766 134 ORFs identified from an assembly of 7.7 million reads. These reads include both
Institute the reads from the Sorcerer II GOS Expedition and the reads from the earlier Sar-
gasso Sea study. Also included are 36,318 ORFs identified from an assembly of
sequences collected from the viral size (, 0.1 lm) fraction of one sample.
doi:10.1371/journal.pbio.0050016.t001
by considering translations of the DNA sequence in all six length of the shorter sequence. This step served the dual role
frames. For ORFs from the PG and TGI-EST datasets, we used of identifying highly conserved groups of sequences (where
the appropriate codon usage table for the known organism. each group was represented by a nonredundant sequence) and
For GOS ORFs from the assembled sequences, we used removing redundancy in the dataset due to identical and
translation table 11 (the code for bacteria, archaea, and near-identical sequences. Only nonredundant sequences were
prokaryotic viruses) [31]. We did not include alternate codon considered for further steps in our clustering procedure. In
translations in this analysis. For all datasets, only ORFs the second step, we identified core sets of similar sequences
containing at least 60 amino acids (aa) were considered. Not using only matches between two sequences involving 80%
all ORFs are proteins. In this paper, ORFs that have of the length of the longer sequence. We used a graph-
reasonable evidence for being proteins are called predicted theoretic procedure to identify dense subgraphs (the core
proteins; other ORFs are called spurious ORFs. sets) within a graph defined by these matches. While the
In summary, the total input data for this study (Table 1) match parameters we used in this step were more relaxed
consisted of 28,610,994 sequences from NCBI-nr, PG, TGI- than those in the first step, we chose them to reduce the
EST, ENS, and GOS. All data and analysis results will be made grouping of unrelated sequences while simultaneously re-
publicly available (see Materials and Methods). ducing the unnecessary splitting of families. In the third step,
We used a sequence similarity clustering to group related these core sets were transformed into profiles, and we used a
sequences and subsequently predicted proteins from this profile–profile method [39] to merge related core sets into
grouping. This approach of protein prediction was adopted larger groups. In the final step, we recruited sequences to
for two reasons. First, the GOS data make up a major portion core sets using sequence-profile matching (PSI-BLAST [40])
of the dataset being analyzed, and a large fraction of GOS and BLAST matches to core set members. We required the
ORFs are fragmentary sequences. Traditional annotation match to involve 60% of the length of the sequence being
pipelines/gene finders, which presume complete or near- recruited.
complete genomic data, perform unsatisfactorily on this type We identified and removed clusters containing likely
of data. Second, protein prediction based on the comparison spurious ORFs using two filters (see Materials and Methods).
of ORFs to known protein sequences imposes limits on the The first filter identified clusters containing shadow ORFs.
protein families that can be explored. In particular, novel The second filter identified clusters containing conserved but
proteins that belong to known families will not be detected if noncoding sequences, as indicated by a lack of selection at the
they are sufficiently distant from known members of that codon level. Only clusters that remained after the two
family. This is the case even though there may be other novel filtering steps and contained at least two nonredundant
proteins that can transitively link them to the known sequences are reported in this analysis.
proteins. Similarly, truly novel protein families will also not We examined the distribution of known protein domains
be detected. in the full dataset using profile HMMs [41] from the Pfam [15]
As the primary input to our clustering process, we and TIGRFAM [22] databases (see Materials and Methods).
computed the pairwise sequence similarity of the 28.6 million We labeled sequences that end up in clusters (containing at
aa sequences in our dataset using an all-against-all BLAST least two nonredundant sequences) or that have HMM
search [38]. This required more than 1 million CPU hours on matches as predicted proteins. The inclusion of the PG ORF
two large compute clusters (see Materials and Methods). The set allowed for the evaluation of protein prediction using our
sequences were clustered in four steps (see Materials and clustering approach. A comparison of proteins predicted in
Methods). In the first step, we identified a nonredundant set the PG ORF set by our clustering against PG ORFs annotated
of sequences from the entire dataset using only pairwise as proteins by whole-genome annotation techniques revealed
matches with 98% similarity and involving 95% of the that our protein prediction method via clustering has a
PLoS Biology | www.plosbiology.org | S58 0434 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Table 2. Clustering and HMM Profiling Results Showing the Number of Predicted Proteins (Including Both Redundant and
Nonredundant Sequences) in Each Dataset
Dataset Original Set Clustering (A) HMM A\B AB BA Total Predicted Mean Length
Profiling (B) Proteins A [ B of Sequence
A \ B denotes the number of predicted proteins common to both the clustering and the HMM profiling; A B, the number of predicted proteins in clusters but not in the HMM profile set;
B A, the number of predicted proteins in the HMM profile set but not in clusters; and A [ B, the total number of predicted proteins in each dataset.
doi:10.1371/journal.pbio.0050016.t002
sensitivity of 83% and a specificity of 86% (see Materials and predictions or organism-specific proteins. Nearly two-thirds
Methods). The HMM profiling allowed for the evaluation of of these sequences are labeled ‘‘hypotheticals,’’ ‘‘unnamed,’’
our clustering technique’s grouping of sequences. We used or ‘‘unknown.’’ This is more than twice the fraction of
Pfam models in two different ways for this assessment (see similarly labeled sequences (30%) in the full NCBI-nr dataset.
Materials and Methods) and make three observations. First, Of the remaining one-third, half of them are less than 100 aa
using a simple Pfam domain architecture-based evaluation, in length. This suggests that they are either fast-evolving short
these clusters are mostly consistent as reflected by 93% of peptides, spurious predictions, or proteins that failed to meet
clusters having less than 2% unrelated pairs of sequences in the length-based thresholds in the clustering.
them. Second, these clusters are quite conservative and can Based on the clustering and the HMM profiling, there is
split domain families, with 58% of domain architectures evidence for 6,123,395 proteins in the GOS dataset (Table 2).
being confined to single clusters and 88% of domain Given the fragmentary nature of the GOS ORFs (as a result of
architectures having more than half of their occurrences in the GOS assembly [10,30]), it is not surprising that the average
a single cluster. Third, the size distribution of these clusters is length of a GOS-predicted protein (199 aa) is smaller than the
quite similar to the size distribution of clusters induced by average length of predicted proteins in NCBI-nr (359 aa), PG
Pfams. ORFs (325 aa), TGI-EST ORFs (207 aa), and ENS (489 aa). The
ratio of clustered ORFs to total ORFs is significantly higher
Protein Prediction for the GOS ORFs (34%) compared to PG ORFs (19%). This
Of the initial 28,610,944 sequences, we labeled 9,978,637 could be due to a large number of false-positive protein
sequences (35%) as predicted proteins based on the cluster- predictions in the GOS dataset. However, this is unlikely for a
ing, of which nearly 60% are from GOS (Table 2). The HMM variety of reasons. Nearly 4.64 million GOS ORFs (26.6%)
profiling labeled only an additional 226,743 (0.8%) sequences have significant BLAST matches (with an E-value 13 1010)
as predicted proteins, for a total of 10,205,380 predicted to NCBI-nr sequences. The PG ORFs do not have a high false-
proteins. This indicates that our clustering method captures positive rate compared to the submitted annotation for the
most of the sequences found by profile HMMs. For sequences prokaryotic genomes (see Materials and Methods). Most
both in clusters and with HMM matches, (on average) 73.5% importantly, based on the fragmentary nature of GOS
of their length is covered by HMM matches. For sequences sequencing compared to PG sequencing, the number of
not in clusters but with HMM matches, this value is only shadow (spurious) ORFs 60 aa is significantly reduced (see
45.3%. Furthermore, while 64% of sequences in clusters have Materials and Methods).
HMM matches, there are 3,550,901 sequences that are Some pairs of GOS-predicted proteins that belong to the
grouped into clusters but do not have HMM matches. Most same cluster are adjacent in the GOS assembly. While some of
of these clusters correspond either to families lacking profile them correspond to tandem duplicate genes, an overwhelm-
HMMs or contain sequences that are too remote to match ing fraction of the pairs are on mini-scaffolds [10], indicating
above the cutoffs used. The latter is an indication of the that they are potentially pieces of the same protein (from the
diversity added to known families that is not picked up by same clone) that we split into fragments. We estimate that this
current profile HMMs. effect applies to 3% of GOS-predicted proteins. Sequencing
Using our method, the predicted proteins constitute errors and the use of the wrong translation table can also
different fractions of the totals for the five datasets, with result in the ORF generation process producing split ORF
87% for NCBI-nr, nearly 20% for both PG ORFs and TGI- fragments.
EST ORFs, 92% for ENS, and 35% for GOS. The high rate of The combined set of predicted proteins in NCBI-nr, PG,
prediction for ENS is a reflection of the high degree of TGI-EST, and ENS, as expected, has a lot of redundancy. For
conservation of proteins across the metazoan genomes, instance, most of the PG protein predictions are in NCBI-nr.
whereas the prediction rates for PG ORFs and TGI-EST Removing exact substrings of longer sequences (i.e., 100%
ORFs are similar to rates seen in other protein prediction identity) reduces this combined set to 3,167,979 predicted
approaches. The 13% of NCBI-nr sequences that we marked proteins. When we perform the same filtering on the GOS
as spurious may constitute contaminants in the form of false dataset, 5,654,638 predicted proteins remain. Thus, the GOS-
PLoS Biology | www.plosbiology.org | S59 0435 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Table 3. Cluster Size Distribution and the Distribution of Sequences in These Clusters
Cluster Size Number of Clusters Total Sequences NCBI-nr PG TGI-EST ENS GOS
The size of a cluster is the number of nonredundant sequences in it. Column three shows the total number of sequences (both redundant and nonredundant) in these clusters. The
succeeding columns show their breakdown by the five datasets. There are 17,067 medium- and large-size clusters.
doi:10.1371/journal.pbio.0050016.t003
PLoS Biology | www.plosbiology.org | S60 0436 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Clusters were annotated using the most commonly matching Pfam domains. Many of these clusters correspond to families that have expanded and functionally diversified.
doi:10.1371/journal.pbio.0050016.t004
only clusters (23.40%) emphasizing the significant novelty indicates a large core of well-conserved protein families
provided by the GOS data. The next section consists of across all domains of life. In contrast, the known prokaryotic
clusters containing sequences from only the known non- protein families are almost entirely covered by the GOS data.
prokaryotic grouping (20.78%), followed closely by the
section containing clusters with sequences from all three Novelty Added by GOS Data
groupings (20.23%). The large known nonprokaryotic–only There are 3,995 medium and large clusters that contain
grouping shows that our current GOS sampling methodology only sequences from the GOS dataset. Some are divergent
will not cover all protein families, and perhaps misses some members of known families that failed to be merged by the
protein families that are exclusive to higher eukaryotes. The clustering parameters used, or are too divergent to be
large section of clusters that include all three groupings detected by any current homology detection methods. The
PLoS Biology | www.plosbiology.org | S61 0437 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S62 0438 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Table 5. Neighbor-Based Inference of Function for Novel Clusters of GOS Sequences
Novel Inferred Function p-Valuea Neighboring Clusters with Other Neighbors Comments
Cluster Contributing GO Annotation of Interestb
ID
GO ID Biological process Cluster ID GO Annotation
8837 GO:0006260 DNA replication 4.70 3 104 812 ATPase involved in DNA Phage Mu Mom DNA Profile–profile match: DNA polymerase
replication modification enzyme processivity factor
2,655 DNA polymerase family B DNA methylase
12519 GO:0006118 Electron transport 4.54 3 103 1,362 Cytochrome c oxidase Profile–profile match: PF03626— cytochrome c oxidase
subunit III subunit IV; 3 predicted transmembrane helices
1,771 SCO1/SenC—biogenesis of
photosynthetic systems
11010151 GO:0017004 Cytochrome complex 1.00 3 105 8,136 Thioredoxin .20 diverse profile–profile matches, one of which is
assembly cytochrome c biogenesis factor ccmH_2
0439
colicins, iron, and phage DNA
9,569 Biopolymer transport protein
ExbD/TolR
14360 GO:0006777 Mo-molybdopterin cofactor 1.00 3 105 9,745 MoaC family Sulfite oxidase SAR11 blast match annotated as probable moaD;
biosynthesis profile–profile matches to ThiS and molybdopterin
converting factor; ,.05% of sequences have PFAM
match to ThiS family
9,948 MoaE protein Predicted thioesterase
255 Radical SAM superfamily
8397 GO:0017004 Cytochrome complex 1.00 3 105 8,136 Thioredoxin SMC superfamily (homologous to Blast match to ‘‘periplasmic or inner membrane–
assembly ABC family) associated protein’’; two predicted TM helices;
0.7% of sequences have PFAM match to
cytochrome c biogenesis protein
9,364 Uncharacterized cytochrome c
biogenesis protein
5
13909 GO:0015979 Photosynthesis 1.00 3 10 13,990 Photosystem II reaction centre Predicted soluble; single blast match to cyanophage
N protein (psbN) P-SSM2 hypothetical protein; many phage proteins
as minor neighbors
5,184 Photosynthetic reaction centre
protein D1 (psbA)
7,664 Ferredoxin-dependent bilin reductase
a
p-Values were computed by simulating 100,000 neighbor cluster sets of equivalent size.
b
Not all clusters could be mapped to a GO term.
doi:10.1371/journal.pbio.0050016.t005
PLoS Biology | www.plosbiology.org | S64 0440 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Figure 5. Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size
The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were given
kingdom weights as described in Materials and Methods. For each kingdom, the fraction of sequences with 1 Pfam match are shown, while the ten
largest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100
sequences. Pfam families that are smaller than the ten largest are binned together in each column’s bottom section. Pfam covers public-100 better than
GOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain,
while only 27.5% of GOS-100s have a sequence match.
doi:10.1371/journal.pbio.0050016.g005
Mysterious Lack of Characteristic Gram-Positive Domains subunits shared by all three kingdoms, marker proteins such
Gram-positive bacteria (Firmicutes and Actinobacteria) repre- as recA and dnaJ, and TCA cycle enzymes all tend to be GOS
sent 26.7% of PG and ;12% of GOS [30]. Given the larger enriched. This suggests that oceanic genomes may be more
size of the GOS dataset, one might predict Gram-positive– compact than sequenced genomes and so have a higher
specific domains to be ;2.4-fold enriched in GOS. Instead, proportion of core pathways.
the opposite is consistently seen. Of 15 firmicute-specific
spore-associated domains, PG has 503 members, but GOS has Characteristics and Kingdom Distribution of Known
none. For another 22 firmicute-restricted domains of varying Protein Domains
or unknown function, the PG/GOS ratio is 1797:77 (Table 6). A decade ago, databases were highly biased towards
Hence, it appears that GOS Gram-positive lineages lack most proteins of known function. Today, whole-genome sequenc-
of their characteristic protein domains. Two sequenced ing and structural genomics efforts have presumably reduced
marine Gram-positives (Oceanobacillus iheyensis [60] and Bacillus the biases that are a result of targeted protein sequencing. We
sp. NRRL B-14911) have a large complement of these used the Pfam database to compare the characteristics and
domains. However, another recently assembled genome from kingdom distribution of known protein domains in the GOS
Sargasso sea surface waters, the actinomycete Janibacter sp. dataset to that of proteins in the publicly available datasets
HTCC2649, has just two of these domains, and may reveal a (NCBI-nr, PG, TGI-EST, and ENS). Such an effort can be used
whole-genome context for this curious loss of characteristic to assess biases in these datasets, help direct future sampling
domains. efforts (of underrepresented organisms, proteins, and protein
families), make more informed generalizations about the
Flagellae and Pili Are Selectively Lost from Oceanic protein universe, and provide important context for deter-
Species mination of protein evolutionary relationships (as biased
Flagellum components from both eubacteria and archaea sampling could indicate expected but missing sequences).
are significantly underrepresented in the GOS dataset by For this analysis we used the nonredundant datasets (at
about 2-fold (Table 6). Ironically, at a bacterial scale,
100% identity) discussed in Figure 1. We refer to the set of
swimming may be worthwhile on an almost dry surface, but
3,167,979 nonredundant sequences from NCBI-nr, PG, TGI-
not in open water. The chemotaxis (che) operon that often
EST, and ENS as the public-100 set and the similarly filtered set
directs flagellar activity is also rare in GOS. Another direc-
of 5,654,638 sequences from the GOS data as the GOS-100 set.
tional appendage, the pilus, is even more reduced, though its
About 70% of public-100 sequences and 56% of GOS-100
taxonomic distribution (mostly in proteobacteria, predom-
sequences significantly match at least one Pfam model. The
inantly c-proteobacteria) would have predicted enrichment.
most obvious difference between the sets is that the vast
Skew in Core Cellular Pathways majority of GOS sequences are bacterial, and this has to be
While taxonomically specialized domains are likely to be taken into account when comparing the numbers. Since
skewed by taxonomic differences, core pathways found in different Pfam families appear with different frequencies in
many or all organisms paint a different picture. We used GO the kingdoms, we considered the results for each kingdom
term mapping and text mining to group domains into major separately (Figure 5). We then evaluated all kingdoms
functions and to look for consistent skews across several together, with results normalized by relative abundance of
domains. Several core functions, including DNA-associated members from the different kingdoms. A domain found
proteins (DNA polymerase, gyrase, topoisomerase), ribosomal commonly and exclusively in eukaryotes and abundant in
PLoS Biology | www.plosbiology.org | S65 0441 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
public-100 would be expected to be found rarely in GOS-100. confident kingdom assignment. Our examination of each of
We used a conservative BLAST-based kingdom assignment the scaffolds responsible for a determination of kingdom-
method to assign kingdoms to the GOS sequences (see crossing confirms that each one had both a highly significant
Materials and Methods). match to the Pfam model in question and an overwhelming
In each kingdom, sequences in GOS-100 are less likely to number of votes for the unexpected kingdom. These scaffold
match a Pfam family than those in public-100 (Figure 5). For assemblies were also manually inspected. No clear anomalies
the cellular kingdoms, these differences are comparatively were observed. In most instances, the assemblies in question
modest. While diversity of the GOS data accounts for some of were composed of a single unitig, and as such are high-
this difference, it might also be explained in part by the confidence assemblies. Mate pair coverage and consistent
fragmentary nature of the GOS sequences. Viruses tell a depth of coverage provide further support for the correct-
dramatic and different story. Of public-100 viral sequences, ness of those assemblies that are built from multiple unitigs.
89.1% match a Pfam domain, while only 27.5% of GOS-100 Examples of kingdom-crossing families include indoleamine
viral sequences have a match. This tremendous difference 2,3-dioxygenase (IDO), MAM domain, and MYND finger [15],
appears to be due to heavy enrichment of the public data for which have previously only been seen in eukaryotes, but we
minor variants of a few protein families, indicated by the sizes find them also to be present in bacteria. These Pfams now
of the ten most populous Pfams in each kingdom (Figure 5). cross kingdoms, due either to their being more ancient than
Sequences from three Pfam families (envelope glycoprotein previously realized or to lateral transfer.
GP120, reverse transcriptase, and retroviral aspartyl pro- We explored the IDO family further. This family has
tease) account for a third of all public viral sequences. By representatives in vertebrates, invertebrates, and multiple
contrast, the most populous three families in the GOS-100 fungal lineages [15,61] in public-100. Members of the IDO
data (bacteriophage T4-like capsid assembly protein [Gp20], family are heme-binding, and mammalian IDOs catalyze the
major capsid protein Gp23, and phage tail sheath protein) rate-limiting step in the catabolic breakdown of tryptophan
account for only about 7% of public-100 sequences. Such a [62], while family members in mollusks have a myoglobin
difference may be due to intentional oversampling of function [63]. In mammals, IDO also appears to have a role in
proteins that come from disease-causing organisms in the the immune system [62,64–66]. The IDO Pfam has matches to
public dataset. 66 proteins in public-100, all of which are eukaryotic.
While the total proportion of proteins with a Pfam hit is However, it also has matches to ten GOS-100 sequences that
fairly similar between public-100 (70%) and GOS-100 (56%) we confidently labeled as bacterial proteins and matches to
datasets, there are considerable differences with regard to the 206 GOS-100 sequences for which a confident kingdom
distributions of protein families within these two datasets. assignment could not be made (many of these are likely
The most highly represented Pfam families in GOS-100 bacterial sequences due to the GOS sampling bias). To
compared to public-100 are shown in Table 7. Notably, we reconstruct a phylogeny of the IDO family, we searched a
found that while many known viral families are absent in recent version of NCBI-nr (March 5, 2006) for IDO proteins
GOS-100, viral protein families dominate the list of the that were not included in the public-100 dataset. The search
families more highly represented in GOS-100; this is identified two bacterial proteins from the whole genomes of
presumably because of biases in the collection of previously the marine bacteria Erythrobacter litoralis and Nitrosococcus
known viral sequences. Surprisingly few bacterial families oceani, and 24 eukaryotic proteins (see Materials and Methods).
were among the most represented in GOS-100 compared with The phylogeny shown in Figure 6 shows 54% bootstrap
public-100. By contrast, we also observed that those families support for a separation of the clade containing exclusively
found more rarely in GOS-100 than public-100 were public-100 and NCBI-nr 2006 eukaryotic sequences from a
frequently bacterial (Table 7). This appears to be a result of clade with the GOS-100 sequences as well as the two NCBI-nr
the large number of key bacterial and viral pathogen proteins E. litoralis and N. oceani sequences. We confirmed this feature of
in public-100 that are comparatively less abundant in the the tree topology with multiple other phylogeny reconstruc-
oceanic samples and/or less intensively sampled. tion methods. Curiously, there is considerable intermixing of
bacterial and eukaryotic sequences in the clade of GOS-100
GOS-100 Data Suggest That a Number of ‘‘Kingdom- sequences and the two NCBI-nr bacteria. A manual inspection
Specific’’ Pfams Actually Are Represented in Multiple of the scaffolds that contain the ten GOS-100 sequences
Kingdoms (containing the IDO domain) that we confidently labeled as
Of the 7,868 Pfam models in Pfam 17.0, 4,050 match bacterial, overwhelmingly supports the kingdom assignment.
proteins from only a single kingdom in public-100. The However, a manual inspection of the scaffolds that contain the
additional sequences from GOS-100 reveal that some of these ten GOS-100 sequences (containing the IDO domain) that we
families actually have representatives in multiple kingdoms. confidently labeled as eukaryotes presents a less convincing
Table 8 shows 12 families that have a Pfam match to at least picture. These scaffolds are short, with most of them
one GOS-100 protein with an E-value 1 3 1010, and which containing only two voting ORFs. Since the NCBI-nr version
we confidently assigned to a kingdom different from that of used in the public-100 set has IDO from eukaryotes only, the
all the public-100 matches. Because our criteria for a ORF with the IDO domain itself would cast four votes for
‘‘confident’’ kingdom assignment are conservative, there are eukaryotes. Thus, these GOS-100 eukaryotic labelings are not
only one or a few confident assignments for each Pfam nearly as confident as the ones labeled bacterial.
domain to a ‘‘new’’ kingdom. Our ‘‘confident’’ criteria are
especially difficult to meet in the case of kingdom-crossing, Structural Genomics Implications
due to the votes contributed by the crossing protein (see Knowledge about global protein distributions can be used
Materials and Methods). Thus, many scaffolds have no to inform priorities in related fields such as structural
PLoS Biology | www.plosbiology.org | S66 0442 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Table 7. Top Pfam Families Represented More Highly or Less Highly in GOS-100 than in Public-100
Families represented PF07068 Major capsid protein Gp23 0 0 0 41 0 41 8 1,818 23,450% ,1 3 10303
more highly
PF03420 Prohead core protein protease 0 0 0 11 0 11 6 1,223 22,176% ,1 3 10303
PF06841 T4-like virus tail tube protein gp19 0 0 0 13 0 13 6 795 14,036% ,1 3 10303
PF04451 Iridovirus major capsid protein 0 0 1 138 0 139 15 1,692 11,269% ,1 3 10303
Bacteriophage T4-like capsid 0 0 211 0 211 20 1,633 7,992% ,1 3 10303
PF07230 assembly protein (Gp20) 0
PF01818 Bacteriophage translational regulator 0 0 0 10 0 10 5 405 7,444% ,1 3 10303
PF01231 Indoleamine 2,3-dioxygenase 0 0 66 0 0 66 7 226 3,471% ,1 3 10303
PF03322 Gamma-butyrobetaine hydroxylase 0 13 117 0 0 130 60 1,807 3,004% ,1 3 10303
PF04777 Erv1/Alr family 0 0 177 10 0 187 10 309 2,996% ,1 3 10303
PF05367 Phage endonuclease I 0 2 0 10 0 12 13 290 2,152% ,1 3 10303
PF04832 SOUL heme-binding protein 3 8 173 0 1 185 43 714 1,648% ,1 3 10303
PF03159 XRN 59-39 exonuclease N-terminus 0 0 214 2 0 216 11 170 1,584% ,1 3 10303
0443
less highly
PF00516 Envelope glycoprotein GP120 0 0 1 41,115 11 41,127 3,071 0 0% ,1 3 10303
PF00077 Retroviral aspartyl protease 0 0 153 26,747 9 26,909 2,004 0 0% ,1 3 10303
PF04650 YSIRK type signal peptide 0 469 0 0 3 472 1,889 0 0% ,1 3 10303
PF03507 CagA exotoxin 0 333 0 0 0 333 1,343 0 0% 4 3 10294
PF03482 sic protein 0 285 0 0 0 285 1,150 0 0% 4 3 10252
PF01308 Chlamydia major outer membrane protein 0 264 0 0 0 264 1,066 0 0% 8 3 10234
PF02707 Major outer sheath protein N-terminal region 0 264 0 0 0 264 1,066 0 0% 8 3 10234
PF00934 PE family 0 249 0 0 0 249 1,005 0 0% 1 3 10220
PF00820 Borrelia lipoprotein 0 223 0 0 0 223 901 0 0% 6 3 10198
PF02722 Major outer sheath protein C-terminal region 0 223 0 0 0 223 901 0 0% 6 3 10198
PF00921 Borrelia lipoprotein 0 202 0 0 0 202 816 0 0% 1 3 10179
Staphylococcal/streptococcal toxin, 0 197 3 1 2 203 797 0 0% 3 3 10175
PF02876 beta-grasp domain
PF01856 Outer membrane protein 0 176 0 0 0 176 712 0 0% 7 3 10157
Staphylococcal/streptococcal toxin, 0 166 3 1 2 172 672 0 0% 4 3 10148
PF01123 OB-fold domain
PF02474 Nodulation protein A (NodA) 0 157 0 0 0 157 636 0 0% 3 3 10140
PF06458 MucBP domain 0 155 2 0 0 157 628 0 0% 2 3 10138
Bacillus/clostridium GerA spore germination 0 149 0 0 0 149 603 0 0% 3 3 10133
PF03323 protein
PF07548 Chlamydia polymorphic membrane protein 0 146 0 0 0 146 591 0 0% 1 3 10130
middle domain
PF02255 PTS system, lactose/cellobiose-specific IIA subunit 0 141 0 0 0 141 571 0 0% 3 3 10126
Green indicates exclusively bacterial in public-100; blue, exclusively eukaryotic in public-100; red, exclusively viral in public-100. Expected number of matches in GOS-100 to each Pfam model was calculated as described in Materials and Methods. This
calculation is based on the number of matches to each Pfam in public-100 and corrected for the different kingdom proportions in GOS-100 and public-100. For each Pfam model, the percentage representation ratio is the number of observed GOS-100 matches
to that Pfam divided by the number expected, and expressed as a percentage. The top half of the table shows the top 20 most highly represented proteins that have representation ratios . 1,000% and have chi-squared p-value , 1 3 10303. Numbers of
observed matches to these Pfams in public-100 are also indicated according to kingdom. A number of Pfams highly represented in GOS-100 appear to occur exclusively or almost exclusively in a particular kingdom in public-100. For example, Pfams that are
characteristically viral in public-100 (colored in red) dominate the top of this list, and an intriguing protein family (IDO) with a known immune function in higher eukaryotes (blue) also appears. The bottom half of the table shows the 20 Pfam domains not
observed in GOS-100 with the highest expectation based on public-100 (or equivalently, with the most significant chi-squared p-values). Thus, a large number of key bacterial and viral pathogen proteins in public-100 are not observed in the oceanic samples.
doi:10.1371/journal.pbio.0050016.t007
Expanding the Protein Family Universe
Some Pfam domains observed exclusively in one kingdom in public-100 are found in a different kingdom in GOS-100. The number of sequences in the public dataset that match each Pfam model is listed above the number of sequences in GOS with a
confident kingdom assignment and a highly significant match to the model. The TC bit score is provided for each model, together with the bit score and E-value of the best match to the model in an unexpected kingdom. For this analysis, Pfam
Best E-Value for Match
matches are filtered with an E-value cutoff of 1 3 1010. In every case, the bit score is at least five bits greater than the TC for the model, because of the larger size of the GOS dataset relative to those used for creating the TC thresholds. In addition to
in Novel Kingdom
10100
1071
1011
1016
3.50 3 1013
9.70 3 1060
1017
1014
1022
1012
1.40 3 1014
2.20 3 1072
3
3
3
3
3
3
3
3
2.00
7.00
6.10
4.30
5.40
3.50
2.50
3.40
Best Score for Match
in Novel Kingdom
passing the ‘‘confident’’ criteria (see Materials and Methods), the kingdom assignments are all confirmed by visual inspection of the BLAST kingdom vote distributions for the respective scaffolds.
247.79
342.32
185.71
250.97
35.88
57.91
41.89
60.46
50.82
74.63
40.63
51.44
Pfam TC
107.1
42.1
47.8
38.4
40.7
34.2
45
78
matched the IDO Pfam model and satisfied multiple alignment quality
criteria. The IDO family is eukaryotic specific in public-100. The
0/206
0/165
0/204
0/239
0/13
0/20
0/80
0/92
0/15
0/9
0/2
0/0
0/0
0/3
0/0
0/0
0/0
0/0
0/0
sequences (orange).
doi:10.1371/journal.pbio.0050016.g006
Eukaryota
66/10
712/1
798/1
197/0
100/1
173/6
0/0
0/0
108/250
0/1
0/0
0/8
0/3
0/2
0/1
21/0
0/0
0/0
0/0
0/0
0/0
0/1
0/0
0/1
8/0
Ribosomal LX protein
PF01231
PF00629
PF01753
PF02089
PF05019
PF02919
PF06945
PF04234
PF04967
PF01911
PF01889
PF06626
PLoS Biology | www.plosbiology.org | S68 0444 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
The GOS sequences will affect Pfam in two ways: some will Previously, in the Sargasso Sea study [10] it was shown that
be classified in existing protein families, thus increasing the shotgun sequencing reveals a much greater diversity of
size of these families; others may eventually be classified into proteorhodopsin-like proteins than was previously known
new GOS-specific families. Both of these will alter the relative from cloning and PCR studies. However, along with the
sizes of different families, and thus their prioritization for potential benefits of phototrophy come many risks, such as
structural genomics studies. We calculated the sizes for all the damage caused to cells by exposure to solar irradiation,
Pfam families based on the number of occurrences of each especially the UV wavelengths. Organisms deal with the
family in the public-100 dataset. Proteins in GOS-100 were potential damage from UV irradiation in several ways,
then added and the family sizes were recalculated. A total of including protection (e.g., UV absorption), tolerance, and
190 families that are not in the Pfam5000 based on public-100 repair [78]. Our examination of the protein family clusters
are moved into the Pfam5000 after addition of the GOS data. reveals that the GOS data provides an order of magnitude
The 30 largest such families are shown in Table 9. As 20 of the increase in the diversity (in both numbers and types) of
30 families are annotated as domains of unknown function in homologs of proteins known to be involved in pathways
Pfam, structural characterization might be helpful in identi- specifically for repairing UV damage.
fying their cellular or molecular functions. Reshuffling the One aspect of the diversity of UV repair genes is seen in the
Pfam5000 to prioritize these 190 families would improve overrepresentation of photolyase homologs in the GOS data
structural coverage of GOS sequences after completion of the (see Table 10). Photolyases are enzymes that chemically
Pfam5000 by almost 1% relative to the original Pfam5000 reverse the UV-generated inappropriate covalent bonds in
(from 55.4% to 56.1%), with only a small decrease in coverage cyclobutane pyrimidine dimers and 6–4 photoproducts [79].
of public-100 sequences (from 67.7% to 67.5%). The massive numbers of homologs of these proteins in the
The Pfam5000 would be further reprioritized by the GOS data (11,569 GOS proteins in four clusters; see Table 10)
classification of clusters of GOS sequences into Pfam. is likely a reflection of their presence in diverse species and
Assuming each cluster of pooled GOS-100 and public-100 the existence of novel functions in this family. New repair
sequences without a current Pfam match would be classified functions could include repair of other forms of UV dimers
as a single Pfam family, 885 such families would replace (e.g., involving altered bases), use of novel wavelengths of light
existing families in the Pfam5000. These 885 clusters contain to provide the energy for repair, repair of RNA, or repair in
a total of 383,019 proteins in GOS-100 and public-100. The different sequence contexts. In addition, some of these
reprioritized Pfam5000 would also retain 1,183 families of proteins may be involved in regulating circadian rhythms,
unknown structure from the current Pfam5000; these families as seen for photolyase homologs in various species. Our
comprise a total of 1,040,330 proteins in GOS-100 and public- findings are consistent with the recent results of a compara-
100. tive metagenomic survey of microbes from different depths
Known Protein Families and Increased Diversity Due to that found an overabundance of photolyase-like proteins at
the surface [51].
GOS Data
A good deal was known about the functions and diversity of
Several protein families serve as examples to further
photolyases prior to this project. However, much less is
highlight the diversity added by the GOS dataset. In this
known about other UV damage–specific repair enzymes, and
paper, we examined UV irradiation DNA damage repair
examination of the GOS data reveals a remarkable diversity
enzymes, phosphatases, proteases, and the metabolic enzymes
of each of these. For example, prior to this project, there were
glutamine synthetase and RuBisCO (Table 10). The RecA
only some 25 homologs of UV dimer endonucleases (UVDEs)
family (unpublished data) and the kinase family [77] have also
been explored in the context of the GOS data. There are available [80], and most of these were from the Bacillus
more than 5,000 RecA and RecA-like sequences in the GOS species. There are 420 homologs of UVDE (cluster 6239) in
dataset (Table 10). An analysis of the RecA phylogeny the GOS data representing many new subfamilies (Figure 7A
including the GOS data reveals several completely new RecA and Materials and Methods). A similar pattern is seen for
subfamilies. A detailed study of kinases in the GOS dataset spore lyases (which repair a UV lesion specific to spores [81])
demonstrated the power of additional sequence diversity in and the pyrimidine dimer endonuclease (DenV, which was
defining and exploring protein families [77]. The discovery of originally identified in T4 phage [82]). We believe this will also
16,248 GOS protein kinase–like enzymes enabled the defi- be true for UV dimer glycosylases [83], but predictions of
nition and analysis of 20 distinct kinase-like families. The function for homologs of these genes are difficult since they
diverse sequences allowed the definition of key residues for are in a large superfamily of glycosylases.
each family, revealing novel core motifs within the entire Our analysis of the kingdom classification assignments
superfamily, and predicted structural adaptations in individ- suggests that the diversity of UV-specific repair pathways is
ual families. This data enabled the fusion of choline and seen for all types of organisms in the GOS samples. This
aminoglycoside kinases into a single family, whose sequence apparently extends even to the viral world (e.g., 51 of the
diversity is now seen to be at least as great as the eukaryotic UVDE homologs are assigned putatively to viruses), suggest-
protein kinases themselves. ing that UV damage repair may be a critical function that
phages provide for themselves and their hosts in ocean
Proteins Involved in the Repair of UV-Induced DNA surface environments. Based on the sheer numbers of genes,
Damage their sequence diversity, and the diversity of types of
Much of the attention in studies of the microbes in the organisms in which they are apparently found, we conclude
world’s oceans has justifiably focused on phototrophy, such that many novel UV damage–repair processes remain to be
as that carried out by the proteorhodopsin proteins. discovered in organisms from the ocean surface water.
PLoS Biology | www.plosbiology.org | S69 0445 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Table 9. The 30 Largest Structural Genomics Target Families Added to the Pfam5000 Based on Inclusion of GOS Sequences
Accession Number Description Family Size after GOS Family Size before GOS
The 30 largest families after inclusion of GOS data that were not among the 5000 largest families before inclusion of GOS data are shown here. Family size was calculated as the number of
matches in public-100 (before GOS) and in the combined GOS-100 and public-100 datasets (after GOS).
doi:10.1371/journal.pbio.0050016.t009
Evidence of Reversible Phosphorylation in the Oceans regulators of the cellular response. Protein phosphatases
Reversible phosphorylation of proteins represents a major are divided into three major groups based on substrate
mechanism for cellular processes, including signal trans- specificity [85]. The Mg2þ- or Mn2þ-dependent phosphoserine/
duction, development, and cell division [84]. The activity of phosphothreonine protein phosphatase family, exemplified
protein kinases and phosphatases serve as antagonistic by the human protein phosphatase 2C (PP2C), represents the
Table 10. Clustering of Sequences in Families That Are Explored in This and Companion Papers
Protein Family Cluster ID Nonredundant Sequences Total Sequences NCBI-nr PG TGI-EST ENS GOS
doi:10.1371/journal.pbio.0050016.t010
PLoS Biology | www.plosbiology.org | S70 0446 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Figure 7. Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined
Kingdom assignments of the sequences are indicated by color: yellow, GOS-eukaryotic; navy blue, GOS-bacterial/archaeal; aqua, GOS-viral; orange,
NCBI-nr–eukaryotic; lime green, NCBI-nr–bacterial/archaeal; pink, NCBI-nr–viral; gray, unclassified.
(A) Phylogeny of UVDE homologs.
(B) Phylogeny of PP2C-like sequences.
(C) Phylogeny of type II GS gene family. In addition to the large amount of diversity of bacterial type II GS in the GOS data, a large group of GOS viral
sequences and eukaryotic GS co-occur at the top of the tree with the eukaryotic virus Acanthamoeba polyphaga mimivirus (shown in pink). The red stars
indicate the locations of eight type II GS sequences found in the type I–type II GS gene pairs. They are located in different branches of the phylogenetic
tree. The rest of the type II GS sequences were filtered out by the 98% identity cutoff.
(D) Phylogeny of the homologs of RuBisCO large subunit. A large portion of the RuBisCO sequences from the GOS data forms new branches that are
distinct from the previously known RuBisCO sequences in the NCBI-nr database.
doi:10.1371/journal.pbio.0050016.g007
smallest group in number. An understanding of their sequences contain at least seven motifs known to be
physiological roles has only recently begun to emerge. In important for phosphatase structure and function [90,91].
eukaryotes, one of the major roles of PP2C activity is to Invariant residues involved in metal binding (aspartate in
reverse stress-induced kinase cascades [86–89]. motifs I, II, VIII) and phosphate ion binding (arginine in
We identified 613 PP2C-like sequences in the GOS dataset, motif I) are highly conserved among the GOS sequences.
and they are grouped into two clusters (Table 10). These Using the catalytic domain portion of these sequences we
PLoS Biology | www.plosbiology.org | S71 0447 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
constructed a phylogeny showing that despite the overall differ from each other by the presence of specific amino acids
conserved structure of the PP2C family of proteins, the in the active site and by their mode of action. The MEROPS
known bacterial PP2C-like sequences group together with the database [100] is a comprehensive source of information for
GOS bacterial PP2C-like sequences (Figure 7B, Materials and this large divergent group of sequences and provides a widely
Methods). Furthermore, the eukaryotic PP2Cs display a much accepted classification of proteases into families, based on the
greater degree of sequence divergence compared to the amino acid sequence comparison, and then into clans based
bacterial PP2C sequences. on the similarity of their 3-D structures.
We also examined the combined dataset of PP2C-like We identified 222,738 potential proteases in the GOS
phosphatases further for potential differences in amino acid dataset based on similarity to sequences in MEROPS (see
composition between the bacterial and eukaryotic groups. Materials and Methods). According to our clustering method,
We observed a striking distinction between the eukaryotic 95% of these sequences are grouped into 190 clusters, with
and bacterial PP2C-like phosphatases in motif II, where a each cluster on the average containing more than 1,100 GOS
histidine residue (His62 in human PP2Ca) is conserved in sequences. These sequences were compared to proteases in
more than 90% of sequences, but not observed in the NCBI-nr. There are groups of proteases in NCBI-nr that are
bacterial group. The bacterial PP2C group contains a highly redundant. For example, there are a large number of
methionine (at the corresponding position) in the majority viral proteases from HIV-1 and hepatitis C viruses that
of the cases (70%). This histidine residue is involved in the dominate the NCBI-nr protease set. Thus, we computed a
formation of a beta hairpin in the crystal structure of human nonredundant set of NCBI-nr proteases and, for the sake of
PP2C [91]. Furthermore, His62 is proposed to act as a general consistency, a nonredundant set of proteases from the GOS
acid for PP2C catalysis [92]. Both amino acids lie in the set using the same parameters. The majority of proteases in
proximity of the phosphate-binding domain, but at this time both sets are dominated by cysteine, metallo, and serine
it is unclear how the difference at this position would proteases. The GOS dataset is dominated by proteases
contribute to the overall structure and function of the two belonging to the bacterial kingdom. That is not surprising,
PP2C groups. Nonetheless, the large number of diverse PP2C- given the filter sizes used to collect the samples. In NCBI-nr
like phosphatases in this dataset allowed us to identify a the proteases are more evenly distributed between the
previously unrecognized key difference between bacterial bacterial and the eukaryotic kingdoms.
and eukaryotic PP2Cs. Our comparison of the protease clan distribution of the
Bacterial genes that perform closely related functions can bacterial sequences in the NCBI-nr and GOS sets reveals that
be organized in close proximity to each other and often in the distribution of clans is very similar for metallo- and serine
functional units. Linked Ser/Thr kinase-phosphatase genetic proteases. However, the distribution of clans in aspartic and
units have been described in several bacterial species, cysteine proteases is different in the two datasets. Among
including Streptococcus pneumoniae, Bacillus subtilis, and Myco- aspartic proteases, the most visible difference is the increased
bacterium tuberculosis [93–96]. Two major neighboring clusters ratio of proteases of the AC clan and the decreased ratio in
are found to be associated with the set of PP2C-like the AD clan. Proteases in the former clan are involved in
phosphatases in the GOS bacterial group. We observed that bacterial cell wall production, while those in the latter clan
one of these clusters contained a protein serine/threonine are involved in pilin maturation and toxin secretion [99].
kinase domain as its most common Pfam domain. An Among cysteine proteases, the most apparent is the decrease
additional neighboring cluster found to be associated with in the CA clan and an increase in the number of proteases
the GOS set of bacterial PP2Cs was identified as a set of from the PB(C) clan. Bacterial members of the CA clan are
sequences containing a PASTA (penicillin-binding protein mostly involved in degradation of bacterial cell wall compo-
and serine/threonine kinase–associated) domain. This domain nents and in various aspects of biofilm formation [99]. It is
is unique to bacterial species, and is believed to play possible that both activities are less important for marine
important roles in regulating cell wall biosynthesis [97]. bacteria present in surface water. Proteases from the PB(C)
Our identification of a conserved group of unique PP2C- clan are involved in activation (including self-activation) of
like phosphatases in the GOS dataset significantly increases enzymes from acetyltransferase family. In fungi this family is
the number and diversity of this enzyme family. This analysis involved in penicillin synthesis, while their function in
of the NCBI-nr, PG ORFs, TGI-EST ORFs, and ENS datasets bacteria is unknown [99].
along with the sequences obtained from the GOS dataset We were unable to detect any caspases (members of the CD
significantly increases the overall number of PP2C-like clan) in the GOS data. This is consistent with the apoptotic
sequences from that estimated just a year ago [98]. The cell death mechanism being present only in multicellular
presence of genes encoding bacterial serine/threonine kinase eukaryotes, which, based on the filter sizes, are expected to be
domains located adjacent to PP2Cs in the GOS data supports very rare in the GOS dataset.
the notion that the process of reversible phosphorylation on
Ser/Thr residues controls important physiological processes Metabolic Enzymes in the GOS Data
in bacteria. To gain insights into the diversity of metabolism of the
organisms in the sea, we studied the abundance and diversity
Proteases in GOS Data of glutamine synthetase (GS) and ribulose 1,5-bisphosphate
Proteases are a group of enzymes that degrades other carboxylase/oxygenase (RuBisCO), two key enzymes in nitro-
proteins and, as such, plays important roles in all organisms gen and carbon metabolism.
[99]. On the basis of their catalysis mechanism, proteases are GS is the central player of nitrogen metabolism in all
divided into six distinct catalytic types: aspartic, cysteine, organisms on earth. It is one of the oldest enzymes in
metallo, serine, threonine, and glutamic proteases [99]. They evolution [101]. It converts ammonia and glutamate into
PLoS Biology | www.plosbiology.org | S72 0448 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
glutamine that can be utilized by cells. GS can be classified these adjacent GS sequences across all the GOS samples. They
into three types based on sequence [101]. Type I has been are mainly found in the samples taken from two sites. Their
found only in bacteria, and it forms a dodecameric structure geographic distribution is significantly different from the
[102,103]. Type II has been found mainly in eukaryotes, and in distributions of types I and II GS across the samples. The high
some bacteria. Type III GS is less well studied, but has been sequence similarity among the adjacent GS pairs and their
found in some anaerobic bacteria and cyanobacteria. There geographic distribution suggest that these adjacent GS
are 18 active site residues in both bacterial and eukaryotic GS sequences may come from only a few closely related
that play important roles in binding substrates and catalyzing organisms. This is consistent with the protein sequence tree
the enzymatic reactions [104]. of type II GS, where the type II GS sequences from the GS
We found 9,120 GS and GS-like sequences in the GOS data gene pairs mainly reside in two distinct branches (Figure 7C).
(Table 10). Using profile HMMs [41,105] constructed from The active site residues are very well conserved in all GS
known GS sequences of different types, we were able to sequences in the GOS data, except one residue, Y179, which
classify 4,350 sequences as type I GS, 1,021 sequences as type coordinates the ammonium-binding pocket. We observed
II GS, and 469 sequences as type III GS (see Materials and substitutions of Y179 to phenylalanine in about half of the
Methods). type II GS sequences. The activity of type I GS in some
The number of type II GS sequences found in the GOS data bacteria is regulated by adenylylation at residue Tyr397. In
is surprisingly high, since previously type II GS were the GOS data, Tyr397 is relatively conserved in type I GS, with
considered to be mainly eukaryotic and very few eukaryotic variations to phenylalanine and tryptophan in about half of
organisms were expected to be included in the GOS the sequences. This indicates that the activity of some of the
sequencing (Figure 7C and Materials and Methods). We used type I GS is not regulated by adenylylation, as shown
gene neighbor analysis to classify the origin of GS genes by previously in some Gram-positive bacteria [108,109].
the nature of other proteins found on the same scaffold. RuBisCO is the key enzyme in carbon fixation. It is the
Using this approach, most of the neighboring genes of the most abundant enzyme on earth [110] and plays an important
type II GS in the GOS data are identified as bacterial genes. role in carbon metabolism and CO2 cycle. RuBisCO can be
The neighboring genes of the type II GS include nitrogen classified into four forms. Form I has been found in both
regulatory protein PII, signal transduction histidine kinase, plants and bacteria, and has an octameric structure. Form II
NH3-dependent NADþ synthetase, A/G-specific adenine gly- has been found in many bacteria, and it forms a dimer in
cosylase, coenzyme PQQ synthesis protein c, pyridoxine Rhodospirillum rubrum. Form III is mainly found in archaea,
biosynthesis enzyme, aerobic-type carbon monoxide dehy- and forms various oligomers. Form IV, also called the
drogenase, etc. We were able to assign more than 90% of the RuBisCO-like protein (RLP), has been recently discovered
type II GS sequences in the GOS data to bacterial scaffolds from bacterial genome-sequencing projects [111,112]. RLP
based on a BLAST-based kingdom assignment method (see represents a group of proteins that do not have RuBisCO
Materials and Methods). Both neighboring genes and king- activity, but resemble RuBisCO in both sequence and
dom assignments suggest that most of the type II GS structure [111,113]. The functions of RLPs are largely
sequences in the GOS data come from bacterial organisms. unknown and seem to differ from each other.
In comparison, the same type II GS profile HMM detects only Contrary to the large number of GS sequences, we
12 putative type II GS sequences from the PG dataset of 222 identified only 428 sequences homologous to the RuBisCO
prokaryotic genomes. Within these, there are only seven large subunit in the GOS data. The small number of RuBisCO
unique type II GS sequences and six unique bacterial species sequences may partly be due to the fact that larger-sized
represented. The reason why bacteria in the ocean have so bacterial organisms were not included in the sequencing
many type II GS genes is unclear. because of size filtering. However, it could also indicate that
Two hypotheses have been raised to explain the origin of CO2 is not the major carbon source for these sequenced
type II GS in bacterial genomes: lateral gene transfer from ocean organisms.
eukaryotic organisms [106] and gene duplication prior to the The RuBisCO homologs in the GOS data are more diverse
divergence of prokaryotes and eukaryotes [101]. The type II than the currently known RuBisCOs (Figure 7D, Materials
GS sequences in the predominantly bacterial GOS data are and Methods). Six of 19 active site residues—N123, K177,
not only abundant, but also diverse and divergent from most D198, F199, H327, and G404—are not well conserved in all
of known eukaryotic GS sequences (Figure 7C). This makes sequences, suggesting that the proteins with these mutations
the hypothesis of lateral gene transfer less favorable. If the GS may have evolved to have new functions, such as in the case of
gene duplication preceded the prokaryote–eukaryote diver- RLPs. From the studies of the RLPs from Chlorobium tepidum
gence according to the gene duplication hypothesis, it is and B. subtilis [111,114], it has been shown that the active site
possible that many oceanic organisms retained type II GS of RuBisCO can accommodate different substrates and is
genes during evolution. potentially capable of evolving new catalytic functions
Interestingly, we found 19 cases where a type I GS gene is [113,114]. On the other hand, two sequence motifs, helices
adjacent to a type II GS gene on the same scaffold. Both GS aB and a8, that are not involved in substrate binding and
genes seem to be functional based on the high degree of catalytic activity are well conserved in the GOS RuBisCO
conservation of active site residues. The same gene arrange- sequences. The higher degree of conservation of these
ment was observed previously in Frankia alni CpI1 [107]. The nonactive site residues than that of active site residues
functional significance of maintaining two types of GS genes suggests that these motifs are important for their structure,
adjacent to one another in the genome remains to be function, or interaction with other proteins.
elucidated. Most of the sequences of these GS genes are We found 47 (31 at 90% identity filtering) GOS sequences
highly similar. We examined the geographic distribution of in the branch with known RLP sequences in a phylogenetic
PLoS Biology | www.plosbiology.org | S73 0449 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S74 0450 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
They include four restriction endonucleases, three hypo- endonucleases families [118], we predicted three catalytic
thetical proteins, and a glucosyltransferase. residues.
GOS sequences can play an important role in identifying
Genome Sequencing Projects and Protein Exploration
the functions of existing ORFans or in confirming protein
With respect to protein exploration and novel family
predictions. For example, we found that the hypothetical
discovery, microbial sequencing offers more promise com-
protein AF1548, which is a PDB ORFan, has matches to 16
pared to sequencing more mammalian genomes. This is
GOS sequences. A PSI-BLAST search with AF1548 as the illustrated by Figure 11, where the number of clusters that
query against a combined set of GOS and NCBI-nr identified protein predictions from various finished mammalian ge-
several significant restriction endonucleases after three nomes fall into was compared to the number of clusters that
iterations. With the support of 3-D structure and multiple similar-sized random subsets of microbial sequences fall into
sequence alignment of AF1548 and its GOS matches, we (see Materials and Methods). As the figure shows, the rate of
predict that AF1548 along with its GOS homologs are protein family discovery is higher for microbes than for
restriction endonucleases (Figure 10). When combined with mammals. Indeed, the rate of new family discovery is
an established consensus of active sites of the related plateauing for mammalian sequences. This is not surprising,
a
Total number of proteins of this organism deposited at NCBI; may have redundant entries.
doi:10.1371/journal.pbio.0050016.t011
PLoS Biology | www.plosbiology.org | S75 0451 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
as mammalian divergence from a common ancestor is much The GOS data provides almost complete coverage of
more recent than microbial divergence from a common known prokaryotic protein families. In addition, it adds a
ancestor, which suggests that mammals will share a larger great deal of diversity to many known families and offers new
core set of less-diverged proteins. Microbial sequencing is insights into the evolution of these families. This is illustrated
also more cost effective than mammalian sequencing for using several protein families, including UV damage–repair
acquiring protein sequences because microbial protein enzymes, phosphatases, proteases, glutamine synthetase,
density is typically 80%–90% versus 1%–2% for mammals. RuBisCO, RecA (unpublished data), and kinases [77]. Only a
This could be addressed with mammalian mRNA sequencing, handful of protein families have been examined thus far, and
but issues with acquiring rarely expressed mRNAs would need many thousands more remain to be explored.
to be considered. There are, of course, other reasons to The protein analysis presented indicates that we are far
sequence mammalian genomes, such as understanding from exploring the diversity of viruses. This is reflected in
mammalian evolution and mammalian gene regulation. several of the analyses. The GOS-only clusters show an
overrepresentation of sequences of viral origin. In addition,
Conclusions our domain analysis using HMM profiling shows a lower Pfam
The rate of protein family discovery is approximately coverage of the GOS sequences in the viral kingdom
linear in the (current) number of protein sequences. Addi- compared to the other kingdoms. At least two of the protein
tional sequencing, especially of microbial environments, is families we explored in detail (UV repair enzymes and
expected to reveal many more protein families and sub- glutamine synthetase) contain abundant new viral additions.
families. The potential for discovering new protein families is The extraordinary diversity of viruses in a variety of
also supported by the GOS diversity seen at the nucleotide environmental settings is only now beginning to be under-
level across the different sampling sites [30]. Averaged over stood [57,119–121]. A separate analysis of GOS microbial and
the sites, 14% of the GOS sequence reads from a site are viral sequences (unpublished data) shows that multiple viral
unique (at 70% nucleotide identity) to that site [30]. protein clusters contain significant numbers of host-derived
Figure 11. Rate of Cluster Discovery for Mammals Compared to That for Microbes
The x-axis denotes the number of sequences (in thousands), and the y-axis denotes the number of clusters (in thousands). Five mammalian genomes
are considered for the ‘‘Mammalian’’ dataset, and the plot shows the number of clusters that are hit when each additional genome is added. For the
‘‘Mammalian Random’’ dataset, the order of the sequences from the ‘‘Mammalian’’ dataset is randomized. For the NCBI-nr prokaryotic and GOS
datasets, random subsets of size similar to that of the mammalian set are considered.
doi:10.1371/journal.pbio.0050016.g011
PLoS Biology | www.plosbiology.org | S76 0452 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
proteins, suggesting that viral acquisition of host genes is relationships falling into the twilight zone overlapping with
quite widespread in the oceans. random sequence similarity, the number of false positives for
Data generated by this GOS study and similar environ- homology detection methods increases, making the true
mental shotgun sequencing studies present their own analysis relationships more difficult to identify. Nevertheless, a deeper
challenges. Methods for various analyses (e.g., sequence knowledge of protein sequence and family diversity intro-
alignment, profile construction, phylogeny inference, etc.) duces unprecedented opportunities to mine similarity rela-
are generally designed and optimized to work with full tionships for clues on molecular function and molecular
sequences. They have to be tailored to analyze the mostly interactions as well as providing much expanded data for all
fragmentary sequences that are generated by these projects. methods utilizing homologous sequence information data.
Nevertheless, these data are a valuable source of new The GOS dataset has demonstrated the usefulness of large-
discoveries. These data have the potential to refine old scale environmental shotgun sequencing projects in explor-
hypotheses and make new observations about proteins and ing proteins. These projects offer an unbiased view of
their evolution. Our preliminary exploration of the GOS data proteins and protein families in an environmental sample.
identified novel protein families and also showed that many However, it should be noted that the GOS data reported here
ORFan sequences from current databases have homologs in are limited to mostly ocean surface microbes. Even with this
these data. The diversity added by GOS data to protein targeted sampling a tremendous amount of diversity is added
families also allows for the building of better profile models to known families, and there is evidence for a large number of
and thereby improves remote homology detection. The novel families. Additional data from larger filter sizes (that
discovery of kingdom-crossing protein families that were will sample more eukaryotes) coupled with metagenomic
previously thought to be kingdom-specific presents evidence studies of different environments like soil, air, deep sea, etc.
that the GOS project has excavated proteins of more ancient will help to achieve the ultimate goal of a whole-earth catalog
lineage than that previously known, or that have undergone for proteins.
lateral gene transfer. This is another example of how
metagenomics studies are changing our understanding of Materials and Methods
protein sequences, their evolution, and their distribution
Data description. NCBI-nr [31,32] is the single largest publicly
across the various forms of life and environments. Biases in available protein resource and includes protein sequences submitted
the currently published databases due to oversampling of to SWISS-PROT (curated protein database) [122], PDB (a database of
some proteins or organisms are illuminated by environ- amino acid sequences with solved structures) [123], PIR (Protein
mental surveys that lack such biases. Such knowledge can help Information Resource) [124], and PRF (Protein Research Founda-
tion). In addition, NCBI-nr also contains protein predictions from
us make better predictions of the real distribution patterns of DNA sequences from both finished and unfinished genomes in
proteins in the natural world and indicate where increased GenBank [125], EMBL [126], and DNA Databank of Japan (DDBJ)
sampling would be likely to uncover new families or family [127]. The nonredundancy in NCBI-nr is only to the level of distinct
sequences, and any two sequences of the same length and content are
members of tremendous diversity (such as in the viral merged into a single entry. NCBI-nr contains partial protein
kingdom). sequences and is not a fully curated database. Therefore it also
These data have other significant implications for the fields contains contaminants in the form of sequences that are falsely
of protein evolution and protein structure prediction. predicted to be proteins.
Expressed sequence tag (EST) databases also provide the potential
Having several hundreds or even tens of thousands of diverse to add a great deal of information to protein exploration and contain
proteins from a family or examples of a specific protein fold information that is not well represented in NCBI-nr. To this end,
should provide new approaches for developing protein assemblies of EST sequences from the TIGR Gene Indices [34], an EST
database, were included in this study. To minimize redundancy, only
structure prediction models. Development of algorithms that EST assemblies from those organisms for which the full genome is not
consider the alignments of all these family members/protein yet known, were included. The protein predictions on metazoan
folds and analyze how amino acid sequence can vary without genomes that are fully sequenced and annotated were obtained by
significantly altering the tertiary structure or function may including the Ensembl database [35,36] in this study.
Both finished and unfinished sequences from prokaryotic genome
provide insights that can be used to develop new ab inito projects submitted to NCBI were included. The protein predictions
methods for predicting protein structures. These same from the individual sequencing projects are submitted to NCBI-nr.
datasets could also be used to begin to understand how a Nevertheless, these genomes were included in this dataset both for
the purpose of evaluating our approach and also for the purpose of
protein evolves a new function. Finally, this large database of identifying any proteins that were missed by the annotation process
amino acid sequence data could help to better understand used in these projects.
and predict the molecular interactions between proteins. For Thus, for this study the following publicly available datasets, all
downloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, and
example, they may be used to predict the protein–protein ENS—were used. The organisms in the PG set and the TGI-EST set
interactions so critical for the formation of specific func- are listed in Protocol S1.
tional complexes within cells. Assembly of the GOS dataset. Initial assembly (construction of
The GOS data also have implications for nearly all ‘‘unitigs’’) was performed so that only overlaps of at least 98% DNA
sequence identity and no conflicts with other overlaps were accepted.
computational methods relying on sequence data. The False assemblies at this phase of the assembler are extremely rare,
increase in the number of known protein sequences presents even in the presence of complex datasets [37,128]. Paired-end (also
challenges to many algorithms due to the increased volume of known as mate-pair) data were then used to order, orient, and merge
unitigs into the final assemblies, but only when two mate pairs or a
sequences. In most cases this increase in sequence data can be single mate pair and an overlap between unitigs implied the same
compensated for with additional CPU cycles, but it is also a layout. In one respect, mate pair data was used more aggressively than
foreshadowing of times to come as the pace of large-scale is typical in assembly of a single genome in that depth-of-coverage
sequence-collecting accelerates. A related challenge is the information was largely ignored [10]. This potentially allows chimeric
assemblies through a repeat within a genome or through an ortholog
increase in the diversity of protein families, with many new between genomes. Thus, a conclusion that relies on the correctness of
divergent clades present. With more protein similarity a single assembly involving multiple unitigs should be considered
PLoS Biology | www.plosbiology.org | S77 0453 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
tentative until the assembly can be confirmed in some way. fragmentary sequence data of varying lengths. This was dealt with
Assemblies involved in key results in this paper were subjected to somewhat by working with rather stringent match thresholds and a
expert manual review based on thickness of overlaps, presence of two-stage process to identify the core sets. We used the concept of
well-placed mate pairs across thin overlaps or across gaps between strict long edges and weak long edges. A strict long edge exists between
contigs, and consistency of depth of coverage. two vertices (sequences) if their match has the following properties:
Data release and availability. All the GOS protein predictions will (1) 90% of the longer sequence is involved in the match; (2) the match
be submitted to GenBank. In addition, all the data supporting this has 70% similarity; and (3) the score of the match is at least 60% of
paper, including the clustering and the various analyses, will be made the self-score of the longer sequence. A weak long edge exists between
publicly available via the CAMERA project (Community Cyberinfras- two vertices (sequences) if their match has the following properties:
tructure for Advanced Marine Microbial Ecology Research and (1) 80% of the longer sequence is involved in the match; (2) the match
Analysis; http://camera.calit2.net), which is funded by the Gordon has 40% similarity; and (3) the score of the match is at least 30% of
and Betty Moore Foundation. the self-score of the longer sequence. Core set identification had two
All-against-all BLASTP search. We used two sets of computer substages: large core initialization and core extension. The large core
resources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeon initialization step identified sets of sequences where these sets were of
processor systems with 2 Gb of memory per system were used. Each a reasonable size and the sequences in them were very similar to each
system had 80 GB local storage and was connected by GBit ethernet other. Furthermore, these sets could be extended in the core
with storage area network (SAN) I/O of ;24 GBit/sec and network extension step by adding related sequences. In the large core
attached storage (NAS) I/O of ;16 GBit/sec. A total of 466,366 CPU initialization step, a directed graph G was constructed on the
hours was used on this system. In addition, access to the National sequences using strict long edges, with each long edge being directed
Energy Research Scientific Computing Center (NERSC) Seaborg from the longer to the shorter sequence. For each vertex v in G, let
computer cluster was available, including 380 nodes each with sixteen S(v) denote the friends set of v consisting of v and all neighbors that v
375-MHz Power3 processors. The systems had between 16 GB and 64 has an out-going edge to.
GB of memory. Only 128 nodes were used at a time. A total of 588,298 Initially all the vertices in G are unmarked. Consider the set of all
CPU hours was used on this system. The dataset of 28.6 million friends sets in the decreasing order of their size. For S(v) that is
sequences was searched against itself in a half-matrix using NCBI currently being considered, do the following: (1) initialize seed set A ¼
BLAST [38] with the following parameters: -F ‘‘m L’’ -U T -p blastp -e S(v); (2) while there exists some v9 such that jS(v) \ S(v9)j k, set A ¼ A
1 3 1010 -z 3 3 109 -b 8000 -v 10. In this paper, similarity of an [ S(v9). (Note: k ¼ 10 is chosen); (3) output set A and mark all vertices
alignment is defined to be the fraction of aligned residues with a in A; and (4) update all friends sets to contain only unmarked vertices.
positive score according to the BLOSUM62 substitution matrix [129] In the core extension step, we constructed a graph G using weak
used in the BLAST searches. long edges. All vertices in seed sets (computed from the large core
Identification of nonredundant sequences. Given a set of sequences initialization step) were marked and the rest of the vertices
S and a threshold T, a nonredundant subset S9 of S was identified by unmarked. Each seed set was then greedily extended to be a core
first partitioning S (using the threshold T) and then picking a set by adding a currently unmarked vertex that has at least k
representative from each partition. The set of representatives neighbors (k ¼ 10 is chosen) in the set; the added vertex was marked.
constitutes the nonredundant set S9. The process was implemented After this process, a clique-finding heuristic was used to identify
using the following graph-theoretic approach. A directed graph G ¼ smaller cliques (of size at most k 1) consisting of currently
(V, E) is constructed with vertex set V and edge set E. Each vertex in V unmarked vertices; these were also extended to become core sets. A
represents a sequence from S. A directed edge (u,v) 2 E if sequence u final step involved merging the computed core sets on the basis of
is longer than sequence v and their sequence comparison satisfies the weak edges connecting them.
threshold T; for sequences of identical length, the sequence with the In the core set merging step, we constructed an FFAS (Fold and
lexicographically larger id is considered the longer of the two. Note Function Assignment System) profile [39] for each core set using the
that G does not have any cycles. Source vertices (i.e., vertices with no longest sequence in the core set as query. FFAS was then used to carry
in-degree) are sorted in decreasing order of their out-degrees and out profile–profile comparisons in order to merge the core sets into
(from largest out-degree to smallest) processed in this order. A source larger sets of related sequences. Due to computational constraints
vertex u is processed as follows: mark all vertices that have not been imposed by the number of core sets, profiles were built on only core
seen before and are reachable from vertex u as being redundant and sets containing at least 20 sequences.
mark vertex u as their representative. Final recruitment involved constructing a PSI-BLAST profile [40]
We used two thresholds in this paper, 98% similarity and 100% on core sets of size 20 or more (using the longest sequence in the core
identity. The former was used in the first stage of the clustering and set as query) and then using PSI-BLAST (–z 1 3 109, –e 10) to recruit as
the later was used in the HMM profile analysis. For the 98% similarity yet unclustered sequences or small-sized clusters (size less than 20) to
threshold, two sequences satisfy the threshold if the following three the larger core sets. For a sequence to be recruited, the sequence–
criteria are met: (1) similarity of the match is at least 98%; (2) at least profile match had to cover at least 60% of the length of the sequence
95% of the shorter sequence is covered by the match; and (3) (match with an E-value 1 3 107. In a final step, unclustered sequences were
score)/(self score of shorter sequence) 95%. recruited to the clusters using their BLAST search results. A length-
For the 100% identity threshold, two sequences satisfy the based threshold was used to determine if the sequence is to be
threshold if their match identity is 100%. recruited.
Description of the clustering algorithm. The starting point for the Identification of clusters containing shadow ORFs. A well-known
clustering was the set of pairwise sequence similarities identified problem in predicting coding intervals for DNA sequences is shadow
using the all-against-all BLASTP compute. Because of both the ORFs. The key requirement that coding intervals not contain in-
volume and nature of the data, the clustering was carried out in four frame stop codons requires that coding intervals be subintervals of
steps: redundancy removal, core set identification, core set merging, ORFs. Long ORFs are therefore obvious candidates to be coding
and final recruitment. intervals. Unfortunately, the constraints on the coding interval to be
A set of nonredundant sequences (at 98% similarity) was identified an ORF often cause subintervals and overlapping intervals of the
using the procedure given in Materials and Methods (Identification of coding interval to also be ORFS in one of the five other reading
nonredundant sequences). Only the nonredundant sequences were frames (two on the same strand and three on the opposite strand).
considered in further steps of the clustering process. These coincidental ORFs are called shadow ORFs since they are
The aim of the core set identification step was to identify core sets of found in the shadow of the coding ORF. In rare cases (and more
highly related sequences. In graph-theoretic terms, this involves frequently in certain viruses) coding intervals in different reading
looking for dense subgraphs in a graph where the vertices correspond frames can overlap but usually only slightly. Overwhelmingly distinct
to sequences and an edge exists between two sequences if their coding intervals do not overlap. However, this constraint is not as
sequence match satisfies some reasonable threshold (for instance, strict for ORFs that contain a coding interval, as the exact extent of
40% similarity match over 80% of at least one sequence and are the coding interval is not known. Prokaryotes predominate in these
clearly homologous based on the BLAST threshold). Dense subgraphs data and are the focus of the ORF predictions. Their 39 end of an
were identified by using a heuristic. This approach utilizes long edges. ORF is very likely to be part of the coding interval because a stop
These are edges where the match threshold is computed relative to codon is a clear signal for the termination of both the ORF and the
the longer sequence. This was done to prevent, as much as possible, coding interval (this signal could be obscured by frameshift errors in
unrelated proteins from being put into the same core set. If all the sequencing). The 59 end is more problematic because the true start
sequences were full length, using long edges would have offered a codon is not so easily identified and so the longest ORF with a
good solution to keeping unrelated proteins apart. However, the reasonable start codon is chosen and this may extend the ORF
situation here is complicated by the presence of a large amount of beyond the true coding interval. For this reason different criteria
PLoS Biology | www.plosbiology.org | S78 0454 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S79 0455 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
(recall that the ORF calling procedure only produced ORFs of length given domain architecture appear in a single cluster. A total of 58%
60 aa). The clustered set and submitted set had 493,756 ORFs in of the domain architectures were confined to single clusters (i.e.,
common. Of the 107,155 sequences that were only in the submitted 100% of their occurence is in one cluster), and 88% of the domain
set, 24,217 sequences (23%) had HMM matches. As with other architectures was such that .50% of their occurences is in one
unclustered HMM matches, most were weak or partial. These cluster.
sequences had an average of only 48% of their lengths covered by For the second evaluation, we selected all sequences with Pfam
HMMs. Of the remaining 82,938 sequences that did not have an HMM matches, and each sequence was assigned to the Pfam that matches it
match, 13,724 (17%) were removed by the filters used, and the rest fell with the highest score. With this assignment, the Pfams induce a
into clusters with only one nonredundant sequence (and thus were partition on the sequences. The distribution of the number of
not labeled as predicted proteins by the clustering analysis). Based on sequences in clusters induced by the Pfams was compared to those of
NCBI-nr sequences in them, these clusters were mostly labeled as clusters from the clustering method. Figure 12A shows comparison as
‘‘hypothetical,’’ ‘‘unnamed,’’ or ‘‘unknown.’’ Our clustering method a log–log plot of the number of sequences versus the number of
identified 81,973 ORFs not predicted by the genome projects, of clusters with at least that many sequences for the two cases. The plot
which 16,042 (20%) were validated by HMM matches (with average shows that cluster size distributions are quite similar, with both the
HMM coverage of 69% of sequence length) and an additional 27,120 methods having an inflection point around 2,500. The difference
(33%) had significant BLAST matches (E-value 1 3 1010) to between the two curves is that there are more big clusters (and also
sequences in NCBI-nr. Thus, if the submitted set is considered as fewer small clusters) induced by the Pfams as compared to the
truth, then protein prediction via clustering produces 493,756 true clustering method. This can be explained by noting that two
positives (TP), 81,973 false positives (FP), and 107,155 false negatives sequences that are in the same Pfam cluster can nevertheless be put
(FN), thereby having a sensitivity (TP/[TP þ FN]) of 83% and into different clusters by the clustering method if they differ in their
specificity (TP/[TP þ FP]) of 86%. However, if truth is considered as remaining portions.
those sequences that are common to both the clustered and Our clustering also shows a good correspondence with HMM
submitted sets in addition to those sequences with HMM matches, profiling on the phylogenetic markers that we looked at. The
then our protein prediction method via clustering has 95% sensitivity clustering identifies 7,423, 12,553, and 13,657 sequences, respectively,
and 89% specificity, while protein prediction by the prokaryotic for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (cluster
genome projects has 97% sensitivity and 86% specificity. ID 1187). HMM profiling identifies 5,292, 12,298, and 12,165
Evaluation of protein clustering. We used Pfams to evaluate the sequences, respectively, for these families. For each of these families,
clustering method in two ways. For both evaluations the clustering there are at least 94% of sequences (relative to the smaller set) in
was restricted to only those sequences with Pfam matches. It should common between clustering and HMM profiling.
be kept in mind that there are redundancies among Pfams in that Difference in ratio of predicted proteins to total ORFs for the PG
there can be more than one Pfam for a homologous domain family set and the GOS set. The ratio of clustered ORFs to total ORFs is
(for instance, the kinase domain Pfams—PF00069 protein kinase significantly higher for the GOS ORFs (0.3471) compared to the PG
domain and PF07714 protein tyrosine kinase), and these redundan- ORFs (0.1888). This can be explained by the fragmentary nature of
cies can affect the evaluation statistics reported below. the GOS data. For the large majority of the GOS data, the average
For the first evaluation, each sequence was represented by the set sequence length is 920 bp compared to full-length genomes for the
of Pfams that match it. This is referred to as the domain architecture for PG data. For the PG data, clustered ORFs have a mean length of 325
a sequence. While Pfams provide a domain-centric view of proteins, aa and a median length of 280 aa. Unclustered ORFs have a mean
the domain architecture attempts to approximate the full sequence- length of 119 aa and a median length of 87 aa. Assuming that the
based approach used here, and thus could be used to shed light on the genomic GOS data has a similar underlying ORF structure to PG data,
general performance of the clustering. We measured how often the effect that GOS fragmentation had on ORF lengths is estimated.
unrelated sequences were present in a given cluster. Two sequences Each reading frame will have a mixture of clustered and unclustered
were defined to be unrelated if their domain architectures each had ORFs, but on average there will be 2 ORFs per reading frame per 920-
at least one Pfam that was not present in the other’s domain bp GOS fragment, and both ORFs will be truncated. Assuming the
architecture. Note that this measure did not penalize the case when truncation point for the ORF is uniformly distributed across the
the domain architecture of one sequence was a proper subset of the ORF, the truncated ORF will drop below the 60-aa threshold to be
domain architecture of the other sequence. This was done to allow considered as an ORF with a probability of 60/(length of the ORF).
fragmentary sequences in clusters to be included in the evaluation as Using the median length, the percentage of clustered ORFs dropping
well (and also because it is not always easy to determine whether an below the threshold due to truncation is 21%; for unclustered ORFs,
amino acid sequence is fragmentary or not). For each cluster, we it is 69%. Accounting for this truncation, the expected ratio of
computed the percentage of sequence pairs that are unrelated under clustered ORFs to total ORFs for the GOS ORFs based on the PG
this measure. A total of 92% of the clusters had at most 2% unrelated ORFs would be 0.3708, which is very close to the observed value.
pairs. Then we carried out an assessment of how many instances of a Kingdom assignment strategy and its evaluation. We used several
PLoS Biology | www.plosbiology.org | S80 0456 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Table 13. BLAST-Based Classification Rate per Kingdom Table 14. The Values for Cd(n), the Number of Clusters of Size
d, as a Function of the Power Law Exponent b and Constant a
Kingdom Total Number Correct Classification Percent Correct
b a Cd(n)
Eukaryota 440,951 422,173 95.7
Bacteria 465,692 430,014 92.3 b,1 nb1 1
Archae 36,894 25,527 69.2 b¼1 1 ln n
Viruses 36,346 32,381 89.0 1,b,2 nb1 ðn= d Þb1
b¼2 ð n= ln nÞ ð n=d b1
ln nÞ
b.2 n n d
doi:10.1371/journal.pbio.0050016.t013
doi:10.1371/journal.pbio.0050016.t014
approaches to assign kingdoms for GOS sequences. They are all
fundamentally based upon a strategy that takes into account top
BLAST matches of a GOS sequence to sequences in NCBI-nr, and 1.72 (R2 ¼ 0.995) for clusters of size 2,500, and b ¼ 2.72 (R2 ¼ 0.995)
then voting on a majority. for clusters of size . 2,500. The estimates for b are different for the
We evaluated a simple strict-majority voting scheme (of the top core clusters compared to the final clusters, reflecting a larger
four BLAST matches) using the NCBI-nr set. First, the redundancy in number of medium and large clusters in the final clustering as a
NCBI-nr was removed using a two-staged process. A nonredundant result of the cluster-merging and additional recruitment steps. A
set of NCBI-nr sequences was computed involving matches with 98% similar dichotomy between the size distributions of large and small
similarity over 95% of the length of the shorter sequence (using the protein families was observed in a study [140] of protein families
procedure discussed in Materials and Methods [Identification of contained in the ProDom, Protomap, and COG databases, where the
nonredundant sequences]). This set was made further nonredundant exponent b reported was in the range of 1.83 to 1.98 for the 50
by considering matches involving 90% similarity over 95% of the smallest clusters and 2.54 to 3.27 for the 500 largest clusters in these
length of the shorter sequence. The nonredundant sequences that databases.
remained after this step constituted the evaluation dataset S. For each Our clustering method was run separately on the following seven
sequence in S, its top four BLAST matches to other sequences in S datasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted of
(ignoring self-matches) were used to assign a kingdom for it (based on all sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6
a strict majority rule). This predicted kingdom assignment for the consisted of set 2 in combination with a random subset of 20%, 40%,
sequence was compared to its actual kingdom. A correct classification 60%, and 80% of the GOS sequences, respectively; set 7 consisted of
is obtained for 93% of the sequences. The correct classification rate set 2 in combination with all the GOS sequences. On each of the
per kingdom is given in Table 13. seven datasets, the redundancy removal (using the 98% similarity
While this evaluation shows that the BLAST-based voting scheme filter) was run, followed by the core set detection steps. Figure 2
provides a reasonable handle on the kingdom assignment problem, shows the number of core sets of varying sizes (3, 5, 10, and 20)
there are caveats associated with it. The kingdom assignment for a set as a function of the number of nonredundant sequences for each
of query sequences is greatly influenced by the taxonomic groups dataset.
from each kingdom that are represented in the reference dataset The observed linear growth in number of families with increase in
against which these queries are being compared. If certain taxa are sample size n is related to the power law distribution in the following
only sparsely represented in the reference set, then, depending on way. We model protein families as a graph where each vertex
their position in the tree of life, queries from these taxa can be corresponds to a protein sequence and an edge between two vertices
misclassified (using a nearest-neighbor type approach based on indicates sequence similarity between the corresponding proteins.
BLAST matches). This explains why the archaeal classification rate is Consider a clustering (partitioning) of the vertices of a graph with n
quite low compared to the others. Thus, the true classification rate vertices such that the cluster sizes obey a power law distribution. Let
for the GOS dataset based on this approach will also depend on the Cd(n) [respectively, Cd(n)] denote the number of clusters of size d
differences in taxonomic biases in the GOS dataset (query) and the (respectively, d). Since the distribution of cluster sizes follows a
NCBI-nr set (reference). power law, there exist constants a, b such that for all x n, Cx(n) ¼
The kingdom proportion for the GOS dataset reported in Figure 1 axb.
is based on a kingdom assignment of scaffolds. Those GOS ORFs with As every vertex of the graph is a member of exactly one cluster,
BLAST matches to NCBI-nr were considered, and the top-four 8
majority rule was used to assign a kingdom to each of them. Using the Xn Xn < n2b 1
ORF coordinates on the scaffold, the fraction (of bp) of a scafffold n¼ xCx ðnÞ ¼ ax 1b
’ a b 6¼ 2 ð1Þ
: 2b
assigned to each kingdom was computed. The scaffold was labeled as x¼1 x¼1 alnn b¼2
belonging to a kingdom if the fraction of the scaffold assigned to that
kingdom was .50%. All ORFs on this scaffold were then assigned to The number of clusters of size at least d is
the same kingdom. 8
Cluster size distribution, the power law, and the rate of protein Xn < n1b d1b
Cd ðnÞ ¼ Cx ðnÞ ’ a b 6¼ 1 ð2Þ
family discovery. Earlier studies of protein family sizes in single : 1b
organisms [137–139] have suggested that P(d), the frequency of x¼d alnn b¼1
protein families of size d, satisfies a power law: that is, P(d) ’ d b
with exponent b reported between 2.68 and 4.02. Power laws have Combining the two equations, we obtain values (up to a multiplicative
been used to model various biological systems, including protein– constant) for Cd(n) as shown in Table 14. In all cases with b . 1, the
protein interaction networks and gene regulatory networks [42,43]. number of clusters Cd(n) increases as n increases, and as d decreases.
Figure 12B illustrates the distribution of the cluster sizes from our Specifically, for b . 2, the growth is linear in n for all d, with slope
data on a log–log scale, a scale for which a power law distribution decreasing as d increases. For 1 , b , 2, the growth is sublinear in n
gives a line. In contrast to family size distributions reported in single for all d.
organisms, the cluster sizes from our data are not well described by a Note that while the observed distribution of protein family sizes is
single power law. Rather, there appear to be different power laws: fit by two different power laws, one for clusters of size less than 2,500
one governs the size distribution of very large clusters, and another with b ¼ 1.99 and another for clusters of size greater than 2,500 with b
describes the rest. This behavior is observed both in the distribution ¼ 3.34 for the current number of (nonredundant) sequences, the
of the core set sizes and also in the distribution of the final cluster contribution of large families to the rate of growth is negligible
sizes. We identified an inflection point for both the core set compared to the small families.
distribution and the final clusters at around size 2,500, and estimated The above formulas for Cd(n) also suggest the dependence of the
the power law exponent b via linear regression separately in each size rate of growth of clusters on the cluster size d. For example, in the
regime. For the core set distribution, the exponent b ¼ 1.99 (R2 ¼ case when b is very close to 2,
0.994) for clusters of size 2,500, and b ¼ 3.34 (R2 ¼ 0.996) for n
Cd ðnÞ ¼ m b1 ð3Þ
clusters of size . 2,500. For the final cluster sizes, the exponent b ¼ d
PLoS Biology | www.plosbiology.org | S81 0457 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
Figure 13. Log–Log plot of Slopes m(d) of Linear Regression Fit to the
Rate of Growth in Figure 2 for Different Values of Cluster Size d
According to the equation derived in the text, m(d) ¼ md1b for some
constant m. The best linear fit to log [m(d)] gives a line with slope 0.91
(R2 ¼ 0.98) that is close to the predicted value 1 b ¼ 0.99.
doi:10.1371/journal.pbio.0050016.g013
for some constant m. Thus, the rate of growth of cluster sizes is linear,
and the slope m(d) of rate of growth is given by m(d) ¼ md1b. Figure 13
shows how well the observed rates of growth match the values
predicted by this equation. A fit to a sublinear function (not shown)
also gives similar results as in Figure 13. Figure 14. Receiver Operating Characteristic Curve Used to Evaluate
GOS versus known prokaryotic versus known nonprokaryotic. Various Methods of Scoring Pairs of Clusters for Functional Similarity
Examples of top five clusters in the various categories (except GOS- Pairs of clusters with 1 example of neighboring ORFs and assigned GO
only) are given below. The cluster identifiers are in parentheses. terms were divided into a set of functionally related (true positive) and
Known prokaryotic only: (Cluster ID 1319) outer surface protein in functionally unrelated (true negative) cluster pairs based on the similarity
Anaplasma ovis, Wolbachia, Ehrlichia canis; (Cluster ID 10911) nitrite of their GO terms. The scoring methods evaluated are described in the
reductase in uncultured bacterium; (Cluster ID 1266) outer mem- text.
brane lipoprotein in Borrelia; (Cluster ID 8595) methyl-coenzyme M doi:10.1371/journal.pbio.0050016.g014
reductase subunit A in uncultured archaeon; (Cluster ID 2959) outer
membrane protein in Helicobacter. Known nonprokaryotic only: by searching for all occurrences of nearby pairs of ORFs belonging to
(Cluster ID 2226) Pol polyprotein HIV sequences; (Cluster ID 4023) the two clusters of interest. Sufficiently close pairs were more likely to
maturase K; (Cluster ID 6257) NADH dehydrogenase subunit 2; be encoded in the same operon. We devised a scoring mechanism to
(Cluster ID 8644) HIV protease; (Cluster ID 12196) MHC class I and II reward those pairs of clusters for which many divergent examples of
antigens. GOS and known prokaryotic only: (Cluster ID 3369) likely operon pairs existed in the set of ORF pairs. For each pair of
carbamoyl transferase; (Cluster ID 688) apolipoprotein N-acyltrans- clusters, a weight was applied to the contribution of each pair of
ferase; (Cluster ID 3726) potassium uptake proteins; (Cluster ID 300) ORFs, and this was proportional to how similar the pair of ORFs was
primosomal protein N9; (Cluster ID 4605) DNA polymerase III delta to other example pairs. Thus, many near-identical pairs of ORFs,
subunit. GOS and known nonprokaryotic only: (Cluster ID 186) seven likely from the same or similar species, are not overrepresented in the
transmembrane helix receptors; (Cluster ID 2069) zinc finger final cluster pair score, while conserved examples of neighboring
proteins; (Cluster ID 3092) MAP kinase; (Cluster ID 1413) potential position from more divergent sequences contribute an increased
mitochondrial carrier proteins; (Cluster ID 233) pentatricopeptide
weight. The score for each cluster pair is calculated as:
(PPR) repeat-containing protein. Known prokaryotic and known
nonprokaryotic only: (Cluster ID 3510) immunoglobulin (and i¼n
immunoglobulin-binding) proteins; (Cluster ID 600) expansin; (Clus- SðC1 C2 Þ ¼ 1 P ½1 PrðOgi1 gi2 jdistÞ wi1 wi2 ð4Þ
i¼1
ter ID 50) pectin methylesterase; (Cluster ID 6492) lectin; (Cluster ID
986) BURP domain-containing protein. GOS and known prokaryotic where S(C1C2) is the linkage score of clusters C1 and C2. The
and known nonprokaryotic: (Cluster ID 2568) ABC transporters; probability PrðOgi1 gi2 jdistÞ that any two genes gi1 from C1 and gi2 from
(Cluster ID 49) short-chain dehydrogenases; (Cluster ID 4294) C2 are in an operon is dependent on the distance between them as
epimerases; (Cluster ID 1239) AMP-binding enzyme; (Cluster ID calculated by [47], and is weighted according to the sequence weights
2630) envelope glycoprotein. wi1 and wi2 described below, for all example pairs i.
Neighbor functional linkage methods. For the sequences in each We calculated sequence weights in a manner similar to that used in
GOS-only cluster, we determined if neighboring ORFs occurring on progressive multiple sequence alignment [142]. Briefly, neighbor-
the same strand had a similar biological process in the GO [49]. If this joining trees were built for all clusters using the QuickJoin [143] and
shared biological process of the neighbors occurred statistically more QuickTree programs [144] based on a distance matrix constructed
often than expected by chance, that inferred a potential operon from all-against-all BLAST scores within a cluster, normalized to self-
linkage and a biological process term for the GOS-only cluster. This scores. For those few clusters with more than 30,000 members, trees
approach weighted ORFs by sequence similarity to reduce the were not built. Instead, equal sequence weights for all members were
skewing effect of sequences from highly related organisms. assigned because of computational limitations. The root of each tree
For definition of linked ORFs, we collected pairs of same-strand was placed at the midpoint of the tree by using the retree package in
ORF protein predictions with intergenic distances less than 500 bp. PHYLIP [145]. The individual sequence weights were then computed
Negative distances were possible if the 59 end of the downstream ORF by summing the distance from each leaf to the root after dividing
in the pair occurred 59 to the 39 end of the upstream ORF. We used a each branch’s weight by the number of nodes in the subtree below it.
probability function to estimate the probability that two putative Weights were normalized so that the sum of weights in any given tree
genes belong to the same operon given their intergenic distance [47]. was equal to 1.0. This weighting scheme is superior to one in which
Because sequences come from a variety of unknown organisms, the weights are normalized to the largest weight in the tree, one that does
probability distribution was created by averaging properties of 33 not weight sequences according to divergence, and one that only
randomly chosen divergent genomes. The exact choice of genomes considers the number of example pairs seen (Figure 14). To compare
did not greatly affect the ability of the distribution to separate the different scoring methods, pairs of clusters annotated with GO
experimentally determined same-operon gene pairs from adjacent, terms that contained adjacent ORFs in the data were gathered. These
same-strand gene pairs in different known operons annotated in a pairs were divided into into functionally related and unrelated
version of RegulonDB downloaded on March 29, 2005 [141]. clusters based on a measure of GO term similarity (p-value 0.01)
We measured the functional linkage between two protein clusters [146]. We evaluated scoring methods for the ability to recover
PLoS Biology | www.plosbiology.org | S82 0458 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
with the programs TMHMM [147] and SPLIT4 [148]. GC content was
calculated as (G þ C)/(G þ C þ A þ T) bases for each ORF in a cluster,
and averaged for each cluster within a set. The GC content, reported
as the mean and standard deviation of the cluster averages, is as
follows for each cluster set: Group I, 36.7% 6 8.0%; Group II, 35.9%
6 7.9%. Group I size-matched sample, 48.8% 6 11.1%; Group II size-
matched sample, 49.5% 6 11.2%; Group I viral fraction, 37.8% 6
5.1%; Group II viral fraction, 37.3% 6 4.6%. To address the
interconnectivity of the novel clusters within the context of all
operon linkages, we constructed a graph with clusters as nodes and
inferred operon linkages (with score 1 3 106) as edges. We then
asked for every node in the set of novel clusters what was the
cumulative fraction of novel nodes that could be reached within a
varying edge distance from the starting node. The expectation of this
fraction was calculated at each distance, and the procedure was
repeated for the set of size-matched clusters (Figure 15).
We tried three different BLAST-based approaches for kingdom
assignment of ORFs. The first method, used in the analysis, required a
majority of the four top BLAST matches to vote for the same
kingdom (archaea, bacteria, eukaryota, or viruses; see Materials and
Methods [Kingdom assignment strategy and its evaluation]). The
Figure 15. Novel GOS-Only Clusters Are More Interconnected Than a second method required all eight top BLAST matches to vote for the
Size-Matched Sample of Clusters same kingdom. The last method we used was the scaffold-based
Red line, novel clusters; green line, size-matched sample; blue line (right kingdom assignment described in Materials and Methods (Kingdom
axis), log2 ratio of fraction novel clusters recovered divided by fraction assignment strategy and its evaluation). Figure 16 shows the results of
sample clusters recovered. using these assignments to infer the kingdom of GOS-only clusters
doi:10.1371/journal.pbio.0050016.g015 (Figure 16D–16F) and their neighboring ORFs (Figure 16A–16C).
GOS-only clusters were assigned a kingdom only if .50% of their
neighboring ORFs were assigned the same kingdom. The general
functionally similar pairs. In all analyses, linkages between clusters
trends observed are the same for each method, though the coverage
were ignored if there were fewer than five examples of cluster
decreases slightly for the more stringent methods.
member ORFs adjacent to each other on a scaffold. Characteristics and kingdom distribution of known protein
Function for novel families was inferred as follows. (1) Assignment domains. For these analyses we used the predicted proteins from
of GO terms to clusters. We downloaded the GO [49] database on the public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. The
September 21, 2005, from http://www.geneontology.org, along with the public dataset contains multiple identical copies of some sequences
files gene_association.goa_uniprot and pfam2go.txt dated July 12, due to overlaps between the source datasets. For example, many
2005. Only the biological process component of the ontology was sequences in PG are also found in NCBI-nr. We filtered the public set
considered. If a cluster had at least 10% of its redundant sequences at 100% identity to avoid overcounting these sequences. Because this
annotated by the most abundant Pfam domain for that cluster, and filtering was necessary for the public dataset, the GOS dataset was
that Pfam domain had a GO biological process term provided by the also filtered at 100% identity. If two or more sequences were 100%
pfam2go mapping, then we assigned a cluster the GO term of its most identical at the residue level, but were of different lengths, only the
abundant Pfam annotation. In addition, if a cluster contained at least longest sequence was kept. The resulting datasets of nonredundant
20% of its Uniprot GO annotations the same, it was assigned that GO proteins are referred to as public-100 and GOS-100.
term. For each cluster, redundant GO terms found on the same path We assigned each protein in public-100 to a kingdom based on the
to the root were removed. (2) Identification of neighbors to GOS-only species annotations provided in the source datasets (NCBI-nr,
clusters. Neighbors of GOS-only clusters were defined as those Ensembl, TIGR, and PG). The NCBI taxonomy tree was used to
clusters that had a cluster linkage score above a predetermined determine the kingdom of each species. Of 3,167,979 protein
threshold (1 3 106) and had at least five examples of cluster members sequences in public-100, 3,158,907 can be annotated by kingdom.
adjacent to each other in the data. These neighbors were then The remaining 9,072 sequences are largely synthetic.
screened for those that had been annotated with a GO term by the Determining the kingdom of origin of an environmental sequence
process described above. (3) Overrepresentation of neighbor GO can be difficult; while an unambiguous assignment can be made for
terms. We attempted to define GO terms for a set of GOS-only some sequences, others can be assigned only tentatively or not at all.
neighbors that were statistically overrepresented. Because of the Therefore, we took a probabilistic approach (kingdom-weighting
highly dependent nature of the terms in the GO, a simulation-based method), calculating ‘‘weights’’ or probabilities that each protein
approach was chosen to determine which terms might be over- sequence originated from a given kingdom.
represented. Annotated neighbors to a cluster of unknown function The top four BLAST matches (E-value , 1 3 1010) of GOS ORFs to
were identified as described above. For each annotated neighbor, NCBI-nr were considered. The kingdom of origin for each match was
counts for the associated GO term and all terms on the path to the determined. We pooled these ‘‘kingdom votes’’ for each scaffold,
root of the ontology were incremented. A total of 100,000 simulated since (presuming accurate assembly) each scaffold must come from a
neighbor lists of the same size as the true neighbor list were computed single species and hence from a single kingdom. Each ORF on a
by selecting without replacement from those clusters with annotated scaffold contributed up to four votes. If an ORF had fewer than four
GO terms, and an identical counting scheme was performed for each BLAST matches with an E-value , 1 3 1010, then it contributed
simulation. Overrepresentation of neighbor terms was calculated for fewer votes. ORFs with no BLAST matches contributed no votes.
each term on the ontology by asking how many times out of the In many cases, the votes were not unanimous, indicating that some
100,000 simulations the count for each GO term in the ontology met uncertainty must be associated with any kingdom assignment. An
or exceeded the observed count for the actual neighbors. This additional source of uncertainty is the finite number of votes. We
fraction of simulations was interpreted as a p-value. If a term is accounted for these statistical issues by applying the following
unusually prevalent in the true observed neighbors, it should be procedure to each scaffold. First, two pseudocounts were added to
relatively infrequent in the simulated data. For the purpose of the the votes for the ‘‘unknown’’ kingdom to represent the uncertainty
metric used here, ‘‘is-a’’ and ‘‘part-of’’ relationships were treated that remains even when votes are unanimous (especially when there
equally. In cases where a cluster had more than one GO term assigned are few votes). The frequency of votes for each kingdom was
to it, any redundant terms occurring on each other’s path to the root calculated. The vote frequency for a kingdom provides the maximum
were first removed. For any remaining clusters with nonredundant, likelihood estimate of the kingdom probability (i.e., the vote
multiple GO annotations, all possible lists of functions for each list of frequency that would have been observed on a scaffold of similar
neighbor clusters were enumerated, and one function from each composition but with infinitely many voting ORFs). However, that
cluster was chosen. Each node in the ontology was assigned the estimate may not be accurate or precise. Therefore, the multinomial
maximum count observed from the enumerated function lists. We standard deviation was calculated for each vote frequency p as SQRT
consistently applied this rule for the observed and simulated data. [p 3 (1 p)/(n 1)], where n is the number of votes. A distance of two
The following descriptive measures of the novel GOS-only cluster standard deviations from the mean corresponds roughly to a 95%
set were obtained. Transmembrane helix prediction was carried out confidence interval. Thus, two standard deviations were subtracted
PLoS Biology | www.plosbiology.org | S83 0459 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S84 0460 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S85 0461 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
PLoS Biology | www.plosbiology.org | S86 0462 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
doi:10.1371/journal.pbio.0050016.t015
Supporting Information
Protocol S1. Supplementary Information
eukaryotic sequences, 73%; public bacterial sequences, 14%; GOS- Found at doi:10.1371/journal.pbio.0050016.sd001 (25 KB DOC).
eukaryotic sequences, 2%; GOS-bacterial sequences, 10%; and GOS-
viral and GOS-unknown sequences, less than 1% combined. Accession Numbers
For the type II GS family, sequences in GOS and NCBI-nr were
searched with a type II GS HMM constructed from 17 previously All NCBI-nr sequences from February 10, 2005 were used in our
known bacterial and eukaryotic type II GS sequences. Matching analysis. Protocol S1 lists the GenBank (http://www.ncbi.nlm.nih.gov/
sequences from NCBI-nr and GOS were filtered separately for Genbank) accession numbers of (1) the genomic sequences used in
the PG set, (2) the sequences used in building GS profiles, and (3) the
redundancy at 98% identity; the combined set of sequences was
NCBI-nr sequences used in building the IDO phylogeny. The other
aligned and a neighbor-joining tree was constructed.
GenBank sequences discussed in this paper are Bacillus sp. NRRL B-
For the RuBisCO family, matching RuBisCO sequences from GOS
14911 (89089741), Janibacter sp. HTCC2649 (84385106), Erythrobacter
and NCBI-nr were filtered separately for redundancy at 90% identity,
litoralis (84785911), and Nitrosococcus oceani (76881875). The Pfam
resulting in 724 sequences in total. The 724 RuBisCO sequences were
(http://pfam.cgb.ki.se) structures discussed in this paper are envelope
then aligned and a neighbor-joining tree was constructed.
glycoprotein GP120 (PF00516), reverse transcriptase (PF00078),
Identification of proteases. We clustered sequences in the MEROPS
retroviral aspartyl protease (PF00077), bacteriophage T4-like capsid
Peptidase Database [100] using CD-HIT [116,117] at 40% similarity assembly protein (Gp20) (PF07230), major capsid protein Gp23
level. This resulted in 7,081 sequences, which were then divided into (PF07068), phage tail sheath protein (PF04984), IDO (PF01231),
groups based on catalytic type and Clan identifier. These sequences poxvirus A22 protein family (PF04848), and PP2C (PF00481). The
were used as queries to search against a clustered version of NCBI-nr glutamine synthetase TIGRFAM (http://www.tigr.org/TIGRFAMs) used
(clustered at 60% similarity threshold) using BLASTP (E-value 1 3 in the paper is GlnA: glutamine synthetase, type I (TIGR00653). The
1010). A similar search was carried out against GOS (clustered at 60% PDB (http://www.rcsb.org/pdb) identifiers and the names of the eight
similarity threshold). Figure 17 shows the content of protease types in PDB ORFans with GOS matches are: restriction endonuclease MunI
NCBI-nr and GOS together with the kingdom distributions. Figure 18 (1D02), restriction endonuclease BglI (1DMU), restriction endonu-
shows the content of bacterial protease clans. clease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-
Metabolic enzymes in GOS. Hmmsearch from the HMMER glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J),
package [105] was used to search the GOS sequences for different putative protein (1T6T), and hypothetical protein AF1548 (1Y88).
GS types. The GlnA TIGRFAM model was used for finding GSI
sequences. The HMMs built from known examples of 17 GSII and 18
GSIII sequences from NCBI-nr were used to search the GOS Acknowledgments
sequences.
Identification of ORFans in NCBI-nr. ORFans are proteins that do We are indebted to a large group of individuals and groups for
not have any recognizable homologs in known protein databases. A facilitating our sampling and analysis. We thank the governments of
straightforward way to identify ORFans is through all-against-all Canada, Mexico, Honduras, Costa Rica, Panama, and Ecuador and
sequence comparison using relaxed match parameters. However, this French Polynesia/France for facilitating sampling activities. All
is not computationally practical. An effective approach is to first sequencing data collected from waters of the above-named countries
remove the non-ORFans that can be easily found, and then to identify remain part of the genetic patrimony of the country from which they
ORFans from the remaining sequences. were obtained. We also acknowledge TimeLogic (Active Motif, Inc.)
We identified non-ORFans by clustering the NCBI-nr with CD-HIT and in particular Chris Hoover and Joe Salvatore for helping make
[116,117], an ultrafast sequence clustering program. A multistep the DeCypher system available to us; the Department of Energy for
iterated clustering was performed with a series of decreasing use of their NERSC Seaborg compute cluster; Marty Stout, Randy
similarity thresholds. NCBI-nr was first clustered to NCBI-nr90, Doering, Tyler Osgood, Scott Collins, and Marshall Peterson (J. Craig
where sequences with .90% similarities were grouped. NCBI-nr90 Venter Institute) for help with the compute resources; Peter Davies
was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30. and Saul Kravitz (J. Craig Venter Institute) for help with data
After each clustering stage, the total number of clusters of NCBI-nr accessibility issues; Kelvin Li and Nelson Axelrod (J. Craig Venter
was decreased and non-ORFans were identified. A one-step clustering Institute) for discussions on data formats; K. Eric Wommack
from NCBI-nr directly to NCBI-nr30 can be performed. However, the (University of Delaware, Newark) and the captain and crew of the
multistep clustering is computationally more efficient. R/V Cape Henlopen for their assistance in field collection of
At the 30% similarity level, all the NCBI-nr proteins were grouped Chesapeake Bay virioplankton samples; John Glass (J. Craig Venter
into 391,833 clusters, including 259,571 singleton clusters. The Institute) for assistance with the collection and processing of the
proteins in nonsingleton clusters are by definition non-ORFans. virioplankton samples; Beth Hoyle and Laura Sheahan (J. Craig
However, proteins that remain as singletons are not necessarily Venter Institute) for help with paper editing; and Matthew LaPointe
ORFans, because their similarity to other proteins may not be and Jasmine Pollard (J. Craig Venter Institute) for help with figure
reported for two reasons: (1) significant sequence similarity can be formatting. STM, MPJ, CvB, DAS, and SEB acknowledge Kasper
,30%; and (2) in order to prevent a cluster from being too diverse, Hansen for statistical advice. We also acknowledge the reviewers for
CD-HIT, like all other clustering algorithms, may not add a sequence their valuable comments.
to that cluster even if the similarity between this sequence and a Author contributions. SY contributed to the design and imple-
sequence in that cluster meet the similarity threshold. mentation of the clustering process, and the subsequent analyses of
The 259,571 singletons were compared to NCBI-nr with BLASTP the clusters; he also contributed to and coordinated all of the analyses
[38] to identify real ORFans from them. The default low-complexity in the paper, and wrote a large portion of the paper. GS contributed
PLoS Biology | www.plosbiology.org | S87 0463 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
to the design and analysis of the clustering process, contributed ideas, paper writing, project planning, and ideas for analysis. JCV conceived
analysis, and also wrote parts of the paper. DBR identified ORFs from and coordinated the project, and supplied ideas.
the assemblies, performed the all-against-all BLAST searches, con- Funding. The authors acknowledge the Department of Energy
tributed to GOS kingdom assignment, and contributed analysis tools Genomics: GTL Program, Office of Science (DE-FG02-02ER63453),
and ideas. ALH performed the assembly of GOS sequences, and the Gordon and Betty Moore Foundation, the Discovery Channel and
contributed analysis tools and ideas. SW contributed to the analysis
the J. Craig Venter Science Foundation for funding to undertake this
of viral sequences. KR contributed to project planning and paper
writing. JAE performed the analysis of UV damage repair enzymes, study. GM acknowledges funding from the Razavi-Newman Center
and also contributed to paper writing. KBH, RF, and RLS contributed for Bioinformatics and was also supported by National Cancer
to project planning. GM performed the profile HMM searches, Institute grant P30 CA014195. PC was partially supported by a Center
carried out the domain analysis, and contributed to paper writing. for Proteolytic Pathways (CPP)–National Institutes of Health (NIH)
WL and AG carried out the ORFan analysis and contributed to paper grant 5U54 RR020843–02. CSM, HL, and DE acknowledge the support
writing. LJ contributed to the profile-profile search process. PC and of DOE Biological and Environmental Research (BER). SL and JED
AG carried out the analysis of proteases and contributed to paper were supported by research grants from NIH. BJR was supported by a
writing. CSM, HL, and DE carried out the analysis of novel clusters, Career Award at the Scientific Interface from the Burroughs
the analysis of metabolic enzymes and contributed to paper writing. Wellcome Fund. Support for the Brenner lab work was provided by
YZ contributed to the profile HMM searches and domain analysis. NIH K22 HG00056 and an IBM Shared University Research grant.
STM, MPJ, CvB, DAS, and SEB carried out the analysis of Pfam STM was supported by NIH Genomics Training Grant 5T32
domain distributions in GOS and current proteins, analysis of IDO, HG00047. MPJ was supported by NIH P20 GM068136 and NIH K22
contributed to GOS kingdom assignment, and also contributed to HG00056. CvB was supported in part by the Haas Scholars Program.
paper writing. DAS and SEB also contributed to the Ka/Ks test. JMC
DAS was supported by a Howard Hughes Medical Institute
and SEB carried out the analysis on the implications for structural
genomics and contributed to paper writing. SL, KN, SST, and JED Predoctoral Fellowship. JMC was supported by NIH grant R01
carried out the phosphatase analysis and contributed to paper GM073109, and by the US Department of Energy Genomics: GTL
writing. SST and JED also contributed to project planning. BJR and program through contract DE-AC02-05CH11231.
VB contributed to the analysis of cluster size distribution, family Competing interests. The authors have declared that no competing
discovery rate, and contributed to paper writing. MF contributed to interests exist.
PLoS Biology | www.plosbiology.org | S88 0464 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
analysis: Probabilistic models of proteins and nucleic acids. New York: 72. Chandonia JM, Brenner SE (2006) The impact of structural genomics:
Cambridge University Press. 356 p. expectations and outcomes. Science 311: 347–351.
42. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. 73. Chandonia JM, Brenner SE (2005) Implications of structural genomics
Science 286: 509–512. target selection strategies: Pfam5000, whole genome, and random
43. Barabasi AL, Oltvai ZN (2004) Network biology: Understanding the cell’s approaches. Proteins 58: 166–179.
functional organization. Nat Rev Genet 5: 101–113. 74. Chandonia JM, Brenner SE (2005) Update on the Pfam5000 strategy for
44. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005) selection of structural genomics targets. Proceedings of the 2005 IEEE
Three Prochlorococcus cyanophage genomes: Signature features and ecolog- Engineering in Medicine and Biology 27th Annual Conference, Shanghai,
ical interpretations. PLoS Biol 3: e144. China 27: 751–755.
45. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005) 75. Baker D, Sali A (2001) Protein structure prediction and structural
Genome streamlining in a cosmopolitan oceanic bacterium. Science 309: genomics. Science 294: 93–96.
1242–1245. 76. Service R (2005) Structural biology. Structural genomics, round 2. Science
46. Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D (2003) 307: 1554–1558.
Inference of protein function and protein linkages in Mycobacterium 77. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural and
tuberculosis based on prokaryotic genome organization: A combined functional diversity of the microbial kinome. PLoS Biol 5: e17. doi:10.1371/
computational approach. Genome Biol 4: R59. journal.pbio.0050017
47. Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, et al. (2004) 78. Friedberg E (1985) DNA repair. New York W. H. Freeman and Co. 614 p.
Prolinks: A database of protein functional linkages derived from 79. Sancar GB (2000) Enzymatic photoreactivation: 50 years and counting.
coevolution. Genome Biol 5: R35. Mutat Res 451: 25–37.
48. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, et al. (2005) 80. Bowman KK, Sidik K, Smith CA, Taylor JS, Doetsch PW, et al. (1994) A new
STRING: Known and predicted protein-protein associations, integrated ATP-independent DNA endonuclease from Schizosaccharomyces pombe that
and transferred across organisms. Nucleic Acids Res 33: D433–D437. recognizes cyclobutane pyrimidine dimers and 6–4 photoproducts.
49. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene Nucleic Acids Res 22: 3026–3032.
ontology: Tool for the unification of biology. The Gene Ontology 81. Setlow P (2001) Resistance of spores of Bacillus species to ultraviolet light.
Consortium. Nat Genet 25: 25–29. Environ Mol Mutagen 38: 97–104.
50. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, et al. (2004) 82. Morikawa K, Ariyoshi M, Vassylyev D, Katayanagi K, Nakamura H, et al.
Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc (1994) Crystal structure of T4 endonuclease V. An excision repair enzyme
Natl Acad Sci U S A 101: 11013–11018. for a pyrimidine dimer. Ann N Y Acad Sci 726: 198–207.
51. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) 83. Piersen CE, Prince MA, Augustine ML, Dodson ML, Lloyd RS (1995)
Community genomics among stratified microbial assemblages in the Purification and cloning of Micrococcus luteus ultraviolet endonuclease, an
ocean’s interior. Science 311: 496–503. N-glycosylase/abasic lyase that proceeds via an imino enzyme-DNA
52. Paul JH, Sullivan MB (2005) Marine phage genomics: What have we intermediate. J Biol Chem 270: 23475–23484.
learned? Curr Opin Biotechnol 16: 299–307. 84. Hunter T (1995) Protein kinases and phosphatases: The yin and yang of
53. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: protein phosphorylation and signaling. Cell 80: 225–236.
504–510. 85. Kennelly PJ (2001) Protein phosphatases—A phylogenetic perspective.
54. Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: The Chem Rev 101: 2291–2312.
genealogy of ORFans in E. coli. Genome Res 14: 1036–1042. 86. Leroy C, Lee SE, Vaze MB, Ochsenbien F, Guerois R, et al. (2003) PP2C
55. Hsiao WW, Ung K, Aeschliman D, Bryan J, Finlay BB, et al. (2005) Evidence phosphatases Ptc2 and Ptc3 are required for DNA checkpoint inactivation
of a large novel gene pool associated with prokaryotic genomic islands. after a double-strand break. Mol Cell 11: 827–835.
PLoS Genet 1: e62. 87. Meskiene I, Baudouin E, Schweighofer A, Liwosz A, Jonak C, et al. (2003)
56. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006) Stress-induced protein phosphatase 2C is a negative regulator of a
Genomic islands and the ecology and evolution of Prochlorococcus. Science mitogen-activated protein kinase. J Biol Chem 278: 18945–18952.
311: 1768–1770. 88. Takekawa M, Maeda T, Saito H (1998) Protein phosphatase 2Calpha
57. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, et al. (2002) inhibits the human stress-responsive p38 and JNK MAPK pathways. EMBO
Genomic analysis of uncultured marine viral communities. Proc Natl Acad J 17: 4744–4752.
Sci U S A 99: 14250–14255. 89. Warmka J, Hanneman J, Lee J, Amin D, Ota I (2001) Ptc1, a type 2C Ser/
58. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, et al. (2003) Thr phosphatase, inactivates the HOG pathway by dephosphorylating the
Origins of highly mosaic mycobacteriophage genomes. Cell 113: 171–182. mitogen-activated protein kinase Hog1. Mol Cell Biol 21: 51–60.
59. Wilson GA, Bertrand N, Patel Y, Hughes JB, Feil EJ, et al. (2005) Orphans 90. Bork P, Brown NP, Hegyi H, Schultz J (1996) The protein phosphatase 2C
as taxonomically restricted and ecologically important genes. Micro- (PP2C) superfamily: Detection of bacterial homologues. Protein Sci 5:
biology 151: 2499–2501. 1421–1425.
60. Takami H, Takaki Y, Uchiyama I (2002) Genome sequence of Oceanobacillus 91. Das AK, Helps NR, Cohen PT, Barford D (1996) Crystal structure of the
iheyensis isolated from the Iheya Ridge and its unexpected adaptive protein serine/threonine phosphatase 2C at 2.0 A resolution. EMBO J 15:
capabilities to extreme environments. Nucleic Acids Res 30: 3927–3935. 6798–6809.
61. Wellcome Trust Sanger Institute (2005) Pfam db [database]. Release 17. 92. Jackson MD, Fjeld CC, Denu JM (2003) Probing the function of conserved
Cambridge (U.K.): Wellcome Trust Sanger Institute. Available: http://www. residues in the serine/threonine phosphatase PP2Calpha. Biochemistry 42:
sanger.ac.uk/Software/Pfam. 8513–8521.
62. Mellor AL, Munn DH (2004) IDO expression by dendritic cells: Tolerance 93. Novakova L, Saskova L, Pallova P, Janecek J, Novotna J, et al. (2005)
and tryptophan catabolism. Nat Rev Immunol 4: 762–774. Characterization of a eukaryotic type serine/threonine protein kinase and
63. Suzuki T, Yokouchi K, Kawamichi H, Yamamoto Y, Uda K, et al. (2003) protein phosphatase of Streptococcus pneumoniae and identification of kinase
Comparison of the sequences of Turbo and Sulculus indoleamine substrates. FEBS J 272: 1243–1254.
dioxygenase-like myoglobin genes. Gene 308: 89–94. 94. Obuchowski M, Madec E, Delattre D, Boel G, Iwanicki A, et al. (2000)
64. Fallarino F, Asselin-Paturel C, Vacca C, Bianchi R, Gizzi S, et al. (2004) Characterization of PrpC from Bacillus subtilis, a member of the PPM
Murine plasmacytoid dendritic cells initiate the immunosuppressive phosphatase family. J Bacteriol 182: 5634–5638.
pathway of tryptophan catabolism in response to CD200 receptor 95. Boitel B, Ortiz-Lombardia M, Duran R, Pompeo F, Cole ST, et al. (2003)
engagement. J Immunol 173: 3748–3754. PknB kinase activity is regulated by phosphorylation in two Thr residues
65. Hayashi T, Beck L, Rossetto C, Gong X, Takikawa O, et al. (2004) Inhibition and dephosphorylation by PstP, the cognate phospho-Ser/Thr phospha-
of experimental asthma by indoleamine 2,3-dioxygenase. J Clin Invest 114: tase, in Mycobacterium tuberculosis. Mol Microbiol 49: 1493–1508.
270–279. 96. Chopra P, Singh B, Singh R, Vohra R, Koul A, et al. (2003) Phosphoprotein
66. Muller AJ, DuHadaway JB, Donover PS, Sutanto-Ward E, Prendergast GC phosphatase of Mycobacterium tuberculosis dephosphorylates serine-threo-
(2005) Inhibition of indoleamine 2,3-dioxygenase, an immunoregulatory nine kinases PknA and PknB. Biochem Biophys Res Commun 311: 112–
target of the cancer suppression gene Bin1, potentiates cancer chemo- 120.
therapy. Nat Med 11: 312–319. 97. Yeats C, Finn RD, Bateman A (2002) The PASTA domain: A beta-lactam-
67. Burley SK, Bonanno JB (2003) Structural genomics. Methods Biochem binding domain. Trends Biochem Sci 27: 438.
Anal 44: 591–612. 98. Schweighofer A, Hirt H, Meskiene I (2004) Plant PP2C phosphatases:
68. Blundell TL, Mizuguchi K (2000) Structural genomics: An overview. Prog Emerging functions in stress signaling. Trends Plant Sci 9: 236–243.
Biophys Mol Biol 73: 289–295. 99. Barrett AJ, Rawlings ND, Woesner JFeditors (2004) Handbook of
69. Brenner SE (2001) A tour of structural genomics. Nat Rev Genet 2: 801– proteolytic enzymes. Amsterdam: Elsevier. 2,140 p.
809. 100. Rawlings ND, Morton FR, Barrett AJ (2006) MEROPS: The peptidase
70. Montelione GT (2001) Structural genomics: An approach to the protein database. Nucleic Acids Res 34: D270–D272.
folding problem. Proc Natl Acad Sci U S A 98: 13488–13489. 101. Kumada Y, Benson DR, Hillemann D, Hosted TJ, Rochefort DA, et al.
71. Chance MR, Bresnick AR, Burley SK, Jiang JS, Lima CD, et al. (2002) (1993) Evolution of the glutamine synthetase gene, one of the oldest
Structural genomics: A pipeline for providing structures for the biologist. existing and functioning genes. Proc Natl Acad Sci U S A 90: 3009–3013.
Protein Sci 11: 723–738. 102. Valentine RC, Shapiro BM, Stadtman ER (1968) Regulation of glutamine
PLoS Biology | www.plosbiology.org | S89 0465 Special Section from March 2007 | Volume 5 | Issue 3 | e16
Expanding the Protein Family Universe
synthetase. XII. Electron microscopy of the enzyme from Escherichia coli. (2003) The EMBL Nucleotide Sequence Database: Major new develop-
Biochemistry 7: 2143–2152. ments. Nucleic Acids Res 31: 17–22.
103. Almassy RJ, Janson CA, Hamlin R, Xuong NH, Eisenberg D (1986) Novel 127. Miyazaki S, Sugawara H, Gojobori T, Tateno Y (2003) DNA Data Bank of
subunit-subunit interactions in the structure of glutamine synthetase. Japan (DDBJ) in XML. Nucleic Acids Res 31: 13–16.
Nature 323: 304–309. 128. Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, et al.
104. Eisenberg D, Gill HS, Pfluegl GM, Rotstein SH (2000) Structure-function (2002) Finishing a whole-genome shotgun: Release 3 of the Drosophila
relationships of glutamine synthetases. Biochim Biophys Acta 1477: 122– melanogaster euchromatic genome sequence. Genome Biol 3: RE-
145. SEARCH0079.
105. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755– 129. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from
763. protein blocks. Proc Natl Acad Sci U S A 89: 10915–10919.
106. Carlson T, Chelm B (1986) Apparant eukaryotic origin of glutamine 130. Ochman H (2002) Distinguishing the ORFs from the ELFs: Short bacterial
synthetase II from the bacterium Bradyrhizobium japonicum. Nature 322: genes and the annotation of genomes. Trends Genet 18: 335–337.
568–570. 131. Nekrutenko A, Makova KD, Li WH (2002) The K(A)/K(S) ratio test for
107. Hosted TJ, Rochefort DA, Benson DR (1993) Close linkage of genes assessing the protein-coding potential of genomic regions: An empirical
encoding glutamine synthetases I and II in Frankia alni CpI1. J Bacteriol and simulation study. Genome Res 12: 198–202.
175: 3679–3684. 132. Li WH (1997) Molecular Evolution. Sunderland (MA): Sinauer Associates,
108. Deuel TF, Ginsburg A, Yeh J, Shelton E, Stadtman ER (1970) Bacillus subtilis Inc. 487 p.
glutamine synthetase. Purification and physical characterization. J Biol 133. Nei M, Kumar S (2000) Molecular evolution and phylogenetics. New York:
Chem 245: 5195–5205. Oxford University Press. 333 p.
109. Fisher SH, Sonenshein AL (1984) Bacillus subtilis glutamine synthetase 134. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high
mutants pleiotropically altered in glucose catabolite repression. J accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
Bacteriol 157: 612–621. 135. Yang Z (1997) PAML: A program package for phylogenetic analysis by
110. Ellis RJ (1979) The most abundant protein in the world. Trends Biochem maximum likelihood. Comput Appl Biosci 13: 555–556.
Sci 4: 241–244. 136. Yang Z, Nielsen R, Goldman N, Pedersen AM (2000) Codon-substitution
111. Hanson TE, Tabita FR (2001) A ribulose-1,5-bisphosphate carboxylase/ models for heterogeneous selection pressure at amino acid sites. Genetics
oxygenase (RubisCO)-like protein from Chlorobium tepidum that is involved 155: 431–449.
with sulfur metabolism and the response to oxidative stress. Proc Natl 137. Huynen MA, van Nimwegen E (1998) The frequency distribution of gene
Acad Sci U S A 98: 4397–4402. family sizes in complete genomes. Mol Biol Evol 15: 583–589.
112. Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, et al. (2002) The 138. Yanai I, Camacho CJ, DeLisi C (2000) Predictions of gene family
complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, distributions in microbial genomes: Evolution by gene duplication and
modification. Phys Rev Lett 85: 2641–2644.
anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99: 9509–
139. Qian J, Luscombe NM, Gerstein M (2001) Protein family and fold
9514.
occurrence in genomes: Power-law behaviour and evolutionary model. J
113. Li H, Sawaya MR, Tabita FR, Eisenberg D (2005) Crystal structure of a
Mol Biol 313: 673–681.
RuBisCO-like protein from the green sulfur bacterium Chlorobium tepidum.
140. Unger R, Uliel S, Havlin S (2003) Scaling law in sizes of protein sequence
Structure (Camb) 13: 779–789.
families: From super-families to orphan genes. Proteins 51: 569–576.
114. Ashida H, Saito Y, Kojima C, Kobayashi K, Ogasawara N, et al. (2003) A
141. Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E, Sanchez-
functional link between RuBisCO-like protein of Bacillus and photo-
Solano F, et al. (2004) RegulonDB (version 4.0): Transcriptional regulation,
synthetic RuBisCO. Science 302: 286–290.
operon organization and growth conditions in Escherichia coli K-12. Nucleic
115. Fischer D, Eisenberg D (1999) Finding families for genomic ORFans.
Acids Res 32: D303–D306.
Bioinformatics 15: 759–762. 142. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the
116. Li W, Jaroszewski L, Godzik A (2001) Clustering of highly homologous sensitivity of progressive multiple sequence alignment through sequence
sequences to reduce the size of large protein databases. Bioinformatics 17: weighting, position-specific gap penalties and weight matrix choice.
282–283. Nucleic Acids Res 22: 4673–4680.
117. Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancy 143. Mailund T, Pedersen CN (2004) QuickJoin—Fast neighbour-joining tree
significantly speeds up clustering of large protein databases. Bioinfor- reconstruction. Bioinformatics 20: 3261–3262.
matics 18: 77–82. 144. Howe K, Bateman A, Durbin R (2002) QuickTree: Building huge
118. Bujnicki JM, Rychlewski L (2001) Identification of a PD-(D/E)XK-like neighbour-joining trees of protein sequences. Bioinformatics 18: 1546–
domain with a novel configuration of the endonuclease active site in the 1547.
methyl-directed restriction enzyme Mrr and its homologs. Gene 267: 183– 145. Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) 3.6 edition
191. [computer program]. Seattle: Department of Genome Sciences, University
119. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, et al. (2004) Diversity of Washington, Seattle.
and population structure of a near-shore marine-sediment viral com- 146. Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic
munity. Proc Biol Sci 271: 565–574. similarity measures across the Gene Ontology: The relationship between
120. Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, et al. (2003) sequence and annotation. Bioinformatics 19: 1275–1283.
Metagenomic analyses of an uncultured viral community from human 147. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting
feces. J Bacteriol 185: 6220–6223. transmembrane protein topology with a hidden Markov model: Applica-
121. Cann AJ, Fandrich SE, Heaphy S (2005) Analysis of the virus population tion to complete genomes. J Mol Biol 305: 567–580.
present in equine faeces indicates the presence of hundreds of 148. Juretic D, Zoranic L, Zucic D (2002) Basic charge clusters and predictions
uncharacterized virus genomes. Virus Genes 30: 151–156. of membrane protein topology. J Chem Inf Comput Sci 42: 620–632.
122. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, et al. 149. Joachimiak MP, Cohen FE (2002) JEvTrace: Refinement and variations of
(2003) The SWISS-PROT protein knowledgebase and its supplement the evolutionary trace in JAVA. Genome Biol 3: RESEARCH0077.
TrEMBL in 2003. Nucleic Acids Res 31: 365–370. 150. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to
123. Westbrook J, Feng Z, Chen L, Yang H, Berman HM (2003) The Protein estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704.
Data Bank and structural genomics. Nucleic Acids Res 31: 489–491. 151. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-
124. Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, et al. (2003) The PUZZLE: Maximum likelihood phylogenetic analysis using quartets and
Protein Information Resource. Nucleic Acids Res 31: 345–347. parallel computing. Bioinformatics 18: 502–504.
125. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2003) 152. Bruno WJ, Socci ND, Halpern AL (2000) Weighted neighbor joining: A
GenBank. Nucleic Acids Res 31: 23–27. likelihood-based approach to distance-based phylogeny reconstruction.
126. Stoesser G, Baker W, van den Broek A, Garcia-Pastor M, Kanz C, et al. Mol Biol Evol 17: 189–197.
PLoS Biology | www.plosbiology.org | S90 0466 Special Section from March 2007 | Volume 5 | Issue 3 | e16
PLoS BIOLOGY
The eukaryotic protein kinase (ePK) domain mediates the majority of signaling and coordination of complex events in
eukaryotes. By contrast, most bacterial signaling is thought to occur through structurally unrelated histidine kinases,
though some ePK-like kinases (ELKs) and small molecule kinases are known in bacteria. Our analysis of the Global
Ocean Sampling (GOS) dataset reveals that ELKs are as prevalent as histidine kinases and may play an equally
important role in prokaryotic behavior. By combining GOS and public databases, we show that the ePK is just one
subset of a diverse superfamily of enzymes built on a common protein kinase–like (PKL) fold. We explored this huge
phylogenetic and functional space to cast light on the ancient evolution of this superfamily, its mechanistic core, and
the structural basis for its observed diversity. We cataloged 27,677 ePKs and 18,699 ELKs, and classified them into 20
highly distinct families whose known members suggest regulatory functions. GOS data more than tripled the count of
ELK sequences and enabled the discovery of novel families and classification and analysis of all ELKs. Comparison
between and within families revealed ten key residues that are highly conserved across families. However, all but one
of the ten residues has been eliminated in one family or another, indicating great functional plasticity. We show that
loss of a catalytic lysine in two families is compensated by distinct mechanisms both involving other key motifs. This
diverse superfamily serves as a model for further structural and functional analysis of enzyme evolution.
Citation: Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2007) Structural and functional diversity of the microbial kinome. PLoS Biol 5(3): e17. doi:10.1371/journal.pbio.
0050017
PLoS Biology | www.plosbiology.org | S91 0467 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Author Summary
PLoS Biology | www.plosbiology.org | S92 0468 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Table 1. The 20 PKL Families, Their Gene Counts in GOS and NCBI-nr, and Functional Notes
doi:10.1371/journal.pbio.0050017.t001
distinguished by its exclusive bacterial specificity, as opposed conservation of the key residues and motifs found in all
to the mostly eukaryotic ePK family. The other major cluster PKL kinases.
is centered on the large and divergent CAK (choline and Sequence similarity between these 20 families varies from
aminoglycoside kinase) family, and includes three other very low (;20%) to almost undetectable. Sequence-profile
families of small-molecule kinases. CAK itself is particularly methods are generally required to align families within the
diverse, containing subfamilies that are specific for choline/ oval clusters of Figure 1, while alignments between clusters
ethanolamine and aminoglycosides, as well as many novel require profile–profile methods. The diversity of this collec-
subfamilies, some of which are specific to eukaryotic tion is demonstrated by comparison with the automated
sublineages. A looser cluster is formed between the Rio and sequence- and profile-based clustering of the overall GOS
Bud32 families, which are universal among both eukaryotes analysis [22], which assigns 93% of these sequences into 32
and archaeae, and the bacterial lipopolysaccharide kinase clusters, each of which is largely specific to one of our 20
family KdoK. An additional four families (UbiB, revK, MalK, families.
CapK) are distantly related to all three clusters, and are
distinct from another set—PI3K, AlphaK, and IDHK—which Key Conserved Residues Unify Diverse Kinase Families
have even less similarity to any other kinase; for PI3K and Comparison between all families reveals a set of ten key
AlphaK, the relationship to kinases was determined by residues that not only account for one-third of the residues
structural comparisons [11], while IDHK displays only conserved within each family, but also are consistently
PLoS Biology | www.plosbiology.org | S93 0469 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Figure 2. The Conserved Core and Variable Regions of the Catalytic Domain
The conserved core in three distinct families, namely ePK (PKA [52]), Rio (A. fulgidis Rio2 [14]), and CAK (APH(39)-IIIa [12]). The conserved regions are
shown in ribbon representation and the variable regions in surface representation. The illustrations were created in PyMOL (http://www.pymol.org).
Some highly conserved residues (see Figure 3) and their associated interactions are shown.
doi:10.1371/journal.pbio.0050017.g002
conserved between families, constituting a core pattern of that they are key in selecting substrates or tuning mechanism
conservation that helps define this superfamily (Table 2, of action. For instance, the 4–amino acid (aa) stretch between
Figure 2, Figure 3). These residues are conserved across the the HxD166 and N171 residues is highly conserved but distinct
major divisions of life, which diverged one to two billion years between families (Figure 4), and provides a discriminative
ago, and across diverse families, which presumably diverged signature that defines each family. Within ePKs, tyrosine and
even earlier. Thus, they are likely to mediate core functions of serine/threonine-specific kinases display distinct patterns of
the catalytic domain rather than merely maintaining their conservation within this 4-aa stretch [27]. Serine/threonine
structures. Six of these residues are known to be involved in kinases conserve a [LI]KPx motif within this stretch, while
ATP and substrate binding and catalysis (G52, K72, E91, D166, tyrosine kinases conserve a [LI]AAR motif. These variations
N171 D184; residues numbered based on PKA structure 1ATP alter the surface electrostatics of the substrate-binding
except where otherwise noted; see Table 3). The full functions pocket, thereby contributing to substrate specificity [27].
of the other four remain unclear, though three of them The C-terminal region of ;100 aa following the DFG motif
(H158, H164, and D220) are part of a hydrogen-bonding is highly divergent between families, apart from the con-
network that links the catalytically important DFG motif with served D220 at the beginning of the F-helix (Figure 2; Dataset
substrate binding regions (Figure 2). The conservation of this S3). Secondary structure is generally predicted to be helical,
network across diverse PKL structures suggested a role for but the poor sequence conservation and known structures
this network in coupling DFG motif-associated conforma- [11] suggest that the overall orientation of the helices may be
tional changes with substrate binding and release [16]. different between families. Notably, in the crystal structures
Despite this ancient conservation, different families of ePKs of APH bound to its substrate, kanamycin [28], the relative
have lost individual members of this triad without destroying positioning of the substrate-binding helices (aH–aI) is
structure or catalytic function: H164 is changed to a tyrosine distinct from that of ePKs (Figure 2). The presence of unique
in PKA and many other AGC families; H158 is lost in most patterns of conservation in each family (Table 2) also suggests
tyrosine kinases; and D220 is lost in the Pim family. The Pim1 that this region is involved in family-specific functions.
structure retains an ePK-like structure, perhaps in part due Several families contain sizeable (;30–100 aa) insert
to stabilization of the catalytic loop by the activation loop, a segments between core subdomains that are specific to
function normally performed by D220 [26], suggesting a novel clusters of families. Most CAK members have an insert
mode of coupling ATP and substrate binding in this family. segment between subdomains VIa and VIb. There is very little
The individual loss of each member of this triad suggests that sequence similarity within this segment across CAK members,
they have independent functions yet to be understood. but structures of APH and ChoK indicate some structural
similarity and highlight its role in substrate binding [28,29].
Sequence and Structural Diversity An equivalent insert is seen in the other CAK cluster families,
Family-specific functions are mediated by features that are FruK, HSK2, and MTRK. Similarly, KdoK and Rio contain an
highly conserved within families, but that are divergent insert between subdomains II and III, which shows some
between families (Figure 4). Many family-selective residues sequence similarity between these families. In the Rio2
map to the motifs surrounding the ten key residues, or to the structure, this insert is disordered, but the presence of a
divergent C-terminal substrate-binding region (Tables 2 and conserved threonine suggests a possible regulatory role [14].
S1). The proximity of these residues to the active site suggests This region also contains an insert in the distinct UbiB family.
PLoS Biology | www.plosbiology.org | S94 0470 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Figure 3. Conservation of Secondary Structure, Key Motifs, and Residues between Families
The ePK secondary structure is shown with standard annotations of subdomains [53] and structural elements. Subdomains I–IX are generally conserved
in all PKLs. Key residues are bolded and numbered; dashed lines point to positions within secondary structure elements. The table below shows the
conservation (% identity) of the ten key residues, showing their broad conservation across families, but the successful replacement of almost all of them
in at least one family. Parentheses indicate changes to another conserved residue and dashes indicate unconserved positions. Key residues are
numbered based on their position in PKA: G52, K72, E91, P104 (VPKA), H158, H164 (YPKA), D166, N171, D184, and D220. More detailed figures are shown
in Dataset S3.
doi:10.1371/journal.pbio.0050017.g003
Finally, the ePK, pknB, and HRK families contain an and Gly within the GxGxFG motif (F54 and G55) are changed to
extended activation loop between subdomains VIII and IX. Ser/Thr and Asn, respectively (S86ChoK, N87ChoK), and G186
These kinases are generally activated by phosphorylation of within the DFG motif is changed to E. Both the GxGxFG and
this loop, the negative charge of which helps to coordinate DFG motifs are spatially proximal to K72 (Figure 5A). Thus,
key structural elements during the activation process, correlated changes in these two motifs could structurally
including a family-selective HRD arginine in the catalytic account for the K-to-R change. Indeed, in the ChoK crystal
loop [30,31]. structure [13], N55 protrudes into the ATP binding pocket, and
hydrogen bonds to R72. In addition, the conserved E91 in helix
Mechanistic Diversity of the Catalytic Core
C, which typically forms a salt bridge with K72, is hydrogen
A surprising finding was that while ten key residues are
bonded (via a water molecule) to the covarying E186, thus
conserved both within and between families, all but one of
them was dispensable in one family or another (Figure 3), linking these three correlated changes and stabilizing R72 in a
indicating that even catalytic residues are malleable in the unique conformation (Figure 5B). By contrast, the two solved
appropriate context. Here we explore the effect of loss of the APH structures (1ND4 and 2BKK) retain the ‘‘ancestral’’
‘‘catalytic lysine’’ K72, which typically positions the a and b sequence state with K72 and G186, and lack N55.
phosphates of ATP (Figure 5A). Mutation of this lysine in ePKs Mutation of R72 or E186 to alanine in ChoK reduces the
is a common method to make inactive kinases [32]. Yet this catalytic rate by several fold [33]. To test the possible role of
residue is conserved as an arginine (R111ChoK) in most CAK these residues in the ChoK catalytic mechanism, we modeled
subfamilies, as a methionine in the CAK-chloro subfamily, and an ATP in the active site of ChoK (based on the nucleotide-
as a threonine in the related HSK2 family (Figure 4). bound structures of APH and PKA). This revealed that R72
In the two major CAK subfamilies with a conserved R72 partially occludes the ATP binding site and is likely to move
(FadE and choline kinase [ChoK]), we see correlated changes in upon ATP binding. Notably, a K72-to-R mutation in Erk2 [34]
the glycine-rich and DFG motifs (Figure 4). Specifically, the Phe also exhibits a conformational change in R72 upon nucleotide
PLoS Biology | www.plosbiology.org | S95 0471 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Table 2. Distribution of Residues That Are .90% Identical within Each Family
ePK 8 4 1 4 0 17
pknB 9 3 0 8 0 20
BLRK 8 12 4 4 18 46
HRK 5 2 0 1 0 8
Bub1 8 7 3 5 4 27
GLK 7 1 2 2 0 12
Bud32 10 6 3 2 2 23
Rio 9 4 0 1 0 14
KdoK 2 0 0 1 0 3
CAK 5 0 0 0 0 5
HSK2 9 9 11 7 4 40
FruK 9 6 5 5 2 27
MTRK 9 10 4 2 8 33
UbiB 8 6 0 1 4 19
MalK 8 13 15 0 10 46
revK 5 1 2 0 5 13
CapK 1 0 0 0 0 1
PI3K 4 3 1 1 2 11
AlphaK 5 4 6 1 7 23
IDHK 9 17 10 1 23 60
Total 138 (31%) 108 (24%) 67 (15%) 46 (10%) 89 (20%) 448 (100%)
Across the ;250-aa domain, almost one-third of .90% conserved residues map to the ten key residues that are also conserved between families, and more than half map to these key
residues or their surrounding motifs (GxGxxGxxxx, vaiK, E, vP, LxxLH, xxHxDxxxNxx, xxDxGxx, DLA; boldfacing indicates the ten key residues). An additional 15% are found in the largely
unalignable region C-terminal of DxG, strongly suggesting family-specific functions, and 10% are semi-conserved, being found in some, but not most families. See Dataset S3 for details.
doi:10.1371/journal.pbio.0050017.t002
binding (Figure 5C). A similar conformational change in ChoK Evolution of Conformational Flexibility and Regulation in
upon ATP binding could result in formation of a R72–E91 salt ePKs
bridge similar to the activation of ePKs (Figure 5A). In this The ePK catalytic domain is highly flexible and undergoes
conformation, R72 could potentially hydrogen bond to both extensive conformational changes upon ATP binding [36]. In
E91 as well as to the covarying E186 in ChoKs, which might contrast, crystal structures of APH, solved in both ATP-
explain the covariation of R72 and E186 in these families. bound and -unbound forms, revealed modest structural
changes in the ATP-binding pocket [37]. This difference in
Variation on a Theme conformational flexibility is reflected in the patterns of
Other CAK members display distinct coordinated changes conservation at key positions within the ATP-binding glycine-
at the G55, K72, and G186 positions. The chloro subfamily of rich loop (Figure 4). Specifically, two conserved glycines (G50
CAK loses the positive charge at position 72 altogether, and G55), which contribute to the conformational flexibility
replacing it with methionine, and has concurrent changes to of this loop in ePKs, are replaced by non-glycines in APH.
R55 and Q186 (Figure 4). This may reflect a shift of the These two glycines are absent in several PKL families (Figure
positive charge from position 72 to 55, an event that also 4) while G52, which is involved in catalysis, is present in most,
happened in Wnk kinases, the only functional ePK family that suggesting that the conformational flexibility of the nucleo-
lacks K72. The conserved K55 of Wnks is required for tide-binding loop is a feature of selected PKL families such as
catalysis and has been shown to interact with ATP similarly to ePKs. Since conformational flexibility allows for regulation, it
K72 of PKA [35] (Figure 5D). Hence, two evolutionary is likely that modest structural changes associated with
inventions may have converted the same core motif residue nucleotide binding gradually evolved into quite dramatic
from one function to another. In CAK-chloro, the unpaired structural rearrangements required to ensure that key players
E91 position loses its charge to become a conserved Phe. The in various signaling pathways act only at the right place and at
function of this Phe is unknown, but is likely to be important the right time. The conserved glycine (G186) within the
since it is also conserved in HSK2, a related family, and the catalytically important DFG motif may likewise have evolved
only other kinase family to conserve a Phe at the E91 position for regulatory functions in ePKs [38]. This glycine is highly
(Figure 4). conserved in the ePK cluster but is absent from most other
Figure 4. Sequence Logos Depicting Conservation of Core Motifs and Neighboring Sequences across Most Kinase Families and Selected CAK
Subfamilies
Motifs are GxGxxGxxxx, VAIK, E, LxxLH, xxHxDxxxxNxx, xxDFGxx, and Dxx. The size of the letters corresponds to their information content [54]. Families
with less than 100 members (BLRK, GLK) are omitted. The diverse CAK family is represented by four distinct subfamilies: APH contains many
aminoglycoside resistance kinases and ChoK includes most ChoKs, while FadE and chloro are less well described. For the HRK family, the first two motif
logos omit the viral subfamily that lacks these motifs.
doi:10.1371/journal.pbio.0050017.g004
PLoS Biology | www.plosbiology.org | S96 0472 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
PLoS Biology | www.plosbiology.org | S97 0473 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
G52 Glycine-rich loop The backbone of this glycine coordinates the c-phosphate of ATP and facilitates
phosphoryl transfer [71].
K72 b3 strand Hydrogen bonds to the a-oxygen and b phosphate of ATP [72].
E91 C-helix Forms a salt bridge interaction with K72 and functions as a regulatory switch in
ePKs [73].
P104 aC-b4 loop Unknown function. Absent from ePKs.
H158 C-terminus of E helix Hydrogen bonds to D220 in the F-helix and is part of a hydrogen bond network
that couples ATP and substrate-binding regions [16].
H(Y164) Catalytic loop aE-b6 Hydrogen bonds to the backbone of the residue before the DFG-Asp and inte-
grates substrate and ATP binding regions [16].
D166 Catalytic loop Catalytic base [72].
N171 Catalytic loop Coordinates the second Mg2þ ion and involved in phosphoryl transfer.
D184 N-terminus of activation loop Coordinates the first Mg2þ ion.
D220 N-terminus of F helix Hydrogen bonds to the backbone of the catalytic loop and positions this loop re-
lative to substrate-binding regions [16].
doi:10.1371/journal.pbio.0050017.t003
PLoS Biology | www.plosbiology.org | S98 0474 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
PKL families. However, within the small subfamily of terminal end of the activation loop, a W-[SA]-X-[G] motif in
magnesium-dependent Mnk ePK kinases, G185 is changed the F-helix, and an arginine (R280), at the beginning of the I
to aspartate (DFD). In the Mnk2 crystal structure, this DFD helix (Figure 6B). These three motifs structurally interact with
motif adopts an ‘‘out’’ conformation in which F185 protrudes each other and form a network that couples the substrate-
into the ATP-binding site. This is in contrast to the ‘‘in’’ and ATP-binding regions (Figure 6B). This network also
conformation, where it packs up below the C-helix [39]. involves conserved buried water molecules, which are known
Mutation of the Mnk2 D186 ‘‘back’’ to glycine results in both to contribute to the conformational flexibility of proteins
in and out conformations of the DFG motif, supporting the [41]. Thus, this ePK/pknB-conserved network may also
role of G186 in DFG-associated conformational changes. facilitate regulation by increasing the conformational flexi-
Such conformational transitions may facilitate regulation of bility of the substrate-binding regions [16].
activity since the conformation of the catalytic aspartate is
also changed during this transition [38]. This may also explain Discussion
why the ePK-specific extended activation loop, which is Data from the GOS voyage provides a huge increase in
phosphorylated and undergoes dramatic conformational available sequences for most prokaryotic gene families,
changes, is directly attached to the DFG motif (Figure 6A). enabling new studies in discovery, classification, and evolu-
In addition to the flexible catalytic core, the substrate- tionary and structural analysis of a wide array of gene
binding regions appear to have evolved for tight regulation of families. Even for a eukaryotic family such as ePK kinases,
ePK activity. In particular, the conserved G helix, which was GOS provides insights by greatly increasing understanding of
recently shown to undergo a conformational changes upon related PKL families. GOS increases the number of known
substrate binding [40], is uniquely oriented in ePK/pknB ELK sequences more than 3-fold, and has enabled both the
(Figure 6A). Several ePK-conserved residues and motifs are at discovery of novel families of kinases as well as a detailed
the interface between the G helix and the catalytic core analysis of conservation patterns and subfamilies within
(Figure 6B). These include the APE motif, located at the C- known families. We believe that the GOS data, coupled with
PLoS Biology | www.plosbiology.org | S99 0475 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
the recent strong growth in whole-genome sequencing, adaptive changes such as the ePK-specific flexibility changes
provide the opportunity for similar insights into virtually that may assist in its diversity of functions.
every gene family with prokaryotic relatives. GOS data are rich in highly divergent viral sequences, and
PKL kinases are largely involved in regulatory functions, as accordingly we find a number of new subfamilies of viral kinases,
opposed to the metabolic activities of other kinases with including two of the three subfamilies of HRK and a subfamily
different folds [25]. The characteristics of this fold that lead of CapK. In both cases we see loss of N-terminal–conserved
to the explosion of diverse regulatory functions of eukaryotic elements, suggesting that these kinases may have alternative
ePKs have also been exploited for many different functions functions or even act as inactive competitors to host kinases.
within prokaryotes. While these kinases reflect only ;0.25% These patterns of sequence conservation and diversity raise
of genes in both GOS and microbial genomes (ePKs represent many questions that can only be fully addressed by structural
;2% of eukaryotic genes [42]), indicating a simpler prokary- methods. The combination of structural and phylogenetic
otic lifestyle, they now outnumber the count of ;12,000 insights for ChoK enabled insights that were not clear from
histidine kinases that we observe in GOS [22], suggesting that the structure alone, and enabled us to reject other inferences
ELKs may be at least as important in bacterial cellular from the crystal structure that were not conserved within this
regulation as the ‘‘canonical’’ histidine kinases. family, highlighting the value of combining these approaches.
PKL kinases cross huge phylogenetic and functional spaces The relative ease of crystallization of PKL domains, the
while still retaining a common fold and biochemical function emergence of high-throughput structural genomics, and our
of ATP-dependent phosphorylation. The presence of Rio and understanding of the diversity of these families make them
Bud32 genes in all eukaryotic and archaeal genomes suggests attractive targets for structure determination of selected
that at least this cluster dates back to the common ancestor of members, and position this family as a model for analysis of
these domains of life. Similarly, the presence of UbiB in all deep structural and functional evolution.
eukaryotes and most bacterial groups, the close similarity of
pknB/ePK families, and the widespread bacterial/eukaryotic Materials and Methods
distribution of FruK suggest their origins before the
emergence of eukaryotes, or from an early horizontal Discovery and classification of kinase genes. Sequences used
consisted of 17,422,766 open reading frames from GOS, 3,049,695
transfer. Their ancient divergence leaves little or no trace predicted open reading frames from prokaryotic genomes, and
of their shared structure within their protein sequence other 2,317,995 protein sequences from NCBI-nr of February 10, 2005, as
than at functional motifs, which include a set of ten key described [22]. Profile HMM searches were performed with a Time
Logic Decypher system (Active Motif, http://timelogic.com) using in-
residues that are highly conserved across all PKLs. house profiles for ePK, Haspin, Bub1, Bud32, Rio, ABC1 (UbiB), PI3K,
Despite the huge attention paid to ePKs, four key residues and AlphaK domains, as well as Pfam profiles [43] for ChoK, APH,
(P104, H158, H164, D220), three of which are highly conserved in KdoK, and FruK, and TIGRFAM profiles [44] for HSK2 (thrB_alt),
ePKs, are still functionally obscure and worthy of greater UbiB, and MTRK. A number (69) of additional ePK-annotated models
from Superfamily 1.67 [45] were used to capture initial hits but not for
attention, both in ELKs and ePKs. Conversely, it appears that further classification. Initial hits were clustered and re-run against all
nine of the ten key residues have been eliminated or models, and each model was rebuilt and rerun three to seven times
transformed in individual families while maintaining fold using ClustalW [46], MUSCLE [47], and hmmalign (http://hmmer.
janelia.org) to align, followed by manual adjustment of alignments
and function, showing that almost anything is malleable in using Clustal and Pfaat [48] and model building with hmmbuild. Low-
evolution given the right context. That right context is scoring members of each family (e . 1 3 105) were used as seeds to
frequently a set of additional changes in the family-specific build new putative families, and profile–profile and sequence–profile
motifs surrounding these key residues, and we see that in the alignments were used to merge families into a minimal set (Dataset S2).
A motif-based Markov chain Monte Carlo multiple alignment model
case of K72, a substitution to arginine triggers a cascade of [49] based on the conserved motifs of Figure 3 was run independently
other core substitutions that serve to retain basic function, and used to verify HMM hits and seed new potential families for blast-
while a substitution to methionine involves a shift of the based clustering, model building, and examination for conserved
residues. Final family assignment was by scoring against the set of HMM
positive charge normally provided by K72 to another models, with manual examination of sequences with borderline scores
conserved residue, in both CAK-chloro and Wnk kinases. (e . 1 3 105 or difference in e-values between best two models ..01).
Other core changes are also seen independently in very Family annotations. Annotations of chromosomal neighbors used
distinct families, such as the G55-to-A change in UbiB and the SMART [50] and a custom analysis of GOS neighbors ([22]; C. Miller,
H. Li, D. Eisenberg, unpublished data). Annotation analysis was based
chloro subfamily of CAK, or the E91-to-F change in both on GenBank annotations and PubMed references. Taxonomic
chloro and HSK2, suggesting that these kinases are sampling a analysis used a mapping of GOS scaffolds to taxonomic groupings
limited space of functional replacements. [22] and NCBI taxonomy tools.
Family alignments and logos. Residue conservation (Dataset S3)
These families vary greatly in diversity. While the ePK was counted from the final alignment using a custom script that
family has expanded to scores of deeply conserved functions omitted gap counts. These counts were then used to construct family
[42], other families, including Bud32, Rio, Bub1, and UbiB, logos using WebLogo (http://weblogo.berkeley.edu; [51]).
usually have just one or a handful of members per genome, Family comparisons. Relatedness between families was estimated
using several methods. HMM–HMM alignments and scores were
suggesting critical function but an inability to innovate. The computed using PRC (http://supfam.org/PRC), and sequence–profile
largely prokaryotic CAK family is also functionally and alignments using hmmalign were analyzed using custom scripts and
structurally diverse, containing several known functions and by inspection. Both full-length and motif multiple alignments were
also created and used for the family comparisons.
many distinct subfamilies likely to have novel functions. The
diversity of both CAK and KdoK sequences may be related to
their involvement in antibiotic resistance and immune Supporting Information
evasion, likely to be evolutionarily accelerated processes. Dataset S1. FastA-Formatted Sequence Files for Each of the 20 Kinase
Comparison of CAK to the related and more functionally Families, Including Both GOS and Public Sequences
constrained HSK2, FruK, and MTRK families may reveal Found at doi:10.1371/journal.pbio.0050017.sd001 (10 MB BZ2).
PLoS Biology | www.plosbiology.org | S100 0476 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
Dataset S2. HMM Profiles for the 20 Kinase Families in HMMer (PF01633.8), APH (PF01636.9), KdoK (PF06293.3), and FruK
Format (PF03881.4). The TIGRFAM (http://www.tigr.org/TIGRFAMs) acces-
Found at doi:10.1371/journal.pbio.0050017.sd002 (2.4 MB HMM). sion numbers for the structures discussed in this paper are HSK2
(TIGR00938), UbiB (TIGR01982), and MTRK (TIGR01767).
Dataset S3. Domain Profiles for 20 PKL Families
These 20 spreadsheets show the conservation profile at each residue
of the kinase domain for each family, including annotations and Acknowledgments
classifications of individual residues. Each worksheet details the
alignment of one kinase family to its HMM. Every row corresponds to We thank the Governments of Bermuda, Canada, Mexico, Honduras,
a position within the alignment, listing the four most common amino Costa Rica, Panama, Ecuador, and French Polynesia for facilitating
acids (aa) in that row along with their fractional popularity. The sampling activities. All sequencing data collected from waters of the
number of aa’s and number of gaps at that position within the above-named countries remain part of the genetic patrimony of the
alignment is also listed. The ‘‘Notes’’ column annotates conservation country from which they were obtained. We thank Chris Miller,
status of selected residues and other notes, while the ‘‘.90% Huiying Li, and David Eisenberg for access to analyses of chromo-
Conserved’’ annotates those corresponding residues as to their class somal neighbors for function prediction, and to Doug Rusch, Shibu
(Core, Motif, Motif-Associated, Semi-Conserved, C-terminal, Unique, Yooseph, and other members of the Venter Institute GOS team for
or external to the kinase domain). A number of color highlights are taxonomic predictions, geographic analysis, and other data and tools.
used. (1) Positions with few aa’s in the alignment (typically inserts We thank Eric Scheeff and Tony Hunter for critical comments and
within the domain that are not of great interest) are shaded gray: structural insights, and Nina Haste for help with PyMOL.
typically dark gray for 20 aa at that position, and light gray for .20 Author contributions. JCV proposed and enabled collaboration.
but still low (the range varies depending on the depth of the SST conceived of collaboration and provided structural insights and
alignment). Rows highlighted in gray have no highlights in any other critical evaluation. NK and GM conceived and designed the experi-
columns and are assumed not to be part of the core domain. (2) Core ments. NK, YZ, and GM performed the experiments. NK and GM
motifs are highlighted in bold and blue. (3) The fractional count for
analyzed the data. YZ and JCV contributed reagents/materials/analysis
the most popular aa is labeled green if 1, dark yellow if .0.9, and light
yellow if .0.8 and ,0.9. tools. NK and GM wrote the paper.
Funding. This work was supported by funding from the Razavi-
Found at doi:10.1371/journal.pbio.0050017.sd003 (1.6 MB XLS). Newman Center for Bioinformatics to GM and National Institutes of
Health grant IP01DK54441 to SST. We gratefully acknowledge the US
Accession Numbers Department of Energy (DOE) Genomics: GTL Program, Office of
The Protein Databank (http://www.pdb.org) accession numbers for Science (DE-FG02-02ER63453), the Gordon and Betty Moore Foun-
the structures discussed in this paper are PKA (1ATP), A. fulgidis Rio2 dation, and the J. Craig Venter Science Foundation for funding of the
(1TQP), C. elegans choline kinase (INW1), Erk2 (1GOL), Wnk1 (1T4H), GOS expedition.
and APH(39)-IIIa (1J7L). The Pfam (http://pfam.cgb.ki.se) accession Competing interests. The authors have declared that no competing
numbers for the structures discussed in this paper are ChoK interests exist.
PLoS Biology | www.plosbiology.org | S101 0477 Special Section from March 2007 | Volume 5 | Issue 3 | e17
Microbial Kinome
33. Yuan C, Kent C (2004) Identification of critical residues of choline kinase basic patch, and nucleotide exchange mechanisms in light of a canonical
A2 from Caenorhabditis elegans. J Biol Chem 279: 17801–17809. structure for Rab, Rho, Ras, and Ran GTPases. Genome Res 13: 673–692.
34. Robinson MJ, Harkins PC, Zhang J, Baer R, Haycock JW, et al. (1996) 55. Higgins JM (2001) Haspin-like proteins: A new family of evolutionarily
Mutation of position 52 in ERK2 creates a nonproductive binding mode for conserved putative eukaryotic protein kinases. Protein Sci 10: 1677–1684.
adenosine 59-triphosphate. Biochemistry 35: 5641–5646. 56. Facchin S, Lopreiato R, Ruzzene M, Marin O, Sartori G, et al. (2003)
35. Xu B, English JM, Wilsbacher JL, Stippec S, Goldsmith EJ, et al. (2000) Functional homology between yeast piD261/Bud32 and human PRPK: Both
WNK1, a novel mammalian serine/threonine protein kinase lacking the phosphorylate p53 and PRPK partially complements piD261/Bud32
catalytic lysine in subdomain II. J Biol Chem 275: 16795–16801. deficiency. FEBS Lett 549: 63–66.
36. Akamine P, Madhusudan, Wu J, Xuong NH, Ten Eyck LF, et al. (2003) 57. Downey M, Houlsworth R, Maringele L, Rollie A, Brehme M, et al. (2006) A
Dynamic features of cAMP-dependent protein kinase revealed by apoen- genome-wide screen identifies the evolutionarily conserved KEOPS
zyme crystal structure. J Mol Biol 327: 159–171. complex as a telomere regulator. Cell 124: 1155–1168.
37. Thompson PR, Boehr DD, Berghuis AM, Wright GD (2002) Mechanism of 58. LaRonde-LeBlanc N, Wlodawer A (2005) The RIO kinases: An atypical
aminoglycoside antibiotic kinase APH(39)-IIIa: Role of the nucleotide protein kinase family required for ribosome biogenesis and cell cycle
positioning loop. Biochemistry 41: 7001–7007. progression. Biochim Biophys Acta 1754: 14–24.
38. Levinson NM, Kuchment O, Shen K, Young MA, Koldobskiy M, et al. (2006) 59. White KA, Lin S, Cotter RJ, Raetz CR (1999) A Haemophilus influenzae gene
A SRC-like inactive conformation in the abl tyrosine kinase domain. PLoS that encodes a membrane bound 3-deoxy-D-manno-octulosonic acid (Kdo)
Biol 4: e144. kinase. Possible involvement of kdo phosphorylation in bacterial virulence.
39. Jauch R, Jakel S, Netter C, Schreiter K, Aicher B, et al. (2005) Crystal J Biol Chem 274: 31391–31400.
structures of the Mnk2 kinase domain reveal an inhibitory conformation 60. Zhao X, Lam JS (2002) WaaP of Pseudomonas aeruginosa is a novel eukaryotic
and a zinc binding site. Structure 13: 1559–1568. type protein-tyrosine kinase as well as a sugar kinase essential for the
40. Dar AC, Dever TE, Sicheri F (2005) Higher-order substrate recognition of biosynthesis of core lipopolysaccharide. J Biol Chem 277: 4722–4730.
eIF2alpha by the RNA-dependent protein kinase PKR. Cell 122: 887–900. 61. Serino L, Virji M (2000) Phosphorylcholine decoration of lipopolysaccharide
41. Fischer S, Verma CS (1999) Binding of buried structural water increases the differentiates commensal Neisseriae from pathogenic strains: Identification
flexibility of proteins. Proc Natl Acad Sci U S A 96: 9613–9615. of licA-type genes in commensal Neisseriae. Mol Microbiol 35: 1550–1559.
42. Goldberg JM, Manning G, Liu A, Fey P, Pilcher KE, et al. (2006) The 62. Wright GD, Thompson PR (1999) Aminoglycoside phosphotransferases:
dictyostelium kinome—Analysis of the protein kinases from a simple model
Proteins, structure, and mechanism. Front Biosci 4: D9–D21.
organism. PLoS Genet 2: e38.
63. Delpierre G, Collard F, Fortpied J, Van Schaftingen E (2002) Fructosamine
43. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, et al. (2002) The Pfam
3-kinase is involved in an intracellular deglycation pathway in human
protein families database. Nucleic Acids Res 30: 276–280.
erythrocytes. Biochem J 365: 801–808.
44. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, et al. (2001)
64. Fortpied J, Gemayel R, Stroobant V, van Schaftingen E (2005) Plant
TIGRFAMs: A protein family resource for the functional identification of
ribulosamine/erythrulosamine 3-kinase, a putative protein-repair enzyme.
proteins. Nucleic Acids Res 29: 41–43.
Biochem J 388: 795–802.
45. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology
to genome sequences using a library of hidden Markov models that 65. Tower PA, Alexander DB, Johnson LL, Riscoe MK (1993) Regulation of
represent all proteins of known structure. J Mol Biol 313: 903–919. methylthioribose kinase by methionine in Klebsiella pneumoniae. J Gen
46. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the Microbiol 139: 1027–1031.
sensitivity of progressive multiple sequence alignment through sequence 66. Sekowska A, Mulard L, Krogh S, Tse JK, Danchin A (2001) MtnK,
weighting, position-specific gap penalties and weight matrix choice. methylthioribose kinase, is a starvation-induced protein in Bacillus subtilis.
Nucleic Acids Res 22: 4673–4680. BMC Microbiol 1: 15.
47. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy 67. Poon WW, Davis DE, Ha HT, Jonassen T, Rather PN, et al. (2000)
and high throughput. Nucleic Acids Res 32: 1792–1797. Identification of Escherichia coli ubiB, a gene required for the first
48. Johnson JM, Mason K, Moallemi C, Xi H, Somaroo S, et al. (2003) Protein monooxygenase step in ubiquinone biosynthesis. J Bacteriol 182: 5139–5146.
family annotation in a multiple alignment viewer. Bioinformatics 19: 68. Do TQ, Hsu AY, Jonassen T, Lee PT, Clarke CF (2001) A defect in coenzyme
544–545. Q biosynthesis is responsible for the respiratory deficiency in Saccharomyces
49. Neuwald AF, Liu JS (2004) Gapped alignment of protein sequence motifs cerevisiae abc1 mutants. J Biol Chem 276: 18161–18168.
through Monte Carlo optimization of a hidden Markov model. BMC 69. Jarling M, Cauvet T, Grundmeier M, Kuhnert K, Pape H (2004) Isolation of
Bioinformatics 5: 157. mak1 from Actinoplanes missouriensis and evidence that Pep2 from
50. Schultz J, Copley RR, Doerks T, Ponting CP, Bork P (2000) SMART: A web- Streptomyces coelicolor is a maltokinase. J Basic Microbiol 44: 360–373.
based tool for the study of genetically mobile domains. Nucleic Acids Res 70. Cozzone AJ, El-Mansi M (2005) Control of isocitrate dehydrogenase
28: 231–234. catalytic activity by protein phosphorylation in Escherichia coli. J Mol
51. Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: A Microbiol Biotechnol 9: 132–146.
sequence logo generator. Genome Res 14: 1188–1190. 71. Aimes RT, Hemmer W, Taylor SS (2000) Serine-53 at the tip of the glycine-
52. Knighton DR, Zheng JH, Ten Eyck LF, Ashford VA, Xuong NH, et al. (1991) rich loop of cAMP-dependent protein kinase: Role in catalysis, P-site
Crystal structure of the catalytic subunit of cyclic adenosine mono- specificity, and interaction with inhibitors. Biochemistry 39: 8325–8332.
phosphate-dependent protein kinase. Science 253: 407–414. 72. Johnson DA, Akamine P, Radzio-Andzelm E, Madhusudan M, Taylor SS
53. Hanks SK, Quinn AM, Hunter T (1988) The protein kinase family: (2001) Dynamics of cAMP-dependent protein kinase. Chem Rev 101:
Conserved features and deduced phylogeny of the catalytic domains. 2243–2270.
Science 241: 42–52. 73. Huse M, Kuriyan J (2002) The conformational plasticity of protein kinases.
54. Neuwald AF, Kannan N, Poleksic A, Hata N, Liu JS (2003) Ran’s C-terminal, Cell 109: 275–282.
PLoS Biology | www.plosbiology.org | S102 0478 Special Section from March 2007 | Volume 5 | Issue 3 | e17
PUBLIC LIBRARY of SCIENCE | plosbiology.org | Special Collection | MARCH 2007
committed to making scientific and medical literature a public resource Oceanic Metagenomics in
www.plos.org