Plos Biology Venter Collection Low

PUBLIC LIBRARY of SCIENCE | plosbiology.
org | Special Collection | MARCH 2007
committed to making scientific and medical literature a public resource Oceanic Metagenomics in
www.plos.org
PUBLIC LIBRARY of SCIENCE | SPECIAL OCEANIC METAGENOMICS COLLECTION | MARCH 2007
A collection of articles from the J. Craig Venter Institute’s

Global Ocean Sampling expedition
Editorial Staff Stephen Ellner Sarah Rowland-Jones
Hemai Parthasarathy, Michael Emerman Gerry Rubin
Managing Editor Manfred Fahle Mick Rugg
Natalie Bouaravong, Susan Gasser Ueli Schibler
Editorial Assistant Mikhail Gelfand Manfred Schliwa Publisher Information
Jami Milton Dantzker, Richard Gibbs David Schneider
PLoS Biology (ISSN-1544-9173, eISSN-
Associate Editor Margaret Goodell Matthew P. Scott
Douglas Green 1545-7885) is published monthly by
Jacob Evans, Editorial Assistant Idan Segev
Liza Gross, Science Writer Bryan Grenfell Ben Sheldon the Public Library of Science. All works
Emma Hill, James Haber Daniel Simberloff published in PLoS journals are open
Associate Editor Hiroshi Hamada Kai Simons access, subject to the terms of the
Catriona MacCallum, William Harris Mandyam Srinivasan Creative Commons Attribution License
Senior Editor Paul Harvey Derek Stemple (http:⁄⁄creativecommons.org/licenses/
Robert Shields, Nicholas Hastie Charles Stevens by/2.5/). Copyright is retained by the
Senior Editor R. Scott Hawley Bill Sugden authors. PLoS Biology is freely available
Janelle Weaver, Anders Hedenström Sally Temple online: http:⁄⁄plosbiology.org
Associate Editor Joseph Heitman Janet Thornton
Dan Herschlag Chris Tyler-Smith Correspondence
Board of Directors Winston Hide Leslie Ungerleider Public Library of Science
Harold Varmus, David Hillis Joan Valentine 185 Berry St., Ste. 3100
Chairman & Co-founder Brigid Hogan Matt van de Rijn San Francisco, CA 94107 USA
Patrick O. Brown, Fred Hughson Antonio Vidal-Puig
Co-founder Tim Hunt Herbert W. Virgin email: plos@plos.org
Michael B. Eisen, Laurence Hurst Matt Waldor phone: +1 415.624.1200
Co-founder Gerald Joyce Peter Walter fax: +1 415.546.4090
Brian Druker Jim Kadonaga Gary Ward
PLoS European Editorial Office
Paul Ginsparg Laurent Keller Detlef Weigel
Christopher Kemp 7 Portugal Place
Allan Golston Jonathan S. Weissman
Calestous Juma Chaitan Khosla Marv Wickens Cambridge CB5 8AF UK
Lawrence Lessig Joel Kingsolver Ken H. Wolfe email: plosuk@plos.org
Elizabeth Marincola Thomas Kirkwood Phillip D. Zamore phone: +44 (0)1223-463-330
Richard Smith Tom Kornberg Robert Zatorre fax: +44 (0)1223-463-348
Rosalind L. Smyth Mark Krasnow Huda Y. Zoghbi
Beth Weil Arthur D. Lander Display Advertising
Andre Levchenko Production
Ashley Clark Patric Donaghy, pdonaghy@plos.org
Editorial Board Michael Lichten
Anthony Flores phone: +1 415.564.8612
Anurag Agrawal Susan Lindquist
Alexis Wynne Mogul
Julie Ahringer David Lipman
Chelsea E. Scholl Manuscript Submission
Shizuo Akira Edison Liu
Richard Aldrich Michel Loreau online: http:⁄⁄biology.plosjms.org
Marketing
Göran Arnqvist Georgina Mace Liz Allen, Director
James Ashe Philippa Marrack Allison Hawxhurst
Anthony Barnosky Alfonso Martínez-Arias Catherine Silvestre
Nick Barton Rowena Matthews
Konrad Basler Markus Meister IT & Web
Michael Bate Bénédicte Michel Richard Cave, Director
Andrew Bergeron
Peter Becker Emmanuel Mignot
Susanne DeRisi
Pamela Björkman Tom Misteli
Josh Klavir
Peer Bork Nancy Moran Céline Nadeau
Henry Bourne Craig Moritz Tim Sullivan
Lon Cardon David Nemazee Russell Uman
James Carrington Eric Nestler Elisa Webb
Lars Chittka Mohamed Noor
Joanne Chory Roel Nusse Staff
Jeffrey Dangl Steve O’Rahilly Mark Gritton,
Titia De Lange Svante Pääbo Chief Executive Officer
Frans de Waal Nipam Patel Mark Patterson,
Director of Publishing
Joseph DeRisi David Penny
Steve Borostyan,
Andrew Dillin Greg Petsko
Chief Financial Officer
Andy Dobson Lennart Philipson Barbara Cohen,
Ford Ebner Ron Plasterk PLoS Executive Editor
Sean Eddy Dietmar Plenz Janice Pettey,
Thomas Edlund Hidde Ploegh Development Director
Thomas Egwang Walt Reid Isis Choto
Jonathan Eisen Callum Roberts Donna Okubo
Steve Elledge Richard Roberts Robert Viera
PLoS Biology | www.plosbiology.org i March 2007 | Oceanic Metagenomics Collection

PUBLIC LIBRARY of SCIENCE www.plos.org Oceanic Metagenomics Collection | March 2007
Editorial ____________________________________________________________________________
Global Ocean Sampling Collection S1
Hemai Parthasarathy, Emma Hill, e83
Catriona MacCallum
Synopses of Research Articles _________________________________________

About the Cover Untapped Bounty: Sampling the Seas S3
Aboard the Sorcerer II, the Global Ocean Sampling to Survey Microbial Biodiversity e85
expedition made its way from Canada, through the Liza Gross
Panama Canal, into the South Pacific, collecting
genomic sequences from marine microorganisms. Feature _____________________________________________________________________________
The resulting data are explored in three papers in this Sorcerer II: The Search for Microbial Diversity S9
special collection from the March 2007 issue of PLoS Roils the Waters e74
Biology (see Rusch et al., e77; Yooseph et al., e16; and Henry Nicholls
Kannan et al., e17).
Essay _________________________________________________________________________________
Cover credit: Image provided by the J. Craig Venter Institute
doi:10.1371/journal.pbio.0050088.g001 Environmental Shotgun Sequencing: S13
Its Potential and Challenges for Studying e82
the Hidden World of Microbes
Jonathan A. Eisen
Community Page _____________________________________________________________

CAMERA: A Community Resource S18
for Metagenomics e75
Rekha Seshadri, Saul A. Kravitz, Larry Smarr,
Paul Gilna, Marvin Frazier
Research Articles ______________________________________________________________

The Sorcerer II Global Ocean Sampling S22
Expedition: Northwest Atlantic through e77
Eastern Tropical Pacific
Douglas B. Rusch, Aaron L. Halpern,
Granger Sutton, et al.
The Sorcerer II Global Ocean Sampling S56

Expedition: Expanding the Universe e16
of Protein Families
Shibu Yooseph, Granger Sutton,
Douglas B. Rusch, et al.
Structural and Functional Diversity S91

of the Microbial Kinome e17
Every paper we publish is freely available Natarajan Kannan, Susan S. Taylor,
Yufeng Zhai, et al.
online for you to read, download, copy,
distribute and use—no permissions required.
All articles are archived in PubMed Central.
PLoS Biology | www.plosbiology.org iii March 2007 | Oceanic Metagenomics Collection

Editorial
Global Ocean Sampling Collection

Hemai Parthasarathy*, Emma Hill, Catriona MacCallum
Although extensive in scope, the But the publishing reality in

papers presented here only touch the genomics research has been less
surface of the wealth of information inspiring. Although sequence data are
to be gleaned from these data, which publicly available and free to be reused
are freely available for all to explore by the community, the same creative
from their desktops: the trace reads license has not yet been awarded to
and processed data have been the key papers resulting from the
deposited in the National Center for major genome projects, which are
Biotechnology Information’s Trace commonly published in subscription-
Archive (http://www.ncbi.nlm.nih. based journals. Many of these genomics
gov/Traces) (with the exception papers are “freely” available from
T
oday, PLoS Biology publishes of that fraction of the trace data publisher Web sites, but their use
landmark metagenomics acquired from Ecuadorian coastal remains restricted, and to claim that
papers from the J. Craig waters), annotated with extensive freedom to read an article is the main
Venter Institute’s Global Ocean geographical and physicochemical benefit of open access is to miss the
Sampling expedition [1–3]. These metadata. The assemblies and promise inspired by DNA sequence
papers describe the initial analyses associated annotated peptides will be databases.
of several gigabasepairs’ worth of delivered to GenBank (http:⁄⁄www. While we and other open-access
sequence data from oceanic microbes ncbi.nlm.nih.gov/Genbank) around journals have both enjoyed and been
collected during the Sorcerer II the time of publication, and will grateful for strong support from the
expedition, as the ship made her become available after GenBank has genomics community, we are also
way down from Canada, through the processed them. More immediately, disappointed that authors of landmark
Panama Canal, and finally out beyond and potentially more usefully, these genomics papers, who adamantly
the Galapagos Islands well into the data are also freely available through support open access to sequence
tropical Pacific and the South Pacific a specially built database, CAMERA— data, have not taken the opportunity
Gyre. Results from the first foray of Cyberinfrastructure for Advanced to provide further leadership for
this research mission into the Sargasso Marine Microbial Ecology Research and their community by promoting open
Sea were published three years ago Analysis (http:⁄⁄camera.calit2.net)— access to the scientific literature. We
[4]. As described in the accompanying which provides greater annotation and encourage all researchers to apply the
Synopsis [5], the new voyage has analysis capabilities [8]. (CAMERA was same standards to their papers as they
added information from multiple funded by the Gordon and Betty Moore would to their data, regardless of the
biomes and several-fold more data. Foundation, which also supports PLoS.) publisher. As Jensen et al. stated in a
Analysis of these data poses not The proponents of open-access
recent review about the benefits of text
only scientific challenges [6], but also publishing, ourselves included, often
mining for the scientific community, “It
significant legal hurdles. Craig Venter cite as an inspiration the power that
is the restricted access to the full text of
is no stranger to issues of intellectual open access to DNA sequence databases
papers…that is currently the greatest
property—his previous incarnation has had in transforming scientific
limitation…” [10].
as the president of Celera saw him discovery. As our founders noted in
embroiled in controversy over the the inaugural issue of PLoS Biology,
decision to “privatize” aspects of his “With great foresight, it was decided
company’s work in sequencing the in the early 1980s that published Citation: Parthasarathy H, Hill E, MacCallum C (2007)
Global Ocean Sampling collection. PLoS Biol 5(3): e83.
human genome. Now, at the head of DNA sequences should be deposited doi:10.1371/journal.pbio.0050083
the Global Ocean Sampling project, in a central repository, in a common
Copyright: © 2007 Parthasarathy et al. This is an
Venter finds himself on the side of format, where they could be freely open-access article distributed under the terms
greater accessibility, negotiating the accessed and used by anyone. Simply of the Creative Commons Attribution License,
claims of individual governments giving scientists free and unrestricted which permits unrestricted use, distribution, and
reproduction in any medium, provided the original
on the genomic wealth within their access to the raw sequences led them author and source are credited.
waters. In particular, as of this writing, to develop the powerful methods,
there is an active negotiation with tools, and resources that have made Hemai Parthasarathy is Managing Editor, Emma Hill
is Associate Editor, and Catriona MacCallum is Senior
the Ecuadorian government (which the whole much greater than the sum Editor at PLoS Biology.
has seen more than one change of of the individual sequences....Now
* To whom correspondence should be addressed.
power since the expedition began) imagine the possibilities if the same E-mail: hemai@plos.org
over restricting commercial reuse of creative explosion that was fueled by
these data. Henry Nicholls describes open access to DNA sequences were This article is part of the Oceanic Metagenomics
collection in PLoS Biology. The full collection is
this tangled legal landscape in an to occur for the much larger body of available online at http://collections.plos.org/
accompanying Feature [7]. published scientific results.” [9] plosbiology/gos-2007.php.
PLoS Biology | www.plosbiology.org | S1 0369 Special Section from March 2007 | Volume 5 | Issue 3 | e83
Acknowledgments eastern tropical Pacific. PLoS Biol 5: e77. 6. Eisen JA (2007) Environmental shotgun
doi:10.1371/journal.pbio.0050077 sequencing: The potential and challenges
PLoS Biology relies on the support of our 2. Yooseph S, Sutton G, Rusch DB, Halpern of random and fragmented sampling of the
academic editors and reviewers in selecting AL, Williamson SJ, et al. (2007) The Sorcerer hidden world of microbes. PLoS Biol 5: e82.
and improving manuscripts for publication. II Global Ocean Sampling expedition: doi:10.1371/journal.pbio.0050082
Expanding the universe of protein families. 7. Seshadri R, Kravitz SA, Smarr L, Gilna P,
We would like to extend particular thanks PLoS Biol 5: e16. doi:10.1371/journal. Frazier M (2007) CAMERA: A community
to our editorial board members Sean Eddy, pbio.0050016 resource for metagenomics. PLoS Biol 5: e75.
Jonathan Eisen, and Nancy Moran, our 3. Kannan N, Taylor SS, Zhai Y, Venter JC, doi:10.1371/journal.pbio.0050075
Manning G (2006) Structural and functional 8. Nicholls H (2007) Sorcerer II: The search for
guest editors Simon Levin and Tony Pawson,
diversity of the microbial kinome. PLoS Biol 5: microbial diversity roils the waters. PLoS Biol 5:
and our anonymous peer reviewers for their e17. doi:10.1371/journal.pbio.0050017 e74. doi:10.1371/journal.pbio.0050074
contributions to this collection of articles. 4. Venter JC, Remington K, Heidelberg 9. Brown PO, Eisen MB, Varmus HE (2003) Why
JF, Halpern AL, Rusch D, et al (2004) PLoS became a publisher. PLoS Biol 1: e36.
References Environmental genome shotgun sequencing of doi:10.1371/journal.pbio.0000036
1. Rusch DB, Halpern AL, Sutton G, the Sargasso Sea. Science 304: 58–60. 10. Jensen LJ, Saric J, Bork P (2006) Literature
Heidelberg KB, Williamson S, et al. (2007) 5. Gross L (2007) Untapped bounty: Sampling mining for the biologist: From information
The Sorcerer II Gobal Ocean Sampling the seas to survey microbial biodiversity. PLoS retrieval to biological discovery. Nat Rev Genet
expedition: Northwest Atlantic through Biol 5: e85. doi:10.1371/journal.pbio.0050085 7: 119–129.
Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity
Liza Gross | doi:10.1371/journal.pbio.0050085
data imposed new challenges on existing genome assembly

methods and other analysis techniques. The researchers
designed the Sorcerer II Global Ocean Sampling (GOS)
expedition to see if collecting more samples would improve
their assembly and lead to a better estimate of the number
and diversity of microbial genes in the oceans.
And now, in three new studies, Venter’s team has
combined the expedition’s latest bounty—6.5 million
sequencing “reads”—with the Sargasso Sea data. The result
is a geographically diverse environmental genomic dataset of
6.3 billion base pairs—twice the size of the human genome.
(To learn about the voyage and sampling methods, see
Box 1.) In the first paper, Douglas Rusch, Aaron Halpern,
and colleagues attempt to describe the immense amount
Being invisible to the naked eye, microbes managed to of microbial diversity in the seas, and determine how—or
escape scientific scrutiny until the mid-17th century, when if—that diversity is structured and what might be shaping that
Leeuwenhoek invented the microscope. These cryptic structure. In the second paper, Shibu Yooseph et al. study the
organisms continued to thwart scientists’ efforts to probe, millions of proteins in the GOS sequences to see if we’re close
describe, and classify them until about 40 years ago, owing to discovering all the proteins in nature. And in the third
largely to a limited morphology that defies traditional study, Natarajan Kannan, Gerard Manning, and colleagues
taxonomic methods and an enigmatic physiology that makes classify thousands of kinases into 20 distinct families,
them notoriously difficult to cultivate. revealing their structural and functional diversity and an
Most of what we know about the biochemical diversity of unexpected importance in prokaryotic regulation.
microbes comes from the tiny fraction that submit to lab
investigations. Not until scientists determined that they could Extracting Meaning from Metagenomic Datasets
use molecular sequences to identify species and determine The GOS samples used in the Rusch et al. study were
their evolutionary heritage, or phylogeny, did it begin to collected over the course of a year from a wide range of
become apparent just how diverse microbes are. We now aquatic environments—including estuaries, lakes, and open
know that microbes are the most widely distributed organisms oceans—then pumped through serial filters. After extracting
on earth, having adapted to environments as diverse as the genetic material from the microbe-encrusted filters,
boiling sulfur pits and the human gut. Accounting for half Rusch et al. used shotgun sequencing to study the genes
of the world’s biomass, microbes provide essential ecosystem present in the samples. DNA is forced through a tiny nozzle
services by cycling the mineral nutrients that support life on that smashes it into bits; the fragments are cloned and the
earth. And marine microbes remove so much carbon dioxide letters of the genetic code are scanned from both ends to
from the atmosphere that some scientists see them as a create “reads.” Reads are then assembled, much like a jigsaw
potential solution to global warming. puzzle, starting with contiguous fragments (“contigs”) that are
Yet even as scientists describe seemingly endless variations then mapped onto “scaffolds,” which order and orient sets of
on the cosmopolitan microbial lifestyle, the concept of a contigs on a chromosome. (For more on shotgun sequencing,
bacterial species remains elusive. Some bacterial species see Box 2.) Using a conservative sequence similarity
(such as anthrax) appear to have little genetic variation requirement, most reads failed to assemble, suggesting that
while in others (such as Escherichia coli) individuals can have the samples contained great microbial diversity.
completely different sets of genes, challenging scientists to With only a bare bones assembly to guide their
explain the observed diversity. investigation, Rusch et al. tried a different approach. They
The emerging field of environmental genomics (or used the 584 completed and draft microbial genomes already
metagenomics) aims to capture the full measure of microbial available in public databases as points of reference and
diversity by trading the lens of the microscope (and relaxed search parameters to detect even remote similarity
biochemistry) for the lens of genomics (and bioinformatics). to GOS sequences. Although the majority of GOS reads
By recovering communities of microbial genes where they matched up with one or more of the reference genomes,
live, environmental genomics avoids the need to culture the loose criteria prevented the researchers from drawing
uncooperative organisms. And by linking these data to details meaningful inferences about kinship.
relating to sequence collection sites, such as pH, salinity, and To boost their inference power, they required that
water temperature, it sheds light on the biological processes similarity to a reference genome extend nearly the full
encoded in the genes. length of a read (producing “recruited reads”). A substantial
The largest metagenomic dataset collected so far comes majority of reads failed this criterion, with only 30% of the
from the Sorcerer II expedition, named after the yacht J. Craig GOS data being recruited. The bulk of these aligned to three
Venter transformed into a marine research vessel. In a pilot genera of widely distributed marine microbes—Pelagibacter,
study of the Sargasso Sea, Venter’s team identified 1.2 million Synechococcus, and Prochlorococcus—which accounted for about
genes and inferred the presence of at least 1,800 bacterial 15% of the recruited reads. The remaining recruited reads
species. But the genetic and taxonomic diversity of the appear to signal conserved genes rather than closely related
Box 1. Following the Sorcerer II’s Hunt for Microbes
The Sorcerer II expedition was inspired by the British such environments. Continuing down the Atlantic seaboard, the
Challenger expedition (1872–1876), a pioneering oceanography expedition stopped near Cape Hatteras, North Carolina, and the
research project that discovered hundreds of new genera and Florida Keys before passing through the Caribbean and ending
nearly 5,000 new marine species. Its gun stations replaced near Panama, where the crew collaborated with scientists at the
with research stations, the Challenger circumnavigated the Smithsonian Tropical Research Institute.
oceans, stopping every 320 kilometers to recover specimens The fourth leg of the voyage sampled sites in the Eastern
from bottom, intermediate, and surface depths to explore the Pacific, including Cocos Island, about 500 kilometers southwest
diversity of macroscopic marine life. At each stop, the crew of Costa Rica. A highly productive ecosystem inhabits the waters
recorded the location, what they used to extract the sample, the off the island, a result of ocean currents buffeting the coast and
causing nutrient upwellings that mix with warm surface waters.
The crew made one last stop in the open ocean, then headed for
the Galapagos Islands.
Owing in part to its position near major ocean currents and
atmospheric transition zones, the Galapagos Archipelago
sits within a hydrographically complex region. Unique
oceanographic features there support a diverse set of habitats
and endemic species, found within several discrete zones
distinguished by temperature. This microbial mother lode held
the crew’s attention for two months, while they extensively
sampled the region.
By early March 2004, the crew had collected the last three
samples used in these studies, from two open ocean sites
and a lagoon in a coral reef in the South Pacific Gyre. Follow
these links to learn more about the Sorcerer II (http://www.
sorcerer2expedition.org/version1/HTML/main.htm) and the
Challenger (http://hercules.kgs.ku.edu/hexacoral/expedition/
challenger_1872-1876/challenger.html) expeditions.
doi:10.1371/journal.pbio.0050085.g001
H.M.S. Challenger (Image: NOAA, Steve Nicklas)
depth of the sample, and several observations related to water

and atmospheric conditions. The Sorcerer II followed a similar
sampling schedule, traveling nearly 9,000 kilometers to collect
samples of microbial marine life and record the water’s location,
depth, pH, salinity, and temperature.
The GOS crew collected samples from surface waters of
diverse, mostly marine aquatic environments. The samples were
collected between August 2003 and May 2004 during a six-leg
journey that followed a path from northeastern Canada to the
South Pacific Gyre. Venter’s crew collected microbial samples
by pumping 200 liters of surface seawater through a series of
increasingly fine filters, which they labeled, froze, and sent back
to the lab of the J. Craig Venter Institute in Maryland.
After a stop in the Gulf of Maine, the expedition sampled three
sites along Nova Scotia, including a “highly eutrophic” coastal
embayment in Halifax. The crew set sail again in November,
starting in Newport Harbor, Rhode Island, and ending in the
Delaware Bay, one of several estuaries targeted on the journey.
The next leg began in Chesapeake Bay. The largest US estuary,
Chesapeake Bay contains a rich mix of freshwater and marine
organisms. Estuaries are complex hydrodynamic environments
that are highly sensitive to runoff from agricultural and urban
development (which can dump massive amounts of nitrogen
and phosphorous into watersheds). Microbial communities
collected from estuaries promise to provide valuable insights
into the metabolic and physiological adaptations required by The Sorcerer II (Image: J. Craig Venter Institute)
organisms. Most of the GOS sequences failed to be identified, genomic content revealed that tropical and temperate
in part because so few surface water microbes have been samples shared the least amount of genomic material. Some
sequenced. samples, however, were very similar.
A novel comparative genomic method. Focusing on While untangling all the factors that may affect genetic
the reads that recruited to these most abundant genera, makeup of a sample is beyond current datasets and methods,
Rusch et al. generated “fragment recruitment plots.” These the researchers demonstrated that specific genetic differences
graphics represent relatedness and diversity of environmental can be related to environmental factors. Several genes
sequences to a reference genome by showing where a read occurred up to seven times more frequently in a pair of
aligns with the reference genome (indicated by a horizontal samples from the Caribbean than they did in a pair from
bar) and its degree of similarity to the reference sequence the eastern Pacific, even though both pairs had similar
(indicated by its vertical position). Recruited reads were ribotype and genetic profiles. Many of these genes govern
color-coded based on sample origin to indirectly depict their the metabolism and transport of phosphate (required for
associated metadata (for example, salinity and pH). These microbial growth), likely reflecting functional adaptations in
plots provided a visual tool to explore genetic diversity at the the microbial communities to the measured differences in
sequence and gene level, genome structure and evolution, phosphate availability in the Caribbean and Pacific samples.
and taxonomic and evolutionary relationships. (For more on The researchers also explored diversity at the gene level
fragment recruitment plots, see the accompanying poster, by looking for evidence of functional differences in one
doi:10.1371/journal.pbio.0050077.sd001.) gene family, proteorhodopsins, light-activated proton pumps
Distinct recruitment patterns, easily detected by bands of with a slightly murky biological role. Proteorhodopsins
color, emerged for each organism. In some cases, a single were abundant in all the GOS and Sargasso Sea samples. In
reference genome had multiple color bands, distinguished keeping with the diverse light environments sampled during
by their similarity and sample provenance. Because the expedition, the researchers found a strong correlation
bands appeared to represent unique, closely related, and between sequence variation and sample provenance. They
geographically distinct populations—and showed a novel level hypothesize that the distribution of given variants reflects
of diversity across the entire genome—the researchers termed adaptation to the most abundant light spectra in their
each band a subtype. A tremendous amount of sequence habitats.
diversity appeared in the subtypes, which also harbored Altogether, these results reveal the power of metagenomic
substantial sequence variation at the protein level, some likely approaches to capture the true measure of microbial diversity
reflecting adaptations to local environments. This finding by uncovering genomic differences that would not have
reveals a potential locus of microbial diversity—at the level of been apparent using traditional marker-based approaches.
subtype rather than at the level of species, or ribotype (based The breadth of this newly revealed diversity may come as a
on a segment of a ribosomal RNA gene called 16S rRNA)— surprise to even inveterate microbe hunters.
and offers clues to why it emerged (perhaps in response to
local pressures) and how it evolved. The Expanding Protein Universe
A novel sequence assembly method. Because such high Along with insights into microbial diversity, metagenomics
levels of sequence diversity among organisms confound promises to help us understand the vast number of proteins
standard whole genome assembly software, and most of in nature. By randomly sampling DNA sequences from
the GOS data correspond to organisms for which there is communities of organisms, metagenomic studies overcome
no appropriate reference genome, Rusch et al. used an selection and culturing biases that arise from focusing on
“extreme assembly” approach to investigate the genomes of a particular organism or a set of proteins, to provide an
other abundant GOS populations. They used greatly reduced expansive view of protein diversity and evolution.
requirements for sequence similarity in the assemblers to Proteins are typically grouped into families based on
generate longer contigs and capture more of the GOS data their evolutionary relationship, which can then be used
in an assembly. While some of the resulting larger assemblies to guide investigations of their biological roles. Proteins
corresponded to known reference genomes, others did not, in the same family share similar amino acid sequences
allowing the researchers to study microbes without cultivated and three-dimensional conformations. Using amino acid
or sequenced counterparts. And because these larger sequence similarity as a measure to identify and group
assemblies could potentially provide functional insights into protein sequences from the GOS data with sequences from
uncharacterized organisms, they might identify conditions a comprehensive set of known proteins, Shibu Yooseph et al.
that would allow scientists to grow them in the lab. evaluated the impact of the GOS data on our understanding
Many of the large contigs failed to align in any significant of known proteins and studied the rate of discovery of protein
way with known genomes, so the researchers tried to match families with new sequences. To group related sequences and
them with “seed fragments” from known taxonomic groups. By predict proteins, they developed a novel sequence clustering
starting assembly from reads mated to the 16S rRNA gene— technique based on full-length sequence similarity.
one of the most common marker genes used for classifying Identifying proteins in metagenomics data. Hypothetical
microbes—they could generate large contigs associated with proteins can be predicted by searching for open reading
many of the abundant GOS ribotypes. Fragment recruitment frames (ORFs), sequences flanked by nucleotide triplets
plots of these assemblies again revealed multiple subtypes, (called codons) that signal the beginning and end of
providing further support for the presence of multiple translation but don’t necessarily encode a protein. Because
evolutionarily distinct subtypes within a given ribotype. the GOS data contain many fragmentary sequences,
Evidence for environmental adaptations. A computational Yooseph et al. allowed ORFs to be terminated at the end of a
approach designed to identify groups of samples with similar sequence, resulting in a partial or truncated ORF. They used
Box 2. Bioinformatic Methods at a Glance
Bioinformatics relies on statistics and computer power alignments of previously identified families to compute
to synthesize and interpret huge datasets. Here’s a brief “position-specific scoring matrixes” (PSSMs). Each position in
introduction to some of the environmental genomics methods the alignment is associated with a set of scores that reward or
used in the GOS studies. penalize the alignment of a given amino acid to the position.
Shotgun sequencing decodes genetic material by randomly Profile methods can be more sensitive than simple sequence
shredding it into millions of fragments. The DNA sequence of similarity methods because they give more weight to signals at
each end of a fragment is determined; the two ends of a given sites that are conserved within a protein family and less weight
fragment (or insert) can be associated, and constitute a “mate to more variable positions.
pair.” These random sequencing “reads” are then reassembled Initially, the advantages of profile methods for detecting
with a computer. Based on sequence similarity, overlapping remote homology were limited to well-characterized families,
reads are identified and merged into longer sequences called as construction of a profile required some expertise. However,
“contigs.” Contigs are organized into larger (but not necessarily this changed with the fully automated integration of this step
continuous) pieces of a genome, called “scaffolds,” based into PSI-BLAST. PSI-BLAST begins with a pairwise (sequence–
on mate pairs. The resulting assemblies can link genes to sequence) similarity search, but then iteratively runs alternating
their regulatory elements, guide investigations of biological steps of building a profile from the current set of similar
pathways, and connect unknown sequences with taxonomic sequences and using the profile to re-search the database for
markers to suggest evolutionary relationships. additional matching sequences.
Sequence similarity detection allows functional and taxonomic Hidden Markov models (HMMs) employ statistical methods
characterization of genomic sequences. Once the shotgunned to model the likelihood of different amino acids at any given
sequences have been organized into a library of sequence position of the sequence in an underlying alignment. Like
“scaffolds” and translated into hypothetical proteins, the next some profile methods, HMMs use a probability-based method
step uses sequence similarity to figure out what the proteins to determine the score of aligning an observed amino acid to
are and to identify families. Similarity can also associate a new a given position in a protein family, but HMMs improve upon
sequence with an approximate location on the tree of life. profiles by more sophisticated modeling of variation in protein
Sequence–sequence (pairwise) methods, the first step for length, storing the probabilities of insertions or deletions at
identifying closely related sequences, compare all sequences each position of the model. HMMs have a good track record for
to all other sequences in a pairwise manner. These methods identifying more distantly related protein sequences.
(such as BLAST) allow all collected sequences to be compared Profile–profile methods are the most recent enhancement to
with one another (and with all sequences already available in sequence homology detection methods. As the name suggests,
public databases) and reliably clustered into families of related profile–profile methods compare one profile to another. Because
sequences with high sequence similarity, or homology. each profile implicitly encodes more information than a single
Profile methods are used to identify more remote sequence, these methods identify relationships that cannot be
relationships. Profile methods use multiple sequence detected by comparing individual sequences.
the ORFs to generate a set of predicted proteins based on the revealed GOS counterparts in nearly all known prokaryotic
results of a series of clustering steps and statistical analyses. protein families; nearly 2,000 clusters appeared unique to the
After performing pairwise comparisons (of every sequence GOS dataset.
against every other sequence) of the more than 28 million Since they couldn’t use sequence similarity to infer
sequences in the combined dataset, the researchers identified function for the unique GOS sequences, the researchers
conserved groups of sequences after accounting for relied on the assumption that proteins with similar roles are
redundancy due to identical and near-identical sequences. more likely to reside in the same genomic neighborhood.
They then used profile methods to merge and expand these This analysis implicated several GOS-only clusters in
groups of sequences. While pairwise comparisons capture photosynthesis or electron transport. Such clusters may come
the most closely related sequences (or homologs), profile from viruses, as many viral parasites of photosynthetic bacteria
methods (the researchers used both PSI-BLAST and hidden express the photosynthetic genes of their hosts. Interestingly,
Markov models) detect more distantly related sequences by though most of the sequences in GOS-only clusters appeared
combining homologs into multiple sequence alignments to to be bacterial, a higher than expected proportion of them
generate “profiles.” (For more on these methods, see Box 2.) were flagged as viral. If such novel GOS protein families pan
From the clusters obtained by the above procedure, clusters out as viral, the researchers argue, “we are far from exploring
of spurious sequences (that overlap true protein regions the molecular diversity of viruses.”
on the genome) were identified in addition to clusters of Insights into evolutionary and functional diversity. To
noncoding conserved sequences (based on tests showing compare ocean versus terrestrial life at the biochemical level,
no selection on their codons). Sequences in these clusters Yooseph et al. compared GOS sequences to those of land-
were removed; those remaining were labeled as predicted dwelling prokaryotes. Nearly 70% of protein domains varied
proteins. The researchers identified nearly 6 million proteins between the two classes of microbes, mostly reflecting the
in the GOS dataset—1.8 times the number already in public distinct biochemical requirements of the two environments,
databases. Comparing the predicted protein clusters to as well as the different taxonomic groupings in the two
known prokaryotic and nonprokaryotic protein databases datasets. The researchers were surprised to find little evidence
of domains specific to gram-positive bacteria (defined by their a phosphate group from adenosine triphosphate (ATP)
unique cell wall), even though this bacterial group makes up to a specific amino acid on the protein, releasing energy
nearly 12% of the GOS dataset. They also found a relative and inducing structural changes that alter the protein’s
dearth of components related to flagella (the whip-like tail of activity. (Dephosphorylation removes the phosphate group,
microbial motility), possibly reflecting the reduced need for restoring the protein to its original conformation and
self-propulsion in the ocean. inactive state.) One cell can contain hundreds of different
Using a comprehensive protein family database (called protein kinases, each charged with phosphorylating one or
Pfam), the researchers compared the kingdom distribution many different proteins.
of known protein domains in the GOS data to that of Bacteria and other prokaryotes, conventional wisdom held,
proteins in public databases. In this process, some families rely mostly on structurally distinct kinases (histidine kinases)
that were previously thought to be single-kingdom turned to mediate protein phosphorylation and cell signaling. But
out to have members in multiple kingdoms. For example, it now emerges that ePK-like kinases (ELKs), once thought
indoleamine 2,3-dioxygenase (IDO), an enzyme linked to to be minor players, are more prevalent and widespread
the immune system in mammals, was considered unique to than the histidine kinases. Although ePKs and ELKs typically
eukaryotes. But the IDO Pfam search turned up matches to exhibit very low sequence similarity, they share similar
ten GOS sequences identified as bacterial—suggesting that phosphorylation mechanisms and the same structural fold
the proteins may have arisen much earlier than previously (the protein kinase–like, or PKL, fold).
thought, or perhaps arose through lateral gene transfer (from Since PKL kinases conserve both fold and mechanism
an unrelated organism). of action, they provide a robust model for determining
The sheer size of the GOS dataset—which nearly doubles how sequence variation corresponds to functional
the number of proteins—greatly expands the functional diversity. Unfortunately, comprehensive comparisons had
diversity of known protein families, providing valuable been frustrated by a lack of sequence information for
insights into their evolution. For example, the researchers the prokaryotic ELK families relative to the well-studied
found a 10-fold increase in the number and type of proteins eukaryotic domains. But now, thanks to the Sorcerer
involved in repairing ultraviolet radiation damage, likely II expedition, sequence databases are brimming with
reflecting the hazards of living in surface waters. A similar microbial sequences, including a 3-fold increase in ELK
boost in phosphatases—which function in such fundamental sequences. Taking advantage of the bounty, Natarajan
biological processes as cell signaling, development, and Kannan, Gerard Manning, and colleagues surveyed the
cell division—highlighted important differences in the way global PKL landscape, and identified over 45,000 PKLs,
one phosphatase (protein phosphatase 2C) functions in which they classified into 20 families. Surprisingly, PKLs
prokaryotes and eukaryotes. appear to usurp the histidine kinases as the core regulator
And the unexpected abundance of a nitrogen metabolism of prokaryotic signaling and cell behavior.
catalyst typically associated with eukaryotes (type II glutamine Cataloging the number and diversity of PKL families. To
synthetase) suggested two possible evolutionary mechanisms: detect kinase sequences, Kannan et al. searched over 17
either lateral gene transfer from eukaryotes, or gene million predicted proteins in the GOS dataset and 5 million-
duplication prior to the divergence of prokaryotes and plus predicted and known protein sequences in public
eukaryotes. (The researchers suspect gene duplication.) The databases. Kinase sequences were detected using hidden
diversity of the GOS sequences also promises to characterize Markov model (HMM) profiles of known PKLs along with
sequences with no similarity to known sequences (known as a model that predicts kinases on the basis of a few ultra-
ORFans): over 6,000 ORFans pair up with GOS sequences conserved motifs. The sensitivity of the HMMs allowed the
representing some 600 organisms, paving the way for further researchers to discover very remote new members of these
study of their identity and function. families and to classify and organize the tens of thousands of
As GOS protein predictions are tested, some of these sequences. Both approaches iterate through multiple runs of
proteins will expand existing protein families while others the clustered results to refine the family alignments and to
will carve out GOS-specific families. Both results will help classify clusters with little similarity to known PKL families as
researchers determine priority targets for structural studies— potentially novel. (For more on these methods, see Box 2.)
an essential strategy for dealing with the flood of protein The public databases, it turned out, harbored nearly 25,000
discoveries. And given that the GOS sequences represent ePKs and over 5,000 ELKs. Over 16,000 GOS sequences fell
mostly microbes from the ocean’s surface—yet point to into 20 PKL families—doubling the size of most families.
substantial viral diversity as well—the rate of protein discovery Three main superfamily clusters emerged, distinguished by
indicates that a comprehensive catalog of proteins in nature is the most abundant members: choline and aminoglycoside
far from complete. kinases (CAKs), a “particularly diverse” family harboring
kinases that facilitate colonization by beneficial and
Variations on a Theme: A Single Fold Spawns a Diverse pathogenic bacteria; ePKs, almost exclusively eukaryotic
Kinase Superfamily except for a similar bacterial kinase (pknB); and a cluster of
Cellular life chugs along under the power of enzymes, kinases, including Rio and Bud32, that are conserved between
proteins that catalyze the scores of chemical reactions archaea and eukaryotes. Three families bore no sequence
required for life. One of the largest protein families similarity to any other families save for a group of key motifs.
in eukaryotes, the eukaryotic protein kinases (ePKs), Overall, the 20 families exhibit significant functional and
regulates the activity of a large fraction of all proteins sequence diversity. Most of the families have not yet been
and almost all biological pathways by phosphorylating fully investigated, though they do include some characterized
proteins. Phosphorylation activates its target by transferring members. Those with known kinase activity target small
molecules (such as lipids and amino acids) and seem to direct substrates to the catalytic core or influence the nature
play regulatory roles, in contrast to many other structurally of the reaction.
unrelated small molecule kinases, which affect metabolism. Evolutionary insights and beyond. Altogether these results
Functional diversity springs from a set of core residues. reveal the vast functional and phylogenetic diversity that can
Because sequence similarity ranged from “very low to occur in even just a subset of proteins, even though they retain
almost undetectable,” the researchers used sequence a common catalytic fold and function. The massive sequence
profiles—models built from entire families to highlight their comparisons in this study not only identified the core of the
core characteristics—to both discover and classify kinase PKL kinase, but also revealed the specific motifs underlying
sequences. They found several novel families, and greatly each family, including the ePKs. And the flexibility of several
extended the breadth of previously defined families. With key regions within ePKs may underlie the huge expansion of
these methods to refine the relationships within and between these enzymes in eukaryotes. This structural flexibility may
PKL families, the researchers explored the traits that unite or give kinases the ability to integrate multiple regulatory signals,
distinguish them. and account for their almost universal involvement in the
Ten key amino acid residues of the catalytic domain regulation of eukaryotic pathways.
consistently turned up in each family. This “core pattern of These results set the stage for more in-depth structural and
conservation,” the researchers explain, represents an ancient biochemical studies to elucidate the diverse functions carried
evolutionary innovation, spanning not just the three divisions out by these critical regulators of cell behavior. This study
of life—which diverged 1–2 billion years ago—but also the also demonstrates how metagenomic datasets, by covering
diverse families. The conservation of these residues across an unbiased diversity of life, can refine our understanding
and within the families suggests that they play an essential of well-studied protein families, such as the ePKs, and shed
role. And, indeed, six of those already characterized mediate light on their evolution. Kannan et al. hope that others take
ATP binding and catalysis. advantage of the environmental metagenomic largesse to
Yet despite the seemingly universal presence of the ten pursue “similar insights into virtually every gene family with
residues, their occurrence in individual subfamilies showed prokaryotic relatives.”
a surprising pattern: all but one of these “core” residues had
either disappeared or changed in individual families—though Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al.
the proteins retained their fold and function—suggesting an (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest
unexpected flexibility for catalytic cores. To test this possibility, Atlantic through eastern tropical Pacific. doi:10.1371/journal.
the researchers focused on one of the ten residues—the pbio.0050077
catalytic lysine K72, which repositions ATP’s phosphates.
Present in ePKs, K72 is replaced by a different conserved Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)
amino acid in three CAK subfamilies. These subfamilies The Sorcerer II Global Ocean Sampling expedition: Expanding the
had corresponding substitutions near other key motifs, universe of protein families. doi:10.1371/journal.pbio.0050016
and structural modeling showed how these coordinated
Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural
replacements could still result in an active enzyme.
and functional diversity of the microbial kinome. doi:10.1371/journal.
A number of features (including amino acid motifs and
pbio.0050017
secondary structure) emerged as family-specific, being highly
conserved within but not between families. And as was seen This article is part of the Oceanic Metagenomics collection
in the CAK analysis, many family-specific residues occur near in PLoS Biology. The full collection is available online at
one of the ten key residues, suggesting that they may help http:⁄⁄collections.plos.org/plosbiology/gos-2007.php.
Feature
Sorcerer II: The Search for Microbial Diversity

Roils the Waters
Henry Nicholls
C
raig Venter is not short of ambition. With the human
Figure 1. Two of a Kind?
genome fresh off the sequencing machines, he The young Charles Darwin (left) and Craig Venter (right). (Photo: J. Craig
set his sights on a project of even grander scale: to Venter Institute)
describe the immense wealth of genetic information living in
the world’s oceans. This voyage into biologically uncharted RNA collected from the marine microbial world (Figure 2)
waters was, according to the Web site of the expedition vessel [3]. “We estimate there are at least 25,000 different kinds of
Sorcerer II, inspired in part by the voyage of H. M. S. Beagle [1]. microbes per litre of seawater,” says Sogin. “But I wouldn’t
Venter, it seems, would like to be remembered as the Charles be surprised if it turns out there are 100,000 or more.” A few
Darwin of the 21st century (Figure 1). of these microbes are common, and Venter will probably use
This is the largest effort to describe the genetic diversity them to recover complete gene sequences, he says. “The vast
in the world’s oceans. The voyage around national and majority of low-abundance organisms are going undetected.”
international waters, collecting from around 150 sites and Venter is more than aware that there’s a lot more to be
interrogating samples at the level of the gene rather than at discovered, but for the moment the goal is to sequence as many
the level of the organism, has already turned up between 5 genes, in their entirety, as possible from these ecologically rich
and 6 million genes. Most of these genes have never been environments. These data raise a host of intriguing questions:
seen before, says Venter. Analysing this immense collection in particular, what is the structure and function of the novel
of data, the researchers discovered that many of the genes proteins these genes encode, and what role do they play in the
encode proteins that fall outside standard classification metabolism of these undescribed microbes? Just as Darwin’s
schemes. Proteins grouped within their own unique work drove a change in the way we see the world, so Venter is
kingdoms are turning up in other kingdoms as well—forcing hoping these marine data will do the same in years to come.
the team to reconsider the evolutionary relationships of
established kingdoms. “This project is revealing some of the Legal Framework
biggest discoveries about the environment,” says Venter. (For But times have changed. In the 21st century, there are plenty
more on these discoveries see the synopsis of the research of hurdles to clear before the collecting and describing of
articles [2].) biodiversity—even microscopic biodiversity—can go ahead.
“If Darwin were alive today trying to Citation: Nicholls H (2007) Sorcerer II: The search for microbial diversity roils the
do his experiments, he would not have waters. PLoS Biol 5(3): e74. doi:10.1371/journal.pbio.0050074
been allowed to.” Copyright: © 2007 Henry Nicholls. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author
Untapped Diversity and source are credited.
The Sorcerer II probably captured only a tiny fraction of the Abbreviations: UNCLOS, United Nations Convention on the Law of the Sea
genetic diversity out there, says Mitchell Sogin, Director of the Henry Nicholls is a freelance science journalist based in London, United Kingdom.
Josephine Bay Paul Center in Comparative Molecular Biology His book Lonesome George was nominated for the 2006 Guardian First Book Award.
and Evolution at the Marine Biological Laboratory in Woods E-mail: henry.nicholls@tiscali.co.uk
Hole, Massachusetts. In August 2006, Sogin and his colleagues This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full
published a detailed analysis of variable stretches of ribosomal collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.
The 1982 United Nations Convention on the Law of the Sea Harbor Branch Oceanographic Institution, an oceanographic
(UNCLOS) endowed coastal nations with the sovereign right research and education institution based in Florida, is
to explore and exploit all resources within their “exclusive after compounds from marine organisms that might have
economic zone”—usually a body of water stretching 200 biomedical potential. The institution has patents on, among
nautical miles out to sea [4]. Most coastal states exercise others, potential anti-cancer agents derived from the marine
this right, granting permits to outsiders wanting to conduct sponges Discodermia dissoluta and Forcepia triabilis (Figure 3).
research in their waters. Deep-sea exploration, and the lengthy research and
The 1992 Convention on Biological Diversity went on to development that follows, is an expensive business. This
set out some basic principles that might encourage sharing means it’s a realistic option for only the world’s wealthiest
of benefits arising from genetic resources [5]. Where parties nations. At least that’s the concern being expressed by some
to the convention have got round to incorporating these developing countries that would like to see a piece of this
principles into their own legislation, the result has been that action, says David Leary of the Centre for Environmental Law
anyone wishing to conduct research on these resources must at Macquarie University in Sydney, Australia.
agree to terms set by the host government.
Beyond national waters (with a few exceptions) are the “No effort ever attempted to
“high seas”. Here, there is little regulation. According
to UNCLOS, mineral resources on the deep seabed are incorporate data from such vastly
considered the “common heritage of mankind”; this means divergent sources to meet the needs
that any benefits deriving from them should be shared with
the international community. But when it comes to biological of such a wide range of scientific
resources, just about anything goes. interests.”
The Rise of Bioprospecting These countries are seeking a change to UNCLOS that
In areas beyond national jurisdiction, there has been an requires biological resources to be treated in the same
increase in so-called bioprospecting, the search for and way as mineral resources and any benefits deriving from
exploitation of commercially valuable compounds from them to be shared with the wider community. But others
genetic resources. In 2005, researchers at the United Nations fear tighter regulation of such activities will only stifle pure
University scoured patent office databases for inventions marine scientific research. The Philippines was one of the
based on the genomic features of deep seabed organisms first countries to regulate access to its genetic resources, says
[6].They found that private companies such as Roche, Sam Johnston, an expert on international environmental law
Diversa, and New England Biolabs are after patents on DNA based in Melbourne, Australia, and a senior research fellow at
polymerases developed from deep-sea thermophilic bacteria the United Nations University Institute of Advanced Studies
that promise to enhance the molecular biologist’s expanding in Yokohama, Japan. “It basically closed down all research,”
toolbox. Others like Sederma (based in France) and he says. “A lot of researchers around the world have found
California Tan (based in the US) have used enzymes from the red tape prohibitive.”
similar microorganisms to develop skin products boasting UV- Finding a balance between the unregulated status quo and
and heat-resistant properties. cumbersome controls over research on marine biodiversity
There are plenty of not-for-profit organisations interested is now the concern of a United Nations working group [7].
in the applications of discoveries from the deep. For example, “Some countries see this as the early stage of negotiating a
new UNCLOS,” says Leary. But, he warns, “this could take 10
or 15 years before we see a result.”
One compromise might be for coastal states to allow all
research on their genetic resources with the proviso that
exploitation of any commercial application is subject to
further negotiation. Another possibility is for the patent
system to take responsibility for seeing that benefits are
shared fairly, only granting patents based on biological
resources if a royalty is paid into a global commons trust
fund.
Ecological Impact
Whilst the UN goes in search of this kind of middle ground,
both pure and applied research in the high seas continues
apace—and this is cause for another concern. “There’s a
number of sites that are so popular that there’s concern
about the intensity of research,” says Leary. Repeated visits to
the same deep-sea spot could not only result in unsustainable
collection of some species and influence local hydrological
and environmental conditions, but increase the likelihood
Figure 2. A Remotely Operated Platform Samples Vent Fluids from that one person’s experiment will influence that of another.
the Northeast Pacific Ocean So far, little thought has been devoted to this consequence of
(Photo: NOAA, http://oceanexplorer.noaa.gov) unregulated access, says Leary. “I haven’t yet seen any clear
Box 1. Zooming in on CAMERA
CAMERA is the convenient acronym for the cumbersomely
named Community Cyberinfrastructure for Advanced Marine
Microbial Ecology Research and Analysis. “This resource
will focus on providing easy-to-use tools for uploading,
downloading, searching, and analysis of genomic datasets,” says
Paul Gilna, CAMERA’s executive director, based at the California
Institute for Telecom and Information Technology in La Jolla,
California.
Researchers will also be able to clothe the bare genetic
sequences in a wealth of other data, such as GPS coordinates
and depth of collection, the water temperature, its oxygen
content, salinity and pH. The site could well draw upon other
resources that enrich these metadata, says Gilna. For example,
satellite imagery associated with the sampling sites, and other
data types, such as microscopy stills and high-definition video,
could become important metadata that help researchers
characterise the environments from which samples were taken.
Crucially, CAMERA will allow researchers to record the source
of each genetic sequence. Many coastal countries now want a
share of commercial applications that derive from their marine
resources. Countries may be happy to see genetic sequences
placed in CAMERA provided they are acknowledged and
doi:10.1371/journal.pbio.0050074.g003 commercial exploitation of their sequence is not permitted
without their consent.
Figure 3. Marine Sponges That Have Generated Products with Anti- But handling such immense datasets poses considerable
Cancer Promise
technological challenges. The GOS database alone contains
(A) Discodermia dissoluta. (Photo: NOAA)
(B) Forcepia triabilis. (Photo: T. Piper, NOAA) around 6 billion bases—the equivalent of two entire human
genomes. And the number and size of this kind of database
scientific data on the extent of the environmental impact of will only mushroom in coming years, making it necessary to
bioprospecting or marine scientific research,” he says. develop high-speed optical networks, grid-based computing,
Clearly, the environmental impact of carrying off 150-odd and new visualisation technologies. “We are quickly approaching
barrels of seawater for analysis isn’t something that Venter a ‘tipping point’,” says Gilna. “These datasets will start to follow
and his colleagues had to worry about. But navigating the exponential, rather than linear trends, much as was the case for
complex legal territory was. “If Darwin were alive today trying DNA sequencing.”
to do his experiments, he would not have been allowed to,” Finally, there’s the tricky task of satisfying all researchers who
says Venter. could benefit from this resource. “The scientific communities—
At least, that is, without help from a lawyer. Sorcerer II from studies on biodiversity and biogeochemistry to evolution
collected samples in the waters of 17 coastal states and and genomes—have different interests, different data
obtained all necessary permits, says Bob Friedman, Vice expectations, different vocabularies, and different levels of
President for Environmental and Energy Policy at the J. experience with using computational tools and databases,” says
Craig Venter Institute. Some countries required detailed John Wooley, a pharmacologist at the University of California,
agreements thrashing out how benefits deriving from these San Diego, who is working on CAMERA. “Before metagenomics,
data would be shared. All of these are posted on the Sorcerer no effort ever attempted to incorporate data from such vastly
II Web site, says Friedman [8]. Most countries, however, have divergent sources to meet the needs of such a wide range of
not decided how they might regulate access to their genetic scientific interests.” For more on CAMERA, see the Community
resources, he says. Page article by Seshadri et al. [13].
In addition to getting the paperwork in order, Venter
encouraged collaboration with local scientists. What’s more,
the entire metagenomic database will be put in the public just off Hiva Oa, an island in the Marquesas archipelago in
domain. The gene sequences should be of tremendous value the Pacific Ocean, tensions escalated. Although the plan to
to each of the countries involved, says Venter. In particular, sample seawater around the islands had the backing of local
it will help them monitor and manage the health of their French Polynesian authorities and scientists, the French
marine ecosystems more effectively, he predicts. To ensure government in Paris had other ideas, says Venter. “We
that this vast dataset will be available to all, the Gordon and were placed under house arrest.” Eventually, after a further
Betty Moore Foundation has stumped up $24.5 million round of intense negotiations, the Sorcerer II was allowed out
dollars for a seven-year project to design a new database to of the harbour to collect its seawater samples and continue
host it and new tools to interrogate it (Box 1). on its way.
Yet, it seems, all these undertakings and assurances Last year, a Canadian-based non-governmental
have not been enough to steer this expedition clear of organisation—the Action Group on Erosion, Technology and
controversy. In 2004, when the Sorcerer II dropped anchor Concentration—dedicated to “the advancement of cultural
and ecological diversity and human rights” labelled Venter a So, keen as Venter might be to put the controversy of his
“biopirate”, accusing him of “flagrant disregard for national human-genome-sequencing days behind him, this kind of
sovereignty over biodiversity” [9]. In several countries, there’s research strays into unknown biological, legal, and ethical
real concern about how he managed his collecting, claims territory. And in this environment, allegations of biopiracy
Pat Mooney, Executive Director of the group. Although the are almost inevitable. This, however, is unlikely to deter a
data are going into the public domain, it is laboratories like man like Venter. “If it’s in the Darwin school of biopiracy,
Venter’s that are best placed to exploit it, he argues. “There’s then fine,” he says.
a handful of folk around the planet that can understand such References
stuff,” says Mooney. 1. Sorcerer II Expedition (2005) Expedition info—Environmental genomics.
Venter is adamant that this whole project is just pure, clean Available: http:⁄⁄www.sorcerer2expedition.org. Accessed 19 January 2007.
2. Gross L (2007) Untapped bounty: Sampling the seas to survey microbial
marine scientific research. Indeed, the Sorcerer II Web site biodiversity. PLoS Biol 5: e85. doi:10.1371/journal.pbio.0050085
explicitly states that “no intellectual property rights will be 3. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006)
Microbial diversity in the deep sea and the underexplored “rare biosphere”.
sought by the Venter Institute on these genomic sequence Proc Natl Acad Sci U S A 103: 12115–12120.
data” [10]. Venter sums up the goal of the project: “We were 4. United Nations (1982) United Nations convention on the law of the sea of
just trying to answer some basic questions about the diversity 10 December 1982. New York: United Nations. Available: http:⁄⁄www.un.
org/Depts/los/convention_agreements/convention_overview_convention.
of microbes on the planet,” he says. htm. Accessed 18 January 2007.
But, says environmental lawyer Johnston, the distinction 5. Secretariat of the Convention on Biological Diversity (2002) Bonn
between pure and applied research is becoming increasingly guidelines on access to genetic resources and fair and equitable sharing
of the benefits arising out of their utilization. Montreal: Secretariat of the
blurred. To illustrate this, he cites a strain of thermophilic Convention on Biological Diversity. Available: https:⁄⁄www.biodiv.org/doc/
Bacillus collected from Antarctica in the early 1980s as part publications/cbd-bonn-gdls-en.pdf. Accessed 16 January 2007.
6. Arico S, Salpin C (2005) UNU-IAS report—Bioprospecting of genetic
of a study into the worldwide distribution and characteristics resources in the deep seabed: Scientific, legal and policy aspects. Yokohama
of such extremophiles. Years later, the same sample, taken (Japan): United Nations University Institute of Advanced Studies. Available:
out of storage and subjected to further study, turned out http:⁄⁄www.ias.unu.edu/binaries2/DeepSeabed.pdf. Accessed 16 January
2007.
to contain a talented enzyme that has the promise to 7. International Institute for Sustainable Development (2006) Ad Hoc Open-
revolutionise DNA extraction for forensic analysis [11]. “The ended Informal Working Group to study issues relating to the conservation
and sustainable use of marine biological diversity beyond areas of national
collector undertook the act in the purest form but ultimately jurisdiction. Winnipeg (Canada): International Institute for Sustainable
the use of it has changed in the course of two decades,” says Development. Available: http:⁄⁄www.iisd.ca/oceans/marinebiodiv. Accessed
Johnston. “So much depends on the perspective at which you 16 January 2007.
8. Sorcerer II Expedition (2005) Collaborative agreements. Available:
look at the issue.” http:⁄⁄www.sorcerer2expedition.org/permits. Accessed 22 January 2007.
This means that there are likely to be several different 9. Coalition Against Biopiracy (2006) Captain Hook awards for biopiracy
2006. Available: http:⁄⁄www.captainhookawards.org/winners/2006_pirates.
takes on the same research. What for one person is Accessed 18 January 2007.
pure marine scientific research can be another person’s 10. Sorcerer II Expedition (2005) Agreements. Available: http:⁄⁄www.
bioprospecting and yet another’s biopiracy. There are very sorcerer2expedition.org. Accessed 19 January 2007.
11. Moss D, Harbison AS, Saul DJ (2003) An easily automated, closed-tube
few cases where everyone agrees there has been outright forensic DNA extraction procedure using a thermostable proteinase. Int J
theft of a biological resource and very few cases where Legal Med 117: 340–349.
12. Laird SA, Wynberg R, Johnston S (2006) Recent trends in the biological
everyone is happy there’s been proper benefit sharing,
prospecting. 29th Antarctic Treaty Consultative Meeting. Available:
says Johnston. “Even the best-designed programmes where http:⁄⁄www.ias.unu.edu/binaries2/ATCM29_May2006.doc. Accessed 16
there’s enormous consultation with the local people have January 2007.
13. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: A
found it’s difficult to get the right kind of consensus and community resource for metagenomics. PLoS Biol 5: e75. doi:10.1371/
buy-in,” he says [12]. journal.pbio.0050075
Essay
Environmental Shotgun Sequencing:

Its Potential and Challenges for Studying
the Hidden World of Microbes
Jonathan A. Eisen
without culturing them [5–7], and taxa may look the same. This vexing
the use of high-throughput “shotgun” problem was partially overcome in
methods to sequence the genomes of the 1980s through the use of rRNA-
cultured species [8]. We are now in the PCR (Table 1). This method allows
midst of another such revolution—this microorganisms in a sample to be
one driven by the use of genome phylogenetically typed and counted
sequencing methods to study microbes based on the sequence of their rRNA
directly in their natural habitats, an genes, genes that are present in all
approach known as metagenomics, cell-based organisms. In essence, a
environmental genomics, or database of rRNA sequences [14,15]
S
ince their discovery in the 1670s community genomics [9]. from known organisms functions
by Anton van Leeuwenhoek, In this essay I focus on one like a bird field guide, and finding a
an incredible amount has been particularly promising area of rRNA-PCR product is akin to seeing a
learned about microorganisms and metagenomics—the use of shotgun bird through binoculars. Rather than
their importance to human health, genome methods to sequence random counting species, this approach focuses
agriculture, industry, ecosystem fragments of DNA from microbes on “phylotypes,” which are defined as
functioning, global biogeochemical in an environmental sample. The organisms whose rRNA sequences are
cycles, and the origin and evolution randomness and breadth of this very similar to each other (a cutoff of
of life. Nevertheless, it is what is not environmental shotgun sequencing >97% or >99% identical is frequently
known that is most astonishing. For (ESS)—first used only a few years ago used). The ability to use phylotyping
example, though there are certainly [10,11] and now being used to assay to determine who was out there in any
at least 10 million species of bacteria, every microbial system imaginable microbial sample has revolutionized
only a few thousand have been formally from the human gut [12] to waste environmental microbiology [16],
described [1]. This contrasts with the water sludge [13]—has the potential to led to many discoveries [e.g.,17],
more than 350,000 described species reveal novel and fundamental insights and convinced many people (myself
of beetles [2]. This is one of many into the hidden world of microbes and included) to become microbiologists.
examples indicative of the general their impact on our world. However,
difficulties encountered in studying the complexity of analysis required
Citation: Eisen JA (2007) Environmental shotgun
organisms that we cannot readily see to realize this potential poses unique sequencing: Its potential and challenges for studying
or collect in large samples for future interdisciplinary challenges, challenges the hidden world of microbes. PLoS Biol 5(3): e82.
doi:10.1371/journal.pbio.0050082
analyses. It is thus not surprising that that make the approach both
most major advances in microbiology fascinating and frustrating in equal Series Editor: Simon Levin, Princeton University,
can be traced to methodological measure. United States of America
advances rather than scientific Copyright: © 2007 Jonathan A. Eisen. This is an

discoveries per se. Who Is Out There? Typing open-access article distributed under the terms
Examples of these key revolutionary and Counting Microbes in of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and
methods (Table 1) include the use of the Environment reproduction in any medium, provided the original
microscopes to view microbial cells, author and source are credited.
One of the most important and
the growth of single types of organisms conceptually straightforward steps Abbreviations: ESS, environmental shotgun
in the lab in isolation from other in studying any ecosystem involves sequencing; PCR, polymerase chain reaction; rRNA,
types (culturing), the comparison ribosomal RNA
cataloging the types of organisms and
of ribosomal RNA (rRNA) genes to the numbers of each type. For a long Jonathan A. Eisen is at the University of California
construct the first tree of life that time, such typing and counting was Davis Genome Center, with joint appointments
in the Section of Evolution and Ecology and
included microbes [3], the use of the an almost insurmountable problem in the Department of Medical Microbiology and
polymerase chain reaction (PCR) [4] microbiology. This is largely because Immunology, Davis, California, United States of
to clone rRNA genes from organisms America. Web site: http://phylogenomics.blogspot.
physical appearance does not provide com. E-mail: jaeisen@ucdavis.edu
a valid taxonomic picture in microbes.
Appearance evolves so rapidly that two This article is part of the Oceanic Metagenomics
Essays articulate a specific perspective on a topic of collection in PLoS Biology. The full collection is
broad interest to scientists.
closely related taxa may look wildly available online at http://collections.plos.org/
different and two distantly related plosbiology/gos-2007.php.
Table 1. Some Major Methods for Studying Individual Microbes Found in the Environment
Method Summary Comments
Microscopy Microbial phenotypes can be studied by making them more visible. In conjunction The appearance of microbes is not a reliable indicator of
with other methods, such as staining, microscopy can also be used to count taxa what type of microbe one is looking at.
and make inferences about biological processes.
Culturing Single cells of a particular microbial type are grown in isolation from other This is the best way to learn about the biology of a
organisms. This can be done in liquid or solid growth media. particular organism. However, many microbes are
uncultured (i.e., have never been grown in the lab in
isolation from other organisms) and may be unculturable
(i.e., may not be able to grow without other organisms).
rRNA-PCR The key aspects of this method are the following: (a) all cell-based organisms This method revolutionized microbiology in the 1980s by
possess the same rRNA genes (albeit with different underlying sequences); (b) PCR allowing the types and numbers of microbes present in
is used to make billions of copies of basically each and every rRNA gene present in a sample to be rapidly characterized. However, there are
a sample; this amplifies the rRNA signal relative to the noise of thousands of other some biases in the process that make it not perfect for all
genes present in each organism’s DNA; (c) sequencing and phylogenetic analysis aspects of typing and counting.
places rRNA genes on the rRNA tree of life; the position on the tree is used to infer
what type of organism (a.k.a. phylotype) the gene came from; and (d) the numbers
of each microbe type are estimated from the number of times the same rRNA gene
is seen.
Shotgun genome The DNA from an organism is isolated and broken into small fragments, and then This has now been applied to over 1,000 microbes, as well
sequencing of cultured portions of these fragments are sequenced, usually with the aid of sequencing as some multicellular species, and has provided a much
species machines. The fragments are then assembled into larger pieces by looking deeper understanding of the biology and evolution of life.
for overlaps in the sequence each possesses. The complete genome can be One limitation is that each genome sequence is usually a
determined by filling in gaps between the larger pieces. snapshot of one or a few individuals.
Metagenomics DNA is directly isolated from an environmental sample and then sequenced. This method allows one to sample the genomes of
One approach to doing this is to select particular pieces of interest (e.g., those microbes without culturing them. It can be used both for
containing interesting rRNA genes) and sequence them. An alternative is ESS, typing and counting taxa and for making predictions of
which is shotgun genome sequencing as described above, but applied to an their biological functions.
environmental sample with multiple organisms, rather than to a single cultured
organism.
doi:10.1371/journal.pbio.0050082.t001
The selective targeting of a single Certainly, many challenges remain the phylotypes and study its properties
gene makes rRNA-PCR an efficient before we can fully realize the potential in the lab. Unfortunately, many, if
method for deep community sampling of ESS for the typing and counting of not most, key microbes have not yet
[18]. However, this efficiency comes species, including making automated been cultured [22]. Thus, for many
with limitations, most of which are yet accurate phylogenetic trees of every years, the only alternative was to
complemented or circumvented by the gene, determining which genes are make predictions about the biology of
randomness and breadth of ESS. For most useful for which taxa, combining particular phylotypes based on what
example, examination of the random data from different genes even when was known about related organisms.
samples of rRNA sequences obtained we do not know if they come from Unfortunately, this too does not work
through ESS has already led to the the same organisms, building up well for microbes since very closely
discovery of new taxa—taxa that were databases of genes other than rRNA, related organisms frequently have
completely missed by PCR because of and making up for the lack of depth of major biological differences. For
its inability to sample all taxa equally sampling. If these challenges are met, example, Escherichia coli K12 and E.
well (e.g., [19]). In addition, ESS ESS has the potential to rewrite much coli O157:H7 are strains of the same
provides the first robust sampling of of what we thought we knew about the species (and considered to be the same
genes other than rRNA, and many of phylogenetic diversity of microbial life. phylotype), with genomes containing
these genes can be more useful for only about 4,000 genes, yet each
some aspects of typing and counting. What Are They Doing? Top Down possesses hundreds of functionally
Some universal protein coding and Bottom Up Approaches to important genes not seen in the
genes are better than rRNA both for Understanding Functions in other strain [23]. Such differences
distinguishing closely related strains are routine in microbes, and thus one
Communities
(because of third position variation in cannot make any useful inferences
codons) and for estimating numbers A community is, of course, more about what particular phylotypes are
of individuals (because they vary less than a list of types of organisms. doing (e.g., type of metabolism, growth
in copy number between species One approach to understanding properties, role in nutrient cycling, or
than do rRNA genes) [10]. Perhaps the properties and functioning of pathogenicity) based on the activities of
most significantly, ESS is providing a microbial community is to start their relatives.
groundbreaking insights into the with studies of the different types of These difficulties—the inability
diversity of viruses [20,21], which lack organisms and build up from these to culture most microbes and the
rRNA genes and thus were left out of individuals to the community. Ideally, functional disparities between close
the previous revolution. to do this one would culture each of relatives—led to one of the first kinds
Table 2. Methods of Binning
Method Description Comments
Genome assembly Identify regions of overlap between different fragments Getting deep enough sampling for this to work is very expensive
from the same organism to build larger contiguous pieces except for low diversity systems or for very abundant taxa.
(contigs).
Reference genome alignment Identify ESS fragments or contigs that are very similar (a) One of the most effective ways to sort through ESS data, if the
to already assembled sections of the genome of single reference genome is very closely related to an organism in the sample;
microbial types. (b) the reason why more reference genomes are needed; (c) does
not handle regions present in uncultured organisms but not in the
reference.
Phylogenetic analysis Build evolutionary trees of genes encoded by ESS fragments (a) Very powerful, but level of resolution depends on whether
or contigs. Assign fragments or contigs to taxonomic fragments encode useful phylogenetic markers and on how well
groups based on nearest neighbor(s) in trees. sampled the database is for the neighbor analysis; (b) would work
much better if more genomes were available from across the tree of
life.
Word frequency and nucleotide Measure word frequency and composition of each (a) Has the potential to work because organisms sometimes have
composition analysis fragment. Group by clustering algorithms or principal “signatures” of word frequencies that are found throughout the
component analysis. genome and are different between species; (b) very challenging for
small fragments.
Population genetics Build alignments of fragments or contigs with similarity May be most useful as a way of subdividing bins created by other
to each other (but not as much as needed for assembly). methods.
Examine haplotype structure, predicted effective
population size, and synonymous and non synonymous
substitution patterns.
Note that some methods can be applied to ESS fragments or to bins identified by other methods.
of metagenomic analyses, wherein and species), and these compartments trying to bin? Is it fragments from the
predictions of function were made matter. The key challenge in analyzing same chromosome from a single cell,
from analysis of the sequence of large ESS data is to sort the DNA fragments which would be useful for studying
DNA fragments from representatives (which are usually less than 1,000 base chromosome structure? If so, then
of known phylotypes. This approach pairs long relative to genome sizes of perhaps genome assembly methods
has provided some stunning insights, millions or billions of bases) into bins are the best. What if instead, as in the
such as the discovery of a novel form that correspond to compartments in the sharpshooter example, we are trying to
of phototrophy in the oceans [24]. system being studied. have each bin include every fragment
However, this large insert approach A recent study by myself and that came from a particular species,
has the same limitation as predicting colleagues illustrates the importance knowledge which may be useful for
properties from characterized of compartments when interpreting predicting community metabolic
relatives—a single cell cannot possibly ESS data. When we analyzed ESS data potential? If the level of genetic
represent the biological functions of all from symbionts living inside the gut polymorphism among individual
members of a phylotype. of the glassy-winged sharpshooter (an cells from the same species is high,
ESS provides an alternative, more insect that has a nutrient-limited diet), then genome assembly methods may
global way of assessing biological we were able to bin the data to two not work well (the polymorphisms
functions in microbial communities. As distinct symbionts [26]. We then could will break up assemblies). A better
when using the large insert approach, infer from those data that one of the approach might be to look for species-
functions can be predicted from symbionts synthesizes amino acids for specific “word” frequencies in the
sequences. However, in this case the the host while the other synthesizes DNA, such as ones created by patterns
predicted functions represent a random the needed vitamins and cofactors. in codon usage. The challenge is, how
sampling of those encoded in the Modeling and understanding of this do we tune the methods to find the
genomes of all the organisms present. ecosystem are greatly enhanced by the right target level of resolution? If we
This approach has unquestionably demonstration of this complementary are too stringent, most bins will include
been wildly successful in terms of gene division of labor, in comparison to only a few fragments. But if we are
discovery. For example, analysis of simply knowing that amino acids, too relaxed, we will create artificial
ESS data has revealed novel forms of vitamins, and cofactors are made by constructs that may prove biologically
every type of gene family examined, as “symbionts.” misleading, such as grouping together
well as a great number of completely How does one go about binning sequences from different species. To
novel families (e.g., [25]). However, ESS data? A variety of approaches have make matters more complex, most
there is a major caveat when using been developed, some of which are likely the stringency needed will vary
ESS data to make community-level described in Table 2. In considering for different taxa present in the sample.
inferences. Ecosystems are more than the different binning methods and Another critical issue is the diversity
just a bag of genes—they are made up of their limitations, the first question of the system under study. Generally,
compartments (e.g., cells, chromosomes, one needs to ask is, what are we binning works better when there are
few different phylotypes present, all Similarly, the initial comparisons of References
1. Gould SJ (1996) Full house: The spread of
of which are distantly related and ESS data involved comparisons of wildly excellence from Plato to Darwin. New York:
form discrete populations. This is why different environments [32], yielding Harmony Books. 244 p.
binning works well for the sharpshooter insights into the general structure of 2. Evans AV, Bellamy CL (1996) An inordinate
fondness for beetles. New York: Holt. 208 p.
system and other relatively isolated, communities. But as more comparisons 3. Woese C, Fox G (1977) Phylogenetic structure
low diversity environments. Binning are made between similar communities of the prokaryotic domain: The primary
increases in difficulty exponentially kingdoms. Proc Natl Acad Sci U S A 74: 5088–
[33,34], such as those sampled during 5090.
as the number of species increases: vertical and horizontal ocean transects 4. Mullis K, Faloona F (1987) Specific synthesis of
the populations and species start to [27,35–37], we will begin to learn DNA in vitro via a polymerase-catalyzed chain
reaction. Methods Enzymol 155: 335–350.
merge together, and the populations about shorter time scale processes such 5. Reysenbach AL, Giver LJ, Wickham GS, Pace
get more and more polymorphic and as migration, speciation, extinction, NR (1992) Differential amplification of rRNA
variable in relative abundance (such as genes by polymerase chain reaction. Appl
responses to disturbance, and Environ Microbiol 58: 3417–3418.
in the paper about the Global Ocean succession. It is from a combination 6. Medlin L, Elwood HJ, Stickel S, Sogin ML
Sampling expedition in this issue [27]). of both approaches—comparing (1988) The characterization of enzymatically
Further complicating binning is the amplified eukaryotic 16S-like ribosomal RNA-
both similar and very divergent coding regions. Gene 71: 491–500.
phenomenon of lateral gene transfer, communities—that we will be able to 7. Weisburg W, Barns S, Pelletier D, Lane D
where genes are exchanged between understand the fundamental rules of (1991) 16S ribosomal DNA amplification for
phylogenetic study. J Bacteriol 173: 697–703.
distantly related lineages at rates that microbial ecology and how they relate 8. Fleischmann RD, Adams MD, White O,
are high enough that random sampling to ecological principles seen in macro- Clayton RA, Kirkness EF, et al. (1995) Whole-
of a genome will frequently include genome random sequencing and assembly
organisms. of Haemophilus influenzae Rd. Science 269:
genes with multiple histories. 496–512.
Despite these challenges, I believe we Conclusions 9. Handelsman J (2004) Metagenomics:
can develop effective binning methods Application of genomics to uncultured
In promoting some of the exciting microorganisms. Microbiol Mol Biol Rev 68:
for complex communities. First, we 669–685.
opportunities with ESS, I do not
can combine different approaches 10. Venter JC, Remington K, Heidelberg
want to give the impression that it is JF, Halpern AL, Rusch D, et al. (2004)
together, such as using one method
flawless. It is helpful in this respect to Environmental genome shotgun sequencing of
to sort in a relaxed manner and then the Sargasso Sea. Science 304: 66–74.
compare ESS to the Internet. As with 11. Tyson GW, Chapman J, Hugenholtz P, Allen
using another to subdivide the bins
the Internet, ESS is a global portal for EE, Ram RJ, et al. (2004) Community structure
provided by the first method. Second, and metabolism through reconstruction of
looking at what occurs in a previously
we can incorporate new approaches microbial genomes from the environment.
hidden world. Making sense of it Nature 428: 37–43.
such as population genetics into
requires one to sort through massive, 12. Gill SR, Pop M, Deboy RT, Eckburg PB,
the analysis [28]. In addition, the Turnbaugh PJ, et al. (2006) Metagenomic
lessons learned here can be applied to random, fragmented collections of bits analysis of the human distal gut microbiome.
other aspects of metagenomics (e.g., of information. Such searches need Science 312: 1355–1359.
to be done with caution because any 13. Garcia Martin H, Ivanova N, Kunin V,
the counting and typing discussed Warnecke F, Barry KW, et al. (2006)
above) and provide insights into the time you analyze such a large amount Metagenomic analysis of two enhanced
nature of microbial genomes and the of data patterns can be found. In biological phosphorus removal (EBPR) sludge
communities. Nat Biotechnol 24: 1263–1269.
structure of microbial populations and addition, as with the Internet, there 14. Olsen GJ, Larsen N, Woese CR (1991) The
communities. is certainly some hype associated with ribosomal RNA database project. Nucleic Acids
ESS that gives relatively trivial findings Res 19: 2017–2021.
15. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-
Comparative Metagenomics more attention than they deserve. Syed-Mohideen AS, et al. (2007) The ribosomal
So far, I have discussed issues relating Overall, though, I believe the hype database project (RDP-II): Introducing myRDP
space and quality controlled public data.
mostly to intrasample analysis of is deserved. As long as we treat ESS Nucleic Acids Res 35: D169–D172.
ESS data. However, the area with as a strong complement to existing 16. Pace NR (1997) A molecular view of microbial
methods, and we build the tools and diversity and the biosphere. Science 276: 734–
perhaps the most promise involves 740.
the comparative analysis of different databases necessary for people to use 17. Hugenholtz P, Pitulle C, Hershberger KL,
samples. This work parallels the the information, it will live up to its Pace NR (1998) Novel division level bacterial
diversity in a Yellowstone hot spring. J Bacteriol
comparative analysis of genomes of revolutionary potential. 180: 366–376.
cultured species. Initial studies of 18. Sogin ML, Morrison HG, Huber JA, Welch
that type compared distantly related Acknowledgments DM, Huse SM, et al. (2006) Microbial diversity
in the deep sea and the underexplored “rare
taxa with enormous biological I thank Simon Levin, Joshua Weitz, biosphere”. Proc Natl Acad Sci U S A 103:
differences. What has been learned Jonathan Dushoff, Maria-Inés Benito, 12115–12120.
Doug Rusch, Aaron Halpern, and Shibu 19. Baker BJ, Tyson GW, Webb RI, Flanagan
from these studies pertains mostly to J, Hugenholtz P, et al. (2006) Lineages of
core housekeeping functions, such Yooseph for helpful discussions, and acidophilic archaea revealed by community
Melinda Simmons, Merry Youle, and three genomic analysis. Science 314: 1933–1935.
as translation and DNA metabolism,
anonymous reviewers for helpful comments 20. Angly FE, Felts B, Breitbart M, Salamon P,
and to other very ancient processes Edwards RA, et al. (2006) The marine viromes
on the manuscript. The writing of this
[29,30]. It was not until comparisons paper was supported by National Science
of four oceanic regions. PLoS Biol 4: e368.
doi:10.1371/journal.pbio.0040368
were made between closely related Foundation Assembling the Tree of Life 21. Edwards RA, Rohwer F (2005) Viral
organisms that we began to understand Grant 0228651 to Jonathan A. Eisen and by metagenomics. Nat Rev Microbiol 3: 504–510.
events that occurred on shorter 22. Leadbetter JR (2003) Cultivation of recalcitrant
the Defense Advanced Research Projects
microbes: Cells are alive, well and revealing
time scales, such as selection, gene Agency under grants HR0011-05-1-0057 their secrets in the 21st century laboratory.
transfer, and mutation processes [31]. and FA9550-06-1-0478. Curr Opin Microbiol 6: 274–281.
23. Perna NT, Plunkett G 3rd, Burland V, Mau B, II Gobal Ocean Sampling expedition: metagenomics of microbial communities.
Glasner JD, et al. (2001) Genome sequence of Northwest Atlantic through Eastern Tropical Science 308: 554–557.
enterohaemorrhagic Escherichia coli O157:H7. Pacific. PLoS Biol 5: e77. doi:10.1371/journal. 33. Edwards RA, Rodriguez-Brito B, Wegley L,
Nature 409: 529–533. pbio.0050077 Haynes M, Breitbart M, et al. (2006) Using
24. Beja O, Aravind L, Koonin EV, Suzuki MT, 28. Johnson PL, Slatkin M (2006) Inference pyrosequencing to shed light on deep mine
Hadd A, et al. (2000) Bacterial rhodopsin: of population genetic parameters in microbial ecology. BMC Genomics 7: 57.
Evidence for a new type of phototrophy in the metagenomics: A clean look at messy data. 34. Rodriguez-Brito B, Rohwer F, Edwards RA
sea. Science 289: 1902–1906. Genome Res 16: 1320–1327. (2006) An application of statistics to comparative
25. Yooseph S, Sutton G, Rusch DB, Halpern AL, 29. Koonin EV, Mushegian AR (1996) Complete metagenomics. BMC Bioinformatics 7: 162.
Williamson SJ, et al. (2007) The Sorcerer II genome sequences of cellular life forms: 35. DeLong EF (2005) Microbial community
Global Ocean Sampling expedition: Expanding Glimpses of theoretical evolutionary genomics. genomics in the ocean. Nat Rev Microbiol 3:
the universe of protein families. PLoS Biol 5: Curr Opin Genet Dev 6: 757–762. 459–469.
e16. DOI: 10.1371/journal.pbio.0050016 30. Mushegian AR, Koonin EV (1996) A minimal 36. DeLong EF, Preston CM, Mincer T, Rich V,
26. Wu D, Daugherty SC, Van Aken SE, Pai gene set for cellular life derived by comparison Hallam SJ, et al. (2006) Community genomics
GH, Watkins KL, et al. (2006) Metabolic of complete bacterial genomes. Proc Natl Acad among stratified microbial assemblages in the
complementarity and genomics of the dual Sci U S A 93: 10268–10273. ocean’s interior. Science 311: 496–503.
bacterial symbiosis of sharpshooters. PLoS Biol 31. Eisen JA (2001) Gastrogenomics. Nature 409: 37. Worden AZ, Cuvelier ML, Bartlett DH
4: e188. doi:10.1371/journal.pbio.0040188 463, 465–466. (2006) In-depth analyses of marine microbial
27. Rusch DB, Halpern AL, Sutton G, Heidelberg 32. Tringe SG, von Mering C, Kobayashi A, community genomics. Trends Microbiol 14:
KB, Williamson S, et al. (2007) The Sorcerer Salamov AA, Chen K, et al. (2005) Comparative 331–336.
Community Page
CAMERA: A Community Resource

for Metagenomics
Rekha Seshadri*, Saul A. Kravitz, Larry Smarr, Paul Gilna, Marvin Frazier
and complex interactions that impact (CAMERA) project [1] is an important

global carbon cycles and ocean first step in attempting to bridge
productivity. Marine microbes are these gaps and in developing global
thought to act as part of the biological methods for monitoring microbial
conduit that transports carbon dioxide communities in the ocean and their
from the surface to the deep oceanic response to environmental changes.
realms. By removing carbon from the The aim is to create a rich, distinctive
atmosphere and sequestering it (in data repository and bioinformatics
the form of organic matter), marine tools resource that will address
microorganisms may significantly many of the unique challenges of
M
icrobes are responsible affect global climate. Although we metagenomics and enable researchers
for most of the chemical now have numerous global and real- to unravel the biology of environmental
transformations that are time methods to measure physical microorganisms (Figure 1). CAMERA’s
crucial to sustaining life on Earth. database includes environmental
Their ability to inhabit almost any metagenomic and genomic sequence
environmental niche suggests that We invite the research data, associated environmental
they possess an incredible diversity of community to submit its parameters (“metadata”), pre-
physiological capabilities. However, computed search results, and software
we have little to no information on a metagenomics data to tools to support powerful cross-analysis
majority of the millions of microbial CAMERA. of environmental samples.
species that are predicted to exist,
mainly because of our inability to and chemical parameters within the
culture them in the laboratory. ocean, few methods or concepts have
A growing discipline called been developed to measure important Citation: Seshadri R, Kravitz SA, Smarr L, Gilna P,
metagenomics allows us to study these microbial processes on a global scale. Frazier M (2007) CAMERA: A community resource
uncultured organisms by deciphering for metagenomics. PLoS Biol 5(3): e75. doi:10.1371/
Even if the technology to make such journal.pbio.0050075
their genetic information from measurements existed, we would
DNA that is extracted directly from presently not know what to measure or Copyright: © 2007 Seshadri et al. This is an
open-access article distributed under the terms
their environment, thus effectively how to interpret those measurements. of the Creative Commons Attribution License,
bypassing the laboratory culture step. We need a systematic way to explore which permits unrestricted use, distribution, and
Metagenomics allows us to address the the structure and function of ocean
reproduction in any medium, provided the original
author and source are credited.
questions “who’s there?”, “what are they ecosystems, and their impact on
doing?”, and “how are they doing it?”, global carbon processing and climate. Abbreviations: CAMERA, Community
offering insights into the evolutionary Cyberinfrastructure for Advanced Marine Microbial
Metagenomics has the potential to Ecology Research and Analysis; GOS, Global Ocean
history as well as previously shed light on the genetic controls Sampling
unrecognized physiological abilities of of these processes by investigating Rekha Seshadri, Saul A. Kravitz, and Marvin Frazier
uncultured communities. the key players, their roles, and are at the J. Craig Venter Institute (JCVI) in Rockville,
Studies such as the J. Craig Venter community compositions that may Maryland, United States of America. Larry Smarr
Institute’s Global Ocean Sampling and Paul Gilna are at the California Institute for
change as a function of time, climate, Telecommunications and Information Technology
(GOS) expedition (in this issue) reveal nutrients, carbon dioxide, and (Calit2), a University of California San Diego
a remarkable breadth and depth of anthropogenic factors. These studies (UCSD)/University of California Irvine partnership,
microbial diversity in the oceans. To La Jolla, California, United States of America. Larry
include a substantial informatics Smarr is also the Harry E. Gruber Professor of
date, researchers have made significant component, requiring researchers to Computer Science and Engineering at UCSD, La Jolla,
but largely preliminary inroads into take on complex computational and California, United States of America. CAMERA is being
developed by Calit2 at UCSD in collaboration with
understanding the biogeography mathematical challenges. Nonetheless, the JCVI, UCSD’s Center for Earth Observations and
of microbial populations across microbiologists have been quick to seize Applications (anchored by the Scripps Institution
ecosystems. We know even less about of Oceanography), the San Diego Supercomputer
upon this modern technique, resulting Center, and the University of California Davis.
the dynamic physiological processes in a deluge of sequence data, and an
ever-widening gap between the rates of * To whom correspondence should be addressed.
E-mail: rseshadri@venterinstitute.org
collecting data and interpreting it.
The Community Page is a forum for organizations The Community Cyberinfrastructure This article is part of the Oceanic Metagenomics
and societies to highlight their efforts to enhance the collection in PLoS Biology. The full collection is
for Advanced Marine Microbial available online at http://collections.plos.org/
dissemination and value of scientific knowledge.
Ecology Research and Analysis plosbiology/gos-2007.php.
The initial release will include computational infrastructure to provide
data and tools associated with the high-performance networking access
companion set of GOS expedition and grid-based computing (applying
publications [2–4]; metagenome data the resources of many computers in
from the Hawaii Ocean Time Series a network to a single problem at the
Station ALOHA [5] and marine same time), and to support new ways
viromes from four different oceanic of visualizing and interacting with the
regions[6]; standard nonredundant data. The distributed architecture of
sequence databases (e.g., nrnt for the CAMERA computational engine
nucleotides and nraa for amino will be based on the National Science
acids[7]); and collections of microbial Foundation–funded OptIPuter
genome sequences, including a set project [8,9], which allows for use of
of 155 marine microbial genomes dedicated 1- or 10-Gbps optical fiber
funded by the Gordon and Betty links between remote user laboratory
Moore Foundation. The focal point clusters and the CAMERA compute
for the CAMERA project is its Web doi:10.1371/journal.pbio.0050075.g001 complex. The data server complex
site: http://camera.calit2.net. We itself will contain a large amount of
invite the research community to Figure 1. Schematic of Intended Core rotating storage (ultimately several
Functions of the CAMERA Project
submit its metagenomics data to tens of terabytes replicated) and a
CBD, Convention on Biological Diversity.
CAMERA, and are establishing large computational cluster (upwards
mechanisms to streamline this of a thousand processors). It will be
process. Here we describe some of New-Generation Bioinformatics augmented on demand by a scalable
the key challenges and features of the Tools back end provided by the recently
CAMERA project. Analysis and comparison of complex upgraded National Science Foundation
metagenomic data is driving the TeraGrid.
Accessibility of Metadata development of a new class of
Existing data repositories provide bioinformatics and visualization Recognition of the Sources
limited support for metadata and software. CAMERA will integrate these of Samples
metadata-based queries—including tools with its database, couple them The Convention on Biological Diversity
any supplemental information with large-scale compute resources, grants countries certain rights over
for the sequence data, such as pH and make them widely available to their genetic resources, including,
and temperature of water at the the research community. Initially, for example, metagenomic sequence
collection site—and therefore these CAMERA will support analytical data of marine microbes taken from
metadata go underutilized by the tools used for analyses in the GOS a country’s territorial waters. Many
research community. CAMERA publications [2–6]. An example countries require, at minimum, that
will integrate sequence data with is shown in Figure 2: a subset of databases explicitly identify the country
all available, relevant metadata, metagenome sequence reads from of origin of the DNA. Rules vary by
including physical information (e.g., GOS environmental samples is country, and it is not a simple task
temperature and sample method), compared to a reference genome to find out what might be required.
chemical information (e.g., salinity sequence (Synechococcus spp.) using International harmonization of these
and pH), temporal information, BLASTN. The results and underlying rules is currently being debated by the
geospatial information, methodology metadata are displayed through an over 150 countries that are party to the
and instrumentation used for data interactive graphical viewer, which Convention on Biological Diversity.
collection, and satellite images of helps users quickly identify sequence Agreements about the use of genetic
the collection site. These contextual reads that are similar to the reference resources are negotiated on a case-by-
data allow researchers to derive genome sequence, and potentially case basis with each researcher who
correlations between deciphered identify metabolic similarities between wishes to sample within a country’s
ecology and the environmental microbes in environmental samples “exclusive economic zone,” typically
conditions that may favor one and a reference microbe. A detailed 200 miles from its shoreline. Some of
community structure over another. description of this tool and its these “memoranda of understanding”
One can envision a future where applications are provided in the GOS impose additional requirements on
metadata from satellites and weather companion paper by Rusch et al. [2]. the researchers. For example, the J.
stations, and other physicochemical CAMERA will work closely with the Craig Venter Institute’s agreement
data, can be used to help interpret and community to identify and incorporate with Australia requires us to “use
inform scientists on how these factors additional tools and workflows. reasonable effort to notify Australia as
affect microbial processes as well as soon as possible of any inquiries for
community composition. CAMERA Large-Scale, Robust, and commercial purposes.”
is working with other groups (e.g., Expandable Cyberinfrastructure Current databases do not allow the
Genome Standards Consortium) to The enormousness of metagenomics original investigators to inform others
establish standards for the information datasets requires terascale about the details of an agreement,
content and format of metagenomic computation and storage facilities. thus creating a significant roadblock to
data and metadata submissions. CAMERA is building a state-of-the-art both the collection and public release
Figure 2. CAMERA Fragment Recruitment Viewer

This tool graphically displays the results of a BLASTN sequence comparison of an available microbial genome against selected sequence read datasets. The
example shown displays the abundance and distribution of Synechococcus spp. genome sequence in the selected sampling sites. The Synechococcus spp.
genome coordinates are shown on the x-axis, while the y-axis shows the percent identity scores of the alignment to the selected Sargasso Sea and GOS
sequence reads. The viewer incorporates metadata associated with the reads, allowing a user to quickly identify data of interest for further examination.
The utility of the plot is to examine the biogeography and genomic variation of abundant microbes when a close reference genome exists.
of metagenomics data. To address and who acknowledge the potential Convention on Biological Diversity, all
this issue, CAMERA data will only be restriction on commercial use by data objects served by CAMERA will
made available to users who register countries from which the data were possess a mapping to the country of
by supplying a suitable E-mail address collected. To further comply with the origin of the underlying DNA sample.
Outreach and Training change and the processes that control References
1. Smarr L (2006 March 21) The ocean of life:
Since the ultimate success of CAMERA climate. Systematic and routine Creating a community cyberinfrastructure for
will depend on the broader research monitoring of genomic signatures advanced marine microbial ecology research
of global microbial populations and and analysis (a.k.a. CAMERA). Friday Harbor
community’s ability to make use of the (Washington): Strategic News Service.
novel cyberinfrastructure, a series of on- processes overlaid with meteorological 2. Rusch DB, Halpern AL, Sutton G, Heidelberg
site and Web-based training programs information and other metadata may KB, Williamson S, et al. (2007) The Sorcerer II
help researchers explain past shifts in Gobal Ocean Sampling expedition: Northwest
will be provided to keep users apprised Atlantic through eastern tropical Pacific.
of CAMERA’s functionalities and to global climate as well as predict future PLoS Biol 5: e77. doi:10.1371/journal.
support integration of CAMERA’s changes. This knowledge may someday pbio.0050077
3. Yooseph S, Sutton G, Rusch DB, Halpern AL,
service-oriented architecture into guide decisions about acceptable Williamson SJ, et al. (2007) The Sorcerer II
their computational fabrics. Finally, atmospheric levels of greenhouse Global Ocean Sampling expedition: Expanding
we envision interacting with the gases, or guide strategies to increase the universe of protein families. PLoS Biol 5:
e16. doi:10.1371/journal.pbio.0050016
community on several fronts, including sequestration of atmospheric carbon 4. Kannan N, Taylor SS, Zhai Y, Venter JC,
standardization of ontology, metadata, dioxide by changing ocean microbial Manning G (2006) Structural and functional
nomenclature, and tools, and compositions, in order to reverse the diversity of the microbial kinome. PLoS Biol 5:
e17. doi:10.1371/journal.pbio.0050017
incorporation or federation of existing effects of global warming. 5. DeLong EF, Preston CM, Mincer T, Rich V,
tools and resources with CAMERA. Hallam SJ, et al. (2006) Community genomics
We believe that the data and Acknowledgments among stratified microbial assemblages in the
ocean’s interior. Science 311: 496–503.
community cyberinfrastructure The authors benefited from many 6. Angly FE, Felts B, Breitbart M, Salamon P,
provided by CAMERA will help discussions with members of the CAMERA Edwards RA, et al. (2006) The marine viromes
team. We wish to thank Robert Friedman, of four oceanic regions. PLoS Biol 4: e368.
researchers to advance understanding doi:10.1371/journal.pbio.0040368
of the codependence or feedback Michael Press, Jasmine Pollard, and
7. Pruitt KD, Tatusova T, Maglott DR (2005)
between microbial communities Matthew LaPointe at the J. Craig Venter NCBI Reference Sequence (RefSeq): A curated
Institute for their assistance in preparing non-redundant sequence database of genomes,
and biogeochemical processes transcripts and proteins. Nucleic Acids Res 33:
the manuscript.
in oceans over time, and of how D501–D504.
Funding. The authors acknowledge
perturbations in the environment 8. Smarr L, Chien AA, DeFanti T, Leigh J,
funding from the Gordon and Betty Moore Papadopoulos PM (2003) The OptIPuter.
cause compositional changes Foundation to the California Institute for Commun ACM 46: 58–67.
(including extinction). Eventually, Telecommunications and Information 9. Taesombut N, Uyeda F, Chien AA, Smarr L,
the expanded global environmental DeFanti T, et al. (2006) The OptIPuter: High-
Technology at the University of California,
performance, QoS-guaranteed network service
metagenomics datasets will enable San Diego, and from National Science for emerging e-science applications. IEEE
better monitoring of environmental Foundation OptIPuter grant SCI-0225642. Commun 4: 38–45.
PLoS BIOLOGY
The Sorcerer II Global Ocean Sampling

Expedition: Northwest Atlantic through
Eastern Tropical Pacific
Douglas B. Rusch1*, Aaron L. Halpern1, Granger Sutton1, Karla B. Heidelberg1,2, Shannon Williamson1, Shibu Yooseph1,
Dongying Wu1,3, Jonathan A. Eisen1,3, Jeff M. Hoffman1, Karin Remington1,4, Karen Beeson1, Bao Tran1,
Hamilton Smith1, Holly Baden-Tillson1, Clare Stewart1, Joyce Thorpe1, Jason Freeman1, Cynthia Andrews-Pfannkoch1,
Joseph E. Venter1, Kelvin Li1, Saul Kravitz1, John F. Heidelberg1,2, Terry Utterback1, Yu-Hui Rogers1, Luisa I. Falcón5,
Valeria Souza5, Germán Bonilla-Rosso5, Luis E. Eguiarte5, David M. Karl6, Shubha Sathyendranath7, Trevor Platt7,
Eldredge Bermingham8, Victor Gallardo9, Giselle Tamayo-Castillo10, Michael R. Ferrari11, Robert L. Strausberg1,
Kenneth Nealson1,12, Robert Friedman1, Marvin Frazier1, J. Craig Venter1
1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 Department of Biological Sciences, University of Southern California, Avalon, California, United
States of America, 3 Genome Center, University of California Davis, Davis, California, United States of America, 4 Your Genome, Your World, Rockville, Maryland, United States
of America, 5 Departmento de Ecologı́a Evolutiva, Instituto de Ecologı́a, Universidad Nacional Autónoma de México, Mexico City, Mexico, 6 Department of Oceanography,
University of Hawaii, Honolulu, Hawaii, United States of America, 7 Bedford Institute of Oceanography, Dartmouth, Nova Scotia, Canada, 8 Smithsonian Tropical Research
Institute, Balboa, Ancon, Republic of Panama, 9 Departamento de Oceanografı́a, Universidad de Concepción, Concepción, Chile, 10 Escuela de Quı́mica, Universidad de Costa
Rica, San Pedro, Costa Rica, 11 Department of Environmental Sciences, Rutgers University, New Brunswick, New Jersey, United States of America, 12 Department of Earth
Sciences, University of Southern California, Los Angles, California, United States of America
The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both
genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which
surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition.
These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and
ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp).
Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with
85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff.
Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and
assembly methods. One comparative genomic method, termed ‘‘fragment recruitment,’’ addressed questions of
genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes
and gene families. A second method, termed ‘‘extreme assembly,’’ made possible the assembly and reconstruction of
large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found
extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions
throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual
sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3)
hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into
genetically isolated populations that have overlapping but independent distributions, implying distinct environmental
preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show
how they may be grouped into several community types. Specific functional adaptations can be identified both within
individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or
absence of the phosphate-binding gene PstS.
Citation: Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern
tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077
Academic Editor: Nancy A. Moran, University of Arizona, United States of America
Received July 14, 2006; Accepted January 16, 2007; Published March 13, 2007
Copyright: Ó 2007 Rusch et al. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted
and source are credited.
Abbreviations: CAMERA, Cyberinfrastructure for Advanced Marine Microbial
Ecology Research and Analysis; GOS, Global Ocean Sampling; NCBI, National
Center for Biotechnology Information
* To whom correspondence should be addressed. E-mail: DRusch@venterinstitute.
org
This article is part of Global Ocean Sampling collection in PLoS Biology. The full
collection is available online at http://collections.plos.org/plosbiology/gos-2007.
php.
Sorcerer II GOS Expedition
Author Summary same ribotype [11], otherwise referred to as species, opera-

tional taxonomic units, or phylotypes.
Marine microbes remain elusive and mysterious, even though they Although rRNA-based analysis has revolutionized our view
are the most abundant life form in the ocean, form the base of the of genetic diversity, and has allowed the analysis of a large
marine food web, and drive energy and nutrient cycling. We know part of the uncultivated majority, it has been less useful in
so little about the vast majority of microbes because only a small predicting biochemical diversity. Furthermore, the relation-
percentage can be cultivated and studied in the lab. Here we report ship between genetic and biochemical diversity, even for
on the Global Ocean Sampling expedition, an environmental
cultivated microbes, is not always predictable or clear. For
metagenomics project that aims to shed light on the role of marine
microbes by sequencing their DNA without first needing to isolate
instance, organisms that have very similar ribotypes (97% or
individual organisms. A total of 41 different samples were taken greater homology) may have vast differences in physiology,
from a wide variety of aquatic habitats collected over 8,000 km. The biochemistry, and genome content. For example, the gene
resulting 7.7 million sequencing reads provide an unprecedented complement of Escherichia coli O157:H7 was found to be
look at the incredible diversity and heterogeneity in naturally substantially different from the K12 strain of the same species
occurring microbial populations. We have developed new bioinfor- [12].
matic methods to reconstitute large portions of both cultured and In this paper, we report the results of the first phase of the
uncultured microbial genomes. Organism diversity is analyzed in Sorcerer II Global Ocean Sampling (GOS) expedition, a
relation to sampling locations and environmental pressures. Taken metagenomic study designed to address questions related to
together, these data and analyses serve as a foundation for greatly
genetic and biochemical microbial diversity. This survey was
expanding our understanding of individual microbial lineages and
their evolution, the nature of marine microbial communities, and
inspired by the British Challenger expedition that took place
how they are impacted by and impact our world. from 1872–1876, in which the diversity of macroscopic
marine life was documented from dredged bottom samples
approximately every 200 miles on a circumnavigation [13–15].
Through the substantial dataset described here, we identified
60 highly abundant ribotypes associated with the open ocean
Introduction
and aquatic samples. Despite this relative lack of diversity in
The concept of microbial diversity is not well defined. It ribotype content, we confirm and expand upon previous
can either refer to the genetic (taxonomic or phylogenetic) observations that there is tremendous within-ribotype diver-
diversity as commonly measured by molecular genetics sity in marine microbial populations [4,7,8,16,17]. New
methods, or to the biochemical (physiological) diversity techniques and tools were developed to make use of the
measured in the laboratory with pure or mixed cultures. sampling and sequencing metadata. These tools include: (1)
However, we know surprisingly little about either the genetic the fragment recruitment tool for performing and visualizing
or biochemical diversity of the microbial world [1], in part comparative genomic analyses when a reference sequence is
because so few microbes have been grown under laboratory available; (2) new assembly techniques that use metadata to
conditions [2,3], and also because it is likely that there are produce assemblies for uncultivated abundant microbial taxa;
immense numbers of low abundance ribotypes that have not and (3) a whole metagenome comparison tool to compare
been detected using molecular methods [4]. Our under- entire samples at arbitrary degrees of genetic divergence.
standing of microbial physiological and biochemical diversity Although there is tremendous diversity within cultivated and
has come from studying the less than 1% of organisms that uncultivated microbes alike, this diversity is organized into
can be maintained in enrichments or cultivated, while our phylogenetically distinct groups we refer to as subtypes.
Subtypes can occupy similar environments yet remain
understanding of phylogenetic diversity has come from the
genetically isolated from each other, suggesting that they are
application of molecular techniques that are limited in terms
adapted for different environmental conditions or roles
of identifying low-abundance members of the communities.
within the community. The variation between and within
Historically, there was little distinction between genetic
subtypes consists primarily of nucleotide polymorphisms but
and biochemical diversity because our understanding of
includes numerous small insertions, deletions, and hyper-
genetic diversity was based on the study of cultivated
variable segments. Examination of the GOS data in these
microbes. Biochemical diversity, along with a few morpho- terms sheds light on patterns of evolution and also suggests
logical features, was used to establish genetic diversity via an approaches towards improving the assembly of complex
approach called numerical taxonomy [5,6]. In recent years the metagenomic datasets. At least some of this variation can be
situation has dramatically changed. The determination of associated with functional characters that are a direct
genetic diversity has relied almost entirely on the use of gene response to the environment. More than 6.1 million proteins,
amplification via PCR to conduct taxonomic environmental including thousands of new protein families, have been
gene surveys. This approach requires the presence of slowly annotated from this dataset (described in the accompanying
evolving, highly conserved genes that are found in otherwise paper [18]). In combination, these papers bring us closer to
very diverse organisms. For example, the gene encoding the reconciling the genetic and biochemical disconnect and to
small ribosomal subunit RNA, known as 16S, based on understanding the marine microbial community.
sedimentation coefficient, is most often used for distinguish- We describe a metagenomic dataset generated from the
ing bacterial and archaeal species [7–10]. The 16S rRNA Sorcerer II expedition. The GOS dataset, which includes and
sequences are highly conserved and can be used as a extends our previously published Sargasso Sea dataset [19],
phylogenetic marker to classify organisms and place them now encompasses a total of 41 aquatic, largely marine
in evolutionary context. Organisms whose 16S sequences are locations, constituting the largest metagenomic dataset yet
at least 97% identical are commonly considered to be the produced with a total of ;7.7 million sequencing reads. In
Figure 1. Sampling Sites

Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the
eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the
Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part
of this study.
the pilot Sargasso Sea study, 200 l surface seawater was environments as well as a few nonmarine aquatic samples for
filtered to isolate microorganisms for metagenomic analysis. contrast (Table 1).
DNA was isolated from the collected organisms, and genome Several size fractions were isolated for every site (see
shotgun sequencing methods were used to identify more than Materials and Methods). Total DNA was extracted from one
1.2 million new genes, providing evidence for substantial or more fractions, mostly from the 0.1–0.8-lm size range.
microbial taxonomic diversity [19]. Several hundred new and This fraction is dominated by bacteria, whose compact
diverse examples of the proteorhodopsin family of light- genomes are particularly suitable for shotgun sequencing.
harvesting genes were identified, documenting their exten- Random-insert clone libraries were constructed. Depending
sive abundance and pointing to a possible important role in on the uniqueness of each sampling site and initial estimates
energy metabolism under low-nutrient conditions. However, of the genetic diversity, between 44,000 and 420,000 clones
substantial sequence diversity resulted in only limited per sample were end-sequenced to generate mated sequenc-
genome assembly. These results generated many additional ing reads. In all, the combined dataset includes 6.25 Gbp of
sequence data from 41 different locations. Many of the clone
questions: would the same organisms exist everywhere in the
libraries were constructed with a small insert size (,2 kbp) to
ocean, leading to improved assembly as sequence coverage
maximize cloning efficiency. As this often resulted in mated
increased; what was the global extent of gene and gene family
sequencing reads that overlapped one another, overlapping
diversity, and can we begin to exhaust it with a large but
mated reads were combined, yielding a total of ;6.4 M
achievable amount of sequencing; how do regions of the
contiguous sequences, totaling ;5.9 Gbp of nonredundant
ocean differ from one another; and how are different sequence. Taken together, this is the largest collection of
environmental pressures reflected in organisms and com- metagenomic sequences to date, providing more than a 5-fold
munities? In this paper we attempt to address these issues. increase over the dataset produced from the Sargasso Sea
pilot study [19] and more than a 90-fold increase over the
Results other large marine metagenomic dataset [20].
Sampling and the Metagenomic Dataset Assembly
Microbial samples were collected as part of the Sorcerer II Assembling genomic data into larger contigs and scaffolds,
expedition between August 8, 2003, and May 22, 2004, by the especially metagenomic data, can be extremely valuable, as it
S/V Sorcerer II, a 32-m sailing sloop modified for marine places individual sequencing reads into a greater genomic
research. Most specimens were collected from surface water context. A largely contiguous sequence links genes into
marine environments at approximately 320-km (200-mile) operons, but also permits the investigation of larger
intervals. In all, 44 samples were obtained from 41 sites biochemical and/or physiological pathways, and also connects
(Figure 1), covering a wide range of distinct surface marine otherwise-anonymous sequences with highly studied ‘‘taxo-
Table 1. Sampling Locations and Environmental Data
ID Sample Country Date, Time Location Sample Water T (8C)a Sb Size Habitat Chl a Sample Good
Location mm/dd/yy Depth, Depth, (ppt) Fraction Type Month (Annual Sequences
m m (lm) 6 SE) mg/m3
GS00a Sargasso Stations 13 and 11 Bermuda (UK) 02/26/03 3:00 3183296 99 n; 63835942 99 w 5.0 .4,200 20.0 20.5 36.6 0.1–0.8 Open ocean 0.17 (0.0.9 6 0.02) 644,551
10:10 31810950 99 n; 64819927 99 w 36.7
GS00b Sargasso Stations 13 and 11 Bermuda (UK) 02/26/03 3:35 31832910 99 n; 63835970 99 w 5.0 .4,200 20.0 20.5 36.6 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 317,180
10:43 31810950n; 64819927 99 w 36.7
GS00c Sargasso Stations 3 Bermuda (UK) 02/25/03 13:00 32809930 99 n; 64800936 99 w 5.0 .4,200 19.8 36.7 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 368,835
GS00d Sargasso Stations 13 Bermuda (UK) 02/25/03 17:00 3183296 99 n; 63835942 99 w 5.0 .4,200 20.0 36.6 0.22–0.8 Open ocean 0.17 (0.0.9 6 0.02) 332,240
GS01a Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n 64830900 99 w 5.0 .4,200 22.9 36.7 3.0–20.0 Open ocean 0.10 (0.10 6 0.01) 142,352
GS01b Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n; 64830900 99 w 5.0 .4,201 22.9 36.7 0.8–3.0 Open ocean 0.10 (0.10 6 0.01) 90,905
GS01c Hydrostation S Bermuda (UK) 05/15/03 11:40 32810900 99 n; 64830900 99 w 5.0 .4,202 22.9 36.7 0.1–0.8 Open ocean 0.1 (0.1 6 0.01) 92,351
GS02 Gulf of Maine USA 08/21/03 6:32 42830911 99 n; 67814924 99 w 1.0 106 18.2 29.2 0.1–0.8 Coastal 1.4 (1.12 6 0.19) 121,590
GS03 Browns Bank, Gulf of Maine Canada 08/21/03 11:50 42851910 99 n; 6681392 99 w 1.0 119 11.7 29.9 0.1–0.8 Coastal 1.4 (1.12 6 0.19) 61,605
GS04 Outside Halifax, Nova Scotia Canada 08/22/03 5:25 4488914 99 n; 63838940 99 w 2.0 142 17.3 28.3 0.1–0.8 Coastal 0.4 (0.78 6 0.17) 52,959
PLoS Biology | www.plosbiology.org | S25

GS05 Bedford Basin, Nova Scotia Canada 08/22/03 16:21 44841925 99 n; 63838914 99 w 1.0 64 15.0 30.2 0.1–0.8 Embayment 6 (6.76 6 0.98) 61,131
GS06 Bay of Fundy, Nova Scotia Canada 08/23/03 10:47 4586942 99 n; 64856948 99 w 1.0 11 11.2 0.1–0.8 Estuary 2.8 (1.87 6 0.18) 59,679
GS07 Northern Gulf of Maine Canada 08/25/03 8:25 43837956 99 n; 66850950 99 w 1.0 139 17.9 31.7c 0.1–0.8 Coastal 1.4 (1.12 6 0.19) 50,980
GS08 Newport Harbor, RI USA 11/16/03 16:45 4182999 99 n; 7182194 99 w 1.0 12 9.4 26.5c 0.1–0.8 Coastal 2.2 (1.59 6 0.17) 129,655
GS09 Block Island, NY USA 11/17/03 10:30 4185928 99 n; 7183698 99 w 1.0 32 11.0 31.0c 0.1–0.8 Coastal 4.0 (2.72 6 0.24) 79,303
GS10 Cape May, NJ USA 11/18/03 4:30 38856924 99 n; 7484196 99 w 1.0 10 12.0 31.0c 0.1–0.8 Coastal 2.0 (2.75 6 0.33) 78,304
GS11 Delaware Bay, NJ USA 11/18/03 11:30 3982594 99 n; 75830915 99 w 1.0 8 11.0 0.1–0.8 Estuary 4.8 (9.23 6 1.02) 124,435
GS12 Chesapeake Bay, MD USA 12/18/03 11:32 38856949 99 n; 7682592 99 w 1.0 25 3.2 3.47c 0.1–0.8 Estuary 21.0 (15.0 6 1.01) 126,162
GS13 Off Nags Head, NC USA 12/19/03 6:28 3680914 99 n; 75823941 99 w 1.0 20 9.3 0.1–0.8 Coastal 3.0 (2.24 6 0.25) 138,033
GS14 South of Charleston, SC USA 12/20/03 17:12 32830925 99 n; 79815950 99 w 1.0 31 18.6 0.1–0.8 Coastal 1.70 (1.92 6 0.25) 128,885
0401
GS15 Off Key West, FL USA 01/08/04 6:25 24829918 99 n; 8384912 99 w 2.0 47 25.3 36.0 0.1–0.8 Coastal 0.2 (0.27 6 0.09) 127,362
GS16 Gulf of Mexico USA 01/08/04 14:15 24810929 99 n; 84820940 99 w 2.0 3,333 26.4 35.8 0.1–0.8 Coastal sea 0.16 (0.11 6 0.01) 127,122
GS17 Yucatan Channel Mexico 01/09/04 13:47 20831921 99 n; 85824949 99 w 2.0 4,513 27.0 35.8 0.1–0.8 Open ocean 0.13 (0.09 6 0.01) 257,581
GS18 Rosario Bank Honduras 01/10/04 8:12 1882912 99 n; 8384795 99 w 2.0 4,470 27.4 35.4 0.1–0.8 Open ocean 0.14 (0.09 6 0.01) 142,743
GS19 Northeast of Colón Panama 01/12/04 9:03 10842959 99 n; 80815916 99 w 2.0 3,336 27.7 35.4 0.1–0.8 Coastal 0.23 (0.15 6 0.02) 135,325
GS20 Lake Gatun Panama 01/15/04 10:24 989952 99 n; 79850910 99 w 2.0 4 28.5 0.06 0.1–0.8 Fresh water 296,355
GS21 Gulf of Panama Panama 01/19/04 16:48 887945 99 n; 79841928 99 w 2.0 76 27.6 30.7 0.1–0.8 Coastal 0.50 (0.73 6 0.22) 131,798
GS22 250 miles from Panama City Panama 01/20/04 16:39 6829934 99 n; 82854914 99 w 2.0 2,431 29.3 32.3 0.1–0.8 Open ocean 0.33 (0.28 6 0.02) 121,662
GS23 30 miles from Cocos Island Costa Rica 01/21/04 15:00 5838924 99 n; 86833955 99 w 2.0 1,139 28.7 32.6 0.1–0.8 Open ocean 0.07 (0.19 6 0.02) 133,051
GS25 Dirty Rock, Cocos Island Costa Rica 01/28/04 10:51 5833910 99 n; 8785916 99 w 1.1 30 28.3 31.4 0.8–3.0 Fringing reef 0.11 (0.19 6 0.01) 120,671
GS26 134 miles NE of Galapagos Ecuador 02/01/04 16:16 1815951 99 n; 90817942 99 w 2.0 2,376 27.8 32.6 0.1–0.8 Open ocean 0.22 (0.28 6 0.02) 102,708
GS27 Devil’s Crown, Floreana Ecuador 02/04/04 11:41 1812958 99 s; 90825922 99 w 2.0 2.3 25.5 34.9 0.1–0.8 Coastal 0.40 (0.38 6 0.03) 222,080
GS28 Coastal Floreana Ecuador 02/04/04 15:47 181391 99 s; 90819911 99 w 2.0 156 25.0c 0.1–0.8 Coastal 0.35 (0.35 6 0.02) 189,052
GS29 North James Bay, Santigo Ecuador 02/08/04 18:03 081290 99 s; 9085097 99 w 2.0 12 26.2 34.5 0.1–0.8 Coastal 0.40 (0.39 6 0.03) 131,529
GS30 Warm seep, Roca Redonda Ecuador 02/09/04 11:42 0816920 99 n; 9183890 99 w 19.0 19 26.9 0.1–0.8 Warm seep 359,152
GS31 Upwelling, Fernandina Ecuador 02/10/04 14:43 081894 99 s; 9183996 99 w 12.0 19 18.6 0.1–0.8 Coastal upwelling 0.35 (0.39 6 0.03) 436,401
GS32 Mangrove, Isabella Ecuador 02/11/04 11:30 0835938 99 s; 9184910 99 w 0.3 0.67 25.4 0.1–0.8 Mangrove 148,018
c
GS33 Punta Cormorant Lagoon, Floreana Ecuador 02/19/04 13:35 1813942 99 s; 90825945 99 w 0.2 0.33 37.6 46 0.1–0.8 Hypersaline 692,255
GS34 North Seamore Ecuador 02/19/04 17:06 0822959 99 s; 90816947 99 w 2.0 35 27.5 0.1–0.8 Coastal 0.36 (0.35 6 0.02) 134,347
GS35 Wolf Island Ecuador 03/01/04 16:44 1823921 99 n; 9184991 99 w 2.0 71 21.8 34.5 0.1–0.8 Coastal 0.28 (0.31 6 0.02) 140,814
GS36 Cabo Marshall, Isabella Ecuador 03/02/04 12:52 081915 99 s; 91811952 99 w 2.0 67 25.8 34.6 0.1–0.8 Coastal 0.65 (0.45 6 0.05) 77,538
GS37 Equatorial Pacific TAO Buoy International 03/17/04 16:38 1858926 99 s; 9580953 99 w 2.0 3,334 28.8 0.1–0.8 Open ocean 0.21 (0.24 6 0.02) 65,670
GS47 201 miles from French Polynesia International 03/28/04 15:25 1087953 99 s; 135826958 99 w 30.0 2,400 28.6 37.3 0.1–0.8 Open ocean 66,023
GS51 Rangirora Atoll French Polynesia 05/22/04 7:04 1588937 99 s; 14782696 99 w 1.0 10 27.3 34.2 0.1–0.8 Coral reef atoll 128,982
Total 7,697,926
a
Temperature.
b
Salinity.
Special Section from March 2007 | Volume 5 | Issue 3 | e77

c
Measurements were acquired from nearby vessels and/or research stations.
our expectations regarding the current dataset. Given the

Table 2. Summary Assembly Statistics large size of the combined dataset and the substantial amount
of sequencing performed on individual filters, the overall lack
Category Statistic Value of assembly provides evidence of a high degree of diversity in
surface planktonic communities. To put this in context,
Assembly Number of reads used for assembly 7,697,926 suppose there were a clonal organism that made up 1% of
inputs
Total read length (bp) 6,325,208,303
our data, or ;60 Mbp. Even a genome of 10 Mbp—enormous
Number of ‘‘intigs’’ used for assemblya 6,389,523 by bacterial standards—would be covered ;6-fold. Such data
Total intig length (bp)a 5,883,982,712 might theoretically assemble with an average contig ap-
Assembly Number of assembliesb 3,081,849 proaching 50 kb [22]. While real assemblies generally fall
outputs
short of theory for various reasons, Shewanella data make up
Total assembled consensus length (bp) 4,460,027,783
Percentage of unassembled reads 53% ,1% of the total GOS dataset, and yet most of the relevant
Percentage of assembly at .13 coverage 15.3% reads assemble into scaffolds .50 kb. Thus, with few scaffolds
Base pairs in contigs 10 kb 39,427,102 of significant length, we could conclude that there are very
Base pairs in contigs 50 kb 15,723,513
few clonal organisms present at even 1% in the GOS dataset.
Base pairs in scaffolds 2.5 kb (consensus bases) 458,196,599
Base pairs in scaffolds 5 kb (consensus bases) 138,137,150 To investigate the nature of the implied diversity and to see
Base pairs in scaffolds 10 kb (consensus bases) 65,238,481 whether greater assembly could be achieved, we explored
Base pairs in scaffolds 50 kb (consensus bases) 20,738,836 several alternative approaches. Breaks in the primary
Base pairs in scaffolds 100 kb (consensus bases) 16,005,244
assembly resulted from two factors: incomplete sequence
Base pairs in scaffolds 300 kb (consensus bases) 8,805,668
Percentage of assembly in scaffolds 10 kbc 1.5% coverage and conflicts in the data. Conflicts can break
Length of longest contig (bp) 977,960 assemblies when there is no consistent way to chain together
Length of longest scaffold (bp) 2,097,794 all overlapping sequencing reads. As it was possible that there
N1d assembly bp 15,915
d would be fewer conflicts within a single sample (i.e., that
N10 assembly bp 2,533
N50d assembly bp 1,611 diversity within a single sample would be lower), assemblies
N1d contig bp 8,994 were attempted with individual samples. However, the results
N10d contig bp 2,447 did not show any systematic improvements even in those
N50d contig bp (single reads) samples with greater coverage (unpublished data). Upon
manual inspection, most assembly-breaking conflicts were
a
Intigs are overlapping mated reads that have been collapsed into a single sequence as found to be local in nature. These observations suggested that
input into the assembler.
b
Assemblies refers to the total number of scaffolds, pairs of mated nonoverlapping
reducing the degree of sequence identity required for
singletons, and singleton unmated reads. assembly could ameliorate both factors limiting assembly:
c
For comparison purposes, 10 kb is the average contig size predicted for 4.13 coverage effective coverage would increase and many minor conflicts
for an idealized shotgun assembly of a repeat-free, clonal genome [22].
d
N1 indicates the length of the next largest assembled sequence or contig such that 1% would be resolved.
of the sequence data falls into longer assemblies or contigs. N10 and N50 indicate that Accordingly, we produced a series of assemblies based on
10% and 50% of the data fell into larger assemblies or contigs.
98%, 94%, 90%, 85%, and 80% identity overlaps for two
subsets of the GOS dataset, again using the Celera Assembler.
Assembly lengths increased as the overlap cutoff decreased
nomic markers’’ such as 16S or recA, thus clearly identifying
from 98% to 94% to 90%, and then leveled off or even
the taxonomic group with which they are associated. The dropped as stringency was reduced below 90% (Table 3).
primary assembly of the combined GOS dataset was Although larger assemblies could be generated using lower
performed using the Celera Assembler [21] with modifica- identity overlaps, significant numbers of overlaps satisfying
tions as previously described [19] and as given in Materials the chosen percent identity cutoff still went unused in each
and Methods. The assembly was performed with quite assembly. This is consistent with a high rate of conflicting
stringent criteria, beginning with an overlap cutoff of 98% overlaps and in turn diagnostic of significant polymorphism.
identity to reduce the potential for artifacts (e.g., chimeric In mammalian sequencing projects the use of larger insert
assemblies or consensus sequences diverging substantially libraries is critical to producing larger assemblies because of
from the genome of any given cell). This assembly was the their ability to span repeats or local polymorphic regions [23].
substrate for annotation (see the accompanying paper by The shotgun sequencing libraries from the GOS filters were
Yooseph et al. [18]). typically constructed from inserts shorter than 2 kb. Longer
The degree of assembly of a metagenomic sample provides plasmid libraries were attempted but were much less stable.
an indication of the diversity of the sample. A few substantial We obtained paired-end sequences from 21,419 fosmid clones
assemblies notwithstanding, the primary assembly was strik- (average insert size, 36 kb; [24,25]) from the 0.1-micron
ingly fragmented (Table 2). Only 9% of sequencing reads fraction of GS-33. The effect of these long mate pairs on the
went into scaffolds longer than 10 kbp. A majority (53%) of GS-33 assembly was quite dramatic, particularly at high
the sequencing reads remained unassembled singletons. stringency (e.g., improving the largest scaffold from 70 kb to
Scaffolds containing more than 50 kb of consensus sequence 1,247 kb and the largest contig from 70 kb to 427 kb). At least
totaled 20.7 Mbp; of these, .75% were produced from a for GS-33 this suggests that many of the polymorphisms affect
single Sargasso Sea sample and correspond to the Burkholderia small, localized regions of the genome that can be spanned
or Shewanella assemblies described previously [19]. These using larger inserts. This degree of improvement may be
results highlight the unusual abundance of these two greater than what could be expected in general, as the
organisms in a single sample, which significantly affected diversity of GS-33 is by far the lowest of any of the currently
from the National Center for Biotechnology Information

Table 3. Evaluation of Alternative Assembly Methods (NCBI; http://www.ncbi.nlm.nih.gov). At the time of this study,
we used 334 finished and 250 draft microbial genomes as
Dataset Type Percent Base Pairs in Base Pairs in references for comparison with the GOS sequencing reads.
Identity 10 k Contigs 100 k Contigs Comparisons were carried out in nucleotide-space using the
sequence alignment tool BLAST [26]. BLAST parameters
GS33 plasmids WGSa 98 13,669,678 0 were designed to be extremely lenient so as to detect even
94 19,536,324 2,749,543 distant similarities (as low as 55% identity). A large
90 20,996,826 3,729,765
85 20,327,989 3,505,324
proportion of the GOS reads, 70% in all, aligned to one or
80 19,245,637 4,195,959 more genomes under these conditions. However, many of the
E-asmb 98 22,000,579 5,604,857 alignments were of low identity and used only a portion of
94 22,781,462 7,302,801 the entire read. Such low-quality hits may reflect distant
90 22,702,764 7,600,441
evolutionary relationships, and therefore less information is
85 22,570,933 7,937,079
80 20,335,558 4,779,684 gained based on the context of the alignment. More stringent
GS33 with fosmids WGSa 98 15,031,557 1,306,992 criteria could be imposed requiring that the reads be aligned
94 22,310,335 4,449,710 over nearly their entire length without any large gaps. Using
90 22,944,278 5,585,959 this stringent criterion only about 30% of the reads aligned
85 22,251,738 5,485,013
80 21,088,975 5,684,925
to any of the 584 reference genomes. We refer to these fully
GS17,18,23,26 WGSa 98 185,058 0 aligned reads as ‘‘recruited reads.’’ Recruited reads are far
94 5,422,366 213,755 more likely to be from microbes closely related to the
90 10,694,783 373,822 reference sequence (same species) than are partial align-
85 11,514,421 800,290
ments. Despite the large number of microbial genomes
80 9,004,221 879,401
E-asmb 98 2,047,524 0 currently available, including a large number of marine
94 10,668,547 1,184,881 microbes, these results indicate that a substantial majority of
90 15,215,981 2,634,227 GOS reads cannot be specifically related to available micro-
85 15,786,515 3,132,152 bial genomes.
80 13,767,929 2,942,160
Combined GOS WGSa 98 39,427,102 11,488,828
The amount and distribution of reads recruited to any
94 98,887,937 12,376,236 given genome provides an indication of the abundance of
E-asmb 98 91,526,091 16,444,304 closely related organisms. Only genomes from the five
94 163,612,717 25,564,163 bacterial genera Prochlorococcus, Synechococcus, Pelagibacter,
90 186,614,813 28,752,198
Shewanella, and Burkholderia yielded substantial and uniform
85 181,887,218 27,154,335
80 161,160,091 23,794,832 recruitment of GOS fragments over most of a reference
genome (Table 4). These genera include multiple reference
a
genomes, and we observed significant differences in recruit-
Whole-genome shotgun (WGS) assembly performed with the Celera Assembler.
b
Assemblies performed using extreme assembly approach (E-asm). ment patterns even between organisms belonging to the same
doi:10.1371/journal.pbio.0050077.t003 species (Figure 2A–2I). Three genera, Pelagibacter (Figure 2A),
Prochlorococcus (Figure 2B–2F), and Synechococcus (Figure 2G–
sequenced GOS samples, yet it clearly indicates the utility of 2I), were found abundantly in a wide range of samples and
including larger insert libraries for assembly. together accounted for roughly 50% of all the recruited reads
(though only ;15% of all GOS sequencing reads). By
Fragment Recruitment contrast, although every genome tested recruited some GOS
In the absence of substantial assembly, direct comparison reads, most recruited only a small number, and these reads
of the GOS sequencing data to the genomes of sequenced clustered at lower identity to locations corresponding to large
microbes is an alternative way of providing context, and also highly conserved genes (for typical examples see Figure 2E–
allows for exploration of genetic variation and diversity. A 2F). We refer to this pattern as nonspecific recruitment as it
large and growing set of microbial genomes are available reflects taxonomically nonspecific signals, with the reads in
Table 4. Microbial Genera that Recruited the Bulk of the GOS Reads
Genus Read Count Best Strain

All Reads 80%þa 90%þb All Reads 80%þa 90%þb
Pelagibacter 922,677 195,539 36,965 HTCC1062 HTCC1062 HTCC1062

Prochlorococcus 208,999 159,102 84,325 MIT9312 MIT9312 MIT9312
Synechococcus 60,650 26,365 21,594 CC9902 RS9917 RS9917
Burkholderia 151,123 108,610 93,081 383 383 383
Shewanella 59,086 34,138 27,693 MR-1 MR-1 MR-1
Remaining 43,244 2,367 564 Buchnera aphidicola Str. Sg Buchnera aphidicola Str. APS Alteromonas macleodii
a
Reads aligned at or above 80% identity over the entire length of the read.
b
Reads aligned at or above 90% identity over the entire length of the read.
Figure 2. Fragment Recruitment Plots

The horizontal axis of each panel corresponds to a 100-kb segment of genomic sequence from the indicated reference microbial genome. The vertical
axis indicates the sequence identity of an alignment between a GOS sequence and the reference genomic sequence. The identity ranges from 100%
(top) to 50% (bottom). Individual GOS sequencing reads were colored to reflect the sample from which they were isolated. Geographically nearby
samples have similar colors (see Poster S1 for key). Each organism shows a distinct pattern of recruitment reflecting its origin and relationship to the
environmental data collected during the course of this study.
(A) P. ubique HTCC1062 recruits the greatest density of GOS sequences of any genome examined to date. The GOS sequences show geographic
stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow-
green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062.
(B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. These
sequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads).
(C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially lower
identity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity.
(D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences were
recruited from the Sargasso samples.
(E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity in
vertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment that
corresponds to the highly conserved 16S and 23S mRNA gene operon.
(F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences are
seen corresponding to the location of conserved genes.
(G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderate
identity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible.
(H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequences
from some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited at
moderate identities.
(I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequences
as the WH8102 and CC9902 strains, they recruit at significantly lower identity.
(J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not only
possible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a
100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other.
(J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOS
sequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison).
(K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previous
recruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequences
primarily from the Sargasso Sea samples (red).
(M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages.
question often recruiting to distantly related sets of genomes. highly similar to the reference genome, as is the case for P.
Most microbial genomes, including many of the marine marinus MIT9312 (Poster S1) and Synechococcus RS9917
microbes (e.g., the ubiquitous genus Vibrio), demonstrated this (unpublished data). P. ubique HTCC1062 and other Synecho-
nonspecific pattern of recruitment. coccus strains like WH8102 show more complicated banding
The relationship between the similarity of an individual patterns (Poster S1D and S1F) because of the presence of
sequencing read to a given genome and the sample from which multiple subtypes that produce complex often overlapping
the read was isolated can provide insight into the structure, bands in the plots. Though the recruitment patterns can be
evolution, and geographic distribution of microbial popula- quite complex they are also remarkably consistent over much
tions. These relationships were assessed by constructing a of the reference genome. In these more complicated recruit-
‘‘percent identity plot’’ [27] in which the alignment of a read to ment plots, such as the one for P. ubique HTCC1062,
a reference sequence is shown as a bar whose horizontal individual bands can show sudden shifts in identity or
position indicates location on the reference and whose vertical disappear altogether, producing a gap in recruitment that
position indicates the percent identity of the alignment. We appears to be specific to that band (see P. ubique recruitment
colored the plotted reads according to the samples to which plots on Poster S1B and S1E, and specifically between 130–
they belonged, thus indirectly representing various forms of 140 kb). Finally, phylogenetic analysis indicates that separate
metadata (geographic, environmental, and laboratory varia- bands are indeed evolutionarily distinct at randomly selected
bles). We refer to these plots that incorporate metadata as locations along the genome.
fragment recruitment plots. Fragment recruitment plots of The amount of sequence variation within a given band
GOS sequences recruited to the entire genomes of Pelagibacter cannot be reliably determined from the fragment recruit-
ubique HTCC1062, Prochlorococcus marinus MIT9312, and Syn-
ment plots themselves. To examine this variation, we
echococcus WH8102 are presented in Poster S1.
produced multiple sequence alignments and phylogenies of
Within-Ribotype Population Structure and Variation reads that recruited to several randomly chosen intervals
Characteristic patterns of recruitment emerged from each along given reference genomes to show that there can be
of these abundant marine microbes consisting of horizontal considerable within-subtype variation (Figure 3A–3B). For
bands made up of large numbers of GOS reads. These bands example, within the primary band found in recruitment plots
seem constrained to a relatively narrow range of identities to P. marinus MIT9312, individual pairs of overlapping reads
that tile continuously (or at least uniformly, in the case when typically differ on average between 3%–5% at the nucleotide
abundance/coverage is lower) along ;90% of the reference level (depending on exact location in the genome). Very few
sequence. The uninterrupted tiling indicates that environ- reads that recruited to MIT9312 have perfect (mismatch-free)
mental genomes are largely syntenic with the reference overlaps with any other read or to MIT9312, despite ;100-
genomes. Multiple bands, distinguished by degree of sim- fold coverage. While many of these differences are silent (i.e.,
ilarity to the reference and by sample makeup, may arise on a do not change amino acid sequences), there is still consid-
single reference (Poster S1D and S1F). Each of these bands erable variation at the protein level (unpublished data). The
appears to represent a distinct, closely related population we amount of variation within subtypes is so great that it is likely
refer to as a subtype. In some cases, an abundant subtype is that no two sequenced cells contained identical genomes.
Figure 3. Population Structure and Variation as Revealed by Phylogeny

Phylogenies were produced using neighbor-joining. There is significant within-clade variation as well as an absence of strong geographic structure to
variants of SAR11 (P. ubique HTCC1062) and P. marinus MIT9312. Similar reads are not necessarily from similar locations, and reads from similar locations
are not necessarily similar.
(A) Geographic distribution of SAR11 proteorhodopsin variants. Keys to coloration: blue, Pacific; pink, Atlantic.
(B) Geographic distribution of Prochlorococcus variants. Keys to coloration: blue, Pacific; pink, Atlantic.
(C) Origins of spectral tuning of SAR11 proteorhodopsins. Reads are colored according to whether they contain the L (green) or Q (blue) variant at the
spectral tuning residue described in the text. The selection of tuning residue is lineage restricted, but each variant must have arisen on two separate
occasions.
Identifying Genomic Structural Variation with proportion of mated reads in the ‘‘good’’ category (i.e., in
Metagenomic Data the proper orientation and at the correct distance) show that
Variation in genome structure in the form of rearrange- synteny is conserved for a large portion of the microbial
ments, duplications, insertions, or deletions of stretches of population. The strongest signals of structural differences
DNA can also be explored via fragment recruitment. The use typically reflect a variant specific to the reference genome
of mated sequencing reads (pairs of reads from opposite ends and not found in the environmental data. In conjunction with
of a clone insert) provides a powerful tool for assessing the requirement that reads be recruited over their entire
structural differences between the reference and the environ- length without interruption, recruitment plots result in
mental sequences. The cloning and sequencing process pronounced recruitment gaps at locations where there is a
determines the orientation and approximate distance be- break in synteny. Other rearrangements can be partially
tween two mated sequencing reads. Genomic structural present or penetrant in the environmental data and thus may
variation can be inferred when these are at odds with the not generate obvious recruitment gaps. However, given
way in which the reads are recruited to a reference sequence. sufficient coverage, breaks in synteny should be clearly
Relative location and orientation of mated sequences provide identifiable using the recruitment metadata based on the
a form of metadata that can be used to color-code a fragment presence of ‘‘missing’’ mates (i.e., the mated sequencing read
recruitment plot (Figure 4). This makes it possible to visually that was recruited but whose mate failed to recruit; Figure 4).
identify and classify structural differences and similarities The ratio of missing mates to ‘‘good’’ mates determines how
between the reference and the environmental sequences penetrant the rearrangement is in the environmental
(Figure 5). For the abundant marine microbes, a high population.
In theory, all genome structure variations that are large
enough to prevent recruitment can be detected, and all such
rearrangements will be associated with missing mates.
Depending on the type of rearrangement present other
recruitment metadata categories will be present near the
rearrangements’ endpoints. This makes it possible to distin-
guish among insertions, deletions, translocations, inversions,
and inverted translocations directly from the recruitment
plots. Examples of the patterns associated with different
rearrangements are presented in Figure 5. This provides a
rapid and easy visual method for exploring structural
variation between natural populations and sequenced repre-
sentatives (Poster S1A and S1B).
Genomic Structural Variation in Abundant Marine

Microbes
Variation in genome structure potentially results in func-
tional differences. Of particular interest are those differences
Figure 4. Categories of Recruitment Metadata between sequenced (reference) microbes and environmental
The recruitment metadata distinguishes eight different general catego- populations. These differences can indicate how representa-
ries based on the relative placement of paired end sequencing reads tive a cultivated microbe might be and shed light on the
(mated reads) when recruited to a reference sequence in comparison to
their known orientation and separation on the clone from which they
evolutionary forces driving change in microbial populations.
were derived. Assuming orientation is correct, two mated reads can be Fragment recruitment in conjunction with the mate metadata
recruited closer together, further apart, or within expected distances helped us to identify both the consistent and the rare
given the size of the clone from which the sequences were derived.
These sequences are categorized as ‘‘short,’’ ‘‘long,’’ or ‘‘good,’’
structural differences between the genomes of microbial
respectively. Alternately, the mated reads may be recruited in a mis- populations in the GOS data and their closest sequenced
oriented fashion, which trumps issues of separation. These reads can be relatives. Our analysis has thus far been confined to the three
categorized as ‘‘normal,’’ ‘‘anti-normal,’’ or ‘‘outie.’’ In addition, there are microbial genera that were widespread in the GOS dataset as
two other categories. ‘‘No mate’’ indicates that no mated read was
available for recruitment, possibly due to sequencing error. Perhaps most represented by the finished genomes of P. marinus MIT9312,
useful of any of the recruitment categories, ‘‘missing’’ mates indicate P. ubique HTCC1062, and to a lesser extent Synechococcus
that while a mated sequence was available, it was not recruited to the WH8102. Each of these genomes is characterized by large and
reference. ‘‘Missing’’ mates identify breaks in synteny between the
environmental data and the reference sequence. small segments where little or no fragment recruitment took
doi:10.1371/journal.pbio.0050077.g004 place. We refer to these segments as ‘‘gaps.’’ These gaps
Figure 5. Fragment Recruitment at Sites of Rearrangements

Environmental sequences recruited near breaks in synteny have characteristic patterns of recruitment metadata. Indeed, each of five basic
rearrangements (i.e., insertion, deletion, translocation, inversion, and inverted translocation) produced a distinct pattern when examining the
recruitment metadata. Here, example recruitment plots for each type of rearrangement have been artificially generated. The ‘‘good’’ and ‘‘no mate’’
categories have been suppressed. In each case, breaks in synteny are marked by the presence of stacks of ‘‘missing’’ mate reads. The presence or
absence of other categories distinguishes each type of rearrangement from the others.
represent reference-specific differences that are not found in Interestingly, the long mated reads around this gap seem to
the environmental populations rather than a cloning bias be disproportionately from the Sargasso Sea samples,
that identifies genes or gene segments that are toxic or suggesting that this segment may be linked to geographic
unclonable in E. coli. The presence of missing mates flanking and/or environmental factors. Thus, hypervariable segments
these gaps indicates that the associated clones do exist, and are highly variable even within the same sample, can on
therefore that cloning issues are not a viable explanation for occasion be unoccupied, and the variation, or lack thereof,
the absence of recruited reads. Although the reference- can be sample dependent.
specific differences are quite apparent due to the recruitment Hypervariable segments have been seen previously in a
gaps they generate, there are also sporadic rearrangements wide range of microbes, including P. marinus [28], but their
associated with single clones, mostly resulting from small precise source and functional role, especially in an environ-
insertions or deletions. mental context, remains a matter of ongoing research. For
Careful examination of the unrecruited mates of the reads clues to these issues we examined the genes associated with
flanking the gaps allowed us to identify, characterize, and the missing mates flanking these segments and the nucleotide
quantify specific differences between the reference genome composition of the gapped sequences in the reference
and their environmental relatives. The results of this analysis genomes. In some rare cases the genes identified on reads
for P. ubique and P. marinus have been summarized in Table 5. that should have recruited within a hypervariable gap were
With few exceptions, small gaps resulted from the insertion highly similar to known viral genes. For example, a viral
or deletion of only a few genes. Many of the genes associated integrase was associated with the P. ubique HTCC1062
with these small insertions and deletions have no annotated hypervariable gap between 516 and 561 kb. However, in the
function. In some cases the insertions display a degree of majority of cases the genes associated with these gaps were
variability such that different sets of genes are found at these uncharacterized, either bearing no similarity to known genes
locations within a portion of the population. In contrast, or resembling genes of unknown function. If these genes were
many of the larger gaps are extremely variable to the extent indeed acquired through horizontal transfer then we might
that every clone contains a completely unrelated or highly expect that they would have obvious compositional biases.
divergent sequence when compared to the reference or to Oligonucleotide frequencies along the P. ubique HTCC1062
other clones associated with that gap. These segments are and Synechococcus WH8102 genomes are quite different in the
hypervariable and change much more rapidly than would be large recruitment gaps in comparison to the well-represented
expected given the variation in the rest of the genome. Sites portions of the genome (Poster S1). Surprisingly, this was less
containing a hypervariable segment nearly always contained true for P. marinus MIT9312, where the gaps have been linked
some insert. We identified two exceptions both associated to phage activity [28]. These results suggest that these
with P. ubique. The first is approximately located at the 166-kb hypervariable segments of the genome are widespread among
position in the P. ubique HTCC1062 genome. Though no large marine microbial populations, and that they are the product
gap is present, the mated reads indicate that under many of horizontal transfer events perhaps mediated by phage or
circumstances a highly variable insert is often present. The transposable elements. These results are consistent with and
second is a gap on HTCC1062 that appears between 50 and expand upon the hypothesis put forward by Coleman et al.
90 kb. This gap appears to be less variable than other [28] suggesting that these segments are phage mediated, and
hypervariable segments and is occasionally absent based on conflicts with initial claims that the HTCC1062 genome was
the large numbers of flanking long mated reads (Poster S1A). devoid of genes acquired by horizontal transfer [29].
Table 5. Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11)
Reference Genome Begina Endb Size, bp Type of Variantc Description
MIT9312 36,401 38,311 2,132 Variable deletion 12 out of 66 clones support simple deletion. Remaining clones show considerable
sequence variation amongst themselves.
MIT9312 124,448 125,219 771 Variable insertion Associated with ASN tRNA gene; in the environment, half the reads identify pair
of small inserts with no similarity to known genes; the other half point to small
and large inserts of undetermined nature.
MIT9312 233,826 233,910 84 Insertion All clones support small insert (270 bp) with no clear sequence similarity to
known genes or sequences.
MIT9312 243,296 245,115 424 Variable deletion 24 of 42 clones support simple deletion of hypothetical protein. 13 support
slightly larger deletion. Remaining not clearly resolved.
MIT9312 296,818 300,888 4,344 Variable deletion 44 clones support deletion of 3,070 bp segment containing 4 genes (3 hypotheti-
cal; 1 carbamoyltransferase). 17 clones support alternative sequences with little or
no similarity to each other.
gMIT9312 342,404 342,662 326 Variable insert In environment is 93% chance of finding deoxyribodopyrimiden photolyase with
7% chance of finding an ABC type Fe3þ siderophore transport system permease
component.
MIT9312 345,933 365,351 19,418 Hypervariable Very limited similarity among clones indicates that this is a hypervariable seg-
ment. Note that it is closely associated with a site-specific integrase/recombinase.
MIT9312 551,347 552,025 678 Deletion Two small deletions within a hypothetical protein.
MIT9312 617,914 621,556 3,642 Hypervariable About 50% of the clones support small deletions among several hypothetical
proteins. Remaining support significant variability suggesting hypervariable seg-
ment. At least two of the missed mates contain integrase-like genes.
MIT9312 646,340 652,375 6,035 Deletion/Hypervariable Majority of clones support simple deletion of eight hypothetical and hypothetical
genes. Small number of clones indicate this may by hypervariable as well.
MIT9312 655,241 655,800 559 Deletion Small deletion between hypothetical proteins.
MIT9312 665,000 678000 12,000 Deletions Complex set of deletions and replacements that vary with geographic location.
MIT9312 665,824 666,380 556 Insertion Small inserted hypothetical protein.
MIT9312 670,747 671,933 1,186 Deletion Deletes hypothetical gene.
MIT9312 736,266 736,289 23 Insertion All clones support insertion of fructose-bisphosphate aldolase and fructose-1,6-bi-
sphosphate aldolase.
MIT9312 762,156 762,717 561 Deletion Small deletion between hypothetical proteins.
MIT9312 779,006 779,309 303 Insertion Small hypothetical protein inserted.
MIT9312 874,349 874,913 564 Insertion Small insertion including gene with similarity to RNA-dependent RNA-polymerase.
MIT9312 943,389 946,997 3,608 Variable Several small changes.
MIT9312 1,043,129 1,131,874 88,745 Hypervariable
MIT9312 1,140,922 1,141,412 490 Insertion Small insertion.
MIT9312 1,144,307 1,144,790 483 Insertion Small insertion of several genes.
MIT9312 1,155,123 1,156,440 1,317 Deletion Deletes high-light–inducible protein.
MIT9312 1,172,609 1,177,292 4,683 Variable Several genes have been replaced or deleted.
MIT9312 1,288,481 1,290,367 1,886 Variable deletion Deletes polysaccharide export-related periplasmic protein (28 out of 55). Other
deletions are variable and may include replacement with alternate sequences.
MIT9312 1,323,606 1,324,523 917 Deletion Environmental sequences lack a small hypothetical protein.
MIT9312 1,369,637 1,369,996 359 Insertion NAD-dependent DNA ligase absent from MIT9312; has possible paralog
MIT9312 1,381,273 1,382,049 776 Deletion Small insert in MIT9312 not present in environment
MIT9312 1,384,664 1,385,110 446 Deletion Deletes delta(12)-fatty acid dehydrogenase and replaces gene with small (;100
bp) sequence. There is some variation in the exact location and replacement se-
quence.
MIT9312 1,388,430 1,389,718 1,288 Replacement Segment between two high-light–inducible proteins swapped for different se-
quence with no similarity.
MIT9312 1,392,865 1,392,976 111 Replacement Small replacement deletes hypothetical gene and replaces it with small, unknown
sequence. There is some variation in the precise boundaries of the deletion and
in the replacement sequences.
MIT9312 1,486,145 1,487,971 231 Variable Small insertion of dolichyl-phosphate-mannose-protein mannosyltransferase; alter-
nately deletes glycosyl transferase (8 out of 56).
MIT9312 1,519,810 1,520,860 1,050 Variable deletion Deletes a single hypothetical protein; about half of the deletions contain variable
sequences of unknown origin.
MIT9312 1,568,049 1,569,121 1,072 Replacement Typically 928-bp portion of MIT9312 replaced by 175-bp stretch in environment;
some small amount of variation in environmental replacement sequence (11 out
of 51).
HTCC1062 50,555 93,942 43,387 Variable deletion Low recruitment segment containing many hypothetical, transporter, and secre-
tion genes.
HTCC1062 146,074 146,415 341 Deletion Often deleted segment containing DoxD-like and ferredoxin dependent gluta-
mate synthase peptide.
HTCC1062 166,600 166,700 100 Variable replacement GOS sequences indicate that variable blocks of genes are frequently inserted
here.
HTCC1062 308,720 309,633 913 Deletion Potential sulfotransferase domain deleted.
HTCC1062 339,545 339,951 406 Deletion Deletes a predicted O-linked N-acetylglucosamine transferase.
HTCC1062 385,348 386,224 876 Deletion SAM-dependent methyltransferase deleted.
Table 5. Continued.
Reference Genome Begina Endb Size, bp Type of Variantc Description
HTCC1062 441,074 441,152 78 Deletion Deletes possible methyltransferase FkbM.

HTCC1062 516,041 561,604 45,563 Variable replacement Several high identity ‘‘missed’’ mates match phage genes, including an integrase.
HTCC1062 660,413 660,978 565 Deletion Deletes hypothetical protein.
HTCC1062 675,141 676,399 1,258 Deletion Deletes hypothetical protein.
HTCC1062 766,022 768,263 2,241 Deletion Deletes four hypothetical proteins.
HTCC1062 814,015 816,386 2,371 Deletion Deletes steroid monoxygenase and short-chain dehydrogenase.
HTCC1062 893,450 922,604 29,154 Deletion Deletes large segment including a 7317-aa hypothetical gene.
HTCC1062 941,203 942,403 1,200 Deletion Deletes hypothetical protein.
HTCC1062 991,461 997,861 6,400 Deletion Deletes several hypotheticals and mix of other genes; adjacent to recombinase.
HTCC1062 1,117,126 1,143,483 26,357 Deletion Large number of hypothetical and transporters that are deleted.
HTCC1062 1,160,908 1,166,589 5,681 Deletion Deletes a small cluster of peptides with various functions.
HTCC1062 1,188,887 1,189,231 344 Deletion Deletes portion of winged helix DNA-binding protein but inserts sequences with
similarity to gij71082757 sodium bile symporter family protein found in large gap
between 50555 and 93942.
a
Begin indicates the approximate bp position which marks the beginning of the gap in recruitment.
b
End indicates the approximate bp position which marks the ending of the gap in recruitment.
c
The type of change indicates what would have to happen to the reference genome to produce the sequences seen in the environment (e.g., a deletion indicates that the indicated
portion of the reference would have to be deleted to generate the variant(s) seen in the environment).
Though insertions and deletions accounted for many of the on the amount of sequence that contributed to the analysis,
obvious regions of structural variation, we also looked for we estimate that one inversion or translocation will be
rearrangements. The high levels of local synteny associated observed for every 2.6 Mbp of sequence examined (less than
with P. ubique and P. marinus suggested that large-scale once per P. marinus genome).
rearrangements were rare in these populations. To inves- A further observation concerns the uniformity along a
tigate this hypothesis we used the recruitment data to genome of the evolutionary history among and within
examine how frequently rearrangements besides insertions subtypes. For instance, the similarity between GOS reads
and deletions could be identified. We looked for rearrange- and P. marinus MIT9312 is typically 85%–95%, while the
ments consisting of large (greater than 50 kb) inversions and similarity between MIT9312 and P. marinus MED4 is generally
translocations associated with P. marinus; however, we did not ;10% lower. However, there are several instances where the
identify any such rearrangements that consistently distin- divergence of MIT9312 and MED4 abruptly decreases to no
guished environmental populations from sequenced cultivars. more than that between the GOS sequences and MIT9312
Rare inversions and translocations were identified in the (Poster S1G). These results are consistent either with
dominant subtype associated with MIT9312 (Table 6). Based horizontal transfer (recombination) or with inhomogeneous
Table 6. Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype
Group Low Low High High Read ID Low Low Low High High High Read Inversion Sample
Genome Genome Genome Genome Read Read Read Read Read Read Length
Begin End Begin End Begin Breakpoint Strand Breakpoint End Strand
1 34,428 34,904 1,536,417 1,536,774 1092255385627 0 476 1 477 834 1 834 No 15

2 607,778 608,467 1,131,368 1,131,621 1093017685727 270 959 0 0 251 1 959 Yes 18
3 618,997 619,375 1,172,217 1,172,728 1092963065572 19 397 1 425 939 1 939 No 17
3 618,997 619,372 1,172,203 1,172,728 1095433012642 0 375 1 403 931 1 941 No 26
3 618,997 619,251 1,172,216 1,172,728 1091140913752 2 256 1 284 799 1 799 No 19
3 618,997 619,331 1,172,280 1,172,728 1641121 2 337 1 365 816 1 816 No 00d
3 619,007 619,375 1,172,223 1,172,728 1092963490951 19 387 1 425 933 1 933 No 17
4 652,933 653,483 1,369,774 1,370,077 200560 325 875 0 0 303 1 875 Yes 00a
4 652,979 653,350 1,369,774 1,370,200 1492647 0 371 1 439 865 0 867 Yes 00d
4 652,979 653,496 1,369,774 1,370,086 1092256128910 10 527 1 595 905 0 906 Yes 15
4 652,979 653,496 1,369,774 1,370,061 1092405979387 14 531 1 599 886 0 886 Yes 25
4 652,979 653,353 1,369,774 1,370,132 1093017637883 2 376 1 444 802 0 802 Yes 18
4 652,980 653,496 1,369,774 1,370,111 1092343389654 6 522 1 591 928 0 928 Yes 25
5 1,049,484 1,049,793 1,172,301 1,172,782 1400802 518 827 0 1 483 0 827 No 00d
6 1,219,993 1,220,519 1,485,834 1,486,222 1092256207885 2 528 1 532 919 1 919 No 17
6 1,219,993 1,220,496 1,485,925 1,486,222 380485 300 803 0 0 296 0 803 No 00a
6 1,219,993 1,220,477 1,485,782 1,486,204 1484583 0 484 1 506 927 1 927 No 00d
6 1,220,005 1,220,388 1,485,733 1,486,216 1682914 507 890 0 2 484 0 913 No 00d
Extreme Assembly of Uncultivated Populations

The analyses described above have been confined to those
organisms with representatives in culture and for which
genomes were readily available. Producing assemblies for
other abundant but uncultivated microbial genera would
provide valuable physiological and biochemical information
that could eventually lead to the cultivation of these
organisms, help elucidate their role in the marine commun-
ity, and allow similar analyses of their evolution and variation
such as those performed on sequenced organisms. Previous
assembly efforts and the fragment recruitments plots showed
that there is considerable and in many cases conflicting
variation among related organisms. Such variation is known
to disrupt whole-genome assemblers. This led us to try an
assembly approach that aggressively resolves conflicts. We call
this approach ‘‘extreme assembly’’ (see Materials and Meth-
ods). This approach currently does not make use of mate-
pairing data and, therefore produces only contigs, not
scaffolded sequences. Using this approach, contigs as large
as 900 kb could be aligned almost in their entirety to the P.
marinus MIT9312 and P. ubique HTCC1062 genomes (Figure
2J–2L). Consistent patterns of fragment recruitment (see
below) generally provided evidence of the correctness of
contigs belonging to otherwise-unsequenced organisms.
Accordingly, large contigs from these alternate assemblies
were used to investigate genetic and geographic population
structure, as described below. However, the more aggressive
assemblies demonstrably suffered from higher rates of
assembly artifacts, including chimerism and false consensus
sequences (Figure 6). Thus, the more stringent primary
assembly was employed for most assembly-based analyses, as
manual curation was not practical.
As just noted, many of the large contigs produced by the
more aggressive assembly methods described above did not
align to any great degree with known genomes. Some could
be tentatively classified based on contained 16S sequences,
but the potential for computationally generated chimerism
within the rRNA operon is sufficiently high that inspection of
Figure 6. Examples of Chimeric Extreme Assemblies the assembly or other means of confirming such classifica-
(A) Fragment recruitment to an extreme assembly contig indicates the tions is essential. An alternative to an unguided assembly that
assembly is chimeric between two organisms, based on dramatic shifts in
density of recruitment, level of conservation, and sample distribution. facilitates the association of assemblies with known organisms
(B) Fragment recruitment to a SAR11-related extreme assembly. Changes is to start from seed fragments that can be identified as
in color, density, and vertical location toward the top of the figure belonging to a particular taxonomic group. We employed
indicate transitions among multiple subtypes of SAR11.
doi:10.1371/journal.pbio.0050077.g006 fragments outside the ribosomal RNA operon that were
mated to a 16S-containing read, limiting extension to the
selectional pressures. Similar patterns are present in the two direction away from the 16S operon. This produced contigs
high-identity subtypes seen on the P. ubique HTCC1062 of 100 kb or more for several of the ribotypes that were
genome (Poster S1D). Other regions show local increases in abundant in the GOS dataset. When evaluated via fragment
similarity between MIT9312 and the dominant subtype that recruitment (Figure 2M–2O), these assemblies revealed
are not reflected in the MIT9312/MED4 divergence (e.g., near patterns analogous to those seen for the sequenced genomes
positions 50 kb, 288 kb, 730 kb, 850 kb, and 954 kb on described above: multiple subtypes could be distinguished
MIT9312; also see Poster S1G). These latter regions might along the assembly, differing in similarity to the reference
reflect either regions of homogenizing recombination or sequence and sample distribution, with occasional gaps.
regions of higher levels of purifying selection. However, the Hypervariable segments by definition were not represented
lengths of the intervals (several are 10 kb or more) are longer in these assemblies, but they may help explain the termi-
than any single gene and correspond to genes that are not nation of the extreme assemblies for P. marinus and SAR11
extremely conserved over greater taxonomic distances (in and provide a plausible explanation for termination of
contrast to the ribosomal RNA operon). Equally, if widespread assemblies of the other deeply sampled populations as well.
horizontal transfer of an advantageous segment explains these This directed approach to assembly can also be used to
intervals, the transfers occurred long enough ago for investigate variation within a group of related organisms (e.g.,
appreciable variation to accumulate (unpublished data). a 16S ribotype). We explored the potential to assemble
Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be
Separated by Extreme Assembly
Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarily
labeled (A–X) for reference in Figure 8.
distinct subtypes of SAR11 by repeatedly seeding extreme surveys use PCR to amplify ubiquitous but slowly evolving
assembly with fragments mated to a SAR11-like 16S sequence. genes such as the 16S rRNA or recA genes. These in turn can
Figure 7 compares the first 20 kb from each of 24 be used to distinguish microbial populations. Since PCR can
independent assemblies. Eighteen of these segments could introduce various biases, we identified 16S genes directly
be aligned full-length to a portion of the HTCC1062 genome from the primary GOS assembly. In total, 4,125 distinct full-
just upstream of 16S, while six appeared to reflect rearrange- length or partial 16S were identified. Clustering of these
ments relative to HTCC1062. The rearranged segments were sequences at 97% identity gave a total of 811 distinct
associated with more divergent 16S sequences (8%–14% ribotypes. Nearly half (48%) of the GOS ribotypes and 88%
diverged from the 16S of HTCC1062), while those without of the GOS 16S sequences were assigned to ribotypes
rearrangements corresponded to less divergent 16S (averag- previously deposited in public databases. That is, more than
ing less than 3% different from HTCC1062). In each segment, half the ribotypes in the GOS dataset were found to be novel
many reads were recruited above 90% identity, but different at what is typically considered the species level [30]. The
samples dominated different assemblies. Phylogenetic trees overall taxonomic distribution of the GOS ribotypes sampled
support the inference of evolutionarily distinct subtypes with by shotgun sequencing is consistent with previously published
distinctive sample distributions (Figure 8). PCR based studies of marine environments (Table 7) [31]. A
smaller amount (16%) of GOS ribotypes and 3.4% of the GOS
Taxonomic Diversity 16S sequences diverged by more than 10% from any publicly
Environmental surveys provide a cultivation-independent available 16S sequence, thus being novel to at least the family
means to examine the diversity and complexity of an level.
environmental sample and serve as a basis to compare the A census of microbial ribotypes allows us to identify the
populations between different samples. Typically, these abundant microbial lineages and estimate their contribution
Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7
Correspond to Monophyletic Subtypes
Coloring of branches indicates that the corresponding reads align at .90% identity to the extreme assembly segments shown in Figure 7; colored
labels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned.
23S gene sequences [7,8,16,17]. However, a clear observation

Table 7. Taxonomic Makeup of GOS Samples Based on 16S Data emerging from the fragment recruitment views was that the
from Shotgun Sequencing reference ribotypes recruit multiple subtypes, and that these
subtypes were distributed unequally among samples (Figures
Phylum or Class Fractiona 2, 7, 8; Poster S1D, S1F, and S1I).
We developed a method to assess the genetic similarity
Alpha Proteobacteria 0.32 between two samples that potentially makes use of all
Unclassified Proteobacteria 0.155 portions of a genome, not just the 16S rRNA region. This
Gamma Proteobacteria 0.132
Bacteroidetes 0.13
similarity measure is assembly independent; under certain
Cyanobacteria 0.079 circumstances, it is equivalent to an estimate of the fraction
Firmicutes 0.075 of sequence from one sample that could be considered to be
Actinobacteria 0.046 in the other sample. Whole-metagenomic similarities were
Marine Group A 0.022
computed for all pairs of samples. Results are presented for
Beta Proteobacteria 0.017
OP11 0.008 comparisons at 98% and 90% identity. No universal cutoff
Unclassified Bacteria 0.008 consistently divides sequences into natural subsets, but the
Delta Proteobacteria 0.005 98% identity cutoff provides a relatively high degree of
Planctomycetes 0.002
resolution, while the 90% cutoff appears to be a reasonable
Epsilon Proteobacteria 0.001
heuristic for defining subtypes. For instance, a 90% cutoff
treats most of the reads specifically recruited to P. marinus
a
Values shown are averages over all samples. MIT9312 as similar (those more similar to MED4 notably
excepted), while reasonably separating clades of SAR11
(Figures 7 and 8). Reads with no qualifying overlap alignment
to the GOS dataset. Of the 811 ribotypes, 60 contain more to any other read in a pair of samples are uninformative for
than 8-fold coverage of the 16S gene (Table 8); jointly, these 60 this analysis, as they correspond to lineages that were so
ribotypes accounted for 73% of all the 16S sequence data. All lightly sequenced that their presence in one sample and
but one of the 60 have been detected previously, yet only a few absence in another may be a matter of chance. For the 90%
are represented by close relatives with complete or nearly cutoff, 38% of the sequence reads contributed to the analysis.
complete genome sequencing projects (see Fragment Recruit- The resulting similarities reveal clear and consistent group-
ment for further details). Several other abundant 16S ings of samples, as well as the outlier status of certain samples
sequences belong to well-known environmental ribotypes that (Figures 10 and 11).
do not have cultivated representatives (e.g., SAR86, Roseobacter The broadest contrast was between samples that could be
NAC-1–2, and branches of SAR11 other than those containing loosely labeled ‘‘tropical’’ (including samples from the
P. ubique). Interestingly, archaea are nearly absent from the list Sargasso Sea [GS00b, GS00c, GS00d] and samples that are
of dominant organisms in these near-surface samples. temperate by the formal definition but under the influence of
The distribution of these ribotypes reveals distinct micro- the Gulf Stream [GS14, GS15]) and ‘‘temperate.’’ Further
bial communities (Figure 9 and Table 8). Only a handful of subgroups can be identified within each of these categories, as
the ribotypes appear to be ubiquitously abundant; these are indicated in Figures 10 and 11. In some cases, these groupings
dominated by relatives of SAR11 and SAR86. Many of the were composed of samples taken from different ocean basins
ribotypes that are dominant in one or more samples appear during different legs of the expedition. A few pairs of samples
to reside in one of three separable marine surface habitats. with strikingly high similarity were observed, including GS17
For example, several SAR11, SAR86, and alpha Proteobacteria, and GS18, GS23 and GS26, GS27 and GS28, and GS00b and
as well as an Acidimicrobidae group, are widespread in the GS00d. In each case, these pairs of samples were collected
surface waters, while a second niche delineated by tropical from consecutive or nearly consecutive samples. However,
samples contains several different SAR86, Synechococcus and the same could be said of many other pairs of samples that do
Prochlorococcus (both cyanobacterial groups), and a Rhodospir- not show this same degree of similarity. Indeed, geographi-
illaceae group. Other ribotypes related to Roseobacter RCA, cally and temporally separated samples taken in the Atlantic
SAR11, and gamma Proteobacteria are abundant in the (GS17, GS18) and Pacific (GS23, GS26) during separate legs of
temperate samples but were not observed in the tropical or the expedition are more similar to one another than were
Sargasso samples. Not surprisingly, samples taken from most pairs of consecutive samples. The samples with least
nonmarine environments (GS33, GS20, GS32), estuaries similarity to any other sample were from unique habitats.
(GS11, GS12), and larger-sized fraction filters (GS01a, Thus, similarity cannot be attributed to geographic separa-
GS01b, GS25) have distinguishing ribotypes. Furthermore, tion alone.
as the complete genomes of these dominant members are The groupings described above can be reconstructed from
obtained, the capabilities responsible for their abundances taxonomically distinct subsets of the data. Specifically, the
may well lend insight into the community metabolism in major groups of samples visible in Figure 10 were reproduced
various oceanic niches. when sample similarities were determined based only on
fragments recruiting to P. ubique HTCC1062 (unpublished
Sample Comparisons data). Likewise, the same groupings were observed when the
The most common approach for comparing the microbial fragments recruiting to either HTCC1062 or P. marinus
community composition across samples has been to examine MIT9312, or both, were excluded from the calculations
the ribotypes present as indicated by 16S rRNA genes or by (unpublished data). Thus, the factors influencing sample
analyzing the less-conserved ITS located between the 16S and similarities do not appear to rely solely on the most abundant
Table 8. Most Abundant Ribotypes (97% Identity Clusters)
Ribotype Classificationa Depth of Coverageb Range Number of Matching GenBank Entriesc
SAR11 Surface 1 581 Widespread 100þ

SAR11 Surface 2 182 Sargasso and GS31d 100þ
Burkholderia 139 00a 100þ
Acidomicrobidae type a 133 Tropical and Sargassod 77
Prochlorococcus 112 Tropical and Sargasso 76
SAR11 Surface 3 109 Widespread 63
SAR86-like type a 108 Widespread 88
Shewanella 80 00a 49
Synechococcus 59 Tropicald 100þ
Rhodospirillaceae 50 Tropical and Sargassod 15
SAR86-like type b 47 Hypersaline pondd 24
SAR86-like type c 47 Tropical and Sargasso 75
Chlorobi-like 40 Hypersaline pond 1
Alpha Proteobacteria type a 38 Widespread 10
Roseobacter type a 35 Tropical and Sargassod 42
Cellulomonadaceae type a 34 Hypersaline pond 12
SAR86-like type d 28 Widespread 30
Alpha Proteobacteria type b 27 Widespread 43
SAR86-like type e 26 Widespread 71
Cytophaga type a 24 Tropical and Sargasso 21
Bacteroidetes type a 23 Widespread 17
Bdellovibrionales type a 21 Tropical and Sargasso 36
Acidomicrobidae type b 21 Temperate 41
SAR116-like 21 Widespread 28
Marine Group A type a 20 Tropical and Sargassod 9
Remotely SAR11-like type a 19 Sargasso and tropical 12
Frankineae type a 19 Fresh and estuary 6
Frankineae type b 18 Hypersaline pondd 71
SAR86-like type f 18 Tropical 15
SAR86-like type g 17 Tropical 13
Remotely SAR11-like type b 16 Tropical and Sargasso 4
Gamma Proteobacteria type a 16 Sargasso and GS14 18
Microbacteriaceae 15 Hypersaline pond 1
SAR102/122-like 15 Tropical and Sargasso 15
SAR86-like type h 15 Tropical and Sargasso 27
Bacteroidetes type b 14 Tropicald 14
Remotely SAR11-like type c 14 Tropical and Sargasso 18
SAR86-like type i 14 Sargassod 13
Rhodobium-like type a 14 Sargasso and GS31d 9
Marine Group A type b 13 Tropical and Sargassod 28
Gamma Proteobacteria type b 12 S. temperate and GS31 22
Oceanospirillaceae 12 Mangrove 1
Gamma Proteobacteria type c 12 Widespread 14
SAR11-like type a 12 Temperate 17
SAR11-like type b 11 Fresh and estuary 11
SAR11-like type c 11 Sargasso and GS31d 12
SAR86-like type j 11 Tropical and Sargasso 1
Roseobacter Algicola 11 Widespread 20
Remotely SAR102/122-like 11 Hypersaline pond 0
Frankineae type c 10 Fresh and estuary 11
Rhodobium-like type b 10 Widespread 9
Roseobacter RCA 10 Temperate 39
Remotely SAR11-like type d 10 Tropical 32
Acidobacteria 9 Fresh 4
Remotely SAR11-like type e 9 Tropical and Sargasso 8
Frankineae type d 9 Estuary and fresh 89
SAR11-like type d 9 Sargasso and GS31 4
Methylophilus 9 Temperated 41
Archaea C1 C1a 8 Sargasso and GS31 2
Cytophaga type b 8 Widespread 9
a
Taxonomic classifications based on Hugenholtz ARB database. Labels indicate the most specific taxonomic assignment that could be confidently assigned to each ribotype. ‘‘Type a,’’
‘‘type b,’’ etc., used to arbitrarily discriminate separate 97% ribotypes that would otherwise be given the same name.
b
Note that the 16S rRNA gene can be multicopy.
c
Matching GenBank entries required full-length matches at 98% identity.
d
Less than 13 coverage outside described range.
Figure 9. Presence and Abundance of Dominant Ribotypes

The relative abundance of various ribotypes (rows) in each filter (columns) is represented by the area of the corresponding spot (if any). The listed
ribotypes each satisfied the following criteria in at least one filter: the ribotype was among the five most abundant ribotypes detected in the shotgun
data, and was represented by at least three sequencing reads. Relative abundance is based on the total number of 16S sequences in a given filter. Order
and grouping of filters is based on the clustering of genomic similarity shown in Figure 11. Ribotype order was determined based on similarity of
sample distribution. A marked contrast between temperate and tropical groups is visible. Estuarine samples GS11 and GS12 contained a mix of
ribotypes seen in freshwater and temperate marine samples, while samples from nonmarine habitats or larger filter sizes were pronounced outliers. The
presence of large amounts of Burkholderia and Shewanella in one Sargasso Sea sample (GS00a) makes this sample look much less like other Sargasso
and tropical marine samples than it otherwise would. Note that 16S is not a measure of cell abundance since 16S genes can be multicopy.
organisms but rather are reflected in multiple microbial with the observed groupings. Other factors such as nutrients
lineages. and light for phototrophs and fixed carbon/energy for
It is tempting to view the groups of similar samples as chemotrophs may ultimately prove better predictors, but
constituting community types. Sample similarities based on these results demonstrate the potential of using metagenomic
genomic sequences correlated significantly with differences data to tease out such relationships.
in the environmental parameters (Table 1), particularly water Examining the groupings in Figure 11 in light of habitat
temperature and salinity (unpublished data). Samples that are and physical characteristics, the following may be observed.
very similar to each other had relatively small differences in The first two samples, a hypersaline pond in the Galapagos
temperature and salinity. However, not all samples that had Islands (GS33) and the freshwater Lake Gatun in the Panama
similar temperature and salinity had high community Canal (GS20) are quite distinct from the rest. Salinity—both
similarities. Water depth, primary productivity, fresh water higher and lower than the remaining coastal and ocean
input, proximity to land, and filter size appeared consistent samples—is the simplest explanation.
Twelve samples form a strong temperate cluster as seen in samples from Nova Scotia through the Gulf of Maine. This
the similarity matrix of Figure 11 as a darker square bounded is followed by a subcluster of four samples between Rhode
by GS06 and GS12. Embedded within the temperate cluster Island and North Carolina. The northern subcluster was
are three subclusters. The first subcluster includes five sampled in August, the southern subcluster in November and
Figure 10. Similarity between Samples in Terms of Shared Genomic Content

Genomic similarity, as described in the text, is an estimate of the amount of the genetic material in two filters that is ‘‘the same’’ at a given percent
identity cutoff—not the amount of sequence in common in a finite dataset, but rather in the total set of organisms present on each filter. Similarities are
shown for 98% identity.
(A) Hierarchical clustering of samples based on pairwise similarities.
(B) Pairwise similarities between samples, represented as a symmetric matrix of grayscale intensities; a darker cell in the matrix indicates greater
similarity between the samples corresponding to the row and column, with row and column ordering as in (A). Groupings of similar filters appear as
subtrees in (A) and as squares consisting of two or more adjacent rows and columns with darker shading. Colored bars highlight groups of samples
described in the text; labels are approximate characterizations rather than being strictly true of every sample in a group.
Figure 11. Sample Similarity at 90% Identity

Similarity between samples in terms of shared genomic content similar to Figure 10, except that the plots were done using a 90% identity cutoff that
has proven reasonable for separating some moderately diverged subtypes
December. Though all samples were collected in the top few Continuing to the right and downward in Figure 11, one
meters, the southern samples were in shallower waters, 10 to can see a large cluster of 25 samples from the tropics and
30 m deep, whereas most of the northern samples were in Sargasso Sea, bounded by GS47 and GS00b. This can be
waters greater than 100 m deep. Monthly average estimates of further subdivided into several subclusters. The first sub-
chlorophyll a concentrations were typically higher in the cluster (a square bounded by GS47 and GS14) includes 14
southern samples as well (Table 1). All of these factors— samples, about half of which were from the Galapagos. The
temperature, system primary production, and depth of the second distinct subcluster (a square bounded by GS16 and
sampled water body—likely contribute to the differences in GS26) includes seven samples from Key West, Florida, in the
microbial community composition that result in the two well- Atlantic Ocean to a sample close to the Galapagos Islands in
defined clusters. The final temperate subgroup includes two the Pacific Ocean. Loosely associated with this subcluster is a
estuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11), sample from a larger filter size taken en route to the
distinguished by their lower salinity and higher productivity. Galapagos (GS25). The remaining samples group weakly with
However, GS11 is markedly similar not only to GS12 but also the tropical cluster. GS32 was taken in a coastal mangrove in
to coastal samples, whereas the latter appears much more the Galapagos. The thick organic sediment at a depth of less
unique. Interestingly, the Bay of Fundy estuary sample (GS06) than a meter is the likely cause for it being unlike the other
clearly did not group with the two other estuaries, but rather samples. Sample 00a was from the Sargasso Sea and contained
with the northern subgroup, perhaps reflecting differences in a large fraction of sequence reads from apparently clonal
the rate or degree of mixing at the sampling site. Burkholderia and Shewanella species that are atypical. When this
Table 9. Relative Abundance of TIGRFAMs Associated with a Specific Sample
TIGRFAM Number of Samplea Relative Major Minor Description

Peptides Abundanceb Category Category
TIGR01526 131 GS01a 3.4 Nicotinamide-nucleotide adenylyltransferase

TIGR00661 214 GS01a 2.6 Hypothetical Conserved Conserved hypothetical protein
TIGR01833 135 GS01a 2 Hydroxymethylglutaryl-CoA synthase
TIGR01408 267 GS01b 4.5 Ubiquitin-activating enzyme E1
TIGR01678 144 GS01b 2.6 Sugar 1,4-lactone oxidases
TIGR01879 758 GS01b 2.5 Amidase, hydantoinase/carbamoylase family
TIGR00890 131 GS01b 2.4 Transport Carbohydrates, Oxalate/Formate Antiporter
TIGR01767 112 GS01b 2.4 5-methylthioribose kinase
TIGR00101 495 GS01b 2.3 Central Nitrogen Urease accessory protein UreG
TIGR01659 186 GS01b 2.2 Sex-lethal family splicing factor
TIGR00313 455 GS01b 2.1 Biosynthesis Heme, Cobyric acid synthase CobQ
TIGR00317 306 GS01b 2.1 Biosynthesis Heme, Cobalamin 59-phosphate synthase
TIGR00601 186 GS01b 2.1 DNA DNA UV excision repair protein Rad23
TIGR01792 904 GS01b 2.1 Central Nitrogen Urease, alpha subunit
TIGR00749 483 GS01b 2 Energy Glycolysis/gluconeogenesis Glucokinase
TIGR01001 485 GS01b 2 Amino Aspartate Homoserine O-succinyltransferase
TIGR02238 165 GS32 5.1 Meiotic recombinase Dmc1
TIGR02239 159 GS32 3.5 DNA repair protein RAD51
TIGR02232 140 GS32 2.6 Myxococcus cysteine-rich repeat
TIGR02153 212 GS32 2.5 Protein tRNA Glutamyl-tRNA(Gln) amidotransferase, subunit D
TIGR00519 289 GS32 2.4 L-asparaginases, type I
TIGR00288 248 GS32 2 Hypothetical Conserved Conserved hypothetical protein TIGR00288
TIGR01681 136 GS32 2 HAD-superfamily phosphatase, subfamily IIIC
TIGR02236 143 GS32 2 DNA DNA DNA repair and recombination protein RadA
TIGR00028 110 GS33 14.7 Mycobacterium tuberculosis PIN domain family
TIGR01550 131 GS33 9.1 Unknown General Death-on-curing family protein
TIGR01552 200 GS33 5.3 Mobile Other Prevent-host-death family protein
TIGR00143 151 GS33 4.6 Protein Protein [NiFe] hydrogenase maturation protein HypF
TIGR01641 131 GS33 4.3 Mobile Prophage Phage putative head morphogenesis protein, SPP1 gp7 family
TIGR01710 1,926 GS33 3.9 Cellular Pathogenesis General secretion pathway protein G
TIGR01539 251 GS33 3.6 Mobile Prophage Phage portal protein, lambda family
TIGR00016 217 GS33 3.5 Energy Fermentation Acetate kinase
TIGR00942 117 GS33 3.5 Transport Cations Multicomponent Naþ:Hþ antiporter
TIGR01836 174 GS33 3.3 Fatty Biosynthesis Poly(R)-hydroxyalkanoic acid synthase, class III, PhaC subunit
TIGR02110 135 GS33 3.2 Biosynthesis Other Coenzyme PQQ biosynthesis protein PqqF
TIGR01106 902 GS33 3.1 Energy ATP-proton Na,H/K antiporter P-type ATPase, alpha subunit
TIGR01497 726 GS33 3.1 Transport Cations Kþ-transporting ATPase, B subunit
TIGR01524 603 GS33 3.1 Transport Cations Magnesium-translocating P-type ATPase
TIGR02140 166 GS33 3.1 Transport Anions Sulfate ABC transporter, permease protein CysW
TIGR01409 122 GS33 3 Tat (twin-arginine translocation) pathway signal sequence
TIGR02195 282 GS33 3 Cell Biosynthesis Lipopolysaccharide heptosyltransferase II
TIGR01522 696 GS33 2.9 Calcium-transporting P-type ATPase, PMR1-type
TIGR01523 766 GS33 2.9 Potassium/sodium efflux P-type ATPase, fungal-type
TIGR00202 228 GS33 2.8 Regulatory RNA Carbon storage regulator
TIGR01116 798 GS33 2.8 Transport Cations Calcium-translocating P-type ATPase, SERCA-type
TIGR02094 204 GS33 2.8 Alpha-glucan phosphorylases
TIGR02051 296 GS33 2.7 Regulatory DNA Hg(II)-responsive transcriptional regulator
TIGR01005 259 GS33 2.6 Transport Carbohydrates, Exopolysaccharide transport protein family
TIGR01222 129 GS33 2.6 Cellular Cell Septum site-determining protein MinC
TIGR01334 157 GS33 2.6 Unknown General modD protein
TIGR01554 152 GS33 2.6 Mobile Prophage Phage major capsid protein, HK97 family
TIGR00640 1,059 GS33 2.5 Methylmalonyl-CoA mutase C-terminal domain
TIGR01708 801 GS33 2.5 Cellular Pathogenesis General secretion pathway protein H
TIGR02018 373 GS33 2.5 Regulatory DNA Histidine utilization repressor
TIGR00554 182 GS33 2.4 Biosynthesis Pantothenate Pantothenate kinase
TIGR01202 152 GS33 2.4 Biosynthesis Chlorophyll Chlorophyll synthesis pathway, bchC
TIGR01583 102 GS33 2.4 Energy Electron Formate dehydrogenase, gamma subunit
TIGR02092 241 GS33 2.4 Energy Biosynthesis Glucose-1-phosphate adenylyltransferase, GlgD subunit
TIGR00052 263 GS33 2.3 Hypothetical Conserved Conserved hypothetical protein TIGR00052
TIGR00824 166 GS33 2.3 Signal PTS PTS system, mannose/fructose/sorbose family, IIA component
TIGR01003 321 GS33 2.3 Transport Carbohydrates, Phosphocarrier, HPr family
TIGR01439 191 GS33 2.3 Regulatory DNA Transcriptional regulator, AbrB family
TIGR01457 220 GS33 2.3 HAD-superfamily subfamily IIA hydrolase, TIGR01457
TIGR01517 684 GS33 2.3 Calcium-translocating P-type ATPase, PMCA-type
TIGR02028 632 GS33 2.3 Biosynthesis Chlorophyll Geranylgeranyl reductase
TIGR00452 217 GS33 2.2 Unknown Enzymes Methyltransferase, putative
TIGR00609 2,555 GS33 2.2 DNA DNA Exodeoxyribonuclease V, beta subunit
TIGR00876 265 GS33 2.2 Energy Pentose Transaldolase
Table 9. Continued.
TIGRFAM Number of Samplea Relative Major Minor Description

Peptides Abundanceb Category Category
TIGR00996 302 GS33 2.2 Cellular Pathogenesis Virulence factor Mce family protein
TIGR01254 419 GS33 2.2 Transport Other ABC transporter periplasmic binding protein, thiB subfamily
TIGR01278 508 GS33 2.2 Biosynthesis Chlorophyll Light-independent protochlorophyllide reductase, B subunit
TIGR01512 1,820 GS33 2.2 Transport Cations Cadmium-translocating P-type ATPase
TIGR01525 2,037 GS33 2.2 Heavy metal translocating P-type ATPase
TIGR01543 418 GS33 2.2 Protein Other Phage prohead protease, HK97 family
TIGR01857 1,495 GS33 2.2 Purines Purine Phosphoribosylformylglycinamidine synthase
TIGR02015 183 GS33 2.2 Energy Photosynthesis Chlorophyllide reductase subunit Y
TIGR02072 1,052 GS33 2.2 Biosynthesis Biotin Biotin biosynthesis protein BioC
TIGR02099 273 GS33 2.2 Hypothetical Conserved Conserved hypothetical protein TIGR02099
TIGR00203 168 GS33 2.1 Energy Electron Cytochrome d ubiquinol oxidase, subunit II
TIGR00218 141 GS33 2.1 Energy Sugars Mannose-6-phosphate isomerase, class I
TIGR00915 4,834 GS33 2.1 Transport Other Transporter, hydrophobe/amphiphile efflux-1 (HAE1) family
TIGR01315 319 GS33 2.1 FGGY-family pentulose kinase
TIGR01330 253 GS33 2.1 39(29),59-bisphosphate nucleotidase
TIGR01508 848 GS33 2.1 Diaminohydroxyphosphoribosylaminopyrimidine reductase
TIGR01511 1,928 GS33 2.1 Transport Cations Copper-translocating P-type ATPase
TIGR01764 387 GS33 2.1 Unknown General DNA binding domain, excisionase family
TIGR02014 298 GS33 2.1 Energy Photosynthesis Chlorophyllide reductase subunit Z
TIGR02047 222 GS33 2.1 Cd(II)/Pb(II)-responsive transcriptional regulator
TIGR00586 990 GS33 2 DNA DNA Mutator mutT protein
TIGR00853 114 GS33 2 Signal PTS PTS system, lactose/cellobiose family IIB component
TIGR00937 364 GS33 2 Transport Anions Chromate transporter, chromate ion transporter (CHR) family
TIGR01030 221 GS33 2 Protein Ribosomal Ribosomal protein L34
TIGR01214 2,834 GS33 2 Cell Biosynthesis dTDP-4-dehydrorhamnose reductase
TIGR01698 608 GS33 2 Purine nucleotide phosphorylase
TIGR02190 465 GS33 2 Glutaredoxin-family domain
a
Reads associated with Shewanella and Burkholderia have been excluded.
b
TIGRFAM is this many times more abundant than in the next most abundant sample.
sample is reanalyzed to exclude reads identified as belonging contained a mixture of otherwise exclusively marine and
to these two groups, sample GS00a groups loosely with freshwater ribotypes; similarity of these sites to the fresh-
GS00b, GS00c, and GS00d (unpublished data). Finally, three water sample (GS20) was minimal at the metagenomic level,
subsamples from a single Sargasso sample (GS01a, GS01b, while the greater similarity of GS11 to coastal samples visible
GS01c) group together, despite representing three distinct at the metagenomic level was not readily visible here. A fuller
size fractions (3.0–20, 0.8–3.0, and 0.1–0.8 lm, respectively; comparison of metagenome-based measurements of diversity
Table 1). based on a large dataset of PCR-derived 16S sequences will be
The complete set of sample similarities is more complex presented in another paper (in preparation).
than described above, and indeed is more complex than can
be captured by a hierarchical clustering. For instance, the Variation in Gene Abundance
southern temperate samples are appreciably more similar to Differences in gene content between samples can identify
the tropical cluster than are the northern temperate samples. functions that reflect the lifestyles of the community in the
GS22 appears to constitute a mix of tropical types, showing context of its local environment [20,32]. We examined the
strong similarity not only to the GS47–GS14 subcluster to relative abundance of genes belonging to specific functional
which it was assigned, but also to the other tropical samples. categories in the distinct GOS samples. Genes were binned
These results may be compared to the more traditional into functional categories using TIGRFAM hidden Markov
view of community structure afforded by 16S sequences models [18], which are well annotated and manually curated
(Figure 9). Some of the same groupings of samples are visible [33].
using both analyses. Several ribotypes recapitulated the The results can be filtered in various ways to highlight
temperate/tropical clustering described above. Others were genes associated with specific environments. One catalog of
restricted to the single instances of nonmarine habitats. possible interest is genes that were predominantly found in a
Several of the most abundant organisms from the coastal single sample. We identified 95 TIGRFAMs that annotated
mangrove, hypersaline lagoon, and freshwater lake were large sets of genes (100 or more) that were significantly more
found exclusively in these respective samples. However, while frequent (greater than 2-fold) in one sample than in any other
several ribotypes recapitulated the temperate/tropical dis- sample (Table 9). Not surprisingly, this approach dispropor-
tinction revealed by the genomic sequence, others crosscut it. tionately singles out genes from the samples collected on
A few dominant 16S ribotypes, related to SAR11, SAR86, and larger filters (GS01a, GS01b, and GS25) and from the
SAR116, were found in every marine sample. The brackish nonmarine environments, particularly the hypersaline pond
waters from two mid-Atlantic estuaries (GS11 and GS12) (sample GS33). Another contrast might be between the
Table 10. Relative Abundance of TIGRFAM Matches in Temperate and Tropical Waters
TIGRFAM Number of Sample(s) Relative Major Minor Description

Peptides Abundancea Category Category
TIGR01153 729 GS15–GS19 32.7 Energy Photosynthesis Photosystem II 44 kDa subunit reaction center protein
TIGR02093 673 GS15–GS19 29.6 Energy Biosynthesis Glycogen/starch/alpha-glucan phosphorylases
TIGR01335 813 GS15–GS19 26.6 Energy Photosynthesis Photosystem I core protein PsaA
TIGR01336 806 GS15–GS19 26.5 Energy Photosynthesis Photosystem I core protein PsaB
TIGR00975 648 GS15–GS19 11 Transport Anions Phosphate ABC transporter, phosphate-binding protein
TIGR00297 261 GS15–GS19 8.6 Hypothetical Conserved Conserved hypothetical protein TIGR00297
TIGR00992 302 GS15–GS19 8.5 Transport Amino Chloroplast envelope protein translocase, IAP75 family
TIGR02030 560 GS15–GS19 7.5 And Chlorophyll Magnesium chelatase ATPase subunit I
TIGR02041 359 GS15–GS19 6 Central Sulfur Sulfite reductase (NADPH) hemoprotein, beta-component
TIGR01151 2,095 GS15–GS19 4.7 Energy Photosynthesis Photosystem q(b) protein
TIGR01152 1,865 GS15–GS19 4.7 Energy Photosynthesis Photosystem II D2 protein (photosystem q(a) protein)
TIGR02031 800 GS15–GS19 4.2 Biosynthesis Chlorophyll Magnesium chelatase ATPase subunit D
TIGR01790 629 GS15–GS19 4 Lycopene cyclase family protein
TIGR02100 512 GS15–GS19 4 Energy Biosynthesis Glycogen debranching enzyme GlgX
TIGR00073 284 GS15–GS19 3.4 Protein Protein Hydrogenase accessory protein HypB
TIGR00159 497 GS15–GS19 3 Hypothetical Conserved Conserved hypothetical protein TIGR00159
TIGR01515 594 GS15–GS19 3 Energy Biosynthesis 1,4-alpha-glucan branching enzyme
TIGR00217 601 GS15–GS19 2.7 Energy Biosynthesis 4-alpha-glucanotransferase
TIGR01486 505 GS15–GS19 2.7 Mannosyl-3-phosphoglycerate phosphatase family
TIGR01098 720 GS15–GS19 2.6 Transport Carbohydrates Phosphonate ABC transporter, periplasmic phosphonate-binding protein
TIGR00101 495 GS15–GS19 2.5 Central Nitrogen Urease accessory protein UreG
TIGR01273 567 GS15–GS19 2.4 Central Polyamine Arginine decarboxylase
TIGR01470 179 GS5–GS10 25.7 Biosynthesis Heme Siroheme synthase, N-terminal domain
TIGR00361 374 GS5–GS10 12.1 Cellular DNA DNA internalization-related competence protein ComEC/Rec2
TIGR01537 333 GS5–GS10 6.2 Mobile Prophage Phage portal protein, HK97 family
TIGR00201 291 GS5–GS10 6 Cellular DNA comF family protein
TIGR00879 420 GS5–GS10 5 Transport Carbohydrates Sugar transporter
TIGR02018 373 GS5–GS10 4.1 Regulatory DNA Histidine utilization repressor
TIGR02183 294 GS5–GS10 4.1 Glutaredoxin, GrxA family
TIGR00427 602 GS5–GS10 4 Hypothetical Conserved Conserved hypothetical protein TIGR00427
TIGR01109 219 GS5–GS10 3.6 Energy Other Sodium ion-translocating decarboxylase, beta subunit
TIGR01262 840 GS5–GS10 2.8 Energy Amino Maleylacetoacetate isomerase
a
Average abundance of TIGRFAM is that many times more abundant the average abundance in the given samples than in the other set of samples (in this case, GS15–GS19 were
compared with GS5–GS10).
temperate and tropical clusters (Figures 10 and 11). We fold higher in the Pacific than the Caribbean pair of samples.
identified 32 proteins that were more than 2-fold more Several of the most differentially abundant genes are related
frequent in one or the other group (Table 10). The presence to phosphate transport and utilization. It is very plausible
of various Prochlorococcus-associated genes in this list high- that this is a reflection of a functional adaptation: these
lights some of the potential challenges with this sort of differences correlate well with measured differences in
approach. Overrepresentation may reflect: a direct response phosphate abundance between the Atlantic and eastern
to particular environmental pressures (as the excess of salt Pacific samples [34,35], and phosphate abundance plays a
transporters plausibly do in the hypersaline pond); a lineage- critical role in microbial growth [36,37]. Indeed, the ability to
restricted difference in functional repertoire (as exemplified acquire phosphate, especially under conditions where it is
by the excess of photosynthesis genes in samples containing limited, is thought to determine the relative fitness of
Prochlorococcus); or a more incidental ‘‘hitchhiking’’ of a Prochlorococcus strains [38].
protein found in a single organism that happens to be The single greatest difference between GS17 and GS18 on
present. the one hand and GS23 and GS26 on the other was attributed
We explored whether clearer and more informative to a set of genes annotated by the hidden Markov model
differences could be discovered between communities by TIGR02136 as a phosphate-binding protein (PstS). This
focusing on groups of samples that are highly similar in TIGRFAM identified a single gene in both P. marinus
overall taxonomic/genetic content. Two pairs of samples MIT9312 and P. ubique HTCC1062. In P. marinus MIT9312,
provide a particularly nice illustration of this approach. this gene is located at 672 kb lying roughly in the middle of a
Samples GS17 and GS18 from the western Caribbean Sea and 15-kb segment of the genome that recruits almost no GOS
samples GS23 and GS26 from the eastern Pacific Ocean were sequences from the Pacific sampling sites (Poster S1H). In P.
all very similar based on the presence of abundant ribotypes ubique HTCC1062, the PstS gene is found at 1,133 kb in a 5-kb
and overall similarity in genetic content (Figures 9–11). segment that also recruited far fewer GOS sequences from all
Despite these similarities, several genes are found to be up to the Pacific samples except for GS51 (Poster S1E). These
seven times more common in the pair of Caribbean samples genomic segments differ structurally among isolates but they
than the Pacific pair (Table 11). No genes are more than 2- are no more variable than the flanking regions, and thus are
Table 11. Relative Abundance of TIGRFAM Matches in Atlantic and Pacific Open Ocean Waters
TIGRFAM Number of Sample(s) Relative Major Minor Description

Peptides Abundancea Category Category
TIGR02136 1,130 GS17, GS18 7.2 Transport Anions Phosphate-binding protein

TIGR00974 2,122 GS17, GS18 3.5 Transport Anions Phosphate ABC transporter, permease protein PstA
TIGR00975 648 GS17, GS18 3.5 Transport Anions Phosphate ABC transporter, phosphate-binding protein
TIGR02138 2,139 GS17, GS18 3.4 Transport Anions Phosphate ABC transporter, permease protein PstC
TIGR00206 459 GS17, GS18 2.8 Cellular Chemotaxis Flagellar M-ring protein FliF
TIGR01782 1,297 GS17, GS18 2.4 Transport Unknown TonB-dependent receptor
TIGR00642 862 GS17, GS18 2.3 Central Other Methylmalonyl-CoA mutase, small subunit
TIGR02135 899 GS17, GS18 2.3 Transport Anions Phosphate transport system regulatory protein PhoU
a
Relative Abundance: average abundance of TIGRFAM is at least that many times more abundant the average abundance in the given samples than in the other set of samples (in this
case, GS17–GS18 were compared to GS23 and GS26).
not hypervariable in the sense used previously (unpublished absorbs primarily in the blue and red spectra; consequently,
data). Nor are they particularly conserved when present, the water appears green [43]. Conversely, in the open ocean
indicating that they are not the result of a recent lateral nutrients are rare and phytoplanktonic biomass is low, so
transfer. Phylogenetic analyses outside these segments did not waters appear blue because in the absence of impurities the
produce any evidence of a Pacific versus Caribbean clade of red wavelengths are absorbed preferentially [44]. It may be
either Prochlorococcus or SAR11 (Figure 3A–3B). The presence that proteorhodopsin-carrying microbes have simply adapted
or absence of phosphate transporters is not limited to these to take advantage of the most abundant wavelengths of light
two types of organisms. The number of phosphate trans- in these systems.
porters that were found in the Caribbean far exceeds the Proteorhodopsins encoded on reads that were recruited to
number that can be attributed to HTCC1062- and MIT9312- P. ubique HTCC1062 account for a fraction (;25%) of all the
like organisms. However, these results indicate that within proteorhodopsin-associated reads, suggesting that the re-
individual strains or subtypes the ability to acquire phosphate mainder must be associated with a variety of marine micro-
(in one or more of its forms) can vary without detectable bial taxa (see also [45–47]). Phylogenetic analysis of the
differences in the surrounding genomic sequences. SAR11-associated proteins revealed that each variant has
arisen independently at least two times in the SAR11 lineage
Biogeographic Distribution of Proteorhodopsin Variants (Figure 3C). Consistent with other findings that proteorho-
Variation in gene content is only one aspect of the dopsins are widely distributed throughout the microbial
tremendous diversity in the GOS data. The functional world [48], we conclude that multiple microbial lineages are
significance of all the polymorphic differences between responsible for proteorhodopsin spectral variation and that
homologous proteins remains largely unknown. To look for the abundance of a given variant reflects selective pressures
functional differences, we analyzed members of proteorho- rather than taxonomic effects. Similar mechanisms seem to
dopsin gene family. Proteorhodopsins are fast, light-driven be involved in the evolution and diversification of opsins that
proton pumps for which considerable functional information mediate color vision in vertebrates [49].
is available though their biological role remains unknown.
Proteorhodopsins were highly abundant in the Sargasso Sea
samples [19] and continue to be highly abundant and evenly Discussion
distributed (relative to recA abundance) in all the GOS Our results highlight the astounding diversity contained
samples. A total of 2,674 putative proteorhodopsin genes within microbial communities, as revealed through whole-
were identified in the GOS dataset. Although many of the genome shotgun sequencing carried out on a global scale.
sequences are fragmentary, 1,874 of these genes contain the Much of this microbial diversity is organized around
residue that is primarily responsible for tuning the light- phylogenetically related, geographically dispersed popula-
absorbing properties of the protein [39–41], and these tions we refer to as subtypes. In addition, there is tremendous
properties have been shown to be selected for under different variation within subtypes, both in the form of sequence
environmental conditions [42]. Variation at this residue is variation and in hypervariable genomic islands. Our ability to
strongly correlated with sample of origin (Figure 12). The make these observations derived from not only the large
leucine (L) or green-tuned variant was highly abundant in the volumes of data but also from the development of new tools
North Atlantic samples and in the nonmarine environments and techniques to filter and organize the information in
like the fresh water sample from Lake Gatun (GS20). The manageable ways.
glutamine (Q) or blue-tuned variant dominated in the
remaining mostly open ocean samples. Variation and Diversity
Given our limited understanding of the biological role for Our data demonstrate to an unprecedented degree the
proteorhopsin, the reason for this differential distribution is nature and evolution of genetic variation below the species
not immediately clear. In coastal waters where nutrients are level. Variation can be analyzed in several ways, including
more abundant, phytoplankton is dominant. Phytoplankton observed differences in sequence, genomic structure, and
Figure 12. Distribution of Common Proteorhodopsin Variants across GOS Samples

The leucine (L) and methionine (M) variants absorb maximally in the green spectrum (Oded Beja, personal communication) while the glutamine (Q)
variant absorbs maximally in the blue spectrum. The relative abundance of each variant is shown as a percentage (x-axis) per sample (y-axis). Total
abundance for all variants in read equivalents normalized by the abundance of recA protein are shown on the right side of the y-axis. The L and Q
variants show a nonrandom distribution. The L variant is abundant in temperate Atlantic waters close to the U.S. and Canadian coast. The Q variant is
abundant in warmer waters further from land. The M variant is moderately abundant in a wide range of samples with no obvious geographic/
environmental association.
gene complement. The observed patterns of variation shed In principle, this variation could reflect some combination of
light on the mechanisms by which marine prokaryotes evolve. physical barriers (true biogeography), short-term stochastic
Gene synteny seems to be more highly conserved than the effects, and/or functional differentiation. Given the confound-
nucleotide and protein sequences. This variation is seen over ing variables of geography, time, and environmental conditions
essentially the entire genome in every abundant group of in the current collection of samples, it is difficult to definitively
organisms sufficiently related for us to recognize a popula- separate these effects, but various observations argue for
tion by fragment recruitment. (These include, but are not functional differentiation between subtypes (i.e., they con-
limited to, the organisms shown in Figure 2 and Poster S1.) stitute distinct ecotypes). First, individual subtypes may be
Notably, we found no evidence of widespread low-diversity found in a wide range of locations; P. ubique HTCC1062 was
organisms such as B. anthracis [50]. isolated in the Pacific Ocean off the coast of Oregon [55], but
Phylogenetic trees and fragment recruitment plots (Figures closely related sequences are relatively abundant in our samples
7 and 8) indicate that the variation within a species is not an taken in the Atlantic Ocean. Second, geography per se cannot
unstructured swarm or cloud of variants all equally diverged fully explain differences in subtype distributions, as multiple
from one another. Instead, there are clearly distinct subtypes, subtypes are found simultaneously in a single sample. Third, the
in terms of sequence similarity, gene content, and sample collection of samples in which a given subtype was found
distribution. Similar findings have been shown for specific generally exhibits similar environmental conditions. A strong
organisms, based on evaluation of one or a few loci [2,51–53]. independent illustration of this comes from the correlation of
These results rule out certain trivial models of population temperature with the distribution of Prochlorococcus subtypes
history and evolution for what is commonly considered a [56]. Fourth, the extensive variation within each subtype (i.e.,
bacterial ‘‘species.’’ For instance, it argues against a recent the fact that subtypes are not clonal populations) indicates that
explosive population growth from a single successful indi- it cannot be chance alone that makes genetically similar
vidual (selective sweep) [54]. Equally, it argues against a organisms have similar observed distributions.
perfectly mixed population, suggesting instead some barriers Taken together, these results argue that subtype classifica-
to competition and exchange of genetic material. tion is more informative for categorizing microbial popula-
tions than classification using 16S-based ribotypes, or finger- inevitably have at least minor functional differences such as
printing techniques based on length polymorphism, such as T- in the optimal temperature or pH for the activity of some
RFLPs [57] or ARISA [58]. For example, the grouping of such enzyme. At the level of gene content, the observation of
disparate microbial populations under the umbrella P. marinus hypervariable segments ([28] and here) implies that there is
dilutes the significance of the term ‘‘species.’’ Indeed, an additional dimension to functional variability. Hyper-
numerous papers have been devoted to comparing and variable genomic islands with preferential insertion sites
contrasting the differences and variability in P. marinus isolates could potentially be associated with a wide range of
to better understand how this particularly abundant group of functions, though to date they have been most closely
organisms has evolved and adapted within the dynamic marine examined for their role in pathogenicity (for a review, see
environment [28,52,56,59–66]. Prior to the widespread use of [74]). However, given their apparent variability within even a
marker-based phylogenetic approaches, microbial systematics single sampling site, it seems unlikely that these elements
relied on a wide range of variables to distinguish microbial reflect a specific adaptive advantage to the local population.
populations [67]. Subtypes bring us back to these more Identifying the source(s), diversity, and range of functionality
comprehensive approaches since they reflect the influences associated with these islands by fully sequencing a large
of a wide range of factors in the context of an entire genome. number of these segments and understanding how their
Although subtypes are a salient feature of our data, individual abundances fluctuate should be quite informative.
variation within a ribotype does not stop at the level of Some might still argue that these differences must be moot
subtypes. Variation within subtypes is so extensive that few for the purpose of understanding the role these organisms play
GOS reads can be aligned at 100% identity to any other GOS in an ecosystem. Yet even small differences in optimal
read, despite the deep coverage of several taxonomic groups. conditions may have profound effects. They may prevent any
Related findings have been shown for the ITS region in single genotype from being universally fittest, allowing and/or
various organisms [2,51,52], and in a limited number of necessitating the coexistence of multiple variants [2,51,69].
organisms for individual protein coding and intergenic Moreover, variation within subtype might afford a form of
regions [2,53,68]. High levels of diversity within the ribotype functional ‘‘buffering,’’ such that the population as a whole
can be convincingly demonstrated in the 16S gene itself [69]. may be more stable in its ecosystem role than any one clone
The applicability of these results over the entire genome were could be (see also [51]). That is, while any one strain of
recently shown for P. marinus [28] using data from the Prochlorococcus might thrive and provide energy input to the
Sargasso Sea samples taken as a pilot project for the rest of the community at a limited range of temperatures, light
expedition reported here [19]. We have definitively demon- conditions, etc., the ensemble might provide such inputs over a
strated the generality of these findings, greatly increased our wider range of environmental conditions. In this way, micro-
understanding of the minimum number of variants of a given diversity might provide system stability or robustness through
organism, and shown that these observations apply to the functional redundancy and the ‘‘insurance effect’’ (reviewed in
entire genome for a wide range of abundant taxonomic [75]). Thus, while the extent of microdiversity suggests that
groups and across a wide range of geographic locations. knowing the behavior of any one isolate in exquisite detail
Average pairwise differences of several percent between might not be as useful to reductionist modeling as one might
overlapping P. marinus or SAR11 reads imply that this hope, this buffering could afford a more stable ensemble
variation did not arise recently. If one uses substitution rates behavior, facilitating the development and maintenance of an
estimated for E. coli [70], one could conclude that on average ecosystem and allowing for system-level modeling.
any two P. marinus cells must have diverged millions of years A direct equation of subtypes with ecotypes is tempting,
ago. Mutational rates are notoriously variable and hard to but not entirely clear-cut. The correlation of PstS distribution
estimate, and assumptions of molecular clocks are equally with phosphate abundance suggests a functional adaptation,
chancy, but clearly within-subtype variants have persisted but within Prochlorococcus and SAR11 the presence or absence
side by side for quite some time. This raises a question related of PstS subdivides subtypes without apparent respect for
to the classic ‘‘Paradox of the Plankton’’: how can so many phylogenetic structure. This contrasts markedly with the
similar organisms have coexisted for so long [71,72]? One distribution of proteorhodopsin-tuning variants within
explanation, which we favor, is that not only subtypes but also SAR11, which, despite a few convergent substitutions, are
individual variants are sufficiently different phenotypically to strongly congruent with phylogeny. It is interesting to ask
prevent any one strain from completely replacing all others what distinguishes pressures or adaptations that respect (or
(discussed further below; see [71] for a recent theoretical that lead to) lineage splits from those that show little or no
treatment). An alternative is that recombination might phylogenetic structuring. These two specific examples plau-
prevent selective sweeps within ecotypes, as proposed by sibly reflect two different mechanisms (i.e., convergent but
Cohan (reviewed in [73]). independent mutation in proteorhodopsin genes and the
acquisition by horizontal transfer of genes involved in
The Significance of Within-Subtype Variation phosphate uptake). Yet, we must wonder: given the evidence
Given the apparent generality of subtypes and intra- that proteorhodopsin has been transferred laterally [48], and
subtype variation, it is important to understand if and how that only a small number of mutations, in some circumstances
these subpopulations are functionally distinct. At the level of even a single base-pair change, are required to switch
DNA sequence, a substantial fraction of substitutions are between the blue-absorbing and green-absorbing forms
silent in terms of amino acid sequence, and others may be [39,40], why should proteorhodopsin variants show any
nonsynonymous but functionally neutral. However, two lineage restriction? Perhaps this relates to the modularity of
organisms that differ by 5% in their genetic sequence (e.g., the system in question: proteorhodopsin tuning may be part
100,000 substitutions in 2 Mbp of shared sequence) will of a larger collection of synergistic adaptations that are
collectively not easily evolved, acquired, or lost, while the PstS approaches rely on fairly well-known techniques but have
and surrounding genes may represent a functional unit that been modified to take greater advantage of the metadata.
can be readily added and removed over relatively short The technique of fragment recruit and the corresponding
evolutionary time scales. If so, perhaps subtypes are indeed fragment recruitment plots have proven highly useful for
ecotypes, but rapidly evolving characters can lead to examining the biogeography and genomic variation of
phenotypes that crosscut or subdivide ecotypes. abundant marine microbes when a close reference genome
Phage provide one possible mechanism for rapid evolution exists. Ultimately, this approach derives from the percent
of microbial populations or strains, and have been found in identity plots of PipMaker [27]. Similar approaches have been
abundance with this and other marine metagenomic datasets used to examine variation with respect to metagenomic
[18,20]. It has been proposed that hypervariable islands are datasets. For example, hypervariable segments and sequence
phage mediated [28]. However, there are reasons to be variation have been visualized in P. marinus MIT9312 using
cautious about invoking phage as an explanation for rapidly the Sargasso Sea data [28] and in human gut microbes
evolving characteristics. While we see variability of PstS and Bifidobacterium longum and Methanobrevibacter smithii [77].
neighboring genes in both SAR11 and P. marinus populations, Our primary advance associated with fragment recruit-
this variation does not seem to be linked to recent phage ment plots is the incorporation of metadata associated with
activity. Initially, the distribution of PstS seems similar to the the isolation or production of the sequencing data. While
variation associated with the hypervariable islands, which simple in nature, the resulting plots can be extremely
may be phage mediated [28]. Indeed, phosphate-regulating informative due to the volume of data being presented.
genes including PstS have been identified in phage genomes Being able to present the sequence similarity and metadata
[64], presumably because enhanced phosphate acquisition is visually allows a researcher to quickly identify interesting
required during the replication portion of their life cycle. portions of the data for further examination. This is one of
However, the regions containing the PstS genes in both the first tools to make extensive use of the metadata collected
SAR11 and Prochlorococcus do not behave in the same fashion during a metagenomic sequencing project. The use of sample
as clearly hypervariable regions, being effectively bimorphic and recruitment metadata is just the beginning. It is not
(modulo the level of sequence variation observed elsewhere in difficult to imagine displaying other variations such as water
the genome), whereas clearly hypervariable regions are so temperature, salinity, phosphate abundance, and time of year
diverse that nearly every sampled clone falling in such a with this approach. Even sample independent metadata such
region appears completely unrelated to every other. Nor do as phylogenetic information may produce informative views
the other genes in PstS-containing regions appear to be phage of the data. The usefulness of this and related approaches will
associated. These observations suggest that differences in PstS only grow as the robust collection of metadata becomes
presence or absence arose in the distant past, or that routine and the variables that are most relevant to microbial
different mechanisms are at work. It seems likely that phage communities are further elucidated.
may mediate lateral transfer of PstS and other phosphate The greatest limitation of fragment recruitment is the lack
acquisition genes, but it is unclear whether these genes then of appropriate reference sequences, particularly finished
can become fixed within the population. Phage require genomes. Using a series of modifications to the Celera
enhanced phosphate acquisition as part of their life cycle Assembler referred to as ‘‘extreme assembly,’’ we have
[64], so regulatory or functional differences in these genes produced large assemblies for cultivated and uncultivated
may limit their suitability for being acquired by the host cell marine microbes. On its own, the extreme assembly approach
for its own purposes. The rate of phage-mediated horizontal would be excessively prone to producing chimeric sequences.
transfer of genes may reflect a combination of the gene’s However, when extreme assemblies are used as references for
value to the host and to the agent mediating the transfer (e.g., fragment recruitment, the metadata provides additional
phage), suggesting that PstS may have much greater immedi- criteria to validate the sampling consistency along the length
ate value than do proteorhodopsin genes and their variants. of the scaffold. Chimeric joins can be rapidly detected and
In practical terms, these results highlight the limitations avoided. This argues that future metagenomic assemblers
associated with marker-based analysis and the use of these could be specifically designed to make use of the metadata to
approaches to infer the physiology of a particular microbial produce more accurate assemblies, and that metagenomic
population. At the resolution used here, marker-based assemblies will be improved by using data from multiple
approaches are not always informative regarding differences sources. Finding ways to represent the full diversity in these
in gene content (e.g., the PstS gene as well as neighboring assemblies remains a pressing issue.
genes), especially those associated with hypervariable seg- Extreme assembly can produce much larger assemblies but
ments. Though phosphate acquisition is known to vary within it is still limited by overall coverage. While many ribotypes are
different strains of P. marinus [64,76], our results clearly show presumably present in sufficient quantities that reasonable
that this variability can happen within a single subtype (as assemblies of these genomes might be expected, this did not
represented by MIT9312), effectively identifying distinct occur even for the most abundant organisms, including
ecotypes. Given the correct samples from the appropriate SAR11 and P. marinus. Many of the problems can be
environments, other core genes might also show similar attributed to the diversity associated with the hypervariable
variation and allow us to more fully assess the reliability of segments where the effective coverage drops precipitously. If
reference genomes as indicators of physiological potential. these are indeed commonplace in the microbial world, it is
unlikely that complete genomes will be produced using the
Tools and Techniques small insert libraries presented here. However, the ability to
Analysis of the GOS dataset has benefited from the bin the larger sequences based on their coverage profiles
development of new tools and techniques. Many of these across multiple samples, oligonucleotide frequency profiles,
and phylogenetic markers suggests that large portions of a populations, while providing some information about how
microbial genome can be reconstructed from the environ- these factors are structured along phylogenetic and environ-
mental data. This in turn should provide critical insights into mental factors. At the same time, many questions remain
the physiology and biochemistry of these microbial lineages unanswered. For example, although microbial populations
that will inform culture techniques to allow cultivation of are structured and therefore genetically isolated, we do not
these recalcitrant organisms under laboratory conditions. understand the mechanisms that lead to this isolation. Their
Not every technique described herein relies on metadata. isolation seems contradictory given overwhelming evidence
The marker-less, overlap-based metagenomic comparison that horizontal gene transfer associated with hypervariable
provides a quantitative approach to comparing the overall islands is a common phenomenon in marine microbial
genetic similarity of two samples (Figures 10 and 11). In populations. Whatever the mechanism, the role and rate at
essence, genomic similarity acts as a proxy for community which gene exchange occurs between populations will be
similarity. Marker-based approaches such as ARISA including crucial to understanding population structure within micro-
the use of 16S sequences described herein can also be used to bial communities and whether these communities are chance
infer community similarity, though these approaches more associations or necessary collections. The hypervariable
aptly generate a census of the community members islands could be a source for tremendous genetic innovation
[51,69,78,79]. This census is biased to the extent that 16S and novelty as evidenced by the rate of discovery of novel
genes can vary in copy number and relies on linkage of the protein families in the GOS dataset [18]. However, it is not
marker gene to infer genome composition. While our clear whether these entities are the main source of this
metagenome comparison does not directly provide a census, novelty or whether this novelty resides in the vast numbers of
the sensitivity can be tuned by restricting the identity of rare microbes [4] that cannot be practically accessed using
matches. This means that even subtype-level differences can current metagenomic approaches. Altogether, this research
be detected across samples. It would also identify the reaffirms our growing wealth and complexity of data and
substantial gene content differences between the K12 and paucity of understanding regarding the biological systems of
O157:H7 E. coli strains [12]. Such large-scale gene content the oceans.
differences have yet to be seen between closely related marine
microbes, but may be a factor in other environments.
Materials and Methods
Although the requisite amount of data will vary with the
complexity of the environment or the degree of resolution Sampling sites. A more detailed description of the sampling sites
provides additional context in which to understand the individual
required, we have found that 10,000 sequencing reads is samples. The northernmost site (GS05) was at Compass Buoy in the
sufficient to reliably measure the similarity of two surface highly eutrophic Bedford Basin, a marine embayment encircled by
water samples (unpublished data). This analysis may become a Halifax, Nova Scotia, that has a 15-y weekly record of biological,
general tool for allocating sequencing resources by allowing a physical, and chemical monitoring (http://www.mar.dfo-mpo.gc.ca/
science/ocean/BedfordBasin/index.htm). Other temperate sites in-
shallow survey of many samples followed by deep sequencing cluded a coastal station sample near Nova Scotia (GS4), a station in
of a select number of ‘‘interesting’’ ones. the Bay of Fundy estuary at outgoing tide (GS06), and three Gulf of
The application of this technique for comparing samples Maine stations (GS02, GS03, and GS07). These were followed by
sampling coastal stations from the New England shelf region of the
along with detailed analysis of fragments recruiting to a given Middle Atlantic Bight (Newport Harbor through Delaware Bay;
reference sequence can also help explicate differences among GS08–GS11). The Delaware Bay (GS11) was one of several estuary
communities in gene content or sequence variation. For samples along the Global Expedition path. Estuaries are complex
example, recent metagenomic studies have reported differ- hydrodynamic environments that exhibit strong gradients in oxygen,
nutrients, organic matter, and salinity and are heavily impacted by
ences in abundance of various gene families or differing anthropogenic nutrients. The Chesapeake Bay (GS12) is the largest
functional roles between samples. Some of these differences estuary in the United States and has microbial assemblages that are
correspond to plausible differences in physiology and diverse mixtures of freshwater and marine-specific organisms [80].
GS13 was collected near Cape Hatteras, North Carolina, inside and
biochemistry, such as the relative overabundance of photo- north of the Gulf Stream, and GS14 was taken along the western
synthetic or light-responsive genes in surface water samples boundary frontal waters of the Gulf Stream off the coast of
[20,32]. Other differences however are less obvious, such as Charleston, South Carolina. The vessel stopped at five additional
the abundance of ribosomal proteins at 130 m or the stations as it transited through the Caribbean Sea (GS15–GS19) to the
Panama Canal. In Panama, we sampled the freshwater Lake Gatun,
abundance of tranposase at 4,000 m [20]; some of these may which drains into the Panama Canal (GS20). The first of the eastern
reflect ‘‘taxonomic hitchhiking,’’ such that a sample rich in Pacific coastal stations GS21, GS22, and GS23 were sampled on the
Archaea or Firmicutes or Cyanobacteria, etc., has an over- way to Cocos Island (;500 km southwest of Costa Rica), followed by a
coastal Cocos Island sample (GS25). Near the island, ocean currents
representation of genes more reflective of their recent diverge and nutrient rich upwellings mix with warm surface waters to
evolutionary history than of a response to environmental support a highly productive ecosystem. Cocos Island is distinctive in
conditions. Being able to control or account for these the eastern Pacific because it belongs to one of the first shallow
taxonomic effects is crucial to understanding how microbial undersea ridges in the region encountered by the easterly flowing
North Equatorial Counter/Cross Current in the Far Eastern Pacific
populations have adapted to environmental conditions and [81,82]. After departing Cocos Island, the vessel continued southwest
how they may behave under changing conditions. The to the Galapagos Islands, stopping for an open ocean station (GS26).
metagenomic comparison method described here provides An intensive sampling program was then conducted in the Galapagos.
The Galapagos Archipelago straddles the equator 960 km west of
a new tool to more accurately measure the impact of mainland Ecuador in the eastern Pacific. These islands are in a
taxonomic effects. hydrographically complex region due to their proximity to the
In conclusion, this study reveals the wealth of biological Equatorial Front and other major oceanic currents and regional
information that is contained within large multi-sample front systems [83]. The coastal and marine parts of the Galapagos
Islands ecosystem harbor an array of distinctive habitats, processes,
environmental datasets. We have begun to quantify the and endemic species. Several distinct zones were targeted including a
amount and structure of the variation in natural microbial shallow-water, warm seep (GS30), below the thermocline in an
upwelling zone (GS31), a coastal mangrove (GS32), and a hypersaline frequency of single inserts. Clones were sequenced from both ends
lagoon (GS33). The last stations were collected from open ocean sites to produce pairs of linked sequences representing ;820 bp at the end
(GS37 and GS47) and a coral reef atoll lagoon (GS51) in the immense of each insert.
South Pacific Gyre. The open ocean samples come from a region of Template preparation. Libraries were transformed, and cells were
lower nutrient concentrations where picoplankton are thought to plated onto large format (16 3 16cm) diffusion plates prepared by
represent the single most abundant and important factor for layering 150 ml of fresh molten, antibiotic-free agar onto a previously
biogeochemical structuring and nutrient cycling [84–87]. In the atoll set 50-ml layer of agar containing antibiotic. Colonies were picked for
systems, ambient nutrients are higher, and bacteria are thought to template preparation using the Qbot or QPix colony-picking robots
constitute a large biomass that is one to three times as large as that of (Genetix, http://www.genetix.com), inoculated into 384-well blocks
the phytoplankton [88–90]. containing liquid media, and incubated overnight with shaking. High-
Sample collection. A YSI (model 6600) multiparameter instrument purity plasmid DNA was prepared using the DNA purification
(http://www.ysi.com) was deployed to determine physical character- robotic workstation custom-built by Thermo CRS (http://www.
istics of the water column, including salinity, temperature, pH, thermo.com) and based on the alkaline lysis miniprep [92]. Bacterial
dissolved oxygen, and depth. Using sterilized equipment [91], 40–200 l cells were lysed, cell debris was removed by centrifugation, and
of seawater, depending on the turbidity of the water, was pumped plasmid DNA was recovered from the cleared lysate by isopropanol
through a 20-lm nytex prefilter into a 250-l carboy. From this sample, precipitation. DNA precipitate was washed with 70% ethanol, dried,
two 20-ml subsamples were collected in acid-washed polyethylene and resuspended in 10 mM Tris HCl buffer containing a trace of blue
bottles and frozen (20 8C) for nutrient and particle analysis. At each dextran. The typical yield of plasmid DNA from this method is
station the biological material was size fractionated into individual approximately 600–800 ng per clone, providing sufficient DNA for at
‘‘samples’’ by serial filtration through 20-lm, 3-lm, 0.8-lm, and 0.1- least four sequencing reactions per template.
lm filters that were then sealed and stored at 20 8C until transport Automated cycle sequencing. Sequencing protocols were based on
back to the laboratory. Between 44,160 and 418,176 clones per station the di-deoxy sequencing method [93]. Two 384-well cycle-sequencing
were picked and end sequenced from short-insert (1.0–2.2 kb) reaction plates were prepared from each plate of plasmid template
sequencing libraries made from DNA extracted from filters [19]. DNA for opposite-end, paired-sequence reads. Sequencing reactions
Data from these six Sorcerer II expedition legs (37 stations) were were completed using the Big Dye Terminator chemistry and
combined with the results from samples in the Sargasso Sea pilot standard M13 forward and reverse primers. Reaction mixtures,
study (four stations; GS00a–GS00d and GS01a–GS01c; [19]. The thermal cycling profiles, and electrophoresis conditions were
majority of the sequence data presented came from the 0.8- to 0.1-lm optimized to reduce the volume of the Big Dye Terminator mix
size fraction sample that concentrated mostly bacterial and archaeal (Applied Biosystems, http://www.appliedbiosystems.com) and to ex-
microbial populations. Two samples (GS01a, GS01b) from the tend read lengths on the AB3730xl sequencers (Applied Biosystems).
Sargasso Sea pilot study dataset and one GOS sample (GS25) came Sequencing reactions were set up by the Biomek FX (Beckman
from other filter size fractions (Table 1). Coulter, http://www.beckmancoulter.com) pipetting workstations.
Filtration and storage. Microbes were size fractionated by serial Robots were used to aliquot and combine templates with reaction
filtration through 3.0-lm, 0.8-lm, and 0.1-lm membrane filters mixes consisting of deoxy- and fluorescently labeled dideoxynucleo-
(Supor membrane disc filter; Pall Life Sciences, http://www.pall.com), tides, DNA polymerase, sequencing primers, and reaction buffer in a
and finally through a Pellicon tangential flow filtration (Millipore, 5 ll volume. Bar-coding and tracking promoted error-free template
http://www.millipore.com) fitted with a Biomax-50 (polyethersulfone) and reaction mix transfer. After 30–40 consecutive cycles of
cassette filter (50 kDa pore size) to concentrate a viral fraction to 100 amplification, reaction products were precipitated by isopropanol,
ml. Filters were vacuum sealed with 5 ml sucrose lysis buffer (20 mM dried at room temperature, and resuspended in water and trans-
EDTA, 400 mM NaCl, 0.75 M sucrose, 50 mM Tris-HCl [pH 8.0]) and ferred to one of the AB3730xl DNA analyzers. Set-up times were less
frozen to 20 8C on the vessel until shipment back to the Venter than 1 h, and 12 runs per day were completed with average trimmed
Institute, where they were transferred to a 80 8C freezer until DNA sequence read length of 822 bp.
extraction. Glycerol was added (10% final concentration) as a Fosmid end sequencing. Fosmid libraries [24] were constructed
cryoprotectant for the viral/phage sample. using approximately 1 lg DNA that was sheared using bead beating to
DNA isolation. In the laboratory, the impact filters were generate cuts in the DNA. The staggered ends or nicks were repaired
aseptically cut into quarters for DNA extraction. Unused quarters by filling with dNTPs. A size selection process followed on a pulse
of the filter were refrozen at 80 8C for storage. Quarters used for field electrophoresis system with lambda ladder to select for 39–40 Kb
extraction were aseptically cut into small pieces and placed in fragments. The DNA was then recovered from a gel, ligated to the
individual 50-ml conical tubes. TE buffer (pH 8) containing 50 mM blunt-ended pCC1FOS vector, packaged into lambda packaging
EGTA and 50 mM EDTA was added until filter pieces were barely extracts, incubated with the host cells, and plated to select for the
covered. Lysozyme was added to a final concentration of 2.5 mg/ clones containing an insert. Sequencing was performed as described
ml1, and the tubes were incubated at 37 8C for 1 h in a shaking for plasmid ends.
water bath. Proteinase K was added to a final concentration of 200 Metagenomic assembly. Assembly was conducted with the Celera
lg/ml1, and the samples were frozen in dry ice/ethanol followed by Assembler [21], with modifications as follows. The ‘‘genome length’’
thawing at 55 8C. This freeze–thaw cycle was repeated once. SDS was artificially set at the length of the dataset divided by 50 to allow
(final concentration of 1%) and an additional 200 lg/ml1 of unitigs of abundant organisms to be treated as unique, as previously
proteinase K were added to the sample, and samples were incubated described [19]. Several distinct assemblies were computed. In the
at 55 8C for 2 h with gentle agitation followed by three aqueous primary assembly, all pairs of mated reads were tested to see whether
phenol extractions and one phenol/chloroform extraction. The the paired reads overlapped one another; if so, they were merged into
supernatant was then precipitated with two volumes of 100% a single pseudo-read that replaced the two original reads; further,
ethanol, and the DNA pellet was washed with 70% ethanol. Finally, only overlaps of 98% identity or higher were used to construct
the DNA was treated with CTAB to remove enzyme inhibitors. Size unitigs. A second assembly was conducted in the same fashion with
fraction samples not utilized in this study were archived for future the exception of using a 94% identity cutoff to construct unitigs.
analysis. Finally, series of assemblies at various stringencies were computed for
Library construction. DNA was randomly sheared via nebulization, subsets of the GOS data; in these assemblies, overlapping mates were
end-polished with consecutive BAL31 nuclease and T4 DNA not preassembled and the Celera Assembler code was modified
polymerase treatments, and size-selected using gel electrophoresis slightly to allow for overlapping and multiple sequence alignment at
on 1% low-melting-point agarose. After ligation to BstXI adapters, lower stringency.
DNA was purified by three rounds of gel electrophoresis to remove Construction of a low-identity overlap database. An all-against-all
excess adapters, and the fragments were inserted into BstXI- comparison of unassembled (but merged and duplicate-stripped)
linearized medium-copy pBR322 plasmid vectors. The resulting sequences from the combined dataset was performed using a
library was electroporated into E. coli. To ensure construction of modified version of the overlapper component of the Celera
high-quality random plasmid libraries with few to no clones with no Assembler [21]. The code was modified to find overlap alignments
inserts, and no clones with chimeric inserts, we used a series of (global alignments allowing free end gaps) starting from pairs of reads
vectors (pHOS) containing BstXI cloning sites that include several that share an identical substring of at least 14 bp. An alignment
features: (1) the sequencing primer sites immediately flank the BstXI extension was then performed with match/mismatch scores set to
cloning site to avoid excessive resequencing of vector DNA; (2) yield a positive outcome if an overlap alignment was found with
elimination of strong promoters oriented toward the cloning site; 65% identity. Overlaps involving alignments of 40 bp were
and (3) the use of BstXI sites for cloning facilitates the preparation of retained for various analyses. For the GOS dataset described here,
libraries with a low incidence of no-insert clones and a high this process resulted in a dataset of 1.2 billion overlaps. Due to the 14-
bp requirement and certain heuristics for early termination of A read that can be overlapped to another at sufficiently high-
apparently hopeless extensions, not all alignments at 65% were sequence identity was taken to indicate that they were from similar
found. In addition, some of the lowest-identity overlaps are bound to organisms, and, relatedly, that similar genes were present in the
be chance matches; however, this was a relatively uncommon event. samples. Only reads with such overlaps contributed to the calcu-
Approximately one in 5 3 106 pairs of 800-bp random sequences (all lation. Other reads reflect genes or segments of genomes that were so
sites independent, A ¼ C ¼ G ¼ T ¼ 25%) can be aligned to overlap lightly sampled (i.e., at such low abundance) that they were not
40 bp at 65% identity using the same procedure. At a 70% cutoff, informative regarding the similarity of two samples. Consequently,
the value is reduced to one in 4 3 107, and one in 5 3 108 at a 75% cut the analysis automatically corrects for differences in the amount of
off. sequencing, and can be computed over sets of samples that vary
Extreme assembly. Like many assembly algorithms, the extreme considerably in diversity. The resulting measure of similarity Si,j takes
assembler proceeds in three phases: overlap, layout, and consensus. on a value between 0 and 1, where 0 implies no overlaps between i
The overlap phase is provided by the all-against-all comparison and j, and 1 implies that a fragment from i and a fragment from j are
described above. The consensus phase is performed by a version of as likely to overlap one another as are two fragments from i or two
the Celera Assembler, modified to accept higher rates of mismatch. fragments from j. As with the Bray-Curtis coefficient [97], abundance
The layout phase begins with a single sequencing read (‘‘seed’’) that is of categories affects the computation. In an idealized situation where
chosen at random or specified by the user and is considered the two libraries can each be divided into some number k of ‘‘species’’ at
‘‘current’’ read. The following steps are performed off one or both equal abundance, and the libraries have l of the species in common,
ends of the seed. (1) Starting from the current fragment end, add the the similarity statistic will approach l/k for large samplings; in this
fragment with the best overlap off that end and mark the current sense, Si,j ¼ x indicates that the two samples share approximately a
fragment as ‘‘used,’’ thus making the added fragment the new current fraction x of their genetic material. It is frequently useful to define Di,j
fragment. (2) Mark as used any alternative overlap that would have ¼ 1 Si,j, the ‘‘dissimilarity’’ or distance between two samples.
resulted in a shorter extension. The simplest notion of ‘‘best overlap’’ Ribotype clustering and identification of representatives. An all-
is simply the one having the highest identity alignment, but more against-all comparison of predicted 16S sequences was performed to
complicated criteria have certain advantages. A simple but useful determine the alignment between pairs of overlapping sequences
refinement is to favor fragments whose other ends have overlaps over using a version of an extremely fast bit-vector algorithm [98]. A
those which are dead ends. For an unsupervised extreme assembly, hierarchical clustering was determined using percent-mismatch in
when the sequence extension terminates because there are no more the resulting alignments as the distance between pairs of sequences.
overlaps, a new unused fragment is chosen as the next seed and the Order of clustering and cluster identity scores were based on the
process is repeated until all fragments have been marked used. average-linkage criterion, with distances between nonoverlapping
Construction of multiple SAR11 variants. Sequencing reads mated partial sequences treated as missing data. Ribotypes were the
to SAR11-like 16S sequences but themselves outside of the ribosomal maximal clusters with an identity score above the cutoff (typically
operon (n ¼ 348) were used as seeds in independent extreme 97%). Representative sequences were chosen for each cluster based
assemblies. Since the assemblies were independent, the results were on both length and highest average identity to other sequences in the
highly redundant, with a given chain of overlapping fragments cluster.
typically being used in multiple assemblies. A subset of 24 assemblies Taxonomic classification. Taxonomic classification of 16S sequen-
that shared no fragments over their first 20 kb was identified as ces was conducted using phylogenetic techniques based on clade
follows. (1) Connected components were determined in a graph membership of similar sequences with 16S sequences with defined
defined by nodes corresponding to extreme assemblies. If the taxonomic membership. Representative sequences from clustered
assemblies shared at least one fragment in the first 20 kb of each sequences were analyzed as described previously [19,99] and by
assembly, the two nodes were connected by an edge. (2) A single addition into an ARB database of small subunit rDNAs [100,101].
assembly was chosen at random from each of the connected Results were spot-checked against the Ribosomal Database Project II
components. The consensus sequence over the 20-kb segment of Classifier server [102] and the taxonomic labels of the best BLASTN
each such representative was used as the reference for fragment hits against the nonredundant database at NCBI.
recruitment. Fragment recruitment. Global ocean sequences were aligned to
Phylogeny. Phylogenies of sequences homologous to a given genomic sequences of different bacteria and phage using NCBI
portion of a reference sequence (typically 500 bp) were determined BLASTN [26]. The following blast parameters were designed to
in the following manner. A set of homologous fragments was identify alignments as low as 55% identity that could contain large
identified based on fragment recruitment to the reference as gaps: -F ‘‘m L’’ -U T -p blastn -e 1e-4 -r 8 -q -9 -z 3000000000 -X 150.
described above. Fragments that fully spanned the segment of Reads were filtered in several steps to identify the reads that were
interest and had almost full-length alignments to the reference aligned over more or less their entire length. Reads had to be aligned
sequence of a user-defined percent identity (typically, 70%) were for more than 300 bp at .30% identity with less than 25 bp of
used for further analysis. A preliminary master–slave multiple unaligned bases on either end, or reads had to be aligned over more
sequence alignment of the recruited reads (slaves) to the reference than 100 bp at .30% identity with less than 20 bp of overhang off
segment (master) was performed with a modified version of the either end. Identity was calculated ignoring gaps. In some instances a
consensus module of the Celera Assembler. Based on this alignment, read might be placed, but the mate would not be placed under these
reads were trimmed to the portion aligning to the reference segment criteria. In such cases, if 80% or more of the mate were successfully
of interest. A refined multiple sequence alignment was then aligned, then the mate would be rescued and considered successfully
computed with MUSCLE [94]. Distance based phylogenies were aligned.
computed using the programs DNADIST and NEIGHBOR from the Generation of shredded artificial reads from finished genomes.
PHYLIP package [95] using default settings. Trees were visualized Random pieces of DNA from the genome in question with a length
using HYPERTREE [96]. between 1,800 to 2,500 bp were selected. For each piece a read length
Measurement of library-to-library similarity. Based on the low- N1 was selected from the distribution of lengths using the GOS
identity overlap database described above, the similarity of a library i dataset. If that GOS sequence had a mate pair, then a second length
to another library j at a given percent identity cutoff was computed as N2 was again randomly selected. The length N1 was used to generate
follows. For each sequence s of i, let ns,i ¼ the number of overlaps to a read from the 59 end of the DNA. The piece of DNA was then
other fragments of i satisfying the cutoff; ns,j ¼ the number of overlaps reverse complemented and if appropriate, a second length N2 was
to fragments of j satisfying the cutoff; and fs,i ¼ ns,i/(ns,i þ ns,j) ¼ fraction used to generate a second read. The relationship between these two
of reads overlapping s from i or j that are from i. reads was then recorded and used to produce a fasta file. This
X approach successfully mimics the types of reads found in the GOS
ri;i ¼ fs;i ð1Þ data with similar rates of missing mates.
s Abundance of proteorhodopsin variants. A total of 2,644 proteo-
rhodopsin genes were identified from the clustered open reading
X frames derived from the GOS assembly [18]. These genes could
ri;j ¼ 1 fs;i ð2Þ
be linked back to 3,608 GOS clones. Open reading frames were
s
predicted from these clones as described in [18]. The peptide
pffiffiffiffiffiffiffiffiffiffiffiffiffi sequences were aligned with NCBI blastpgp with the following
si;j ¼ 0:5 ðri;j þ rj;i Þ= ri;i rj;j ð3Þ parameters: -j 5 -U T -e 10 -W 2 -v 5 -b 5000 -F ‘‘m L’’ -m 3. The
search was performed with a previously described blue-absorbing
proteorhodopsin protein BPR (gij32699602) as the query. The amino
Si;j ¼ Sj;i ¼ 2si;j =ð1 þ si;j Þ ð4Þ acid associated with light absorption is found within a short
conserved motif RYVDWLLTVPL*IVEF, where the asterisk indicates (http://camera.calit2.net). The dataset and associated metadata will be
tuning amino acid [39–41]. In total, 1,938 clones were found to accessible via CAMERA (using the dataset tag CAM_PUB_Rusch07a).
contain this motif. Clones and the sample metadata were then Given the exceptional abundance of Burkholderia and Shewanella
associated with the tuning amino acid to determine the relative sequences in the first Sargasso Sea sample and the feeling that these
abundance of the different amino acids at these positions. Clones may be contaminants, we are also providing a list of the scaffold IDs
could be associated with SAR11 if both mated sequencing reads and sequencing read IDs associated with these organisms to facilitate
(when available) were recruited to P. ubique HTCC1062. analyses with or without the sequences. In addition to CAMERA, the
Site abundance estimates and comparisons. Given a set of genes GOS scaffolds and annotations will be available via the public sequence
identified on the GOS sequences, we can identify the scaffolds on repositories such as NCBI (http://www.ncbi.nlm.nih.gov/entrez/
which these genes were annotated. A vector indicating the number of query.fcgi?db¼genomeprj&cmd¼Retrieve&dopt¼Overview&list_
sequences contributed by every sample is determined for every gene. uids¼13694), and the reads will be available via the Trace Archive
This vector reflects the number of sequences from every sample that (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?).
assembled into the scaffold on which the gene was identified after
normalizing for the proportion of scaffold covered by the gene. For
example, if a 10-kb scaffold contains a 1-kb gene, then each sample Supporting Information
will contribute one GOS sequence for every ten GOS sequences it
contributed to the entire scaffold. The vectors are then summed and Poster S1. Fragment Recruitment of GOS Data to Finished Microbial
normalized to account for either the total number of GOS sequences Genomes
obtained from each sample or based on the number of typically single Found at doi:10.1371/journal.pbio.0050077.sd001 (21 MB PDF).
copy recA genes (identified as in [18]). Unless stated otherwise, recA
was used to normalize abundance across samples. When comparisons
using groups of samples were performed, the average value for the Accession Numbers
samples was compared. The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession num-
Oligonucleotide composition profile. A 1-D profile representing ber for proteorhodopsin protein BPR is gij32699602.
oligonucleotide frequencies was computed as follows. A sequence was
converted into a series of overlapping 10,000-bp segments, each
segment offset by 1,000 bp from the previous one, using perl and shell Acknowledgments
scripting. Dinucleotide frequencies are computed on each segment
using a C program written for this purpose. Higher-order oligonu- We acknowledge the Department of Energy (DOE), Office of Science,
cleotides were examined and gave similar results for the genomes of and Office of Biological and Environmental Research (DE-FG02-
interest. Remaining calculations were performed using the R package 02ER63453), the Gordon and Betty Moore Foundation, the Discovery
[103]. Principle component analysis (function princomp with default Channel, and the J. Craig Venter Science Foundation for funding to
settings) was applied to the matrix of frequencies per window undertake this study. We are also indebted to a large group of
position. The value of the first component for each position was individuals and groups for facilitating our sampling and analysis. We
normalized by the standard deviation of these values, and truncated thank the governments of Canada, Mexico, Honduras, Costa Rica,
to the range [5, 5]. For visualizations, the resulting values were Panama, Ecuador, French Polynesia, and France for facilitating
plotted at the center of each window. sampling activities. All sequencing data collected from waters of the
Estimating frequency of large-scale translocations and inversions. above-named countries remain part of the genetic patrimony of the
The unrecruited mated sequencing reads of reads recruited to P. country from which they were obtained. Canada’s Bedford Institute
marinus MIT9312 at or above 80% identity were examined. An of Oceanography provided a vessel and logistical support for
unrecruited mate indicated a potential translocation or inversion if it sampling in Bedford Basin. The Universidad Nacional Autónoma
aligned to the MIT9312 genome in two and only two distinct de México (UNAM) facilitated permitting and logistical arrangements
alignments separated by at least 50 kb, if each aligned portion was at and identified a team of scientists for collaboration. The scientists
least 250 bp long, if there was less than 100 bp of unaligned sequence and staff of the Smithsonian Tropical Research Institute (STRI)
and no more than 100 bp of overlapping sequence between the two hosted our visit in Panama. Representatives from Costa Rica’s
aligned portions in read coordinates, and if each aligned portion was Organization for Tropical Studies (Jorge Arturo Jimenez and
anchored to one end of the sequencing read with less than 25 bp of Francisco Campos Rivera), the University of Costa Rica (Jorge
unaligned sequence from each end. In total, 18 rearrangements were Cortés), and the National Biodiversity Institute (INBio) provided
identified, six of which appear to be unique events. assistance with planning, logistical arrangements, and scientific
The rate of discovery was estimated by determining the number of analysis. Our visit to the Galapagos Islands was facilitated by
rearrangements in a given volume of sequence. We estimated the assistance from the Galapagos National Park Service Director,
volume of sequence that was potentially examined by identifying Washington Tapia, and the Charles Darwin Research Institute,
recruited mated sequencing reads that fit the ‘‘good’’ category (i.e., especially Howard Snell and Eva Danulat. We especially thank Greg
which were recruited in the correct orientation at the expected Estes (guide), Héctor Cháuz Campo (Institute of Oceanography of the
distance from each other). For a given read, if the mate was recruited Ecuador Navy), and a National Park Representative, Simon Ricardo
at greater than or equal to 80% identity, then the expected amount of Villemar Tigrero, for field assistance while in the Galapagos Islands.
sequence examined should be the current (as opposed to mate) read Martin Wilkalski (Princeton University) and Rod Mackie (University
length minus 500 bp. This produces an estimate of the search space to of Illinois) provided planning advice for the Galapagos sampling plan.
be ;47 Mbp. Given 18 rearrangements, this leads to an estimate of We thank Matthew Charette (Woods Hole Oceanographic Institute)
one rearrangement per 2.6 Mbp. for nutrient data analysis. We also acknowledge the help of Michael
Quantification and assessment of sequences associated with gaps. Ferrari and Jennifer Clark for remote sensing data. The U.S.
GOS reads assigned to the ‘‘missing mate’’ category that were Department of State facilitated Governmental communications on
recruited at greater than 80% identity outside the gap in question multiple occasions. John Glass (J. Craig Venter Institute [JCVI])
were identified. The mates of these reads were then identified and provided valuable assistance in methods development. The dedicated
clustering was attempted with Phrap (http://www.phrap.org). Reads efforts of the quality systems, library construction, template, and
that were incorporated end to end into the Phrap assemblies were sequencing teams at the JCVI Joint Technology Center produced the
identified. For most small gaps a single assembly included all the high quality sequence data that was the basis of this paper. We thank
missing mate reads and identified the precise difference between the Matthew LaPointe, Creative Director of JCVI, for assistance with
reference and the environmental sequences. For the hypervariable figure design, and the JCVI information technology support team
segments, most of the reads failed to assemble at all, and those that who facilitated many of the vessel related technical needs. Special
did show greater sequence divergence than typically seen. In the case thanks are due for Charles H. Howard, captain of the Sorcerer II, and
of SAR11-recruited reads, to increase the number of reads associated fellow crew members Cyrus Foote and Brooke A. Dill for their time
with the hypervariable gaps we identified reads that did not recruit to and effort in support of this research. We gratefully acknowledge Dr.
the P. ubique HTCC1062 but aligned in a single HSP (high-scoring Michael Sauri, who oversaw medical related issues for the crew of the
pair) over at least 500 bp with one end unaligned because it extended Sorcerer II.
into the hypervariable gap. Author contributions. DBR, ALH, KB, HS, CAP, JFH, MF, and JCV
Data and tool release. To facilitate continued analysis of this and conceived and designed the experiments. DBR, ALH, JMH, KB, BT,
other metagenomic datasets, the tools presented here along with their HBT, CS, JT, JF, CAP, and JCV performed the experiments. DBR,
source code will be available via the Cyberinfrastructure for Advanced ALH, GS, KBH, SW, DW, JAE, KR, JEV, TU, YHR, MRF, KN, and RF
Marine Microbial Ecology Research and Analysis (CAMERA) website analyzed the data. DBR, ALH, GS, KBH, SY, JMH, KR, KB, BT, HS,
HBT, CS, JT, JF, CAP, KL, SK, JFH, TU, YHR, LIF, VS, GBR, LEE, and Environmental Research (DE-FG02-02ER63453), the Gordon and
DMK, SS, TP, EB, VG, GTC, MRF, RLS, MF, and JCV contributed Betty Moore Foundation, the Discovery Channel, and the J. Craig
reagents/materials/analysis tools. DBR, ALH, GS, KBH, SW, SY, JAE, Venter Science Foundation.
RLS, KN, RF, MF, and JCV wrote the paper.
Funding. Funding for this study was received from the US Competing interests. The authors have declared that no competing
Department of Energy, Office of Science, and Office of Biological interests exist.
References A Web server for aligning two genomic DNA sequences. Genome Res 10:
1. Whitman WB, Coleman DC, Wiebe WJ (1998) Prokaryotes: The unseen 577–586.
majority. Proc Natl Acad Sci U S A 95: 6578–6583. 28. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006)
2. Beja O, Koonin EV, Aravind L, Taylor LT, Seitz H, et al. (2002) Genomic islands and the ecology and evolution of Prochlorococcus. Science
Comparative genomic analysis of archaeal genotypic variants in a single 311: 1768–1770.
population and in two different oceanic provinces. Appl Environ 29. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005)
Microbiol 68: 335–345. Genome streamlining in a cosmopolitan oceanic bacterium. Science 309:
3. DeLong EF, Pace NR (2001) Environmental diversity of bacteria and 1242–1245.
archaea. Systematic Biol 50: 1–9. 30. Hagstrom A, Pommier T, Rohwer F, Simu K, Stolte W, et al. (2002) Use of
4. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) 16S ribosomal DNA for delineation of marine bacterioplankton species.
Microbial diversity in the deep sea and the underexplored ‘‘rare Appl Environ Microbiol 68: 3628–3633.
biosphere.’’ Proc Natl Acad Sci U S A 103: 12115–12120. 31. Giovannoni S, Rappe M (2002) Evolution, diversity and molecular ecology
5. Garrity GM (2001) Bergey’s manual of systematic bacteriology. New York; of marine Prokaryotes. In: Kirchman DL, editor. Microbial ecology of the
Springer-Verlag. oceans. New York: Wiley-Liss. pp. 47–84.
6. Madigan M, Martinko JM, Parker J (2000) Brock biology of micro- 32. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005)
organisms. Upper Saddle River (NJ); Prentice Hall. 991 p. Comparative metagenomics of microbial communities. Science 308: 554–
7. Fuhrman JA, McCallum K, Davis AA (1992) Novel major archaebacterial 557.
group from marine plankton. Nature (London) 356: 148–149. 33. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein
8. Giovannoni SJ, Britschgi TB, Moyer CL, Field KG (1990) Genetic diversity families. Nucleic Acids Res 31: 371–373.
in Sargasso Sea bacterioplankton. Nature 345: 60–63. 34. Conkright M, Levitus S, Boyer T. (1994) World ocean atlas 1994. Volume 1:
9. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Nutrients. Washington, D.C. U.S. Department of Commerce.
Science 276: 734–740. 35. Levitus S, Burgett R, Boyer T (1994) World ocean atlas 1994. Volume 3:
10. Rappe MS, Giovannoni SJ (2003) The uncultured microbial majority. Ann Nutrients. Washington, D.C.: U.S. Department of Commerce.
Rev Microbiol 57: 369–394. 36. Parekh P, Follows MJ, Boyle E (2004) Modeling the global ocean iron cycle.
11. Stackebrandt E, Goebel BM (1994) Taxonomic note: A place for DNA- Global Biogeochem Cycles 18: GB1002.
DNA reassociation and 16S rRNA sequence analysis in the present species 37. Scanlan DJ, Wilson WH (1999) Application of molecular techniques to
definition in bacteriology. Int J Syst Bacteriol 44: 846–849. addressing the role of P as a key effector in marine ecosystems.
12. Welch RA, Burland V, Plunkett G 3rd, Redford P, Roesch P, et al. (2002) Hydrobiologia 401: 149–175.
Extensive mosaic structure revealed by the complete genome sequence of 38. Moore L, Ostrowski M, Scanlan D, Feren K, Sweetsir T (2005) Ecotypic
uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99: 17020–17024. variation in phosphorus acquisition mechanisms within marine picocya-
13. Linklater E (1972) The voyage of the Challenger. Garden City (NJ); nobacteria. Aquat Microb Ecol 39: 257–269.
Doubleday. 280 p. 39. Kelemen BR, Du M, Jensen RB (2003) Proteorhodopsin in living color:
14. Mosley HN (1879) Notes by a naturalist on the ‘‘Challenger,’’ being an Diversity of spectral properties within living bacterial cells. Biochim
account of various observations made during the voyage of H.M.S. Biophys Acta 1618: 25–32.
‘‘Challenger’’ round the world, in the years 1872–1876. London: 40. Man D, Wang W, Sabehi G, Aravind L, Post AF, et al. (2003) Diversification
Macmillian and Company. 540 p. and spectral tuning in marine proteorhodopsins. EMBO J 22: 1725–1731.
15. Thompson SCW, Murray SJ, Nares GS, Thompson FT (1895) Report on 41. Man-Aharonovich D, Sabehi G, Sineshchekov OA, Spudich EN, Spudich
the scientific results of the voyage of H.M.S. Challenger during the years JL, et al. (2004) Characterization of RS29, a blue-green proteorhodopsin
1873–76 under the command of Captain George S. Nares, R.N., F.R.S. and variant from the Red Sea. Photochem Photobiol Sci 3: 459–462.
the late Captain Frank Tourle Thomson, R.N. Prepared under the 42. Bielawski JP, Dunn KA, Sabehi G, Beja O (2004) Darwinian adaptation of
superintendence of the late Sir C. Wyville Thomson, 1885–1895: proteorhodopsin to different light intensities in the marine environment.
Edinburgh: printed for H.M. Stationery off. (by order of Her Majesty’s Proc Natl Acad Sci U S A 101: 14824–14829.
Government). 43. Johnsen S, Sosik H. (2005) Shedding light on light in the ocean. Oceanus
16. Fuhrman JA, Mccallum K, Davis AA (1993) Phylogenetic diversity of Mag 43: 24–28.
subsurface marine microbial communities from the Atlantic and Pacific 44. Braun C, Smirnov S (1993) Why is water blue. J Chem Educ 70: 612–615.
Oceans. Appl Environ Microbiol 59: 1294–1302. 45. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial
17. Hewson I, Steele JA, Capone DG, Fuhrman JA (2006) Temporal and spatial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289:
scales of variation in bacterioplankton assemblages of oligotrophic surface 1902–1906.
waters. Mar Ecol Prog Ser 311: 67–77. 46. de la Torre JR, Christianson LM, Beja O, Suzuki MT, Karl DM, et al. (2003)
18. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) Proteorhodopsin genes are distributed among divergent marine bacterial
The Sorcerer II Global Ocean Sampling expedition: Expanding the universe taxa. Proc Natl Acad Sci U S A 100: 12830–12835.
of protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016 47. Sabehi G, Beja O, Suzuki MT, Preston CM, DeLong EF (2004) Different
19. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) SAR86 subgroups harbour divergent proteorhodopsins. Environ Micro-
Environmental genome shotgun sequencing of the Sargasso Sea. Science biol 6: 903–910.
304: 66–74. 48. Frigaard NU, Martinez A, Mincer TJ, DeLong EF (2006) Proteorhodopsin
20. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) lateral gene transfer between marine planktonic bacteria and archaea.
Community genomics among stratified microbial assemblages in the Nature 439: 847–850.
ocean’s interior. Science 311: 496–503. 49. Yokoyama S (2000) Phylogenetic analysis and experimental approaches to
21. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A study color vision in vertebrates. Methods Enzymol 315: 312–325.
whole-genome assembly of Drosophila. Science 287: 2196–2204. 50. Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, et al. (2002)
22. Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting Comparative genome sequencing for discovery of novel polymorphisms in
random clones: A mathematical analysis. Genomics 2: 231–239. Bacillus anthracis. Science 296: 2028–2033.
23. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002) 51. Brown MV, Fuhrman JA (2005) Marine bacterial microdiversity as revealed
The genome sequence of the malaria mosquito Anopheles gambiae. Science by internal transcribed spacer analysis. Aquat Microb Ecol 41: 15–23.
298: 129–149. 52. Rocap G, Distel DL, Waterbury JB, Chisholm SW (2002) Resolution of
24. Kim UJ, Shizuya H, Dejong PJ, Birren B, Simon MI (1992) Stable Prochlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNA
propagation of cosmid sized human DNA inserts in an F-factor based internal transcribed spacer sequences. Appl Environ Microbiol 68: 1180–
vector. Nucleic Acids Res 20: 1083–1085. 1191.
25. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF (1996) Characterization 53. Schleper C, DeLong EF, Preston CM, Feldman RA, Wu KY, et al. (1998)
of uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pair Genomic analysis reveals chromosomal variation in natural populations of
genome fragment from a planktonic marine archaeon. J Bacteriol 178: the uncultured psychrophilic archaeon Cenarchaeum symbiosum. J Bacteriol
591–599. 180: 5003–5009.
26. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local 54. Rogers AR, Harpending H (1992) Population growth makes waves in the
alignment search tool. J Mol Biol 215: 403–410. distribution of pairwise genetic differences. Mol Biol Evol 9: 552–569.
27. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, et al. (2000) PipMaker: 55. Rappe MS, Connon SA, Vergin KL, Giovannoni SJ (2002) Cultivation of
the ubiquitous SAR11 marine bacterioplankton clade. Nature 418: 630– 79. Garcia-Martinez J, Rodriguez-Valera F (2000) Microdiversity of uncultured
633. marine prokaryotes: The SAR11 cluster and the marine Archaea of Group
56. Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM, et al. (2006) I. Mol Ecol 9: 935–948.
Niche partitioning among Prochlorococcus ecotypes along ocean-scale 80. Jenkins BD, Steward GF, Short SM, Ward BB, Zehr JP (2004) Finger-
environmental gradients. Science 311: 1737–1740. printing Diazotroph communities in the Chesapeake Bay by using a DNA
57. Liu WT, Marsh TL, Cheng H, Forney LJ (1997) Characterization of macroarray. Appl Environ Microbiol 70: 1767–1776.
microbial diversity by determining terminal restriction fragment length 81. Legeckis R (1988) Upwelling off the Gulfs of Panama and Papagayo in the
polymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 63: tropical Pacific during March 1985. J Geophys Res 93: 15485–15489.
4516–4522. 82. McCreary JP, Lee HS, Enfield DB (1989) The response of the coastal ocean
58. Fisher MM, Triplett EW (1999) Automated approach for ribosomal to strong offshore winds: With application to circulation in the gulfs of
intergenic spacer analysis of microbial diversity and its application to Tehuantepec and Papagayo. J Mar Res 47: 81–109.
freshwater bacterial communities. Appl Environ Microbiol 65: 4630–4636. 83. Palacios DM (2003) Oceanographic conditions around the Galápagos
59. Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S, et al. (2001) The Archipelago and their influence on Cetacean community structure. [PhD
photosynthetic apparatus of Prochlorococcus: Insights through comparative diss]. Corvallis: Oregon State University. 178 p.
genomics. Photosynth Res 70: 53–71. 84. Christian JR, Lewis MR, Karl DM (1997) Vertical fluxes of carbon,
60. Martiny AC, Coleman ML, Chisholm SW (2006) Phosphate acquisition nitrogen, and phosphorus in the North Pacific Subtropical Gyre near
genes in Prochlorococcus ecotypes: Evidence for genome-wide adaptation. Hawaii. J Geophys Res 102: 15667–15677.
Proc Natl Acad Sci U S A 103: 12552–12557. 85. Doney SC, Abbott MR, Cullen JJ, Karl DM, Rothstein L (2004) From genes
61. Moore LR, Chisholm SW (1999) Photophysiology of the marine cyano- to ecosystems: The ocean’s new frontier. Front Ecol Environ 2: 457–466.
bacterium Prochlorococcus: Ecotypic differences among cultured isolates. 86. McGillicuddy DJ, Anderson LA, Doney SC, Maltrud ME (2003) Eddy-driven
Limnol Oceanogr 44: 628–638. sources and sinks of nutrients in the upper ocean: Results from a 0.18
62. Moore LR, Rocap G, Chisholm SW (1998) Physiology and molecular resolution model of the North Atlantic. Global Biogeochem Cycles 17: 1035.
phylogeny of coexisting Prochlorococcus ecotypes. Nature 393: 464–467. 87. van der Staay SYM, van der Staay GWM, Guillou L, Vaulot D, Claustre H,
63. Rocap G, Larimer FW, Lamerdin J, Malfatti S, Chain P, et al. (2003) et al. (2000) Abundance and diversity of prymnesiophytes in the
Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche picoplankton community from the equatorial Pacific Ocean inferred
differentiation. Nature 424: 1042–1047. from 18S rDNA sequences. Limnol Oceanogr 45: 98–109.
64. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005) 88. Blanchot J, Charpy L, Borgne RL (1989) Size composition of particulate
Three Prochlorococcus cyanophage genomes: Signature features and organic matter in the lagoon of Tikehau Atoll (Tuiamotu Archipelago).
ecological interpretations. PLoS Biol 3: e144. Mar Biol 102: 329–339.
65. Ting CS, Rocap G, King J, Chisholm SW (2002) Cyanobacterial photosyn- 89. Torréton JP, Dufour P (1996) Bacterioplankton production determined by
thesis in the oceans: The origins and significance of divergent light- DNA synthesis, protein synthesis and frequency of dividing cells in
harvesting strategies. Trends Microbiol 10: 134–142. Tuamotu atoll lagoons and surrounding ocean. Microb Ecol 32: 185–202.
66. Zinser ER, Coe A, Johnson ZI, Martiny AC, Fuller NJ, et al. (2006) 90. Torréton JP, Dufour P (1996) Temporal and spatial stability of
Prochlorococcus ecotype abundances in the North Atlantic Ocean as bacterioplankton biomass and productivity in an atoll lagoon. Aquat
revealed by an improved quantitative PCR method. Appl Environ Microb Ecol 11: 251–261.
Microbiol 72: 723–732. 91. Rutala WA, Weber DJ (1997) Uses of inorganic hypochlorite (bleach) in
67. Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, et al. (1987) health-care facilities. Clin Microbiol Rev 10: 597–610.
Report of the ad hoc committee on reconciliation of approaches to 92. Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning. A laboratory
bacterial systematics. Int J Syst Bacteriol 37: 463–464. manual. Cold Spring Harbor (NY): Cold Spring Laboratory Press.
68. Thompson JR, Pacocha S, Pharino C, Klepac-Ceraj V, Hunt DE, et al. 93. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-
(2005) Genotypic diversity within a natural coastal bacterioplankton terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.
population. Science 307: 1311–1313. 94. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high
69. Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, et al. (2004) Fine- accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
scale phylogenetic architecture of a complex bacterial community. Nature 95. Felsenstein J (1989) PHYLIP: Phylogeny Inference Package (Version 3.2).
430: 551–554. Cladistics 5: 164–166.
70. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia 96. Bingham J, Sudarsanam S (2000) Visualizing large hierarchical clusters in
coli genome. Proc Natl Acad Sci U S A 95: 9413–9417. hyperbolic space. Bioinformatics 16: 660–661.
71. Scheffer M, Rinaldi S, Huisman J, Weissing FJ (2003) Why plankton 97. Bray JR, Curtis JT. (1957) An ordination of upland forest communities of
communities have no equilibrium: Solutions to the paradox. Hydro- southern Wisconsin. Ecol Monogr 27: 325–349.
biologia 491: 9–18. 98. Myers G (1999) A fast bit-vector algorithm for approximate string
72. Hutchinson GE (1961) The paradox of the plankton. Am Nat 95: 137–145. matching based on dynamic programming. J ACM 46: 395–415.
73. Cohan F (2002) Concepts of bacterial biodiversity for the age of genomics. 99. Penn K, Wu D, Eisen JA, Ward N (2006) Characterization of bacterial
In: Fraser CM, Read TD, Nelson KE, editors. Microbial genomes. Totowa communities associated with deep-sea corals on Gulf of Alaska Seamounts.
(New Jersey): Humana Press. pp. 175–194. Appl Environ Microbiol 72: 1680–1683.
74. Hacker J, Blum-Oehler G, Hochhut B, Dobrindt U (2003) The molecular 100. Ludwig W, Strunk O, Westram R, Richter L, Meier H, et al. (2004) ARB: A
basis of infectious diseases: Pathogenicity islands and other mobile genetic software environment for sequence data. Nucleic Acids Res 32: 1363–1371.
elements. A review. Acta Microbiol Immunol Hung 50: 321–330. 101. Hugenholtz P (2002) Exploring prokaryotic diversity in the genomic era.
75. McCann KS (2000) The diversity-stability debate. Nature 405: 228–233. Genome Biol 3: REVIEWS0003.
76. Fuller NJ, West NJ, Marie D, Yallop M, Rivlin T, et al. (2005) Dynamics of 102. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, et al.
community structure and phosphate status of picocyanobacterial pop- (2005) PHACCS, an online tool for estimating the structure and diversity
ulations in the Gulf of Aqaba, Red Sea. Limnol Oceanogr 50: 363–375. of uncultured viral communities using metagenomic information. BMC
77. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. (2006) Bioinformatics 6: 41.
Metagenomic analysis of the human distal gut microbiome. Science 312: 103. R Development Core Team (2004) R: A language and environment for
1355–1359. statistical computing [computer program]. Vienna, Austria: R Foundation
78. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA (2005) Coupling 16S- for Statistical Computing. http://www.R-project.org.
ITS rDNA clone libraries and automated ribosomal intergenic spacer 104. Gomez-Consarnau L, Gonzalez JM, Coll-Llado M, Gourdon P, Pascher T, et
analysis to show marine microbial diversity: Development and application al. (2007) Light stimulates growth of proteorhodopsin-containing marine
to a time series. Environ Microbiol 7: 1466–1479. Flavobacteria. Nature 445: 210–213.
Note Added in Proof

Recently, Gomez-Consarnau et al. provided credible evidence for the
biological role of proteorhodopsins [104]. These results indicate that
proteorhodopsins blur the line between heterotrophic and autotrophic
microbes by allowing a wide range of organisms to harness light energy for
respiration and growth. This reinforces the notion that the differential
distribution of proteorhodopsin variants identified here reflects functional
adaptation to the wavelengths of available light. Furthermore, these adapta-
tions may be driven by the makeup of the microbial community. Thus, these
distributional differences could reflect competition between microbes for light
resources.
PLoS BIOLOGY
The Sorcerer II Global Ocean Sampling

Expedition: Expanding the Universe
of Protein Families
Shibu Yooseph1*, Granger Sutton1, Douglas B. Rusch1, Aaron L. Halpern1, Shannon J. Williamson1, Karin Remington1,
Jonathan A. Eisen1,2, Karla B. Heidelberg1, Gerard Manning3, Weizhong Li4, Lukasz Jaroszewski4, Piotr Cieplak4,
Christopher S. Miller5, Huiying Li5, Susan T. Mashiyama6, Marcin P. Joachimiak6, Christopher van Belle6,
John-Marc Chandonia6,7, David A. Soergel6, Yufeng Zhai3, Kannan Natarajan8, Shaun Lee8, Benjamin J. Raphael9,
Vineet Bafna8, Robert Friedman1, Steven E. Brenner6, Adam Godzik4, David Eisenberg5, Jack E. Dixon8,
Susan S. Taylor8, Robert L. Strausberg1, Marvin Frazier1, J. Craig Venter1
1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 University of California, Davis, California, United States of America, 3 Razavi-Newman Center for
Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America, 4 Burnham Institute for Medical Research, La Jolla, California, United States of
America, 5 University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America, 6 University
of California Berkeley, Berkeley, California, United States of America, 7 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United
States of America, 8 University of California San Diego, San Diego, California, United States of America, 9 Brown University, Providence, Rhode Island, United States of
America
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein
families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of
sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million
Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total
of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no
detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of
sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in
the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously
categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans)
from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset
is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins,
the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their
evolution. These observations are illustrated using several protein families, including phosphatases, proteases,
ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS
data has implications for choosing targets for experimental structure characterization as part of structural genomics
efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the
addition of new sequences, implying that we are still far from discovering all protein families in nature.
Citation: Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families.
PLoS Biol 5(3): e16. doi:10.1371/journal.pbio.0050016
Academic Editor: Sean Eddy, Washington University St. Louis, United States of
America
Received March 24, 2006; Accepted August 15, 2006; Published March 13, 2007
Copyright: Ó 2007 Yooseph et al. This is an open-access article distributed under
and source are credited.
Abbreviations: aa, amino acid; ENS, Ensembl; EST, expressed sequence tag; GO,
Gene Ontology; GOS, Global Ocean Sampling; GS, glutamine synthetase; HMM,
hidden Markov model; IDO, indoleamine 2,3-dioxygenase; NCBI, National Center for
Biotechnology Information; ORF, open reading frame; PDB, Protein Data Bank; PG,
prokaryotic genomes; PP2C, protein phosphatase 2C; PSI, Protein Structure
Initiative; RLP, RuBisCO-like protein; TGI, TIGR gene indices; TC, trusted cutoff;
UVDE, UV dimer endonuclease
* To whom correspondence should be addressed. E-mail: Shibu.Yooseph@
venterinstitute.org
collection is available online at http://collections.plos.org/plosbiology/gos-2007.
php.
Expanding the Protein Family Universe
Author Summary provide rich resources for protein annotation. However, a

vast number of protein predictions remain unclassified both
The rapidly emerging field of metagenomics seeks to examine the in terms of structure and function. Given varying rates of
genomic content of communities of organisms to understand their evolution, there is unlikely to be a single similarity threshold
roles and interactions in an ecosystem. Given the wide-ranging roles or even a small set of thresholds that can be used to define
microbes play in many ecosystems, metagenomics studies of every protein family in nature. Consequently, estimates of the
microbial communities will reveal insights into protein families number of families that exist in nature vary considerably
and their evolution. Because most microbes will not grow in the
based on the different thresholds used and assumptions made
laboratory using current cultivation techniques, scientists have
turned to cultivation-independent techniques to study microbial in the classification process [26–29].
diversity. One such technique—shotgun sequencing—allows ran- In this study, we explored proteins using a comprehensive
dom sampling of DNA sequences to examine the genomic material dataset of publicly available sequences together with environ-
present in a microbial community. We used shotgun sequencing to mental sequence data generated by the Sorcerer II Global
examine microbial communities in water samples collected by the Ocean Sampling (GOS) expedition [30]. We used a novel
Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis clustering technique based on full-length sequence similarity
predicted more than six million proteins in the GOS data—nearly both to predict proteins and to group related sequences. The
twice the number of proteins present in current databases. These goals were to understand the rate of discovery of protein
predictions add tremendous diversity to known protein families and
families with the increasing number of protein predictions,
cover nearly all known prokaryotic protein families. Some of the
predicted proteins had no similarity to any currently known proteins
explore novel families, and assess the impact of the environ-
and therefore represent new families. A higher than expected mental sequences from the expedition on known proteins
fraction of these novel families is predicted to be of viral origin. We and protein families. We used hidden Markov model (HMM)
also found that several protein domains that were previously profiling to examine the relative biases in protein domain
thought to be kingdom specific have GOS examples in other distributions in the GOS data and existing protein databases.
kingdoms. Our analysis opens the door for a multitude of follow-up This profiling was also used to assess the impact of the GOS
protein family analyses and indicates that we are a long way from data on target selection for protein structure character-
sampling all the protein families that exist in nature. ization efforts. We carried out in-depth analyses on several
protein families to validate our clustering approach and to
understand the diversity and evolutionary information that
the GOS data added; the families included ultraviolet (UV)
Introduction irradiation DNA damage repair enzymes, phosphatases,
proteases, and the metabolic enzymes glutamine synthetase
Despite many efforts to classify and organize proteins [1–6]
and RuBisCO.
from both structural and functional perspectives, we are far
from a clear understanding of the size and diversity of the
protein universe [7–9]. Environmental shotgun sequencing Results/Discussion
projects, in which genetic sequences are sampled from Data Generation, Sequence Clustering, and HMM Profiling
communities of microorganisms [10–14], are poised to make We used the following publicly available datasets in this
a dramatic impact on our understanding of proteins and study (Table 1)—the National Center for Biotechnology
protein families. These studies are not limited to culturable Information (NCBI)’s nonredundant protein database
organisms, and there are no selection biases for protein (NCBI-nr) [31,32], NCBI Prokaryotic Genomes (PG) [31,33],
classes or organisms. These studies typically provide a gene- TIGR Gene Indices (TGI-EST) [34], and Ensembl (ENS)
centric (as opposed to an organism-centric) view of the [35,36]. The rationale for including these datasets is discussed
environment and allow the examination of questions related in Materials and Methods. All datasets were downloaded on
to protein family evolution and diversity. The protein February 10, 2005.
predictions from some of these studies are characterized None of the above-mentioned databases contained sequen-
both by their sheer number and diversity. For instance, the ces from the Sargasso Sea study [10], the largest environ-
recent Sargasso Sea study [10] resulted in 1.2 million protein mental survey to date, and so we pooled reads from the
predictions and identified new subfamilies for several known Sargasso Sea study with the reads from the Sorcerer II GOS
protein families. expedition [30], creating a combined set that we call the GOS
Protein exploration starts by clustering proteins into dataset. The GOS dataset was assembled using the Celera
groups or families of evolutionarily related sequences. The Assembler [37] as described in [30] (see Materials and
notion of a protein family, while biologically very relevant, is Methods). The GOS dataset was primarily generated from
hard to realize precisely in mathematical terms, thereby the 0.1 lm to 0.8 lm size filters and thus is expected to be
making the large-scale computational clustering and classi- mostly microbial [30]. The data also included a small set of
fication problem nontrivial. Techniques for these problems sequences from a viral size (,0.1 lm) fraction (Table 1).
typically rely on sequence similarity to group sequences. We identified open reading frames (ORFs) from the DNA
Proteins can be grouped into families based on the highly sequences in the PG, TGI-EST, and GOS datasets. An ORF is
conserved structural units, called domains, that they contain commonly defined as a translated DNA sequence that begins
[15,16]. Alternatively, proteins are grouped into families with a start codon and ends with a stop codon. To
based on their full sequence [17,18]. Many of these classi- accommodate partial DNA sequences, we extended this
fications, together with various expert-curated databases [19] definition to allow an ORF to be bracketed by either a start
such as Swiss-Prot [20], Pfam [15,21], and TIGRFAM [22,23], codon or the start of the DNA sequence, and by either a stop
or integrated efforts such as Uniprot [24] and InterPro [25], codon or the end of the DNA sequence. ORFs were generated
Table 1. The Complete Dataset Consisted of Sequences from NCBI-nr, ENS, TGI-EST, PG, and GOS, for a Total of 28,610,944 Sequences
Dataset Source Number of Amino Mean Sequence Brief

Acid Sequences Length Description
NCBI-nr NCBI 2,317,995 339 Consists of protein sequences submitted to SWISS-PROT, PDB, PIR, and PRF, and
also predicted proteins from both finished and unfinished genomes in GenBank,
EMBL, and DDBJ.
PG ORFs NCBI 3,049,695 160 ORFs identified from 222 prokaryotic genome projects. Organisms are listed in
Protocol S1.
TGI-EST ORFs TIGR Gene Index 5,458,820 119 ORFs identified from 72 datasets in which each dataset consists of EST assem-
blies. Organisms are listed in Protocol S1.
ENS Ensembl 361,668 466 Sequences from 12 species, including human, mouse, rat, chimp, zebrafish, fruit
fly, mosquito, honey bee, dog, two species of puffer fish, chicken, and worm.
GOS ORFs J. Craig Venter 17,422,766 134 ORFs identified from an assembly of 7.7 million reads. These reads include both
Institute the reads from the Sorcerer II GOS Expedition and the reads from the earlier Sar-
gasso Sea study. Also included are 36,318 ORFs identified from an assembly of
sequences collected from the viral size (, 0.1 lm) fraction of one sample.
by considering translations of the DNA sequence in all six length of the shorter sequence. This step served the dual role
frames. For ORFs from the PG and TGI-EST datasets, we used of identifying highly conserved groups of sequences (where
the appropriate codon usage table for the known organism. each group was represented by a nonredundant sequence) and
For GOS ORFs from the assembled sequences, we used removing redundancy in the dataset due to identical and
translation table 11 (the code for bacteria, archaea, and near-identical sequences. Only nonredundant sequences were
prokaryotic viruses) [31]. We did not include alternate codon considered for further steps in our clustering procedure. In
translations in this analysis. For all datasets, only ORFs the second step, we identified core sets of similar sequences
containing at least 60 amino acids (aa) were considered. Not using only matches between two sequences involving 80%
all ORFs are proteins. In this paper, ORFs that have of the length of the longer sequence. We used a graph-
reasonable evidence for being proteins are called predicted theoretic procedure to identify dense subgraphs (the core
proteins; other ORFs are called spurious ORFs. sets) within a graph defined by these matches. While the
In summary, the total input data for this study (Table 1) match parameters we used in this step were more relaxed
consisted of 28,610,994 sequences from NCBI-nr, PG, TGI- than those in the first step, we chose them to reduce the
EST, ENS, and GOS. All data and analysis results will be made grouping of unrelated sequences while simultaneously re-
publicly available (see Materials and Methods). ducing the unnecessary splitting of families. In the third step,
We used a sequence similarity clustering to group related these core sets were transformed into profiles, and we used a
sequences and subsequently predicted proteins from this profile–profile method [39] to merge related core sets into
grouping. This approach of protein prediction was adopted larger groups. In the final step, we recruited sequences to
for two reasons. First, the GOS data make up a major portion core sets using sequence-profile matching (PSI-BLAST [40])
of the dataset being analyzed, and a large fraction of GOS and BLAST matches to core set members. We required the
ORFs are fragmentary sequences. Traditional annotation match to involve 60% of the length of the sequence being
pipelines/gene finders, which presume complete or near- recruited.
complete genomic data, perform unsatisfactorily on this type We identified and removed clusters containing likely
of data. Second, protein prediction based on the comparison spurious ORFs using two filters (see Materials and Methods).
of ORFs to known protein sequences imposes limits on the The first filter identified clusters containing shadow ORFs.
protein families that can be explored. In particular, novel The second filter identified clusters containing conserved but
proteins that belong to known families will not be detected if noncoding sequences, as indicated by a lack of selection at the
they are sufficiently distant from known members of that codon level. Only clusters that remained after the two
family. This is the case even though there may be other novel filtering steps and contained at least two nonredundant
proteins that can transitively link them to the known sequences are reported in this analysis.
proteins. Similarly, truly novel protein families will also not We examined the distribution of known protein domains
be detected. in the full dataset using profile HMMs [41] from the Pfam [15]
As the primary input to our clustering process, we and TIGRFAM [22] databases (see Materials and Methods).
computed the pairwise sequence similarity of the 28.6 million We labeled sequences that end up in clusters (containing at
aa sequences in our dataset using an all-against-all BLAST least two nonredundant sequences) or that have HMM
search [38]. This required more than 1 million CPU hours on matches as predicted proteins. The inclusion of the PG ORF
two large compute clusters (see Materials and Methods). The set allowed for the evaluation of protein prediction using our
sequences were clustered in four steps (see Materials and clustering approach. A comparison of proteins predicted in
Methods). In the first step, we identified a nonredundant set the PG ORF set by our clustering against PG ORFs annotated
of sequences from the entire dataset using only pairwise as proteins by whole-genome annotation techniques revealed
matches with 98% similarity and involving 95% of the that our protein prediction method via clustering has a
Table 2. Clustering and HMM Profiling Results Showing the Number of Predicted Proteins (Including Both Redundant and
Nonredundant Sequences) in Each Dataset
Dataset Original Set Clustering (A) HMM A\B AB BA Total Predicted Mean Length
Profiling (B) Proteins A [ B of Sequence
NCBI-nr 2,317,995 1,939,056 1,645,146 1,566,123 372,933 79,023 2,018,079 359

PG ORFs 3,049,695 575,729 448,159 418,503 157,226 29,656 605,385 325
TGI-EST ORFs 5,458,820 1,097,083 606,779 576,532 520,551 30,247 1,127,330 207
ENS 361,668 319,855 253,007 241,671 78,184 11,336 331,191 489
GOS ORFs 17,422,766 6,046,914 3,701,388 3,624,907 2,422,007 76,481 6,123,395 199
Total 28,610,944 9,978,637 6,654,479 6,427,736 3,550,901 226,743 10,205,380 —
A \ B denotes the number of predicted proteins common to both the clustering and the HMM profiling; A B, the number of predicted proteins in clusters but not in the HMM profile set;
B A, the number of predicted proteins in the HMM profile set but not in clusters; and A [ B, the total number of predicted proteins in each dataset.
sensitivity of 83% and a specificity of 86% (see Materials and predictions or organism-specific proteins. Nearly two-thirds
Methods). The HMM profiling allowed for the evaluation of of these sequences are labeled ‘‘hypotheticals,’’ ‘‘unnamed,’’
our clustering technique’s grouping of sequences. We used or ‘‘unknown.’’ This is more than twice the fraction of
Pfam models in two different ways for this assessment (see similarly labeled sequences (30%) in the full NCBI-nr dataset.
Materials and Methods) and make three observations. First, Of the remaining one-third, half of them are less than 100 aa
using a simple Pfam domain architecture-based evaluation, in length. This suggests that they are either fast-evolving short
these clusters are mostly consistent as reflected by 93% of peptides, spurious predictions, or proteins that failed to meet
clusters having less than 2% unrelated pairs of sequences in the length-based thresholds in the clustering.
them. Second, these clusters are quite conservative and can Based on the clustering and the HMM profiling, there is
split domain families, with 58% of domain architectures evidence for 6,123,395 proteins in the GOS dataset (Table 2).
being confined to single clusters and 88% of domain Given the fragmentary nature of the GOS ORFs (as a result of
architectures having more than half of their occurrences in the GOS assembly [10,30]), it is not surprising that the average
a single cluster. Third, the size distribution of these clusters is length of a GOS-predicted protein (199 aa) is smaller than the
quite similar to the size distribution of clusters induced by average length of predicted proteins in NCBI-nr (359 aa), PG
Pfams. ORFs (325 aa), TGI-EST ORFs (207 aa), and ENS (489 aa). The
ratio of clustered ORFs to total ORFs is significantly higher
Protein Prediction for the GOS ORFs (34%) compared to PG ORFs (19%). This
Of the initial 28,610,944 sequences, we labeled 9,978,637 could be due to a large number of false-positive protein
sequences (35%) as predicted proteins based on the cluster- predictions in the GOS dataset. However, this is unlikely for a
ing, of which nearly 60% are from GOS (Table 2). The HMM variety of reasons. Nearly 4.64 million GOS ORFs (26.6%)
profiling labeled only an additional 226,743 (0.8%) sequences have significant BLAST matches (with an E-value 13 1010)
as predicted proteins, for a total of 10,205,380 predicted to NCBI-nr sequences. The PG ORFs do not have a high false-
proteins. This indicates that our clustering method captures positive rate compared to the submitted annotation for the
most of the sequences found by profile HMMs. For sequences prokaryotic genomes (see Materials and Methods). Most
both in clusters and with HMM matches, (on average) 73.5% importantly, based on the fragmentary nature of GOS
of their length is covered by HMM matches. For sequences sequencing compared to PG sequencing, the number of
not in clusters but with HMM matches, this value is only shadow (spurious) ORFs 60 aa is significantly reduced (see
45.3%. Furthermore, while 64% of sequences in clusters have Materials and Methods).
HMM matches, there are 3,550,901 sequences that are Some pairs of GOS-predicted proteins that belong to the
grouped into clusters but do not have HMM matches. Most same cluster are adjacent in the GOS assembly. While some of
of these clusters correspond either to families lacking profile them correspond to tandem duplicate genes, an overwhelm-
HMMs or contain sequences that are too remote to match ing fraction of the pairs are on mini-scaffolds [10], indicating
above the cutoffs used. The latter is an indication of the that they are potentially pieces of the same protein (from the
diversity added to known families that is not picked up by same clone) that we split into fragments. We estimate that this
current profile HMMs. effect applies to 3% of GOS-predicted proteins. Sequencing
Using our method, the predicted proteins constitute errors and the use of the wrong translation table can also
different fractions of the totals for the five datasets, with result in the ORF generation process producing split ORF
87% for NCBI-nr, nearly 20% for both PG ORFs and TGI- fragments.
EST ORFs, 92% for ENS, and 35% for GOS. The high rate of The combined set of predicted proteins in NCBI-nr, PG,
prediction for ENS is a reflection of the high degree of TGI-EST, and ENS, as expected, has a lot of redundancy. For
conservation of proteins across the metazoan genomes, instance, most of the PG protein predictions are in NCBI-nr.
whereas the prediction rates for PG ORFs and TGI-EST Removing exact substrings of longer sequences (i.e., 100%
ORFs are similar to rates seen in other protein prediction identity) reduces this combined set to 3,167,979 predicted
approaches. The 13% of NCBI-nr sequences that we marked proteins. When we perform the same filtering on the GOS
as spurious may constitute contaminants in the form of false dataset, 5,654,638 predicted proteins remain. Thus, the GOS-
largest clusters correspond to families that have functionally

diversified and expanded (Table 4). While some large families,
such as the HIV envelope glycoprotein family and the
immunoglobulins, also reflect biases in sequence databases,
many more, including ABC transporters, kinases, and short-
chain dehydrogenases, reflect their expected abundance in
nature.
Rate of Discovery of Protein Families

We examined the rate of discovery of protein families using
our clustering method to determine whether our sampling of
the protein universe is reaching saturation. We find that for
the present number of sequences there is an approximately
linear trend in the rate of discovery of clusters with the
addition of new (i.e., nonredundant) sequences (Figure 2).
Figure 1. Proportion of Sequences for Each Kingdom Moreover, the observed distribution of cluster sizes is well
(A) The combined set of NCBI-nr, PG, TGI-EST, and ENS has 3,167,979 approximated by a power law [42,43], and this observed
sequences. The eukaryotes account for the largest portion and is more power law can be used to predict the rate of growth of the
than twice the bacterial fraction. number of clusters of a given size (see Materials and
(B) Predicted kingdom proportion of sequences in GOS. Out of the
5,654,638 GOS sequences, 5,058,757 are assigned kingdoms using a Methods). This rate is dependent on the value of the power
BLAST-based scheme. The bacterial kingdom forms by far the largest law exponent and decreases with increasing cluster sizes. We
fraction in the GOS set. find good agreement between the observed and predicted
growth rates for different cluster sizes. The approximately
linear relationship between the number of clusters and the
predicted protein set is 1.8 times the size of the predicted number of protein sequences indicates that there are likely
protein set from current publicly available datasets. We used many more protein families (either novel or subfamilies
a simple BLAST based scheme to assign kingdoms for the distantly related to known families) remaining to be
GOS sequences (see Materials and Methods). Of the sequences discovered.
that we could annotate by kingdom, 63% of the sequences in
the public datasets are from the eukaryotic kingdom, and GOS versus Known Prokaryotic versus Known
90.8% of the sequences in the GOS set are from the bacterial Nonprokaryotic
kingdom (Figure 1). We also examined the GOS coverage of known proteins
and protein families. Based on the cell-size filtering
Protein Clustering performed while collecting the GOS samples, we expected
The 9,978,637 protein sequences predicted by our cluster- that the sample would predominantly be a size-limited subset
ing method are grouped into 297,254 clusters of size two or of prokaryotic organisms [30]. We studied the content of the
more, where size of a cluster is defined to be the number of 17,067 medium- and large-sized clusters across three group-
nonredundant sequences in the cluster. There are 280,187 ings: (1) GOS, (2) known prokaryotic (PG together with
small clusters (size , 20), 12,992 medium clusters (size bacterial and archaeal portions of NCBI-nr), and (3) known
between 20 and 200), and 4,075 large clusters (size . 200). nonprokaryotic (TGI-EST and ENS together with viral and
While the 17,067 medium- and large-sized clusters constitute eukaryotic portions of NCBI-nr). The Venn diagram in Figure
only 6% of the total number of clusters, they account for 85% 3 shows the breakdown of these clusters by content (see
of all the sequences that are clustered (Table 3). Many of the Materials and Methods). The largest section contains GOS-
Table 3. Cluster Size Distribution and the Distribution of Sequences in These Clusters
Cluster Size Number of Clusters Total Sequences NCBI-nr PG TGI-EST ENS GOS
2–4 214,033 756,269 194,297 87,699 149,687 32,920 291,666

5–9 48,348 415,166 97,759 30,565 71,414 14,828 200,600
10–19 17,806 350,918 90,682 19,904 60,783 23,493 156,056
20–49 7,255 310,770 78,153 13,809 58,496 26,486 133,826
50–99 3,086 337,296 80,470 14,342 55,190 26,150 161,144
100–199 2,631 595,903 165,846 28,100 107,490 40,465 254,002
200–499 2,134 1,036,567 218,940 57,131 164,581 49,797 546,118
500–999 799 914,207 148,084 54,077 90,020 24,047 597,979
1,000–2,000 620 1,503,116 205,196 79,348 105,866 21,883 1,090,823
2,000 542 3,758,425 659,629 190,754 233,556 59,786 2,614,700
Total 297,254 9,978,637 1,939,056 575,729 1,097,083 319,855 6,046,914
The size of a cluster is the number of nonredundant sequences in it. Column three shows the total number of sequences (both redundant and nonredundant) in these clusters. The
succeeding columns show their breakdown by the five datasets. There are 17,067 medium- and large-size clusters.
Table 4. List of the Top 25 Clusters from the Clustering Process
Cluster ID Cluster Nonredundant Total NCBI-nr PG TGI-EST ENS GOS

Annotation Sequences Sequences
3510 Immunoglobulin 37,227 51,944 49,206 0 1,649 1,089 0

2568 ABC transporter 34,130 69,010 8,886 6,248 150 13 53,713
49 Short chain dehydrogenase 33,406 56,266 7,607 3,055 2,852 747 42,005
4294 NAD dependent epimerase/dehydratase 29,445 35,555 2,745 1,265 1,500 111 29,934
1239 AMP-binding enzyme 22,111 37,598 3,838 1,614 2,246 613 29,287
2630 Envelope glycoprotein 21,161 41,205 41,189 2 10 0 4
157 Glycosyl transferases group 1 20,366 27,012 2,766 1,446 557 42 22,201
183 Integral membrane protein 17,627 33,079 2,154 1,298 1,198 95 28,334
530 Aldehyde dehydrogenase 15,851 30,929 3,116 1,349 1,589 388 24,487
1308 Aminotransferase class-V and 15,757 22,484 1,849 1,086 413 71 19,065
DegT/DnrJ/EryC1/StrS aminotransferase
244 Kinase family, including pknb, epk, c6 15,112 21,641 6,384 83 10,809 2,761 1,604
336 Histidine kinase–, DNA gyrase B–, and HSP90-like ATPase 14,724 23,355 3,809 2,469 54 4 17,019
357 Tetratricopeptide repeat 14,323 17,058 1,598 609 1,320 315 13,216
4325 Alpha/Beta hydrolase fold 13,806 20,886 2,828 1,334 1,625 196 14,903
113 Aminotransferase class I and II 13,006 22,186 2,931 1,534 1,239 120 16,362
333 Zinc-binding dehydrogenase 12,737 22,298 4,055 1,370 2,383 269 14,221
1315 tRNA synthetases class I (I, L, M, and V) 12,545 19,992 1,152 600 472 131 17,637
26 Acyl-CoA dehydrogenase 12,150 22,340 2,081 1,152 541 179 18,387
159 ABC transporter and ABC transporter transmembrane 11,984 17,650 2,697 1,442 797 170 12,544
3357 Cytochrome P450 11,929 17,302 5,355 249 6,994 1,399 3,305
4556 Response regulator 11,928 21,903 5,387 3,320 348 5 12,843
1720 TonB-dependent receptor 11,890 17,080 1,789 1,090 34 2 14,165
514 NADH dehydrogenase (various subunits) 11,224 25,068 11,624 635 253 10 12,546
4235 Glycosyl transferase family 2 10,954 13,593 1,236 724 74 14 11,545
186 7 transmembrane receptor 10,654 22,252 13,943 0 1,475 6,829 5
Clusters were annotated using the most commonly matching Pfam domains. Many of these clusters correspond to families that have expanded and functionally diversified.
only clusters (23.40%) emphasizing the significant novelty indicates a large core of well-conserved protein families
provided by the GOS data. The next section consists of across all domains of life. In contrast, the known prokaryotic
clusters containing sequences from only the known non- protein families are almost entirely covered by the GOS data.
prokaryotic grouping (20.78%), followed closely by the
section containing clusters with sequences from all three Novelty Added by GOS Data
groupings (20.23%). The large known nonprokaryotic–only There are 3,995 medium and large clusters that contain
grouping shows that our current GOS sampling methodology only sequences from the GOS dataset. Some are divergent
will not cover all protein families, and perhaps misses some members of known families that failed to be merged by the
protein families that are exclusive to higher eukaryotes. The clustering parameters used, or are too divergent to be
large section of clusters that include all three groupings detected by any current homology detection methods. The
Figure 2. Rate of Discovery of Clusters as (Nonredundant) Sequences Are Added

The x-axis denotes the number of sequences (in millions) and the y-axis denotes the number of clusters (in thousands). Seven datasets with increasing
numbers of (nonredundant) sequences are chosen as described in the text. The blue curve shows the number of core sets of size 3 for the seven
datasets. Curves for core set sizes 5, 10, and 20 are also shown. Linear regression gives slopes 0.027 (R2 ¼ 0.999), 0.011 (R2 ¼ 0.999), 0.0053 (R2 ¼
0.999), and 0.0024 (R2 ¼ 0.996) for size 3, size 5, size 10, and size 20, respectively.
thesis or electron transport. These GOS-only clusters could

be of viral origin, as cyanophage genomes contain and
express some photosynthetic genes that appear to be derived
from their hosts [44,50,51]. In support of these observations,
we identified five photosynthesis-related clusters containing
hundreds to thousands of viral sequences, including psbA,
psbD, petE, SpeD, and hli in the GOS data; furthermore, our
nearest-neighbor analysis of these sequences reveals the
presence of multiple viral proteins (unpublished data).
Although the majority of GOS-only sequences are bacterial,
a higher than expected proportion of the GOS-only clusters
are predicted to be of viral origin, implying that viral
sequences and families are poorly explored relative to other
microbes. To assign a kingdom to the GOS-only clusters, we
first inferred the kingdom of neighboring sequences based on
the taxonomy of the top four BLAST matches to the NCBI-nr
database (see Materials and Methods). A possible kingdom was
assigned to the GOS-only cluster if more than 50% of
assignable neighboring sequences belong to the same king-
dom. Viewed in this way, 11.8% of Group I clusters and
17.3% of Group II clusters with at least one kingdom-assigned
Figure 3. Venn Diagram Showing Breakdown of the 17,067 Medium and neighbor have more than 50% viral neighbors (Figure 4).
Large Clusters by Three Categories—GOS, Known Prokaryotic, and Only 3.3% and 3.4% of random samples of clusters with size
Known Nonprokaryotic distributions matching that of Group I and Group II clusters
doi:10.1371/journal.pbio.0050016.g003 have more than 50% viral neighbors, while 7.7% of all
clusters pass this criterion. A total of 547 GOS-only clusters
remaining clusters are completely novel families. In exploring contain sequences collected from the viral size fraction
the 3,995 GOS-only clusters, 44.9% of them contain included in the GOS dataset. For these clusters, 38.9% of the
sequences that have HMM matches, or BLAST matches to Group I subset and 27.5% of the Group II subset with one or
sequences in a more recent snapshot of NCBI-nr (down- more kingdom-assigned neighbors would be inferred as viral,
loaded in August 2005) than was used in this study. The recent based on the conservative criteria of having more than 50%
NCBI-nr matches include phage sequences from cyanophages viral assignable neighbors. Several alternative kingdom
(P-SSM2 and P-SSM4) [44] and sequences from the SAR-11 assignment methods were tried (see Materials and Methods)
genome (Candidatus pelagibacter ubique HTCC1062) [45]. We and provide for a similar conclusion.
used profile–profile searches [39] to show that an additional The GOS-only clusters also tend to be more AT-rich than
12.5% of the GOS-only clusters can be linked to profiles built sequences from a random size-matched sample of clusters
from Protein Data Bank (PDB), COG, or Pfam. The 2,295 (35.9% 6 8% GC content for Group II clusters versus 49.5%
clusters with detected homology are referred to as Group I 6 11% GC content for sample). Phage genomes with a
Prochlorococcus host [44] are also AT rich (37% average GC
clusters. The remaining 1,700 (42.6%) GOS-only clusters with
content). Our analysis of the graph constructed based on
no detectable homology to known families are labeled as
inferred operon linkages between all clusters indicates that
Group II clusters.
the GOS-only clusters may constitute large sets of cotran-
We applied a guilt-by-association operon method to
scribed genes (see Materials and Methods).
annotate the GOS-only clusters with a strategy that did not
The high proportion of potentially viral novel clusters
rely on direct sequence homology to known families.
observed here is reasonable, as 60%–80% of the ORFs in
Function was inferred for the GOS-only clusters by examin-
most finished marine phage genomes are not homologous to
ing their same-strand neighbors on the assembly (see known protein sequences [52]. Viral metagenomics projects
Materials and Methods). Similar strategies have been success- have reported an equally high fraction of novel ORFs [53],
fully used to infer protein function in finished microbial and a recent marine metagenomics project estimated that up
genomes [46–48]. Despite minimal assembly of GOS reads, to 21% of photic zone sequences could be of viral origin [51].
many scaffolds and mini-scaffolds contain at least partial It has also been reported that 40% of ORFans (sequences that
fragments of more than one predicted ORF, thereby making lack similarity to known proteins and predicted proteins)
this approach feasible. For 90 (5.3%) of the Group II clusters, exist in close spatial proximity to each other in bacterial
and for 214 (9.3%) of the Group I clusters, at least one Gene genomes, and this combined with proximity to integration
Ontology (GO) [49] biological process term at p-value 0.05 signals has been used to suggest a viral horizontally trans-
can be inferred. The inferred functions and neighbors of ferred origin for many bacterial ORFans [54]. Others have
some of these GOS-only clusters are highlighted in Table 5. noted a clustering of ORFans in genome islands and
We observed that for Group I clusters, the neighbor-inferred suggested they derive from a phage-related gene pool [55].
function is often bolstered by some information from weak A recent analysis of genome islands from related Prochlor-
homology to known sequences. While neighboring clusters as ococcus found that phage-like genes and novel genes cohabit
a whole are of diverse function, a number of GOS-only these dynamic areas of the genome [56]. In our GOS-only
clusters seem to be next to clusters implicated in photosyn- clusters, 37 of the 1,700 clusters with no detectable similarity
Table 5. Neighbor-Based Inference of Function for Novel Clusters of GOS Sequences
Novel Inferred Function p-Valuea Neighboring Clusters with Other Neighbors Comments
Cluster Contributing GO Annotation of Interestb
ID
GO ID Biological process Cluster ID GO Annotation
8837 GO:0006260 DNA replication 4.70 3 104 812 ATPase involved in DNA Phage Mu Mom DNA Profile–profile match: DNA polymerase
replication modification enzyme processivity factor
2,655 DNA polymerase family B DNA methylase
12519 GO:0006118 Electron transport 4.54 3 103 1,362 Cytochrome c oxidase Profile–profile match: PF03626— cytochrome c oxidase
subunit III subunit IV; 3 predicted transmembrane helices
1,771 SCO1/SenC—biogenesis of
photosynthetic systems
11010151 GO:0017004 Cytochrome complex 1.00 3 105 8,136 Thioredoxin .20 diverse profile–profile matches, one of which is
assembly cytochrome c biogenesis factor ccmH_2

9,364 Cytochrome c biogenesis protein
1,317 Cytochrome c assembly protein
18456 GO:0009252 Peptidoglycan biosynthesis 1.00 3 105 1,252 FAD binding domain Extracytoplasmic function One predicted TM helix
(ECF) sigma factor 24
10,764 UDP-N-acetylenolpyruvoylglucos- Viral RNA helicase
aminereductase
14219 GO:0009628 Response to abiotic stimulus 3.10 3 104 5,936 Colicin V production protein Predicted soluble; exclusively neighbors to just
two clusters.
MatE multidrug efflux pump
11480 GO:0015031 Protein transport 3.00 3 104 4,177 MotA/TolQ/ExbB proton channel family Four predicted TM helices; Tol proteins facilitate
transport of
0439
colicins, iron, and phage DNA
9,569 Biopolymer transport protein
ExbD/TolR
14360 GO:0006777 Mo-molybdopterin cofactor 1.00 3 105 9,745 MoaC family Sulfite oxidase SAR11 blast match annotated as probable moaD;
biosynthesis profile–profile matches to ThiS and molybdopterin
converting factor; ,.05% of sequences have PFAM
match to ThiS family
9,948 MoaE protein Predicted thioesterase
255 Radical SAM superfamily
8397 GO:0017004 Cytochrome complex 1.00 3 105 8,136 Thioredoxin SMC superfamily (homologous to Blast match to ‘‘periplasmic or inner membrane–
assembly ABC family) associated protein’’; two predicted TM helices;
0.7% of sequences have PFAM match to
cytochrome c biogenesis protein
9,364 Uncharacterized cytochrome c
biogenesis protein
5
13909 GO:0015979 Photosynthesis 1.00 3 10 13,990 Photosystem II reaction centre Predicted soluble; single blast match to cyanophage
N protein (psbN) P-SSM2 hypothetical protein; many phage proteins
as minor neighbors
5,184 Photosynthetic reaction centre
protein D1 (psbA)
7,664 Ferredoxin-dependent bilin reductase
a
p-Values were computed by simulating 100,000 neighbor cluster sets of equivalent size.
b
Not all clusters could be mapped to a GO term.

Table 6. Functions Skewed in Domain Representation between

PG and GOS
Process Number Number Number GOS

of HMMs in PG in GOS Enrichment
Sarcosine oxidase 6 686 19,295 4.766

Oxidative stress 5 524 9,804 3.170
Ubiquinone synthesis 4 245 4,035 2.790
RecA 1 215 3,728 2.938
Topoisomerase IV 4 2,163 33,472 2.622
Photosynthesis 41 919 13,889 2.561
DNA polymerase 20 3,682 51,224 2.357
tRNA synthetases 11 5,499 71,294 2.197
Transketolase 4 2,127 26,440 2.106
DNA gyrase 7 4,146 49,677 2.030
TCA cycle 30 12,057 135,294 1.901
Shikimate metabolism 8 2,393 24,316 1.722
DnaJ 3 1,103 12,389 1.891
Universal ribosomal 39 8,555 80,321 1.591
components (found in
all three kingdoms)
UVR exonuclease operon 6 4,108 38,223 1.577
Figure 4. Enrichment in the GOS-Only Set of Clusters for Viral Neighbors
ABC transporter 39 193,689 727,314 0.636
Cluster sets from left to right are: I, GOS-only clusters with detectable Flagellum 38 3,771 12,988 0.584
BLAST, HMM, or profile-profile homology (Group I); II, GOS-only clusters Sugar transport 7 3,601 4,453 0.210
with no detectable homology (Group II); I-S, a sample from all clusters Transposase 13 4,354 4,365 0.170
chosen to have the same size distribution as Group I; II-S, a sample from Che operon (chemotaxis) 7 1,142 1,119 0.166
all clusters chosen to have the same size distribution as Group II; I-V, a
Ethanolamine 9 231 218 0.160
subset of clusters in Group I containing sequences collected from the
Hydrogenases 16 1,179 1,061 0.152
viral size fraction; II-V, a subset of clusters in Group II from the viral size
fraction; and all clusters. Notice that although predominantly bacterial, Pilus 14 700 623 0.151
GOS-only clusters are assigned as viral based on their neighbors more PTS phosphotransferase 32 11,439 6,661 0.099
often than the size-matched samples and the set of all clusters. system
doi:10.1371/journal.pbio.0050016.g004 Gas vesicle 6 49 19 0.066
Grþ nonspore 22 1,063 52 0.008
Grþ spores 15 503 0 0.000
(2.2%) have at least ten bacterial-classified and ten viral-
classified neighboring ORFs. This is 6.2-fold higher than the Functionally related families of domains were grouped by GO terms or by inspection to
rate seen for the size-matched sample of all clusters (six sum up total domain counts in GOS and PG. There were 8,935,364 domain matches in the
GOS data (corresponding to 3,701,388 sequences) and 1,513,880 domain matches in the
clusters, 0.35%). This would seem to add more support to a PG data (corresponding to 448,159 sequences). The GOS enrichment ratio is computed
phage origin for at least some ORFans found in bacterial from columns three and four, and then normalized to account for the 5.9 times the
genomes. number of domain matches in GOS compared to PG.
If a sizable portion of the novel families in the GOS data
are in fact of viral origin, it suggests that we are far from fully
exploring the molecular diversity of viruses, a conclusion ences reflect several factors, including differing biochemical
echoed in previous studies of viral metagenomes [53,57,58]. In needs of oceanic life and taxonomic biases in the two
studies of bacterial genomes, discovery of new ORFans shows datasets. An initial comparison of these domain profiles
no sign of reaching saturation [59]. Coverage of many phage helps shed light on these factors. 91% (964/1,056) of GOS-only
families in the GOS data may be low, given that there are domains are viral and/or eukaryotic specific (by Pfam
inherent differences in the abundance of their presumed annotation). Most of the remaining 92 domains are rare (63
bacterial hosts. These GOS-only clusters were operationally domains have less than ten copies in GOS), are predom-
defined as having at least 20 nonredundant sequences. inantly eukaryotic/viral, or are specific to narrow bacterial
Reducing this threshold to ten nonredundant sequences adds taxa without completed genome sequences. Most of the 879
7,241 additional clusters. Whether this vast diversity repre- PG-only domains are also rare (444 have ten or less members),
sents new families or is a reflection of the inability to detect and/or are restricted to tight lineages, such as Mycoplasma (104
distant homology will require structural and biochemical matches to five domains) or largely extremeophile archaeal-
studies, as well as continued development of computational specific domains (1,254 matches to 99 domains). Highly PG-
methods to identify remotely related sequences. enriched domains also tend to belong in these categories.
Many moderately skewed domains reflect the taxonomic skew
Comparison of Domain Profiles in GOS and PG Datasets between PG and GOS. For instance, we found that a set of six
We used HMM profiling to address the question of which sarcosine oxidase-related domains are 4.8-fold enriched in
biochemical and biological functions are expanded or GOS (Table 6). They are mostly found in a- and c-
contracted in GOS compared to the largely terrestrial proteobacteria, which are widespread in GOS. Normalizing
genomes in PG. Significant differences are seen in 68% of to the taxonomic class level predicts a 1.8-fold enrichment in
domains (4,722 out of the 6,975 domains that match either GOS, indicating that taxonomy alone cannot fully explain the
GOS or PG; p-value ,0.001, chi-square test). These differ- prevalence of these proteins in oceanic bacteria.
Figure 5. Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size
The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were given
kingdom weights as described in Materials and Methods. For each kingdom, the fraction of sequences with 1 Pfam match are shown, while the ten
largest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100
sequences. Pfam families that are smaller than the ten largest are binned together in each column’s bottom section. Pfam covers public-100 better than
GOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain,
while only 27.5% of GOS-100s have a sequence match.
Mysterious Lack of Characteristic Gram-Positive Domains subunits shared by all three kingdoms, marker proteins such
Gram-positive bacteria (Firmicutes and Actinobacteria) repre- as recA and dnaJ, and TCA cycle enzymes all tend to be GOS
sent 26.7% of PG and ;12% of GOS [30]. Given the larger enriched. This suggests that oceanic genomes may be more
size of the GOS dataset, one might predict Gram-positive– compact than sequenced genomes and so have a higher
specific domains to be ;2.4-fold enriched in GOS. Instead, proportion of core pathways.
the opposite is consistently seen. Of 15 firmicute-specific
spore-associated domains, PG has 503 members, but GOS has Characteristics and Kingdom Distribution of Known
none. For another 22 firmicute-restricted domains of varying Protein Domains
or unknown function, the PG/GOS ratio is 1797:77 (Table 6). A decade ago, databases were highly biased towards
Hence, it appears that GOS Gram-positive lineages lack most proteins of known function. Today, whole-genome sequenc-
of their characteristic protein domains. Two sequenced ing and structural genomics efforts have presumably reduced
marine Gram-positives (Oceanobacillus iheyensis [60] and Bacillus the biases that are a result of targeted protein sequencing. We
sp. NRRL B-14911) have a large complement of these used the Pfam database to compare the characteristics and
domains. However, another recently assembled genome from kingdom distribution of known protein domains in the GOS
Sargasso sea surface waters, the actinomycete Janibacter sp. dataset to that of proteins in the publicly available datasets
HTCC2649, has just two of these domains, and may reveal a (NCBI-nr, PG, TGI-EST, and ENS). Such an effort can be used
whole-genome context for this curious loss of characteristic to assess biases in these datasets, help direct future sampling
domains. efforts (of underrepresented organisms, proteins, and protein
families), make more informed generalizations about the
Flagellae and Pili Are Selectively Lost from Oceanic protein universe, and provide important context for deter-
Species mination of protein evolutionary relationships (as biased
Flagellum components from both eubacteria and archaea sampling could indicate expected but missing sequences).
are significantly underrepresented in the GOS dataset by For this analysis we used the nonredundant datasets (at
about 2-fold (Table 6). Ironically, at a bacterial scale,
100% identity) discussed in Figure 1. We refer to the set of
swimming may be worthwhile on an almost dry surface, but
3,167,979 nonredundant sequences from NCBI-nr, PG, TGI-
not in open water. The chemotaxis (che) operon that often
EST, and ENS as the public-100 set and the similarly filtered set
directs flagellar activity is also rare in GOS. Another direc-
of 5,654,638 sequences from the GOS data as the GOS-100 set.
tional appendage, the pilus, is even more reduced, though its
About 70% of public-100 sequences and 56% of GOS-100
taxonomic distribution (mostly in proteobacteria, predom-
sequences significantly match at least one Pfam model. The
inantly c-proteobacteria) would have predicted enrichment.
most obvious difference between the sets is that the vast
Skew in Core Cellular Pathways majority of GOS sequences are bacterial, and this has to be
While taxonomically specialized domains are likely to be taken into account when comparing the numbers. Since
skewed by taxonomic differences, core pathways found in different Pfam families appear with different frequencies in
many or all organisms paint a different picture. We used GO the kingdoms, we considered the results for each kingdom
term mapping and text mining to group domains into major separately (Figure 5). We then evaluated all kingdoms
functions and to look for consistent skews across several together, with results normalized by relative abundance of
domains. Several core functions, including DNA-associated members from the different kingdoms. A domain found
proteins (DNA polymerase, gyrase, topoisomerase), ribosomal commonly and exclusively in eukaryotes and abundant in
public-100 would be expected to be found rarely in GOS-100. confident kingdom assignment. Our examination of each of
We used a conservative BLAST-based kingdom assignment the scaffolds responsible for a determination of kingdom-
method to assign kingdoms to the GOS sequences (see crossing confirms that each one had both a highly significant
Materials and Methods). match to the Pfam model in question and an overwhelming
In each kingdom, sequences in GOS-100 are less likely to number of votes for the unexpected kingdom. These scaffold
match a Pfam family than those in public-100 (Figure 5). For assemblies were also manually inspected. No clear anomalies
the cellular kingdoms, these differences are comparatively were observed. In most instances, the assemblies in question
modest. While diversity of the GOS data accounts for some of were composed of a single unitig, and as such are high-
this difference, it might also be explained in part by the confidence assemblies. Mate pair coverage and consistent
fragmentary nature of the GOS sequences. Viruses tell a depth of coverage provide further support for the correct-
dramatic and different story. Of public-100 viral sequences, ness of those assemblies that are built from multiple unitigs.
89.1% match a Pfam domain, while only 27.5% of GOS-100 Examples of kingdom-crossing families include indoleamine
viral sequences have a match. This tremendous difference 2,3-dioxygenase (IDO), MAM domain, and MYND finger [15],
appears to be due to heavy enrichment of the public data for which have previously only been seen in eukaryotes, but we
minor variants of a few protein families, indicated by the sizes find them also to be present in bacteria. These Pfams now
of the ten most populous Pfams in each kingdom (Figure 5). cross kingdoms, due either to their being more ancient than
Sequences from three Pfam families (envelope glycoprotein previously realized or to lateral transfer.
GP120, reverse transcriptase, and retroviral aspartyl pro- We explored the IDO family further. This family has
tease) account for a third of all public viral sequences. By representatives in vertebrates, invertebrates, and multiple
contrast, the most populous three families in the GOS-100 fungal lineages [15,61] in public-100. Members of the IDO
data (bacteriophage T4-like capsid assembly protein [Gp20], family are heme-binding, and mammalian IDOs catalyze the
major capsid protein Gp23, and phage tail sheath protein) rate-limiting step in the catabolic breakdown of tryptophan
account for only about 7% of public-100 sequences. Such a [62], while family members in mollusks have a myoglobin
difference may be due to intentional oversampling of function [63]. In mammals, IDO also appears to have a role in
proteins that come from disease-causing organisms in the the immune system [62,64–66]. The IDO Pfam has matches to
public dataset. 66 proteins in public-100, all of which are eukaryotic.
While the total proportion of proteins with a Pfam hit is However, it also has matches to ten GOS-100 sequences that
fairly similar between public-100 (70%) and GOS-100 (56%) we confidently labeled as bacterial proteins and matches to
datasets, there are considerable differences with regard to the 206 GOS-100 sequences for which a confident kingdom
distributions of protein families within these two datasets. assignment could not be made (many of these are likely
The most highly represented Pfam families in GOS-100 bacterial sequences due to the GOS sampling bias). To
compared to public-100 are shown in Table 7. Notably, we reconstruct a phylogeny of the IDO family, we searched a
found that while many known viral families are absent in recent version of NCBI-nr (March 5, 2006) for IDO proteins
GOS-100, viral protein families dominate the list of the that were not included in the public-100 dataset. The search
families more highly represented in GOS-100; this is identified two bacterial proteins from the whole genomes of
presumably because of biases in the collection of previously the marine bacteria Erythrobacter litoralis and Nitrosococcus
known viral sequences. Surprisingly few bacterial families oceani, and 24 eukaryotic proteins (see Materials and Methods).
were among the most represented in GOS-100 compared with The phylogeny shown in Figure 6 shows 54% bootstrap
public-100. By contrast, we also observed that those families support for a separation of the clade containing exclusively
found more rarely in GOS-100 than public-100 were public-100 and NCBI-nr 2006 eukaryotic sequences from a
frequently bacterial (Table 7). This appears to be a result of clade with the GOS-100 sequences as well as the two NCBI-nr
the large number of key bacterial and viral pathogen proteins E. litoralis and N. oceani sequences. We confirmed this feature of
in public-100 that are comparatively less abundant in the the tree topology with multiple other phylogeny reconstruc-
oceanic samples and/or less intensively sampled. tion methods. Curiously, there is considerable intermixing of
bacterial and eukaryotic sequences in the clade of GOS-100
GOS-100 Data Suggest That a Number of ‘‘Kingdom- sequences and the two NCBI-nr bacteria. A manual inspection
Specific’’ Pfams Actually Are Represented in Multiple of the scaffolds that contain the ten GOS-100 sequences
Kingdoms (containing the IDO domain) that we confidently labeled as
Of the 7,868 Pfam models in Pfam 17.0, 4,050 match bacterial, overwhelmingly supports the kingdom assignment.
proteins from only a single kingdom in public-100. The However, a manual inspection of the scaffolds that contain the
additional sequences from GOS-100 reveal that some of these ten GOS-100 sequences (containing the IDO domain) that we
families actually have representatives in multiple kingdoms. confidently labeled as eukaryotes presents a less convincing
Table 8 shows 12 families that have a Pfam match to at least picture. These scaffolds are short, with most of them
one GOS-100 protein with an E-value 1 3 1010, and which containing only two voting ORFs. Since the NCBI-nr version
we confidently assigned to a kingdom different from that of used in the public-100 set has IDO from eukaryotes only, the
all the public-100 matches. Because our criteria for a ORF with the IDO domain itself would cast four votes for
‘‘confident’’ kingdom assignment are conservative, there are eukaryotes. Thus, these GOS-100 eukaryotic labelings are not
only one or a few confident assignments for each Pfam nearly as confident as the ones labeled bacterial.
domain to a ‘‘new’’ kingdom. Our ‘‘confident’’ criteria are
especially difficult to meet in the case of kingdom-crossing, Structural Genomics Implications
due to the votes contributed by the crossing protein (see Knowledge about global protein distributions can be used
Materials and Methods). Thus, many scaffolds have no to inform priorities in related fields such as structural
Table 7. Top Pfam Families Represented More Highly or Less Highly in GOS-100 than in Public-100
Category Accession Description Public-100 Hits GOS-100 Hits

Number
Archaea Bacteria Eukaryota Viruses Unknown Total Expected Based Observed Observed/ Chi Square
on Public-100 Expected
Families represented PF07068 Major capsid protein Gp23 0 0 0 41 0 41 8 1,818 23,450% ,1 3 10303
more highly
PF03420 Prohead core protein protease 0 0 0 11 0 11 6 1,223 22,176% ,1 3 10303
PF06841 T4-like virus tail tube protein gp19 0 0 0 13 0 13 6 795 14,036% ,1 3 10303
PF04451 Iridovirus major capsid protein 0 0 1 138 0 139 15 1,692 11,269% ,1 3 10303
Bacteriophage T4-like capsid 0 0 211 0 211 20 1,633 7,992% ,1 3 10303
PF07230 assembly protein (Gp20) 0
PF01818 Bacteriophage translational regulator 0 0 0 10 0 10 5 405 7,444% ,1 3 10303
PF01231 Indoleamine 2,3-dioxygenase 0 0 66 0 0 66 7 226 3,471% ,1 3 10303
PF03322 Gamma-butyrobetaine hydroxylase 0 13 117 0 0 130 60 1,807 3,004% ,1 3 10303
PF04777 Erv1/Alr family 0 0 177 10 0 187 10 309 2,996% ,1 3 10303
PF05367 Phage endonuclease I 0 2 0 10 0 12 13 290 2,152% ,1 3 10303
PF04832 SOUL heme-binding protein 3 8 173 0 1 185 43 714 1,648% ,1 3 10303
PF03159 XRN 59-39 exonuclease N-terminus 0 0 214 2 0 216 11 170 1,584% ,1 3 10303

PF06213 Cobalamin biosynthesis protein CobT 0 33 0 0 0 33 137 2,155 1,569% ,1 3 10303
PF01786 Alternative oxidase 0 5 239 0 0 244 31 479 1,527% ,1 3 10303
PF00274 Fructose-bisphosphate aldolase class-I 0 28 932 0 0 960 143 2,076 1,453% ,1 3 10303
PF03291 mRNA capping enzyme 0 0 149 33 0 182 11 157 1,395% ,1 3 10303
PF04724 Glycosyltransferase family 17 0 1 118 0 0 119 12 155 1,296% ,1 3 10303
PF00940 DNA-dependent RNA polymerase 0 5 208 23 0 236 32 394 1,222% ,1 3 10303
PF03030 Inorganic Hþ pyrophosphatase 11 83 382 0 0 476 355 4,213 1,187% ,1 3 10303
PF02747 Proliferating cell nuclear antigen, 19 0 175 9 0 203 21 243 1,153% ,1 3 10303
C-terminal domain
Families represented PF01617 Surface antigen 0 991 0 0 0 991 3,987 0 0% ,1 3 10303
0443
less highly
PF00516 Envelope glycoprotein GP120 0 0 1 41,115 11 41,127 3,071 0 0% ,1 3 10303
PF00077 Retroviral aspartyl protease 0 0 153 26,747 9 26,909 2,004 0 0% ,1 3 10303
PF04650 YSIRK type signal peptide 0 469 0 0 3 472 1,889 0 0% ,1 3 10303
PF03507 CagA exotoxin 0 333 0 0 0 333 1,343 0 0% 4 3 10294
PF03482 sic protein 0 285 0 0 0 285 1,150 0 0% 4 3 10252
PF01308 Chlamydia major outer membrane protein 0 264 0 0 0 264 1,066 0 0% 8 3 10234
PF02707 Major outer sheath protein N-terminal region 0 264 0 0 0 264 1,066 0 0% 8 3 10234
PF00934 PE family 0 249 0 0 0 249 1,005 0 0% 1 3 10220
PF00820 Borrelia lipoprotein 0 223 0 0 0 223 901 0 0% 6 3 10198
PF02722 Major outer sheath protein C-terminal region 0 223 0 0 0 223 901 0 0% 6 3 10198
PF00921 Borrelia lipoprotein 0 202 0 0 0 202 816 0 0% 1 3 10179
Staphylococcal/streptococcal toxin, 0 197 3 1 2 203 797 0 0% 3 3 10175
PF02876 beta-grasp domain
PF01856 Outer membrane protein 0 176 0 0 0 176 712 0 0% 7 3 10157
Staphylococcal/streptococcal toxin, 0 166 3 1 2 172 672 0 0% 4 3 10148
PF01123 OB-fold domain
PF02474 Nodulation protein A (NodA) 0 157 0 0 0 157 636 0 0% 3 3 10140
PF06458 MucBP domain 0 155 2 0 0 157 628 0 0% 2 3 10138
Bacillus/clostridium GerA spore germination 0 149 0 0 0 149 603 0 0% 3 3 10133
PF03323 protein
PF07548 Chlamydia polymorphic membrane protein 0 146 0 0 0 146 591 0 0% 1 3 10130
middle domain
PF02255 PTS system, lactose/cellobiose-specific IIA subunit 0 141 0 0 0 141 571 0 0% 3 3 10126
Green indicates exclusively bacterial in public-100; blue, exclusively eukaryotic in public-100; red, exclusively viral in public-100. Expected number of matches in GOS-100 to each Pfam model was calculated as described in Materials and Methods. This
calculation is based on the number of matches to each Pfam in public-100 and corrected for the different kingdom proportions in GOS-100 and public-100. For each Pfam model, the percentage representation ratio is the number of observed GOS-100 matches

to that Pfam divided by the number expected, and expressed as a percentage. The top half of the table shows the top 20 most highly represented proteins that have representation ratios . 1,000% and have chi-squared p-value , 1 3 10303. Numbers of
observed matches to these Pfams in public-100 are also indicated according to kingdom. A number of Pfams highly represented in GOS-100 appear to occur exclusively or almost exclusively in a particular kingdom in public-100. For example, Pfams that are
characteristically viral in public-100 (colored in red) dominate the top of this list, and an intriguing protein family (IDO) with a known immune function in higher eukaryotes (blue) also appears. The bottom half of the table shows the 20 Pfam domains not
observed in GOS-100 with the highest expectation based on public-100 (or equivalently, with the most significant chi-squared p-values). Thus, a large number of key bacterial and viral pathogen proteins in public-100 are not observed in the oceanic samples.
Some Pfam domains observed exclusively in one kingdom in public-100 are found in a different kingdom in GOS-100. The number of sequences in the public dataset that match each Pfam model is listed above the number of sequences in GOS with a
confident kingdom assignment and a highly significant match to the model. The TC bit score is provided for each model, together with the bit score and E-value of the best match to the model in an unexpected kingdom. For this analysis, Pfam
Best E-Value for Match
matches are filtered with an E-value cutoff of 1 3 1010. In every case, the bit score is at least five bits greater than the TC for the model, because of the larger size of the GOS dataset relative to those used for creating the TC thresholds. In addition to
in Novel Kingdom
10100
1071
1011
1016
3.50 3 1013
9.70 3 1060
1017
1014
1022
1012
1.40 3 1014
2.20 3 1072
3
3
3
3
3
3
3
3
2.00
7.00
6.10
4.30
5.40
3.50
2.50
3.40
Best Score for Match
in Novel Kingdom
passing the ‘‘confident’’ criteria (see Materials and Methods), the kingdom assignments are all confirmed by visual inspection of the BLAST kingdom vote distributions for the respective scaffolds.
247.79
342.32
185.71
250.97
35.88
57.91
41.89
60.46
50.82
74.63
40.63
51.44
Pfam TC
107.1
Figure 6. Maximum Likelihood Phylogeny for the IDO Family

17.8
15.1
14.3
36.2
42.1
47.8
38.4
40.7
34.2
45
78
The phylogeny is based on an alignment of 93 sequences from GOS-100

and 51 sequences from public-100 and NCBI-nr from March 2006 that
Unknown
matched the IDO Pfam model and satisfied multiple alignment quality
criteria. The IDO family is eukaryotic specific in public-100. The
0/206
0/165
0/204
0/239
phylogeny shows a clade with all the GOS sequences, predicted to be

0/17
0/13
0/20
0/80
0/92
0/15
0/9
0/2
bacterial (navy blue), eukaryotic (yellow), or unknown (gray), along with

two sequences from the marine bacteria Erythrobacter litoralis and
Matches in Public-100/Matches in GOS-100
Nitrosococcus oceani (lime green) submitted to the sequence database

Viruses
after February 2005, and a public-only clade of only eukaryotic

0/0
0/0
0/0
0/0
0/0
0/0
0/3
0/0
0/0
0/0
0/0
0/0
sequences (orange).
Eukaryota
66/10
712/1
798/1
197/0
100/1
173/6
genomics. Structural genomics is an international effort to

0/0
0/0
0/0
0/0
0/0
0/0
determine the 3-D shapes of all important biological macro-

molecules, with a primary focus on proteins [67–72]. Previous
Bacteria
108/250
studies have shown that an efficient strategy for covering the

91/42
0/10
protein structure universe is to choose protein targets for

0/6
0/1
0/1
0/1
0/0
0/8
0/3
0/2
0/1
experimental structure characterization from among the

Archaea
largest families with unknown structure [73,74]. If the

structure of one family member is determined, it may be
88/3
27/0
21/0
0/0
0/0
0/0
0/0
0/0
0/1
0/0
0/1
8/0
used to accurately infer the fold of other family members,

even if the sequence similarity between family members is too
Protein of unknown function (DUF1289)
Protein of unknown function (DUF1152)
low to enable accurate structural modeling [75]. Therefore,

large families are a focus of the production phase of the
Eukaryotic DNA topoisomerase I,
Copper resistance protein CopC
Membrane protein of unknown
Protein Structure Initiative (PSI), the National Institutes of

Palmitoyl protein thioesterase
Indoleamine 2,3-dioxygenase
Health–funded structural genomics project that commenced

HTH DNA-binding domain
Coenzyme Q (ubiquinone)
biosynthesis protein Coq4
in October 2005 [76].

DNA binding fragment
Model Description
Ribosomal LX protein
In March 2005, 2,729 (36%) of 7,677 Pfam families had at

function (DUF63)
least one member of known structure; these families could be

MAM domain
MYND finger
used to infer folds for approximately 51% of all pre-GOS

prokaryotic proteins (covering 44% of residues) [74]. The
Pfam5000 strategy is to solve one structure from each of the
Table 8. New Multi-Kingdom Pfams
largest remaining families, until a total of 5,000 families have

at least one member with known structure [73]. As this strategy
Accession
Number
PF01231
PF00629
PF01753
PF02089
PF05019
PF02919
PF06945
PF04234
PF04967
PF01911
PF01889
PF06626
is similar to that being used at PSI centers to choose targets,

Pfam
projections based on the Pfam5000 should reflect PSI results.

Completion of the Pfam5000, a tractable goal within the
production phase of PSI, would enable accurate fold assign-
Kingdom Specificity
ment for approximately 65% of all pre-GOS prokaryotic

proteins. In the GOS-100 dataset, we observed that 46% of the
in Public-100
proteins might currently be assigned a fold based on Pfam

Database
families of known structure (see Materials and Methods).

A only
B only
E only
Completion of the Pfam5000 would increase this coverage to

55%.
The GOS sequences will affect Pfam in two ways: some will Previously, in the Sargasso Sea study [10] it was shown that
be classified in existing protein families, thus increasing the shotgun sequencing reveals a much greater diversity of
size of these families; others may eventually be classified into proteorhodopsin-like proteins than was previously known
new GOS-specific families. Both of these will alter the relative from cloning and PCR studies. However, along with the
sizes of different families, and thus their prioritization for potential benefits of phototrophy come many risks, such as
structural genomics studies. We calculated the sizes for all the damage caused to cells by exposure to solar irradiation,
Pfam families based on the number of occurrences of each especially the UV wavelengths. Organisms deal with the
family in the public-100 dataset. Proteins in GOS-100 were potential damage from UV irradiation in several ways,
then added and the family sizes were recalculated. A total of including protection (e.g., UV absorption), tolerance, and
190 families that are not in the Pfam5000 based on public-100 repair [78]. Our examination of the protein family clusters
are moved into the Pfam5000 after addition of the GOS data. reveals that the GOS data provides an order of magnitude
The 30 largest such families are shown in Table 9. As 20 of the increase in the diversity (in both numbers and types) of
30 families are annotated as domains of unknown function in homologs of proteins known to be involved in pathways
Pfam, structural characterization might be helpful in identi- specifically for repairing UV damage.
fying their cellular or molecular functions. Reshuffling the One aspect of the diversity of UV repair genes is seen in the
Pfam5000 to prioritize these 190 families would improve overrepresentation of photolyase homologs in the GOS data
structural coverage of GOS sequences after completion of the (see Table 10). Photolyases are enzymes that chemically
Pfam5000 by almost 1% relative to the original Pfam5000 reverse the UV-generated inappropriate covalent bonds in
(from 55.4% to 56.1%), with only a small decrease in coverage cyclobutane pyrimidine dimers and 6–4 photoproducts [79].
of public-100 sequences (from 67.7% to 67.5%). The massive numbers of homologs of these proteins in the
The Pfam5000 would be further reprioritized by the GOS data (11,569 GOS proteins in four clusters; see Table 10)
classification of clusters of GOS sequences into Pfam. is likely a reflection of their presence in diverse species and
Assuming each cluster of pooled GOS-100 and public-100 the existence of novel functions in this family. New repair
sequences without a current Pfam match would be classified functions could include repair of other forms of UV dimers
as a single Pfam family, 885 such families would replace (e.g., involving altered bases), use of novel wavelengths of light
existing families in the Pfam5000. These 885 clusters contain to provide the energy for repair, repair of RNA, or repair in
a total of 383,019 proteins in GOS-100 and public-100. The different sequence contexts. In addition, some of these
reprioritized Pfam5000 would also retain 1,183 families of proteins may be involved in regulating circadian rhythms,
unknown structure from the current Pfam5000; these families as seen for photolyase homologs in various species. Our
comprise a total of 1,040,330 proteins in GOS-100 and public- findings are consistent with the recent results of a compara-
100. tive metagenomic survey of microbes from different depths
Known Protein Families and Increased Diversity Due to that found an overabundance of photolyase-like proteins at
the surface [51].
GOS Data
A good deal was known about the functions and diversity of
Several protein families serve as examples to further
photolyases prior to this project. However, much less is
highlight the diversity added by the GOS dataset. In this
known about other UV damage–specific repair enzymes, and
paper, we examined UV irradiation DNA damage repair
examination of the GOS data reveals a remarkable diversity
enzymes, phosphatases, proteases, and the metabolic enzymes
of each of these. For example, prior to this project, there were
glutamine synthetase and RuBisCO (Table 10). The RecA
only some 25 homologs of UV dimer endonucleases (UVDEs)
family (unpublished data) and the kinase family [77] have also
been explored in the context of the GOS data. There are available [80], and most of these were from the Bacillus
more than 5,000 RecA and RecA-like sequences in the GOS species. There are 420 homologs of UVDE (cluster 6239) in
dataset (Table 10). An analysis of the RecA phylogeny the GOS data representing many new subfamilies (Figure 7A
including the GOS data reveals several completely new RecA and Materials and Methods). A similar pattern is seen for
subfamilies. A detailed study of kinases in the GOS dataset spore lyases (which repair a UV lesion specific to spores [81])
demonstrated the power of additional sequence diversity in and the pyrimidine dimer endonuclease (DenV, which was
defining and exploring protein families [77]. The discovery of originally identified in T4 phage [82]). We believe this will also
16,248 GOS protein kinase–like enzymes enabled the defi- be true for UV dimer glycosylases [83], but predictions of
nition and analysis of 20 distinct kinase-like families. The function for homologs of these genes are difficult since they
diverse sequences allowed the definition of key residues for are in a large superfamily of glycosylases.
each family, revealing novel core motifs within the entire Our analysis of the kingdom classification assignments
superfamily, and predicted structural adaptations in individ- suggests that the diversity of UV-specific repair pathways is
ual families. This data enabled the fusion of choline and seen for all types of organisms in the GOS samples. This
aminoglycoside kinases into a single family, whose sequence apparently extends even to the viral world (e.g., 51 of the
diversity is now seen to be at least as great as the eukaryotic UVDE homologs are assigned putatively to viruses), suggest-
protein kinases themselves. ing that UV damage repair may be a critical function that
phages provide for themselves and their hosts in ocean
Proteins Involved in the Repair of UV-Induced DNA surface environments. Based on the sheer numbers of genes,
Damage their sequence diversity, and the diversity of types of
Much of the attention in studies of the microbes in the organisms in which they are apparently found, we conclude
world’s oceans has justifiably focused on phototrophy, such that many novel UV damage–repair processes remain to be
as that carried out by the proteorhodopsin proteins. discovered in organisms from the ocean surface water.
Table 9. The 30 Largest Structural Genomics Target Families Added to the Pfam5000 Based on Inclusion of GOS Sequences
Accession Number Description Family Size after GOS Family Size before GOS
PF06213.2 Cobalamin biosynthesis protein CobT 2,188 33

PF04244.3 Deoxyribodipyrimidine photolyase-related protein 1,628 51
PF07021.1 Methionine biosynthesis protein MetW 1,305 50
PF03420.3 Prohead core protein protease 1,234 11
PF06347.2 Protein of unknown function (DUF1058) 1,114 40
PF06439.1 Domain of unknown function (DUF1080) 1,021 48
PF06253.1 Trimethylamine methyltransferase (MTTB) 942 38
PF06242.1 Protein of unknown function (DUF1013) 915 36
PF06841.2 T4-like virus tail tube protein gp19 808 13
PF05992.2 SbmA/BacA-like family 746 26
PF04018.3 Domain of unknown function (DUF368) 720 54
PF06041.1 Bacterial protein of unknown function (DUF924) 416 59
PF03209.5 PUCC protein 415 48
PF06146.1 Phosphate-starvation-inducible E 393 44
PF06175.1 tRNA-(MS[2]IO[6]A)-hydroxylase (MiaE) 342 46
The 30 largest families after inclusion of GOS data that were not among the 5000 largest families before inclusion of GOS data are shown here. Family size was calculated as the number of
matches in public-100 (before GOS) and in the combined GOS-100 and public-100 datasets (after GOS).
Evidence of Reversible Phosphorylation in the Oceans regulators of the cellular response. Protein phosphatases
Reversible phosphorylation of proteins represents a major are divided into three major groups based on substrate
mechanism for cellular processes, including signal trans- specificity [85]. The Mg2þ- or Mn2þ-dependent phosphoserine/
duction, development, and cell division [84]. The activity of phosphothreonine protein phosphatase family, exemplified
protein kinases and phosphatases serve as antagonistic by the human protein phosphatase 2C (PP2C), represents the
Table 10. Clustering of Sequences in Families That Are Explored in This and Companion Papers
Protein Family Cluster ID Nonredundant Sequences Total Sequences NCBI-nr PG TGI-EST ENS GOS
RecA 1146 2,897 7,423 1,683 235 288 104 5,113

UVDE 6239 417 484 38 25 1 0 420
Photolyase 411 1,387 2,261 19 9 0 0 2,233
1285 5,907 9,796 302 145 182 15 9,152
3077 319 482 149 2 176 42 113
3454 67 73 1 1 0 0 71
Spore lyase 5283 237 331 39 25 0 0 267
PP2C phosphatase 78 2,917 3,933 762 112 2,295 199 565
3673 62 106 39 0 22 45 0
9118 68 73 0 0 72 1 0
11012181 36 69 34 0 15 20 0
11021747 19 72 13 11 0 0 48
11066319 19 38 14 0 15 9 0
Glutamine synthetase (type I, II, III) 3709 4,284 11,322 1,504 320 489 48 8,961
3072 159 192 46 11 6 0 129
4547 30 32 1 0 1 0 30
RuBisCO (large subunit) 3734 1,979 14,149 13,532 41 148 0 428
Figure 7. Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined
Kingdom assignments of the sequences are indicated by color: yellow, GOS-eukaryotic; navy blue, GOS-bacterial/archaeal; aqua, GOS-viral; orange,
NCBI-nr–eukaryotic; lime green, NCBI-nr–bacterial/archaeal; pink, NCBI-nr–viral; gray, unclassified.
(A) Phylogeny of UVDE homologs.
(B) Phylogeny of PP2C-like sequences.
(C) Phylogeny of type II GS gene family. In addition to the large amount of diversity of bacterial type II GS in the GOS data, a large group of GOS viral
sequences and eukaryotic GS co-occur at the top of the tree with the eukaryotic virus Acanthamoeba polyphaga mimivirus (shown in pink). The red stars
indicate the locations of eight type II GS sequences found in the type I–type II GS gene pairs. They are located in different branches of the phylogenetic
tree. The rest of the type II GS sequences were filtered out by the 98% identity cutoff.
(D) Phylogeny of the homologs of RuBisCO large subunit. A large portion of the RuBisCO sequences from the GOS data forms new branches that are
distinct from the previously known RuBisCO sequences in the NCBI-nr database.
smallest group in number. An understanding of their sequences contain at least seven motifs known to be
physiological roles has only recently begun to emerge. In important for phosphatase structure and function [90,91].
eukaryotes, one of the major roles of PP2C activity is to Invariant residues involved in metal binding (aspartate in
reverse stress-induced kinase cascades [86–89]. motifs I, II, VIII) and phosphate ion binding (arginine in
We identified 613 PP2C-like sequences in the GOS dataset, motif I) are highly conserved among the GOS sequences.
and they are grouped into two clusters (Table 10). These Using the catalytic domain portion of these sequences we
constructed a phylogeny showing that despite the overall differ from each other by the presence of specific amino acids
conserved structure of the PP2C family of proteins, the in the active site and by their mode of action. The MEROPS
known bacterial PP2C-like sequences group together with the database [100] is a comprehensive source of information for
GOS bacterial PP2C-like sequences (Figure 7B, Materials and this large divergent group of sequences and provides a widely
Methods). Furthermore, the eukaryotic PP2Cs display a much accepted classification of proteases into families, based on the
greater degree of sequence divergence compared to the amino acid sequence comparison, and then into clans based
bacterial PP2C sequences. on the similarity of their 3-D structures.
We also examined the combined dataset of PP2C-like We identified 222,738 potential proteases in the GOS
phosphatases further for potential differences in amino acid dataset based on similarity to sequences in MEROPS (see
composition between the bacterial and eukaryotic groups. Materials and Methods). According to our clustering method,
We observed a striking distinction between the eukaryotic 95% of these sequences are grouped into 190 clusters, with
and bacterial PP2C-like phosphatases in motif II, where a each cluster on the average containing more than 1,100 GOS
histidine residue (His62 in human PP2Ca) is conserved in sequences. These sequences were compared to proteases in
more than 90% of sequences, but not observed in the NCBI-nr. There are groups of proteases in NCBI-nr that are
bacterial group. The bacterial PP2C group contains a highly redundant. For example, there are a large number of
methionine (at the corresponding position) in the majority viral proteases from HIV-1 and hepatitis C viruses that
of the cases (70%). This histidine residue is involved in the dominate the NCBI-nr protease set. Thus, we computed a
formation of a beta hairpin in the crystal structure of human nonredundant set of NCBI-nr proteases and, for the sake of
PP2C [91]. Furthermore, His62 is proposed to act as a general consistency, a nonredundant set of proteases from the GOS
acid for PP2C catalysis [92]. Both amino acids lie in the set using the same parameters. The majority of proteases in
proximity of the phosphate-binding domain, but at this time both sets are dominated by cysteine, metallo, and serine
it is unclear how the difference at this position would proteases. The GOS dataset is dominated by proteases
contribute to the overall structure and function of the two belonging to the bacterial kingdom. That is not surprising,
PP2C groups. Nonetheless, the large number of diverse PP2C- given the filter sizes used to collect the samples. In NCBI-nr
like phosphatases in this dataset allowed us to identify a the proteases are more evenly distributed between the
previously unrecognized key difference between bacterial bacterial and the eukaryotic kingdoms.
and eukaryotic PP2Cs. Our comparison of the protease clan distribution of the
Bacterial genes that perform closely related functions can bacterial sequences in the NCBI-nr and GOS sets reveals that
be organized in close proximity to each other and often in the distribution of clans is very similar for metallo- and serine
functional units. Linked Ser/Thr kinase-phosphatase genetic proteases. However, the distribution of clans in aspartic and
units have been described in several bacterial species, cysteine proteases is different in the two datasets. Among
including Streptococcus pneumoniae, Bacillus subtilis, and Myco- aspartic proteases, the most visible difference is the increased
bacterium tuberculosis [93–96]. Two major neighboring clusters ratio of proteases of the AC clan and the decreased ratio in
are found to be associated with the set of PP2C-like the AD clan. Proteases in the former clan are involved in
phosphatases in the GOS bacterial group. We observed that bacterial cell wall production, while those in the latter clan
one of these clusters contained a protein serine/threonine are involved in pilin maturation and toxin secretion [99].
kinase domain as its most common Pfam domain. An Among cysteine proteases, the most apparent is the decrease
additional neighboring cluster found to be associated with in the CA clan and an increase in the number of proteases
the GOS set of bacterial PP2Cs was identified as a set of from the PB(C) clan. Bacterial members of the CA clan are
sequences containing a PASTA (penicillin-binding protein mostly involved in degradation of bacterial cell wall compo-
and serine/threonine kinase–associated) domain. This domain nents and in various aspects of biofilm formation [99]. It is
is unique to bacterial species, and is believed to play possible that both activities are less important for marine
important roles in regulating cell wall biosynthesis [97]. bacteria present in surface water. Proteases from the PB(C)
Our identification of a conserved group of unique PP2C- clan are involved in activation (including self-activation) of
like phosphatases in the GOS dataset significantly increases enzymes from acetyltransferase family. In fungi this family is
the number and diversity of this enzyme family. This analysis involved in penicillin synthesis, while their function in
of the NCBI-nr, PG ORFs, TGI-EST ORFs, and ENS datasets bacteria is unknown [99].
along with the sequences obtained from the GOS dataset We were unable to detect any caspases (members of the CD
significantly increases the overall number of PP2C-like clan) in the GOS data. This is consistent with the apoptotic
sequences from that estimated just a year ago [98]. The cell death mechanism being present only in multicellular
presence of genes encoding bacterial serine/threonine kinase eukaryotes, which, based on the filter sizes, are expected to be
domains located adjacent to PP2Cs in the GOS data supports very rare in the GOS dataset.
the notion that the process of reversible phosphorylation on
Ser/Thr residues controls important physiological processes Metabolic Enzymes in the GOS Data
in bacteria. To gain insights into the diversity of metabolism of the
organisms in the sea, we studied the abundance and diversity
Proteases in GOS Data of glutamine synthetase (GS) and ribulose 1,5-bisphosphate
Proteases are a group of enzymes that degrades other carboxylase/oxygenase (RuBisCO), two key enzymes in nitro-
proteins and, as such, plays important roles in all organisms gen and carbon metabolism.
[99]. On the basis of their catalysis mechanism, proteases are GS is the central player of nitrogen metabolism in all
divided into six distinct catalytic types: aspartic, cysteine, organisms on earth. It is one of the oldest enzymes in
metallo, serine, threonine, and glutamic proteases [99]. They evolution [101]. It converts ammonia and glutamate into
glutamine that can be utilized by cells. GS can be classified these adjacent GS sequences across all the GOS samples. They
into three types based on sequence [101]. Type I has been are mainly found in the samples taken from two sites. Their
found only in bacteria, and it forms a dodecameric structure geographic distribution is significantly different from the
[102,103]. Type II has been found mainly in eukaryotes, and in distributions of types I and II GS across the samples. The high
some bacteria. Type III GS is less well studied, but has been sequence similarity among the adjacent GS pairs and their
found in some anaerobic bacteria and cyanobacteria. There geographic distribution suggest that these adjacent GS
are 18 active site residues in both bacterial and eukaryotic GS sequences may come from only a few closely related
that play important roles in binding substrates and catalyzing organisms. This is consistent with the protein sequence tree
the enzymatic reactions [104]. of type II GS, where the type II GS sequences from the GS
We found 9,120 GS and GS-like sequences in the GOS data gene pairs mainly reside in two distinct branches (Figure 7C).
(Table 10). Using profile HMMs [41,105] constructed from The active site residues are very well conserved in all GS
known GS sequences of different types, we were able to sequences in the GOS data, except one residue, Y179, which
classify 4,350 sequences as type I GS, 1,021 sequences as type coordinates the ammonium-binding pocket. We observed
II GS, and 469 sequences as type III GS (see Materials and substitutions of Y179 to phenylalanine in about half of the
Methods). type II GS sequences. The activity of type I GS in some
The number of type II GS sequences found in the GOS data bacteria is regulated by adenylylation at residue Tyr397. In
is surprisingly high, since previously type II GS were the GOS data, Tyr397 is relatively conserved in type I GS, with
considered to be mainly eukaryotic and very few eukaryotic variations to phenylalanine and tryptophan in about half of
organisms were expected to be included in the GOS the sequences. This indicates that the activity of some of the
sequencing (Figure 7C and Materials and Methods). We used type I GS is not regulated by adenylylation, as shown
gene neighbor analysis to classify the origin of GS genes by previously in some Gram-positive bacteria [108,109].
the nature of other proteins found on the same scaffold. RuBisCO is the key enzyme in carbon fixation. It is the
Using this approach, most of the neighboring genes of the most abundant enzyme on earth [110] and plays an important
type II GS in the GOS data are identified as bacterial genes. role in carbon metabolism and CO2 cycle. RuBisCO can be
The neighboring genes of the type II GS include nitrogen classified into four forms. Form I has been found in both
regulatory protein PII, signal transduction histidine kinase, plants and bacteria, and has an octameric structure. Form II
NH3-dependent NADþ synthetase, A/G-specific adenine gly- has been found in many bacteria, and it forms a dimer in
cosylase, coenzyme PQQ synthesis protein c, pyridoxine Rhodospirillum rubrum. Form III is mainly found in archaea,
biosynthesis enzyme, aerobic-type carbon monoxide dehy- and forms various oligomers. Form IV, also called the
drogenase, etc. We were able to assign more than 90% of the RuBisCO-like protein (RLP), has been recently discovered
type II GS sequences in the GOS data to bacterial scaffolds from bacterial genome-sequencing projects [111,112]. RLP
based on a BLAST-based kingdom assignment method (see represents a group of proteins that do not have RuBisCO
Materials and Methods). Both neighboring genes and king- activity, but resemble RuBisCO in both sequence and
dom assignments suggest that most of the type II GS structure [111,113]. The functions of RLPs are largely
sequences in the GOS data come from bacterial organisms. unknown and seem to differ from each other.
In comparison, the same type II GS profile HMM detects only Contrary to the large number of GS sequences, we
12 putative type II GS sequences from the PG dataset of 222 identified only 428 sequences homologous to the RuBisCO
prokaryotic genomes. Within these, there are only seven large subunit in the GOS data. The small number of RuBisCO
unique type II GS sequences and six unique bacterial species sequences may partly be due to the fact that larger-sized
represented. The reason why bacteria in the ocean have so bacterial organisms were not included in the sequencing
many type II GS genes is unclear. because of size filtering. However, it could also indicate that
Two hypotheses have been raised to explain the origin of CO2 is not the major carbon source for these sequenced
type II GS in bacterial genomes: lateral gene transfer from ocean organisms.
eukaryotic organisms [106] and gene duplication prior to the The RuBisCO homologs in the GOS data are more diverse
divergence of prokaryotes and eukaryotes [101]. The type II than the currently known RuBisCOs (Figure 7D, Materials
GS sequences in the predominantly bacterial GOS data are and Methods). Six of 19 active site residues—N123, K177,
not only abundant, but also diverse and divergent from most D198, F199, H327, and G404—are not well conserved in all
of known eukaryotic GS sequences (Figure 7C). This makes sequences, suggesting that the proteins with these mutations
the hypothesis of lateral gene transfer less favorable. If the GS may have evolved to have new functions, such as in the case of
gene duplication preceded the prokaryote–eukaryote diver- RLPs. From the studies of the RLPs from Chlorobium tepidum
gence according to the gene duplication hypothesis, it is and B. subtilis [111,114], it has been shown that the active site
possible that many oceanic organisms retained type II GS of RuBisCO can accommodate different substrates and is
genes during evolution. potentially capable of evolving new catalytic functions
Interestingly, we found 19 cases where a type I GS gene is [113,114]. On the other hand, two sequence motifs, helices
adjacent to a type II GS gene on the same scaffold. Both GS aB and a8, that are not involved in substrate binding and
genes seem to be functional based on the high degree of catalytic activity are well conserved in the GOS RuBisCO
conservation of active site residues. The same gene arrange- sequences. The higher degree of conservation of these
ment was observed previously in Frankia alni CpI1 [107]. The nonactive site residues than that of active site residues
functional significance of maintaining two types of GS genes suggests that these motifs are important for their structure,
adjacent to one another in the genome remains to be function, or interaction with other proteins.
elucidated. Most of the sequences of these GS genes are We found 47 (31 at 90% identity filtering) GOS sequences
highly similar. We examined the geographic distribution of in the branch with known RLP sequences in a phylogenetic
others are unannotated or similar to other DNA-modifying

enzymes not previously thought to have zinc finger domains.
HMM profiles can be further exploited by using matches
beyond the conservative trusted cutoff (TC) used in this study.
For instance, the Pfam for the poxvirus A22 protein family
has no GOS matches above the TC, but 137 matches with E-
values of 1 3 103 to 1 3 1010, containing a short conserved
motif overlap with A22 proteins. Alignment of these matches
shows an additional two short motifs in common with A22,
establishing their homology, and using a profile HMM, we
found a total of 269 family members in GOS and eight family
members in NCBI-nr. Many members of this new family are
surrounded by other novel clusters, or are in putative viral
scaffolds, suggesting that these weak matches are an entry
point into a new clade of viruses.
ORFans with Matches in GOS Data

Further evidence of the diversity added by GOS sequences
is provided by their matches to ORFans. ORFans are
sequences in current protein databases that do not have
Figure 8. Distribution of Average HMM Score Difference between GOS any recognizable homologs [117]. ORFan sequences (dis-
and Public (NCBI-nr, MG, TGI-EST, and ENS) counting those that may be spurious gene predictions)
Only matches to the full length of an HMM are considered, and only represent genes with organism-specific functions or very
HMMs that have at least 100 matches to each of GOS and public remote homologs of known families. They have the potential
databases are considered. This results in 1,686 HMMs whose average
scores to GOS and public databases are considered. The mean of the to shed light on how new proteins emerge and how old ones
distribution is 50, showing that GOS sequences tend to score lower diversify.
than sequences in public, thereby reflecting diversity compared to We identified 84,911 ORFans (5,538 archaea, 35,292
sequences in public.
doi:10.1371/journal.pbio.0050016.g008 bacteria, 37,427 eukaryotic, 5,314 virus, and 1,340 unclassi-
fied) from the NCBI-nr dataset using CD-HIT [116,117] and
BLAST (see Materials and Methods). Of these, 6,044 have
tree of RuBisCO (Figure 7D). In this phylogenetic tree, in
matches to GOS sequences using BLAST (E-value 1 3 106).
addition to the clades for each of the four forms of RuBisCO,
Figure 9 shows the distribution of the matched ORFans
there are also new groups of 65 (58 at 90% identity filtering)
grouped by organisms, number of their GOS matches, and
GOS sequences that do not cluster with any known RuBisCO
the lowest E-value of the matches. We found matches to GOS
sequences. This indicates that there could be more than one sequences for 13%, 6.3%, 0.89%, and 8.9% of bacterial,
type of RuBisCO-like protein existing in organisms. The archaeal, eukaryotic, and viral ORFans, respectively. While
novel groups of RuBisCO homologs in the GOS data also most of these ORFans have very few GOS matches, 626 of
suggest that we have not fully explored the entire RuBisCO them have 20 GOS matches. The similarities between GOS
family of proteins (Figure 7D). sequences and eukaryotic ORFans are much weaker than
those between GOS sequences and noneukaryotic ORFans.
GOS Data and Remote Homology Detection
The average sequence identity between eukaryotic ORFans
The addition of GOS sequences may help greatly in
and their closest GOS matches is 38%. This is 6% lower than
defining the range and diversity of many known protein
the identity between noneukaryotic ORFans and their closest
families, both by addition of many new sequences and by the
GOS matches.
increased diversity of GOS sequences. Our comparison of The ORFans that match GOS sequences are from approx-
HMM scores for GOS sequences with those from the other imately 600 organisms. Table 11 lists the 20 most populated
four datasets shows that GOS sequences consistently tend to organisms. Out of the 6,044 matched ORFans, approximately
have lower scores, which indicates additional diversity from 2,000 are from these 20 organisms. For example, Rhodopirellula
that captured in the original HMM (Figure 8). The addition of baltica SH 1, a marine bacterium, has 7,325 proteins deposited
GOS data into domain profiles may broaden the profile and in NCBI-nr. We identified 1,418 ORFans in this organism, of
allow it to detect additional remote family members in both which 322 have GOS matches. Another interesting example in
GOS and other datasets. As a trial, we rebuilt the Pfam model this list is Escherichia coli. Although there are .20 different
PF01396, which describes a zinc finger domain within strains sequenced, 168 ORFans are identified in strain
bacterial DNA topoisomerase. The original model finds 821 CFT073, and 67 of them have GOS matches. The only
matches to 481 proteins in NCBI-nr. Our model that includes eukaryotic organism in this list is Candida albicans SC5314, a
GOS sequences reveals 1,497 matches to 722 sequences, an fungal human pathogen, which has 49 ORFans with GOS
increase of 50% in sequences and 82% in domains (most matches.
topoisomerases have three such domains, of which one is We examined a small but interesting subset of the ORFans
divergent and difficult to detect). Of these new matches, 104 that have 3-D structures deposited in PDB. Out of 65 PDB
are validated by the presence of additional topoisomerase ORFans, GOS matches for eight of them are found (see
domains, or they are annotated as topoisomerase, while most Supporting Information for their PDB identifiers and names).
Figure 9. Pie Chart of ORFans That Had GOS Matches

ORFans are grouped by organism (left), number of their GOS matches (middle), and the lowest E-value to their GOS matches in negative logarithm form
(right). For both middle and right charts, inner and outer circles represent noneukaryotic and eukaryotic ORFans, respectively. From the middle chart it
is seen that 626 (¼ 404 þ 180 þ 21 þ 21) ORFans form significant protein families with 20 GOS matches.
They include four restriction endonucleases, three hypo- endonucleases families [118], we predicted three catalytic
thetical proteins, and a glucosyltransferase. residues.
GOS sequences can play an important role in identifying
Genome Sequencing Projects and Protein Exploration
the functions of existing ORFans or in confirming protein
With respect to protein exploration and novel family
predictions. For example, we found that the hypothetical
discovery, microbial sequencing offers more promise com-
protein AF1548, which is a PDB ORFan, has matches to 16
pared to sequencing more mammalian genomes. This is
GOS sequences. A PSI-BLAST search with AF1548 as the illustrated by Figure 11, where the number of clusters that
query against a combined set of GOS and NCBI-nr identified protein predictions from various finished mammalian ge-
several significant restriction endonucleases after three nomes fall into was compared to the number of clusters that
iterations. With the support of 3-D structure and multiple similar-sized random subsets of microbial sequences fall into
sequence alignment of AF1548 and its GOS matches, we (see Materials and Methods). As the figure shows, the rate of
predict that AF1548 along with its GOS homologs are protein family discovery is higher for microbes than for
restriction endonucleases (Figure 10). When combined with mammals. Indeed, the rate of new family discovery is
an established consensus of active sites of the related plateauing for mammalian sequences. This is not surprising,
Table 11. Top 20 Organisms with Most ORFans Matched by GOS
Organism Total Proteinsa Total ORFans ORFans Matched
Rhodopirellula baltica SH 1 7,325 1,418 322

Shewanella oneidensis MR-1 4,472 292 206
Cytophaga hutchinsonii 3,686 555 170
Bdellovibrio bacteriovorus HD100 3,587 753 152
Kineococcus radiotolerans SRS30216 4,559 1,070 125
Synechococcus sp. WH 8102 2,517 143 116
Burkholderia cepacia R18194 7,717 198 100
Aeropyrum pernix K1 1,841 1,312 95
Burkholderia cepacia R1808 7,915 292 94
Magnetospirillum magnetotacticum MS-1 10,146 826 92
Microbulbifer degradans 2–40 4,038 386 85
Burkholderia fungorum LB400 7,994 190 84
Desulfitobacterium hafniense DCB-2 4,389 758 75
Escherichia coli CFT073 5,379 168 67
Bradyrhizobium japonicum USDA 110 8,317 580 66
Acanthamoeba polyphaga mimivirus 911 304 57
Caulobacter crescentus CB15 3,737 333 56
Rubrivivax gelatinosus PM1 4,307 287 53
Mesorhizobium loti MAFF303099 7,272 370 53
Candida albicans SC5314 14,107 1,647 49
a
Total number of proteins of this organism deposited at NCBI; may have redundant entries.
Figure 10. Structure and GOS Homologs of Hypothetical Protein AF1548

Yellow bars represent b-strands. Highlighted are predicted catalytic residues: 38D, 51E, and 53K.
as mammalian divergence from a common ancestor is much The GOS data provides almost complete coverage of
more recent than microbial divergence from a common known prokaryotic protein families. In addition, it adds a
ancestor, which suggests that mammals will share a larger great deal of diversity to many known families and offers new
core set of less-diverged proteins. Microbial sequencing is insights into the evolution of these families. This is illustrated
also more cost effective than mammalian sequencing for using several protein families, including UV damage–repair
acquiring protein sequences because microbial protein enzymes, phosphatases, proteases, glutamine synthetase,
density is typically 80%–90% versus 1%–2% for mammals. RuBisCO, RecA (unpublished data), and kinases [77]. Only a
This could be addressed with mammalian mRNA sequencing, handful of protein families have been examined thus far, and
but issues with acquiring rarely expressed mRNAs would need many thousands more remain to be explored.
to be considered. There are, of course, other reasons to The protein analysis presented indicates that we are far
sequence mammalian genomes, such as understanding from exploring the diversity of viruses. This is reflected in
mammalian evolution and mammalian gene regulation. several of the analyses. The GOS-only clusters show an
overrepresentation of sequences of viral origin. In addition,
Conclusions our domain analysis using HMM profiling shows a lower Pfam
The rate of protein family discovery is approximately coverage of the GOS sequences in the viral kingdom
linear in the (current) number of protein sequences. Addi- compared to the other kingdoms. At least two of the protein
tional sequencing, especially of microbial environments, is families we explored in detail (UV repair enzymes and
expected to reveal many more protein families and sub- glutamine synthetase) contain abundant new viral additions.
families. The potential for discovering new protein families is The extraordinary diversity of viruses in a variety of
also supported by the GOS diversity seen at the nucleotide environmental settings is only now beginning to be under-
level across the different sampling sites [30]. Averaged over stood [57,119–121]. A separate analysis of GOS microbial and
the sites, 14% of the GOS sequence reads from a site are viral sequences (unpublished data) shows that multiple viral
unique (at 70% nucleotide identity) to that site [30]. protein clusters contain significant numbers of host-derived
Figure 11. Rate of Cluster Discovery for Mammals Compared to That for Microbes
The x-axis denotes the number of sequences (in thousands), and the y-axis denotes the number of clusters (in thousands). Five mammalian genomes
are considered for the ‘‘Mammalian’’ dataset, and the plot shows the number of clusters that are hit when each additional genome is added. For the
‘‘Mammalian Random’’ dataset, the order of the sequences from the ‘‘Mammalian’’ dataset is randomized. For the NCBI-nr prokaryotic and GOS
datasets, random subsets of size similar to that of the mammalian set are considered.
proteins, suggesting that viral acquisition of host genes is relationships falling into the twilight zone overlapping with
quite widespread in the oceans. random sequence similarity, the number of false positives for
Data generated by this GOS study and similar environ- homology detection methods increases, making the true
mental shotgun sequencing studies present their own analysis relationships more difficult to identify. Nevertheless, a deeper
challenges. Methods for various analyses (e.g., sequence knowledge of protein sequence and family diversity intro-
alignment, profile construction, phylogeny inference, etc.) duces unprecedented opportunities to mine similarity rela-
are generally designed and optimized to work with full tionships for clues on molecular function and molecular
sequences. They have to be tailored to analyze the mostly interactions as well as providing much expanded data for all
fragmentary sequences that are generated by these projects. methods utilizing homologous sequence information data.
Nevertheless, these data are a valuable source of new The GOS dataset has demonstrated the usefulness of large-
discoveries. These data have the potential to refine old scale environmental shotgun sequencing projects in explor-
hypotheses and make new observations about proteins and ing proteins. These projects offer an unbiased view of
their evolution. Our preliminary exploration of the GOS data proteins and protein families in an environmental sample.
identified novel protein families and also showed that many However, it should be noted that the GOS data reported here
ORFan sequences from current databases have homologs in are limited to mostly ocean surface microbes. Even with this
these data. The diversity added by GOS data to protein targeted sampling a tremendous amount of diversity is added
families also allows for the building of better profile models to known families, and there is evidence for a large number of
and thereby improves remote homology detection. The novel families. Additional data from larger filter sizes (that
discovery of kingdom-crossing protein families that were will sample more eukaryotes) coupled with metagenomic
previously thought to be kingdom-specific presents evidence studies of different environments like soil, air, deep sea, etc.
that the GOS project has excavated proteins of more ancient will help to achieve the ultimate goal of a whole-earth catalog
lineage than that previously known, or that have undergone for proteins.
lateral gene transfer. This is another example of how
metagenomics studies are changing our understanding of Materials and Methods
protein sequences, their evolution, and their distribution
Data description. NCBI-nr [31,32] is the single largest publicly
across the various forms of life and environments. Biases in available protein resource and includes protein sequences submitted
the currently published databases due to oversampling of to SWISS-PROT (curated protein database) [122], PDB (a database of
some proteins or organisms are illuminated by environ- amino acid sequences with solved structures) [123], PIR (Protein
mental surveys that lack such biases. Such knowledge can help Information Resource) [124], and PRF (Protein Research Founda-
tion). In addition, NCBI-nr also contains protein predictions from
us make better predictions of the real distribution patterns of DNA sequences from both finished and unfinished genomes in
proteins in the natural world and indicate where increased GenBank [125], EMBL [126], and DNA Databank of Japan (DDBJ)
sampling would be likely to uncover new families or family [127]. The nonredundancy in NCBI-nr is only to the level of distinct
sequences, and any two sequences of the same length and content are
members of tremendous diversity (such as in the viral merged into a single entry. NCBI-nr contains partial protein
kingdom). sequences and is not a fully curated database. Therefore it also
These data have other significant implications for the fields contains contaminants in the form of sequences that are falsely
of protein evolution and protein structure prediction. predicted to be proteins.
Expressed sequence tag (EST) databases also provide the potential
Having several hundreds or even tens of thousands of diverse to add a great deal of information to protein exploration and contain
proteins from a family or examples of a specific protein fold information that is not well represented in NCBI-nr. To this end,
should provide new approaches for developing protein assemblies of EST sequences from the TIGR Gene Indices [34], an EST
database, were included in this study. To minimize redundancy, only
structure prediction models. Development of algorithms that EST assemblies from those organisms for which the full genome is not
consider the alignments of all these family members/protein yet known, were included. The protein predictions on metazoan
folds and analyze how amino acid sequence can vary without genomes that are fully sequenced and annotated were obtained by
significantly altering the tertiary structure or function may including the Ensembl database [35,36] in this study.
Both finished and unfinished sequences from prokaryotic genome
provide insights that can be used to develop new ab inito projects submitted to NCBI were included. The protein predictions
methods for predicting protein structures. These same from the individual sequencing projects are submitted to NCBI-nr.
datasets could also be used to begin to understand how a Nevertheless, these genomes were included in this dataset both for
the purpose of evaluating our approach and also for the purpose of
protein evolves a new function. Finally, this large database of identifying any proteins that were missed by the annotation process
amino acid sequence data could help to better understand used in these projects.
and predict the molecular interactions between proteins. For Thus, for this study the following publicly available datasets, all
downloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, and
example, they may be used to predict the protein–protein ENS—were used. The organisms in the PG set and the TGI-EST set
interactions so critical for the formation of specific func- are listed in Protocol S1.
tional complexes within cells. Assembly of the GOS dataset. Initial assembly (construction of
The GOS data also have implications for nearly all ‘‘unitigs’’) was performed so that only overlaps of at least 98% DNA
sequence identity and no conflicts with other overlaps were accepted.
computational methods relying on sequence data. The False assemblies at this phase of the assembler are extremely rare,
increase in the number of known protein sequences presents even in the presence of complex datasets [37,128]. Paired-end (also
challenges to many algorithms due to the increased volume of known as mate-pair) data were then used to order, orient, and merge
unitigs into the final assemblies, but only when two mate pairs or a
sequences. In most cases this increase in sequence data can be single mate pair and an overlap between unitigs implied the same
compensated for with additional CPU cycles, but it is also a layout. In one respect, mate pair data was used more aggressively than
foreshadowing of times to come as the pace of large-scale is typical in assembly of a single genome in that depth-of-coverage
sequence-collecting accelerates. A related challenge is the information was largely ignored [10]. This potentially allows chimeric
assemblies through a repeat within a genome or through an ortholog
increase in the diversity of protein families, with many new between genomes. Thus, a conclusion that relies on the correctness of
divergent clades present. With more protein similarity a single assembly involving multiple unitigs should be considered
tentative until the assembly can be confirmed in some way. fragmentary sequence data of varying lengths. This was dealt with
Assemblies involved in key results in this paper were subjected to somewhat by working with rather stringent match thresholds and a
expert manual review based on thickness of overlaps, presence of two-stage process to identify the core sets. We used the concept of
well-placed mate pairs across thin overlaps or across gaps between strict long edges and weak long edges. A strict long edge exists between
contigs, and consistency of depth of coverage. two vertices (sequences) if their match has the following properties:
Data release and availability. All the GOS protein predictions will (1) 90% of the longer sequence is involved in the match; (2) the match
be submitted to GenBank. In addition, all the data supporting this has 70% similarity; and (3) the score of the match is at least 60% of
paper, including the clustering and the various analyses, will be made the self-score of the longer sequence. A weak long edge exists between
publicly available via the CAMERA project (Community Cyberinfras- two vertices (sequences) if their match has the following properties:
tructure for Advanced Marine Microbial Ecology Research and (1) 80% of the longer sequence is involved in the match; (2) the match
Analysis; http://camera.calit2.net), which is funded by the Gordon has 40% similarity; and (3) the score of the match is at least 30% of
and Betty Moore Foundation. the self-score of the longer sequence. Core set identification had two
All-against-all BLASTP search. We used two sets of computer substages: large core initialization and core extension. The large core
resources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeon initialization step identified sets of sequences where these sets were of
processor systems with 2 Gb of memory per system were used. Each a reasonable size and the sequences in them were very similar to each
system had 80 GB local storage and was connected by GBit ethernet other. Furthermore, these sets could be extended in the core
with storage area network (SAN) I/O of ;24 GBit/sec and network extension step by adding related sequences. In the large core
attached storage (NAS) I/O of ;16 GBit/sec. A total of 466,366 CPU initialization step, a directed graph G was constructed on the
hours was used on this system. In addition, access to the National sequences using strict long edges, with each long edge being directed
Energy Research Scientific Computing Center (NERSC) Seaborg from the longer to the shorter sequence. For each vertex v in G, let
computer cluster was available, including 380 nodes each with sixteen S(v) denote the friends set of v consisting of v and all neighbors that v
375-MHz Power3 processors. The systems had between 16 GB and 64 has an out-going edge to.
GB of memory. Only 128 nodes were used at a time. A total of 588,298 Initially all the vertices in G are unmarked. Consider the set of all
CPU hours was used on this system. The dataset of 28.6 million friends sets in the decreasing order of their size. For S(v) that is
sequences was searched against itself in a half-matrix using NCBI currently being considered, do the following: (1) initialize seed set A ¼
BLAST [38] with the following parameters: -F ‘‘m L’’ -U T -p blastp -e S(v); (2) while there exists some v9 such that jS(v) \ S(v9)j k, set A ¼ A
1 3 1010 -z 3 3 109 -b 8000 -v 10. In this paper, similarity of an [ S(v9). (Note: k ¼ 10 is chosen); (3) output set A and mark all vertices
alignment is defined to be the fraction of aligned residues with a in A; and (4) update all friends sets to contain only unmarked vertices.
positive score according to the BLOSUM62 substitution matrix [129] In the core extension step, we constructed a graph G using weak
used in the BLAST searches. long edges. All vertices in seed sets (computed from the large core
Identification of nonredundant sequences. Given a set of sequences initialization step) were marked and the rest of the vertices
S and a threshold T, a nonredundant subset S9 of S was identified by unmarked. Each seed set was then greedily extended to be a core
first partitioning S (using the threshold T) and then picking a set by adding a currently unmarked vertex that has at least k
representative from each partition. The set of representatives neighbors (k ¼ 10 is chosen) in the set; the added vertex was marked.
constitutes the nonredundant set S9. The process was implemented After this process, a clique-finding heuristic was used to identify
using the following graph-theoretic approach. A directed graph G ¼ smaller cliques (of size at most k 1) consisting of currently
(V, E) is constructed with vertex set V and edge set E. Each vertex in V unmarked vertices; these were also extended to become core sets. A
represents a sequence from S. A directed edge (u,v) 2 E if sequence u final step involved merging the computed core sets on the basis of
is longer than sequence v and their sequence comparison satisfies the weak edges connecting them.
threshold T; for sequences of identical length, the sequence with the In the core set merging step, we constructed an FFAS (Fold and
lexicographically larger id is considered the longer of the two. Note Function Assignment System) profile [39] for each core set using the
that G does not have any cycles. Source vertices (i.e., vertices with no longest sequence in the core set as query. FFAS was then used to carry
in-degree) are sorted in decreasing order of their out-degrees and out profile–profile comparisons in order to merge the core sets into
(from largest out-degree to smallest) processed in this order. A source larger sets of related sequences. Due to computational constraints
vertex u is processed as follows: mark all vertices that have not been imposed by the number of core sets, profiles were built on only core
seen before and are reachable from vertex u as being redundant and sets containing at least 20 sequences.
mark vertex u as their representative. Final recruitment involved constructing a PSI-BLAST profile [40]
We used two thresholds in this paper, 98% similarity and 100% on core sets of size 20 or more (using the longest sequence in the core
identity. The former was used in the first stage of the clustering and set as query) and then using PSI-BLAST (–z 1 3 109, –e 10) to recruit as
the later was used in the HMM profile analysis. For the 98% similarity yet unclustered sequences or small-sized clusters (size less than 20) to
threshold, two sequences satisfy the threshold if the following three the larger core sets. For a sequence to be recruited, the sequence–
criteria are met: (1) similarity of the match is at least 98%; (2) at least profile match had to cover at least 60% of the length of the sequence
95% of the shorter sequence is covered by the match; and (3) (match with an E-value 1 3 107. In a final step, unclustered sequences were
score)/(self score of shorter sequence) 95%. recruited to the clusters using their BLAST search results. A length-
For the 100% identity threshold, two sequences satisfy the based threshold was used to determine if the sequence is to be
threshold if their match identity is 100%. recruited.
Description of the clustering algorithm. The starting point for the Identification of clusters containing shadow ORFs. A well-known
clustering was the set of pairwise sequence similarities identified problem in predicting coding intervals for DNA sequences is shadow
using the all-against-all BLASTP compute. Because of both the ORFs. The key requirement that coding intervals not contain in-
volume and nature of the data, the clustering was carried out in four frame stop codons requires that coding intervals be subintervals of
steps: redundancy removal, core set identification, core set merging, ORFs. Long ORFs are therefore obvious candidates to be coding
and final recruitment. intervals. Unfortunately, the constraints on the coding interval to be
A set of nonredundant sequences (at 98% similarity) was identified an ORF often cause subintervals and overlapping intervals of the
using the procedure given in Materials and Methods (Identification of coding interval to also be ORFS in one of the five other reading
nonredundant sequences). Only the nonredundant sequences were frames (two on the same strand and three on the opposite strand).
considered in further steps of the clustering process. These coincidental ORFs are called shadow ORFs since they are
The aim of the core set identification step was to identify core sets of found in the shadow of the coding ORF. In rare cases (and more
highly related sequences. In graph-theoretic terms, this involves frequently in certain viruses) coding intervals in different reading
looking for dense subgraphs in a graph where the vertices correspond frames can overlap but usually only slightly. Overwhelmingly distinct
to sequences and an edge exists between two sequences if their coding intervals do not overlap. However, this constraint is not as
sequence match satisfies some reasonable threshold (for instance, strict for ORFs that contain a coding interval, as the exact extent of
40% similarity match over 80% of at least one sequence and are the coding interval is not known. Prokaryotes predominate in these
clearly homologous based on the BLAST threshold). Dense subgraphs data and are the focus of the ORF predictions. Their 39 end of an
were identified by using a heuristic. This approach utilizes long edges. ORF is very likely to be part of the coding interval because a stop
These are edges where the match threshold is computed relative to codon is a clear signal for the termination of both the ORF and the
the longer sequence. This was done to prevent, as much as possible, coding interval (this signal could be obscured by frameshift errors in
unrelated proteins from being put into the same core set. If all the sequencing). The 59 end is more problematic because the true start
sequences were full length, using long edges would have offered a codon is not so easily identified and so the longest ORF with a
good solution to keeping unrelated proteins apart. However, the reasonable start codon is chosen and this may extend the ORF
situation here is complicated by the presence of a large amount of beyond the true coding interval. For this reason different criteria
such as ‘‘hypothetical.’’ Just to be conservative, all of these 989

Table 12. The Number of Sequences in NCBI-nr, PG ORFs, TGI- clusters were rescued and not labeled as shadow ORF clusters.
Ka/Ks test to determine if sequences in a cluster are under selective
EST ORFs, ENS, and GOS ORFs prior to and after the Redundancy pressure. For a cluster containing conserved but noncoding
Removal Step of Our Clustering sequences, it is expected that there is no selection at the codon
level. We checked this by computing the ratio of nonsynonymous to
synonymous substitutions (Ka/Ks test) [130,131] on the DNA
Data Number of Amino Acid Sequences sequences from which the ORFs in the cluster were derived. For
Original Set Nonredundant Set most proteins, Ka/Ks 1, and for proteins that are under strong
positive selection, Ka/Ks 1. A Ka/Ks value close to 1 is an indication
that sequences are under no selective pressure and hence are unlikely
NCBI-nr 2,317,995 1,017,058 to encode proteins [134,135]. Weakly selected but legitimate coding
PG ORFs 3,049,695 2,424,016 sequences can have a Ka/Ks value close to 1. These were identified to
TGI-EST ORFs 5,458,820 5,085,945 some extent by using a model in which different partitions of the
ENS 361,668 137,057 codons experience different levels of selective pressure. A cluster was
GOS ORFs 17,422,766 14,134,842 rejected only if no partition was found to be under purifying
Total 28,610,944 22,798,918 selection at the amino acid level.
The Ka/Ks test [130,131] was run only on those clusters (remaining
after the shadow ORF filtering step) that did not contain sequences
doi:10.1371/journal.pbio.0050016.t012 with HMM matches or have NCBI-nr sequences in them. Only the
nonredundant sequences in a cluster were considered. Sequences in
each of the clusters were aligned with MUSCLE [134]. For each
were set for when ORFs have a significant overlap depending on the cluster, a strongly aligning subset of sequences was selected for the
orientation (or the 59 or 39 ends) of the ORFs involved. Two ORFs on Ka/Ks analysis. The codeml program from PAML [135,136] was run
the same strand are considered overlapping if their intervals overlap using model M0 to calculate an overall (i.e., branch- and position-
by at least 100 bp. Two ORFs that are on the opposite strands are independent) Ka/Ks value for the cluster. Clusters with Ka/Ks 0.5,
considered overlapping either if their intervals overlap by at least 50 indicating purifying selection and therefore very likely coding, were
bp and their 39 ends are within each others intervals, or if their considered as passing the Ka/Ks filter. In addition, the remaining
intervals overlap by at least 150 bp and the 59 end of one is in the clusters were examined by running codeml with model M3. This
interval of the other. partitioned the positions of the alignment into three classes that may
ORFs for coding intervals are clustered based on sequence be evolving differently (typically, a few positions may be under
similarity. In most cases this sequence similarity is due to the ORFs positive selection while the remainder of the sequence is conserved).
evolving from a common ancestral sequence. Due to functional A likelihood ratio test was applied to select clusters for which M3
constraints on the protein being coded for by the ORF, some explained the data significantly better than M0 [136]. If a cluster was
sequence similarity is retained. There are no known explicit thus selected, and if one of the resulting partitions had a Ka/Ks 0.5
constraints on the shadow ORFs to constrain drift from the ancestral and comprised at least 10% of the sequence, then that cluster was also
sequences. However, the shadow ORFs still tend to cluster together considered as passing the Ka/Ks filter. All other clusters were marked
for some obvious reasons. The drift has not yet obliterated the as containing spurious ORFs.
similarity. There are implicit constraints due to the functional Statistics for the various stages of the clustering process The
constraints on the overlapping coding ORF. There are also other number of sequences that remain after redundancy removal (at 98%
possible unknown functional constraints beyond the coding ORF. At similarity) for each dataset is given in Table 12. Recall that the size of
first it was surmised that within shadow ORF clusters the diversity a cluster is the number of nonredundant sequences in it.
should be higher than for the coding ORF, but this did not prove to Number of core sets of size two or more totals 1,586,454; number
be a reliable signal. The apparent problem is that the shadow ORFs of nonredundant sequences in core sets of size two or more totals
tend to be fractured into more clusters due to the introduction of 8,337,256; and total number of sequences in core sets of size two or
stop codons that are not constrained because the shadow ORFs are more is 12,797,641.
noncoding. What rapidly became apparent is that the most reliable Total number of clusters after profile merging and (PSI-BLAST
signal that a cluster was made up of shadow ORFs is that the cluster and BLAST) recruitment is 1,871,434; number of clusters of size two
was smaller than the coding cluster containing the ORFs overlapping or more totals 1,388,287; number of nonredundant sequences in
the shadow ORFs. clusters of size two or more totals 11,494,078; total number of
The basic rule for labeling a cluster as a shadow ORF cluster is that sequences in clusters of size two or more is 16,565,015.
the size of the shadow ORF cluster is less than the size of another The final clustering statistics (after shadow ORF detection and Ka/
cluster that contained a significant proportion of the overlapping Ks tests) are as follows: number of clusters of size two or more totals
ORFs for the shadow ORF cluster. A specific set of rules was used to 297,254; number of nonredundant sequences in clusters of size two or
label shadow ORF clusters based on comparison to other clusters that more totals 6,212,610; total number of sequences in clusters of size
contained ORFs overlapping ORFS in the shadow ORF cluster (called two or more is 9,978,637.
the overlapping cluster for this discussion). First, the overlapping In the final BLAST recruitment step, a pattern was seen involving
cluster cannot be the same cluster as the shadow ORF cluster (there highly compositionally biased sequences that recruited unrelated
are sometimes overlapping ORFs within the same cluster due to sequences to clusters. This was reflected in the pre- and post-BLAST
frameshifts). Second, both the redundant and nonredundant sizes of recruitment numbers, where the postrecruitment sizes were more
the shadow ORF cluster must be smaller than the corresponding sizes than three to four times the size of the prerecruitment numbers.
of the overlapping cluster. Third, at least one-third of the shadow There were 75 such clusters, and these were removed.
ORFs must have overlapping ORFs in the overlapping cluster. Fourth, Searching sequences using profile HMMs. The full set of 7,868
less than one-half of the shadow ORFs are allowed to contain their Pfam release 17 models was used, along with additional nonredun-
overlapping ORFs (this test is rarely needed but did eliminate the vast dant profiles from TIGRFAM (1,720 of 2,443 profiles; version 4.1).
majority of the very few obvious false positives that were found using HMM profiling was carried out using a TimeLogic DeCypher system
these rules). Finally, the majority of the shadow ORFs that overlapped (Active Motif, Inc., http://www.activemotif.com) and took 327 hours in
must overlap by more than half their length. total (on an eight-card machine). A sequence was considered as
When using this rule, 1,274,919 clusters were labeled as shadow matching a Pfam (fragment model) if its sequence score was above the
ORF clusters, and 6,570,824 singletons were labeled as shadow ORFs. TC score for that Pfam and had an E-value 1 3 103. It was
The rules need to be somewhat conservative so as not to eliminate considered as matching a TIGRFAM if the match had an E-value 1
coding clusters. To test these rules, clusters containing at least two 3 107.
NCBI-nr sequences were examined. Two sequences were used instead Evaluation of protein prediction via clustering. Our evaluation of
of one because occasional spurious shadow ORFs have been protein prediction via the clustering shows a very favorable
submitted to NCBI-nr. There were 989 shadow ORF clusters comparison to currently used protein prediction methods for
containing at least two NCBI-nr sequences and with more than prokaryotic genomes. We used the PG dataset for this evaluation
one-tenth as many NCBI-nr sequences as the overlapping cluster. (Table 2). Of the 3,049,695 PG ORFs, 575,729 sequences (19%) were
This was 0.86% of all clusters (114,331 in total) with at least two clustered (the clustered set). Of the 614,100 predictions made by the
NCBI-nr sequences. Of these 989, a few were obvious mistakes, and genome projects, 600,911 sequences could be mapped to the PG ORF
the others involved very few NCBI-nr sequences of dubious curation, set (the submitted set); 93% of the unmapped sequences were ,60 aa
Figure 12. Log–Log Plots of Cluster Size Distributions

The x-axis is logarithm of the cluster size X and the y-axis is the logarithm of the number of clusters of size at least X; logarithms are base 10.
(A) Plot comparing the sizes of clusters produced by our clustering approach (red) to those of clusters produced by Pfams (green). The curves track each
other quite well, with both of them having an inflection point around cluster size 2,500 (approximately 3.4 on the x-axis). Each sequence is assigned to
the highest scoring Pfam that it matches. Two sequences that are assigned to the same Pfam can nevertheless be assigned to different clusters by the
full-sequence–based clustering approach if they differ in the remaining portion. This is especially true for commonly occurring domains that are present
in different multidomain proteins. Thus, there tends to be a larger number of big clusters in the Pfam approach as compared to the full-sequence–
based approach. Hence, the green curve is above the red curve at the higher sizes.
(B) Plot of the cluster size distributions for core sets (green) and for final clusters (red). Both curves have an inflection point around cluster size 2,500
(approximately 3.4 on the x-axis). Note that these plots give the cumulative distribution function (cdf), while the power law exponents reported in the
text are for the number of clusters of size X (i.e., the probability density function [pdf]). The relationship between these exponents is bpdf ¼ 1 þ bcdf.
(recall that the ORF calling procedure only produced ORFs of length given domain architecture appear in a single cluster. A total of 58%
60 aa). The clustered set and submitted set had 493,756 ORFs in of the domain architectures were confined to single clusters (i.e.,
common. Of the 107,155 sequences that were only in the submitted 100% of their occurence is in one cluster), and 88% of the domain
set, 24,217 sequences (23%) had HMM matches. As with other architectures was such that .50% of their occurences is in one
unclustered HMM matches, most were weak or partial. These cluster.
sequences had an average of only 48% of their lengths covered by For the second evaluation, we selected all sequences with Pfam
HMMs. Of the remaining 82,938 sequences that did not have an HMM matches, and each sequence was assigned to the Pfam that matches it
match, 13,724 (17%) were removed by the filters used, and the rest fell with the highest score. With this assignment, the Pfams induce a
into clusters with only one nonredundant sequence (and thus were partition on the sequences. The distribution of the number of
not labeled as predicted proteins by the clustering analysis). Based on sequences in clusters induced by the Pfams was compared to those of
NCBI-nr sequences in them, these clusters were mostly labeled as clusters from the clustering method. Figure 12A shows comparison as
‘‘hypothetical,’’ ‘‘unnamed,’’ or ‘‘unknown.’’ Our clustering method a log–log plot of the number of sequences versus the number of
identified 81,973 ORFs not predicted by the genome projects, of clusters with at least that many sequences for the two cases. The plot
which 16,042 (20%) were validated by HMM matches (with average shows that cluster size distributions are quite similar, with both the
HMM coverage of 69% of sequence length) and an additional 27,120 methods having an inflection point around 2,500. The difference
(33%) had significant BLAST matches (E-value 1 3 1010) to between the two curves is that there are more big clusters (and also
sequences in NCBI-nr. Thus, if the submitted set is considered as fewer small clusters) induced by the Pfams as compared to the
truth, then protein prediction via clustering produces 493,756 true clustering method. This can be explained by noting that two
positives (TP), 81,973 false positives (FP), and 107,155 false negatives sequences that are in the same Pfam cluster can nevertheless be put
(FN), thereby having a sensitivity (TP/[TP þ FN]) of 83% and into different clusters by the clustering method if they differ in their
specificity (TP/[TP þ FP]) of 86%. However, if truth is considered as remaining portions.
those sequences that are common to both the clustered and Our clustering also shows a good correspondence with HMM
submitted sets in addition to those sequences with HMM matches, profiling on the phylogenetic markers that we looked at. The
then our protein prediction method via clustering has 95% sensitivity clustering identifies 7,423, 12,553, and 13,657 sequences, respectively,
and 89% specificity, while protein prediction by the prokaryotic for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (cluster
genome projects has 97% sensitivity and 86% specificity. ID 1187). HMM profiling identifies 5,292, 12,298, and 12,165
Evaluation of protein clustering. We used Pfams to evaluate the sequences, respectively, for these families. For each of these families,
clustering method in two ways. For both evaluations the clustering there are at least 94% of sequences (relative to the smaller set) in
was restricted to only those sequences with Pfam matches. It should common between clustering and HMM profiling.
be kept in mind that there are redundancies among Pfams in that Difference in ratio of predicted proteins to total ORFs for the PG
there can be more than one Pfam for a homologous domain family set and the GOS set. The ratio of clustered ORFs to total ORFs is
(for instance, the kinase domain Pfams—PF00069 protein kinase significantly higher for the GOS ORFs (0.3471) compared to the PG
domain and PF07714 protein tyrosine kinase), and these redundan- ORFs (0.1888). This can be explained by the fragmentary nature of
cies can affect the evaluation statistics reported below. the GOS data. For the large majority of the GOS data, the average
For the first evaluation, each sequence was represented by the set sequence length is 920 bp compared to full-length genomes for the
of Pfams that match it. This is referred to as the domain architecture for PG data. For the PG data, clustered ORFs have a mean length of 325
a sequence. While Pfams provide a domain-centric view of proteins, aa and a median length of 280 aa. Unclustered ORFs have a mean
the domain architecture attempts to approximate the full sequence- length of 119 aa and a median length of 87 aa. Assuming that the
based approach used here, and thus could be used to shed light on the genomic GOS data has a similar underlying ORF structure to PG data,
general performance of the clustering. We measured how often the effect that GOS fragmentation had on ORF lengths is estimated.
unrelated sequences were present in a given cluster. Two sequences Each reading frame will have a mixture of clustered and unclustered
were defined to be unrelated if their domain architectures each had ORFs, but on average there will be 2 ORFs per reading frame per 920-
at least one Pfam that was not present in the other’s domain bp GOS fragment, and both ORFs will be truncated. Assuming the
architecture. Note that this measure did not penalize the case when truncation point for the ORF is uniformly distributed across the
the domain architecture of one sequence was a proper subset of the ORF, the truncated ORF will drop below the 60-aa threshold to be
domain architecture of the other sequence. This was done to allow considered as an ORF with a probability of 60/(length of the ORF).
fragmentary sequences in clusters to be included in the evaluation as Using the median length, the percentage of clustered ORFs dropping
well (and also because it is not always easy to determine whether an below the threshold due to truncation is 21%; for unclustered ORFs,
amino acid sequence is fragmentary or not). For each cluster, we it is 69%. Accounting for this truncation, the expected ratio of
computed the percentage of sequence pairs that are unrelated under clustered ORFs to total ORFs for the GOS ORFs based on the PG
this measure. A total of 92% of the clusters had at most 2% unrelated ORFs would be 0.3708, which is very close to the observed value.
pairs. Then we carried out an assessment of how many instances of a Kingdom assignment strategy and its evaluation. We used several
Table 13. BLAST-Based Classification Rate per Kingdom Table 14. The Values for Cd(n), the Number of Clusters of Size
d, as a Function of the Power Law Exponent b and Constant a
Kingdom Total Number Correct Classification Percent Correct
b a Cd(n)
Eukaryota 440,951 422,173 95.7
Bacteria 465,692 430,014 92.3 b,1 nb1 1
Archae 36,894 25,527 69.2 b¼1 1 ln n
Viruses 36,346 32,381 89.0 1,b,2 nb1 ðn= d Þb1
b¼2 ð n= ln nÞ ð n=d b1
ln nÞ
b.2 n n d
approaches to assign kingdoms for GOS sequences. They are all
fundamentally based upon a strategy that takes into account top
BLAST matches of a GOS sequence to sequences in NCBI-nr, and 1.72 (R2 ¼ 0.995) for clusters of size 2,500, and b ¼ 2.72 (R2 ¼ 0.995)
then voting on a majority. for clusters of size . 2,500. The estimates for b are different for the
We evaluated a simple strict-majority voting scheme (of the top core clusters compared to the final clusters, reflecting a larger
four BLAST matches) using the NCBI-nr set. First, the redundancy in number of medium and large clusters in the final clustering as a
NCBI-nr was removed using a two-staged process. A nonredundant result of the cluster-merging and additional recruitment steps. A
set of NCBI-nr sequences was computed involving matches with 98% similar dichotomy between the size distributions of large and small
similarity over 95% of the length of the shorter sequence (using the protein families was observed in a study [140] of protein families
procedure discussed in Materials and Methods [Identification of contained in the ProDom, Protomap, and COG databases, where the
nonredundant sequences]). This set was made further nonredundant exponent b reported was in the range of 1.83 to 1.98 for the 50
by considering matches involving 90% similarity over 95% of the smallest clusters and 2.54 to 3.27 for the 500 largest clusters in these
length of the shorter sequence. The nonredundant sequences that databases.
remained after this step constituted the evaluation dataset S. For each Our clustering method was run separately on the following seven
sequence in S, its top four BLAST matches to other sequences in S datasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted of
(ignoring self-matches) were used to assign a kingdom for it (based on all sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6
a strict majority rule). This predicted kingdom assignment for the consisted of set 2 in combination with a random subset of 20%, 40%,
sequence was compared to its actual kingdom. A correct classification 60%, and 80% of the GOS sequences, respectively; set 7 consisted of
is obtained for 93% of the sequences. The correct classification rate set 2 in combination with all the GOS sequences. On each of the
per kingdom is given in Table 13. seven datasets, the redundancy removal (using the 98% similarity
While this evaluation shows that the BLAST-based voting scheme filter) was run, followed by the core set detection steps. Figure 2
provides a reasonable handle on the kingdom assignment problem, shows the number of core sets of varying sizes (3, 5, 10, and 20)
there are caveats associated with it. The kingdom assignment for a set as a function of the number of nonredundant sequences for each
of query sequences is greatly influenced by the taxonomic groups dataset.
from each kingdom that are represented in the reference dataset The observed linear growth in number of families with increase in
against which these queries are being compared. If certain taxa are sample size n is related to the power law distribution in the following
only sparsely represented in the reference set, then, depending on way. We model protein families as a graph where each vertex
their position in the tree of life, queries from these taxa can be corresponds to a protein sequence and an edge between two vertices
misclassified (using a nearest-neighbor type approach based on indicates sequence similarity between the corresponding proteins.
BLAST matches). This explains why the archaeal classification rate is Consider a clustering (partitioning) of the vertices of a graph with n
quite low compared to the others. Thus, the true classification rate vertices such that the cluster sizes obey a power law distribution. Let
for the GOS dataset based on this approach will also depend on the Cd(n) [respectively, Cd(n)] denote the number of clusters of size d
differences in taxonomic biases in the GOS dataset (query) and the (respectively, d). Since the distribution of cluster sizes follows a
NCBI-nr set (reference). power law, there exist constants a, b such that for all x n, Cx(n) ¼
The kingdom proportion for the GOS dataset reported in Figure 1 axb.
is based on a kingdom assignment of scaffolds. Those GOS ORFs with As every vertex of the graph is a member of exactly one cluster,
BLAST matches to NCBI-nr were considered, and the top-four 8
majority rule was used to assign a kingdom to each of them. Using the Xn Xn < n2b 1
ORF coordinates on the scaffold, the fraction (of bp) of a scafffold n¼ xCx ðnÞ ¼ ax 1b
’ a b 6¼ 2 ð1Þ
: 2b
assigned to each kingdom was computed. The scaffold was labeled as x¼1 x¼1 alnn b¼2
belonging to a kingdom if the fraction of the scaffold assigned to that
kingdom was .50%. All ORFs on this scaffold were then assigned to The number of clusters of size at least d is
the same kingdom. 8
Cluster size distribution, the power law, and the rate of protein Xn < n1b d1b
Cd ðnÞ ¼ Cx ðnÞ ’ a b 6¼ 1 ð2Þ
family discovery. Earlier studies of protein family sizes in single : 1b
organisms [137–139] have suggested that P(d), the frequency of x¼d alnn b¼1
protein families of size d, satisfies a power law: that is, P(d) ’ d b
with exponent b reported between 2.68 and 4.02. Power laws have Combining the two equations, we obtain values (up to a multiplicative
been used to model various biological systems, including protein– constant) for Cd(n) as shown in Table 14. In all cases with b . 1, the
protein interaction networks and gene regulatory networks [42,43]. number of clusters Cd(n) increases as n increases, and as d decreases.
Figure 12B illustrates the distribution of the cluster sizes from our Specifically, for b . 2, the growth is linear in n for all d, with slope
data on a log–log scale, a scale for which a power law distribution decreasing as d increases. For 1 , b , 2, the growth is sublinear in n
gives a line. In contrast to family size distributions reported in single for all d.
organisms, the cluster sizes from our data are not well described by a Note that while the observed distribution of protein family sizes is
single power law. Rather, there appear to be different power laws: fit by two different power laws, one for clusters of size less than 2,500
one governs the size distribution of very large clusters, and another with b ¼ 1.99 and another for clusters of size greater than 2,500 with b
describes the rest. This behavior is observed both in the distribution ¼ 3.34 for the current number of (nonredundant) sequences, the
of the core set sizes and also in the distribution of the final cluster contribution of large families to the rate of growth is negligible
sizes. We identified an inflection point for both the core set compared to the small families.
distribution and the final clusters at around size 2,500, and estimated The above formulas for Cd(n) also suggest the dependence of the
the power law exponent b via linear regression separately in each size rate of growth of clusters on the cluster size d. For example, in the
regime. For the core set distribution, the exponent b ¼ 1.99 (R2 ¼ case when b is very close to 2,
0.994) for clusters of size 2,500, and b ¼ 3.34 (R2 ¼ 0.996) for n
Cd ðnÞ ¼ m b1 ð3Þ
clusters of size . 2,500. For the final cluster sizes, the exponent b ¼ d
Figure 13. Log–Log plot of Slopes m(d) of Linear Regression Fit to the
Rate of Growth in Figure 2 for Different Values of Cluster Size d
According to the equation derived in the text, m(d) ¼ md1b for some
constant m. The best linear fit to log [m(d)] gives a line with slope 0.91
(R2 ¼ 0.98) that is close to the predicted value 1 b ¼ 0.99.
for some constant m. Thus, the rate of growth of cluster sizes is linear,
and the slope m(d) of rate of growth is given by m(d) ¼ md1b. Figure 13
shows how well the observed rates of growth match the values
predicted by this equation. A fit to a sublinear function (not shown)
also gives similar results as in Figure 13. Figure 14. Receiver Operating Characteristic Curve Used to Evaluate
GOS versus known prokaryotic versus known nonprokaryotic. Various Methods of Scoring Pairs of Clusters for Functional Similarity
Examples of top five clusters in the various categories (except GOS- Pairs of clusters with 1 example of neighboring ORFs and assigned GO
only) are given below. The cluster identifiers are in parentheses. terms were divided into a set of functionally related (true positive) and
Known prokaryotic only: (Cluster ID 1319) outer surface protein in functionally unrelated (true negative) cluster pairs based on the similarity
Anaplasma ovis, Wolbachia, Ehrlichia canis; (Cluster ID 10911) nitrite of their GO terms. The scoring methods evaluated are described in the
reductase in uncultured bacterium; (Cluster ID 1266) outer mem- text.
brane lipoprotein in Borrelia; (Cluster ID 8595) methyl-coenzyme M doi:10.1371/journal.pbio.0050016.g014
reductase subunit A in uncultured archaeon; (Cluster ID 2959) outer
membrane protein in Helicobacter. Known nonprokaryotic only: by searching for all occurrences of nearby pairs of ORFs belonging to
(Cluster ID 2226) Pol polyprotein HIV sequences; (Cluster ID 4023) the two clusters of interest. Sufficiently close pairs were more likely to
maturase K; (Cluster ID 6257) NADH dehydrogenase subunit 2; be encoded in the same operon. We devised a scoring mechanism to
(Cluster ID 8644) HIV protease; (Cluster ID 12196) MHC class I and II reward those pairs of clusters for which many divergent examples of
antigens. GOS and known prokaryotic only: (Cluster ID 3369) likely operon pairs existed in the set of ORF pairs. For each pair of
carbamoyl transferase; (Cluster ID 688) apolipoprotein N-acyltrans- clusters, a weight was applied to the contribution of each pair of
ferase; (Cluster ID 3726) potassium uptake proteins; (Cluster ID 300) ORFs, and this was proportional to how similar the pair of ORFs was
primosomal protein N9; (Cluster ID 4605) DNA polymerase III delta to other example pairs. Thus, many near-identical pairs of ORFs,
subunit. GOS and known nonprokaryotic only: (Cluster ID 186) seven likely from the same or similar species, are not overrepresented in the
transmembrane helix receptors; (Cluster ID 2069) zinc finger final cluster pair score, while conserved examples of neighboring
proteins; (Cluster ID 3092) MAP kinase; (Cluster ID 1413) potential position from more divergent sequences contribute an increased
mitochondrial carrier proteins; (Cluster ID 233) pentatricopeptide
weight. The score for each cluster pair is calculated as:
(PPR) repeat-containing protein. Known prokaryotic and known
nonprokaryotic only: (Cluster ID 3510) immunoglobulin (and i¼n
immunoglobulin-binding) proteins; (Cluster ID 600) expansin; (Clus- SðC1 C2 Þ ¼ 1 P ½1 PrðOgi1 gi2 jdistÞ wi1 wi2 ð4Þ
i¼1
ter ID 50) pectin methylesterase; (Cluster ID 6492) lectin; (Cluster ID
986) BURP domain-containing protein. GOS and known prokaryotic where S(C1C2) is the linkage score of clusters C1 and C2. The
and known nonprokaryotic: (Cluster ID 2568) ABC transporters; probability PrðOgi1 gi2 jdistÞ that any two genes gi1 from C1 and gi2 from
(Cluster ID 49) short-chain dehydrogenases; (Cluster ID 4294) C2 are in an operon is dependent on the distance between them as
epimerases; (Cluster ID 1239) AMP-binding enzyme; (Cluster ID calculated by [47], and is weighted according to the sequence weights
2630) envelope glycoprotein. wi1 and wi2 described below, for all example pairs i.
Neighbor functional linkage methods. For the sequences in each We calculated sequence weights in a manner similar to that used in
GOS-only cluster, we determined if neighboring ORFs occurring on progressive multiple sequence alignment [142]. Briefly, neighbor-
the same strand had a similar biological process in the GO [49]. If this joining trees were built for all clusters using the QuickJoin [143] and
shared biological process of the neighbors occurred statistically more QuickTree programs [144] based on a distance matrix constructed
often than expected by chance, that inferred a potential operon from all-against-all BLAST scores within a cluster, normalized to self-
linkage and a biological process term for the GOS-only cluster. This scores. For those few clusters with more than 30,000 members, trees
approach weighted ORFs by sequence similarity to reduce the were not built. Instead, equal sequence weights for all members were
skewing effect of sequences from highly related organisms. assigned because of computational limitations. The root of each tree
For definition of linked ORFs, we collected pairs of same-strand was placed at the midpoint of the tree by using the retree package in
ORF protein predictions with intergenic distances less than 500 bp. PHYLIP [145]. The individual sequence weights were then computed
Negative distances were possible if the 59 end of the downstream ORF by summing the distance from each leaf to the root after dividing
in the pair occurred 59 to the 39 end of the upstream ORF. We used a each branch’s weight by the number of nodes in the subtree below it.
probability function to estimate the probability that two putative Weights were normalized so that the sum of weights in any given tree
genes belong to the same operon given their intergenic distance [47]. was equal to 1.0. This weighting scheme is superior to one in which
Because sequences come from a variety of unknown organisms, the weights are normalized to the largest weight in the tree, one that does
probability distribution was created by averaging properties of 33 not weight sequences according to divergence, and one that only
randomly chosen divergent genomes. The exact choice of genomes considers the number of example pairs seen (Figure 14). To compare
did not greatly affect the ability of the distribution to separate the different scoring methods, pairs of clusters annotated with GO
experimentally determined same-operon gene pairs from adjacent, terms that contained adjacent ORFs in the data were gathered. These
same-strand gene pairs in different known operons annotated in a pairs were divided into into functionally related and unrelated
version of RegulonDB downloaded on March 29, 2005 [141]. clusters based on a measure of GO term similarity (p-value 0.01)
We measured the functional linkage between two protein clusters [146]. We evaluated scoring methods for the ability to recover
with the programs TMHMM [147] and SPLIT4 [148]. GC content was
calculated as (G þ C)/(G þ C þ A þ T) bases for each ORF in a cluster,
and averaged for each cluster within a set. The GC content, reported
as the mean and standard deviation of the cluster averages, is as
follows for each cluster set: Group I, 36.7% 6 8.0%; Group II, 35.9%
6 7.9%. Group I size-matched sample, 48.8% 6 11.1%; Group II size-
matched sample, 49.5% 6 11.2%; Group I viral fraction, 37.8% 6
5.1%; Group II viral fraction, 37.3% 6 4.6%. To address the
interconnectivity of the novel clusters within the context of all
operon linkages, we constructed a graph with clusters as nodes and
inferred operon linkages (with score 1 3 106) as edges. We then
asked for every node in the set of novel clusters what was the
cumulative fraction of novel nodes that could be reached within a
varying edge distance from the starting node. The expectation of this
fraction was calculated at each distance, and the procedure was
repeated for the set of size-matched clusters (Figure 15).
We tried three different BLAST-based approaches for kingdom
assignment of ORFs. The first method, used in the analysis, required a
majority of the four top BLAST matches to vote for the same
kingdom (archaea, bacteria, eukaryota, or viruses; see Materials and
Methods [Kingdom assignment strategy and its evaluation]). The
Figure 15. Novel GOS-Only Clusters Are More Interconnected Than a second method required all eight top BLAST matches to vote for the
Size-Matched Sample of Clusters same kingdom. The last method we used was the scaffold-based
Red line, novel clusters; green line, size-matched sample; blue line (right kingdom assignment described in Materials and Methods (Kingdom
axis), log2 ratio of fraction novel clusters recovered divided by fraction assignment strategy and its evaluation). Figure 16 shows the results of
sample clusters recovered. using these assignments to infer the kingdom of GOS-only clusters
doi:10.1371/journal.pbio.0050016.g015 (Figure 16D–16F) and their neighboring ORFs (Figure 16A–16C).
GOS-only clusters were assigned a kingdom only if .50% of their
neighboring ORFs were assigned the same kingdom. The general
functionally similar pairs. In all analyses, linkages between clusters
trends observed are the same for each method, though the coverage
were ignored if there were fewer than five examples of cluster
decreases slightly for the more stringent methods.
member ORFs adjacent to each other on a scaffold. Characteristics and kingdom distribution of known protein
Function for novel families was inferred as follows. (1) Assignment domains. For these analyses we used the predicted proteins from
of GO terms to clusters. We downloaded the GO [49] database on the public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. The
September 21, 2005, from http://www.geneontology.org, along with the public dataset contains multiple identical copies of some sequences
files gene_association.goa_uniprot and pfam2go.txt dated July 12, due to overlaps between the source datasets. For example, many
2005. Only the biological process component of the ontology was sequences in PG are also found in NCBI-nr. We filtered the public set
considered. If a cluster had at least 10% of its redundant sequences at 100% identity to avoid overcounting these sequences. Because this
annotated by the most abundant Pfam domain for that cluster, and filtering was necessary for the public dataset, the GOS dataset was
that Pfam domain had a GO biological process term provided by the also filtered at 100% identity. If two or more sequences were 100%
pfam2go mapping, then we assigned a cluster the GO term of its most identical at the residue level, but were of different lengths, only the
abundant Pfam annotation. In addition, if a cluster contained at least longest sequence was kept. The resulting datasets of nonredundant
20% of its Uniprot GO annotations the same, it was assigned that GO proteins are referred to as public-100 and GOS-100.
term. For each cluster, redundant GO terms found on the same path We assigned each protein in public-100 to a kingdom based on the
to the root were removed. (2) Identification of neighbors to GOS-only species annotations provided in the source datasets (NCBI-nr,
clusters. Neighbors of GOS-only clusters were defined as those Ensembl, TIGR, and PG). The NCBI taxonomy tree was used to
clusters that had a cluster linkage score above a predetermined determine the kingdom of each species. Of 3,167,979 protein
threshold (1 3 106) and had at least five examples of cluster members sequences in public-100, 3,158,907 can be annotated by kingdom.
adjacent to each other in the data. These neighbors were then The remaining 9,072 sequences are largely synthetic.
screened for those that had been annotated with a GO term by the Determining the kingdom of origin of an environmental sequence
process described above. (3) Overrepresentation of neighbor GO can be difficult; while an unambiguous assignment can be made for
terms. We attempted to define GO terms for a set of GOS-only some sequences, others can be assigned only tentatively or not at all.
neighbors that were statistically overrepresented. Because of the Therefore, we took a probabilistic approach (kingdom-weighting
highly dependent nature of the terms in the GO, a simulation-based method), calculating ‘‘weights’’ or probabilities that each protein
approach was chosen to determine which terms might be over- sequence originated from a given kingdom.
represented. Annotated neighbors to a cluster of unknown function The top four BLAST matches (E-value , 1 3 1010) of GOS ORFs to
were identified as described above. For each annotated neighbor, NCBI-nr were considered. The kingdom of origin for each match was
counts for the associated GO term and all terms on the path to the determined. We pooled these ‘‘kingdom votes’’ for each scaffold,
root of the ontology were incremented. A total of 100,000 simulated since (presuming accurate assembly) each scaffold must come from a
neighbor lists of the same size as the true neighbor list were computed single species and hence from a single kingdom. Each ORF on a
by selecting without replacement from those clusters with annotated scaffold contributed up to four votes. If an ORF had fewer than four
GO terms, and an identical counting scheme was performed for each BLAST matches with an E-value , 1 3 1010, then it contributed
simulation. Overrepresentation of neighbor terms was calculated for fewer votes. ORFs with no BLAST matches contributed no votes.
each term on the ontology by asking how many times out of the In many cases, the votes were not unanimous, indicating that some
100,000 simulations the count for each GO term in the ontology met uncertainty must be associated with any kingdom assignment. An
or exceeded the observed count for the actual neighbors. This additional source of uncertainty is the finite number of votes. We
fraction of simulations was interpreted as a p-value. If a term is accounted for these statistical issues by applying the following
unusually prevalent in the true observed neighbors, it should be procedure to each scaffold. First, two pseudocounts were added to
relatively infrequent in the simulated data. For the purpose of the the votes for the ‘‘unknown’’ kingdom to represent the uncertainty
metric used here, ‘‘is-a’’ and ‘‘part-of’’ relationships were treated that remains even when votes are unanimous (especially when there
equally. In cases where a cluster had more than one GO term assigned are few votes). The frequency of votes for each kingdom was
to it, any redundant terms occurring on each other’s path to the root calculated. The vote frequency for a kingdom provides the maximum
were first removed. For any remaining clusters with nonredundant, likelihood estimate of the kingdom probability (i.e., the vote
multiple GO annotations, all possible lists of functions for each list of frequency that would have been observed on a scaffold of similar
neighbor clusters were enumerated, and one function from each composition but with infinitely many voting ORFs). However, that
cluster was chosen. Each node in the ontology was assigned the estimate may not be accurate or precise. Therefore, the multinomial
maximum count observed from the enumerated function lists. We standard deviation was calculated for each vote frequency p as SQRT
consistently applied this rule for the observed and simulated data. [p 3 (1 p)/(n 1)], where n is the number of votes. A distance of two
The following descriptive measures of the novel GOS-only cluster standard deviations from the mean corresponds roughly to a 95%
set were obtained. Transmembrane helix prediction was carried out confidence interval. Thus, two standard deviations were subtracted
possible (e.g., to calculate the expected kingdom distribution of a

given set of proteins by summing the weights). However, it was
necessary in some cases to use discrete assignments of a single
kingdom to each ORF. A tentative assignment can be made for a
given scaffold by choosing the kingdom with the highest weight. The
possibility remains, in this case, that a fraction of the ‘‘unknown’’
weight should rightfully belong to a different kingdom. However, if a
kingdom weight is greater than 0.5, then this danger is averted, and a
‘‘confident’’ assignment of the scaffold and its constituent ORFs to
that kingdom can be made.
Given the uncertainty penalty above, achieving a kingdom weight
greater than 0.5 generally requires overwhelming support for one
kingdom over the others. In particular, on a given scaffold, at least
eight unanimous votes for a kingdom are needed (i.e., two ORFs
contributing four votes each) to make a confident assignment to that
kingdom. Any disagreement between the votes increases the required
number rapidly: for instance, 15 votes for a single kingdom are
required to override four votes for other kingdoms.
‘‘Confident’’ kingdom assignments were made for 2,626,178 (46%)
of the 5,654,638 proteins in GOS-100.
In the analysis that identified new multi-kingdom Pfams, we used
the subset of confidently kingdom-annotated proteins. Here, a Pfam
model was designated as ‘‘kingdom-specific’’ in public-100 if there
were only matches to proteins in one particular kingdom, and no
‘‘unknown’’ matches. A Pfam model that was kingdom specific in
public-100 was further designated as newly ‘‘multi-kingdom’’ if it had
matches to one or more GOS-100 proteins that were confidently
labeled as belonging to a kingdom different from that found in the
public-100 matches. Also, we filtered Pfam matches with an E-value
cutoff of 1 3 1010. In every case, the bit score is at least five bits
greater than the trusted cutoff for the model. In addition to passing
the ‘‘confident’’ criteria, the kingdom assignments were all confirmed
by visual inspection of the BLAST kingdom vote distributions for the
respective scaffolds. Because the criteria for a ‘‘confident’’ kingdom
assignment were conservative, there were only one or a few confident
Figure 16. GOS-Only Clusters Are Enriched for Sequences of Viral Origin assignments for each domain to a ‘‘new’’ kingdom. The ‘‘confident’’
Independently of the Kingdom Assignment Method Employed criteria are especially difficult to meet in the case of kingdom-
crossing due to the votes contributed by the crossing protein. For
For each panel, clusters are as in Figure 4. For (A–C), a kingdom is instance, because the IDO domain itself always contributes four votes
assigned to each neighboring ORF within each cluster set; the for ‘‘Eukaryota,’’ at least 15 votes for ‘‘Bacteria’’ were required to call
percentage of all neighboring ORFs with a given kingdom assignment a scaffold ‘‘bacterial.’’ Thus, many scaffolds have no confident
is plotted. For (D–F), a kingdom is assigned to each cluster if more than kingdom assignment.
50% of all that cluster’s neighbors with a kingdom assignment share the We compared the relative diversities of protein families between
same assignment; the percentage of clusters in each set with a given GOS-100 and public-100 as represented by Pfam sequence models. In
assignment is plotted. In (A) and (D), a kingdom is assigned to a order to do this, the number of matches expected to be found for
neighboring ORF by a majority vote of the top four BLAST matches to a each Pfam model in the GOS-100 data was computed, assuming that
protein in NCBI-nr (Materials and Methods). In (B) and (E), a kingdom is the matches were distributed among the models in the same
assigned if all eight highest-scoring BLAST matches agree in kingdom. In proportions that they were in the public-100 data. These ‘‘expected’’
(C) and (F), all ORFs on a scaffold are assigned the same kingdom by match counts were compared with the observed counts to identify
voting among all ORFs with BLAST matches to NCBI-nr on that scaffold domains that are more diverse in GOS-100 than in public-100 and
(Materials and Methods). In all graphs, only clusters with at least one vice versa.
assignable neighbor are considered. When compared to the size- Because kingdoms differ in their protein usage, Pfam models
matched controls, in all cases the GOS-only clusters show enrichment match sequences from different kingdoms with different frequencies,
for viral sequences. and some models match sequences exclusively from one kingdom.
doi:10.1371/journal.pbio.0050016.g016 Thus, to calculate the expected number of matches to a given Pfam in
GOS-100 based on the number of matches observed in public-100, we
from each vote frequency, and called the result (or zero, if the result corrected for the radically different kingdom composition of the two
was negative) the ‘‘kingdom weight.’’ This ‘‘kingdom weight’’ is a datasets.
conservative estimate. There is 95% chance that the actual kingdom The expected proportion of all Pfam matches in GOS-100 that are
probability is greater. to a given model M was calculated as follows. First, we made a
The kingdom weights do not sum to one because of the standard simplifying assumption that sequences from different kingdoms were
deviation penalty. The difference between the sum of the kingdom equally likely to have a Pfam hit, and thus that the Pfam matches in
weights and unity is a measure of the total uncertainty about the GOS-100 would be distributed among the kingdoms according to the
kingdom assignment. This is called the ‘‘unknown weight.’’ kingdom proportions calculated using the weighted method above
Finally, we assigned each ORF the kingdom weights calculated for (for instance, it is assumed that 97% of the matches would be to
the scaffold as a whole. This procedure assigned kingdom weights to bacterial sequences). Probability that a Pfam hit in GOS-100 is from K
many ORFs with no BLAST matches. Overall, 4,745,649 (84%) of the ’ pGOS-Pfam(K) (for sequences in GOS-100 with at least one Pfam hit)
5,654,638 proteins in GOS-100 receive nonzero kingdom weights. for kingdoms K in fArchae, Bacteria, Eukaryotes, Virusesg.
The kingdom weights calculated in this way provide a basis for Second, we assumed that Pfam models match with the same
estimating the proportion of sequences originating from each relative rates within each kingdom in GOS-100 as they do in public-
kingdom, pGOS(K). The weights over all sequences in GOS-100 were 100. For instance, since twice as many SH3 domains as SH2 domains
summed for each of the known kingdoms, and divided by the sum of are found in public-100 eukaryotic sequences, the same ratio is
the weights for all kingdoms (excluding the unknown weight). This expected to be found in GOS-100 eukaryotic sequences. Using the
procedure suggested that 96% of the sequences are bacterial, a public-100 data, we calculated the frequency of matches for each
somewhat higher proportion than is estimated by the method Pfam model M within each kingdom, relative to the total number of
described in Materials and Methods (Kingdom assignment strategy Pfam matches to that kingdom. Pseudocounts of one were added to
and its evaluation). Similarly, kingdom proportions, pGOS–Pfam(K), both the ‘‘match’’ and ‘‘no match’’ counts (i.e., using a uniform
were calculated for the subset of GOS-100 sequences that have a Dirichlet prior), to allow proper statistical treatment of families with
significant Pfam hit, and 97% are found to be bacterial. few or no matches in the public databases for some kingdom. In
We used the kingdom weights directly in the analyses where Equation 5 below, Obspublic(M,K) is the observed number of public-
In summary, calculation of the expected number of Pfam hits to a

model M in GOS-100 for all kingdoms can be expressed in one
equation as follows:
ðSUMðK 2 fA; B; E; VgÞ½ððObspublic ðM; KÞ þ 1Þ=ðObspublic ðKÞ þ 2ÞÞ ð8Þ
3 pGOSPfam ðKÞÞ 3 ObsGOS
where Obspublic(M,K) is the observed number of public-100 hits to
model M in K, Obspublic(K) is the observed number of public-100 hits
to all models in K, pGOS-Pfam(K) is the proportion of GOS-100
sequences that have at least one Pfam hit in K, and ObsGOS is the total
number of Pfam hits to all models in GOS-100.
The ratio of the observed to the predicted number of hits for each
Pfam model is a measure of the relative diversity of that Pfam family
in GOS-100 compared to public-100, corrected for the differing
kingdom proportions in the two datasets. We computed the
significance of this ratio using the CHITEST function in Excel, which
implements the standard Pearson’s Chi-square test with one degree of
freedom and expresses the result as a probability. For many protein
families, the difference in diversity between the two datasets was so
pronounced that Excel reports a probability of zero due to numerical
underflow, indicating a p-value less than 1 3 10303.
IDO analysis. The GOS-100 and public-100 sequences selected for
the IDO family alignment matched the PF01231 Pfam fs model with a
score above the trusted bit-score cutoff at the sequence level. In
addition, the sequences were required to have the width of their
matching region spanning over 50% of the Pfam IDO HMM model
length. Next, all sequence matches to the Pfam IDO model from the
NCBI-nr database downloaded on March 6, 2006, were added (these
also satisfied the trusted score cutoff and model alignment span
criteria). An additional 26 IDO sequences were found in the new
sequence database relative to the GOS public sequence data freeze
after filtering for identical and 1 aa different sequences and presence
of first and last residues in the final trimmed alignment. Jevtrace
(version 3.14) [149] was used to assess alignment quality, to remove
sequences problematic for alignment, to remove sequence redun-
dancy (at the 0-aa and 1-aa difference levels) while allowing for
redundant nonoverlapping sequences, to trim the alignment to a
block of aligned columns, to delete columns with more than 50%
gaps, and to remove sequences with missing first or last residues. One
Figure 17. Content of Protease Types in NCBI-nr and GOS, and Kingdom sequence (GenBank ID 72038700) was likely a multidomain protein
problematic for alignment and was removed manually. This set of
Distribution of All Proteases
procedures produced a block sequence alignment of 144 sequences
Due to the highly redundant nature of some NCBI-nr protease groups, and 231 characters. We aligned sequences with MUSCLE (version
nonredundant sets for both NCBI-nr and GOS are computed; these 3.52) [134] using default parameters. The final alignment was used to
nonredundant sets are referred to as NCBI-nr60 and GOS60. reconstruct phylogenies with a series of phylogeny reconstruction
doi:10.1371/journal.pbio.0050016.g017 methods: PHYML [150], Tree-Puzzle [151], Weighbor [152], and the
protpars program from the PHYLIP package (version 3.6a3) [145].
Bootstrapping was performed with the protpars program using 1,000
100 hits to M in K, and Obspublic(K) is the observed number of public- bootstrap replicates, each with 100 jumbles; the majority consensus
100 hits to all models in K. tree was produced by the consense program in the PHYLIP package.
Obspublic ðM; KÞ þ 1 Structural genomics implications. The Pfam5000 families used in
pGOSPfam ðMjKÞ ’ ppubPfam ðMjKÞ ¼ ð5Þ this study were chosen from among the manually curated (Pfam-A)
Obspublic ðKÞ þ 2
families in from Pfam version 17. We added 2,932 families with a
By multiplying the conditional probability of each model given a structurally characterized representative as of October 27, 2005, to
kingdom by the respective kingdom probability (pGOS-Pfam(K), the Pfam5000 in descending order by family size, followed by 2,068
calculated as described above in ‘‘Kingdom annotation of GOS-100 additional families without a structurally characterized representa-
proteins: kingdom weighting method’’), the proportions of Pfam tive, in descending order by family size. Pre-GOS family size was
matches in GOS-100 due to each combination of kingdom and Pfam calculated as the number of sequences in public-100 that had a match
model were then predicted. Finally, these predictions were summed to the Pfam family. Post-GOS family size was calculated as the
across kingdoms to obtain the expected proportion of matches to number of sequences in public-100 and GOS-100 that matched each
each model. family. We used the results of the HMM profiling effort (using Pfams)
used for this analysis.
pGOSPfam ðMÞ ¼ SUMðK ¼ fA; B; E; VgÞ½pGOSPfam ðMjKÞpGOSPfam ðKÞ Coverage of GOS-100 and public-100 sequences by both versions
ð6Þ of the Pfam5000 was measured using the subset of families in Pfam 17
that were also in Pfam 16. This was done in order to enable direct
Relatively fewer GOS-100 sequences than public-100 sequences comparison of coverage results with a previous study of coverage of
have a Pfam hit (likely because Pfam is based on sequences in the fully sequenced bacterial and eukaryotic genomes [73]. The versions
public databases). To avoid systematically overestimating the number of Pfam are similar in size (Pfam 16 contains 7,677 families, and Pfam
of GOS-100 hits for each Pfam model due to this global effect, the 17 contains 7,868 families).
predicted counts were based on the observed total number of Pfam Phylogeny construction for various families. For the UVDE family,
matches to all models in GOS-100, and an attempt was made to predict sequences were aligned using MUSCLE [134] and a tree was built
only how these matches are distributed among models. Thus, the using QuickTree [144].
expected number of Pfam hits to a given model in GOS-100 is equal to For the PP2C family, the catalytic domain portions of the
the expected proportion of hits to that model, as calculated above, sequences were identified and aligned using the PP2C Pfam model.
multiplied by the total number of Pfam hits. In the equation below, Sequences that contained 70% nongaps in this alignment were used
ObsGOS is the total number of Pfam hits to all models in GOS-100. to generate a phylogenetic tree of all the PP2C-like sequences. The
Expected count of hits to M in GOS100 ¼ pGOSPfam ðMÞ 3 ObsGOS phylogeny was inferred using the protdist and neighbor-joining
programs in PHYLIP [145]. We used 1,941 total PP2C-like sequences
ð7Þ for the phylogenetic analysis. The breakdown was as follows: public
Figure 18. Content of Bacterial Protease Clans

filter was enabled in the BLAST comparisons, and similarity thresh-

Table 15. Clustering Information for Ensembl Sequences for H. old in the form of an E-value was set to 1 3 106. In the end, 84,911
proteins with at least 100 aa are identified as ORFans. About 100,000
sapiens, M. musculus, R. norvegicus, C. familiaris, and P. short ORFans less than 100 aa were removed from this study, because
troglodytes they may not be real proteins.
Genome sequencing projects and rate of discovery. We used
Ensembl sequences for Homo sapiens, Mus musculus, Rattus norvegicus,
Genome Number of Number of Number of Canis familiaris, and Pan troglodytes. Their clustering information is
Sequences Sequences Clusters shown in Table 15. When we considered the datasets in the order HS,
from Ensembl in Clusters HS þ MM, HS þ MM þ RN, HS þ MM þ RN þ CF, and HS þ MM þ RN þ
CF þ PT, the numbers of distinct clusters were 10,536, 12,731, 13,605,
H. sapiens 33,860 31,268 10,536 14,606, and 14,993, respectively. These numbers were compared
M. musculus 32,442 30,025 9,734
against a random subset of NCBI-nr bacterial sequences (of a similar
R. norvegicus 28,545 27,486 9,485
size) and also against a random subset of GOS sequences. We also
randomized the order of the mammalian sequences to produce a
C. familiaris 30,308 29,041 9,397
dataset that was independent of the genome order being considered.
P. troglodytes 38,822 34,697 9,978
Supporting Information
Protocol S1. Supplementary Information
eukaryotic sequences, 73%; public bacterial sequences, 14%; GOS- Found at doi:10.1371/journal.pbio.0050016.sd001 (25 KB DOC).
eukaryotic sequences, 2%; GOS-bacterial sequences, 10%; and GOS-
viral and GOS-unknown sequences, less than 1% combined. Accession Numbers
For the type II GS family, sequences in GOS and NCBI-nr were
searched with a type II GS HMM constructed from 17 previously All NCBI-nr sequences from February 10, 2005 were used in our
known bacterial and eukaryotic type II GS sequences. Matching analysis. Protocol S1 lists the GenBank (http://www.ncbi.nlm.nih.gov/
sequences from NCBI-nr and GOS were filtered separately for Genbank) accession numbers of (1) the genomic sequences used in
the PG set, (2) the sequences used in building GS profiles, and (3) the
redundancy at 98% identity; the combined set of sequences was
NCBI-nr sequences used in building the IDO phylogeny. The other
aligned and a neighbor-joining tree was constructed.
GenBank sequences discussed in this paper are Bacillus sp. NRRL B-
For the RuBisCO family, matching RuBisCO sequences from GOS
14911 (89089741), Janibacter sp. HTCC2649 (84385106), Erythrobacter
and NCBI-nr were filtered separately for redundancy at 90% identity,
litoralis (84785911), and Nitrosococcus oceani (76881875). The Pfam
resulting in 724 sequences in total. The 724 RuBisCO sequences were
(http://pfam.cgb.ki.se) structures discussed in this paper are envelope
then aligned and a neighbor-joining tree was constructed.
glycoprotein GP120 (PF00516), reverse transcriptase (PF00078),
Identification of proteases. We clustered sequences in the MEROPS
retroviral aspartyl protease (PF00077), bacteriophage T4-like capsid
Peptidase Database [100] using CD-HIT [116,117] at 40% similarity assembly protein (Gp20) (PF07230), major capsid protein Gp23
level. This resulted in 7,081 sequences, which were then divided into (PF07068), phage tail sheath protein (PF04984), IDO (PF01231),
groups based on catalytic type and Clan identifier. These sequences poxvirus A22 protein family (PF04848), and PP2C (PF00481). The
were used as queries to search against a clustered version of NCBI-nr glutamine synthetase TIGRFAM (http://www.tigr.org/TIGRFAMs) used
(clustered at 60% similarity threshold) using BLASTP (E-value 1 3 in the paper is GlnA: glutamine synthetase, type I (TIGR00653). The
1010). A similar search was carried out against GOS (clustered at 60% PDB (http://www.rcsb.org/pdb) identifiers and the names of the eight
similarity threshold). Figure 17 shows the content of protease types in PDB ORFans with GOS matches are: restriction endonuclease MunI
NCBI-nr and GOS together with the kingdom distributions. Figure 18 (1D02), restriction endonuclease BglI (1DMU), restriction endonu-
shows the content of bacterial protease clans. clease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-
Metabolic enzymes in GOS. Hmmsearch from the HMMER glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J),
package [105] was used to search the GOS sequences for different putative protein (1T6T), and hypothetical protein AF1548 (1Y88).
GS types. The GlnA TIGRFAM model was used for finding GSI
sequences. The HMMs built from known examples of 17 GSII and 18
GSIII sequences from NCBI-nr were used to search the GOS Acknowledgments
sequences.
Identification of ORFans in NCBI-nr. ORFans are proteins that do We are indebted to a large group of individuals and groups for
not have any recognizable homologs in known protein databases. A facilitating our sampling and analysis. We thank the governments of
straightforward way to identify ORFans is through all-against-all Canada, Mexico, Honduras, Costa Rica, Panama, and Ecuador and
sequence comparison using relaxed match parameters. However, this French Polynesia/France for facilitating sampling activities. All
is not computationally practical. An effective approach is to first sequencing data collected from waters of the above-named countries
remove the non-ORFans that can be easily found, and then to identify remain part of the genetic patrimony of the country from which they
ORFans from the remaining sequences. were obtained. We also acknowledge TimeLogic (Active Motif, Inc.)
We identified non-ORFans by clustering the NCBI-nr with CD-HIT and in particular Chris Hoover and Joe Salvatore for helping make
[116,117], an ultrafast sequence clustering program. A multistep the DeCypher system available to us; the Department of Energy for
iterated clustering was performed with a series of decreasing use of their NERSC Seaborg compute cluster; Marty Stout, Randy
similarity thresholds. NCBI-nr was first clustered to NCBI-nr90, Doering, Tyler Osgood, Scott Collins, and Marshall Peterson (J. Craig
where sequences with .90% similarities were grouped. NCBI-nr90 Venter Institute) for help with the compute resources; Peter Davies
was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30. and Saul Kravitz (J. Craig Venter Institute) for help with data
After each clustering stage, the total number of clusters of NCBI-nr accessibility issues; Kelvin Li and Nelson Axelrod (J. Craig Venter
was decreased and non-ORFans were identified. A one-step clustering Institute) for discussions on data formats; K. Eric Wommack
from NCBI-nr directly to NCBI-nr30 can be performed. However, the (University of Delaware, Newark) and the captain and crew of the
multistep clustering is computationally more efficient. R/V Cape Henlopen for their assistance in field collection of
At the 30% similarity level, all the NCBI-nr proteins were grouped Chesapeake Bay virioplankton samples; John Glass (J. Craig Venter
into 391,833 clusters, including 259,571 singleton clusters. The Institute) for assistance with the collection and processing of the
proteins in nonsingleton clusters are by definition non-ORFans. virioplankton samples; Beth Hoyle and Laura Sheahan (J. Craig
However, proteins that remain as singletons are not necessarily Venter Institute) for help with paper editing; and Matthew LaPointe
ORFans, because their similarity to other proteins may not be and Jasmine Pollard (J. Craig Venter Institute) for help with figure
reported for two reasons: (1) significant sequence similarity can be formatting. STM, MPJ, CvB, DAS, and SEB acknowledge Kasper
,30%; and (2) in order to prevent a cluster from being too diverse, Hansen for statistical advice. We also acknowledge the reviewers for
CD-HIT, like all other clustering algorithms, may not add a sequence their valuable comments.
to that cluster even if the similarity between this sequence and a Author contributions. SY contributed to the design and imple-
sequence in that cluster meet the similarity threshold. mentation of the clustering process, and the subsequent analyses of
The 259,571 singletons were compared to NCBI-nr with BLASTP the clusters; he also contributed to and coordinated all of the analyses
[38] to identify real ORFans from them. The default low-complexity in the paper, and wrote a large portion of the paper. GS contributed
to the design and analysis of the clustering process, contributed ideas, paper writing, project planning, and ideas for analysis. JCV conceived
analysis, and also wrote parts of the paper. DBR identified ORFs from and coordinated the project, and supplied ideas.
the assemblies, performed the all-against-all BLAST searches, con- Funding. The authors acknowledge the Department of Energy
tributed to GOS kingdom assignment, and contributed analysis tools Genomics: GTL Program, Office of Science (DE-FG02-02ER63453),
and ideas. ALH performed the assembly of GOS sequences, and the Gordon and Betty Moore Foundation, the Discovery Channel and
contributed analysis tools and ideas. SW contributed to the analysis
the J. Craig Venter Science Foundation for funding to undertake this
of viral sequences. KR contributed to project planning and paper
writing. JAE performed the analysis of UV damage repair enzymes, study. GM acknowledges funding from the Razavi-Newman Center
and also contributed to paper writing. KBH, RF, and RLS contributed for Bioinformatics and was also supported by National Cancer
to project planning. GM performed the profile HMM searches, Institute grant P30 CA014195. PC was partially supported by a Center
carried out the domain analysis, and contributed to paper writing. for Proteolytic Pathways (CPP)–National Institutes of Health (NIH)
WL and AG carried out the ORFan analysis and contributed to paper grant 5U54 RR020843–02. CSM, HL, and DE acknowledge the support
writing. LJ contributed to the profile-profile search process. PC and of DOE Biological and Environmental Research (BER). SL and JED
AG carried out the analysis of proteases and contributed to paper were supported by research grants from NIH. BJR was supported by a
writing. CSM, HL, and DE carried out the analysis of novel clusters, Career Award at the Scientific Interface from the Burroughs
the analysis of metabolic enzymes and contributed to paper writing. Wellcome Fund. Support for the Brenner lab work was provided by
YZ contributed to the profile HMM searches and domain analysis. NIH K22 HG00056 and an IBM Shared University Research grant.
STM, MPJ, CvB, DAS, and SEB carried out the analysis of Pfam STM was supported by NIH Genomics Training Grant 5T32
domain distributions in GOS and current proteins, analysis of IDO, HG00047. MPJ was supported by NIH P20 GM068136 and NIH K22
contributed to GOS kingdom assignment, and also contributed to HG00056. CvB was supported in part by the Haas Scholars Program.
paper writing. DAS and SEB also contributed to the Ka/Ks test. JMC
DAS was supported by a Howard Hughes Medical Institute
and SEB carried out the analysis on the implications for structural
genomics and contributed to paper writing. SL, KN, SST, and JED Predoctoral Fellowship. JMC was supported by NIH grant R01
carried out the phosphatase analysis and contributed to paper GM073109, and by the US Department of Energy Genomics: GTL
writing. SST and JED also contributed to project planning. BJR and program through contract DE-AC02-05CH11231.
VB contributed to the analysis of cluster size distribution, family Competing interests. The authors have declared that no competing
discovery rate, and contributed to paper writing. MF contributed to interests exist.
References Multiple sequence alignments and HMM-profiles of protein domains.

1. Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG Nucleic Acids Res 26: 320–322.
database: A tool for genome-scale analysis of protein functions and 22. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein
evolution. Nucleic Acids Res 28: 33–36. families. Nucleic Acids Res 31: 371–373.
2. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: A structural 23. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, et al. (2001)
classification of proteins database for the investigation of sequences and TIGRFAMs: A protein family resource for the functional identification of
structures. J Mol Biol 247: 536–540. proteins. Nucleic Acids Res 29: 41–43.
3. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. (1997) 24. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004)
CATH—A hierarchic classification of protein domain structures. Struc- UniProt: The Universal Protein knowledgebase. Nucleic Acids Res 32:
ture 5: 1093–1108. D115–D119.
4. Thornton JM, Orengo CA, Todd AE, Pearl FM (1999) Protein folds, 25. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. (2005)
functions and evolution. J Mol Biol 293: 333–342. InterPro, progress and status in 2005. Nucleic Acids Res 33: D201–D205.
5. Todd AE, Orengo CA, Thornton JM (2001) Evolution of function in 26. Heger A, Holm L (2003) Exhaustive enumeration of protein domain
protein superfamilies, from a structural perspective. J Mol Biol 307: 1113– families. J Mol Biol 328: 749–767.
1143. 27. Liu X, Fan K, Wang W (2004) The number of protein folds and their
6. Coulson AF, Moult J (2002) A unifold, mesofold, and superfold model of distribution over families in nature. Proteins 54: 491–499.
protein fold use. Proteins 46: 61–71. 28. Kunin V, Cases I, Enright AJ, de Lorenzo V, Ouzounis CA (2003) Myriads
7. Rost B (2002) Did evolution leap to create the protein universe? Curr Opin of protein families, and still counting. Genome Biol 4: 401.
Struct Biol 12: 409–416. 29. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, et al. (2005) The
8. Kinch LN, Grishin NV (2002) Evolution of protein structures and ProDom database of protein domain families: More emphasis on 3D.
functions. Curr Opin Struct Biol 12: 400–408. Nucleic Acids Res 33: D212–D215.
9. Galperin MY, Koonin EV (2000) Who’s your neighbor? New computational 30. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al.
approaches for functional genomics. Nat Biotechnol 18: 609–613. (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest
10. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Atlantic through eastern tropical Pacific. PLoS Biol 5: e77. doi:10.1371/
Environmental genome shotgun sequencing of the Sargasso Sea. Science journal.pbio.0050077
304: 66–74. 31. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, et al. (2006)
11. Tringe SG, Rubin EM (2005) Metagenomics: DNA sequencing of environ- Database resources of the National Center for Biotechnology Information.
mental samples. Nat Rev Genet 6: 805–814. Nucleic Acids Res 34: D173–D180.
12. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005) 32. National Center for Biotechnology Information (2005) Blast db [database].
Comparative metagenomics of microbial communities. Science 308: 554– Washington (D.C.) National Center for Biotechnology Information.
557. Available: ftp://ftp.ncbi.nih.gov/blast/db. Accessed 10 February 2005.
13. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, et al. (2004) 33. National Center for Biotechnology Information (2005) Microbial Genome
Reverse methanogenesis: Testing the hypothesis with environmental Projects db[database]. Washington (D.C.) National Center for Biotechnol-
genomics. Science 305: 1457–1462. ogy Information. Available: ftp://ftp.ncbi.nih.gov/genomes/Bacteria. Ac-
14. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) cessed 10 February 2005.
Community structure and metabolism through reconstruction of micro- 34. Quackenbush J, Liang F, Holt I, Pertea G, Upton J (2000) The TIGR gene
bial genomes from the environment. Nature 428: 37–43. indices: Reconstruction and representation of expressed gene sequences.
15. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, et al. (2004) The Pfam Nucleic Acids Res 28: 141–145.
protein families database. Nucleic Acids Res 32: D138–D141. 35. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, et al. (2004)
16. Corpet F, Gouzy J, Kahn D (1998) The ProDom database of protein Ensembl 2004. Nucleic Acids Res 32: D468–D470.
domain families. Nucleic Acids Res 26: 323–326. 36. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, et al. (2004) An
17. Sasson O, Vaaknin A, Fleischer H, Portugaly E, Bilu Y, et al. (2003) overview of Ensembl. Genome Res 14: 925–928.
ProtoNet: Hierarchical classification of the protein space. Nucleic Acids 37. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A
Res 31: 348–352. whole-genome assembly of Drosophila. Science 287: 2196–2204.
18. Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, et 38. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local
al. (2002) ProClust: Improved clustering of protein sequences with an alignment search tool. J Mol Biol 215: 403–410.
extended graph-based approach. Bioinformatics 18: S182–S191. 39. Rychlewski L, Jaroszewski L, Li W, Godzik A (2000) Comparison of
19. Apweiler R, Bairoch A, Wu CH (2004) Protein sequence databases. Curr sequence profiles. Strategies for structural predictions using sequence
Opin Chem Biol 8: 76–80. information. Protein Sci 9: 232–241.
20. Gasteiger E, Jung E, Bairoch A (2001) SWISS-PROT: Connecting 40. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997)
biomolecular knowledge via a protein database. Curr Issues Mol Biol 3: Gapped BLAST and PSI-BLAST: A new generation of protein database
47–55. search programs. Nucleic Acids Res 25: 3389–3402.
21. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: 41. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence
analysis: Probabilistic models of proteins and nucleic acids. New York: 72. Chandonia JM, Brenner SE (2006) The impact of structural genomics:
Cambridge University Press. 356 p. expectations and outcomes. Science 311: 347–351.
42. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. 73. Chandonia JM, Brenner SE (2005) Implications of structural genomics
Science 286: 509–512. target selection strategies: Pfam5000, whole genome, and random
43. Barabasi AL, Oltvai ZN (2004) Network biology: Understanding the cell’s approaches. Proteins 58: 166–179.
functional organization. Nat Rev Genet 5: 101–113. 74. Chandonia JM, Brenner SE (2005) Update on the Pfam5000 strategy for
44. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005) selection of structural genomics targets. Proceedings of the 2005 IEEE
Three Prochlorococcus cyanophage genomes: Signature features and ecolog- Engineering in Medicine and Biology 27th Annual Conference, Shanghai,
ical interpretations. PLoS Biol 3: e144. China 27: 751–755.
45. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005) 75. Baker D, Sali A (2001) Protein structure prediction and structural
Genome streamlining in a cosmopolitan oceanic bacterium. Science 309: genomics. Science 294: 93–96.
1242–1245. 76. Service R (2005) Structural biology. Structural genomics, round 2. Science
46. Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D (2003) 307: 1554–1558.
Inference of protein function and protein linkages in Mycobacterium 77. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural and
tuberculosis based on prokaryotic genome organization: A combined functional diversity of the microbial kinome. PLoS Biol 5: e17. doi:10.1371/
computational approach. Genome Biol 4: R59. journal.pbio.0050017
47. Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, et al. (2004) 78. Friedberg E (1985) DNA repair. New York W. H. Freeman and Co. 614 p.
Prolinks: A database of protein functional linkages derived from 79. Sancar GB (2000) Enzymatic photoreactivation: 50 years and counting.
coevolution. Genome Biol 5: R35. Mutat Res 451: 25–37.
48. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, et al. (2005) 80. Bowman KK, Sidik K, Smith CA, Taylor JS, Doetsch PW, et al. (1994) A new
STRING: Known and predicted protein-protein associations, integrated ATP-independent DNA endonuclease from Schizosaccharomyces pombe that
and transferred across organisms. Nucleic Acids Res 33: D433–D437. recognizes cyclobutane pyrimidine dimers and 6–4 photoproducts.
49. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene Nucleic Acids Res 22: 3026–3032.
ontology: Tool for the unification of biology. The Gene Ontology 81. Setlow P (2001) Resistance of spores of Bacillus species to ultraviolet light.
Consortium. Nat Genet 25: 25–29. Environ Mol Mutagen 38: 97–104.
50. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, et al. (2004) 82. Morikawa K, Ariyoshi M, Vassylyev D, Katayanagi K, Nakamura H, et al.
Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc (1994) Crystal structure of T4 endonuclease V. An excision repair enzyme
Natl Acad Sci U S A 101: 11013–11018. for a pyrimidine dimer. Ann N Y Acad Sci 726: 198–207.
51. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) 83. Piersen CE, Prince MA, Augustine ML, Dodson ML, Lloyd RS (1995)
Community genomics among stratified microbial assemblages in the Purification and cloning of Micrococcus luteus ultraviolet endonuclease, an
ocean’s interior. Science 311: 496–503. N-glycosylase/abasic lyase that proceeds via an imino enzyme-DNA
52. Paul JH, Sullivan MB (2005) Marine phage genomics: What have we intermediate. J Biol Chem 270: 23475–23484.
learned? Curr Opin Biotechnol 16: 299–307. 84. Hunter T (1995) Protein kinases and phosphatases: The yin and yang of
53. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: protein phosphorylation and signaling. Cell 80: 225–236.
504–510. 85. Kennelly PJ (2001) Protein phosphatases—A phylogenetic perspective.
54. Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: The Chem Rev 101: 2291–2312.
genealogy of ORFans in E. coli. Genome Res 14: 1036–1042. 86. Leroy C, Lee SE, Vaze MB, Ochsenbien F, Guerois R, et al. (2003) PP2C
55. Hsiao WW, Ung K, Aeschliman D, Bryan J, Finlay BB, et al. (2005) Evidence phosphatases Ptc2 and Ptc3 are required for DNA checkpoint inactivation
of a large novel gene pool associated with prokaryotic genomic islands. after a double-strand break. Mol Cell 11: 827–835.
PLoS Genet 1: e62. 87. Meskiene I, Baudouin E, Schweighofer A, Liwosz A, Jonak C, et al. (2003)
56. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006) Stress-induced protein phosphatase 2C is a negative regulator of a
Genomic islands and the ecology and evolution of Prochlorococcus. Science mitogen-activated protein kinase. J Biol Chem 278: 18945–18952.
311: 1768–1770. 88. Takekawa M, Maeda T, Saito H (1998) Protein phosphatase 2Calpha
57. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, et al. (2002) inhibits the human stress-responsive p38 and JNK MAPK pathways. EMBO
Genomic analysis of uncultured marine viral communities. Proc Natl Acad J 17: 4744–4752.
Sci U S A 99: 14250–14255. 89. Warmka J, Hanneman J, Lee J, Amin D, Ota I (2001) Ptc1, a type 2C Ser/
58. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, et al. (2003) Thr phosphatase, inactivates the HOG pathway by dephosphorylating the
Origins of highly mosaic mycobacteriophage genomes. Cell 113: 171–182. mitogen-activated protein kinase Hog1. Mol Cell Biol 21: 51–60.
59. Wilson GA, Bertrand N, Patel Y, Hughes JB, Feil EJ, et al. (2005) Orphans 90. Bork P, Brown NP, Hegyi H, Schultz J (1996) The protein phosphatase 2C
as taxonomically restricted and ecologically important genes. Micro- (PP2C) superfamily: Detection of bacterial homologues. Protein Sci 5:
biology 151: 2499–2501. 1421–1425.
60. Takami H, Takaki Y, Uchiyama I (2002) Genome sequence of Oceanobacillus 91. Das AK, Helps NR, Cohen PT, Barford D (1996) Crystal structure of the
iheyensis isolated from the Iheya Ridge and its unexpected adaptive protein serine/threonine phosphatase 2C at 2.0 A resolution. EMBO J 15:
capabilities to extreme environments. Nucleic Acids Res 30: 3927–3935. 6798–6809.
61. Wellcome Trust Sanger Institute (2005) Pfam db [database]. Release 17. 92. Jackson MD, Fjeld CC, Denu JM (2003) Probing the function of conserved
Cambridge (U.K.): Wellcome Trust Sanger Institute. Available: http://www. residues in the serine/threonine phosphatase PP2Calpha. Biochemistry 42:
sanger.ac.uk/Software/Pfam. 8513–8521.
62. Mellor AL, Munn DH (2004) IDO expression by dendritic cells: Tolerance 93. Novakova L, Saskova L, Pallova P, Janecek J, Novotna J, et al. (2005)
and tryptophan catabolism. Nat Rev Immunol 4: 762–774. Characterization of a eukaryotic type serine/threonine protein kinase and
63. Suzuki T, Yokouchi K, Kawamichi H, Yamamoto Y, Uda K, et al. (2003) protein phosphatase of Streptococcus pneumoniae and identification of kinase
Comparison of the sequences of Turbo and Sulculus indoleamine substrates. FEBS J 272: 1243–1254.
dioxygenase-like myoglobin genes. Gene 308: 89–94. 94. Obuchowski M, Madec E, Delattre D, Boel G, Iwanicki A, et al. (2000)
64. Fallarino F, Asselin-Paturel C, Vacca C, Bianchi R, Gizzi S, et al. (2004) Characterization of PrpC from Bacillus subtilis, a member of the PPM
Murine plasmacytoid dendritic cells initiate the immunosuppressive phosphatase family. J Bacteriol 182: 5634–5638.
pathway of tryptophan catabolism in response to CD200 receptor 95. Boitel B, Ortiz-Lombardia M, Duran R, Pompeo F, Cole ST, et al. (2003)
engagement. J Immunol 173: 3748–3754. PknB kinase activity is regulated by phosphorylation in two Thr residues
65. Hayashi T, Beck L, Rossetto C, Gong X, Takikawa O, et al. (2004) Inhibition and dephosphorylation by PstP, the cognate phospho-Ser/Thr phospha-
of experimental asthma by indoleamine 2,3-dioxygenase. J Clin Invest 114: tase, in Mycobacterium tuberculosis. Mol Microbiol 49: 1493–1508.
270–279. 96. Chopra P, Singh B, Singh R, Vohra R, Koul A, et al. (2003) Phosphoprotein
66. Muller AJ, DuHadaway JB, Donover PS, Sutanto-Ward E, Prendergast GC phosphatase of Mycobacterium tuberculosis dephosphorylates serine-threo-
(2005) Inhibition of indoleamine 2,3-dioxygenase, an immunoregulatory nine kinases PknA and PknB. Biochem Biophys Res Commun 311: 112–
target of the cancer suppression gene Bin1, potentiates cancer chemo- 120.
therapy. Nat Med 11: 312–319. 97. Yeats C, Finn RD, Bateman A (2002) The PASTA domain: A beta-lactam-
67. Burley SK, Bonanno JB (2003) Structural genomics. Methods Biochem binding domain. Trends Biochem Sci 27: 438.
Anal 44: 591–612. 98. Schweighofer A, Hirt H, Meskiene I (2004) Plant PP2C phosphatases:
68. Blundell TL, Mizuguchi K (2000) Structural genomics: An overview. Prog Emerging functions in stress signaling. Trends Plant Sci 9: 236–243.
Biophys Mol Biol 73: 289–295. 99. Barrett AJ, Rawlings ND, Woesner JFeditors (2004) Handbook of
69. Brenner SE (2001) A tour of structural genomics. Nat Rev Genet 2: 801– proteolytic enzymes. Amsterdam: Elsevier. 2,140 p.
809. 100. Rawlings ND, Morton FR, Barrett AJ (2006) MEROPS: The peptidase
70. Montelione GT (2001) Structural genomics: An approach to the protein database. Nucleic Acids Res 34: D270–D272.
folding problem. Proc Natl Acad Sci U S A 98: 13488–13489. 101. Kumada Y, Benson DR, Hillemann D, Hosted TJ, Rochefort DA, et al.
71. Chance MR, Bresnick AR, Burley SK, Jiang JS, Lima CD, et al. (2002) (1993) Evolution of the glutamine synthetase gene, one of the oldest
Structural genomics: A pipeline for providing structures for the biologist. existing and functioning genes. Proc Natl Acad Sci U S A 90: 3009–3013.
Protein Sci 11: 723–738. 102. Valentine RC, Shapiro BM, Stadtman ER (1968) Regulation of glutamine
synthetase. XII. Electron microscopy of the enzyme from Escherichia coli. (2003) The EMBL Nucleotide Sequence Database: Major new develop-
Biochemistry 7: 2143–2152. ments. Nucleic Acids Res 31: 17–22.
103. Almassy RJ, Janson CA, Hamlin R, Xuong NH, Eisenberg D (1986) Novel 127. Miyazaki S, Sugawara H, Gojobori T, Tateno Y (2003) DNA Data Bank of
subunit-subunit interactions in the structure of glutamine synthetase. Japan (DDBJ) in XML. Nucleic Acids Res 31: 13–16.
Nature 323: 304–309. 128. Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, et al.
104. Eisenberg D, Gill HS, Pfluegl GM, Rotstein SH (2000) Structure-function (2002) Finishing a whole-genome shotgun: Release 3 of the Drosophila
relationships of glutamine synthetases. Biochim Biophys Acta 1477: 122– melanogaster euchromatic genome sequence. Genome Biol 3: RE-
145. SEARCH0079.
105. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755– 129. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from
763. protein blocks. Proc Natl Acad Sci U S A 89: 10915–10919.
106. Carlson T, Chelm B (1986) Apparant eukaryotic origin of glutamine 130. Ochman H (2002) Distinguishing the ORFs from the ELFs: Short bacterial
synthetase II from the bacterium Bradyrhizobium japonicum. Nature 322: genes and the annotation of genomes. Trends Genet 18: 335–337.
568–570. 131. Nekrutenko A, Makova KD, Li WH (2002) The K(A)/K(S) ratio test for
107. Hosted TJ, Rochefort DA, Benson DR (1993) Close linkage of genes assessing the protein-coding potential of genomic regions: An empirical
encoding glutamine synthetases I and II in Frankia alni CpI1. J Bacteriol and simulation study. Genome Res 12: 198–202.
175: 3679–3684. 132. Li WH (1997) Molecular Evolution. Sunderland (MA): Sinauer Associates,
108. Deuel TF, Ginsburg A, Yeh J, Shelton E, Stadtman ER (1970) Bacillus subtilis Inc. 487 p.
glutamine synthetase. Purification and physical characterization. J Biol 133. Nei M, Kumar S (2000) Molecular evolution and phylogenetics. New York:
Chem 245: 5195–5205. Oxford University Press. 333 p.
109. Fisher SH, Sonenshein AL (1984) Bacillus subtilis glutamine synthetase 134. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high
mutants pleiotropically altered in glucose catabolite repression. J accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
Bacteriol 157: 612–621. 135. Yang Z (1997) PAML: A program package for phylogenetic analysis by
110. Ellis RJ (1979) The most abundant protein in the world. Trends Biochem maximum likelihood. Comput Appl Biosci 13: 555–556.
Sci 4: 241–244. 136. Yang Z, Nielsen R, Goldman N, Pedersen AM (2000) Codon-substitution
111. Hanson TE, Tabita FR (2001) A ribulose-1,5-bisphosphate carboxylase/ models for heterogeneous selection pressure at amino acid sites. Genetics
oxygenase (RubisCO)-like protein from Chlorobium tepidum that is involved 155: 431–449.
with sulfur metabolism and the response to oxidative stress. Proc Natl 137. Huynen MA, van Nimwegen E (1998) The frequency distribution of gene
Acad Sci U S A 98: 4397–4402. family sizes in complete genomes. Mol Biol Evol 15: 583–589.
112. Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, et al. (2002) The 138. Yanai I, Camacho CJ, DeLisi C (2000) Predictions of gene family
complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, distributions in microbial genomes: Evolution by gene duplication and
modification. Phys Rev Lett 85: 2641–2644.
anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99: 9509–
139. Qian J, Luscombe NM, Gerstein M (2001) Protein family and fold
9514.
occurrence in genomes: Power-law behaviour and evolutionary model. J
113. Li H, Sawaya MR, Tabita FR, Eisenberg D (2005) Crystal structure of a
Mol Biol 313: 673–681.
RuBisCO-like protein from the green sulfur bacterium Chlorobium tepidum.
140. Unger R, Uliel S, Havlin S (2003) Scaling law in sizes of protein sequence
Structure (Camb) 13: 779–789.
families: From super-families to orphan genes. Proteins 51: 569–576.
114. Ashida H, Saito Y, Kojima C, Kobayashi K, Ogasawara N, et al. (2003) A
141. Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E, Sanchez-
functional link between RuBisCO-like protein of Bacillus and photo-
Solano F, et al. (2004) RegulonDB (version 4.0): Transcriptional regulation,
synthetic RuBisCO. Science 302: 286–290.
operon organization and growth conditions in Escherichia coli K-12. Nucleic
115. Fischer D, Eisenberg D (1999) Finding families for genomic ORFans.
Acids Res 32: D303–D306.
Bioinformatics 15: 759–762. 142. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the
116. Li W, Jaroszewski L, Godzik A (2001) Clustering of highly homologous sensitivity of progressive multiple sequence alignment through sequence
sequences to reduce the size of large protein databases. Bioinformatics 17: weighting, position-specific gap penalties and weight matrix choice.
282–283. Nucleic Acids Res 22: 4673–4680.
117. Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancy 143. Mailund T, Pedersen CN (2004) QuickJoin—Fast neighbour-joining tree
significantly speeds up clustering of large protein databases. Bioinfor- reconstruction. Bioinformatics 20: 3261–3262.
matics 18: 77–82. 144. Howe K, Bateman A, Durbin R (2002) QuickTree: Building huge
118. Bujnicki JM, Rychlewski L (2001) Identification of a PD-(D/E)XK-like neighbour-joining trees of protein sequences. Bioinformatics 18: 1546–
domain with a novel configuration of the endonuclease active site in the 1547.
methyl-directed restriction enzyme Mrr and its homologs. Gene 267: 183– 145. Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) 3.6 edition
191. [computer program]. Seattle: Department of Genome Sciences, University
119. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, et al. (2004) Diversity of Washington, Seattle.
and population structure of a near-shore marine-sediment viral com- 146. Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic
munity. Proc Biol Sci 271: 565–574. similarity measures across the Gene Ontology: The relationship between
120. Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, et al. (2003) sequence and annotation. Bioinformatics 19: 1275–1283.
Metagenomic analyses of an uncultured viral community from human 147. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting
feces. J Bacteriol 185: 6220–6223. transmembrane protein topology with a hidden Markov model: Applica-
121. Cann AJ, Fandrich SE, Heaphy S (2005) Analysis of the virus population tion to complete genomes. J Mol Biol 305: 567–580.
present in equine faeces indicates the presence of hundreds of 148. Juretic D, Zoranic L, Zucic D (2002) Basic charge clusters and predictions
uncharacterized virus genomes. Virus Genes 30: 151–156. of membrane protein topology. J Chem Inf Comput Sci 42: 620–632.
122. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, et al. 149. Joachimiak MP, Cohen FE (2002) JEvTrace: Refinement and variations of
(2003) The SWISS-PROT protein knowledgebase and its supplement the evolutionary trace in JAVA. Genome Biol 3: RESEARCH0077.
TrEMBL in 2003. Nucleic Acids Res 31: 365–370. 150. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to
123. Westbrook J, Feng Z, Chen L, Yang H, Berman HM (2003) The Protein estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704.
Data Bank and structural genomics. Nucleic Acids Res 31: 489–491. 151. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-
124. Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, et al. (2003) The PUZZLE: Maximum likelihood phylogenetic analysis using quartets and
Protein Information Resource. Nucleic Acids Res 31: 345–347. parallel computing. Bioinformatics 18: 502–504.
125. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2003) 152. Bruno WJ, Socci ND, Halpern AL (2000) Weighted neighbor joining: A
GenBank. Nucleic Acids Res 31: 23–27. likelihood-based approach to distance-based phylogeny reconstruction.
126. Stoesser G, Baker W, van den Broek A, Garcia-Pastor M, Kanz C, et al. Mol Biol Evol 17: 189–197.
PLoS BIOLOGY
Structural and Functional Diversity

of the Microbial Kinome
Natarajan Kannan1,2, Susan S. Taylor1,2, Yufeng Zhai3, J. Craig Venter4, Gerard Manning3*
1 Department of Chemistry and Biochemistry, University of California San Diego, La Jolla, California, United States of America, 2 Howard Hughes Medical Institute, University
of California San Diego, La Jolla, California, United States of America, 3 Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California,
United States of America, 4 J. Craig Venter Institute, Rockville, Maryland, United States of America
The eukaryotic protein kinase (ePK) domain mediates the majority of signaling and coordination of complex events in
eukaryotes. By contrast, most bacterial signaling is thought to occur through structurally unrelated histidine kinases,
though some ePK-like kinases (ELKs) and small molecule kinases are known in bacteria. Our analysis of the Global
Ocean Sampling (GOS) dataset reveals that ELKs are as prevalent as histidine kinases and may play an equally
important role in prokaryotic behavior. By combining GOS and public databases, we show that the ePK is just one
subset of a diverse superfamily of enzymes built on a common protein kinase–like (PKL) fold. We explored this huge
phylogenetic and functional space to cast light on the ancient evolution of this superfamily, its mechanistic core, and
the structural basis for its observed diversity. We cataloged 27,677 ePKs and 18,699 ELKs, and classified them into 20
highly distinct families whose known members suggest regulatory functions. GOS data more than tripled the count of
ELK sequences and enabled the discovery of novel families and classification and analysis of all ELKs. Comparison
between and within families revealed ten key residues that are highly conserved across families. However, all but one
of the ten residues has been eliminated in one family or another, indicating great functional plasticity. We show that
loss of a catalytic lysine in two families is compensated by distinct mechanisms both involving other key motifs. This
diverse superfamily serves as a model for further structural and functional analysis of enzyme evolution.
Citation: Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2007) Structural and functional diversity of the microbial kinome. PLoS Biol 5(3): e17. doi:10.1371/journal.pbio.
0050017
sequence methods, but retain structural and mechanistic

conservation with ePKs. These include the phosphatidyl
inositol kinases (PI3K) and related protein kinases, alpha
kinases, the slime mold actin fragmin kinases, and the
phosphatidyl inositol 59 kinases [17–20].
These studies demonstrate that PKL kinases conserve both
fold and catalytic mechanisms in the presence of tremendous
sequence variation, which allows for an equivalent diversity in
substrate binding and function. This makes the PKL fold a
model system to investigate how sequence variation maps to
functional specialization. Previous studies along these lines
include the study of ePK-specific regulatory mechanisms,
through ePK–ELK comparison [16], and the sequence
determinants of functional specificity within one group
Introduction (CMGC [CDK, MAPK, GSK3, and CLK kinases]) of ePKs [21].
Previous studies have been hampered by poor annotation
The eukaryotic protein kinase (ePK) domain is the most and classification of ELK families and their low representa-
abundant catalytic domain in eukaryotic genomes and medi- tion in sequence databases relative to ePKs. Recent large-
ates the control of most cellular processes, by phosphorylation
of a significant fraction of cellular proteins [1–3]. Most Academic Editor: Tony Pawson, Samuel Lunenfeld Research Institute, Canada
prokaryotic protein phosphorylation and signaling is thought
Received May 18, 2006; Accepted September 20, 2006; Published March 13, 2007
to occur through structurally distinct histidine-aspartate
kinases [4]. However, there is growing evidence for the existence Copyright: Ó 2007 Kannan et al. This is an open-access article distributed under
and importance of different families of ePK-like kinases (ELKs) use, distribution, and reproduction in any medium, provided the original author
in prokaryotes [5–10]. ePKs and ELKs share the protein kinase– and source are credited.
like (PKL) fold [11] and similar catalytic mechanisms, but ELKs Abbreviations: aa, amino acid; CAK, choline and aminoglycoside kinase; ChoK,
generally display very low sequence identity (7%–17%) to ePKs choline kinase; ELK, ePK-like kinase; ePK, eukaryotic protein kinase; GOS, Global
Ocean Sampling; HMM, hidden Markov model; NCBI, National Center of
and to each other. Crystal structures of ELKs such as amino- Biotechnology Information; PKL, protein kinase–like
glycoside, choline, and Rio kinases reveal striking similarity to
* To whom correspondence should be addressed. E-mail: manning@salk.edu
ePKs [12–14], and other ELKs have been defined by remote
homology methods [6,15] and motif conservation [16]. Another collection is available online at http://collections.plos.org/plosbiology/gos-2007.
set of even more divergent PKL kinases are undetectable by php.
Microbial Kinome
Author Summary
The huge growth in sequence databases allows the characterization of

every protein sequence by comparison with its relatives. Sequence
comparisons can reveal both the key conserved functional motifs that
define protein families and the variations specific to individual
subfamilies, thus decorating any protein sequence with its evolu-
tionary context. Inspired by the massive sequence trove from the
Global Ocean Survey project, the authors looked in depth at the
protein kinase–like (PKL) superfamily. Eukaryotic protein kinases
(ePKs) are the pre-eminent controllers of eukaryotic cell biology and
among the best studied of enzymes. By contrast, their prokaryotic
relatives are much more poorly known. The authors hoped to both
characterize and better understand these prokaryotic enzymes, and
also, by contrast, provide insight into the core mechanisms of the
eukaryotic protein kinases. The authors used remote homology
methods, and bootstrapped on their discoveries to detect more than
45,000 PKL sequences. These clustered into 20 major families, of which
the ePKs were just one. Ten residues are conserved between these
families: 6 were known to be important in catalysis, but four more—
including three highly conserved in ePKs—are still poorly understood,
despite their ancient conservation. Extensive family-specific features
were found, including the surprising loss of all but one of the ten key
residues in one family or another. The authors explored some of these
losses and found several cases in which changes in one key motif
substitute for changes in another, demonstrating the plasticity of
these sequences. Similar approaches can be used to better under-
stand any other family of protein sequences.
scale microbial genomic sequencing, coupled with Global

Ocean Sampling (GOS) metagenomic data, now allow a much Figure 1. Sequence and Structure Based Clustering of PKL Families
more comprehensive analysis of these families. In particular, Despite minimal sequence similarity, relationships between families can
be estimated by profile–profile matching and alignments restricted to
the GOS data provides more than 6 million new peptide conserved motifs. Three main clusters of families are seen (shaded ovals):
sequences, mostly from marine bacteria [22,23], and more CAK, ePK, and KdoK. Four more families (towards bottom) are distantly
than triples the number of ELK sequences. Here, for the first related to these clusters, while three more (PI3K, AlphaK, IDHK, at
time, we define the extent of 20 known and novel PKL bottom) have no sequence similarity outside a subset of key motifs. The
area of each sphere represents the family size within GOS data.
families, define a set of ten key conserved residues within the doi:10.1371/journal.pbio.0050017.g001
catalytic domain, and explore specific elaborations that
mediate the unique functions of distinct families. These family structure and conservation. Two families that are more
highlight both underappreciated aspects of the catalytic core than 10-fold enriched in GOS (CapK and HSK2) are found
as well as unique family specific features, which in several largely in a proteobacteria, which are also highly enriched in
cases reveal correlated changes that map to concerted
GOS. Both CapK and HRK contain viral-specific subfamilies
variations in structure and mechanism.
that are also greatly GOS enriched, indicating that differ-
ences in kinase distribution between databases are largely due
Results to taxonomic biases. As expected, eukaryotic-specific families,
Discovery and Classification of PKL Kinase Families (ePK, Bub1, PI3K, AlphaK) are underrepresented in GOS.
Kinase sequences were detected using hidden Markov
model (HMM) profiles of known PKLs as well as with a motif Functional Diversity of PKL Families
model focused on key conserved PKL motifs [16,24]. Results These 20 PKL families display great functional and
of each approach were used to iteratively build, search, and sequence diversity, though common sequence motifs and
refine new sets of HMMs, using both public and GOS data. functional themes recur. Some families are entirely unchar-
Weak but significant sequence matches were used as seeds to acterized, and few have been well studied, though most have
define and elaborate novel families. The final result was some characterized members, many with known kinase
16,248 GOS sequences (Dataset S1) classified into 20 HMM- activity. Their substrates include proteins and small mole-
defined PKL families (Table 1; Figure 1; Dataset S2). A similar cules such as lipids, sugars, and amino acids, and they
analysis of the National Center of Biotechnology Information generally appear to have regulative functions (Table 1). This
nonredundant public database (NCBI-nr) revealed 24,924 is in contrast to the diversity of several other structurally
ePK and 5,151 ELK sequences (Dataset S1). More than 1,400 unrelated, small-molecule kinase families that play largely
of the NCBI ELK sequences were annotated as hypothetical metabolic roles [25]. Profile–profile alignments show clear but
or unknown, and several hundred more are misannotated or distant relationships between several families, which are
have no functional annotation. GOS data at least doubles the enclosed by ovals in Figure 1. The ePK cluster includes pknB,
size of most families, and permits an in-depth analysis of which is highly similar to but distinct from ePK and is
Microbial Kinome
Table 1. The 20 PKL Families, Their Gene Counts in GOS and NCBI-nr, and Functional Notes
Family GOS Count/NR Count Description
ePK 2753/24924 Eukaryotic protein kinase, almost exclusively eukaryotic.

pknB 1525/1047 Bacterial-specific, enriched in several phyla, closely related to ePK.
BLRK 21/28 Bacterial leucine-rich kinase, an uncharacterized ePK-like family containing leucine-rich repeats. Mostly restricted to
proteobacteria.
GLK 38/17 Glycosylase-linked kinase. Previously unannotated. Sequences from bacterial phyla fused to or neighbors of a DNA
glycosylase domain. Archaeal members are neighbors of tRNA-associated genes.
HRK 259/78 Haspin-related kinase. Consists of eukaryotic haspin protein kinases [55] and two distinct sets of largely viral kinases,
one of which lacks the GxGxxG and VAIK motifs, suggesting that they may be catalytically inactive and/or interfere with
host kinase signaling.
Bub1 9/112 Pan-eukaryotic protein kinase, functions in mitotic spindle assembly.
Bud32 139/123 Universal/single copy gene in eukaryotes and archaeae. In vertebrates it phosphorylates p53, while in S. cerevisiae it is
involved in bud site selection, both unconserved processes [56]. Recently implicated in telomere regulation [57].
Rio 133/249 Universal eukaryotic/archaeal protein kinase, implicated in control of translation, a function that is highly conserved
between eukaryotes and archaeae [58]. Also found in some bacteria, particularly proteobacteria.
KdoK 389/199 Small family of bacterial kinases known to phosphorylate sugar moieties of LPS [59]. Reportedly autophosphorylates on
tyrosine [60]. High sequence variation, even at key motifs, suggests diverse functions.
CAK 3997/1427 Choline and aminoglycoside kinases. Includes many novel subfamilies. Bacterial choline kinase (ChoK, licA), modifies
LPS, enabling mucosal binding for several human-commensal and pathogenic bacteria [61]. Expression is controlled by
phase variation. Metazoan choline kinases are involved in the production of phosphatidyl choline and acetylcholine,
and metazoans also have related ethanolamine kinases. Aminoglycoside kinases (Aminoglycoside phosphotransferases,
APH) phosphorylate and inactivate aminoglycosides, antibiotics that target the bacterial ribosome [62]. They are pro-
duced as antidotes by aminoglycoside-producing bacteria, and by many of their targets.
HSK2 1649/93 One of several structurally distinct homoserine kinases, involved in threonine biosynthesis. Found mostly in a-proteo-
bacteria, mirroring the distribution of HSK1 in c-proteobacteria. Unlike HSK1, it does not chromosomally cluster with
other threonine biosynthesis genes, but is usually linked to the lytB gene involved in isoprenoid biosynthesis, to RNAse
H1 and to clusters of novel genes (NK, GM, unpublished data). These suggest additional functions for this family.
FruK 390/136 Fructosamine kinase. Initiates repair of aging proteins by phosphorylating residues damaged by glycosylation, leading
to their repair [63]. Found in most eukaryotes and many bacteria, and may also have sugar kinase activities [64].
MTRK 144/38 MethylThioRibose kinase. Involved in a sulphur salvage pathway of methionine synthesis. Expression is controlled by
methionine levels in K. pneumonia [65] and by starvation in B. subtilis [66]. Present in select bacteria and plants, but
not in higher eukaryotes.
UbiB 4110/623 UbiB (ABC1 in eukaryotes). Regulates the ubiquinone (co-enzyme Q) biosynthesis pathway in both prokaryotes and
yeast [67,68]. It is speculated to activate an unknown mono-oxygenase in the ubiquinone biosynthesis pathway, possi-
bly in response to aerobic induction. Ubiquitous in eukaryotes and widespread in bacteria.
MalK 29/80 Maltose kinase. Contains two members shown biochemically to be maltose kinases [69]. Most public members are
annotated as trehalose synthases, based on transitive annotation from a member that is fused to trehalose synthase.
RevK 116/77 Reverse kinase. Novel family that lacks the N-terminal ATP-binding GxGxxG loop, but on the C-terminus is usually fused
a P-type ATPase domain, including an ATP-binding GxxGxG motif. No functional annotation.
CapK 308/21 Capsule kinase. Uncharacterized family. Chromosomal neighbors in two bacterial phyla involved in capsule synthesis. A
viral subset lacks obvious motifs upstream of the H164xD motif, reminiscent of the viral HRK subfamily.
PI3K 79/702 Eukaryotic PI39 and PI49 lipid kinases and associated PIKK protein kinases.
AlphaK 13/100 Eukaryotic protein kinases with diverse functions.
IDHK 111/58 Isocitrate dehydrogenase kinase (AceK). Small, highly conserved family. Activates the glyoxylate bypass in E. coli, used
for survival on acetate or fatty acids, by phosphorylating and inhibiting isocitrate dehydrogenase [70].
distinguished by its exclusive bacterial specificity, as opposed conservation of the key residues and motifs found in all
to the mostly eukaryotic ePK family. The other major cluster PKL kinases.
is centered on the large and divergent CAK (choline and Sequence similarity between these 20 families varies from
aminoglycoside kinase) family, and includes three other very low (;20%) to almost undetectable. Sequence-profile
families of small-molecule kinases. CAK itself is particularly methods are generally required to align families within the
diverse, containing subfamilies that are specific for choline/ oval clusters of Figure 1, while alignments between clusters
ethanolamine and aminoglycosides, as well as many novel require profile–profile methods. The diversity of this collec-
subfamilies, some of which are specific to eukaryotic tion is demonstrated by comparison with the automated
sublineages. A looser cluster is formed between the Rio and sequence- and profile-based clustering of the overall GOS
Bud32 families, which are universal among both eukaryotes analysis [22], which assigns 93% of these sequences into 32
and archaeae, and the bacterial lipopolysaccharide kinase clusters, each of which is largely specific to one of our 20
family KdoK. An additional four families (UbiB, revK, MalK, families.
CapK) are distantly related to all three clusters, and are
distinct from another set—PI3K, AlphaK, and IDHK—which Key Conserved Residues Unify Diverse Kinase Families
have even less similarity to any other kinase; for PI3K and Comparison between all families reveals a set of ten key
AlphaK, the relationship to kinases was determined by residues that not only account for one-third of the residues
structural comparisons [11], while IDHK displays only conserved within each family, but also are consistently
Microbial Kinome
Figure 2. The Conserved Core and Variable Regions of the Catalytic Domain
The conserved core in three distinct families, namely ePK (PKA [52]), Rio (A. fulgidis Rio2 [14]), and CAK (APH(39)-IIIa [12]). The conserved regions are
shown in ribbon representation and the variable regions in surface representation. The illustrations were created in PyMOL (http://www.pymol.org).
Some highly conserved residues (see Figure 3) and their associated interactions are shown.
conserved between families, constituting a core pattern of that they are key in selecting substrates or tuning mechanism
conservation that helps define this superfamily (Table 2, of action. For instance, the 4–amino acid (aa) stretch between
Figure 2, Figure 3). These residues are conserved across the the HxD166 and N171 residues is highly conserved but distinct
major divisions of life, which diverged one to two billion years between families (Figure 4), and provides a discriminative
ago, and across diverse families, which presumably diverged signature that defines each family. Within ePKs, tyrosine and
even earlier. Thus, they are likely to mediate core functions of serine/threonine-specific kinases display distinct patterns of
the catalytic domain rather than merely maintaining their conservation within this 4-aa stretch [27]. Serine/threonine
structures. Six of these residues are known to be involved in kinases conserve a [LI]KPx motif within this stretch, while
ATP and substrate binding and catalysis (G52, K72, E91, D166, tyrosine kinases conserve a [LI]AAR motif. These variations
N171 D184; residues numbered based on PKA structure 1ATP alter the surface electrostatics of the substrate-binding
except where otherwise noted; see Table 3). The full functions pocket, thereby contributing to substrate specificity [27].
of the other four remain unclear, though three of them The C-terminal region of ;100 aa following the DFG motif
(H158, H164, and D220) are part of a hydrogen-bonding is highly divergent between families, apart from the con-
network that links the catalytically important DFG motif with served D220 at the beginning of the F-helix (Figure 2; Dataset
substrate binding regions (Figure 2). The conservation of this S3). Secondary structure is generally predicted to be helical,
network across diverse PKL structures suggested a role for but the poor sequence conservation and known structures
this network in coupling DFG motif-associated conforma- [11] suggest that the overall orientation of the helices may be
tional changes with substrate binding and release [16]. different between families. Notably, in the crystal structures
Despite this ancient conservation, different families of ePKs of APH bound to its substrate, kanamycin [28], the relative
have lost individual members of this triad without destroying positioning of the substrate-binding helices (aH–aI) is
structure or catalytic function: H164 is changed to a tyrosine distinct from that of ePKs (Figure 2). The presence of unique
in PKA and many other AGC families; H158 is lost in most patterns of conservation in each family (Table 2) also suggests
tyrosine kinases; and D220 is lost in the Pim family. The Pim1 that this region is involved in family-specific functions.
structure retains an ePK-like structure, perhaps in part due Several families contain sizeable (;30–100 aa) insert
to stabilization of the catalytic loop by the activation loop, a segments between core subdomains that are specific to
function normally performed by D220 [26], suggesting a novel clusters of families. Most CAK members have an insert
mode of coupling ATP and substrate binding in this family. segment between subdomains VIa and VIb. There is very little
The individual loss of each member of this triad suggests that sequence similarity within this segment across CAK members,
they have independent functions yet to be understood. but structures of APH and ChoK indicate some structural
similarity and highlight its role in substrate binding [28,29].
Sequence and Structural Diversity An equivalent insert is seen in the other CAK cluster families,
Family-specific functions are mediated by features that are FruK, HSK2, and MTRK. Similarly, KdoK and Rio contain an
highly conserved within families, but that are divergent insert between subdomains II and III, which shows some
between families (Figure 4). Many family-selective residues sequence similarity between these families. In the Rio2
map to the motifs surrounding the ten key residues, or to the structure, this insert is disordered, but the presence of a
divergent C-terminal substrate-binding region (Tables 2 and conserved threonine suggests a possible regulatory role [14].
S1). The proximity of these residues to the active site suggests This region also contains an insert in the distinct UbiB family.
Microbial Kinome
Figure 3. Conservation of Secondary Structure, Key Motifs, and Residues between Families
The ePK secondary structure is shown with standard annotations of subdomains [53] and structural elements. Subdomains I–IX are generally conserved
in all PKLs. Key residues are bolded and numbered; dashed lines point to positions within secondary structure elements. The table below shows the
conservation (% identity) of the ten key residues, showing their broad conservation across families, but the successful replacement of almost all of them
in at least one family. Parentheses indicate changes to another conserved residue and dashes indicate unconserved positions. Key residues are
numbered based on their position in PKA: G52, K72, E91, P104 (VPKA), H158, H164 (YPKA), D166, N171, D184, and D220. More detailed figures are shown
in Dataset S3.
Finally, the ePK, pknB, and HRK families contain an and Gly within the GxGxFG motif (F54 and G55) are changed to
extended activation loop between subdomains VIII and IX. Ser/Thr and Asn, respectively (S86ChoK, N87ChoK), and G186
These kinases are generally activated by phosphorylation of within the DFG motif is changed to E. Both the GxGxFG and
this loop, the negative charge of which helps to coordinate DFG motifs are spatially proximal to K72 (Figure 5A). Thus,
key structural elements during the activation process, correlated changes in these two motifs could structurally
including a family-selective HRD arginine in the catalytic account for the K-to-R change. Indeed, in the ChoK crystal
loop [30,31]. structure [13], N55 protrudes into the ATP binding pocket, and
hydrogen bonds to R72. In addition, the conserved E91 in helix
Mechanistic Diversity of the Catalytic Core
C, which typically forms a salt bridge with K72, is hydrogen
A surprising finding was that while ten key residues are
bonded (via a water molecule) to the covarying E186, thus
conserved both within and between families, all but one of
them was dispensable in one family or another (Figure 3), linking these three correlated changes and stabilizing R72 in a
indicating that even catalytic residues are malleable in the unique conformation (Figure 5B). By contrast, the two solved
appropriate context. Here we explore the effect of loss of the APH structures (1ND4 and 2BKK) retain the ‘‘ancestral’’
‘‘catalytic lysine’’ K72, which typically positions the a and b sequence state with K72 and G186, and lack N55.
phosphates of ATP (Figure 5A). Mutation of this lysine in ePKs Mutation of R72 or E186 to alanine in ChoK reduces the
is a common method to make inactive kinases [32]. Yet this catalytic rate by several fold [33]. To test the possible role of
residue is conserved as an arginine (R111ChoK) in most CAK these residues in the ChoK catalytic mechanism, we modeled
subfamilies, as a methionine in the CAK-chloro subfamily, and an ATP in the active site of ChoK (based on the nucleotide-
as a threonine in the related HSK2 family (Figure 4). bound structures of APH and PKA). This revealed that R72
In the two major CAK subfamilies with a conserved R72 partially occludes the ATP binding site and is likely to move
(FadE and choline kinase [ChoK]), we see correlated changes in upon ATP binding. Notably, a K72-to-R mutation in Erk2 [34]
the glycine-rich and DFG motifs (Figure 4). Specifically, the Phe also exhibits a conformational change in R72 upon nucleotide
Microbial Kinome
Table 2. Distribution of Residues That Are .90% Identical within Each Family
Family Key Residues Motif-Associated C-Term Unique Semiconserved Other Total
ePK 8 4 1 4 0 17
pknB 9 3 0 8 0 20
BLRK 8 12 4 4 18 46
HRK 5 2 0 1 0 8
Bub1 8 7 3 5 4 27
GLK 7 1 2 2 0 12
Bud32 10 6 3 2 2 23
Rio 9 4 0 1 0 14
KdoK 2 0 0 1 0 3
CAK 5 0 0 0 0 5
HSK2 9 9 11 7 4 40
FruK 9 6 5 5 2 27
MTRK 9 10 4 2 8 33
UbiB 8 6 0 1 4 19
MalK 8 13 15 0 10 46
revK 5 1 2 0 5 13
CapK 1 0 0 0 0 1
PI3K 4 3 1 1 2 11
AlphaK 5 4 6 1 7 23
IDHK 9 17 10 1 23 60
Total 138 (31%) 108 (24%) 67 (15%) 46 (10%) 89 (20%) 448 (100%)
Across the ;250-aa domain, almost one-third of .90% conserved residues map to the ten key residues that are also conserved between families, and more than half map to these key
residues or their surrounding motifs (GxGxxGxxxx, vaiK, E, vP, LxxLH, xxHxDxxxNxx, xxDxGxx, DLA; boldfacing indicates the ten key residues). An additional 15% are found in the largely
unalignable region C-terminal of DxG, strongly suggesting family-specific functions, and 10% are semi-conserved, being found in some, but not most families. See Dataset S3 for details.
binding (Figure 5C). A similar conformational change in ChoK Evolution of Conformational Flexibility and Regulation in
upon ATP binding could result in formation of a R72–E91 salt ePKs
bridge similar to the activation of ePKs (Figure 5A). In this The ePK catalytic domain is highly flexible and undergoes
conformation, R72 could potentially hydrogen bond to both extensive conformational changes upon ATP binding [36]. In
E91 as well as to the covarying E186 in ChoKs, which might contrast, crystal structures of APH, solved in both ATP-
explain the covariation of R72 and E186 in these families. bound and -unbound forms, revealed modest structural
changes in the ATP-binding pocket [37]. This difference in
Variation on a Theme conformational flexibility is reflected in the patterns of
Other CAK members display distinct coordinated changes conservation at key positions within the ATP-binding glycine-
at the G55, K72, and G186 positions. The chloro subfamily of rich loop (Figure 4). Specifically, two conserved glycines (G50
CAK loses the positive charge at position 72 altogether, and G55), which contribute to the conformational flexibility
replacing it with methionine, and has concurrent changes to of this loop in ePKs, are replaced by non-glycines in APH.
R55 and Q186 (Figure 4). This may reflect a shift of the These two glycines are absent in several PKL families (Figure
positive charge from position 72 to 55, an event that also 4) while G52, which is involved in catalysis, is present in most,
happened in Wnk kinases, the only functional ePK family that suggesting that the conformational flexibility of the nucleo-
lacks K72. The conserved K55 of Wnks is required for tide-binding loop is a feature of selected PKL families such as
catalysis and has been shown to interact with ATP similarly to ePKs. Since conformational flexibility allows for regulation, it
K72 of PKA [35] (Figure 5D). Hence, two evolutionary is likely that modest structural changes associated with
inventions may have converted the same core motif residue nucleotide binding gradually evolved into quite dramatic
from one function to another. In CAK-chloro, the unpaired structural rearrangements required to ensure that key players
E91 position loses its charge to become a conserved Phe. The in various signaling pathways act only at the right place and at
function of this Phe is unknown, but is likely to be important the right time. The conserved glycine (G186) within the
since it is also conserved in HSK2, a related family, and the catalytically important DFG motif may likewise have evolved
only other kinase family to conserve a Phe at the E91 position for regulatory functions in ePKs [38]. This glycine is highly
(Figure 4). conserved in the ePK cluster but is absent from most other
Figure 4. Sequence Logos Depicting Conservation of Core Motifs and Neighboring Sequences across Most Kinase Families and Selected CAK
Subfamilies
Motifs are GxGxxGxxxx, VAIK, E, LxxLH, xxHxDxxxxNxx, xxDFGxx, and Dxx. The size of the letters corresponds to their information content [54]. Families
with less than 100 members (BLRK, GLK) are omitted. The diverse CAK family is represented by four distinct subfamilies: APH contains many
aminoglycoside resistance kinases and ChoK includes most ChoKs, while FadE and chloro are less well described. For the HRK family, the first two motif
logos omit the viral subfamily that lacks these motifs.
Microbial Kinome
Microbial Kinome
Table 3. Structural/Functional Role of Highly Conserved Residues
Residue (PKA Number) Structural Location Structural/Functional Role
G52 Glycine-rich loop The backbone of this glycine coordinates the c-phosphate of ATP and facilitates
phosphoryl transfer [71].
K72 b3 strand Hydrogen bonds to the a-oxygen and b phosphate of ATP [72].
E91 C-helix Forms a salt bridge interaction with K72 and functions as a regulatory switch in
ePKs [73].
P104 aC-b4 loop Unknown function. Absent from ePKs.
H158 C-terminus of E helix Hydrogen bonds to D220 in the F-helix and is part of a hydrogen bond network
that couples ATP and substrate-binding regions [16].
H(Y164) Catalytic loop aE-b6 Hydrogen bonds to the backbone of the residue before the DFG-Asp and inte-
grates substrate and ATP binding regions [16].
D166 Catalytic loop Catalytic base [72].
N171 Catalytic loop Coordinates the second Mg2þ ion and involved in phosphoryl transfer.
D184 N-terminus of activation loop Coordinates the first Mg2þ ion.
D220 N-terminus of F helix Hydrogen bonds to the backbone of the catalytic loop and positions this loop re-
lative to substrate-binding regions [16].
Figure 5. Mechanistic Diversity of the ATP-Binding Pocket.

(A) PKA showing structural interactions associated with K72 in active ATP-bound state. The salt bridge interaction between K72 and E91 is shown by
dotted lines.
(B) Structural interactions associated with Arg111ChoK in ChoK.
(C) Conformational changes associated with Arg52Erk2 in the Erk2 mutant structure. Here, the arginine does not form a salt bridge interaction with
Glu69Erk2 (E91), but moves closer towards Glu69Erk2 upon ATP binding.
(D) Inactive state of Wnk1: K72 is shifted over to the G-loop (K233Wnk1) and E91 (Glu268Wnk1) hydrogen bonds to a conserved Arg (R348Wnk1 within the
HRD motif) in the catalytic loop.
(A–D) Residues conserved across all the major families are colored in magenta, while family-specific residues are colored in gold. Hydrogen bonds are
indicated in dotted lines.
Microbial Kinome
Figure 6. ePK-Specific Motifs and Interactions in the Substrate-Binding Region

(A) The ePK-specific activation loop and G-helix are shown in PKA (PKA [52]). The corresponding regions are shown in Rio (A. fulgidis Rio2 [14]). The
activation loop and G-helix are colored in red, and the core-conserved residues are shown in stick representation.
(B) The three ePK-specific motifs in the C-terminal substrate-binding lobe and their structural interactions are shown. Hydrogen bonds are indicated by
dotted lines. The conserved buried water is shown in CPK representation.
PKL families. However, within the small subfamily of terminal end of the activation loop, a W-[SA]-X-[G] motif in
magnesium-dependent Mnk ePK kinases, G185 is changed the F-helix, and an arginine (R280), at the beginning of the I
to aspartate (DFD). In the Mnk2 crystal structure, this DFD helix (Figure 6B). These three motifs structurally interact with
motif adopts an ‘‘out’’ conformation in which F185 protrudes each other and form a network that couples the substrate-
into the ATP-binding site. This is in contrast to the ‘‘in’’ and ATP-binding regions (Figure 6B). This network also
conformation, where it packs up below the C-helix [39]. involves conserved buried water molecules, which are known
Mutation of the Mnk2 D186 ‘‘back’’ to glycine results in both to contribute to the conformational flexibility of proteins
in and out conformations of the DFG motif, supporting the [41]. Thus, this ePK/pknB-conserved network may also
role of G186 in DFG-associated conformational changes. facilitate regulation by increasing the conformational flexi-
Such conformational transitions may facilitate regulation of bility of the substrate-binding regions [16].
activity since the conformation of the catalytic aspartate is
also changed during this transition [38]. This may also explain Discussion
why the ePK-specific extended activation loop, which is Data from the GOS voyage provides a huge increase in
phosphorylated and undergoes dramatic conformational available sequences for most prokaryotic gene families,
changes, is directly attached to the DFG motif (Figure 6A). enabling new studies in discovery, classification, and evolu-
In addition to the flexible catalytic core, the substrate- tionary and structural analysis of a wide array of gene
binding regions appear to have evolved for tight regulation of families. Even for a eukaryotic family such as ePK kinases,
ePK activity. In particular, the conserved G helix, which was GOS provides insights by greatly increasing understanding of
recently shown to undergo a conformational changes upon related PKL families. GOS increases the number of known
substrate binding [40], is uniquely oriented in ePK/pknB ELK sequences more than 3-fold, and has enabled both the
(Figure 6A). Several ePK-conserved residues and motifs are at discovery of novel families of kinases as well as a detailed
the interface between the G helix and the catalytic core analysis of conservation patterns and subfamilies within
(Figure 6B). These include the APE motif, located at the C- known families. We believe that the GOS data, coupled with
Microbial Kinome
the recent strong growth in whole-genome sequencing, adaptive changes such as the ePK-specific flexibility changes
provide the opportunity for similar insights into virtually that may assist in its diversity of functions.
every gene family with prokaryotic relatives. GOS data are rich in highly divergent viral sequences, and
PKL kinases are largely involved in regulatory functions, as accordingly we find a number of new subfamilies of viral kinases,
opposed to the metabolic activities of other kinases with including two of the three subfamilies of HRK and a subfamily
different folds [25]. The characteristics of this fold that lead of CapK. In both cases we see loss of N-terminal–conserved
to the explosion of diverse regulatory functions of eukaryotic elements, suggesting that these kinases may have alternative
ePKs have also been exploited for many different functions functions or even act as inactive competitors to host kinases.
within prokaryotes. While these kinases reflect only ;0.25% These patterns of sequence conservation and diversity raise
of genes in both GOS and microbial genomes (ePKs represent many questions that can only be fully addressed by structural
;2% of eukaryotic genes [42]), indicating a simpler prokary- methods. The combination of structural and phylogenetic
otic lifestyle, they now outnumber the count of ;12,000 insights for ChoK enabled insights that were not clear from
histidine kinases that we observe in GOS [22], suggesting that the structure alone, and enabled us to reject other inferences
ELKs may be at least as important in bacterial cellular from the crystal structure that were not conserved within this
regulation as the ‘‘canonical’’ histidine kinases. family, highlighting the value of combining these approaches.
PKL kinases cross huge phylogenetic and functional spaces The relative ease of crystallization of PKL domains, the
while still retaining a common fold and biochemical function emergence of high-throughput structural genomics, and our
of ATP-dependent phosphorylation. The presence of Rio and understanding of the diversity of these families make them
Bud32 genes in all eukaryotic and archaeal genomes suggests attractive targets for structure determination of selected
that at least this cluster dates back to the common ancestor of members, and position this family as a model for analysis of
these domains of life. Similarly, the presence of UbiB in all deep structural and functional evolution.
eukaryotes and most bacterial groups, the close similarity of
pknB/ePK families, and the widespread bacterial/eukaryotic Materials and Methods
distribution of FruK suggest their origins before the
emergence of eukaryotes, or from an early horizontal Discovery and classification of kinase genes. Sequences used
consisted of 17,422,766 open reading frames from GOS, 3,049,695
transfer. Their ancient divergence leaves little or no trace predicted open reading frames from prokaryotic genomes, and
of their shared structure within their protein sequence other 2,317,995 protein sequences from NCBI-nr of February 10, 2005, as
than at functional motifs, which include a set of ten key described [22]. Profile HMM searches were performed with a Time
Logic Decypher system (Active Motif, http://timelogic.com) using in-
residues that are highly conserved across all PKLs. house profiles for ePK, Haspin, Bub1, Bud32, Rio, ABC1 (UbiB), PI3K,
Despite the huge attention paid to ePKs, four key residues and AlphaK domains, as well as Pfam profiles [43] for ChoK, APH,
(P104, H158, H164, D220), three of which are highly conserved in KdoK, and FruK, and TIGRFAM profiles [44] for HSK2 (thrB_alt),
ePKs, are still functionally obscure and worthy of greater UbiB, and MTRK. A number (69) of additional ePK-annotated models
from Superfamily 1.67 [45] were used to capture initial hits but not for
attention, both in ELKs and ePKs. Conversely, it appears that further classification. Initial hits were clustered and re-run against all
nine of the ten key residues have been eliminated or models, and each model was rebuilt and rerun three to seven times
transformed in individual families while maintaining fold using ClustalW [46], MUSCLE [47], and hmmalign (http://hmmer.
janelia.org) to align, followed by manual adjustment of alignments
and function, showing that almost anything is malleable in using Clustal and Pfaat [48] and model building with hmmbuild. Low-
evolution given the right context. That right context is scoring members of each family (e . 1 3 105) were used as seeds to
frequently a set of additional changes in the family-specific build new putative families, and profile–profile and sequence–profile
motifs surrounding these key residues, and we see that in the alignments were used to merge families into a minimal set (Dataset S2).
A motif-based Markov chain Monte Carlo multiple alignment model
case of K72, a substitution to arginine triggers a cascade of [49] based on the conserved motifs of Figure 3 was run independently
other core substitutions that serve to retain basic function, and used to verify HMM hits and seed new potential families for blast-
while a substitution to methionine involves a shift of the based clustering, model building, and examination for conserved
residues. Final family assignment was by scoring against the set of HMM
positive charge normally provided by K72 to another models, with manual examination of sequences with borderline scores
conserved residue, in both CAK-chloro and Wnk kinases. (e . 1 3 105 or difference in e-values between best two models ..01).
Other core changes are also seen independently in very Family annotations. Annotations of chromosomal neighbors used
distinct families, such as the G55-to-A change in UbiB and the SMART [50] and a custom analysis of GOS neighbors ([22]; C. Miller,
H. Li, D. Eisenberg, unpublished data). Annotation analysis was based
chloro subfamily of CAK, or the E91-to-F change in both on GenBank annotations and PubMed references. Taxonomic
chloro and HSK2, suggesting that these kinases are sampling a analysis used a mapping of GOS scaffolds to taxonomic groupings
limited space of functional replacements. [22] and NCBI taxonomy tools.
Family alignments and logos. Residue conservation (Dataset S3)
These families vary greatly in diversity. While the ePK was counted from the final alignment using a custom script that
family has expanded to scores of deeply conserved functions omitted gap counts. These counts were then used to construct family
[42], other families, including Bud32, Rio, Bub1, and UbiB, logos using WebLogo (http://weblogo.berkeley.edu; [51]).
usually have just one or a handful of members per genome, Family comparisons. Relatedness between families was estimated
using several methods. HMM–HMM alignments and scores were
suggesting critical function but an inability to innovate. The computed using PRC (http://supfam.org/PRC), and sequence–profile
largely prokaryotic CAK family is also functionally and alignments using hmmalign were analyzed using custom scripts and
structurally diverse, containing several known functions and by inspection. Both full-length and motif multiple alignments were
also created and used for the family comparisons.
many distinct subfamilies likely to have novel functions. The
diversity of both CAK and KdoK sequences may be related to
their involvement in antibiotic resistance and immune Supporting Information
evasion, likely to be evolutionarily accelerated processes. Dataset S1. FastA-Formatted Sequence Files for Each of the 20 Kinase
Comparison of CAK to the related and more functionally Families, Including Both GOS and Public Sequences
constrained HSK2, FruK, and MTRK families may reveal Found at doi:10.1371/journal.pbio.0050017.sd001 (10 MB BZ2).
Microbial Kinome
Dataset S2. HMM Profiles for the 20 Kinase Families in HMMer (PF01633.8), APH (PF01636.9), KdoK (PF06293.3), and FruK
Format (PF03881.4). The TIGRFAM (http://www.tigr.org/TIGRFAMs) acces-
Found at doi:10.1371/journal.pbio.0050017.sd002 (2.4 MB HMM). sion numbers for the structures discussed in this paper are HSK2
(TIGR00938), UbiB (TIGR01982), and MTRK (TIGR01767).
Dataset S3. Domain Profiles for 20 PKL Families
These 20 spreadsheets show the conservation profile at each residue
of the kinase domain for each family, including annotations and Acknowledgments
classifications of individual residues. Each worksheet details the
alignment of one kinase family to its HMM. Every row corresponds to We thank the Governments of Bermuda, Canada, Mexico, Honduras,
a position within the alignment, listing the four most common amino Costa Rica, Panama, Ecuador, and French Polynesia for facilitating
acids (aa) in that row along with their fractional popularity. The sampling activities. All sequencing data collected from waters of the
number of aa’s and number of gaps at that position within the above-named countries remain part of the genetic patrimony of the
alignment is also listed. The ‘‘Notes’’ column annotates conservation country from which they were obtained. We thank Chris Miller,
status of selected residues and other notes, while the ‘‘.90% Huiying Li, and David Eisenberg for access to analyses of chromo-
Conserved’’ annotates those corresponding residues as to their class somal neighbors for function prediction, and to Doug Rusch, Shibu
(Core, Motif, Motif-Associated, Semi-Conserved, C-terminal, Unique, Yooseph, and other members of the Venter Institute GOS team for
or external to the kinase domain). A number of color highlights are taxonomic predictions, geographic analysis, and other data and tools.
used. (1) Positions with few aa’s in the alignment (typically inserts We thank Eric Scheeff and Tony Hunter for critical comments and
within the domain that are not of great interest) are shaded gray: structural insights, and Nina Haste for help with PyMOL.
typically dark gray for 20 aa at that position, and light gray for .20 Author contributions. JCV proposed and enabled collaboration.
but still low (the range varies depending on the depth of the SST conceived of collaboration and provided structural insights and
alignment). Rows highlighted in gray have no highlights in any other critical evaluation. NK and GM conceived and designed the experi-
columns and are assumed not to be part of the core domain. (2) Core ments. NK, YZ, and GM performed the experiments. NK and GM
motifs are highlighted in bold and blue. (3) The fractional count for
analyzed the data. YZ and JCV contributed reagents/materials/analysis
the most popular aa is labeled green if 1, dark yellow if .0.9, and light
yellow if .0.8 and ,0.9. tools. NK and GM wrote the paper.
Funding. This work was supported by funding from the Razavi-
Found at doi:10.1371/journal.pbio.0050017.sd003 (1.6 MB XLS). Newman Center for Bioinformatics to GM and National Institutes of
Health grant IP01DK54441 to SST. We gratefully acknowledge the US
Accession Numbers Department of Energy (DOE) Genomics: GTL Program, Office of
The Protein Databank (http://www.pdb.org) accession numbers for Science (DE-FG02-02ER63453), the Gordon and Betty Moore Foun-
the structures discussed in this paper are PKA (1ATP), A. fulgidis Rio2 dation, and the J. Craig Venter Science Foundation for funding of the
(1TQP), C. elegans choline kinase (INW1), Erk2 (1GOL), Wnk1 (1T4H), GOS expedition.
and APH(39)-IIIa (1J7L). The Pfam (http://pfam.cgb.ki.se) accession Competing interests. The authors have declared that no competing
numbers for the structures discussed in this paper are ChoK interests exist.
References 19. Steinbacher S, Hof P, Eichinger L, Schleicher M, Gettemans J, et al. (1999)

1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S (2002) The The crystal structure of the Physarum polycephalum actin-fragmin kinase: An
protein kinase complement of the human genome. Science 298: 1912–1934. atypical protein kinase with a specialized substrate-binding domain. EMBO
2. Cohen P (2002) Protein kinases—The major drug targets of the twenty-first J 18: 2923–2929.
century? Nat Rev Drug Discov 1: 309–315. 20. Yamaguchi H, Matsushita M, Nairn AC, Kuriyan J (2001) Crystal structure
3. Manning G, Plowman GD, Hunter T, Sudarsanam S (2002) Evolution of of the atypical protein kinase domain of a TRP channel with phospho-
protein kinase signaling from yeast to man. Trends Biochem Sci 27: 514–520. transferase activity. Mol Cell 7: 1047–1057.
4. Parkinson JS (1993) Signal transduction schemes of bacteria. Cell 73: 21. Kannan N, Neuwald AF (2004) Evolutionary constraints associated with
857–871. functional specificity of the CMGC protein kinases MAPK, CDK, GSK,
5. Kennelly PJ, Potts M (1999) Life among the primitives: Protein O- SRPK, DYRK, and CK2falphag. Protein Sci 13: 2059–2077.
phosphatases in prokaryotes. Front Biosci 4: D372–D385. 22. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)
6. Leonard CJ, Aravind L, Koonin EV (1998) Novel families of putative The Sorcerer II Global Ocean Sampling expedition: Expanding the universe
protein kinases in bacteria and archaea: Evolution of the ‘‘eukaryotic’’ of protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016
protein kinase superfamily. Genome Res 8: 1038–1047. 23. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007)
7. Krupa A, Srinivasan N (2005) Diversity in domain architectures of Ser/Thr The Sorcerer II Gobal Ocean Sampling expedition: Northwest Atlantic
kinases and their homologues in prokaryotes. BMC Genomics 6: 129. through eastern tropical Pacific. PLoS Biol 5: e77. doi:10.1371/journal.pbio.
8. Zhang CC (1996) Bacterial signalling involving eukaryotic-type protein 0050077
kinases. Mol Microbiol 20: 9–15. 24. Neuwald AF, Liu JS, Lawrence CE (1995) Gibbs motif sampling: Detection
9. Young TA, Delagoutte B, Endrizzi JA, Falick AM, Alber T (2003) Structure of bacterial outer membrane protein repeats. Protein Sci 4: 1618–1632.
of Mycobacterium tuberculosis PknB supports a universal activation mecha- 25. Cheek S, Zhang H, Grishin NV (2002) Sequence and structure classification
nism for Ser/Thr protein kinases. Nat Struct Biol 10: 168–174. of kinases. J Mol Biol 320: 855–881.
10. Kennelly PJ (2002) Protein kinases and protein phosphatases in prokar- 26. Qian KC, Wang L, Hickey ER, Studts J, Barringer K, et al. (2005) Structural
yotes: A genomic perspective. FEMS Microbiol Lett 206: 1–8. basis of constitutive activity and a unique nucleotide binding mode of
11. Scheeff ED, Bourne PE (2005) Structural evolution of the protein kinase- human Pim-1 kinase. J Biol Chem 280: 6130–6137.
like superfamily. PLoS Comput Biol 1: e49. 27. Taylor SS, Radzio-Andzelm E, Hunter T (1995) How do protein kinases
12. Hon WC, McKay GA, Thompson PR, Sweet RM, Yang DS, et al. (1997) discriminate between serine/threonine and tyrosine? Structural insights
Structure of an enzyme required for aminoglycoside antibiotic resistance from the insulin receptor protein-tyrosine kinase. FASEB J 9: 1255–1266.
reveals homology to eukaryotic protein kinases. Cell 89: 887–895. 28. Nurizzo D, Shewry SC, Perlin MH, Brown SA, Dholakia JN, et al. (2003) The
13. Peisach D, Gee P, Kent C, Xu Z (2003) The crystal structure of choline kinase crystal structure of aminoglycoside-39-phosphotransferase-IIa, an enzyme
reveals a eukaryotic protein kinase fold. Structure (Camb) 11: 703–713. responsible for antibiotic resistance. J Mol Biol 327: 491–506.
14. LaRonde-LeBlanc N, Wlodawer A (2004) Crystal structure of A. fulgidus 29. Thompson PR, Schwartzenhauer J, Hughes DW, Berghuis AM, Wright GD
Rio2 defines a new family of serine protein kinases. Structure 12: 1585– (1999) The COOH terminus of aminoglycoside phosphotransferase (39)-IIIa
1594. is critical for antibiotic recognition and resistance. J Biol Chem 274: 30697–
15. Cheek S, Ginalski K, Zhang H, Grishin NV (2005) A comprehensive update 30706.
of the sequence and structure classification of kinases. BMC Struct Biol 5: 6. 30. Nolen B, Taylor S, Ghosh G (2004) Regulation of protein kinases:
16. Kannan N, Neuwald AF (2005) Did protein kinase regulatory mechanisms Controlling activity through activation segment conformation. Mol Cell
evolve through elaboration of a simple structural component? J Mol Biol 15: 661–675.
351: 956–972. 31. Boitel B, Ortiz-Lombardia M, Duran R, Pompeo F, Cole ST, et al. (2003)
17. Grishin NV (1999) Phosphatidylinositol phosphate kinase: A link between PknB kinase activity is regulated by phosphorylation in two Thr residues
protein kinase and glutathione synthase folds. J Mol Biol 291: 239–247. and dephosphorylation by PstP, the cognate phospho-Ser/Thr phosphatase,
18. Walker EH, Pacold ME, Perisic O, Stephens L, Hawkins PT, et al. (2000) in Mycobacterium tuberculosis. Mol Microbiol 49: 1493–1508.
Structural determinants of phosphoinositide 3-kinase inhibition by 32. Gibbs CS, Zoller MJ (1991) Rational scanning mutagenesis of a protein
wortmannin, LY294002, quercetin, myricetin, and staurosporine. Mol Cell kinase identifies functional regions involved in catalysis and substrate
6: 909–919. interactions. J Biol Chem 266: 8923–8931.
Microbial Kinome
33. Yuan C, Kent C (2004) Identification of critical residues of choline kinase basic patch, and nucleotide exchange mechanisms in light of a canonical
A2 from Caenorhabditis elegans. J Biol Chem 279: 17801–17809. structure for Rab, Rho, Ras, and Ran GTPases. Genome Res 13: 673–692.
34. Robinson MJ, Harkins PC, Zhang J, Baer R, Haycock JW, et al. (1996) 55. Higgins JM (2001) Haspin-like proteins: A new family of evolutionarily
Mutation of position 52 in ERK2 creates a nonproductive binding mode for conserved putative eukaryotic protein kinases. Protein Sci 10: 1677–1684.
adenosine 59-triphosphate. Biochemistry 35: 5641–5646. 56. Facchin S, Lopreiato R, Ruzzene M, Marin O, Sartori G, et al. (2003)
35. Xu B, English JM, Wilsbacher JL, Stippec S, Goldsmith EJ, et al. (2000) Functional homology between yeast piD261/Bud32 and human PRPK: Both
WNK1, a novel mammalian serine/threonine protein kinase lacking the phosphorylate p53 and PRPK partially complements piD261/Bud32
catalytic lysine in subdomain II. J Biol Chem 275: 16795–16801. deficiency. FEBS Lett 549: 63–66.
36. Akamine P, Madhusudan, Wu J, Xuong NH, Ten Eyck LF, et al. (2003) 57. Downey M, Houlsworth R, Maringele L, Rollie A, Brehme M, et al. (2006) A
Dynamic features of cAMP-dependent protein kinase revealed by apoen- genome-wide screen identifies the evolutionarily conserved KEOPS
zyme crystal structure. J Mol Biol 327: 159–171. complex as a telomere regulator. Cell 124: 1155–1168.
37. Thompson PR, Boehr DD, Berghuis AM, Wright GD (2002) Mechanism of 58. LaRonde-LeBlanc N, Wlodawer A (2005) The RIO kinases: An atypical
aminoglycoside antibiotic kinase APH(39)-IIIa: Role of the nucleotide protein kinase family required for ribosome biogenesis and cell cycle
positioning loop. Biochemistry 41: 7001–7007. progression. Biochim Biophys Acta 1754: 14–24.
38. Levinson NM, Kuchment O, Shen K, Young MA, Koldobskiy M, et al. (2006) 59. White KA, Lin S, Cotter RJ, Raetz CR (1999) A Haemophilus influenzae gene
A SRC-like inactive conformation in the abl tyrosine kinase domain. PLoS that encodes a membrane bound 3-deoxy-D-manno-octulosonic acid (Kdo)
Biol 4: e144. kinase. Possible involvement of kdo phosphorylation in bacterial virulence.
39. Jauch R, Jakel S, Netter C, Schreiter K, Aicher B, et al. (2005) Crystal J Biol Chem 274: 31391–31400.
structures of the Mnk2 kinase domain reveal an inhibitory conformation 60. Zhao X, Lam JS (2002) WaaP of Pseudomonas aeruginosa is a novel eukaryotic
and a zinc binding site. Structure 13: 1559–1568. type protein-tyrosine kinase as well as a sugar kinase essential for the
40. Dar AC, Dever TE, Sicheri F (2005) Higher-order substrate recognition of biosynthesis of core lipopolysaccharide. J Biol Chem 277: 4722–4730.
eIF2alpha by the RNA-dependent protein kinase PKR. Cell 122: 887–900. 61. Serino L, Virji M (2000) Phosphorylcholine decoration of lipopolysaccharide
41. Fischer S, Verma CS (1999) Binding of buried structural water increases the differentiates commensal Neisseriae from pathogenic strains: Identification
flexibility of proteins. Proc Natl Acad Sci U S A 96: 9613–9615. of licA-type genes in commensal Neisseriae. Mol Microbiol 35: 1550–1559.
42. Goldberg JM, Manning G, Liu A, Fey P, Pilcher KE, et al. (2006) The 62. Wright GD, Thompson PR (1999) Aminoglycoside phosphotransferases:
dictyostelium kinome—Analysis of the protein kinases from a simple model
Proteins, structure, and mechanism. Front Biosci 4: D9–D21.
organism. PLoS Genet 2: e38.
63. Delpierre G, Collard F, Fortpied J, Van Schaftingen E (2002) Fructosamine
43. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, et al. (2002) The Pfam
3-kinase is involved in an intracellular deglycation pathway in human
protein families database. Nucleic Acids Res 30: 276–280.
erythrocytes. Biochem J 365: 801–808.
44. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, et al. (2001)
64. Fortpied J, Gemayel R, Stroobant V, van Schaftingen E (2005) Plant
TIGRFAMs: A protein family resource for the functional identification of
ribulosamine/erythrulosamine 3-kinase, a putative protein-repair enzyme.
proteins. Nucleic Acids Res 29: 41–43.
Biochem J 388: 795–802.
45. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology
to genome sequences using a library of hidden Markov models that 65. Tower PA, Alexander DB, Johnson LL, Riscoe MK (1993) Regulation of
represent all proteins of known structure. J Mol Biol 313: 903–919. methylthioribose kinase by methionine in Klebsiella pneumoniae. J Gen
46. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the Microbiol 139: 1027–1031.
sensitivity of progressive multiple sequence alignment through sequence 66. Sekowska A, Mulard L, Krogh S, Tse JK, Danchin A (2001) MtnK,
weighting, position-specific gap penalties and weight matrix choice. methylthioribose kinase, is a starvation-induced protein in Bacillus subtilis.
Nucleic Acids Res 22: 4673–4680. BMC Microbiol 1: 15.
47. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy 67. Poon WW, Davis DE, Ha HT, Jonassen T, Rather PN, et al. (2000)
and high throughput. Nucleic Acids Res 32: 1792–1797. Identification of Escherichia coli ubiB, a gene required for the first
48. Johnson JM, Mason K, Moallemi C, Xi H, Somaroo S, et al. (2003) Protein monooxygenase step in ubiquinone biosynthesis. J Bacteriol 182: 5139–5146.
family annotation in a multiple alignment viewer. Bioinformatics 19: 68. Do TQ, Hsu AY, Jonassen T, Lee PT, Clarke CF (2001) A defect in coenzyme
544–545. Q biosynthesis is responsible for the respiratory deficiency in Saccharomyces
49. Neuwald AF, Liu JS (2004) Gapped alignment of protein sequence motifs cerevisiae abc1 mutants. J Biol Chem 276: 18161–18168.
through Monte Carlo optimization of a hidden Markov model. BMC 69. Jarling M, Cauvet T, Grundmeier M, Kuhnert K, Pape H (2004) Isolation of
Bioinformatics 5: 157. mak1 from Actinoplanes missouriensis and evidence that Pep2 from
50. Schultz J, Copley RR, Doerks T, Ponting CP, Bork P (2000) SMART: A web- Streptomyces coelicolor is a maltokinase. J Basic Microbiol 44: 360–373.
based tool for the study of genetically mobile domains. Nucleic Acids Res 70. Cozzone AJ, El-Mansi M (2005) Control of isocitrate dehydrogenase
28: 231–234. catalytic activity by protein phosphorylation in Escherichia coli. J Mol
51. Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: A Microbiol Biotechnol 9: 132–146.
sequence logo generator. Genome Res 14: 1188–1190. 71. Aimes RT, Hemmer W, Taylor SS (2000) Serine-53 at the tip of the glycine-
52. Knighton DR, Zheng JH, Ten Eyck LF, Ashford VA, Xuong NH, et al. (1991) rich loop of cAMP-dependent protein kinase: Role in catalysis, P-site
Crystal structure of the catalytic subunit of cyclic adenosine mono- specificity, and interaction with inhibitors. Biochemistry 39: 8325–8332.
phosphate-dependent protein kinase. Science 253: 407–414. 72. Johnson DA, Akamine P, Radzio-Andzelm E, Madhusudan M, Taylor SS
53. Hanks SK, Quinn AM, Hunter T (1988) The protein kinase family: (2001) Dynamics of cAMP-dependent protein kinase. Chem Rev 101:
Conserved features and deduced phylogeny of the catalytic domains. 2243–2270.
Science 241: 42–52. 73. Huse M, Kuriyan J (2002) The conformational plasticity of protein kinases.
54. Neuwald AF, Kannan N, Poleksic A, Hata N, Liu JS (2003) Ran’s C-terminal, Cell 109: 275–282.
PUBLIC LIBRARY of SCIENCE | plosbiology.org | Special Collection | MARCH 2007
committed to making scientific and medical literature a public resource Oceanic Metagenomics in
www.plos.org
PUBLIC LIBRARY of SCIENCE | SPECIAL OCEANIC METAGENOMICS COLLECTION | MARCH 2007
A collection of articles from the J. Craig Venter Institute’s

Global Ocean Sampling expedition

Plos Biology Venter Collection Low

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Plos Biology Venter Collection Low

Загружено:

Авторское право:

Доступные форматы

PUBLIC LIBRARY of SCIENCE | plosbiology.

org | Special Collection | MARCH 2007

PUBLIC LIBRARY of SCIENCE | SPECIAL OCEANIC METAGENOMICS COLLECTION | MARCH 2007

A collection of articles from the J. Craig Venter Institute’s

PLoS Biology | www.plosbiology.org i March 2007 | Oceanic Metagenomics Collection

Synopses of Research Articles _________________________________________

Community Page _____________________________________________________________

Research Articles ______________________________________________________________

The Sorcerer II Global Ocean Sampling S56

Structural and Functional Diversity S91

PLoS Biology | www.plosbiology.org iii March 2007 | Oceanic Metagenomics Collection

Global Ocean Sampling Collection

Although extensive in scope, the But the publishing reality in

data imposed new challenges on existing genome assembly

H.M.S. Challenger (Image: NOAA, Steve Nicklas)

depth of the sample, and several observations related to water

Sorcerer II: The Search for Microbial Diversity

Environmental Shotgun Sequencing:

advances rather than scientiﬁc Copyright: © 2007 Jonathan A. Eisen. This is an

CAMERA: A Community Resource

and complex interactions that impact (CAMERA) project [1] is an important

Figure 2. CAMERA Fragment Recruitment Viewer

The Sorcerer II Global Ocean Sampling

Author Summary same ribotype [11], otherwise referred to as species, opera-

Figure 1. Sampling Sites

PLoS Biology | www.plosbiology.org | S25

Special Section from March 2007 | Volume 5 | Issue 3 | e77

our expectations regarding the current dataset. Given the

from the National Center for Biotechnology Information

Genus Read Count Best Strain

Pelagibacter 922,677 195,539 36,965 HTCC1062 HTCC1062 HTCC1062

Figure 2. Fragment Recruitment Plots

Figure 3. Population Structure and Variation as Revealed by Phylogeny

Genomic Structural Variation in Abundant Marine

Figure 5. Fragment Recruitment at Sites of Rearrangements

Table 5. Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11)

Reference Genome Begina Endb Size, bp Type of Variantc Description

Reference Genome Begina Endb Size, bp Type of Variantc Description

HTCC1062 441,074 441,152 78 Deletion Deletes possible methyltransferase FkbM.

1 34,428 34,904 1,536,417 1,536,774 1092255385627 0 476 1 477 834 1 834 No 15

Extreme Assembly of Uncultivated Populations

23S gene sequences [7,8,16,17]. However, a clear observation

Table 8. Most Abundant Ribotypes (97% Identity Clusters)

Ribotype Classificationa Depth of Coverageb Range Number of Matching GenBank Entriesc

SAR11 Surface 1 581 Widespread 100þ

Figure 9. Presence and Abundance of Dominant Ribotypes

Figure 10. Similarity between Samples in Terms of Shared Genomic Content

Figure 11. Sample Similarity at 90% Identity

Table 9. Relative Abundance of TIGRFAMs Associated with a Specific Sample

TIGRFAM Number of Samplea Relative Major Minor Description

TIGR01526 131 GS01a 3.4 Nicotinamide-nucleotide adenylyltransferase

TIGRFAM Number of Samplea Relative Major Minor Description

TIGRFAM Number of Sample(s) Relative Major Minor Description

TIGRFAM Number of Sample(s) Relative Major Minor Description

TIGR02136 1,130 GS17, GS18 7.2 Transport Anions Phosphate-binding protein

Figure 12. Distribution of Common Proteorhodopsin Variants across GOS Samples

Note Added in Proof

The Sorcerer II Global Ocean Sampling

Author Summary provide rich resources for protein annotation. However, a

Dataset Source Number of Amino Mean Sequence Brief

NCBI-nr 2,317,995 1,939,056 1,645,146 1,566,123 372,933 79,023 2,018,079 359

largest clusters correspond to families that have functionally

Rate of Discovery of Protein Families