Вы находитесь на странице: 1из 7

Available online at www.sciencedirect.

com

Next generation sequencing and bioinformatic bottlenecks:


the current state of metagenomic data analysis
Matthew B Scholz1,2, Chien-Chi Lo1,2 and Patrick SG Chain1,2
The recent technological advances in next generation
sequencing have brought the field closer to the goal of
reconstructing all genomes within a community by presenting
high throughput sequencing at much lower costs. While these
next-generation sequencing technologies have allowed a
massive increase in available raw sequence data, there are a
number of new informatics challenges and difficulties that must
be addressed to improve the current state, and fulfill the
promise of, metagenomics.
Addresses
1
Genome Science Group, Los Alamos National Laboratory, Los Alamos,
NM 87545, United States
2
Microbial and Metagenome Program, Joint Genome Institute, Walnut
Creek, CA 94598, United States
Corresponding author: Chain, Patrick SG (pchain@lanl.gov)

however this particular goal has been largely unattainable


owing to technological limitations in bacterial isolation/
DNA recovery as well as in sequencing capacity (cost and
throughput).

Community profiling
Early efforts to describe whos there have relied upon
cataloging species as designated by conserved changes in
their rDNA sequence, either via targeted sequencing of
amplicons, microarray technologies (Phylochip [1]), or
electrophoretic sizing techniques that are primarily
restricted to differentiating between communities (DGGE
and T-RFLP [24]). Each methodology has its own set of
limitations, and most share a reliance on the known set of
target 16S rDNA sequences, as well as the assumption
that 16S sequence can serve as a sufficient marker for
species level identification.

Current Opinion in Biotechnology 2012, 23:915


This review comes from a themed issue on
Analytical biotechnology
Edited by Wei E. Huang and Jizhong Zhou
Available online 9th December 2011
0958-1669/$ see front matter
# 2011 Elsevier Ltd. All rights reserved.
DOI 10.1016/j.copbio.2011.11.013

The discipline of metagenomics


Until recently, analysis of any bacterial community was
severely limited by available technology and a shortage of
reference genomes. The advent of next generation
sequencing (NGS; Table 1) has allowed an explosion
in sequencing of individual genomes, and started a revolution in metagenomic sequencing and analysis. The
increased throughput and decrease in costs of sequencing,
coupled with additional technological advances have
transformed the landscape of metagenomics.
The goal for any metagenome sequencing project is the
full characterization of a community (essentially whos
there?, what are they doing?), and thus usually includes
efforts to understand: 1) community composition/structure, including the taxonomic breakdown and relative
abundance of the various species, 2) genic contribution
of each member of the community, including number and
functional capacity, and 3) intra-species or intra-population
heterogeneity of the genes. Ideally, one would be able to
completely reconstruct all genomes within a given sample;
www.sciencedirect.com

Functional gene metagenomic studies are of interest to


those specifically investigating what are they doing in
detail. A methodologically similar approach to 16S community profiling can also be applied using gene families of
particular enzymatic function, which again can be
explored using either microarray (e.g. Geochip [5]) or
direct sequencing techniques [6]. While not mutually
exclusive, there is little, if any, data to correlate the
presence of a particular species as indicated by 16S rDNA
sequence to the enzymes identified by functional gene
assays. These profiling approaches are now becoming less
cost effective and more time consuming than direct
sequencing of a community.

Shotgun sequencing of metagenomes


Limited sequencing throughput constrained most early
metagenomic shotgun sequencing efforts to the characterization of simple microbial communities: enriched
bioreactor communities or those found in extreme
environments [7]. One high-cost deep metagenomic survey of the ocean was performed using this technology
(GOS), although in-depth analysis could still only be
performed with the most dominant organisms [8].
NGS technologies have allowed exploration of complex
communities by sequencing at lower costs and higher
throughput than Sanger based sequencing (Table 1). This
new sequencing capacity can be utilized for more comprehensive characterization of more diverse and complex
microbial communities such as animal-associated microflora (e.g. termite hindgut [9], human intestinal tract [10],
human saliva [11,12], and cow rumen [2,13]), and has
Current Opinion in Biotechnology 2012, 23:915

10 Analytical biotechnology

Table 1
List of NGS sequencing platforms and their expected throughputs, error types and error rates. Each platform has distinct advantages
owing to cost, error rate, read length, and so on
Platform

Run time (h)

Roche
454 FLX+
454 FLX Titanium
454 GS

1820
10
10

Illumina
GAIIx
HiSeq 2000
HiSeq 2000 V3
MiSeq

14
8
10
1

Life technologies
SOLiD 4
SOLiD 5500xl

12
8

Ion torrent
PGM 314 Chip
PGM 316 Chip
PGM 318 Chip

3
3
3

Pacific biosciences
RS

14/8 Smart Cells

Read length (bp)


700
400
400

Throughput per run (Mb)

Error type

Error rate (%)

900
500
50

Indel
Indel
Indel

1
1
1

2  150
2  100
2  150
2  150

96,000
400,000
<600,000
1000

Substitution
Substitution
Substitution
Substitution

>0.1
>0.1
>0.1
>0.1

50  35
75  35 PE
60  60 MP

71,000
155,000

A-T Bias
A-T Bias

>0.06
>0.01

100
100+
200

10
100
1000

Indel
Indel
Indel

1500

45/SC

Insertions

1
1
1
15

even made it feasible to begin investigating soil and


pelagic microbial communities of extreme complexity.

necessitate planning for at least several hundred gigabytes of data storage per sample.

New technologies for metagenomic


sequencing: advantages and drawbacks

Analysis methods for metagenomics

As new NGS technologies continue to emerge, the field of


metagenomics has adapted to the new types of sequencing data. However, each NGS technology has advantages
and disadvantages that must be addressed to enable
appropriate use of metagenome sequencing data. The
volume of NGS data coupled with their relatively short
reads raises the question of how to analyze these data so as
to maximize their scientific value. While 454 pyrosequencing has been used for metagenome analyses, the number
of reads obtained from the Illumina (and SOLiD) platform (Table 1) makes this short read technology the
most well suited to deep-coverage sequencing and
analysis of a shotgun metagenome. However, these reads
are very short (currently 36150 bp), making many read
based analyses difficult, incomplete, or impossible.
Therefore, new methods are required to analyze Illumina-sequenced metagenomes.
Novel algorithms able to process millions to billions of
very short reads are being developed worldwide, for
sequence alignment, assembly, as well as read annotation
[14] among other applications (Table 2). Although this
provides many useful resources, it has delayed or prevented the selection of standard best practice tools for
analysis. Any NGS research in metagenomics will require
significant computational resources, and a core of bioinformaticians with skills to install, update, and run the
latest tools. Furthermore, using NGS platforms can
Current Opinion in Biotechnology 2012, 23:915

While the ultimate goal is to reconstruct all the genomes


within an environment, this is not feasible owing to the
computational complexity involved. Instead there are two
general types of analyses performed as a proxy for complete genome reconstruction (Figure 1), both with
inherent limitations: 1) assembling the reads into contigs,
and performing taxonomic classification and functional
assignments, or 2) read-based reconstruction of the functional and taxonomic components of the metagenome.
Several problems particular to the assembly of metagenomes exist. The algorithms for short read assembly require
crippling amounts of memory to reconstruct the many
genomes found within a metagenome. Additionally, the
wide range of abundances of the genomes within a sample
further complicates the issue, and low abundance genomes
may not assemble at all. Furthermore, population heterogeneity within lineages can fragment otherwise contiguous
sequences, and the similarity of closely related lineages can
result in chimeric assembly. Read-based methods to
achieve both functional and taxonomic classification of
NGS data suffer from the sheer number of reads (long
analysis time) and from short read length (high error rates).
Assembly

The currently accepted methods most capable of assembling NGS data utilize Kmer DeBruijn graph traversalbased methods, including programs such as Velvet, SOAPdenovo, ALLPATHS, ABySS, the CLC Bio commercial
www.sciencedirect.com

Current state of metagenomic data analysis Scholz, Lo and Chain 11

Table 2
Currently available software tools for analysis and assembly of metagenomes
References

Software/algorithm
Annotation and analysis

MG-RAST
IMG-M
Eragatis
DIYA
CloVR
RATT
VMGAP
CAMERA
METAREP

[38]
[39]
[27]
[40]
http://clovr.org/
[41]
[31]
[42]
[43]

Assembly

RAY
Velvet
SOAPdenovo
Newbler
ABySS
ALLPATHs
Genovo
CLCbio
Meta-IDBA
MetaVelvet

[44]
[21]
[45]
[20]
[46]
[16]
[24]
http://clcbio.com
[47]
http://metavelvet.dna.bio.keio.ac.j

Mapping/alignment

BWA
Bowtie
Novoalign
SOAP
MrFAST
CloudBurst
BFAST
MUMer
MOSAIK
BLAST
MAQ

[48]
[49]
[50]
[51]
[52]
[53]
[54]
[26]
http://bioinformatics.bc.edu/marthlab/Mosaik
[26]

product as well as number of newer assemblers under


development [15,1622] (Table 2). The Kmer approach
was developed in an attempt to overcome the time limitations of traditional overlap-based assembly strategies. The
primary limitation of overlap-based assemblers is that they
require a number of calculations proportional to N2 (the
square of the number of reads in the assembly). As NGS
output surpassed the multi-million read mark, this process
became prohibitively slow. While there are several excellent reviews of Kmer-based assembly [15], it is important
to note two things: first, this method reduces the time of
assembly, but at the cost of requiring significant RAM
which is proportional to the size of the genome(s) being
assembled and/or the amount of data, which de facto limits
the total size of the metagenome being assembled; second,
this method is non-deterministic. Because reads are broken
down into smaller pieces of defined length (Kmers), reads
themselves are no longer the target of assembly, leading to
the potential introduction of assembly errors.
Metagenomes have presented a number of additional
assembly challenges. Multiple genomes are represented
disproportionately owing to uneven community composition resulting in poor or no coverage of many parts of
many genomes. Assembly of reads into contigs is often
www.sciencedirect.com

limited owing to complex population composition [23],


thus more data are required to cover the diverse genomes,
at the cost of computational memory and time requirements. Species level (and population) heterogeneity can
further confuse assembly by presenting forks or bubbles
in the DeBruijn graph. While there are a number of
effective assemblers for single genomes, only three
(Genovo, MetaVelvet and Meta-IDBA [24]) currently
attempt to address these metagenome issues, and one
(Genovo) is limited to a few million reads. MetaVelvet
and Meta-IDBA attempt to bin genomes by abundance
in the population by separating short-read data by Kmer
frequency and/or collapsing the graph. Alternative
approaches include finding ways to partition or bin reads
into discreet groups that are more amenable to assembly.
Read-based analyses

Read-based approaches can also allow exploration of gene


functions and taxonomic classification. However, only a
subset of the data can be assigned to either function or
species, and of those that can be classified, the degree of
confidence can vary as a result of both the reference
databases, and the shortness of reads. When coupled with
the sheer volume of reads, which threatens to overwhelm
current available online annotation and local analysis
Current Opinion in Biotechnology 2012, 23:915

12 Analytical biotechnology

Figure 1

NGS-Based Metagenome Analysis Methods


Your Sequencing Method of Choice
Amplicon Sequencing
16s RiboTags

Character- or
HomologyBased
Approaches

Whole Sample Sequencing

Gene-Targeted
Metagenomics

OTU Based
&
HypothesisTesting
Approaches

Shotgun Metagenomics

Read
Binning
Gene
Calling

Taxonomy/Function-subfamily
Community Profiling

FunctionSubfamily
Profiling

Assembly

Reads

Contig
Binning

Contig
Annotation

Read
Annotation

Contig-based
Taxonomy
and
Functional
Profiling

Read-based
Taxonomy
and
Functional
Profiling

Final Analytical Data Set for Analysis and Community (Metagenome) Comparisons
Current Opinion in Biotechnology

Analytical stages and steps for analysis of metagenomic data from either amplicon sequencing or whole sample shotgun metagenome sequencing.

systems, these issues strengthen the argument to use both


assembly and read-based approaches together to examine
metagenomes.
Read mapping methods based on the BurrowsWheeler
aligners, such as BWA, Bowtie MAQ, and others (Table 2)
can perform fast alignments of reads to a given set of
reference sequences [14]. This method has specific limitations with regards to the reference sequences, but is
extremely rapid. MUMmer [25] is less limited by database
size, but is also much slower. BLAST [26] is more sensitive
and can be used to find distant homologous sequences for
taxonomy and functional attributions. However this comes
at too high a cost in terms of CPU time. Depending on
computational resources and method of alignment, results
for tens to hundreds of millions of reads can generally be
obtained within hours (Burrows Wheeler methods), days
(Nucmer) or weeks (BLAST). Other fast approaches, such
as using pre-developed Hidden Markov Models (HMMs),
can also be used to find conserved domains, however these
do not function very well for short reads, and typically have
a limited range (e.g. a single gene of interest).
Current Opinion in Biotechnology 2012, 23:915

Metagenome analysis and comparative metagenomics

The annotation of metagenomic contigs is often viewed


as simply an extension of bacterial genome annotation.
However, for most metagenomes, there will also be viral
and eukaryotic components. There is currently no centralized method to annotate such diverse sequences at the
same time. Bacterial and archaeal annotations can be
performed in-house by use of available gene calling
and functional annotation workflows, including those in
the Ergatis web-based workflow management system
[27], the command-line cg-pipeline [28]. Online metagenome annotation services such as IMG-M [29] and
MG-RAST [30], are also available. While a virus specific
annotation pipeline, VMGAP [31] also exists, there is
currently no available system that combines all necessary
pipelines for complex metagenome samples, thus a strong
need for such development exists.
CAMERA provides a framework to design and implement data analysis pipelines. Other installable workflow
managements systems like Galaxy [32], Ergatis, and
others, already have hundreds of tools for analyzing
www.sciencedirect.com

Current state of metagenomic data analysis Scholz, Lo and Chain 13

genomic and metagenomic data [33]. Customized features can be added by users, and analysis pipelines can be
built and shared easily among scientists within the
research community.

are still drastic improvements that can be made to these


tools, and novel algorithms that are sorely needed to
process NGS data, however it is clear that metagenomic
sequencing/analysis is finally hitting its stride.

IMG-M, CAMERA and MG-RAST allow some analyses


beyond annotation, including taxonomic and functional
assignments, and pathway reconstruction, making these
resources popular portals that can provide a single platform
for depositing, locating, analyzing, visualizing and sharing
data. These resources are incredibly useful for those individual laboratories without access to high performance
computers to perform analysis, and have grown to accommodate browsing functional and taxonomic comparisons of
metagenomes in detail and as a whole. METAREP is
another tool developed for high-performance comparative
metagenomics. Users can analyze and compare annotated
metagenomics datasets by graphical summaries for taxonomic and functional classifications and perform statistical
tests. If metadata is available, these sites/tools can provide
multivariate statistical analysis, principal component
analysis (PCA) and nonmetric multidimensional scaling
to cluster samples and show the major factors which
contribute most to the observed functional or taxonomic
compositions of the metagenomic populations [11].

It is expected that sequencing will soon cease to be the


limiting factor in metagenomic studies. The computational
resources required for assembly, annotation and analysis
have already become the main bottlenecks for metagenomics projects. It is highly probable that sequencing
centers will begin to serve principally as bioinformatics
resources that lend computational resources and expertise
to the community. However, without novel algorithms for
assembly and analysis, it is clear that the sheer volume of
sequencing data will overwhelm available resources.

Finally, the comparison of multiple metagenomes can


reveal relationships between different environments or
different time points within the same environment, based
on taxonomic and functional profiling. Dinsdale et al. conducted a broad metagenomic comparison, contrasting
15 million sequences from 45 distinct microbiomes
and 42 distinct viromes [34]. More recently, a number of
other metagenomic comparisons have been conducted,
such as comparisons between deep-sea hydrothermal
microbial communities [35], biogas fermenters [36], oceanic viral communities [37], and saliva microbiomes [12].
Metagenome comparison platforms will become more and
more important to gain scientific understanding of how
community structure and functional profiles relate to
environmental perturbations.

Acknowledgements
This study was supported partly by Laboratory-Directed Research and
Development of Los Alamos National Laboratory under grant number
20100034DR, by the U.S. Department of Energy Joint Genome Institute
through the Office of Science of the U.S. Department of Energy under
Contract No. DE-AC02-05CH11231, and by grants from the U.S. Defense
Threat Reduction Agency under contract numbers B104153I and B084531I.

References and recommended reading


Papers of particular interest, published within the annual period of
review, have been highlighted as:
 of special interest
 of outstanding interest
1.

Loy A, Lehner A, Lee N, Adamczyk J, Meier H, Ernst J,


Schleifer KH, Wagner M: Oligonucleotide microarray for 16S
rRNA gene-based detection of all recognized lineages of
sulfate-reducing prokaryotes in the environment. Applied and
Environmental Microbiology 2002, 68:5064-5081.

2.

Deng WD, Xi DM, Mao HM, Wanapat M: The use of molecular


techniques based on ribosomal RNA and DNA for rumen
microbial ecosystem studies: a review. Molecular Biology
Reports 2008, 35:265-274.

3.

Kumari N, Srivastava AK, Bhargava P, Rai LC: Molecular


approaches towards assessment of cyanobacterial
biodiversity. African Journal of Biotechnology 2009, 8:4284-4298.

4.

Shi WB, Syrenne R, Sun JZ, Yuan JS: Molecular approaches to


study the insect gut symbiotic microbiota at the omics age.
Insect Science 2010, 17:199-219.

5.

He ZL, Deng Y, Van Nostrand JD, Tu QC, Xu MY, Hemme CL,


Li XY, Wu LY, Gentry TJ, Yin YF et al.: GeoChip 3.0 as a highthroughput tool for analyzing microbial community
composition, structure and functional activity. ISME Journal
2010, 4:1167-1179.

Future directions
NGS has enabled us to peer at the genetic composition of
complex communities in a way not thought possible only
a few years ago. While novel tools have been developed
specifically for such massively parallel high-throughput
sequencers, the complexity of metagenomic samples has
presented difficult challenges and exposed a number of
analytical bottlenecks. Despite problems inherent with
assembly or read-based analysis, both approaches need to
be examined for a more complete understanding of any
metagenome project and begin to answer the basic questions in metagenomics (who, what, how). Figure 1 illustrates the types of analyses that are possible for
metagenomes with NGS technologies and the interrelatedness of read-based and assembly based analyses. There
www.sciencedirect.com

6.


Iwai S, Chai B, Sul WJ, Cole JR, Hashsham SA, Tiedje JM:
Gene-targeted-metagenomics reveals extensive diversity of
aromatic dioxygenase genes in the environment. ISME Journal
2009, 4:279-285.
This study is the first to target the dioxgenase gene family to understand
its diversity. This is one of few studies that target genes other than the
ribosomal RNA genes yet relies on many tools developed for 16S community profiling studies.

7.

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,


Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF:
Community structure and metabolism through reconstruction
of microbial genomes from the environment. Nature 2004,
428:37-43.

8.

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D,


Eisen JA, Wu DY, Paulsen I, Nelson KE, Nelson W et al.:
Environmental genome shotgun sequencing of the Sargasso
Sea. Science 2004, 304:66-74.
Current Opinion in Biotechnology 2012, 23:915

14 Analytical biotechnology

9.

Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M,


Richardson TH, Stege JT, Cayouette M, McHardy AC, Djordjevic G,
Aboushadi N et al.: Metagenomic and functional analysis of
hindgut microbiota of a wood-feeding higher termite. Nature
2007, 450: 560-U517.

10. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R,


Gordon JI: The human microbiome project. Nature 2007,
449:804-810.
11. Willner D, Furlan M, Schmieder R, Grasis JA, Pride DT, Relman DA,
Angly FE, McDole T, Mariella RP Jr, Rohwer F et al.:
Metagenomic detection of phage-encoded platelet-binding
factors in the human oral cavity. In Proceedings of the National
Academy of Sciences of the United States of America 2011,
108(Suppl. 1):4547-4553.
12. Yang et al., Yang F, Zeng X, Ning K, Liu KL, Lo CC, Wang W,
Chen J, Wang D, Huang R, Chang X et al., Saliva microbiomes
distinguish cariesactive from healthy human populations.
ISME Journal doi:10.1038/ismej.2011.71, advance online
publication, 30 June 2011.
13. Brulc JM, Antonopoulos DA, Miller ME, Wilson MK, Yannarell AC,
Dinsdale EA, Edwards RE, Frank ED, Emerson JB, Wacklin P et al.:
Gene-centric metagenomics of the fiber-adherent bovine
rumen microbiome reveals forage specific glycoside
hydrolases. In Proceedings of the National Academy of Sciences
of the United States of America 2009, 106:1948-1953.
14. Flicek P, Birney E: Sense from sequence reads: methods
for alignment and assembly. Nature Methods 2009,
6:S6-S12.
15. Miller JR, Koren S, Sutton G: Assembly algorithms for next
generation sequencing data. Genomics 2010, 95:315-327.
This is a thorough review of the state of the art in assembly algorithms and
assembly tools, for next generation sequencing data, including their
advantages over former tools and their limitations.
16. Laserson J, Jojic V, Koller D: Genovo: de novo assembly for
metagenomes. Journal of Computational Biology 2011, 18:429-443.
17. Liu YC, Schmidt B, Maskell DL: Parallelized short read assembly
of large genomes using de Bruijn graphs. BMC Bioinformatics
2011, 12:.
18. Misra S, Agrawal A, Liao WK, Choudhary A: Anatomy of a hashbased long read sequence mapping algorithm for next
generation DNA sequencing. Bioinformatics 2011, 27:189-195.
19. Narzisi G, Mishra B: Scoring-and-unfolding trimmed tree
assembler: concepts, constructs and comparisons.
Bioinformatics 2011, 27:153-160.
20. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I:
ABySS: a parallel assembler for short read sequence data.
Genome Research 2009, 19:1117-1123.
21. Zerbino DR, Birney E: Velvet: algorithms for de novo short read
assembly using de Bruijn graphs. Genome Research 2008,
18:821-829.
22. Schmieder R, Edwards R: Fast identification and removal of
sequence contamination from genomic and metagenomic
datasets. PLoS ONE 2011, 6:.
23. Xie G, Chain PS, Lo CC, Liu KL, Gans J, Merritt J, Qi F: Community
and gene composition of a human dental plaque microbiota

obtained by metagenomic sequencing. Molecular Oral
Microbiology 2010, 25:391-405.
Using a blend of sequencing approaches, this study presents the first
shotgun metagenome and assembled gene catalog of the dental plaque
microbiota from a healthy human. This study compares the results and
interpretations of sequencing data from two popular platforms, 454 and
Illumina.
24. Peng Y, Leung HCM, Yiu SM, Chin FYL: Meta-IDBA: a de novo

assembler for metagenomic data. Bioinformatics 2011,
27:I94-I101.
Using a novel method to compress assembly graph structure, coupled
with abundance information, this study reports a new memory efficient
tool designed specifically for the assembly of communities where
population (genome) heterogeneity could otherwise confuse other
assemblers.
Current Opinion in Biotechnology 2012, 23:915

25. Delcher AL, Phillippy A, Carlton J, Salzberg SL: Fast algorithms


for large-scale genome alignment and comparison. Nucleic
Acids Research 2002, 30:2478-2483.
26. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Research
1997, 25:3389-3402.
27. Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E,
Nampally S, Riley D, Sundaram JP, Felix V et al.: Ergatis: a web
interface and scalable software system for bioinformatics
workflows. Bioinformatics 2010, 26:1488-1492.
28. Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB,
Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D
et al.: A computational genomics pipeline for prokaryotic
sequencing projects. Bioinformatics 2010, 26:1819-1826.
29. Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K,
Dalevi D, Chen IM, Grechkin Y, Dubchak I, Anderson I et al.: IMG/
M: a data management and analysis system for metagenomes.
Nucleic Acids Research 2008, 36:D534-D538.
30. Meyer F, Paarmann D, DSouza M, Olson R, Glass EM, Kubal M,
Paczian T, Rodriguez A, Stevens R, Wilke A et al.: The
metagenomics RAST server a public resource for the
automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics 2008, 9:.
31. Lorenzi HA, Hoover J, Inman J, Safford T, Murphy S, Kagan L,
Williamson SJ: TheViral MetaGenome Annotation
Pipeline(VMGAP):an automated tool for the functional
annotation of viral Metagenomic shotgun sequencing data.
Standards in Genomic Sciences 2011, 4:418-429.
32. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P,
Zhang Y, Blankenberg D, Albert I, Taylor J et al.: Galaxy: a
platform for interactive large-scale genome analysis. Genome
Research 2005, 15:1451-1455.
33. Kosakovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G,
 Chung WY, Taylor J, Nekrutenko A: Windshield splatter analysis
with the Galaxy metagenomic pipeline. Genome Research
2009, 19:2144-2153.
This is one of the first studies to use a next-generation sequencing data
analysis pipeline within a workflow management system (Galaxy) for the
analysis of a metagenome. Analysis workflows of all kinds will be developed as modular pipelines using such systems in the future.
34. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM,
Furlan M, Desnues C, Haynes M, Li L et al.: Functional
metagenomic profiling of nine biomes. Nature 2008, 452:629-632.
35. Xie W, Wang F, Guo L, Chen Z, Sievert SM, Meng J, Huang G, Li Y,
Yan Q, Wu S et al.: Comparative metagenomics of microbial
communities inhabiting deep-sea hydrothermal vent
chimneys with contrasting chemistries. ISME Journal 2011,
5:414-426.
36. Jaenicke S, Ander C, Bekel T, Bisdorf R, Droge M, Gartemann KH,
Junemann S, Kaiser O, Krause L, Tille F et al.: Comparative and
joint analysis of two metagenomic datasets from a biogas
fermenter obtained by 454-pyrosequencing. PLoS ONE 2011,
6:e14519.
37. Sharon I, Battchikova N, Aro EM, Giglione C, Meinnel T, Glaser F,
Pinter RY, Breitbart M, Rohwer F, Beja O: Comparative
metagenomics of microbial traits within oceanic viral
communities. ISME Journal 2011, 5:1178-1190.
38. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA,
Formsma K, Gerdes S, Glass EM, Kubal M et al.: The RAST
Server: rapid annotations using subsystems technology. BMC
Genomics 2008, 9:75.
39. Markowitz VM, Mavromatis K, Ivanova NN, Chen IM, Chu K,
Kyrpides NC: IMG ER: a system for microbial genome
annotation expert review and curation. Bioinformatics 2009,
25:2271-2278.
40. Stewart AC, Osborne B, Read TD: DIYA: a bacterial annotation
pipeline for any genomics lab. Bioinformatics 2009, 25:962-963.
41. Otto TD, Dillon GP, Degrave WS, Berriman M: RATT: Rapid
Annotation Transfer Tool. Nucleic Acids Research 2011, 39:e57.
www.sciencedirect.com

Current state of metagenomic data analysis Scholz, Lo and Chain 15

42. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M: CAMERA: a


community resource for metagenomics. PLoS Biology 2007,
5:e75.

48. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and


memory-efficient alignment of short DNA sequences to the
human genome. Genome Biology 2009, 10:R25.

43. Goll J, Rusch DB, Tanenbaum DM, Thiagarajan M, Li K, Methe BA,


Yooseph S: METAREP: JCVI metagenomics reports an open
source tool for high-performance comparative
metagenomics. Bioinformatics 2010, 26:2631-2632.

49. Krawitz P, Rodelsperger C, Jager M, Jostins L, Bauer S,


Robinson PN: Microindel detection in short-read sequence
data. Bioinformatics 2010, 26:722-729.

44. Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly


of reads from a mix of high-throughput sequencing
technologies. Journal of Computational Biology 2010,
17:1519-1533.
45. Chaisson MJ, Pevzner PA: Short read fragment assembly of
bacterial genomes. Genome Research 2008, 18:324-330.
46. Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A,
Malek J, McKernan K, Ranade S, Shea TP et al.: ALLPATHS 2: small
genomes assembled accurately and with high continuity from
short paired reads. Genome Biology 2009, 10:R103.
47. Li H, Durbin R: Fast and accurate short read alignment with

Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.
This study reports one of the most popular short read alignment tools in
the industry, BWA. It is a fast, accurate and gap-aware alignment method
that can identify a best match of a read to a reference dataset.

www.sciencedirect.com

50. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2:


an improved ultrafast tool for short read alignment.
Bioinformatics 2009, 25:1966-1967.
51. Hach F, Hormozdiari F, Alkan C, Birol I, Eichler EE, Sahinalp SC:
mrsFAST: a cache-oblivious algorithm for short-read
mapping. Nature Methods 2010, 7:576-577.
52. Schatz MC: CloudBurst: highly sensitive read mapping with
MapReduce. Bioinformatics 2009, 25:1363-1369.
53. Homer N, Merriman B, Nelson SF: BFAST: an alignment tool
for large scale genome resequencing. PLoS ONE 2009,
4:e7767.
54. Khan Z, Bloom JS, Kruglyak L, Singh M: A practical algorithm
for finding maximal exact matches in large sequence
datasets using sparse suffix arrays. Bioinformatics 2009,
25:1609-1616.

Current Opinion in Biotechnology 2012, 23:915

Вам также может понравиться