Вы находитесь на странице: 1из 28

Open Computational Problems in Metagenomics

Gabriel Valiente

Algorithms, Bioinformatics, Complexity and Formal Methods Research Group


Technical University of Catalonia

Computational Biology and Bioinformatics Research Group


Research Institute of Health Science, University of the Balearic Islands

Centre for Genomic Regulation


Barcelona Biomedical Research Park

University of Zaragoza
18 May 2012
Abstract

Next-generation sequencing technologies allow for the genetic


study of complex microbial communities, which were so far
largely unknown because they cannot be cultured in the
laboratory.

The core problem of metagenomics is to determine and


quantify the composition of a sample consisting of a mixture
of different, and possibly unknown, microbial species (National
Research Council, 2007).

Solving this core biological problem involves a series of


algorithmic and computational problems, ranging from the
simulation of metagenomic samples to the alignment or
mapping of sequence reads, the non-taxonomic assignment or
binning of sequence reads, and the taxonomic assignment of
sequence reads (Ribeca and Valiente, 2011).
Biological background I
Biological background II
Kingdom Archaea Bacteria Eukaryota Viruses

Phylum Streptophyta

Class Streptophytina

Order Solanales

Family Solanaceae

Genus Solanum

Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis


simplicibus
Biological background III
Kingdom Archaea Bacteria Eukaryota Viruses

Phylum Chordata

Class Mammalia

Order Primates

Family Hominidae

Genus Gorilla Homo Pan Pongo

Species Homo sapiens


Computational background I
Computational background II

(Van de Peer et al., 2000)


Computational background III

(Ashelford et al., 2005)


Computational background IV

(Schloss, 2009)
Simulating metagenomic samples I
Different metagenomic analysis pipelines produce different results,
and standardized simulated data are essential to their evaluation.

The simulated sequence reads have to reflect the diverse taxonomic


composition of a metagenomic dataset. Usual experimental
biological settings involve targeted or random sequencing,
single-end or paired-end reads, and one or many dominant species.

Ribeca and Valiente (2011) showed that current simulation


tools (Richter et al., 2008) do not properly reflect the diverse
taxonomic composition of a metagenomic dataset.

Problem 1 Devise an algorithm to simulate a metagenomic dataset


of single-end reads.

Problem 2 Devise an algorithm to simulate a metagenomic dataset


of paired-end reads.
Simulating metagenomic samples II
Domain Phylum Class Genomes
Bacteria Actinobacteria Actinobacteria 9
Bacteroidetes Cytophagia 1
Chlorobi Chlorobia 7
Chloroflexi Chloroflexi 1
Cyanobacteria Cyanobacteria 6
Deinococcus-Thermus Deinococci 1
Firmicutes Bacilli 13
Clostridia 8
Proteobacteria Alphaproteobacteria 17
Betaproteobacteria 13
Gammaproteobacteria 25
Deltaproteobacteria 6
Epsilonproteobacteria 1
unclassified Proteobacteria 1
Archaea Euryarchaeota Methanomicrobia 3
Thermoplasmata 1
Simulating metagenomic samples III
• Low-complexity microbial community with one dominant
population
• Medium-complexity microbial community with three dominant
populations flanked by low-abundance populations
• High-complexity microbial community with no dominant
population at all

simLC simMC simHC


Most abundant 28,861 22,956 2,384
2nd abundant 9,277 16,577 2,248
3rd abundant 5,168 10,484 2,191
4th abundant 1,149 6,107 2,127
5th abundant 1,109 4,868 2,083
6th abundant 1,074 1,146 2,051
Rest 50,857 52,319 103,687

(Alonso-Alemany et al., 2011)


Mapping sequence reads I
The composition of a metagenomic dataset can be assessed by
aligning or mapping the sequence reads to a reference database of
known sequences from a large set of different organisms. The high
yield of high-throughput sequencing technologies requires
extremely efficient mapping programs, ruling out traditional
alignment programs like BLAST (Altschul et al., 1990).

Current mapping programs achieve efficiency at the price of


accuracy, by not being exhaustive and allowing a small number of
insertions and mismatches, and by only reporting single matches in
case of ambiguities (Ribeca and Valiente, 2011).

Problem 3 Devise an exhaustive algorithm to map sequence reads


with long insertions and an arbitrary number of mismatches.
Mapping sequence reads II
• Align the 328,723 simulated sequence reads to the 113
microbial genomes using BLAST (a larger database is often
used when the target sequences are not known beforehand)
• Ambiguities arise when a sequence read is aligned with more
than one target sequence
• Take as candidate alignments all those sequences with the
same E-value as the top BLAST hit
• Sequence reads with no hit in the database of microbial
genomes are due to sequencing errors

Data set No hit One hit Ambiguous Total


simLC 59 76,513 20,923 97,495
simMC 76 86,705 27,676 114,457
simHC 100 99,619 17,052 116,771

(Alonso-Alemany et al., 2011)


Binning sequence reads
Those sequence reads that cannot be mapped to any sequence in a
reference database of known sequences are usually assumed to
come from unknown species. Pairwise similarities among sequence
reads are used to group them into clusters of related species.

Ribeca and Valiente (2011) showed that current binning


tools (Schloss et al., 2009; Caporaso et al., 2010) provide an
overestimation of diversity and richness in a simulated
metagenomic dataset.

Problem 4 Devise a binning algorithm that reflects the microbial


composition of a metagenomic dataset.
Assigning sequence reads I
Ambiguities may arise when mapping sequence reads to a reference
database of known sequences. Sequence reads are attributed to
species at the closest possible taxonomic rank, and any ambiguities
are usually solved by assigning ambiguous sequence reads to either
the consensus or LCA of all matching sequences in a reference
taxonomy (Huson et al., 2007), or to a sequence in the reference
taxonomy that provides optimal sensitivity and
specificity (Clemente et al., 2010, 2011).

Ribeca and Valiente (2011) showed that current assignment


tools (Cole et al., 2009; Schloss et al., 2009; Alonso-Alemany
et al., 2011) provide an underestimation of diversity and richness in
a simulated metagenomic dataset. Alonso-Alemany et al. (2011)
extended these results to taxonomic diversity.

Problem 5 Devise an assignment algorithm that reflects the


taxonomic composition of a metagenomic dataset.
Assigning sequence reads II
Input A genomic reference S (set of sequences)
A taxonomic reference T (tree) with a leaf set L, where each
leaf in L has an associated known sequence of S
A set R of sequence (short or long) reads
A positive integer k
Output For each read Ri ∈ R, a single node in T that represents
in a “good” way the subset Mi ⊆ L of hits or matches whose
sequences contain a substring with at most k mismatches to Ri
Assigning sequence reads III
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Ri be the ith read
• Let Mi be the leaves of T matching Ri with up to k
mismatches
• Let Ti be the subtree of T rooted at the LCA of Mi
• Let Ni be the leaves of Ti not matching Ri with up to k
mismatches
For the ith read, the leaves of Ti can be partitioned in the
following four subsets:
• TP i = Mi (true positives)
• FP i = Ni (false positives)
• TN i = ∅ (true negatives)
• FN i = ∅ (false negatives)
Assigning sequence reads IV

Ti

Ni Mi

FPi TPi
TP TP
P= R=
TP + FP TP + FN
Assigning sequence reads V
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Tij be the subtree of T rooted at the jth node of Ti
• Let Mij be the leaves of Tij matching Ri with up to k
mismatches
• Let Nij be the leaves of Tij not matching Ri with up to k
mismatches
For the ith read and the jth node of Ti , the leaves of Ti can be
partitioned in the following four subsets:
• TP ij = Mij (true positives)
• FP ij = Nij (false positives)
• TN ij = Ni \ Nij (true negatives)
• FN ij = Mi \ Mij (false negatives)
Assigning sequence reads VI

Ti

Tij

Ni Nij Mij Mi

TNij FPij TPij FNij


TP TP
P= R=
TP + FP TP + FN
Assigning sequence reads VII
Bacteria
Aquificae
Aquificae
Aquificales
Aquificaceae
Aquifex
Aquifex pyrophilus
Hydrogenobaculum
Hydrogenobaculum acidophilum
P = 6/(6 + 8) = 43% Hydrogenobacter
R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus
Hydrogenobacter thermophilus
F = 60% Hydrogenobacter hydrogenophilus
Persephonella
Persephonella hydrogeniphila
Persephonella marina
Persephonella guaymasensis
Sulfurihydrogenibium
Sulfurihydrogenibium subterraneum
P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense
R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense
F = 67% Thermocrinis
Thermocrinis albus
Thermocrinis ruber
Hydrogenivirga
Hydrogenivirga caldilitoris
Assigning sequence reads VIII
Fischer and Huson (2010) introduced the notion of
LCA-skeleton-tree: the restriction of a tree to a given subset of the
nodes. Clemente et al. (2011) noticed that the LCA-skeleton-tree
of a reference taxonomy suffices for optimal taxonomic assignment.

Problem 6 Devise a fast algorithm for the LCA-skeleton-tree of a


subset of the nodes in a reference taxonomy.
Assigning sequence reads IX
Kingdom Archaea

Phylum Crenarchaeota

Class Thermoprotei

Order

Family

Genus

Species
References I
D. Alonso-Alemany, J. C. Clemente, J. Jansson, and G. Valiente.
Taxonomic assignment in metagenomics with TANGO.
EMBnet.journal, 17(2):46–50, 2011.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.
Lipman. Basic Local Alignment Search Tool. J. Mol. Biol., 215
(3):403–410, 1990.
K. E. Ashelford, N. A. Chuzhanova, J. C. Fry, A. J. Jones, and
A. J. Weightman. At least 1 in 20 16S rRNA sequence records
currently held in public repositories is estimated to contain
substantial anomalies. Appl. Environ. Microbiol., 71(12):
7724–7736, 2005.
J. G. Caporaso, J. Kuczynski, J. Stombaugh, et al. Qiime allows
analysis of high-throughput community sequencing data. Nat.
Methods, 7(5):335–6, 2010.
References II
J. C. Clemente, J. Jansson, and G. Valiente. Accurate taxonomic
assignment of short pyrosequencing reads. In Proc. 15th Pacific
Symp. Biocomputing, volume 15, pages 3–9. World Scientific,
2010.
J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic
assignment of ambiguous sequencing reads. BMC
Bioinformatics, 12:8, 2011.
J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris,
A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M.
Garrity, and J. M. Tiedje. The Ribosomal Database Project:
Improved alignments and new tools for rRNA analysis. Nucleic
Acids Res., 37(D):141–145, 2009.
J. Fischer and D. H. Huson. New common ancestor problems in
trees and directed acyclic graphs. Inform. Process. Lett., 110
(8–9):331–335, 2010.
References III
D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. MEGAN
analysis of metagenomic data. Genome Res., 17(3):377–386,
2007.
National Research Council. The New Science of Metagenomics:
Revealing the Secrets of Our Microbial Planet. The National
Academic Press, Washington, DC, 2007.
P. Ribeca and G. Valiente. Computational challenges of sequence
classification in microbiomic data. Brief. Bioinform., 12(6):
614–625, 2011.
D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson.
MetaSim: A sequencing simulator for genomics and
metagenomics. PLoS ONE, 3(10):e3373, 2008.
P. D. Schloss. A high-throughput dna sequence aligner for
microbial ecology studies. PLoS ONE, 4(12):e8230, 2009.
References IV
P. D. Schloss, S. L. Westcott, T. Ryabin, et al. Introducing
mothur: Open-source, platform-independent,
community-supported software for describing and comparing
microbial communities. Appl. Environ. Microbiol., 75(23):
7537–7541, 2009.
Y. Van de Peer, P. D. Rijk, J. Wuyts, T. Winkelmans, and R. D.
Wachter. The european small subunit ribosomal RNA database.
Nucleic Acids Res., 28(1):175–176, 2000.