Вы находитесь на странице: 1из 15

Srinivas Aluru

7 : 69

Bioinformatics for Next Generation Sequencing


Srinivas Aluru
Fellow, IEEE Dept. of Electrical and Computer Engineering at Iowa State University, and with the Dept. of Computer Science and Engineering at Indian Institute of Technology, Bombay. E-mail: aluru@iastate.edu, aluru@cse.iitb.ac.in

High throughput next generation sequencing technologies are causing a revolution in modern biological and life sciences research by enabling rapid, cost-effective sequencing of genomes and transcriptomes. The myriad applications enabled by next generation sequencing require the development of different computational methods in addressing them. In addition, the differences in next generation platforms in terms of read lengths and error characteristics may necessitate the development of different computational methods. The ever increasing throughput of these sequencers motivates a continuous refinement of algorithms to scale to larger datasets and genomes, and application of advanced computing technologies such as high performance computing and accelerators. Consequently, next generation sequencing is driving a rich set of algorithmic problems and has become a fast growth area within bioinformatics. This review article surveys some of the most important developments and outlines open problems for future research. Index Terms: Bioinformatics; error correction; genome assembly; genome mapping; next generation sequencing; RNA-seq. 1. Introduction
DNA sequencing is the fundamental tool relied upon in genomic research. For nearly three decades since its invention, Sanger sequencing, which permits reading of DNA up to 8001000 base pairs (bp) in length one sequence at a time, has been the primary sequencing workhorse. Limited degree parallelism was achieved by multiple capillary units, with high-end systems such as the ABI 3730x containing as many as 96 capillaries. Thus, large-scale sequencing projects such as whole genome sequencing of complex organisms could only be conducted in genome sequencing centers with numerous sequencers. Due to the cost and effort involved, sequencing was limited to model organisms identified by the scientific community, particularly for plant and mammalian species due to their large genome sizes. Since the introduction of the 454 pyrosequencing system in 2005, a number of high-throughput technologies have been developed [4], [20], [45], [57], [58], [87] that are revolutionizing DNA sequencing by making consistent throughput advances and driving down sequencing costs. Common to these systems is the ability to generate millions of concurrent reads, with throughput of some systems reaching to the level of a few billion reads per experiment. Collectively known as next-generation sequencers (NGS), these systems vary in the type of chemistry used in sequencing reactions, which affects the maximum obtainable read lengths and error types and rates. NGS systems are further classified as second and third generation systems. The key differentiating factor is that in second generation technology, base reads are averaged across many copies of the DNA molecule being sequenced. Typically, high throughput is achieved by attaching individual molecules to microscopic entities such as beads, amplifying the number of copies on the spot using a technology such as emulsion PCR, followed by concurrent reading of all molecules being sequenced using a sequence of experiments each designed to read the next base on all sequences. In contrast, third generation sequencing refers to the ability to directly probe individual molecules. Such systems hold out the promise for real-time sequencing, avoiding PCR cloning artifacts, long reads, and ability to infer other information such as detecting the presence of methylation and kinetic information [82]. Despite this promise, while second generation systems have matured and account for most sequencing currently being carried out, third generation systems are yet to be perfected both with respect to throughput and data quality. To put the NGS revolution in perspective, customers of Illumina next-gen sequencers in 2008 sequenced ten times the amount of all publicly available DNA sequences accumulated until 2005, the full content of the GenBank repository as of that date [69]. Since then, Illumina sequencers achieved 10-fold average throughput increase per year, going from millions to billions of reads per experiment, in a short span of three years. Major sequencing centers are currently generating petabytes of sequencing data per year, straining the capability of both storage systems and algorithms for analyzing such data. NGS systems have significantly enhanced the scope and magnitude of sequencing projects. It is now possible for a researcher with access to average resources to sequence the genome of an organism. This has led to the sequencing of numerous species which would have otherwise been impossible, and also led to the sequencing of numerous individuals of a given species to capture the genetic diversity of the species. Thousands of such genomes have been sequenced for organisms such as humans and the model plant Arabidopsis thaliana. The race to enable costeffective individual human genome sequencing is nearing its end, which is expected to usher in the era of personalized medicine. These advancements have placed an urgency on the development of bioinformatics tools to meet the growing needs of NGS-driven life sciences research. Though a nascent

CSI Journal of Computing | Vol. 1 No.1, 2012

7 : 70

Bioinformatics for Next Generation Sequencing


TABLE 1 Read and throughput characteristics of selected next generation sequencers.

Company/Equipment 454 GS FLX+ Illumina HiSeq2000 ABI SOLiD 5500xl Ion Torrent Helicos Heliscope Pacbio RS

Read Length 700 bp (avg) 2x100 bp 2x60 bp 101 bp (avg) 35 bp (avg) 1500 bp (avg)

Throughput 700 Mbp 540-600 Gbp 70-105 Gbp 100 Mbp 35 Gbp 40 Mbp

Time 23 hrs 11 days 7 days 2 hrs 1.5 days 1.5 hrs

Error rate 1% 1-3% 3-5% 5% 15-20%

area with earliest publications appearing just six years ago, the rapid growth of research makes it impossible to cover this topic comprehensively in a single review article. This paper presents some of the major applications of the NGS technologies and provides a window into accompanying algorithmic advances, along with additional reading material for further study.

2. Sequencer Properties and Algorithmic Implications


Several NGS systems are currently available, new systems or models are continuously under development, and some are yet to be released. The properties of selected NGS platforms, as of this writing, are listed in Table 1. The 454 sequencer is the first NGS system to be introduced [57], and currently provides near Sanger read lengths and very low error rates, with a comparatively lower throughput of a million sequences per run. Illumina [4] is the current sequencing workhorse, a second generation system with throughput up to 6 billion 100bp reads per run. A similar high-throughput second generation system is the ABI Solid platform [58]. Heliscope [87] and Pacbio RS [45] are third generation systems and sport much higher error rates currently. Ion Torrent [78] is a rapidly emerging low-cost platform based on semiconductor technology that shares properties of both second and third generation systems. Complete Genomics [20] is a unique system that is not directly sold to researchers but instead used by the manufacturing company to directly provide sequencing services. It is optimized for the exclusive study of human genomes. New NGS systems are still being contemplated, with nanopore based sequencing among the prominent ones. The varying characteristics of NGS systems have several implications for algorithm designers, as illustrated below: Varying read lengths: Short read lengths (35bp - 100bp) require fundamentally different algorithms from longer reads (700bp or more). Typically, reads are sampled with high coverage from much longer target DNA, such as the entire genome of an organism. With long reads, a long suffix-prefix overlap between a pair of reads is a reliable indicator of genomic co-location, except in case of long repeats in target DNA. Thus, overlap based methods typically work well. Shorter reads lead to significant ambiguity regarding their genomic location and overlaps cannot be relied upon. On the other hand, ultra-long reads (such as Pacbio) can enable unique applications such as scaffolding, the problem of finding the genomic order of genomic substrings (contigs) inferred through assembly. Different error characteristics: All sequencing platforms make errors, which must be taken into careful account in algorithm

development. With respect to the correct sequence, the errors can be classified as insertions, deletions (collectively known as indels), and substitutions. Illumina sequencers predominantly make substitution errors, and negligible (approximately 1% of total errors) indels. Taking advantage of this, many algorithms designed for Illumina deal with Hamming distance instead of edit distance, which leads to simpler and faster algorithms. Some platforms make both indels and substitution errors, while others predominantly make indel errors. Other variations are possible. For instance, Ion Torrent makes errors in the estimation of the number of occurrences of a repeated base (e.g., AAAA may be read as AA). Such indels, called homopolymer indels, account for a majority of indels. Sequencer specific problems: Sometimes a sequencer may have unique properties that necessitates models and algorithms specific to the target platform. The ABI Solid system does not infer a DNA sequence directly but produces a sequence of colors from a color alphabet of size 4. Each color corresponds to two consecutive bases, and is a 4 to 1 mapping due to the 16 combinations the 4 colors represent. However, each base is interrogated twice (part of two consecutive colors in the sequence), and the mapping from second base to color is unique given the first base. This propelled design of algorithms in color space and algorithms for analyzing hybrid data in both sequence and color space. As another example, Pacbio offers strobe sequencing, whereby one can stretch the long read (> 1200bp) even further by introducing long and approximately estimated gaps within the read, so that the read covers a much larger span on the genome. This can be particularly useful for scaffolding. The emerging nature of NGS systems implies that new algorithmic problems will continue to emerge. For instance, some of the nanopore based technologies under development specify short substrings (such as 6 mers) of the genome with near perfect accuracy, but with errors in the estimated position of occurrence. The following sections outline some of the major applications of NGS technologies, and provide a survey of main algorithmic techniques developed for each.

3.

Resequencing

Resequencing refers to the problem of sequencing the genome of an individual from a species for which a representative genome is already available. Prior to NGS, the huge expense in Sanger sequencing of large genomes resulted in sequencing of just one or a mixture of a small number of individuals per species, for

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru
a limited number of species chosen after considerable scientific debate. To put this in perspective, the cost of first human genome sequencing exceeded USD $1 Billion, and in just a little over a decade, NGS systems have recently brought this cost down to USD $1,000. Resequencing, particularly of humans, is a major application of NGS technology. Sequencing multiple genomes of a species enables studies of its genetic diversity, and in case of humans, is a critical component of the coming era of personalized medicine. The key requirement for resequencing is that the individual genome being sequenced can be accurately captured by comparing it to the reference genome. This is true in case of humans where estimates show that individuals agree on as much as 99.9% of the genome. In organisms such as plants which have undergone rapid changes due to human breeding and selection, such a technique would not work due to potential for rapid genomic differences, including large scale genomic rearrangements, between multiple inbreds. In such a case, or when the genome of a hitherto unknown species is sequenced, the new genome is inferred through a process known as genome assembly, covered later. Resequencing is a simpler problem than genome assembly due to availability of the reference genome. In resequencing, short reads representing high coverage of the individual being studied are sequenced using an NGS instrument, where coverage is defined as the ratio of the total number of bases sequenced to the size of the genome. The computational problem arising from resequencing is simple to state: The reads are mapped to the reference genome taking into account that there may be genomic differences between the reference and the individual being sequenced, and there may be erroneous bases in NGS reads. The two causes are both manifested as differences from the reference genome, and are indistinguishable when mapping a single read to the reference genome. The mapping is achieved by taking each short read, and aligning it to the reference genome using some type of alignment or approximate string matching algorithm. High coverage is needed (> 30-50 for Illumina) to both ensure that the genome is adequately covered, and to differentiate between NGS instrument errors vs. individual differences such as Single Nucleotide Polymorphisms (SNPs; one base differences

7 : 71
that play a role in many genetic diseases). Because errors are infrequent and random, the base at each genomic position can be accurately inferred by examining the bases from all mapped reads that contain this position. An overwhelming consensus with a small number of deviations helps identify the base correctly. In case of diploid organisms such as humans where two copies of the genome exist, the two copies may have different bases in a specific position. In that case, both the bases should appear with high frequency when sequenced with sufficient coverage, and can be categorized accordingly. Approximate string matching and string alignment are well understood problems studied decades earlier. While these are applicable to the mapping problem, the key motivation behind development of new algorithms is the scale of data involved. A 50X coverage of the 3+ billion base pair human genome by 100bp Illumina reads requires mapping over 1.5 billion reads to the human genome. Operating at this scale requires addressing both computational complexity as well as optimizing the memory required for processing such large data. This led to the development of many algorithmic techniques, briefly reviewed below. Due to obvious reasons, human genome resequencing is seen as the ideal benchmark for mapping algorithms. In resequencing, each read is either mapped uniquely to one location in the reference genome, mapped ambiguously to multiple locations, or discarded because it could not be mapped. Accuracy of mapping is of paramount importance, particularly if inferences on disease propensity are inferred from base occurrences at biomarker sites. The Seed-Filter-Extend Approach Most mapping algorithms can be characterized as conforming to the seed-filter-extend approach. In this approach one or more seeds are used to rapidly map each read to locations in the genome where there can be potentially a good match. This phase is used to mitigate the computation involved if the read were to be mapped directly to the genome using an alignment or approximate string matching algorithm. An optional filter phase may be used to further reduce the number of matching positions per read. Finally, the extend phase is used to carry out full read mapping at the locations identified. The methods vary in the

Seed

Filter

Extend

Fig. 1 : The seed-filter-extend approach for mapping reads to the reference genome. A seed is a pattern of fixed length with positions marked as match or dont care. Reads are initially mapped to the reference genome based on sharing of at least one common seed. The number of mapped locations may be further reduced using a Filter procedure. Finally, an Extend procedure is used to map the read fully to the reference genome.

CSI Journal of Computing | Vol. 1 No.1, 2012

illustration of the seed-filter-extend approach is provided in Fig. 1. that accompanied Solexa sequencers, which later was acquired by Illumina. Seeds for NGS mapping can be designed to give Most of the innovation in mapping algorithms centered sensitivity guarantees. the number of bases in spaced around the seed that positions (i, j) such phase. A seed can be thought of as a bit pattern advantageous becauseTypically, methods use multiplethe as seeds and identify mapping by a factor of coverage c S of fixed length k. The 1s in the pattern indicate positions where reference genome is smaller locations as potential candidates as r[i + l] = G[j + and 0 l < k s.t. S[l] = in the base in the readl] the corresponding base 1 the genome when long as one seed matches theof bases in all spaced seeds of compared to the number location. With the reads, length k, any approximate match with l < k must match and 0s represent dont care positions. Formally, and the lookup table needs to be computedmismatches can be a only once if k The simplest seed is a pattern of all satisfying the= {1}S is a pair discovered by utilizing (kl ) seeds, corresponding to all possible mapping of the read r to genome G 1s, i.e., S seed = the reference genome is xed. The lookup table solution 111 . . .of positions (i, j) such that 111 (1 repeated k times). In this case, a mapping ways of choosing k-long seeds with l mismatches. To reduce the naturally extends to methods adopt machine specific strategies. For spaced seeds. Indexing for seed S denotes a pair r[i + l] = G[j +(i, j) 0 <= l < k s.t. S[l] = 1 of positions l] " for which complexity, some of length k with 1s in l positions requirestowards thetable instance, Illumina reads are of high quality a lookup beginning The simplest seed is a pattern of all 1s, i.e., S = {1}k = 111... l of r[i..i + |S| 1] = G[j..j + a mapping denotes a pair of size 4the read. Taking advantage of this, MAQ guarantees a 2|S| 1] of . 111 (1 repeated k times). In this case, mismatch hit in programs utilize spaced seeds [40], Many mapping the first 28 bases of the read [49]. wherepositions = j)r[i]r[i + 1]r[i + 2] . . . r[i + l] indicates r[i..l] (i, for which The extend phase of the seed-filter-extend approach is r[i..i + |S| 1] = G[ j..j + |S| 1] the substring of r starting as position i and ending at [50], [49], [53] or variants of this strategy. The idea to spaced seeds full rst proposed in map a read where r[i..l ] = r[i] a [read is [mappedr[i + genomic the r i + 1] r i + 2] ... to a l] indicates behind simply conduct a wasblown alignment test tothe early to position l. In other words, each of the further candidate locations identified. A simple potential substring of r starting as position i and ending at position l. In [10] homology search location if we can nd a common kmer between the read 1990sway toand this is to applied to semiglobal alignment in the do carry out to other words, a read is mapped to a genomic location if we can [56], both of which predate NGS.aThe key advantage of and the genome where it is mapped. and the genome where it genomic region surrounding the mapping location. Often, the find a common kmer between the read is their ability to improve over The isjustication for using a kmer match as a seed spaced seeds is reduced by using the mappedsensitivity anchor for run-time seed as an mapped. the use of a continuous seed with the same number of is as follows: justification for using a kmer match as instrument the alignment and extending the alignment on either side using The Suppose that based on NGS seed is as follows: error characteristics andon NGS instrument error between 1s, without sacricing run-time efciency. Spaced seeds an expected differences characteristics the Needleman-Wunsch algorithm [66]. Some methods use Suppose that based rst used for mapping in Eland, a commercial the individual anddifferences between thewe are willing to were additional filtering phase between the seed and extend phases to and expected reference genomes individual and reference software that accompanied Solexa For example, GASSST [77] further reduce run-time complexity. sequencers, which tolerate at most x differencestolerate at most x differences when genomes we are willing to when mapping a read to later was acquired byseeds (kmerSeeds for NGS mapping uses non-spaced Illumina. matches) and the Needlemanthe genome. These differences can be substitutions, inser- be mapping a read to the genome. These differences can can be designed to give sensitivity guarantees. Typi- an Wunsch algorithm for extension. In between, it incorporates substitutions, Then, no matter how these matter how tions, or deletions.insertions, or deletions. Then, no differencesthese cally, additional filter step, consisting of the following: All possible 4mer methods use multiple spaced seeds and identify differences over the read, there exists at least least are distributed are distributed over the read, there exists at one one mapping locations as potential candidates as long used to alignments are precomputed and stored. These tables are as substring of r of length at least k= |r| that is an exact match of length at least k = substring of r x+1 that is an exact compute an approximate banded alignment between a read one seed matches the location. With spaced seeds of and matchbetween the read and the genome. between the read and the genome. a genomic location it is mapped to. Banded k mismatches Lookup Tableseed based mapping:Exact kmer matches for seed based mapping: Exact kmer length k, any approximate match with l <alignment refers to an Lookup Table for between k seeds, corresponding between reads and the reference genome can can be can alignment algorithmutilizing two sequences that limits dynamic l matches between reads and the reference genomebe efficiently be discovered by exploration to entries within a band. It reduces programming table of choosing k-long seeds with l k T computed using lookup possible efciently computed ausing a table T of size T . of size 4k . in to allrun-time at ways of not considering solution paths that are lookup table 4 Each entry the cost the complexity, some denotes a particular string of length k. The reference genome is Each entry in T denotes a particular string of length k. mismatches. To reduce the band, but are useful in methods a not fully confined to indexed using T. Each entry in T is built to contain a linked list machine specic strategies. For instance, cases where Illumina The reference genomeG that contain the corresponding kmer. The is indexed using T . Each entry in adoptsmall number of deviations are allowed. The candidate location of all positions in reads are of high quality towards the beginning of the T is built to contain acomputedlist ofa all positions ingenome, alignment score does not exceed lookup table can be linked using linear scan of the G read. is discarded if the bandedthis, MAQ guarantees a 2- a Taking advantage of that contain the corresponding kmer. The lookupindexed, the threshold. taking O(4k + |G|) time. Once the reference genome is table mismatch hit in the rst 28 bases of the read [49]. can be computed using a linear scan of present in a read, the q-gram frequency based mapping: The spaced seed based reads are scanned similarly. For each kmer the genome, k takingcorresponding entry in the lookup table contains all the positions Theapproaches are ideallythe seed-lter-extend approach only O(4 + |G|) time. Once the reference genome is extend phase of suited for substitution differences as in where the read can be mapped based on this kmer is indexed,Gthe reads are scanned similarly. For eachkmer match. to mismatches but no indels are tolerated alignment test to that simply conduct a full blown within the seed. Note once a to each of to potential candidate the alignment Assuming |G| > 4 all mapped positions between lookup map a read seed is used thefind a mapping location,locations present in a read, thek,corresponding entry in the reads and the program used in the extend this can carry out a the rest reference all the positions in G where O(1) time identied. A simple way to dophaseis to allow gaps in semi- of table contains can be computed in linear time plusthe read per the read, and to within the seed area if surrounding mapped position. Note that the roles match. Assuming and of reference genome global alignment even the genomic regionthe seed is not used can be mapped based on this kmer readsall mapped positions between reads and the reads mapping location. Often, the run-time is reduced can be swapped; i.e., lookup table can be built on the the as an anchor. However, the mapping location would not be found k |G| > 4 , instead. reference can We computed in linear time plus O(1) time the using the mapped seedan indel within the seed. An approach be consider the former as advantageous because by if there is a possibility of as an anchor for the alignnumber of bases in the reference genome is smaller by a factor to overcome this problem and treat indel differences on par per mapped position. Note that the roles of reference ment and extending the alignment on either side using of coverage c when compared to the number of bases in all the with mismatches is the q- gram approach [75]. The q-gram of a genome andand the can be table needs i.e., lookup table can the Needleman-Wunschqmers found in[66]. Some methods algorithm it. Sequences approximately reads, reads lookup swapped; to be computed only once if sequence is the set of be built on the reads instead. The lookup tablethe former use an additional ltering phase many common qmers. Formally, We consider solution naturally the reference genome is fixed. close to each other should share between the seed and extends to spaced seeds. Indexing for seed S of length k with 1s in a read r with at most k differences (number of insertions/ l positions requires a lookup table of size 4l. deletions/substitutions) from the corresponding genomic occurrence of it will share at least |r|(k+1)q+1 common qmers. Many mapping programs utilize spaced seeds [40], [50], The q-gram approach is particularly suited for long reads when [49], [53] or variants of this strategy. The idea behind spaced the entire read length is used as a seed. It is also more suited for seeds was first proposed in the early 1990s [10] and further NGS systems that produce more or less uniform sequence lengths applied to homology search in [56], both of which predate NGS. (e.g., Illumina and ABI Solid) over systems that produce varying The key advantage of spaced seeds is their ability to improve length reads (e.g., Ion Torrent and Pacbio). An example of q-gram sensitivity over the use of a continuous seed with the same based mapping program is SHRiMP [79]. number of 1s, without sacrificing run-time efficiency. Spaced

Fig. 1.7 :The seed-lter-extend approach for mapping reads to the reference genome.for seed Generation Sequencing 72 Bioinformatics A Next is a pattern of xed length with positions marked as match or dont care. Reads are initially mapped to the reference genome based on sharing of at least one common seed. The number of mapped locations may be further reduced using a Filter algorithms used to carry out the three is used to map the read fully to the first used for genome. Eland, a commercial software seeds were reference mapping in procedure. Finally, an Extend procedureprocedures. A schematic

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru
Burrows-Wheeler Transform based Mapping: The key limitation of lookup table and q-gram based approaches is that they limit mapping of reads to the genome based on fixed length exact matches. Exact matches of arbitrary length are easily identifiable through the use of data structures such as suffix trees, or equivalently, suffix arrays and auxiliary data structures. This flexibility comes at the cost of increased storage, though linear in the size of the genome, high because of the constants involved (12-20 bytes per nucleotide). In 2000, Ferragina and Manizini proposed a space-efficient data structure now called FM-index [21], that uses Burrows-Wheeler Transform (BWT) and provides the same capability with 1-2 bytes per nucleotide. This data structure became the basis for some of the widely used mapping programs [46], [47], [48], [51]. Generally, BWT and FM-index based methods facilitate finding exact matches between a read and the genome. The FMindex algorithm is similar to a backward search on the prefix trie, but without its large memory requirement. As with lookup table approaches, mapping of the entire read at a candidate location can be obtained through the use of alignment. It is also possible to directly find inexact matches tolerating a small number of substitutions or indels. Because this is accomplished by translating to an exponential number of exact matches that conform to the inexact match, performance decreases exponentially with the number of differences tolerated. Still, this can be a viable approach for allowing one or two differences and in turn obtain much longer and hence more reliable matches. Further research and open problems: Read mapping is perhaps the most studied problem in next generation sequencing applications. Nevertheless, some issues remain. While mapping programs succeed in mapping majority of the reads unambiguously (most reach 85-95% but success heavily depends on complexity and repeat structure of the genome), they do leave out a small percentage of unmapped reads, and cant determine a single location for some others. Clearly, ambiguity arising from repeat sequences cannot be resolved with a mapping program. The remaining errors are generally a result of the tradeoff between sensitivity and run-time, leaving scope for further improvement. A key but less recognized problem is the accuracy of read mapping even when the mapped locations are correct. Depending on the choice of alignment parameters, the particular indels and substitutions used in the alignment vary. Resolving certain positions, such as disease biomarker sites, accurately is crucial. Mapping programs are prone to many more errors here than in determining correct mapping locations. With ever increasing throughput and the anticipated future needs of routinely sequencing individuals, speed and scaling will continue to remain issues.

7 : 73
trend is that errors tend be infrequent at the beginning of the read (5 end) with errors typically clustered towards the end of the read (3 end). Technologies such as Illumina Hiseq/Miseq and Ion Torrent also produce many perfect reads (> 60%), with no easy way of distinguishing them from reads containing errors. Errors in reads affect the quality of downstream analysis for virtually every application of NGS systems. Hence, algorithms for eliminating or reducing errors are of great interest. The output of a sequencer is a set of reads, each represented as a sequence of nucleotide bases along with a corresponding sequence of quality scores. The quality score of a base indicates the probability that the base is correct, and is typically calibrated to a logarithmic scale. Phred quality scores are now nearly universally adopted, with score to accuracy mapping as follows: 10 90%; 20 99%; 30 99:9%; 40 99:99%. The translation of signals generated by the NGS system during base incorporation (light detected by optical fibres or release of hydrogen ions) to the bases and quality scores is often termed primary analysis, and errors can be minimized by improving the instrumentation or corresponding base calling algorithms. However, such details tend to be highly machine-specific, and often outside the purview of what is directly available to the user. Hence, it is not the focus of discussion here. Researchers have developed a class of error correction algorithms, often termed secondary analysis, that directly analyze reads and alter them to improve data quality. This is possible because all sequencing experiments oversample the target sequences in order to be able to reconstruct them. Consider sequencing of a genome using NGS technology: let G denote the genome, n denote the number of reads sampled, and l denote the nl average read length. Then, coverage c = |G| denotes the average number of reads the base at a particular position in the genome appears. Although coverage is expected to be uniform, significant variation is found in practice [19]. Error correction algorithms take advantage of the fact that the base at a specific genomic position would have been read correctly in many of the reads spanning it, and thus base corrections can be performed if a proper layout of read overlaps can be computed. This is not directly possible unless the genome is known a priori, and the individual being sequenced is expected to have minimal differences and no large scale deviations from the known genome. If this were the case, the problem reduces to resequencing, where techniques described in previous section can be readily applied. 4.1 Substitution-only Error Correction To date, most error correction algorithms targeted substitution only errors, owing to their algorithmic simplicity. This was justified because of the error characteristics of Illumina sequencers which have been the dominant platform. Error correction is not crucial for 454 reads because of their near Sanger read lengths. However, this may be changing due to the interest in emerging Ion Torrent platforms, and there have recently been methods developed for handling indel errors. The k-spectrum approach: The key ideas for correcting substitution-only errors were first developed by Chaisson et al. [13]. The key problem in stacking up reads spanning a specific genomic position is that multiple sequence alignment is an NP-

4. Error Correction
As noted in Table 1, all NGS technologies produce reads with errors whose frequency is higher than Sanger sequencing, for which error rates are below 1%. The types of errors vary from platform to platform but some general trends persist. It appears longer reads are associated with higher frequency of errors, measured as the percentage of erroneous bases (substitutions, insertions, deletions) over total number of bases. Another general

CSI Journal of Computing | Vol. 1 No.1, 2012

XX, XXXXXXXXX

uality scores. The quality work, reads are decomposed into all possible k-long sub7 : 74 Bioinformatics for Next Generation Sequencing probability that the base strings contained in them, called kmers. When two reads ibrated to a logarithmic overlap, they share many kmer pairs that span exactly e now nearly universallySupposesame range of set of readspositions. Identication of of the Hamming graph. They also present a the however that a genomic have the implementations hard problem. acy mapping as start and endsuch overlapping kmershave substitution follows: genomic positions and only is indirectly achieved by taking tiling decomposition of reads to take advantage of sparsity of same 99.9%; 40 errors. In such a case, alignmentsmall Hamming distance to be overlapping. beginning of the read to design a faster error 99.99%. The kmers with is trivially inferred. However, errors towards the correction algorithm. Taken together, these concepts provide the the former assumption is unjustified. not necessarily reads by the NGS system dur- Although this isTo make this work, true, misplaced kmers are decomposed are possible k-long substrings the aggregate space andmany efficiency needed to scale to large datasets. tected by optical bres or into alllikely overpowered in contained in by the run-time them, called kmers. When two reads ones. Suffix trie/array based error correction: These methods e bases and quality scores correctly placed overlap, they share many kmer can be generalize kmer The the same range of genomic positions. ysis, and errorspairs that span exactlyalgorithm works as follows: The set of all kmers based correction to include correction of Identification of present in the reads is termed the by variable length substrings using suffix trie or array data structure. nstrumentation or corre- such overlapping kmers is indirectly achievedk-spectrum. A kmer taking kmers with small Hamming distance to be overlapping. occurrence is ms. However, such details is considered to be valid if its frequency ofSchroeder et al. [85] construct a generalized suffix trie using both Although this is not necessarily true, misplaced kmers are likely forward and reverse complement strands of the input reads. The ic, and often outside the higher than a threshold M , a function of the coverage c. overpowered in the aggregate by the many correctly placed ones. trie has a root to leaf path corresponding to every suffix, and ailable to the user. Hence, A Hamming graph is directly or indirectly constructed edges have single character labels. Thus, each internal node v The algorithm works as follows: The set of all kmers present over the k-spectrum to connect all pairs ofcorresponds to a substring s kmers within n here. in the reads is termed the k-spectrum. A kmer is considered to be given by the concatenation of edge a class of error correction aof occurrence is higher than a threshold M, a neighbors of anto v. The sizev of the subtree under v is the number specied Hamming distance d. The valid if its frequency labels from root ondary analysis, thatthe coverage c. kmer in the Hamming graph canof occurrences of s in the read set. The main idea is to estimate be consulted function of di- invalid v em to improve data qual- graph is directly or indirectly constructed over to identify potential correction possibilitiesthe frequency of a string of length s based on the number of for the kmer. A Hamming v sequencingthe k-spectrum to Reads arepairs of kmersby replacing the invalid kmers experiments connect all corrected within a specified reads, and consider sv as an error if its observed frequency falls ces in order to be distancecontained in them with valid kmers insignificantly below the expected frequency. Schroeder et al. a consistent Hamming able d. The neighbors of an invalid kmer in the sequencing Hamming graph can be consulted to identify propose correction of a genome way. Chaisson et al. potential a dynamic perform this estimation assuming uniform random sampling. The programming denote the genome,for thesolution, and a heuristic for replacing to largeprobability [13].a read contains string sv present in the genome possibilities n kmer. Reads are corrected by scaling the data sets that l-|sv|+1 is given by t Finding with valid kmers in a consistent ampled, andinvalid kmers contained in them the optimum threshold: The performance=of |G| . Taking read sampling to be a collection l denote the nl way. Chaisson Bernouli trials, error correction depends on the accuracyofof classica- the estimated number of reads containing sv erage c = |G| denotes theet al. propose a dynamic programming solution, and a position tion to the k-spectrum among reads se at a particular heuristic for scalingof large data sets [13]. into valid and invalid nkmers.is given by mean value e = np and variance d = Finding to Chin et al. [16] The performance value the threshold gh coverage is expected the optimum threshold: derive optimumof error for np(1p). If the observed occurrence of sv is smaller than ead for a user specified parameter a, it is considered an error. correction [19]. Mon the accuracy of classificationsampling, uniform error n is found in practice depends assuming uniform genome of the k-spectrum into distribution, and Chin al. for all substituke advantage of the fact valid and invalid kmers. equaletrate[16] derivepossible Errors are corrected as follows: if sv is deemed to contain optimum value for the threshold M assuming uniform genome an error, then v is identified with one of its siblings u such that su mic position would have tion errors at a given position. While these assumpsampling, uniform error distribution, and equal rate for all possible passes the frequency test. This is tantamount to fixing the error to of the readssubstitution errors tions become necessary to permit analytical calculations, spanning it, at a given position. While these assumptions be the last base of s . It is justified because if the error were to occur be performed if a proper they are seldom true in practice. Kelley et al. [43] ituse v have been already considered at one of the become necessary to permit analytical calculations, they are elsewhere, would a more practical approach that yields superior of v. Paths below v that are identical to corresponding e computed. This truenotpractice. Kelley et al. [43] use a more practical seldom is in ancestors error nome is known a priori, correctionerror taking quality scores into consideration. corrected by merging. Otherwise, a read may approach that yields superior by correction by taking quality paths under sv are be W multiple enced is expected to have They dene defineweight of a kmer X tocontain (X) = errors, and correction will be continued in a scores into consideration. They the the weight of a kmer k1 X to be from Pr(quality score(Xi [j])), where isimilar wayover v. Since frequency estimation cannot be made rge scale deviations W(X) = Pr(qualityscore(Xi[j])), where i ranges below i j=0 re the case,ranges over all occurrences of the kmer X, andkmer X, and theaccurately for very short and very long strings, the correction is the problem all occurrences of the the weight of a weight of a Xi is given by the product the probabilities limited a the e techniquesspecific occurrencespecic occurrence Xiofis given by the producttoof range of depths in the suffix trie. described in that the base at each position in Xithat the base at each isposition Ilie etX [33] developed a suffix array based approach that is correct. The value of M applied. probabilities in al. is i then determined using a histogram of kmer weights. thenerror pursues kmer a correct. The value of M is The determined using based error correction in a different way. In this approach by incrementally method, itself Correction correction itself follows a greedy kmer weights. The error correctionan error in a read is corrected only if it is preceded by histogram of targeting the next lowest quality score base for error correction, a correct kmer. follows a of a read if the greedy approach lgorithms targeted substi- the correction greedy approach by incrementally targeting The idea is similar to the suffix trie approach and abandoning mentioned above in that an erroneous (k +1)-mer with the last eir algorithmic simplicity. consistent correction. quality score base for error correction, fails to deliver a the next lowest base as error is replaced by the correct (k + 1)-mer. A suffix array, abandoning the A common of a read if the greedy the error characteristics andambiguity resolution: correctionproblem Context-sensitive containing all suffixes of all reads and reverse complements have been the dominant approach fails is ambiguity: a consistent correction. order, is used. In this setting, suffixes that contain the in k-spectrum based error correction to deliver when an invalid in sorted Context-sensitive ambiguity resolution: A common not crucial for 454 multiple valid neighbors in the Hamming graph, it is not kmer has reads erroneous (k+1)-mer will be close to suffixes with the correct (k + ad lengths. However, thisthe problem in k-spectrum based Xiao et al. clear which of substitutions should be selected. error correction is ambigu- correction. The value of k is crucial. Large value 1)-mer, facilitating interest in emerging Ion this bywhen an invalid kmer has multiple valid decreases number of correctable errors because correction [103] overcome ity: examining the surrounding bases in the of k neighbors read methods in the Hamming graph, it is not clear ve recently been of the kmer being corrected to help resolve ambiguity. The whichonlythe sub-to errors preceded by k correct bases. On the is of applied scheme works as stitutions should becorrects tiles instead [103] overcome values of k lead to more ambiguous correction follows: the algorithm selected. Xiao et al. other hand, small rrors. of kmers, cor- tile defined as a sequence of two or more possibilities. of The key ideas for where a thisisby examining the surrounding bases in the readThe value of k is chosen to minimize the sum of the kmers. The same concept of the occurrence frequency is applied ambiguity. bases that cannot be corrected and the number of were rst developed by the kmer being corrected to help resolve number of The both up reads tiles. The idea is as follows: constituent corrects tiles blem in stacking to kmers and scheme works that while eachthe algorithm bases that are potentially falsely corrected. kmer position is that within a tile may have multiple correction possibilities, taken as 4.2 General Error Correction multiple instead of kmers, where a tile is dened a sequence of together, the kmer replacements should result in a corrected tile -hard problem. Suppose two or more kmers. The same concept of the occurrence Until recently, algorithmic efforts for error correction largely that is valid (occurrence frequency exceeds the threshold set for have the same start often results in accurate resolution of ambiguities. In tiles. The idea is tiles). This and frequency is applied both to kmers and focused on substitutions and targeted the Illumina platform. y have substitution errors. thatspace vs.eachtrade off in defining multiple a There havehave two works aimed at indel error correction, one addition, they present a while time constituent kmer within tile may been trivially inferred. How- multiple correction possibilities, taken together, the kmer unjustied. To make this replacements should result in a corrected tile that is valid

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru
an extension of the suffix trie based approach and the other based on multiple sequence alignment. With the introduction of Pacbio and Ion Torrent platforms, for both of which effective error correction would be important, there is need to accelerate work in correcting indel errors, particularly homopolymer errors. Ion Torrent systems produce short reads (100-200 bp) with a sizable frequency of errors (3- 5%), making error correction beneficial. Although Pacbio produces ultra long reads of 1,200 bp to over 2,000 bp, the error rate is high at 15-20%. Salmela et al. [80] modified the suffix trie approach of Schroeder et al. [85] to also account for insertion and deletion errors. Let v be a node and sv be the corresponding substring with error in the last base of sv. The approach is a straightforward extension to allow the last base to be not only a substitution error, but potentially an insertion or deletion error as well. If the last base of sv is an insertion error, the correct string would be missing this base. Thus, v should be compared with its parent. On the other hand, if this were a deletion error, v should be compared with the children of its siblings. This extension is able to capture up to one insertion/deletion error in a read per iteration. A multiple sequence alignment-based error correction method for handling indel errors is proposed by Salmela [81]. As noted before, direct multiple sequence alignment is computationally prohibitive. Thus, the approach uses kmers as seeds to identify sequences that have common substrings, and uses progressive pairwise alignments to assemble the multiple sequence alignment. Thus, the method shares potential drawbacks of the kmer based methods, and the heuristic, potentially suboptimal multiple sequence alignment technique used. The algorithm initially constructs a lookup table to record all the reads containing a given kmer, for each of the 4k possible kmers. Error correction is carried out one read at a time. For a selected read r, the goal is to infer all sequences that overlap with r in terms of the genomic ranges they span. Any reads that share at least one kmer with r is taken to be a potential candidate, and aligned to r using an alignment algorithm. The alignment of all such sequences to r is used to generate the layout of the multiple sequence alignment, which is used to correct errors in r. The method can also be applied for substitution-only errors if needed, by setting insertion and deletion penalties to in carrying out alignments. 4.3 Error Correction for Non-uniform Sampling The aforementioned error correction methods all assume uniform sampling of the target sequences. Hence, they are useful in applications such as random shotgun sampling of genomes. In fact, the motivation behind many of the error correction algorithms is for potential use in sequencing of novel genomes, where error correction improves both the quality and effort required for genome assembly. Some error correction methods were even demonstrated by their effectiveness for genome assembly, such as in increasing the lengths of target sequences inferred [81]. Recent developments in single-cell sequencing require the ability to error correct non-uniformly sampled data sets. Here, the genome is extracted from a single cell, fragmented, and the number of copies required for sequencing are made by increasing the sample size through PCR. The PCR can be thought of as an iterative process where each iteration doubles the number of molecules in a sample. Due to differences in the number of times

7 : 75
PCR cycles apply to the different molecules, their final composition can be highly non-homogeneous. Medvedev et al. [62] extended the approach in [103] to work for non-uniformly sampled NGS data sets. They also use spaced seeds to provide higher sensitivity in identifying similar substrings. An important application where non-unform sampling is the norm is transcriptome sequencing. In this, expressed sequences (such as genes) from a genome are sequenced. The sequences are expressed in relation to the need for these molecules for the biological processes currently being carried out in the cell. Thus differential expression is expected with significant variation. Since these sequences are sampled uniformly, the number of reads arising from the same expressed sequence will be proportional to the copy count of that sequence. Qu et al. [74] developed an error correction method for transcriptome sequencing. They group reads into clusters where each cluster is a hierarchical tree such that 1) the difference between a parent and a child is one base, 2) children appear with smaller frequency than their parents, and 3) there is a high transition probability from parent to child. This is possible because transcriptome sequencing samples the transcript (expressed sequence) from 3 end, hence short reads of the same length sequenced from a molecule should be identical. The sequences within each hierarchical tree are corrected to its root and errors are the accumulated base differences from the read to the root of the tree it lies in. Other approaches and open problems: Due to ever increasing sizes of NGS data sets, error correction algorithms that can scale accordingly are of great interest. Most error correction algorithms have been validated or demonstrated using tens of millions of reads, while some NGS systems are capable of generating billions of reads. Despite recent works in development of error correction methods for parallel computers [86] and accelerators [88], this very much remains open. There are two promising avenues of investigation unexplored so far. One is the development of error correction techniques taking into account specific machine characteristics, such as the prevalence of homopolymer insertion and deletion errors in Ion Torrent platform. In fact, deriving error characteristics of an NGS system from example training sets consisting of sequencing data derived from a known genome very much remains an open problem. Dohm et al. [19] outline a number of specific error patterns observed in Illumina short reads. Algorithms to infer such characteristics automatically and using such models to drive error correction could significantly improve error correction capabilities. Another avenue of unexplored investigation is the possibility of correcting hybrid data sets consisting of reads from multiple platforms. Intuitively, it should be possible to perform better error correction taking into account relative advantages of each type of platform. Such work should be of interest to biologists as most genome sequencing projects need to utilize data sets from multiple platforms, either by design or because multiple groups made independent choices over time and generated such data sets.

5. Genome Assembly
In contrast to resequencing, de novo genome sequencing refers to the process of sequencing an unknown genome directly from

CSI Journal of Computing | Vol. 1 No.1, 2012

7 : 76
Original DNA

Bioinformatics for Next Generation Sequencing

Fragments

Fig. 2: A schematic illustration of shotgun sequencing. sequenced DNA fragments obtained from it. Most genomes are sequenced through a process termed shotgun sequencing, whereby multiple copies of the genome are randomly fragmented down to a size that can be run through a sequencer. The fragments are sequenced, and in the absence of a template genome, the overlaps between fragments becomes the primary information available to infer the genome. The computational problem of taking sequenced fragments (also called reads) as input, and inferring the corresponding genome is called genome assembly, or fragment assembly. A schematic illustration of shotgun sequencing, and the resulting genome assembly problem, is shown in Fig. 2. The coverage, or depth of sequencing, needed to accurately assemble a genome is a function of the read length. For Sanger sequencing of some of the largest genomes such a human and mouse, 5-7X coverage became an accepted standard balancing the need for accurate reconstruction and cost of sequencing. Shorter reads of NGS systems require much higher coverage, and at the same time facilitate it through low and still rapidly declining costs. It is now common to use 50X coverage or higher for Illumina sequencing of whole-genomes. In practice, gaps in assembly occur due to low coverage or breaks in the coverage of the genome, or because the assembler is not able to accurately reconstruct the sequence. As a result, the output of an assembler is a set of sequences termed contigs, each of which represents a contiguous stretch of the genome. Even if the full genome sequence is unknown, it is highly desirable to order and orient the contigs to obtain an approximate representation of the whole genome, a process known as scaffolding the contigs. Note that although a sequence is double stranded, it is represented as one of its strands read from 5 to 3 direction. Hence, the contigs can be on either strand, and thus orientation becomes important. In genome sequencing, each read spans an unknown interval of the genome. Two reads are said to be co-located on the genome, if the corresponding intervals overlap. If so, this is manifested as an overlap between the suffix of one of the reads, and a prefix of the other read, termed suffix-prefix overlap. The overlap need not be exact due to sequencing errors. In addition, sometimes multiple individuals may be sequenced together, such as the case with the first human genome sequencing project. Hence, an alignment algorithm is needed to detect overlaps. When assembling a genome, pairwise suffix-prefix overlaps can be computed from the data. Though genomic co-location implies pairwise overlap, the converse need not be true. This is because the sequence in the overlap region can be exactly or approximately repeated elsewhere in the genome (Fig. 3). In general, repeats longer than read length cannot be resolved by sequence overlaps alone. Hence, this problem worsens with decreasing read length. Genomes may contain long repeats which can pose an issue for current read lengths supported by any sequencer. The extent, frequencies, and lengths of repeats can vary from organism to organism. To assist with resolving repeats, reads are typically generated in pairs from either end of longer fragments of approximate known size (typically + 10% accuracy) termed the insert size. The resulting reads are often called matepair or paired-end reads. Due to the nature of sequencing, the two reads come from the opposite ends and from oppose strands of the fragments. A read from a repeat region can be accurately placed if the corresponding mate-pair read samples a non-repeat region. In practice, multiple insert sizes (3-5) may be used to sample the genome at different length scales. Such mate-pair distance constraints provide the second important source of information for genome assembly. 5.1 Overlap-Layout-Consensus Approach Two major approaches have been developed for genome assembly, the choice of which depends on read lengths. The first of these is the overlap-layout-consensus method, developed and perfected in the days of Sanger sequencing. In this method, significant pairwise overlaps between input reads are first computed. Let n denote the number of reads and l denote the read length. Computing all pairwise overlaps takes O(n2l2) time, prohibitive as n is large. To mitigate this, a filter is designed that directly identifies pairs of reads which will be considered for overlap. The filter can be designed such that any pair left out by the filter cannot have a pairwise suffix-prefix alignment score exceeding a preset threshold. A simple filter is one that looks for the presence of at least a common kmer. Any pair of reads whose maximal common substring is of length less then k will have at least one difference every k bases in the overlapping region, which sets an upper bound on the quality of the resulting alignment. Pairs having common kmers can be readily identified in time linear in n plus O(1) generation time per pair, using the lookup table described in Section 3. More sophisticated approaches use suffix trees to identify maximal common substrings of arbitrary length [41]. In the absence of repeats and for uniform shotgun sequencing, the complexity of the overlap phase can be effectively brought down to O(nl2) from O(n2l2). Once pairwise sequence overlaps are computed, they are used as a guide to generate a layout of the reads. The goal is to recreate the layout that would have been obtained if the genome sequence and the span of each read were known. Finding an optimal layout is an NP-hard problem, but a greedy heuristic works well and has seen great success in practice. In the greedy

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru

7 : 77

RPT A1

RPT A2

Fig. 3: An illustration of the difficulty of resolving repeats in genome assembly. The two repeats can be exact or approximate. Reads sequenced from one repeat region cannot be distinguished from reads sequenced from the other. This can lead to a misassembly whereby a contig can be put together wrongly as a concatenation of the sequence before repeat A1, followed by the repeat, followed by the sequence after repeat A2.

approach, pairs of reads are considered in decreasing order of their overlap scores. Using each successive pair, a new read or a new pair of reads is brought into the assembly provided there is no conflict with the current set of contigs. Once a layout is generated, the consensus sequence for each contig is easily obtained by considering each position, and taking the majority of the bases in that position over all the reads spanning that position. In addition, mate-pair constraints can be used to verify the accuracy of the assembly, and if needed, incorporation of a new read into a current contig based on evidence of pairwise overlap can be discarded if it violates mate-pair constraints. The overlap-layout-consensus paradigm accounted for a majority of the assemblers of the Sanger era, only some of which are referenced here [2], [30], [38], [44], [64], [93]. For the largest genomes, assembly can take months of computational effort serially. The all-pairs overlap computation is easy to parallelize, but the remaining phases are not straightforward. Full-scale parallel solutions that cut down assembly time significantly were developed by Huang et al. [31] and Kalyanaraman et al. [41]. The overlap-layout-consensus approach is likely to be useful for NGS systems producing long read lengths, such as PacBio RS. However, short reads are the main staple of NGS genome sequencing. Due to short read lengths, the even shorter pairwise overlaps cannot be relied upon. This prompted significant research into de Bruijn graph based approaches, described in greater detail below. 5.2 Graph Based Assembly In graph based assembly, reads are translated into short paths on a suitably defined graph and contigs are superpaths containing these. Graph based assembly was initially proposed by Idury and Waterman in 1995 [32], and the first graph based assembler was created by Pevzner et al. [72]. Both of these efforts date back to the Sanger era. The key advantage of graph based assembly is that unlike the greedy heuristic employed by the overlap-layout-consensus approach, it is possible to take a more global view of multiple overlapping reads while forging superpaths to infer contigs. From a theoretical point of view, overlap-layoutconsensus approach can be viewed as the NP-hard problem of finding a Hamiltonian cycle in the overlap graph (nodes are reads, and edges depict overlaps) whereas the graph based approach can be modeled as the problem of finding an Eulerian path in a de Bruijn graph. Despite these apparent advantages of graph based assembly, the overlap-layout-consensus assemblers won out in practice during the Sanger era. The greedy heuristic provided

high quality solutions, and the memory requirement was low, as pairwise overlaps can be computed in batches and stored on disks as needed. The graph based method required holding the entire graph in main memory, and its seeming theoretical advantages did not translate to any run-time gains in practice. Thus, graph based methods all but remained an intellectual curiosity until the NGS era. The short reads caused significant misassemblies with overlap driven assemblers and drove the community to reconsider graph based assemblers. Short reads also complicate graph based assembly. With NGS, the number of reads went up by two orders of magnitude, one order of magnitude due to the increase in coverage requirements, and another order of magnitude due to shorter read lengths. Although this caused two orders of magnitude increase in the graph sizes for graph based assembly, such methods had to be pursued for higher accuracy. This refocused the efforts of the community and led to the creation of many graph based assemblers [5], [9], [12], [13], [24], [34], [35], [36], [52], [105]. While the methods vary in some aspects, they follow the same general methodology. We describe this in detail below, while highlighting some important differences between different assemblers. Construction of de Bruijn Graph: Most graph based assemblers use the de Bruijn graph formulation (see Fig. 4). In this, each node is a kmer present in the read set (also considering complementary strands), and each edge is a (k + 1) mer present in the read set. A directed edge connects two kmers with an exact suffix-prefix match of length k 1. Thus, a read of length l is represented as a path in the graph spanning l - k + 1 nodes and l k edges. The frequency of occurrence of each edge in the set of reads is an important attribute to maintain, as it will be useful in many ways for subsequent analysis paths with low frequency can indicate errors, and a high frequency path diverging into multiple paths is indicative of a repeat, etc. Each edge in the de Bruijn graph is labeled with a single character, for example the first character of the (k + 1)mer it represents. Then, a read can be recovered from the corresponding path by taking concatenation of the edge labels on the path, followed by concatenation of the kmer in the last node of the path. The key advantage of labeling is that the initial graph can be simplified by collapsing paths to edges when possible, and enlarging the edge labels to represent concatenation of labels on the paths so collapsed. This continues to maintain the property that contigs are paths in the graph. Such a graph with edges labeled by strings is often termed a string graph. Note that the initial graph is actually a subset of the de Bruijn

CSI Journal of Computing | Vol. 1 No.1, 2012

7 : 78
TAC ACG CGT GTA

Bioinformatics for Next Generation Sequencing


TAA AAC ACC

GTG

TGA

GAC

CCT

Fig. 4: An example de Bruijn graph representation of two overlapping sequences: TACGTAACC and CGTGACCT. The graph is built using 3mers. The second sequence contains an error in the 4th position, which causes the two sequences to differ in one base in their overlapping region. This corresponds to 6th base in the first sequence. Because the base difference is present in three consecutive 3mers, the paths in the de Bruijn graph corresponding to the two sequences diverge with the alternative path containing three nodes.

graph since not all 4k possible kmers may be present in the read set. Each node has an out degree of 4 and in degree of 4. When using reads and their reverse complements, one ends up with modeling both strands of the genome separately. However, there is no easy way to separate the two. Some methods directly model the double stranded nature of DNA by using the bidirected de Bruijn graph formulation [34], [61]. Some assemblers enhance the scale of graphs that can be constructed and analyzed by constructing and storing graphs collectively in the distributed memory of multiple nodes in a cluster [5], [35]. Finally, Myers demonstrated that graph based assembly can also be carried out on string graphs generated from overlap graphs [65]. Graph compaction and simplification: Genomes contain fairly long stretches of unique sequences interspersed with repeats of various sizes, lengths, and frequencies. Each such long unique stretch will appear as a path in the de Bruijn graph with no further branching. By identifying and collapsing such chains into single edges, the graph is greatly reduced in size. Hence, this is often the first step carried out by most assemblers. The graph resulting after compaction is often termed the conflict graph as there will be multiple paths through each node that must be resolved correctly. Jackson et al. developed an elegant parallel algorithm based on parallel prefix to simultaneously identify and collapse all chains in the de Bruijn graph [35]. Graph editing for error removal: Figure 4 illustrates the de Bruijn graph for two overlapping reads TACGTAACC and CGTGACCT, with the character G shown in boldface representing an error. In the absence of the error, the overlapping region CGTAACC should be a path in the graph with frequency count of two. The presence of the single error creates a bubble of length k, an alternative path of length k that deviates from the true path. Given the low frequency of errors, with sufficiently high coverage, the true path should have high frequency and the bubble should appear with very low frequency. This is illustrated in Figure 5 where the thickness of an edge represents its frequency, and the graph is shown after graph compaction. The graph can be analyzed for such low frequency bubbles, which can be removed from the graph as evidence of erroneous paths. This type of analysis is first proposed in [105] and followed up in other assemblers. A number of local graph topologies that are indicative of errors can be identified. Three such scenarios are illustrated in Fig. 5. The first refers to a case where error occurs within the last k bases of a read. In this case, the remaining read length is not large enough for the erroneous path to rejoin the correct path. This leads to the erroneous path culminating in a dead end

node, a configuration known as a tip. The last case refers to the case where errors transform a portion of the read into another valid sequence present in the genome. A read with frequent errors may simply show up as a lone or low frequency path in the graph, disconnected from the rest. Note that removal of low frequency paths so identified can lead to uncovering of more chains in the graph. For example, a path with an internal bubble will be manifested as a three edge path after graph simplification, with the middle edge having an alternative, low frequency edge connecting the same nodes (see Fig. 5). Removal of the bubble will create a 3-node chain, that can be compacted to a single edge. Thus, graph compaction and error removal may need to be run iteratively until no more changes are possible. Generating contigs: Contigs are formed by traversing the final graph and it is in this stage that many methods differ. Two key algorithmic strategies underlie this phase of the assembly. One of them is a type of network flow analysis that estimates the number of times each edge should be traversed (for example, see [59], [65], [98]). The other is the mapping of mate-pair constraints to the graph and utilizing them in forming contigs. Note that edges in the graph represent read overlaps, or contigs formed by read overlaps with a layout that does not pose any ambiguity. Mapping of the mate-pair information to the graph enables using this vital information as a resource to form contigs. It is straightforward to map mate-pair information onto the graph. Each read maps to a unique path in the graph. Two reads of a mate-pair provide an approximate distance constraint between the locations where the reads are mapped on to the graph. The constrains link a position within the label of an edge in the graph to a position within the label of another edge. In performing graph traversal, this sets up lower and upper bounds on the distance within which the edges should be visited from each other. A large collection of such distance constraints can then effectively guide the graph traversal process. Different assemblers follow different strategies in exploiting such distance constraints. The ALLPATHS assembler [9] performs a limited combinatorial forward search at a node to explore all possible paths out of a node, in search for determining the right path that satisfies the distance constraints. Note that the search need only be conducted for a distance up to the maximum mate-pair distance. The YAGA assembler [35] uses the matepair constraints in the reverse direction. In this approach, contig formation begins with an edge whose label is long enough to be reinforced internally with some distance constraints. It then

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru

7 : 79

Fig. 5. Illustration of three scenarios in graph error editing. The thickness of each edge in the string graph indicates the frequency associated with its edge label. As a result, thick edges represent reliable sequences and thin edges may correspond to errors. (a) Tip: indicated in the leftmost graph, it represents a sequence with an error within the last k bases. (b) Bubble: indicated in the second graph, it represents a sequence with an error somewhere in the middle with at least one correct kmer at both ends. (c) Spurious link: indicated in the rightmost graph, represents the case where errors in one read make portions of it identical to portions of another set of valid reads. proceeds to add one edge at a time. The choice among multiple edges out of a node is made by identifying that edge which is reinforced by multiple distance constraints that hook backwards into the current path. The advantage of backward reinforcement is that it avoids the need for combinatorial search. Other approaches to genome assembly: Genome assembly is an actively researched problem and other approaches have been proposed from time to time. One recent approach is the maximum likelihood based approach which has the goal of inferring the most likely genome given the input read set [59], [60], [98]. While grounded in rigorous optimization theory, these efforts have yet to produce a functioning assembler, and there will likely be hurdles in scaling to large datasets. Further research and open problems: Development of genome assembly algorithms and software for NGS platforms is an active area of ongoing research. Competitions such as the Assemblathon [1] aim to provide a comparative assessment and mark progress in the field. Despite the advances made, NGS assemblers have not yet managed to produce contig sizes on par with what is achievable with Sanger reads. Ability to assemble mixed read datasets from multiple NGS systems is an important open avenue for research. Most assemblers are targeted to one type of NGS platform, particularly Illumina. While a few of them can handle reads from multiple NGS systems, they generally restrict the user to one type of reads at a time. Another area of pressing need is the development of assemblers that can efficiently exploit large-scale parallelism. Many assemblers can use a limited number of cores in certain phases of the assembly through the use of OpenMP directives, etc. The one exception is the ABySS assembler [5], which is an MPI based program that can be used on large distributed clusters. Nevertheless, the key source of parallelism in ABySS is the building and storing of a distributed representation of the de Bruijn graph in parallel. The rest of the computation merely relies on the knowledge of mapping of graph nodes to processors but does not constitute the design of efficient parallel algorithms for each phase of the assembly. Development of efficient assemblers that can handle hybrid datasets and that can scale to complex genomes such as those of plants continues to be an open avenue for investigation.

6. Gene Expression Analysis


DNA sequencing by itself presents a static view of the organism. Genome inference, and knowledge of important molecules such as genes, various types of RNAs, etc., provide an understanding of the components of a biological system. Biological processes are governed by interactions among multiple molecules under complex combinatorial control. A key window into this process is a measurement of the number of copies of each molecule and how they influence each other. The primary means of doing this is to measure the expression levels of all expressed sequences of the genome. All biological molecules such as proteins and various types of RNAs are derived from expressed substrings of the genome, primarily genes. Measuring the expression levels under various conditions or in a time series provides clues as to the key players in a biological process or how the process itself unfolds. Study of such collective interactions is the primary goal of the field of systems biology. Measuring expression levels provides a key data for such studies. Since the mid 1990s, gene expressions have been measured through the use of microarrays. A microarray is a glass slide to which one or more representative DNA sequences, called probes, from each gene in the organism are attached. Probes of fixed length, termed oligonucleotides, are commonly used. Organisms express genes as RNA molecules which either operate directly, or are later converted to corresponding protein molecules. The RNA molecules are collected, converted into their corresponding DNA molecules termed cDNAs, which are then separated and washed over the microarray. This causes hybridization of the molecules in the sample to the corresponding probes on the microarray. The abundance of molecules in the sample influences the corresponding hybridization frequency, whose intensity is captured as an indirect measure of gene expression. With the advent of NGS, particularly due to the ability to sequence a few billion short DNA molecules, sequencing is now used as an attractive alternative means to measure gene expression. In this approach, the cDNAs, produced as outlined before, are collectively fragmented and directly sequenced using NGS. The set of all expressed sequences is called the transcriptome, and hence this procedure is known as transcriptome sequencing. The

CSI Journal of Computing | Vol. 1 No.1, 2012

7 : 80
sequenced reads collectively contain enough information to infer each of the expressed elements, and an estimate of their relative frequency. A particular advantage of NGS is that one obtains digital counts of expression values, making them more accurate when compared to measurement of intensity as in the microarray based approach. In addition, events such as alternative splicing by which the same gene can result in multiple RNA sequences, can be more easily captured and their relative expression counts determined. The process of transcriptome sequencing using NGS technology is called RNA-seq. The computational problem arising from RNA-seq experiments is to infer the expressed sequences and determine their counts. 6.1 Mapping Based Approaches One way to identify expressed elements and measure their frequencies is by mapping reads from the RNA-seq experiment to the reference genome of the organism. This is similar to the mapping problem discussed in Section 3 on resequencing, with one key difference. Genes in eukaryotic organisms are composed of alternating blocks of exons and introns. What is expressed corresponds to a concatenation of the exon sequences. Thus, if a read corresponds to a sequence that is not completely contained in an exon, then the read spans two or more exons. This results in the need to do spliced alignment an alignment that maps consecutive non-overlapping substrings of the read to an ordered sequence of nonconsecutive non-overlapping substrings of the genome. Spliced alignment is a well studied problem in literature (see for example [23], [29]). Given the importance of RNA-seq, and the need to identify putative exons and generate expression counts, standalone tools based on the mapping approach have been designed [17], [63], [96]. Generally, the tools map a subset of reads initially using a standard mapping program and identify exons based on this. There is considerable knowledge about splice junctions at the interface of exons and introns. In particular, patterns at the boundaries are known, although exceptions can be found, and an occurrence of a pattern is not guarantee of a splice junction. However, the knowledge of splice junctions can be effectively used in regions surrounding the mapped portions to predict remaining portions of the expressed sequence and the remaining reads can be aligned. It is found that incorporation of knowledge about splice junctions improves mapping of RNA-seq data. Alternative transcripts may be generated from the same gene by incorporation of an exon optionally or by retaining an intron optionally within a transcript. When a large number of reads from the same gene are generated, the alternative transcripts manifest themselves as differential coverage by the reads of the various exons (and possibly introns). This information is used to predict alternatively spliced transcripts. Other events, such as gene fusion in case of cancer cells, can be similarly studied using RNAseq mapping. Quantifying expression levels so that they are sensible to compare across genes, and across experiments, is not as straightforward as counting reads mapped to the same gene. Because reads are generated by fragmenting the transcriptome, longer genes give rise to many more reads. Similarly, the number of reads mapped to a gene vary by the total number of reads

Bioinformatics for Next Generation Sequencing


sampled. These two factors are typically normalized using the reads per kb of transcript per million mapped reads [63]. A more sophisticated approach to quantify expression of alternatively spliced isoforms using maximum likelihood approach can be found in [97]. 6.2 De Novo Transcriptome Assembly An alternative approach to RNA-seq data analysis is to follow the assembly approach outlined in Section 5. As the reads are sampled from the transcriptome, it should be possible to assemble the reads together to infer all expressed sequences. This procedure is termed transcriptome assembly. In transcriptome assembly, the output is a set of sequences which are the expressed elements. It can be performed similar to genome assembly except for one crucial difference: the number of reads from a transcript are proportional to its expression level. Therefore, unlike the case of genome sequencing where the sampling is uniform, transcriptome sequencing is inherently nonuniform. Algorithms for transcriptome assembly closely track the genome assembly approach. However, the differential expression offers a powerful tool in resolving paths and generating contigs from the conflict graph. Consider the case of a node with multiple incoming and multiple outgoing edges. In case of genome assembly, it is difficult to correctly pair incoming edges to outgoing edges in the absence of additional information. The approximate distance constraints generated from mate-pair reads helps address this problem in genome assembly. In case of transcriptome assembly, the correct incoming-outgoing edge pairings may be obvious if they belong to different transcripts with significant differential expression among them. Thus, algorithms for transcriptome assembly deviate in certain phases of the algorithm from their genome assembly counterparts. An advantage of de novo transcriptome assembly is that it can be conducted in the absence of an adequate reference genome, and is not dependent on the quality of the reference genome. For an in-depth study of transcriptome assembly, see [5], [26], [37]. Further research and open problems: We briefly highlight some directions for future research involving RNA-seq data analysis. Generally, mapping based methods work faster than de novo assembly based methods. However, both approaches could benefit from the development of faster algorithms without compromising accuracy, as RNA-seq datasets are getting larger in line with sequencer throughput improvements. Analysis by RNA-seq softwares is presently considered far from satisfactory. For example, identification of fusion genes in cancer research using RNA-seq throws up an overwhelming number of false positives. Thus, a continual improvement of quality is needed for many applications of RNA-seq. The key advantage of RNAseq over microarrays is that microarrays can only be used to measure what is already known to be expressed. Expressions of previously unknown elements can only be done with RNA-seq. However, for known elements, a proper comparative evaluation of the effectiveness of microarrays vis a vis RNA-seq approaches is yet to be made. In addition, much of the research in the field of systems biology is built on microarray data as foundation, and this needs to be reworked as needed in the context of RNA-seq.

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru
7. Conclusions

7 : 81
short reads. Bioinformatics, 20(13):20672074, 2004. [14] J.A. Chapman, I. Ho, S. Sunkara, et al. Meraculous: de novo genome assembly with short paired-end reads. PLoS One, 6(8):e23501, 2011. [15] B. Chevreux, T. Pfisterer, B. Drescher, et al. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Research, 14(6):1147 1159, 2004. [16] FYL. Chin, HCM. Leung, W. Li, and S. Yiu. Finding optimal threshold for correction error reads in dna assembling. BMC Bioinformatics, 10 Suppl 1:S15, 2009. [17] F DeBona, S Ossowski, K Schneeberger, et al. Optimal spliced alignments of short sequence reads. Bioinformatics, 24(i1):7480, 2008. [18] S. DiGuistini, N. Liao, D. Platt, et al. De novo genome sequence assembly of a filamentous fungus using sanger, 454 and illumina sequence data. Genome Biology, 10(9):R94, 2009. [19] JC. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research, 36(16):e105, 2008. [20] R. Drmanac, A. Sparks, M. Callow, A. Halpern, et al. Human genome sequencing using unchained base reads on selfassembling DNA nanoarrays. Science, 327(5961):7881, 2010. [21] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Foundations of Computer Science (FOCS), pages 390 398, 2000. [22] P. Gajer, M. Schatz, and SL. Salzberg. Automated correction of genome sequence errors. Nucleic Acids Research, 32(2):562569, 2004. [23] M.S. Gelfand, A.A. Mironov, and P.A. Pevzner. Gene recognition via spliced sequence alignment. Proceedings of the National Academy of Sciences USA, 93. [24] S. Gnerre, I. Maccallum, D. Przybylski, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceeding National Academy of Sciences USA, 108(4):15131518, 2011. [25] S. Goldberg, J. Johnson, D. Busam, and T. Feldblyum et al. A sanger/ pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proceedings of the National Academy of Sciences USA, 103(30):1124011245, 2006. [26] M.G. Grabherr, B.J. Haas, M. Yassour, J.Z. Levin, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29:644652, 2011. [27] N. Haiminen, D.N. Kuhn, L. Parida, et al. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS One, 6(9):e24182, 2011. [28] D. Hernandez, P. Franois, L. Farinelli, and M. Osteras et al. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research, 18:802809, 2008. [29] X. Huang and K. Chao. A generalized global alignment algorithm. Bioinformatics, 19:228233, 2003. [30] X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Research, 9:868877, 1999. [31] X. Huang, J. Wang, S. Aluru, and S.P. Yang et al. PCAP: A wholegenome assembly program. Genome Research, 13:21642170, 2003. [32] R.M. Idury and M.S. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2:291306, 1995. [33] L. Ilie, F. Fazayeli, and S. Ilie. HiTEC: Accurate Error Correction in high-throughput sequencing data. Bioinformatics, 27(3):295302, 2011. [34] B.G. Jackson and S. Aluru. Parallel construction of bidirected string graphs for genome assembly. In Proceedings of the 37th International Conference on Parallel Processing, 2008. [35] B.G. Jackson, M. Regennitter, X. Yang, and P.S. Schanble et. al.

This article provides an overview of next generation sequencing technologies, and a window into the myriad bioinformatics problems that result from the applications these technologies can be put to use. The review focused on four key problem areas error correction, resequencing, de novo sequencing, and transcriptome analysis. While these constitute important and major applications of NGS technologies, they are by no means exhaustive. Many important applications, both big and small, have been left out to contain the review to a reasonable length. Notable exceptions include epigenetic and metagenomic studies. In metagenomics, researchers directly fragment the DNA of a community of microbial organisms and sequence them collectively. This is necessary because they could not be isolated and cultured in the laboratory. Thus, the reads have to be both differentiated across multiple species and assembled into the genomes of the respective species, compounding the difficulty. A number of large-scale metagenomics projects are currently ongoing to sample the earth, oceans, and microorganisms inhabiting different human organs. Even among the topics covered, though the survey provides over a hundred important references which can aid further study, they are but a sampling of the research in this exciting and rapidly growing area. Readers interested in further development should actively consult journals and conferences for recent developments in the field. A particularly useful resource is the continuously updated list of all publications on NGS that are published in the Bioinformatics journal [67]. References
[1] The Assemblathon. assemblathon.org. [2] S. Batzoglou, D.B. Jaffe, K. Stanley, and J. Butler et al. ARACHNE: a wholegenome shotgun assembler. Genome Research, 12:177189, 2002. [3] T. Beissbarth, L. Hyde, GK. Smyth, C. Job, WM. Boon, SS. Tan, HS. Scott, and TP. Speed. Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics, 20 Suppl 1:i31i39, 2004. [4] D. Bentley, S. Balasubramanian, H. Swerdlow, and G. Smith et. al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218):5359, 2008. [5] I. Birol, S.D. Jackman, C.B. Nielsen, and J.Q. Qian et al. De novo transcriptome assembly with ABySS. Bioinformatics, 25:28722877, 2009. [6] J. Blazewicz, M. Bryja, M. Figlerowicz, et al. Whole genome assembly from 454 sequencing output via modified dna graph concept. Computational Biology and Chemistry, 33(3):224230, 2009. [7] S. Boisvert, F. Laviolette, and J. Corbeil. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology, 17(11):15191533, 2010. [8] D W Bryant, W-K Wong, and T C Mockler. QSRA: a qualityvalue guided de novo short read assembler. BMC Bioinformatics, 10:69, 2009. [9] J. Butler, I. MacCallum, M. Kleber, and I.A. Shlyakhter et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research, 18:810820, 2008. [10] A. Califano and I. Rigoutsos. Flash: a fast look-up algorithm for string homology. In Computer Vision and Pattern Recognition, pages 353 359, 1993. [11] MJ. Chaisson, D. Brinza, and PA. Pevzner. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research, 19(2):336346, 2009. [12] M.J. Chaisson and P.A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Research, 18:324330, 2008. [13] MJ. Chaisson, PA. Pevzner, and H. Tang. Fragment assembly with

CSI Journal of Computing | Vol. 1 No.1, 2012

7 : 82
Parallel de novo assembly of large genomes from highthroughput short reads. In 24th IEEE International Parallel & Distributed Processing Symposium, pages 1 10, 2010. [36] B.G. Jackson, P.S. Schanble, and S. Aluru. Assembly of large genomes from paired short reads. In Proceedings of the 1st International Conference on Bioinformatics and Computational Biology, 2009. [37] B.G. Jackson, P.S. Schanble, and S. Aluru. Parallel short sequence assembly of transcriptomes. BMC Bioinformatics, 10(Suppl 1):S1, 2009. [38] D.B. Jaffe, J. Butler, S. Gnerre, and E. Mauceli et al. Wholegenome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13:9196, 2003. [39] W.R. Jeck, J.A. Reinhardt, D.A. Baltrus, and M.T. Hickenbotham et al. Extending assembly of short DNA sequences to handle error. Bioinformatics, 23:29422944, 2007. [40] H. Jiang and W. H. Wong. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24(20):23952396, 2008. [41] A. Kalyanaraman, S.J. Emrich, P.S. Schanble, and S. Aluru. Assembling genomes on large-scale parallel computers. Journal of Parallel and Distributed Computing, 67:12401255, 2007. [42] W. Kao, AH. Chan, and YS. Song. ECHO: A reference-free shortread error correction algorithm. Genome Research, 21:11811192, 2011. [43] DR. Kelley, MC. Schatz, and SL. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):R116, 2010. [44] W.J. Kent and D. Haussler. GigAssembler: An algorithm for initial assembly of the human working draft. Genome Research, 11(9):1541 1548, 2001. [45] J. Korlach, KP. Bjornson, BP. Chaudhuri, RL. Cicero, et al. Realtime DNA sequencing from single polymerase molecules. Methods Enzymology, 472:431455, 2010. [46] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009. [47] H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754 1760, 2009. [48] H. Li and R. Durbin. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26(5):589595, 2010. [49] H. Li, J. Ruan, and R. Durbin. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Research, 18:18511858, 2008. [50] R. Li, Y. Li, K. Kristiansen, and J. Wang. SOAP: short oligonucleotide alignment program. Bioinformatics, 25(15):19661967, 2009. [51] R. Li, C. Yu, Y. Li, T. Lam, and S. Yiu et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966 1967, 2009. [52] R. Li, H. Zhu, J. Ruan, and W. Qian et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20:265272, 2010. [53] H. Lin, Z. Zhang, M. Zhang, B. Ma, and M. Li. ZOOM! Zillions of oligos mapped. Bioinformatics, 24(21):24312437, 2008. [54] Y. Lin, J. Li, H. Shen, et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics, 27(15):20312037, 2011. [55] Y. Liu, B. Schmidt, and D.L. Maskell. Parallelized short read assembly of large genomes using de bruijn graphs. BMC Bioinformatics, 12:354, 2011. [56] B. Ma, J. Tromp, and M. Li. PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18(3):440445, 2002. [57] M. Margulies, M. Egholm, W. Altman, S. Attiya, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376380, 2005.

Bioinformatics for Next Generation Sequencing


[58] K. McKernan, H. Peckham, G. Costa, and S. McLaughlin et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research, 19(9):15271541, 2009. [59] P. Medvedev and M. Brudno. Ab initio whole genome shotgun assembly with mated short reads. Lecture Notes in Computer Science, 4955:5064, 2008. [60] P. Medvedev and M. Brudno. Maximum likelihood genome assembly. J Comput Biol, 16(8):11011116, 2009. [61] P. Medvedev, K. Georgiou, G. Myers, and M. Brudno. Computability of models for sequence assembly. Lecture Notes in Computer Science, 4645:289301, 2007. [62] P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner. Error correction of high-throughput sequencing datasets with nonuniform coverage. Bioinformatics, 27(13):i137i141, 2011. [63] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7):621628, 2008. [64] J.C. Mullikin and Z. Ning. The phusion assembler. Genome Research, 13(1):8190, 2003. [65] E.W. Myers. The fragment assembly string graph. Bioinformatics, 21:ii79ii85, 2005. [66] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443453, 1970. [67] Nextgen. http://www.oxfordjournals.org/our_journals/bioinformatics/ nextgenerationsequencing.html/. [68] An Illumina paired end and mate-pair short read simulator. seqanswers.com/wiki/SimSeq. [69] J.M. Perkel. Sanger Who? sequencing the next generation. Science, 10:275279, 2009. [70] P.A. Pevzner and H. Tang. Fragment assembly with doublebarreled data. Bioinformatics, 21:S225S233, 2001. [71] P.A. Pevzner, H. Tang, and G. Tesler. De novo repeat classification and fragment assembly. Genome Research, 14:17861796, 2004. [72] P.A. Pevzner, H. Tang, and M.S. Waterman. An eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the USA, 98:97489753, 2001. [73] M. Pop. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics, 10(4):354366, 2009. [74] W. Qu, S. Hashimoto, and S. Morishita. Efficient frequencybased de novo short-read clustering for error trimming in nextgeneration sequencing. Genome Research, 19(7):130915, 2009. [75] K. Rasmussen, J. Stoye, and E.W. Myers. Efficient q-gram filters for finding all e-matches over a given length. Journal of Computational Biology, 13:296308, 2006. [76] T. Rausch, S. Koren, G. Denisov, et al. A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics, 25(9):11181124, 2009. [77] G. Rizk and D. Lavenier. GASSST: Global alignment short sequence search tool. Bioinformatics, 26(20):25342540, 2010. 78] J.M. Rotherberg, W. Hinz, T.M. Rearick, J. Schultz, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356):348352, 2011. 79] S. M Rumble, P. Lacroute, A. V Dalca, M. Fiume, A. Sidow, and M. Brudno. SHRiMP: accurate mapping of short color-space reads. PLoS Computation Biology, 5(5):e1000386, 2009. 80] L. Salmela. Correction of sequencing errors in a mixed set of reads. Bioinformatics, 26(10):12841290, 2010. 81] L. Salmela and J. Schroder. Correcting errors in short reads by

CSI Journal of Computing | Vol. 1 No.1, 2012

Srinivas Aluru
multiple alignments. Bioinformatics, 27(11):14551461, 2011. 82] E.E. Schadt, S. Turner, and A. Kasarskis. A window into thirdgeneration sequencing. Human Molecular Genetics, 19(R2):R227R240, 2010. 83] M.C. Schatz, A.L. Delcher, and S.L. Salzberg. Assembly of large genomes using second-generation sequencing. Genome Research, 20:11651173, 2010. 84] J. Schroder, J. Bailey, T. Conway, and J. Zobel. Reference-free validation of short read data. PLoS One, 5(9):e12681, 2010. 85] J. Schroder, H. Schroder, S. J. Puglisi, and R. Sinha et al. SHREC: a short-read error correction method. Bioinformatics, 25(17):2157 2163, 2009. 86] A. Shah, S. Chockalingam, and S. Aluru. A parallel algorithm for spectrum-based short read error correction. In IEEE International Parallel & Distributed Processing Symposium, page to appear., 2012. 87] J. Shendure and H. Ji. Next-generation DNA sequencing. Nature Biotechnology, 26(10):11351145, 2008. 88] H. Shi, B. Schmidt, W. Liu, and W. M uller-Wittig. Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA. In IEEE International Parallel & Distributed Processing Symposium, pages 18, 2009. 89] J.T. Simpson, K. Wong, S.D. Jackman, and J.E. Schein et al. ABySS: a parallel assembler for short read sequence data. Genome Research, 19:11171123, 2009. 90] A. Smith, Z. Xuan, and M. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9(1):128135, 2008. 91] A. Sundquist, M. Ronaghi, H. Tang, and P. Pevzner et al. Wholegenome sequencing and assembly with high-throughput, short read technologies. PLoS ONE, 2:e484, 2007. 92] Y. Surget-Groba and J.I. Montoya-Burgos. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Research, page onlie, 2010. 93] G. Sutton, O. White, M. Adams, and A. Kerlavage. TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1(1):919, 1995. 94] MT. Tammi, E. Arner, E. Kindlund, and B. Andersson. Correcting errors in shotgun sequences. Nucleic Acids Research, 31(15):4663 4672, 2003. 95] MA. Taub, HC. Bravo, and RA. Irizarry. Overcoming bias and systematic errors in next generation sequencing data. Genome

7 : 83
Medicine, 2(12):87, 2010. 96] C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with RNA-seq. Bioinformatics, 25:11051111, 2009. 97] C. Trapnell, B.A.Williams, G. Pertea, A. Mortazavi, et al. TopHat:discovering splice junctions with RNA-Seq. Nature Biotechnology, 28(5):511515, 2010. 98] A. Varma, A. Ranade, and S. Aluru. An improved maximum likelihood formulation for accurate genome assembly. In Proceedings of the 1st IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS), pages 165170, 2011. 99] R. L. Warren, G.G. Sutton, S.J.M. Jones, and R. A. Holt. Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23:500501, 2007. 100] D. Weese, A. Emde, T. Rausch, A. Dring, and K. Reinert. RazerSfast read mapping with sensitivity control. Genome Research, 19(9):1646 1654, 2009. 101] E. Wijaya, MC. Frith, Y. Suzuki, and P. Horton. Recount: Expectation maximization based error correction tool for next generation sequencing data. Intrnational conference on Genome Informatics, 23(1):189201, 2009. 102] X. Yang, S. Aluru, and K S Dorman. Repeat-aware modeling and correction of short read errors. BMC Bioinformatics, 12 Suppl 1:S52, 2011. 103] X. Yang, K S Dorman, and S Aluru. Reptile: representative tiling for short read error correction. Bioinformatics, 26(20):25262533, 2010. 104] Osvaldo Zagordi, Rolf Klein, Martin Dumer, and Niko Beerenwinkel. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research, 38(21):7400 7409, 2010. 105] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18:821829, 2008. 106] W. Zhang, J. Chen, Y. Yang, et al. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One, 6(3):e17915, 2011. 107] X. Zhao, LE. Palmer, R. Bolanos, and C. Mircean et al. EDAR: an efficient error detection and removal algorithm for next generation sequencing data. J Comput Biol, 17(11):15491560, 2010. 108] D. Zhi, U Keich, P Pevzner, and S. Heber et al. Correcting base-assignment errors in repeat regions of shotgun assembly. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(1):5464, 2007.

About the Authors


Srinivas Aluru is the Ross Martin Mehl and Marylyne Munas Mehl Professor of Computer Engineering at Iowa State University, and the Bajaj Chair Professor of Computer Science and Engineering at Indian Institute of Technology Bombay. Earlier, he held faculty positions at Syracuse University and New Mexico State University. Aluru conducts research in high performance computing, parallel algorithms and applications, bioinformatics and systems biology, combinatorial scientific computing, and applied algorithms. He is a recipient of the NSF Career award, Iowa State University Foundation award for outstanding achievement in research, Swarnajayanti Fellowship from Government of India, two best paper awards (IPDPS 2006 and CSB 2005), and two best paper finalist recognitions (SC 2007 and SC 2002). He serves on the editorial boards of the Journal of Parallel and Distributed Computing, IEEE Transactions on Parallel and Distributed Systems, International Journal of Data Mining and Bioinformatics, and Journal of Computing by the Computer Society of India. He served on numerous program committees in parallel processing and computational biology, including serving as program chair for IC3 2011, HiPC 2007, program co-chair for BiCoB 2008, and program vice chair for ICPP 2012, BIBM 2009, SC 2008, IPDPS 2007, ICPP 2007 and HiPC 2006. He co-chairs an annual workshop in High Performance Computational Biology (www.hicomb.org) and edited a comprehensive handbook on computational molecular biology, published in 2005. He is a Fellow of the American Association for the Advancement of Science (AAAS) and the Institute of Electrical and Electronics Engineers (IEEE).

CSI Journal of Computing | Vol. 1 No.1, 2012

Вам также может понравиться