Академический Документы
Профессиональный Документы
Культура Документы
Dengfeng Zhang, Qing Liu+ Yuewan Wu, Yaping Li, Lin Xiao
School of Information, Renmin University of China School of Information, Renmin University of China
+
Corresponding author: e-mail: qliu@ruc.edu.cn
978-1-4799-3219-1/13 $31.00
978-0-7695-5134-0/13 $26.00 2013 IEEE 61
DOI 10.1109/WISA.2013.20
introduce two variables, C[c] and Occ[i]. C[c] stores the The above process is alterable. You can obtain a faster
number of characters whose order are lower than c in alphabet version with omitting the GST and RLE0 and the performance
($ is included). Occ[i] stores the number of occurrences of L[i] will not degenerate sharply. People have made a lot of
in the prefix L[0, i-1] of the L. LF-mapping array LF can be improvements on all phases, resulting in each phase having a
generated by the following formula: LF[i]=C[L[i]]+Occ[i]. lot of alternatives. For example, many suffix array construction
Supposed C[c] and Occ[i] for all characters in L have methods have the linear time and space complexity [10][11],
been computed, it is time to give the reverse algorithm: so the BWT output can be generated quickly. At the GST
phase, people proposed a lot of methods, such as Incremental
Algorithm BWT_reverse(L) Frequency Count [12] and Weighted Frequency Count [13], to
1. i=0; //M[0]=$S replace the move-to-front transform. There are various RLE
2. for j=N-1 to 0 schemas in the third phase, such as placing the RLE before the
3. S[j]=L[i]; GST [14]. The last phase also has a lot of improvements,
4. i=Occ[i]+C[L[i]]; //compute the LF-mapping array mainly the variations of Huffman coding and arithmetic
coding.
B. Index based on BWT
In [2][3]Ferragina and Manzini made an analysis on the
structure of matrix M, proposed an opportunistic data structure
and implemented the opportunistic index, FM-index.
FM-index combines the BWT-based compression algorithm
with suffix array data structure, and achieves effective random
accesses to the compressed data without uncompressing all of
them at query time. The FM-index functions in two steps: 1)
Counting the occurrences of the matching pattern, 2) Locating
Figure 3. The reversible process of BWT the occurrences.
62
their corresponding positions in the original string, this can be IV. ANALYSIS OF TOOLS BASED ON BWT
done by saving part of suffix array. If a row r is marked, we
can read the position pos(r) instantly, if not, we will execute A. Compression tools comparison
the second step, which is called tracing here. If row r is not
We compare several tools which are widely used,
marked, we can use LF-mapping to find the row
including gzip (v1.2.4)[19], szip(v1.12a)[20],bzip2(v 1.0.6)[21],
corresponding to the previous position of r. Iterate this
procedure v times until r is a marked row and set bicom(v 1.01)[22]. Bzip2 and szip are based on BWT, gzip is
pos(r)=pos(r)+v; Figure 6 depicts the procedures of locating based on LZ77 and bicom is based on PPM. The test data is
P in string S. Large Corpus[23] and the results are showed in table 1.
TABLE I. Compression rates(bits/byte) for large corpus.
FM-index is a combination of Burrows-Wheeler TABLE II. varying read length using Bowtie, Map and SOAP [6]
compression algorithm with the suffix array data structure and
some auxiliary information. So FM-index has a great Read Program CPU Peak Speed- Reads
length time memory up aligned
relationship with compression techniques based on BWT. (megabytes)
Figure 7 depicts the relationship between compression
techniques and FM-index based on BWT. 36 bp Bowtie 6m15s 1,305 -62.2
Figure 7. FM-index and compression techniques based on BWT SOAP do not support
63
Table displays the performance of Bowtie v0.96, SOAP [8] Liu, C. M., Wong, T., Wu, E., Luo, R., Yiu, S. M., Li, Y., Lam, T. W.
v1.10, and Map versions v0.66 and v0.71 on the server (2012). SOAP3:ultra-fast GPU-based parallel alignment tool for short
reads. Bioinformatics, 28(6), 878-879.
platform when aligning 2M untrimmed reads from the 1000
[9] Fenwick, P. M. (1996). The BurrowsWheeler transform for block
Genome project(National Center for Biotechnology sorting text compression: principles and improvements. The Computer
Information Short Read Arichive:SRR003084 for 36 base Journal, 39(9), 731-740.
pairs[bp], SRR003092 for 50bp, and SRR003196 for 76 bp). [10] Krkkinen, J., & Sanders, P. (2003). Simple linear work suffix array
For the SOAP comparison, Bowtie was invoked with -v 2 to construction. In Automata, Languages and Programming (pp. 943-955).
mimic SOAPs default matching policy. For the Maq Springer Berlin Heidelberg.
comparison, Bowtie runs with its default policy to mimic [11] Nong, G., Zhang, S., & Chan, W. H. (2009, March). Linear suffix array
Maqs default matching policy. Maq v0.7.1 was used for the construction by almost pure induced-sorting. In Data Compression
Conference, 2009. DCC'09. (pp. 193-202). IEEE.
76bp reads because v0.6.6 does not support reads longer than
63bp. SOAP is excluded from the 76bp experiment because it [12] Abel, J. (2007). Incremental frequency counta post BWTstage for
the BurrowsWheeler compression algorithm. Software: Practice and
does not support reads longer than 60bp. Experience, 37(3), 247-265.
From table , it is obvious to draw a conclusion that [13] Deorowicz, S. (2002). Second step algorithms in the BurrowsWheeler
compression algorithm. Software: Practice and Experience, 32(2),
Bowtie based on FM-index has a great advantage over 99-111.
traditional tools (Maq, SOAP) in time consumption. In the
[14] Abel, J. (2003). Improvements to the Burrows-Wheeler compression
experiment, Maq seems to occupy less memory than Bowtie. algorithm: After BWT stages. submitted for publication in ACM
However, the memory usage of BWT-based alignment tools is Transactions on Computer Systems.
independent of the number of reads to be aligned, while [15] Ferragina, P., Manzini, G., Mkinen, V., & Navarro, G. (2004, January).
Maqs is linear in it. With the increasing size of reads to be An alphabet-friendly FM-index. In String Processing and Information
aligned, the memory usage of Maq will grow while memory Retrieval(pp. 150-160). Springer Berlin Heidelberg.
usage of Bowtie will be stable. [16] Ferragina, P., Manzini, G., Mkinen, V., & Navarro, G. (2007).
Compressed representations of sequences and full-text indexes. ACM
Transactions on Algorithms (TALG), 3(2), 20.
V. CONCLUSION
[17] Grabowski, S., Mkinen, V., & Navarro, G. (2004, January). First
In this paper, we make a detailed description about Huffman, then Burrows-Wheeler: A simple alphabet-independent
Burrows-Wheeler transform and the reversible transform. FM-index. In String Processing and Information Retrieval (pp. 210-211).
Springer Berlin Heidelberg.
This paper also analyzes the general schema of compression
[18] Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., & Mkinen, V.
algorithm based on BWT and gives the work procedures of (2006). A simple alphabet-independent FM-index. International Journal
FM-index. At the same time, the paper uses a picture to depict of Foundations of Computer Science, 17(06), 1365-1384.
the relationship between compression algorithm and [19] The gzip homepage:www.gzip.org.
FM-index. At last, we compare the compression tools and [20] Schindler, M. (1997, March). A fast block-sorting algorithm for lossless
alignment tools based on BWT with other applications. data compression. In Proceedings of the Conference on Data
Compression (Vol. 469). IEEE Computer Society.
ACKNOWLEDGMENT [21] Seward,J. 1997. The BZIP2 home page. www.bzip.org/index.html
[22]The bijective compressor homepage:www3.sympatico.ca/mt0000/bicom.
This research is supported by grant number 60773217 from [23] http://corpus.canterbury.ac.nz/descriptions.
the National Natural Science Foundation of China. [24] Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing
reads and calling variants using mapping quality scores. Genome
research, 18(11), 1851-1858.
[25] Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: short
REFERENCES oligonucleotide alignment program. Bioinformatics, 24(5), 713-714.
[1] M.Burrows and D.Wheeler. A block sorting lossless data compression
algorithm. Technical Report, Digital Equipment Corporation, Palo Alto,
CA, 1994.
[2] Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with
applications. In Foundations of Computer Science, 2000. Proceedings.
41st Annual Symposium on (pp. 390-398). IEEE.
[3] Ferragina, P., & Manzini, G. (2001, January). An experimental study of
an opportunistic index. In Proceedings of the twelfth annual ACM-SIAM
symposium on Discrete algorithms (pp. 269-278). Society for Industrial
and Applied Mathematics..
[4] Manzini, G. (2001). An analysis of the BurrowsWheeler
transform. Journal of the ACM (JACM), 48(3), 407-430..
[5] Abel, J. (2010). Post BWT stages of the BurrowsWheeler compression
algorithm. Software: Practice and Experience, 40(9), 751-777.
[6] B.Langmead, C.Trapnell, M.Pop, S.Salzberg. 2009. Ultrafast and
memory-efficient alignment of short DNA sequences to the human
genome. Genome Biology 2009,Vol.10,Issue 3,Article R25.
[7] Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with
BurrowsWheeler transform. Bioinformatics, 25(14), 1754-1760..
64