Вы находитесь на странице: 1из 4

2013 10th Web Information System and Application Conference

Compression and Indexing Based on BWT: A Survey

Dengfeng Zhang, Qing Liu+ Yuewan Wu, Yaping Li, Lin Xiao
School of Information, Renmin University of China School of Information, Renmin University of China
+
Corresponding author: e-mail: qliu@ruc.edu.cn

AbstractBurrows-Wheeler Transform (BWT) is a new data


transform method, firstly introduced by Burrows and Wheeler in
1994 and used in a lossless data compression algorithm. In the
original version of data compression algorithm based on BWT,
Burrows and Wheeler used the Move-To-Front (MTF) transform
after the BWT and entropy coding (Huffman coding) for
encoding the transform result in the last stage. Ferragina and
Manzini combined the compression algorithm based on BWT
and the suffix array data structure, and proposed a new
opportunistic data structure. They shortly called it FM-index in Figure 1. Burrows-Wheeler Transform for stringS=abraca. The outputs of
that it is a Full-text index and occupies Minute space. This paper BWT are L=caraab and index=1.
mainly describes the methods about compression and index
based on BWT, and relationship between them. At last, the paper Cyclically shifting the block data of length N will
compares some tools. produce Ns strings, so sorting operation will be inefficiency
both in time and space. Burrows and Wheeler recommended
Keywords BWT; compression algorithm; full-text indexes; using the suffix array construction instead of the sorting
FM-index. operation on the Ns results, for bijection exists between the
suffix array and the matrix M (figure 2). If the suffix array,
I. INTRODUCTION SA, has been computed from the original string S, the output
Burrows and Wheeler proposed Burrows-Wheeler L can be produced by the formula: L[i]=S[(SA[i]-1)%N],
Transforming [1] and came up with a new compression i[0,N-1]. For convenience, we append a $ to S, and $ is
algorithm with a comparable speed and a better compression smaller than any character in the alphabet.
result than Lempel-Ziv series. In [2][3], Ferragina and Manzini
made some analysis on the properties of the result of the BWT
and combined them with the suffix array structure to propose a
new opportunistic data structure, FM-index. FM-index is an
effective index with which we can do pattern matching easily
in the compressed data.
Giovanni made an explicit analysis of BWT-based
compression algorithm[4]. Jurgen Abel made a review about
the post stages of BWT-based compression algorithms [5]. At Figure 2. Bijection between the matrix M and suffix array SA
the same time, many alignment tools based on FM-index are
widely used in the DNA sequences alignment [6][7][8]. B. The reversible BWT
The structure of this paper is as follows. The next section
presents the Burrows-Wheeler transform procedures and the The result of BWT, L, is just a permutation of original
reversible process. Section provides an overview of basic string S, but a string with length N may has N!s permutations
compression algorithm and indexing techniques based on at most. In order to reconstruct S with L, it is critical to
understand the following two properties of matrix M.
BWT, including the relationship between them. In section
1. For the i-th row of M, the last character L[i] precedes
the paper compares some tools based on BWT with other the first character F[i] in the original string S,
tools. At last, a conclusion is made for the paper. namely L[i]F[i] . F is the first column of M and
can be obtained by sorting L.
II. BURROWS-WHEELER TRANSFORM 2. Last-to-First mapping (LF-mapping). Let L[i]=c and
let ri be the number of occurrences of c in the prefix
A. Procedures of BWT
L[0,i-1]. Let M[j] be the ri-th row of the M starting
Burrows-Wheeler Transform deals with block data and with c. Then the character F[j] in the first column
consists of three steps: 1) for the block data with length N, corresponds to L[i] in the last column and set
cyclically shifting the data and getting Ns strings. 2) Sorting LF-mapping array LF[i]=j, meaning that F[j] and L[i]
the results of step one and getting a matrix M. 3) Output the are the same character in original string.
last column L and the index of the original string in M. Before describing the reversible BWT, it is necessary to

978-1-4799-3219-1/13 $31.00
978-0-7695-5134-0/13 $26.00 2013 IEEE 61
DOI 10.1109/WISA.2013.20
introduce two variables, C[c] and Occ[i]. C[c] stores the The above process is alterable. You can obtain a faster
number of characters whose order are lower than c in alphabet version with omitting the GST and RLE0 and the performance
($ is included). Occ[i] stores the number of occurrences of L[i] will not degenerate sharply. People have made a lot of
in the prefix L[0, i-1] of the L. LF-mapping array LF can be improvements on all phases, resulting in each phase having a
generated by the following formula: LF[i]=C[L[i]]+Occ[i]. lot of alternatives. For example, many suffix array construction
Supposed C[c] and Occ[i] for all characters in L have methods have the linear time and space complexity [10][11],
been computed, it is time to give the reverse algorithm: so the BWT output can be generated quickly. At the GST
phase, people proposed a lot of methods, such as Incremental
Algorithm BWT_reverse(L) Frequency Count [12] and Weighted Frequency Count [13], to
1. i=0; //M[0]=$S replace the move-to-front transform. There are various RLE
2. for j=N-1 to 0 schemas in the third phase, such as placing the RLE before the
3. S[j]=L[i]; GST [14]. The last phase also has a lot of improvements,
4. i=Occ[i]+C[L[i]]; //compute the LF-mapping array mainly the variations of Huffman coding and arithmetic
coding.
B. Index based on BWT
In [2][3]Ferragina and Manzini made an analysis on the
structure of matrix M, proposed an opportunistic data structure
and implemented the opportunistic index, FM-index.
FM-index combines the BWT-based compression algorithm
with suffix array data structure, and achieves effective random
accesses to the compressed data without uncompressing all of
them at query time. The FM-index functions in two steps: 1)
Counting the occurrences of the matching pattern, 2) Locating
Figure 3. The reversible process of BWT the occurrences.

III. COMPRESSION AND INDEXING TECHNIQUES 1) Counting the occurrences


Since M is computed by sorting the Ns strings, the rows
A. Compression based on BWT starting with pattern P[i, p-1] must be continuous in M. when i
BWT itself does not compress the data, it only makes the decreases, the pattern becomes more complete, and the rows
original data more compressible. Considering the most prefixed by P[i, p-1] decrease too. When i equals zero, the
frequently occurred word the in the English text, when number of rows is the occurrences of pattern P. The following
meeting the word the, the text will be he.t after a algorithm counts the occurrences of pattern P.
cyclical shift. After the sorting operation, the rows in M
starting with he will be continuous, which result in a Algorithm counting(P[0,p-1])
disproportionately large number of t in a region. This 1. c=P[p-1], i=p-1;
property is very useful for many transform algorithms, such as 2. sp=C[c], ep=C[c+1]-1;
Move-To-Front Transform. When applied BWT to the string L, 3. while((spep)&&(i1)) do
the output of Move-To-Front transform will be dominated by 4. c=P[i-1];
low numbers, which can be compressed effectively by entropy 5. sp=C[c]+Occ(c,sp);
coding (Huffman coding or arithmetic coding). 6. ep=C[c]+Occ(c,ep);
Current compression techniques based on BWT consist 7. i=i-1;
four phases. The first one is BWT, the core part of 8. if(ep<sp) return 0;
compression algorithm, which makes the string more 9. else return ep-sp+1;
compressible. The second phase is the global structure
transformation (GST). In this part, Burrows and Wheeler used
the list update algorithm, Move-To-Front transform. The third
phase does not occur in the original version compression
algorithm by Burrows and Wheeler, but they proposed an idea
which used a code indicating the length of run to represent
each run of zeroes. In [9], Fenwick used the run length coding
to code run of zeroes and achieved a pretty good compression
result. The last phase is entropy coding, which can be
Huffman coding or arithmetic coding, to compression the
result of previous phase. Figure 5. Counting the occurrences of pattern P=aca

2) Locating the occurrences


In the prior phase, the block rows prefixed by P have
been obtained, but how to get the positions that pattern p
Figure 4. The four phases of BWT-based compression algorithm occurs in original string S? Firstly, marking some rows with

62
their corresponding positions in the original string, this can be IV. ANALYSIS OF TOOLS BASED ON BWT
done by saving part of suffix array. If a row r is marked, we
can read the position pos(r) instantly, if not, we will execute A. Compression tools comparison
the second step, which is called tracing here. If row r is not
We compare several tools which are widely used,
marked, we can use LF-mapping to find the row
including gzip (v1.2.4)[19], szip(v1.12a)[20],bzip2(v 1.0.6)[21],
corresponding to the previous position of r. Iterate this
procedure v times until r is a marked row and set bicom(v 1.01)[22]. Bzip2 and szip are based on BWT, gzip is
pos(r)=pos(r)+v; Figure 6 depicts the procedures of locating based on LZ77 and bicom is based on PPM. The test data is
P in string S. Large Corpus[23] and the results are showed in table 1.
TABLE I. Compression rates(bits/byte) for large corpus.

File File size Bicom Szip Bzip2 gzip

Large.txt 4,047,392 1.69 1.63 1.67 2.35

E.Coli 4,638,690 2.12 2.02 2.16 2.31

World192.txt 2,473,400 1.44 1.60 1.58 2.34

From table, compression tools based on BWT perform


Figure 6. Locating the occurrence of pattern P much better than tools based on LZ77 for all files, and even
better than bicom based on PPM for the first two files. In
The main improvements of FM-index are mitigating the general, tools based on PPM have a better compression rate
dependency on the alphabet size. Ferragina and Manzini et al. with much higher time consumption than tools based on
put forward the fact that there is an exponential dependency BWT.
on the alphabet size in the space bound, and a linear
dependency on the alphabet size in the time bounds. With the B. Alignment tools comparison
wavelet tree data structure, they designed an alphabet-friendly
version of FM-index, which scales well with the size of the Tools based on FM-index are mainly used in
alphabet [15][16].Grabowski and Navarro et al. made a bioinformatics for aligning short DNA sequence reads to
further improvement with eliminating the dependency on the large genomes. Here we display the comparison of three
alphabet size. They firstly compressed the original string with alignment tools, including MAQ [24], SOAP [25] and
Huffman coding, and applied the BWT over the compressed Bowtie [6]. MAQ and SOAP are the traditional alignment
data [17][18]. The obtained structure can be regarded as an tools using hash table, and Bowtie makes full use of
FM-index built over a binary sequence, so it has nothing to do Burrows-Wheeler indexing and extends it with a novel
with the original alphabet. quality-aware backtracking algorithm. For the limitations of
hardware and our knowledge of biology, we just make an
C. Relationship analysis of comparison result from [6].

FM-index is a combination of Burrows-Wheeler TABLE II. varying read length using Bowtie, Map and SOAP [6]
compression algorithm with the suffix array data structure and
some auxiliary information. So FM-index has a great Read Program CPU Peak Speed- Reads
length time memory up aligned
relationship with compression techniques based on BWT. (megabytes)
Figure 7 depicts the relationship between compression
techniques and FM-index based on BWT. 36 bp Bowtie 6m15s 1,305 -62.2

Maq 3h52m26s 804 36.7x 65.0

Bowtie v 2 4m55s 1,138 - 55.0

SOAP 16h44m3s 13,619 216x 55.1

50 bp Bowtie 7m11s 1,310 - 67.5

Maq 2h39m56s 804 21.8x 67.9

Bowtie v 2 5m32s 1,138 - 56.2

SOAP 48h42m4s 13,619 691x 56.2

76 bp Bowtie 18m58s 1,323 - 44.5

Maq 0.7.1 4h45m7s 1,155 14.9x 44.9

Bowtie v 2 7m35s 1,138 - 31.7

Figure 7. FM-index and compression techniques based on BWT SOAP do not support

63
Table displays the performance of Bowtie v0.96, SOAP [8] Liu, C. M., Wong, T., Wu, E., Luo, R., Yiu, S. M., Li, Y., Lam, T. W.
v1.10, and Map versions v0.66 and v0.71 on the server (2012). SOAP3:ultra-fast GPU-based parallel alignment tool for short
reads. Bioinformatics, 28(6), 878-879.
platform when aligning 2M untrimmed reads from the 1000
[9] Fenwick, P. M. (1996). The BurrowsWheeler transform for block
Genome project(National Center for Biotechnology sorting text compression: principles and improvements. The Computer
Information Short Read Arichive:SRR003084 for 36 base Journal, 39(9), 731-740.
pairs[bp], SRR003092 for 50bp, and SRR003196 for 76 bp). [10] Krkkinen, J., & Sanders, P. (2003). Simple linear work suffix array
For the SOAP comparison, Bowtie was invoked with -v 2 to construction. In Automata, Languages and Programming (pp. 943-955).
mimic SOAPs default matching policy. For the Maq Springer Berlin Heidelberg.
comparison, Bowtie runs with its default policy to mimic [11] Nong, G., Zhang, S., & Chan, W. H. (2009, March). Linear suffix array
Maqs default matching policy. Maq v0.7.1 was used for the construction by almost pure induced-sorting. In Data Compression
Conference, 2009. DCC'09. (pp. 193-202). IEEE.
76bp reads because v0.6.6 does not support reads longer than
63bp. SOAP is excluded from the 76bp experiment because it [12] Abel, J. (2007). Incremental frequency counta post BWTstage for
the BurrowsWheeler compression algorithm. Software: Practice and
does not support reads longer than 60bp. Experience, 37(3), 247-265.
From table , it is obvious to draw a conclusion that [13] Deorowicz, S. (2002). Second step algorithms in the BurrowsWheeler
compression algorithm. Software: Practice and Experience, 32(2),
Bowtie based on FM-index has a great advantage over 99-111.
traditional tools (Maq, SOAP) in time consumption. In the
[14] Abel, J. (2003). Improvements to the Burrows-Wheeler compression
experiment, Maq seems to occupy less memory than Bowtie. algorithm: After BWT stages. submitted for publication in ACM
However, the memory usage of BWT-based alignment tools is Transactions on Computer Systems.
independent of the number of reads to be aligned, while [15] Ferragina, P., Manzini, G., Mkinen, V., & Navarro, G. (2004, January).
Maqs is linear in it. With the increasing size of reads to be An alphabet-friendly FM-index. In String Processing and Information
aligned, the memory usage of Maq will grow while memory Retrieval(pp. 150-160). Springer Berlin Heidelberg.
usage of Bowtie will be stable. [16] Ferragina, P., Manzini, G., Mkinen, V., & Navarro, G. (2007).
Compressed representations of sequences and full-text indexes. ACM
Transactions on Algorithms (TALG), 3(2), 20.
V. CONCLUSION
[17] Grabowski, S., Mkinen, V., & Navarro, G. (2004, January). First
In this paper, we make a detailed description about Huffman, then Burrows-Wheeler: A simple alphabet-independent
Burrows-Wheeler transform and the reversible transform. FM-index. In String Processing and Information Retrieval (pp. 210-211).
Springer Berlin Heidelberg.
This paper also analyzes the general schema of compression
[18] Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., & Mkinen, V.
algorithm based on BWT and gives the work procedures of (2006). A simple alphabet-independent FM-index. International Journal
FM-index. At the same time, the paper uses a picture to depict of Foundations of Computer Science, 17(06), 1365-1384.
the relationship between compression algorithm and [19] The gzip homepage:www.gzip.org.
FM-index. At last, we compare the compression tools and [20] Schindler, M. (1997, March). A fast block-sorting algorithm for lossless
alignment tools based on BWT with other applications. data compression. In Proceedings of the Conference on Data
Compression (Vol. 469). IEEE Computer Society.
ACKNOWLEDGMENT [21] Seward,J. 1997. The BZIP2 home page. www.bzip.org/index.html
[22]The bijective compressor homepage:www3.sympatico.ca/mt0000/bicom.
This research is supported by grant number 60773217 from [23] http://corpus.canterbury.ac.nz/descriptions.
the National Natural Science Foundation of China. [24] Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing
reads and calling variants using mapping quality scores. Genome
research, 18(11), 1851-1858.
[25] Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: short
REFERENCES oligonucleotide alignment program. Bioinformatics, 24(5), 713-714.
[1] M.Burrows and D.Wheeler. A block sorting lossless data compression
algorithm. Technical Report, Digital Equipment Corporation, Palo Alto,
CA, 1994.
[2] Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with
applications. In Foundations of Computer Science, 2000. Proceedings.
41st Annual Symposium on (pp. 390-398). IEEE.
[3] Ferragina, P., & Manzini, G. (2001, January). An experimental study of
an opportunistic index. In Proceedings of the twelfth annual ACM-SIAM
symposium on Discrete algorithms (pp. 269-278). Society for Industrial
and Applied Mathematics..
[4] Manzini, G. (2001). An analysis of the BurrowsWheeler
transform. Journal of the ACM (JACM), 48(3), 407-430..
[5] Abel, J. (2010). Post BWT stages of the BurrowsWheeler compression
algorithm. Software: Practice and Experience, 40(9), 751-777.
[6] B.Langmead, C.Trapnell, M.Pop, S.Salzberg. 2009. Ultrafast and
memory-efficient alignment of short DNA sequences to the human
genome. Genome Biology 2009,Vol.10,Issue 3,Article R25.
[7] Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with
BurrowsWheeler transform. Bioinformatics, 25(14), 1754-1760..

64

Вам также может понравиться