Академический Документы
Профессиональный Документы
Культура Документы
Transform
Sen Zhang
Transform
What is BWT?
The Burrows and Wheeler transform (BWT) is a
block sorting lossless and reversible data
transform.
The BWT can permute a text into a new
sequence which is usually more compressible.
Surfaced not long ago, 1994, by Michael
Burrows and David Wheeler.
The transformed text can be better compressed
with fast locally-adaptive algorithms, such as runlength-encoding (or move-to-front coding) in
combination with Huffman coding (or arithmetic
coding).
3
Outline
What does BWT stand for?
Why BWT?
Steps of BWT
BWT is reversible and lossless
Steps to inverse
Variants of BWT
ST
Why BWT?
Run length encoding
Replacing a long series of a repeated character with a
count of the repetition. Squeezing to a number and a
character.
AAAAAAA
*A7 , * flag
Preliminaries
Alphabet
{a,b,c,$}
We assume
an order on the alphabet
a<b<c<$
How to transform?
Three steps
Form a N*N matrix by cyclically rotating (left)
the given text to form the rows of the matrix.
Sort the matrix according to the alphabetic
order.
Extract the last column of the matrix.
One example
how the BWT transforms mississippi.
T=mississippi$
11
12
m
i
s
s
i
s
s
i
p
p
i
$
s
s s
s i
i s
s s
s i
i p
p p
p i
i $
$ m
m i
i s
s s
s i
i p
p p
p i
i $
$ m
m i
i s
s s
s
s
i
p
p
i
$
m
i
s
s
i p
p p
p i
i $
$ m
m i
i s
s s
s i
i s
i s s
p
p
i
$
m
i
s
s
i
s
s
p i $
i $ m
$ m i
m i s
i s s
s s i
s i s
i s s
s s i
s i p
i p p
i p p i
13
Now, we sort all the rows of the matrix OM in ascending order with the
leftmost element of each row being the most significant position.
Consequently, we obtain the transformed matrix M as given below.
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
14
15
From the above transform, L is easily obtained by taking the transpose of the last
column of M together with the primary index.
4
L= s s m p $ p i s s i i i
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
Notice how there are 3 i's in a row and 2 consecutive s's and
another 2 consecutive ss - this makes the text easier to compress,
than the original string mississippi$.
16
17
Any problem?
It sounds cool, but
Is the transformation reversible?
18
The remarkable thing about the BWT is not only that it generates a
more easily compressible output, but also that it is reversible, i.e. it
allows the original text to be re-generated from the last column data
and the primary index.
19
mississippi$
BWT
Inverse BWT
mississippi$
20
The intuition
Assuming you are in a 1000 people line.
For some reason, people are dispersed
Now, we need to restore the line.
What should you (the people in line) do?
What is your strategy?
Centralized?
A bookkeeper or ticket numbers, that requires centralized
extra bookkeeping space
Distributed?
If every person can point out who stood immediately in
front of him. Bookkeeping space is distributed.
21
For IBWT
The order is distributed and hidden in the
output themselves!!!
22
The trick is
Where to start? Who is the first one to ask?
The last one.
Two matters
Recover the current people (by index)
L[currentindex], so what is the currentindex?
24
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
25
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
26
The solution
The solution turns out to be very simple:
Using LF mapping!
Continue to see what LF mapping is?
28
Inverse BW-Transform
Assume we know the complete ordered
matrix
Using L and F, construct an LF-mapping
LF[1N] which maps each character in L
to the character occurring in F.
Using LF-mapping and L, then reconstruct
T backwards by threading through the LFmapping and reading the characters off of
L.
29
L and F
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
30
LF mapping
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
31
Inverse BW-Transform:
Reconstruction of T
Start with T[] blank.
Let u = N
Initialize Index = the primary index (4 in our
case)
T[u] = L[index].
We know that L[index] is the last character of T
because M[the primary index] ends with $.
For each i = u-1, , 1 do:
s = LF[s] (threading backwards)
T[i] = L[s] (read off the next letter back)
32
Inverse BW-Transform:
Reconstruction of T
First step:
s = 4
Second step:
s = LF[4] = 11
Third step:
s = LF[11] = 3
Fourth step:
s = LF[3] = 5
And so on
T = [.._ _ _ _ _ $]
T = [.._ _ _ _ i $]
T = [.._ _ _ p i $]
T = [.._ _ p p i $]
33
34
4
? Which one
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
35
4
? Why not this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
36
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
37
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
38
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
39
T1=S1+P
T2=S2+P
If T1<T2, S1<S2
Now, let us reverse S and P
P+S1= T1
P+S2=T2
Since S1<S2, we know T1<T2
40
41
42
43
44
Inverse BW-Transform:
Construction of C
Store in C[c] the number of occurrences in
T of the characters {1, , c-1}.
In our example:
T = mississippi$
i 4, m 1, p 2, s 4, $ 1
C = [0 4 5 7 11]
Notice that C[c] + m is the position of the
mth occurrence of c in F (if any).
45
Inverse BW-Transform:
Constructing the LF-mapping
Why and how the LF-mapping?
Notice that for every row of M, L[i] directly precedes
F[i] in the text (thanks to the cyclic shifts).
Inverse BW-Transform:
Constructing the LF-mapping
So, define LF[1N] as
47
Inverse BW-Transform
Construct C[1||], which stores in C[i]
the cumulative number of occurrences in T
of character i.
Construct an LF-mapping LF[1N] which
maps each character in L to the character
occurring in F using only L and C.
Reconstruct T backwards by threading
through the LF-mapping and reading the
characters off of L.
48
Another example
1. You are given and input string ababc
(a) Using Burrows-Wheeler, create all cyclic
shifts of the string
(b) sorted order
(b) Output L and the primary index.
(g) Given L, determine F and LF (and show
how you do it).
(h) Decode the original string using indexX, L,
and LF (and show how you do it).
49
Cons:
the need of sorting all the contexts up to their full lengths of $N$
is the main cause for the super-linear time complexity of BWT.
Super-linear time algorithms are not hardware friendly.
50
Block wise
It works on blocks of certain typical size.
51
An improved algorithm
-Schindler Transforms
To address the above drawbacks, a slightly
different transform, called ST, was proposed.
which can sort the texts by using only their first $k$
characters (where $k$ can be a value far less than
$N$), but still render itself reversible.
ST transform
i $ m
i s s
i p p
i s s
s i s
m i s
$ m i
s i p
p i $
s s i
p p i
s s i
i s s
i p p
i $ m
i s s
s i p
s i s
s s i
p i $
m i s
p p i
$ m i
s s i
i s s
i $ m
i s s
i p p
p i $
s i p
s s i
m i s
s i s
$ m i
s s i
p p i
53
Cons:
The currently known approach to inverse ST
is based on a hashing function.
The relationship between inverse ST and
inverse BWT is not well studied.
54
An application scheme
in data communication system
55
Conclusions
The BW transform makes the text (string)
more amenable to compression.
BWT in itself does not modify the data stream.
It just reorders the symbols inside the data
blocks.
Evaluation of the performance actually is
subject to information model assumed.
Another topic.
BW Transform Summary
Any nave implementation of the transform
has an O(n^3) time complexity.
The best solution has O(n), which is tricky to
implement.
58
homework
The BWT algorithms
Forward Transform
Backward Transform
59
60
61
Requirements
Stage 1: use a fixed string or accept a
string from keyboard to test the
correctness of your algorithms. (80 points)
Stage 2: then expand your solution to read
the string from a given file. (20 points)
Notice that text2 should be a binary file,
for the first data is index, then followed by
ascii code.
62
2. radix sort
3. suffix array
63
Array
Dynamic memory allocation
String manipulation
Sorting
File operation
Data compression algorithms
64