The Burrows-Wheeler Transform

The Burrows-Wheeler
Transform
Sen Zhang
Transform
What is the definition for transform?

To change the nature, function, or condition of; convert.
To change markedly the appearance or form of
Lossless and reversible
By the way, to transform is simple, a kid can do it.
To put them back is a problem.
Think of a 3 years old baby, he pretty much can transform
anything, disassemble anything, but
There exist efficient reverse algorithms that can retrieve the original
text from the transformed text.
What is BWT?
The Burrows and Wheeler transform (BWT) is a
block sorting lossless and reversible data
transform.
The BWT can permute a text into a new
sequence which is usually more compressible.
Surfaced not long ago, 1994, by Michael
Burrows and David Wheeler.
The transformed text can be better compressed
with fast locally-adaptive algorithms, such as runlength-encoding (or move-to-front coding) in
combination with Huffman coding (or arithmetic
coding).
3
Outline
What does BWT stand for?
Why BWT?
Data Compression algorithms

REL
Huffman coding
Combine them
What is left out?
Bring the reality closer to ideality
Steps of BWT
BWT is reversible and lossless
Steps to inverse
Variants of BWT
ST
When was BWT initially proposed?

Where are the inventors of the algorithms?
Your homework!
4
Why BWT?
Run length encoding
Replacing a long series of a repeated character with a
count of the repetition. Squeezing to a number and a
character.
AAAAAAA
*A7 , * flag
Ideally, the longer of the sequence of the same

character is, the better.
In reality, the input data, however, does not
necessarily favor the expectation of the RLE method.
5
Bridge reality and ideality

BWT can transform a text into a sequence
that is easier to compress.
Closer to ideality (what is expected by RLE).
Compression on the transformed text

improves the compression performance
Preliminaries
Alphabet
{a,b,c,$}
We assume
an order on the alphabet
a<b<c<$
A character is available to be used as the sentinel, denoted as $.
How to transform?
Three steps
Form a N*N matrix by cyclically rotating (left)
the given text to form the rows of the matrix.
Sort the matrix according to the alphabetic
order.
Extract the last column of the matrix.
One example
how the BWT transforms mississippi.
T=mississippi$
Step 1: form the matrix

The N * N symmetric matrix, MO, originally
constructed from the texts obtained by rotating
the text $T$.
The matrix OM has S as its first row, i.e. OM[1,
1:N]=T.
The rest rows of OM are constructed by applying
successive cyclic left-shifts to T, i.e. each of the
remaining rows, a new text T_i is obtained by
cyclically shifting the previous text T_{i-1} one column
to the left.
The matrix OM obtained is shown in the next

slide.
10
A text T is a sequence of characters drawn from the alphabet.

Without loss of generality, a text T of length $N$ is denoted as
x_1x_2x_3...x_{N-1}$, where every character x_i is in the alphabet,
, for i in [1, N-1]. The last character of the text is a sentinel, which
is the lexicographically greatest character in the alphabet and
occurs exactly once in the text.
Appending a sentinel to the original text is not a must but helps
simplifying the understanding and make any text nonrepeating.
abcababac$
11
Step 1 form the matrix

First treat the input string as a cyclic string and construct N* N matrix from it.
12
Step 1: form the matrix
m
i
s
s
i
s
s
i
p
p
i
$
s
s s
s i
i s
s s
s i
i p
p p
p i
i $
$ m
m i
i s
s s
s i
i p
p p
p i
i $
$ m
m i
i s
s s
s
s
i
p
p
i
$
m
i
s
s
i p
p p
p i
i $
$ m
m i
i s
s s
s i
i s
i s s
p
p
i
$
m
i
s
s
i
s
s
p i $
i $ m
$ m i
m i s
i s s
s s i
s i s
i s s
s s i
s i p
i p p
i p p i
13
Step 2: transform the matrix
Now, we sort all the rows of the matrix OM in ascending order with the
leftmost element of each row being the most significant position.
Consequently, we obtain the transformed matrix M as given below.
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
Completely sorted from the leftmost column to the rightmost column.
14
Step 3: get the transformed text

The Burrows Wheeler transform is the last
column in the sorted list, together with the
row number where the original string ends
up.
15
Step 3: get the transformed text
From the above transform, L is easily obtained by taking the transpose of the last
column of M together with the primary index.
4
L= s s m p $ p i s s i i i
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
Notice how there are 3 i's in a row and 2 consecutive s's and
another 2 consecutive ss - this makes the text easier to compress,
than the original string mississippi$.
16
What is the benefit?

The transformed text is more amenable to
subsequent compression algorithms.
17
Any problem?
It sounds cool, but
Is the transformation reversible?
18
The remarkable thing about the BWT is not only that it generates a
more easily compressible output, but also that it is reversible, i.e. it
allows the original text to be re-generated from the last column data
and the primary index.
19
mississippi$
BWT
Index 4 and ssmp$pissiii
Inverse BWT
mississippi$
??? How to achieve the goal?
20
The intuition
Assuming you are in a 1000 people line.
For some reason, people are dispersed
Now, we need to restore the line.
What should you (the people in line) do?
What is your strategy?
Centralized?
A bookkeeper or ticket numbers, that requires centralized
extra bookkeeping space
Distributed?
If every person can point out who stood immediately in
front of him. Bookkeeping space is distributed.
21
For IBWT
The order is distributed and hidden in the
output themselves!!!
22
The trick is
Where to start? Who is the first one to ask?
The last one.
Finding immediate preceding character

By finding immediate preceding row of the current
row.
A loop is needed to recover all.

Each iteration involves two matters
Recover the current people (by index)
In addition to that, to point out the next people (by index) to
keep the loop running.
23
Two matters
Recover the current people (by index)
L[currentindex], so what is the currentindex?
In addition to that, to point out the next people (by

index)
currentindex = new index;
// how to update currentindex, we need a updating
method.
24
We want to know where is the

preceding character of a given
character.
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
Based on the already known primary index, 4, we know, L[4], i.e. $ is

the first character to retrieve, backwardly, but our question is which
character is the next character to retrieve?
25
We want to know where is the

preceding character of a given
character.
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
We know that the next character is going to be i?

But L[6]=L[9]= L[10] = L[11] =i. Which index should be chosen?
Any of 6, 9, 10, and 11 can give us the right character i, but the correct
26
We know that the next character is going

to be i?
But L[6]=L[9]= L[10] = L[11] =i. Which
index should be chosen?
Any of 6, 9, 10, and 11 can give us the
right character i, but the correct strategy
also has to determine which index is the
next index continue the restoration.
27
The solution
The solution turns out to be very simple:
Using LF mapping!
Continue to see what LF mapping is?
28
Inverse BW-Transform
Assume we know the complete ordered
matrix
Using L and F, construct an LF-mapping
LF[1N] which maps each character in L
to the character occurring in F.
Using LF-mapping and L, then reconstruct
T backwards by threading through the LFmapping and reading the characters off of
L.
29
L and F
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
30
LF mapping
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
31
Inverse BW-Transform:
Reconstruction of T
Start with T[] blank.
Let u = N
Initialize Index = the primary index (4 in our
case)
T[u] = L[index].
We know that L[index] is the last character of T
because M[the primary index] ends with $.
For each i = u-1, , 1 do:
s = LF[s] (threading backwards)
T[i] = L[s] (read off the next letter back)
32
Reconstruction of T
First step:
s = 4
Second step:
s = LF[4] = 11
Third step:
s = LF[11] = 3
Fourth step:
s = LF[3] = 5
And so on
T = [.._ _ _ _ _ $]
T = [.._ _ _ _ i $]
T = [.._ _ _ p i $]
T = [.._ _ p p i $]
33
Who can retrieve the data?

Please complete it!
34
Why does LF mapping work?
4
? Which one
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
35
4
? Why not this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
36
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
37
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
38
4
? Why this?
i
i
i
i
m
p
p
s
s
s
s
$
p
s
s
$
i
i
p
i
i
s
s
m
p
s
s
m
s
$
i
p
s
i
i
i
i
i
i
i
s
m
$
p
s
p
s
s
$
p
s
s
i
i
m
i
i
p
s
s
m
p
s
s
s
s
i
$
p
i
i
i
i
i
i
i
s
s
s
m
p
$
p
s
s
$
p
s
i
i
s
i
i
m
p
s
s
m
p
s
p
s
i
s
$
i
i
i
i
i
i
i
p
s
s
s
m
s
$
p
s
s
$
p
i
i
s
i
i
s
m
p
s
s
m
p
$
p
i
s
s
i
i
i
7
8
4
5
11
6
0
9
10
1
2
3
39
The mathematic explanation
T1=S1+P
T2=S2+P
If T1<T2, S1<S2
Now, let us reverse S and P
P+S1= T1
P+S2=T2
Since S1<S2, we know T1<T2
40
The secret is hidden in the sorting strategy

the forward component.
Sorting strategy preserves the relative order
in both last column and first column.
41
We had assumed we have the matrix. But

actually we dont.
Observation, we only need two columns.
Amazingly, the information contained in
the Burrows-Wheeler transform (L) is
enough to reconstruct F, and hence the
mapping, hence the original message!
42
First, we know all of the characters in the

original message, even if they're permuted
in the wrong order. This enables us to
reconstruct the first column.
43
Given only this information, you can easily

reconstruct the first column. The last
column tells you all the characters in the
text, so just sort these characters to get
the first column.
44
Construction of C
Store in C[c] the number of occurrences in
T of the characters {1, , c-1}.
In our example:
T = mississippi$
i 4, m 1, p 2, s 4, $ 1
C = [0 4 5 7 11]
Notice that C[c] + m is the position of the
mth occurrence of c in F (if any).
45
Constructing the LF-mapping
Why and how the LF-mapping?
Notice that for every row of M, L[i] directly precedes
F[i] in the text (thanks to the cyclic shifts).
Let L[i] = c, let ri be the number of occurrences

of c in the prefix L[1,i], and let M[j] be the r i-th
row of M that starts with c. Then the character in
the first column F corresponding to L[i] is located
at F[j].
How to use this fact in the LF-mapping?
46
Constructing the LF-mapping
So, define LF[1N] as
LF[i] = C[L[i]] + ri.

C[L[i]] gets us the proper offset to the
zeroth occurrence of L[i], and the addition
of ri gets us the ri-th row of M that starts
with c.
47
Inverse BW-Transform
Construct C[1||], which stores in C[i]
the cumulative number of occurrences in T
of character i.
Construct an LF-mapping LF[1N] which
maps each character in L to the character
occurring in F using only L and C.
Reconstruct T backwards by threading
through the LF-mapping and reading the
characters off of L.
48
Another example
1. You are given and input string ababc
(a) Using Burrows-Wheeler, create all cyclic
shifts of the string
(b) sorted order
(b) Output L and the primary index.
(g) Given L, determine F and LF (and show
how you do it).
(h) Decode the original string using indexX, L,
and LF (and show how you do it).
49
Pros and cons of BWT

Pros:
The transformed text does enjoy a compression-favorable
property which tends to group identical characters together so
that the probability of finding a character close to another
instance of the same character is increased substantially.
More importantly, there exist efficient and smart algorithms to
restore the original string from the transformed result.
Cons:
the need of sorting all the contexts up to their full lengths of $N$
is the main cause for the super-linear time complexity of BWT.
Super-linear time algorithms are not hardware friendly.
50
Block wise
It works on blocks of certain typical size.
51
An improved algorithm
-Schindler Transforms
To address the above drawbacks, a slightly
different transform, called ST, was proposed.
which can sort the texts by using only their first $k$
characters (where $k$ can be a value far less than
$N$), but still render itself reversible.
The key idea of ST is a two-hierarchy priority

sorting scheme, which can be easily achieved
using the radix sort.
the lexicographical sorting criterion.
the positional sorting criterion.
52
ST transform
Let OM be the same matrix as defined for the BWT.

Under k-order ST, OM is transformed to M_k by sorting all its rows
according to their first k leftmost characters, i.e. k-order contexts, only.
In case that any two k-order contexts are equal, the tie is resolved by their
relative positions in the original OM.
i p p
i s s
i s s
i $ m
m i s
p i $
p p i
s i s
s i p
s s i
s s i
$ m i
i $ m
i s s
i p p
i s s
s i s
m i s
$ m i
s i p
p i $
s s i
p p i
s s i
i s s
i p p
i $ m
i s s
s i p
s i s
s s i
p i $
m i s
p p i
$ m i
s s i
i s s
i $ m
i s s
i p p
p i $
s i p
s s i
m i s
s i s
$ m i
s s i
p p i
Only partially sorted on the leftmost two columns
53
Pros and Cons of ST

Pros:
Faster than BWT
Hardware implementation friendly
Cons:
The currently known approach to inverse ST
is based on a hashing function.
The relationship between inverse ST and
inverse BWT is not well studied.
54
An application scheme
in data communication system
55
Conclusions
The BW transform makes the text (string)
more amenable to compression.
BWT in itself does not modify the data stream.
It just reorders the symbols inside the data
blocks.
Evaluation of the performance actually is
subject to information model assumed.
Another topic.
The transform is lossless and reversible

56
BW Transform Summary
Any nave implementation of the transform
has an O(n^3) time complexity.
The best solution has O(n), which is tricky to
implement.
We can reverse it to reconstruct the

original text in O(n) time, using O(n) space.
Once we obtain L, we can compress L in a
provably efficient manner
57
Issues left out

How about if all characters in the alphabet set appear in
the text, i.e. no sentinel can be used?
Do you need to compare N positions?
How about the input data is not ascii encoded, but an
image, or a biological sequence (DNA, RNA or protein)?
Why not the first column, but the last column?
In BWT, the last column, L, of the sorted matrix contains
concentrations of identical characters, which is why L is easy to
compress. However, the first column, F, of the same matrix is
even easier to compress since it . Why select column L and not
column F?
58
homework
The BWT algorithms
Forward Transform
Backward Transform
Either in the Windows environment or the

Linux environment
59
Examples of running your

program in the command line
bwt f text1 text2
Transfer text1 to text2
bwt i text2 text3

Inverse text2 to text3
60
How to verify the correctness of

your algorithms.
Because the bwt is reversible and
lossless, if your implementation is correct,
text3 should be the same as text1.
Your can manually verify text1 and text3
Alternatively, you can run diff command in
Linux to report any differences between any
two files.
61
Requirements
Stage 1: use a fixed string or accept a
string from keyboard to test the
correctness of your algorithms. (80 points)
Stage 2: then expand your solution to read
the string from a given file. (20 points)
Notice that text2 should be a binary file,
for the first data is index, then followed by
ascii code.
62
How to sort the matrix

1. the simplest way
Whatever sorting algorithm you feel comfortable
Make each row a string, then do string comparison
C string, need to know how functions for string comparison
Cpp string, need to know how to how to use string class.
You use whichever way you feel the most comfortable.
2. radix sort
3. suffix array
63
Knowledge to be practiced for

the homework
Array
Dynamic memory allocation
String manipulation
Sorting
File operation
Data compression algorithms
64

The Burrows-Wheeler Transform

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

The Burrows-Wheeler Transform

Загружено:

Авторское право:

Доступные форматы

The Burrows-Wheeler

What is the definition for transform?

Data Compression algorithms

When was BWT initially proposed?

Ideally, the longer of the sequence of the same

Bridge reality and ideality

Compression on the transformed text

A character is available to be used as the sentinel, denoted as $.

Step 1: form the matrix

The matrix OM obtained is shown in the next

A text T is a sequence of characters drawn from the alphabet.

Step 1 form the matrix

Step 1: form the matrix

Step 2: transform the matrix

Completely sorted from the leftmost column to the rightmost column.

Step 3: get the transformed text

Step 3: get the transformed text

What is the benefit?

BWT is reversible and lossless

BWT is reversible and lossless

Index 4 and ssmp$pissiii

??? How to achieve the goal?

Finding immediate preceding character

A loop is needed to recover all.

In addition to that, to point out the next people (by

We want to know where is the

Based on the already known primary index, 4, we know, L[4], i.e. $ is

We want to know where is the

We know that the next character is going to be i?

We know that the next character is going

Who can retrieve the data?

Why does LF mapping work?

Why does LF mapping work?

Why does LF mapping work?

Why does LF mapping work?

Why does LF mapping work?

The mathematic explanation

The secret is hidden in the sorting strategy

We had assumed we have the matrix. But

First, we know all of the characters in the

Given only this information, you can easily

Let L[i] = c, let ri be the number of occurrences

LF[i] = C[L[i]] + ri.

Pros and cons of BWT

The key idea of ST is a two-hierarchy priority

Let OM be the same matrix as defined for the BWT.

Only partially sorted on the leftmost two columns

Pros and Cons of ST

The transform is lossless and reversible

We can reverse it to reconstruct the

Issues left out

Either in the Windows environment or the

Examples of running your

bwt i text2 text3

How to verify the correctness of

How to sort the matrix

Knowledge to be practiced for

Вам также может понравиться