Вы находитесь на странице: 1из 31

..

: . . 6-77-1
..
: ..


2013.
:
..3
1. .4
1.1 .5
1.1.1. -5
1.1.2. .8
1.2 .9
2.
............................................16
2.1. (PPM)16
2.2. (LZ77 LZ78) .19
2.3 (BWT)......22
...26
- ..28
..30



.
(lossless)
(lossy). ,
,
.
,
(
).
,
. , ,
,
.
.

,
.
- .

(, . .), .
,
.
,
.

. , -
,
,
- ,
,
.

1.


( )
.
,
.

( ).
:

.
,
.
, ,

.
.
,
.
.

,
, , - , ,
.
, ,
. ,
.
.

.

,
.

. ,
, .

, -
, , ,
.

, .
,
, .
,
.
4

, -
(
).

,
.

,
.
,

.

1.1.
1.1.1.
,
(. Robert
Fano). ,
.
: ,
. ,
.
.
(. Shannon-Fano coding)
.
(, ).
,
,
() ,
,
.
(
, 1948 ) , , (
).
:
1.
. (ai1,...,aiN).
j = 1, 2,:, N - 1 aij
aij+1
2. - t 1: t := 1.
3. M ,
,
1/M.
G1, G2,..., GM:
G1 = (ai1, ai2, : ,ai k1),
G2 = (ai k1+1, ai k1+2, ... ,ai k2), :
4. t- .
GS bS, s = 1, 2,.., M.
5. - 1: t := t + 1.
5

6. - .
GS aj aj
.
, 2 ,
, 3 4.
t- .
, 5.
,
.
:
.
.
( ).
.
, .

.
. ,
;
. ,
. 1
0, .

, , .
n (n + 1)
. ,

.
,
.
, , ,
.
,
, .

:
A ( 50)
B ( 39)
C ( 18)
D ( 49)
E ( 35)
F ( 24)

: A 11, B 101, C 100, D 00, E 011, F 010.

. 1. -

,
.
, , ,

.
,
.

1.1.2.

. 1952

.
.
,
m2 .
:
.
- .
:

. . 1952 . :
,
, .
.
(..
), .

.
(-).
.
, ,
.
.
, .
,
.
, , 1,
0.
, , ,
. .
, :
15
7
6
6
5

,
,
, n0
. .
, ,
,
, (
).
,
.

0
100 101 110 111
,
. ,

, .
,
, 87 ( 2,2308 ).
117 (
3 ). , ,
, ~2,1858
, .. ,
, ,
0,05 .
. ,
, . ,
,
, .
,
:
( -),
. -,
,
2. -, ,
1 (, ),
.

. 2.

1.2.

. , ,
, .
9

[0, 1) [a,
b), -

.

.
[0, 1), ,
. ,
.
, [a, b), [c, d). [0, 1) [a, b). [c, d) [a, b) [a + (b - a) c, a + (b - a) d). . , [a, b),
, , [a, b )
[0, 1) [c, d). [c, d)
- .
. 3.

. 3.

.
.
,
. ()
.
,
.

. , ,
,
. ,
, .
, . , aaab.
3/4,
b - 1/4. ,
[0, 3/4), b - [3/4, 1) ( ).
.
10

aa [0+(3/4-0)-00+(3/4-0)-3/4) = [0,
9/16), - [0+(9/16-0) 0, 0+(9/16-0) 3/4) = [0,
27/64) , , b - [0+(27/64-0) 3/4,
0+(27/64-0)-1) = [81/256, 27/64).
96/256 = 3 /8.
0.011. , b

( , , ).
,
,
, .
, ,
81/256. , ,
81/256 0.01010001. ,
. , , ,
,
,
, .
, . . ,

,
.

[a, b ) p (
). ,
m- , m
- .
[0, 1) , m N
( N- ),
() [a, b). , , p,

.

.
[1/4, 1), b - [0, 1/ 4). , 11

[0, 1).
,
;
,
.

. ,
.

.
. [0, 1)
m- ,
.
. (

)
.

, , ,
. ,
.
IBM

.

.

, . , ,
-
.
.
,
.
.
, ,
, 1 ,
IBM.

. ,
.
,

.
, 010101

12

0101 s
.
:

.
1.
.
2. ,
.
3. ,
. ,
.
4. (3) .
5. .
.
left = 0
right = 1
while !eof
read(symb)
newRight = left + (right - left) * segment[symb].right //segment[symb]
[0; 1), symb
newLeft = left + (right - left) * segment[symb].left
left = newLeft
right = newRight
ans = (left + right) / 2

:
.
1.
, ,
, ,
. , ,
.
2. .
3. . (1-2) , ( ).

do
for i = 1 to n
if code >= segment[i].left && code < segment[i].right

13

write(segment[i].character)
code = (code segment[i].left) / (segment[i].right segment[i].left)
break
while (segment[i].character != eof)

, .

, .

, .

:

:

14

:
::

15


, ( ),
.

2.


.
, .

.

: ,
.
.

.

.
, ,
, - ,
, .
,
- .
.
2.1.
,
,
. , ,

. ,
16

,

.
.
,
,
, - .
, ,

.
, , ,
.
.

,
.

.
,
: PPM,
DMC, CTW ,
.

80- PPM,

. DMC,
PPM, .
PPM
.
. CTW

.
CTW, ,
,
.
-
.

PPM.
PPM (. Prediction by Partial Matching
) ,
. PPM
,
,
. PPM ,

, , , .
17

. -
,
.

- 'esc'. - ,
.
.
, ,
.
S PPM- M,
M.
S , ,
S. ,
S.
, S
. -1 ,
. ,
. ,

.
m
,
(M...m+1),
, S .
,
.
(exclusions).
, ,
. n PPM,
PPM(n).
PPM
, .
. PPM
, ,
. , , PPM-D,
, , ,
. ( , PPM-D

).
PPM
1980- . 1990- ,
PPM .
PPM
.
:
PPM ,
.
PPM[3]:
boa, PPMz (Ian Sutton)

18

HA, PPM order 4, (Harry


Hirvola)
lgha, ha ( )
ppmpacktc, PPMd, PPMz, PPMVC HA
hsc ( )
arhangel, ha
( )
PPMd PPM order-2..16,
( )
ppmz Z (Charles Bloom)
rk PPMz (Malcolm Taylor)
rkuc PPM 16-12-8-5-3-2-1-0 (Malcolm Taylor)
RAR ( 3 ) PPMd, PPMII
7-Zip PPMd
WinZip ( 10 ) PPMd

2.2.

.
.


. ,
,
.
?
. , ,
.


. ,
.
,
,
.
.
,
,
. ,
, - ,
. . .
,
, .
,
,
.
, ,
().
,
.

19

,

,
. , , :
,
?
.

,
.
. ,
abac, ,
: a, ab, c, bac. ,

: abac = ab + a + c abac = a + bac.
, (greedy parsing),

. ,
(optimal parsing),
. -
.
, , ,
.
.
,
, . ,

.
(lazy matching).
.
, : a, ab, bac.
abac
: aba = ab +
a (
, c).
,
.
, ,
,
.

, , ,
()
.
,
.
.
,
,
. (
).
20

,
, , ,
.
, LZ77,
.
LZ77 LZ78 ,
(.) (.) 1977 1978 .
LZ*,
LZW, LZSS, LZMA .
,
, RLE . LZ77
,
, LZ78.
LZ77

.
.
, ,
. ,
LZ77 ,
,

. LZ77
:

(match length)
(offset) (distance)


, :

, .
, -
, ,
.
: 1 7
, . 7
, 1
?
: 7 () 1
. ,
,
-.

LZ77 . ,
. ,
21

< + >
. .
:

;
;
.

+1.
kabababababz

LZ78
LZ77, , LZ78
, (LZ78
, ).
,
.
, ,
,
, , .
. ,
.
2.3
()
,
.
,
.
,
. , ,
,
.
(Burrows-Wheeler transform, BWT,
- ,
) ,
. BWT bzip2.
.
BWT , BWT
.
22

,

. , BWT RLE
, ,
LZ.
, ( )

, .

(. . move to front, MTF)

.
BWT MTF/RLE ,
bzip2, LZH
.
,
BWT, . ,

, ,
.
, bucket
sort+qsort
ABABABAB bucket sort 2 A B,
, qsort
.

(radix sort), ,
.
BWT
,
( ) ,
.
LZH (gzip )
, .
BWT ( )
, PPM.

( )
:
.VANYA..VANYA.TANYA.MANYAVANYA
BWT :
ANYA.V VANYA
ANYA.T
ANYA.M
, ANYA,
.
V, T .

23

MTF ,
T M
:
BWT,
. .
, ,
,
. ,
, ,
, .
, :
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
* , ,
:
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT

, . ,
.BANANA. BNN.AA.A
( ):

.BANANA.
..BANANA
A..BANAN
NA..BANA
.BANANA.
ANA..BAN
NANA..BA
ANANA..B
BANANA..

ANANA..B
ANA..BAN
A..BANAN
BANANA..
NANA..BA
NA..BANA
.BANANA.
..BANANA

BNN.AA.A

,
BWT . ,
(EOL) .

function BWT (string s)


create a list of all possible rotations of s
let each rotation be one row in a large, square table

24

sort the rows of the table alphabetically, treating each row


as a string
return the last (rightmost) column of the table
function inverseBWT (string s)
create an empty table with no rows or columns
repeat length(s) times
insert s as a new column down the left side of the table
sort the rows of the table alphabetically
return the row that ends with the 'EOL' character
BWT ,

, , ,
.
.
, .
.
, ,
.
, .
.
, . ,
.
:

BNN.AA.A
1 1 2 2
B
N
N
.
A
A
.
A

A
A
A
B
N
N
.
.

BA
NA
NA
.B
AN
AN
..
A.

AN
AN
A.
BA
NA
NA
.B
..

3 3 4 4
BAN
NAN
NA.
.BA

ANA
ANA
A..
BAN

BANA
NANA
NA..
.BAN

25

ANAN
ANA.
A..B
BANA

ANA
ANA
..B
A..

NAN
NA.
.BA
..B

ANAN
ANA.
..BA
A..B

NANA
NA..
.BAN
..BA

5 5 6 6
BANAN
NANA.
NA..B
.BANA
ANANA
ANA..
..BAN
A..BA

ANANA
ANA..
A..BA
BANAN
NANA.
NA..B
.BANA
..BAN

BANANA
NANA..
NA..BA
.BANAN
ANANA.
ANA..B
..BANA
A..BAN

ANANA.
ANA..B
A..BAN
BANANA
NANA..
NA..BA
.BANAN
..BANA

7 7 8 8
BANANA.
NANA..B
NA..BAN
.BANANA
ANANA..
ANA..BA
..BANAN
A..BANA

ANANA..
ANA..BA
A..BANA
BANANA.
NANA..B
NA..BAN
.BANANA
..BANAN

BANANA..
NANA..BA
NA..BANA
.BANANA.
ANANA..B
ANA..BAN
..BANANA
A..BANAN

ANANA..B
ANA..BAN
A..BANAN
BANANA..
NANA..BA
NA..BANA
.BANANA.
..BANANA

.BANANA.

. BWT
,
.
BWT
. ,
, .
, ,
. , ,
.
, .
, 'EOL',
. BWT
.
, BWT .
:
.
26



.
,
,
.
, ,

.
.
, , ,
,
,
.
,

.
,
.

. , ,
, .
. ,
, ,
.
,
-
. , ,
. , , -
,
.
, -
.

,
.

27

-

#include
#include
#include
#include
#include

<unistd.h>
<stdlib.h>
<assert.h>
<stdio.h>
<string.h>

typedef unsigned char byte;


byte *rotlexcmp_buf = NULL;
int rottexcmp_bufsize = 0;
int rotlexcmp(const void *l, const void *r)
{
int li = *(const int*)l, ri = *(const int*)r, ac=rottexcmp_bufsize;
while (rotlexcmp_buf[li] == rotlexcmp_buf[ri])
{
if (++li == rottexcmp_bufsize)
li = 0;
if (++ri == rottexcmp_bufsize)
ri = 0;
if (!--ac)
return 0;
}
if (rotlexcmp_buf[li] > rotlexcmp_buf[ri])
return 1;
else
return -1;
}
void bwt_encode(byte *buf_in, byte *buf_out, int size, int *primary_index)
{
int indices[size];

28

int i;
for(i=0; i<size; i++)
indices[i] = i;
rotlexcmp_buf = buf_in;
rottexcmp_bufsize = size;
qsort (indices, size, sizeof(int), rotlexcmp);
for (i=0; i<size; i++)
buf_out[i] = buf_in[(indices[i]+size-1)%size];
for (i=0; i<size; i++)
{
if (indices[i] == 1) {
*primary_index = i;
return;
}
}
assert (0);
}
void bwt_decode(byte *buf_in, byte *buf_out, int size, int primary_index)
{
byte F[size];
int buckets[256];
int i,j,k;
int indices[size];
for (i=0; i<256; i++)
buckets[i] = 0;
for (i=0; i<size; i++)
buckets[buf_in[i]] ++;
for (i=0,k=0; i<256; i++)
for (j=0; j<buckets[i]; j++)
F[k++] = i;
assert (k==size);
for (i=0,j=0; i<256; i++)
{
while (i>F[j] && j<size)
j++;
buckets[i] = j; // it will get fake values if there is no i in F, but
// that won't bring us any problems
}
for(i=0; i<size; i++)
indices[buckets[buf_in[i]]++] = i;
for(i=0,j=primary_index; i<size; i++)
{
buf_out[i] = buf_in[j];
j=indices[j];
}
}

29

int main()
{
byte buf1[] = "Wikipedia";
int size = strlen(buf1);
byte buf2[size];
byte buf3[size];
int primary_index;
bwt_encode (buf1, buf2, size, &primary_index);
bwt_decode (buf2, buf3, size, primary_index);
assert (!memcmp (buf1, buf3, size));
printf ("Result is the same as input, that is: <%.*s>\n", size, buf3);
return 0;}

:
1).. . , 2001.
2). . , . . . . .: , 1973.
3)A. Moffat, Implementing the PPM data compression scheme, IEEE Transactions on
Communications, Vol. 38 (11), pp. 19171921, November 1990.
4) , LZ77 "
" // ., 11.04.2007
5) M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm.
Technical Report 124, Digital Equipment Corporation, 1994.
6) . , . , . , .
: = Introduction to Algorithms. 2- . .:
, 2006. 1296 . ISBN 0-07-013151-1
7). . , . .: , 2004.
368 . 3000 . ISBN 5-94836-027-X
8) . . 9. : //
: = Introduction to The Design and
Analysis of Aigorithms. .: , 2006. . 392398. ISBN 0-201-74395-7
9) http://algolist.manual.ru/compress/standard/huffman.php
10) http://algolist.manual.ru/compress/standard/shannon_fano.php
11) http://compression.ru/download/articles/huff/tiger_shannon-fano.html
12) http://habrahabr.ru/post/130531/
13) , , , .
. , 2011 . 1296 . ISBN
978-5-8459-0857-5, 5-8459-0857-4, 0-07-013151-1
14) http://www.compression.ru/arctest/descript/ppm-faq.htm
15) ., ., ., .. . : , 2003.
16) M. Crochemore, T.Lecroq. Text data compression algorithms. In: Atallah M.J. Ed.,
Algorithms and theory of computation handbook. Ch. 12. CRC Press, 1999.
17) ... . : ,
2001.
18) CCITT group 4. International telecommunication union, 1988.
19) M. Maniscalco, S. Puglisi, Faster lightweight suffix array construction, Proceedings of the
17th Australasian Workshop on Combinatorial Algorithms (AWOCA'06), 2006. pp.16-29.
20). K.M. Likhomanov, A.M. Shur. Two combinatorial criteria for BWT images. Computer
Science Theory and Applications. Proceedings of the 6th Symposium on Computer Science in
Russia. 2011. pp.385-396. [Lecture Notes in Computer Science Vol. 6651].
30

31