Supervised Transform For Data Lossless Compression Improvementseddik - WCCCS2013 PDF

Supervised transform for Data lossless Compression
improvement
Seddik Ilias
Meftah Boudjelal
Benyettou Abdelkader
Department of Computer Science

University of Oran MB, Algeria
Department of Mathematics and

Computer Science
TVIM, LRSBG Laboratory
University of Mascara, Algeria
Department of Computer Science

SIMPA Laboratory
University of Oran MB, Algeria
Abstract Data compression is taking an important part in

computer research on applications like tele-video- conferencing,
remote sensing, document and medical imaging, and facsimile
transmission; while our stocking disk and our passing bands are
limited. This field regroups two families: Lossy compression, with
acceptable loss of information, and lossless compression without
any loss. This paper proposes a method can be used to improve
both families of methods, while decreasing the entropy of the
treated data. In this work we will consider solely the problem of
lossless compression for many kinds of data with the same
dictionary supervised algorithm with a reduced complexity.
Keywords- Compression;
compression ratio.
I.
entrop;
dictionary;
coding;
INTRODUCTION
Since the beginning of researches about data compression,

more than half century ago, most of methods make a transform
on the treated data to increase order or repetitions; in entropy
coding [1]; within the latter, than apply a coding method that
will really make the compression process using this increases.
For coding, we will make our tests with the Huffman method
[2], which is a statistical method based on the repetition of
characters, and the LZW method based on the coding of a
concatenation of characters to compress the resulted dictionary
[3]. The transform we will apply is supervised since it
performs a first pass to evaluate the correlation between each
character and its following, so that each character often behind
another one will be coded by a little coefficient, corresponding
to its rank among the other successors, in every succession of
both of them. Therefore, our method is not itself a
compression method, but it will increase the number of little
coefficients, which will improve the efficiency of the
statistical coding methods. We will focus on lossless
compression which means that the compressed element can be
reconstructed without any loss of information.
Note that the latest compression methods often focus on
one type of data. Our method, although it is less efficient in
most of cases than algorithms specialized in a single data type,
obtains good results on many types of data with this single
algorithm. So our goal is not to compare the compression ratio
obtained, but to demonstrate significant results on many kinds
of data without making modifications to the algorithm and

avoiding to stack many layers of processing.
II.
DESCRIPTION OF THE TECHNIQUE
A. Data Pre-Treatment
In this step we create the vector that will be coded by the
transform. For this, a subdivision of the input data in blocs is
done, followed by a reordering if necessary. The interest of
this reordering is to optimize the regularity of the bloc by
increasing the correlation between each element of input data
and its following. The processing to be applied will vary
according to the nature of the input data, so for text data
Burrows & Wheeler transform [4] which sorts blocs of text
lexicographically increasing rehearsals will be applied. For
image inputs, the matrix will be divided and each block
obtained will be sorted according to the flow geometry inside
the block [5],Or simply by choosing among the horizontal,
vertical and diagonal direction to minimize the entropy
resulting. The same operation is applied to video on the
images sequences. The goal of those pre-treatment is to
minimize high frequencies (defined here by a large differential
between a coefficient and its following) for images, sounds
and videos, and to reduce the disorder of characters in textual
data.
Figure 1. Example of resolution of the flow geometry in a piece of image
B. Processing method
V2=
The transform will firstly make the supervised part, by

running through the entry vector until the penultimate element
in order to build a dictionary for each element with all those
that can follow it; every vector representing a dictionary is then
ordered by probability of occurrence of each coefficient after
the mother element of this vector. After that, every element
will be replaced by its position in the dictionary that has as
mother the previous coefficient in the original data vector.
V7=
We can describe the principle for a string vector like that:
Create a vector for each character existing in the

original data.
Fulfill vectors with each character that can appear

after the character corresponding.
Create a second line for each vector with number of

occurrence of every character after the vector
character.
Order the matrices obtained by their second line (by

occurrence).
Numbering of the matrices columns from zero to the

length of every matrix, less one.
Replacing of each character in the original data by its

position in the first line of its predecessors matrix,
beginning by the second character.
And the coder implementation with M as original vector

containing the values {1, 2, 3,n}, C the coded vector, and V
the dictionary vector for the character j with j={1,2,3,n}
will be:
For (i=1:length(M)-1)
VM(i)(M(i+1),2) = VM(i)(M(i+1),2) + 1
End
Sort Vj={1,2,n} by the second line
; V3=
; V4=
; V5=
; V6=
After the reordering by occurrence frequency, we obtain:

V1=
V4=
; V2=
; V5=
; V3=
; V6=
; V7=
Where we just need the first line to build our code:

V1=
V6=
; V2=
; V3=
; V4=
; V5=
; V7=
Figure 2. Example of a resulting dictionary
The transformed vector will be:

T(a) = [2 0 1 0 0 2 0 0 2 0 3 0 0 0 0 0 1 2 1 1 0 1 1 1] ;
Where the first character is the only one that is not
encoded, because its the only one without precedent. We can
observe that the new vector contains less of different
coefficient with an unbalanced distribution comparing to the
original vector, and this will improve the results of most of
statistical coding systems. The only drawback is the dictionary
weight that will be important, so, it will be better to use the
transform on large data. It should be noted that its essential to
save the first coefficient of the original vector to be able to
start the reconstruction.
C(1)=M(1)
For (i=2:length(M))
C(i)= Position of M(i) in VM(i-1)
end
Example: a= [2 5 3 2 5 6 3 2 4 5 2 5 1 7 5 1 1 2 3 1 7 2 3 1]
V1=
representing the vector of the following of the
character 1, and the second line represents number of the
occurrence of each of them after this character.
(a)
(b)
Figure 3. Coeficient distribution before (a) and after (b) application of the
transform
Figure 3 compares the coefficient distribution in previous

example before and after applying the transform, and showing
more uniform distribution in the transformed vector, whats
giving a reduction of the entropy.
C. Reconstruction method
For the inverse process, we will read our vector in the
same way, using the dictionary step by step. We firstly get the
first coefficient in its natural form, then, well see the
correspondence of the second one from the coded vector in the
dictionary of the first as a position, and replace it by the
coefficient in this position, and knowing the original
coefficient, we can apply the same process for the rest of the
vector until the end like a domino falling, where the falling of
the first one allow to reach the last domino. We can describe
the reconstitution algorithm like that:
2
3
1
7
2
3
5
2
7
5
5
2
III.
3
1
1
2
3
1
After obtaining our coded vector, we can apply a

sequential data compressor like Lempel and Ziv [6], but best
results are realized with entropy coding like Huffman
algorithm [2] because the improvement provided by the
algorithm is in reducing the entropy [1] of the input, which
maximizes the effect of entropy coding.
Entropy is calculated by Shannons formula:
(1)
For the vector X comprising n elements.

TABLE IV.
Last coefficient
decoded
2
5
3
2
5
6
3
2
4
5
2
5
1
7
5
1
1
COMPARAISON OF ENTROPY BETWEEN ORIGINAL DATA AND

TRANSFORMED DATA
Input data
text
In the previous example the first coefficient that has been

saved is 2, the following one is 0, so we have to take the first
element from the dictionary of 2 that is 5, and the
reconstruction algorithm go on like in the table:
TABLE III.
CODING
H(X) = F = First coefficient of the coded vector

VF= dictionary of F
C=Second Coded coefficient
Repeat until the end of coded vector
P= VF(C)
Add P to the reconstructed vector
F=P
C= Next (C) In coded vector
End
4
2
4
-
EXAMPLE OF RECONTRUCTION RESOLUTION
Vf(1)
Vf(2)
Vf(3)
Vf(4)
5
1
2
5
1
3
2
5
5
1
5
1
7
5
1
7
7
3
3
1
3
3
1
3
3
3
3
1
2
3
1
1
4
6
4
6
4
6
4
6
2
6
2
2
2
2
2
2
2
-
Image
Video
Paper 1
Paper 2
Pic
Barbara
Boat
Flinstones
Claire
Salesman
Foreman
Original
entropy
4.983
4.601
1.210
15.735
11.385
14.724
6.418
6.801
7.173
Transformed
data entropy
3.830
3.592
1.004
5.367
5.128
4.946
3.209
4.605
4.589
Table 2 shows the reduction of the zero-order entropy in

different types of data by using the algorithm proposed, with
an average reduction of approximately 25% for text and video
files and more than 60 % for images.
IV. RESULTS
We tested our algorithm after implementation on previous
files without adding any pre-treatment except the directions
choice for images, taking into consideration the dictionary
volume in bit rate, and comparing the speed of execution in
seconds between data from different type and size. Comparing
to transformed vector, the entropy reduction is less important
in dictionaries, which must be coded because the large size
they will get, but same series are often repeated offering a
good field for the application of Lempel Ziv Welch coding
algorithm [3].
Original ratio for all tested data entry is 8 bits per character
for text data and 8 bits per pixels for images and videos. We
can note that some methods can make a bigger compression
like CALIC and some of its derived algorithms [7] for images,
LOCO-R [8] for videos, or Deflate [9] and LZ77 for texts as
used in popular compression software like winrar or gzip, but

with many layers of algorithm, and each one on a single kind
of input data unlike our transform which has as advantage to
give results with all types using one implementation, what
could permit to reduce the size of a software based on it. Best
results are obtained in images compression with ratios very
close to those of the latest technique and quicker execution
(Table 3).
VI. REFERENCES
[1] C. E. Shannon, A mathematical theory of communication, Bell System
Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October,
1948.
[2] D.A. Huffman, A method for the construction of minimum-redundancy
codes, Proceedings of the I.R.E, September 1952, pp1098-1102.
[3] T. A. Welch, A technique for high- performance data compression,
Computer, vol. 17, pp. 8-19, 1984.
TABLE III.
[4] M. Burrows and D. J. Wheeler, A block-sorting lossless data

compression algorithm, 10th May 1994, Digital SRC Research Report
124
[5] S. Mallat and G. Peyr, Traitements gomtriques des images par
bandelettes, CMAP, Ecole Polytechnique, 2006.
[6] J. Ziv and A. Lempel, A universal algorithm for sequential data
compression. IEEE Transactions on Information Theory. Vol. IT-23, No.
3, May 1977, pp. 337343.
[7] X.Wu and N. Memon, Context-based, adaptive, lossless image coding,
IEEE Transactions on Communications, vol. 45, no. 4, pp. 437444,
April 1997.
[8] L. Zheng-lin, Q. Ying, Y. Li-ying, B. Yu and L. Hui, An Improved
Lossless Image Compression Algorithm LOCO-R, in Proceedings of
International Conference on Computer Design And Applications
(ICCDA), 2010.
[9] D. Salomon, Data compression: The complete reference, (4ed),
Springer. pp. 241, ISBN 978-1-84628-602-5. 2007.
COMPRESSION PERFORMANCE IN TERM OF RATIO (AVERAGE BIT PER ELEMENT) AND RUNNING TIMES (SEC) FOR
DIFFERENT TYPES OF DATA
Input data
Original
size
Compressed
Vectors size
Dictionary
size
Compression
ratio
Running
times
(Sec)
Paper 1
53161
24520
1556
3.92
0.078
Paper 2
82199
35518
1340
3.58
0.171
Pic
513216
56302
3009
1.06
2.44
Barbara
262144
163500
14888
5.44
1.48
Boat
262144
168140
9338
5.41
1.24
Flinstones
262144
162330
12617
5.33
2.15
Claire
18779904
753212
41795
3.22
61.3
Salesman
17071922
9827967
42384
4.62
69.2
Foreman
11406644
6418926
30393
4.52
41.1
Text
Image
Video

Supervised Transform For Data Lossless Compression Improvementseddik - WCCCS2013 PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Supervised Transform For Data Lossless Compression Improvementseddik - WCCCS2013 PDF

Загружено:

Авторское право:

Доступные форматы

Supervised transform for Data lossless Compression

Department of Computer Science

Department of Mathematics and

Department of Computer Science

Abstract Data compression is taking an important part in

Since the beginning of researches about data compression,

of data without making modifications to the algorithm and

DESCRIPTION OF THE TECHNIQUE

Figure 1. Example of resolution of the flow geometry in a piece of image

The transform will firstly make the supervised part, by

We can describe the principle for a string vector like that:

Create a vector for each character existing in the

Fulfill vectors with each character that can appear

Create a second line for each vector with number of

Order the matrices obtained by their second line (by

Numbering of the matrices columns from zero to the

Replacing of each character in the original data by its

And the coder implementation with M as original vector

After the reordering by occurrence frequency, we obtain:

Where we just need the first line to build our code:

Figure 2. Example of a resulting dictionary

The transformed vector will be:

Figure 3 compares the coefficient distribution in previous

After obtaining our coded vector, we can apply a

For the vector X comprising n elements.

COMPARAISON OF ENTROPY BETWEEN ORIGINAL DATA AND

In the previous example the first coefficient that has been

H(X) = F = First coefficient of the coded vector

EXAMPLE OF RECONTRUCTION RESOLUTION

Table 2 shows the reduction of the zero-order entropy in

used in popular compression software like winrar or gzip, but

[4] M. Burrows and D. J. Wheeler, A block-sorting lossless data

Вам также может понравиться