Вы находитесь на странице: 1из 4

Supervised transform for Data lossless Compression

improvement
Seddik Ilias

Meftah Boudjelal

Benyettou Abdelkader

Department of Computer Science


University of Oran MB, Algeria

Department of Mathematics and


Computer Science
TVIM, LRSBG Laboratory
University of Mascara, Algeria

Department of Computer Science


SIMPA Laboratory
University of Oran MB, Algeria

Abstract Data compression is taking an important part in


computer research on applications like tele-video- conferencing,
remote sensing, document and medical imaging, and facsimile
transmission; while our stocking disk and our passing bands are
limited. This field regroups two families: Lossy compression, with
acceptable loss of information, and lossless compression without
any loss. This paper proposes a method can be used to improve
both families of methods, while decreasing the entropy of the
treated data. In this work we will consider solely the problem of
lossless compression for many kinds of data with the same
dictionary supervised algorithm with a reduced complexity.
Keywords- Compression;
compression ratio.

I.

entrop;

dictionary;

coding;

INTRODUCTION

Since the beginning of researches about data compression,


more than half century ago, most of methods make a transform
on the treated data to increase order or repetitions; in entropy
coding [1]; within the latter, than apply a coding method that
will really make the compression process using this increases.
For coding, we will make our tests with the Huffman method
[2], which is a statistical method based on the repetition of
characters, and the LZW method based on the coding of a
concatenation of characters to compress the resulted dictionary
[3]. The transform we will apply is supervised since it
performs a first pass to evaluate the correlation between each
character and its following, so that each character often behind
another one will be coded by a little coefficient, corresponding
to its rank among the other successors, in every succession of
both of them. Therefore, our method is not itself a
compression method, but it will increase the number of little
coefficients, which will improve the efficiency of the
statistical coding methods. We will focus on lossless
compression which means that the compressed element can be
reconstructed without any loss of information.
Note that the latest compression methods often focus on
one type of data. Our method, although it is less efficient in
most of cases than algorithms specialized in a single data type,
obtains good results on many types of data with this single
algorithm. So our goal is not to compare the compression ratio
obtained, but to demonstrate significant results on many kinds

of data without making modifications to the algorithm and


avoiding to stack many layers of processing.
II.

DESCRIPTION OF THE TECHNIQUE

A. Data Pre-Treatment
In this step we create the vector that will be coded by the
transform. For this, a subdivision of the input data in blocs is
done, followed by a reordering if necessary. The interest of
this reordering is to optimize the regularity of the bloc by
increasing the correlation between each element of input data
and its following. The processing to be applied will vary
according to the nature of the input data, so for text data
Burrows & Wheeler transform [4] which sorts blocs of text
lexicographically increasing rehearsals will be applied. For
image inputs, the matrix will be divided and each block
obtained will be sorted according to the flow geometry inside
the block [5],Or simply by choosing among the horizontal,
vertical and diagonal direction to minimize the entropy
resulting. The same operation is applied to video on the
images sequences. The goal of those pre-treatment is to
minimize high frequencies (defined here by a large differential
between a coefficient and its following) for images, sounds
and videos, and to reduce the disorder of characters in textual
data.

Figure 1. Example of resolution of the flow geometry in a piece of image

B. Processing method

V2=

The transform will firstly make the supervised part, by


running through the entry vector until the penultimate element
in order to build a dictionary for each element with all those
that can follow it; every vector representing a dictionary is then
ordered by probability of occurrence of each coefficient after
the mother element of this vector. After that, every element
will be replaced by its position in the dictionary that has as
mother the previous coefficient in the original data vector.

V7=

We can describe the principle for a string vector like that:

Create a vector for each character existing in the


original data.

Fulfill vectors with each character that can appear


after the character corresponding.

Create a second line for each vector with number of


occurrence of every character after the vector
character.

Order the matrices obtained by their second line (by


occurrence).

Numbering of the matrices columns from zero to the


length of every matrix, less one.

Replacing of each character in the original data by its


position in the first line of its predecessors matrix,
beginning by the second character.

And the coder implementation with M as original vector


containing the values {1, 2, 3,n}, C the coded vector, and V
the dictionary vector for the character j with j={1,2,3,n}
will be:
For (i=1:length(M)-1)
VM(i)(M(i+1),2) = VM(i)(M(i+1),2) + 1
End
Sort Vj={1,2,n} by the second line

; V3=

; V4=

; V5=

; V6=

After the reordering by occurrence frequency, we obtain:


V1=
V4=

; V2=
; V5=

; V3=
; V6=

; V7=

Where we just need the first line to build our code:


V1=
V6=

; V2=

; V3=

; V4=

; V5=

; V7=

Figure 2. Example of a resulting dictionary

The transformed vector will be:


T(a) = [2 0 1 0 0 2 0 0 2 0 3 0 0 0 0 0 1 2 1 1 0 1 1 1] ;
Where the first character is the only one that is not
encoded, because its the only one without precedent. We can
observe that the new vector contains less of different
coefficient with an unbalanced distribution comparing to the
original vector, and this will improve the results of most of
statistical coding systems. The only drawback is the dictionary
weight that will be important, so, it will be better to use the
transform on large data. It should be noted that its essential to
save the first coefficient of the original vector to be able to
start the reconstruction.

C(1)=M(1)
For (i=2:length(M))
C(i)= Position of M(i) in VM(i-1)
end

Example: a= [2 5 3 2 5 6 3 2 4 5 2 5 1 7 5 1 1 2 3 1 7 2 3 1]
V1=
representing the vector of the following of the
character 1, and the second line represents number of the
occurrence of each of them after this character.

(a)

(b)

Figure 3. Coeficient distribution before (a) and after (b) application of the
transform

Figure 3 compares the coefficient distribution in previous


example before and after applying the transform, and showing
more uniform distribution in the transformed vector, whats
giving a reduction of the entropy.
C. Reconstruction method
For the inverse process, we will read our vector in the
same way, using the dictionary step by step. We firstly get the
first coefficient in its natural form, then, well see the
correspondence of the second one from the coded vector in the
dictionary of the first as a position, and replace it by the
coefficient in this position, and knowing the original
coefficient, we can apply the same process for the rest of the
vector until the end like a domino falling, where the falling of
the first one allow to reach the last domino. We can describe
the reconstitution algorithm like that:

2
3
1
7
2
3

5
2
7
5
5
2
III.

3
1
1
2
3
1

After obtaining our coded vector, we can apply a


sequential data compressor like Lempel and Ziv [6], but best
results are realized with entropy coding like Huffman
algorithm [2] because the improvement provided by the
algorithm is in reducing the entropy [1] of the input, which
maximizes the effect of entropy coding.
Entropy is calculated by Shannons formula:
(1)

For the vector X comprising n elements.


TABLE IV.

Last coefficient
decoded
2
5
3
2
5
6
3
2
4
5
2
5
1
7
5
1
1

COMPARAISON OF ENTROPY BETWEEN ORIGINAL DATA AND


TRANSFORMED DATA

Input data

text

In the previous example the first coefficient that has been


saved is 2, the following one is 0, so we have to take the first
element from the dictionary of 2 that is 5, and the
reconstruction algorithm go on like in the table:
TABLE III.

CODING

H(X) = F = First coefficient of the coded vector


VF= dictionary of F
C=Second Coded coefficient
Repeat until the end of coded vector
P= VF(C)
Add P to the reconstructed vector
F=P
C= Next (C) In coded vector
End

4
2
4
-

EXAMPLE OF RECONTRUCTION RESOLUTION

Vf(1)

Vf(2)

Vf(3)

Vf(4)

5
1
2
5
1
3
2
5
5
1
5
1
7
5
1
7
7

3
3
1
3
3
1
3
3
3
3
1
2
3
1
1

4
6
4
6
4
6
4
6
2
6
2
2

2
2
2
2
2
-

Image

Video

Paper 1
Paper 2
Pic
Barbara
Boat
Flinstones
Claire
Salesman
Foreman

Original
entropy
4.983
4.601
1.210
15.735
11.385
14.724
6.418
6.801
7.173

Transformed
data entropy
3.830
3.592
1.004
5.367
5.128
4.946
3.209
4.605
4.589

Table 2 shows the reduction of the zero-order entropy in


different types of data by using the algorithm proposed, with
an average reduction of approximately 25% for text and video
files and more than 60 % for images.
IV. RESULTS
We tested our algorithm after implementation on previous
files without adding any pre-treatment except the directions
choice for images, taking into consideration the dictionary
volume in bit rate, and comparing the speed of execution in
seconds between data from different type and size. Comparing
to transformed vector, the entropy reduction is less important
in dictionaries, which must be coded because the large size
they will get, but same series are often repeated offering a
good field for the application of Lempel Ziv Welch coding
algorithm [3].
Original ratio for all tested data entry is 8 bits per character
for text data and 8 bits per pixels for images and videos. We
can note that some methods can make a bigger compression
like CALIC and some of its derived algorithms [7] for images,
LOCO-R [8] for videos, or Deflate [9] and LZ77 for texts as

used in popular compression software like winrar or gzip, but


with many layers of algorithm, and each one on a single kind
of input data unlike our transform which has as advantage to
give results with all types using one implementation, what
could permit to reduce the size of a software based on it. Best
results are obtained in images compression with ratios very
close to those of the latest technique and quicker execution
(Table 3).
VI. REFERENCES
[1] C. E. Shannon, A mathematical theory of communication, Bell System
Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October,
1948.
[2] D.A. Huffman, A method for the construction of minimum-redundancy
codes, Proceedings of the I.R.E, September 1952, pp1098-1102.
[3] T. A. Welch, A technique for high- performance data compression,
Computer, vol. 17, pp. 8-19, 1984.

TABLE III.

[4] M. Burrows and D. J. Wheeler, A block-sorting lossless data


compression algorithm, 10th May 1994, Digital SRC Research Report
124
[5] S. Mallat and G. Peyr, Traitements gomtriques des images par
bandelettes, CMAP, Ecole Polytechnique, 2006.
[6] J. Ziv and A. Lempel, A universal algorithm for sequential data
compression. IEEE Transactions on Information Theory. Vol. IT-23, No.
3, May 1977, pp. 337343.
[7] X.Wu and N. Memon, Context-based, adaptive, lossless image coding,
IEEE Transactions on Communications, vol. 45, no. 4, pp. 437444,
April 1997.
[8] L. Zheng-lin, Q. Ying, Y. Li-ying, B. Yu and L. Hui, An Improved
Lossless Image Compression Algorithm LOCO-R, in Proceedings of
International Conference on Computer Design And Applications
(ICCDA), 2010.
[9] D. Salomon, Data compression: The complete reference, (4ed),
Springer. pp. 241, ISBN 978-1-84628-602-5. 2007.

COMPRESSION PERFORMANCE IN TERM OF RATIO (AVERAGE BIT PER ELEMENT) AND RUNNING TIMES (SEC) FOR
DIFFERENT TYPES OF DATA

Input data

Original
size

Compressed
Vectors size

Dictionary
size

Compression
ratio

Running
times
(Sec)

Paper 1

53161

24520

1556

3.92

0.078

Paper 2

82199

35518

1340

3.58

0.171

Pic

513216

56302

3009

1.06

2.44

Barbara

262144

163500

14888

5.44

1.48

Boat

262144

168140

9338

5.41

1.24

Flinstones

262144

162330

12617

5.33

2.15

Claire

18779904

753212

41795

3.22

61.3

Salesman

17071922

9827967

42384

4.62

69.2

Foreman

11406644

6418926

30393

4.52

41.1

Text

Image

Video

Вам также может понравиться