Вы находитесь на странице: 1из 5

DATA COMPRESSION

Introduction
Compression is the art of representing the information in a compact form rather than its original or uncompressed form . In other words, using the data compression, the size of a particular file can be reduced. This is very useful when processing, storing or transferring a huge file, which needs lots of resources. If the algorithms used to encrypt works properly, there should be a significant difference between the original file and the compressed file. When data compression is used in a data transmission application, speed is the primary goal. Speed of transmission depends upon the number of bits sent, the time required for the encoder to generate the coded message and the time required for the decoder to recover the original ensemble. In a data storage application, the degree of compression is the primary concern. Compression can be classified as either lossy or lossless. Lossless compression techniques reconstruct the original data from the compressed file without any loss of data. Thus the information does not change during the compression and decompression processes. These kinds of compression algorithms are called reversible compressions since the original message is reconstructed by the decompression process. [1]Lossless compression techniques are used to compress medical images, text and images preserved for legal reasons, computer executable file and so on . Lossy compression techniques reconstruct the original message with loss of some information. It is not possible to reconstruct the original message using the decoding process, and is called irreversible compression . The decompression process results an approximate reconstruction. It may be desirable, when data of some ranges which could not recognized by the human brain can be neglected. Such techniques could be used for multimedia images, video and audio to achieve more compact data compression .Various lossless data compression algorithms have been proposed and used. Some of the main techniques in use are the Huffman Coding, Run Length Encoding, Arithmetic Encoding and Dictionary Based Encoding . This paper examines the performance of the Run Length Encoding Algorithm, Huffman Encoding Algorithm, Shannon Fano Algorithm, Adaptive Huffman Encoding Algorithm, Arithmetic Encoding Algorithm and Lempel Zev Welch Algorithm. In particular, performance of these algorithms in compressing text data is evaluated and compared.

Need for Data Compression


- At any given time, the ability of the Internet to transfer data is fixed. - Think of the capability as the Internet's collective bandwidth. - Thus, if data can effectively be compressed wherever possible, significants improvements of data throughput can be achieved. - In some instances, file sizes can be reduced by up to 60-70 %. - At the same time, many systems cannot accommodate purely binary data, so encoding schemes are also employed which reduce data compression effectiveness.

- Many files can be combined into one compressed document making sending easier, provided combined file size is not huge.

Data Compression Methods


- Losless Methods - Lossy Methods

Losless Methods :
Run Length Encoding Algorithm Run Length Encoding or simply RLE is the simplest of the data compression algorithms. The consecutive sequences of symbols are identified as runs and the others are identified as non runs in this algorithm. This algorithm deals with some sort of redundancy [2]. It checks whether there are any repeating symbols or not, and is based on those redundancies and their lengths. Consecutive recurrent symbols are identified as runs and all the other sequences are considered as non-runs. For an example, the text ABABBBBC is considered as a source to compress, then the first 3 letters are considered as a non-run with length 3, and the next 4 letters are considered as a run with length 4 since there is a repetition of symbol B. The major task of this algorithm is to identify the runs of the source file, and to record the symbol and the length of each run. The Run Length Encoding algorithm uses those runs to compress the original source file while keeping all the non-runs with out using for the compression process. Huffman Encoding Huffman Encoding Algorithms use the probability distribution of the alphabet of the source to develop the code words for symbols. The frequency distribution of all the characters of the source is calculated in order to calculate the probability distribution. According to the probabilities, the code words are assigned. Shorter code words for higher probabilities and

longer code words for smaller probabilities are assigned. For this task a binary tree is created using the symbols as leaves according to their probabilities and paths of those are taken as the code words. Two families of Huffman Encoding have been proposed: Static Huffman Algorithms and Adaptive Huffman Algorithms. Static Huffman Algorithms calculate the frequencies first and then generate a common tree for both the compression and decompression processes [2]. Details of this tree should be saved or transferred with the compressed file. The Adaptive Huffman algorithms develop the tree while calculating the frequencies and there will be two trees in both the processes. In this approach, a tree is generated with the flag symbol in the beginning and is updated as the next symbol is read. The Lempel Zev Welch Algorithm Dictionary based compression algorithms are based on a dictionary instead of a statistical model [5]. A dictionary is a set of possible words of a language, and is stored in a table like structure and used the indexes of entries to represent larger and repeating dictionary words. The Lempel-Zev Welch algorithm or simply LZW algorithm is one of such algorithms. In this method, a dictionary is used to store and index the previously seen string patterns. In the compression process, those index values are used instead of repeating string patterns. The dictionary is created dynamically in the compression process and no need to transfer it with the encoded message for decompressing. In the decompression process, the same dictionary is created dynamically. Therefore, this algorithm is an adaptive compression algorithm.

Lossy Algorithms : JPEG : Joint Photographic Experts Group


JPEG (Joint Photographic Experts Group) (1992) is an algorithm designed to compress images with 24 bits depth or greyscale images. It is a lossy compression algorithm. One of the characteristics that make the algorithm very flexible is that the compression rate can be adjusted. If we compress a lot, more information will be lost, but the result image size will be smaller. With a smaller compression rate we obtain a better quality, but the size of the resulting image will be bigger. This compression consists in making the coefficients in the quantization matrix bigger when we want more compression, and smaller when we want less compression.

Advantages : It reduces the data storage requirements . It reduces the data storage requirements Data security can also be greatly enhanced by encrypting the decoding parameters and transmitting them
separately from the compressed database files to restrict access of proprietary information

The rate of input-output operations in a computing device can be greatly increased due to shorter representation
of data .

Data Compression obviously reduces the cost of backup and recovery of data in computer systems by storing
the backup of large database files in compressed form

Disadvantages : The extra overhead incurred by encoding and decoding process is one of the most serious drawbacks of data
compression, which discourages its use in some areas

Data compression generally reduces the reliability of the records Transmission of very sensitive compressed data through a noisy communication channel is risky because the
burst errors introduced by the noisy channel can destroy the transmitted data

Disruption of data properties of a compressed data, will result in compressed data different from the original data In many hardware and systems implementations, the extra complexity added by data
compression can increase the systems cost and reduce the systems efficiency, especially in the areas of applications that require very low-power VLSI implementation

Conclusion Data compression is still a developing field . Data Compression has become so popular that
researches are going on in this field to improve the compression rate and speed.

The aim of data compression branch is to better and better compression techniques References :
[1] Pu, I.M., 2006, Fundamental Data Compression, Elsevier, Britain.

[2] Blelloch, E., 2002. Introduction to Data Compression, Computer Science Department, Carnegie Mellon University. [3] Kesheng, W., J. Otoo and S. Arie, 2006. Optimizing bitmap indices with efficient compression, ACM Trans. Database Systems, 31: 1-38. [4] Kaufman, K. and T. Shmuel, 2005. Semi-lossless text compression, Intl. J. Foundations of Computer Sci., 16: 11671178. [5] Campos, A.S.E, Basic arithmetic coding by Arturo Campos Website, Available from: http://www.arturocampos.com/ac_arithmetic.html. (Accessed 02 February 2009) [6] Vo Ngoc and M. Alistair, 2006. Improved wordaligned binary compression for text indexing, IEEE Trans. Knowledge & Data Engineering, 18: 857-861. [7] Cormak, V. and S. Horspool, 1987. Data compression using dynamic Markov modeling, Comput. J., 30: 541550. [8] Capocelli, M., R. Giancarlo and J. Taneja, 1986. Bounds on the redundancy of Huffman codes, IEEE Trans. Inf. Theory, 32: 854857. [9] Gawthrop, J. and W. Liuping, 2005. Data compression for estimation of the physical parameters of stable and unstable linear systems, Automatica, 41: 1313-1321.

Вам также может понравиться