Вы находитесь на странице: 1из 38

Compression

Fundamentals of Data Compression


Text
1 page with 80 characters/line and 64 lines/page and 1 byte/char results in 80 * 64 * 1 * 8 = 40 kbit/page

Still image
24 bits/pixel, 512 x 512 pixel/image results in 512 x 512 x 24 = 8 Mbit/image

Audio
CD quality, sampling rate 44,1 KHz, 16 bits per sample results in 44,1 x 16 = 706 kbit/s stereo: 1,412 Mbit/s

Video
Full-size frame 1024 x 768 pixel/frame, 24 bits/pixel, 30 frames/s results in 1024 x 768 x 24 x 30 = 566 Mbit/s.

Storage and transmission of multimedia streams require compression!!

compression
Data compression is the representation of an information source (e.g. a data file, a speech signal, an image, or a video signal) as accurately as possible using the fewest number of bits. Compressed data can only be understood if the decoding method is known by the receiver.

compression
Why data compression?
Data storage and transmission cost money. This cost increases with the amount of data available. This cost can be reduced by processing data so that it takes less memory and less transmission time.

Disadvantage of data compression: compressed data must be decompressed to be viewed, thus extra processing is required. Compression is possible because information usually contains redundancies, or information that is often repeated, examples include reoccurring letters, numbers or pixels, and compression programs remove this redundancy.

Principles of Data Compression


Lossless Compression
Data is compressed and can be reconstituted (uncompressed) without loss of detail or information. These are referred to as bit-preserving or reversible compression systems also.

Lossy Compression
There is a difference between the original object and the reconstructed object Physiological and psychological properties of the ear and eye are taken into account The aim is to obtain the best possible fidelity for a given bit-rate or minimizing the bit-rate to achieve a given fidelity measure. Video and audio compression techniques are most suited to this form of compression. Lossy techniques usually achieve higher compression rates than lossless ones but the latter are more accurate.

Classification of Coding Techniques


Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques.
Run Length encoding, Statistical encoding

Lossy compression uses source encoding techniques:


Differential encoding, Transform encoding, Vector Quantisation

Lossless techniques are classified into static, adaptive (or dynamic), and hybrid.
In a static method the mapping from the set of messages to the set of code-words is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message being encoded.
Static coding requires two passes: one pass to compute probabilities (or frequencies) and determine the mapping, and a second pass to encode.

In an adaptive method the mapping from the set of messages to the set of code-words changes over time.
Mean codeword may vary from one transfer to another Note: have to have same set of codewords at the transmitter and the receiver All of the adaptive methods are one-pass methods; only one scan of the message is required.

An algorithm may also be a hybrid, neither completely static nor completely dynamic

Classification of Coding Techniques

Lossless Compression Algorithms


(Repetitive Sequence Suppression)

These methods are fairly straight forward to understand and implement. Their simplicity is their downfall in terms of attaining the best compression ratios. However, the methods have their applications, as mentioned below:
Simple Repetition Suppression Run-length Encoding

Simple Repetition Suppression


If in a sequence a series on n successive tokens appears we can replace these with a token and a count number of occurrences. We usually need to have a special flag to denote when the repeated token appears For Example
89400000000000000000000000000000000

we can replace with


894f32 where f is the flag for zero.

Simple Repetition Suppression


Compression savings depend on the content of the data. Applications of this simple compression technique include:
Suppression of zero's in a file (Zero Length Suppression) Silence in audio data, Pauses in conversation etc. Blanks in text or program source files Backgrounds in images Other regular image or data tokens

Run-length Encoding
Principle
Replace all repetitions of the same symbol in the text (runs) by a repetition counter and the symbol.

Run-length Encoding
This encoding method is frequently applied to images (or pixels in a scan line). It is a small compression component used in JPEG compression. In this instance, sequences of image elements are mapped to pairs where ci represent image intensity or colour and li the length of the ith run of pixels (Not dissimilar to zero length supression above).

Run-length Encoding (Example)


Example 1:
AAAABBBAABBBBBCCCCCCCCDABCBAABBBBCCD Encoding: 4A3B2A5B8C1D1A1B1C1B2A4B2C1D As we can see, we can only expect a good compression rate when long runs occur frequently. Examples in text documents are long runs of blanks,leading zeroes or strings of white in gray-scale images.

Example 2:
Original Sequence: It can be encoded as: 111122233333311112222 (1,4),(2,3),(3,6),(1,4),(2,4)

Run-length Encoding (Example)


Run Length Coding for Binary Files When dealing with binary files we are sure that a run of 1s is always followed by a run of 0s and vice versa. It is thus sufficient to store the repetition counters only! Example
000000000000000000000000000011111111111111000000000 28 14 9 000000000000000000000000001111111111111111110000000 26 18 7 000000000000000000000001111111111111111111111110000 23 24 4 000000000000000000000011111111111111111111111111000 22 26 3 000000000000000000001111111111111111111111111111110 20 30 1 000000000000000000011111110000000000000000001111111 19 7 18 7 000000000000000000011111000000000000000000000011111 19 5 22 5 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000001111000000000000000000000001110 20 4 23 3 1 000000000000000000000011100000000000000000000111000 22 3 20 3 3 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011000000000000000000000000000000000000000000000011 1 2 46 2

Lossless Compression Algorithms


(Pattern Substitution)

This is a simple form of statistical encoding. Here we substitute a frequently repeating pattern(s) with a code. The code is shorter than the pattern giving us compression. A simple Pattern Substitution scheme could employ predefined code (for example replace all occurrences of `The' with the code '&').

Lossless Compression Algorithms


(Pattern Substitution)

More typically tokens are assigned to according to frequency of occurrences of patterns:


Count occurrence of tokens Sort in Descending order Assign some symbols to highest count tokens

A predefined symbol table may be used i.e. assign code i to token i. However, it is more usual to dynamically assign codes to tokens. The entropy encoding schemes basically attempt to decide the optimum assignment of codes to achieve the best compression.

Lossless Compression Algorithms


(Pattern Substitution)

Example 1 : ABC ->1, EE -> 2

Example 2: ABCEE ->1

Lossless Compression Algorithms


(Entropy Encoding)

Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques. Shannon is father of information theory and we briefly summaries information theory before looking at specific entropy encoding methods.

Basics of Information Theory


The Entropy is defined as the average information content per symbol of the source, according to Shannons formula as

Where -pi is the probability that symbol Si in S will occur. indicates the amount of information contained in Si, i.e., the number of bits needed to code Si.

The Shannon-Fano Algorithm


Lossless Compression Algorithms

This is a basic information theoretic algorithm. A simple example will be used to illustrate the algorithm: Symbol A B C D E ---------------------------------Count 15 7 6 6 5

The Shannon-Fano Algorithm


Encoding for the Shannon-Fano Algorithm: A top-down approach
1. Sort symbols according to their frequencies/probabilities, e.g., ABCDE. 2. Recursively divide into two parts, each with approx. same number of counts.
Lossless Compression Algorithms

The Shannon-Fano Algorithm


Lossless Compression Algorithms

Symbol ---------A
B C D E

BITS IN THE CODE)

Count log(1/p) ----------------- --------------15 1.38


7 6 6 5 2.48 2.70 2.70 2.96

Code Subtotal (# of bits) --------- ---------------------00 30 (COUNT* NO. OF

01 14 10 12 110 18 111 15 TOTAL (# of bits): 89

Huffman Coding
Lossless Compression Algorithms

Huffman coding is based on the frequency of occurrence of a data item (pixel in images). The principle is to use a lower number of bits to encode the data that occurs more frequently. Codes are stored in a Code Book which may be constructed for each image or a set of images. In all cases the code book plus encoded data must be transmitted to enable decoding.

Huffman Coding
Lossless Compression Algorithms

Find the frequency of each character in the file to be compressed For each distinct character create a one-node binary tree containing the character and its frequency as its priority While (there are more than one tree in the priority queue)
De-queue two trees t1 and t2 Create a tree t that contains t1 and its left sub-tree and t2 as its right sub-tree (1*) Priority (t) = priority (t1) + priority (t2) Insert t in its proper location in the propriety queue (2*)

Assign 0 and 1 weights to the edges of the resulting tree, such that the left and right edge of each node do not have the same weight (3*)

EXAMPLE
Character Frequency a 45 E 65 L 13 n 45 o 18 s 22 t 53

If the message is sent uncompressed with 8-bit ASCII representation for the characters, we have 261*8 = 2088 bits. Assuming that the number of character-codeword pairs and the pairs are included at the beginning of the binary file containing the compressed message in the following format:

Number of bits for the transmitted file = bits(7) + bits(characters) + bits(codewords) + bits(compressed message) = 3 + (7*8) + 21 + 696 = 776 Compression ratio = bits for ASCII representation / number of bits transmitted = 2088 / 776 = 2.69 Thus, the size of the transmitted file is 100 / 2.69 = 37% of the original ASCII file (i.e., 37% compression has been achieved)

Dynamic Huffman coding


The basic Huffman algorithm has been extended, for the following reasons: The previous algorithms require the statistical knowledge which is often not available (e.g., live audio, videoThe solution is to use adaptive algorithms. Dynamic Huffman coding Faller and Gallager independently proposed a one-pass scheme, later improved substantially by Knuth for constructing Dynamic Huffman codes. It is known as FGK algorithm As characters are processed, frequencies are updated and codes are changed, or the coding tree is modified. Both sender and receiver start with the same initial tree and use the same algorithm to modify the tree after each letter is processed

Construction of tree
Sender and receiver begin with an initial tree consisting of a root node and a left child with a null character and weight=0 First character is sent uncompressed and is added to the tree as the right branch from the root. The new node is labeled with the character, its weight is 1 and the tree branch is labeled 1 also A list shown the tree entries in order Whenever a new character appears in the message, it is sent as follows: Send the uncompressed representation of the new character Place the new character into the tree and update the list representation Example: Banana

A(3) 1 *(0)

3
N(2) B(1)

Continue.
Huffman coding creates separated code-word for each character If transmitted letter is existed in current tree then only transmit its code else transmit uncompressed letter and add it to the tree. Each step must update tree

LZW (Lempel-Ziv-Welch)
The compressor algorithm builds a string translation table from the text being compressed
The string translation table maps fixed-length codes to strings The string is initialized with all single-character strings (256) entries in the case of 8-bit characters As the compressor character serially examines in the text, it stores every unique twocharacter string into the table as a code/character concatenation, with the code mapping to the corresponding first character Whenever a previously-encountered string is read from the input, the longest such previously-encountered string is determined, and then the code for this string concatenated with the extension character (the next character in the input) is stored in the table

Decoding LZW data is the reverse of encoding. The decompressor reads a code from the encoded data stream and adds the code to the data dictionary if it is not already there. The code is then translated into the string it represents and is written to the uncompressed output stream.

The encoding algorithm can summarized as follows w = NIL; while ( read a character k ) { if wk exists in the dictionary w = wk; else add wk to the dictionary; output the code for w; w = k; }

Example: Input string is "^WED^WE^WEE^WEB^WET".

w k output index symbol ----------------------------------------NIL ^ ^ W ^ 256 ^W W E W 257 WE E D E 258 ED D ^ D 259 D^ ^ W ^W E 256 260 ^WE E ^ E 261 E^ ^ W ^W E ^WE E 260 262 ^WEE E ^ E^ W 261 263 E^W W E WE B 257 264 WEB B ^ B 265 B^ ^ W ^W E ^WE T 260 266 ^WET T EOF T
A 19 symbol input has been reduced to 7 symbol plus 5 code output . Usually, compression doesnt start until a large number of bytes (e.g., >100) are read in

The LZW decompression algorithm as follows read a character k; output k; w = k; while ( read a character k ) /* k could be a character or a code. */ { entry = dictionary entry for k; output entry; add w + entry[0] to dictionary; w = entry; } Example (continued): Input string is "^WED<256>E<260><261><257>B<260>T" w k output index symbol ------------------------------------------^ ^ ^ W W 256 ^W W E E 257 WE E D D 258 ED D <256> ^W 259 D^ <256> E E 260 ^WE E <260> ^WE 261 E^ <260> <261> E^ 262 ^WEE <261> <257> WE 263 E^W <257> B B 264 WEB B <260> ^WE 265 B^ <260> T T 266 ^WET

Advantages
LZW compression provides a better compression ratio, in most applications, than any well-known method available up to that time It usually runs very fast, as the bit parsing is easy and table lookup is automatic

Disadvantages
Substantial memory requirements

Вам также может понравиться