Вы находитесь на странице: 1из 39

Data Compression

Lecture 16

1
2
Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa

Uncompress
Aaa aa aa aaa aa a a a aa aa aaaa

Compress
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa
Aaa aa aa aaa aa a a a aa aa
bandwidth or hard disk space
 Reason: to save valuable resources such as communication
 Definition: process of encoding with fewer bits
Compression
Compression Types

1. Lossy
 Loses some information during compression which means the
exact original can not be recovered (jpeg)
 Normally provides better compression
 Used when loss is acceptable - image, sound, and video files

3
Compression Types
2. Lossless
 Exact original can be recovered
 Used when loss is not acceptable - data

Basic Term: Compression Ratio - ratio of the number of bits in


original data to the number of bits in compressed data

For example: 3:1 is when the original file was 3000 bytes and the
compression file is now only 1000 bytes.
4
Huffman Codes

 Invented by Huffman as a class assignment in 1950


 Used in many, if not most, compression algorithms
 gzip, bzip, jpeg (as option), fax compression,…

 Properties:
 Generates optimal prefix codes
 Cheap to generate codes
 Cheap to encode and decode
 la = H if probabilities are powers of 2
The (Real) Basic Algorithm

1. Scan text to be compressed and tally occurrence of all characters.


2. Sort or prioritize characters based on number of occurrences in text.
3. Build Huffman code tree based on prioritized list.
4. Perform a traversal of tree to determine all code words.
5. Scan text again and create new file using the Huffman codes.
Building a Tree

 Scan the original text


Eerie eyes seen near lake.
 What is the frequency of each character in the text?

Char Freq. Char Freq. Char Freq.

E 1 Y 1 k
e 8 s 2 . 1
r 2 n 2 space 4
i 1 a 2
l 1
Building a Tree

 Prioritize characters
 Create binary tree nodes with character and frequency of each
character
 Place nodes in a priority queue
 The lower the occurrence, the higher the priority in the queue
Building a Tree

 Prioritize characters

Uses binary tree nodes


public class HuffNode
{
public char myChar;
public int myFrequency;
public HuffNode myLeft, myRight;
}
priorityQueue myQueue;
Building a Tree

 The queue after inserting all nodes


 Null Pointers are not shown

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree

 While priority queue contains two or more nodes


 Create new node
 Dequeue node and make it left subtree
 Dequeue next node and make it right subtree
 Frequency of new node equals sum of frequency of left and right
children
 Enqueue new node back into queue
Building a Tree

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree

y l k . r s n a sp e
1 1 1 1 2 2 2 2 4 8

E i
1 1
Building a Tree

y l k . r s n a sp e
2
1 1 1 1 2 2 2 2 4 8
E i
1 1
Building a Tree

k . r s n a sp e
2
1 1 2 2 2 2 4 8
E i
1 1

y l
1 1
Building a Tree

2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
1 1
E i
1 1
Building a Tree

r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i 1 1
1 1

k .
1 1
Building a Tree

r s n a 2 2 sp e
2
2 2 2 2 4 8
E i y l k .
1 1 1 1 1 1
Building a Tree

n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1

r s
2 2
Building a Tree

n a 2 sp e
2 4
2
2 2 4 8

E i y l k . r s
1 1 1 1 1 1 2 2
Building a Tree

2 4 e
2 2 sp
8
4
y l k . r s
E i 1 1 1 1 2 2
1 1

n a
2 2
Building a Tree

2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i 1 1 1 1 2 2 2 2
1 1
Building a Tree

4 4 e
2 sp
8
4
k . r s n a
1 1 2 2 2 2

2 2

E i y l
1 1 1 1
Building a Tree

4 4 4
2 sp e
4 2 2 8
k . r s n a
1 1 2 2 2 2
E i y l
1 1 1 1
Building a Tree

4 4 4
e
2 2 8
r s n a
2 2 2 2
E i y l
1 1 1 1

2 sp
4
k .
Building a Tree

4 4 4 6 e
2 sp 8
r s n a 2 2
4
2 2 2 2 k .
E i y l 1 1
1 1 1 1

What is happening to the characters with a low number of occurrences?


Building a Tree

4 6 e
2 2 2 8
sp
4
E i y l k .
1 1 1 1 1 1
8

4 4

r s n a
2 2 2 2
Building a Tree

4
6 e 8
2 2 2 8
sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Tree

8
e
8
4 4
10
r s n a
2 2 2 2 4
6
2 2 2 sp
4
E i y l k .
1 1 1 1 1 1
Building a Tree

8 10
e
8 4
4 4
6
2 2
r s n a 2 sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1
Building a Tree

10
16
4
6
2 2 e 8
2 sp 8
4
E i y l k . 4 4
1 1 1 1 1 1

r s n a
2 2 2 2
Building a Tree

10 16

4
6
e 8
2 2 8
2 sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Tree

26

16
10

4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree

After enqueueing this node


26 there is only one node left in
priority queue.
16
10

4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree

 Dequeue the single node left in


the queue.
26
 This tree contains the new code
words for each character. 16
10

 Frequency of root node should 4 e 8


6 8
equal number of characters in 2 2
text. 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

Eerie eyes seen near lake.  26 characters


Encoding the File
Traverse Tree for Codes

 Perform a traversal of the tree to


obtain new code words
 Going left is a 0 going right is a 1
26
 Code word is only completed when a
leaf node is reached 16
10

4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
Encoding the File
Traverse Tree for Codes

Char Code
E 0000
i 0001
y 0010
l 0011 26
k 0100 16
. 0101 10
space 011 4
e 10 6
e
8
8
r 1100 2 2 2
s 1101 sp
4
4 4
n 1110 E i y l k .
a 1111 1 1 1 1 1 1 r s
2 2
n a
2 2
Encoding the File

 Rescan text and encode file


using new code words
Eerie eyes seen near lake. Char Code
E 0000
00001011000001100111000101011011010011111 i 0001
01011111100011001111110100100101 y 0010
l 0011
k 0100
. 0101
space 011
e 10
r 1100
 Why is there no need for a s 1101
separator character? n 1110
a 1111
Encoding the File
Results

 Have we made things better?


 73 bits to encode the text
00001011000001100111000101011011010011111
 ASCII would take 8 * 26 = 208 01011111100011001111110100100101
bits

Вам также может понравиться