Вы находитесь на странице: 1из 37

Chapter 7

Special Section
Focus on Data
Compression
7A Objectives

• Understand the essential ideas underlying data


compression.
• Become familiar with the different types of
compression algorithm.
• Be able to describe the most popular data
compression algorithms in use today and know
the applications for which each is suitable.

2
7A.1 Introduction

• Data compression is important to storage systems


because it allows more bytes to be packed into a
given storage medium than when the data is
uncompressed.
• Some storage devices (notably tape) compress
data automatically as it is written, resulting in less
tape consumption and significantly faster backup
operations.
• Compression also reduces file transfer time, saving
time and communications bandwidth.

3
7A.1 Introduction

• A good metric for compression is the compression


factor (or compression ratio) given by:

• If we have a 100KB file that we compress to 40KB,


we have a compression factor of:

4
7A.1 Introduction

• Compression is achieved by removing data


redundancy while preserving information content.
• The information content of a group of bytes (a
message) is its entropy.
– Data with low entropy permit a larger compression ratio than
data with high entropy.
• Entropy, H, is a function of symbol frequency. It is the
weighted average of the number of bits required to
encode the symbols of a message:
H= -P(x) × log2P(xi)
5
7A.1 Introduction

• The entropy of the entire message is the sum of the


individual symbol entropies.

∑ -P(x) × log2P(xi)
• The average redundancy for each character in a
message of length l is given by:

∑ P(x) × li - ∑ -P(x) × log2P(xi)

6
7A.2 Statistical Coding

• Consider the message: HELLO WORLD!


– The letter L has a probability of 3/12 = 1/4 of appearing in
this message. The number of bits required to encode this
symbol is -log2(1/4) = 2.
• Using our formula, ∑ -P(x) × log2P(xi), the average
entropy of the entire message is 3.022.
– This means that the theoretical minimum number of bits per
character is 3.022.
• Theoretically, the message could be sent using only
37 bits. (3.022 × 12 = 36.26)

7
7A.2 Statistical Coding

• The entropy metric just described forms the basis


for statistical data compression.
• Two widely-used statistical coding algorithms are
Huffman coding and arithmetic coding.
• Huffman coding builds a binary tree from the letter
frequencies in the message.
– The binary symbols for each character are read directly
from the tree.
• Symbols with the highest frequencies end up at the
top of the tree, and result in the shortest codes.

8
7A.2 Statistical Coding

• The process of building the tree begins by counting


the occurrences of each symbol in the text to be
encoded.

HIGGLETY PIGGLTY POP


THE DOG HAS EATEN THE MOP
THE PIGS IN A HURRY
THE CATS IN A FLURRY
HIGGLETY PIGGLTY POP

9
7A.2 Statistical Coding

• Next, place the letters and their frequencies into a


forest of trees that each have two nodes: one for
the letter, and one for its frequency.

10
7A.2 Statistical Coding

• We start building the tree by joining the nodes


having the two lowest frequencies.

11
7A.2 Statistical Coding

• And then we again join the nodes with two lowest


frequencies.

12
7A.2 Statistical Coding

• And again ....

13
7A.2 Statistical Coding

• Here is our finished tree.

14
7A.2 Statistical Coding

This is the code


derived from this tree.

15
7A.2 Statistical Coding

• The second type of statistical coding, arithmetic


coding, partitions the real number interval between
0 and 1 into segments according to symbol
probabilities.
– An abbreviated algorithm for this process is given in the
text.
• Arithmetic coding is computationally intensive and
it runs the risk of causing divide underflow.
• Variations in floating-point representation among
various systems can also cause the terminal
condition (a zero value) to be missed.
16
7A.2 Statistical Coding

• For most data, statistical coding methods offer


excellent compression ratios.
• Their main disadvantage is that they require two
passes over the data to be encoded.
– The first pass calculates probabilities, the second encodes
the message.
• This approach is unacceptably slow for storage
systems, where data must be read, written, and
compressed within one pass over a file.

17
7A.3 LZ Dictionary Systems

• Ziv-Lempel (LZ) dictionary systems solve the two-pass


problem by using values in the data as a dictionary to
encode itself.
• The LZ77 compression algorithm employs a text window
in conjunction with a lookahead buffer.
– The text window serves as the dictionary. If text is found in
the lookahead buffer that matches text in the dictionary, the
location and length of the text in the window is output.

18
7A.3 LZ Dictionary Systems

• The LZ77 implementations include PKZIP and IBM’s


RAMAC RVA 2 Turbo disk array.
– The simplicity of LZ77 lends itself well to a hardware
implementation.
• LZ78 is another dictionary coding system.
• It removes the LZ77 constraint of a fixed-size
window. Instead, it creates a trie as the data is read.
• Where LZ77 uses pointers to locations in a
dictionary, LZ78 uses pointers to nodes in the trie.

19
7A.4 GIF and PNG Compression

• GIF compression is a variant of LZ78, called LZW,


for Lempel-Ziv-Welsh.
• It improves upon LZ78 through its efficient
management of the size of the trie.
• Terry Welsh, the designer of LZW, was employed by
the Unisys Corporation when he created the
algorithm, and Unisys subsequently patented it.
• Owing to royalty disputes, development of another
algorithm PNG, was hastened.

20
7A.4 GIF and PNG Compression

• PNG employs two types of compression, first a


Huffman algorithm is applied, which is followed by
LZ77 compression.
• The advantage that GIF holds over PNG, is that GIF
supports multiple images in one file.
• MNG is an extension of PNG that supports multiple
images in one file.
• GIF, PNG, and MNG are primarily used for graphics
compression. To compress larger, photographic
images, JEPG is often more suitable.

21
7A.5 JPEG Compression

• Photographic images incorporate a great deal of


information. However, much of that information can be lost
without objectionable deterioration in image quality.
• With this in mind, JPEG allows user-selectable image
quality, but even at the “best” quality levels, JPEG makes
an image file smaller owing to its multiple-step
compression algorithm.
• It’s important to remember that JPEG is lossy, even at the
highest quality setting. It should be used only when the
loss can be tolerated.

The JPEG algorithm is illustrated on the next slide.


22
7A.5 JPEG Compression

23
Section 7A Conclusion

• Two approaches to data compression are statistical


data compression and dictionary systems.
• Statistical coding requires two passes over the
input, dictionary systems require only one.
• LZ77 and LZ78 are two popular dictionary systems.
• GIF, PNG, MNG, and JPEG are used for image
compression.
• JPEG is lossy, so its use is not suited for all types of
images.

24
End of Section 7A

25
JPEG Encoding
• Used to compress pictures and graphics.
• In JPEG, a grayscale picture is divided into 8x8
pixel blocks to decrease the number of
calculations.
• Basic idea:
 Change the picture into a linear (vector) sets of numbers
that reveals the redundancies.
 The redundancies is then removed by one of lossless
compression methods.
JPEG Encoding- DCT
• DCT: Discrete Concise Transform
• DCT transforms the 64 values in 8x8 pixel block in a way
that the relative relationships between pixels are kept but
the redundancies are revealed.
• Example:
A gradient grayscale
Quantization & Compression
Quantization:
 After T table is created, the values are quantized to
reduce the number of bits needed for encoding.

• Compression:
 Quantized values are read from the table and
redundant 0s are removed.
 To cluster the 0s together, the table is read
diagonally in an zigzag fashion. The reason is if the
table doesn’t have fine changes, the bottom right
corner of the table is all 0s.
 JPEG usually uses lossless run-length encoding at
the compression phase.
JPEG Encoding
MPEG Encoding
• Used to compress video.
• Basic idea:
 Each video is a rapid sequence of a set of frames. Each frame is a
spatial combination of pixels, or a picture.
 Compressing video =
spatially compressing each frame
+
temporally compressing a set of frames.
MPEG Encoding
• Spatial Compression
 Each frame is spatially compressed by JPEG.
• Temporal Compression
 Redundant frames are removed.
 For example, in a static scene in which someone is talking,
most frames are the same except for the segment around the
speaker’s lips, which changes from one frame to the next.
Audio Compression
Used for speech or music
• Speech: compress a 64 kHz digitized signal

Music: compress a 1.411 MHz signal

Two categories of techniques:


• Predictive encoding

Perceptual encoding

Audio Encoding
Predictive Encoding
 Only the differences between samples are encoded, not the whole
sample values.
 Several standards: GSM (13 kbps), G.729 (8 kbps), and G.723.3
(6.4 or 5.3 kbps)

• Perceptual Encoding: MP3


 CD-quality audio needs at least 1.411 Mbps and cannot be sent
over the Internet without compression.
 MP3 (MPEG audio layer 3) uses perceptual encoding technique
to compress audio.
References
• http://www.csie.kuas.edu.tw/course/cs/english/ch-15.ppt

• CS157B-Lecture 19 by Professor Lee


http://cs.sjsu.edu/~lee/cs157b/cs157b.html

• “The essentials of computer organization and


architecture” by Linda Null and Julia Nobur.
Data Compression

QUESTION?
Audio Encoding
• Predictive encoding
 Only the differences

Вам также может понравиться