Вы находитесь на странице: 1из 296

INTERNATIONAL

Fundamentais
ofMu ia

-Nian L1M
í

undamentais
of Multimedia

Ze-Nian Li and Mark S. Drew


School of Computing Science
Simon Fraser Universiiy

PEARSON
• Prentice
• • Hali

Pearson Education International


,. ./‘

Jf you piarchased this book wilhin lhe Uniled Siates ar Canada


you should be aware thal ii has been wrongfully imported
without Lhe approval aí lhe Publisher ar Lhe Author.

Vice Presidem and Ediiorial Director, ECS: Monja J. Jioru,i


Senior Arquisitions Ediior: Kare bolse??
Editorial Assisianl: Michoel Giacubhe
Vice President and Director of Produclion and Manufacturing. ESM: David 1V Rtccardi
Executive Managing Editor: Viiic’e OBriest
Managing Edilor: Camilie Tremacexoe
Produclion Editor: (riria Zucker
Director of Creative Servires: Paul Belfan?i
ArL Direclor and Cover Manager: Jayuc Con,e To my inom, and my wzj’e Yansrn.
Coser Designer: Sic.a,ute Belutke
Managing Edilor. AV Managemenl and Produrtion: Pa?ricia Burns &-Nian
Ad Editor: Gregon’ Didles
Manufacltiring Manager: Trudy Piscwrri
Manufacluring Buyer: Lisa McDowell
Markeiing Manager: Panela Shaffer

‘E AR S O N © 2004 by Pearson Education, mc,


Pearson Prentice liali
Prentice 07458 To Noah, James (ira), Eva, and, especially, to Jenna.
Mark
Ali righls reserved. No part of lhis book may be reproduccd in any formal or by any mossa, withoul permission
o writing from lhe publisher.

Images of Lena ihat appear o Figures 3.1.3.3.3,4,3.10.8.20. 9.2, and 9.3. are reproduced by spcciai perrnission
of Playboy magazine. Copyrighl 1972 by Playboy.

lhe author and publisher aí this book have uaed heir best cOaria is preparing lhis bode. These cObris orlado
Lhe developmenl, research, and tesring of lhe heories and programs lo determine bar efl’ectiveness. The author
and publisher makc no warranly of any kind. expressed or tmpiied, with regard lo hese programs or [he
documentalion coolaincd in lhis book. lhe author and publisher shall n°1 be liable is any evenl for incidental or
consequenlial damages in conneclion wilh, ar arising oul of. lhe furnishing. performance, ar use aí lhese
prograflis.

Prinied in lhe United SaIes oíAmerica i~’,j ~


Universideda do Porto
/ ~Jr~
lO 9 8 7 6 5 Faculdade do Enoenharia

ISBN O-13-127256-X BibIiotoc~ //


CDU
Pearsons Education LTD. ‘ “ .L2: ~‘1.
Pearson Education Austraba PTY, Limiled
Pearson Education Singapore, PIe. Ltd
Pearson Education Norlh Asia Ltd
Pearson Education Casada, Ltd.
Pearson Educación de Mexico, S.A. de CV.
Pearson Education -- Japan
Pearson Education Maiaysia. PIe. LLd
Pearson Education, Upper Saddle River, New Jersey
List of Trademarks Contents
Tbe foliowing is a iist of produccs noied in lhis texi thaL are lrademarks or registered trademarks Lheir
associated companies.
Preface
3D Studio Max is a registered trademark of Autodesk, me.
After Effects, fliusiracor, PhoLoshop, Prerniere, and Cool EdiL are registercd trademarks of Adobe 1 Multimedia Authoring and Data Representations
Systems, mc.
Authorware, Director, Dreamweaver, Fireworks, and Frechand are regislered Lrademarks, and Flash Introduction Lo Multimedia
and Soundedit are trademarks of Macromedia, mc., in lhe United Siales and?or other countries. 1.1 What is Multimedia? 3
Cakewalk Pro Audio is a Lrademark of Twelve Tone Syslems, mc, 1. LI Componenls of Multimedia 3
1.1.2 Muilimedia Research Topics and Projects 4
CoreIDRAW is a regiscered irademark of Corei and/or its subsidiaries in Canada, Lhe United States 1.2 Multimedia and Hypermedia 5
an&or olher countries. 1.2.1 Hislory ofMultiniedia 5
Cubase is a registered trademark of Pinnacle Systcms. 1.2.2 Hypermedia and Multimedia 7
DirectX, lnemet Expiorer, PowerPoinl, Windows, Word, Visual Basic, and Visual C++ are regislered 1.3 World Wide Web 8
trademarks of Microsofc Corporation ia Lhe Uniied Staces andlor other countries. 1.3.1 History of lhe WWW 8
Gifcon is a trademark of Alchemy Mindworks Corporation. 1.3.2 HyperText Transfer Protocol (HTTP) 9
1.3.3 HyperText Markup Language (HTML) lo
HyperCard and Final CuL Pro are registered trademarks of Appie Computer, mc. 1.3.4 Extensible Markup Language (XML) li
HyperStudio is a registered trademark of Sunburst Tcchnology. 1.3.5 Synchronized Multimedia Integration Language (SMIL) 12
Java Media Framework and Java 3D are lrademarks of Sun Microsystems, mc., in lhe Uniled SLaÉes 1.4 Overview of Multimedia Software Tools 14
and other couniries. 1.4.1 Music Scqucncing and Notation 14
1.4.2 Digital Audio 15
Jeil-O is a regislered lrademark of Kraft Foods incorporated. 1.4.3 Graphics and image Editing IS
MATLAB is a trademark of The MalhWorks, me. 1.4.4 VideoEditing IS
Maya and OpenGL are registered trademarks of Silicon Graphics mc. 1.4.5 Animalion 16
1.4.6 Muicimedia Authoring 17
Mosaic is a regisLered iiadeniark of National Center for Supercomputing Appiicacions (NCSA). 1.5 Further Exploralion 17
Netscape is a regisiered lrademark of Nelscape Conimunicalions Corporation in Lhe U.S. and other 1.6 Exercises 18
Couniries. 1.7 References 19
Piaystation is a registered tradeniark of Sony Corporation.
2 Multimedia Authoring and Tools
Pro Tools is a registered Érademark of Avid Technology, mc. 2.1 Muitiniedia Authoring 20
Quest Muhimedia Authoring System is a regiscered crademark of AlIen Comniunication Learning 2.1.1 Multimedia Aulhoring Metaphors 21
Se rv ices. 2.1.2 Muilimedia Production 23
RenderMan is a registered Lrademark of Pixar Animation Siudios. 2.1.3 Muirimedia Presenlation 25
2.1.4 Automatic Aulhoring 33
Slinky is a regiscered Lradeniark of Slinky Toys. 2.2 Some Uscful Editing and Authoring Tools 37
Sofrirnage XSi is a registered Lradernark of Avid Technology mc. 2.2.1 Adobe Preiniere 37
Sound Forge is a regisrered trademark of Sonic Foundry. 2.2.2 Macromedia Director 40
2.2.3 Macromedia Flash 46
Winzip is a registered trademark WinZip Compuling, mc. 2.2.4 Dreamweaver Si
2.3 VRML SI
2.3.1 Overview Si
2.3.2 Animalion and inleraclions 54
2.3.3 VRML Specifics 54
2.4 Further Exploracion 55
2.5 Exercises 56
2.6 References 59
3 Graphics and Image Data Representations 4.3.3 YlQColorModel lOS
3.1 Graphics/lmage Data Types 60 4.3.4 YCbCr Calor Model 107
3.1.! 1-Bil Images 61 4.4 Furiber Exploration 107
3.1.2 8-Bil Gray-Level Images 6! 4.5 Exercites 108
3.1.3 Image Data Types 64 4.6 References III
3.1.4 24-Bit Color Images 64
5 Fundamental Concepla in Video
3.1.5 8-BIL Colar Images 65
3.1.6 Color Lookup Tables (LUTs) 67 5.! TypesofVideoSignals 112
5.1.1 Cornponent Video 112
3.2 Popular File Formats 71
5.1.2 Composile Video 113
3.2.1 GIF 7!
5.1.3 S-Vidco l13
3.2.2 JPEO 75
5.2 AnalogVideo 113
32.3 PNG 76
3.2.4 11FF 77 5.2.1 NTSCVideo 116
3.2.5 EXIF 77 5.2.2 PAL Video 119
3.2.6 Graphics Animacion Files 77 5.2.3 SECAM Video 119
3.2.7 P5 and PDF 78 5.3 DigilalVideo 119
3.2.8 Windows WMF 78 5.3.1 Chroma Subsampling 120
3.2.9 Windows BMP 78 5.3.2 CCIR Slandards for Digital Vidco 120
3.2.10 Macintosh PAINT and PICT 78 5.3.3 High Definition TV (HDTV) 122
3.2.11 XWindowsPPM 79 5.4 Further Exploralion 124
13 Furiher Exploration 79 5.5 Exercises 124
3.4 Exercises 79 56 References 125
3.5 References 81
Basies of Digital Audio
4 Calor ia linage and Video 6.1 Digilization aí Sound 126
4.1 Calor Science 82 6.1.1 What Is Sound? 126
4.1.1 LightandSpeclra 82 6.1.2 Digilization 127
4.1.2 Human Vision 84 6.1.3 NyquistTheorem 128
4.1.3 Speclral Sensitivity of the Eye 84 6.1.4 Signal-to-Noise Ratio (SNR) 131
4.1.4 Image Forination 85 6.1.5 Signal-lo-Quantizalion-Noise Ratio (SQNR) 13!
4.1.5 CameraSystems 86 6.1.6 Linear and Nonlinear Quantization 133
4.1.6 Gamma Correction 87 6.1.7 Audio Filiering 136
4.1.7 Color-Matching Functions 89 6.1.8 Audio Qualily versus Data Rate 136
4.1.8 ClEChromaticíty Diagram 91 6.1.9 SynihecicSounds 137
4.1.9 Colar Monitor Specifications 94 6.2 MIDI: Musical lnsln,ment Digital Interface 139
4.1.10 Out-of-Gamut Colors 95 6.2.1 MIDI Overview 139
4.1.11 White-Point Correction 96 6.2.2 HardwareAspectsofMlDl 142
4.1.12 XYZ (o RGB Transform 97 6.2.3 Structure of MIDI Messages 143
4.1.13 Transform with Gamma Correclion 97 6.2.4 General MIDI 147
4.1.14 L*a*b* (CIELAB) Colar Model 98 6.2.5 MIDI-lo-WAV Conversion 147
4.1.15 More Color-Coordinate Schcmes 100 6.3 Quanlization and Transmjssion of Audio 147
4.1.16 Munsell Calor Naming Syscem 100 6.3.1 Coding aí Audio 147
4.2 Calor Models in Images 100 6.3.2 Pulse Cade Modulation 148
4.2.1 RGB Colar Model for CRT Displays 100 6.3.3 Differenlial Coding of Audio 150
4.2.2 Subtraclive Calor: CMY Colar Model 101 6.3.4 Losslcss Predictive Coding 151
4.2.3 Transformation from RGB to CMY 101 6.3.5 DPCM 154
4.2.4 Undercolor Removal: CMYIC Systcm 102 6.3.6 DM 157
4.2.5 Printer Gamuts 102 6.3.7 ADPCM 158
4.3 Colar Models in Video 104 6.4 Further Exploration 159
4.3.1 Video Calor Transforms 104 6.5 Exercises 160
4.3.2 YUVColorModel 104 6.6 References 163
II Multimedia Data Compression 165 9.2.2 Adapcing EBCØI’ lo JPEG2000 275
9.2.3 Regiori-of-!nterest Coding 275
7 Lossless Compression Algorithms 167 9.2.4 Comparison of JPEG and JPEG2000 Performance 277
7.1 Inlioduction 167 9.3 The JPEG-LS Standard 277
7.2 Basics of lnfonnation Theory 168 9.3.! Prediction 280
7.3 Run-Length Coding 171 9.3.2 Context Determination 28!
7.4 Variable-Length Coding (VLC) 171 9.3.3 ResidualCoding 281
7.4.! Shannon—Pano Algorithm 171 9.3.4 Near-Lossless Mode 28!
7.4.2 !-fuffman Coding 173
9.4 Bi!eve! !mage Compression Standards 282
7.4.3 Adaptive HufIman Coding 176
9.4.1 ThelBlGStandard 282
7.5 Dictionary-Based Coding 181
9.4.2 fle JBIG2 Standard 282
7.6 Arithmetic Codiag 187
9.5 Furthcr Exploration 284
7.7 Lossless Image Compression 191
7.7.1 Diflerential Coding of Images 191 9.6 Exercises 285
7.7.2 Lossless JPEG 193 9.7 References 287
7.8 Further Exploration 194
10 Basic Video Compression Techniques
7.9 Exercises 195
7.10 References 197 lO.! !ntroduction lo Video Compression 288
10.2 Video Compression Based on Motion Compensation 288
8 Lossy Compression Algorithnis 10.3 Search for Motion Vectors 290
8.1 Introduction 199 10.3.! Sequentia! Search 290
8.2 Disiortion Measures 199 10.3.2 2D Logarithmic Search 29!
8.3 fle Rale-Distortion Theory 200 10.3.3 1-lierarchica! Search 293
8.4 Quantization 200 10.4 [1.261 295
8.4.! Uniform Scalar Quantization 201 10.4.! Intra-Frame (!-Frame) Coding 297
8.4.2 Nonuniforro Sealar Quantization 204 10.4.2 Inter-Fraine (P-Frame) Predichve Coding 297
8.4.3 Vector Quaniizationt 206 10.4.3 Quantization in [1.26! 297
8.5 Transform Coding 207 10.4.4 [1.26! Encoder and Decoder 298
8.5.1 Discrele Cosine Transform (DCI’) 207 10.4.5 A Glance at ihe H.26! Video Bitstream Syniax 30!
8.5.2 ICarhunen—Loàve Transform* 220 10.5 [1.263 303
8.6 Wavelet-Based Coding 222 10.5.! Motion Compensation in H.263 304
8.6.1 !ntroduction 222 10.5.2 Optional H.263 Coding Modes 305
8.6.2 Continuous Wavelet Transform* 227 10.5.3 H.263÷ and H.263++ 307
8.6.3 Discrete Wavelet Transforni* 230 10.6 Furiber Exploration 308
8.7 Wave!et Packeis 240 10.7 Exercises 309
8.8 Embedded Zerotree of Wave!et Coefficients 24! 10.8 References 310
8.8.! lhe Zerotree Data Stnacture 242
8.8.2 Successive Approximation Quantization 244 11 MPEG Video Coding 1— MPEG-1 and 2
8.8.3 EZW Example 244 11.1 Overview 312
8.9 SeI Partitioning in Hierarchical Trees (SPII-IT) 247 11.2 MPEG-! 312
8.10 FurtlterExploration 248
11.2 1 Motion Compensation in MPEG-! 313
8.)! Exercises 249
11.2.2 OiherMajorDifTerencesfrom [1.26! 315
8.12 References 252
11.2.3 MPEG-! VideoBitstream 318
9 Image Compression Standards 11.3 MPEG-2 319
9.1 TheJPEGSiandard 253 11.3.1 Supporting Interlaced Video 320
9.1.! Maia Sieps in JPEG Image Compression 253 11.3.2 MPEG-2 Scalabi!ities 323
9.1.2 JPEG Modes 262 11.3.3 Ocher Major Differences from MPEG-! 329
9.1.3 A G!ance ai lhe JPEG Bicsiream 265 11.4 Further Exp!oration 330
9.2 The JPEG2000 Standard 265 11.5 Exercises 330
9.2.! Main Steps of JPEG2000 !mage Compression 267 11.6 References 331
12 MPEG Video Coding II — MPEG-4, 7, and Bcyond 14.2.2 MPEG Audio Stralegy 406
12.1 OverviewoíMpEG-4 332 14.2.3 MPEG Audio Compression Algorichm 407
12.2 Objec[.Based Visual Coding in MPEG-4 335 14.2.4 MPEG-2 AAC (Advanced Audio Coding) 412
122.1 VOP-BasedCodingvs. Frame-EasedCoding 335 14.2.5 MPEG-4Audio 414
12.2.2 MoLion Compensation 337 14.3 Oiher Commercial Audio Codecs 415
12.2.3 TexiureCoding 341 14.4 Tbe Future: MPEG-7 and MPEG-21 415
12,2.4 ShapeCoding 343 14.5 FurtherExploration 416
12.2.5 Slatic Texiure Coding 346 14.6 Exercises 416
12.2.6 Sprite Coding 347 14.7 References 417
12.2.7 Global Moiion Compensation 348
12.3 Synthctic Object Coding in MPEG-4 349
12.3.1 20 Mesh Objeci Coding 349 III Multimedia Commuuication and Retrieval 419
12.3.2 30 Model-based Coding 354
15 Coniputer and Multimedia Neiworlcs 421
12.4 MPEG-4 Objeci iypcs, Profiles and LeveIs 356
15.1 Basics of Computer and Multimedia Neiworks 421
12.5 MPEG-4 PartlO/H.264 357
15.1.1 081 Network Layers 421
12.5.1 Core Features 358
15.1.2 TCP/lPProtocols 422
12.5.2 Baseline Profile Features 360
15.2 MultiplexingTechnologies 425
12.5.3 Main Profile Features 360
15.2.1 Basics ofMultiplexing 425
12.5.4 Exiended Profile Features 361
15.2.2 lniegraied Services Digital Network (ISDN) 427
12.6 MPEG-7 36!
15.2.3 Synchronous Optical NE1\vork (SONET) 428
12.6.1 Descripior(D) 363
15.2.4 Asymmecric Digital Subscriber Line (ADSL) 429
12.6.2 Description Scheme (OS) 365
15.3 LANandWAN 430
12.6.3 Description Definition Language (DDL) 368
15.3.1 Local Arca Networks (LANs) 431
12.7 MPEG-21 369
15.3.2 Wide Arca Necworks (WANs) 434
12.8 Further Exploraiion 370
15.3.3 Asynchronous Transfer Mede (ATM) 435
12.9 Exercises 370
15.3.4 Gigabic and l0-Gigabit Ethernets 438
12.10 References 371
15.4 Access Nctworks 439
15.5 Common Peripheral lnierfaces 441
13 Basic Audio Compression Techniques
13.1 ADPCM in Speech Coding 374 15.6 Further Exploration 441
13.1.1 ADPCM 374 15.7 Exercises 442
15.8 References 442
13.2 0.726 ADPCM 376
13.3 Vocoders 378
16 Muilimedia Network Communications and Applicaiions
13.3.1 Phase lnsensitivity 378
16.1 Qualiiy of Multimedia Data Transmission 443
13.3.2 ChannelVocoder 378
16.1.1 Quality of Service (Q0S) 443
13.3.3 Formani Vocoder 380
16.1.2 Q0S for IP Protocols 446
13.3.4 Linear Predictive Coding 380
16.1.3 Prioritized Delivery 447
13.3.5 CELP 383
16.2 Muilimedia over IP 447
13.3.6 Hybrid Excitation Vocoderst 389
16.2.1 IP-Muliicasc 447
13.4 Fuilher Exploration 392
16.2.2 RTP (Real4ime Transpoil Protocol) 449
13.5 Exercises 392
16.2.3 Real Time Control Protocol (RTCP) 451
13.6 References 393
16.2.4 Resource ReSerVation Protocol (RSVP) 451
14 MPEG Audio Compression 16.2.5 Real-Time Streaming Prolocol (RTSP) 453
14.1 Psychoacousiics 395 16.2.6 Intemet Telephony 455
14.1.1 Equal-Loudness Relations 396 16.3 Multimedia over ATI’vl Networks 459
14.1.2 FrequencyMasking 398 16.3.1 Video Biiraies over ATM 459
14.1.3 Temporal Masking 403 16.3.2 ATM Adaptation Layer (AAL) 460
14.2 MPEG Audio 405 16.3.3 MPEG-2 Convergence Lo ATM 461
142.1 MPEG Layers 405 16.3.4 Multicasi over ATM 462
16.4 Transport ofMPEO-4 462 18.3.7 Informedia 537
16.4.1 DMIFinMPEG-4 462 18.3.8 MetaSEEk 537
16.4.2 MPEG-4 over IP 463 18.3.9 Photobook and FourEyes 538
16.5 Media-on-Demand (MOO) 464 18.3.IOMARS 538
165.1 Interactive TV (ITV) and SeI-Top Box (STB) 464 18,3.11 Virage 538
16.5.2 Broadcast Schemes for Video-on-Demand 465 18.3.12 Viper 538
16.5.3 Buffer Managemcnt 472 18.3.13 Visual RetrievalWare 538
16.6 Furtlier Exploration 475 18.4 Relevance Feedback 539
16.7 Exercises 476 18.4.1 MARS 539
16.8 References 477 18.4.2 iFind 541
18.5 Quantifying Results 541
17 Wireless Networks 18.6 Queryingon Videos 542
17.1 Wireless Nelworks 479 18.7 Querying on Other Formats 544
17.1.1 Analog Wireless Networlcs 480 18.8 Ouilook for Content-Based Retricval 544
17.1.2 Digital Wireless Networks 481 18.9 Further Exploration 545
17.1.3 TDMAandOSM 481 18.10 Exercises 546
17.1.4 SpreadSpectrumandCDMA 483 18.11 References 547
17.1.5 AnalysisofCDMA 486
17.1.6 30 Digital Wireless Networks 488
17.1.7 WirelessLAN(WLAN) 492
17.2 Radio Propagation Models 493
17.2.1 Multipath Fading 494
17.2.2 Path Loss 496
17.3 Multimedia over Wireless Networks 496
[7.3.1 Synchronization Loss 497
17.3.2 Error Resilient Entropy Coding 499
17.3.3 Error Concealment SOl
17.3.4 Forward Error Correction (FEC) 503
17.3.5 Trends in W’ireless Interactive Multimedia 506
17.4 Funher Exploration 508
17.5 Exercises 508
17.6 References 510

IS Content-Based Reirieval la Digital Libraries


18.1 How Should We Retrieve Images? 511
18.2 C-BIRD——ACaseStudy 513
18.2.1 C-BIRDOUI 514
18.2.2 ColorHistogram 514
18.2.3 Color Density 516
18.2.4 ColorLayout 516
18.2.5 Texture Layout 517
18.2.6 Seaxch by lllumination Invariance 519
18.2.7 Search by Object Model 520
18.3 Synopsis of Current Image Search Systems 533
18.3.1 QBIC 535
18.3.2 UC Santa Barbara Search Engines 536
18.3.3 Berkeley Digital Library Project 536
18.3.4 Chabot 536
18.3.5 Blobworld 537
18.3.6 Columbia University Image Seekers 537
Preface xv

Preface Using Lhe methods and ideas collected here, studenLs are also able lo leam more themselves,
someLimes in ajob setLing. IL is not unusual for sLudents who Lalce the type of mLllLimedia
course this LexL aims at togo on tojobs in a muitimedia-reiated indusLry immediateiy after
their senior year, and someLimes before.
The selection of material in Lhe texl addresses real issues these learners wiill face as soon
as they show lip in Lhe workplace. Some topics are simpie but new to lhe studenLs; some
are more complex but unavoidable in this emerging area.
A course in multimedia is rapidly becoming a necessity in computer science and engineering
curricula, especially now LhaL muilimedia Louches most aspects of Lhese fields. Multimedia Have the Authors lised This Material in a Real Class?
was originaily seen as,a vertical applicaLion area; Lhat is, a niche applicaLion with meLhods
lhaL belong only LO itself. However, like pervasive computing, multimedia is now essentially Since 1996, we have taught a third-year undergraduate course in mulLimedia sysLems based
a horizontal application area and forms an important component of the study of computer on Lhe inLroducbory maLerials seL out in Lhis book. A one-semester course could very Iikely
graphics, image processing, databases, real-Lime systems, operating systems, informaLion noL include ali the maLerial covered in Lhis text, buL we have usually managed Lo consider
retrieval, computer neLworks, computer vision, and so on. Multimedia is no longer just a good many of lhe Lopics addressed and to menLion a selecL number of issues in Part III
a Loy but forms part of Lhe technological environment in which we work and think. This wiLhin thaL time frame.
book fihis the need for a universiLy-level LexL thaL examines a good deal of the core agenda Over Lhe same time period as an inLroduction lo more advanced maLeriais, we have also
compuLer science sees as belonging Lo Luis subject area. MuiLimedia has become associated taught a one-semesLer graduate-level course using notes covering topics similar tolhe ground
with a cerlain seL of issues in compuLer science and engineering, and we address Lhose here. covered by this text. A fourth-year or graduate course wouid do well Lo consider material
The book is not an inLroduction Lo simple design issues—it serves a more advanced from Parts 1 and 11 of lhe book and then some maLerial from Part III, perhaps in conjunction
audience than that. On the oLher hand, iL is noL a reference work iL is more a tradiLional with some of lhe original research references included here and resulLs presenLed aI topical
Lextbook. While we perforce discuss multimedia Lools, we would like Lo give a sense of conferences.
Lhe underlying principIes in lhe Lasks those Lools carry ouL. SLudents who undertake and We have aLtempLed lo 611 boLh needs, concenlraLing on an undergraduaLe audience but
succeed in a course based on this LexL can be said Lo reaily understand fundamental matLers including more advanced maLeriai as well. Sections thaL can safely be omilLed on a first
in regard Lo Lhis material; hence Lhe LiLie of Lhe texL. reading are marked with an asterisk.
In conjunclion wiLh this LexL, a fuil-fledged course should also allow sLudents lo make use
ofLhis knowledge Lo carry out inLeresting or even wonderful practical projects in mulLimedia, What is Covered in This Text?
inLeractive projecLs thaL engage and sometimes amuse and, perhaps, even teach Lhese same In Part 1, Multimedia AuLhoring and DaLa Representations, we introduce some of the no
concepLs. tions included in Lhe term muhimedia and look at its history as well as iLs present. Practi
cally speaking, we carry ouL muiLimedia projects using software boIs, so in addition to an
Who Should Read Thi5 Book? overview of these boIs, we gel down Lo some of Lhe nuts and bolts of multimedia aulhoring.
This Lext aims aI introducing the basic ideas in muiLimedia Lo an aLidience comfortable with Representing daLa is criLical in muiLimedia, and we Iook at lhe most importanL data repre
technical applications—Lhat is, computer science and engineering sLudenLs. IL aims to cover sentations for muiLimedia applicaLions, examining image data, video data, and audio daL
an upper-levei undergraduaLe multimedia course buL could also be used in more advanced delail. Since color is vitally importanL in multimedia programs, we see how this important
courses and would be a good reference for anyone, including those in industry, interested in arca impacLs multimedia issues.
currenL muiLimedia technologies. GraduaLe sLudents needing a solid grounding in maLerials In Part II, MuiLimedia Dala Compression, we consider how we can make ali this data
Lhey may noL have seen before would undoubLedly benefit from reading ii fly onLo lhe screen and speakers. Data compression Luras out lo be an importanL enabling
The texL mainly presenLs concepts, noL applications. A multimedia course, on lhe oLher Lechnology Lhat makes modera multimedia sysLems possible, 50 we look at lossless and lossy
hand, teaches these concepLs and tests them buL also allows students lo use coding and compression methods. For the laLter calegory, JPEG slili-image compression standards,
presenLaLion skilis they already know Lo address problems in multimedia. The accompanying including JPEG2000, are arguably the mosL important, 50 We consider Lhese in detail. RuI
web site shows some of lhe code for muiLimedia applications, along with some of the beLter since a picLure is worth a thousand words and video is worth more Lhan a million words
projecLs studenLs have developed in such a course and oLher useful materiais besL presented per minuLe, we examine Lhe ideas behind MPEG sLandards MPEG- 1, MPEG-2, MPEG-4,
eiectronicaily. MPEG-7, and beyond. SeparaLely, we consider some basic audio compression techniques
The ideas in Lhe LexL drive the resulLs shown in sLudent projecLs. We assume the reader and Lake a look aL MPEG Audio, including MP3.
knows how Lo program and is also complelely comfortable leaming yet anotber boi. lnsLead In Part 111, MuiLimedia Communication and ReLrieval, we consider Lhe greaL demands
of concentrating on tools, however, we emphasize whaL sLudenLs do noL already know. muiLimedia places on neLworks and systems. We go on to consider neLwork Lechnologies

xiv
xvi Preface

and protocois LhaL make interactive multimedia possible. Some of Lhe applicaLions discussed PART ONE
include muiLimedia cri demand, multimedia over IP, muiLimedia over ATM, and multimedia
over wireless networks. Content-based retrievai is a parLicularly importanL issue ia digiLal
libraries and inLeraclive muiLimedia, so we examine ideas and systems for this application
in some deLail.

Textbook Web Site


MULTIMEDIA
The book’s web siLe is www.cs.sfu,ca/mmbook. There, you wiII find copies of figures from
Lhe book, an erraLa sheeL updaLed regularly, programs thaL help demonsLrate concepts ia Lhe
text. and a dynamic seL of Iinks for Lhe Furlher Exploration seclion of each chaprer. Since
AUTHORING AND
these Iinks are regularly updaLed (and of course URLs change ofLen) they are mosrly online
ralher than ia Lhe texL.

Instructors’ Resources
DATA
The main text web siLe has no 113 and password, but access Lo sample sLudent projecLs is
at Lhe insLructor’S discretion and is password-proLecLed. Prentice Hail also hosLs a web siLe
containing Course Instructor resources for adopLers of the text. These include an exLensive
REPRESENTATIONS
coliecLion of online course notes, a one-semester course syllabus and calendar of events,
solutions for Lhe exercises in lhe texl. sample assignmenLs and solutions, sample exams. and Chapter 1 Introduction to Multimedia 3
extra exam quesLions. Chapter 2 Multimedia Authoring and Tools 20
Acknowledgements
Chapter 3 Graphics and Image Data Representations 60
We are most graLeful Lo coileagues who generously gave of Lheir Lime Lo review Lhis Lext, and
Chapter 4 Color in Image and Video 82
we wish Lo express our Lhanks to Shu-Ching Chen, Edward Chang, Qianping Gu, Racheile S. Chapter 5 Fundamental Concepts in Video 112
ReVer, Gongzhu Hu, 5. N. Jayaram, Tiko Kameda, Xiaobo Li, Siwei Lu, Dennis Richards, Chapter 6 Basics of Digital Audio 126
and Jacques Vaisey.
The writing of this text has been greaLly aided by a number of suggestions from presenL
and former coileagues and studenLs. We would like lo thank Janies Au, Chad Ciavarro,
Hao Jiang, SLeven Kilthau, Michael King, Cheng L.u, Yi Sua, Dominic Szopa, Zinovi
Tauber, Malte von Ruden, Jian Wang, Jie Wei, Edward Yan, Yingchen Yang, Osmar Zaïane, Introduction to Multimedia
Wenbiao Zhang, and William Zhong for their assistance. As weII, Mr. Ye Lu made great
As an inLroduction Lo multimedia, in Chapter 1 we consider lhe queslion of just what
conLributions lo Chapters 8 and 9 and bis valianL efforLs are particularly appreciated. We
mulLimedia is. We examine its history and the development of hypertexL and hypermedia.
are also most graLeful for lhe studenLs who generously made lheir course projects available We then geL down Lo practical maLters with an overview of multimedia software Lools. These
for instrucLional use for Lhis book.
are lhe basic means we use lo develop muiLimedia conlenL. But a multimedia producLion is
much more Lhan lhe sum of its parIs, Lo ChapLer 2 looks aL Lhe nuLs and bolLs of multimedia
aulhoring design and a taxonomy of auLhoring metaphors. The chapter also sets out a list
of importanL conLemporary mulLimedia auLhoring bois in current use.

Multimedia Data Representations


As in many fields, lhe issue of how lo best represenL lhe daLa is of crucial importance ia
lhe sLudy of multimedia. ChapLers 3 through 6 consider how Lhis is addressed in this field,
selLing ouL Lhe mosL imporlant data representaLions in multimedia applicaLions. Because Lhe
main areas of concern are images, moving picLures, and audio, we begin investigaLing these
in Chapler 3, Graphics and Image Data RepresentatiOfls, Iben look aI Basics of Video iii C H A P T E R 1
~
Chapter 5. Before going on Lo ChapLer 6, Basics of Digilai Audio, we lake a side Lrip in
issues on lhe use of color, since color is vitally imporlant 1 ntrod uction to I~/I u Iti rned ia

1.1 WHAT IS MULTIMEDIA?


People who use Lhe Lerm “muilimedia” often seem to have quite different, even opposing,
viewpoinLs. A PC vendor would like us to Lhink of multimedia as a PC Lhat has sound
capabilily, a DVD-ROM drive, and perhaps the superioriLy of mulLimedia-enabled micro
processors ihat understand addiLional muiLimedia insLructions. A consumer enLerlaiiiment
vendor may ihink of multimedia as inleractive cable TV wiLh hundreds of digiLai channeis,
or a cable-TV-like service delivered over a high-speed Jniemet connecLion.
A computer science studenL reading this book likely has a more applicaLion-orienled
view of what muiLimedia consisis of: applicaLions Lhat use muiLiple modalities to their
advanLage, including LeXL, images, drawings (graphics), animalion, video, sound (including
speech), and, most likely, inleracliviLy of some kind. The popular notion of “convergence”
is one Lhal inhabits lhe coliege campus as iL does lhe cuiLure aL large. In this scenario,
PCs, DVDs, games, digital TV, set-Lop web surfing, wireless, and so on are converging in
lechnology, presumably to arrive in the near future ata final all-around, muiLimedia-enabled
producl. Whiie hardware may indeed involve such devices, Lhe presenL is already exciling—
muiLimedia is parL of some of Lhe most interesting projecis underway in compuLer science.
The convergence going on in this field is in fact a convergence of areas Lhat have in lhe past
been separaLed buL are now finding much Lo share in this new application area. Graphics,
visualizaLion, HC1, compuLer vision, daLa compression, graph theory, neLworking, daiabase
systems — ali have imporLant conLribuLiOns to make in muliimedia at lhe present time.

1.1.1 Components of Multimedia


ThemulLiplemodaliLiesoftexL, audio, images. drawings. animation, and videoin multimedia
are pul Lo use in ways as diverse as

• Video ieleconferencing

• DisLributed leclures for higher educaLion

• Telemedicine

• Cooperative work environmenLs thaL allow business people to edit a shared document
or schoolchildren Lo share a single game using Lwo mice Lhat pass conLrol back and
forLh

3
4 Chapter 1 Introduction to Multimedia Sectiori 1.2 Multimedia and Hypermedia 5

• Searching (very) large video and image databases for target visual objecis Current Multimedia Projects Many exciting research projects are currently underway
in multimedia, and we’d 111w to introduce a few of them here.
• “Augmented” reality: placing real-appearing computer graphics and video objects For example, researchers are interested in camera-based object tracking technology. One
into scenes soas to take Lhe physics of objects and lights (e.g., shadows) into account aim is to deve)op control systems for industrial control, gaming, and so on that rely on moving
scale modeis (toys) around a real environment (a board game, say). Tracking the control
• Audio cues for where video-conference participants are seated, as well as taking into objects (toys) provides user control of Lhe process.
account gaze direction and attention of participants 3D motion capture can also be used for multiple actor capture, so that muitiple real
• Building searchable features into new video and enabling very high to very low bitrate actors in a virtual studio can be used to automatically produce realistic aniniated modeis
use of new, scalable multimedia products with natural movement.
Multiple views from several cameras or from a single camera under differing lighting can
• Making multimedia components editable — allowing the user side to decide what accurately acquire data that gives both the shape and surface properties of materiais, thus
compenenis, video, graphics, and so on are actually viewed and allowing the client automatically generating synthetic graphics models. This allows photo-realistic (video
to move components around or delete them — making components distributed quality) syntbesis of virtual actors.
3D capture technology is next to fast enough now to allow acquiring dynamic characteris
• Building “inverse-HoHywood” applications that can re-create Lhe proeess by which a tics of human facial expression during speech, to synthesize highly realistic facial animation
video was made, allowing storyboard pruning and concise video summarization from speech.
• Using voice recognition to build an interactive environment — say a kitchen-wall Multimedia applications aimed at handicapped persons, particularly those with poor
web browser vision and the elderly, are a rich field of endeavor in current research.
“Digital fashion” aims to develop smart clothing that can communicate with offier such
From the computer science student’s point of view, what makes multimedia interesting enhanced clothing using wireless communication, so as to artificially enhance human in
is that so much of Lhe material covered in traditional computer science arcas bears on the teraction in a social setting. The vision here is to use technology to allow individuais to
multimedia enterprise: networks, operating systems, real-time systems, vision, information allow certain thoughts and feelings to be broadcast automatically, for exchange with others
retrieval. Like databases, multimedia touches on many traditional arcas. equipped with similar technology.
Georgia Tech’s Electronic Housecall system, an initiative for providing interactive health
1.1.2 Multimedia Research Topics and Projects monitoring services tu patients in their homes, relies on networks for delivery, challenging
Tolhe computer science researcher, multimedia consists of a wide variety of topics [1]: current capabilities.
Behavioral science models can be brought into play to model interaction between peo
• Multimedia processing and coding. Tliis includes multimedia content analysis, pie, which can then be extended te enable natural interaction by virtual characters. Such
content-based multimedia retrieval, multimedia security, audio/image/video process “augmented interaction” applications can be used to develop interfaces between real and
ing, compression, and 50 011. virtual humans for tasks such as augmented storytelling.
Each of these application areas pushes the development of computer science generally,
• Multimedia system support and networking. People look at such topics as network stimuiates new applications, and fascinates practitioners.
protocois, Intemet, operating systems, servers and clients, quality of service (QoS),
and databases. 1.2 MULTIMEDIA AND HYPERMEDIA
• Multimedia tools, cnd systems, and applications. These include hypermedia sys To place multimedia in its proper context, in this section we briefly consider the history of
tems, user interfaces, authoring systems, muitimodal interaction, and integration: multimedia, a recent part of which is Lhe connection between multimedia and hypermedia.
“ubiquity” — web-everywhere devices, multimedia education, including computer We go on to a quick overview of multimedia software tools available for creation of multi
supported coilaborative leaming and design, and applications of virtual environments. media content, which prepares us to examine, in Chapter 2, the larger issue of integrating
this content into full-blown multimedia productions.
The concerus of multimedia researchers also impact researchers in almost every other branch
of computer science. For example, data mining is an important current research area, and 1.2.1 History of Multimedia
a large database of multimedia data objects is a good example of just what we may be
A brief history of the use of mullimedia tu communicate ideas might begin with newspapers,
interested in mining. Telemedicine applications, such as “telemedical patient consultative
which were perhaps thefirsi mass communication medium, using text, graphics, and images.
encounters’ are multimedia applications that place a heavy burden on existing network
architectures.
6 Chapter 1 Introduction to Multimedia Section 1.2 Multimedia and Hypermedia 7

Molion pictures were originally conceived of in lhe 1830s lo observe motion Loo rapid 1989 Tim Bemers-Lee proposed Lhe Worid Wide Web lo Lhe European Council for Nuclear
for percepLion by lhe human eye. Thomas Alva Edison commissioned Lhe invention of a Research (CERN).
molion piclure camera in 1887. SilenI feacure Iilms appeared from 1910 Lo 1927; Lhe silenl 1990 Krislina Hooper Wooisey headed lhe Appie Muilimedia Lab, with a slaff of 100.
era effectively ended wiLh lhe release of The Jazz Singer in 1927. Educalion was a chiefgoal.
In 1895, Guglielmo Marconi seni his firsl wireless radio Lransmission aI PonLecchio, lLaly.
A few years laler (1901), he detecied radio waves beamed across Lhe Allanlic. lnilially 1991 MPEG-I was approved as an inLemalional sLandard for digital video. lis furlher
invented for Lelegraph, ralho is now a major medium for audio broadcasting. In 1909, developmenL Ied Lo newer standards, MPEG-2, MPEG-4, and further MPEGs, in Lhe
Marconi shared lhe Nobel Prize for physics. (Reginaid A. Fessenden, of Quebec, beal 1990s.
Marconi lo human voice lransmission by severa! years, bul noL ali invenlors receive due 1991 The inLroducLion of PDAs in 1991 began a new period inIbe use of compulers in
credir. Nevertheless, Fessenden was paid $2.5 million in 1928 for.his purloined paLenLs.) general and muilimedia in particular. This developmenL conlinued in 1996 wilh Lhe
Television was lhe new medium for Lhe tweiitieth cenlury. lt eslabiished video as a markeling of lhe firsl PDA wilh no keyboard.
commoniy availabie medium and has since changed Lhe world of mass communicaLion.
1992 JPEG was accepled as lhe inLemalionai sLandard for digital image compression. lLs
The conneclion belween coinpulers and ideas about multimedia covers what is actuaily furlher developmenL has now led Lo Lhe new JPEG2000 slandard.
only a short period:
1992 The firsL MBone audio muilicasl on lhe Nel was made.
1945 As part of MIT’s postwar deiiberalions on whaL Lo do wilh ali lhose scienlists em
1993 The Universily of Illinois Nalionai Cenler for Supercompuling Applicalions pro
ployed on lhe war efforL, Vannevar Bush (1890—1974) wrote a iandmark articie [2]
duced NCSA Mosaic, lhe firsi fuil-Íiedged browser, launching a new era in Intemet
describing whaL amounls Lo a hypermedia syslem, called “Memex.” Memex was
informaLion access.
meanL Lo be a universally useful and personaiized memory device Lhal even included
lhe concepl of associative links it reaily is Lhe forerunner of Lhe Worid Wide Web. 1994 um Clark and Marc Andreessen crealed lhe Nelscape program.
Afier World War II, 6,000 scientisls who had been hard aL work on lhe war efforl
1995 The JAVA language was created for plaLform-independenl applicalion developmenl.
suddenly found Lhemselves wiLh lime Lo consider other issues, and lhe Memex idea
was one fruil of Lhal new freedom. 1996 DVD video was inLroduced; high-qualily, full-ienglh movies were disliibuled on a
single disk. The DVD formal promised Lo lransform the music, gaming and compuLer
1960s Ted Neison sLarted lhe Xanadu projecl and coined Lhe Lerm “hyperlexi?’ Xanadu indusLries.
was lhe firsL altempl aL a hypertexL sysLem Neison calied iL a “magic piace of
iiLerary memory.” 1998 XML 1.0 was announced as a W3C Recommendalion.
1998 Handheid MP3 devices firsL made inroads inLo consumer LasLes in lhe fali, wilh Lhe
1967 Nicholas NegroponLe formed lhe Archileclure Machine Group aL MIT.
inlroducLion of devices hoiding 32 MB of flash memory.
1968 Douglas Engelbarl, greaLly influenced by Vannevar Bush’s “As We May Thinlc’ 2000 Worid Wide Web (WWW) size was eslimaled aI over 1 biliion pages.
demonsiraled lhe “On-Line Syslem” (NLS), anoLher early hypertexl program. En
geibarl’s group aL Stanford Research JnsLiLuLe aimed aI “augmenlalion, noL auLoma 1.2.2 Hypermedia and Multimedia
Lion,” lo enhance human abililies Lhrough compuLer Lechnology. NLS consisLed of
such criLical ideas as an ouLline ediLor for idea developmenL, hyperlexi links, lele Ted Neison invenled lhe lenn “HyperTexL” around 1965. Whereas we may think of a book
conferencing, word processing, and e-mau, and made use of Lhe mouse pointing as a linear medium, basically meanL lo be read from beginning lo end, a hyperlexL syslem is
device, windowing sofLware, and heip sysLems [3]. meanL Lo be read noniineariy, by following iinks Lhal poinL Lo oLher parts of Lhe document,
or indeed lo oLher documenls. Figure 1.1 illusLraLes Lhis idea.
1969 Neison and van Dam ai Brown UniversiLy crealed an eariy hypertexl edilor calied Hyperniedia is noL conslrained lo be lexl-based. IL can inciude olher media, such as
FRESS [4]. The presenl-day Inlermedia projecL by Lhe lnsLiLuLe for Research in graphics, images, and especially Lhe conLinuous media — sound and video. Apparenlly Ted
InformaLion and Scholarship (IRIS) aI Brown is lhe descendanl of Lhat early sysLem. Nelson was also lhe firsL Lo use Lhis Lerm. The World Wide Web (WWW) is lhe besL example
of a hypermedia applicalion.
1976 The MIT Archileclure Machine Group proposed a projecL entilled “Muilipie Media.”
As we have seen, multimedia fundamenlally means lhal compuLer informaLion can be
This resulled in lhe Aspen Movie Map, Lhe firsL hypermedia videodisc, in 1978.
represenled Lhrough audio, graphics, images, video, and animaLion in addilion Lo Iradi
1985 NegroponLe and Wiesner cofounded lhe MIT Media L.ab, a ieading research institu lional media (lext and graphics). Hypermedia can be considered one particular mullimedia
Lion invesligaiing digital video and mulLimedia. applicaLion.
8 Chapter 1 uction to Multimedia Section 1.3 World Wide Web 9

Hyperlext 1960s It is recognized LhaL documents need Lo have formats Lhal are human-readable and lhat
identify strucLure and elements. Charles Goldfarb, Edward Mosher, and Raymond
Lorie developed Lhe Generalized Markup Language (GML) for IBM.
1986 The ISO released a final version of lhe SLandard Generalized Markup Language
(SGML), mostly based on lhe earlier GML.
1990 Wilh approval from CERN, Tím Bemers-Lee slarLed developing a hypertext server,
browser, and editor on a NeXTSlep workstation. He invenled hypertext markup
language (HTML) and lhe hypertext Iransfer protocol (FfrI’P) for lhis purpose.
1993 NCSA released an alpha version of Mosaic based on lhe version by Marc Andreessen
for lhe X Windows SysLem. This was lhe first popular browser. Microsoft’s InterneL
Explorer is based on Mosaic.
1994 Marc Andreessen and some colleagues from NCSA joined Dr. James H. Clark (also
lhe founder of Silicon Graphics mc.) lo form Mosaic Conimunications Corpora
Lion. ln November, Lhe company changed its name lo Nelscape Communications
CorporaLion.
O ‘HoL spols”
1998 The W3C accepLed XML version 1.0 specificalions as a RecommendaLion. XML is
Nonlinear Lhe main focus of Lhe W3C and supersedes HTML.

FIGURE 1.1: Hypertext is nonlinear.


1.3.2 HyperText Transfer Protocol (HTTP)
ifri? is a prolocol thal was originally designed for lransmiLLing hypermedia, buL iL also
supporls lransmission of any file Lype. H’fl’P is a “slaleless” request/response prolocol, in
Examples of lypical mulLimedia applications include: digilal video ediling and produc Lhe sense lhaL a client lypically opens a conneclion lo lhe HTTP server, requesLs informalion,
lion sysLems; electronic newspapers and magazines; lhe World Wide Web; online reference Lhe server responds, and lhe conneclion is terminaled no informalion is carried over for
works, such as encyclopedias; games; groupware; home shopping; inLeraclive TV; mulli Lhe nexl requesl.
media courseware; video conferencing; video-on-demand; and interactive movies. The basic requesL formal is

1.3 WORLD WIDE WEB Method URI Version


Additional-Headers
The World Wide Web is lhe largest and mosI commonly used hypermedia applicaLion. lIs
populariLy is due lo lhe amount of informalion available from web servers, lhe capacity Message—body
Lo post such informalion, and Lhe ease of navigaLing such information with a web browser.
WWW Iechnology is mainlained and developed by lhe World Wide Web Consorlium (W3C), The Uniform Resource IdenLifier (URI) idenLifies Lhe resource accessed, such as Lhe hosL
although Lhe Inlemel Engineering Task Force (IETF) slandardizes Lhe LechnologieS. The name, always preceded by Lhe loken “http: / 1”. A URI could be a Uniform Resource
W3C has lisled lhe following lhree goals for Lhe WWW: universal access of web resources Localor (URL), for example. Here, lhe URI can also include query strings (some interaclions
(by everyone everywhere), effecLiveness of navigaling available informalion, and responsi require submiLLing dala). Method is a way of exchanging informailon or performing tasks
ble use of posted malerial. on lhe URI. Two popular meLhods are GET and POST. GET specifies lhaL Lhe informalion
requested is in Lhe requesL sLring ilself, while Lhe POST meLhOd specifies Lhal Lhe resource
1.3.1 History of fite WWW poinLed lo in Lhe URI should consider Lhe message body. POST is generally used for
submilLing HTML forms. Addi tional —Headers specifies addilional paramelers about
Amazingly, one of Lhe mosl predominanl networked mullimedia applicaLions has its rools Lhe clienl. For example, Lo requesl access lo lhis lexlbook’s web sile, lhe following HITP
in nuclear physics! As noted in Lhe previous seclion, Tim Berners-Lee proposed lhe World message mighl be generated:
Wide Web Lo CERN (European Cenler for Nuclear Research) as a means for organizing and
sharing Lheir work and experimenLal results. The following is a shorl lisI of imporlanl dales GET http://www.cs.sfu.ca/mbook/ HTTP/1.1
in lhe creation of lhe WWW:
10 Chapter 1 Introduction te Multimedia Section 1.3 World Wide Web 11

The basic response formal is A very simple J-ITML page is as foliows:

Version Status-Code Status-Phrase


Additional-Headers <MTML>
<HEAD>
<TITLE>
Message-bOdy
A sample web page.
Status—Code is a number that identifles lhe response lype (or error Lhat occurs), and </TTTLE>
Status—Phrase is a textual description of it. Two commonly seen status codes and <META NAEE = ‘Author” CONTENT ‘Cranky Professor”>
phrases are 200 0K when the requcst was processed successfully and 404 Not Found </HEAD> <BODY>
when lhe URI does not exist. For example, in response lo the example request above for
this LexLbook’s URL, lhe web server may return something like We can put any text we like here, since this is
a paragraph element.
HTTP/1.1 200 0K Server:
[No-plugs-here-pleaseJ Date: Wed, 25 July 2002 <1 P>
</BODY>
20:04:30 GMT
Content-Length: 1045 Content-Type: text/html </HTML>

<HTML> Nalurally, HTML. has more complex sLructures and can be mixed with other sLandards.
The sLandard has evolved Lo allow inlegration with script languages, dynamic manipulaLion
</HTML> of almost ali elements and properties after display on lhe client side (dynamic HTML), and
modular cusLomization of ali rendering parameLers using a markup language calied Cascad
1.3.3 HyperText Markup Language (HTML) ing Style Sheets (CSS). NoneLheless, HTML. has rigid, nondescriptive structure elements,
and modularity is hard Lo achieve.
HTML is a language for publishing hypermedia on Lhe World Wide Web. li is delined using
SGML and derives elements that describe generic document sLnicture and formatting. Since
1.3.4 Extensible Markup Language (XML)
il uses ASCII, II is portable Lo ali different (even binary-incompatible) computer hardware,
which allows for global exchange of informaLion. The current version of HTML is version There is a need for a markup language for Lhe WWW lhaL has moduiariLy of data, structure,
4.01, specified in 1999. The next generation of HTML is XHTML, a reformulaLion of and vjew, That is, we would like a user or an application Lo be able lo define Lhe Lags
HTML using XML. (strucLure) allowed in a documenL and Lheir relationship Lo each other, in one place, Lhen
HTML uses lags to describe document eleinents. The tags are in lhe formal <token define data using these lags tn anoLher place (lhe XML file) and, finally, define in yet anoLher
parains> lo define Lhe start poinL of a document element and </token> lo define Lhe end document how Lo render lhe Lags.
of the element. Some elements have only mIme parameLers and don’t require ending Lags. Suppose you wanted lo have slock information reLrieved from a database according Lo
I-ITML divides lhe document mIo a HEAD and a EODY part as follows: a user query. Using XML, you would use a global Documen: Type Definition (DTD) you
<HTML> have already defined for stock data. Your server-side script wilI abide by Lhe I3TD mies Lo
<HEAD> generate an XML documenL according Lo the query, using daLa from your daLabase. Finally,
you will send users your XML Style Shee: (XSL), depending on Lhe type of device tliey use
</HEAD> lo display Lhe information, se that your document looks best both on a compuLer with a
<BODY> 21-inch CRT monitor and on a cellphone.
The currenL XML version is XML 1.0, approved by Lhe W3C in Febmary 1998. XML.
</BODY> syntax looks like HTML syntax, although iL is much stricter. AlI Lags are lowercase, and a
</HTML> lag Lhat has only mime data has Lo terminate itself, for example, <token parants 1>.
The HEAD describes documenL definitions, which are parsed before any documenL rendering XML also uses namespaces, se that multiple DTDs declaring different elemenLs buL wiLh
similar tag names can have their elements distinguished. DTDs can be imported from URIs
is done. These include page Litle, resource links, and meta-information Lhe author decides Lo
specify. The BODY part describes Lhe document SLrUCLure and conLenL. Common sLnicLure as well. As an example of an XML document sLnicture, here is Lhe definition for a small
elements are paragraphs, tables, forms, Iinks, item lisls, and buttons. XHTML documenL:
12 Chapter 1 Introduction to Multimedia Section 1.3 World Wide Web 13

<?xnil version”1.0” encodinr’iso-8859-l”?> The W3C established a Working Group in 1997 Lo come up wiLh specifications for a
<!DOCTYPE htmi PUBLIC “-//W3C//DTD XHPNL 1.0” mullimedia synchronization language. Thal group produced specificalions for SMIL. 1.0
‘http: //www.w3 .org/TR/xhtmhl/DTD/xhtrnll-traflsitiOfl.dtd”> lhat became a Recommendation in June 1998. As i-ITML was being redefined in XM].
<html mlns=”http: www.w3.org/1999/xhtnll”> (XHTML. specifications), 50 Loo did SMIL 1.0, wiLh some enhancemenls. SMIL 2.0, which
(html that follows also provides integralion wiLh HTML, was accepted as a Recommendation in Augusl 2001.
the above mentioned SMIL 2.0 is specified in XML using a nwdularizarion approach similar Lo lhe one used in
XML nies] XHTML. Ali SMIL elemenLs are divided inLo modules — seLs of XML elemenls, atlributes,
</html> and values Lhat define one concepLual funcLionalily. In lhe inLeresL of modularization, noL ali
availabie modules musL be included for ali applicaLions. For lhat reason, Language ProJiIes
Ali XML documents slart with <?xxnl version= ‘ver”?>. <!DOCPYPE ...> is a are defined, specifying a particular grouping of modules. Particular modules may have
special tag used for importing DTDs. Since It is a DTD definition, II does not adhere lo inLegralion requirements a profile musL foilow. SMIL 2.0 has a main language profile Lhal
XML rules. anlns defines a unique XML namespace for lhe documenL elements. In this includes almost ali SMIL modules, a Basic profile thaL includes only modules necessary
case, Lhe namespace is lhe XHTML specificalions web site. Lo support basic funclionaliLy, and an XHTML-i-SMIL profile designed Lo inlegraLe HTM].
and SMIL. The lalLer includes mosL of lhe XHTML modules, wiLh only lhe SMIL Liming
In addition lo XML specificaiions, Lhe foliowing XML-reiated specifications are stan modules (buL noL sLructure modules — XHTML has iLs own sLrucLure modules) added.
dardized:
The SMIL language sLrucLure is similar lo XHTML. The rool eiemenL is smil, which
contains lhe Lwo elemenLs head and body. head conlains information noL used for
• XML Protocol. Used lo exchange XML information between processes. lt is meant synchronizalion — meLainformation, layout infonnation, and conlenl conLrol, such as media
Lo supersede HTTP and exLend iL as well as lo allow interprocess communicaLions biLrale. body conLains ali lhe informalion re]aling to which resources lo presenL, and when.
across networks. Three lypes of resource synchronizaLion (grouping) are available: seq, par, and exci.
seq specifies LhaL Lhe elements grouped are Lo be presenLed in lhe specified order (sequen
tially). AlLemaLively, par specifies lhaL ali Lhe elemenls grouped are lo be presenLed aI
• XML Schcma. A more sLnacLured and powerful language for defining XML data
lhe same time (in parallel). exci specifies Lhal only one of lhe grouped elements can be
lypes (lags). Unlike a DTD, XML Schema uses XML Lags for type definitions.
presenLed ata lime (exclusively); order does noL malLer.
Let’s look aI an example of SMIL code:
• XSL. This is basically CSS for XML. On lhe oLher hand, XSL is much more complex,
having Lhree paris: XSL Transformaflons (XSLT), XML Path Language (XPath), and <IDOCTYPE smil PUBLIC “-//W3C//DTD SMIL 2.0’
XSL Formalting Objecis. “http://www.w3.org/2001/SMIL20/sMIL20.dtd”>
<smil xmlns=
• SMIL: Synchronized Multimedia Integration Languagc, pronounced “smile”. “http://www.w3.org/2001/sMIL20/Language’>
This is a particular applicaLion of XML (globaily predefined DTD) that permils spec <head>
ifying Lemporally scripLed inleraction among any media types and user inpul. For <meta nwne=”Author” content=” Some Professor” 1>
exampie, ii can be used Lo show a streaming video synchronized with a slide show </head>
presenlalion, both reacLing Lo user navigation through lhe slide show or video. <body>
<par id=’MakingOfABooic”>
<5 eq>
1.3.5 Synchronized Multimedia Integration Language (SMIL) <video src=’aulzhorviewjnpg’ />
JusL as iL was beneficial lo have IrITML provide lext-documenL publishing using a readable <img src=”onagooddayjpg’ 1>
markup language, iL is also desirabie lo be able lo publish multimedia presenlations using a </seq>
markup language. Multimedia presenLaLions have addiLional characLerislics: whereas in LexL
documents the Lext is read sequenLially and displayed ali aL once (aI Lhe same time), multi <audio src= ‘autl-iorview.wav” 1>
media presenLations can include many elemenLs, such as video and audio, Lhat have content <text src=”http://www.cs.sfuca/nmibook/” 1>
changing Lhrough Lime. Thus, a muiLimedia marlcup language must enable scheduling and </par>
synchronizaLion of different muiLimedia elemenLs and define Lhese elemenls’ interacLivity </body>
wiLh lhe user. </smil>
14 Chapter 1 Iritroductiori to Mtiltirnedia Section 1.4 Overview of tvlultimedia Software Tools 15

A SMIL document can oplionally use Lhe <!DOC’rYPE. . .> directive lo imporl lhe Macromedia Soundedil Soundedil is a mature program for crealing audio for muiLi
SMIL DTD, which will force Lhe interpreter te verify Lhe documenl against Lhe DTD. A media projecLs and lhe web that inlegraLes well wiLh oLher Macromedia producLs such as
SMIL documenL sLarts with .csmil> and specifies the default namespace, using lhe anlns Flash and Direcior.
alLribute. The chead> seclion specifies lhe author of Lhe document. The body element
conLains lhe synchronizalion informalion and resources we wish Lo presenL. 1.4.2 Digital Audio
In Lhe example given, a video source called “authorview.mpg”, an audio source, Digital Audio bois deal with accessing and editing lhe actual sampled sounds thaL make up
‘authorview.wav’, and an HTML document aI ‘http://booksite.html’ audio.
are presenled simulLaneously aI Lhe beginning. When Lhe video ends, lhe image
onagoodday. jpg” is shown, while lhe audio and Lhe HTML documenl are 5h11 pre Cool EdiL Cool Edil is a powerful, popular digital audio ioolkit wiLh capabilities (for
senled. AL this poinL, Lhe audio will Lhank Lhe lisLeners and conclude lhe inLerview. PC users, aL least) thai emulate a professional audio siudio, inciuding mulLiirack producLions
Addilional information on SMIL specificaLions and available modules is available cri Lhe and sound file ediiing, along with digital signa) processing effecLs.
W3C web site.
Sound Forge Sound Forge is a sophislicaied PC-based program for edihing WAV files.
1.4 OVERVIEW OF MULTIMEDIA SOFTWARE TOOLS Sound can be captured from a CD-ROM drive or from tape or microphone ihrough lhe sound
In this subsechion, we look brielly aL some of lhe sofLware Lools available for carrying card, then mixed and ediLed. Jt also permits adding complex special effects.
out Lasks in mLllLimedia. These tools are really only the beginning — a fully funcLional
mullimedia projecL can also cal) for stand-alone programming as well as jusL the use of Pra Tools Pra Tools is a high-end integrated audio production and editing environment
predefined tools Lo fully exercise Lhe capabilihies of machines and lhe NeL.’ thal nans on Macintosh computers as well as Windows. Pra Toois offers easy MIDI creation
The caLegories of software Lools we examine here are and manipulation as well as powerful audio mixing, recording, and ediling soflware.
• Music sequencing and noLaLion
1.4.3 Graphics and Image Editing
• DigiLal audio
• Graphics and image editing Adobe Iliustrator lllustraior is a powerful publishing teci for creating and editing
vector graphics, which can easily be exported lo use on lhe web.
• Videoediting
Adobe Photoshop PhoLoshop is Lhe standard in a tool for graphics, image processing,
• Animalion
and image manipulation. Layers of images, graphics, and texl can be separately manipulated
• Multimedia aulhoring for maximum flexibility, and its “filter factory” permiLs creahion of sophislicaied lighhing
effects.
1.4.1 Music Sequencing and Notation
Macromcdia Fireworks Fireworks is software for making graphics specifically for
Lhe web. lt includes a bitmap editor, a vector graphics editor, and a JavaScript generator for
Cakewalk Cakewalk is a well known older name for what is now called Pra Audio.
bultons and rollovers.
The firm producing this sequencing and editing software, Twelve Tone Syslems, also sells an
inlroduclory version of Lheir software, “Calcewaik Express”, over Lhe lnlernet for a iow price.
Macromedia Freehand Freehand is a LexL and web graphics editing boi that supporls
The tetw sequencer comes from oider devices hhal stored sequences aí notes in lhe MIDI
many bitmap formats, such as GIF, PNG, and JPEG. These are pLtel-based formats, in thaL
music language (evenis, in MDI; see Section 6.2). lt is also possible lo inserl WAV files and
each pixel is specified. lt also supporls vecror-based formats, in which endpoints oflines
Windows MCI commands (for animalion and video) mIo music tracks. (MCI is a ubiquilous
are specified instead of Lhe pixels themseives, such as SWF (Macromedia Flash) and FHC
component of lhe Windows API.)
(Shockwave Freehand). It can also read Pholoshop format.
Cubase Cubase is another sequencinglediting program, with capabilities similar Lo 1.4.4 Video Editing
those of Cakewalk. It includes some digital audio editing tools (see below).
See lhe accompanying web sue for severat inleresting uses of sofiware boIs. in a iypical computer science
course is multimedia, lhe toots dcscribed bem might be used lo creme a smatt mutornedia production as a firsL
Adobe Premiere Premiere is asimple, intuitive video ediiing Lool fornonlinearediting
assignrnent Some of lhe boIs are powerful enough that they might atso form pan ora course project. putting video clips inLo any order. Video and audio are arranged in iracks, iike a musical
16 Chapter 1 Introduction to Multimedia section 1.5 Further Exploration 17

score. lt provides a large number of video and audio tracks, superimpositions. and virtual 1.4.6 Multimedia Authoring
clips. A large library of built-in transitions, filters, and motions for clips allows easy creation Tools LhaL provide Lhe capability for creaLing a complete mulLimedia presenLation, including
ofeffective muitimedia productions. interactive user conLrol, are called authoring programs.

Adobe After Etfecls After Effects is a powerful video editing tool that enables users Lo Macromedia Flash Flash allows users Lo create inleracLive movies by using the score
add and change existing movies wiih effects such as lighLing, shadows, and motion blurring. meLaphor a Limeline arranged in parailel eveni sequences, much like a musical score
lt also allows iayers, as in Photoshop, to permit manipulating objects independently. consisLing of musical noLes. Elements in Lhe movie are called symbols in Flash. Symbols are
added Loa cenLral reposiLory, calied a library, and can be added Lo the movie’s Limeline. Once
Final Cul Pro Final Cut Pro is a video editing Lool offered by Apple for the Macintosh Lhe symbols are present aL a specific time, Lhey appear on Lhe SLage, which represenLs whaL
platforni. IL allows Lhe capture of video and audio from numerous sources, such as fim lhe movie looks like ata certain Lime, and can be manipulated and moved by lhe boIs builL
and DV. lt provides a complete environment, from capturing the video Lo ediLing and color into Flash. Finished Flash movies are commonly used Lo show movies or games on Lhe web.
correction and finaily output to a video file or broadcasL from Lhe coniputer.
Macromedia DirecLor Director uses a movie metaphor to creaLe inLeractive presen
1.4.5 Animation Lations. This powerful program includes a builL-in scripbing language, Lingo, lhaL allows
creaLion of complex inLeractive movies.2 The “cast” of characLers in Director íncludes
bitmapped sprites, scripLs, music, sounds, and paleLLes. Director can read many bitmapped
Multimedia APIs
file formaLs. The program itself allows a good deal of interacliviLy, and Lingo, with its own
Java3D is an API used by Java to construcL and render 3D graphics, similar to lhe way debugger, allows more conlrol, including control Over exLemal devices, such as VCRs and
Java Media Framework handles media files. lL provides a basic set of object primiLives (cube, videodisc players. DirecLor also has web authoring feaLures available, for crealion of fully
splines, etc.) upon which the developer can build scenes. li is an absLracLion layer builL on inLeractive Shockwave movies playable over Lhe web.
top of OpenOL or DirectX (Lhe user can select which), so Lhe graphics are accelerated.
DirectX, a Windows API Lhat supporls video, images, audio, and 3D animation, is Lhe Authorwnre Authorware is a mature, weil-supported auLhoring producL Lhat has an easy
most common API used Lo deveiop inodern muiLimedia Windows applications, such as leaming curve forcomputersciencesLudents because it is based on Lhe idea of flowcharts (lhe
computer games. so-called iconic/flow.conn-ol meLaphor). lt allows hyperlinks Lo link text, digital movies,
OpenGL was creaLed in 1992 and has become Lhe most popular 3D API in use today. graphics, and sound. li also provides CompaLibiliLy beLween files produced in PC and Mac
OpenGL is highly portable and will run on ali popular modere operating sysLems, such as versions. Shockwave Aulhorware applicaLions can incorporaLe Shockwave files, including
UNIX, Linux, Windows, and MacinLosh. DirecLor movies, Flash animaLions, and audio.

Rcndering Tools Quest Quesb, which uses a type of flowchariing meLaphor, is similar Lo Authorware in
3D SLudio Max includes a number of high-end professional tools for characLer animation, many ways. However, Lhe flowchart nodes can encapsulaLe information in a more absLracL
game development, and visual effecLs production. Models produced using this tool can be way (called “frames”) Lhan simply subrouLine leveis. As a resuiL, connecLions between icoas
seen in several consumer games, such as for Lhe Sony PlaystaLion. are more concepLual and do noL always represent ulow of conLrol in the program.
Softimage XSI (previously called Softimage 3D) is a powerful modeling, animation,
and rendering package for animaLion and special effecLs in films and games. 1.5 FIJRTHER EXPLORATION
Maya, a compeLing product Lo SofLimage, is a compleLe modeling package. lt features a ChapLers 1 and 2 of SLeinmeLz and NahrsLedt [5] provide a good overview of muiLimedia
wide variety of modeling and animaLion Lools, such as to creaLe realisLic clothes and fur. concepts.
RenderMan is a rendering package created by Pixar. lL excels in crëating complex The web siLe for Lhis LexL is lcept currenL on new deveiopmenLs. Chapter 1 of lhe Further
surface appearances and images and has been used in numerous movies, such as Monslers ExploraLion directory on the web site provides links Lo much of Lhe hisLory ofmulLimedia. As
mc. and Final Fantasy: The Spiriis Within. IL is alto capable of importing models from Maya. a sLart, Lhe compleLe Vannevar Bush article on lhe Memex sysLem conception is online. This
article was and sLill isconsidered seminal. Allhough wriLben over 50 years ago, iL adumbraLes
GIF Animation Packages For a simpler approach Lo animation thaL also allows quick many current developmenLs, inciuding fax machines and Lhe associative memory model LhaL
developmenl of effective small animaLions for the web, many shareware and other programs underlies Lhe developmenL of Lhe web. Nielsen’s book [6] is a good overview of hypertexL
permiL creating animaLed GIF images. GlFscanconLain several images, and loopingthrough
Lhem creaLes a simple animation. Gifcon and OifBuilder are LW0 of Lhese. Linux also 2Tlierefore, Direcior is ofien a popular choice wilh srudenls for creating a final project o multimedia coulses
il provides lhe desired power wiihoui Lhe inevicable pais ol’ usiog a full.blown C++ program
provides some simple animation bois. such as aniniate. —
18 Chapter 1 Iritroduction to Multimedia section 1.7 References 19

and hypermedia. For more advanced reading, lhe coliection of survey papers by Jeffay and 3. Your Lask is Lo Lhink about lhe lransmission of smell over Lhe InlemeL. Suppose we
Zhang [1] provides in-depLh background as well as fuLure direcLions ofresearch. have a srnell sensor aL one location and wish lo Lransmit lhe Aroma Vector (say) to
Olher links in the LexI web site include information 011 a receiver lo reproduce Lhe sarne sensaLion. You are asked lo design such a system.
LisL three key issues to consider and Lwo applications of such a delivery system. Hini:
• Ted Nelson and lhe Xanadu project Think abouL medical applications.
4. Tracking objecLs or people can be done by boLh sighL and sound. While vision systems
• Nicholas NegroponLe’s work at lhe MIT Media Lab. NegroponLe’s small book 011 are precise, Lhey are relalively expensive; on lhe oLher hand, a pair of microphones can
mulLimedia [7] has become a much-quoted classic. delect a person’s bearing inaccurately bui cheaply. Sensorfusion of sound and vision
is thus useful. Surf Lhe web Lo find ouL who is developing tools for video conferencing
• Douglas Engelbart and the hislory of Lhe “On-Line System” using lhis kind of multimedia idea.
• The MIT Media Lab. Negroponte and Wiesner cofounded Lhe MIT Media Lab, which 5. Non-pho:orealistic graphics means compuler graphics Lhat do well enough wilhouL
is still going slrong and is arguably the rnost influential idea faciory in Lhe world. aLtempLing Lo make images IbaL look like camera images. An example is conferenc
ing (leIs look aL Lhis cutting-edge applicaLion again). For example, if we track lip
• ClienL-side execuLion. Java and client-side execution sLarted in 1995; “Dulce”, lhe movements, we can generate lhe right animaLion lo fiL mar face. lf we don’L much
flrsL JAVA appleL, is also on Lhe LexLboOk’s web site. like our own face, we can substitute another one — facial-feature rnodeling can rnap
correct lip movemenls onto anoLher model. See if you can find oul who is carrying
Chapler 12 of Buford’s book [8] provides adetailed introduction lo auLhoring. NeuschoLz’s OUL research 011 generaLing avaLars lo represenL conference participanls’ bodies.
introductory text [9] gives step-by-step instrucüons for creaLing simple Lingo-based inter- 6. WaLerrnarking is a means of embedding a hidden message in data. This could have
active DirecLor nlovies. importanL legal implications: Is Lhis image copied? Is this image docLored? Who
Other links include looic iL? Where? Think of “messages” LhaL could be sensed while capturing an image
and secretly embedded in lhe image, so as lo answer these questions. (A similar
• Digital Audio. This web page includes a link to lhe Sonic Foundry company for quesLion derives from lhe use of celI phones. WhaL could we use Lo determine who
informalion on Sound Forge, a sample Sound Forge file, and lhe resulting ouLput is pulLing Lhis phone to use, and where, and when? This could eliminaLe Lhe need for
WAV file. The example combines left and righL channel information in a complex passwords.)
fashion. LitLle effort is required lo produce sophisLicaled special effects with this tool.
Digidesign is one finn offering high-end Macintosh sofLware, which can even involve 1.7 REFERENCES
purchasing exLra boards for specialized processing. 1 K. Jeffay and H. Zhang, Readangs ii. Muhirnedia Conipunng and Networking, San Francisco:
Morgan Kaufmann, CA, 2002.
• Music sequencing and noLaLion
2 Vannevar Bush, “As We May ThinlC’ The Aiianric Monthty, Jul. 1945.
• Graphics and image ediLing information 3 D. Engelbart and Ii. Lehtman, “Working TogeLherT BYTE Magazine, Dcc. 1998. 245—252.
4 N. Yankelovitch, N. Meyrowitz, and A. van Dam, “Reading and Writing Lhe Eleclronic BookT
• Video editing producls and information ia Hypennedia and Literary Studies, ed. P. Delany and G.P. Landow, Cambridge, MA: MIT
Press, 1991.
• AnimaLion siLes
5 R. SLeinmeLz and K. Nahrsledt, Multimedia: Compuung, C’om,nunications and Applications,
Upper Saddle River, P11: Prentice HalI PTR, 1995.
• Multimedia auLhoring Lools
6 J. Nielsen, Multimedia and Hypertexr: lhe Interne? and Beyond, San Diego: AP Professional,
• XML. 1995.
7 N. Negroponie, Being Digital, New York: VinLage Books, 1995.
1.6 EXERCISES 8 J.F.K. Buford, Mui:i,nedia Systenrs, Reading, MA: Addison Wesley, 1994.
1. ldentify three novel applicaLions of Lhe InLernel or multimedia applicalions. Discuss 9 P1. Neuscholz, Introduction ‘o Director and Lingo: Muitimedia and Internes Appiications,
why you think lhese are novel. Upper Saddle River, N,J: Prenlice Hail, 2000.
2. Brielly expIam, in your own words, “Memex” and its role regarding hypertext. Could
we carry oul Lhe Memex Lask today? How do you use Memex ideas in your own work?
Section 2.1 Multimedia Authoring 21

production. Here we go through Lhe nuLs and bolts of a number of sLandard programs
CHAPTER 2 currently in use.

Multimedia Authoring and 2.1.1 Multimedia Authoring Metaphors


Authoring is Lhe process of creaLing multimedia applications. MosL authoring programs use

Tools one of several authoring metaphors, also known as authoring paradigins: meLaphors for
easier understanding of the methodoiogy employed Lo creaLe multimedia applications [li.
Some common authoring metaphors are as follows:

• Scripting language metaphor


Tlie idea here is Lo use a special language lo enable interacLivity (buLtons, mouse, etc.)
and allow conditionais, jumps, ioops, funcLions/macros, and so on. An example is
Lhe OpenScripL language in AsymeLrix Learning Systems’ Toolbook program. Open
2.1 MULTIMEDIA AUTHORING ScripL looks like a standard objecL-orienLed, evenL-driven programming language. For
example, a small Toolbook prograrn is shown below. Such a language has a learning
MuiLimedia auLhoring is Lhe creation of multimedia productions, someLimes cailed “movies”
curve associaLed with ii, as do ali auLhoring tools —even those Lhat use Lhe siandard C
or “presenLations”. Since we are inLerested in this subject from a computer science point
programming language as Lheir scripLing language because of lhe object libraries
of view, we are mosLly interested in inieractive appiicaLions. Also, we need Lo consider
Lhat must be leamed.
still-image editors, such as Adobe PhoLoshop, and simple video ediLors, such as Adobe
Premiere, because Lhese applications help us create inLeracLive multimedia projects. load an MPEG file
How much inLeraction is necessary or meaningful depends on lhe applicaLion. The extFileNaine of Mediaplayer ‘ theMpegpath’ =
spectrum runs from almosi no interactiviLy, as in a siide show, to full-immersion virtual c : \windows\media\home3j . mpg’;
reality. -- play
Jn a siide show, inLeracLivity generaliy consists of being able to conLrol the pace (e.g., extplaycount of MediaPlayer ‘theMpegpath’ = 1;
click Lo advance Lo Lhe nexL siide). The next levei of interactiviLy is being able Lo conLrol -- put the Mediaplayer in frames mode (not time mode)
Lhe sequence and choose where Lo go next. NexL is media control: start/sLop video, search extoisplayMode of Mediaplayer °theMpegPath” = 1;
texi, scroll the view, zoom. More control is available if we can conLrol variables, such as -- if want to start and end at specific frames:
changing a daLabase search query. extselectionstart of MediaPlayer ~theMpegPath” = 103;
The levei of control is substaniially higher if we can conLrol objecLs say, mOving extSelectionEnd of Mediaplayer ‘theMpegpath’ 1997;
objecLs around a screen, playing inLeracLive games, and 50 on. Finaily, we can control an -- start playback
entire simulation: move our perspecLive in Lhe scene, conLroi scene objects. get extPlay() of Mediaplayer ‘theMpegPath”;
For some time, people have indeed considered what should go inLo a muitimedia project;
references are given aL the end of Lhis chapter.
• Silde show metaphor
In this section, we shali iook aL Slide shows are by default a linear presenLaLion. Although bois exist Lo perform
MuiLimedia auLhoring meLaphors jumps ia siide shows, few pracLiLioners use lhem. Example programs are PowerPoint
ar lmageQ.
• Multimedia production
• Hierarchical metaphor
• MulLimedia presenLation Here, user-conLrollable elements are organized inLo a tree strucLure. Such a meLaphor
is ofLen used in menu-driven applications.
AutomaLic authoring
• Iconicfflow-contrel metaphor
The final iLem deals with general auLhoring issues and whaL benefiL automaLed Lools, using
Graphical icons are available in a Loolbox, and authoring proceeds by creaLing a
some artificial inteiiigence techniques, for exampie, can bring lo Lhe authoring task. As a
flowchari with icons atLached. The standard example of such a meLaphor is Aulhor
first step, we consider programs Lhat carry out automaLic linking for legacy documenLs.
ware, by Macromedia. Figure 2.1 shows an example flowcharL. As well as simple
After an inLroducLion to muitimedia paradigms. we present some of Lhe pracLical Lools
flowchart elements, such as an IF staiemenL, a CASE sLaLemenL, and so on, we can
of muiLimedia conLent production software Loois Lhat form Lhe arsenal of mulLimedia

20
22 Chapter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 23

litllequiz.ir p • Qnest Frame FAit . Clieck Cale


Levei 1 frame Edlt e Qbjéct HeIp
welcaiie Ideal 6?~Fks Ir

ql
q2
q3 II This exampie starts up the Wndows Caicuiator.
llif lhe user mininizes Lhe calcuiator and then
iesuls li Irias lo start II up agem, lhe Calcuialor is
f~troughi to lhe lop instead ol slarting up anolher insiance ei 1
fiTo use this in your Mies, copy lhe tolowfng lhree lines mIo
ifyour trame as wefl as lhe watch tor. Then inodify the watc1i
//for lo wefch for whatever bulton or event you have designated
#wiil Iaunch lhe calculelor. Vou do foi need lo use lhe
llcail prograrn tooi aI lhe alie design levei for lhis lo work

WORD wstalus; //Copy 1 hese 3 mines of


i*M40 hwnc~ #code mio your frane
FIGURE 2.1: Authorware flowchart. char szMsg(60];

Oraphic File (calc bnp~ ‘Openceic’


group elements using a Map (i.e., a subroutine) icon. With little effort, simpie ani Panei
mation is also possibie.
htopy lhis watch for 1-ito your trame and modity 1 to watch
Frames metaphor
1/for whatever eveni you would líiçe to iaunch lhe calctNaior
As in the iconic/fiow-control metaphor, graphicai icons are again availabie in a tool
~ Watch for ‘Opencaic L~uItonClicked then
box, and authoring proceeds by creating a flowchart with icons attached. However,
rather than represenhing the actual flow of Lhe program, links between icons are more Ilis L -10
conceptual. Therefore, “frames” of icon designs represent more abstraction Iban in
Lhe simpier iconic/flow-control metaphor. An example of such a program is Quest, by FIGURE 2.2: Quesl frame.
Alien Communication. The flowchart consists of “modules” composed of”frames”.
Frames are constructed from objecls, such as text, graphics, audio, animations, and
video, ali of which can respond Lo events. A real benefit is that the scripting language
are basicaHy event procedures or procedures triggered by Limer evenls. Usuaiiy, you
here is lhe wideiy used programming ianguage C. Figure 2.2 shows a Quest frame.
can write your own scripts. in a sense, this is similar Lo Lhe conventionai use of the term
• Card/scripting metaphor “scripting language” — one LhaL is concise and invokes ]ower-levei abstractions, since
that isjust what one’s own scripts do. Director, by Macromedia, is lhe chiefexample
This metaphor uses a simpie index-card structure Lo produce muitimedia productions.
Since links are available, this is an easy route Lo producing applications that use of this melaphor. Director uses the Lingo scripting ianguage, an object-oriented,
event-driven language.
hypertext or hypermedia. The original of Lhis metaphor was HyperCard by Apple.
Another exampie is HyperStudio by Knowledge Adventure. The latter program is
now used in many schoois. Figure 2.3 shows two cards in a HyperStudio stack. 2.1.2 Multimedia Production

• Castlscorelscripting metaphor A muiLimedia project can invoive a host of people with specialized skills. In Lhis book
In Lhis metaphor, time is shown horizontaily in a type of spreadsheet fashion, where we focus on more Lecbnical aspecis, but muiLimedia production can easily involve an art
rows, or Lracks, represent instanliations of characters in a muitimedia production. director, graphic designer, production artist, producer, project manager. writer, user interface
Since chese tracks control synchronous behavior, this metaphor somewhat parailels a designer, sound designer, videographer, and 3D and 2D animators, as well as programmers.
music score. Muiiimediaelements are drawn from a “cast” of characters, and “scripts”
24 Chapter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 25

2,1.3 Multimedia Presentation


In this section, we briefly outline some effects Lo keep in mmd for presenting multimedia
contentas weil as some usefui guidelines for content design.

Graphics Styles Carefui Lhought tias gorie into combinations of color schemes and
how lettering is perceived in a presentation. Many presentations are meanL for business
displays, rather Lhan appearing on a screen. Human visual dynamics are considered in
regard to how such presenLations must be constructed. Most of the observations here are
drawn from Vetter eI ai. [2], as is Figure 2.4.

Color Principies and Guidelines Some colorschenies and arrsiyles are best combined
with a certain Lheme or sLyie. Color schemes could be, for example, natural and floral for
ouLdoor scenes and solid colors for indoor seenes. Examples of art styles are oU paints,
watercolors, colored pencils, and pasLels.
A general hinL is Lo not use too many colors, as Lhis can be distracLing. IL helps Lo be
FIGURE 2.3: Two cards in a HyperStudio stack.
consistent wilh lhe use of color — then color can be used lo signal changes in theme.

Fonts For effecLive visual communicaLion, large fonts (18 Lo 36 points) are besL, with
The production timeline would likely only involve programming when the project is no more Lhan six to eight lines per screen. As shown in Figure 2.4, sans serif fonts work
about 40% complete, with a reasonable Larget for an alpha version (an early version Lhat betLer than serif fonLs (serif fonLs are those with short lines sLemming from and aL an angle
does not contam ali planned features) being perhaps 65—70% complete. ‘T~’pically, Lhe Lo Lhe upper and lower ends of a letter’s strokes). Figure 2.4 shows a comparison of two
design phase consists of storyboarding, flowcharting, proLotyping, and user testing, as well screen projections, (Figure 2 and 3 from VeLLer, Ward and Shapiro [2]).
as a parallel production of media. Programniing and debugging phases would be carried The Lop figure shows good use of color and fonts. IL tias a consisLent color scheme, uses
out in consuitation with marketing, and Lhe distribution phase would follow. large and ali sans-serif (Anal) fonts. The boLLom figure is poor, in Lhat Loo many colors are
A storyboard depicts the iniCial idea content of a multimedia concept in a series of used, and Lhey are inconsisLenL. The red adjacent to the blue is hard Lo focus on, because Lhe
sketches. These are like ‘keyframes” in a video — lhe story hangs from these “stopping human retina cannoL focus on Lhese colors simulLaneously. The serif (Times New Roman)
places”. A flowchart organizes Lhe storyboards by inserting navigation mnformation — the font is said to be hard Lo read in a darkened, projecLion seLting. Finally, the lower righL
multimedia concepl’s structure and user interaction. The most reliable approach for planning paneI does Rol have enough conLrast — preLty pastel colors are often usable only if Lheir
navigalion isto pick a traditional data structure. A hierarchical system is perhaps one of the background is sufficiently different.
simplest organizaLional strategies.
A Color Contrast Program Seeing Lhe resulLs of VeLtereL al.’s research, weconsLrucLed
Muitimedia is not really like other presenlations, in that careful Lhought must be given lo
a small Visual Basic program Lo invesLigaLe how readabiliLy of Lext colors depends on color
organization of movement between the “rooms” inChe production. For example, suppose we and Lhe color of Lhe background (see lhe Further ExploraLion section aL Lhe end of this
are navigating an African safari, but we also need lo bring specimens back Lo our museum chapter for a poinLer Lo this program on Lhe texL web siLe. There, both Lhe execuLable and
for dose examination —just how do we effecL the Lransition from mie locale lo Lhe oLher? lhe program source are given).
A fiowchart helps imagine Lhe solution.
The simplest approach Lo making readable colors on a screen is Lo use Lhe principal
The fiowchart phase is followed by development of a detailed functional specificalion. complemenLary color as lhe background for texL. For color values in the range O lo 1 (or,
This consists of a walk-through of each scenario of the presenLation, frame by frame, mnclud effectively, O Lo 255), if lhe text color is some triple (R, G, B), a legible color for Lhe
ing ali screen adtion and user inleraction. For example, during a mouseover for a character, background is likely given by thaL color subtracted from Lhe maximum:
lhe character reacLs, ora user cliclcing on a characLer resulta in an acLion.
The final part of Lhe design phase is prototyping and testing. Some mulLimedia designers (R,G,B) ~ (l—R,l—G,1—B) (2.1)
use an auLhoring Lool at this stage already, even if the intermediate prototype will ROL be
used in Lhe final product or contmnued in another boi. User testing is, of course, extremely That i5, RoL only is Lhe color “opposite” in some sense (noL Lhe same sense as artisLs use),
imporLant before the final development phase. but if lhe text is brighL, the background is dark, and vice versa.
Section 2.1 Multimedia Authoring 27
26 Chapter 2 Multimedia Authoring and Tools

- MuhiTntd~TQclBDDk-C22~~O5J6K _________

tua tOIT taqt RJpIICm.UPIa u~p


Use ü,&sider~ Lo charige lhe window calor,
A 15 second clip of music from a compact disc was digitized os ciciçon Lhe vãidowLo use a colos piches.
at three different samphigtates (11 kHz, 22 ld-lz, and 44kHz)
wiLh 8-bit precision. The effects of lhe different sampling raLes The•lext wul change lo Lhe windows complemenLauy 00101
are cleaily audible. This is a demonstratiofl of Lhe Nyquist
Theorem. on lhe Lestia rnanualy change t
Use lhe slders Lochange lhe windõw cõlor,
Press ButLon lo Play
Nyquist Theorem: ordckon lhei’kidow Lo use a colos pickei,
8-bil Audio Qip
The minimum sampling 1 II

MÜSIC 11 kHz frequency ot an ND converter lieIP


should be at Ieast twice me
frequency of the signal being
measured. QlLokt boxto ange -s co r
vai

~t~F~t
- MuhimDdiflTüDl5DDk.C2?5~5J6K
bit Edil ~sga appuuc.iIoaS acIp - -

A 15 secoflcl clip of music êom a coznpact disc was digitizeil ai


three different sámpling iates (11 kH 22 and 44Hz) wilh a
8-bit precision. The effects of the different sampling ratos are
clearly audible. Titia is a demoustration of Lhe Nyquist
Theorem.

FIGURE 2.5: Program Lo investigate colors and readability.

Mus 11-
pleasing Lhan others — for example, a pink background and forest green foreground, or a
lc 22 lHz green background and mauve foreground. Figure 2.5 shows Ibis small program in operalion.
Musc -
Figure 2.6 shows a “color wheel”, wiLh opposite colors equal 10(1 — R, 1 — G, 1 — B). An
a arList’s color wheel wilI not look the sarne, as it is based on feel ralher Lhan on an algorithm.
In lhe tradiLional artisl’s wheel, for exaniple, yellow is opposile magenLa, inslead ofopposiie
IktwIff’ blue as in Figure 2.6, and blue is instead opposite orange.

Sprite Animalion Sprües are ofLen used in animaLion. For example, in Macromedia
FIGURE 2.4: Colors and fonLs. (This figure also appears in Lhe color inserL section.) Courtesy Direclor, Lhe nolion of a spriLe is expanded Lo an insLanliaLion of any resource. However, Lhe
ofRon Yetter. basic idea of sprite anirnaLion is simple. Suppose we have produced an animalion figure, as
in Figure 2.7(a). Then it is a simple matler Lo creaLe a l-bil rnask M, as in Figure 2.7(b),
black on whiLe, and Lhe accornpanying sprite 5, as in Figure 2.7(c).
In Lhe Visual Basic program given, sliders can be used Lo change Lhe background color.
Now we can overlay Lhe spriLe on a colored background 8, as in Figure 2.8(a), by firsL
As Lhe background changes, lhe text changes Lo equal Lhe principal coniplemenLarY color.
ANDing B and M, Lhen ORing lhe resuit wiLh S, with Lhe final resuiL as in Figure 2.8(e).
Clicking on Lhe background brings up a color-picker as an altemative Lo Lhe sliders.
Operations are available Lo carry ouL these simple composiLing rnanipulations aL frame raLe
If you feel you can choose a beLler color cornbination, click on Lhe LexL. This brings
and so produce a simple 2D animalion lhaL moves Lhe spriLe around Lhe frarne buL does nol
up a color picker not Lied Lo Lhe background color, so you can experimenL. (The texL itself
change Lhe way iL Iooks.
can also be ediLed.) A little experimenLaLion shows LhaL some color combinations are more
28 Chapter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 29

(b) Cc)

FIGURE 2.6: Color wheel. (This figure also appears in Lhe colar iriserl section.)

Video Transitions Video Lransilions can be an effective way lo indicate a change lo (d) (e)
Lhe next SeCLiOR. Video IransiLions are synLacLic means lo signal “scene changes” and ofLen FIGURE 2.8: SpriLe animaLion: (a) Background B; (b) Mask M; (e) 8 AND M; (d) Sprite 5;
carry semanLic meaning. Many differenL lypes of Lransilions exisL; Lhe main Lypes are mas, (e) 8 AND M 0k 5.
wipes, dissolves, fade-ins and fade-outs.
A cuL, as Lhe name suggests. carnes ouL an abnapl change of image conlents in lwo
consecutive video frames from Lheir respective clips. It is lhe simplesL and most frequently A dissolve replaces every pixel wilh a mixLure over time of Lhe two videos, gradually
used video LransiLion. changing Lhe firsl Lo lhe second. A fade-ouL is lhe replacemenL of a video by black (orwhile),
and fade-in is iLs reverse. Mosi dissolves can be classified inLo lwo lypes, corresponding, for
A wipe is a replacement of Lhe pixels in a region of Lhe viewport wiLh Lhose from anolher
example. Lo cr055 dissolve and dirher dissolve iii Adobe Premiere video ediling soflware.
video. lf Lhe boundary une between lhe two videos moves slowly across lhe screen, lhe
In Lype 1 (cross dissolve), every pixel is affecLed gradually. IL can be defined by
seeond video gradually replaces lhe firsL. Wipes can be Iefl-Lo-righL, righL-to-IefL, vertical,
horizontal, like an iris opening, swepL ouL like lhe hands of a clock, and so on.
D=(l—aQflA+aQ)-B (2.2)

where A and B are lhe color 3-vecLors for video A and video B. Here, aQ) is a Lransilion
funcLion, which is ofLen linear wiLh Lime 1~

a(t) = kt, wilh kfrnax (2.3)

Type II (diLher dissolve) is enLirely different. Delermined by «O), increasingly more and
more pixels in video A will abrupLly (instead of gradually, as in ‘1~’pe 1) change lo video B.
The posiLions of lhe pixels subjected lo the change can be random or someLiines follow a
(b) (e) particular palLern.
(a)
Obviously. fade-in and fade-oul are special Lypes of a Type 1 dissolve, in which video A
FIGURE 2.7: Sprile creaLion: (a) original; (b) mask image M; and (e) sprite S. “Dulce” or B is black (or while). Wipes are special forms of a Type l[ dissolve, in which changing
figure cour!esy of Sun Microsysierns. pixels follow a particular geomelric palLern.
pter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 31

Viewport
y

>l~
VideoL VideoR

(a) (b) (c)

FIGURE 2.9: (a) VideoL; (b) VideoR; (c) VideoL sliding into place and pushing out VideoR. O XT Xma X

Despite lhe facL that many digital video editors include a preset number of video Lran
(a)
sitions, we may also be inlerested in buiiding our own. For example, suppose we wish Lo
buiid a special Lype of wipe lhaL slides one video out whiie another video slides in te repiace
iL. The usual Lype of wipe does not do lhis. InsLead, each video stays in place, and the y
transition une moves across each “stalionary” video, so lhat Lhe left part of Lhe viewport
shows pixels from Lhe left video, and lhe right part shows pixeis from the right video (for a
wipe moving horizonlaily from lefL to right). Xr
Suppose we wouid like Lo have each video frame not heid in piace, but instead move 1-

progressiveiy farLher into (out oQ the viewport: we wish lo slide VideoL in from lhe ieft X
4-
and push out VideoR. Figure 2.9 shows lhis process. Each of Videoi. and Video~q has its
own values ofR, G, and B. NoLe Lhat Ris a funclion of posilion in Lhe frame, (x, y), as weli
as of Lime t. Since lhis is video and not a coliection of images of various sizes, each of lhe o X
Lwo videos has lhe sarne maximum extenL, Xrnax. (Premiere actuaily makes ali videos Lhe
same size — Lhe one chosen in lhe preset seiection —50 Lhere is no cause te worry abouL Xmax XT
different sizes.) (b)
As lime goes by, Lhe horizonlai location XT for Lhe Lransition boundary moves across lhe
viewport from XT = 0 at t = O Lo XT = x,,10~ at t = ‘mar Therefore, for a Lransition lhat is FIGURE 2.10: (a) Geometry of VideoL pushing out VideoR; (b) CalculaLing position in
linear in Lime, XT = (I/trnax)Xmax VideoL from where pixeis are copied Lo lhe viewport.
So for any Limei, Lhe siLuation is as shown in Figure 2.10(a). The viewport, in which we
shall be wriLing pixeis, has its own coordinate system, wiffi lhe x-axis from O Lo Xmax. For
each x (and y) we musL deLermine (a) from which video we Lake RGB values, and (b) from SubsLituting Lhe fact Lhat Lhe Lransilion moves linearly wiLh time, XT = Xmax((/tmax),
whaL x posiLion in lhe unmoving video we Lake pixel values — Lhat is, from whaL posilion x we can seL up a pseudocode solution as in Figure 2.11. In Figure 2.11, lhe siight change in
from Lhe iefL video, say, in its own coordinaLe system. lt is a video, SO of course Lhe image formula if pixels are acLualiy corning from VideoR instead of from VideoL is easy to derive.
in lhe left video frame is changing in time.
Let’s assume Lhat dependence on y is impliciL. In any evenL, we use lhe sarne y as in Lhe SomeTechnicai Design Issues Technical paramelers LhaL affecL the design and delivery
source video. Then for lhe red channel (and similarly for Lhe green and biue), 1? = R(x, t). ofmulLimedia applicaLions inciude computer piatfonn, video format and resoluLion, memory
Suppose we have determined Lhal pixeis should come from VideoL. Then Lhe x-positioa XL and disk space, deiivery methods.
in lhe unmoving video should be XL = x + (Xmax — XT), where x is Lhe position we are
trying to tili in lhe viewport, XT is Lhe posilion in Lhe viewport Lhat Lhe transitioa boundaxy
has reached, and Xmax is lhe maximum pixei position for any frame. • Compuler Piatform. Usually we deal wiLh machines thaL are eilher some Lype of
To see lhis, we noLe from Figure 2.10(b) Lhat we can calcuiate lhe position XL in VideoL’s UNIX box (such as a Sun) or else a IND or MacinLosh. While a good deal of software
coordinate system as Lhe sum of Lhe distancex, in Lhe viewport, and lhedifferencexmax XT. is osLensibly ‘portabie”, much cross-piaLform software relies on runlime modules Lhat
may foI work weil across systems.
32 Chapter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 33

for t in °m01 No perfect mechanism curreritly exists lo distribute large muitimedia projects. Never
for x in O..Xmax Lheless, using such bois as PowerPoint or DirecLor, iL is possible lo creaLe acceptable
II irnor < presenlations Lhal 61 on a singie CD-ROM.
1? = Rj ( x +Xrnax*[i7~j, 1)
else 2.1.4 Automatic Authoring
R = RR ( 1 — 1m01 * ~—,
Thus far, we have considered notions developed for authoring new muilimedia. Nevertbe
iess, a tremendous amounl of legacy muitimedia documents exists, and researchers have
FIGURE 2.11: Pseudocode for sIide video transition. been interesLed iri methods Lo facilitate automatic auflwring. By this term is meant either
an advanced helper for creating new multimedia presentations ora mechanism to facilitate
automatic creation of more usefui muitimedia documents from existing sources.
• Video Formal and Resolution. The mosl popular video formais are NTSC, PAL,
and SECAM. They are IIOL compatibie, so conversion is required to play a video in a Hypermcdia Documents Lei us start by considering hypermedia documents. Gen
different format. erally, three steps are involved ia producing documents meant Lo be viewed nonlinearly:
infonnation generation or capture, authoring, and publication. A question that can be asked
The graphics card, which displays pixels on the screen, is someLimes referred to as is, how much of Lhis process can be automated?
a “video card”. Iii fact, some cards are abie Lo perfonu “frame grabbing”, lo change The first sLep, capture of media, be iL from LexL or using an audio digitizer or video
anaIog signals to digilai for video. This kind of card is called a “video capture frame-grabber, is highiy developed and well auLomaLed. The final step, presenlaLion, is Lhe
card”. objective of lhe multimedia bois we have been considering. Bul the middle step (authoring)
is mosL under consideraLion here.
The graphics card’s capacity depends on its price. An old standard for lhe capacity EssenLially, we wish lo sLmcLure informaLion lo support access and manipulaLion of the
of a card is S-VGA, which aIlows for a resolulion of 1,280 x 1,024 pixeis in a dis availabie media. Ciearly, we would be well advised lo consider Lhe standard computing
piayed image and as many as 65,536 coiors using 16-bit pixeis or 16.7 miliion colors science data strucLures in sLnicturing this informalion: lisLs, Lrees, or networks (graphs).
using 24-bit pixels. Nowadays, graphics cards LhaL support higher resolution, such as However, here we wouid iike Lo coasider how best Lo structure Lhe data Lo supporl muiLiple
1,600 x 1,200, and 32-bil pixels or more are common. views, ralher than a singie, sLaLic view.

Externalization versus Linearization Figure 2.12 shows lhe essenLial problem in


• Memory and Disk Space Requirement. Rapid progress ia hardware alieviates voived ia commuaicating ideas wiLhout using a hypermedia mechanism: the author’s ideas
the problem, but multimedia software is generaliy greedy. Nowadays, at Ieast 128 are “Iinearized” by settiag lhem down in linear order on paper. la coatrasL, hyperiinks alIow
megabytes of RAM and 20 gigabyLes of hard-disk space should be availabie for ac us Lhe freedom Lo partiaily mimic Lhe author’s Lhoughl process (i.e., extemaiization). After
ceptable performance and siorage for multimedia programs. ali, Lhe essence of Bush’s Memex idea in SecLion 1.2.1 invoives associative lialcs in human
memory.
• Delivery Methods. Once coding and ali other work is finished, how shali we presenL Now, using MicrosofL Word, say, it is trivial lo creaLe a hypertext version of one’s doc
our dever work? Since we have presumably purchased a large disk, 50 LhaL perfor umenL, as Word simply foiiows Lhe layout aiready set up ia chaplers, headings, and $0 0fl.
mance is good and storage is noL aR issue, we could simply bring along our machine Bul probiems arise when we wish lo extract semanlic content and find Iinks and anchors,
and show lhe work that way. However, we Iikely wish lo distribute lhe work as a even consideriag just text and noL images. Figure 2.13 displays Lhe probiem: whiie il is
product. Presently, rewritabie DVD drives are not Lhe nonu, and CD-ROMs may Iack feasibie lo meataliy manage a few informaLion nodes, once Lhe problem becomes iarge, we
sufficient storage capaciiy lo hold lhe presenlation. Also, access Lime for CD-ROM need aubomatic assistants.
drives is longer than for hard-disk drives. Once a datasel becomes iarge, we should empioy daLabase methods. The issues be
come focused on scalabiiiLy (lo a large datasel), maintainability, addition of material, and
Electronic delivery is an option, but Lhis depends on network bandwidth at lhe user reusabiiity. The database information musl be set up ia such a way Lhat Lhe “publishing”
side (and aL our server). A streaming option may be available, depending on lhe stage, presenLaLion Lo lhe user, can be carried oul jusl-in-Lime, presenLing infonuaLion in a
presentation. user-defined view from an intermediale informalion strucLure.
34 Chapter 2 Multimedia Authoring and Tools Section 2.1 Multimedia Authoring 35

Noite Node
Tostou doo. hos hcon eidcly mcd no 000500001 ind
Aulhor Reader manipolano inf0000linn ii, mlult,mrdin sflnn,ns, ‘for
A nodo ti simpty • ;rninping ofconoioiofeomnnnion
no~rlhoo no feno, nonin. T),o nodo co, 5o nodo np
comido, lhe nto ar Ireloal dela ‘cilhio o,alni.,,ndin moo, dilfrro,,n n~csof iofonn.eio, soes .510cm.
coo ibonloonceo W snoord. .nalyocd, imo~eo. nidon. soond. dc.
,noonipsInlrd4r,lrr.lndsy.iholjçolly.nndosbo.cncd~..... Tais pio, o! iofnnnnnoioo s,lsich pairo cnrrnonlp
for moe olsooOo,o. Eoscoodly lraI001jaLaa.00C is» o nodo. N°0° dm00 ii saloios somo, rol.
ho gwospcd moo ~sa~Jod~1ln cio bo nsd o. Typic.lly li, nn.dilionnl molnimondi.
Lun~~içalion Deluney41aLion . •rnanod os 0005lsling o! disonrIr nnljlirs {ooo,di. ipsnnnoo)ml,c anldoonissotd!ormI,rlindoio,g..,d
omonocirra. porngnnph. ‘no.) oohjch ok,y o nonir, o!
synladicl md nenooninc rolos drncribia,g lhe .4nchor
innrrrlaiioonjnip,.

(a)

Author Reader

FIGURE 2.14: Nades and anchors ira hypertext. Cour:esy of David Lowe. [6]
ExL~1aj~zation Intey~on

Semiautomatic Migration of Hypertext The slructure of hyperlinlcs for LexL informa


Lion is simple: “nades” represent semantic information and are anchors for Iinks lo other
pages. Figure 2.14 illustrates Lhese concepts.
For tex:, lhe first step for migraling paper-based information lo hypertext is lo autornat
(b) icaliy converl lhe fonnac used lo HTML. Then, seclions and chapters can be piaced in a
dalabase. Sirnple versions of data rnining Lechniques, such as word stemming, can easily
FIGURE 2.12: Cornrnunication using hyperlinks. Cauriesy of David Lowe; (©1995
be used lo pane lides and captions for keywords — for exampie, by frequency counting.
IEEE). [5]
Keywords found can be added Lo lhe database being buiit. Then a heiper program cara
automaticaliy generate addilional hyperlinks between reiated concepts.
A serniautomaLic versiori of such a program is rnost likeiy Lo be successful, malcing
suggestions lhaL can be accepted orrejected and manually added to. A database rnanagement
Index Index system can mainlain Lhe integrity of Iinks when new nades are inserted. For Lhe pubhshing
Concepl 1• s s Concepl stage, since iL may be impractical Lo re-creaLe the underiying information slnictures, it is
Concepl 2 ConcepL besL lo deiay imposing a viewpoint on Lhe data unIu as late as possibie.

Concepl 3
S Concepl
Hyperirnages MatLers are not nearly so straighLforward when considering irnage or
Concepl 4 Concepl
o o
olher muitirnedia dala. To LreaL an image ira Lhe sarne way as texI, we wouid wish Lo
o o
o o consider an irnage lo be a nade IbaL conlains objecLs and other anchors, for which we need
lo determine irnage entiLies and mies. WhaL we desire is an autornated method lo heip us
Information space produce true hypermedia, as in Figure 2.15.

(b) It is possible lo rnanuaily delineate synlaclic irnage elernents by rnasking irnage arcas.
(a)
These can be tagged wilh LexI, so that previous Lexl-based rneLhods can be brought inLo piay.
FIGURE 2.13: Cornplex information space: (a) cornplexity: manageabie; (b) complexity: Figure 2.16 shows a “hyperimage”, with image arcas identified and automaLically linked lo
overwhelrning. Courresy of David L.owe; (©1995 IEEE). [5] olher parIs of a docurnenl.
36 Chapter 2 Multimedia Authoring and Tools Section 2.2 Some Useful Editing and Authoring Tools 37

2.2 SOME USEFUL EDITING AND AUTHORING TOOLS


This texL is primarily concerned with principies of multimedia — Lhe fundamentais to be
grasped for a real understanding of Lhis subject. Nonetheless, we need real vehicles for
showing this understanding, and straight programming in C++ or Java is not always Lhe
best way of showing your knowledge. Most introductory multimedia courses ask you co
aL least start off delivering some multimedia product (e.g., see Exercise II). So we need a
jump-start lo help you Iearn “yet another software boi.” This section aims to give you that
jLlmp-start.
Therefore, we’ll consider some popular authoring boIs. Since Lhe first step in creating a
multimedia application is probably creation of interesting video clips, we sart off looking
ata video editing tool. This is not really au authoiing tool, but video creation isso important
LbeL we include a small introduction to one such program.
The tools we look at are Lhe following:
FIGURE 2.15: Strucbure of hypermedia. Courtesy o! David Lowe. [6]
• Adobe Premiere 6

• Macromedia Director 8 and MX

Such methods are certainly in their infancy buc provide a fascinating view of what is to • Flash5andMX
come in authoring autoniation. Naturally, we are also interested in what tools from database
systems, data mining, artificial inteiligence, and so on can be brought Lo bear to assist • Dreamweaver MX
production of fuIl-blown multimedia systems, not just hypermedia systems. The above Whule this is not an exhaustive Iist, these Lools are often used in creating multimedia contenc.
discussion shows that we are indeed aI Lhe start of such work.
2.2.1 Adobe Premiere

Premicre Basics Adobe Premiere is a very simple video editing program Lhat allows you
r ~~Nypermedb /u/JhIJra/jvyper~detWentaI*5/4ene4bie~IbIE lo quickly create a simple digital video by assembling and merging multimedia components.
FIIerWI.w0 E -
~ PIoy OS Ris: rd2bL~O~tIco/dcfl~db li effectively uses lhe score authoring melaphor, in that components are placed in “tracks”
*2 JJ
J Goto’ qulo Objeco: edible.
______________
horizontally, in a limeline window.
• The File > New Proj cc t command opens a window chat displays a sedes of”pre
-J HypEfMdii t4 — ti HyrM.dia moo.
sets” — assemblies of values for frame resolution, compression method, and frame rale.
D -ou— cocei..
drw
c,oaLty obo ..,ocity e0 9* There are many preset options, most of which conform lo some NTSC or PAL video stan
~0~ .dbLe pcedotts. ~.y Li
dard.
O no
no
bt.k.a duo oito viciou. outugeoteo.
ttou4v ti,.. ntegoci.. .q eet Lo
~tbogonL0t OL&çmvoff010
s Start by importing resources, such as AVI (Audio Video Interleave) video files and WAV
sound files and dragging them from lhe Project window onto tracks 1 or 2. (In fact, you cmi
ccncoUy tio actos atu ,.tüaocd
Oft*A 000,5,1*01 p.aigii’g (vioL use up to 99 video and 99 audio tracks!)
ti, pcobeflo cacept100 o0
v,9*tales) Ibe p.ckagioq Les tio
Eortut.tt ceoslt g aioovinç tio
Video 1 is actually made up of three tracks: Video IA, Video 18 and Transitions.
ttnteot, te Lo ,oeLy,4. dtMcIb,d.
dnv,cted. etc
Transitions can be applied only Lo Video 1. Transitions are dragged into lhe Transilions
track from lhe Transition window, such as a gradual replacement of Video IA by Video 18
(a dissolve), sudden replacement of random pixels in a checkerboard (a dither dissolve), or
li__Á 1- a wipe, with one video sliding over anolher. There are many Lransibions Lo choose from, bul
you can also design an original transition, using Premiere’s Transition Factory.
You can import WAV sound files by dragging lhem to Audio 1 or Audio 2 of Lhe Timeline
window or lo any additional sound tracks. You can ediL Lhe properties of any sound lrack
FIGURE 2.16: Hyperimage. Courtesy of David Lowe. [6] by right-clicking on iL.
Section 2.2 orne Useful Editing and Authoring Tools 39
38 Chapter 2 Multirnedia Authoring and Tools

toca — o, — — — oc
—.4
Ic—c—fl~i
o,-.
- 4_n.~’fl FÁI ~
frn~ ~W o , ~ 430w1

ei—’ - —‘- —
FI004&t IIW~J PI1,t~~p~I 00õ 03
o— —
o ~ - ——
Lo~ Ii~s ~ rLLuldalara,o Ii~k4.c
:‘ ~ IAMm S

(a)
• —1.4
• £, —

•I i*Aj!biS
ei
l$I_.
o—.— • • —
isl
Microsolt MPEG-4 Video Codec V2
Co~ight ® Microsoít Corp. l39S~1 333
• po-,
Options
Keyfiame every ~ seconds
—4 _

.0

•—,
•—2
- Compression Contro4
* •.—. Smootktiess 75 Crispness

FIGURE 2.17: Adobe Premiere screen.


• -Data Rate (ti ~s pei Sec~id)
3~Q
Figure 2.17 shows what a typical Premiere screen might Iook like. The yellow ruler at Lhe -~
Lop of Lhe Timeline window delineates Lhe working timeline — drag iL to Lhe right amount
of time. The 1 Second dropdown box aL the bottom represents showing one video keyframe
per 1 second. 0K Cancel
To “compile” Lhe video, goto Timeline > Render Work Area and save Lhe
project as a .ppj file. Now iL geLs inLeresling, because you must make some choices (b)
here, involving how and iii whaL fonnat Lhe movie is Lo be saved. Figure 2.18 shows Lhe
project opLions. The dialogs LhaL Lweak each codec are provided by Lhe codec manufacLurer; FIGURE 2.18: (a) output opLions; (b) compression opLions.
bring these up by clicking on the Configure button. Compression codees (compression
decompression protocois) are ofLen in hardware on Lhe video capture card. lf you choose
a codec Lhat requires hardware assistance, someone eise’s system may not be able to play In Photoshop, we seL up an alpha channel as foilows:
yourbrilliant digital video, and ali is in vain!
Images can also be inserted into tracks. We can use transitions Lo make Lhe images 1. Use an image you like — a . JPG, say.
gradually appear or disappear in Lhe final video window. Todo so, set up a “mask” image, 2. Make Lhe background some solid color — white, say.
as in Figure 2.19. Here, we have imported an Adobe Photoshop 6.0 layered image, with 3. Malce sure you have chosen Image > Mode > RGB Color.
accompanying alpha channei made in PhoLoshop. 4. Select thaL background area (you wanL iL to remam opaque in Premiere) — use lhe
Then in Premiere, we cliclc on Lhe image, which has been placed in its own video track, magic wand Lool.
anduseClip > Video Options > TranspareilcytosettheKey(Whichtflggels & Gotoselect > Save Selection....
Lransparency) to Aipha Channel. IL is also simple Lo use Clip > Video Options >
6. Ensure thaL Channel = New. Press 0K.
Motion to have Lhe image fiy in and ouL of Lhe frame.
— ~II.Ofl

ection 2.2 Some Useful Editing and Authoring Tools 41


40 Chapter 2 Multimedia Authoring and Too s

— x ~. — — — — ‘AA~ fl bá.’ b

fal~MWl ~ .,t i~.í &LJ j~jT~J ~oID~ 1~in& »J~


1_MaR V— 7 Il~ 1~
a : - a
1’ T A

- — Ia Ia Ol~ll2%Fl
I4awnl 1 —- Ia,. mWR,IM
a aq Iyaa,aaA
~1 •~ fl~1a,rn2lA
lI •~ fl’4a,1Q1fll
— 4 .~__~____4•7= $ — flI3$~~fl
27 2? Páa na$,aion
8 8 PâM. Wda$IIRIQR
li ~ sana, RIOFI
8 P4... canal RIOP,
a — flal4llêlln
a - T,M R$I)a,lnH
2M M$p san-mama
a S— ‘OflallI,ln
a n,ow..n
(b) lá R..p amalia,..,
(a) 1 — arao.aleA..

FIGURE 2.19: (a): RGB channels. (b): Alphachannel.


FIGURE 2.20: Director: main windows.

7. Goto Window > Show Chanfleis, double-click lhe new channel, and rename it
Alpha: make its color (0, 0, 0). used in lhe program refiect Lhis. The main window, ml which lhe action talces place, is
8. Save lhe file as a PSD. lhe Stage. Explicitly opening lhe Stage auomatically doses Lhe other windows. (A useful
shortcut is Shift + Keypad-Enter (the Enter key next Lo the numeric keypad, not lhe usual
If the aipha channel you created in Photoshop has a white background, you’ll need lo EnLer key); Lhis clears alI windows except Lhe Stage and plays Lhe movie.)
choose Reverse Key in Premiere when you choose Alpha. The other Lwo main windows are CasL and Score. A CasL consists of resources a movie
Premiere has its own simple method of creating filIes (lo give credit where credit is due) may use, such as bitmaps, sounds, vector-graphics shapes, Flash movies, digital videos, and
for your digital video. scripts. CasL members can be created directly or simply imporled. Typically you creaLe
Another nice feature of Premiere is that il is simple lo use in capturing video. To form a several casLs, Lo betLer organize lhe parts of a movie. CasL members are placed on Lhe Stage
digital video from a videotape or camcorder input, go Lo File > Capture > Movie by dragging Lhem there from Lhe CasL window. Because several instances may be used for a
capture. (The menu for video/audio capture options appears by righL-clicking lhe capture single cast member, each instance is called a sprite. Typically, cast members are raw media,
window.) Similarly, saving Lo analog tape format is also simple. whereas sprites are objects Lhat control where, when, and how cast members appear on lhe
stage and in the movie.
Premiere ‘fransitions Premiere offers an interesting assorlment of video transitions. Sprites can become interactive by aLtaching “behaviors” to Lhem (for example, make Lhe
However, examining the resulting video frame by frame reveals that the built-in transitions sprite follow lhe mouse) either prewriLten or specially created. Behaviors are in lhe intemal
do not work quite as “advertised”. For example, on dose examination, what purporls to be script language of DirecLor, called Lingo. DirecLor is a sLandard event-driven program that
a wipe Lhat is linear with time Lums out to have a nonlinear dip as it begins — Lhe video allows easy positioning of objects and attachment of evenL procedures lo objecLs.
transition line moves at not quite constant speed. The set of predefined events is rich and includes mouse events as well as network evenls
The Premiere Transition FacLory provides a good many functions for building our’own (aR example of the laILer would be Lesting whether casL members are downloaded yet). The
transitions, if we are interesLed in doing so. Since we are actually in an int regime, these type of control achievable might be Lo Ioop part of a presentalion unLil a video is downloaded,
functions, such as sin and cos, have both domam and range in lhe ints ralher than then continue orjump Lo anoLher frame. Bitmaps are used for buttons, and lhe most typical
tloats. Therefore, some care is required in using Lhem. Exercise 9 gives some of these use would be Lo jump to a frame in Lhe movie afler a button-click event.
details in a realistic problem setting. The Score window is organized in horizontal lines, each for one of lhe sprites, and vertical
frames. Thus Lhe Score looks somewhat like a musical score, in Lhal lime is from left lo
2.2.2 Macromedia Director right, but iL more resembles lhe lisL of evenLs in a MDI file (see Chapter 6.)
Both Lypes of behaviors, prewritlen and user-defined, are in Lingo. The Library paleLte
provides access lo ali prewriLten behavior scripts. You can drop a behavior onto a spriLe or
Director Windows Director is a complete environment (see Figure 2.20) for creating
aLtach behaviors Lo a whole frame.
interactive “movies”. The movie metaphor is used Lhroughout Director, and the windows
pter 2 Mu timedia Authoring and Tools Section 2.2 Some Useful Editing and Authoring Toeis 43

ti 1 VGsne VErÓ,g
.1
Mente,

IQOZ • 4 Li

FIGURE 2.21: A tweened sprite. FIGURE 2.22: Score window.

lf a behavior includes parameters, a dialog box appears. For example, navigation be follow between keyframes. You also specify how Lhe image should accelerate and decelerate
haviors musi have a specified frame Lo jump lo. You can attach Lhe sarne bebavior lo many aL lhe beginning and end of Lhe movemeni (“ease-in” and “ease-out”). Figure 2.21 shows a
sprites or frames and use different parameters for each instance. Most behaviors respond lo tweened sprite.
simple events, such as a click on a sprite or lhe event tiiggered when lhe “playback head” A simpie kind of animation called paleite animation is also widely used. lf images are
enters a frame. Most basic funclions, such as playing a sound, come prepackaged. Writing 8-bit, cycling through Lhe color lookup Lable or systematically replacing loolcup Lable entries
your own user-deflned Lingo sctipls provides more flexibility. Behaviors are modified using produces interesLing (or strange) effects.
Inspector windows: the Behavior Inspector, or Properly Inspector. The Score window’s imporiant features are channels, frames, and Lhe playbaclc head.
The Iatter shows where we are in Lhe score; clicking anywhere in Lhe score reposilions Lhe
Animation Traditiortai animation (ccl animation) is created by showing slightly differ playback head. Channels are Lhe rows in Lhe Score and cmi contam sprite instances ofvisible
ent images over Lime. In DirecLor, Lhis approach amounts Lo using different cast members in media. Therefore, these numbered channels are cailed Sprite channeis.
different frames. To conLrol this process more easily, Director permiLs combining many cast
Ai Lhe lop of Lhe Score window are Special Bifecis channels for cOnLrolling lhe paleLtes,
members into a single sprite. (To place on lhe score, seiect ali Lhe images lo be combined,
tempo, transitions, and sounds. Figure 2.22 shows Lhese channels in Lhe Score window.
Lhen use Lhe CasI To Time menu item Lo place them in Lhe current score localion.) A useftil Frames are numbered horizontally in Lhe Sprite and Special Effects channels. A frame is
feature is that expanding Lhe time used on Lhe score for such an animalion slows lhe playback a single step in Lhe movie, as in a traditional fim. The movie’s playback speed can be
Lime for each image, so the whole animation takes Lhe required amount of Lime. modified by resetting Lhe number of frames per second.
A less sophisticated-looking bul simpie animation is avaiiable wiih the tweening feaLure
of Director. Here, you specify a particular image and move ii around lhe stage without
altering Lhe original image. “Tweening” refers Lo lhe job of minor animators, who used lo Control You can place named marlcers aI any frame. Then Lhe simplest lype of conLrol
have to flul in beLween lhe keyframes produced by more experienced animators — a role evenl would be Lo jump lo a marker. In Director parlance, each maricer begins a Scene.
Events triggered for frame navigalion are Co To Frame, GoTo Marker, or Hold on Current
Director fuifihis automatically.
Frame, which stops Lhe movie aL lhat frame. Behaviors for frames appear in a Script Channel
To prepare such an animation, specify Lhe path on Lhe stage for Lhe tweened frames Lo
in Lhe score window.
take. You can also specify several keyframes and Lhe kind of curve for Lhe animation Lo
1 44 Chapter 2 Multimedia Authoring and Tools Se~ion 2.2 Some Useful Editing and Authoring Tools 45

Bultons are simply biLmaps with behaviors alLached. You usually make use of two These deLails are weil oullined in lhe Lingo Help porlion of Lhe online help. The
bilmaps, orie depicling lhe depressed slale of the buLton and one for lhe undepressed staLe. l-{elp direclory Learning > Lingo_Examples has many DIR files LhaL deLail
Then lhe buill-in evenL mi mousetip effecls lhe jump. lhe basics of Lingo use.

Lingo Scripts Director uses four types of scripLs: behaviors, scripLs aLlached Lo casl
members, movie scripls, and parenl scripts. Behaviors, movie scripLs, and parenl scripts ali Lingo Movie-in-a-Window For an exceilenl example of Lingo usage, Lhe Linga Help
appear as casL members in Lhe CasL window. arLicle on creaLing a movie-in-a-window shows a good overview of how Lo aLlach a scripl.
A “behavior” is a Lingo script aLlached Lo a sprite ora frame. You mighL use a scripl Lo Lingo is a standard, evenL-driven programming language. EvenL handiers are alLached
delennine wheLher a sprile moves, based on whether lhe user has clicked a butlon. A useful io specific evenLs, such as a rnouseDown message. Scripls conLain evenL handlers. You
fealure is lhat a scripl can conLrol when a muiLimedia resource is played, depending on how aLLach a seL of evenL handlers lo an objecL by aLlaching Lhe scripl lo lhe objecl.
much of Lhe resource has already slreamed from lhe web. To atlach a behavior, drag iL from
a casl to a sprile or frame iii lhe Score or mi lhe Stage. 3D SpriLcs A new fealure recenLly added Lo DirecLor is Lhe abiliLy lo creale, imporl,
Also used are Movie scripts, which are avaiiable Lo lhe enlire movie. Movie scripLs can and manipulate 3D objecLs on Lhe sLage. A simple 3D objecl lhal can be added in Direclor is
conLrol evenl responses when a movie sLarts, slops, or pauses and can also respondIa evenLs, 3D lexL. To creaLe 3D lexI, selecL any regular LexL, Lhen in lhe Properly InspecLor click on Lhe
such as key presses and mouse clicks. Parenl seripls can be used Lo create mulLiple inslances TexI lab and seL lhe display mode lo 3D. Olher oplions, such as LexL deplh and Lexlure, can
of an objecl withouL adding casl members Lo lhe score. be changed from lhe 3D Extruder Lab in Lhe Properly Inspector window. These properlies
User-wriLlen Linga scripLs can be used Lo creaLe animaLion orlo respond Lo typical evenls, can also be dynamicaliy seL in Lingo as well, Lo change lhe LexL as lhe movie progresses.
such as user acLions with lhe keyboard and mouse. Scripts can also be used lo stream videos 3D objecls olher lhan LexL can be formed only using Linga or imported from 3D Sludio
from lhe lntemeL, perform navigatiOn, formaL lexI, and so on. Max. DirecLor supporls many basic elemenls of 3D animalion, including basic shapes such
Linga seripls also exlend behaviors beyond whal lhe Score alone can do. The basic as spheres and user-definable meshes. The basic shapes can have lexLures and shaders added
dala type is a lisI, which is of course lhe fundamenlal dala slruclure. Using lisls, you can lo lhem; lexLures are 2D images drawn onlo lhe 3D modeis, while shaders define how lhe
manipulale arrays as well. Malh operaLions and sLring handling are also available. Lisls are basic model looks. Lighis can also be added Lo lhe scene; by defaulL, one lighL provides
of lwo types: linear and properly. ambienl lighling Lo lhe whoie scene. Four Lypes oílighls can be added: ambienL, direclionai,
A linear lisl is simply a list as in LISP, such as 132,43, 12]. A property lisl is ali poinl, and a spollighL. The slrength and calor of Lhe iight can also be specified.
associalion lisI, again as iii LISP: each elemenl contains Lwo values separated by a colon. The viewpoinL of lhe user, called Lhe canlera, can be moved around Lo show Lhe 3D objecLs
Each properly is preceded by a number sign. For example, slaLemenLs lo creaLe Lwo difíerenl from any angle. MovemenL of Lhe camera, such as panning and lilLing, can be conlrolled
property lisls lo specify lhe Slage coordinales of lwo spriles are as follows: using built-in ScripLs in lhe Library window.
spritelLOCatiOfl [#left:100 #top:15O, ltright:300, #botton:350]
sprite2I~ocation = l#left:400, #top:550, #right:500. #bottom:7501 Properties and Parameters Linga behaviors can be crealed wiLh more fiexibilily by
specifying behavior parameLers. Paramelers can change a behavior by supplying inpul lo
Lingo has many funcLions Lhat operale on lisLs, such as append Lo add an element lo lhe lhe behavior when iL is crealed. If no paramelers are specified, a defaull value will be used.
end of a lisL and deleteOne lo deleLe a value from a list. Paramelers can be easily changed for a parlicular behavior by double-clicking on lhe name
of Lhe behavior while iL is alLached Lo anoLher casL member, wiLh dialog-driven pararneLer
Lingo Specifics change as shown in Figure 2.23.
A behaviorcan have a special handlercalled getPropertyflescriptionList lhaL
The funcLion the trame reíers Lo lhe currenl frame.
is mli when a spriie alLached Lo Lhe behavior is creaLed. The handier relums a lisI of
• Special markers next or previous refer Lo adjacenL markers (nol adjacenL frames). parameLers LhaL can be added by lhe addProp funclion. For example, if a movemenL
behavior is made in Linga, paramelers can be added Lo specify lhe direcLion and speed of
• Funclion marker (—1) relnrns Lhe idenLifier for Lhe previous marker. If Lhe frame is Lhe movemenl. The behavior can lhen be alLached 10 many casL meinbers for a varieLy of
marked and has a marker name, marker (0) relurns lhe name of lhe currenL frame; movemenLs.
olherWise, iL reLums lhe name of Lhe previous marker. The parameters defined in Lhe getPropertyDescriptionList handier are prop
erties of lhe behavior Lhal can be accessed wiLhin any handie of Lhal behavior. Defining
• movie ‘Jaws’ ‘ refers lo lhe slari frame of lhe global movie named ‘Jaws’
a property in a behavior can be done by simply using lhe property keyword ouLside
This would lypically be Lhe name of anolher Direclor movie. The reference trame
any handler and lisLing ali lhe properties, separaled by commas. Global variables can be
100 of rnovie ‘Jaws’ ‘ poinLs mIo lhaL movie.
apter 2 Mu me ia uthoring and Tools Section 2.2 Some Useful Editing and Authoring Tools 47

in Flash, componenLs such as images and sound lhat make upa movie are cailed symbols,
which can be inciuded in lhe movie by piacing lhem on Lhe Stage. The sLage is always
visible as a large, white rectangie in Lhe cenler window of Lhe screen. Three oiher imporlanL
windows in Flash are Lhe Tímeline, Library, and Tools.
l9Ontetna c-100. .1~.z.i6O
1~ . .
Library Window The Library window shows ali lhe cunenL symbois in lhe scene
and can be Loggled by lhe Window > Library command. A symboi can be ediLed by
~ I0~I double-clicking iLs name in Lhe library, which causes il Lo appear on Lhe slage. Symbois can
~ .100 Cartel aiso be added Lo a scene by simply dragging lhe symbol from lhe Library onto lhe sLage.

Timeline Window The Tímeline window manages lhe iayers and Limelines of lhe
Spr~e: Cliamel[3l Fiaae4l lo lI scene. The lefL portion of Lhe Timeline window consisLs of one or more Jayers of lhe SLage,
which enabies you Lo easiiy organize Lhe SLage’s conlenLs. Symbols from lhe L.ibrary can
be dragged onto lhe SLage, mIo a particular iayer. For example, a simple movie could have
FIGURE 2.23: ParameLers dialog box. Lwo layers, lhe background and foreground. The background graphic from Lhe Iibrary can
be dragged orlo lhe stage when lhe background iayer is selecLed.
Another usefui funclion for iayering is lhe abiiily lo iock or hide a iayer. Pressing Lhe
accessed across behaviors; lhey can be deciared like a property, except Lhal Lhe global circular bullons nexL lo lhe layer name can loggle lheir hiddenllocked slaLe. Hiding a layer
keyword is used inslead. Each behavior lhal needs Lo access a global variable musl declare can be usefui whiie posilioning or ediling a symboi on a differenl iayer. Locking a layer
iL wilh lhe global keyword. can prevenL accidenlal changes Lo lIs symbols once lhe layer has been compleled.
The right side of Lhe Timeline window consisls of a horizonLal bar for each layer in the
Director Objecls Direclor has LWO maia Lypes of objecls: Lhose creaLed in bago and scene, similar lo a musical score. This represenls Lhe passage of Lime in lhe movie. The
lhose on Lhe Score. Parenl scripLs are used Lo creale a new objecL in L.ingo. A behavior can Timeline is composed of a number of keyframes in differenL layers. A new keyframe can be
be LranSfOrmed mIo a parent scripl by changing Lhe scripl type ia lhe Property Inspector. inserled inLo lhe currenl layer by pressing F6. An evenl such as lhe slarl ofan animaLion or
Parenb scripts are different from other behaviors, in Lhal paramelers are passed inLo Lhe object Lhe appearance of a new symbol musL be in a keyfraine. Ciicking on lhe Limeline changes
when iL is crealed in Lingo scripl. Lhe currenl Lime in lhe movie being ediled.
Parent scripLs can be creaLed and changed only in Lingo, while objecls ia lhe Score can
only be manipulaLed. The mosL common objecls used are Lhe spriLes ia lhe Score. SpriLes Tools Window The Tools window, which aiiows Lhe crealion and manipuiaLion of
can be used only in lhe same Lime period as Lhe Lingo scripL referencing lhem. Reference images, is composed of four main seclions: Tools, Yiew, Colors, and Options. Toois consisLs
lhe sprile aI lhe channel using lhe Sprite keyword followed by lhe spribe channel number. of seleclion Lools lhaL can be used lo demarcaLe exisling images, along wilh severai simple
drawing Lools, such as lhe pencil and painl buckeL. View consisLs of a zoom boi and a hand
A sprite has many properties lhaL perfoaii a varieLy of aclions. The locabion of lhe sprile
1001, which allow navigaLion on Lhe SLage. Coiors aiiows foreground and background colors
can be changed by lhe locv and loch properties lo change lhe vertical and horizontal
lo be chosen, and symbol colors lo be manipuialed. OpLions ailows addiLional opLions when
posiLion, respecLively. The meniber properly specifies lhe sprile’s cast member and can be
a 100115 selecLed.
used Lo change lhe casl member aLlached Lo LhaL behavior. This can be useful in animaLion
Many olher windows are useful in manipuiaLing symbois. WiLh Lhe exceplion of Lhe
— insLead of changing Lhe spriLe in lhe Score Lo reflect a small change, it can be done ia
Tímeline window, which can be boggied wilh lhe View > Timeline command, ali oLher
Lingo.
windows can be loggied under Lhe Window menu. Figure 2.24 shows lhe basic Flash screen.

2.2.3 Macromedia Flash Symbols Symbols can be eiLher composed from oLher symbols, drawn, or imported
Flash is a simple auLhoring 1001 lhat facililaLes lhe creation of inleracLive movies. Flash inLo Flash. Fiash is abie Lo imporl severa] audio, image, and video formaIs mIo lhe symbo]
follows lhe score meLaphor in lhe way Lhe movie is crealed and lhe windows are organized. library. A symboi can be imported by using Lhe command File > Import, which
Here we give a brief introduclion lo Flash and provide some examples of mIs use. aulomaLicaliy adds il lo Lhe current library. To creale a new symboi for Lhe movie, press
ctrl + F8. A pop-up diaiog box wiil appear in which you can specify lhe name and
behavior of lhe symboi. Symbols can talce on one of lhree behaviors: a buiton, a graphic,
Windows A movie is composed of one or more scenes, each a dislincl part of lhe
ora movie. Symbols, such as a bulton, can be drawn using Lhe Toois window.
movie. The command Insert > Scene creabes a new scene for lhe currenl movie.
apter 2 Mii tim ia uthoring and Tools Section 2.2 Some Useful Editing and Authoring Tools 49

ii ii.
o. cl. — — o— 1—~~~ -

ii k AT.. Mame: jBackgroundrnagel 0K


• ~ .1W
‘9
tA
.801.

1.0
~ar ~rn
gehavioc e Mo’tieC~, 1
00 r &jIcn
,J

—o
mT
• ~ ii~ia icr 0 Giaphic Met 1
irn
~ :

1,
FIGURE 2.25: Create symboi dialog.

—.1— 2J ~
n Tweening There are two Lypes of Lweening: shape and moveinen( Lweening. Shape
tweening allows you to create a shape that continuously changes to a different shape over
time. Movement tweening allows you to place a symbol in different places on Lhe Stage
in different keyframes. Fiash automatically fihIs in Lhe keyframes along a path between Lhe

E-aí
FÁ e* OJ
stan and finish. To cany OUL movement tweening, select the symbol te be tweened, choose
Insert > Create Motion Tween,andselecttheendframe. Thenusethecominand
Insert > Fraine and move Lhe symboi toLhe desired position. More advanced tweening
‘E
—z e.,. allows conLrol of Lhe paLh as well as of acceleraLion. Movement and shape tweenings can
T-a3 o— be combined for addiLionai effecL.
OGAfl$~ g~
Mask animation involves Lhe manipulation of a layer mask — a layer thaL selectively
hides portions of anoLher layer. For exampie, to creaLe an explosion effect, you could use
FIGURE 2.24: Macromedia Flash.
a mask Lo cover ali buL Lhe center of Lhe explosion. Shape tweening could Lhen expand Lhe
mask, so that eventually Lhe whoie explosion is seen te take place. Figure 2.26 shows a
Buttons To creaLe a simple butLon, create a new symbol wiLh Lhe buLton behavior. scene before and afLer a tweening effecL is added.
The Timeline window should have four keyframes: up, dovm, over, and hit. These
keyframes show differenL images of Lhe button when Lhe specified action is taken. Only Lhe Action Scripts AcLion scripts allow you to trigger events such as moving Lo a different
up keyframe is required and is the defauit; ali others are optional. A button can be drawn keyframe or requiring the movie Lo sLop. Action scripLs can be aLtached Lo a keyframe
by selecting Lhe recLanguiar Looi in the Toois window and then dragging a rectangle onto or symbols in a keyframe. Right-clicking on Lhe symbol and pressing Actions in Lhe
Lhe Stage. list can modify Lhe actions of a symbol. Simiiarly, by right-clicking on Lhe keyframe and
To add images, so that the buLton’s appearance will change when an event is triggered, pressing Actions in Lhe pop-up. you can appiy actions to a keyframe. A Frame Actions
click on the appropriaLe keyframe and create Lhe butLon image. After aL leasL one keyframe window will come up. wiLh a list of available acLions on the iefL and the current actions
is defined, the basic buLion is complete, although no acLion is yeL aLtached to it. AcLions are being applied symbol on the right. AcLion scripLs are broken into six categories: Basic
discussed further in thc acLion scripLs secLiou below.
Creating a symbol from oLhcr symbols is similar to creaLing a scenc: drag Lhe desired
symbols from Lhe Library onLo Lhe Stage. This allows Lhe creation of complex symbols by
combining simpler symbols. Figure 2.25 shows a dialog box for symbol creaLion.

Animation iii Flash Animation can be accomplished by creating subtie differences in


each keyframe of a symbol. In the first keyframe, Lhe symbol to be animaLed can be dragged
ML( ML.!L.UM~
onto Lhe stage from Lhe Library. Then another keyframc can be inserted, and the symbol
changed. This can be repeaLed as ofLen as needed. Although Lhis process is Lime-consuming,
it offers more flexibility Lhan any other technique for animation. Flash also allows specific
animations Lo be more easily creaLed in several oLher ways. Tweening can produce simpie FIGURE 2.26: Before and after Lweening leLters.
animations, with changes auLomaLically created between keyframes.
Section 2.3 VRML 51
50 Chapter 2 Multimedia Authoring and Tools

BuLtons need action scripts — evenL procedures —so Lhat pressing Lhe button will cause
an effecL. It is siraightforward to atLach a simple action, such as replaying Lhe Flash movie,
~~iMo’sieÉxpIorer ~iJ Frarne Actions io a butLon. SelecL lhe buLlon and click to launch Lhe action scripL window, localed aL lhe
botiom righl of lhe screen. Then click on Basic Actions, which generates a drop-down
+ Frarne Actions lisL of acLions. Double-clicking on Lhe Play action auLomalically adds it to the right side
of Lhe window. This button now replays the movie when clicked.
Basio Actions 4%

2.2.4 Dreamweaver
® PIa~ Dreamweaver is quite a popular Macromedia product (Dreamweaver MX is lhe currenl
® SLop version) forbui’lding mulLimedia-enabled web sues as well aslnLemet applications in HTML,
XML, and other formaLs. It provides visual layout boIs and code-editing capability for file
® Toggle Higli Qualitg
Lypes such as JavaScripL, Active Server Pages, PHP, and XML. The producL is integrated
L~J StopAllSounds with other Macromedia products such as Flash MX and Fireworks MX.
® G€LURL Along wiLh ils use as basically a WYSIWYG web developmenL 1001, an interesLing part of
Ø FSCcrnmand •? Dreamweaver Lhat relates more directly Lo authoring is lhe facL Lhat iL comes wilh a prepack
aged seL of behaviors and is also extensible. The behaviors are essenLially evenl procedures,
responding Lo evenls such as mouseover— Lhe set of possible evenLs is different for each Lar
gel browser and is reconfigurable for each browser and version number. CompuLer Science
FIGURE 2.27: Action scripts window. sLudenls can write Lheir own JavascripL code, say, and atLach Lhis Lo evenLs.

2.3 VRML
Actions, Actions, Operators, Funclions, Properties, and Objecis. Figure 2.27 shows the 2.3.1 Overview
Frame Actions window.
Bane Aclions allow you Lo aLlach many simple actions to the movie. Some common VRML, which stands for Virtual Realiiy Modeling Language, was conceived aI lhe firsL
acLions are inLernaLional conference of the World Wide Web. Mark Pesce, Tony Parisi, and David
RaggeL outlined lhe sLnicture of VRML at Lhe conference and specified that it would be
• Goto. Moves Lhe movie Lo Lhe keyframe specified and can oplionally stop. The sLop a platform-independent language Lhat would be viewed on lhe IntemeL. The objective of
acLion is commonly used to SLOp inLeractive movies when Lhe user is given an option. VRML was Lo have Lhe capability to put colored objecLs into a 3D environment.
VRML is an inlerpreLed language, which can be seen as a disadvantage, because it runs
• Play. Resumes Lhe movie if Lhe movie is stopped. slowly on many compuLers Loday. However, it has been influenlial, because iL was lhe first
meLhod available for displaying a 3D world on the World Wide Web.
Stop. SLops the movie if iL is playing. SLricLly spealdng, VRML is not a “Lool,” like Premiere or Director. In fact, Lhe only
piece of sofLware needed to creale VRML. contenL is a lext ediLor. NoneLheless, VRML, is
• Teu Target. Sends messages to differenL symbols and keyframes in Flash. It is
a Lool used Lo creaLe 3D environments on Lhe web, much like Flash is a Lool used Lo creaLe
commonly used Lo sLart or stop an action on a different symbol or keyframe.
interacLive movies.
The Actions category contains many programming consLrucLs, such as Loopa and Goto
History VRML 1.0 was created in May 1995, with a revision for clarification called
slatemenLs. Other acLions are also included, similar to those in typical high-level, event
VRIvIL 1 .OC in January 1996. VRML is based on a subset of lhe file invenLor format creaLed
driven programming languages, such as Visual Basic. The Operators caLegory inciudes
by Silicon Graphics lnc. VRML. 1.0 allowed for the creaLion of many simple 3D objecLs,
many comparison and assignmenL operaLors for variables. This allows you to perform
such as a cube, sphere, and user-defined polygons. MaLerials and textures can be specified
operations on variables in lhe acLion scripL.
for objecLs to make the objects more realisLic.
The Funetions caLegory contains built-in functions included in Fiash thaL are noL specific
The lasL major revision of VRML. was VRML 2.0. This revision added lhe ability Lo create
to a Flash objecL. The Properties section includes alI Lhe global vaiiables predefined in an interactive world. VRML. 2.0, also called “Moving Worlds”, allows for animaLion and
Flash. For example, Lo refer Lo lhe currenl frame, Lhe variable _currentf rame is defined.
sound in W1 interactive virtual world. New objecLs were added Lo make lhe creaLion of virtual
The Objeets section iists ali objecLs, such as movie clips or strings and their associaLed worlds easier. Java and Javascript have been included in VRML, Lo allow for inLeracLive
funclions.
52 Chapter 2 Multimedia Authoring and Tools Section 2.3 VRML 53

•/1/Ii
—A
FIGURE 2.28: Basic VRML shapes.

objecLs and user-defined actions. VRML 2.0 was a major change from VRML 1.0, and Lhe
Lwo versions are not compatible. However, utilities are available Lo convert VRMI.. 1.0 Lo
VRML 2.0.
FIGURE 2.29: A simple VRML scene.
VRML 2.0 was submitled for sLandardization lo the inLemaLional Organization for Stan
dardization (ISO), and as a result, VRML97 was specified. VRML97 is virtually idenLical Lo
VRML 2.0 only minar documentaLion changes and clarifications were added. VRML97
— The final Lexlure-mapping nade is called a PixelTexture, which simply means cre
is an ISO/IEC standard. aLing an image lo use wiLh ImageTexture VRML. Although iL is more inefficienL Lhan
an ImageTexture nade, iL is sliII useful for simple textures.
VRML Shapes VRIvIL is made up of nodes put into a hierarchy Lhat describe a TexI can be put inLo a VRML world using lhe Text node. You can specify Lhe LexL Lo be
of one or more objects. VRIvIL contains basic geomeLric shapes Lhat can be combined Lo included, as well as the font, alignmenL, and size. By defaulL, Lhe text faces in Lhe positive
creaLe more complex objects. The shape node is a generic node for ali objecLs in VRML. Y direcLion, or “up”.
The Box, Cylinder, Cone, and Sphere are geoinetry nodes Lhat place basic objects AlI shapes and texL sLarl in lhe middle of the VRML world. To arrange lhe shapes,
in lhe virtual world. Trans form nodes musl be wrapped around Lhe shape nodes. The Trans torrn node can
contam Translation, Scale, and Rotation nodes. Translation simply moves
VRML allows for Lhe definition of complex shapes Lhat include IndexedFaceSet
lhe objecL a specific disLance from iLs cLirrent location, which is by defaull lhe center of Lhe
and Extrusion. An IndexedFaceSet is a seI of faces thal make up an objeet. This
world. Scale increases or decreases lhe size of Lhe object, while Rotation rotaLes lhe
allows for lhe creation of complez shapes, since an arbitrary number of faces is allowed.
objecL around iLs center.
An Extrusion is a 2D cross-secLion extruded along a spine and is useful in creaLing a
simple curved surface, such as a flower petal.
VRML World A virtual world needs more Lhan jusl shapes lo be realistic; it needs
An object’s shape, size, color, and retiecLive properties can be specified in VRML. The Lo view Lhe objects, as well as backgrounds and Iighting. The defaulL
Appearance node controls Lhe way a shape iooks and can contam a Material nade and is aligned wiLh the negative z-axis, a few melers from the cenLer of lhe scene. Using
LexLure nodes. Figure 2.28 displays some of Lhese shapes. Viewpoint nades, lhe defaulL camera posilion can be changed and olher cameras added.
A Material nade specifies an object’s surface properties. lt can control what color Figure 2.29 displays a simple VRML scene from one viewpoinl.
Lhe object is by specifying Lhe red, green, and blue values of Lhe object. The specular and The viewpoinl can be specified wiLh Lhe position node and can be rotaLed from lhe
emissive colors can be specified similarly. Olher aLtribules, such as how much Lhe objecL default view wiLh lhe orientation node. The camera’s angle for iLs field of view can be
reflects direcl and indirect lighL, can also be controlled. ObjecLs in VRML can be transparenl changed from iLs defaull 0.78 radians wilh Lhe fieldOfView node. Changing lhe field of
or partially Iransparenl. This is also included in lhe Material node. view can creale a LelephoLo effecl.
Three kinds of LexLure nodes can be used lo map texLures onlo any objeeL. The most Three Lypes of lighling can be used in a VRML. world. A DirectionalLight node
common one is lhe IxnageTexture, which can take an exLernal JPEG or PNG image file shines a light across lhe whole world in a certain direction, similar Lo lhe lighl from lhe sun
— il is from one direction and affecls ali objecls in lhe scene. A PointLight shines a
and map il onLo Lhe shape. The way lhe image is LexLured can be specified — Ibal is, Lhe
way Lhe image should be tiled onto Lhe objeel is ediLable. light in ali direclions from a cerlain poinl in space. A SpotLight shines a lighL in a cerlain
direction from a poinl. Proper IighLing is important in adding realism to a world. Many
A MovieTexture node allows mapping an MPEG movie onto an object; Lhe sLarling
parameLers, such as Lhe calor and sLrengLh of Lhe light, can be specified for every Lype of lighL.
and sLopping Lime can also be specified.
54 Chapter 2 Multimedia Authoring and Tools Section 2.4 Further Exploration 55

The background of Lhe VRML world can also Ix specifled using lhe Background Although only a simple LexL ediLor such as notepad is needed, VRML-specific text
node. The background color, black by default, as well as lhe sky color can be changed. ediLors are available, such as VRMLpad. They aid in creating VRML objects by providing
A Panorama node can map a texture Lo lhe sides of lhe world. A panorama is mapped differenL colors and collapsing or expanding nodes.
onto a large cube surrounding lhe VRML world. if a panorama is used, the user can never AlI nodes begin wilh “(“ and end wiLh 1” and most can conlain nodes inside nodes.
approach the lexLure, because Lhe panorama is ceniered on Lhe user. It is also possible lo add Specia] nodes, called group nodes, can cluster mulLiple nodes. The keyword children
fog iii VRML using lhe Fog node, where Lhe color and densiLy of lhe fog can be specified. followed by “[“ begins Lhe lisL of children nodes, which ends with ~ “. A “Trans form”
Fog can increase the frame raLe of a world, since objecLs hidden by lhe fog are noL rendered. node is an example of a group node.
Nodes can be named using DEF and can Ix used again Jater by using lhe keyword USE.
This allows for creaLion of complex objecls using many simple objects.
2.3.2 Animation and Interactions
To creale a simple box in VRML.
An advanlage of VRML97 over lhe original VRML 1.0 is Lhat lhe VRML world can be
interactive. The only method of animalion in VRML is Lweening, which can be done by Shape (
slowly changing an objecL specified in an interpolator node. This node will modify an object Geometry Box{)
over time. based on Lhe Lype of inLerpolalOr.
There are six inLerpolalors: color, coo rdinare, normal, orienzation, posilion, and scalar.
Ali interpolalors have Lwo nodes Lhal musl be specified: lhe key and keyValue. The The box defaulLs lo a 2-meLer-Jong cube in Lhe cenler of lhe screen. Pulting it inLo a
key consisls of a lisI of two or more numbers, starling wiLh O and ending wiLh 1. Each key Trans form node can move lhis box lo a different part of lhe scene. We can also give the
elemenl musL be complemenled wilh a keyValue element. The key defines how far along box a different color, such as red:
the animaLion is, and Lhe keyValue defines what values should change. For example, a
key element of 0.5 and iLs maLching keyvalue define what lhe objecL should look like aI Transforin { transiation O 10 O children
lhe middle of lhe animation. Shape {
A TimeSensor node limes an animation, 50 lhaL lhe inLerpolator knows whal sLage the Ceoinetry Box{}
object should be in. A TimeSensor has no physical form in Lhe VRML world and just appearance Appearance (
keeps Lime. To notify an inlerpolaLor of a Lime change, a ROUTE is needed Lo connecl Lwo material Material
nodes. One is needed beLween lhe TimeSensor and lhe interpolaLor and anolher between diffusecolor 1 O O
lhe inlerpolaLor and Lhe object Lo be animated. Most animaLion can be accomplished this
way. Chaining ROUTE commands 50 Lhal one evenL Lriggers many others can accomplish
complex animaLions. }
1\vo categories of sensors can be used in VRML Lo obLain input from a user. The firsL is en J}
viromneni sensors. There are lhree kinds of environmenL sensor nodes: Vis ibi li tySen—
sor, ProximitySensor, and Collision. A VisibilitySensor is acLivated This VRML fragmenL puLs a red box cenlered in lhe +10 Y direclion. The box cmi be
when a user’s field of view enLers an invisible box. A ProximitySensor is activaLed reused if DEF mybox is put in front of lhe Trans torin. Now, whenever lhe box needs
when a user enters or leaves an ana. A Collision is aclivaLed when lhe user hils lhe Lo Ix used again, simply pulLing USE mybox will maice a copy.
node.
The second category of sensors is called pointing device sensors. The first poinLing 2A FURTHER EXPLORATION
device sensor is a touch sensor, aclivaLed when an object is clicked wilh Lhe mouse. Three Good general references for muiLimedia aulhoring are inLroduclory books [3, 1] and Chap
oLher sensors are called drag sensors. These sensors allow the rolation of spheres, cylinders, lers 5—8 in [4]. Material on aulomatic authoring is frily expanded in [7].
and planes when a mouse is dragging Lhe object. A Jink lo lhe overall, ver)’ usefri FAQ file for mullimedia authoring is in lhe Lexlbook
web siLe’s Further Exploration secLion for lhis chapter.
2.3.3 VRML Specifics Our TextColor . exe program for investigating complemenlary colors, as in Fig
ure 2.5, is on lhe LexLbook web siLe as well.
A VRML file is simpiy a text file wilh a . wrl exLensiOn. VRML97 musL inciude Lhe line
We also include a linic Lo a good FAQ coliection on DirecLor. A simple Director movie
#VRNIJ V2 . O UTFS in lhe first line of Lhe file. A # denoles a comment anywhere in lhe
demonslraLing lhe ideas sei oul in Section 2.2.2 may Ix downioaded from lhe web sue, along
file except for Lhe firsL une. The firsL une of a VRIvIL file lells lhe VRML clienL what version
wilh informalion on Dreamweaver, VRML, and a sniall demo VRML worid.
of VRML lo use. VRML nodes are case sensiLive and are usually built hierarchically.
56 Chapter 2 Multimedia A ng and Tools Section 2.5 Exercises 57

2.5 EXERCISES Assume we already have a liSL of (x, y) coordinaLes for Lhe fish paLh, Lhal we have
available a procedure for cenlering images on palh posiLions, and lhaL lhe movemenL
1. What extra informalion is muitimedia good at conveying? lakes place on Lop of a video.
6. For lhe slide Lransilion iii Figure 2.11, expIam how we arrive aL Lhe formula for x in
(a) What can spoken lexI convey lhat wrilLen lexl canriot?
Lhe unmoving righL video RR.
(b) When might wrillen lexI be beLler Lhan spoken Lext?
7. Suppose we wish Lo creaLe a video Iransilion such lhat lhe second video appears under
lhe first video lhrough an opening circie (like a camera iris opening), as in Figure 2.31.
2. Find and leain 3D Studio Max in your local lab software. Read lhe online LuLorials Lo
WriLe a formula Lo use Lhe correcL pixeis from lhe Lwo videos Lo achieve this special
see Lhis software’s approach LO a 3D modeling technique. Learn texLure mapping and
effect. JusL write your answer for lhe red channel.
animation using lhis producl. Make a 3D model after canying ouL these sLeps.
3. Oesign an interaclive web page using Dreamweaver. HTML 4 provides layer func
tionalily, as in Adobe PhoLoshop. Each layer represents an HTML objecl, such as
text, an image, or a simple I-ITML. page. In Dreaniweaver, each layer has a marker
associaLed with iL. Therefore, highlighLing Lhe layer marker selecLs the enLire layer,
lo which you can apply any desired effecL. As in Flash, you can add buttons and
behaviors for navigalion and control. You can creale animations using Lhe Timeline
behavior.
4. In regard lo automatie aulhoring,

(a) Whal would you suppose is meanL by Lhe Lerm “acLive images”?
(b) WhaL are Lhe problems associated wilh moving lexL-based LechniqUes Lo lhe
realm of image-based aulomaLic authoring? (a) (b)
(c) WhaL is lhe single ff051 importanL problem associaled with auLomalic auLhoflng
using legacy (already wriLlen) lexl documenls? FIGURE 2.31: Iris wipe: (a) íris is opening: (b) ata laler mOmenL.

5. Suppose we wish Lo creaLe a simple animation, as iii Figure 2.30. Note Lhal Lhis image 8. Now suppose we wish Lo create a video Lransilion such LhaL Lhe second video appears
is exaclly whal lhe animation looks like aI some time, 1101 a figurative represenLaLion under Lhe firsl video Lhrough a moving radius (like a clock hand), as in Figure 2.32.
of lhe process of moving lhe fish; the fish is repeated as il moves. SLaLe whaL we need Wrile a formula Lo use Lhe correcL pixels from Lhe lwo videos Lo achieve lhis special
lo carry oul Lhis objecLive, and give a simple pseudocode solulion for Lhe problem. effect for Lhe red channel.

(a) (b)

FIGURE 2.32: Clock wipe: (a) clock hand is sweeping oul; (b) ala laLer nioment.
FIGURE 2.30: Sprile, progressively laking up more space.
58 Chapter 2 Multimedia Authoring and Tools Section 2.6 References 59

Cakewalk Pro Audio, and olher multimed ia software.


e 4’ (a) Capture (or find) at IeasL Lhree video files. You can use a camcorder or VCR to
malce your own (through Premiere or the like) or find some on lhe Net.
(b) Compose (or ediL) a smali MIDI file with Cakewalk Pro Audio.
(e) Create (or flnd) at leasc one WAV file. You may either digiLize your own or
download some from lhe net.
(d) Use Photoshop Lo create a Litle and an ending.
(e) Combine ali of the above Lo produce a movie about 60 seconds long, including a
LiLIe, some credils, some soundLracks, and aL le’ast three transitions. Experiment
wiLh different compression meLhods; you are encouraged Lo use MPEG for your
final product.
(f) The above ConsLiLutes a minimum sacement of Lhe exercise. You may be lempLed
Lo get very creaLive, and that’s fine, but don’t go overboard and Lake too much
time away from Lhe rest of your life!

2.6 REFERENCES
1 A.C. Lulher, Aurhoring Ínfero crive Mulrbnedia, The IBM Tocis Sedes, San Diego: AP Ptofes
sionai, 1994.
FIGURE 2.33: Filter applied to video. 2 R. Velter, C. Ward, and 5. Shapiro, “Using Color and Texi in Mullimedia ProjecLions’ !EEE
Mulrin,edia, 2(4): 46—54, 1995.
3 J.C. Shepherd and IX Coiaizzi, AurhoringAuthorware: A Pracrical Guide, Upper Saddle River,
9. Suppose you wish to create a wavy effect, as ir’ Figure 2.33. This effecL comes from NJ: Prenlice Flali, 1998.
repiacing Lhe image .r value by an x value offset by a small amounL. Suppose Lhe 4 DE. Wolfgram, Crearing Muirin,edio Presentations, indianapoiis: Que Publishing, 1994.
image size is 160 rows x 120 columns of pixels.
5 A. Ginige, D. Lowe, and). Robertson, “Hypermedia Auihoiing’ lEEEMultimedia, 2: 24—35,
(a) Using floaL arithmetic, add a sine component Lo Lhe x value of Lhe pixel such 1995.
Lhat Lhe pixel takes on an RGB value equal Lo thaL of a different pixel in Lhe 6 A. Ginige and D. Lowe, Nexi Generation Hypermedia AuLhoring SysLems’ Ir’ Proceedings
original image. Make Lhe maximum shift in x equal to 16 pixeis. ofMultirnedia Jnfonnation Systems and !lyperinedia, 1995, 1—li.
(b) In Premiere and oLher packages, only inLeger ariLhmetic is provided. Functions 7 IX lnwe and W. Hail, Hypennedia and the Web: An EntineerinsApproach, New York: Wiley,
such as em are redefined soas Lo Lake an int argument and return an int. 1999.
The argument Lo Lhe em function must be in 0.. 1,024, and the value of em
is in —512.512: sin(0) returns 0, smn(256) rewrns 512, sin(5l2)
reLumsø,sin(768) retums—5lZandsifl(l,O24) reLurns0.
Rewrite your expression in part (a) using inLeger arithmetic.
(e) How could you change your answer Lo make the waving Lime-dependent?

10. How would you create the image in Figure 2.6? WriLe a small program Lo make such
an image. uni: Place R, 6, and B aL the corners of an equilateral Lriangle inside Lhe
circle. IL’s best lo go over ali columns and rows in lhe ouLput image rather than simply
going around Lhe disk and trying to map results back Lo (x, y) pixei positions.
11. As a longer exercise for learning exisLing sofLware for manipulaling images, video,
and music, malce a 1-minute digital video. Ey Lhe end of this exercise, you should be
familiar wiLh PC-based equipmenL and know how Lo use Adobe Premiere, Photoshop,
Section 3.1 Graphics/Image Data Types 61

CHAPTER 3

Graphics and Image Data


Representations
FIGURE 3.1: Monochrome 1-bit Lena image.

In this chapter we look ai images, starting with 1-bil images, Iben 8-bit gray images and how
lo print them, then 24-bit color images and 8-bit versions of color images. The specifics of 3.1.1 1-Bit Images
file formats for storing such images will also be discussed. Images consist ofpixeis, orpels — picture elemenis in digital images. A 1 -bit image consists
We consider the following topics: ofon and off bits only and ihus is lhe simplest type ofimage. Each pixel is stored as a single
bit (0 or 1). Hence, such an image is also referred loas a binary image.
• Graphics/image data types lt is also calied a l-bit monochro,ne image, since it contains no colar. Figure 3.1 shows
a 1 -bil monochrome image (called “Lena” by multimedia scientists — ibis is a standard
• Popular file formaIs image used to illusirate many algorithms). A 640 x 480 monochrome image requires 38.4
lciiobytes of siorage (= 640 x 480/8). Manochrome 1 -bit images can be satisfactory for
pictures containing only simple graphics and text.
3.1 GRAPHICS!IMAGE DATA TYPES
3.1.2 8-Bit Gray-Level Images
The number of file formais used in multimedia continues to proliferate (1]. For example,
Now consider an 8-bit image — that is, one for which each pixei has a gray value between
Table 3.1 shows a list of file formais used in the popular product Macromedia Director.
O and 255. Each pixel is represented by a single byte — for example, a dark pixel might
In this texi, we shall study jusi a few popular file formais, lo develop a sense of how they
have a value afIO, and a bright one might be 230.
operate. We shall concentrale on GIF and JPG image file formais, since tI-tese two formais
are disiinguished by Lhe fact that most web browsers can decompress and display them. The entire image can be thought of as a two-dimensional array of pixel values. We reter
lo such an array as a bllmap, — a representation of the graphics(image data that paralleis
To begin, we shall discuss lhe features of file formais in general.
lhe manner in which it is stored in video memory.
Image resolution refers Lo the numberof pixels inadigitalimage(higherresoltnionalways
yieids better quality). Fairly high resolution for such an image might be 1,600 x 1,200,
TABLE 3.1: Macromedia Director file formats. whereas Iower resolution might be 640 x 480. Nolice that here we are using an aspect raio
of 4:3. Wedon’t have lo adopt ihis ratio, but ilhas been found to iook natural.
File import File expori Native
Such an anay musi be stored in hardware; we cal) this hardware aframe buffer. Special
Image Palette 1 Sound Video AnimatiOn Image L vi&o (relatively expensive) hardware called a “video” card (actually a graphics card) is used for
this purpose. The resolution of the video card does nol have to match Lhe desired resolution
BMP, DIB, PAL AIFF AVI DIR BMP AVI DIR
of the image, bul if not enough video card memory is available, lhe data has to be shified
GIF, JPG, Ad’ AU MOV FLA MOV DXR around in RAM for display.
PICT, PNG, MP3 FLC EXE We can think of the 8-bit image as a ser of 1 -bit biiplanes, where each plane consists of
PNT,PSD, WAV FLI a l-bit representation of the image ai higher and higher leveis of “elevation”: a bil is mmcd
on if lhe image pixel has a nonzero value ar ar abave that bit levei.
TGA, TIFF, GIF Figure 3.2 displays the concept of bitplanes graphically. Each bit-piane can have a value
WMF PFT’ of O or 1 at each pixel bei, together, ali lhe bitpianes make up a singie byle that stores

60
62 Chapter 3 Graphics and Image Data RepresentatiOns Se~ion 3.1 Graphics/Image Data Types 63

Plane 7 Dilbering For printing on a 1-biL printer, diLhering is used Lo caicuiate larger patlems of
dois, such LhaL values from 010255 correspond lo pleasing patLenis ihal correcliy represeni
Plane O darker and brighier pixei values. The maín straLegy is to replace a pixei value by a larger
panem, say 2 x 2 or4 x 4, such lhaL Lhe number of prinLed dots approximales the varying
sized disks of ink used ia halfione prinflng. Haif-Lone printing is an analog proccss Ihat
uses smalieror larger fiuled circies of biack ink Lo represent shading, for newspaper printing,
say.
If insiead we use an ti x a matrix of on-off 1 -bit doLs, we can represenL 772 + 1 leveis
of intensiLy resolulion — since, for exampie. three doIs Iiiled in any way counis as one
intensity levei. The dol paLtems are CreaLed heurisLicaily. For example, if we use a 2 x 2
“dither maLrix”:
BiLpiane
(0 2
FIGURE 3.2: Bitplanes for 8-biL grayscale image.

we can firsL remap image values in 0.. 255 inLo lhe new range 0.. 4 by (integer) dividing by
256/5. Then, for exampie, if lhe pixel vaiue is 0, we prinL noLhing ia a 2 x 2 area ofprinter
vaiues between O and 255 (in Lhis 8-biL situation). For Lhe leasL significanL bit, the bit value ouLput. Bui if Lhe pixei value is4, we prial ali four doIs. So lhe mie is
translates Lo O or 1 in Lhe final numeric sum of lhe binary number. PosiLional arithmetic
implies thaL for lhe nexL, second, bit each O or 1 makes a coritribution of O or 2 lo the final
sum. The nexL biLs sLand for O or 4,0 or 8, and so On, up to O or 128 for Lhe most signific~nL
lf lhe inLensiLy is greater Lhan Lhe dither mairix enlry, print an on dol aI thaL
biL. Video cards can refresh bitplane data aL video rale bul, unlike RAM, do not hold Lhe
entry iocation: replace each pixel by an a x a matrix of doIs.
daLa well. Raster flelds are refreshed aI 60 cycles per second ia North America.
Each pixel is usually stored as a byLe (a value beLween O Lo 255), soa 640 x 480 grayscale
image requires 300 kilobytes of sLorage (640 x 480 = 307,200). Figure 3.3 shows Lhe Lena However, we nolice lhat lhe number of leveis is small for this iype of prinLing. lf we
image again, this time in grayscale. increase lhe number of effecLive inLensiLy leveis by increasing lhe diiher matrix size, we also
If we wish Lo prini such an image, Lhings become more complex. Suppose we have increase lhe size of Lhe output image. This reduces Lhe amounL of detaii in any smali part
available a 600 doL-per-inch (dpi) iaser printer. Such a device can usually only print a dol of Lhe image, effeciiveiy reducing Lhe spaLiai resolution.
or noL print It. However, a 600 x 600 image wili be prinLed in a l-inch space and wili
Note that Lhe image size may be much iarger for a diihered image, since repiacing each
thus not be very pleasing. Instead, di:hering is used. The basic slrategy of diihering is lo pixei by a 4 x 4 array of doLs, say, makes an image 16 limes as large. However, a dever
trade intensity resotiulon for spatiai resolution. (See [2], p. 568, for a good discussion of Lrick can geL around Lhis probiem. Suppose we wish lo use a larger, 4 x 4 dilher matrix,
ditheiing).
such as

O 8 2 lO
12 4 14 6
3 11 i 9
IS 7 13 5

Then suppose we siide lhe dilher maLrix over lhe image four pixeis ia lhe horizontal and
vertical direcLions ata time (where image values have been reduced tolhe range 0.. 16). An
“ordered dither” consisLs of Lurning on lhe prinLer ouLpul bil for a pixei if lhe inLensily levei
is greater Lhan lhe particular maLrix eiemenLjusl aL Lhal pixei posilion. Figure 3.4(a) shows
a grayscale image of Lena. The ordered-diLher version is shown as Figure 3.4(b), with a
FIGURE 3.3: Grayscale image of Lena. delaii of Lenas righl eye in Figure 3.4(c).
Section 3.1 Graphics/Image Data Types 65
64 Chapter 3 Graphics and Image Data Representations

(a) (b) (e)


FIGURE 3.4: Dithering of grayscaie images. (a) 8-bit gray image lenagray . bmp;
(b) dithered version of lhe image: (e) detail of dithered version. (This figure also appears in
Lhe color insert section.)

An algoriLhm for ordered dilher, with vi x n dither matrix, is as foliows:

ALGORITIIM 3.1 ORDERED DITHER


begin
for x = 0 Lo Xrnax II coiumns
for y — 0 Lo Yrnax II rows
i — x inod n (c) (d)
j=yrnodn
II !(x, y) is Lhe input, O(x, y) is lhe ouLput, 13 is lhe diLher malrix. FIGURE 3.5: High-resoluLion color and separale R, O, B color channei images. (a) example
if 1 (x, y) > D(i, j) of 24-biL color image forestfire. bmp; (b, c, d) R, O, and B color channels for Lhis
O(x,y) —1; image. (This figure also appears in Lhe color inserl seclion.)
eise
O(x.y) —0;
end of 16,777,2 16, possible combined colors. However, such fiexibility does resuil in a slorage
penalLy: a 640 x 480 24-bit color image wouid require 921.6 kilobyLes of sLorage wiihouL
Foley, eL ai. 12] provides more detaiis on ordered diLhering. any compression.
An importanl poinl lo note is lhal many 24-biL color images are acLually stored as 32-biL
3.1.3 Image Data Types images, with Lhe exLra byte of daLa for each pixei sLoring an « (alpha) value represenLing
special-effect informaLion. (See [2], p. 835, for an inLroducLion lo use of Lhe «-channel
The nexL seclions introduce some of Lhe most common daLa types for graphics and image for composiLing several overlapping objecis in a graphics image. The simplesl use is as a
file formaLs: 24-biL color and 8-bil color. We then discuss file formais. Some formaIs iransparency fiag.)
are restricted Lo particular hardwareloperaling system piaLformS, whiie oLhers are plaiform Figure 3.5 shows lhe image forestf ire . bmp, a 24-biL image in Microsoft Windows
independeni, or c,vss-plaifonn, formaLs. Even if some formaLs are not cross-piaLform, BMP formaL (discussed laLer in lhe chapLer). Also shown are Lhe grayscale images for jusL
conversion applications can recognize and LransiaLe fonnats from one sysLem lo anoLher. lhe red, green, and blue channels, for lhis image. Taking lhe byle values 0.. 255 in each
Most image formaLs incorporaLe some varialion of a compression technique due Lo the color channel Lo represent inLensily, we can display a gray image for each colar separaLely.
large sLorage size of image files. Compressioli Lechniques can be ciassified as either loss
less or iossy. We will study various image, video, and audio compression LechniqUes iii 3.1.5 8-Bit Color Images
ChapLers 7 Lhrough 14.
II space is a concern (and il almost always is), reasonably accuraLe color images can be oh
lained by quantizing Lhe color information Lo collapse iL. Many syslems can make use ofonly
3.1.4 24-Bit Color Images 8 biLs of color informaLion (lhe so-called ‘256 coiors”) in producing a screen image. Even
if a sysLem has Lhe electronics lo aclually use 24-bil information, backward compatibiliLy
In a colar 24-bit image, each pixel is represented by Lhree byLes, usualiy represenling RGB. demands thaL we understand 8-bil color image files.
Since each value is in Lhe range 0—255, LhiS formal supporLs 256 x 256 x 256, or a loLal
66 Chapter 3 Graphics and Image Data Representations Section 3.1 Graphics/Image Data Types 67

FIGURE 3.7: Exampie of 8-biL color image. (This figure aiso appears in Lhe cobor inseri
seclion.)

noL aiways lhe case. Consider Lhe fieid of medical imaging: wouid you be saLisfied wiLh only
a “reasonabiy accuraLe” image of your brain for poLential laser surgery? Likeiy not — and
FIGURE 3.6: Three-dimensionai hisLogram of ROB colors in forestfire . b~np. Lhal is why consideraLion of 64-bit imaging for medical appiicaLions is noL ouL of Lhe question.
NoLe lhe greaL savings in space for 8-bit images over 24-bil ones: a 640 x 480 8-bit color
image requires only 300 kiiobyLes of sLorage, compared 10921.6 kiiobytes for a coior image
Such image files use lhe concepi of a iookup table lo store color information. Basically, (again, wilhouL any compression applied).
lhe image aLores noL color but instead just a set of bytes, each of which is an index mIo a tabie
with 3-byte values Lhal specify the colorfora pixel wilh that lookup table index. In a way, iL’s
3.1.6 Color Lookup Tables (LUTS)
a bit like a paint-by-number children’s art sei, with number 1 perhaps standing for orange,
number 2 for green, and ao on — Ihere is no inherent pattem lo Lhe set of actual colors. Again, Lhe idea used in 8-bil color images is Lo abre oniy Lhe index, or code value, for each
1L makes sense lo carefully choose just which coiors to represeni best in lhe image: if an pixel. Then, if a pixel siores, say, lhe value 25, lhe meaning is Lo go Lo row 25 in a coior
image is mostly red sunset, it’s reasonable lo represent red wilh precision and siore oniy a bookup Lable (L.UT). Whiie images are dispiayed as two-dimensional arrays of values, they
few greens. are usuaily srored in row-column order as simpiy a bong sedes of values. For an 8-bil image,
Suppose ali Lhe cobra in a 24-bil image were coliected in a 256 x 256 x 256 sei of celis, lhe image file can store in Lhe file header information jusL whal 8-biL values for R, 0, and B
along with lhe counl of how many pixeis beiong Lo each of these coiors slored in Lhal ceil. correspond Lo each index. Figure 3.8 displays Lhis idea. The LUT is ofLen called apateue.
For example, if exacLiy 23 pixels have ROR vaiues (45,200,91) lhen store lhe value 23 in
a Lhree-dimensional array, aL the eiement indexed by Lhe index values [45, 200, 91]. This
data sLructure is called a color hislogra,n (see, e.g., [3, 4]).
Figure 3.6shows a 3D histogram of lhe RGB vaiues of lhepixeis in forestf ire . bmp.
The histogram has 16 x 16 x 16 bins and shows lhe counL in each bin in terma of intensiLy
and pseudocobor. We can see a few imporLant ciusters of color information, corresponding
Lo lhe reds, yellows, greens, and so on, of lhe forestf ire image. CiusLering in this way
aliows us lo pick Lhe most imporlant 256 groups of coior.
Basically, large populations in 3D histogram bins can be subjecLed Lo a spiit-and-merge
aigorithm lo deLermine lhe “best” 256 coiors. Figure 3.7 shows Lhe resuiting 8-bil image
in 0fF formal (discussed iater in Lhis chapter). NoLice Lhat ii is difficult lo disceni Lhe
difference beLween Figure 3.5(a), lhe 24-bil image, and Figure 3.7, lhe 8-biL image. This is FIGURE 3.8: Coior LUT for 8-bit color images.
Section 3.1 Graphics/Image Data Types 69
68 Chapter 3 Graphics and Image Data Representations

R O B
0
0000 1111 2222
0000 liii 2222 Cyan
0000 liii 2222 3
0000 1111 2222 4
3333 44445555
333344445555
3333 4444 5555
3333 44445555
(a) (b) le)
6666 7777 8888
66667777 8888
66667777 8888 FIGURE 3.10: (a) 24-bit color image lena.bmp; (b) version wilh color diLhering; (c) delail of
66667777 8888 diLhered version.
255
Since humans are more sensiLive Lo R and O Lhan Lo B, we could shrink Lhe R range and
FIGURE 3.9: Color picker for 8-bil color: each block of lhe color picker corresponds Lo one O range 0.255 inLo Lhe 3-bil range 0.. 7 and shrink lhe B range down lo Lhe 2-biL range
row of Lhe color LUT. 0.. 3, maldng a LoLal of 8 biLs. To shrink R and O, we could sirnpiy divide lhc R orO byLe
value by (256/8 =) 32 and Lhen Lruncale. Then each pixel in Lhe image gets repiaccd by iLs
8-biL index, and Lhe color Liii’ serves Lo generale 24-biL color.
A calor picker consisls of an array of fairiy large biocks of color (or a semicc~itinuouS However, whal Lends lo happen wilh Lhis simple scheme is lhal edge artifacLs appear in
range of colors) such Lhat a mouse ciick wiIi select Lhe color indicated. lii reaiity, a color Lhe irnage. The reason is Lhal if a slighl change in ROB resulLs in shifLing Lo a new code, an
picker displays lhe palette colors associated with index values frorn O lo 255. Figure 3.9 edge appears, and Lhis can be quiLe annoying perceptually.
displays lhe concept of a color picker: if lhe user selects lhe color block wilh index value A sirnpie alternaLe soluLiOn for Lhis color reducLion problem calied Lhe niedian-cui algo
2, lhen Lhe calor meanc is cyan, wiLh RGB values (0,255.255). rithm does a belLer job (and several oLher compeling meLhods do as weli or beLler). This
A simple animation process is possible via simpiy changing lhe color Lable: ihis is called approach derives from compuLer graphics [5]; here, we show a rnuch sirnplified version.
calor cycling or paleue aniniation. Since updaLes from lhe color Lable are fast, Lhis can The method is a lype of adaptive partitioning scherne Lhal Lries lo puL lhe rnosl biLs, lhe mosL
result in a simple, pieasing cifreI. discrimination power, where coiors are mosI clusLered.
Dithering can also be canied oul for color prinLers, using 1 bil per color channei and The idea is Lo sorl Lhe R byLe values and find Lheir median. Then vaiues smaller Lhan lhe
spacing out lhe calor wiLh R, G, and B dots. Altematively, if Lhe prinLer or sereen can prinl median are labeled wiLh aO biL and values larger Lhan lhe median are Iabeled wiLh a 1 biL.
only a limiled number of colors, say using 8 biLs jnslead of 24, color can be made lo seem The median is lhe point where haif Lhe pixeis are smalier and half are larger.
prinlable, even jf il is not available in Lhe color LIJT. The apparenl color resolution of a Suppose we are imaging some appies, and mosl pixeis are reddish. Then lhe median R
display can be increased wilhouL reducing spaLial resoluLion by averaging Lhe intensiLies of byle value rnighL fali fairiy high on lhe red 0.. 255 scale. NexL, we consider oniy pixeis
neighboring pixeis. Then it is possible lo trick Lhe eye inLo perceiving colors Lhal are nol wilh aO label frorn Lhe firsl slep and sort lheir G values. Again, we labei irnage pixels wiLh
available, because iL cardes aula spalial blending Lhal can be pul lo good use. Figure 3.10(a) another bil, O for Lhose less lhan Lhe median in Lhe greens and 1 for Lhose greaLer. Now
shows a 24-bil color image of Lena, and Figure 3.10(b) shows Lhe sarne imag~ reduced Lo appiying Lhe sarne scheme lo pixeis lhat received a 1 bil for Lhe red sLep, we have arrived aI
oniy 5 bils via dilhering. Figure 3.10(c) shows a delail of Lhe lefL eye. 2-biL iabeling for ali pixeis.
Carrying on Lo Lhe biue channei, we have a 3-bil scheme. Repealing ali sLeps, R, G, and
B, resuics in a 6-bil scheme, and cyciing Lhrough R and O once more resuils in 8 biLs. These
How te Devise a Color Lookup Table In Seclion 3.1.5, we briefly discussed lhe idea
biLs fomi our 8-bil coior index vaiue for pixels, and corresponding 24-bil coiors can be lhe
of clustering lo generale lhe mosL imporlanl 256 coiors from a 24-bil color image. However,
cenLers of lhe resuiLing smali color cubes.
in general, cluslering is an expensive and slow process. BuL we need Lo devise color LUTs
somehow — how shall we accomplish Lhis? You can see lhaL in facL lhis Lype of scheme wili indeed concenlraLe bils where Ihey mosl
need lo differenLiaLe belween high popuiaLions of dose coiors. We can mosl casiiy visualize
The mosL slraighLforward way Lo make 8-biL lookup color ouL of 24-bil color would be finding lhe median by using a hislogram showing counLs aL posiLion 0.. 255. Figure 3.11
Lo divide lhe RGB cube mIo equal slices in each dimension. Then lhe cenlers of each of Lhe shows a hisLograrn of Lhe R byLe vaiues for Lhe forestfire .bmp irnage aiong wiLh Lhe
resulting cubes would serve as lhe enlries in lhe color LUT, while simply scaling Lhe RGB median of Lhese vaiues, dcpicled as a verLical une.
ranges 0.. 255 mIo lhe appropriaLe ranges would generate Lhe 8-biL codes.
Section 3.2 Popular File Formats 71
70 Chapter 3 Graphics and Image Data Representations

Red bil 1 in a lookup lable thaL indexes representative colors (in Lhe Lable, each represenLaLive
color is 24-bits —8 biLs each for R, 0, and B.)
o
o
U1
This way, we might have a lable of 256 rows, each conLaining three 8~biL values. The row
8
o mndices are Lhe codes for Lhe iookup lable, and Lhese indices are what are slored in pixel
o
values of Lhe new, calor quontized or palertized image.
o

3.2 POPULAR FILE FORMATS


Some popular file formaLs for informalion exchange are described below. One of Lhe mosL
important is lhe 8-bit 0W formal, because of ils hisLorical conneclion lo Lhe WWW and

Ç
I-ITML markup language as lhe firsL image Lype recognized by neL browsers. However,
currenLly lhe mosL importanL common file formal is JPEG, which wiii be explored iii greal
deplh in Chapler 9.
Greenbit2.forr~bil 1 =~) Gftenbit2,forredblt 1 == 1
3.2.1 GIF

8
Graphics Inierchange Formal (GIF) was devised by UNISYS CorporaLion and Compuserve,
mnitiaily for transmitting graphicai images over phone lines via modems. The GIF sLandard
e o 50 I~ ‘50 2~ 250 O 50 I® 150 2~ 250
uses lhe Lempel-Ziv-Weich algorithm (a form of compression — see Chapter 7), modified
slighliy for image scanlmne packeLs lo use Lhe une grouping of pixeis effectively.
The OIF standard is limiLed Lo 8~bit (256) colar images only. While this produces accept
FIGURE 3.11: Hislogram of R bytes for Lhe 24-bit color image foresttire . bmp resulLs able colar, iL is besl suiLed for images wilh few distinctive colors (e.g., graphics ar drawing).
iii aO or 1 bil labei for every pixel. Por lhe second bit of Lhe color Labie index being buiiL, The GIF image formaL has a few interesting feaLures, nolwiLhsLanding Lhe fact lhat il has
we take R values iess than lhe R median and labei just those pixeis as O or 1 according as been largeiy supplanled. The standard supperls interlacing — lhe successive display of
their O value is iess or grealer Lhan lhe median of Lhe 0 vaiue. ConLinuing over R, 0, B for pixeis in widely spaced rows by a four-pass display process.
8 bits gives a color LUT 8-bit index. In facL, 0W comes in iwo fiavors. The original specification is GIF87a. The laler version,
GIFS9a, supports simple animation via a Oraphics ConLrOI ExLension biock in Lhe daLa. This
The 24-biL color image resuiting from repiacing every pixel by iLs corresponding color provides simple contrai over delay time, a transparency index, and so on. SofLware such as
LIJT 24-bit color is only an approximation to Lhe original 24-bit image, of course, buL Lhe Corei Draw allows access lo and ediling of GIF images.
above aigorithm does a reasonabie job of putLing mosL discriminatory power where it is most IL is worLhwhiie examining lhe file formaL for 01F87 in more detail, since many such
needed — where smali colar shading differences wiIl be most noticeable. li should also formaLs bear a resembiance Lo it bul have grown a good deal more complex lhan Lhis “simple”
be mentioned thaL several methods exisL for disLributing Lhe approximaLiOn errors from one standard. For lhe sLandard specification, Lhe general file formaL is as in Figure 3.12. The
pixei tolhe nexL. This has lhe effect of smooLhing out probiems in the 8-biL approximatiofl. Signalure is 6 byles: GIF87a; lhe Screen Descriplor is a 7-byle seI of fiags. A G1F87 file
The more accurate version of lhe median-cut aigoriLhm proceeds via Lhe foliowing steps: can contam more Lhan one image definition, usually lo fiL on severai differenl parIs of Lhe
screen. Therefore each image can contam its own colar lookup table, a Local Colar Map,
1. Pind lhe smailest box Lhat conLains ali lhe coiors in Lhe image. for mapping 8 bils mIo 24-biL RGB values. However, iL need noL, and a global color map
2. Sort Lhe enciosed colors along Lhe iongesl dimension of lhe box. can insLead be defined Lo take Lhe place of a local labie if Lhe laLler is nol inciuded.
3. Split lhe box into LWO regions aI Lhe median of lhe sorLed lisI. The Screen DescripLor comprises a set of allributes lhal belong Lo every image in lhe file.
4. Repeal Lhe above process in sLeps (2) and (3) unIU lhe original color space has been According Lo lhe G1F87 standard, iL is defined as in Figure 3.13. Screen Width is given in lhe
divided mIo, say, 256 regions. firsl 2 byles. Since some machines invert lhe order MSB/LSB (mosL significanL bytelleast
significant byLe — i.e., byte order), this order is specified. Screen ffeight is lhe next 2 bytes.
5. For every box, cail Lhe mean of R, 0, and B in lhat box lhe representalive (lhe center)
The “m” in byte 5 is O if no global calor map is given. Calor resoluLion, “cr”, is 3 bits in
color for Lhe box.
O.. 7. Since this is au old standard meanl lo operaLe on a variely of low-end hardware, “cr”
6. Based on lhe Euclidean dislance between a pixel RGB value and lhe box centers, is requesring Lhis much color resolution.
assign every pixel to one of the represenLaLive coiors. Replace the pixel by lhe code
72 Chapter 3 Graphics and Image Data Representations Section 3.2 Popular File Formats 73

BiLs
rãi1? signat~~J 76543210 Byte#

Screen descriptor Red intensiLy 1 Red value for colar index O


Green intensity 2 Green value for colar index O
Global colar map
Blue inLensiLy 3 Blue value for color index O

Image descripLor Red intensity 4 Red value for colar index 1

ril colar map 1 1 iton Limes


Green intensity

Blue inLensiLy
5

6
Green value for color index 1

Blue value for color index 1

r Rasterarefj (continues for remaining colors)

rGIF LerminaMj FIGURE 3.14: GlFcolormap.

FIGURE 3.12: GIF file format. The next bit, shown as “0”, is extra and is nol used in Lhis sLandard. “Pixel” is another
3 bits, indicating Lhe number of bits per pixel in Lhe image, as stored in Lhe file. AlLhough
“cr” usually equals “pixel”, it need not. Byte 6 gives Lhe colar table index byte for Lhe
background calor, and byte 7 is filled with zeros. For presenL usage, Lhe ability Lo use a
small colar resoluLion is a good feaLure, since we may be inLeresLed in very low-end devices
Bits such as web-enabled wristwatches, say.
A colar map is sei up iii a simple fashion, as in Figure 3.14. However, Lhe actual lengLh
7 6 5 4 3 2 1 O Byte#
of Lhe Lable equals 2pu~cl+I as given in lhe screen descripLor.
Each image in Lhe file has iLs own lmage DescripLor, defined as in Figure 3.15. InLerest
Screen width Raster width in pixels (LSB first) ingly, Lhe developers of this standard allowed for future extensions by ignoring any byLes
2 between Lhe end of one image and the beginning of lhe nexl, idenLified by a comma character.
Ia this way, fuLure enhancements could have been simply inserted in a backward-compatible
3 fashion.
Screen height RasLer height in pixels (LSB first)
If Lhe interlace bit is set in Lhe local Image Descriptor, Lhe rows of Lhe image are displayed
‘na four-pass sequence, as in Figure 3.16. Here, Lhe first pass displays rows O and 8, Lhe
second pass displays rows 4 and 12, and so on. This allows fora quick sketch to appear when.
cr O~ pixel : a web browser displays Lhe image, followed by more deLailed fill-ins. The JPEG standard
(below) has a similar display mode, denoted progi-essive mode.
Background 6 Background = color index of screen
background (color is defined from The actual rasler data itself is firsL compressed using Lhe LZW compression scheme (see
O O O O O O O O 7 theglobalcolormaporifnone ChapLer 7) before being stored.
specified, from Lhe default map) The GIFS7 sLandard also seL ouL, for finure use, how Extensian Blocks could be defined.
Even in G1F87, simple animations can be achieved, but no delay was defined between
m= 1 Global color map follows descriptor
cr + 1 # bits of color resolution images, and mulLiple images simply overwrite each other wiLh no screen clears.
pixel + 1 # biLs/pixel in image G1F89 introduced a number of ExLension Block definitions, especially those Lo assist
animation: Lransparency and delay beLween images. A quite useful feaLure introduced in
FIGURE 3.13: GIF screen descriptor. GIFS9 is Lhe idea of a sorted colar lable. The mosL imporlanL colors appear first, SO lhaL if
74 Chapter 3 Graphics and Image Data Representations Section 3.2 Popular File Formats 75

a decoder has fewer calors available, the mosl imporlant ones are chosen. ThaL is, only a
Bits segment of the color lookup Lable is used, and nearby colors are mapped as well as possible
7 6 5 4 3 2 1 O ByLe # into Lhe colors available.
1 Image separator character (comma) We can investigate how Lhe file header works in practice by having a look ata particular
00101100
6fF image. Figure 3.7 is an 8-bit color GIF image. To see how the file header looks, we can
2 Start of image in pixels from the simply use everyone’s favorite command in lhe UNIX operating system: od (ocial dump).
Image Ieft
3 left side of Lhe screen (LSB firsL) la UNIX,’ Lhen, we issue Lhe command
4 Start of image in pixels from the od -c forestfire.git 1 head -2
Image Lop
~ top of lhe screen (LSB first)
_________________________ and we see lhe firsi 32 bytes interpreLed as characLers:
Image width WidLh of lhe image ia pixeis (LSB first> G 1 F 8 7 a \208 \2 \188 \1 \247 \0 \0 \6
J \132 \24 1 \7 \198 \195 \ \128 u \27 \J.96 \166 & T

8 To decipher Lhe remainder of Lhe file header (after GIF87a), we use hexadecimal:
Image height Height of lhe image in pixels (LSB firsl)
ad -x forestfire.gif head -2
m Ii o o O Ipi~ 10 m=O
m= 1
Use global colar map, ignore ‘pixel’
Local colar map follows, use ‘pixel’ with the resulL
i=O Image formatted in Sequential arder 4749 4638 3761 d002 boOl f700 0006 0305
i= 1 Image formatted ia Interlaced arder ae84 187c 2907 c6c3 5c80 SSlb c4a6 2654
pixel + 1 # bits per pixel for Lhis image
The d002 bcOl following lhe Signature are Screen Width and Height; these are given in
FIGURE 3.15: GIF image descriptor. least-significant-byLe-first order, so for Lhis file in decimal lhe Screen Width isO + 13 x 16 +
2 x 162 = 720, and Screen Heighl is li x 16+12+1 x 162 = 444. Then Lhe f7 (which is
247 in decimal) is the fifLh byte in Lhe Screen Descriptor, followed by the background calor
index, 00, and Lhe 00 delimiter. The sei of Ilags, f7, ia bits, reads 1, lii, 0, lii,
lmage or in other words: global color map is used, 8-bit color resolution, O SeparaLor. 8-bii pixel
row Pass 1 Pass 2 Pass 3 Pass 4 ResuIL
______________________________________________________ daLa.
O *la* *la* 3.2.2 JPEG
1 *4a* *4a*
2 *3a* *3a* The most important curreni sLandard for image comprcssion is JPEO [6]. This standard was
3 *4b* *4b* creaLed by a working group of lhe Inlemalional OrganizaLion for Standardizalion (150) that
4 *2a* was informally called the Joint PhoLographic Experts Group and is iherefore sa named. We
5 *4c* *4c* shall sLudy JPEG ia a good deal more detail ia Chapter 9, buL a few salieni feaLures of this
6 *3b* *3b* compression standard can be mentioned here.
7 *4d* *4d* The human vision syslem has some specific limilalions, which JPEG takes advantage
8 *lb* *lb* of to achieve high rates of compression. The eye—brain system cannoL see extremely fine
9 *4e* *4e* detail. lf many changes occur wiihin a few pixels, we refer lo thai image segment as having
lO *3c* *3c* high spatialfrequency — that is, a great deal of change in (x, y) space. This limitalion is
11 *4f* *4f* even more conspicuous for color vision Lhan for grayscale (black and white). Therefore,
12 *2b* *2b* color informaLion in JPEG is dechncued (partially dropped, or averaged) and then small
blocks of an image are represenLed in lhe spaiial frequency domam (ii, v), rather than in
(x, y). That is, the speed of changes mx and y is evaluated, from low lo high, and a new
“image” is formed by grouping Lhe coefficients ar weighLs of these speeds.
FIGURE 3.16: 6fF four-pass interlace display row arder.
‘Solaris version; older versions use slightly different synux.
Section 3.2 Popular File Formats 77
76 Chapter 3 Graphics and Image Data RepresentatiOfls

Special features of PNG files include support for up Lo 48 bits of color information —
a large increase. Files may also contam gamma-correction information (see SecLion 4.1.6)
for correcL display of calor images and alpha-channel information for such uses as control
of transparency. InsLead of a progressive display based on widely separaed rows, as in GIF
images, the display progressively displays pixels in a Lwo-dimensional fashion a few at a
Lime over seven passes thraugh each 8 x 8 block of an image.

3.2.4 TIFF

Tagged linage File Format (TIFF) is anolher popular image file formaL. Developed by
Lhe Aldus CorporaLion in Lhe 1980s, iL was later supported by MicrosofL. ILs support for
aLtachmeni of additional information (referred to as “Lags”) provides a great deal offiexibility.
The most important tag is a formaL signifier: whaL type of compression eLc. is in use in the
stored image. For example, TIFF can sLore many differenL Lypes of images: 1 -bit, grayscale,
8-bit, 24-bit RGB, and so on. TIFF was originally a lossless format, buL a new JPEG tag
allows you Lo opL for JPEG compression. Since TIFF is not as user-controllable as JPEG, li
FIGURE 3.17: JPEG image with low quality specified by user. (This figure also appears in does noL provide any major advanLages over the laLier.
lhe color inserL section.)
3.2.5 EXIF

WeighLs LhaL correspond to slow changes are then favored, using a siniple trick: values Exchange Iniage File (EXIF) is an image formaL for digiLal canieras. lniLially developed in
are dividcd by some large integer and truncated. In this way, small values are zeroed out. 1995, iLs current version (2.2) was published in 2002 by the Japan Elecironics and Informa
Then a scheme for representing long runs of zeros efficiencly is applied, and voila! — the Lion Technology lndusLries AssociaLion (JEITA). Compressed EXIF files use lhe baseline
image is greatly compressed. JPEG format. A varieLy of Lags (many more Lhan in TIFF) is available lo facililate higher
qualiLy prinLing, since informaLion abouL Lhe camera and piciure-iaking condiLions (flash,
Since we effectively Lhrow away a ~oL of informalion by the division and Lruncation step,
exposure, light source, white balance, Lype of scene) can be stored and used by prinLers
this compression scheme is “lossy” (although a lossless mode exists). WhaL’s more, since for possible color-correction algorithms. The EXIF sLandard also includes specificaLion of
it is straightforward to allow the user Lo choose how large a denominator to use and hence file fomiat for audio lhat accompanies digital images. IL also supports tags for information
how much information to discard, JPEG allows Lhe user to set a desired levei of quality, or needed for conversion to FlashPix (initially developed by Kodak).
compression raLio (inpuL divided by output).
As an example, Figure 3.17 shows our forestfire image with a qualiLy factor Q= 3.2.6 Graphics Animation Files
10%. (The usual defauit quality factor is Q = 75%.)
This image is a mere 1.5% of the original sue. In comparison, a JPEG image with A few dominanL formaLs are aimed aL sLoring graphics animations (i.e., sedes of drawings
Q = 75% yields an image size 5.6% of the original, whereas a GIF version of this image or graphic illustraLions) as opposed to video (i.e., series of images). The difference is
compresses down to 23.0% of Lhe uncompressed image size. Lhat animations are considerably less demanding of resources than video files. However,
animation file formaLs can be used Lo store video informaLion and indeed are somelimes
used for such.
3.2.3 PNG
FLC is an importani animaLion or moving picture file formaL; iL was originally creaLed
One intcresting development stemming from the populariLy of Lhe lnternet is efforts toward by Animalion Pro. Another format, FLI, is similar Lo FLC.
more sysLem4ndependent image formats. One such format is Poriable Neiwork Graphics
(PNG). This sLandard is meant Lo supersede Lhe GIF standard and exLends it in impor OL produces somewhat beLter quality moving picLures. GL animaLions can also usually
tant ways. The niotivation for a new standard was in part Lhe patenL held by IJNISYS handle larger file sizes.
and Compuserve on the LZW compressiOn methad. (lnteresLingly, the paLenL cavers only Many older formats are used for animation, such as DL and Amiga 1FF, as well as
compression, not decompression: Lhis is why Lhe UNIX gunzip utility can decompress altemaLes such as Apple Quicktime. And, of course, there are also animated G1F89 files.
LZW-compressed files.)
Section 3.3 Further Exploration 79
78 Chapter 3 Graphics and Image Data RepresentatiOns

3.2.7 P5 and PDF 3.2.11 X Windows PPM


‘rbis is lhe graphics formal for the X Windows Systern. Portable PixMap (PPM) supports
PostScript is an important language for Lypesetting, and many high-end printers have a
24-bit color bitmaps and can be rnanipulated using many public domam graphic editors,
POSLSCripL interpreter built into them. PostScript is a vector-based, rather than pixel-based.
such as xv. lt is used inibe X Windows SysLem for sLoring icons, pixmaps, backdrops, and
picture language: page elements are essentially defined in terms of vectors. Witb fonts
so on.
defined Lhis way, PostScript includes text as well as vector/structured graphics; bit-mapped
images can also be included in output files. Encapsulated PostScript files add some infor
mation for including PostScript files in another document. 3,3 FIJRTHER EXPI.ORATION
Several popular graphics programs, such as lllustrator and Freel-Iand, use PostScript. Foley e; ai. [2] provide,an excellent introduction to compuLer graphics. For a good discussion
However, the PostScript page description language itself does not provide compression; in on issues invoiving image processing, see Gonzaiez and Woods [7]. More information
fac;, PostScript files are just stored as ASCII. Therefore files are ofLen large, and in academic including a complete up-to-daLe lisL of currenL file formats can be viewed on lhe textbook
settings, iL is common for such files Lo be made available only after compression by some web site, in Chapter 3 of the Further Expioration directory.
UNIX uLility, such as compressor gzip. Otheriinksinciude
Therefore, another Lext + figures language has begun Lo supersede PostScript: Adobe
Systems mc. includes LZW (see Chapter 7) compression in its Portabte Document Formal • G1F87 and GIFB9 deLaiis. Aithough Lhese file formats are noL 50 interesting iii Lhern
(PDF) file format. As a consequence, PDF files Lhat do not include images have about seives, they have Lhe virtue of being simple and are a usefui introduction to how such
Lhe sarne compression ratio, 2:1 or 3:1, as do files compressed with other LZW-based bitsLrearns are set ouL.
compression Lools, such as UNIX compressor gzip on PC-based winzip (a variety of
pkz ip). For files containing images, PDF may achieve higher compression ratios by using A popular shareware program for developing GIF animations
separate JPEG compression for Lhe image content (depending on the tools used to create
original and cornpressed versions). The Adobe Acrobat PDF reader can also be configured JPEG considered in detail
to read documents sti-ucLured as linked elernents, wiLh clickable contenL and handy summary
Lree-structured link diagrams provided. . PNG deLaiis

3.2.8 Windows WMF . The PDF file formaL

Windows MetaFile (WMF) is Lhe native vector file formaL for Lhe Microsoft Windows oper
The ubiquitous BMP file formaL
ating environment. WMF files actually consist of a coilection of Graphics Device Inierface
(ODI) function calis, also native LO the Windows environrnent. When a WMF file is “played”
In Lerrns of actual input/output of such file formaLs, code for simple 24-bit BMP file
(typically using the Windows PlayMetaFile fl funclion) the described graphic is ren
reading and manipuiaLion is given on Lhe web site.
dered. WMF files are ostensibly device.independenL and unlimited in sue.
3.4 EXERCISES
3.2.9 Windows BMP
1. Briefly expIam why we need to be abie Lo have Iess Lhan 24-bit color and why this
BizMap (BMP) is Lhe major system standard graphics file format for Microsoft Windows, makes for a probiem. GeneraHy, whaL do we need Lo do Lo adaptiveiy transform 24-bit
used in Microsoft PainL and other programs. It makes use of run-length encoding compres color vaiues Lo 8-biL ones?
sion (see ChapLer 7) and can fairly efficiently sLore 24-bit bitrnap images. Note, however, 2. Suppose we decide to quanLize an 8-bit grayscaie image down tojust 2 biLs ofaccuracy.
thaL BMP has rnany differenL modes, including uncompressed 24-bit images. What is Lhe simplest way todo so? WhaL ranges of byLe values in Lhe original image
are mapped Lo what quanLized values?
3.2.10 Macintosh PAINT and PiO’ 3. Suppose we have a 5-bit grayscale image. What size of ordered dither maLrix do we
need lo dispiay lhe image on a 1 -bit printer?
PAÍNT was originally used in the MacPaint program, iniLially only for 1 -bit monochrome
images. 4. Suppose we have available 24 biLs per pixel for a coior image. However, we noLice
Lhat humans are more sensiLive Lo R and G iban lo B—in facL, 1.5 Limes more sensitive
pl~ is used in MacDraw (a vecLor-based drawing program) for storing strucwred
Lo R or G than Lo B. How could we besL make use of lhe bits availabie?
graphics.
80 Chapter 3 Graphics and Image Data RepresentatiOns Section 3.5 References 81

5. AI your job, you have decided to impress lhe boss by using up more disk space for 3.5 REFERENCES
lhe company’s grayscale images. lnsLead of using 8 bils per pixel, you’d like lo use 1 J. Miano, Cornpressed Image File Formais: JPEG, PNG, G1F XBM, BMP, Reading. MÁ:
48 biLs per pixel in ROR. How could you store Lhe original grayscale images 50 that Addison-Wesley, 1999.
in Lhe new formal lhey would appear lhe sarne as Lhey used lo, visually? 2 J. D. Foley, Á. van Darn, 5. K. Feiner, and). F. Hughes, Coinpuier Graphics: Principies and
6. SomeLimes bilpianes of an image are characterized using au analogy from mapmaking Practéce in C, 2nd cd., Reading, MÁ: Addison-Wesley, 1996.
called “elevalions”. Figure 3.18 shows some elevalions. 3 M. Sonka, V. Hlavac, and R. Boyle. image Processing, Analysis, and Machine Vision, Boslon:
Suppose we describe an 8-bil image using 8 bilpianes. Briefiy discuss how you could PWS Publishing, 1999.
view each biLpiane in lerms of geographical concepls. 4 LO. Shapiro and G.C. Stocknian, Compuier Vision, Upper Saddle River, Ni: Prenlice HaII,
2001.
5 P l-leckberl, “ColorlmageQuanlizacion forFrameBufferDisplay:’ inS!GGRAPffProceedings,
vol. ló, p. 297—307, 1982.
6 W.B. Pennebaker and J.L. MilcheII, The JPEG Stiil buage Data Coinpression Standard, New
York: Vau NosLrand Reinhold, 1993.
7 R.C. Gonzalez and R.E. Woods, Digital hnage Processing, 2nd ed., Upper Saddle River, NJ:
Prenlice Hail, 2002.

O 20 40 60 80 I~ 120

FIGURE 3.18: ElevaLions in geography.

7. For lhe color LIJT problem, lry oul Lhe median-cuL algorilhifi ou a sample image.
ExpIam briefiy why iL is Lhal this algorithm, carried out OH an image of red apples,
puls more color gradaLion in lhe resulLing 24-bit color image where iL is needed, among
Lhe reds.
8. In regard Lo nonordered diihering, a sLandard graphics LexI [2] slales, “Even larger
palLerns can be used, buL Lhe spaLial versus inLensily resoluLion trade-off is limiled by
our visual acuiLy (abouL crie minule of arc in normal lighling)2

(a) Whal does lhis senrence mean?


(b) lf we hold a piece of paper oul aI a disLance of 1 fool, whaL is Lhe apprOximale
linear dislance belween doLs? (Inforination: Que minute of arc is 1/60 of one
degree of angle. Arc IengLh ou a circle equals angle (in radians) limes radius.)
Could we see lhe gap beLween doLs on a 300 dpi prinLer?
(c) Wrile down au algorilhm (pseudocode) for calculaling a color hisLogram for
ROB dala.
Section 4.1 Color Science 83

CHAPTER 4

Color in Image and Video

Calor images and videos are ubiquitous on the web and in multimedia productions. Increas
ingly, we are becoming aware of Lhe discrepancies between color as seen by people and lhe
sometimes very different calor displayed on our screens. The latesl version of lhe J-ITML
standard attempls lo address this issue by specifying colar in terms of a standard, “sRGB”,
arrived aL by calor scienlisls.
To become aware of lhe simpie yel strangely involved worid of color, in this chapler we
shall cansider the following topics:

. Colar science
FIGURE 4.1: Sir lsaac Newton’s experiments. By permission o! lhe Warden and Fellows,
• Colar madels in images New Coilege, Oxford.
• Colar madels in video

in praclice, measurements are used tinI effecliveiy sum up vallage in a smali wavelenglh
4.1 COLOR SCIENCE
range, say 5 or 10 nanamelers, so such plots usualiy cansisl of segments joining funclian
4.1.1 Light and Spectra values every 10 nanomelers. This means aisa lhat such profiles are actuaily stored as vectors.
Recall from high school ;hal light is an electromagnetic wave and lhat its calor is char Below, however, we show equations lhal lreal E(Â) as a continuous function, aithaugh in
acterized by lhe wavelength af lhe wave. Laser lighl consisls af a single wavelength — reality, integrais are calcuiated using sums.
for example, a ruby laser produces a brighl, scarlet beam. Sa if we were lo plot lhe light
inlensily versus wavelength, we wauld see a spike aI lhe appropriale red wavelength and na
olher contributian lo lhe light.
Ia contrasl, most light saurces praduce contribulions aver many wavelengths. Humans
cannat delecl ali lighl — jusI cantribulions thal fali in lhe visible wavelength. Short wave 1~
o
lengths praduce a biue sensatian, and lang waveiengths produce a red ane.
o .2
We measure visible lighl using a device cailed a spectrophotomeler, by reflecling iight
from a diffraction grating (a ruled surface) lhat spreads out lhe differenl wavelenglhs, much ~00
as a prism does. Figure 4.1 shows Lhe phenamenan lhat while hghl contains ali lhe colars of
a rainbow. If you have ever iooked ihrough a prism, yau will have naticed thal iL generales
a rainbow effect, due lo a natural phenomenOn calied dispersion. You see a similar effecl
an lhe surface of a soap bubble.
Visible iighl is an electromagnetic wave in lhe range 400—700 nm (where nm stands
for nanometer, ar i09 meter). Figure 4.2 shaws lhe reiative power in each wavelength 400 450 500 550 600 650 700
interval for typical auldoor lighl ana sunny day. This type of curve, called a spectral power Wavelength (nm)
disliibution (SP1J), ar rpeclruni, shows lhe relative amount of light energy (eiectromagnelic
signal) aL each waveienglh. The symbol for waveienglh is À, so this type of curve mighl be FIGURE 4.2: Spectral power distributian af dayiight.
calied E(À).

82
84 Chapter 4 (olor in Image and Video Section 4.1 (olor Science 85

4.1.2 Human Vision The eye has about 6 miliion cones, bui the proportions of 8, G, and 8 cones are differeni.
They likely are preseni in the raiios 40:20:1 (see [3] for a complete expianation). So lhe
The eye works iike a camera, with Lhe lens focusing an image onto Lhe retina (upside-down achromaLic channel produced by Lhe cones is thus something uike 28 + G + 8/20.
and lefL-right reversed). The retina consists of an airay of rods and Lhree kinds of cones, These spectral sensitiviiy functions are usualiy denoted by some other ieLters than 8, G,
50 named because of their shape. The rods come mio piay when Iight leveis are low and and 8, so here lei us denote them by Lhe vector funciion q (À), wiLh componenLs
produce an image in shades of gray (“Ai night, ali cats are grayF’). For higher iight leveis,
Lhe cones each produce a signai. Because of their differing pigments, Lhe three kinds of q(À) = (4.1)
cones are mosi sensitive Lo red (8), green (G), and blue (B) light.
Higher light leveis resuit in more neurons Iiring, buijust what happens in Lhe brain furiher ThaL is, ihere are three sensors (a vector index k = 1 .3 therefore applies), and each is a
down Lhe pipeline is the subject of much debate. However, it seems likely Lhat Lhe brain function of wavelength.
makes use of differences R—G, G—B, and 8—8, as weil as combining ali of R, G, and 8 into The response in each coior channel in lhe eye is proportional Lo Lhe number of neurons
a high-light-ievel achromatic channei (and thus we can say Lhat the brain is good ai aigebra). firing. For the red channei, any light faliing anywhere in lhe nonzero part of Lhe red cone
funciion in Figure 4.3 wiil generaie some response. So lhe total response of Lhe red channei
is the sum over ali Lhe light falling on lhe reLina to which Lhe red cone is sensitive, weighted
4.1.3 Spectral Sensitivity of the Eye
by lhe sensiiiviiy ai that wavelength. Again Lhinking of these sensiLiviLies as continuous
The eye is most sensitive Lo iight in Lhe middle of the visible specirum. Like Lhe SPD funciions, we can succincily wdte down Lhis idea in lhe fomi of an integral:
profile of a light source, as in Figure 4.2, for recepiors we show lhe relative sensitiviiy as
a function of wavelengLh. The blue receptor sensitivity is noL shown lo scale, because ii is
much smaller Lhan the curves for red or green. Blue is a late addiLion in evolution (and,
R = J E(À) qR(À) dÀ

staLisLicaliy, is Lhe favorite color of humans, regardless of nationality — perhaps for Lhis
reason: blue is a bit surprising!). Figure 4.3 shows Lhe overail sensiiivity as a dashed une,
G ,[ E(À) YG(À) dÀ (4.2)
calied lhe luminous-efficiency funclion. It is usually denoted V (À) and is Lhe sum of the
response curves Lo red, green, and biue [1, 2].
8 = f E(À) q8(À) dÀ
The rods are sensitive to a broad range of wavelengLhs, bui produce a signal that generates
Since Lhe signai transmitied consists of Lhree numbers, coiors fomi a ihree-dimensional
Lhe perception of Lhe biack—whiie scaie only. The md sensitiviLy curve looks uike Lhe
vector space.
iuminous-efficiency function Y(À) bul is shifted somewhai to the red end of Lhe spectrum [1].
4.1.4 Image Formation
Equation (4.2) above actually applies only when we view a seif-luminous object (i.e., a
lighQ. In most situations, we image light reflected from a surface. Surfaces reOeci different
/ amounLs of lighi aL different waveiengths, and dark surfaces reflect iess energy Lhan light
/ surfaces. Figure 4.4 shows lhe surface spectrai reflectance from orange sneakers and faded
o /
o‘o bluejeans [4]. The reliecLance funciion is denoied S(À).
e / The image formation siiuation is Lhus as fouiows: hght from Lhe iiiiuminant wilh 5H)
o ‘o
0,
‘o á E(À) impinges on a surface, wiih surface spectral refleciance function S(À), is rellecied,
2 and is Lhen fiutered by lhe eye’s cone functions q (À). The basic arrangement is as shown in
o
Figure 4.5. The funciion C(À) is called lhe color signa! and is the producL of lhe illuminant
o / R
o E(À) and the refleccance S(À): C(À) = E(À) S(À).
r~I The equations similar to Eqs. (4.2) Lhat Lake inLo account lhe image formation model are
á

q
8 = ,[ £(À) S(À) qR(À) dÀ
o
400 450 500 550 600 650 700 G = f E(À) 5(À) qGR) clà (4.3)

Wavelength (nm) 8 = f 8(À) 5(À) q8(À) clÃ

FIGURE 4.3: Cone sensitivities: R, O, and B cones, and iuminous-efficiency curve V(À).
Section 4.1 Colar Science 87
86 Chapter 4 Colar in Image and Video

4.1.6 Ganlma Correction


a,
au o The RGB nurnbers in an image file are converted back lo analog and drive Lhe electron guns
a)
a)
in Lhe cathode ray Lube (CRT). ElecLrons are emitLed proporlional lo lhe driving voiLage,
a, o and we would like Lo have the CRT system produce light linearly related to Lhe voiLage.
1~

au
Unfortunalely, iL Lunis olaL thaL Lhis is noL Lhe case. The lighL emitLed is actually roughly
o proporLional Lo lhe voltage raised Loa power; lhis power is called “garnrna”, wiLh syrnbol y.
a. Thus, if Lhe file value in Lhe red channel is R, Lhe screen erniLs Iight proportional Lo
o
R~, wiLh SPD equal lo lhaL of lhe red phosphor paint on lhe screen Lhal is lhe LargeL of lhe

j
na, red-channel electron gun. The value of garnrna is around 2.2.
o
Since lhe mechanics of a Lelevision receiver are lhe sarne as Lhose for a cornputer CRT,
TV systerns precorrect for Lhis situaLion by applying Lhe inVerse LransformaLion before trans
400 450 500 550 600 650 700 rniLting TV voiLage signais. It is cuslomary lo append a prirne Lo signais that are “gamrna
Wavelength (nrn) corrected” by raising lo Lhe power (l/y) before Lransrnission. Thus we have

R -. = ~ (R’9 -. R (4.4)
FIGURE 4.4: Surface spectral refiectance furictions 5(À) for two objects.
and we arrive aI “linear signals”.
4.1.5 Camera Systems Voltage is ofLen nonnalized lo maxirnum 1, and it is inleresling lo see what effect Lhese
gamma Lransformalions have on signais. Figure 4.6(a) shows Lhe light ouLpuL wiLh no garnma
Now, we hurnans develop camera systerns in a sirnilar fashion. A good carnera has Lhree correction applied. We see Lhal darker values are displayed too dark. This is also shown in
signais produced aL each pixel location (corresponding Loa reLinal position). Analog signais Figure 4.7(a), which displays a linear rarnp frorn left Lo righL.
are converted lo digiLal, truncated lo inLegers, and stored. If lhe precision used is 8-bit, Lhe Figure 4.6(b) shows Lhe effecl of precorrecting signals by applying lhe power law R11”,
maxirnurn value for any of R, 6, B is 255, and lhe minimurn is 0. where il is cusLomary lo normalize volLage lo lhe range O lo 1. We see LhaL applying firsL
However, lhe light entering lhe eye of Lhe compuLer user is whaL Lhe screen erniLs — Lhe lhe correcLion in Figure 4.6(b), followed by lhe effecL of lhe CRT sysLern in Figure 4.6(a),
screen is essentially a self.luminous source. Therefore, we need Lo know lhe lighL E(A)
entering lhe eye.
No garnma correclion Garnma correction
q Q

co / 00
á / á
E(A) - - -

~~S(À) ~:[ ‘9
o

o
00
‘9
o

o
C(À) (9
o / o
/ /
q Q
o o
Sensors ~R, G. 8 (À) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

/1\
RGB
VolLage
(a)
VolLage
(b)

FIGURE 4.6: Effecl of garnrna correclion: (a) no gamma correction — effect of CRTon light
ernilLed from screen (vollage is normalized Lo range 0.. 1); (b) garnma correction of signal.
FIGURE 4.5: Irnage formaLion rnodel.
88 Chapter 4 o or in Image and Video Section 4.1 Color Science 89

li makes sense, Lhen, to define an overail “syslem” gamma Lhat Lakes into account ali such
Lransformations. linfortunaLeiy, we musL often simply guess aí lhe overali gamma. Adobe
PhoLoshop allows us to Lry different gamana values. For WWW pubiishing, ii is importaní
to know thaí a Macintosh does gamma correction ia iís graphics card, with a gamma of 1.8.
SGI machines expect a gamma of 1.4, and most PCs or Suns do no extra gamma correction
and likely have a display gamma of about 2.5. Therefore, for lhe most common machines,
ii might malce sense Lo gamma-correcl images aI Lhe average of Macinlosh and PC values,
orabout2.l.
However, mosL practitioners mighl use a value of 2.4, adopted by Lhe sRGB group. A new
“standard” RGB for WWW applicaLions calied sRGB, lo be included ia aH future HTML
standards, defines a standard modeling of typical light leveis and monitor condilions and is
(more or iess) “device-independenL color space for Lhe lnLemet”.
An issue reIaled Lo gamma correction is lhe decision ofjusL whaL inLensiíy leveis wili be
(a) (b) represented by what bit patLerns in Lhe pixel values in a file. The eye is most sensitive Lo
raios of intensity leveis ralher lhan absolute inlensities. This means lhaL Lhe brighter lhe
FIGURE 4.7: Effect of gamma correction: (a) display of ramp from O lo 255, with no gamma lighl, lhe greater musL be Lhe change in lighL levei for lhe change Lo be perceived.
correction; (b) image with gamma correction applied. ifwe had precise control over whal bils represented whal inLensiLies, it would make sense
lo code inlensities logarilhmically for maximum usage of Lhe bils available. Then we could
include lhat coding in an inverse of lhe (1/>’) power law transform, as in Equation (4.4), or
would result in linear signais. Figure 4.7(b) shows lhe combined effect. Here, a ramp is perhaps a iookup table implemenlalion of such an inverse function (see [6], p. 564).
shown ia 16 seps, from gray levei 0 Lo gray levei 255. However, iL is mosL likeiy Lhat images or videos we encounler have no nonlinear encoding
A more careful definiíion of gamma recognizes that a simple power iaw wouid resuit ia of bil leveis but have indeed been produced by a camcorder or are for broadcast TV. These
an infinite derivative aL zero voiíage — which makes constructing a circuit lo accomplish images wiil have been gamma correcíed according lo Equalion (4.4). The CIE-sponsored
gamma correclion difficuit to devise ia anaiog. ia practice a more general transform, such CIELAB perceptually based coior-difference meLric discussed ia Section 4.1.14 provides a
as R .-.~ = a x R’1~ + bis used, along wilh special care ai Lhe origin:
carefui algoriíhm for including lhe nonlinear aspecl of human brightness percepíion.
4.5 x 1% < 0.018
vout =
(4.5) 4.1.7 Color-Matching Functions
1.099 x (Yin — 0.099) 1% ~ 0.018
PracLicaiiy speaking, many color applicalions involve specifying and re-creating a particular
desired color. Suppose you wish Lo duplicate a particular shade on Lhe screen, ora particular
This is called a camera :ransfrrfunction, and lhe above iaw is recommended by Lhe Society
shade of dyed cloLh. Over many years, even before lhe eye-sensiLivity curves of Figure 4.3
of Motion PicLure and Television Engineers (SMPTE) as standard SMPTE—i7OM.
were known, a technique evolved ia psychology for malching a combinaíion of basic 1?, G,
Why a gamma of 2.2? la facL, this value does not produce a final power law of 1.0. The
and 8 iights lo a given shade. A particular seL of three basic lighLs was available, called Lhe
hisLory of this number is buried ia decisions of lhe Nalional Television Syslem Committee
set of calor prirnaries. To maLch a given shade, a seL of observers was asked Lo separately
of Lhe lISA. (NTSC) when TV was invented. The power law for coior receivers may in
adju6t Lhe brighlness of the Lhree primaries using a set of conLrols, uníil lhe resulting spoL of
actuality be closer to 2.8. However, if we compensate for only about 2.2 of Lhis power law,
hghL most closely matched Lhe desired coior. Figure 4.8 shows Lhe basic siluaLion. A device
we arrive aL an overali value of about 1.25 instead of 1.0. The idea was thaL ia viewing
for carrying out such an experimení is called a colorünerer.
conditions with a dim surround, such an overail gamma produces more pleasing images,
albeic with coior errors — darker colors are made even darker, and also lhe eye—brain system The inlemalional standards body for color, lhe Commission InLemationale de LEclairage
changes lhe relative conlrast of light and dark coiors [5]. (dE), pooled ali such dala ia 1931, in a seI of curves calied Lhe color.matchingfunclions.
They used coior primaries with peaks aL 440, 545, and 580 nanomeLers. Suppose, in
Wilh Lhe advent of CRT-based compuLer systems, lhe situalion has become even more
sLead of a swatch of cioLh, you were inleresLed in matching a given wavelength of iaser
inleresling. The camera may ormay nol have inserted gamma correcLion; software may write
(i.e., monochromaLic) light. Then Lhe color-maLching experiments are summarized by a
lhe image file using some gamma; software may decode expecting some (other) gamma;
staíemenL of lhe proportion of Lhe color primaries needed for each individual narrow-band
lhe image is stored ia a frame buffer, and iL is common Lo provide a lookup table for gamma
wavelenglh of iighL. General lighls are Lhen malched by a linear combinaLion of single
in Lhe frame buffer. After alI, if we generaLe images using computer graphics, no gamma is
wavelenglh resulls.
applied, but a gamma is scill necessary Lo precompensate for lhe display.
Section 4.1 (olor Science 91
90 Chapter 4 Color ir Image and Video

‘o
c1)
o>

400 450 500 550 600 650 700


Wavelength (nm)

FIGURE 4.10: CIE standard color-maiching functions i(À), 5(1), ~(À).

FIGURE 4.8: Colorimeter experiment. Why are some parts of lhe curves negative? This indicates that some colors cannot be
reproduced by a linear combinalion of the primaries. For such colors, one or more of the
primary lights has to be shifted from one side of lhe black partition in Figure 4.8 co the other,
Figure 4.9 shows lhe CW color-matching curves, denoted i(À), g(À), b(À). In fact, sue 80 they iliuminate the sample co be niatched instead of lhe white screen. Thus, in a sense,
curves are a linear matrix-multiplication away from lhe eye sensitivities in Figure 4.3. such samples are being matched by negative lights.

4.1.8 CIE Chromaticity Diagram


In limes long past, engineers found it upsecting thac one Cifi color-matching curve in Fig
ure 4.9 has a negative lobe. Therefore, a seI of fictitious primaries was devised that led
co color-matching funclions with only positives values. Figure 4.10 shows lhe resulting
curves; these are usually referred Lo as rhe color-matching functions. They result from a
linear (3 x 3 macrix) cransform from the F, ~, b curves, and are denoted i(À), 5(1), ~(À).
‘o
eo The matiix is chosen such chal lhe middle standard color-matching function 5(1) exaccly
‘o
o equals lhe luminous-efficiency curve V(A) shown in Figure 4.3.
For a general SPD E(A), lhe essencial “colorimetiic” information required lo characterize
o a color is Lhe set of tristimutus values X, Y, Z, defined in analogy Lo Equation (4.1) as
0
X = f E(A) 2(1) dA

400 450 500 550 600 650 700 1’ = JE(Ã)5(À)dÀ (4.6)


Wavelength (nm)
Z = J E(A) ~(À) dA
FIGURE 4.9: dE color-matching functions ?(À), j(À), ~(À).
Section 4.1 Color Science 93
92 Chapter 4 Color in Image and Video

Q
The rniddie vaRie, Y, is called Lhe tuminance. Ali color informalion and transforms \
are lied lo these speciai values, which incorporale substantial informaLion abouL Lhe hurnan \

visual sysLem. However, 30 data is difficull Lo visualize, and consequently, lhe dE devised
co
a2D diagram based on lhe values of (X, 1’, Z) tripies irnplied by Lhe curves in Figure 4.10. á
For each wavelenglh ia lhe visible, lhe values of X, Y, Z given by Lhe three curve values
forrn Lhe lirnils of whal hurnans can see. However, frorn Equation (4.6) we observe thal
increasing Lhe brighlness of iliurninalion (luniing up Lhe light buib’s watlage) increases Lhe o
Lrislirnulus values by a scaiar muitiple. Therefore, il maRes sense to devise a 2D diagram
by somehow factoring miL lhe magnilude of vecLors (X, Y, Z). In Lhe CIE syslem, lhis is
accomplished by dividing by lhe sum X + Y + Z:
o
x = X/(X+Y+Z)
700
= Y/(X+Y+Z) (4.7) r
o
z = Z/(X+Y+Z)

This effecLively means Lhal one value out of Lhe seL (x, y, z) is redundanl, since we have q \
o
x+Y+z (4.8) 0.0 0.2 0.4 0.6 0.8 1.0
x
so Lhal
FIGURE 4.11: CIE chromaLiciLy diagram.
z=1—x--y (4.9)

(0.310063,0.316158). Figure 4.12 displays Lhe SPD curves for each of Lhese sLandard
Values x, y are called chromaticiües.
iighLs. HluminanL A is characLerisLic of incandescenl lighLing, wilh an SPD typicai of a
Effeclively, we are projecling each lrisLimUius vector (X, Y, Z) onLo Lhe plane connecLing
lungsLen bulb, and is quile red. llluminanl C is an early aLLempl Lo characterize daylighL,
poiaIs (1,0,0), (0, 1,0), and (0,0, 1). Usuaiiy, lhis plane is viewedprojected orlo Lhe z = O while 065 and 0100 are respeclively a midrange and a biuish commonly used daylighl.
plane, as a seI of poinLs inside lhe Iriangie with verlices having (x, y) values (0,0), (1,0), Figure 4.12 also shows lhe rnuch more spiky SPD for a standard fluorescenL illuminalion,
and (0, 1). called F2 [2].
Figure 4.11 shows lhe locus of poiaIs for rnonochromaLic light, drawn on lhis dE “chro
Cobra with chromaLiciLies on lhe specLrurn locus represenL “pure” colors. These are lhe
rnaLicily diagram”. The slraighL une aiong lhe boLlom of Lhe “horseshoe” joins poinLs aI
mosl “saluraLed”: Lhink of paper becoming more and more saLuraLed wiLh ink. In conLrasl,
lhe exLremilies of lhe visible specLnlm, 400 and 700 nanomelers (from blue Lhrough green colors closer lo Lhe while poinL are more unsaluraled.
Lo red). That sLraighl une is cailed lhe tine ofpurples. The horseshoe ilseif is cafled lhe
The chromalicily diagram has lhe nice properly thaL, for a mixLure of two Iighls, Lhe
spectrurn tocus and shows Lhe (x, y) chromaticily values of monochromalic lighl aI each of
resulLing chromaliciLy lies on Lhe sLraighl une joining lhe chromalicities of Lhe Lwo lighLs.
Lhe visible waveiengLhs.
Here we are being slighlly cagey in foI saying lhaL lhis is lhe case for colors ia general, jusL
The coior-maLching curves are devised soas to add up Lo lhe sarne vaRie [lhe area under
for “lighLs”. The reason is Lhat so far we have been adhering Lo an additive modei of cobor
each curve is lhe sarne for each ofi(À), j(À), ~(À)]. Therefore for a while iliuminanl wiLh
rnixing. This model holds good for iighis or, as a special case, for monitor coiors. However,
ali SPD values equal lo 1 — an “equi-energy while iighL” — lhe chromaticiLy values are
as we shall see below, il does nol hold for prinLer colors (see p. 102).
(1/3, 1/3). Figure4.l 1 displays lhis while poinL in Lhe middle of lhe diagram. Finaliy, since
For any chromaLicily on lhe CIE diagram, Lhe “dominant wavelenglh” is Lhe posilion on
we rnust have x, y ~ 1 and x + y ~ 1, ali possible chromalicily values must necessarily lie
lhe speclrum locus intersecLed by a linejoining Lhe while poiaL Lo Lhe given color and extended
below lhe dashed diagonal une iii Figure 4.11.
Ihrough il. (For coiors lhal give an inLersecLion on lhe une of purpies, a compIemenLary
Note lhaL one may choose differenL “whiLe” speclra as Lhe sLandafd illuminanL. The CIE
dominant waveiengLh is defined by exLending lhe une backward Lhrough Lhe while poinL.)
defines several of lhese, such as iiluminanl A, iliuminanl C, and slandard daylights 065 and
Anolher usefui definition is Lhe seL of complemenLary coiors for some given color, which is
D 100. Each of lhese will display as a sornewhal differenL while spol on lhe dE diagram:
given by ali lhe colors on Lhe une Lhrough Lhe whiLe spol. Finally, lhe excitation purity is lhe
065 has a chromaLicily equal lo (0.312713,0.329016), and iliurninanL C has chromalicily
Section 4.1 Color Science 95
94 Chapter 4 Color in Image and Video

TABLE 4.1: Chromaticities and white points for monitor specifications.

Red Green Blue WhiLe Point


a System Xj~ 3’, Xg Yg Xb Yb XW YW
.9
‘o NTSC 0.67 0.33 0.21 0.71 0.14 0.08 0.3101 0.3162
E(a
SMPTE 0.630 0.340 0.310 0.595 0.155 0.070 0.3127 0.3291
1..
o EBU 0.64 0.33 0.29 0.60 0.15 0.06 0.3127 0.3291
oo.

oo 4.1.10 Out-of-Gamut Colors


a.
0 For lhe moment, IeL’s noL worry abouL garnma correcLion. Then lhe really basic problem
for displaying color is how lo generate device-independeni color, by agreemenL Laken Lo be
specified by (x, y) chromaticity values, using device-dependeni color values RGB.
For any (x, y) pair we wish Lo find Lhat RGB Lriple giving lhe specified (x, y, z): there
400 450 fore, we form lhe z values for lhe phosphors via z = 1 — x — y and solve for RGB from
Wavelength (nm) lhe manufacturer-specified chromaticilies. Since, if we had no green or blue value (i.e., file
values of zero) we would simply see lhe red-phosphor chromaticities, we combine nonzero
FIGURE 4.12: Standard illuminanL SPDs values of R, O, and E via

Ex, xg Xbl ER1 xl


ratio of disLances from the white spot Lo lhe given color and to the dominanL wavelength, Iy~ Yg Yb1101 YI (4.10)
[z, Zg ZbJ LEi zJ
expressed as a percentage.
lf (x, y) is specifled insLead of derived from Lhe above, we have Lo invert lhe malrix of
4.1.9 Color Monitor SpecificatioflS phosphor (x, y, z) values Lo obLain lhe contct ROR values to use Lo obtain Lhe desired
chromaticiLy.
Coior moniLors are specified in part by lhe white point chromaticitY desired if lhe RGB But what if any of Lhe RGB numbers is negative? The problem in lhis case is that while
elecLrOn guns are ali activaLed at Iheir highest power. Actually, we are lilcely using gamma humans are able lo perceive lhe color, iL is noL represenLable on Lhe device being used. We
corrected values R’, O’, B’. lf we normalize voltage Lo Lhe range O to 1. then we wish to say in Lhat case Lhe color is cui o! gamu, since lhe set of ali possible displayable colors
specify a monitor such LhaL iL displays the desired whiLe point when R’ = O’ = 8’ = consLitules Lhe gamuL of the device.
(abbreviating the transform from file value to voitage by simply sLating lhe pixel color values, One melhod used lo deal with Lhis siwalion is Lo simply use lhe closest in-gamut color
normalized Lo maximum 1). availabie. AnoLher common approach is to selecL Lhe closest complemenLary color.
However, Lhe phosphOresCent paints used on lhe inside of the monitor screen have their For a monitor, every displayable color is within a triangie. This follows from so-called
own chromaticiLies, so at firsL glance iL would appear lhat we cannot independentiY conLrol Grassnian ‘s Law, describing human vision, sLating that “color maLching is linear”. This
the moniLor whiLe poinL. Flowever, Lhis is remedied by setting lhe gain control for each means IbaL linear combinaLions of lighls made up of Lhree primaries are jusL Lhe linear set of
electron gun such thaL at maximum volLages Lhe desired white appears. weighLs used lo make Lhe combination Limes lhose primaries. That is, if we compose colors
Several monitor specifications are in current use. Monitor specificaLions consist of Lhe from a linear combination of lhe Lhree “lights” available from the three phosphors, we can
fixed, manufacturer.specified chromaticities for the monitor phosphors, along wilh lhe sLan creaLe colors only from lhe convex set derived from the lights — in Lhis case, a Lriangle.
dard whiLe point needed. Table 4.1 shows Lhese values for three comnion specificaLiOn (We’il see below thaL for prinLers, lhis convexity no longer holds.)
statemenLs. NTSC is Lhe standard North American and Japanese specification. SMPTE is a Figure 4.13 shows lhe triangular gamut for lhe NTSC sysLem drawn on Lhe CIE diagrani.
more modern version of this, wherein lhe sLandard illuminant is changed from iiluminant C Suppose lhe small triangle represenLs a given desired color. Then Lhe in-gamul poinL on lhe
to illuminanL D65 and lhe phosphor chromaticities are more in line with modero machines. boundary of lhe NTSC moniLor gamul is laken Lo be the inlersection of(a) lhe une connecting
Digital video specifications use a similar specification in North America. The EBU sys lhe desired color lo Lhe white poinL wilh (b) lhe nearest une forming Lhe boundary of lhe
tem derives from Lhe European Broadcasting Union and is used in PAI. and SECAM video gamuL triangle.
systems.
96 (hapter 4 (olor in Image and Video Section 4.1 (olor Science 97

where ( )T means Iranspose.


For Lhe SMPTE specificaLion, we have (x, y, z) = (0.3127, 0.3291,0.3582) or, dividing
Green by lhe middle value, XYZwhíle = (0.95045, 1 , 1.08892). We note LhaL multiplying D by
00 (1,1, ~
6
rx
~
]
=
r 0.340
0.630 0.310 0.155
0.595 0.070
1 r ct~
j (4.13)
‘9
o [ Z J while L 0.03 0.095 0.775 J L €13
lnverling, wilh lhe new values XYZwhj,e specified as above, we arrive aL

o (d1,d2,d3) = (0.6247,1.1783, 1.2364) (4.14)

Red 4.1.12 XYZ to RGB Transform


e
o Now lhe 3 x 3 transform maLiix from XYZ Lo RGB is Laken lo be

T=MD (4.15)
O
o El
even for points other Lhan lhe white point:
0.0 0.2 0.4 0.6 0.8 1.0
x rxl rR1
“ 1=2’ l° (4.16)
FIGURE 4.13: Approximating an out-of-gamuc color by an in-gamul one. The out-of-gamut [zJ [BJ
color shown by a triangle is approximaLed by lhe intersection of (a) lhe line from that color For lhe SMPTE specificaLion, we arrive at
Lo Lhe whiLe point wiLli (b) the boundary of the device color gamut.

T =
r 0.3935
0.2124
0.3653
0.7011
0.1916
0.0866
1
(4.17)
4.1.11 White-Point Correction
One deficiency in whaL we have done so far is that we need lo be able Lo map tristimulus
[ 0.0187 0.1119 0.9582 J
WritLen oul, this reads
values XYZ to device RGBs, and not jusL deal wiLh chromaticily xyz. The difference is
Lhat X YZ values include lhe magnitude of Lhe color. We also need lo be able Lo alter matters X = 0.3935. R + 0.36530+0.1916 8
such Lhat when each of R, G, 8 is aL maximum value, we obLain lhe white point.
BuL 50 far, Table 4.1 would produce incorrect values. Consider Lhe SMPTE specifications.
Y = 0.2124. R + 0.7011 O + 0.08668 (4.18)
SetLing R = 0 = B = 1 resulLs in a value of X Lhat equals Lhe sum of lhe x values, or Z=0.0l87~R+0.1ll9-O+0.9582 8
0.630 + 0.310 + 0.155, which is 1.095. Similarly, lhe Y and Z values come mil 10 1.005
and 0.9. Dividing by (X + 1’ + Z) resulLs in a chromaticiLy of (0.365,0.335) raLher Lhan 4.1.13 Transform with Gamma Correction
Lhe desired values of (0.3 127, 0.3291). The above calculaLions assume we are dealing wiLh linear signals. However, instead of
The method used Lo correct boLh deficiencies is to firsL Lalce lhe while-point magnitude linear R, 0, 8, we mosL Iikely have nonlinear, gamma-correcled R’, 0’, 8’.
of Y as uniLy: The best way of carrying mil mi XYZ-to-RGB transform is Lo calculale Lhe linear RGB
Y(whiLepoinl)= 1 (4.11) required by inverting EquaLion (4.16), then creaLe nonlinear signais via gamina correcLion.
Nevertheless, this is nol usually done as stated. lnsLead, lhe equaLion for lhe Y value is
Now we need Lo finda seL of lhree correcLion factors such Lhal ii Lhe gains of Lhe lhree elecLron used as is but is applied Lo nonlinear signals. This does not imply much error, in facl, for
guns are mulLiplied by lhese values, we get exactly lhe whiLe point XYZ value aL R = O = colors near Lhe white point. The only concession lo accuracy is Lo give the new name Y’ lo
8 = 1. Suppose lhe maLrix of phosphorchromaticiLiesxr. x8, ... in Equalion (4.10) is called Lhis new Y value creaLed from R’, 0’, 8’. The significance of Y’ is thaL iL codes a descripLor
M. We can express Lhe correcLion as a diagonal maLrix 1) = diag(di, c12, ti)) such Lhat of brightness for Lhe pixel in quesLion)
XYZWhI,e~MDO,LBT (4.12) ‘In lhe Color FAQ file ci, lhe ‘cxi web sue, ihis new value Y’ is callcd “luma”.

fl.~
Section 4.1 (olor Science 99
98 Chapter 4 (olor in Image and Video

The most-used transform equalions are those for lhe original NTSC syslem, based upon L= IDO
While
an iliuminanl C while poinL, even though these are outdated. Foilowing lhe procedure
outlined above bul with Lhe values in Tabie 4.1, we anive aL lhe foiiowing transform:

X = 0.607 R +0.174 G +0.200 B Green: a<0 YelIow: b>0


Y=0.299R+0.587G+0.114.B (4.19)
Z = 0.000. R +0.066 G + 1.116 B

Thus, coding for nonlinear signals begins wiLh encoding Lhe nonlinear-signal correlaLe BIue:b<0 Red:a>0
of luminance:

= 0.299 R’ + 0.587 G’ + 0.114 8’ (4.20) Black


L=0

(See Section 4.3 below for more discussion on encoding of nonlinear signais.)
FIGURE 4.14: CIELAB model. (This figure also appears in Lhe color inserl secLion.)
4.1.14 ttatbt (CIELAB) Color Model
where
The discussion above of how besL Lo malce use of lhe biLs available Lo Us LOUched on lhe
issue of how weli human vision sees changes in lighl leveis. This subjecL is acLUaily an / y \ (1/3)
exampie of Weber’s Law, from psychology: lhe more Lhere is of a quanLity, Lhe more change
1) = 1161—1 — 16

]
\Y~)
there musl be lo perceive a difference. For example, iL’s relaLively easy Lo teu lhe difference
in weighL belween your 4-year-old sisLer and your 5-year-old broiher when you pick them
up. However, ii is more difficult Lo leu Lhe difference in weighl beLween lwo heavy objects.
a = 500 [&;:)x (1/3)

y (1/3)
(j;;;) (4.22)

AnoLher example is lhat lo see a change in a brighl iighL, lhe difference musl be much larger
iban lo see a change in a dim lighL. A rule of Lhumb for Lhis phenomenon sLales ihal equally — 200[()’ (Z)°”3)]
perceived changes musl be relative. Changes are about equally perceived if lhe ralio of lhe
change is Lhe sarne, wheLher for dark or brighl lighLs, and so on. AfLer some lhoughL, Lhis with X~, Y,,, Z~ lhe XYZ values of lhe whiLe poinL. Auxiliary definilions are
idea leads Lo a logarilhrnic approximaLiOn lo perceptually equally spaced unils.
For human vision, however, CIE arrived ai a sornewhal more involved version of Lhis kind chroma = et = «‘(a92 + (b92
of mie, called lhe CIELAB space. Whal is being quantified in ibis space is, again, differences
hue angle = = arcLan — (4.23)
perceived in color and brighlness. This makes sense because, pracLically speaking, color
differences are mosl useful for cornparing source and LargeL colors. You would be interesled,
Roughly, lhe maximum and minirnum of value at correspond lo red and green, while bt
for example, in whelher a particular batch of dyed clolh has lhe sarne color as an original
ranges from yellow Lo blue. The ehroma is a scale of colorfulness, wiLh more colorful (more
swalch. Figure 4.14 shows a culaway mio a 3D solid of Lhe coordinale space associated
saLuraLed) colors occupying Lhe oulside of lhe CIELAB solid aI each L brighlness levei,
wiLh Lhis color difíerence melric.
and more washed-oul (desaluraLed) colors nearer lhe cenlral achromalic axis. The hue angle
CIELAB (also known as L*a*b*) uses apower iaw of 1/3 inslead of a logarilhrn. CIELAB expresses more or less whal mosl people mean by “lhe colo?’ — LhaL is, you would describe
uses lhree values Lhal correspond roughly Lo luminance and a pair lhat combine lo make ii as redor orange.
colorfulness and hue (variabies have an asLerisk lo differenliate ibem from previous versions The developmenL of such color-differences models is an aclive field of research, and
devised by lhe dE). The color difference is defined as Lhere is a plethora of olher human-percepLion-based formulas (lhe olher compelilor of lhe
sarne vinlage as CIEL.AB is called CIELUV — boLh were devised in 1976). The inleresl is
= J(L92 + (a12 + (b92 (4.21) generaled partly because such colar melrics iinpacl how we model differences in lighling and
Section 4.2 Calor Modeis in Images 101
100 Chapter 4 Colar in Image and Video

Blue Cyan Yel!ow Red


viewing across device and/ar network boundaries [7}. Severa! high-end products, including
Adobe Photoshop, use Lhe CIELAB model.
Magenta
4.1.15 More Color-Coordinate Schemes
There are severa! other coordinate schemes in use Lo describe co!or as humans perceive it,
with some confusion in the fie!d as to whether gamma correetion shou!d ar shou!d not be
app!ied. Here we are describing device-independent calor — based on XY Z and correiated
Lo what humans see. However, generai!y users make fite use of RGB or R’, G’, 8’.
Other schemes include: CMY (described on p. 101); HSL — Hue, Saturation and Light Red
ness; HSV — Hue, Saturation and Value; HSI and Intensity; HCI — C=Chroma; HVC
— V=Va!ue; HSD — D=Darkness; the beat goes on

4.1.16 Munseli Color Narning System Black (0,0,0) White (1, 1, 1) White (0,0,0) B!ack (1, 1,1)

Accurate naining of co!ors is also an importam consideration. One time-tested standard


system was devised by Munsel! in lhe ear!y 1 900’s and revised many times (lhe iast one The RGB Cube The CMY Cube
is cal!ed the Munseil renotasion) [8]. The idea is Lo set up (yet anolher) approxifllale!Y
perceptua!!y uniform system of three axes lo discuss and specify colar. The axes are value FIGURE 4.15: RGB and CMY calor cubes. (This figure also appears in Lhe colar insert section.)
(biack-while), hue, and chroma. Value is divided mIo 9 steps, hue is in 40 steps around a
circle, and chroma (saturation) has a maximum of 16 leveIs. The circle’s radius varies with
value. 4.2.2 Subtractive Color: CMV Color Model
The main idea is a fairly mnvariant specification of colar for any user, inc!uding artists.
The Munseil corporation lherefore sells books of ai! Lhese patches of paint, made up with So far, we have effectively been dealing only with addirive calor. Name!y, when two !ight
proprietary paint formulas (lhe book is quite expensive). It has been asserted that this is lhe beams impinge ana Largel, their co!ors add; when Lwo phosphors ana CRT screen are turned
most oflen used uniform sca!e. on, Lheir colors add. So, for example, red phosphor + green phosphor malces ye!Iow !ighl.
BuL for ink deposited on paper, in essence lhe opposite situation ho!ds: ye!low ink
subiracis b!ue from white illumination bul reflects red and green; which is why it appears
4.2 COLOR MQDELS IN IMAGES
yellow!
We now have had an introduclion to colar science and some of Lhe prob!ems lhat crop up So, instead of red, green, and blue primaries, we need primaries thaL amount to —red,
with respecl la calor for image displays. BuL how are calor models and coordinales systems —green, and —b!ue; we need lo subtracr R, G, ar B. These subtractive colar primaries are
really used for stored, disp!ayed, and prinled images? cyan (C), magenta (M), and ye!low (Y) inks. Figure 4.15 shows how the two systems,
RGB and CMY, are connected. In Lhe additive (RCIB) system, b!ack is “no !ight”, RGB =
4.2.1 RGB Color Model for CRT Displays (0,0,0). In the subtractive CMY system, black arises from subtracting alI lhe !ight by
layingdowninkswithC=M—Y— 1
According lo Chapter 3, we usuafly store calor informalion directly in RGB form. However.
we note from lhe previous seclion thal such a coordinate system is in fact device-dependent.
4.2.3 Transformation from RGB to CMV
We expect lo be able Lo use 8 bits per colar channe! for color that is accurate enough.
In fact, we have to use about 12 bits per channel lo avoid an a!iasing effect in dark image Given our identification of Lhe role of inks in subtractive systems, lhe simp!est mode! we
arcas — contour bands Lhat resuli from gamma correclion, since gamma correction resu!ts can invent lo specify what ink density lo lay down on paper, lo make a certain desired RGB
in many fewer available integer leveIs (see Exercise 7). calor, is as fo!lows:
For images produced from compuler graphics, we slore integers proportiona! lo intensity
in Lhe frame buffer. Then we shou!d have a gamma correclion L.UT between lhe frame rcl r’ FR
buffer and Lhe CRT. lf gamma correclion is app!ied lo fioats before quantizing lo integers, IMI—Il IG (4.24)
before storage in Lhe frame buffer, ihen we can use on!y 8 bils per channe! and stil! avoid [YJ Li L~
cantouring artifacts.
102 (hapter 4 (olor in Image and Video Section 4.2 (olor Models in Images 103

Biockdyes Block dyes gamut ÷ NTSC gamut


q q

o o
o
‘e
o
no
Yêllow
o
[.4

(‘4 r9
o o
(a) (b) q
o
400 450 500 550 600 650 700 0.0 0.2 0.4 0.6 0.8 1.0
FIGURE 4.16: Additive and subtraclive color: (a) RGB is used Lo specify additive co or; Wavelength x
(b) CMY is used to specify subLraclive colar. (This figure also appears in lhe colar insert
(a) (b)
secLion.)
FIGURE 4.17: (a) Lransmission curves for biock dyes; (b) spectrum iocus, triangular NTSC gamut,
Then Lhe inverse lransform is and six-vertex prinler gamut.

r~i ril ~c
G 1=1 11-1 M (4.25) our objective is Lo produce a cyan ink that compietely blocks red light but aiso completely
LBi Lii L~ passes ali green and blue light. Urifortunately, such “block dyes” are only approximated
in induslry. lii reality, transmission curves overlap for lhe C, M, and Y inlcs. This leads
4.2.4 Undercolor Removal: CMYK System to “crossLalk” beLween lhe coior channels and difficulLies in predicLing colors achievable in
C, M, and Y are supposed Lo mix te black. However, more ofLen Lhey mix to a muddy printing.
brown (we ali know Lhis from kindergarten). Tn’ly “black” black ink is in facL cheaper lhan Figure 4.17(a) shows typicaJ transmission curves for real block dyes, and Figure 4.17(b)
mixing colored inks to make black, soa simple approach Lo producing sharper printer coiors shows lhe resuiting color gamul for a color printer thaL uses such inks. We see lhat the
is Lo calculaLe thaL part of Lhe Lhree-color mix lhat would be black, remove it from lhe color gamut is smaller than LhaL of an NTSC monitor and can overlap it.
proportions, and add it back as real biack. This is called “undercolor removal”. Such a gamut arises from Lhe model used for prinler inks. TransmitLances are related
The new specification of inks is lhus Lo optical density E) via a logarithm: D = — in T, where T is one of Lhe curves in
Figure 4.17(a). A calor is formed by a linear combinaLion E) of inks, wiLh E) a combination
K min(C, M, Y) of Lhe Lhree densities weighted by weights w1, i = 1 .. 3, and w~ can be in lhe range from
rcl rc-K1 zero Lo Lhe maximum ailowable withouL smearing.
So Lhe overali Lransmitlance T is fonned as a product of exponentiais of lhe lhree weighted
M M—K (426)
[~J [Y Kj densides — ligln is extinguished exponentiaily as it Iravels lhrough a “sandwich” of trans
parent dyes. The light reflecLed from paper (ar through a piece of slide fim) is TE = e_DE,
Figure 4.16 depicts the calor combinaLions Lhat resuit from combining primary coiors where E is lhe illuminating light. Forming colors XYZ wilh Equation (4.6) leads to lhe
available in lhe two situations: additive color, in which we usuaiiy specify colar using RGB, prinLer gamut in Figure 4.17(b).
and subtractive colar, in which we usuaily specify calor using CMY ar CMYK. The center of lhe printer gamut is lhe white-black axis, and Lhe six boundary vertices
correspond lo C, M, Y, and lhe three combinations CM, CY, and MY iaid down at fuIl
4.2.5 Printer Gamuts density. Lesser ink densiies lie more in Lhe middle of Lhe diagram. FulI density for ali inks
corresponds Lo Lhe black/white point, which lies in lhe center of lhe diagram, at lhe point
In a comman modei af Lhe printing process, printers lay down Lransparerlt layers of ink anta
marked “o”. For lhese particular inks, that point has chromaticiLy (x, y) = (0.276, 0.308).
a (generally white) substrate. lf we wish to have a cyan prinling ink lmly equal tu minus-red,
104 Chapter 4 (olor in Image and Video Section 4.3 Color Models in Video 105

4.3 COLOR MODELS IN VIDEO Finally, in Lhe actual irnplementaLion, U and V are rescaled for purposes of having a
more convenient maximum and rninirnurn. For analog video, lhe scales are chosen such
4.3.1 Video Color Transforms that each of U or V is limited Lo lhe range between ±0.5 Limes Lhe maximurn of Y’ [9].
Methods of dealing with color in digital video derive Iargely from older analog methods (NoLe Lhat acLual voiLages are in anoLher, non-normalized range — for analog, Y’ is ofLen in
of coding color for TV. Typically, some version of Lhe lurninance is combined with color lhe range O Lo 700 mV, so rescaled U and V, called I’a and ~I? ia Lhat context, range over
information in a single signal. For example, a matrix transform rnethod similar to Equa ±350 mv.)
tion (4.19) called YIQ is used to transrnit TV signais in North Arnerica and Japan. This Such scaling reflecLs how Lo deal with cornponenL video — three separaLe signals. How
coding also malces its way into VHS videotape coding in these counLries, since video tape ever, for dealing with conaposire video, in which we wanL Lo compose a single signa! OuL of
technologies also use YIQ. 3”, U, and V aL once, it tums ouL lo be convenienL lo contam the composite signa! magniLude
In Europe, videotape uses the PAI.. or SECAM codings, which are based on TV Lhat uses 3” ± ‘,/U2 + V2 within Lhe range — 1/3 to +4/3,so that it will remam wiLhin Lhe ampliLude
a rnatrix Lransform called YUV. limiLs of Lhe recording equiprnent. For this purpose, U and V are rescaled as follows:
Finally, digital video mostly uses a matrix transform called YCbCr Lhat is closely related
U = 0.492111(8’—)”)
to YUV.2
V = 0.877283(R’ — 3”) (4.29)
4.3.2 YUV Color Model (with mulLipliers sorneLimes rounded lo Lhree significanL digiLs). Then lhe chrorninance
Initially, YUV coding was used for PAI. analog video. A version of YUV is now also used signal is cornposed from U and V as lhe cornposite signa!
in Lhe CCIR 601 standard for digital video.
First, il codes a luminance signal (for gamma-corrected signals) equal Lo Y’ in Equa C = li cos(w:) + V sinQoi) (4.30)
tion (4.20). (RecalI thaL Y’ is often called Lhe “lurna.”) The luma Y’ is similar to, but where a represents Lhe NTSC color frequency.
not exactly the sarne as, the dE lurninance value Y, garnrna-correcLed. In multirnedia, From equaLions (4.29) we note thaL zero is not lhe rninimurn value for U, V. In terms of
practitioners often biur Lhe difference and simply refer 10 both as lhe luminance. real, positive colors, U is approximaLely from blue (U > O) Lo yellow (1] < 0) in Lhe RGB
As well as magnitude or brightness we need a colorfulness scale, and Lo Lhis end chromi cube; V is approximately from red (V > 0)10 cyari (V < 0).
nance refers to the difference between a color and a reference whiLe aL the sarne lurninance. Figure 4.18 shows Lhe decomposition of a Lypical color irnage mIo iLs 3”, U, Y compo
It can be represenLed by Lhe color differences U, V: nents. Since boLh U and V go negative, lhe images are in fact shifLed, rescaled versions of
U =
Lhe actual signals.
(4.27) Because Lhe eye is most sensiLive lo black-and-white variations, in Lerms of spaLial fre
Y =
quency (e.g., lhe eye can see a grid of fine gray lines more clearly Lhan fine colored lines),
From Equation (4.20), Equation (4.27) reads in lhe analog PAL signa! a bandwidLh of only 1.3 MHz is allocated lo each of U and V,
while 5.5 MHz is reserved for lhe Y’ signa!. In fact, color information LransrnitLed for color
r“1 r 0.299 0.587 0.144 1 r R’ (4.28)
TV is actually veiy blocky.
1 u j 1 = —0.299 —0.587 0.886
Lv J L 0.701 —0.587 —0.114 J [ 8’ 4.3.3 YIQ (olor Model

We go backward, frorn (Y’, U, V) to (R’, G’, 8’), by inverting the matrix in Equation (4.28). YIQ (acLually, 3” 1 Q) is used in NTSC color TV broadcasting. Again, gray pixels generaLe
Note thaL for a gray pixel, with R’ = = 8’, Lhe luminance Y’ is equal Lo Lhat sarne gray
zero (1, Q) chrominance signal. The original rneanings of these names carne from combi
value, R’, say, since the sum of thecoefficienLs in Equation (4.20) is 0.299+0.587+0.114 = naLions of analog signals —1 for in-phase chronainance, and Q for quadrature chrominance
— and can now be safely ignored.
1.0. So fora gray (“black.and-white”) image, lhe clirominance (U, V) is zero, since the sum
ofcoefhcients in each of the lowertwo equations in (4.28) is zero. ColorTV can be displayed IL is thoughL Lhat, alLhough U and V are more simply defined, they do noL caplure lhe
most-to-least hierarchy of human vision sensilivity. AlLhough Lhey nicely define lhe color
on a black-and-white Lelevision by just using lhe Y’ signal.3 For backward compaLibility,
differences, Lhey do not best correspond lo acLual hurnan perceptual color sensitiviLies.
color TV uses old black-and-white signals with no color information by identifying the
NTSC uses 1 and Q insLead.
signal wiLh Y’.
YIQ isjust a version of YUV, wiLh Lhe sarne 1” buL wiLh U and V roLaLed by 33°:
2The luminancc-chrominance color models (YIQ, YIJV. YCbCr) are proven effeclive. Hcnce, ihcy are also
adopled ia image.compression slandards such as JPEG and JPEG2000. = 0.877283(R’ — 3”) cos33° — 0.492111(8’ — 3”) sin 33°
31i should be noted that many aulhors and usas simply use chese lenas with no pnrncs and (perhaps) mesa
Usem as if they were with primes! Q = 0.877283(R’ 3”) sin33° +0.492111(8’ — Y’)cos33° (4.31)
Section 4.4 Further Exploration 101
106 Chapter4 oorinlmageandVideO

(a) (b)

FIGURE 4.19: (a) 1 and (b) Q componencs of calor image.

4.3.4 YCbCr Color Model


The international standard for component (Lhree-signal, swdio quality) digital video is of
ficially Recommendation ITU-R BT.60I-4 (known as “Rec. 601”). This standard uses
another colar space, YCbCr, often simply writlen YCbCr. The YCbCr transform is used
in WEG image compression and MPEG video compression and is closely related to Lhe
YUV transform. YUV is changed by scaling such thaL Cj, is U, but wiLh a coefficienc of 0.5
mulLiplying 8’. In some software systems, Cj, and Cr are also shifted such that values are
between O and 1. This makes the equations as follows:
(d)
(b) (e) Cj, = ((8’— )~u/I.772) +0.5

FIGURE 4.18: Y’UV decomposition ofcolorimage (a) original calor image; (b) Y’; (c) li; Cr = ((R’ — Y’)/1.402) + 0.5 (433)
(d) V. (This figure also appears in the colar insert section.) Writlen out, we Lhen have

[
This leads to Lhe following matriz LransfOtTW

y’l ro.299 0.587 0.144 1 rR’l


ry’i

r
1[ c, J1 1[ —0.168736
0.5
0.299 0.587

—0.331264
—0.418688 —0.081312
0.144
0.5
1 rR’l r0

J1 1[ G’8’ j1 + 1[ 0.5
o.s
1 (4.34)

= 0.595879 —0.274133 —0.321746 = 1 G’ 1 (4.32) 219Inand a minimum


practice, of Rec.
however, +16.601Values below
specifies 8-bitlócoding,
and abave
with 235, denotedY’headroonz
a maximum and
value ofonly
Q j L 0.211205 —0.523083 0.311878 ~ [ 8’ ~ footroom, are reserved for other processing. Cb and Cr have a range of ±112 and offset of
+128 (in olher words, a maximum of 240 and a minimum of 16). If R’, O’, 8’ are floats in
1 is roughly Lhe orange-blue direetion, and Q roughly corresponda to the purple-green + 1], we obtain Y’, Ca, C, in [0.255] via Lhe Lransform [9]
direction.
Figure 4.19 shows the decomposition of Lhe sarne color image as above into YIQ com-
ponents. Only the 1 and Q cornponents are shown, since Lhe original irnage and Lhe ~‘
~0~ponentarethesameasinFigure4.I9
For this particular image, most of the energy is capLured in Lhe Y’ component, which
r y’ 1
[1 J1 1[
Cr
=
r 65.481
—37.797
112
128.553
—74.203
—93.786
24.966 1 r R’ 1 r ~
112
—18.214 J1 1[ J1 1
O’ +
8’
128
[128
In facL, the output range is also clamped to [1 .. 254], since Lhe Rec. 601 synchronizabon
] (4.35)

is Lypical. However, in Lhis case Lhe YIQ decompoSition does a better of job of forming a signals are given by codes O and 255.
hierarchical sequence of images: for Lhe 8-bit Y’ componenL, Lhe root-mean-square (RMS)
value is 137 (with 255 Lhe maximum possible). The U, Y components have RMS values ‘~ 4.4 FIJRTHER EXPLORATION
and 44. For lhe YIQ decomposition, lhe 1 and Q components have RMS values 42 and 14,
50 they beLter prioriLiZe color values. Originally, NTSC allocated 4.2 MHz Lo Y. 1.5 MHz chief aLtribuLes
ln a deep that makes
way, calor is onemultimedia ao compelling.
of our favorite pleasures asThe most-used
humans referenceis on
and arguably onecolor in
of Lhe
to 1, and 0.6 MHz Lo Q. Today, both 1 and Q are each allocated 1.0 MHz.
Section 4.5 Exercises 109
108 Chapter 4 (olor in Image and Video

(b) What are (roughly) lhe relaLive sizes of Lhe LAR gamuL, lhe CMYK gainut, and
general is lhe classic handbaok 121. A compendium of importaM techaiques usai Loday is a monitor gamuL?
lhe coilection [10].
Linlcs iii Lhe Chapter 4 section of lhe Further Exploration directory on Lhe text web site 4. Where does Lhe chromaticity “horseshoe” shape iii Figure 4.11 come from? Can we
include calculate it? WriLe a small pseudocode solution for Lhe problem of finding this so
More details on gamma correction for publicaLion on lhe WWW called “spectrum locus”. Hini: Figure 4.20(a) shows lhe color-maLching functions in
Figure 4.10 drawn as a seL of points in Lhree-space. Figure 4.20(b) shows Lhese poinLs
• The fuil speciflcation of Lhe new sRGB standard color space for WWW applications mapped into another 3D set of points. Anorher hiiu: Try a programming solution for
lhis problem, to help you answer iL more expliciLly.
• An excellent review of color transforms and a standard color FAQ

• A MATLAB script to exercise (and expand upon) Lhe calor transform functions that 0.9
1.8 0.8
are part of lhe Image Toolbox in MATLAB: Lhe sLandard Lena image is transforma! 1.6
0.7
Lo YIQ and Lo YCbCr 1.4
1.2 0.6
0.5
• A new calor space. The new MPEG standard, MPEG4, (discussed ia Chapter 12) 0.8
0.6 0.4
somewhat sidesteps lhe Lhorny question of whose favoriLe colar space to use in a
0.4 0.3
standard definition by including six calor spaces. One of them is a new variant on 0.2 0.2
HSV space, HMMD calor space, that purporis lo allow a simple calor quantization 0~. 0.1
— from 24-bit down lo 8-bit color, say, — tbat is effecLively equivalent loa complex 1.4 o
1 0:? 20 0.8
vector colar quantization (i.e., considering a more careful but also more expensive
mapping of Lhe colors in an image inLo Lhe colar LUT). This new color space may
indeed become imporlant. (a) (b)

4.5 EXERCISES FIGURE 4.20: (a) color-matching functions; (b) Lransformed colar maLching functions.
1. Consider Lhe following set of color-relaLed tenns:
5. Suppose we use a new seI of color-matching funclions i”~’°(À), 5new(À) !~“~(À)
(a) WavelengLh wiLh values
(b) Color levei
(e) BrighLness À (um) i~t°(À) jfleW(À) ~fleW(À)

(d) Whiteness 450 0.2 0.1 0.5


500 0.1 0.4 0.3
How wouid you match each of the following (more vaguely stated) characteristics lo
each aí Lhe above LeLins? 600 0.1 0.4 0.2
700 0.6 0.1 0.0
(e) Luminance
(1) Hue In this system, what are Lhe chromaticity values (x, y) of equi-energy white lighL E(À)
(g) SaLuraLion where E(À) 1 for alI wavelengLhs À? Expiam.
(h) Chrominance 6. (a) Suppose images are no! gamma correcLed by a camcorder. Generaliy, how
would Lhey appear ou a screen?
2. What color is outdoor light? For example, around what wavelength would you guess (h) WhaL happens if we artificially increase lhe outpuL gamma for stored image
Lhe peak power is for a red sunseL? For blue sky iight? pixels? (We can do Lhis in Photoshop.) WhaL is lhe effecL ou the image?
3. “The LAR gamut covers ali colors in Lhe visible specLflim. 7. Suppose image file values are in 0.. 255 iii each colar chanuei. lf we define ~ =
R/255 for Lhe red chanuel, we wish Lo cany ouL gamma correcLion by passing a new
(a) What does this sLatemenL mean? Briefiy, how does LAR relate Lo color? Just value 1 Lo Lhe dispiay device, with 1 ~I)2.O
be descripLive.
110 Chapter 4 Color iii Image and Video Section 4.6 References 111

li is cornmon Lo cany ouL Lhis operation using inleger rnath. Suppose we apprOximate (We’re nol discussing any psychophysicai effects thaL change our perception —
lhe caiculalion as crealing new integer values in 0.. 255 via here we’rejust worried about lhe machine itselfl.
(mi) (255 (R112°)) (b) lhe
lf garnma
first, when
correction
dispiayed?
is nor appiied, does lhe second RGB have lhe sarne hue as

(a) CornmenL (very roughly) on Lhe effect of Lhis operation on lhe nurnberofactuallY (e) For what color triples is Lhe hue aiways unchanged?
available leveis for display. uni: Coding this up in any language wili help you
understand Lhe rnechanism 111 work betLer and wili ailow you Lo sirnply count 11. We wish lo produce a graphic thai is pleasing and easily readabie. Suppose we rnake
the outpuL leveis. mosl readable? Juslify
Lhe background your answer.
coior pink. What coior LeXL fonL shouid we use lo make lhe Lext
(b) Which end of Lhe leveis 0.. 255 is affecLed niost by gamma correcLion — lhe
low end (near 0) or lhe high end (near 255)? Why? l4ow much at each end? 12. sensors,
To rnake as opposed
maLters Lo RGB
sirnpler for sensors
eventual(CMY cameras
printing, we buyarea carnera
in facL available).
equipped with CMY

8. In rnany computer graphics applicalions, y-correction is performed only incolor LJJT


(iookuptabie). ShowthefirsLfiveentriesofac0l0rLUTrnnhf0ruseY<0r1t~b0m (a) Draw specLral curves roughiy depicting whaL such a camera’s sensiLivity lo
Hini: Coding Ibis up saves you Lhe Lrouble of using a calculator. frequency might 100k like.
9. Devise a prograrn lo produce Figure 4.21, showing Lhe color gamul of a monitor Lhat llow? lhe OuLput of a CMY carnera be used lo produce ordinary RGB picLures?
(b) Could
adheres Lo SMFI’E specifications.
13. Color inkjet printers use lhe CMY mojiei. When lhe cyan ink color is sprayed onto a
sheet of whiLe paper,

(a) Why does it 100k cyan under daylight?


(b) What color would iL appear under a biue lighL? Why?

4.6 REFERENCES
1 D.H. Pritchard, “liS. Color Television Fundainenials — A Review,” JEEE Trans. Consume,
Electronics 23(4): p. 467—478, 1977.
2 O. Wyszecki and W.S. Stiies, Colar Science: Concepis and Methods, Quantirative Data and
Fonnulas, 2nd cd., New York: Wiley, 1982.
3 R.W.G. Iluni, “Color Reproduction and Color Vision Modeling’ in Isi Calor iinaging Con
ference: Ti’onsfornzs & Transporrability of Color, Society for lmaging Science & Technoiogy
(IS&T)lSocieLy for lnformation Display (510) joint conference, 1993, 1—5.
4 Mi. Vrhel, R. Gershon, and L.S. lwan, “Measurement and Analysis of Object Refiectance
Spectra,” Colo, Research andApplicarion, 19: 4—9, 1994.
5 R.W.G. Huni. TheReproduction of Calor, Sth cd., Tolworth, Surry, U.K.: Fountain Press, 1995.

FIGURE 4.21: SMP’I’E Monitor Gamut. (This figure aiso appears in the color insert section.) 6 1Pracifre
». Foley, A. 2nd
in C, van cd.,
Dain, 5. K. Feiner,
Reading and J. F. Hughes,1996.
MA: Addison-Wesley, Computer Graphics: Principies and

10. flue is lhe color, independent of brighLness and how rnuch pure whiLe has been added 7 M~k o. Fairehild, Calor Appearance Models, Reading MA: Addison-Wesley, 1998.
to iL. We can rnalce a simple definition of hue as the seI of raLios R:G:B. Suppose a
color (i.e., an RGB) is divided by 2.0,so that lhe RGB triple now has values 0.5 limes 8 0. Travis, Effecrive Colar Displays, San Diego: Academic Press, 1991.
its former values. Expiam, using numerical values: 9 C.A. PoynLon, A Technical !nrroduciion ‘o Digital Video, New York: Wiiey, 1996.
lO 1’. Green and L. MacDonald, eds., Colou, Engineering: Achieving Device lndependent Colour,
(a) If garnma correcLion is applied after Lhe division by 2.0 and before Lhe color is New York: Wiley, 2002.
stored, does Lhe darker RGB have Lhe sarne hue as Lhe original, in Lhe sense of
having Lhe sarne ratios R:G:B of light ernanating from Lhe CRT dispiay device?
Section 5.2 Analog Video 113

CHAPTER 5 5.1.2 ColflpOSite Video


la cornposite video, color (“chrominance”) and intensity (“luminance”) signals are mixed
into a single carrier wave. Chrominance is a composite of two color components (1 and
Fundamental Concepts in Video or U and Y). This is the type of signal used by broadcast color TVs; it is downward
compatible with black-and-white TV.
In NTSC TV, for example [II, 1 and Q are combined into a chroma signa!, and a color
subcanier tben puts the chrorna signa! at the higher frequency end of the channel shared
with the luminance signal. The chrominance and luminance components can be separated
In this chapter, we introduce the principal notions needed to understand video. Digital video at the receiver end, and the two color components can be further recovered.
compression is explored separately, in Chapters lo through 12. When connecting to TVs or VCRs, composite video uses only one wire (and hence one
Here we consider the following aspects of video and how they impact multimedia appli counector, such as a BNC connector at each end of a coaxial cable or an RCA plug at each
cations: end of an ordinary wire), and video color signais are mixed, not sent separately. The audio
signal is another addition lo this one signal. Since color information is mixed and both color
and intensity are wrapped into the same signal, some interference between the luminance
T~’pes of video signals
and chrominance signals is inevitable.

• Analog video
5.1.3 5-Video
• Digital video As a compromise, S-video (separated video, or super-video, e.g., in S-VHS) uses two wires:
one for luminance and another for a composite chrominance signal. As a result, there is
Since video is created from a variety of sources, we begin with lhe signais theniselves. less crosstalk between the color information and lhe crucial gray-scale information.
Analog video is represented as a continuous (time-varying) signal, and the first part of The reason for placing luminance into its own part of the signal is that black-and-white
this chapter discusses how it is measured. Digital video is represented as a sequence of information is cnicial for visual perception. As noted in the previous chapter, humans are
digital images, and the second part of the chapter discusses standards and dehnitions such able lo differentiate spatial resolution in grayscale images much better than for the color part
as HDTV. of color images (as opposed lo the “black-and-white” part). Therefore, color information
sent can be much less accurate than intensity information. We can see only fairly large blobs
5.1 TYPES OF VIDEO SIGNALS of color, so it maltes sense to send less color detail.

Video signals can be organized in three different ways: Componeni video, Composite video, 5.2 ANALOG VIDEO
and S-video.
Most TV is still sent and received as an analog signal. Once the electrical signa! is received,
5.1.1 Component Video we may assume that brightness is aI least a monotonic function of voltage, if not necessarily
linear, because of gamma correction (see Section 4.1.6).
Higher-end video systems, such as for studios, make use of three separate video signais for
An analog signal 1 O) samples a time-varying image. So-called progressive scanning
the red, green, and blue image planes. This is referred to as componeni video. This kind of traces through a complete picture (a frame) row-wise for each time interval. A high
system has three wires (and connectors) connecting the camera or other devices to a TV or resolution computer monitor typically uses a time interval of 1/72 second.
monitor.
ln TV and in some monitors and multimedia standards, another system, interlaced scan
Color signals are not restricted to always being RGB separations. lnstead, as we saw in
ning, is used. Here, the odd-numbered lines are traced first, then the even-numbered lines.
Chapter 4 on color models for images and video, we can form three signais via a luminance
This results in “odd” and “even” fields — two fields make up one frame.
chrominance transformation of the ROR signals — for example, YIQ or YUV. In contrast,
most computer systems use component video, with separate signals for R, O, and B signals. In fact, lhe odd lines (starting from 1) end up at lhe middle of a line at the end of the
odd field, and lhe even scan starts ata half-way point. Figure 5.1 shows lhe scheme used.
For any color separation scheme, component video gives the best colar reproduction,
First the solid (odd) lines are traced — P to Q, then R lo 5, and so on, ending ai T — then
since there is no “crosstalk” between the three different channels, unlike composite video ar
lhe even field starts at U and ends at V. The scan lines are not horizontal because a small
S-video. Component video, however, requires more bandwidth and good synchronization
voltage is applied, moving the electron beam down over time.
of the three components.

112
Fundamental Concepts in Video Section 5.2 Analog Video 115
114 Chapter5

U
Q
5

V
T

FIGURE 5.1: Interlaced raster scan.

InLerlacing was invenLed because, when standards were being defined, ii was difliculi (b) (c) (d)
lo transmit lhe amounl of information iii a fuli frame quickly enough Lo avoid flicker. The
double number of fields presented lo lhe eye reduces perceived flicker. FIGURE 5.2: InLerlaced scan produces two fields for each frame: (a) lhe video frame;
Because of interlacing, Lhe odd and even lines are displaced intime from each other. This (b) Field 1; (c) Field 2; (d) difference of Fields.
is generally not noticeable excepL when fast aclion is Laking place onscreen, when blurring
may occur. For example, in Lhe video in Figure 5.2, Lhe moving helicopLer is blurred more
Lhan lhe still background. Blank is ai zero volLs. As shown, Lhe Lime duraLion for blanking pulses in Lhe signal is used
Since ii is sometimes necessary Lo change Lhe frame rale, resize, or even produce stillls for synchronizalion as welI, wiLh Lhe Lip of Lhe Sync signal ai approximaLely —0.286V. lo
from an interlaced source video, various schemes are used Lo de-inter?ace ii. The simplesl faci, Lhe problem of reliable synchronizalion is so imporlanl thaL special signals lo control
de-inter]acing meLhod consists of discarding one field and duplicaLing Lhe scan lines of Lhe sync lalce up abouL 30% of lhe signal!
olher field, which results in lhe informalion in one field being losI completely. Olher, more
complicaied melhods reLain informalion from boch fields.
CRi’ displays are buill like fluorescenl lighls and musl flash 50 lo 70 iimes per second WhiLe(0.714V)
lo appear smooLh. In Burope, Ibis facl is conveniently lied lo lheir 50Hz electrical syslem,
and lhey use video digilized aL 25 frames per second (fps); in North America, lhe 60 fiz
elecLric sysLem diclales 30 fps.
The jump from Q lo R and so on in Figure 5.1 is cafled lhe horizontal retrace, during
Biack (0.055 V)
which lhe elecironic beam in Lhe CRT is blanked. The jump from T Lo U or V Lo P is called Blank(OV) ----~ 1
Lhe vertical retrace.
Since vollage is one-dimensional — ii is simply a signal Lhat varies with lime — how do
we know when a new video une begins? Thal is, whal pari of au elecLriCal signal teus us Sync (— 0.286 V)
lhat we have lo resLari ai lhe Ieft side of lhe screen?
The solution used in analog video is a smaul voltage offseL from zero lo indicale black
and another value, such as zero, Lo indicale lhe sLarl of a une. Namely, we could use a Horizontal relrace Aclive une signal
“blacker-lhan-black” zero signal Lo indicate Lhe beginning of a une.
Figure 5.3 shows a typical electronic signal for one scan une of NTSC composite video. FIGURE 5.3: ElecLronic signal for one NTSC scan une.
‘While’ has a peak value of 0.7 14 V; ‘Black’ is slighlly above zero ai 0.055 V; whereas
Section 5.2 Analog Video 117
116 Chapter 5 Fundamental Concepts in Video

The vertical retrace and sync ideas are similar to the horizontal one, except Lhat Lhey TABLE 5.1: Samples per une for various analog video formats.
happen only once per field. Tekalp [2] presents a good discussion of the details of analog
FormaL Sampies per une
(and digital) video. The handbook [3] considers many fundamental problems in video
processing in great depth. VI-IS 240
S-VHS 400425
5.2.1 NTSC Video
Beta-Si’ 500
The NTSC TV standard is mostly used in North America and Japan. li uses a familiar 4:3
Standard 8 mm 300
aspect ratio (i.e., Lhe ratio of picture width to height) and 525 scan lines per frame at 30
frames per second. Hi-8 mm 425
More exactly, for historical reasons NTSC uses 29.97 fps —or, in otherwords, 33.37 msec
per frame. NTSC follows the interlaced scanning system, and each frame is divided into
two fields, wiih 262.5 lines/field. Thus the horizontal sweep frequency is 525 x 29.97 Different video formais provide different nunibers of samples per une, as iisted in Ta
15, 734 lineslsec, so that each une is swept ouL in 1/15,734 sec 63.6 jssec. Since the bie 5.1. Laser disks have about lhe same resolution as Hi-8. (In comparison, miniDV
horizontal retrace takes 10.9 jisec, Lhis leaves 52.7 ~tsec for Lhe active une signal, during u/4-inch tapes for digital video are 480 lines by 720 samples per line.)
which image data is displayed (see Figure 5.3). NTSC uses Lhe YIQ coior model. We employ lhe technique of quadrature modulation
Figure 5.4 shows the effect of “vertical retrace and sync” and “horizontal retrace and to combine (lhe spectraily overlapped part of) 1 (in-phase) and Q (quadrature) signals mio
sync” on Lhe NTSC video raster. Blanking infonuation is placed into 20 lines reserved for a singie chroma signal C [1, 2]:
control information at Lhe beginning of each field. Hence, Lhe number of active video tines C 1 cos(F,~t) + Q sin(F,~t)
= (5.1)
per frame is only 485. Similarly almost 1/6 of the raster aL the left side is blanked for
horizontal retrace and sync. The nonblanking pixels are calied active pixeis. This modulated chroma signai is also known as Lhe calor subcarrier, whose magnitude
Pixeis often fali between scaniines. Therefore, even with noninterlaced scan, NTSC TV is .112 + Q2 and phase is tan~ (Q/1). The frequency of C is F,c 3.58 MHz.
is capable of showing only about 340 (visually distinct) lines, — about 70% of Lhe 485 The 1 and Q signais are multipiied in Lhe time domam by cosine and sine functions with
specified active lines. With interiaced scan, it could be as iow as 50%. the frequency ~ [Equation (5.1)]. This is equivalent toconvolving Lheir Fouriertransforms
Image data is not encoded in lhe bianking regions, but other information can be placed in the frequency domam with two impulse functions ai F3~ and —F3~. As a result, a copy
there, such as V-chip information, siereo audio channel data, and subtitles in many languages. of 1 and Q frequency spectra are made which are ceniered aL F3~, and — F3~, respectiveiy.’
NTSC video is an analog signal with no fixed horizontal resolution. Therefore, we must The NTSC composiie signal is a further composition of Lhe luminance signal Y and Lhe
decide how many times to sample Lhe signal for dispiay. Each sample corresponds to one chroma signai, as deiined beiow:
pixel output. A pixel clock divides each horizontal une of video into sampies. The higher
Lhe frequency of the pixel clock, the more samples per une. composiLe = 1’ + C = Y + 1 cos(F5~t) + Q sin(F5~t) (5.2)

NTSC assigned a bandwidth of 4.2 MHz to 1’ but only 1.6 MHz to 1 and 0.6 MHz lo Q,
Vertical reirace and sync dueto humans’ insensitivity Lo color details (high-frequency color changes). As Figure 5.5
shows, Lhe picture carrier is at 1.25 MHz in Lhe NTSC video channel, which has a total
o bandwidth of 6 MHz. The chroma signai is being “carried” by F3~ 3.58 MHz towards
Lhe higher end of Lhe channel and is thus centered aL 1.25 + 3.58 = 4.83 MHz. This greatiy
~0
e reduces the potential inierference between Lhe 1’ (luminance) and C (chrominance) signais,
5)
o since Lhe magnitudes of higher-frequency components of 1’ are significantly smauler than
a ~ta Lheir iower frequency counterparts.
8 Moreover, as Buinn[1j expiains, great care is taken Lo interieave Lhe discrete Y and
e C spectra so as Lo further reduce the interference between Ihem. The “interieaving” is
o
N illustrated in Figure 5.5, where Lhe frequency components for Y (from Lhe discreie Fourier
5-
o Lransform) are shown as solid lines, and those for 1 and Q are shown as dashed lines. As
x
‘Negafive frequency (—F,~) isa mathe,natical notion nccded in Lhe Fourieriransfonn In Lhe physical speclmm,
only positive frequency is used.
FIGURE 5.4: Video raster, inciuding retrace and sync data.
Section 5.3 Digital Video 119
118 Chapter 5 Fundamental Concepts in Video

6MHz Lhal NTSC color TV slowed its frame rale lo 30 x 1,000/1,001 29.97 fps [4]. As a
resuiL, Lhe adopled NTSC color subcarrier frequency is slighLly lowered, Lo
4.2MHz
f~ = 30 x 1,000/1,001 x 525 x 227.5 3.579545 MHz
where 227.5 is Lhe number of color samples per scan une in NTSC broadcasl TV.

5.2.2 PALVideo
P,4L (Phase Aliernahng Line) is a TV slandard originally invenled by German scienlisls. IL
uses 625 scan lines per frame, aI 25 frames per second (or 40 msec/frame), wilh a 4:3 aspecL
raLio and inlerlaced fields. lIs broadcasl TV signais are also used in coniposiLe video. This
(MUz) imporlant slandard is widely used in WesLeni Europe, China, India and many oLher parIs of
lhe world.
Piciure Color Audio PAI.~ uses Lhe YUV color model wiLh an 8 MHz channel, allocaLing a bandwidLh of
canier sobcanier subeanier 5.5 MHz Lo Y and 1.8 MHz each Lo U and V. The color subcarzier frequency is f,,,,
4.43 MHz. To improve picLure qualily, chroma signais have alteniaLe signs (e.g., +U and
FIGURE 5.5: lnterleaving Y and C signals inibe NTSC spectrum. —U) in successive scan lines; hence Lhe name “Phase AlLernaLing Line’2 This faciliLales lhe
use of a (une-rale) comb fllLer aI lhe receiver — Lhe signais in conseculive lines are averaged
soas lo cancel lhe chroma signals (which always carry opposiLe signs) for separaling Y and
C and obLain high-qualily Y signais.
a result, Lhe 4.2 MHz band of Y is overlapped and interleaved with Lhe 1.6 MUz to 1 and
0.6 MHz lo Q.
5.2.3 SECAM Video
The first slep in decoding lhe composite signal aI lhe receiver side is LO separaLe Y and
C. Generally, Iow-pass fihters can be used Lo extract 1’, which is located aI lhe lower end of SECAM, which was invenled by lhe French, is lhe third major broadcasL TV standard.
lhe channel. TV sets with higher qualily also use comb filters [11 Lo exploit Lhe facl that Y SECAM sLands for Systeme Elecwonique Couleur Avec Menioire. SECAM also uses 625
and C are interleaved. scan lines per frame, at 25 frames per second, wiLh a 4:3 aspect raLio and interlaced fields.
The original design called for a higher number of scan lines (over 800), bul Lhe final version
Afler separation from Y, Lhe chroma signal C can be demodulaled lo extract 1 and Q settled for 625.
separately. SECAM and PALaresimilar, differing slighlly in Lheircolorcoding scheme. In SECAM,
ToextracL 1: U and V signais are modulated using separaLe color subcaniers aI 4.25 MHz and 4.41 MHz,
respeclively. They are senl in alLemale lines — lhaL is, oniy one of Lhe U or V signais wilI
1. Multiply lhe signal C by 2cos(F,ct) be senL on each scan une.
Table 5.2 gives a comparison of lhe Lhree major analog broadcasl TV systems.
C ‘2 cos(F5c1) = 1 2 cos2(F,ct) + Q •2 sin(Fs~t) cos(F~~t)
5.3 DIGITAL VIDEO
= 1 (1 + cos(2F3~t)) + Q .2 sin(F~~t) cos(F~~t)
= 1 + 1 cos(2F3S) ÷Q . sin(2Fsct) The advanLages of digital represenLaLion for video are many. IL permils

• SLoring video on digiLal devices or in memory, ready Lo be processed (noise removal,


cul and paste, and so on) and inLegraLed mIo various mulLimedia applicalions
2. Apply a low-pass filter lo obLain 1 and discard lhe lwo higher-frequency (2F,~) tenns.
• Direct access, which makes nonlinear video ediling simple
Similarly, exLracl Q by firsL mulliplying C by 2sin(Fsct) and then applying low-pass
filLefing. • RepeaLed recording without degradadon of image qualily
The NTSC bandwidth of 6 MHz is Lighl. lIs audio subearrier frequency is 4.5 MHz, • Ease of encrypLion and beLler Lolerance Lo channel noise
which places lhe cenLer of lhe audio band aI 1.25 + 4.5 = 5.75 MHz in lhe channel
2According co Blinn tIL NTSC selects a halfinteger(227.5) numberofcolorsamples foreach scan une. Hence,
(Figure 5.5). This would acLually be a bil LoO dose lo lhe color subcanier — a cause for its chroma signal also switches sign in successive scan lines.
polenLial inLerferenCe beLween lhe audio and color signais. It was due largely lo Lhis reason
120 Chapter 5 Fundamental Concepts in Video section 5.3 Digital Video 121

TABLE 5.2: Comparison of analog broadcast TV systems. *0*0*0


Frame Number of Total Bandwidth *0*0*0
TV system rate
(fps)
scan
lines
channel
width (MHz) Y
allocation (MHZ)
1 or LI Q or V
*0*0*0
NTSC 29.97 525 6.0 4.2 1.6 0.6 *0*0*0
PAI. 25 625 8.0 5.5 1.8 1.8
4:4:4 4:2:2
SECAM 25 625 8.0 6.0 2.0 2.0

*000 *0 000000
In earlier Sony or Panasonic recorders, digital video was in the form of composite video. • . .
Modern digital video generally uses component video, although ROR signais are first con
verted into a certain type of color opponent space, such as YUV. The usual color space is
*000 *0 000000
YCbCr [5].
*000*0 000000
• . •
Chroma Subsampling
5.3.1
*000*0 000000
Since humans see color with much less spatial resolution than black and white, it makes
sense to decimate the chrominance signal. lnteresting but not necessarily informative names 4:1:1 4:2:0
have arisen to label lhe different schemes used. To begin with, numbers are given stating how
many pixel values, per four original pixels, are actually sent. Thus Lhe chroma subsampling
scheme “4:4:4” indicates that no chroma subsampling is used. Each pixel’s Y, Cb, and Cr
values are transmitted, four for each of Y, Cb, and Cr.
The scheme “4:2:2” indicates horizontal subsampling of Lhe Cb and Cr signais by a factor
o Pixel with only Y

of 2. That is, of four pixels horizontally labeled O Lo 3, ali four Ys are sent, and every two • Pixel with only Cr and Cb values
Cbs and two Crs are sent, as (CbO, Y0)(CrO, 1’ 1 )(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on.
The scheme “4:1:1” subsamples horizontaily by a factor of 4. The scheme “4:2:0” * Pixel with Y, Cr, and Cb
subsamples in both Lhe horizontal and vertical dimensions by a factor of 2. Theoretically, an
average chroma pixel is positioned between the rows and columns, as shown in Figure 5.6. FIGURE 5.6: Chroma subsampling.
We can see tbat the scheme 4:2:0 is in fact another lcind of 4:1:1 sampling, in lhe sense that
we send 4, 1, and 1 values per 4 pixels. Therefore, lhe labeling scheme is nota very reliable
mnemonic! represented with two bytes (8 bits for Y and 8 bits altemating between CI.’ and Cr). The
Scheme 4:2:0, along with others, is commonly used in IPEO and MPEG (see later CCIR 601 (NTSC) data raLe (including blanking and sync but excluding audio) is thus
chapters in Part II). approximately 216 Mbps (megabits per second):

5.3.2 CCIR Standards for Digital Video 525 x 858 x 30 x 2bytes x ~ 216 Mbps
byte
The CCIR is lhe Consuitalive Commiitee for 1,#ernational Radio. One of ibe most im
portant standards it lias produced is CCIR-601, for conlponent digital video (introduced
During blanking, digital video systems rnay make use of lhe extra data capacity to carry
in Section 4.3.4). This standard has since become standard ITU-R-601, an international
audio signals, translations into foreign languages, or error-correction infonnation.
standard for professional video applications. lt is adopted by certain digital video formats,
including Lhe popular DV video. Table 5.3 shows some of the digital video specifications, ali wiih an aspect ratio of 4:3.
The NTSC version has 525 scan lines, each having 858 pixels (wilh 720 of them visible, The CCIR 601 standard uses an interlaced scan, so each field lias only half as much vertical
resolution (e.g., 240 lines in NTSC).
not in the blanking period). Because lhe NTSC version uses 4:2:2, each pixel cmi be
Section 5.3 Digital Video 123
122 Chapter 5 Fundamental Concepts in Video

TABLE 5.3: Digital video specifications. TABLE 5.4: Advanced DigiLal TV Formats SupporLed by ATSC.

CCIR 601 CCIR6Oi CIF QCIF Number of active Number of acLive Aspect ratio PicLure raLe
525/60 625/50 pixels per une iines
NTSC PAIJSECAM i,920 i,080 16:9 601 30P 24P

Luminance resoiution 720 x 480 720 x 576 352 x 288 176 x i44 i,280 720 16:9 60P30P24P
704 480 l6:9 and 4:3 601 60P 30P 24P
Chrominance resolulion 360 x 480 360 x 576 176 x 144 88 x 72
640 480 4:3 60160P30P24P
Coior subsampling 4:2:2 4:2:2 4:2:0 4:2:0
AspecLratio 4:3 4:3 4:3 4:3
Fields/sec 60 50 30 30 HDTV wiil easily demand more Lhan 20 MHz bandwidLh, which wili not fit in Lhe currenL
Interiaced Yes Yes No No 6 MHz or 8 MHz channels, varo~is compression Lechniques are being invesiigated. IL is
also anticipated thaL high-qualiLy HDTV signais wiil be LransmiLled using more than one
channel, even afLer compression.
CIF stands for Ca,nmon Iniermediate Forma!, specified by the Intemationai Teiegraph In i987, Lhe FCC decided thaL I-IDTV sLandards must be compaLible with Lhe exisling
and Teiephone Consuitative Committee (Ccii]’), now superseded by the Inteniationai NTSC sLandard and musi be confined Lo the existing Very High Frequency (VHF) and Ultra
Teiecomrnunicatiofl Union, which oversees botii telecomrnunicatiofls (ITU-T) and radio l-ligh Frequency (UHF) bands. This prompted a number of proposais in North America by
frequency matters (ITU-R) under one United Nations body. The idea of C1F, which is about Lhe cml of 1988, ali of Lhem analog or mixed analog/digilai.
the sarne as VHS quaiity, is to specify a formaL for lower bitrate. CIF uses a progressive In i990, Lhe FCC announced a different initialive — its preference for fuii-resoiution
(noninLeriaCed) scan. QCIF stands for Quarter-CIF, and is for even lower biLrale. Ali lhe HDTV. They decided Lhat HDTV would be simuitaneousiy broadcast wilh existing NTSC
CIF/QCIF resoiutions are evenly divisibie by 8, and ali except 88 are divisibie by 16; this TV and eventualiy replace iL. The deveiopment of digiLal FJDTV immediaLeiy iook off in
is convenient for biock-based video coding in H.26i and H.263, discussed in Chapter 10. Norlh America.
CIF is a compromise between NTSC and PAL, in that it adopts the NTSC frame raLe and Witnessing a boom of proposais for digital I-IDTV, Lhe FCC made a key decision lo
half Lhe nuniber of active lines in PAL. When piayed on existing TV seLs, NTSC TV wiii first go ali digitai in 1993. A “grand alliance” was formed that inciuded four main proposals,
need Lo converi Lhe number aí iines, whereas PAL TV wiill require frame-rate conversion. by General InsLruments, MIT, ZeniLh, and AT&T, and by Thomson, Philips, Samoff and
others. This evenLuaily ied Lo lhe formation of Lhe Advanced Television SysLems CommiLlee
5.3.3 High Definition 1V (HDTV) (ATSC), which was responsible for Lhe standard for TV broadcasting of HDTV. In 1995,
Lhe U.S. FCC Advisory CommiLLee on Advanced Television Service recommended Lhat Lhe
The introduction of wide-screen movies brought Lhe discovery Lhat viewers seated near ATSC digiLal Leievision standard be adopted.
Lhe screen enjoyed a levei of parLicipation (sensation of imrnersion) noL experienced with The standard supports video scanning forrnaLs shown in Table 5.4. In Lhe Labie, 1” means
conventional movies. ApparenLiy lhe exposure lo a greater field of view, especialiy Lhe inLeriaced scan and “P” means progressive (noninLerlaced) scan. The frame rales supported
invoivemeflL of peripherai vision, contributes Lo Lhe sense aí “being Lhere”. The main Lhrust are both integer raLes and Lhe NTSC rates thaL is, 60.00 or 59.94, 30.00 or 29.97, 24.00
of 111gb Definition TV (HDTV) is not Lo increase lhe “definiLion” in each unit arca, but or 23.98 fps.
raLher Lo increase Lhe visual field, especiatiy its width.
For video, MPEG-2 is chosen as Lhe compression standard. As wiili be seen in Chapler 1 i,
First-generaLion 1-IDTV was based on an analog technology deveioped by Sony and NHK
it uses Main Levei Lo High Levei of the Main Proflie of MPEG-2. For audio, AC-3 is Lhe
in Japan in Lhe iaLe 1970s. HDTV successfuily broadcasL Lhe 1984 Los Angeies Olympic standard. IL supports Lhe so-cailed 5.1 channel Doiby suffound sound — five surround
Games in Japan. MUltiple sub-NyquisL Sampiing Encoding (MUSE) was an improved channeis plus a subwoofer channel.
N1IK HDTV with hybrid anaiogMigiLai technologies Lhat was put in use in Lhe 1990s. It has
The salienL diftèrence between conventionai TV and HDTV [4, 6] is lhaL Lhe iaLter has
1,125 scan iines, inLeriaCed (60 fields per second), and a 16:9 aspect raLio. IL uses sateilile Lo
a much wider aspecL raLio of 16:9 insLead of 4:3. (AcLually, iL works OUL to be exacLiy
broadcast — quite appropriaLe for Japan, which can be covered wiLh one or two saLeiliLes.
One-Lhird wider than curreni TV.) Another feature of HDTV is iLs move Loward progressive
The Direct Broadcast Sateflite (DBS) channeis used have a bandwidth of 24 MHz.
(noninterlaced) scan. The rationaie is Lhat interiacing inLroduces serrated edges Lo moving
In general, terresLriai broadcasL, saLeilite broadcasL, cable, and broadband networks are
objects and flickers along horizonLal edges.
ali feasibie means for transmitLing HDTV as weli as conventionai TV. Since uncompressed
124 Chapter 5 Fundamental Concepts in Video Section 5.6 References 125

The FCC has pianned to repiace ali analogbroadcasl services wilh digitalTV broadcasting 4. Show how lhe Q signai can be extracLed from lhe NTSC chroma signal C [Equa
by lhe year 2006. Consumers wilh analog TV seLs wili still be able Lo receive signais via an Lion (5.1)] during demodulaLion.
8-VSB (8-levei vesligial sideband) demoduialion box. The services provided will include s. One somelimes hears lhat Lhe old Belamax formal for videotape, which compeled
with VHS and 1051, was aclually a beLter formal. How would such a slalement be
• Standard Definition TV (SDTV) — lhe cunenl NTSC TV or higher justified?
• Enhanccd Definition TV (EDTV) — 480 aclive lines or higher — lhe third and 6. We don’l see tliclcer ou a workslalion screen when dispiaying video aL NTSC frame
fourlh rows in Table 5.4 raLe. Why do you lhink lhis mighL be?
• High Definition TV (HDTV) — 720 aclive lines or higher. So far, lhe popular ~ Digital
feasible?video uses chroma subsampiing. WhaL is Lhe purpose of this? Why is it
choices are 720P (720 lines, progressive, 30 fps) and 10801(1,080 lines, interlaced,
30 fps or 60 fields per second). The lalter provides slighlly belLer piclure quality bul 8. Whal are Lhe mosl salienl differerices beiween ordinary TV and HDTV? WhaL was
requires much higher bandwidlh. the main impeLus for lhe developmenL of HDTV?
9. Whal is lhe advanlage of inlerlaced video? Whal are some of iLs problems?
5.4 FURTHER EXPLORATION 10. One solulion Lhal removes Lhe problems of inLerlaced video is Lo de-inLerlace il. Why
Tekalp [2] covers various imporlanL issues for digiLal video processing. Chapler 5 of Stein- can we nOL jusl overlay the lwo fields Lo obLain a de-inlerlaced image? Suggest some
meLz and Nahrsledl [7] provides detailed discussions of video and lelevision systems. Poyn- simple de-interlacing algorilhms thal reLain informaLion from boLh fields.
1011 [6] provides an exLensive and updaled review of digital video and HDTV.
Links given for Lhis chapler on lhe lexI web sile include: 5.6 REFERENCES
1 J.F. Bhnn, “NTSC: Nice Technoiogy, Super ColorT IEEE Computes- Graphics and Applica
• Tuloriais on NTSC lelevision tions, 13(2): 17—23, 1993.

• The officiai ATSC home page 2 A.M. Tekalp, Digital Video Processing, Upper Saddle River, ?JJ: PrenLice Hail PTR, 1995.
3 A. Bovik, ediLor, Handbook ofImage and Yideo Processing, San Diego: Academic Press, 2000.
• The laLesI news on lhe digiLal TV front 4 C.A. PoyriLon, A Technical introduction lo Digital Video, New York: Wiiey, 1996.
• lnLrodtiCliOn lo HDTV 5 J.F. Blinn, ‘lhe World of Digital Video.” IEEE Coinputer Graphics and Applications, [2(5):

• The oflicial FCC home page 106—112, 1992.


6 C.A. PoynLon, Digital Video and HDTV Algorithms and Inzerfaces, San Francisco: Morgan
5.5 EXERCISES Kaufmann, 2003.
1. NTSC video has 525 lines per frame and 63.6 itsee per une, wilh 20 lines per field of 7 R. Sleinmetz and K. Nahrsledt. Multirnedia: Computing, Coinmunicarions and Applications,
vertical reLrace and 10.9 jssec horizontal retrace. Upper Saddle River, NJ: Prentice Hail PTR, 1995.

(a) Where does lhe 63.6 jrsec come from?


(b) Which lalces more lime, horizontal reLrace or vertical reLrace? How much more
lime?

2. Which do you think has less deLecLable flicker, PAL in Europe or NTSC in North
America? JusLify your conclusion.
3. Sometimes Lhe signals for Lelevision are combined mIo fewer Lhan aH lhe paris required
for TV transmission.

(a) AltogeLher, how many and whaL are lhe signais used for sLudio broadcast TV?
(b) How many and whaL signais are used in S-video? What does S-video stand for?
(c) How many signals are actually broadcasL for standard analog TV reception?
WhaL kind of video is LhaL called?
Section 6.1 Digitization of Sound 127

CHAPTER 6

o
Basics of Digital Audio o.
E

Audio informaLion is crucial for muitimedia presentations and, in a sense, is Lhe sirnplest
type of muitimedia data. i-lowever, some important differences between audio and image Time
information cannot be ignored. Forexample, whiie ii is customary anduseful to occasionalIy
FIGURE 6.1: An analog signal: conLinuous rneasurement of pressure wave.
drop a video frame from a video strearn, lo facilitate viewing speed, we simply cannot do
the sarne with sound information or ali sense wiill be iost frorn Lhat dimension. We introduce
basic concepts for sound in multirnedia in this chapter and examine Lhe arcane details
of compression of sound information in Chapters 13 and 14. The digitization of sound 6.1.2 Digitization
necessarily implies sampling and quantization of signals, so we iniroduce these topics here. Figure 6.1 shows Lhe one-dimensional naLure of sound. Values change over Lime in ampli
We begin with a discussion of jusi what malces up sound information, then we go on tude: lhe pressure increases or decreases wiLh time [1]. The amplitude value is a continuous
to examine the use of MIIN as an enabling technology to capture, store, and play back quanLity. Since we are interested in working with such data in computer sLorage, we must
digital audio. We go on to look ai some details of audio quantization, for transrnission digitize Lhe analog signais (i.e., continuous-valued voiLages) produced by microphones.
and give some introductory information on how digital audio is dealt with for storage or For image daLa, we musL likewise digitize lhe time-dependent anaiog signals produced by
transmission. This entails a first discussion of how subtraction of signais from predicted Lypical videocameras. Digitization means conversion Lo a sLream of numbers — preferably
values yields numbers thaL are dose to zero, and hence easier to deal with. in:egers for efficiency.
Since lhe graph iri Figure 6.1 is two-dimensional, lo fully digitize lhe signai shown we
6.1 DIGITIZATION OF SOUND have Lo sample in each dimension — in time and in amplitude. Sampiing means measur
ing Lhe quanLity we are inLeresLed in, usually at evenly spaced inLervals. The first kind of
6.1.1 What Is Sound? sampling — using measuremenLs only ai evenly spaced Lime inLervals — is simply called
sampling (surprisingiy), and lhe raLe ai which it is perfonned is called lhe sanipling fre
Sound is a wave phenomenon like light, but ii is macroscopic and involves molecules of air quency. Figure 6.2(a) shows Lhis type of digiLizaLion.
being compressed and expanded under Lhe action of some physical device. For example,
a speaker in an audio system vibraLes back and forth and produces a longitudinal pressure
wave that we perceive as sound. (As an example, we get a longitudinal wave by vibrating
a Slinky along iLs iengLh; iii contrast, we get a transverse wave by waving lhe Slinlcy baclc
and forth perpendicular Lo its iength.) o o
~0 0
2
Without air there is no sound — for example, in space. Since sound is a pressure wave, it
o. o.
takes on continuous values, as opposed to digitized ones with a finite range. Nevertheless, E E
if we wish to use a digiLal versiori of sound waves, we rnust form digitized representations
of audio information.
Even though such pressure waves are longitudinal, they still have ordinary wave proper
Time Time
fies and behaviors, such as reuiection (bouncing), refraction (change of angle when entering
a medium with a differenL density), and diffraction (bending around an obstacle). This (a) (b)

makes Lhe design of “surround sound” possible. FIGURE 6.2: Sampling and quantizaLion: (a) sampling lhe analog signai in ihe time dimen
Since sound consists of measurable pressures ai any 3D point, we can detect ii by measur sion; (b) quanLization is sampling lhe analog signai in Lhe amplitude dirnension.
ing lhe pressure levei at a location, using a Lransducer Lo.convert pressure to voltage leveis.

126
Section 6.1 Digitization of Sound 129
128 Chapter 6 Basics of Digital Audio

Fundamenlal
For audio, typicai sampiing rales are from 8 kHz (8,000 samples per second) lo 48 kl-iz. frequency
The human ear can hear from abouL 20Hz (a very deep rumbie) lo as much as 20 kl-lz; above
Lhis levei, we enLer lhe range ofulLrasound. The human voice can reach approximately 4kHz
and we need lo bound our sampling raLe fram below by aI least double lhis frequency (see
lhe discussion of lhe Nyquist sampling raLe, beiow). Thus we arrive aL lhe useful range
abouL 8 Lo 40 or 50 kHz. +0.5x
Sampling in lhe amplitude or voltage dimension is called quanhizalia’l shown in Fig 2 >< fundamental
ure 6.2(b). While we have discussed only unifarm sampling, with equaily spaced sampling
intervais, nonuniform sampiing is possible. This is not used for sampling in time but is used
for quantizaLion (see lhe jt-law mie, beiow). T~tpical uniform quanLization rales are 8-biL
and 16-bil; 8-biL quanlizalion divides Lhe vertical axis inLo 256 leveis, and l6-bil divides it
mIo 65,536 leveis. +0.33 x
3 x fundamenLai
To decide how lo digitize audio daLa, we need lo answer lhe foilowing queslions:

1. Whal is Lhe sampling rale?


2. How finely is lhe daLa Lo be quantized, and is lhe quanlizalion uniform?
+ 0.25 x
3. How is audio daLa formalLed (i.e., whaL is lhe file formal)? 4 x fundamenLai

6.1.3 Nyquist Theorem


Signals can be decomposed into asum of sinusoids, ifwe are willing Lo use enough sinusoids.
Figure 6.3 shows how weighled sinusoids can build up quite a complex signal. Whereas +0.5 x
frequency is an absoluLe measure, pilch is a perceplual, subjective qualily of sound — 5 x fundamenLai
generally, piLch is relative. Pilch and frequency are linked by seLling lhe note A above middle
C lo exactly 440 fiz. An octave above Lhal note corresponds Lo doubling lhe frequency and
takes us lo anolher A noLe. Thus, wilh lhe middle A on a piano (“A4” or “A440”) seL Lo 440
Hz, lhe next A up is 880 Hz, one ocLave above.
Here, we define har,nonics as any series of musical Lones whose frequencies are inLegral FIGURE 6.3: Building upa complex signal by superposing sinusoids.
mulLipies of lhe frequency of a fundamental lone. Figure 6.3 shows lhe appearance of these
harmonics.
Now, if we allow noninLeger mulliples of the base frequency, we aliow non-A noLes and and an upper limiL f~ of frequency componenls in lhe signal — lhen we need a sampling
have a complex resulling sound. NeverLheicss, each sound isjusL made from sinusoids. Fig raLe ofaL leasl 2(J’2 — li).
ureó.4(a) showsa single sinusoid: il is a single, pure, frequency (oniy elcclronic inslnlmenLs Suppose we have afixed sampling raLe. Since iL would be impossible Lo recover fre
can create such boring sounds). quencies higher Lhan haif lhe sampling raLe in any evenl, mosL syslems have an &ntialiasing
Now if lhe sampling rale jusl equals lhe acLual frequency, we can see from Figure 6.4(b) fihter lhal resLricLs Lhe frequency conlenl of Lhe sampler’s inpuL Lo a range aI or below haif
lhal a false signal is delecLed: iL is simply a conslanL, with zero frequency. lf, on Lhe lhe sampiing frequency. Confusingiy, lhe frequency equal lo half lhe NyquisL raLe is called
oLher hand, we sample aI 1.5 Limes lhe frequency, Figure 6.4(c) shows Lhal we oblain an lhe Nyquisifrequency. Then for our fixed sampling raLe, lhe NyquisL frequency is half lhe
incorrect (alias) frequency lhal is lower lhan lhe correcl one — ii is half lhe correcl one (lhe sampling raLe. The highesL possible signai frequency componenL has frequency equal Lo
waveieflgLh, from peak Lo peak, is double Lhat of Lhe acLual signal). in compuLer graphics, lhaL of Lhe sampling iLseif.
much efforl is aimed aL masking such alias effecls by various melhods aí antialiasing. An Nole lhaL lhe Ime frequency and ils alias are locaLed symmelricaily on lhe frequency axis
alias is any artifacL thal does not belong Lo lhe original signal. Thus, for correcL sampling wilh respecl lo Lhe NyquisL frequency pertaining Lo lhe sampling rale used. For Ihis reason,
we musl use a sampiing rale equal lo aI leasl Lwice lhe maximum frequency contenl in lhe Lhe Nyquisl frequency associaLed wiLh Lhe sampling frequency is ofLen called Lhe “folding”
signal. This is called lhe Nyquist rale. frequency. ThaL is Lo say, if Lhe sampling frequency is iess lhan Lwice Lhe Lme frequency, and
The NyquisL Theorem is named afler Harry Nyquisl, a famous maLhemaLician who worked is grealer Lhan Lhe Lrue frequency, Lhen Lhe alias frequency equals Lhe sampling frequency
aL Beli Labs. More generaliy, if a signal is band-limited Lhal is, if iL has a lower limiL fl
Sectiori 6.1 Digitization of Sound 131
130 Chapter 6 Basics ot Digital Audio

>‘
o 3
co
2

2
a
o. o
O 2 4 6 8 10
True frequency (kHz)

FIGURE 6.5: Folding of sinusoid frequency sarnpled aL 8,000 Hz. The foiding frequency,
shown dashed, is 4,000 Hz.

In general, lhe apparenL frequency of a sinusoid is Lhe iowesL frequency of a sinusoid lhal
has exacliy lhe sarne sarnpies as lhe inpuL sinusoid. Figure 6.5 shows Lhe reiationship of lhe
apparenL frequency Lo Lhe inpul (Irue) frequency.

6.1.4 Signai-to-Noise Ratio (SNR)


Ia any analog sysLern, randorn flucLualions produce noise added to lhe signai, and Lhe rnea
sured vollage is Lhus incorrecl. The ralio of lhe power of Lhe correel signai lo lhe noise is
calied lhe signal-to-noise ralio (SNR). Therefore, lhe SNR is a measure of Lhe qualily of Lhe
signai.
(e) The SNR is usualiy measured in decibeis (dli), where 1 dB is a LenLh of a bel. The SNR
value, in unils of dB, is defined in Lerms of base-lO logarilhms of squared vollages:
FIGURE 6.4: Aliasing: (a) a singie frequency; (b) sarnpling aI exacLly lhe frequency produces
a conslanl; (c) sarnpiing aL 1.5 Limes per cycie produces ao alias frequency Lhal is perceived.
SNR = lOlog10 signa? = 201og10 ~signa? (6.2)
~nolse

minus lhe Irue frequency. For example, ii Lhe lrue frequency is 5.5 kHz and Lhe sampling The power in a signai is proporlionai lo Lhe square of Lhe voilage. For exarnpie, if Lhe signa]
frequency is 8 kllz, Lhen lhe alias frequency is 2.5 kHz: voiLage ~sjgnal ~ 10 Limes lhe noise, lhe SNR is 20 x log10(l0) = 20 dB.
In lerms of power, if lhe squeaking we hear from len violins playing is Len Limes lhe
squeaking we hear frorn crie vioiin playing, lhen lhe ralio of power is given in Lerrns of
fatias = fsamphng — finte, for ferue < fsasnpt,ng < 2x finte (6.1)
decibeis as lO dB, or, in olher words, 1 Bel. NoLice lhal decibeis are always defined in
As weli, a frequency aL double any frequency could also fiL sample poinLs. lo fact, adding Lerms of a raLio. The Lerm “decibeis” as appiied Lo sounds in our environmenl usualiy is
any positive ar negative muitiple of lhe sarnpling frequency lo lhe true frequency always in comparison Lo a just-audibie sound wilh frequency 1 kHz. The leveis of sound we hear
gives anolher possible alias frequency, in LhaL such an alias gives Lhe sarne seI of samples around us are desciibed in Lerms of decibeis, as a ratio Lo lhe quieLesL sound we are capabie
when sampied aI lhe sarnpling frequency. of hearing. Tabie 6.1 shows approxirnale leveis for Lhese sounds.
So, if again Lhe sarnpling frequency is less Ihan Lwice Lhe Lrue frequency and is less Lhan
lhe lrue frequency, lhen lhe alias frequency equals n limes the sarnpiing frequency rninus lhe 6.1,5 Signai-to-Quantization-Noise Ratio (SQNR)
lrue frequency, where Lhe n is lhe iowesl inLeger Lhal rnakes si tirnes lhe sarnpiing frequency For digilai signais, we musl Lake inLo account Lhe facL LhaL oniy quantized vaiues are sLored.
larger lhan Lhe true frequency. For exampie, when lhe Lrue frequency is beLween 1.0 and For a digilai audio signai, lhe precision of each sampie is delermined by Lhe number of bils
1.5 limes lhe sampling frequency, lhe alias frequency equais Lhe lrue frequency minus lhe per sampie, lypicaiiy 8 ar 16.
sampling frequency.
Section 6.1 Digitization of Sound 133
132 Chapter 6 Basics of Digital Audio

We have examined Lhe worst case. lf, on lhe oLher hand, we assume Lhat lhe inpuL signai
TABLE 6.1: Magnitudes of common sounds, in decibeis is sinusoidal, LhaL quanlizalion error is sLaListicaliy independent, and Lhal iLs magnitude is
uniformly dislribuLed beLween O and haif lhe inLervai, we can show ([2], p. 37) lhaL lhe
Threshold of hearing expression for Lhe SQNR becomes
Rustie of leaves
SQNR = 6.02?) + i.76(dR) (6.4)
Very quiel roorn
Average room Since iarger is beLLer, Lhis shows lhaL a more reahsLic approximation gives a better charac
II
Lerization number for lhe qualily of a syslem.
Conversation
Typical digital audio sample precision is eilher 8 bits per sample, equivaienL lo abouL
Busy street LelephOne quahLy, or 16 bils, for CD quaiily. In fact, 12 bits or 80 would hkely do fine for
Loud radio adequale sound reproducLion.
Train through slation II
6.1.6 Linear and Nonlinear Quantization
RiveLer II
We menlioned above Lhal sampies are Lypicaiiy slored as uniformiy quanLized vaiues. This
Threshold of discomfort is calied linear formar. However, with a hmiled number of bits availabie, il may be more
Threshold of pain sensible lo Lry Lo lake mIo account lhe properlies of human percepLion and seI up nonuniform
quantizalion leveis lhaL pay more aLtenLion Lo lhe frequency range over which humans hear
Damage lo eardrum II
besL.
Remember that here we are quanLizing magnitude, or amplitude — how loud Lhe signai is.
lii Chapter 4, we discussed an interesling fealure of many human percepLion subsysLems (as
Aside from any noise that may have been present in lhe original analog signal, addiLional il were) — Weber’s Law — which slaLes Lhat Lhe more lhere is, proportionateiy more musL
error resulls from quantizalion. That is, if voltages are in Lhe range of 0 lo 1 but we have be added Lo discem a difference. SlaLed fonnaily, Weber’s Law says Lhat equaliy perceived
oniy 8 bits in which lo store values, we effectively force ali conlinuous values of voitage differences have values proportionai Lo absoiute leveis:
mIo oniy 256 different values. lnevitably, this introduces a roundoff error. Although It is
nol really “noise’ it is called quanuzation noise (or quantization errar). The association AResponse c ASLimuius/Sdmulus (6.5)
with Lhe concepl of noise is lhal such errors will essentiaily occur randomiy from sample lo
sample. This means lhaL, for exampie, if we can frei an increase in weighL from 10 lo 11 pounds,
The quaiity of lhe quantization is characterized by Lhe signaI~to-quanhiZatiOfl-n0üe ratio Lhen if inslead we slart at 20 pounds, iL wouid take 22 pounds for us lo frei an increase in
(SQNR). Quantization noise is defined as lhe difference beLween lhe value of Lhe analog weighl.
signal, for lhe particular sampling time, and lhe nearesl quantization inLerval value. AL Inserling a conslanL of proporLionaiily k, we have a differentiai equalion lhat sLates
most, this errar can be as much as half of lhe interval.
For a quanLizalion accuracy of N bits per sample, lhe range of lhe digital signal is —2” dr =k(1/s)ds (6.6)
~O 2N~ —1. Thus, if lhe actual analog signa1 is in lhe range from —Vmax Lo +Vmax, each
wilh response r and sLimulus s. lnLegrating, we arrive aI a soiulion
quanlizalion levei represenLs a voltage of 2 Vmax ,2N, or Vmax/2M’. SQNR can be simply
expressed in Lerms of lhé peak signai, which is mapped tolhe levei V,j3,,01 of aboul 2N~, r kins+C (6.7)
and lhe SQNR has as denominalor lhe maximum Vquan.,wise of 1)2. The ratio of lhe lwo is
a simpie definition of lhe SQNR:’ wilh conslanL of inLegraLion C. SLaLed differentiy, lhe soiuLion is
Vsignai
SQNR = 2oiogio = 20iog10 —j-- r =kin(s/so) (6.8)
Yquani~oise
(6.3) where so is lhe iowesL levei of sLimuius Lhat causes a response (r = O when s =
= 20 x N x iog2 = 6.02N(dB)
Thus, nonuniform quanLizaLion schemes lhaL lake advanlage of Lhis percepLuai charac
in olher words, each bil adds aboul 6dB of resolution, so ló bils provide a maximum SQNR lerisLic male use of iogarithms. The ideais Lhat in a iog piol derived from EquaLion (6.8), if
of96dB. we simpiy Lake unifonn sleps aiong Lhe s axis, we are nol mirroring Lhe nonlinear response
aiong Lhe r axis.
This ralio is aciually lhe peak signal.lo.quanhiZ2llOn~OiSe raio, or PSQNR.
Section 6.1 Digitizatiori of Sound 135
134 Chapter 6 Basics of Digital Audio

js-law orA-law
Instead, we would like to take uniform steps along lhe r axis. Thus, nonlinearquantizatiOn
works by firsl Lransforming au analog signal from the raw s space into te tbeoretical r space,
teu uniformly quantizing te resulting values. The resuit is that for steps near the low end
of te signa!, quanlization steps are effectively more concentrated on te s axis, whereas for
large values of s, one quantization step in r encompasses a wide range of s values.
Such a law for audio is called ~s-iaw encoding, or u-law, since it’s easier Lo write. A very
similar mIe, ca!led A-law, is used in telephony in Furope.
The equations for tese similar encodings are as foIlows:
o
~u-law: 3
sgn(s) 1 1 s 1) Is 1
r= lnjl+itI—I , I—1S1 (6.9)
ln(1+tr) I~pl 1~

{
A-law:
A (s)
1-1-luA ;; ‘ s~1—A
r= (6.10)
!~!)[1+InAIi~ii, !<IAI<l —1
I+IflA 1 A — I~pI —1 —0.8 —0.6 —0.4 —0.2 O 0.2 0.4 0.6 0.8 1
sIsp

where sgn(r) = { 1
—1
ifs>0,
otherwise FIGURE 6.6: Nonlinear Lransform for audio signals.

Figure 6.6 depics these curves. The parameler of te M-law encoder is usually set lo
= 100 or js = 255, while Lhe parameler for Lhe A-Iaw encoder is usually set to A = 87.6. This Lechnique is based ou human perception — a simple foim of “perceptual code?’.
Here, s~, is te peak signa! value and s is the current signa! value. So far, Lhis simply lnLerestingly, we have in effect also made use of Lhe sLalislics of sounds we are likely to
means tat we wish lo deal wiLh s/s,,, in te range —1 to 1. Ficar, which are generally in the low-volume range. In effect, we ai-e asking for mosL bils lo
Theideaofusingthis typeof law isthat ifs/s~ is first Lransformed Lo values r as aboveand be assigned where most sounds occur — where Lhe probability density is highest. So Lhis
then r is quantized uniformly before transmitling or storing Lhe signal, mosL of Lhe available Lype of coder is also one taL is driven by sLaListics.
bils will be used Lo sLore infonnation where changes in te signa! are most apparent Lo a In summary, a logaritmic transform, called a “compresso?’ in the pai-lance of leiephony,
human listener, because of our percepLual nonuniformity. is npplied Lo te anaiog signal before it is sampled and converLed Lo digital (by an analog-to
To see Lhis, consider a small change in s/s~,I near the value 1.0, where te curve in digilal, or AD, converter). The amount of compression increases as Lhe amplitude of lhe input
Figure 6.6 is flaLtesL. Clearly, lhe change in s has Lo be much larger in Lhe flaL area than signa! increases. The AD converter carnes out a uniform quantizaLion on the “compressed”
near Lhe origin Lo be registered by a change iii Lhe quantized r value. And it is at Lhe quiel, signa!. AfLer transmission, since we need analog Lo hear sound, Lhe signai is converted
low end of our heaiing that we cais besL discem small changes in s. The ís-iaw transform back, using a digiLai-Lo-analog (DA) converter, Lhen passed through an “expande?’ circuit
concenlrates Lhe available informaLion ai taL end. lhat reverses Lhe logaritm. The overail LransformaLion is called companding. Nowadays,
FirsL we carry out the s-law Lrnnsformation, then we quantize te resuiting value, which is compnnding can also be canried out in the digital domam.
a nonJinear transform away from the inpuL. The iogarithmic steps represent low-ampliltide, The jt-law in audio is used Lo develop a nonuniform quanLization rule for sound. In
quiet signais with more accuracy Lhan loud, high-amplitude ones. WhaL this means for general, we would like lo pul lhe available bils where te most percepLual acuiLy (sensiLiviLy
signals lhat are then encoded as a fixed number of biLs is that for low-amplitude, quiet lo small changes) is. Ideally, biL allocation occurs by examining a curve of stimulus versus
signais, te amounL of noise — te error in representing te signa! — is a smaller number response for humans. Then we Lry to allocate biL leveis Lo intervals for which a smafl change
than for high-ampliLude signals. Therefore, the u-law transform effecLively makes te in sLimulus produces a large change in response.
signal-to-noise ratio more uniform across lhe range of input signals.
Section 6.1 Digitization of Sound 137
136 Chapter 6 Basics of Digital Audio

That is, Lhe idea of companding refiects a Iess specific idea used in assigning bits lo TABLE 6.2: DaLa rale and bandwidth in sample audio applicaLions
signais: put lhe bits where Lhey are most needed to deliver finer resolution where lhe result
can be perceived. This idea militales againsl sirnpiy using uniforra quantizatiOn schemes, Quahty Sample Bits per Mono) Data rale Frequency
instead favoring nonuniform schemes for quantlzation. The it-Iaw (or A-law) for audio is rale sample slereo (if uncompressed) band
an application of Lhis idea. (kHz) (kB/sec) (Hz)
Telephone 8 8 Mono 8 200—3,400
6.1.7 Audio Filtering
AM radio 11.025 8 Mono 11.0 100—5,500
Prior Lo sampling and AD conversion, Lhe audio signai is also usuallyfittered to remove un
FM radio 22.05 16 Slereo 88.2 20—11,000
wanted frequencies. The frequencies kept depend on Lhe application. For speech, Lypically
from 50Hz to lO kl-lz is retained. OLher frequencies are blocked by a band-passfilia, also CD 44.1 16 Slereo 176.4 5—20,000
called a band-limiting filter, which screens miL lower and higher frequencies. DAT 48 16 SLereo 192.0 5—20,000
An audio music signal wili typicaily contam from abouL 20Hz up Lo 20 kHz. (Twenty Hi DVD audio 192 (max) 24 (max) Up Lo 6 channels 1,200.0 (max) 0—96,000 (max)
is lhe low rumbie produced by an upseL elephant. TwenLy kl-iz is about Lhe highesL squeak
we can hear.) So Lhe band-pass filter for rnusic wiiI screen out frequencies ouLside this
range. Telephony uses ~r-law (ar u-law) encoding, or A-Iaw in Europe. The oLher formaLs use
At the DA converter end, even though we have removed high frequencies that are lilcely linear quanLizaLion. Using Lhe ji-law rale shown in Equation (6.9), Lhe dynarnic range of
just noise iri any event, Lhey reappear iii Lhe ouLput. The reason is lhaL because o! sampiing digital Lelephone signals is effeelively improved from 8 bils lo 12 or 13.
and then quanLizaLion, we have effectively replaced a perhaps smoolh inpuL signal by a series Somelimes iL is useful Lo remember lhe kinds o! data raLes in Table 6.2 in Lerms of byLes
of step functions. In theory, such a discontinuous signal contains ali possible frequencies. per minuLe. For exarnple, lhe uncompressed digital audio signal for CD-qualily sLereo sound
Therefore, aL Lhe decoder side, a low-pass filter is used after the DA circuit, making use of is 10.6 megabyLes per minuLe — roughly lO megabyles — per minute.
lhe sarne cutoff as at lhe high-frequencY end of the coder’s band-pass fliLer.
We have still sornewhat sidestepped the issue of just how many bits are required for 6,1.9 Synthetic Sounds
speech or audio appiicaLion. Some of lhe exercises aL the end of Lhe chapLer wili address Digilized sound must sLill be converled Lo analog, for us lo hear ii There are Lwo funda
Lhis issue. menLally differenL approaches lo handling stored sampled audio. The firsL is Lermed FM,
Some imporlant audio file formaIs include AU (for UNIX worksLalions), AIFF (for MAC forfrequency modutailon. The second is called Wave rabie, or jusL Wave, sound.
and SGI machines), and WAV (for PCs and DEC worksLalions). The MP3 compressed file In lhe first approach, a canier sinusoid is changed by adding anoLher Lerm involving
formaL is discussed in Chapter 14. a second, modulaling frequency. A more inLeresting sound is creaLed by changing Lhe
argumenL o! lhe main cosine lerm, pulling Lhe second cosine inside Lhe argurnenL ilself —
6.1.8 Audio Quality versus Data Rate Lhen we have a cosine of a cosine. A Lime-varying amplitude “envelope” function multiplies
Lhe whole signal, and anoLher time-varying funcLion mulliplies Lhe inner cosine, Lo account
The uncompressed daLa raLe increases as more bils are used for quanLizatiOn. Slereo infor for overtones. Adding a couple of extra conslanLs, Lhe resulling function is compiez indeed.
rnation, as opposed lo mono, daubies lhe arnounL of bandwidLh (in bits per second) needed For example, Figure 6.7(a) shows lhe function cos(2,rt), and Figure 6.7(b) is anoffier
to lransmit a digital audio signal. Table 6.2 shows how audio qualiLy is reiated lo data raLe sinusoid aI Lwice lhe frequency. ‘A cosine o! a cosine is lhe more inleresling funclion
and bandwidth. Figure 6.7(c), and finally, with carrier frequency 2 and modulaLing frequency 4, we have Lhe
The term bandwidlh, in analog devices, refers lo the part of lhe response or transfer much more inLeresLing curve Figure 6.7(d). Obviously, once we consider a more complex
funclion of a device thal is approximaLelY constanl, ar flat, wilh Lhe x-axis being Lhe fre signal, such as Lhe following [3],
quency and lhe y-axis equai Lo Lhe Lransfer funclion. !-Ialf-power bandwidih (I-IPBW) refers
lo Lhe bandwidth belween poinls when lhe power falis lo half Lhe maximum power. Since
xU) = AU) cos[w~irt + IQ) cosQornrt + ~ + Ø~] (6.11)
10 log~oW.5) —3.0, Lhe term —3dB bandwidlh is also used lo refer lo Lhe HPBW. we can creaLe a mosl complicated signal.
So for analog devices, lhe bandwidLh is expressed in frequency unils, called Heri~ (Hz), This FM synLhesis equaLion slales lhat we make a signal using a basic carrier frequency
which is cycles per second. For digiLal devices, on lhe oLher hand, lhe amounl of data Lhat w~ and also use an additional, modulating frequency Wm. In Figure 6.7(d), these values
can be lransmilLed in a fixed bandwidLh is usuaily expressed in bits per second (bps) or byLes were w,, = 2 and w,,, 4. The phase consLants #m and ~ creale lime-shifts for a more
per arnounl of time. For either analog or digiLal, lhe LenU expresses Lhe amounl of daLa Lhal inLeresling sound. The lime-dependenL funcLion AQ) is called Lhe envelope — iL specifies
can be lransmiLted iii a fixcd amounl o! Lime.
Sectiori 6.2 MIDI: Musical Instrument Digital Interface 139
138 Chapter 6 Basics of Digital Audio

cos (2jrt) cos (4irl) For exampie, 1115 useful Lo be able lo change Lhe key — suppose a song is a biL LoO high
for your voice. A wave table can be maLhemaLically shifLed so lhat it produces lower-pitched
1.0 sounds. However, this kind of extrapolaLion can be used only jusi so far wiLhouL sounding
o 0.5 wrong. Wave Lables often include sampling at vaiious notes of lhe instrumenL, se that a
t
lcey change need noL be sLreLched Loo far. Wave table synLhesis is more expensive Lhan FM
1co 0.0 synLhesis, parLly because lhe data storage needed is much larger.
~ —0.5
6.2 MIDI: MUSICAL INSTRUMENT DIGITAL INTERFACE
— 0.0 0.2 0.4 0.6 0.8 —1.0
1.0 0.0 0.2 0.4 0.6 0.8 1.0 Wave-table files provide an accurate rendering of real instrument sounds but are quite large.
Time Time For simple music, we mighL be satisfied wilh FM synthesis versions of audio signais that
could easily be generated by a sound card. A sound card is added to a PC expansion
(a) (b)
board and is capable of manipulaLing and outpuLting sounds through speakers connected
to lhe board, recording sound input from a microphone connecLed Lo lhe compuLer, and
cos (cos (4,rt)) cos (27rt + cos (4irt)) manipulating sound stored on a disk.
lf we are willing Lo be satisfied with the sound card’s defaults for many of lhe sounds
we wish Lo include in a mulLimedia projecL, we can use a simple scripLing language and
hardware seLup called MIDI.
Eco 0.0
6.2.1 MIDI Overview
~ -0.5
MIDI, which dates from Lhe early l 980s, is an acronym lhaL slands for Musical lnstrunient
—1.0 Digital Interj’ace. It forms a protocol adopLed by lhe electronic music indusLry Lhat enabies
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time computers, synLhesizers, keyboards, and olher musical devices to communicate wiLh each
oLher. A synLhesizer produces synthetic music and is included on sound cards, using one of
(c) (d)
lhe lwo meLhods discussed above. The MIDI standard is supported by most synLhesizers,
50 sounds created on one can be played and manipulaled on anolher and sound reasonably
FIGURE 6.?: Frequency modulation: (a) a single frequency; (b) twice lhe frequency; (c) usu
dose. Computers musL have a special MIDI inLerface, but Lhis is incorporated mIo most
ally, FM is carried out using a sinusoid argument ~o a sinusoid; (d) a more complex form
sound cards. The sound card must also have bolh DA and AD converters.
arises from a carrier frequency 2r1 and a modulating frequency 4,rt cosine inside lhe
MIEM is a scripLing language — it codes “evenLs” lhat stand for lhe producLion of certain
sinusoid.
sounds. Therefore, MIDI files are generally very small. For exampie, a MIDI event might
include values for lhe piLch of a single note, its duraLion, and its volume.
overail loudness over time and is used Lo fade in and fade out lhe sound. A guitar string has
an auack period, Lhen a decay period, a sustain period, and finally a release period. Terminology. A synthesizer was, and still may be, a sLand-alone sound generator that
can vaty pitch, loudness, and tone color. (The pitch is lhe musical note lhe instrument
Finally, lhe time-dependent function 1(t) is used Lo produce a feeling of hannonics
plays — a C, as opposed lo a G, say.) lt can also change addiLional music characterisLics,
(“overtones”) by changing Lhe amount of modulation frequency heard. When 1Q) is small,
such as attack and delay time. A good (musidian’s) synthesizer often has a microprocessor,
we hear mainly low frequencies, and when 1(t) is larger, we hear higher frequencies as welI.
keyboard, control panels, memory, and so on. However, inexpensive synthesizers are now
FM synthesis is used in low-end versions of Lhe ubiquitous Creative Labs Sound Blaster PC
included on PC sound cards. Units Lhat generate sound are referred lo as Lone modules or
sound card.
sound modules.
A more accurate way of generating sounds from digiLal signals is called wave-tab!e A sequencer sLarled off as a special hardware device for storing and editing a sequence
synthesis. In Lhis technique, digital samples are stored sounds from real instrumenLs. Since of musical events, in Lhe form of MIEM data. Now it is more often a software nzusic editor
wave tables are stered in memory on lhe sound card, they can be manipulated by software on lhe compuLer.
so that sounds can be combined, ediLed, and enhanced. Sound reproduction is a good A MlDlkeyboard produces no sound, instead generating sequences of MIDI instrucLions,
deal beLter with wave tables Lhan wiLh FM synthesis. To save memory space, a variety of cailed MIDI messages. These are raLher like assembler code and usually consist of just a
special techniques, such as sample looping, pitch shifting, mathematical interpolation, and few byLes. You mighL have 3 minuLes of music, say, stored in only 3 kB. In comparison,
polyphonic digiLal filtering, can be applied [4,5].
140 Chapter 6 Basics of Digital Audio Section 6.2 MIDI: Musical lnstrumerit Digital lnterface 141

a wave-table file (WAV) stores 1 minute of music in about lo MU. In MIDI pariance, lhe Stop One Start
bit byLe bil
keyboard is referred to as a keyboard coarroiler.

MIDI Concepts. Music is organized into tracks in a sequencer. Each Lrack cai’ be
turned on or off on recording or playing back. Usualiy, a particular instrumenL is associated TransmitLing
1’ •J~
01110111000 1011001100 110111010
Synihesizer
with a MDI channel. MID1 channeis are used Lo separate messages. There are 16 channels, device
numbered from O Lo 15. The channel forms Lhe iast four bits (the least significanL bits) of
lhe message. The ideais that each channel is associated with a particular insLniment — for
example, channel 1 is lhe piano, channei 10 is lhe dmms. Nevertheless, you can switch
insLruments midstream, if desired, and associate another inslrument with any channel.
The channel can also be used as a placeholder in a message. lf Lhe firsL four biLs are ali FIGURE 6.8: SLream of lO-biL byles; for Lypical MEDI messages, Lhese consist of {sLaLus
ones, lhe message is inLerpreted as a system common message. byLe, daLa byte, daLa byte) = (NoLe On, Note Number, Note Velocity~.
Along with channel messages (which include a channel number), several oLher Lypes of
messages are sent, such as a general message for ali instruments indicating a change in tuning
or timing; these are called system niessages. II is also possibie lo send a special message by a Note 0ff message (key release), which also has a pitch (which noLe Lo tum oU) and —
Lo an insLnlment’s channel thaL allows sending many notes without a channei specified. We for consistency, one supposes — a velociLy (ofLen set to zero and ignored).
willl describe these messages in demil later. The daLa in a MEDI sLaLus byLe is beLween 128 aiid 255; each of Lhe data bytes is beLween
The way a syntheLic musical insLmmeflt responds loa MIDI message is usually by simply O and 127. Actual MIDI bytes are 8 bit, plus a O sLart and sLop bil, making Lhem lO-bii
ignoring any “play sound” message ihat is not for its channel. If severai messages are for “bytes”. Figure 6.8 shows Lhe MIDI datastream.
its channel, say several simuitaneous notes being played on a piano, ihen Lhe instrument A MIDI device ofien is capable of prograntmabiliry, which means ilhas filters available
responds, provided it is midti-voice — thaL is, can piay more than a single noLe aI once. for changing Lhe bass and Lreble response and can also change Lhe ‘envelope” describing
IL is easy Lo confuse Lhe Lerm voice with lhe term timbre. The laLler is MDI LerminolOgy how the ampliLude of a sound changes over time. Figure 6.9 shows a model of a digital
for just what instrumenL we are trying Lo emulate — for example, a piano as opposed lo a insLrument’s response lo NoLe On/Note 0ff messages.
violin. It is lhe quality of lhe sound. An insLrument (or sound card) tInI is multi-timbral is MEDI sequcncers (editors) allow you Lo work with standard music noLalion or geL right
capable of playing many different sounds aL Lhe same time, (e.g., piano, brass, dmms) into Lhe daLa, ifdesired. MIDI files can also store wave-Lable daLa. The advantage of wave
On lhe other hand, Lhe temi “voice”, while sometimes used by musicians Lo mean lhe
same thing as timbre, is used in MIDI Lo mean every differenL timbre and piLch thaL lhe ione
module cari produce at lhe sarne time. Synthesizers can have many (typically 16, 32, 64, Amplitude
256, eLc.) voices. Each voice works independently and simultaneously lo produce sounds
of different Limbre and pitch.
The temi polyphony refers Lo Lhe number of voices that can be produced aI Lhe same
Lime. So a Lypicai tone module may be able lo produce “64 voices of polyphony” (64 Decay
NoLe off
different noLes aI once) and be “16-part muiLi-limbrai” (cai’ produce sounds like 16 differenL
instruments aI once).
Sustain
Iiow different Limbres are produced digiLally is by using a pa:ch, which is the seI of
control settings that define a particular timbre. Patches are oflen organized mIo databases,
called bania. For true aficionados, sofLware patch edicors are available. Release
A sLandard mapping specifying just whaL instruments (patches) will be associaLed with
what channels has been agreed on and is called General MIDL In General MIEM, Lhere
are 128 paLches are associated with standard instruments, and channei 10 is reserved for
percussion insLrumenLs.
For rnosL insLnlrnents, a typical message might be Note On (meaning, e.g., a keypress), Note on
consisting of what channel, whaL piLch, and what velocity (i.e., volume). For percussion
insLwments, lhe pitch data means which kind of drum. A Note On message thus consists FIGURE 6.9: Stages of ampliLude versus Lime for a music note.
of a status byte — which channel, whaL piLch — followed by two data byLes. lt is íollowed
Section 6.2 MlDl: Musical lnstrument Digital Interface 143
142 Chapter 6 Basics of Digital Audio

tabie data (WAV files) is Lhat iL much more precisely stores Lhe exact sound of au instrument.
A sampler is used Lo sample Lhe audio daLa — for example, a “drum machine” aiways sLores
wave-table daLa of real drums.
Sequencers empioy severai techniques for producing more music from whaL is acLually
available. For example, looping over (repeating) a few bars can be more or less convincing.
Volume can be easily controlied over time — Lhis is calied ;ime-varying amplitude modu
lation. More interestingly, sequencers can also accompiish time compression or expansion
wiLh no pitch change.
Whiie ii is possibie Lo change Lhe pitch of a sampled instrumenl, if Lhe key change is large,
lhe resulting sound begins lo sound displeasing. For this reason, samplers employ multi
sampling. A sound is recorded using several band-pass fiuters, and the resulting recordings
are assigned Lo different keyboard keys. This makes frequency shifting for a change of key
more reliable, since less shifL is involved for each noLe.

6.2.2 Hardware Aspects of MIDI


The MIDI hardware seLup consisLs of a 31.25 kbps (kilobits per second) serial conneclion,
wiLh lhe 1 0-bil bytes including aO sLart and sLop biL. Usually, MIDI-capable units are either etc.
input devices or outpuL devices, noL both.
Figure 6.10 shows a tradiLional synLhesizer. The modulaLion wheel adds vibrato. Pitch FIGURE 6.11: A Lypical MIDI seLup.
bend alters Lhe frequency, much iike pulling a guitar sLring over slightly. There are ofLen
other conLrOlS, such as foots pedals, sliders, and so on.
The physical MIDI ports consist of 5-pin conncctors labeied IN and OUT and a Lhird Figure 6.11 shows a Lypicai MIDI sequencer seLup. Here, Lhe MDI OUT of lhe keyboard
connector, THRU. This iasL daLa channel simply copies data entering Lhe IN channel. MIE» is connected toLhe MIDI IN of a synthesizer and Lhen THRU Lo each of lhe additionai sound
communicalion is half-duplex. MIL» ll’1 is the connector via which the device receives ali modules. During recording, a keyboard-equipped synthesizer sends MIDI messages te a
MDI data. MDI OUT is lhe connecLor through which the device transmits ali the MDI sequencer, which records them. During playback, messages are senl from Lhe sequencer lo
data it generates itself. MIDI THRU is Lhe connector by which Lhe device echoes the data it ali Lhe sound modules and the synthesizer, which play Lhe music.
receives from MIL» IN (and only LhaL ali Lhe daLa generaLed by Lhe device iLseif is senL via
MIDI OUT). These ports are on the sound card or interface extemally, eiLher on a separate
6.2.3 Structure of MIDI Messages
card on a PC expansion card sIoL or using a special interface te a serial or parailel port.
MIDI messages can be classified inLo two types, as in Figure 6.12 — channel messages and
system messages — and further classified as shown. Each type of message wiii be examineil
lceyboard below.

MIDI messages { Channei messages

messages —E
Voice messages

Mode messages
Common messages

Real-time
Exclusive messages
wheel wheei
FIGURE 6.12: MIDI message Laxonomy.
FIGURE 6.10: A MIDI synlhesizer.
ection 6.2 MIDI: Musical lnstrument Digital Interface 145
144 Chapter 6 Basics ot Digital Audio

TABLE 6.4: MIDI mode messages


TABLE 6.3: MIDI voice messages
Data byte2 1 st data byte Description Meaning of 2nd data byte
Voice message Status byte Data bytel
&H79 Reset ali controliers None; set Lo 0
Note 0ff &H8n Key number Note 0ff velocity
&H7A Local control O = off; 127 = on
Note On &l19n Key number Note On velocity
Amount &H7B Ali notes off None; set Lo O
Poiyphonic Key Pressure &HAn Key number
&H7C Omni mode off None; set LoO
Control Change &HBn Controlier number Controiler value
&H7D Omni mode on None; set to O
Program Change &HCn Program number None
&l-l7E Mono mode on (Poly mode off) Controlier nurnber
Channel Pressure &HDn Pressure value None
&H7F PoIy mode on (Mono mode off) None; set to O
Pitch Bend &lIEn MSB LSB

&H indicales liexadecimal, and n lo lhe Status byte hex value stands for a channel number. Ali valties
mc nO.. 127 except Conliolier number, which is in O.. 120. patch for Lhat channel. So (ar, we have attached two timbres Lo two different channels.
Then sending two Note On messages (in serial) wouid Lum on both channels. Aitemativeiy,
we could also send a Note On message for a particular channel and then anoLher Note On
Channel Messages. A channel message can have up Lo 3 bytes; the 6rst is Lhe status byte message, with another pitch, before sending Lhe Note 0ff message for Lhe first note. Then
(the opcode, as it were), and fias its most significant bit set Lo 1. The four low-order bits we wouid be playing Lwo notes effectively aL the sarne time on Lhe sarne instniment.
identify which of Lhe 16 possible channels this message belongs to, with Lhe Lhree remaining Poiyphonic Pressure refers Lo how much force simulLaneous notes have on several in
bits holding Lhe message. For a data byte, Lhe most significaM bit is set to zero. struments. Channel Pressure refers to how much force a singie note has on one instrument.

Volte Messages. This type of channel message controls a voice — that is, sends in Channel Mode Messages. Channel mode messages form a special case of Lhe Con
formation specifying which note to play or to turn off— and encodes key pressure. Voice trol Change message, and therefore ali rnode messages have opcode B (so Lhe message is
messages are also used to specify controiler effects, such as sustain, vibrato, tremolo, and “&HBn,” or 101 innnn). However, a Channel Mode message has its firsL data byte in 121
Lhe pitch wheel. Table 6.3 lists these operations. through 127 (&H79 7F).
For (‘lote On and Note 0ff messages, Lhe velociiy is how quickly Lhe key is played. Channei mode messages determine how an insLrument processes MIDI voice messages.
Typically, a synthesizer responds to a higher velocity by making the note louder or brighter. Some examples inciude respond to ali messages, respond just to Lhe correct channei, don’t
Note On makes a note occur, and Lhe synffiesizer also attempts to rnake the note sound like respond aL ali, or go over to local control of Lhe instmrnent.
Lhe real instrument while the note is playing. Pressure messages can be used to alter Lhe RecalI Lhat the sLatus byte is “&HBnT where n is Lhe channei. The data bytes have
sound of notes while they are playing. The Channel Pressure message is a force measure for meanings as shown in Tabie 6.4. Local Control 0ff means Lhat the keyboard should be
Lhe keys on a specific channel (instrument) and fias an identical effect on ali notes playing disconnected from Lhe synthesizer (and another, extemal, device will be used to control the
on Lhat channel. The other pressure message, Polyphonic Key Pressure (also called Key sound). Ali Notes 0ff is a handy command, especialiy if, as sornetimes happens, a bug arises
Pressure), specifles how much volume keys played togeLher are to have and can be different such that a note is left piaying inadvertenLiy. Omni means that devices respond Lo messages
for each note in a chord. Pressure is also called aflertouch. from ali channeis. The usual mode is OMNI 0FF— pay attention to your own messages
The Control Change instmction sets various controllers (faders, vibrato, etc.). Each only, and do not respond to every message regardless of what channel it is on. PoIy means a
manufacturer may malte use of different controiler numbers for different tasks. However, device wiili play back severai notes at once if requesLed to do so. The usual mode is POLY
controlier lis likely Lhe moduiation wheel (for vibrato). ON.
For example, a Note On message is followed by two bytes, une to identify the note and lo POLY 0FF — rnonophonic mode — the argumenL that represents Lhe number of
one to specify Lhe velocity. Therefore, Lo play note number 80 with maximum velociLy on monophonic channels can have a value of zero, in which case it defaults Lo Lhe number
channel 13, the MIDI device would send the following three hex byte values: &H9C &H50 of voices Lhe receiver can play; or it may set to a specific number of channeis. However,
&147F. (A hexadecimal number fias a range 0.. 15. Since it is used Lo denote channels 1 Lo Lhe exact meaning of the combination of OMNI ON/OFF and MonolPoly depends on Lhe
16, “&HC” refers to channel 13). Notes are nurnbered such Lhat middle C fias number 60. speciflc cornbination, with four possibilities. Suffice iL lo say Lhat Lhe usual combination is
To play two notes simuItaneously (effectively), first we wouid send a Prograrn Change OMNI 0FF, POLY ON.
message for each of two channels. Recai! Lhat Program Change means to ioad a particular
146 Chapter 6 Basics of Digital Audio
1 Section 6.3 Quantization and Transrnission of Audio 147

6.2.4 General MIDI


TABLE 6.5: MIDI System Common messages For MIDI music Lo sound more or iess the sarne on every machine, we would at leasL like Lo
have the sarne patch numbers associaLed wiLli Lhe sarne insLrumenLs for example, patch
SysLem common message Status byLe Number of data bytes 1 should aiways be a piano, nOL a flugelhorn. To this end, General MIDI [5J is a scheme
MIDI Timing Code &HFI for assigning instniments Lo paLch numbers. A slandard percussion rnap aiso specifies
47 percussion sounds. Where a ‘noLe” appears on a musical score determines just what
Song Position Pointer &HF2 2 percussion element is being siruck. This book’s web site includes boLh Lhe General MIDI
Song Select &HF3 lnstrument Path Map and Lhe Percussion Key map.
Tune Request &FIF6 None OLher requiremenLs for General MIDI compaLibility are Lhat a MIDI device must sup
porL ali 16 channels; rnust be rnulLi-timbral (i.e., each channel can piay a differenL instru
EOX (Lerminator) &HF7 Noie rnenL/program); rnust be polyphonic (i.e., each channel is able to play many voices); and
rnust have a minimum of 24 dynamicaliy aliocaLed voices.

System Messages. System messages have no channei number and are rneant for com General MIDI LeveIZ. An exLended General MIDI has recentiy been defined, with a
mands that are not channei-speciflC, such as Lirning signais for synchronization, positioning sLandard SMFSiandardMIDI File format defined. A nice extension is Lhe inclusion ofexLra
information in prerecorded MIDI sequences, and detailed setup informaLion for lhe desti character informaLion, such as karaoke iyrics, which can be dispiayed on a good sequencer.
nation device. Opcodes for ali system messages start with “&HIV’ System messages are
divided into Lhree ciassifications, according lo Lheir use. 6.2.5 MIDI-to-WAV Conversion
Some prograrns, such as eariy versions of Premiere, cannoL inciude MIDI files instead,
Systein Conimon Messages. Table 6.5 seLs out these messages, which relate Lo timing • Lhey insisL on WAV format files. Various shareware programs can approximaLe a reasonabie
or positioning. Song Position is measured in beats. The messages deLermine what is to be • conversion between Lhese formats. The programs essenLiaily consisL of large Iookup files
played upon receipt of a “start” real-Lime message (see below). Lhat try lo do a reasonable job of substituting predefined or shifted WAV ouLput for some
MIDI messages, with inconsistenL success.
Syslem Real-Ti me Messages. Table 6.6 sets out system real-time messages, which are
relaLed Lo synchronizatiOn. 6.3 QUANTIZATION AND TRANSMISSION QE AUDIO
To be lransrnitLed, sampied audio informaLion musL be digitized, and here we Jook aL some
Syslein Exclusive Message. The final type of system message, .Systeni Exclusive mes of lhe deLaiis of Lhis process. Once Lhe informalion has been quanLized, it can Lhen be
sages, is included so that rnanufacturers can extend Lhe MIDI standard. AfLer the iniLial LransmitLed or sLored. We go through a few exampies in complete deLaii, which helps in
code, Lhey can insert a stream of any specific messages Lhat apply to their own product. A undersLanding whaL is being discussed.
SysLern Exclusive message is supposed Lo be terminaLed by a Lerminator byte “&HF7,” as
specified in Table 6.5. However, Lhe Lerminator is opLional, and Lhe datastream may simply 6.3.1 Coding of Audio
be ended by sending Lhe staLus byLe of the next message. QuanLizaLion and transformaLion of daia are coilectiveiy known as coding of Lhe data. For
audio, Lhe p-iaw Lechnique for companding audio signals is usually combined with a simple
algorithm Lhat exploits Lhe temporal redundancy presenL in audio signais. Differences in
TABLE 6.6: MIDI System Real-Time messages signais between Lhe presenL and a previous Lime can effectively reduce Lhe size of signal
values and, most importani, concentraLe Lhe histograrn of pixel vaiues (differences, now)
[~tem real-time message StaLus byte inLo a rnuch smaller range. The result of reducing Lhe variance of values is Lhat lossless
Timing Clock &HFB compression meLhods thaL produce a bitstream with shorler biL iengths for more likely vaiues,
introduced in Chapter 7, fase much beLter and produce a greaLly compressed bitsLream.
Start Sequence &HFA
In general, producing quanLized sampled ouLput for audio is cailed Pulse Cade Modula
Continue Sequence &HFB lion, or PCM. The difrerences version is calied DPCM (and a crude but efficienL varianL is
Stop Sequence &HFC called Dbfl. The adapLive version is caiied AIJPCM, and variants Lhat Lake into account
speech properlies foiiow from Lhese. More complex models for audio are outiined in
Active Sensing &HFE
Chapier 13.
System Reset &HFF
148 Chapter 6 Basics of Digital Audio Section 6.3 Quantization and Transmission of Audio 149

6.3.2 Pulse Code Modulation Assuming a bandwidth for speech from about 50 Hz lo aboul 10 kHz, lhe Nyquist rale
would dictate a sampiing raLe of 20 kHz. Using uniform quanLization without companding,
lhe minimum sarnpie size we couid gel away wiLh wouid iikeiy be about 12 biLs. Hence,
PCM in General. Audio is analog — lhe waves we hear travei lhrough Lhe air Lo reach for mono speech transrnission Lhe biLrale wouid be 240 kbps. Wilh cornpanding, we can
our eardrums. We know thaL the basic Lechniques for creating digital signais from analog safely reduce Lhe sampie size Lo 8 bits wiLh lhe sarne perceived levei of quality and Ibus
ones consist of sampling and quantization. Sarnpling is invariabiy done uniformly — we reduce lhe biLrate Lo 160 kbps. However, Lhe standard approach lo teiephony assumes thaL
select a sampiing raLe and produce coe value for each sampiing time. lhe highest-frequency audio signal we want lo reproduce is about 4 kHz. Therefore, lhe
In the magnitude direction, we digitize by quantization, selecting breakpoints in mag sampiing raLe is only 8 kHz, and lhe companded bitrate Lhus reduces Lo only 64 kbps.
nitude and remapping any value within an interval Lo one representative outpuí levei. The We musL also address two smail wrinkles to get this comparativeiy simple forrn of speech
set of interval boundaries is sometimes caiied decision boundaries, and Lhe represenLative compression righL. First because only sounds up lo 4 kHz are lo be considered, ali oLher
values are cailed reconstruction leveis. frequency conLenL musL be noise. Therefore, we should remove this high-frequency conLent
We say Lhat lhe boundaries for quantizer input inLervals lhaL will ali be rnapped mIo lhe from Lhe anaiog input signal. This is done using a band-limiting fliler Lhat blocks ouL high
sarne output levei form a coder ,napping, and the representative values that are Lhe output frequencies as weil as very iow ones. The “band” of not-removed (“passed”) frequencies
values from a quanLizer are a decoder mopping. Since we quantize, we may choose lo create are what we wish to keep. This type of fliLer is therefore also caiied a band-pass fliter.
either ao accuraLe or iess accurate representation of sound magnitude values. Finally, we Second, once we arrive aL a pulse signal, such as Lhe one in Figure 6.13(a), we must
may wish Lo compress lhe data, by assigning a bitstream that uses fewer bits for lhe most sLili perform digiLai-to-analog conversion and then construct an ouLpuL analog signal. But
prevaient signai values.
Every compression scheme has Lhree slages:
Amplitude
1. Transformation. The input data is :ransformed loa new representation that is easier
or more efficient to compress. For exampie, ia PredicLive Coding, (discussed later ia 4
Lhe chapter) we predict lhe next signal from previous ones and Lransmit Lhe prediction 3
error. 2
2. Loss. We may introduce loss of information. Quantization is lhe main lossy step.
Here we use a iimited number of reconstnjction leveIs, fewer than mn Lhe original
signal. Therefore, quantization necessitates some loss of informalion.
3. Coding. Here, we assign a codeword (Lhus forming a binary bitstream) lo each output —3
levei or symbol. This couid be a fixed-length cede ora variable-iength code, such as -4
Huffman coding (discussed ia Chapter 7).
(a)
For audio signais, we first consider PCM, Lhe digitization rnethod. That enables us
to consider Lossiess Predictive Coding as well as Lhe DPCM scheme; these meLhods use Amplitude
differential coding. We also look aL Lhe adaptive version, ADPCM, which is meant lo provide
better compression. 4 4
Pulse Code Modulation, is a formal terrn for Lhe sampiing and quantization we have 3 3
already been using. Pulse comes from an engineer’s point of view LhaL Lhe resulting digital 2 2
signais can be Lhought of as infinitely narrow vertical “pulses”. As an exampie of PCM,
audio samples 00 a CD mighL be sampied at a raLe of 44.1 kHz, with 16 biLs per sample. O
For sLereo sound, with two channels, this amounLs Lo a data raLe of about 1,400 kbps.
—2
—3 —3
PCM in Speech Compression. Recail Lhat in Section 6.1.6 we considered companding: -4 -4
lhe so-called compressor and expander sLages for speech signal processing, for Lelephony.
For Lhis application, signals are first transformed using lhe u-law (or A-iaw for Burope) rule (b) (c)
inLo what is essentially a logarithmic scale. Only Lhen is PCM, using uniforrn quantization,
applied The resulL is Lhat finer incrernents in sound volume are used aI Lhe low-volume end of FIGURE 6.13: Pulsecode moduiation (PCM): (a) original analogsignai and iLscorresponding
speech ralher Lhan aI lhe high-voiumeend, where we can’Ldiscem smali changes ia anyevent. PCM signals; (b) decoded staircase signai; (c) reconstnjcted signal afLer low-pass filtering.
150 Chapter 6 Basics of Digital Audio Section 6.3 Quantization and Transmission of Audio 151

Input analog To begin with, consider a lossless version of Lhis scheme. Loss arises when we quantize.
speech signa! Ifwe apply no quantization, we can still have compression — via Lhe decrease in lhe variance
of values Lhat occurs in differences, compared Lo Lhe original signa!. Chapter 7 inLroduces
more sophisticated versions of lossless compression methods, but it helps Lo see a simple
version here as well. With quantization, Predictive Coding becomes DPCM, a lossy method;
we’ll also Lry out LhaL scheme.

6.3.~ Lossless Predictive Coding


Output analog Predictive coding simply means LransmitLing differences — we predict the nexL sample as
speech signal being equal tO Lhe currenL sample and send not Lhe sample itself but Lhe error involved ia
making lhis assumpLion. ThaL is, if we predict that the next sample equals the previous one,
Lhen Lhe error isjust lhe difference between previous and next. Our prediction scheme could
FIGURE 6.14: PCM signal encoding and decoding. also be more complex.
However, we do note one problem. Suppose our integer sample values are in Lhe range
O.. 255. Then differences could be as much as —255.. 255. So we have unfortunately
increased our dynamic range (raLio of maximum Lo minimum) by a factor of Lwo: we may
Lhe signal we anive aL is effecLively the staircase shown ia Figure 6.13(b). This type of well need more bits Lhan we needed before Lo transmit some differences. Fortunately, we
disconLinuous signa! contains not just frequency componenLs dueto the original signal but, can use a Lrick Lo geL around Lhis problem, as we shall see.
because of the sharp comers, also a Lheoretically infinite set of higher-frequency compo So, basically, predicLive coding consists of finding differences and transmitting them,
nents (from Lhe theory of Fourier analysis, in signal processing). We know these higher using a PCM system such as Lhe one introduced in SecLion 6.3.2. First, note thaL differences
frequencies are extraneous. Therefore, Lhe output of Lhe digital-to-analog converter is in of inLegers will at leasi be integers. LeL’s formalize our statemenL of what we are doing by
Lurn passed to a low.passfilter. which allows only frequencies up toLhe original maximum to defining Lhe integer signal as lhe seL of values f,,. Then we predici values f, as simply Lhe
be reLained. Figure 6.14 shows the complete scheme for encoding and decoding telephony previous value, and we define lhe error e,, as the difference between Lhe actual and predicted
signals as a schematic. As a result of the low-pass filtering, Lhe output becomes smoothed, as signals:
Figure 6.13(c) shows. For simplicity, Figure 6.13 does not show Lhe effecL of companding.
A-law or jt-law PCM coding is used in the older Intemational Telegraph and Telephone 1~ = Á-’
Consultative CommiLtee (CCITT) standard G.7 11, for digiLal telephony. This Ccli]’ sLan e,, = Á_ir,, (6.12)
dard is now subsumed into standards promulgated by a newer organization, lhe Intemational
Telecommunication Union (ITU). We certainly would like our error value e,, to be as small as possible. Therefore, we would
wish our predicLion f,, Lo 12 as dose as possible Lo lhe actual signa! f,,. BUL for a particular
sequence of signa! values, some function of a few of Lhe previous values, f,,—i. fn—2, f,,—3,
6.3.3 Differential Coding of Audio
eLc., may provide a betLer prediction of f,,. Typically, a linear predicior funcLion is used:
Audio is ofLen stored not in simple PCM 12t in a form LhaL exploits differences. For a start,
2 to 4
differences will generally be smaller numbers and hence offer Lhe possibility of using fewer
biLs to store. fn = Z
k=I
(6.13)

An advanLage of forming differences is that Lhe histogram of a difference signa! is usually


considerably more peaked than Lhe histogram for Lhe original signal. For example, as an Such a predictor can be followed by a Lnincating or rounding operaLion to result in integer
extreme case, Lhe histogram for a linear ramp signa! LhaL has constant siope is uniform, values. lii fact, since now we have such coefficienLs a,,_~ available, we can even change
whereas Lhe histogram for Lhe derivative of Lhe signal (i.e., Lhe differences, from sampling them adapLively (see Section 6.3.7 below).
point to sampling poinL) consisLs of a spike at Lhe slope value. The idea of forming differences is lo make lhe histogram of sample values more pealced.
Generally, if a time-dependent signa! has some consistency over time (temporal redun For example, Figure 6.15(a) plos 1 second of sampled speech at 8 kHz, with magnitude
dancy), lhe difference signa! — subLracting the culTent sample from Lhe previous one — resolution of 8 biLs per sample.
will have a more peaked hisLogram, with a maximum around zero. Consequently, if we then A histogram of these values is centered around zero, as in Figure 6.15(b). Figure 6.15(c)
shows Lhe hisLogram forcorresponding speech signa! differences: difference values are much
go on to assign bitstring codewords Lo differences, we can assign short codes Lo prevalent
values and long codewords lo rarely occurring ones. more clustered around zero than are sample values themselves. As a result, a method Lhat
152 Chapter 6 Basics of Audio j Section 6.3 Quantization and Transmission of Audio 153

However, we are sLilI lefL wilh Lhe problem of what lo do if, for some reason, a particular
set of difference values does indeed consist of some exceptionai large differences. A dever
solution Lo lhis difficulty involves defining iwo new codes lo add lo our lisI of difference
values, denoted SU and 50, standing for Shift—Up and Shift—Down. Some special
-~ 0.0 values wili be reserved for them.
00
‘e Suppose samples are iii lhe range 0.. 255, and differences are in —255.. 255. Define
SIJ and 50 as shifts by 32. Then we could in fact produce codewords for a limiled seI
—0.04 of signal differences, say oniy Lhe range — IS.. ló. Differences (Lhat inherently are in lhe
range —255.. 255) iying in lhe Iimited range can be coded as is, bul if we add lhe extra Lwo
Samples values for SU, 50, a value oulside Lhe range —15.. ló can be lransmilted as a series of
shifLs, foiiowed by a value thaL is indeed inside Lhe range —15.. 16. For example, 100 is
(a) Lransmitled as SU, SU, SU, 4, where (lhe codes for) su and for 4 are what are sent.
Lossiess Fredictive Coding is.. . lossiessl Thal is, Lhe decoder produces lhe sarne signais
as Lhe original. Ilis helpfui Lo consider an explicic scheme for such coding consideraLions,
50 leL’s do lhat here (we wOn’L use lhe mosl compiicaLed scherne, bul we’ll lry Lo carry out an
enLire calcuiation). As a simpie example, suppose we devise a predictor for f, as foliows:
e
o
o 1~ = ~ + fn-2)j
e,, = f~1—J~ (6.14)
Then lhe error e,, (or a codeword for it) is whal is actualiy transrnitted.
Aso:oo:si;o LeL’s consider an explicit example. Suppose we wish Lo code Lhe sequence fl, h f~ f4~
Sample value
= 21, 22,27,25,22. For Lhe purposes of Lhe predictor, we’Ii invenl an exlra signal vaiue
(b)
lo, equal to f1 = 21, and first Lransmil Lhis initial vaiue, uncoded; after ali, every coding
scheme has Lhe exlra expense of some header infonnation.
Then Lhe first error, ej, is zero, and subsequently

12 = 21, e2=22—2i=1
e
e
o
o 13 = L~Cf2+fl)J=L~(22+21)i=2l
e~ = 27—21=6

—1.0 —0.5 0.0 0.5 1.0


Sample difference 14 = L~(l3+f2)i=L~(Z7+22)]24
(c) = 25—24=1

FIGURE 6.15: Differencing concentrales lhe histogram: (a) digilal speeeh signal; (b) his
logram of digilal speech signal values; (c) hislogram of digital speech signal differences. 15 = [~Cf~+h)i = [~(25+27)j =26
e5 = 22—26=—4 (6.15)
The error does cenler around zero, we see, and coding (assigning bilsLring codewords) wiil
assigns short codewords lo frequently occurring symbols wiIl assign a shorl code lo zero and
be efficienl. Figure 6.16 shows a typical schemaLic diagram used Lo encapsuiate Lhis type of
do rather welI. Such a coding scheme wiiil much more efficienlly code sample differences
system. Notice that Lhe Predictor emils Lhe predicled value f,,. What is invariably (and an
Lhan samples themselves, and a similar slalement applies if we use a more sophislidaled
noyingiy) IefL oul of such schemalics is Lhe facL Lhal lhe predictor is based on f, 1 f,, 2
predictor lhan simply lhe previous signal value.
Section 6.3 Quantization and Transmission of Audio 155
154 Chapter 6 BasicS of Digital Audio

Nocice thaL the predicLor is always based on Lhe reconsLnicced, quantized version of Lhe
+
signal: lhe reason for this is Lhat Lhen Lhe encoder side is no; using any infonnaLion not
available lo Lhe decoder side. Generaiiy, if by mistake we made use of Lhe aclual signals
f,, in Lhe predictor instead of Lhe reconscructed ones f,,, quantizalion error wouid Lend Lo
accumulate and could gel worse rather Lhan being cencered on zero.
The main effecc of lhe coder-decoder process is Lo produce reconsLructed, quancized signal
values f,, = f,, + é,,. The “disLortion” is Lhe average squared error [E~~
(f,, — f,,)2]/N,
and mie ofLen sees diagrams of disLortion versus lhe number of biL leveis used. A Lioyd-Max
(a) quantizer wiii do betcer (have less distortion) thari a uniform quanLizer.
For any signa!, we want lo choose lhe size of quantizalion steps 50 Lhat they correspond
f,, ReconstrucLed Lo lhe range (Lhe maximum and minimum) of Lhe signal. Even using a uniform, equai-sLep
quantization wiii naLuraiiy do beLter if we foiiow such a praclice. For speech, we couid
rnodify quantization sLeps as we go, by esLirnating Lhe mean and variance of a paLch of
signai values and shifLing quancizalion steps accordingly, for every block of signa! values.
ThaL is, scarling aL Lime i we couid Lake a biock of N values f,, and Lr), lo rninirnize Lhe
fn quanLizacion error:
(b)
i+N—I

FIGURE 6.16: SchemaLic diagram for Predictive Coding: (a) encoder; (b) decoder.
mm Z (f~ — Q[f~i)2 (6.17)
li =i

Since signal differences are veiy pealced, we could modei thern using a Laplacian proba
Therefore, the predicLor rnusL involve a mernory. At lhe ieast, lhe predictor includes a circuit bihty discribution funccion, which is also sLrongiy peaked aI zero [6]: iL looks lUte 1(x) =
for incorporating a delay inche signal, to store f,,_ i (l/s’~~)exp(—’Ji~x~/a), for variance c2. So typicaily, we assign quantizaLion sceps for
a quantizer wiLh nonuniform steps by assurning lhaL signa! differences, 4, say, are drawn
6.3.5 DPCM frorn such a distribuLion and then choosing sLeps lo rninimize
i+N—I
Differential Pulse Code Modulation is exactly lhe sarne as Predictive Coding, except thaL it
incorporates a quantizer step. Quantization is as in PCM and can be uniform or nonuniforrn. mm ~ (4— Qjd,,])21(4) (6.18)
~i=1
One scheme for analyticallY determining Lhe best set of nonuniforro quantizer sceps is lhe
Lloyd-Max quancizer, narned for Stuarl Lloyd and Joel Max, which is based on a leasL This is a ieast-squares probiem and can be soived iLeracively using lhe Lioyd-Max quanLizer.
squares rninirnization of Lhe error tenn. Figure 6.17 shows a schernacic diagrarn for Lhe DPCM coder and decoder. As is cornrnon
Here we should adopt sorne nomenclaLure for signal values. We shall vali the original in such diagrarns, several inLeresting features are more or iess nol indicated. FirsL, we noLice
signal f,,, lhe predicted signal f,,, and lhe quanLized, reconstwcted signal f,,. How DPCM lhaL lhe predictor rnakes use of the reconsLrucled, quanLized signai values f,,, no; actual
operates is Lo form lhe prediction, font an error e,, by subtracting lhe prediction from lhe signal values f,, — Lhat is, Lhe encoder sirnulaLes lhe decoder in lhe prediccor paLh. The
actual signal, chen quanLize lhe error to a quantized version, é,,. The equations that describe quantizer can be uniform or non-uniform.
DPCM are as follows: The box iabeled “Symboi coder” in Lhe block diagrarn sirnply means a Huffman coder—
Lhe details of Lhis scep are sei oul in Chapter 7. The predicLion value f,, is based on however
1~ = function~of(fn_l,_2!n-3~’~ rnuch history Lhe prediction scherne requires: we need Lo buffèr previous values of f lo
form Lhe prediction. Notice Lhat Lhe quanLizaLion noise, f,, — f,,, is equal toLhe quanLizalion
e,, = fn-!n
(6.16) effect on Lhe enor term, e,, — é,,.
é,, = Q[e,,] IL helps us explicitiy understand lhe process of coding lo look aL actual nurnbers. Suppose
transmit codeword(én) we adopt a parLicuiar predictor as foilows:
reconstruct: f,, = fn + ên
= crunc + !n-2) /2]
Codewords for quanLized error values é,, are produced using entropy coding, such as Huff 50 thac e,, = f,, — J,, is an integer. (6.19)
man coding (discussed iii ChapLer 7).
Section 6.3 Quantization and Transmission of Audio 157
156 Chapter 6 Basics of Digital Audio

TABLE 6.7: DPCM quanLizer reconsLrucLion leveis

f,T e,, in range Quantized Lo vaiue


—255 .. —240 —248
Binaiy stream —239.. —224 —232

—31—16 —24
—i5..0 —8
(a) 1.16 8
17.32 24

ReconstflsCted
225.240 232
24i..255 248

Einary stream exacl: = 130. Then subsequent values caicuiaLed are as foiiows (wiLh prepended vaiues
(b) in a box):
J = O i30, 142, i44, 167
FIGURE 6.17: Schernatic diagram for DPCM: (a) encoder; (b) decoder. e = ~, 20, —2, 56, 63
= ~, 24, —8, 56, 56
Let us use the particular quantization scheme 1 — ~, 154, i34, 200, 223
= Q[e,,116*L1nCR255+e~~)/16]256+S On lhe decoder side, we again assume exLra vaiues f equai lo Lhe correct vaiue , SO tlnt
lhe ÜFSL reconsLnicLed value fi is con-ecL. What is received is ê,,, and lhe reconstrucLed f,, is
.1,, = (6.20) idenLical Lo Lhe one on lhe encoder side, provided we use exacLiy Lhe sarne predicLion mie.

First, we note that Lhe error is in the range —255 .. 255 — that is, 511 leveis are possible 6.3.6 DM
for lhe error term. The quantizer takes the simpie course of dividing lhe ator range into DM sLands for Deita Modulation, a rnuch-simplified version of DPCM ofLen used as a quick
32 patches of about 16 leveis each. it aiso makes the representative recon5Ifl1Ct~ value for anaiog-to-digitai converter. We inciude Lhis scheme here for compieLeness.
each patch equai lo the midway point for each group of 16 leveis.
Tabie 6.7 gives output vaiues for any of lhe inpul codes: 4-bit codes are niapped Lo 32 Uniform-Delta DM. The idea iri OM is lo use oniy a single quantized ator value,
reconstruction leveis in a staircase fashion. (Notice thal lhe final range includes oniy 15 either posilive or negalive. Such a 1 -biL coder Lhus produces coded oulpuL lhaL foliows lhe
ieveis, not i6.) original signai in a sLaircase fashion. The relevanl seL of equalions is as foliows:
As an exampie stream of signai values, consider lhe seI of values
1~ 1n-i
e,, = f,,—f,, =f,,—f,,i
fi f2 f~ f~ f5
i +k if e,, > 0, where k is a constanL
130 150 140 200 230 -
e,, = . (6.2i)
—k otherwise,
1» = !n+ê~
We prepend extra vaiues f — i 30 in lhe datastrealfl Lhal rephcate lhe first value, fi.
and iniLialize with quanLiZed error êi 0, so Lhat we ensure lhe firsL recOnstIUcL&l vaiue is NoLe lhaL lhe predicLion simpiy involves a delay.
Section 6.4 Further Exploration 159
158 Chapter 6 Basics of Digital Audio

Again, leL’s consider actual numbers. Suppose signal values are as follows: Lo solve for Lhese coefficienis. Suppose we decide to use a leasL-squares approach Lo solving
a minimizalion, lrying Lo find Lhe beM values of lhe a~:
fI f2 f3 f4
lO II 13 15 N
mm Z(fn — (6.23)
We also define an exacL reconsLnicled value li = = lO. n=1
Suppose we use a sLep value k = 4. Then we ative ai the following values:
where here we would sum over a large number of samples f,, for Lhe curren! paLch ofspeech,
/2=10. e2=11—lOl, ~=4. 12=10+4=14 say. BuL because f~ depends on lhe quantizaLion, we have a difficult problem lo solve. Also,
/314, e3=13—14=—l. ë3=—4, /3=14—4=10 we should really be changing lhe fineness of Lhe quanlizaLion aL lhe sarne lime, lo suil Lhe
/410, e4=15—1O5. é4=4, /4=10+4=14 signal’s changing naLure; lhis makes lhings problernaLical. -

lnsLead, we usually resorL lo solving Lhe simplerproblem Lha! resuhs from using no! f,, ia
We see ifiaL lhe reconsLructed sei of values lO, 14, 10, 14 never strays far from Lhe correct lhe predicLion bu! simply Lhe signal f,, ilself. This is indeed simply solved, since, expliciLly
seI lO, II, 13, 15. wriLing ia lerms of lhe coefficienls a~, we wish lo solve
NeverLheless, ii is nol difficult Lo discover lhe! DM copes welI wiLh more or less constanl
N M
signals, buL noL as well wiLh rapidly changing signals. One approach lo mitigaLing this
problem is to simply increase Lhe sampling, perhaps lo many limes lhe NyquisL rale. This minZ(f,, — Zaifn_i2 (6.24)
scheme can work well and makes DM a very simple yeL effecLive analog-to-digilal converLer.
Differentiation wilh respecL 10 each of lhe o~ and selLing lo zero produces a linear sysLem of
Adaptive DM. However, if lhe slope of lhe acLual signal curve is high, lhe sLaircase M equalions thaL iseasy Losolve. (The seLof equations iscalled lheWiener-Hopfequalions.)
approximation cannoL keep up. A sLraightfOrward approach to dealing wiLh a sLeep curve Thus we indeed find a simple way lo adapLively change lhe prediclor as we go. For
is lo simply change Lhe slep size k adapiively lhal is, ia response Lo lhe signal’s current speech signals, iL is cornmon Lo consider blocks of signal values, jus! as for liege coding,
properties. and adaplively change Lhe prediclor, quanlizer, or bolh. lfwe sample a! 8 kHz, a common
block size is 128 samples — 16 rnsec of speech. Figure 6.18 shows a schematic diagram
6.3.7 ADPCM for Lhe ADPCM coder and decoder (7].
Adaptive DPCM talces lhe idea of adapling Lhe coder lo suiL the input much furLher. Easically,
two pieces make upa DPCM coder: Lhe quanlizer and Lhe predicLor. Above, in Adaptive DM, 6.4 FURTHER EXPLORATION
we adapted lhe quanLizer step size Lo suil lhe inpuL. In J3PCM, we cai’ adaptivety niodify FascinaLing work is ongoing ia lhe use of audio Lo help sighl-impaired persons. One Lech
the quanhizer. by changing lhe step size as well as decision boundaries in a nonuniform nique is presenling HTML struclure by means of audio cues, using crealive Lhinking asia
quanlizer. lhe papers [8,9,101.
We cai’ carry Ibis oul ia two ways: using lhe properlies of lhe inpul signal (calledforward AnexcellenL resource fordigilization and SNR, SQNR, and soai’ is Lhe book by Pohlrnann
adaptive quantizahion), ar Lhe properlies of lhe quantized ouLpul. For if quanLiZed errors [2]. The audio quantizaLion ti-Iaw is described in Lhe Chapler 6 web page ia lhe Furiher
become loo large, we should change lhe nonuniforni Lloyd-Max quantizer (this is called Expiorarion secLion of lhe LexI web siLe. Olher useful links included are
backward adaphive quanhizahion).
We can also adapi ihe predictor, agem using forward or backward adaplation. Generally, . An excellenL discussion of lhe use of FM lo creaLe synLhelic sound
making Lhe prediclor coefficients adapLive is called Adaphive Predichive Coding (APC). li is
inleresLiflg lo see how lhis is done. Recail Lhal Lhe prediclor is usually taken Lo be a linear An exLensive lisi of audio file formats
funclion of previously reconstrucled quanLiZed values, f,,. The number of previous values
• A good description of various CD audio file formaLs, which are somewhal differenl.
used is called lhe arder of Lhe prediclor. For example, if we useM previous values, we need
The main music formal is called ,ed book audio.
M coefficienls a~, i = 1 .. M ia a predicLor
M • A General MIDI Inslrumenl Patch Map, along with a General MIDI Percussion Key
= Q~fn—i (6.22) Map
1=1
• A link 10 a good tutorial on MIDI and wave-!able music synthesis
However we can get mio a difficulL siLuaLion if we Lry lo change Lhe prediclion coefficienLs
. A link Loa Java program for decoding MJDI sLrearns
thaL multiply previous quantized values, because Lhal makes a complicaled sei of equalions
Section 6.5 Exercises 161
160 Chapter 6 Basics of Digital Audio

(b) To conipensate for lhe ioss, a iistener can adjusi lhe gain (and hence Lhe output)
64 kbps A-law 32 kbps on an equalizer at different frequencies. lf Lhe ioss remains —3 dB and a gain
or JL-law oulput through the equalizer is 6 dE at IS kHz, what is Lhe oulpui voltage now? uns:
PCM input Assume iog10 2 = 0.3.

4. Suppose Lhe sampiing frequency is 1.5 Limes Lhe Lrue frequency. Whac is Lhe alias
frequency?
5. In a crowded room, we can sLiill pick ou and understand a nearby speaker’s voice,
notwithstanding Lhe fact that general noise leveis rnay be high. This is known as
Lhe cocksail-party effecs. The way it operates is that our hearing can localize a sound
(a) source by taking advantage of the difíerence in phase between Lhe two signals entering
our left and right ears (binaural audisory perception). In mono, we could noi hear
64 kbps A-iaw our neighbor’s conversation weil if Lhe noise levei were at ali high. State how you
32 kbps or g-law think a karaoke machine works. Hins: The mix for commerciaI music recordings is
input PCM output such thai lhe “pan” parameter is difterent going to the ieft and right channeis for each
instrumeni. That is, for an instrument, either Lhe left or right channei is emphasized.
How would Lhe singer’s track Liming have to be recorded Lo make ii easy to subtract
Lhe sound of Lhe singer (which is typicaily done)?
6. The dynaniic range of a signai V is Lhe ratio of Lhe maximum to the minimum absolute
value, expressed in decibeis. The dynamic range expected in a signal is Lo some exient
(b)
an expression of Lhe signal quality. li aiso diciales Lhe nuniberof bits per sample needed
Lo reduce Lhe quaniization noise Lo an acceptable levei. For exampie, we may want to
FIGURE 6.18: Schematic diagram for: (a) ADPCM encoder; (b) decoder. reduce the noise Lo at ieasi an order of magnitude beiow Vmjn. Suppose Lhe dynamic
range for a signai is 60 dE. Can we use 10 bits for this signai? Can we use ló bits?
7. Suppose lhe dynamic range of speech in Leiephony implies a ratio ~max/ 1’n,in ofaboul
• A good multimediafsound page, including a source for iocating Internei soundlmusic 256. Using uniform quaniization, how many bits should we use to encode speech lo
materiais make lhe quaniization noise at ieast an order of magnitude less lhan Lhe smallest
deiectabie Lelephonic sound?
• A performing_arts-oriented site that is an exceflent all-around resource on sound gen 8. Percepsual nonuniforinisy is a general term for describing the nonlinearity of human
eraily, inciuding a great deal of information on definitions of terms, signai processing, perception. That is, when a certain parameter of an audio signai varies, humans do
and sound perception. noi necessarily perceive Lhe difference in proportion to the amount of change.

6.5 EXERCISES (a) Briefiy describe ai least two Lypes of percepiual nonuniformities in human au
ditory perception.
1. My oid SoupdBiaster card is an 8-bit card.
(b) Which one of Lhem does A-iaw (or M-iaw) atLempt Lo approximate? Why could
ii improve quanlization?
(a) What is it 8 bits of?
(b) What is Lhe best SQNR it can achieve? 9. Draw a diagram showing a sinusoid at 5.5 kHz and sampling aI 8 kHz (show eighL
intervais between sampies in your plol). Draw lhe alias at 2.5 kJ-lz and show that in
2. lf a set of ear protectors reduces the poise levei by 30 dB, how much do they redLlce lhe eighi sample intervais, exactly 5.5 cycles of lhe true signal fiL into 2.5 cycies of
tbe intensity (the power)?
the alias signai.
3. A loss of audio output ai both ends of Lhe audible frequency range is inevitabie, due 10. Suppose a signal contains topes aLI, lO, and 21 kl-Iz and is sampied aL the rale 12 kHz
to the frequency response function of an audio amplifier and Lhe medium (e.g., tape).
(and then processed with an antialiasing filter iimiting output 106 kflz). What tones
are inciuded in the oulput? Hins: MosL of lhe ouiput consists ofaiiasing.
(a) lf Lhe output was 1 volt for frequencies at midrange, what is the output voltage
after a ioss of 3dB at IS kl-lz?
Section 6.6 References 163
162 Chapter 6 Basics of Digital Audio

250 250
11. (a) Can a singie MIDI message produce more than one note sounding? 200 200
(b) Is it possibie for more than one note to sound ai once on a particular instrument?
g 150 l50
lf so, how is ii done in MIDI? 06 ‘o
‘o
(e) Is the Program Change MIDI message a Channel Message? What does dais mes ‘o ~oo loa
sage accomplish? Based on the Program Change message, how many different 50 50
instnlments are there in General MIDI? Why?
Time Time
(d) In general, what are the two main kinds of MIDI messages? la cerms of data,
what is lhe main difference between the two types of messages? Within those (a) (b)
two caiegories. list Lhe different subtypes.
FIGURE 6.19: (a) DPCM reconstrucied signal (doited une) tracks Lhe input signal (solid
12. (a) Give an example (in English, not hex) of a MIDI voice message. line); (b) DPCM reconstructed signal (dashed line) sieers farther and farther from lhe input
(b) Describe the parts of ihe “assembier” statement for the message. signai (solid une).
(c) What does a Program Change message do? Suppose Program change is hex
“&HCI ‘ What does Lhe instnlction “&HC 103” do?
6.6 REFERENCES
13. In PCM, what is lhe delay, assuming 8 Icliz sampling? Generaiiy, delay is Lhe time
penaity associated with any aigoiithm dueto sampiing, processing, and analysis. 1 B. Tmax, HandbookforAcoustic Ecology, 2nd cd., Burnaby, BC, Canada: Cainbridge Streei
Publishing, 1999.
14. (a) Suppose we use a predictor as foliows:
2 K.C. Pohlmann, Principies of Digital Audio, 4ih cd., New York: McGraw-HiIi, 2000.
3 3H. McCIeIIan, R.W. Schafer, and M.A. Yoder, DSP First: A Multimedia Approach, Upper
1» = trunc{~un-l+Jn_2] Saddle River, tU: Prentice-Hail PTR, 1998.

ei, = fn — I~’ (6.25) 4 3. Heckroth, Tuzorial on MÍDI and Music Synthesis, La 1-labra, CA: The MIDI Manufacwrers
Association, 1995, www.harmony-central.comlMlDIIDocltuIoriai.html.
5 P.K. Andleigh and K. Thakrar, Multirnedia Sys;erns Design, Upper Saddle River, NJ: Prentice
Also, suppose we adopt Lhe quantizer Equation (6.20). If the input signal has Hali PTR, 1984.
values as foflows:
6 K. Sayood, Introduction to Data Conipression, 2nd cd., San Francisco: Morgan Kaufmann,
2038567492110128146 164 182200218236254 2000.
7 Roger L. Freeman, Reference Manual for Telecommunicaiions Engineering, 2nd cd., New
show Lhat Lhe oulput from a DPCM coder (without entropy coding) is as follows: York: Wiley. 1997.
8 MM. BIaLIner, DA. Suinikawa, and R. Greenberg, “Barcons and Icons: Tbeir Stmciure and
2044567489105121153 161 181 195212243251 Common Design Principies:’ Hutnan-Coniputer Interaction, 4: 11-44, 1989.
9 MM. Biattner, “MulLimedia Interfaces: Designing for Diversity’ Multünedia Tools and Ap
Figure 6.19(a) shows how the quantized reconstructed signai iracks lhe input plications,3: 87—122. 1996.
signal. As a programming project, write a smali piece of code Lo verify your
lO W.W. Gaver and R. Smilh, “Audilory Icons in Large-Scale Coilaboraiive Environments,” in
resuits.
Readings iii Hurnan-Computer Interaction: Toward Lhe Year 2000. cd. R. Baecker, 3. Grudin,
(b) Suppose by misiake on the coder side we inadvertenliy use Lhe predictor for W. Buxton, and 5. Greenberg, San Francisco: Morgan-Kaufman. 1990, pp. 564—569.
iosstess coding, Equation (6.14), using original values f,, insiead of quantized
ones, fi,. Show ibat on the decoder side we end up with reconstructed signal
values as foilows:

2044567489105 121 137 153 169 185 201 217 233

so that Lhe error gets progressivelY worse.


Figure 6.19(b) shows how Lhis appears: Lhe reconstwcted signal gets progres
sively worse. Modify your code from above Lo verify this siatement.
PART TWO

MULTIMEDIA DATA
COMPRESSION
Chapter 7 Lossless Compression Algorithms 167
Chapter 8 Lossy Compression Algorithms 199
Chapter 9 Image Compression Standards 253
Chapter 10 Basic Video Compression Techniques 288
Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 312
Chapter 12 MPEG Video Coding II MPEG-4, 7, and Beyond

332
Chapter 13 Basic Audio Compression Techniques 374
Chapter 14 MPEG Audio Compression 395

in Lhis part, we examine Lhe role played by data compression, perhaps Lhe most important
enabling technology that makes modern muitimedia systems possible.
We start off in Chapter 7 looking at lossless data compression — thal is, involving no
distortion of Lhe original signal once it is decompressed or reconsLiLuLed. So much daLa
exists, in archives and elsewhere, Lhat iL has become critical Lo compress this informaLion.
1,ossless compression is one way to proceed.
For example, suppose we decide to spend our savings on a whole-body MRI scan, looking
for Lrouble. Then we certainly wanL this costly medical infonnaLion to remam prisLine, wiLh
no loss of informaLion. This example of volume data forms a simply huge dataset, buL We
can’t afford Lo lose any of it, so we’d besL use lossless compression. WinZip, for example,
is a ubiquiLous tool tIiaL uses Iossless compression.
AnoLher good example is archival storage of precious artworks. Here, we inay go Lo Lhe
LrOuble of imaging an Old Master’s painLing using a high-powered camera mounted on a
dolly, to avoid parallax. Certainly we do not wish to jose any of Lhis hard-won information,
so again we’ll use lossless compression.
On the other hand, when iL comes Lo home movies, we’re more willing Lo lose some
information. if we have a choice beLween losing some informaLion anyway, because our
PC cannoL handle ali Lhe daLa we wanL to push Lhrough it, or losing some mnformation on
purpose, using a lossy compression meLhod, we’il choose Lhe laLter. Nowadays, almost ali

165
166

video you see is compressed in some way, and Lhe compression used is mostly lossy. Ainiost C H A P T E R 7
every image on Lhe web is in the standard JPEG fonnat, which is usually lossy.
So in Chapter 8 we go on to look at lossy methods of compression, mainly foeusing on lhe
Discrete Cosine Transform and lhe Discrete Wavelet Transfotm. The major applications ~f
these important methods is in lhe set of JPEG still image compression standards, including
Loss1 ess Corn presslo n
~ ~ g orith rns
JPEG2000, examined in Chapter 9.
We then go on to look aL how data conipression methods can be applied lo moving images
— videos. We start with basic video compression techniques in Chapter 10. We examine lhe

ideas behind Lhe MPEG standard, starting with MPEG-1 and 2 in Chapter 11 and MPEG-4,
7, and beyond in Chapter 12. Audio compression in a sense stands by iLself, and we consider
some basic audio compression techniques in Chapter 13, while in Chapter 14 we look at
MPEG Audio, including MP3.
1.1 INrRODUaION
The emergence of multimedia Lechnologies has made digital libra ries a reality. Nowadays,
libraiies, museums, fim studios, and govemments are converting more and more data and
archives mio digital form. Some of Lhe data (e.g., precious books and painLings) indeed
need Lo be stored without any loss.
As a start, suppose we warit lo encode the cail numbers of Lhe 120 million or 50 items
in lhe Library of Congress (a mere 20 million, if we consider just books). Why don’L we
just Lransmit each 11cm as a 27-bit number, giving each item a unique binary cede (since
227 > 120, 000, 000)?
The main problem is lhat Lhis “greal idea” requires loa many bits. And in fact there exisL
many coding techniques IbaL will effectively reduce Lhe total number of bits needed Lo rep
resent lhe above information. The process invalved is generally referred lo as compression
[1,2].
In Chapter 6, we had a beginning Iook aL compression schemes aimed aL audio. There, we
had to first consider Lhe complexity of Lransforming analog signals to digital ones, whereas
here, we shall consider that we at least starl with digital signals. For example, even Lhough
we how an image is caplured using analog signais, the file produced by a digital camera
is indeed digital. The more general problem of coding (compressing) a set of any symbols,
not just byte values, say, has been studied for a long time.
Getling back to our L.ibrary of Congress problem, it is well known Lhat certain parIs of
calI numbers appear more frequently than others, so it would be more economic lo assign
fewer bits as their cedes. This is known as variable-lengih coding (VLC) — lhe more
frequently-appearing symbols are coded with fewer bits per symbo], and vice versa. As a
result, fewer bits are usually needed Lo represent the whole collection.
In this chapter we sLudy lhe basics of information theory and severa] popular lossless
compression techniques. Figure 7.1 depicts a generaJ data compression scheme, in which
compression is performed by an encoder and decompression is performed by a decoder.
We call lhe oulput of the encader codes or codewords. The intennediate medium could
either be daLa storage or a communicaLionlcomputer network. lf Lhe compression and
decompression processes induce no information loss, Lhe compression scheme is lossless;
otherwise, ii is lossy. The next several chapters deal with lassy compression algorithms as
Lhey are cam.monly used for image, video, and audio compression. Here, we concenlsate
on lossless compression.
167
Section 7.2 Basics ot Information Theory 169
168 Chapter 7 Los5Iess Compressioli Algorithms

every decision to swap or noL, we impari 1 biL of inforinaLion Lo lhe card system and transfer
Input [~e~l [geo1 [~ZZ10~~tPut 1 bit of negative entropy Lo Lhe card deck.
The definition of entropy includes the idea LhaL two decisions means Lhe Lransfer of twice
data (compression) networks (decoifiPressiOn) data Lhe negative entropy in its use of lhe iog base 2. A two-biL vedor can have 22 states, and Lhe
Iogarithm Lakes this value inLo 2 biLs of negaLive entropy. Twice as many sorting decisions
FIGURE 7.1: A general daLa compression scheme. impart Lwice lhe entropy change.
Now suppose we wish to communicate Lhose swapping decisions, via a neLwork, say.
Then for our Lwo decisions we’d have Lo send 2 bits. lf we had a two-decision sysLem, then
If the total nuniber of bits required to represent the data before compression is Bo and the of course lhe average number of biLs for ali such communications wouid also be 2 bits. lf
total number of biLs required to represent the data afLer compressiOn is Bi, Lhen we define we Iike, we can think of the possibie number of staLes in our 2-bit sysLem as four outcomes.
Lhe conlpression ralio as Each outcome has probability 1/4. So on average, the number of bits to send per outcome
is 4 x (1/4) x log((1/(l/4)) = 2 bits — no surprise here. To communicate (transmit) the
conipression raio = (7.1) results of our Lwo decisions, we would need Lo Lransmit 2 bits.
But if the probability for one of lhe outcomes were higher than the oLhers, the average
number of bits we’d send would be different. (This siluaLion might occur if lhe deck
In general, we wouid desire any codec (encoder/decoder scheme) to have a compres were already partially ordered, 50 LhaL lhe probability of a not-swap were higher Lhan for
sion raio much larger than 1.0. The higher Lhe compression nado, the better the lossless a swap.) Suppose the probabiliLies of one of our four sLates were 1/2, and Lhe oLher three
coinpression scheme, as Iong as it is computationallY feasible. states each had probabiiity 1/6 of occurring. To extend our modeling of how many bits
Lo send on average, we need Lo go Lo noninteger powers of 2 for probabilities. Then we
7.2 BASICS OF INFORMATION TIIEORY can use a logarithm Lo ask how many (float) biLs of information musL be sent to Lransmit
Lhe information conLent. EquaLion (7.3) says Lhat in this case, we’d have Lo send just
According to Lhe famous scientist Claude E. Shannon, of Beil Labs [3,41, lhe entropy a~ of
(1/2) x 1082(2) + 3 x (1/6) x log2(6) = 1.7925 biLs, a value Iess than 2. This reflects
an informatian source with alphabeL S = lsi, ~2 5,,) is defined as:
Lhe idea lhat if we could somehow encode our four states, such Lhat Lhe mosL-occurring one
means fewer biLs Lo send, we’d do betLer (fewer biLs) on average.
q=l1(S) = ~pslog2’ (72) The definiLion of entropy is aimed at idenLifying ofLen-occurring symbols in Lhe data
1=1 stream as good candidates for shorr codewords in lhe compressed bitsLream. As described
earlier, we use a variable-lengrh coding scheme for enlropy coding — frequenLly-occurring
= —L pglog2p, (7.3) symbols are given codes thaL are quickly LransmitLed, while infrequenLly-occurring ones are
given longer codes. For example, E occurs frequentiy in English, so we should give it a
shorter code Lhan Q, say.
where p~ is the probability that symbol s~ in S will occur. This aspecL of “surprise” in receiving an infrequenL symbol in the datasLream is reflecLed
The term 1082 ~- indicales the amount of information (lhe so-called seif-informaiion in Lhe definition (7.3). For if a symbol occurs rarely, its probability p~ is low (e.g., 11100),
defined by Shannon [3]) contained in s~, which corresponds to lhe number of bitst needed and thus its logaiithm is a large negative number. This reflecLs the facL thaL it Lakes a longer
to encode s~. For example. if lhe probabiliLy of having lhe character a in a manuscript is biLstring to encode ii. The probabiliLies Pi siLting ouLside Lhe IogariLhm in Eq. (7.3) say that
1/32, Lhe aniount of information associated with receiving this character is 5 bits. lis other over a long sLream, lhe symbols come by wiLh an average frequency equal Lo the probability
words, a characLer string nnn will require 15 bits to code. This is Lhe basis for possible data of Lheir occurrence. This weighting shouid muitiply lhe long or short information conLent
reduction in text compression, since iL will lead to character coding schemes different from given by the elemenL of “surprise” in seeing a particular symbol.
Lhe ASCII represenLation. in which each character is always represenLed with 8 bits. As anoLher concrete example, if lhe information source S is a gray-Ievel digital image,
What is the enLropy? In science, entropy is a measure of lhe disorder of a system — lhe each s~ is a gray-level intensity ranging from O to (2k — 1), where k is lhe number of bits
mote entropy, the more disorder. Typically, we add negative enLropy to a system when we used Lo represenL each pixel in an uncompressed image. The range is often [0, 255], since
impart more order to it. For example, suppose we sori a deck of canis. (Tbink of a bubble 8 bits are Lypicaily used: Lhis makes a convenient one byLe per pixei. The image histogram
(as discussed in ChapLer 3) is a way of calcuiating the probability p~ of having pixels wiLh
sori for lhe deck — perhaps this is noL lhe usual way you actually sori cards, though.) For
gray-level intensity i in Lhe image.
Since we have chosen 2 as lhe base for logaiilhms in lhe above definilion, the onit or informaliOn is bis — One wrinkle in lhe algoriLhm implied by Eq. (7.3) is LhaL if a symboi occurs wilh zero
naturally also mosl appropriale for lhe binaiy cade represenlaliOn usei in digital computers. li lhe log base is lO. frequency, we simpiy don’t count it into lhe entropy: we cannoL Lake a log of zero.
lhe unit is lhe hartley; ir the base is e, lhe anil is che nal.
170 Chapter 7 Lossless Compression Algorithms se~ion 7.3 Run-Length Coding 171

7.3 RUN-LENGTH CODING


lnslead of assuming a memoryless source, run-Iength coding (RLC) exploils rnemory presenl
in lhe informalion source. IL is one of lhe simplesl fonns of daLa compression. The basic
1)256 idea is lhat if Lhe informalion source we wish Lo cornpress has Lhe property lhal symbols
tend lo forrn conlin~aous groups, insLead of coding each symbol in Lhe group individually,
we can code one such symbol and lhe lenglh of Lhe group.
As an example, consider a bilevel image (one wilh only 1-bil black and while pixeis)
o 255 255 wilh monoLone regions. This informaLion source can be efhcienlly coded using run-lengLh
coding. In fact, since lhere are only Lwo symbols, we do noL even iieed Lo code any’symbol
(a) (b)
aL Lhe starl of each rurl. lnslead, we can assume lhal Lhe slarting run is always of a parlicular
color (eiLher black or while) and simply code lhe lengLh of each nin.
FIGURE 7.2: Hislograms for Lwo gray-Ievel images.
The above descripLion is lhe one-dimensional run-lengLh coding algorilhm. A lwo
dimensional varianl of il is usually used lo code bilevel images. This algorilhm uses lhe
Figure 7.2(a) shows lhe hislogram of an image with uniform dislribulion of gray-Ievel coded rua information in lhe previous row of Lhe image lo code lhe ruo in lhe currenL row.
intensities, — lhat is, Vi p~ = 1/256. Hence, Lhe entropy of Lhis image is A fuIl descripLion of Lhis algorilhm can be found in [5].
255
(7.4) ~ VARIABLE-LENGTH CODING (Vl.C)
= ~ . Iog2 256 = 8
Since Lhe enlropy indicales lhe informalion conLenl in an informalion source S, iL leads Lo
As can be seen in Eq. (7.3), lhe enLropy q is a weighted sum of Lerms Iog2 ~; hence iL a family of coding melhods commonly known as enlropy coding melhods. As described
represenls Lhe average arnounL of information contained per symbol in Lhe source S. For earlier, variable-Ieng:h coding (VLC) is one of Lhe besl-known such rneLhods. Here, we
a memoryless source2 .9, Lhe enLropy ij represenls lhe minirnum average number of biLs will sludy Lhe Shannon—Fano algorilhrn, Huffman coding, and adapLive Huffrnan coding.
required lo represent each symbol in.9. In olher words, iL specifies Lhe lower bound for lhe
average number of bils Lo code each symbol in S. 7.4.1 Shannon—Fano Aigorithm
If we use (lo denoLe lhe average lengLh (measured in bits) of Lhe codewords produced
by lhe encoder, Lhe Shannon Coding Theorern slates thal lhe enLropy is lhe besi we can do The Shannon—Fano algorithm was independenLly deveioped by Shannon aI Beil Labs and
(under certain condiLions): Roberl Fano aI MIT [6]. To illustrale lhe algorilhm, Iel’s suppose Lhe symbols lo be coded
(7.5) are lhe charadlers in lhe word HELLO. The frequency counl of Lhe symbols is

Coding schemes aim lo geL as dose as possible Lo Lhis Lheorelical Iower bound.
II is inleresting lo observe LhaL in Lhe above uniform-dislribulion example we found Lhal
= 8 — lhe minimum average number of biLs Lo represenl each gray-level inLensily is aI
ti. HELO

leasL 8. No compression is possible for LhiS image! In Lhe conlexl of imaging, Ihis will 1121
correspond lo lhe “worsl case:’ where neighboring pixel values have no similariLy.
Figure 7.2(b) shows lhe histogram of anolher image, in which 1/3 of lhe pixeis are raLher
The encoding sleps of Lhe Shannon—Fano a gonLhm can be presenled in lhe following
dark and 2/3 of Lhem are raLher brighL. The enlropy of Lhis image is
iop-down manner:
1 2 3
II = —log23+~log2~
1. Sort Lhe symbols according Lo lhe frequency counL of Lheir occurrences.
= 0.33 x 1.59 + 0.67 x 0.59 = 0.52 + 0.40 = 0.92
2. Recursively divide lhe symbols inLo lwo parIs, each with approxirnately lhe sarne
In general, lhe enLropy is greaLer when Lhe probabiliLy disLribulion is flal and srnaller when numberofcounts, unIu ali parIs conlain only one symbol.
il is more peaked.
A nalural way of implemenLing lhe above procedure is lo build a binary Lree. As a
2An nfonnalion sorna lbat is indepcnderilly distribuled. meaning lhal Lhe value of lhe cunent symbol does
n°1 depend 00 Lhe values ol’ lhe previously appeared symbols.
convenLion, leL’s assign biL O lo iL5 lefL branches and 110 lhe righL branches.
172 Chapter 7 Lossless Compression Algorithms Section 7.4 Variable-Length Coding (VLC) 173

(5) (5) (5) (5)

L:(2) H,E,O:(3) L~>1 L,H:(3) E,O:(2)

H:(1) E,O:(2) L:(2) 1f:(1) E:(1) 0:0)


(a) (b) (a) (b)

(5) FIG URE 7.4: Another coding tree for HELLO by Lhe Shannon—Fano algoriLhrn.

L:(2) This suggesLs Lhat Lhe minimum average number of bils Lo cade each character in lhe word
(2) I-IELLO would be ai IeasL 1.92. In Lhis example, Lhe Shannon—Fano algoiithm uses an
H:(I) average of 10/5 = 2 biLs Lo cade each symbol. which is fairly dose lo Lhe lower bound of
1.92. ApparenLly, lhe resuil is salisfacLory.
EU) 0:0) IL should be pointed oul thaL lhe oulcome of Lhe Shannon—Fano algoriLhm is nol neces
(c) sarily unique. For instance, aI lhe firsL division in Lhe above example, iL would be equally
valid LO divide inLo Lhe lwo paris L,H:(3) and E,O:(2). This would result in lhe coding in
FIGURE 7.3: Coding tree for HELLO by Lhe Shannon—Fano algorithm. Figure 7.4. Table 7.2 shows lhe codewords are different now. Also, these Lwo seis of code
words rnay behave differenLly when errors are preseni. Coincidentally, lhe total number of
biLs required Lo encode lhe world HELLO remains aliO.
lnitially, lhe syrnbols are sorted as LHEO. As Figure 7.3 shows, Lhe first division yields The Shannon—Fano algoriLhm delivers satisfacLory coding resulLs for dala compression,
two paris: (a) 1.. wiLh a coma of 2, denoted as L:(2); and (b) H, E andO wiLh a total count bul iL was soon oLlLperfornled and overlaken by lhe Huffman coding rneLhod.
of 3, denoLed as H,E,O:(3). The second division yields H:(I) and E,0:(2). The last division
is E:(l) and 0:0). 7.4.2 Huffman Coding
Table 7.1 surnmarizes the resuit, showing each syrnbol, iLs frequency count, infonnation
contenL (log2 f), resulting codeword, and Lhe number of biLs needed to encodeeach syrnbol
FirsL presenLed by David A. Huffrnan in a 1952 paper [7], Lhis rneLhod aLlracLed an over
whelming amounL of research and has been adopted in many importanL and)or comrnercial
in lhe word HELLO. The LoLal number of bits used is shown at Lhe botLorn. applicaLions, such as fax machines, JPEG, and MPEG.
To revisit the previous discussion on entropy, in lhis case, in conLradistincLion lo Shannon—Fano, which is Lop-down, lhe encoding sleps of lhe
1 1 1 1 Huffman aigorithrn are described in lhe foliowing bouom-up rnanner. Lel’s use lhe sarne
= P1. Iog2—+pu Iog2— +pe log2—+po log2— exarnple word, HELLO. A similar binary coding Lree will be used as above, in which Lhe
fiz. PH PE P0
iefL branches are coded O and righL branches 1. A simple lisL data structure is also used.
= 0.4 x 132+0.2 x 2.32+ 0.2 x 232+02 x 2.32 = 1.92

TABLE 7.2: Anolher resuiL of performing Lhe Shannon—Fano aigoriLhm on HELLO.


TABLE 7.1: One result of performing Lhe Shannon—Fano algorithm on IJELLO.
Nurnber of biLs used
Syrnbol CounL log2 ~- Cade Number of bits used
2 4
L 2 1.32 O
2 2
H 1 2.32 10
3 2
E 1 2.32 110
3 2
o 1 2.32 Iii
TOTAL nurnber ofbiLs: 10 TOTAL nurnberofbits: lo
Section 7.4 Variabie-Length Coding (VLC) 175
174 Chapter 7 Lossless Compression Algorithms

Pi:(2) P2:(3) For Lhis simpie exampie, lhe Huffman algorithm apparenLiy generated Lhe sarne codirtg
resuiL as one of lhe Shannon—Fano resulLs shown in Figure 7.3, aiLhough Lhe resulLs are
usuaily betLer. The average number of bits used lo cede each characLer is also 2, (i.e.,
(1 + i + 2 + 3 + 3)/5 = 2). As anoLher simple example, consider a LexL string conlaining
E:(1) 0:0) H
a seI of characters and their frequency counLs as foliows: A:(15), B:(7), C:(6), D:(6) and
E:(1) 0:0) E(S). IL is easy Lo show that the Shannon—Fano algorithm needs a toLal of 89 biLs to encode
(a) (b) this string, whereas the Huffman aigorithm needs only 87.
As shown above, if eorrecL probabiliLies (“prior statistics”) are avaiiable and accurate,
the Huffman coding method produces good compression resulis. Decoding for lhe Huffman
P3:(5) coding is Lrivia] as long as lhe staLisLics andlor coding tree are senL before lhe data Lo be
compressed (in lhe file header, say). This overhead becomes negligible if Lhe data file is
P2:(3) sufficieatiy iarge.
L:(2) P1:(2) The foilowing are important properties of Huffman coding:
H:(l) • Unique prefix property. No Huffman cede is a prefix of any oLher Huffman cede.
E:(1) 0:0) For inscance, Lhe code O assigned Lo L. in Figure 7.5(c) is not a prefix of lhe code 10
forHor IlOforEor lii for0;noristhecode lOforllaprefixofthecode lIOfor
(e) E or lii foro. lt Luras ouL lhaL Lhe unique prefix property is guaranLeed by Lhe above
Huffman algorithm, since It aiways places ali inpuL symbols at Lhe Ieaf nodes of Lhe
FIGURE 7.5: Coding tree for HELLO using Lhe Huffman algorithrn. Hufíman Lree. The Huffman code is one of lhe prefix codes for which Lhe unique
prefix property holds. The code generated by lhe Shannon—Fano algorilhm is anoLher
such example.

ALGORITHM 7.1 HUFFMAN CODING This property is essenLial and also makes for an efficienL decoder, since it preciudes
any arnbiguity in decoding. In Lhe above example, if a bit Ois received, lhe decodercan
1. Initialization: puL ali symbois on the iisL sorted according LO their frequency counts. immediately produce a symbol L wiLhout waiLing for any more bits lo be transmitted.
2. Repeat unLii Lhe lisi has oniy one symboi left. • Optimality. The Huffman cede is a mininuim-redundancy code, as shown ia Huff
man’s 1952 paper [7]. IL has been proveu [8,2] that lhe Huffman code is optimal for
(a) From Lhe iist, pick two symbois with Lhe iowest frequency counts. Forrn a a given data model (i.e., a given, accurate, probability distribuLion):
Huffman subtree Lhat has these two symbols as child nodes and create a parent
node for Lhenl. — The two Ieast frequeat symbois wiii have Lhe sarne Iength for Lheir Huffman
(b) Assign Lhe sum of Lhe children’s frequency counts to the parent and insert it luto cedes, differing only at Lhe iasL biL. This should be obvious from Lhe above
the ilst, such thaL lhe order is maintained. aigorithm.
(c) Delete Lhe chuldren from lhe hst. — Symbols Lhat occur more frequently wiii have shorter Huffman codes than sym
bois LhaL occur iess frequentiy. Nameiy, for symbois s~ and s~, if p~ > pj then
3. Assign a codeword for each leaf based on Lhe path from lhe root. 4 <I~, where jj is Lhe number of biLs in Lhe codeword for s~.
— It has been shown (see [2]) that Lhe average cede iengLh foran informaLion source
In lhe above figure, new symbois P1, fl, P3 are created Lo refer lo the parent nodes in S is sLricLly iess Lhan ij + 1. Combined with Eq.(7.5), we have
the Huffman coding tree. The contenLs in Lhe list are iliusLrated below:
(7.6)
After iniLiaiization LHEO
After iteraLion (a): L P1 H
LP2 Extended Huffman Coding. The discussion of Huffman coding 50 far assigns each
After iLeration (b):
P3 symbol a codeword thaL has an inleger bit iength. As sLaLed earlier, iog2 ~ indicates Lhe
AfLer iteraLion (e):
amounL of information contained in Lhe inforrnalioa source s~, which corresponds to Lhe
176 Chapter 7 Lossless Compression Aigorithms Section 7.4 Variable-Length Coding (VLC) 177

number of bits needed lo represent iL. When a particular symboi s~ has a iarge probability
(dose to 1.0), log2 ~- wiili be dose Lo O, and assigning one biL Lo represent tliat symboi wiiil PROCEDURE 7.1 Procedures for Adaptive Huffman Coding
be cosLly. Oniy when lhe probabiliLies of ali symbois can be expressed as 2~, wherek is a
positive integer, wouid the average iength of codewords be truly opLirnai — thaL is, 1
Cieariy, 1 > q in most cases. ENCODER DECODER
One way to address lhe probiem of inLegrai codeword iengLh is Lo group several symbols
and assign a singie codeword lo Lhe group. Huffman coding of this Lype is caiied E.xtended
Huifinan Coding [2]. Assume an informaLion source lias aiphabel S = (si, s,, ~. lf Initial_code() Initial_code()
k syrnbois are grouped together, then Lhe extended alphabet is while not EOF while not EOF
syni bois { (
get(c)
8 (A) = {s~si . ~~1, ~l~I .. .52 s~sI . .5,,, ~1~1~ .32Sl s,,s,, . .
decode(c);
encode(c); output(c);
NoLe lhaL lhe size of Lhe new alphabeL S~ is ~k• lf k is reiativeiy iarge (e.g.. k > 3), then update_tree(c); updatetree(c);
for mosL pracLical applicaLions where ri » i, nt< wouid be a very large number, impiying a 1 1
huge symbol Labie. This overhead malces Extended Huffman Coding impracLical.
As shown in [2], if the enLropy of 8 is ~, lhen Lhe average number of biLs needed for each
symboi in 8 is now
• Initial_code assigns symbois wilh some initiaiiy agreed-upon codes, wiLhout
q<t<q+ (7.7) any prior knowledge of lhe frequency counts for Lhem. For exampie, some conven
tionai code such as ASCII may be used for coding character symbois.
so we have shaved quite a biL from the coding schernes’ bracketing of the Lheoreticai besL • update_tree is a procedure forconstrucLing an adaptive Huffman Lree. Itbasicaliy
iimit. Nevertheless, Lhis is noL as much of an improvement over the original Huffman coding does two things: it incremenls Lhe frequency counts for Lhe symbois (inciuding any
(where group size is 1) as one rnight have hoped for. new ones), and updates the configuraLion of Lhe tree.

7.4.3 Adaptive Huffman Coding — The Huffman tree musA aiways maintain iLs sibling properiy — Lhat is, ali nodes
The Ruffman aigoriLhm requires prior statistical knowiedge abouL lhe informaLion source, (interna! and leal) are arranged in the order of increasing counLs. Nodes are
and such information is oflen nol avaiiabie. This is particuiarly true In muitimedia applica numbered in order from ieft Lo right, botLom Lo Lop. (See Figure 7.6, in which
tions, where future daLa is unknown before iLs anival. as for exampie in iive (or sLreaming) the firsL node is 1.A:(1), thesecond node is2.B:(,1), and so on, where Lhe numbers
audio and video. Even when lhe sLaListics are availabie, Lhe Lransmission of the symbol Labie in parentheses indicaLes Lhe counl.) ifthe sibiing property is about to be vioiaLed,
couid represent heavy overhead. a swap procedure is invoked Lo updaLe the Lree by rearranging the nodes.
For Lhe non-exLeflded version of Huffman coding, lhe above discussion assumes a 50- — When a swap is necessaly, Lhe farthest node with count N is swapped with lhe
called order-O modei — lhaL is, symbois/charaCLers were ireated singly, wiLhOuL any contexL node whose coma has just been increased to N ÷ 1. Note Lhat if Lhe nade with
or history mainLained. One possibie way to inciude contextual informaLion is lo examine counL ,V is noL a ieaf-node — iA is Lhe root of a subtree — Lhe enLire subtree will
k preceding (or succeeding) symbois each time; lhis is known as an order-k modei. For go wiLh iL during the swap.
exampie, an order-1 model can incorporate such sLalisliCs as Lhe probabiliLy of “qu” in
addition Lo lhe individual probabiliLies of “q” and “u”. Nevertheiess, this again impiies Lhat • The encoder and decoder musL use exactiy Lhe sarne Initial code and
much more sLaLisLical daLa has to be stored and senL for Lhe order-k modei when k > 1. update_tree routines.
The soiution is to use adaptive compression algorithms, in which stalistics are gathered
Figure 7.6(a) depicts a Huffman Lree with some symbois aiready received. Figure 7.6(b)
and updated dynamicaiiy as Lhe datasLream arrives. The probabilities are no longer based
shows the updaLed Lree after an additionai A (i.e., the second A) was received. This increased
on prior knowiedge buL on Lhe actual daLa received so far. The new coding melhods are
lhe GomiL of As Lo N + 1 = 2 and triggered a swap. Iii Lhis case, lhe farthest node wiLh
“adaplive” because, as Lhe probabiiiLy distribuLion of the received syrnbois changes, symbois
counL N = 1 was D:(i). Rence, A:(2) and D:(1) were swapped.
wiil be given new (ionger or shorter) codes. This is especiaHy desirabie for multimedia
Apparentiy, the same resuit couid aiso be obtained by firsL swapping A:(2) wiLh B:(i),
data, when Lhe content (Lhe music or Lhe color of Lhe scene) and hence lhe staListics can
Lhen with C:(1), and finaiiy with D:(i). The probiem is thaL such a procedure wouid Lalce
change rapidiy.
three swaps; Lhe mie of swapping wiLh “lhe farthest node with counL N” heips avoid such
As an example, we introduce Lhe Adaptive Huffrnan Coding aigorithm in lhis seclion.
unnecessary swaps.
Many ideas, however, are also applicabie to olher adaptive compression algorilhrns.
Section 7.4 Variable-Length Coding (VLC) 179
178 Chapter 7 Lossless Compression Algorithms

The update of lhe Huffman tree after receiving lhe third A is more involved and is
illustraLed in lhe Lhree sLeps shown iii Figure 7.6(c-1) lo (c-3). Since A:(2) will become
9.00) A:(3) (lemporarily denoled as A:(2+I)), it is now necessary lo swap A:(2+I) with lhe fifth
9. (9) node. This is illustraLed wiLh an anow in Figure 7.6(c-1).
Since lhe fifLh node is a non-leaf node, Lhe subtree with nodes 1. D:(1), 2. B:(l), and
7 7. (5)
5. (2) is swapped as a whole with A:(3). Figure 7.6(c-2) shows Lhe tree after this flrsL swap.
8. P:(5) 8. P:(5)
Now Lhe seventh node wiIl become (5+1), which Lriggers anoLher swap with Lhe eighth node.
5(2) 5. Figure 7.6(c-3) shows Lhe Huffman tree afLer lhis second swap.
The above example shows an update process thaL aims lo mainlain lhe sibiing property
of lhe adaptive Huffman Lree — Lhe updaLe of Lhe Lree sometimes requires more Llian one
1. A:(1) 2. B:(1) 3. C:(l) 4. D:(1) 1. D:(1) 2. B:(1) 3. C:(1) 4. A:(2) swap. When this occurs, Lhe swaps should be executed iii multiple steps in a “botLom-up”
(a) Huffman tree (b) Receiving 2nd “A” triggered a swap manner, starting from the lowest levei where a swap is needed. In oLher words, Lhe update
is carried ouL sequentialiy: tree nodes are examined in order, and swaps are made whenever
9.00) necessary.
To clearly illusLrate more implementation details, leL’s examine anolher exampie. Here,
9.00) we show exacLly what bits are senL, as opposed Lo simply sLaLing how lhe Lree is updaLed.

EXAMPLE 7.1 Adaptive Huffman Coding for Symbol String AADCCDD


8. P:(5) 5.
Let’s assume Lhat Lhe initial code assignmenL for bolh lhe encoder and decoder simply
5.(2) follows lhe ASCII order for Lhe 26 symbols in an aiphabet, A Lhrough Z, as Table 7.3
shows. To improve Lhe implemenLation of Lhe algoriLhm, we adopL an additional mie: ifany
characterfsymbol is Lo be sent lhe firsl lime, il musL be preceded by a special symboi, NEW.
1.D:(l) 2.B:(1)
1. D:(1) 2. B:(1) 3. C:(1) 4. A:(2+I) The initial code for NEW is O. The coum for NEW is aiways kept as 0 (Lhe count is never
(c-2) Anolher swap is needed increased); hence iL is always denoled as NEW:(0) in Figure 7.7.
(c-1) A swap is needed after receiving 3rd “A”
Figure 7.7 shows lhe Huffman tree afLer each step. lnitially, Lhere is no tree. For lhe firsL
A, O for NEW and lhe inilial code 00001 for A are senL. AfLerward, lhe tree is builL and
9.01) shown as lhe first one, Iabeled A. Now both lhe encoder and decoder have consLrucLed lhe
same firsl lree, from which iL can be seen Lhal lhe code for Lhe seeond Ais 1. The code sent
is thus 1.
8(6)
After lhe second A, Lhe Lree is updaLed, shown labeled as AA. The updates after receiving
7. P:(5) D and C are similar. More sublrees are spawned, and Lhe code for NEW is getLing longer
6. (3)
from 0 Lo OOto 000.
5. A:(3)
4. (2)
3. C:(1)
TABLE 7.3: IniLial code assignmenL for AADCCDD using adaptive Huffman coding.
1. D:(1) 2. B:(1) mula! Cade
NEW: O
(c-3) The Huffman Lree after receiving 3rd “A” A: 00001
B: 00010
C: 00011
FIGURE 7.6: Node swapping for updating an adaptive Huffman tree: (a) a Huffman tree; (b) receiving
D: 00100
2nd “A” Lriggered a swap; (e- 1) a swap is needed afler receiving 3rd “A”; (c-2) anoLher swap is needed;
(c-3) lhe Huffman tree after receiving 3rd “A”.
Section 7.5 Dictionary-Based Coding 181
180 Chapter 7 Lossless Compressiori Algorithms

(1) (2) (3)


TABLE 7.4: Sequence of syrnbols and codes sent Lo Lhe decoder

A:(2) A:(2) Symbol NEW A A NEW D NEW C C D 13


NEW:(O) A:(l) NEW:(O)
Code O 00001 1 O 00100 00 00011 001 10! 101
I4EW:(o) D:(l)
“A” “AR’ “AAD”
11 is imporlant Lo ernphasize Lhal Lhe code for a particular syrnbol ofLen changes during
(4) (4) Lhe adaptive Huffman coding process. The more frequent lhe symbol up lo lhe rnornenL, lhe
(4)
shorler Lhe code. For example, afLer AADCCDD, when lhe characler D overtakes A as lhe
o (2+!) mosl frequent symbol, lls code changes from 101 to 0. This is of course fundamenlal for lhe
(2)

(1)
O
(2)
A:(2) ~ ~
adaplive algorilhm — codes are reassigned dynamically according lo lhe new probabiliLy
disLribuLion of lhe symbols.
O O 1 D:(l)

NEW:(O) C:(1+l) NEW;(O) D:(l) The “Squeeze Page” on Lhis book’s web site provides a Java applet for adaptive Huffrnan
NEW:(O) C:(l)
coding thaL should aid you ia Iearning this algoriLhrn.
“AADC” “AADCC” step 1 “AADCC” slep 2

1.5 DICFIONARY-BASED CODING


(5) (6) (7) The Lernpel-Ziv-Welch (LZW) algorithm ernploys an adaptive, dicLionary-based compres
sion technique. Unlike variable-lengLh coding, in which lhe lengths of lhe codewords are
different, LZW uses fixed-length codewords Lo represenL variable-IengLh strings of sym
A A:(~2~0 bols/characlers Lhat comrnonly occur Logelher, such as words ia English LeXL.
c:(2) As in Lhe oLher adapLive compression techniques, Lhe LZW encoder and decoder builds
up lhe sarne diclionary dynamically while receiving Lhe daLa — Lhe encoder and Lhe decoder
NEW:(o) D:(1) NEW:(O) D:(2) NEW:(O) A:(2) boLh develop lhe sarne dicLionary. Since a single code can now represenl more lhan one
“AADCCDD” symbollcharacLer, dala compression is realized.
“AADCC” slep 3 “AADCCO”
12W proceeds by placing longer and longer repealed enLries inLo a dicLionary, Lhen
FIGURE 7.7: Adaptive Huffrnan tree for AADCCDD. emilLing Lhe code for an elernent ralher Lhan lhe sLring itself, if Lhe elernenL has already been
placed in lhe dictionary. The predecessors of LZW are 1277 [9] and 1278 [10], duelo Jacob
Ziv and Abraham Lempel in 1977 and 1978. Teny Welch [11] improved Lhe Lechnique in
From AADC lo AADCC Lalces lwO swaps. To illusLrale lhe updale process clearly, lhis 1984. LZW is used ia many applicalions, such as UNIX compress, GIF for images, V.42
is shown ia lhree steps, wilh lhe required swaps again indicaLed by arrows. bis for modems, and olhers.

ALGORITHM 7.2 LZW COMPRESSION


• AADCC Step 1. The frequency count for C is increased from 110 1 + 1 = 2; this
necessilales iLs swap with D:(1).
BEGIN
s = next input character;
• AADCC Step 2. Afler lhe swap belween C and O, Lhe count of lhe parent node of while not EOF
C:(2) will be increased from 2 Lo 2 + 1 = 3; lhis requires its swap with A:(2).
c = next input character;
• AADCC Step 3. The swap between A and lhe parenl of C is completei
if s + c exists in the dictionary
5 = s + C;
Table 7.4 sumrnarizes lhe sequence of syrnbols and code (zeros and ones) being sent lo
lhe decoder.
s — entry;

Section 7.5 Dictionary-Based Codirig 183


182 Chapter 7 Lossless CompressiOfl Algorithms
The output codes are 1 2452346 1. Instead of 14 characters, only 9 codes need Lo be
else sent. lf we assume each character or code is transmined as a byle, lhaL is quite a saving (Lhe
compression ratio would be 14/9 — 1.56). (Remember, lhe LZW is an adaptive algorithm,
output the code for 5; in which Lhe encoder and decoder independently build their own string lables. Hence, ibere
add string s + c to the dictionary with a new code; is no overhead involving Lransmitting lhe string table.)
s = c; Obviously, for our illustration lhe above example is replete with a great deal ofredundancy
) in lhe input string, which is why it achieves compression so quickly. In general, savings for
LZW would not come until lhe text is more than a few hundred bytes long.
output the code for s;
‘l’he above 12W algorithm is simple, and ii makes no effort in selecting optimal new
strings lo enter into its dictionary. As a result, its siring Lable grows rapidly, as illustraied
above. A typical 1.2W implementation for textual data uses a 1 2-bit codelength. Hence,
its dictionary can contam up lo 4,096 entries, with lhe first 256 (0—255) entries being
EXAMPI~E 7.2 1.2W Compression for String ABABBABCABABBA ASCU codes. lf we iake lhis fito account, lhe above compression ratio is reduced lo
~ (14 x 8)1(9 < = 104

only three characters, with cedes as follows: ALGORITHM 7.3 12W DECOMPRESSION (SIMPLE VERSION)
code stríflg BEGIN
$ = NUs;
1 A while not EOF
2 B
3 C k = next input code;
entry = dictionary entry for k;
Now if Lhe input string is ABABBABCABABBA, lhe LZW compression algorithm works ~ entry;
asfollows: if (s NIL)
add string $ + entry[0) to dictionary
s c output code string with a new code;

1 A
)
2 B
END
3 C

A E 1 4 AB ________________________________________________________
B A 2 5 BA EXAMPLE 7.3 11W decompression for string ABABBABCABABBA
A B Input cedes to Lhe decoder are 1 245 2 3 46 1. The initial string table is identical lo what
AB E 4 6 ABB isusedbytheencoder.
B A The 12W decompression algorithm then works as follows:
BA B 5 7 BPB
E C 2 8 BC s k entry/output code string
C A 3 9 CA
A B 1 A
AB A 4 10 ~ 2 B
A E 3 C
AB B
ABB A 6 11 ABDA NIL 1 A
A EOF 1
184 Chapter 7 Lossless Compression Algorithms Section 7.5 Dictionary-Based Coding 185

A 2 E 4 AB ABB A 6 10 ABBA
E 4 AB 5 BA A B
AB 5 BA 6 ARE AB B
BA 2 B 7 BAR ABB A
B 3 c 8 BC ABBA X 11 ABBAX
C 4 AB 9 CA
AB 6 ABB 10 ABA
ABB 1 A 11 ABBA
A EOF
Apparenhly Lhe oulpul stiing is ABABBABCABABBA — a lmly lossless resultl
________________________________________________________ The sequence
is 1245 of outputcodes from lhe encoder(and hence lhe inpul codes forthe decoder)
236 lO....
The simple 12W decoder:
12W Algorithm Details A more careful examinalion of Lhe above simple version of
lhe LZW decompression algorithm will reveal a poLential problem. In adaptively updating s k entry/output code string
lhe cliclionaries, lhe encoder is sometimes ahead of Lhe decoder. For example, afler lhe
sequence A.BABB, Lhe encoder will oulput code 4 and create a dictionary enlry wiLh code 1 A
6 for Lhe new sLring ABU. 2 B
On lhe decoder side, afLer receiving lhe code 4, lhe output wiiIl be AU, and lhe dicliona’y 3 c
is updaLed with code 5 for a new slring, BA. This occurs severai limes iii Lhe above example,
such as afler lhe encoder oulpuLs anolher code 4, code 6. In a way, Lhis is anticipaLed — NIL 1 A
afler ali, il is a sequential process, and lhe encoder had lo be ahead. In this example, this A 2 B 4 AR
did noL cause problem. E 4 AB 5 BA
Welch 11] points oul thaL lhe simple version of Lhe 1.2W decompression algorilhm will AR 5 BA 6 ABB
break down when Lhe foliowing scenario occurs. Assume Lhal Lhe inpul string is ABAB- BA 2 B 7 BAR
BABCABBABBAX.... E 3 c 8 BC
The 12W encoder: C 6 ABB 9 CA
ARB 10
s c output code string

1 A “???“ indicaLes Lhal Lhe decoder Lias encounLered a difficulty: no dictionary enlry exisls
2 E for Lhe lasL input code, lO. A closer examinaLion reveals that code 10 was mosL recenlly
3 e creaLed aL Lhe encoder side, formed by a concaLenalion of Character, Slring, Characler. In
Lhis case, lhe character is A, and string is BB — lhat is, A + BB + A. Meanwhile, Lhe
A B 1 4 AR sequence of Lhe output symbols from Lhe encoder are A, BU, A, BB, A.
E A 2 5 BA This example illusLrates that whenever Lhe sequence of symbols’to be coded is Character,
A B Slring, Character, Slring, Characler, and so on, Lhe encoder will creaLe a new code lo
AR E 4 6 ARE represenl Character + SLring + Character and use il righL away, before Lhe decoder has had
E A a chance lo creaLe iL!
BA B 5 7 BAR FortunaLely, this is Lhe only case in which Lhe above simple LZW decompression algo
E c 2 8 BC riLhm will fail. Also, when Lhis occurs, lhe variable s = Character + String. A modified
e A 3 9 CA version of Lhe algorithm can handle this exceplional case by checking wheLher lhe inpul
A E code has been defined in Lhe decoder’s diclionary. lf nol, it will simply assume LhaL the code
AR B represents lhe symbols .s + s[O]; Lhal is Characler + Slring + Characler.
Section 7.6 Arithmetic Coding 187
186 Chapter 7 Lossless Compression Algorithms

7.6 ARITHMETIC CODING


ALGORITHM 7.4 12W DECOMPRESSION (MODIFIED) Arithmetic coding is a more modem coding method Lhat usuaHy outperforms Huffman
coding in practice. lt was fuily developed in the iate 1970s and 1980s [12, 13, 14]. The
BEGIN initiai idea of arithmetic coding was introduced in Shannon’s 1948 work [3]. Peter Elias
e = NIL; deveioped its first recursive implemenLation (which was not pubiished but was mentioned in
while not EOF Abramson’s 1963 book [15]). The method was further deveioped and described in Jeiinek’s
1968 book [ló]. Modern arithmetic coding can be attributed Lo Pasco (1976) [17] and
k = next input code; Rissanen and Langdon (1979) [12].
entry = dictionary entry for k; Normaily (in its non-extended mode), Huffman codirig assigns each symbol a codeword
LhaL has an integral bit length. As sLated earlier, log2 ~- indicates the amount of information
1* exception handier ~/ contained in the infonnation source s~, which corresponds Lo Lhe number of bits needed Lo
if (entry == NULL) represent iL.
entry = s + For exampie, when a particular symbol s~ has a large probabiiity (dose Lo 1.0), log2 .1,
wiil be dose to 0, and assigning one bit Lo represent that symboi wiill be very costly. Oni’y
outpUt entry; when Lhe probabilities of ali symbols can be expressed as 2”, where k is a positive integer,
if (e 1= NIL) wouid te average iength of codewords be truly optimai — that is, 1 = i~ (with q te entropy
add string e + entry[O] to dictionarY ofthe infonnation source, as defined in Eq. (7.3)). Apparentiy, l > 7) in most cases.
with a new code; Aitough iL is possible lo group symbois into metasymbols for codeword assignment
s = entry; (as in extended Huffman coding) to overcome the limitation of integral number of bits per
) symbol, te increase in Lhe resultant symbol table required by te Huffman encoder and
END decoder would be fonnidabie.
Arithmeüc coding can treaL te whoie message as one unit. In practice, te inpuc data
is usualiy broken up into chunlcs to avoid error propagaLion. However, in our presentation
beiow, we Lalce a simplisLic approach and inciude a Lenuinator symbol.
Implementation requires some practical limit for Lhe dictionary sire — for example, a A message is represented by a half-open interval [a, 1,) where a and b are real numbers
maximum of 4,096 entries for GIF and 2,048 entries for V.42 bis. Nevertheless, this still beLween O and 1. lnitialiy, Lhe inLervai is [0, 1). When Lhe message becomes longer,
yields a 12-bit or 1 1-bit code Iength for LZW codes, which is longer than the word iength the lenglh of te inLerval shortens, and lhe number of bits needed Lo represent Lhe inLerval
for the original data — 8-bit for ASCII. increases. Suppose Lhe alphabet is [A, B, C, D, E, F, $], in which $ is a special symbol used
In real applications, Lhe code length lis kept ia te range of [lo, Imax]. For lhe UNIX to tenninate the message, and Lhe icnown probability distribuLion is as shown in Figure 7.8(a).
comprese conunand, lo = 9 and lmax is by default 16. The dictionaiy initiaily has a size
of 2’°. When it is fluled up, the code length will be increased by 1; this is allowed te repeat ALGORITHM 7.5 ARITHMETIC CODING ENCODER
until 1 = tmax BEGIN
If tbe data to be compressed iacks any repetitive structure, the chance of using te new low = 0.0; high = 1.0; range = 1.0;
codes iii te dictionary entries could be low. Sometimes, this will iead Lo data expansion
instead of data reduction, since te code iength is often longer than Lhe word length of the while (syrnbol 1= terminator)
original daLa. To deal with Lhis, V.42 bis, for exampie, lias built in two medes: compressed
and transparent. The iatter wrns off compression and is invoked when data expansion is get (syrnbol);
detected. low = low i- range * Range_low(symbol);
Since Lhe dictionary has a maximum size, once iL reaches 21”~’ entries, 12W loses iLs high = low + range * Range_high(syrnbol);
adaptive power and becomes a static, dictionary-based Lechnique. UNIX compress, for range = high - 10w;
example, willl monitor its own performance at Lhis point. lt will simpiy flush and re.initialize
te dictionary when te compression ratio falis beiow a Lhreshold. A better dictionary
management is perhaps Lo remove the LRU (ieast recently used) entries. V.42 bis wifl look output a code so that low <= code < high;
for any entry Lhat is not a prefix te any other dictionary entry, because tis indicates Lhat Lhe END
code lias not been used since its creation.
188 Chapter 7 Lossless Compression Algorithms Section 7.6 Arithmetic Coding 189

The encoding process is illusLrated in Figure 7.8(b) and (c), in which a string of symbols
Symbol Probability Range
CAEE$ is encoded. IniLially, ‘0w = 0, high = 1.0, and range = 1.0. AfLer Lhe firsL
A 0.2 [0,0.2) symbol C, Rangeiow(C) = 0.3, Rangeiiigh(C) = 0.5; so low = 0+1.0 x 0.3 = 0.3,
B 0.1 [0.2,0.3) high = O + 1.0 x 0.5 = 0.5. The new range is now reduced to 0.2.
For clarity of illusLration, Lhe ever-shrinidng ranges are enlarged in each step (indicaLed
C 0.2 [0.3, 0.5) by dashed lines) in Figure 7.8(b). AfLer tbe second symbol A, iow, high, and range are 0.30,
D 0.05 [0.5, 0.55) 0.34, and 0.04. The process repeats iLself until after Lhe Lerminating symbol $ is received.
E 0.3 [0.55, 0.85) By then 10w and high are 0.33184 and 0.33220, respectively. lt is apparenL IhaL Iinally we
have
F 0.05 [0.85, 0.9)
range — X ~A X “e X ~e x = 0.2 x 0.2 x 0.3 x 0.3 x 0.1 = 0.00036
$ 0.1 [0.9, 1.0)
(a)
The final step in encoding calls for generaLion of a number Lhat falis wiLhin lhe range [low,
high). AlLhough it is trivial lo pick such a number in decimal, such as 0.33 184, 0.33 185, or
1.0 0.5 0.34 .334 0.33 22 0.3322
r 0.332 in Lhe above example, it is lesa obvious how Lo do iL with a binary fractional number.
$ $ $ $ - $ $ The following algoriLhm wiIl ensure that che shortesc binary codeword is found if low and
0.9 F F E E F high are lhe Lwo ends of lhe range and Iow <high.
0.85

E E E E PROCEDURE 7.2 Generating Codeword for Encoder


E E
BEGIN
0.55 O O ~ -‘ D”, O O code = 0;
0.5
k 1;
=

c c c c c c while (value(code) < low)


0.3 8 8
li’ 8 8 B assign 1 to the kth binary fraction bit;
0.2 if (value(code) > high)
A replace the kth bit 1w 0;
A A A
‘~j
A A k = k + 1;
o 03½ 0.3286 0.33184
(b) ENO

Symbol For lhe above example, Iow = 0.33184, high = 0.3322. lfwe assign 110 the first binary
_____ high range
fracLion bic, it would be 0.1 in binary, and iLs decimal value(code) = value(0.1) = 0.5 >
high. Hence, we assign O lo Lhe first bit. Since value(0.0) = O < Iow, lhe while loop
c continues.
A 0.30 0.34 0.04 Assigníng 1 to Lhe second bit makes a binary cade 0.01 and value(0.01) = 0.25, which
E 0.322 0.334 0.012 is less Lhan high, SO iL is accepted. Since it is sLill tme thaL value(0.01) < law, Lhe iteration
continues. Eventually, Lhe binary codeword generated is 0.01010101, which is 2—2 + 2~ +
E 0.3286 0.3322 0.0036 2—6 + 28 = 0.33203 125.
$ 0.33 184 0.33220 0.00036 Et must be poinLed out thaL we were lucky Lo have found a codeword of only 8 bits Lo
represenL Lhis sequence of symbols CAEE$. In Lhis case, log2 ~- + log2 ~ + Iog2 ~ +
(e)
log2 ~ + log2 À = Iog2 ~ = log2 ~ 11.44, which would suggest that it could
FIGURE 7.8: Arithmetic coding: encode symbols CAEE$: (a) probability distribution of take 12 birs to encode a string of symbols like Lhis.
symbols; (b) graphical display of shrinking ranges; (c) new iow, high, and range generaLed.
Lossless Image Compression 191
190 Chapter 7 Lossless Compressiori Algorithms

lt can be proven [2] LhaL [iog2(l/fl1Pi)1 is lhe upper bound. Namely, in Lhe worsl TABLE 7.5: Arilhmetic coding: decode symbois CAEE$
case, lhe shorlesl codeword in arithmelic coding will require k biLs Lo encode a sequence of
symbols, and Value Output symboi Low High Range
0.33203 125 C 0.3 0.5 0.2
= [Iog2 range1 = [log2 jj-!~.1 (7.8) 0.16015625 A 0.0 0.2 0.2
0.80078i25 E 0.55 0.85 0.3
where P~ is lhe probabilily for symbol i and range is Lhe fina] range generaLed by lhe encoder.
0.8359375 E 0.55 0.85 0.3
Apparenlly, when lhe IengLh of lhe message is long, ils range quicldy becomes very
smail, and hence Iog2 ~ becomes very large; lhe difference belween log2 ~je and 0.953125 $ ‘ 0.9 1.0 0.1
[iog2 ~ is negligibie.
Generally, AriLhmetic Coding achieves belLerperfonnance lhan Huffman coding, because The aigorithm described previousiy has a subLie impiemenLalion difficu]ly. When Lhe
Lhe former treaLs au enlire sequence of symbols as one anil, whereas lhe IaLler has lhe inlervais shiinlc, we need lo use very high-precision numbers Lo do encoding. This makes
reslriCliOfl of assigning an inlegrai number of bits lo each symbol. For example, Huffman pracLica] impiemenlaLion of Lhis aigorilhm infeasible. Forlunaleiy, iL is possible Lo rescaie
coding would require 12 biLs for CAEE$, equaling Lhe worsl-case performance ofAiilhmetic Lhe inlervais and use only inLeger arilhmelic for a praclical implemenLalion [18].
Coding. in lhe above discussion, a speciai symbol, $, is used as a lerminalor of lhe slring of
Moreover, Huffman coding cannol always alIam Lhe upper bound illusLraled in Eq. (7.8). symbois. This is anaiogous Lo sending end-of-hne (EOL) in image Lransmission. In con
II can be shown (see Exercise 5) lhal if Lhe aiphabel is [A, 8, C] and lhe known probabiliLy venLional compression appiicaLions, no LerminaLor symbo] is needed, as Lhe encoder simpiy
dislribulion is P,.j = 0.5, 1’8 — 0.4, ~c = 0.1, lhen for sending EBB, Huffnian coding wiil codes ai] symbois from lhe input. However, if Lhe lransmission channeilnelwork is noisy
require 6 bils, which is more lhan [log2(l/fit Pa)1 = 4, whereas ariLhmeLic coding wiii (iossy), lhe proLecLion of having a lenninalor (or EOL) symboi is cruciai for Lhe decoder Lo
need oniy 4 bils. regam synchronizaLion with lhe encoder.
The coding of Lhe EOL symboi iLseif is an interesling probiem. Usuai]y, EOL ends up
ALGORITHM 7.6 ARITHMETIC CODING DECODER being reialive]y iong. Lei eL ai. [19] address some of lhese issues and propose an algorilhm
LhaL conLrols Lhe ienglh of Lhe EOL codeword ii generates.
BEGIN
get binary code and convert to decimal value = value(code); 7.7 LOSSLESS IMAGE COMPRESSION
Do
One of Lhe mosl commoniy used compression techniques in muilimedia dala compression
find a syrabol s so that is differential coding. The basis of daLa reduclion in differenliai coding is Lhe redundancy
Range_1OW(5) <= value < Range_high(S); in consecutive symbois in a dalaslream. Recali lhaL we considered lossless differenLiai
output 5; coding in Chapter 6, when we examined how audio musl be deall wiLh via sublraclion from
low = Rang_lOW(S); predicled vaiues. Audio is a signai indexed by one dimension, lime. Here we consider
high = Range_highCs); how lo appiy Lhe iessons ieamed from audio lo lhe conLext of digiLal image signals LhaL are
range = high - lcw; indexed by Lwo, spatiai, dimensions (x, y).
value = [value - lowJ / range;
7.7.1 Differential Coding of Images
Until symbol s is a terminator Lel’s consider differential coding in Lhe conLexl of digiLal images. In a sense, we move from
END signais wilh domam in one dimension lo signals indexed by numbers in lwo dimensions
(x, y) — Lhe rows and coiumns ofan image. Laler, we’ii ]ook aL video signais. These are
even more compiex, in Lhal Lhey are indexed by space and Lime (x, y, t).
Table 7.5 iliustrales Lhe decoding process for Lhe above example. initially, value = Because of lhe conlinuity of lhe physical worid, Lhe gray-ievei inlensilies (or coior) of
0.33203125. Since Range.Jow(C) = 0.3 ~ 0.33203125 <0.5 = Range .high(C), Lhe firsl background and foreground objecls in images Lend Lo change reiaLiveiy siow]y across lhe
oulpul symboi is C. This yieids value = [0.33203125 — 0.3)10.2 = 0.16015625, which in image frame. Since we were deaiing wiLh signais in Lhe lime domam for audio, practitioners
Lum delermines Lhal Lhe second symbol is A. Evenluaily, value is 0.953125, which falis in generaiiy refer Lo images as signais in lhe sparial domam. The generaHy siowiy changing
Lhe range [0.9, 1.0) of lhe lerminalor $.
192 Chapter 7 Loss ess Compression A gorithms section 7.7 Lossless Image Compression 193

C B

A X

(a) (b)

x104 xl&
FIGURE 7.10: Neighboring pixeis for prediclors in lossless JPEG. Note Lhat any of A, B,
or C has already been decoded before iL is used in Lhe prediclor, on Lhe decoder side of an

j________
encode/decode cycle.

7.7.2 Lossless JPEG


.~ Lossless JPEG is a special case of lhe JPEG image compression. It differs drastically from
0 50 100 150 200 250 —80-60-40—20 O 20 40 60 80 otherJPEG modes in Lhal Lhe algorithm has no iossy steps. Thus we Ireat iL here and consider
(e) (d)
lhe more used JPEG methods in Chapter 9. Lossless JPEG is invoked when Lhe user selects
a 100% quallty factor in an image tool. Essenliaily, lossless JPEG is included in Lhe JPEG
FIGURE 7.9: Distributions for original versus derivative images. (a,b) original gray-level compression standard simply for completeness.
image and lIs partial derivative image; (c,d) histograms for original and derivative images.
The following predictive method is applied on Lhe unprocessed original image (or each
This figure uses a commonly employed image called Barb. color band of Lhe original color image). lt essentially involves Lwo sleps: forming a differ
enlial prediction and encoding.
nature of imagery spatially produces a high likelihood lhat neighboring pixeis will have
1. A predictor combines lhe values of up lo lhree neighboring pixels as lhe predicled
similar intensity values. Given an original image 1 (x, y), using a simple difference operator value for Lhe current pixel, indicaled by X in Figure 7.10. The prediclor can use any
we can define a difference image d(x, y) as follows: one of lhe seven schemes listed in Table 7.6. If predictor Plis used, Lhe neighboring
d(x,y)=I(x,y)—l(X—lY) (7.9) inlensity value A will be adopted as lhe predicted intensily of lhe current pixel; if

This is a simple approximation of a partial differenlial operator 8/dx applied lo an image


defined in lerms of integer values ofx and y. TABLE 7.6: Predictors for lossless JPEG
Another approach is Lo use Lhe discrete version of lhe 21) Laplacian operalor lo define a
Prcdiction
difference image d(x, y) as

d(x,y)=41(x,y)_1(X,Y_1)_I(X.Y+1)_+l.Y~~ (7.10)

In both cases, lhe difference image will have a histogram as in Figure 7.9(d), derived from
lhe d(x, y) partial derivalive image in Figure 7.9(b) for lhe original image 1 in Figure 7.9(a). A+B—C
Nolice thal lhe hislogram for 1 is much broader, as in Figure 7.9(c). lt can be shown lhal
A + (B — C) / 2
image 1 has larger enlropy lhan image d, since ilhas a more even dislribution in its inlensity
values. Consequently, Huffman coding or some olher variable-lenglh coding scheme will B + (A — C) /2
produce shorter bit-length codewords for lhe difference image. Compression will work (A+B)/2
belter on a difference image.
Sectiori 7.9 Exercises 195
194 Chapter 7 LossIess Compression AlgorithmS
• An excellent resource for data compression compiled by Mark Nelson thaL includes
libraries, documentations, and source cede for Huffman Coding, AdapLive Huffman
TABLE 7.7: Cornparison of Iossless JPEG with other Iossless compression programs Coding, 12W, ArithmeLic Coding, and so on.
Compression program Compression ratio
• Source code for Adaptive ArithmeLic Coding

A: • The Theory of Data Compression web page, which inLroduces basic theories behind

iIF
Lossless JPEG both lossless and lossy data compression. Shannon’s original 1948 paper on infor
OpLimal Iossless JPEG mation Lheory can be downloaded from this siLe as well.
conlpress (LZW)
• The PAQ for Lhe comp.cornpression and comp.compression. research
gzip (LZ77) groups. This FAQ answers mosL of the commonly asked questions abouL daLa com
gzip-9 (optimal LZ77) e: pression in general.
pack (Huffman coding)
• A set of appleLs for lossless compression Lhat effecLively show inLeractive demonstra
Lions of Adaptive Huffman, LZW, and so on. (Impressively, Lhis web page is Lhe fnjit
of a studenL’s final project in a third-year undergraduate multimedia course based on
predicLor P4 is used, Lhe current pixel value is derived from the three neighboring Lhe material in this text.)
pixeis as A + B — C; and 50 On.
2. The encoder compares Lhe prediction with Lhe actual pixel value at position X and • A good introduction to AxiLhmetic Coding.
encodes Lhe difference using one of the lossless compressiOn Lechniques we have
discussed, such as tbe Huffman coding scheme. • Grayscaletestimagesf-18 .bmp, tlowers .binp, football .brnp, lena.bmp.

Since prediction must be based on previOusly encoded neighbors, Lhe very firsL pixel in 7.9 EXERCISES
the image 1(0,0) wiIl have Lo simply use iLs own value. The pixeis in Lhe first row always
1. Suppose eight characLers have a distribuLion A:(1), B:(l), C:(l), D:(2), E:(3), F:(5),
use predictor P1, and those in the first column always use P2.
G:(5), H:(10). Draw a Huffman Lree for this disLribution. (Because Lhe algorithm may
1..ossless JPEG usually yields a relatively Iow compressiOn ratio, which renders iL im
group subtrees with equal probabiliLy in a differenL order, your answer is not sLricLly
practical for most multimedia applications. An empirical comparison using some 20 images
unique.)
indicates Lhat the compression ratio for lossless JPEG with any one of Lhe seven predicLors
ranges from 1.0 to 3.0, with an average of around 2.0. PredicLors 4 Lo 7 that consider neigh 2. (a) WhaL is Lhe entropy (,~) of the image below, where numbers (0, 20, 50, 99)
boring nodes in both horizonLal and vertical dimensions offer slightly beLter compression denote Lhe gray-Ievel inLensiLies?
(approximatelY 0.2 to 0.5 higher) than predictors 1 to 3.
Table 7.7 shows a comparison of Lhe compression ratio for several lossless compression
99 99 99 99 99 99 99 99
techniques using test images l..ena, football, F-18, and flowers. These standard images
20 20 20 20 20 20 20 20
used for many purposes in imaging work are shown on the LexLbOOk web site in the Purther 00 000000
ExploraLion section for this chapter.
00 50 50 50 50 O O
This chapLer has been devoted Lo Lhe discussion of lossless compression algorithms. IL 00 50 50 50 50 O O
should be apparent that Lheir compression milo is generally limited (wiLh a maximum at 00 50 50 50 50 O O
about 2 to 3). However, many of Lhe multimedia applications we wilI address in the next 00 50 50 50 50 O O
several chapters require a much higher compression raLio. This is accomplished by tossy 00 000000
compression schemes.
(b) Show step by step how Lo consLrucL the Huffman Lree to encode Lhe above four
7.8 FURTHER EXPLORATION inLensiLy values in this image. Show Lhe resulting code for each inLensiLy value.
Marlr Nelson’s book (1] is a standard reference on daLa conipression. as is Lhe LexL by Khalid (e) What is the average number of bits needed for each pixel, using your Huffman
Sayood [2]. code? How does iL compare Lo ij?
The Further ExploraLion secLion of Lhe text web site for this chapter provides a set of web
resources for lossless compression. including
Section 7.10 References 197
196 Chapter 7 Lossless Compression Algorithrfls

(b) Assume lhat Adaptive Huffman Coding is used lo code an informalion source .9
3. Consideran alphabet with Lwo symbols A, B, wilh probability P(A) = x and P(B) =
wilh a vocabulary of four ietters (a, b, c, d). Before any lransmission, Lhe inilial
l —x. coding is a = 00, b = 01, c = 10, d = 11. As in lhe exampie iiiustraled in Figure
7.7, a special symboi NEW will te senl before any leLIer if il is to be sent lhe
(a) PIoL lhe enlropy as a funcLion ofx. You might wanl Lo use log2(3) = 1.6,
firsL Lime.
iog2(7) = 2.8.
Figure 7.11 is Lhe Adaptive Huffman lree after sending ietlers aabb. After
(b) Discuss why ir musl te lhe case thal if lhe probabiiity of the two symbois is lhat, lhe addilionai biLsLream received by lhe decoder for lhe nexl few Ietlers is
1/2 + e and 1/2 — e, wilh small e, lhe enLropy is iess than lhe maximum.
01010010101.
(e) Generalize Lhe above resuir by showing that, for a source generating N symbols,
lhe enlropy is maximum when lhe symbols are ali equiprobabie. 1. Whal are Lhe addilionai IeLlers received?
(d) As a smail programming project, wriLe code lo verify lhe conclusions above. ii. 0mw lhe adapLive Huffman trees afLer each of lhe addiLional ieLlers is
received.
4. Extended Huffman Coding assigns one codeword Lo each group of k symbols. Why is
average(i) (lhe average number of bils for each symbol) stili no iess Lhan lhe enLropy 7. Compare lhe rale of adaplalion of adaplive Hufi’man coding and adapLive ariLhmetic
q as indicated in equaLion (7.7)? coding (see Lhe lexlbbook web site for Lhe latler). What prevenLs each meLhod from
5. (a) Whal are lhe advanlages and disadvantages of Arithmelic Coding as compared adapting Lo quicic changes in source sLalislics?
lo Huffman Coding? 8. Consider lhe diclionary-based LZW compression aigoriLhm. Suppose lhe aiphabel is
(b) Suppose lhe aiphabel is [A, 8, C], and lhe known probabilily disLribulion is lhe seL of symbois (O, 1). Show lhe dictionary (symboi seIs plus associaLed codes)
= 0.5, li 0.4, Pc = 0.1. For simplicity, iel’s also assume thal b~ffi and ouLpul for 12W compression of lhe inpul
encoder and decoder know lhat Lhe ienglh of lhe messages is always 3, so Lhere
is no need for a Lerminator. 0110011

L How many bits are needed to encode lhe message BBE by Huffman 9. Implemenl Huffman coding, adaptive Huffman, ariLhmelic coding, and lhe 12W
coding? coding aigorilhms using your favorile programming ianguage. GeneraLe aL ieast lhree
ii. How many bils are needed lo encode Lhe message BBB by arilhmeLic lypes of statisticaliy differenl artificial dala sources lo lesI your impiemenLalion of
coding? lhese algorilhms. Compare and comment on each algorilhm’s performance in lerms
of compression ralio for each Lype of daLa source.
6. (a) What are lhe advanLages of Adaplive Huffman Coding compared tolhe original
Huffman Coding aigorilhm? 7.10 REFERENCES
1 M. Neison, The Data Compression Rook, 2nd cd., New York: M&T Books, 1995.
2 K. Sayood, introduetiata to Data Compression, 2nd ed., San Francisco: Morgan lCaufmann,
2000.
3 C.E. Shannon, “A Malhemalical Theory of Communicalion,” Reli Systenz Technical Journal,
o 27: 379—423 and 623—656, 1948.
4 C.E. Shannon and W. Weaver, The Mathematical Theory of Co,nnaunication, Champaign, EL:
Universily of Illinois Press, 1949.
a
5 R.C. Gonzalez and R.E. Woods, Digital buage Processing, 2nd cd., Upper Saddie River, NJ:
PrenLice Hail, 2002.
o 6 R. Fano, Transnaission of Infonuation, Cambridge, MA: MIT Press, 1961.
7 DA. l-Juffman, A MeLhod for LheConsLnaction of Minimum-Redundancy Codes’ Proeeedings
NEW b of lhe IRE [InsLilule of Radio Engineers, now ibe 1EEE], 40(9): 1098—1101, 1952.
8 T.H. Connen, C.E. Leiserson, and R.L. Rivesl, introduction lo Aigorithnas, Cambridge, MA:
MIT Press, 1992.
FIGURE 7.11: AdapLive I-luffman lree.
198 Chapter 7 Lossless Compression AlgorithmS

9 J. Ziv and A. Lempel, “A Universal Algorithm for SequenLial Data CornpressionT IEEE Trans- C H A P T E R 8
aclions on Inforination Theory, 23(3): 337—343, 1977.
10 J. 1v and A. Lempel, “Compression of Individual Sequences Via Variable-RaLe Coding’ LEEE
Transactions on Infonnation Theoiy, 24(5): 530—536, 1978. Lossy Coni press i on P1 g o rith rns
II T.A. Welch, “A Technique for High Performance Data Compression.” IEEE Computer, 17(6):
8—19, 1984.
12 J. Rissanen and G.G. Langdon, “Asithrnetic Coding’ IBM Journal of Research and Develop
meni, 23(2): 149—162, 1979.
13 1.1-1. WiLLen, R.M. Neal, and J.G. Cleary, “AxithmeLic Coding for Data CornpressiotiT Co,nmu- In this chapLer, we consider lossy compression methods. Since information loss impljes
nications of rhe ACM, 30(6): 520—540, 1987. . some tradeoff between error and biLrate, we lirst consider measures of distortion — e.g.,
ffs, NJ, PrenLtce Hali, squared error. Different quantizers are inLroduced, each of which has a differenL distortion
14 Te. Beil, J.G. Cleary, and 1H. Witten. Teci Compression. Englewood Cli bebavior. A discussion of transform coding leads mio an introducLion to the DiscreteCosine
1990. Transform used in JPEG compression (see Chapter 9) and the Karhunen Loêve transform.
15 14. Abrarnson, Infornialion Theory and Coding, New York: McGraw-Hill, 1963. Another Lransfonn scbeme, wavelet based coding, is then set out.
16 F. Jelinek. Probabilisiic Inforination llieoty, New York: McGraw-HiII, 1968.
17 R. Pasco, “Source Coding Algorithrns for Data Compression’ Pb.D. diss., Department of 8.1 INTRODUCTION
Eleciaical Engineering, Stanford University, 1976. As discussed in Chapter 7, the compression rodo for irnage data using lossless compression
18 P. G. Howard and]. 5. Vitter, “Practical Implernentation of Arithmetic Coding7 Image and techniques (e.g., Huffman Coding, Ariihmetic Coding, LZW) is low when Lhe image bis-
TesE Compression, ed. J. A. Storer, Boston: Kluwer Academic Publishers, 1992,85—112. togr~ is relatively flai. For image compression in multimedia applicaiions, where a higher
19 S.M. Lei and MIE. Sun, “An Entropy Coding System for Digital HDTV ApplicationS’ IEEE compression raiio is required, lossy rnethods are usually adopted. In lossy compression, fie
Transactions on Circu its and Sysiems for Video Technology, 1W: 147—154, 1991. compressed irnage is usually not Lhe sarne as the original irnage but is rneant to form a dose
approximation to the original irnage perceptually. To quantitatively describe how dose Lhe
approximation is to Lhe original data, sorne form of distortion measure is required.

8.2 DISTORTION MEASURES


A distortion measure is a rnathernatical quantity Lhat specifies how dose an approximation
is to iLs original, using sorne disLortion criLeria. When looking at compressed data, il is
naLura] Lo think of Lhe disLortion in Lerms of Lhe numerical difference beLween Lhe original
data and Lhe reconstrucLed daLa. However, when Lhe data Lo be cornpressed is au image,
such a measure rnay not yield Lhe inLended result.
For example, if Lhe reconstructed image is Lhe sarne as original irnage except tbaL iL is
shifted to Lhe right by one vertical scan line, au average hurnan observer would have a hard
time distinguishing iL from Lhe original and would therefore conclude that Lhe distortion is
srnall. However, when the calculaLion is carried out nurnerically, we find a large distortion,
because of the large changes in individual pixels of Lhe reconstrucLed image. The problem
is Lhat we need a measure of perceptual distortion, noL a more naive numerical approach.
However, the sLudy of perceptual distortions is beyond Lhe scope of Lhis book.
Of Lhe rnany numerical disLortion rneasures Lhat have been defined, we presenL Lhe three
mosL commonly used in irnage compression. lf we are interested in Lhe average pixel
difference, Lhe rnean square error (MSE) a2 is ofLen used. Ii is defined as

N
= j~ — (8.1)

199
Section 8.4 Quantization 201
200 Chapter 8 Lossy Compres5ion Algorithms

where x~, y~, and N are the inpuL data sequence, reconstructed data sequence, and length •0
of Lhe data sequence, respectively.
If we are interested in the size of the error relative to the signa!, we can measure Lhe
signal-to-noiSe ratio (SNR) by taldng the raLio of Lhe average square of lhe original data
sequence and Lhe mean square error (MSE), as discussed in Chapter 6. In decibel units (dB),
it is defined as
«2
SNR = lOlog10 -i (8.2)

where u~ is lhe average square value of lhe original data sequence and 4 is lhe MSE.
AnoLher commonly used measure for distortion is the peak-signal-to-nOise ratio (PSNR),
which measures Lhe size of Lhe error relative Lo Lhe peak value of lhe signal Xpcak. lt is given
by
%ak (8.3)
PSNR = 101og10 «2
O
0rnax
8.3 THE RATE-DISTORTION THEORY
Lossy conipression always involves a tradeoff beLween rale and disLortion. Rate is lhe FIGURE 8.1: Typicai rate-distortion function.
average number of bits required Lo represenL each source symbol. Wilhin Lhis framework,
lhe tradeoff between rate and disLortion is represenLed in lhe form of a rate-distortion function Each aigorithm (each quantizer) can be uniquely deLermined by iLs partition of Lhe inpul
R(D). range, on the encoder side, and lhe seL of ouLpul values, on the decoder side. The inpuL and
InLuitively, for a given source and a given dislortion measure, if O is a tolerable amounL ouLput of each quanLizer can be eiLher scalar values or vecLor values, Lhus Ieading Lo scalar
of distortion, R(D) specifies lhe lowest raLe aL which Lhe source data can be encoded while quantizers and vector quantizers. In lhis secLion, we examine Lhe design of boLh uniform and
keeping Lhe distortion bounded above by O. li is easy lo see IbaL when O = 0, we have nonuniform scalar quantizers and brielly inLroduce Lhe Lopic of vector quanhization (VQ).
a lossless compression of lhe source. The rate-distortion funcLion is meanL Lo describe a
fundamenLal limil for the performance of a coding algorithm and so can be used Lo evaluate 8.4.1 Uniform Scalar Quantization
Lhe perfomiance of differenL algorithms.
Figure 8.1 shows a typical raLe-distortiOn funcLion. Notice thaL lhe minimum possible A uniform scalarqLlantizer partiLions lhe domam of inputvalues inLo equaliy spaced inLervals,
rale aL O = 0, no loss, is lhe entropy of Lhe source data. The disLortion corresponding to a except possibly at Lhe two outer inLervals. The endpoints of partiLion intervais are cailed Lhe
raLe R(D) O is Lhe maximum amounl of distortion incurred when “noLhing” is coded. quanLizer’s decision boundaries. The ouLpul or reconsLruction vaiue corresponding Lo each
Finding a closed-form analylic descripLion of the rate-distorlion function for a given inLerval is taken lo be Lhe midpoinL of the inLerval. The length of each inLerval is referred
source is diflicull, if nol impossible. Gyorgy [1] presents analyLic expressions of lhe rate Lo as Lhe step size, denoted by the symbol A. Uniform scalar quantizers are of Lwo types:
disLortion funclion for various sources. For sources for which an analytic solution cannoL be niidrise and midtread. A midLread quanLizer has zero as crie of iLs 0uLpuL values, whereas
readily obtained, lhe rate-distortion function can be calculated numerically, using algorithms Lhe midrise quanLizer has a partition interval that brackets zero (see Figure 8.2). A midrise
developed by ArimoLo [2] and Blahul [3]. quanLizer is used with an even number of OULpUL leveis, and a midtread quanLizer wilh an
odd number.
8.4 QUANTIZATION A midlread quantizer is important when source dala represenLs the zero value by fluc
tuating between small posiLive and negaLive numbers. Applying lhe midtread quantizer in
QuantizaLion in some form is lhe heart of any lossy scheme. WithouL quantizatiOn, we lhis case would produce an accurate and steady represenlation of lhe value zero. For Lhe
would indeed be losing iiLLle information. Here, we embark on a more detailed discussion special case A — 1, we can simply compuLe lhe output values for these quanLizers as
ofquantization Lhan in Section 6.3.2.
The source we are inLeresled in compressing may contam a large number of distinct Q,nidrise(x) = [xl — 0.5 (8.4)
output values (or even infiniLe, if anaiog). To efficienliy represent lhe source outpUt, we Qrnidgread(X) = [x + 0.5J (8.5)
have Lo reduce lhe number of disLinct values to a much smaller sei, via quantizaLion.
Section 8.4 Quantization 203
202 Chapter 8 Lossy Cornpression Algorithms

Q(X) Error
Q(X)
4.0
3.5
3.0
2.5
2.0
1.5

— Lo
—2.0
x
—2.5 —3.0
3.5 —4.0

(a) (b)

FIGURE 8.2: Uniform sealar quantizers: (a) midrise; (b) midtread.


FIGURE 8.3: Quantization error of a uniformly distributed source.

The goal for the design of a successful uniform quantizer is to minimize the distortion for
again jusL considering positive data. The LoLa] distortion is Lwice Lhe sum over lhe positive
a given source input wiLh a desired number of output vaiues. This can be done by adjusting
data, or
the step sire A Lo match Lhe input statisLics.
Let’s examine Lhe performance of mi M levei quantizer. LeI 8 = {bo, l’i bpj) be

(
M
Lhe set of decision boundaries and 1’ = (yj, y2 YM) be the seI of reconsLrucdon or
ouLput values. Suppose lhe inpuL is unifonniy distribuLed in Lhe interval [Xmax, Xmax]. Dgra,, = 2 Z [A — 21_IA)2 ~ dx (8.8)
The rale of the quantizer is 1=1 (z—DA mal

R = [log~ M~ (8.6) where we divide by lhe range of X to normalize to a value of aL most 1.


Since lhe reconsnction values y~ are lhe midpoints of each interval, lhe quantization
error musL lie wiLhin Lhe values [—q, ti• Figure 8.3 is a graph of quantization error
ThaL is, R is Lhe number of biLs required to code M things — in this case, Lhe M output
leveis. for a uniformly distribuLed soLirce. The quanLization error in this case is also uniforrnly
dislribuLed. Therefore, lhe average squared error is lhe sarne as Lhe variance aJ of Lhe
The step size A is given by
quantization error calculated from just lhe inLerval [O, A] wiLh error values in [— ~, ~].
The ator value aI x is e(x) = x — A/2, so lhe variance of enors is given by
A = 2Xmax (8.7)
1 pA
= —1 (e(x)—ê9dx
since Lhe enLire range of input values is from —X~01 to Xmax. For bounded input, lhe AJ0

: ~L
quantization error caused by lhe quanLizer is referred to as granular diszorüon. If Lhe 1 A A 2
quanLizer repiaces a whoie range of values, from a maximum value toco, and similarly for
negative values, that part of lhe distortion is called Lhe overload distortion.
To get an overali figure for granular distortion, notice LhaL decision boundaries b1 for a (x_1_o) dx (8.9)
midrise quanLizer are [(1 — flA, IA], 1 = 1.. M/2, covering positive data X (and another
half for negative X values). Output values y~ are lhe midpoints IA — A/2, 1 = 1 .. M/2,
Section 8.4 Quantization 205
204 Chapter 8 Lossy Compression Algorithms
This says that lhe opLimal reconstniction value is Lhe weighted cernroid of lhe x interval.
Similarly, lhe signa! variance is cr~ = (2Xmax9/12, so if lhe quantizer is si bits, M = 2~, Differentiating with respect to b~ and setLing Lhe result to zero yields
then from Eq. (8.2) we have
= Yj+I+Y, (8.14)
SQNR = l0h~io(~)
This gives a decision boundary b~ aL Lhe midpoinL of two adjacent reconscruction values.
f(2X,naj2 12 Soiving Lhese Lwo equations simultaneously is canied aia by iLeration. The resuit is termed
= Lhe Lloyd—Max quanLizer.

((2xmax)2 12 ALGORITHM 8.1 LLOYD-MAX QUANTIZATION


= 101og10 1 —.
12 (2x)2
BEGIN
(8.10)
= ioiog1oM2 =20n 108102 Choose initiai levei set yo
(8.11) i=0
= 6.02n(dB)
Repeat
Hence, we have rederived the formula (6.3) derived more simply in Section 6.1. From Compute b1 using Equation 8.14
i=i+i
Eq. (8.11), we have the important resuiL that increasing one bit in Lhe quantizer increases
Lhe signa1~to~quafltimLioh1 flO~5~ ratio by 6.02 dE. More sophisticated estimates of D result Compute Yi using Equation 8.13
from more sophisLiCated modeis of the probability distributiOn of errors. Untii Iyj—yj—iI < E
ENO
8.4.2 Nonuniform Scalar QuantizatiOfl
If lhe input source is noL uniformly distribuLed, a uniforni quantizer may be inefficient. SLarting with an iniLiai guess of lhe optimai reconstrucLion leveis, lhe algorithm above
Increasiflg Lhe number of decision leveis within lhe region where the source is densely iLeratively estimates lhe opLimal boundaries, based on Lhe current estimaLe of Lhe recon
distribuLed can effectively iower granular distortion. In addition, without having Lo in sLrucLion leveis. IL then updaLes Lhe currenL esLimale of Lhe reconsLniction leveis, Lising the
crease Lhe total number of decision leveis, we can enlarge lhe region in which the source is newly computed boundary information. The process is repeated unLil lhe reconstruction
sparsely dislribuLed. Such nonuniforni quanhizers thus have nonunifornilY defined decision leveis converge. For an example of lhe algoriLhm in operaLion, see Exercise 3.
boundaries.
There are two common approaches for nonuniform quantizatiofl the Lloyd—MaX quan Companded Quantizer. In companded qLlantizaLion, the input is mapped by a com
fizer and Lhe companded quantizer, both introduced in Chapter 6. pressorfuncrion G and Lhen quanLized using a unifonn quantizer. After transmission, lhe
quantized values are mapped back using an espanderfuncrion G ~. The block diagram
Lloyd—MaX Quantizelt* For a uniform quantizer. the total distortion is equal to lhe for lhe companding process is shown in Figure 8.4, where X is Lhe quantized version of X.
granular distortion, as in Eq. (8.8). If Lhe source distribution is not uniform, we muSt lf Lhe input source is bounded by ~ Lhen any nonuniform quantizer can be represenLed
explicitly consider its probability distribution (probability desisto? function) fx(x). Now as a companded quanLizer. The Lwo commonly used companders are Lhe it-law and A-law
we need lhe corteet decision boundaries b~ and reconstruction values y~, by solving for both companders (SecLion 6.1).
simultaneOuslY. Todo so, we plug variables b1, y~ into a total distortion measure
AI
(8.12)
Dgra,, = (x —
maR fx(X)~
x Uniform quanLizer 2 A
x

Then we can minimize Lhe total distortion by setting lhe derivative of Eq. (8.12) to zero.
~jfferenLiatiflg with respect tO YJ yields the set of reconslfliction values
4
G
~xfx(x)dX (8.13)
f!~fx(x)dX FIGURE 8.4: Companded quanLizaLion.
206 Chapter 8 Lossy Compression Aigorithms Section 8.5 Trarisform Coding 207

Encoder Decoder the decoder has only iimiLed resources, and the need is for quick execulion Lime. MosL
muitimedia appiicalions fali jato Lhis category.

8.5 TRANSFORM CODING


From basic principies of information theory, we know thaL coding veclors is more efficient
Lhan coding scalars (see Section 7.4.2). To carry out such an intention, we need to group
blocks of consecutive samples from Lhe source inpuL jato vecLors.
LeL X = (xl, xk jT be a vector of sampies. Whether our input dala is an image,
A
a piece of music, an audio or video clip, or even a piece of LexL, Lhere is a good chance Lhal a
subslanLial amounL of correlaLion is inherent among neighboring sampies x~. The raLionale
behind Lransform coding is Lhat if Y is lhe resulL of a linear transform T of the inpuL vecLor
X in such a way thaL the componenLs of Y are much less correlaLed, then Y can be coded
more efficiently Lhan X.
For example, if mosL information ia an RGB image is contained in a main axis, rolaling so
LhaL this direcLion is Lhe firsL component means Lhat luminance can be compressed differenLly
from color informaLion. This wili approximate the luminance channel in lhe eye.
Ia higher dimensions than Lhree, if most inforniation is accurateiy described by lhe
— — —
firsL few components of a lransformed vector, Lhe remaining components can be coarseiy
quantized, or even set Lo zero, wiLh lirLIe signai distortion. The more decorre lated Lhat
FIGURE 8.5: Basic vector quantization procedure. is, the iess effecL one dimension has 011 anoLher (Lhe more orthogonal Lhe axes), Lhe more
chance we have of dealing differenLly wilh Lhe axes thaL store reiatively minor amounts
of infonnaLion withouL affecLing reasonably accurate reconsLmction of lhe signal from iLs
8.4.3 Vector QuantizatiOfl~ quanLized or Lruncaled Lransform coefficienLs.
Generaliy, the lransform T iLseif does noL compress any data. The compression comes
One of Lhe fundamental ideas in Shannon’s original work 011 information theory is that any from Lhe processing and quantization of the componenLs of Y. In lhis section, we wiili
compression system performs better ifit operates on vectors or groups of samples rather than study Lhe Discrete Cosine Transform (DCI’) as a tool to decorrelale lhe input signai. We
on individual symbols or sainples. We can fonn vectors of input samples by concatenating will aiso examine Lhe Karhunen—Inève Transform (KLT), which optimally decorrelales Lhe
a number of consecutive samples inLo a single vedor. For example, an input vector might componenLs of Lhe inpuL X.
be a segment of a speech sample, a group of consecutive pixels in an image, or a chunk of
data in any olher formaL. 8.5.1 Djscrete Cosine Transform (Da)
The idea behind vector quanhization (VQ) is similar Lo thaL of scalar quantization buL
extended into multiple dimensions. Instead of representing values wiLhin au interval in The Discrete Cosine Transform (DCT), a widely used Lransform coding technique, is abie
one-dimensional space by a reconslruction value, as in scalar quantization, in VQ an n Lo perform decorrelalion of the inpuL signal ia a data-independent manner. Because of Lhis,
component code vector represents vectors lhat lie within a region in n-dimensional space. it has gained tremendous populariLy. We wiiil examine the definiLion of the DCT and discuss
A collection of these code vectors forms Lhe codebook for Lhe vector quantizer. some of iLs properties, in particular lhe reiationship between it and lhe more familiar Discrete
Since Lhere is no implicit ordering of code vecLors, as lhere is in Lhe one-dimensional Fourier Transform (DFT).
case, an index set is also needed Lo index inLo Lhe codeboolc. Figure 8.5 shows Lhe basic
vector quantization procedure. Ia Lhe diagram, Lhe encoder finds Lhe closest code vector Lo Definition ofDCT. Let’s sLart WiLh the Lwo-dimensionai DCT. Given a function 1(1, j)
lhe input vector and outputs Lhe associated index. On the decoder side, exactly Lhe sarne over two inLeger variabies i and j (a piece of an image), Lhe 2D DCT Lransforrns il inLo a
codebook is used. When Lhe coded index of Lhe input vector is received, a simple table new function F(u, u), wiLh inleger u and v running over the same range as 1 and j. The
loolcup is performed te determine Lhe reconstruction vector. general definition of Lhe transfomi is
Finding Lhe appropriate codebook and searching for Lhe closest code vector at Lhe encoder
end may require considerable computational resources. However, Lhe decoder can execute
quickly, since only a constanL time operation is needed to obLain Lhe reconsLltctiOn. Because
F(u, v) =
2 C(u) C(v)
;~
M—I N—l
cos
(21 + fluir
2M C05
(2j + 1)uir
2N ~ .i) (8.15)
of lhis property, VQ is attractive for systems wiLh a lot of resources aL Lhe encoder end while
Section 8.5 Transform Coding 209
208 Chapter 8 Lossy Compressiori Algorithms

An electrical signal witb constant magnitude is known as a DC (direct current) signal. A


wherei,u=0,1 M—l,j,u0,1 N—l,andtheconstantsC(U)andC(V) common example is a battery that cardes 1.5 or 9 volts DC. An electrical signal lhat changes
are detennined by
ils magnitude periodically at a certain frequency is known as au AC (alternating current)
signal. A good example is the household electric power circuit, which cardes electricity with
if~ =0, (8.16)
1 otherwise. sinusoidal waveform at 110 volts AC, 60 1k (or 220 volts, 50Hz in many other countries).
Most real signais are more complex. Speech signais or a row of gray-level intensities
In lhe JPEG image compression standard (see Chapter 9), an image block is defined Lo ia a digital image are examples of such 1D signais. However, any signal can be expressed
have dimension M = N = 8. Therefore, Lhe definitions for lhe 2!) DCT and its inverse as a sum of multiple signais Lhat are sine or cosine waveforms at various amplitudes and
(IDCI’) in lhis case are as follows: frequencies. This is known as Fourier analysis. The terms DC and AC, originating in
electrical engineering, are cartied over to describe tbese components of a signai (usually)
2D Discreto Cosmo ‘IYansform (2!) DCT). cornposed of one DC and several AC components.
lf a cosine function is used, lhe process of determining lhe amplitudes of lhe AC and
C(u) C(u) (21 + fluir (2j + fluir (8.17) DC componenís of the signal is called a Cosine Transform, and Lhe integer indices malce
F(u, v) = ~ C05 16 ~ 16 1(1,1), it a Discrete Cosine Transform. When u = 0, Eq. (8.19) yields lhe DC coetficient; when
1=0 j=O
u = 1, or 2 up Lo 7, it yields lhe first or second, etc., up to Lhe seventh AC coefficient.
wherei, j, u, v = 0,1 7,andtheconstantsC(u) andC(v)aredeterminedbyEq. (8.16). Eq. (8.20) shows lhe Inverse Discrere Cosine Transform. This uses a sum of lhe products
of lhe DC or AC coefficients and lhe cosine functions Lo reconstruct (recompose) the function
f (1). Since cornputing Lhe DCT and IDCT involves some loss, f(i) is now denoted by f(i).
2D Inverso Discrete Cosine ‘I’tansform (2!) IDCT). The inverse function is almost In short, the role of Lhe DCI’ is to decompose lhe original signal into its DC and AC
lhe sarne, with Lhe roles of f (1, j) and F(u, v) reversed, except that now C(u)C(v) must components; Lhe mie of lhe [DCI’ is Lo reconstruct (recompose) the signal. The DCI’ and
stand inside Lhe sums: IDCT use lhe sarne set of cosine functions; lhey are known as basisfunctions. Figure 8.6
shows lhe family ofeight 1!) DCT basis functions: u = 0.. 7.
- . . ~ ~ C(u) C(v) (21 + fluir (2j + fluir (8.18) The DCI’ enables a new means of signai processing and analysis in lhefrequencydomain.
cos cos F(u,u)
We mean to analyze blocks of eight pixeis in au image, but we can begin by considering time
u=C v=O
dependent signais, ralher lhan space-dependent ones (since time-signal analysis is where
wherei,j,u,v=O.J 7. lhe melhod originates).
The 20 transforms are applicable to 20 signals, such as digital images. As shown below, Suppose f(i) represents a signal Lhat changes with time é (we will note bothered here
lhe 1 D version of Lhe DCT and IDCT is similar to the 2D version. by the convention that time is usuaily denoted as t). The lD DCT lransforrns f(i), which
is in lhe time domam, to F(u), which is in thefrequency domam. The coefficients F(u) are
known as lhe frequency responses and foni~ lhe frequency spectrum of f(i).
ID Discreto Cosine ‘ftansform (1D DCT). l..eL’s use some examples to illustrate frequency responses.
C(u) ~ (21 + 1)uir (8.19) EXAMPLE 8.1
F(u)=_r_ZCOS 16
The ieft side of Figure 8.7(a) shows a DC signal with a magnitude of 100, i.e., fi (i) = 100.
wherei=0,1 7,u=0,1 7. Since we are examining lhe Discrete Cosine Transform, Lhe input signal is discrete, and its
domam is [0,7].
1D Inverso Discrete Cosine ‘&ansfonn (1D-IDCT). When u = 0, regardless of Lhe i value, ali Lhe cosine terms ia Eq. (8.19) become cosO,
which equais 1. Taldng into account that C(0) = ~J~/2, F1 (0) is given by
C(u) (2i + fluir (8.20)
= —cos 16
Fj(0) = iTU00+llOOl00~l00
wherei=0,l 7,u=O,l 7. + 1 100 + 1 100 + 1 100 + i~ 100)
283
One-Dimensionai DCT. Let’s examine lhe DCT for a one-dimensional signal; almost
ali concepts are readiiy extensibie to lhe 2D DCI’.
Section 8.5 Transform Coding 211
210 Chapter 8 Lossy Compressiorl Algorithms

Signalf1Q) that does 004 change Dcl oulpul F1(u)


The 00, basis funcüon (u = 0) lhe lei baile function (ia = 1) 200 400
I,0r~
0.5 Ei 1 1501 1300
iool
______
UUUUUUUv~
—1.0 1_____________________________ —1.0
01234567
EU O
01234567
Dfl
50[P
01H
0 1 2 3 4 5 6
P1100
III
7
1200

1 2 34 5 6 7

lhe 2nd baile funclion (ia = 2) lhe 3rd basis funcuon (ia = 3) (a)
1.0 1.0
0.5
ou
[1 ~ UI~ A changing signalf2Q)
IhaI has an AC componeni 400
DCI’ output F2(u)

-0.5 100 300


—1.0 ________________________________

01234567
—1.0
01234567
solfl
0Lu EI
—50
Ü Ei ~I200
1100

-100 O
Pie 4i1i bade funetion (u = 4) The 50, baile function (ia = 5) 01234567 o i 5 6 7
11.0
(b)
0.5
01 ou 0u ~jW~ 0u u
-as’ ____________ DCT oulpui F3(u)
1
—1.0 ________________________________ —LO Signalf3(0 =f10) +j~O) 400
01234567 01234567 200 300
IsoIn [-li
11 1200
100111 [1 n
lhe 6th basis function (ia = 6) The 7th basis funciion (ia = 7) ~~[II Un 1H - nH
CIII
01234567 o ~ I 3~4 5 6
Ei uj°~
-0.5 1 U H-a~ _______________ (c)
l°~
—LO U
______________________________ —1.0
01234567 01234567 An aibitraiy signalflz) DCI’ output F(u)
100 200

FIGURE 8.6: The ID Der basis funetions.


~i
0[U
o
oU
o
Ei o 1001
o[~
-moi D o O O
-100 1 ~-200
01234567 01234567
When ia = 1, Fj(u) is as below. Because cos = —cos 1~. cos = —cos etc.
andC(1) l,wehave (d)

FIGURE 8.7: Exampies of ID Discrete Cosine Transfonn: (a) a DC signal fi (O; (b) an AC
1 yr 3r 5,r signal f2(O; (c) f~Q) = fi (1) + f2(O; and (d) an arbitraiy signal f(i).
FiO) = ..(cos— 1OO+coS-j~ 100+cosjg’lOO+C0Sj~’ 100
9yr Ilr 13ir lSir
+cosj~ . 100+cos-j~- 10O+cos—1~- lOO+cos—j~ .100)
Similarly, it can be sbown tbat F’~ (2) = F1 (3) = ... = F1 (7) = O. The ID-DCT resuit
=0 F1 (ii) for this DC signal fj (1) is depicted on the right side of Figure 8.7(a) — only a DC
(i.e., first) component of F is nonzero.
Section 8.5 Transform Codirig 213
212 Chapter 8 l.ossy Compression Algorithms

EXAMPLE 8.3
EXAMPLE 8.2
In lhe Lhird row of Figure 8.7 lhe inpul signai lo lhe DCT is now Lhe sum of Lhe previous
The ieft side of Figure 8.7(b) shows a discrete cosine signal f2(i). Incidentaliy (or, rather, two signais — Ibal is, f3(i) = f’ (i) + f2(fl. The output F(u) vaiues are
purposeiy), it has the sarne frequency and phase as the second cosine basis function, and lIs
amplitude is 100. [‘~(0) = 283,
When ri = 0, again, ali thecosineterms inEq. (8.19) equai 1. Becausecos ~ = — cos [‘3(2) = 200,
cos = — cos ~j, and 50 on, we have [‘3(1) = F3(3)=[’3(4)=...=F3(7)=0

Thus we discover lhaL [‘3(u) = F’i(u) + [‘2(u).


F2(0) = ~E2~ .ooocosE+1O0cos~+l00cOs~+100cOST
9,r llir 13,r 15,r
+ lOOcos + lOOcos + lOOcos —~— + lOOcos EXAMPLE 8.4

=0 The fourth row of lhe figure shows an arbiLrary (or aI ieasl relativeiy complex) inpuL signai
fQ) and its DCT ouLpuL F(u):
To calcuiaLe [‘2(u), we firsl note thal when ii = 2, because cos = sin j, we have f(i) (i
= 0.7): 85 —65 iS 30 —56 35 90 60
F(u)(u—0..7): 69 —49 74 li 16 i17 44 5
cos 2~
~+cOs ~
Tco5 2~ •21r~=l
~+sln
Note Lhat in this more general case, ali the DCT coeificients [‘(ri) are nonzero and some
are negative.
Similariy,
From the above examples, Lhe characLerislics of lhe Dcl’ can be summarized as foilows:
~ ~
cos -j-+CO5 ~ = 1 1. The DCT produces Lhe frequency specLrum [‘(ri) conesponding Lo Lhe spatiai signai
f(i).
_w+cos 21~
cos ~ —~- = 1 In particular, Lhe OLh DCT coefficienL [‘(0) is lhe DC componen of Lhe signal f (1).
21331 21531 UpLoaconslantfacLor(i.e., ~ .48 = 2v’~inthe 1DDCTand ~ .4.4.64 = 8
cos + cos — =
in lhe 2D DCT), [‘(0) equais lhe average magnitude of Lhe signal. In Figure 8.7(a),
the average magniLude of Lhe DC signa) is obviousiy 100, and [‘(0) = 2~J~ x 100;
Then we end up wiLh in Figure 8.7(b), Lhe average magnitude of Lhe AC signa) is 0, and so is F(0); in
Figure 8.7(c), lhe average magniLude of f3(i) is apparentiy 100, and again we have
1 r yr 3ff 3~ 531 531 [‘(0) = 2.J~ x 100.
F2(2) = _.(cos_.co5_+CQs_~_.COST+COST~cOSy
The other seven DCT coefficienLs reflect lhe various cltanging (i.e.. AC) componenLs
7r 7,r 9ir 9,r llyr ll,r oflhesignai f(i) aL differenL frequencies. lfwe denote [‘(1) as ACI, [‘(2) as AC2,
+ cos -~ cos + cos cos -~ + cos cos [‘(7) as AC7, Lhen ACI is Lhe firsl AC componenl, which compieLes half a cycle
13ff 13ff 15ff 15x as a cosine funclion over [0, 7]; AC2 completes a fuli cycle; AC3 compleLes one and
+ cos . cos + cos ~— . cos ~ 100 one-haif cycies’ and AC7, Lhree and a half cycles. Ali Lhese are, of course, due Lo
Lhe cosine basis funclions, which are arranged in exacLiy lhis manner. In oLher words,
= Lo+l+l+l).lOo=200 Lhe second basis function corresponds lo AC 1, lhe Lhird corresponds Lo AC2, and so
on. 1n lhe exampie in Figure 8.7(b), since the signai f2(i) and Lhe Lhird basis funcLion
have exacLly lhe same cosine waveform, wiLh idenLical frequency and phase, they wiill
We wiill nol show Lhe oLher derivalions in detaii. IL luras oul thaL [‘2(1) = [‘2(3) =
reach lhe maximum (posilive) and minimum (negalive) values synchronousiy. As a
[‘2(4)=...=F2(7)0.
resuiL, lheir producls are aiways posilive, and Lhe sum of Lheir producls ([‘2(2) or AC2)
Section 8.5 Transform Coding 215
214 Chapter 8 Lossy Compression Algorithms

After Oth iterailon (DC) Alter lst ileration (DC + AGI)


is iarge. It tums out that ali other AC coeificienis are zero, since f20) and ali Lhe 100 100
olher basis functions happen to be orthogonal. (We wiii discuss orthogonality later in 50 50
this chapter.) o 0000DDDD o ..,oDOU
It should be pointed out that Lhe DCT coefficients cmi easily take on negative values. —50 —50
-100 -100
For DC, Lhis occurs when Lhe average of f(i) is less than zero. (For mi image, this 01234567 01234567
never happens so Lhe DC is nonnegative.) For AC, a speciai case occurs when f (1)
and some basis function have Lhe sarne frequency buL one of them happens to be haif
a cycle behind — Lhis yields a negaLive coefficient, possibiy with a large magnitude. Alter 2nd iteraüon (DC + ACI + AC2) Afler 3rd iteration (DC + AGI i- AC2 + AC3)
In general, signais will look more like Lhe one in Figure 8.7(d). Then f 0) will produce 100 100
many nonzero AC components, with Lhe ones toward AC7 indicating higher frequency
conLent. A signal will have large (positive or negative) response in its high-frequency
components only when iL alLemates rapidly wiLhin lhe small range [0, 7].
As an example, if AC7 is a large positive number, Ihis indicates that Lhe signal f(i)
has a component thaL alLemaLes synchronously wiLh Lhe eighlh basis function — three
50
o
—50
-100
UDODUU

01234567
50

—50
-100
o
H °

01234567
— o

and half cycles. According Lo Lhe Nyqist theorem, Lhis is lhe highest frequency in
Afia 4ih iteration (DC + AGI +. . + AC4)
Lhe signal thaL can be sampled wiffi eight discrete values wiLhout significanL loss and Alter Sth ileration (DC + ACI +. . . + AC5)
100 100
a]iasing. 50 50
2. The DCI’ is a linear transform. o 0000_DUhl o
In general, a transform T (or function) is linear, iff —50 —50
-100 -100
T(ap + flq) aT(p) + fl’T(q),
= (8.21) 01234567 o 1234567

where « and fi are constants, and p and q are any functions, variables or constants.
After6lh íteraiion (DC + ACI +. . . + AC6) Alter liii iteration (DC + ACI +.. . + AO)

u
From Lhe definition in Eq. (8.19), Lhis properly cmi readily be proven for Lhe DCI’, 100 100
because iL uses only simple ariLhmetic operations. 50
o E o O O.. ü U
—50
50
o ~ n Ii
—50
One-Dimensional Inverse DCI’. Let’s finish Lhe exarnple in Figure 8.7(d) by showing -100 -100
012345 67 01234567
its inverse DCI’ (JDCT). Recali that F(u) contains Lhe following:

FOs)(u=0..7): 69 —49 74 11 16 117 44 —5


FIGURE 8.8: An exarnple of ID IDCI’.
The ID IDCT, as indicated in Eq. (8.20), cmi readily be implemenLed as a loop wiLh eight
iterations, as illustrated in Figure 8.8.
After iteration 0, 7(i) has a constam value of approximately 24.3, which is Lhe recovery
IteraLion 0: 7(i) = . cosO. F(0) = 4 . 1 ~69 24.3. of Lhe DC component in f(i); after iteration 1, f 0) 24.3 — 24.5 . cos ~≥~1!, which is
Lhe sum of Lhe DC and first AC component; afler iteration 2,7(i) reflects Lhe sum of DC and
Iteration 1: 7~ = cosO F(0) + . cos F(1)
. AC 1 and AC2; and so on. As shown, Lhe process of Lhe sum-of-product in IDCI’ eventually
24.3 + . (—49) . cos 24.3 — 24.5 cos reconstnicts (recomposes) lhe function f(i), which is approximalely

Iteration2: 7o =
10) (i =0.7): 85 —65 15 30 —56 35 90 60
24.3 — 24.5 . cos + 37 . cos

As it happens, even Lhough we went from integer to integer via interrnediatefloa:s, we


recovered Lhe signal exactly. This is not always Lrue, but Lhe answer is always dose.
Sectiori 8.5 Transform Coding 217
216 Chapter 8 Lossy Compression Algorithrns

The Cosine Basis Functions For a belter decomposilion, lhe basis funclions should
be oribogonai, so as Lo have lhe leasl redundancy amongst thern.
Funclions B~(i) and Bq (1) are orthogonal if

~[Bp(i)Bq(O1’O if pØq (8.22)

Funclions B~(i) and Bq (1) are orthonormol if lhey are orthogonal and
(8.23)
[B~Ü) Bq(i)] = 1 if p =q

The orlhonormal properly is desirable. With this property, lhe signal is not amplified
during the transform. When lhe sarne basis funclion is used in bolh lhe transformalion
and lis inverse (somelirnes calledforward transform and backward :ransform), we will get
:~
(approxirnatelY) lhe sarne signal back.
It can be shown Lhal
‘ r (2i+IPJnr (2i+1)qrl
Z [COS ió COS—~6 j = O zf pØq FIGURE 8.9: Oraphical illusLration of 8 x 8 2D Da basis.

rc(p) (2i-l-1)PJT C(q) (2i+i).qlr] = 1 ifp=q If we view Lhe sum-of-products operaLion in Eq. (8.19) as Lhe doL product of one of lhe
L 2 16 2 16 discreLe cosine basis funcLions (for a specified ti) and lhe signal f(i), Lhen lhe analogy be
lween Lhe DCI’ and Lhe Cartesian projecLion is remarkable. Namely, lo get lhe x-coordinale
The cosine basis functions iii lhe Da are indeed orthogonal. Wilh lhe help of constanls of poinL P, we sirnply projecl P onLo lhe x axis. MaLhernaLically, Lhis is equivalent Lo a dot
C(p) and C(q) lhey are also orihonormal. (Now we undersland why conslanLs C(u) and product 2~ OP = xi.,. Obviously, lhe sarne goes for obtaining Yp and z~,.
C(v) in Lhe definilions of DCI’ and 1D~ seemed lo have laken some arbilrary values.) Now, compare this Lo Lhe example in Figure 8.7(b), for a poinl E = (O, 5, O) in lhe
Recali lhal because of lhe orthogonalily, for f2(i) in Figure 8.7(b), only F2(2) (for u = 2) Carlesian space. Only ils projeclion 01110 lhe y axis is Yp = 5 and iLs projeclions onlo lhe
has a nonzero oulput whereas ali olher DCT coefficients are zero. This is desirable for some x and z axes are bolh O.
signal processing and analysis in the frequency domam, since we are now able Lo preCisely
identify lhe frequency cornponenLs in Lhe original signal. 2D Basis Functions. For lwo-dirnensional Da funcLions, we use lhe basis shown as
The cosine basis funcLions are analogous lo lhe basis veclors 2, 5, ~ for lhe 3D Carlesian 8 x 8 images. These are depicted in Figure 8.9, where whiLe indicaLes posilive values and
space, or lhe so-called 3D vector space. The veclors are orlhonormal, because black indicaLes negaLive. To oblain Da coefficienls, we essenlially jusl form lhe inner
producL of each of Lhese 64 basis images wilh an 8 x 8 block frorn au original irnage. Nolice
lhaL now we are lalking aboul an original signal indexed by space, noL Lirne. We do Lhis for
= (1,O,O).(O,1,O)0 each 8 x 8 image block. The 64 producls we calculaLe rnake up au 8 x 8 spatialfrequency
21 = (1,O,O).(O,O,1)=O image F(u, v).
= (O,1,O).(O,O,1)0
ZD Separable Basis. Of course, for speed, mosl sofLware implernenlalions use fixed
2.2 = (1,O,O).(l,O,O)1 poinL arilhmelic Lo calculale lhe DCT Lransform. JusL as Lhere is a rnaLhematicaliy derived
= (1,O,O)(1,O,O)1 FasI Fourier Transform, Lhere is also a FasL DCT. Some fast implernenLalions approxirnaLe
= (1,O,O).(l,O.O)=1 coefficienLs so lhal ali mulLiplies are shifLs and adds. Moreover, a rnuch sirnpler mechanism
is used lo produce 2D Da coefficienls — factorizalion inLo lwo 1 D Da lransforms.
Any poinL E = (Xp, Yp’ Zp) can be represented by a vector ó~p = (Xp, Yp’ Zp), where When lhe block size is 8, lhe 2D DCT can be separared inLo a sequence of LWO ID D~
O is Lhe origin, which can iii wm be decornposed into .rp . 2 + Yp . j~ + Zp sleps. FirsL, we caiculale an intermediaLe funcLion G(i, v) by performing a 1D Da on each
Sectiori 8.5 Transform Coding 219
218 Chapter 8 Lossy CompressiOn Algorithms

y
column — in lhis way, we have gone over to frequency space for lhe columns, but nol for
lhe rows: 7
7
1 (21+1)int (8.24) 6
GQ,v) = ~C(v)~CoS 16 fO,i)
5
4
Then we calculate anolher ID DCT, this time replacing Lhe row dimension by its frequency
counterpart: 3
2
1 ‘ (2i+l)iflr (8.25)
F(u, v) = ~C(u) cos 16 G(z, ti)

x
This is possible because Lhe 2D DCT basis funclions are separable (mulLiply separate func
o 1 2 3 4 5 6 7 8 9 10 li 12 13 14 15
tions of 1 and j). It is straighlforward lo see lhat this simple change saves many arilhmetic
steps. The number of iLerations required is reduced from 8 x 8 lo 8 + 8. FIGURE 8.10: Symmetric exlension of lhe ramp funcLion.

Comparison of DCT and Dfl. The discrele cosine transform [4] is a dose counterpart
lo lhe Discrete Fourier Transform (DFT), and in Lhe world of signal processing~ lhe laLter This works because sine is an odd function; lhus, Lhe conlribulions from Lhe sine Lerms
is likely the more common. We have started off wiLh lhe DCI instead because il is simpler cancel each olher oul when lhe signal is symmeLrically exLended. Therefore, lhe DCT of
and is also much used in multimedia. Nevertheless, we should nol entirely ignore lhe DEI’. eighl inpuL samples corresponds Lo lhe DFT of 16 samples made up of Lhe original eighL
Fora conLifluous signal, we define lhe conlinuoUs Fourier lransform 7 as follows: inpuL samples and a symmelric copy of Lhese, as in Figure 8.10.
t00
(8.26) WiLh lhe symmeLric exlension, lhe DCT is now working on a Lriangular wave, whereas
7(w) = 1 f(t)e~”~ di
J-oo
lhe DFT Lues lo code lhe repeated ramp. Because Lhe DFT is Lrying Lo model lhe arlificial
discontinuity crealed belween each copy of lhe samples of Lhe ramp function, a 101 of high
frequency components are needed. (Refer Lo [4] for a Lhorough discussion and comparison
Using Euler’s formula, we have
of DCI and DEI’.)
(8.27)
= cos(x) + 1 sin(x) Table 8.1 shows Lhe calculaled DCI’ and DEI’ coefficienLs. We can see Lhat more energy is
concentraled iii lhe firsl few coefficients in lhe DCI’ than in lhe DFT. If we try lo approximaLe
Thus, lhe conlinuous Fourier transform is composed of an infinite sum of sine and cosine
terms. Because digilal compulerS require us lo discreLize lhe inpul signal, we define a DFT
thal operates on eighl samples of lhe inpul signal (fo, .i’i fy) as
TABLE 8.1: DCT and DEI’ coefficienls of Lhe ramp function
2,riwz
(8.28)
F~ =
1=0
o
Wdling lhe sine and cosifle lerms explicilly, we have
/2i’rwx\ 2
= ~ I~~) — ~ f~ sin
x=0
(8.29)
3
x=0
4
Even wilhout giving an explicit definition of lhe DCI’, we can guess Lhat lhe Dais lilcely 5
a transform Lhal involves only lhe real part of lhe DFI’. The intuitionbehind Lhe formulation of
6
Lhe DCI’ lhat allows il lo use only lhe cosine basis funclions of lhe DEI’ is that we can cancel
out lhe imaginary part of Lhe DFT by making a symmetric copy of lhe original input signal. 7
Section 8.5 Transform Coding 221
220 Chapter 8 Lossy CompressiOn Algorithms

where RxU, s) = E[X~X,] is lhe autocorrelation function. Our goal is Lo finda transform
7 T such ihaL lhe components of lhe oulput Y are uncontlated — thal is, E[Y~Y~) = 0, if
s. Thus, Lhe autocorrelalion matrix of Y Lakes on Lhe fonii of a positive diagonal malrix.
6 Since any autocorrelation malrix is synmielric and nonnegalive definhe, there are k or
thogonal eigenvectors u1, 112 11k and k corresponding real and nonnegative eigenvalues
5
À1 > À2> -.. > À~ > 0. We define Lhe Karhunen-Loève transform as
4
y y T=[ui,u2,’-’ ,UkJT (8.32)
3
Then, lhe autocorrelalion malrix of Y becomes
2
Ry = £[YYT] (8.33)
= E[TXXTTI (8.34)

o = TRxTT (8.35)
1 234567 À1 O O
x .-.

O À2 O
(a) (b) = . (8.36)
O:
FIGURE 8.11: ApproximatiOn of lhe ramp function: (a) three-term DCT approximation; (b) O O -.. À~
Lhree-Lerm DEI’ approximatiOn.
Clearly, we have lhe required autocorrelation malrix for Y. Therefore, Lhe KLT is optimal,
in Lhe sense that iL completely decorrelates Lhe input. In addilion, since Lhe KLT depends
lhe original ramp function using only lhree terms of both lhe DCI’ and DEI’, we notice that on lhe computalion of lhe aulocorrelation malrix of Lhe input vector, it is data dependenl: it
lhe D~ approximatiOn is much closer. Figure 8.11 shows lhe comparison. has Lo be computed for every dalasel.

8.5.2 Karhunen—LOêVe TransfOflfl* EXAMPLE 8.5


The Karhunen—LOè ve Transform (KLT) is a reversible linear transform that exploits lhe sta
Listical properties of lhe vector representation. Its primary property is tal it optimally decor To illustrale Lhe mechanics of Lhe KLT, consider lhe four 3D inpul vectors xi = (4,4, 5),
= (3,2,5), x3 = (5,7,6), and X4 = (6,7,7). To find lhe required iransform, we must
relates lhe input. To do so, il fits na n-dimensional ellipsoid around lhe (mean-subtracted)
first estimate lhe autocontialion malrix of lhe input. The mcmi of lhe four input veclors is
data. The main ellipsoid axis is Lhe major direction of change in lhe data.
Think of a cigar that has unfortnnately been stepped on. Cigar data consisls of a cloud
of poinls in 3-space giving lhe coordinates of positions of measured points in lhe cigar. The
m1=—I
1 r 2018
long axis of Lhe cigar will be identified by a statistical program as lhe first KLT axis. The
second mosl important axis is the horizontal axis across lhe squashed cigar, perpendicular 4L23
Lo lhe firsl axis. The third axis is orthogonal Lo both and is in lhe vertical, Lhin direetion. A
We can estimate Lhe aulocorrelation matnix using the fonnula
KLT component program carnes oul just this analysis.
To understand lhe optimality of lhe KLT, consider lhe autocorrelalion matnix Rx of Lhe
inpul vector X, defined as Rx = x~xT m~mT
— (8.37)
(8.30)
Rx = E[XXTI
Rx(l, 1) Rx(l, 2) ..- Rx(I, k) where n is lhe number of input vectors. From lhis equation, we obtain
Rx(2, 1) Rx(l,l) -- Rx(2,k1) (8.3 1)

Rx =
r 2.25
1.25 2.25 0.88
4.50 1.50
Rx(k, 1) Rx(k—l,l) ... Rx(l,l)
[ 0.88 1.50 0.69
Section 8.6 Wavelet-Based Coding 223
222 Chapter 8 Lossy CompressiOn Algorithms

function 7(w), a complex-valued function of real-vaiued frequency co given in Eq. (8.26).


Tbe eigenvalues of Rx are À1 = 6.1963, À2 = 0.2147, and À3 = 0.0264. Clearly, lhe first Such decomposition results iii very fine resolution iii Lhe frequency domam. However, since
component is by far lhe rnost important. The corresponding eigenvectors are a sinusoid is Lheoretically infinite in extent intime, such a decomposilion gives no temporal
r 0.4385 1 r o.446o r —0.7803
0.1929
resolution.
Another method of decomposition Lhat has gained a great deal of popularity in recent
= 0.8471 1 U2 = —0.4952]
i13 =
0.5949 years is Lhe wavelet transform. li seeks Lo represenls a signal with good resolution in both
L 0.3003 L 0.7456 time and frequency, by using a set of basis functions called wavelets.
Therefore, Lhe ICLT is given by time matrix There are two Lypes of waveiet Lransforms: Lhe Continuous Wavelet Transform (CWT) and
Lhe Discreze Wavelet Transform (DWT). We assume Lhal Lhe CWT is applied Lo Lhe large class
r 0.4385
0.4460
0.8471 0.3003
—0.4952 0.7456
of functions f(x) thai are square integrable on Lhe real line — lhat is,! [f(x)}2 dx < co.
In mathematics, this is written as f(x) € L2(R).
—0.7803 0.1929 0. 5949
The oLher kind of wavelet transform, Lhe DWT, operates on discrete samples of Lhe input
signal. The DWT resembies other discrete linear transforms, such as Lhe DFT or Lhe DCT,
Subtracting Lhe mean vector from each input vector and applying Lhe KLT, we have and is very useful for image processing and compression.

yi = [ —1.2916
—0.2870
—0.2490
Y2 =
r —3.4242
1
L
0.2573
0.1453
Beforewebegin adiscussion of Lhe Lheory of wavelets, let’s deveiop an intuition aboutLhis
approach by going through an example using Lhe simplest wavelet transform, lhe so-calied
Haar Waveier Transform, to form averages and differences of a sequence of float values.
If we repeatediy take averages and differences and keep resuits for every step, we effec
r 1.9885 ~ r 2.7273
tively creale a multiresolution analysis of Lhe sequence. For images, Lhis wouid be equivalent
to creating smaller and smailer summary images, one-quarter Lhe size for each step, and
Y3 = 1 —0.5809 Y4 = 1 0.6107 keeping track of differences from the average as weii. MenLaily slacking Lhe fuli-size image,
L 0.1445 J [ —0.0408 lhe quarter-size image, lhe sixteenlh size image, and so on, crentes a pyramid. The fui] sei,
along with difference images, is Lhe muitiresolution decomposilion.
Since Lhe rows of T are orthonormal vectors, Lhe inverse transform is just the transpose:
= TT. We can obtain the original vectors from Lhe transform coefficients using Lhe EXAMPLE 8.6 A Simple Wavelet Transform
inverse relation
The objective of Lhe wavelet transform is to decompose Lhe input signai, for compression
x = TTY + mx (8.38) purposes, into componenls lhat are easier Lo deal wilh, have speciai interpretations, or have
some components lhat can be Lhresholded away. Furlhermore, we want Lo be able lo at
In terms of the transform coefficients ~ lhe magnitude of lhe first few components is ieast approximately reconstruct Lhe original signai, given these components. Suppose we
usually considerably iarger tban that of Lhe othercompOnents. In general, after lhe KLT, most are given Lhe foiiowing input sequence:
of Lhe “energy” of Lhe transform coefficienls is concentrated wiLhin lhe first few components.
This is Lhe energy compaction property of lhe KLT. (x~,~) = (10.13,25,26,29,21,7,15) (8.39)
For an input vector x with ri componenls, if we coarsely quantize Lhe output vector y by
setting its last k components to zero, caliing Lhe resulting vector ~, the KLT minimizes Lhe Here, i E [0.. 7] indexes “pixels”, and n stands for Lhe levei of a pyramid we are on. AL lhe
Lop, n = 3 for this sequence, and we shali form three more sequences, for ri = 2, 1, and 0.
mean squared mor between Lhe original vector and its reconstwction.
At each levei, iess information wili be retained in Lhe beginning eiements of Lhe Lransformed
signal sequence. When we reach pyramid levei ri = 0, we end up wiLh Lhe sequence average
8.6 WAVELET-BASED CODING
stored in Lhe first elemenL. The remaining eiements store detaii information.
8.6.1 Introduction Consider Lhe transform LhaL repiaces Lhe original sequence wilh its pairwise average
Decomposing the input signal into its constituents ailows us to apply coding techniqtieS x,,..~j and d~fference 4.i,~, delined as foiiows:
suitable for each constituent, to improve compression performance. Consider again a time
x~2f + 5n21+I
dependent signal fQ) (iL is best Lo base discussion on continuous functions Lo stazt wiLh).
The traditional method of signal decomposition is Lhe Fourier transfonn. Above, in our = 2 (8.40)
discussion of Lhe DCT, we considered a special cosine-based transform. If we carry out
= xfl.21 2 (8.41)
analysis based on boLh sine and cosine, then a concise notation assembles Lhe results imito a
Section 8.6 Wavelet-Based Coding 225
224 Chapter 8 Lossy Compression Algorithrns

We can further apply lhe sarne transform Lo (x~_i,~), lo obtain anoLher levei of approxi
Notice that lhe averages and differences are applied only on consecutive pairs of input malion Xn...2,j and deLail cl,, 2
sequences whose first element has an even index. Therefore, lhe number of elements
in each set {Xn_l,i) and (4—i,~) is exactiy half lhe number of eiements in lhe original (x,,_2,1,d,,_2,~, d,,_i,1) = (18.5,18, —7,7, —1.5, —0.5,4, —4) (8.44)
sequence. We can form a new sequence having Ienglh equal te that of lhe original sequence
by concatenaling lhe two sequences (x~_i,~} and (d,,_j,J. The resuiLing sequeiiCe is thus This is lhe essential idea of multiresolution analysis. We can now study Lhe inpul signal in
lhree different scales, along wiLh Lhe detaiis needed Lo go from one scale Lo anolher. This
{x~ ~,, 4.-i,~) = (11.5,25.5,25,11, —1.5, —0.5,4, —4) (8.42) process cmi conLinue ia limes, unLil only one element is left in the approxirnalion sequence.
In lhis case, ia = 3, and lhe final sequence is given below:
where we are now aI levei ri — 1 = 2. This sequence has exacLly Lhe sarne number of (x,,_3,~,d,,_3,~,d,,_21, d,,_i,~) = (18.25,0.25, —7,7, —1.5, —0.5,4, —4) (8.45)
elements as lhe input sequence — lhe Lransform did not increase che arnount of daLa. Since
Lhe first half of lhe above sequence contains averages from lhe original sequence, we can Now we realize Lhat ri was 3 because only Lhree resolulion changes were available uniU we
view it as a coarser approximaLion tolhe original signal. reached Lhe final form.
The second haif of this sequence can be viewed as lhe details or approximaLion errors The value 18.25, corresponding Lo Lhe coarsest approximalion Lo Lhe original signai, is
of lhe firsl half. MosI of lhe values in lhe detail sequence are much smaller lhan Lhose of lhe average of ali lhe elements in lhe original sequence. From Lhis example, il is easy lo see
lhe original sequence. Thus, most of lhe energy is effectively concentrated in lhe firsl haif. lhaL lhe cosI of compuling Lhis Lransfonn is proporlional tolhe number of elemenLs N in Lhe
Therefore, we can potentially store (4_j,t) using fewer bits. inpul sequence — lhaL is, 0(N).
lt is easily verified lhat lhe original sequence can be reconstructed from lhe Lransformed
sequence, using Lhe relations Extending Lhe one-dimensional Haar waveleL transferiu mio Lwo dimensions is relaLively
easy: we simpiy apply Lhe one-dimensional lransform lo Lhe rows and colurnns of lhe lwo.
= x~_~ +d~.-~j dimensional input separalely. We will demonstraLe Lhe lwo-dirnensional Haar Lransform
Xn.21+I = Xn_ij — ~ (8.43) applied lo Lhe 8 x 8 inpuL image shown in Figure 8.13.

EXAMPLE 8.7 2D Haar Transform


This transform is lhe discrele Haar wavelet Lransform. Averaging and differencing caii be
carried oul by applying a so-called .rcalingfunclion and waveletfúnctiOfl along Lhe signal. This example of lhe 2D Haar lransform not only serves lo iiluslraLe how lhe wavelet Irans
Figure 8.12 shows lhe Haar version of Lhese funclions. form is applied Lo Lwo-dimensional inpuls bul also poinis ouL usefui interpretations of lhe

2 00000000
00000000
0 O 63 127 127 63 O O
0 0 127255255 127 0 O
O 0 0 127255255 127 0 O
O O 6312712763 O O
00000000
00000000
—2
—0.5 o 0.5 1 1.5 (a) (b)
—0.5 o 0.5 1 1.5
(a) (b) FIGURE 8.13: inpuL image for lhe 2D Haar WaveieL Transform: (a) pixei values; (b) an 8 x 8
irnage.
FIGURE 8.12: Haar Waveiet Transform: (a) scaling function; (b) waveleL function.
Section 8.6 Wavelet-Based Coding 227
226 Chapter 8 Lossy CompressiOn Algorithms

FIGURE 8.16: A simple graphical illustration of the Wavelet Transform.


FIGURE 8.14: Intentediate output of Lhe 21) Haar Wavelet Transfonn.

viewed as a low-pass-filLered version of the original image, in the sense that higher-frequency
transformed coefficients. However, it is intended oniy to provide Lhe reader with an intuitive edge informaLion is iost, while iow-spaLial-frequency smooLh infonnation is reLained.
feeiing of the kinds of operations involved in perfonning a general 21) waveiet Lransform. The upper right quadrant contains the vertical averages of Lhe horizontal differences
Subsequent sections provide moredetaiied description of the forward and inverse 2D waveiet and can be interpreted as information about the vertical edger within lhe original image.
transform aigorithms, as well as a more elaborate example using a more compiex wavelet. Similarly, Lhe lower left quadrant conLains Lhe vertical differences of Lhe horizonLal averages
and represents Lhe horizontal edges in Lhe original image. The lower right quadranL conLains
21) [bar Wavelet Wansform. We begin by appiying a one.dirnensional Haar wavelet Lhe differences from both Lhe horizonLal and vertical passes. The coefficients in this quadrant
transform to each row of the input. Tbe firsL and last Lwo rows ol the input are trivial. After represent diagonal edges.
performing the averaging and differenciiig operations on the remaining rows, we obtain tbe These interpretaLions are shown more clearly as images in Figure 8.16, where bright
inLermediate output shown in Figure 8.14. pixeis code positive and dark pixels code negaLive image values.
We continue by appiying Lhe sarne 1 D Haar transform to each column of the intermediate The inverse of Lhe 21) Haar transform can be calculated by firsL inverting Lhe columns
output. This step completes one levei of Lhe 2D [bar transform. Figure 8.15 gives the using Eq. (8.43), and then inverting Lhe resulting rows.
resulting coefficients.
We can naturally divide Lhe resuit into four quadrants. The upper lefL quadrant contains
Lhe averaged coefficients from both Lhe horizontal and vertical passes. Therefore, it can be 8.6.2 Continuous Wavelet Transform*
We noted LhaL Lhe motivation for Lhe use of waveleLs is Lo provide a set of basis funcLions Lhat
decompose a signal in time over parameLers in Lhe frequency domam and Lhe time domam
simultaneously. A Fourier transform aims to pin down only Lhe frequency content of a
signal, in terms of spaLially varying raLher Lhan time varying signais. WhaL wavelets aim Lo
do is pin down Lhe frequency content aL different parts of Lhe image.
For example, one part of Lhe image rnay be “busy” with texture and thus high-frequency
conlent, while anoLher part may be smooth, with litLle high-frequency contenL. Naturally,
one can Lhink of obvious ways to consider frequencies for localized arcas of an image:
divide mi image into parts and fire away wiLh Fourier analysis. The tirne-sequence version
of that idea is called the Short-Term (or Windowed) Fourier Transform. And oLher ideas
have also arisen. However, it tums out Lhat waveleLs, a much newer development, have
neater characterisLics.
To further motivate Lhe subject, we should consider the Heisenberg uncertainry principie,
from physics. In Lhe conLext of signal processing, Lhis says Lhat Lhere is a Lradeoff beLween
accuracy in pinning down a function’s frequency, and iLs extenL intime. We cannoL do boffi
FIGURE 8.15: Output of Lhe first levei of the 21) Haar Wavelet Transform.
Section 8.6 Wavelet-Based Cocling 229
228 Chapter 8 tossy Compression Algorithms

The constraint (8.49) is called the admissibilisy condilion for waveleLs. A function that

.4~j. .’&5j 0.J22f”\~~


sums Lo zero must osciliate around zero. ALso, from (8.26), we see Lhat Lhe DC value, Lhe
Fourier transform of *0) for a, = 0, is zero. Another way to state this is Lhat Lhe Oth
momenL M0 of ~Q) is zero. The pth moment is defined as

1w!,, = L00t”*t»~t
00
(8.50)

The function * is normalized with I * 1 = 1 and centered in lhe neighborhood of: = 0.


We can obtain afaindy of wavelet funetions by scaling and translating Lhe mother wavele:
Time Frequency • as follows:
(a) (b)
(a) (b) = ~* (8.51)

FIGURE 8.17: A Mexican-hat Wavelet: (a) a = 0.5; (b) iLs Fourier transform. If ~Q)is normalized, so is *s,u 0).
The Conrinuour Wavelei Transform (CW’l’) 0ff E L2(R) aL Lime is and scale ris defined

accurately, in general, and still have a useful basis function. For example, a sine wave is as
exaci in terms of its frequency but infinite in extent. W(f, s, is) = je +00
f0)*~~0)dr (8.52)
As an exampleofa function tini dies away quickly and also has limited frequency content,
suppose we start with a Gaussian function, The CWT oCa 10 signal is a 20 function a funetion of both scak s and shift u.
A very importanL issue is tini, in conLradisLincLion to (8.26), where Lhe Fourier analysis
1 _,2
fQ) = (8.46) function is stipulated Lo be Lhe sinusoid, here (8.52) does noL siate what *0) acLually is?
lnstead, we creaLe a sei of rules such functions must obey and Lhen invent useful functions
tini obey Lhese niles — different funclions for different uses.
The parameter e expresses Lhe scate of the Gaussian (bell-shaped) function. Justas we defined Lhe DCTin Lerms of products of a function wiLh a sei of basis funcLions,
The second derivative of this function, called ~‘(t), loolcs lute a Mexican hat, as in here Lhe transform W is writLen in terms of inner producLs with basis functions thaL are a
Figure 8.17(a). Clearly, Lhe function *0) is limited iii time, lis equation is as follows: scaled and shifLed version of Lhe inolher waveieL *0).

*0) = 2~ (4 [e~ — i)] (8A7)


The moLher wavelet *0) is a wave, since ii must be an oscillatory function. Why is
iL wavelet? The spatial-frequency anaiyzer parameler in (8.52) is s, Lhe scale. We choose
some scaie s and see how much contem Lhe signal has around that scale. To malte Lhe
funcLion decay rapidly, away from lhe chosen s, we have to choose a mother waveiet *0)
We can explore Lhe frequency content of function *0) by talcing its Fourier transform. This
IhaL decays as fast as some power of s.
turns out Lo be given by
IL is actuaiiy easy lo show, from (8.52), Lhat if ali moments of *0) up Lo Lhe nth are zero
(or quite small, pracLicaiiy speaking), Iben Lhe CW’l’ coefficienL W(f, s, u) has a Tayior
7(w) = w2e~t (8.48)
expansion around is = O Lhat is of order 3~4’2 (see Exercise 9). This is lhe localization in
frequency we desire in a good mother waveiet.
Figure 8.17(b) displays this function: Lhe candidate wavelet (8.47) is indeed limited in We derive wavelet coefficients by applying waveieis aI different scales over many lo
frequency as well. cations of lhe signal. Excitingly, if we shrink Lhe waveiets down small enough Lhat Lhey
In general, a wavelet is a function * c 13(R) with a zero average, cover a part of Lhe function f 0) that is a polynomial of degree n or iess, Lhe coefficient for
ç+OO
(8.49) that wavelet and ali smailer ones will be zero. The condition Lhat Lhe waveleL shouid have
J -00 4’Q)dt=O
vanishing momenLs up lo some order is one way of characterizing mathemaLicai regularUy
condirions on Lhe mother wavelei.
that satisfies some conditions thaL ensure ii can be utilized in a muiLiresolution decompo The inverse of Lhe continuous waveleL transform is:
sition. The conditions ensure that we can use Lhe decomposition for zooming in locally in
some part of an image, much as we might be interested in closer or fariher views of some
neighborhood in a map.
= ~L L~
1 +00 +00

W(f,s,u)7
1 —u 1
* (—;-—) -~duds (8.53)
Section 8.6 Wavelet-Based Coding 231
230 Chapter 8 Lossy CompressiOn Algorithms
weights limes Lhe ievei-j basis. The scaiing funclion is chosen such ihaL Lhe coefficients of
where lis iransiates are ali necessariiy bounded (iess lhan infiniLe).
+°° The scaiing funcLion, aiong wilh its lransiaLes, forms a basis aL ihe coarser levei j + 1
(8.54) (say 3, or lhe 1/8 levei) bui noL aL levei j (say 2, or lhe 1/4 ievel). InsLead, ai ievei j lhe
c.=L CL)
sei of Lransiaies of lhe scaiing funciion 4’ a!ong with the sei of trans/ates of lhe ,noiher
wavelet4’ do form a basis. We are ieft wilh lhe siLualion ihaL Lhe scaling funcLion describes
and ‘1’ (w) is Lhe Fourier transfonu of ~ (O. Eq. (8.54) is another phrasing of lhe admissibility smooth, or approximalion, information and lhe wavelei describes whai is ieft over — delail
condition. informaiion.
The troubie with lhe CWT is thai (8.52) is nasty: most waveiets are noc anaiytic but Since Lhe seL of Iransiates of ihe scaiing function 4’ ai a coarser ievei can be wriiten
resuli simply from npmerical calculations. The resulting infinite set of scaied and shifted exactiy as a weighLed sum of lhe iransiaies ata finer ievei, Lhe scaiing funciion musi satisfy
functions is not necessary for Lhe anaiysis of sampled functions, such as Lhe ones aiise in Lhe so-caiied dilation equation [7]:
image processing. For lhis reason, we apply lhe ideas thaL pertain to the CWT to the discrete
domam. «1) ‘Jihc[n]Ø(2t n)
= — (8.56)
neZ
8.6.3 Discrete Wavelet TransfOrlfl*
The square brackeLs come from Lhe Lheory of Jiiters, and Lheir use is carried over here. The
Discrete wavelets are again formed from a molher waveiet, but with scaie and shift in discrete
dilaimon equalion is a recipe for finding a function lhaL cmi be buiili from a sum of copies of
steps.
iLseif ihat are firsi scaied, Lransiaied, and diiaLed. EquaLion (8.56) expresses a condition Lhai
a function musi saúsfy tu be a scaiing function and ai lhe same lime forms a definilion of
MnitiresolutiOfl Anaiysis and lhe Discrete Wavelet Transform. The connection be
lhe scaling vector h0.
tween wavelets in lhe continuous time domam andflhter banks in Lhe discrete time domam
Noi oniy is Lhe scaiing funciion expressibie as a sum of Lransiaies, bulas weii Lhe wavelet
is muitiresolution anaiysis; we discuss lhe DWT within ihis frameworic. Mailai [5] showed
ai lhe coarser levei is aiso expressible as such:
that it is possibie lo construct waveiets ~ such Lhat lhe diiated and Lraflsiated family

(8.55) = ~hhj[n]4’(2t —n) (8.57)


{ *~Q) = ‘1’ tjjT) IÚ.10eZ2
nEZ

Beiow, we’ii show Lhai lhe sei of coeificients h1 for lhe waveiet can in facL be derived from
is an orthonormal basis of L2(R), where Z represenis the set of integers. This is known lhe scaiing funclion ones h0 [Eq. (8.59) below], so we also have lhat Lhe waveiet can be
as “dyadic” scaiing and IransiaLion and corresponds tolhe notion of zooming oui in a map derived from lhe scaling function, once we have one. The equaLion reads
by factors of 2. (lf we draw a cosine function cos(t) from lime O lo 2r and then draw
cos(t/2), we see Lhat while cosO) goes over a whole cycie, cos(t/2) has oniy a haifcycle: *0) = Z(—fl”ho[l — n]4’(2t — (8.58)
lhe function cos(2~t) is a wider function and thus is ata broader scaie.)
NoLe Lhat we change Lhe scaie of transiaLions aiong wiLh lhe overali scale 2~, so as lo
keep movement in Lhe iower.resolution image in proportion. Notice also ihat lhe notation So Lhe condiLion mi a waveieL is similar lo Lhal on Lhe scaiing funcfion,
used says ihai a larger index j corresponds lo a coarser version of Lhe iniage. Eq. (8.56), and in faci uses lhe same coefficienis, oniy in lhe opposiie order and wilh
MultiresoluLion anaiysis provides lhe tool Lo adapt signa! resolution lo only relevam aiLernaling signs.
details for a particular Lask. Tbe oclave decomposilion introduced by Mailai [6] initialiy Cieariy, for efficiency, we would iike lhe sums in (8.56) and (8.57)10 be as few as possibie,
decomposes a signai mio an approximaLion componeni and a detail component. The ap so we choose waveieis Lhai have as few vecLor entries h0 and h1 as possibie. The effeci of
proximalion componenL is lhen recursively decomposed inLo approximation and detail ai Lhe scaling function is a kind of smoolhing, or fiiLering, operaLion on a signal. Therefore
successively coarser scaies. Waveleis are sei up such lhaL Lhe approximaiiOn aL resolu ii acts as a low-pass fluter, screening oul high-frequency contenL. The vecLor vaiues ho[n]
tion 2 contains ali lhe necessary information lo compuLe an approximation at coarser are called lhe iow-pass fiuier impidse response coefficienLs, since Lhey describe Lhe effecL
resolution 2_oi+». of lhe fiiLering operaLion on a signa] consisLing of a singie spilce wilh magnitude uniLy (an
Waveiets are used lo characterize detail information. The averaging information is for impuise) ai Lime t = O. A compiele discreLe signai is made of a seL of such spikes, shifted
maiiy determined by a ldnd of dual tolhe molher waveiet, caiied lhe scaiingfunction 4’(t). in time from O and weighted by Lhe magnitudes of Lhe discreie sampies.
The main idea in Lhe theory of waveiets is Lhat at a particular levei of resoiution j, lhe Hence, Lo specify a DWT, oniy Lhe discrele iow-pass fiuier impulse response ho[n] is
set of trans/ates indexed by n form a basis ai lhai levei. InLeresLingiY, lhe sei of transiaLes needed. These specify lhe approximation fiuiering, given by lhe scaiing funcLion. The
forming lhe basis ai lhe j + 1 nexi levei, a coarser levei, can ali be wriiten as a sum of
ion 8.6 Wave et-Based Coding 233
232 Chapter 8 Lossy CompressiOn Algorithms

TABLE 8.3: Biorthogonal wavelet filLers


TABLE 8.2: Orthogonal wavelet Micra
Wavelet Filter Number SLart Coefficients
oftaps index
Antonini 9/7 ho[n] 9 —4 [0.038, —0.024, —0.111,0.377,0.853,0.377, —0.111,
—0.024, 0.038)
&o[n] 7 —3 [—0.065, —0.041,0.418,0.788,0.418, —0.041, —0.065]
Vilia 10/18 ho[n) 10 —4 [0.029, 0.0000824, —0.158,0.077,0.759,0.759,0.077,
—0.158, 0.0000824, 0.029]
~o[n] 18 —8 [0.000954, —0.00000273, —0.009, —0.003,0.031, —0.014,
—0.086, 0.163,0.623,0.623,0.163, —0.086, —0.014,0.031,
—0.003, —0.009, —0.00000273, 0.000954)
discrete high-pass impulse response h1 [n], describing lhe details using lhe wavelei funetion, flrisiawn ho[n] 10 —4 [0.027, —0.032, —0.241,0.054,0.900,0.900,0.054,
can be derived from ho[n] using Lhe following equation:
—0.241, —0.032, 0.027]
hj[n] = (—1)’~ho[1 — si] (8.59)
~o[n] lO —4 [0.020, 0.024, —0.023, 0.146,0.541,0.541,0.146,
—0.023, 0.024,0.020]
The number of coefficients in Lhe impulse response is calied lhe number of :aps in lhe fliter.
If ho[n] has only a finite number of nonzero entries, Lhe resulting wavelet is said to have
conspaci support. Additional constraints, such as orthonomiality and regularity, can be
imposed mi lhe coefficients h0[n]. The vectors h0[n] and h1[n] are called lhe low-pass and For analysis, aL each levei we transform lhe sedes x[n) inLo another series of lhe sarne
high-pass analysis fihters. length, in which lhe firsi haif of lhe eiements is approxiinaLion information and lhe second
To reconstruct lhe original input, an inverse operation is needed. The inverse fiuters are half consisLs of detail informaLion. For an N-tap filcer, Lhis is simply lhe sedes
calied synthesis fihters. For orthonormal wavelets, lhe forward transfonn and its inverse are
transposes of each olher, and the analysis filters are identical to Lhe synthesis fiuters.
Without orthogonality, lhe waveiets for analysis and synthesis are calied biorthogonai,
a wealcer condition. In this case, lhe synthesis fi!ters are not identical tolhe analysis fiiters.
We denote themas ho[n] and h1 [n). To specify a biorthogonal waveiet transform, we require
(x[nj) ~ y[n] = Zx[j]ho[n — j]; ~x[j]hi[n — j] } (8.62)

both ho[,i] and ho[n]. As before, we can compute lhe discrete high-pass fihters in terms of where for each half, the odd-numbered results are discarded. The sumrnaLion over shifLed
sums of lhe low-pass ones: coefficients in (8.62) is referred lo as a convolugion.
h1[n] = (—1)”io[I — si] (8.60)
2D fliscrete Waveiet Transform. ‘fie exLension of Lhe waveiet transform lo Lwo di
= (—1)”hc[l — si] (8.61)
mensions is quite straightforward. A Lwo-dimensionai scaling funcLion is said lo be sepa
rabie if ii can be factored mio a producL of two one-dimensionai scaling functions. That
Tabies 8.2 and 8.3 (cf. [8]) give some commonly used orthogonal and biorthogonai wavelet Is,
fifters. The “start index” coiumns in these Lables refer to lhe starting value of Lhe index si
used in Eqs. (8.60) and (8.6 1).
•(x, y) — ~(x)~(y) (8.63)
Figure 8.18 shows a block diagram for the ID dyadic wavelet transform. [fere, x[n]
is Lhe discrete sampled signa!. The box means subsanipling by talcing every second
For simplicity, only separable wavelets are considered in this section. Furthermore, let’s
element, and Lhe box means upsampling by replication. The reconstruction phase assume Lhat lhe widLh and heighL of lhe inpuL image are powers of 2.
yields sedes y[n].
Section 8.6 Wavelet-Based Coding 235
234 Chapter 8 Lossy Compression AIgorithmS

LL2 HL2
HLI
LH2 11112

x[n]

(b)

FIGURE 8.19: The two-dimensjonal discrete wavelet transfonn: (a) one-level transfonn;
(b) two-Ievel transform.

1. For each slage of te lransfonned image, stariing wilh lhe lasI, separate each cohimn
mIo low-pass and high-pass coefficients. Upsample each of the low-pass and high
pass arrays by inserting a zero after each coefficienl.
y[nl 2. Convoive lhe low-pass coefficienls wilh h0[n] and high-pass coefficienls wilh h1 [ti]
and add the two resulting arrays.
3. Afte9 ali coinmns have been processed, separale each row mIo iow-pass and high
pass coefficienls and upsample each of lhe lwo arrays by inserting a zero after each
coefficienl.
4. Conv~lve lhe Iow-pass coefflcients wilh ho[n] and high-pass coefficienls with h, [n]
FIGURE 8.18: Bioclc diagram of lhe ID dyadic waveiet transform. and add lhe two resulcing arrays.

If biorthogonai filters are used for lhe forward lransform, we musl replace lhe ho[n] and
For an N by N input image, lhe two.dimensiOnai DWT proceeds as foilows: h1[n] above wilh ho[n] and hi[n] in lhe inverse lransfonn.

1. Convolve each row of lhe image with h0[n) and h1 [ti], discard the odd.numbered EXAMPLE 8.8
columns of the resulting arrays, and concatenate lhem to form a transformed row.
2. Alter ali rows have been transformed, convolve each coiumn of the result with ho[n] The input irnage is a subsampled version of lhe image Lena, as shown in Figure 8.20. The
and hi [til. Again discard lhe odd-numbered rows and concatenate lhe resuit. size of lhe inpul is ló x 16. The filter used in lhe example is lhe Anlonini 9/7 filter seI given
in Table 8.3.
Alter te above two sleps, one slage of lhe DWT is complete. The txansfonned image now Before Jte begin, we need to compule lhe analysis and synthesis high-pass filters using
contains four subbands Li, HL, LH, and 1111, standing for iow-iow, high-iow, and 50 on, as Eqs. (8.60)mand (8.61). Theresulting filtercoefficienls are
Figure 8.19(a) shows. As in lhe one.dimenSionai transform, te Li subband can be further
deconiposed to yieid yet another levei of decomposition. This process can be continued hi[n] = [—0.065,0.041, 0.4i8, —0.788, 0.418, 0.041, —0.065]
untii lhe desired number of decomposition leveis is reached or te LL component only has
a singie eiement left. A two levei decomposition is sbown ai Figure 8.19(b).
± [—0.038, —0.024,0.111,0.377, —0.853,0.377,0.Iii, —0.024, —0.038]
(8.64)
The inverse lransform simply reverses lhe sleps of lhe forward transform.
Section 8.6 Wavelet-Based Coding 237
236 Chapter 8 Lossy CompressiOfl AlgorithmS
NexL, we form lhe Lransformed output row by concalenaling lhe resulling coefficients.
The firsL row of Lhe lransforrned image is Lhen

[245, 156, 171,183,184, 173, 228,160, —30,3,0,7, —5, —16, —3,16]

Similar Lo Lhe simple one-dimensional Haar lransform examples, mosL of lhe energy is now
concenlraled on Lhe firsl half of lhe lransformed irnage. We conlinue lhe sarne process for
lhe rernaining rows and obLain lhe following resuil:

1i,(x,y) =
245 156 171 183 84 73 222 160 303 O 7 5 (6 3 16
239 141 III 197 242 151 202 229 —li 5 —20 3 26 —27 27 141
195 147 163 177 281 173 209 106 —34 2 2 19 —50 —35 —38 —I
80 139 226 177 274 267 247 63 —45 29 24 —29 —2 30 —101 —78
(b) 191 145 197 191 247 230 239 (43 —49 22 36 —U —16 —14 101 —54
(a) 192 (45 237 184 135 253 169 192 —47 38 36 4 —58 66 94 —4
(76 (59 56 77 204 232 SI 1% —31 9 —48 30 II 58 29 4
(79 (48 (62 129 146 283 92 287 —39 II 50 —lO 33 SI —23 8
FIGURE 8.20: Lena: (a) original 128 x 128 image; (b) 16 x 16 subsampled image. 169 859 (63 97 204 202 85 234 —29 1 —42 23 37 48 —56 —5
855 853 (49 (59 (76 204 65 236 —32 32 85 39 38 44 —54 —31
845 848 858 (48 864 857 (88 285 —55 59 —((0 28 26 48 —l —64
834 852 102 70 853 826 899 207 —47 38 83 lO —76 3 —7 —76
127 203 130 94 III 218 171 228 (2 88 —27 IS 1 76 24 85
The inpul irnage in numerical form is 70 88 63 844 191 257 285 232 5 24 —28 —9 (9 —46 36 91
129 124 87 96 877 236 862 77 —2 20 —48 1 (7 —56 30 —24
103 lIS 85 (42 888 234 884 132 —37 O 27 —4 5 —35 —22 —33

100(x,y) =
858 870 97 804 23 130 133 825 132 827 112 158 159 44 116 98
We now go on and apply Lhe filLers Lo Lhe colurnns of lhe above resulling irnage. As before,
164 153 91 99 824 852 131 860 189 116 106 845 840 (43 227 53 we apply bolh ho[n] and h, [ti] lo each column and discard Lhe odd indexed results:
116 849 90 101 III III 138 852 202 211 84 154 127 846 58 58
95 145 88 lOS (88 123 II? 182 85 204 203 854 853 229 46 147
808 856 89 880 165 113 148 870 863 886 144 94 208 39 883 859 (111(0,:) * hc[n]) 4.2 = [353,280,269,256,240,206,160, j53jT
103 153 94 103 203 836 846 92 66 192 188 803 878 47 167 159
802 846 106 99 99 121 39 60 864 (75 898 46 56 56 856 856 (1,i(0,:)*hj[n])42 = [_l2,10,_7,_4,2,_l,43j6]T
99 ‘46 95 97 144 68 (03 807 808 III 192 62 65 128 153 154
99 840 103 809 803 824 54 88 872 837 (78 54 43 859 849 874
84 133 807 84 149 43 (58 95 151 120 183 46 30 147 142 201 ConcaLenaling lhe above results inLo a single colurnn and applying lhe sarne procedure Lo
58 853 110 48 94 283 li 73 840 103 838 83 852 843 (28 207
56 141 (08 58 92 SI 55 61 88 866 58 103 146 150 816 211 each of Lhe rernaining colurnns, we arrive aL Lhe final Lransformed image:
89 885 888 47 113 804 56 67 128 155 187 78 853 (34 203 95
35 99 151 67 35 88 88 828 840 842 876 213 844 828 284 880
89 98 97 SI 49 101 47 90 136 136 57 205 806 43 54 76 112ÇC,y) =
44 805 69 69 68 53 1(0 827 834 (46 859 184 809 121 72 (83 353 212 258 272 288 234 308 289 —33 6 —(5 5 24 —29 38 (20
280 203 254 250 402 269 297 207 —45 II —2 9 —31 —26 —74 23
269 202 3(2 280 3(6 353 337 227 —70 43 56 —23 —41 21 82 —88
1 represenls lhe pixel values. The first subscript of 1 indicaLes the currenl slage of lhe 256
240
217
221
247
226
(55
(72
236
264
328
294
114
113
283 —52
330 —dl
27
(4
—N
38
23
23
—2
57
90 49
60 —78
82
—3
lransfom1, while lhe second subscript indicales Lhe current step wilhin a slage. We slart 206 204 203 192 230 289 232 309 —76 67 —53 40 4 46 —(8 —(07
by convolving lhe first row wilh boLh ho[n] and h1 [n] and discarding lhe values wilh odd 860
153
275
189
50
‘33
(35
(73
244
260
294
342
267
256
331 —2
(76 —20
90
(8
—(7
—38
80 —24
—4 24
49
—75
29
25
89
—5
nurnbered index. The resulls of these lw0 operaLiolis are —(2 7—9—13—6 II 12—69—10—1 84 6—38 3—45 —99
(0 3 —31 86 —I —SI —lO —30 2 —(2 O 24—32—45809 42
—7 5 —44 —35 67 —(0 —17 —IS 3 —IS —28 O 41 —30 —(8 —89
—4 9—1 —37 41 6—33 2 9—12 —67 31 —7 3 2
(looC 0) * ho[n]) d~ 2 = [245,156, 171,183,184, 173,228, 160] 2 —3 9 —25 2 —25 60 —8 —fl —4 —823 —(2 —6 —4 14 —fl
—8 22 32 46 lO 48 —II 20 89 32 —59 9 70 50 16 73
(looC 0) * hi[n]) 4.2 = [—30,3,0,7, —5, —16—3, 16] 43 —88 32 —40 —13 —23 —37 —61 8 22 2 83 —(2 43 —8 —45
86 2—6—32—7 5 (3 50 24 7 —61 2 II —33 43 1

where lhe colon in lhe firsL index posilion indicaLes lhat we are showing a whole row. If you
Iike, you can verify lheSe operations using MATLAB’S conv funcLion.
Section 8.6 Wavelet-Based Coding 239
238 Chapter 8 Lossy CompressiOn Algorithms

This completes one stage of lhe Discrete Wavelet Transform. We can perform another stage
by applying Lhe same transform procedure to lhe upper left 8 x 8 DC image of 112(x, y).
The resulting Lwo-stage transformed image is 121(x,y) =
414 337 382 403 70 —16 48 12 —33 6 —IS 5 24 —29 38 120
354 322 490 368 39 59 63 55 —45 II —2 9 —31 —26 —74 23
122(x,)’) = 323 395 450 442 62 25 —26 90 —70 43 56 —23 —41 21 82 —81
558 451 608 532 15 26 94 25 —33 6 —IS 5 24 —29 38 120 338 298 346 2% 23 77 —III —131 —52 27 —14 23 —2 90 49 12
463 SI’ 621 566 66 68 —43 64 —45 II —2 9 —31 —26 —74 23 333 286 364 298 4 67 —75 —176 —41 14 3! 23 57 60 —78 —3
464 40! 478 416 14 84 —97 —229 —70 43 56 —23 —41 21 82 —81 294 279 308 350 —2 II 12 —53 —76 67 —53 40 4 46 —IS —107
422 335 477 553 —88 46 —31 —6 —52 27 —14 23 —2 90 49 12 324 240 326 412 —% 54 —25 —45 —2 90 —II lO —24 49 29 89
‘4 33—5642 22—43—36 1—41 14 31 2357 60—78 —3 260 189 382 359 —47 14 —63 69 —20 II —38 —4 24 —75 25 —5
—13 3634 52 12—21 SI 10—7667—53 40 446—18-107 —12 7—9—13—6 II 12—69—10—1 14 6—38 3—45 —99
25 —20 25 —7 —35 35 —56 —55 —2 90 —17 lO —24 49 29 89 lO 3—31 16—1—51 —lO —30 2—12 0 24—32—45 09 42
46 37 —SI SI —44 26 39 —74 —20 IS —38 —4 24 —75 25 —5 —7 5 —44 —35 67 —lO —II —IS 3 —IS —28 O 41 —30 —IS —19
—12 7—9—13—6 II 12~69—10—1 14 6—38 3—45 —99 —4 9—1—37 41 6—33 2 9—12 —67 31—7 3 2
lo 3—31 16_I—5II0 —30 2—12 O 24—32—45 ‘09 42 2 —3 9 —25 2 —25 60 —8 —II —4 —123 —12 —6 —4 14 —12
—~ 5 —44 —35 67 —lO —17 —IS 3 —IS —28 O 41 —30 —18 —19 —I 22 32 46 0 48 —II 20 19 32 —59 9 70 50 16 73
~4 9—1—3741 6—33 2 9—12 —67 31—7 3 2 43 —18 32 —40 —13 —23 —37 —61 8 22 2 13 —12 43 —8 —45
2 —3 9 —25 2 —25 60 —8 —II —4 —123 —12 —6 —4 14 —11 16 2—6—32—7 5—13—5024 7—61 2 11—3343 1
—I 22 32 46 lo 48 —II 20 19 32 —59 9 70 50 16 73
43 —IS 32 —40 —13 —23 —37 —6! 8 22 2 13 —12 43 —8 —45
ló 2—6—32—7 5 13—5024 7—61 2 11—33 43 1
We are now ready to process the rows. For each row of Lhe upper lefL 8 x 8 sub-image, we
again separaLe them into low-pass and high-pass coefficients. Then we upsample botb by
Notice that ‘12 corresponds lo Lhe subband diagram shown ia Figure 8.19(a), and ‘22 adding a zero afLer each coeflicient. The results are convolved wiLh he appropriate ho[n]
corresponds to Figure 8.19(b). ALIbis point, we may appiy different leveis ofquanhization and h1 [o] fiiLers. After Lhese steps are compleLed for ali rows, we have
Lo each subband according Lo some preferred biL allocation algorilhm, given a desired bitrate.
This is 11w basisforo sirnple wavelet-based compression algorithm. However, since in this
exampie we are illustrating the mechanics of Lhe DWT, here we wiil simply bypass Lhe
quantization step and perform an inverse transform to reconstruct lhe input image. 112(x,y) =
353 212 251 272 281 234 308 289 —33 6 —IS 5 24 —29 38 120
We refer Lo Lhe top lefi 8 x 8 biock of values as lhe innermost stage in correspondence 280 203 254 250 402 269 297 207 —45 II —2 9 —31 —26 —74 23
269 202 312 280 316 353 337 227 —70 43 56 —23 —4! 21 82 SI
wiLh Figure 8.19. Starting wiLh Lhe innennosL sLage, we extract the first coiumn and separaLe 256 217 247 155 236 328 114 283 —52 27 —14 23 2 90 49 12
Lhe iow-pass and high-pass coefficients. The low-pass coefficient is simply Lhe first half of 240 221 226 172 264 294 113 330 —41 14 31 23 57 60 78 3
206 204 201 ‘92 230 219 232 307 —76 67 —53 40 4 46 —IS —107
Lhe column, and Lhe high-pass coefficients are Lhe second haif. Then we upsample them by 60 275 150 135 244 294 267 331 —2 90 —II lO —24 49 29 89
appending a zero after each coefficient. The Lw0 resulLing arrays are 153 189 113 173 260 342 256 176 —20 18 —38 —4 24 —75 25 —5
—12 7—9—13—6 II 12—69—10—1 14 6—38 3—45 —99
lO 3—31 16—1—51—10—30 2—12 024—32—45109 42
—7 5 —44 —35 67 —lO —17 —IS 3 —IS —28 O 41 —30 —IS —19
á = [558,0,463,0, 464,0,422,09 —4 9—1—37 41 6—33 2 9—12 —67 31—7 3 2 O
=
2 —3 9 —25 2 —25 60 —8 —II —4 —123 —12 —6 —4 14 —12
—I 22 32 46 lO 48 —II 20 19 32 —59 9 70 50 6 73
43 —IS 32 40 —13 —23 —37 —61 8 22 2 13 —12 43 —8 —45
16 2 6 32—7 3—13 —50 24 7—61 2 11—33 43 1
Since we are using biorthogonal filters, we convoive á and ~ with iofn] and ~ [a] respec
tively. The resuiLs of Lhe two convolutions are Lhen added Lo form a single 8 x 1 array. The
resulting coiumn is We Lhen repeaL lhe same inverse cransform procedure 00 12(x, y), Lo obLain I~(x, y).
NoLice Lhat I~(x, y) is nol exactly the sarne as !oo(x, y), buL Lhe difference is small. These
[414,354,323,338,333,294324~ 260]T smail differences are caused by round-off errors during the forward and inverse transform,
and tnincaLion errors when converling from floating point numbers lo inleger grayscaie
values. Figure 8.21 shows a Lhree-ievel image decomposition using Lhe Haar waveiel.
Ali columns in lhe innermost stage are processed in Lhis manner. The resulting image is
Section 8.8 Embedded Zerotree of Wavelet Coefficients 241
240 Chapter 8 Lossy Compression Algorithms

I~(x,y) =
58 70 97 103 122 129 132 125 132 126 III IS? 159 144 116 91 —

164 152 90 98 123 151 131 159 188 115 106 145 ‘40 143 227 52
115 48 89 160 II? 118 131 151 201 210 84 154 127 146 58 58
94 144 88 104 87 123 117 IS! 184 203 202 153 152 228 45 146
1W 155 88 99 64 112 147 69 163 186 143 193 207 38 112 158
103 153 93 102 203 135 145 91 66 192 188 103 77 46 66 158
102 146 106 99 99 121 39 60 164 75 198 46 56 56 156 156
99 146 95 97 143 60 102 ‘06 07 110 191 61 65 128 153 54
98 139 102 109 103 123 53 80 171 136 li? 53 43 158 148 173
84 133 lo? 84 148 42 157 94 150 119 182 45 29 146 141 260
57 152 ‘09 41 93 213 70 72 139 102 137 82 151 143 128 207
56 141 108 58 91 50 54 60 87 165 57 102 146 149 116 211
89 114 187 46 113 104 55 66 127 154 186 71 153 34 203 94
35 99 150 66 34 88 88 127 140 141 175 212 144 128 213 160
88 97 96 50 49 101 47 90 36 136 56 204 ‘05 43 54 76
43 ‘04 69 69 68 53 110 127 134 145 158 183 109 121 72 113

Wavelet-Based Reduction Program. Keeping oniy lhe lowest-freq’iencY content


amounts lo an even simpier waveiet-based image zooming-out reduction aigorithm. Pro
gram wavelet_reductiofl . c on lhe book’s web site gives a simple iliustration of this
principie, iimited tojust lhe scaling funclion and anaiysis filter te scaie down an image some
number of times (lhree, say) using wavelet-based analysis. The program operaLes on lhe
Unix-based POM (portable graymap) file formal and uses lhe Antonini 9/7 biorthogonal
filterinTabie8.3.
FIGURE 8.21: Haar wavelet decomposiLion. Courwsy o! Sieve Kilthau.
8.7 WAVELET PACKETS
Waveiet packeLs can be viewed as a generalization ofwavelets. They were first introduceil by
Coifman, Meyer, Quake, and Wickerhauser [9] as a family of orihonormai bases for discrete
functions of R”. A complete subband decomposition can be viewed as a decompositiOn of
lhe input signal, using an analysis tree of depth iog N.
In Lhe usual dyadic wavelet decompositiOn, only lhe iow-pass-filtered subband is recur
siveiy decomposed and lhus can be represented by a logarithmic tree soructure. However,
a wavelet packet decomposition ailows lhe decomposition lo be represented by any pruned 8.8 EMBEDDED ZEROTREE OF WAVELET COEFFICIENTS
subtree of lhe fuil tree topology. Therefore, Lhis representalion of lhe decomposition topo!
ogy is isomorphic to ai! permissible subband topologies [10]. Tbe leaf nodes of each pnined So far, we have described a wavelet-based scheme for image decomposition. However,
subtree represent one permissible orthonormal basis. aside from referring Lo Lhe idea of quantizing away small coefficients. we have noL really
The wavelel packet decomposition offers a number of alLractive properties, including addressed how Lo code Lhe waveleL Lransform values — how Lo form a bilslream. This
problem is precisely whaL is dealL with in Lemis of a new data strucLure, lhe Embedded
• ElexibiliLy, since a best wavelet basis in lhe sense of some cosI metric can be found
ZeroLree.
wiohin a large library of permissible bases
The Embedded Zerorree Wavelet (EZW) algorithm introduced by Shapiro [11] is an
• Favorable localizaLion of wavelet paclcets in bolh frequency and space.
effecLive and computationally efficienL technique ir’ image coding. This work has inspired
• Low computatioflal requirement for wavelet packet decomposition, because each a number of refinemenLs lo lhe initial EZW algorithm, Lhe most notable being Said and
decompositioLi can be compuLed in lhe order of N log N using fast filter banks. Pearlman’s Set Partiííoning in Hierarchical Trees (SPIHT) algoriLhm [12] and Taubman’s
Einbedded Block Coding wi:h Optiniized Truncanon (EBCOT) algorithm [13], which is
Wavelet packeLs are currently being applied lo solve various practical problems such as
adopLed frIto Lhe JPEG2000 standard.
image compression. signal de-noising, fingerprint idenlificaLion, and 50 0!!.
Section 8.8 Embedded Zerotree of Wavelet Coefficients 243
242 Chapter 8 Lossy CompressiOli Algorithms

The EZW aigorithin adilresses two probierns: obtaining the best image quaiity for a given
bitrate and accornpiishing Lhis task in an ernbedded fashion. An enzbedded code is one that
contains ali lower-rate codes “embedded” aL Lhe beginning of lhe bitstream. The bits are
effectively ordered by imporlance in lhe bitstreanl. An embedded cade allows lhe encoder
to terminate lhe encoding at any point and thus meet any target bitrate exactiy. Similarly, a
decoder can cease to decode at any point and produce reconslructions corresponding Lo ali
iower-rate encodings.
To ach leve this goai, lhe EZW algorithm takes advantage of an important aspect of iow
bitrate image coding. When conventionai coding methods are used to achieve iow bitrates,
using scaiar quantization followed by entropy cading, say, the most likeiy symbol, after
quantization. is zero. Ii turns aut that a large fraction of Lhe bit budget is spent encoding the
significance map, which flags whether input sampies (in lhe case of lhe 2D discrete waveiet
lransfonfl, lhe transforrn coefficients) have a zero or nonzero quantizeil vaiue. The EZW
algorithm expioits this observation to turn any significant improvenlent in encoding lhe
significance map into a corresponding gain in compression efficiencY. The EZW aigorithm FIGURE 8.22: Parent—child relalionship in a zerotree.
consists of two centrai componentS: lhe zerolree data structure and lhe melhod ofsuccessive
approximatiOn quantization.
• Positive significance. The coefficient is significam wilh a positive value.
8.8.1 The Zerotree Data Structure
• Negative signilicance. The coefficient is significant wilh a negative value.
The coding of Lhe significance map is achieved using a new data structure called lhe zeroiree.
A waveiet coefficient x is said to be insignificani with respect to a given threshold T if The cost of encoding lhe significance map is substantiaily reduced by ernpioying lhe
< T. The zerotree operates under Lhe hypothesis lhat if a waveiet coefficient ata coarse zerotree. The zerotree works by exploiting seif-simiiarity on lhe transform coefficients. The
scaie is insignificant wilh respect lo a given threshold T, ail waveiet coefficients of lhe underlyingjustification for lhe success of the zerotree is lhat even though lhe image has been
sarne orientation in lhe sarne spatial iocation at finer scales are likeiy to be insignificant transformed using a decorreiafing transforrn, lhe occurrences of insignificant coefficients
with respect lo T. Using lhe hierarchicai waveiet decompositiOn presented in Chapter 8, are not independent events.
we can relate every coefficient ata given scaie Lo a set of coefficients aI lhe next finer scale
of simiiar orientation.
Figure 8.22 provides a pictoriai representatioli of lhe zerotree on a lhree-stage waveiet
decomposition. The coefficient aI Lhe coarse scaie is called lhe parem whiie ali correspond
ing coefllcients are lhe next finer scale of Lhe sarne spatiai location and similar orientation
are caiied children. For a given parent, the seI of alI coefficients aI ali finer scales are cailed
t7
descendanis. Similarly, for a given child, Lhe seI of ali coefficients aI ali coarser scaies are
cailed ancesiors.
The scanning of Lhe coefficients is performed in such a way lhat no child node is scanned
/
before its parent. Figure 8.23 depicts lhe scanning pattem for a Lhree-ievei wavelet decom

/~
pos ition.
Given a lhreshoid T, a coefilcient x is an element of lhe zerotree if it is insignificant and
aH its descendanls are insignificant as weil. An element of a zerotree is a zerotree moi if
iL is not lhe descendant of a previously found zerotree root. The significance map is coded
using lhe zerotree with a four-symbol alphabet. The four symbols are

• The zerotree root. The root of lhe zerotree is encoded wilh a speciai symboi indicat
ing lhat lhe insignificance of the coefficients at finer scales is compieteiy predictabie. FIGURE 8.23: EZW scanning order.
. lsolated zero. The coefficient is insignificant but has some significant descendants.
Section 8.8 Embedded Zerotree of Wavelet Coefficients 245
244 Chapter 8 Lossy CompressiOn AlgorithmS

57—3739—20 3 7 9 10
In addition, Lhe zerolree coding technique is based co lhe observation that ii is much easier
to predict insignificance Lhan lo predict significaM details across scales. This technique —29301733 8 2 1 6
focuses on reducing lhe COSI of encoding lhe significance map so that more bits wili be
14 6 15 13 9 —4 2 3
available to encode lhe expensive significant coefúcienis.
10 i9 —7 9 —7 i4 12 —9
8.8.2 Successive ApproximatiOfl Quantization
12 15 33 20 —2 3 1 O
Embedded coding in lhe EZW coder is achieved using a method calied Successive Approxr
O 7 2 4 4 —1 1 1
imotion Quanuzation (SAQ). One motivation for deveioping this method is lo produce an
embedded code that provides a coarse-to-fine, multiprecision logarithmic representalion of 4 1 10 3 2 O 1 O
lhe scale space corresponding lo lhe waveiet-transforrned image. Mulher motivation is Lo
56003 i 21
lake further advantage of lhe efflcient encoding of Lhe significance map using lhe zerotree
daLa slruClure, by allowing 1110 encode more significance maps.
The SAQ method sequenLially applies a seqUeflCe of lhresholds T0 TN_j lo deter FIGURE 8.24: Coefficients of a lhree-stage waveieL transform used as inpuL lo Lhe EZW
mine lhe significance of each coefficient. The tbreshoids are chosen such Lhat T1 = 7~i /2. algorilhm.
The initial threshoid T0 is chosen so lhat lxii < 2To for ali transforrn coefficients Xj. A
dorninant lis: and a subordinate lis: are maintained during lhe encoding and decoding pro 8 x 8 lhree-ievei wavelet Lransform. However, uniilce Lhe exampie given by Shapiro, we
cess. The dominant iisL conlains lhe coordinates of lhe coefficienls that have not yet been wiili compiete Lhe encoding and decoding process and show Lhe oulpul bitsLream up lo lhe
found lo be significant in lhe sarne relaLive order as lhe initial scan. poinL just before entropy coding.
Using lhe scan ordering shown in Figure 8.23, ali coefflcients in a given subband appear Figure 8.24 shows lhe coefflcienLs of a lhree-sLage waveiet Lransforrn lhaL we alLempl Lo
on Lhe iniiiai doniinant lisI prior lo coefficients in lhe next subband. The subordinate lisI code using lhe EZW algorilhm. We wiiil use lhe symbois p, ri, t,and z lo denoLe positive
contains lhe magnitudes of lhe coefflcients ihal have been found Lo be significani. Each lisI significance, negative significance, zerotree rool, and isoiaLed zero respecliveiy.
is scanned oniy once for each threshold. Since lhe Iargest coefficient is 57, we wiii choose Lhe initial lhreshold ~‘o to be 32. AL
During a dominantpass, coefficienls having lheir coordinales on lhe dominani iist impiies lhe beginning, lhe dominant iisl contains lhe coordinates of aH lhe coefficients. We begin
thaL they are nol yet significant. These coefficients are compared Lo lhe lhreshold T1 Lo scanning in lhe order shown in Figure 8.23 and delermine lhe significance of lhe coefficienLs.
determine lheir significance. lf a coefficienl is found lo be significant, lIs magnitude is The foliowing is lhe iisl of coefficients visited, in the order of Lhe scan:
appended tolhe subordinale iist, and lhe coefficienL in lhe waveiet transfomi array is sei lo
zero to enable lhe possibiliiy of a zeroiree occurring on fulure dominant passes ai smalier (57, —37, —29,30,39, —20,17,33, i4, 6, iO, i9, 3,7,8,2,2,3, i2, —9,33,20,2,4)
ihresholds. The resulling significance map is zerotree-coded. Wilh respecL lo lhe Lhreshold T0 = 32, ii is easy Lo see lhat lhe coefficients 57 and —37
The dominant pass is foilowed by a subordinate pass. Ali coefficients on lhe subordinale are significani. Thus, we outpuL a p and an n Lo represent them. The coefficient —29 is
iist are scanned, and Lbeir magnitude, as il is made avaiiabie lo lhe decoder, is refined Lo insignificani buL conLains a signiflcant descendanL, 33, in LI-li. Therefore, il is coded as z.
an additional bil of precision. Effectiveiy, lhe width of lhe uncertainly interval for lhe mie The coefficient 30 is aiso insignificani, and ali iL5 descendanLs are insignificant with respect
magnitude of lhe coefficienls is cui in half. For each magnitude on lhe subordinate lisI, lhe Lo Lhe current lhreshold, so il is coded as t.
refinemenl can be encoded using a binary aiphabet with a 1 indicaiing lhat lhe fite value Since we have already delermined Lhe insignificance of 30 and au iLs descendants, Lhe
fails in lhe upper haif of lhe uncerlainty intervai and aO indicating lhat i~ faiis in lhe lower scan wiil bypass them, and no addiLionai symbois wili be generated. ConLinuing in this
haif. The string of symbois from this binary aiphabet is lhen entropy-coded. After lhe rnanner, Lhe dominam pass oulpuls Lhe foHowing symbois:
subordinale pass, lhe magnitudes on lhe subordinate lisI are sorted in decreasing order Lo
lhe exlent lhai lhe decoder can perform lhe sarne sori. pnztptrptzttpnt
The process continues Lo aiternate between lhe LW0 passes, wilh lhe threshoid halved
Five coefhcients are found lo be significant: 57, —37, 39, 33, and anolher 33. Since
before each dominani pass. The encoding stops when some larget stopping criterion has
we imow Lhat no coefficients are grealer lhan 2T0 = 64, and lhe lhreshoid used in lhe firsl
been mel.
dominanl pass is 32, lhe uncertainly intervai is thus [32, 64). Therefore, we know lhal lhe
value of significanL coefficienLs lie somewhere inside lhis uncertainty inlerval.
8.8.3 EZW Exampie The subordinaLe pass foHowing lhe dominam pass refines lhe magniLude of these coef
The foiiowing exampie demonstrates lhe concepl of zerotree coding and successive approx ficients by indicaLing wheLher lhey Iie in lhe firsL haif or lhe second haif of Lhe uncerlainty
imation quantization. Shapiro [11] presents an exampie of EZW codiitg in his paper for an
Section 8.9 Set Partitioning in Hierarchical Trees (SPIHT) 247
246 Chapter 8 Lossy CompressiOn Algorithrns

interval. The output is 0 if Lhe values lie in [32,48) and 1 for values wiLhin [48, 64). 56—40400 O O O O
According Lo lhe order of lhe scan, Lhe subordinate pass outputs the following bils: O O O 40 O O O O
S0: 10000 00000000
00000000
Now Lhe dominant list contains Lhe coordinates of ali lhe coefficients except Lhose found
Lo be significant, and lhe subordinate Iist contains lhe values (57, 37, 39, 33. 331 After lhe O O 40 O O O O O
subordinate pass is conipleted, we atternpt Lo rearrange Lhe values in lhe subordinate list
00000000
such that larger coefficients appear before smailer ones, wilh lhe consfliflt that Lhe decoder
is abie do exactly lhe sarne. 00000000
Since Lhe subordinate pass halves Lhe unceriainty interval, lhe decoder is able Lo distin
00000000
guish values from [32,48) and [48, 64). Since 39 and 37 are not distinguishable in Lhe
decoder, Lheir order wiII not be changed. Therefore, Lhe subordinate list remains lhe sarne
after Lhe reordering operation. FIGURE 8.25: ReconstnacLed transform coefficients from lhe firsl dorninanL and subordinate
Before we move ou to the second round of dorninant and subordinaLe passes, we need Lo passes.
set Lhe values of lhe significant coefficients to O in lhe wavelet transform array se lhat lhey
do not prevent lhe emergence of a new zeroLree. reversing lhe encoding process. From lhe symbols in D0 we can obLain Lhe position of
The new threshold for a second dorninant pass is 7’j = 16. Using lhe sarne procedure as
lhe significant coefficienLs. Then, using Lhe bits decoded from So, we can reconstruct lhe
above, lhe dominant pass outpuLs lhe foilowing symbols. Note lhat Lhe coefficienls in Lhe value of lhese coefficients using lhe center of lhe uncertainty interval. Figure 8.25 shows
dominant list wilI not be scanned.
lhe resulting reconstruction.
D1: zznptnpttztPtttttthtfhPttUtt (8.65) It is evidenL LhaL we can stop the decoding process at any point Lo reconstruct a coarser
represenLaLion of Lhe original input coefficients. Figure 8.26 shows lhe reconslniction if Lhe
Thesubordinateiistisnow (57,37,39,33,33,29,30,20. 17, 19,201. Tbesubordinatepass decoder received only ~o, So, D1, Sj, ~ and only Lhe first 10 biLs of 82. The coefficients
lhat follows will halve each of lhe three current unceriainty intervals [48, 64), [32,48), and lhat were nol refined during lhe IasL subordinate pass appear as if Lhey were quanLized using
[16, 32). The subordinate pass ouLputs lhe following bits: a coarser quantizer lhan lhose thaL were.
In facL, Lhe reconsLruction value used for lhese coefficients is lhe center of lhe uncertainty
St: 10000110000 interval from Lhe previous pass. The heavily shaded coefflcients in Lhe figure are Lhose lhat
were refined, while lhe lightly shaded coefficients are lhose lhat were noL refined. As a
Now we set Lhe value of Lhe coefficients found Lo be significant Lo O in Lhe wavelet transforrn resuiL, iL is noL easy lo see where Lhe decoding process ended, and lhis eliminaLes much of
array. lhe visual artifact contained in lhe reconstrucLion.
The output of lhe subsequent dominant and subordinate passes is shown below:
8.9 SET PARTITIONING IN HIERARCHICAI. TREES (SPIHT)
02: zzzzzzzzptpzpptn:tptppttpttpttpnppttitttpttttttlttttutt
oiiooi1i001101100000110h10 SPIHT is a revolutionary extension of Lhe EZW algoriLhrn. Based on EZW’s underlying
82: principies of partial ordering of lxansfonned coefficients, ordered bitplane transmission of
zzzzzzztzpttztntIptttttptnnttttptttpptpptlptIttt
refinement bits, and Lhe exploiLaLion of self-siniiariLy in Lhe transforrned wavelet image, Lhe
83: OOI00OIO(JOIIIOIOOII000IOO11I 1101100010 SPIHT algorilhin significantly improves lhe performance of its predecessor by changing lhe
zzzzzttztZtZZtZZPPPPhIPhtPttfl~~~Pl ways subseLs of coefficients are partiLioned and refinement informalion is conveyed.
A unique property of Lhe SPIHT bitsLrearn is iLs compacLness. The resulLing biLstrearn
84: iiiiioi00110101100000b0hhb0hb0hb000b00100b0b0b0b0
from Lhe SPIRT aigoriLhm is so compact Lhat passing iL lhrough an entropy coder would
ZZZztZtI(IZIZZZZPPPIPP)PUP produce oniy marginal gain in compression aL Lhe expense of much more cornputaLion.
Therefore, a fast SPIHT coder can be implemenLed wilhout any entropy coder or possibiy
Since lhe length of lhe uncertainLy interval in lhe iast pass is 1, Lhe last subotdinate pass is just a simple paLenl-free Huffman coder.
unnecessary. AnoLher signature of Lhe SPIHT algorilhm is lhat no ordering inforrnation is expiicitiy
On lhe decoder side, suppose we received inforrnation only from lhe first dorninant and Lransmitted Lo lhe decoder. Instead, Lhe decoder reproduces Lhe execution paLh of Lhe encoder
subordinaLe passes. We can reconstruet a lossy version of Lhe transforrn coefficients by
Section 8.11 Exercises 249
248 Chapter 8 Lossy Compression Algorithms

~-~e I~F~
111
• TheFAQforlhecomp.compressionand coinp.00mpression.research
groups. This FAQ answers most of lhe commonly asked questions about wavelet
theory and data compression in general.

II • A seL of slides for scalar quanLization and vector quanlizalion, from Lhe informaLion
Lheory course offered aL DelfI University.

0000 • A link lo an excellent article “Image Compression — from DCI’ lo Wavelets: A


Review”.
O 0 O 010 o o o
. Links to documentaLion and source code relaLed Lo quanLizaLion.
o o~~joIo o o o
o o o ojo o o o 8.11 EXERCISES
1. Assume we have an unbounded source we wish Lo quantize using an M-bit midLread
FIGURE 8.26: Reconstructed transform coefficients from D0, So, D1, Si, D2, and Lhe first uniform quanLizer. Derive an expression for lhe total distortion if the step size is 1.
10 bits of S2, 2. Suppose Lhe domam of a uniform quantizer is [—bM,bM]. We define Lhe Ioading
fraction as

and recovers Lhe ordering information. A desirable side effect of this is Lhat the encoder and bM
decoder have similar execution times, which is rarely the case for other coding methods. a
Said and Peariman [12] gives a fuli descripLiOn of Lhis aigorithm. where a is Lhe standard devialion of Lhe source. Write a simple program to quanlize a
Gaussian distributed source having zero mean and unit variance using a4-bit unifonii
8.10 FURTHER EXPLORATION quantizer. PIoL Lhe SNR against Lhe loading fraction and estimaLe lhe optimai step
Sayood [14] deals extensiveiy with lhe subject of Iossy data compressiOn in a weil.organized size Lhat incurs Lhe Ieast amount of distortion from Lhe graph.
and easy-to-understand manner. 3. * Suppose Lhe input source is Gaussian-distributed with zero mean and unit variance
(Jersho and Gray [151 coverquantiZation. especially vectorquantiZation. comprehensivlY. Lhat is, Lhe probability density function is defined as
In addition toLhe basic theory, Lhis book provides a nearly exhaustive descriplion ofavailabie
VQ methods. fx(x) = (8.66)
Gonzales and Woods [7] discuss malhematical transforms and image compression, in
cluding straightforWard explanalions for a wide range of algoriihms in lhe conLext of image
processing. We wish Lo finda four-level Lloyd—Max quantizer. IeL y~ = u’? yi] and b~ =

The mathematical foundation for the development of many lossy data compression algo [b? b?]. The initial reconstruction leveis are set lo y~ = [—2, —1,1,2]. This
rithms is lhe study of srochas(ic processes. Stark and Woods [16] is an excelient textbook source is unbounded, so Lhe outer two boundaries are +00 and —ao.
on Lhis subjecl. FolIow the Lloyd—Max aigorithm in Lhis chapter: lhe other boundary values are calcu
Finally, Mailal [5] is a book on wavelets, emphasizing theory. Iated as Lhe midpointsof thereconstruccion values. We now have b0 = [—co, —1.5,0,
L.inlcs included in lhe Further Exploration directory of the Lext web sile for Lhis chapter 1.5, ao]. Continue one more iteration for i = 1, using Eq. (8.13) and find 4, y~,
are y~, 4, using numericai inLegralion. Also calculate lhe squared error of Lhe difference
belweeny1 andyo.
• An online, graphics-based demonstralion of Lhe wavelel transform. Two programs Iteration is repeaied until lhe squared error belween successive estimales of Lhe re
are included, one Lo demonslrate Lhe ID wavelet Lransfonu and lhe olher for 2D image construclion leveis is below some predefined Lhreshold E. Write a small program lo
compression. In the ID program, you simply draw Lhe curve lo be transformed. implemenl the Lloyd—Max quantizer described above.
4. if Lhe block size for a 2D DCI’ Lransform is 8 x 8, and we use oniy lhe DC components
• The Theory of Data CoinpressiOn web page, which introduces basic theories behind Lo create a Lhumbnail image, what fraction of Lhe original pixels would we be using?
both lossless and lossy data compression. Shannon’s original 1948 paper on infor 5. When Lhe block size is 8, lhe definition of lhe DCT is given in Eq. (8.17).
malion tbeory can be downloaded from this site as well.
Section 8.11 Exercises 251
250 Chapter 8 Lossy Compression Algorithms

9. Suppose lhe molher waveiel *0) has vanishing moments M~ up to and inciuding AI,,.
Expand fQ) in a Tayior sedes around £ = 0, up tolhe nlh derivative of f (i.e., up lo
ieftover error of order O (n + 1)]. Evaluate lhe summation of integrais produced by
substituting lhe Tayior sedes into (8.52) and show lhal lhe resuil is of order O (s’~+2).
10. The program wavelet_compression. c on lhis book’s web sile is in facl simple
lo impiement as a MATLAB function (or similar fourth-generation language). The
advantage in doing so is lhat lhe imread function can input image formais of a
FIGURE 8.27: Sphere shaded by a Iight. great many lypes, and imwrite can output as desired. Using lhe given program as
a tempiate, conslnict a MATLAB program for wavelel-based image reduclion, wilh
perhaps lhe number of waveiel leveis being a function parameler.
(a) If an 8 x 8 grayscaie image is in the range O.. 255, what is lhe largesl value a 11. II is interesting to find lhe Fourier lransform of functions, and lhis is easy if you have
Dcl’ coefficient could be, and for what input image? (Also, state ali lhe DCT avaiiabie a symboiic manipulation syslem such as MAPLE. In lhat ianguage, you can
coefficient values for that iniage.) just invoke lhe tourier function and view lhe answer directiy! As an exampie, lry
(b) lf we first sublract the value 128 from lhe whole image and chen cany out lhe lhe foiiowing code fragmenl:
DCI’, what is lhe exact effect on lhe DCT value F[2, 3]?
(e) Why would we carry out lhat subtraction? Does lhe subtraction affect lhe withPinttrans’);
number of bits we need to code lhe image? f : 1;
F : fourier(f,t,w);
(d) Would it be possible to invert ihat sublraction, inche IDCT? Ifso, how?

6. We could use a similar D~ scheme for video szreams by using a 3D version of Da. The answer shouid be 2rS(w). Let’s try a Gaussian:
«,
Suppose one color component of a video has pixeis fijk at position j) and time k.
l-low could we define ils 3D Da cransform? f := exp(—t2);
F := fourier(t,t,w);
7. Suppose a uniformly colored sphere is illuminated and has shading varying smoolhiy
across ils surface, as in Figure 8.27.
Now lhe answer shouid be qWe~_’°2/~~: lhe Fourier transform of a Claussian is simply
(a) What wouid you expect lhe Da coefficients for ics image to look like? anolher Gaussian.
(b) What would be lhe effect on lhe Da coefficients of having a checkert,oard of 12. Suppose we define lhe waveiet funclion
coiors on lhe surface of lhe sphere?
(c) For lhe uniformly colored sphere again, describe lhe D~ values for a block = exp(_tV4) sinQ4) , 1 ~ O (8.68)
lhat straddles lhe top edge of lhe sphere, where it meis tbe biack background.
(d) Describe lhe Da values for a block lhat straddies lhe left edge of lhe sphere. This function osciliates about lhe value O. Use a plotting package to convince yourself
lhat lhe funclion has a zero momenl AI,, for any value of p.
8. The Haar wavelet has a scaling function which is defined as foilows: 13. hnpiement bolh a DCT-based and a waveiel-based image coder. Design your user

I
interface se lhat lhe compressed resuils from bolh coders can be seen side by side
(8.67)
= OI olherwise for visual comparison. The PSNR for each coded image shouid a]so be shown, for
0sIs1
quantitative comparisons.
and its scaling vector is ho[0] = ho[l] = l/’[2. Include a slider bar lhal controis lhe target bitrate for bolh coders. As you change lhe
largel bilrate, each coder shouid compress lhe inpul image in real lime and show lhe
(a) Draw lhe scaling function, then verify thal its diiated translates ~(2t) and ~(2t — compressed resuits inimediateiy on your user interface.
1) satisfy lhe dilacion equation (8.56). Draw lhe combinalion of these functions Discuss both qualitative and quantilalive compression resulls observed from your
lhat makes up lhe fuil funclion ~Q). program aI cargel bitrales of 4 bpp, 1 bpp, and 0.25 bpp.
(b) Derive lhe waveiet vector h1 [0], h1[l] from Eq. (8.59) and then derive and draw 14. Wrile a simpie program or refer tolhe sampie Dcl’ program dct_1D. e in lhe book’s
lhe Haar waveiet function 4’Q) from Eq. (8.57). web sile lo verify lhe resuits in Exampie 8.2 of lhe ID DCI’ example in lhis chapter.
252 Chapter 8 Lossy Compression Algorithms

8.12 REFERENCES C H A P T E R 9
1 A. Gyõrgy, “Ou theTheoreücal Limits of Lossy Source Coding’ TlidományoS Diálck& (TDK)
Conference [1-lungarian Scientific Student’s Conferencel at Technical University of Budapest,
1998. ~~p:llwwwszit.bme.hu&gYWm~P5 1 mage Com p ress io n Standard s
2 5. Arimoto, “An Algotitlun for Caiculating lhe Capacity of au Áxbitraxy Discrete Memoryless
Channel’ IEES Transactions ora bzfonnation Theory, IS: 14—20, 1972.
3 R. DhabI, “Computation of Channel Capacity and Rate-DistortiOli Funcúons,” lESE Transac
dons on lnforrnatiOn Theory, 18: 460—473, 1972.
4 J.F. Bhinn,”What’stheDealwi±theD~’?” ~ 13(4): increase in numbers
Recent years of digital
have seen imagingindevices,
au explosion such as scanners
lhe availability of digitaland digitalbecause
images, cameras.ofThe
Lhe
78—83, 1993. need Lo efficiently process and store images in digital form has motivated lhe development
5 5. MailaL. A Wavelet Tour of Signa? Processing, San Diego: Academic press, 1998. of many image compression standards for various applications and needs. In general,
6 5. Malhal, “ATheory for Mulúresolutiotl Signal Decomposition: The Wavelet RepresentatiOtl’ standards have greater longevity than particular programs or devices and Iberefore warrant
lESE Transactions on Partem Analysis and Machine Inteiligence, 11: 674—693, 1989. careful study. In this chapter, we examine some current standards and demonstrate how
7 R.C. Gonzalez and R.E. Woods, Digital bnage Processins, 2ad ed., Upper Saddle River, NJ: topics presented in Chapters 7 and 8 are applied in practice.
Prentice HaIl, 2002. We first explore Lhe standard JPEG definition, mcd in most images ou the web, Lhen
8 B.E. Usevitch, “A Tútorial ou Modera Lossy Wavelet Image Compressiofl Foundations of go on lo look at Lhe wavelet-based JPEG2000 standard. Two other standards, JPEG-LS —
JPEG z000;’ lESE Signa? Processing Magazine, 18(5): 2235,2001. aimed particularly aI a lossless JPEG, outside Lhe main JPEG standard — and JBIG, for
9 R. Coifman, Y. Meyer, 5. Quake, and V. Wickerbauser, “Signa! Processing and Compression bilevel image compression, are included for completeness.
with Wavelet PacketsT Numerical Algorithms Research Group, YaIe UniversitY, 1990. 9.1 THE JPEG STANDARD
10 K. Ramachandran and M. Vetterli, “BesI Wavelet Packet Basis ia a Rate-Distortioli Sense’
IEEE Transactions ou Image Processing, 2: 160—173, 1993. JPEG is an image compression standard developed by lhe Joint Photographic Experts Group.
II J. Shapiro, “Embedded tmage Coding using Zerotrees of Wavelet CoefficientsT lESE Trans- Ii was formally accepted as au international standard in 1992 [1].
actions ou Signa? Processing, 4102): 3445-3462, 1993. JPEO consists of a mimber of steps, each of which contributes Lo compression. We’Il
12 A. Said and W.A. Pearlman. “A New, Fast, and Efficieat Image Codec Based on SeL Partitioning look aI lhe motivation behind these steps, then take apart lhe algorithm piece by piece.
ia Hierarchical TreesT IEEE Transactions on Circuits and Systems for Video Techuology, 6(3): 9.1.1 Main Steps in JPEG Image Compression
243—250, 1996.
13 D. Taubmaa, “1-11gb Performance Scalable Iniage Compression with EDCOT,” lESE Transac- As we know, unlike one-dimensional audio signals, a digital image fQ, j) is not defined
tions on Isnage Processing, 9(7): 1158—1170.2000. over lhe time domam Instead, it is defined over a spatial domam — Lhat is, an image is a
14 K. Sayood, Introductiou to Data Compression, znd cd., San Francisco: Morgan Kaufmann, function of lhe two dimensions i and j (or, conventionally, x and y). The 2D DCT is used
2000. as one step ia JPEG, lo yield a frequency response that is a function F(u, u) ia lhe spatmal
IS A. Gersho and fIM. Gray. Vector Quantization and Signa? Cosnpression. Bostou: KluWer JPEG isdomam,
frequency a lossy indexed
image compression method.
by two integers is and The
ti. effectiveness of Lhe DCT transform
Academic Publishers, 1992.
16 1-1. Slark and J.W. Woods, Probabi?ily and Random Processes with App?icalion lo Signa? Pra- coding method in JPEG relies on Lhree major observations:
cessing, 3rd cd., Upper Saddle River, N$: Prenúce 1-lalI, 2001. Observation 1. Useful image contents change relatively slowly across Lhe image —
that is, is is unusual for intensity values lo vary widely several Limes ia a small area —
for example, in au 8 x 8 image block. Spatial frequency indicates how many times
pixel values change across an image block. The Dcl’ formalizes this notion with a
measure of how much Lhe image contenis change ia relation tolhe number of cycles
of a cosine wave per block.
Observation 2. Psychophysical experiments suggest lhat humans are much less likely
lo notice lhe loss of very high-spatial frequency components Lhan lower-frequency
components.

253
Section 9.1 The JPEG Standard 255
254 Chapter 9 Image Compression Standards

When Lhe JPEG image is needed for viewing, Lhe lhree compressed component images
YIQ orYUV can be deeoded independently and evenLually combined. For lhe calor channels, each pixel
musL be first enlarged to cover a 2 x 2 block. WiLhauL loss of generaliLy, we wili simply use
one of them — for example, lhe Y image, in lhe descripLion of Lhe compressian aigorithm
beiow.
Figure 9.1 shows a block diagram for a JPEG encoder. If we reverse Lhe arrows iii Lhe
figure, we basically obLain a JPEG decoder. The JPEG encoder consists of lhe follawing
main steps:

. Transform RGB Lo YIQ ~ YUV and subsample colar

Perform Dcl’ on image biocks


. Appiy QuanLizalion
Zigzag
. Perforni Zigzag ordering and mn-length encoding

Perforrn Entropy coding

DCT OH Image Biocks. Each image is divided into 8 x 8 blocks. The 2D DCI’ (Equa
FIGURE 9.1: Block diagram for JPEG encoder. tion 8.17) is appiied lo each block image f(i, j), wiLh ouLpul being lhe DCT coefhcients
F(u, ti) for each block. The choice of a small block size in JPEG is a compromise reached
by Lhe commiLtee: a number larger Lhan 8 wouid have made accuracy aL low frequencies
betLer, buL using 8 malces lhe DCT (and IDCT) compuLation very fasL.
JPEG’s approach Lo lhe use aí DO’ is basically lo reduce high-frequeflcy contenLs and
Using blacks aL ali, however, has the effecL of isoiaLing each bioek from its neighboring
lhen efficienlly code lhe resull into a bitstring. The terrn spatiai redwzdancy indicates that
conLexL. This is why JPEG images laok choppy (“blocky”) when lhe user specifies a high
much of lhe infonnation in an image is repeated: ifa pixel is red, then its neighbor is likely
compression ratio we can see Lhese blocks. (And in fact removing such “biocking
red also. Because aí Observation 2 above, lhe DCI’ coefficients for lhe lowest frequencies
artifacts” is an importaril concem of researchers.)
are rnost important. Therefore, as frequency gets higher, it becomes less importaM Lo
represent lhe DCI’ coefficient accurately. It may even be safely seI to zero without losing To calcuiale a particular F(u, ti), we selecL lhe basis image in Figure 8.9 Lhat corresponds
LO Lhe appropriaLe u and ti and use iL in EquaLion 8.17 LOderive one of lhe frequency responses
much perceivable image information.
F(u, ti).
Clearly, a string of zeros can be represented efficiently as lhe lengLh of such a run of
zeros, and compression of bits required is possible. Since we end up using fewer nunibers QuaHtization. The quantizaLion slep in JPEG is aimed aI reducing lhe toLal number
to represent lhe pixels in blocks, by removing some location~dePendent inforniation, we of bits needed for a compressed image [2]. IL consisLs of simpiy dividing each enLry iii lhe
have effectively removed spatial redundancy. frequency space block by an inLeger, then rounding:
JPEG worlcs forbothcolOr and grayscale images. In lhe case aí colar irnages, such as YIQ
ar YUV, lhe encoder works on each component separately, using Lhe sarne routines. If lhe 1 F(u.v))
source image is in a different colar forrnat, lhe encoder perforrns a color-spaCe conversion fr(u, ti) — round (9.1)
~ Q (u, ti)
Lo YIQ ar YUV. As discussed in Chapter 5, lhe chrominaflce images (1, Q or (.1, Y) are
subsampled JPEG uses lhe 4:2:0 scheme, making use of anoiber observation about vision: Here, F(u, ti) represenLs a DCT coefficient, Q(u, ti) is a quansization matrix entry, and
F(u, ti) represents Lhe quansized DCT coefficients JPEG will use in Lhe succeeding enLrapy
ObservatiOn 3. Visual acuity (accuracy in distinguishiflg closely spaced lines) is coding.
much greater for gray C’black and white”) lhan for calor. We simply cannot see much The defauit vaiues in lhe 8 x 8 quanLizaLiori maLrix Q(u, ti) are lisled in Tables 9.1
change incolor if ii occurs in dose proximiLY — think of lhe blobby inlc used in comic and 9.2 for luminance and chrominance images, respecLively. These numbers resulted from
boolcs. ‘[bis works simply because our eye sees lhe black lines best, and our brain psychophysical sludies, wiLh lhe gaai aí maximizing lhe compression ralio while minimizing
just pushes lhe calor mIo place. Ia fact, ordinary broadcast TV makes use of lhis percepLual iosses iii JPEG images. The foflawing should be apparent:
phenomenan to transmit much less colar infarmatiOn lhan gray information.
Section 9.1 The JPEG Standard 257
256 Chapter 9 Image Compressiofl Standards

TABLE 9.1: The Iuminance quantization table.

16 II 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
An 8 x 8 block froni the Y image of Lena
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101 200 202 189 188 189 175 175 175 51565—124 1 2—8 5
72 92 95 98 112 100 103 99 200 203 198 188 189 182 178 175 —16 3 2 O 0—11 —2 3
203 200 200 195 200 187 185 175 —12 6 11 —1 3 O 1 —2
200 200 200 200 197 187 187 187 —g 3 —4 2 —2 —3 —5 —2
TABLE 9.2: The chrominance quantizaüofl table. 200 205 200 200 195 i88 187 175 0—2 7—5 4 0—1—4
99 200 200 200 200 200 190 187 175 0—3 —1 O 4 1—1 O
17 18 24 47 99 99 99
205 200 199 200 191 187 187 175 3—2—3 3 3—1—1—3
18 21 26 66 99 99 99 99 210 200 200 200 188 185 187 186 —2 5 —2 4 —2 2 —3 O
24 26 56 99 99 99 99 99 fO.j) F(u, v)
47 66 99 99 99 99 99 99
99 99 99 99 99 99 99 99 32 6 —1 O O O O O 512 66 —10 O 0000
—1 O o o o o o o —12 O O O 0000
99 99 99 99 99 99 99 99
—1 O 1 O O O O O —14 O 16 O 0000
99 99 99 99 99 99 99 99 —1 O O O O O O O —14 O O O 0000
99 99 99 99 99 99 99 00000000 0000 0000
00000000 0000 0000
00000000 0000 0000
• Since lhe numbers in Q(u, v) are relatively large, the magnitude and variance 00000000 0000 0000
F(u, u) are significantly smaller than chose of FOi, v). We’ll see later that FQs
Ê(u,v) P(u, v)
can be coded with many fewer bis. The quantization step is the main.sourcefor lo
in JPEG compressiún.
199 196 191 186 182 178 177 176 1 6 —2 2 7 —3 —2 —1
• The entries of Q (u, ti) tend to have larger values toward lhe lower right comer. This 201 199 196 192 188 183 18O178 —1 4 2 —4 1 —1 —2 —3
aims lo introduce more Ioss aI lhe higher spatial frequencies — a practice supported 203 203 202 200 195 189 183 180 o —3 —2 —5 5 —2 2 —5
by Observations 1 and 2. 202 203 204 203 198 191 183 179 —2 —3 —4 —3 —1 —4 4 8
200 201 202 201 196 189 182 177 o 4 —2 —1 —1 —1 5 —2
We can handily change Lhe compressiOfl ratio simply by multiplicatiVely scaling lhe 200 200 199 197 192 186 181 177 o o i ~ 8 4 6 —2
numbers iii lhe Q(u ti) matrix. In fact, Lhe quaUiy factor, a user choice offered in every 204 202 199 195 190 186 183 181 1 —2 O 5 1 1 4 —6
JPEO implementation. is essentially linearly tied lo lhe scaling factor. JPEG also allows
207 204 200 194 190 £87 185 184 3 —4 O 6 —2 —2 2 2
custom quantiZation tables lo be specified and pul in lhe header; it is interesting lo use low
constant or high-constartt values such as Q e 2 or Q e 100 to observe lhe basic effects of EQ,j) =f(i,j) —J(i.J)
Q on visual artifacls.
Figures 9.2 and 9.3 show some results of JPEG image coding and decoding on lhe test
FIGURE 9.2: JPEG compression for a smooth image block.
image Lena. Only lhe luminance image (Y) is shown. Also, lhe lossless coding sleps
Section 9.1 The JPEG Standard 259
258 Chapter 9 Image Compression Standards

afLer quanLization are not shown, since they do not affecL lhe quality/ioss of the JPEG
images. These resuits show the effect of compression and decompression applied to a
relatively smooth biock in the image and a more iextured (higher-frequency-conLenl) block,
respectively.
Suppose f(i, j) represenLs one of Lhe 8 x 8 biocks extracLed from Lhe irnage, FOi, u)
the DC1’ coefficienLs, and F(u, ti) Lhe quanLized DCT coeflicients. Lei F(u, v) denoLe
Lhe de-quantized DCT coefficienLs, determined by simply multiplying by Q(u, ti), and leL
f (1, j) be lhe reconslructed image block. To iliustraLe Lhe quality of Lhe JPEG compression,
especiaily Lhe Ioss, theerrore(i, j) = f(i, j) —f (1, j) is shown in lhe lastrow iii Figures 9.2
and 9.3.
Anolher 8 x 8 block from Lhe Y irnage of Lena’
In Figure 9.2, an image block (indicated by a black box in lhe image) is chosen aL Lhe
—80 —40 89 —73 44 32 53 —3 arca where the luminance values change smooLhly. AcLually, Lhe iefL side of Lhe block is
70 70 100 70 87 87 150 187 14 —3—13 —28 brighler, and Lhe right side is siighLly darker. As expecLed, except for Lhe DC and Lhe first
—135 —59 —26 6
85 100 96 79 87 154 87 113 few AC cornponenLs, represenling low spatial frequencies, rnost of Lhe DCT coefficiencs
47 —76 66-3 —108 —78 33 59
100 85 116 79 70 87 86 196 F(u, ti) have small magnitudes. This is because Lhe pixel values in Lhis block conLain few
—2 10 —18 O 33 11—21 1
136 69 87 200 79 71117 96 high-spatiai-frequency changes.
—9 —22 8 32 65—36 —1
161 70 87200 103 71 96 113 3 24—30 24
5 —20 28 —46 An explanation of a small implementalion deLail is in order. The range of 8-bit iuminance
161 123 147 133 113 113 85161 12 —35 33 17 values f(i, j) is [0,255]. in Lhe JPEG implemenLalion, each Y value is firsL reduced by 128
6 —20 37 —28
146 147 175 100 103 103 163 187 17 —5 —4 20 by simply sublracting.
—5 —23 33 —30
156 146 189 70 113 161 163 197
The idea here is Lo Wni Lhe Y componenL inLo a zero-mean image, Lhe sarne as Lhe
fQ,j) F(u,v) chrominance images. As a result, we do noL wasLe any bits coding the mean value. (Think
—80 -44 90 —80 48 40 SI O ofan 8 x 8 block with intensiLy values ranging from 120 tu 135.) Using f (1, j) — 128 in
—5 —4 9 —5 2110 —132 —60 —28 o 26 O o —55 place of f (i, j) will nor affecl Lhe output of Lhe AC coeflicienis — it aliers only Lhe DC
—ii —5 —2 O 1 O o —1 42 —78 64 o —120 —57 o 56 coe fficient.
3—6 4 O —3 —1 o 1 o 17 —22 o 51 O 00 in Figure 9.3, Lhe image biock chosen has rapidly changing luminance. 1-lence, many
o 1 —1 0 1000 o o —37 o 0109 00 more AC cornponents have large magnitudes (including Lhose toward the lower righi comer,
o o —l 0 0100 o .35 55 —64- O O 00 where ii and ti are large). NoLice lhaL Lhe error EQ, J) is also larger now Lhan in Figure 9.2
0—1 1—1 0000 o 000 00 — JPEG does iniroduce more loss if Lhe image has quickiy changing details.
o o
0000 0000 o o 0000
o o
0000 0000
PreparaLion for Entropy Coding. We have 50 far seen Lwo of Lhe main sleps in
P(u,v) P(u,v JPEG compression: DCT and quanLization. The remaining smali sLeps shown in Lhe block
diagram in Figure 9.1 ali lead up Lo entropy coding of Lhe quantized Dcl’ coefficients. These
70 60 106 94 62 103 146 176
o 10 —6—24 25—ló 4 Ii additional daLa compression sLeps are lossless. InLereslingly, Lhe DC and AC coefficienLs are
85 101 85 75 102 127 93 144 o —i II 4 —15 27 —6 —31 Lreated quiLe differenlly before entropy coding: run-iengLh encoding on ACs versus DPCM
2 —14 24 —23 4—11—3 29 on DCs.
98 99 92 102 74 98 89 167
4 16—2420 24 1 1149
132 53 1H 180 55 70 106 145
173 57 114 207 iii 89 84 90 —12 13 —27 —7 —8 —18 12 23 Run-Length Coding (RLC) mi AC Coefficient& Notice in Figure 9.2 Lhe rnany zeros
164 123 131 135 133 92 85 162 —3 O 16 —2 —20 21 O —1 in F(u, ti) afterquantizaLion is applied. Run-Ieng:h Coding (RLC) (orRun-iength Encoding,
141 159 169 73 106 101 149 224
5 —12 6 27 —3 —2 14 —37 RLE) is Lherefore useful in tuming Lhe F(u, ti) values inLo seLs {#-zevvs-to-skip, na! rica
6 5 —6 —9 6 1447 44 zero value). RLC is even more effective when we use an addressing schcrne, malcing li
150 141 195 79 107 147 210 153
mosL likely Lo hit a long run of zeros: a zigzag scan Lums Lhe 8 x 8 malrix F(u, ti) into a
fOi) eO,f) =f (1,» —?o.» 64-vector, as Figure 9.4 illustraLes. After alI, most irnage biocks tend Lo have small high
spaLial-frequency componenis, which are zeroed out by quanLization. Hence Lhe zigzag
FIGURE 9.3: JPEG compreSsion for a Lexlured image block.
Section 9.1 The JPEG Standard 261

260 Chapter 9 Image CompreSSiO° Standards


TABLE 9.3: Baseline entropy coding details — size caLegory.

SIZE AMPLITUDE
1 —1, 1
2 3,—2,2,3
3 —7..—4,4..7
4 —15—8,8.15

10 —1023—512,512.. 1023
FIGURE 9.4: Zigzag scan in JPEO.

Entropy Coding The DC and AC coefficients finally undergo ari entropy coding step.
Below, we wilI discuss only Lhe basic (or baselíne’) entropy coding method, which uses
scan order has a good chance of concatenating long ruIas of zeros. For example, ÊQs, v) in Huffman coding and supports only 8-bit pixels lia lhe original images (or color image com
Figure 9.2 will be turned mio
ponents).
(32,6,_1,_l,0,1,0.0~0_1.0.0~ 1,0,0 0) LeL’s examine che two entropy coding schemes, using a variant of Huffman coding for
DCs and a slightly different scheme for ACs.

with three runs of zeros inibe middle and a nin of 51 zeros ai the end.
Huffman Codung of DC Coefficients Each DPCM-coded DC coeflicient is represented
The RLC step replaces values by a pair (RUNLENGTMI VALUE) for e,ach rim of zeros
by a pair of symbols (SIZE, AMPLITUDE), where SIZE indicates how many bits are
in the AC coefficients of F, where RUNLENGTM is lhe number of zeros ia lhe rim and
needed for representing lhe coefficient and AMPLITUDE contains lhe actual bits.
VAIUE is the next nonzero coefficient. To further save bits, a special pair (0,0) indicates
Table 9.3 illustrates lhe size category for lhe different possible amplitudes. Notice that
the end-of-b1O~ after Lhe last nonzero AC coefficient is reached. Inibe above exaniple, not
DPCM values could require more than 8 bits and could be negative values. The one’s
considerin& lhe first (DC) componenL~ we will thus have
complement scheme is used for negative numbers — Lhat is, binary code 10 for 2, 01 for
(0, 6)(0, .—1)(0, —fl(1, —1)(3, —1)(2, fl(0, 0) —2; 11 for 3,00 for —3; and so on. In lhe example we are using, codes 150, 5, 6, 3, —8
will be Lumed mio

(8, lOOlOllO),(3, lOl),(3,0O1),(2, ll),(4,Olll)


Difrerentual Pulse Code ModUhat101~ (DPCM) on DC CoeffiCitfl~ The DC coef
ficients are coded separatelY from the AC ones. Each 8 x 8 image block has only one In Lhe SPEG implementation, SIZE is Huffman coded and is hence a variable-length
DC coefficient. The values of the DC coefficients for varloUs bloclcs could be large and code. In other words, SIZE 2 might be represented as a single bit (0 or 1) if it appeared
different, because the DC value refiects tbe average intensitY of each bloclc, but consistent most frequently. In general, smaller SIZEs occur much more often — Lhe entropy of
with Observation 1 above, lhe DC coefficient is unlilcelY lo change drastically within a shorl SIZE is low. Hence, deployment of Huffman coding brings additional compression. After
distance. This makes DPCM an ideal scheme for coding the DC coefficients. encoding, a custom Huffman table can be stored in lhe JPEG image header; otherwise, a
lf the DC coefficients for lhe first five image blocks are 150, 155, 149, 152, 144, DPCM default Huffman table is used.
would produce 150, 5, —6, 3, —8, assuming the predictor for ibe ith block is simply On Lhe other hand, AMPLITUDE is not Huffman coded. Since its value can change
ci~ = DC,+t — DCi, and d0 = DC0. We expeci DPCM codes lo generallY have smaller widely, Huffman coding has no appreciable benefit.
magnitude and vafiance, which is beneficial for lhe next entropy coding step.
11’he JPEO standard allows both Flufiman coding and Ariihmetic coding; bO(h are entropy codíng meihods, Ii
Ii is worth noting that unlike the run.length coding of lhe AC coefficients, which is alsa supporls boih 8-bit and 12.bit pixel Ienglhs.
perfonfled on each individual bloclc, DPCM for the DC coefficients in JPEG is carried out
on the entire image ai once.
Section 9.1 The JPEG Standard 263
262 Chapter 9 Image CompressiOfl Standards
Scan 1: Encode DC and first few AC components, e.g., AC 1, AC2.
Huffman Coding of AC Coefllcients. Recail we said that lhe AC coefficients are run Scan 2: Encode a few more AC componenls, e.g., AC3, AC4, AC5.
length coded and are represented by pairs ofnumbers (RUNLENGTHS VALUE) . 1-Iowever,
in an actual SPEG implementation,VM.UE is furlherrepresentedbY SIZE and AMPLITUDE,
as for Lhe DCs. To save bits, RUNLENGTH and SIZE are allocated only 4 bits each and
Scan k: Encode lhe JasI few ACs, e.g., AC6I, AC62, AC63.
squeezed into a single byte — let’s cali Lhis Synibol 1. Symboi 2 is lhe AMPLITUDE value;
its number of bits is indicated by SIZE: Successive approximation: lnstead of gradualiy encoding spectral bands, ali DCI coeffi
Symbol 1: (RUNLENGTH, SIZE) cicias are encoded simultaneously, but wilh lheir most significam bics (MSBs) first.

Symbol 2: (AMPLITUDE) Scan 1: Encode lhe firsI few MSBs, e.g., Bits 7,6,5, and 4.

The 4~biL RUNLENGTM can represeflt only zero-runs of length 010 15. Occasionally, lhe Scan 2: Encode a few more less-significant bits, e.g., Eh 3.
zerO-run length exceeds 15; chen a special extension code, (15,0), is used for Symbol 1. In
lhe worst case, lhree consecutive (15,0) extensions are needed hefore a normal terminating
Symbol 1, whose RUNLENGTH wiil then complete Lhe actual runlength. As in DC, Symbol 1 Scan m: Encode Lhe least significam bit (LSB), BiL 0.
is lluffman coded, whereas Symbol 2 is not.

9.1.2 JPEG Modes ilierarchical Moda As its name suggests, Hierarchical JPECI encodes Lhe image in
a hierarchy of several different resolutions. The encoded image aI lhe lowest resolution is
The JPEG standard supports numerous modes (variations). Some of lhe commonly used basically a compressed low-pass-filtered image, whereas Lhe images aI successively higher
ones are: -
resolutions provide additional delails (differences from lhe lower-resolucion images). Sim
• Sequential Mede ilar Lo Progressive JPEG, Hierarchical JPEG images can be transmitted in multiple passes
witli progressively improving quality.
• ProgreSsiVe Mode Figure 9.5 illustrates a Lhree-level hierarchical WEG encoder and decoder (separated by
lhe dashed une in lhe figure).
• Hierarchical Mode

. Lossless Mede
f4
Sequential Mode. This is Lhe default JPEG mode. Each gray-ievel image or color
image component is encoded in a single left-to-right, top4o-bottOm scan. We implicitly
assumed Lhis mode in Lhe discussions so far. The “Motion JPEG” video codec uses Baseline
Sequential JPEG, applied to each image frame in lhe video.
12 =
progressiveMode. Progressive JPEG delivers low-quality versions of lhe image quicldy,
EU4) +
followed by higher-quality passes, and has become wideiy supported in web browsers. Such
multiple scans of images are of course most useful when lhe speed of Lhe communication
une is low. In Progressive Mede, Lhe first few scans carry only a few bits and deliver a rough
picture of what is to follow. After each additional scan, more data is receiveil, and image
quality is gradualiy enhanced. The advantage is that lhe user-end has a choice whether lo
continue receiving image data afler lhe first scan(s).
ProgressiVe JPEG can be realized in one of lhe following two ways. The main steps f f=
(Dar, quantization. etc.) are identical lo lhose ia Sequential Mode.

Spectral selection: This scheme takes advantage of Lhe spectrai (spatial frequency spec
Lnim) characteristiCS of lhe DCI coefficients: lhe higher AC components provide FIGURE 9.5: Block diagram for Hierarchical JPEG.
only detail information.
Section 9.2 The JPEG2000 Standard 265
264 Chapter 9 Image Compression Standards
Start.of_image Frame End...of_image~
ALGORITHM 9.1 THREE-LEVEL HIERARCHICAL JPEG ENCODER

1. Reductioii of image resolutiofl. Reduce resolution of lhe input image f (e.g., 512 x
512) by a factor of 2 in each dirnension Lo obtain f2 (e.g., 256 x 256). Repeat this (o
1 b, et ~ Scan 1 - :‘: - -

obtain f~t (e.g., 128 x 128).


2. Compress Iow.rcsolutiofl image f~. Encode f~ using any other JPEG method (e.g., Tables, eLe. Header Segment Restart Segment Restart . ..

Sequential, Progressive) to obtain F4.


3. CompreSs differente image ilz.
(a) Decode F4 Lo obtain 14. J~
Use any interpolation method Lo expand Lo be of lhe fllock Block Block ...

sarne resolution as f2 and cail ii E(f4).


(b) Encode difference d2 = f2— E(J4) using any otherJPEG metbod (e.g., Sequential. FIGURE 9.6: JPEG biLstream.
Progressive) to generate 02
4. Compres difference image d1.
9.1.3 A Glance at the JPEG Bitstream
(a) Decode 02 to obtain J2; add it to EU4) to get 12 = E(34) +J2, which is aversion
of f2 aiter compressiOn and decompression. Figure 9.6 provides a hierarchical view of Lhe organizaLion of Lhe bitsLrearn foriPEO images.
Here, aframe is a picture, a scan is a pass Lhrough Lhe pixeis (e.g., Lhe red component), a
(b) Encode difference d1 = f — E(,j2) using any other JPEG method (e.g., Sequential.
Progressive) lo generate Vi. segmen: is a group of blocks, and a block consisis of 8 x 8 pixels. Examples of some header
information are:
. Frame header
ALGORITHM 9.2 THREE4.EVEL HIERARCHICAI. JPEG DECODER
— BiLs per pixel
1. Decornpress the encoded low.resolutiOfl image F4. Decode F4 using lhe sarne JPEG — (Width, height) of image
method as inibe encoder, lo obtain f~.
— Number of componenis
2. Restore image f~ at the intermediate resolution. Use E(J4) + J2 to obtain 12.
— Unique ID (for each cornponenL)
3. Restore image f at the original resolution. Use E(12) + Lo obtain 7.
— HorizonLallvertical sampling factors (for each cornponent)
It should be pointed out that at step 3 iii lhe encoder, lhe difference d2 is not taken as — QuantizaLion Lable to use (for each component)
f2 — E(f4) buL as f2 — E(f4). Employing f~ has its overhead, since an additional decoding . Scan header
step must be introduced on the encoder side, as shown in lhe figure.
So, is iL necessary? li is, because Lhe decoder never has a chance Lo see Lhe original f4. — Number of components in scan
The restoration step in the decoder uses ft Lo obtain f2 = E(f4) + d2. Since f4 ~ f~ when — Cornponent ID (for each cornponenL)
a lossy JPEG rneLhod is used in compressing f~, Lhe encoder must use f4 ind2 = f2 — E(f4)
— HuffrnaniAriLhmetic coding table (for each component)
to avoid unnecessarY error aL decoding Lime. This kind of decoder.encOder step is typical in
many compression schemes. In fact, we have seen iL in Section 6.3.5. lt is present simply
9.2 THE JPEG2000 STANDARD
because Lhe decoder has access only ia encoded, not original. values. -

Similarly, aI step 4 in Lhe encoder, d1 uses the difierence between f and E(f2), not E(f2). The JPEG standard is no doubL lhe most successful and popular image formaL to date.
The main reason for iLs success is Lhe qualiLy of is outpu for relatively good compression
Lossless Mode Lossless JPEG is a very special case of JPEG which indeed has no loss ratio. However, in anLicipaLing lhe needs and requirements of nexL-generaLion imagery
in its irnage quality. As discussed iii Chapter7. however, it employs only a simple differential applications, lhe JPEG cornmitLee has defined a new sLandard: JPEG2000.
coding melhod, involving no transform coding. It is rarely used, since its compression ratio The new JPEG2000 standard [3] aims to provide not only a beLter raLe-distortion Lradeoff
is very low compared Lo other, lossy modes. On lhe other hand, ii meets a special need, and and improved subjective image quality but also additional functionalities Lhe current JPEG
Lhe newly developed JPEG-LS standard is specifically airned aL lossless image compressiOn sLandard lacks. in particular, Lhe JPECI2000 standard addresses lhe following problerns [4]:
(see Section 9.3).
Section 9.2 The JPEG2000 Standard 267
266 Chapter 9 Image CompressiOn Standards

• Low-bitrate compressiOn. The current JPEG standard offers excellent rate-distortiOn


perfonnance ai niedium and high bitrates. However, ai bitrates below 0.25 bpp,
subjective distortion becomes unacceptable. This is important if we hope to receive
!~Z
images on our web-enabled ubiquitous devices, such as web-aware wristwatches, and
50 Ofl.

• Lossless and Iossy compression. Currently, no standard can provide superior lossiess
compression and lossy compression in a single bitslream.

• Large images. The new standard wifl allow image resolutions greatet tban 64k x 64k
without tiling. It can handie image sizes up 10232 — 1.

• Single decompresSiün architecture. ‘fite current JPEG standard has 44 modes, many
of which are application-Specific and not used by the majority of JPEG decoders.

• TransmissiOn in noisy environmentS. The new standard wiIl provide improved error
resilience for transmission in noisy environments such as wireless networks and lhe FIGURE 9.7: Cede block structure of EECOT.
Internei.

• Progressive transmission. The new standard provides seamless quality and reso 9.2.1 Main Steps of JPEG2000 Image Compression
lution scalability from low Lo high bitrates. The target bitrate and reconstmctiOn The main compression method used in JPEG2000 is lhe (Embedded Block Coding wiih Op
resolution need not be known ai lhe time of compression. :imized Truncarion) algori:hm (EBCO7), designed by Taubman [5]. In addition lo providing
excellent compression efficiency, EBCOT produces a bitstream with a number of desirable
• Region.Of4nterest coding. Tlie new standard permits specifying Regions ofInterest features, including quality and resolution scalability and random access.
(ROl), which cmi be coded wilh better quality than the rest of the image. We might, The basic idea of EBCOI’ is che partition of each subband LL, LH, HL, HH produced
forexample, like to code lhe face of someone making a presentation wilh more quality by the waveiet transform mIo small blocks called code blocks. Each code block is coded
than lhe surrounding furniture. independeritly, in such a way that no inforniation for any other block is used.
A separate, scalable bitstream is generated for each code block. With its block-based
• computer~generated imagery. The current JPEG standard is optimized for natural
coding scheme, Lhe EBCOT algorithm has improved entr resilience. The EBCOT algorithm
imagery and does not perform well on computer.generated imagery.
consisls of three steps:
• Compound documents. The new standard offers metadata mechanisnis for incorpo 1. Block coding and bitstream generation
rating additional non-image data as part of tbe file. This might be useful for including
2. Postcompression rale distortion (PCRD) optimization
texi along with imagery, as one iniportant example.
3. Layer formation and representation
In addition, JPEG2000 is aMe to handle U~ to 256 channels of information, whereas lhe
current JPEG standard is able lo handle only three color channels. Such huge quanlities of Block Coding and Bitstream Generation. Each subband gerterated for the 2D discrete
data are routinely produced in sateilite imagery. wavelet Lransform is first partitioned mio smafl code blocks of size 32 x 32 or 64 x 64.
Consequently, JPEG2000 is designed to address a variety of applicatiOns, such as lhe Then lhe EBCOT algorithm generates a highly scalable bilstream for each code block B~.
InterneI, color facsimile, printing, scanning, digital photographY. remote sensing, mobile The bitstream associated with R~ may be independently truncated to any member of a
applications. medical imagery, digital library, e.colnnierce, and so on. The method looks predetennined colieccion of different Iengths R7, with associated discortion D~.
ahead and provides lhe power to carry out remote browsing of large compressed images. For each code block R~ (see Figure 9.7), iet sjk] = s1[k1, k2] be the two-dimensional
The JPEG2OCIO standard operates in two coding modes: DCT-based and wavelet-based. sequence of small code blocks of subband samples, wich k1 and k2 the row and column
The PCT-based coding mode is offered for backward compatibility wilh lhe current JPEG index. (WiLh this definition, Lhe horizontal high-pass subband HL musl be transposed so
standard and implements baseline JPEG. AlI lhe new functionalitieS and improved perfor that k1 and k2 wiIl have meaning consistem with Lhe other subbands. This transposition
mance reside in lhe wavelet-based mede.
Section 9.2 The JPEG2000 Standard 269
268 Chapter 9 Image CompressiOn standards
Section 8.8 inLroduces lhe zerotree data strucwre as a way of efficiently coding lhe
bitstream for wavelet coefficients. The underiying observation behind lhe zerotree data
y
structure is that significant samples tend Lo be clusLered, 50 Lhat it is often possible lo
dispose of a large number of samples by coding a single binary symbol.
2 EBCCYT takes advantage of lhis observation; however, with efficiency in mmd, it exploits
lhe clustering assumption oniy down lo reiatively large sub-blocics of size 16 x 16. As a
result, each code block is furlher partitioned into a Lwo-dimensional sequence of sub-blocks
B,ü]. For each bitpiane, expiicit information is first encoded lhat identifies sub-biocks
containing one or more significant samples. The other sub-blocks are bypassed in lhe
x remaining coding phases for lhat bitplane.
o Ltt a~(B1[j]) be lhe significance of sub-block B,Ljj in bitplane p. The significance
o i 2 3 map is coded using a quad Lree. The Lree is constructed by identifying Lhe sub-blocks
with ieaf nodes — lhat is, B7[JJ = B~ [ii. The higher leveis are built using recursion:
—1 RIU] = U~€w 1~Ç’[2J + z], O ~ t ~ T. The root of lhe tree represents lhe entire
code-block: R[[O] = UjBuFj].
The significance of lhe code block is identified one quad levei aL a time, starting from
—2 lhe rool aL t = T and working toward lhe leaves at t = O. The significance vaiues are lhen
sent lo an arithmetic coder for entropy coding. Significance vaiues lhat are redundani are
skipped. A vaiue is Laken as redundani if any of lhe foilowing conditions is met:
FIGURE 9.8: Dead zone quantizer. The iength of lhe dead zone is 28. Values inside lhe dead
zone are quantized lo O. The parenL is insignificant.

. The current quad was aiready significanL in Lhe previous bitplane.


means that lhe HL subband can be treated in lhe sarne way as lhe LH, 11H, and 1.1 subbands
and use lhe sarne context model.) • This is lhe iast quad visited among lhose lhat share lhe sarne significant parent, and
The aigorithm uses a dead zone quantizer shown in Figure 9.8 — a double-iength region Lhe olher sibbngs are insignificanL.
straddiing O. Let xt[k] E (—1,1) be lhe sign of s~tk1 and iet u1(kl be lhe quantized
magniLude. Expiicitiy, we have EBCOT uses four different coding primitives lo code new information for a single sample
(9.2) ia a bitpiane p, as foliows:
u1[k]

• Zero coding. This is used lo code ur[k], given lhat lhe quantized saniple saLisfies
u~[k) < 2P~* Because lhe sample statistics are measured Lo be approximateiy
where ôp is lhe step size for subband 5j, which contains code block B~. Let vf [k~ be lhe pth
Markovian, lhe significance of lhe current sample depends on lhe values of its eight
bit in lhe binary representation of v1jk], where p = O corresponds to lhe ieast significant
immediate neighbors. The significance of these neighbors can be ciassified into Lhree
bit, and let pr~z be lhe maximum value of p such that [k] ~ O foral least one sample categories:
in Lhe code biock.
Theencoding process is similar Lo lhat of a bitpiane coder, in which lhe rnost significant — Horizontal. h~(k] = ~ aj[ki + z, k2j, wilh Os h~[k] s2
bit vfi [k) is coded first for ali samples in lhe cade block, foiiowed by lhe next most — Vertical. v~[k1 = E~€ll’, a1[ki, k2 + z], wilh O ~ v1[k] <2
(floR—!)

significant bit vÇ) [k], and so on, until ali bitplanes have been coded. In this way, if lhe — Diagonal. d~[kj = Ez.u€n.—ij a~[ki + z~, k2 + Z2], with Os d1[k) 54
bitstream is truncated, then sorne sampies in lhe cade biock may be missing one or rnore
ieast.significant bits. This is equivalent Lo having used a coarser dead zone quantizer for The neighbors ouLside lhe cade biock are considered lo be insignificant, but note lhat
lhese samples. sub-biocks are noL aL ali independenL. The 256 possibie neighborhood configurations
In addition, it is important Lo exploit Lhe previously encoded infonnation about a particular are reduced Lo lhe nine distincL context assignmenLs listed in Tabie 9.4.
sample and its neighboring samples. This is done in EBCOT by defining a binary vaiued
• Run-length coding. The run-lenglh coding primitive is aimed at producing runs of
state variable cg[k], which is initiaily O but changes to 1 when lhe relevaM sarnple’s first
lhe i-biL significance values, as a preiude for lhe arilhnietic coding engine. When a
nonzero bitpiane uf [ii = i is encoded. Tbis binary state variabie is referred Lo as lhe
significance of a sample.
Section 9.2 The )PEG2000 Standard 271
270 Chapter9 Image ConlpreSSiOfl Standards

TABLE 9.5: Context assignmenLs for the sign coding prirnitive.

Label ~~[k] 11[k] i~j[k]


4 1 1 1
3 1 O
2 1 —l 1
1 —l 1 O
O 1 O O
1 1 —1 O
2 —l 1 —l

1 3
4
—l
—l
O
—l —l

defined similarly). LeI jj [Ir] be Lhe sign predicLion. The binary symbol coded using
lhe relevanL contexL is Xi [k] ~,[k]. Table 9.5 lists these context assignmenLs.
horizontal rua of insignificaliL samples having insignificant aeighbors is found, it is • Magnitude refinement. This prirnitive is used to code lhe value of vf [k], given
invoked insLead of lhe zero coding primitive. Each of lhe following four coadilions
lhat v1[k] ? 2p~. Only three context rnodels are used for Lhe magnitude refinernenL
rnust be mel for lhe nin-lenglh coding priniitive lo be invoked: primitive. A second sLaLe variable c5~ [k] is inLroduced LhaL changes from O to 1 after Lhe
— Four consecUtive samples rnust be insignificant. magnitude refinernent primitive is 6rst applied lo s~ [kj. The conLext modeis depend
on lhe value of this siate variable: vf[k] is coded wiLh contexL O if ã[k] = h1[k] —
— The samples rnust have insignificant aeighbors.
v~[k] = O, wiLh context 1 ifõ~[k] = O and h~[k] + v~[k] ~ O, and wiLh context 2 if
— The samples rnust be within the sarne sub-block. = 1.
— The horizontal index k1 of the first sample rnust be even.
To ensure that each code block has a finely ernbedded bitsLrearn, Lhe coding of each
The last two conditions are simply for efficiencY. When four symbols satisfy these bitplane p proceeds in four distinct passes, (Pf’) Lo
conditions, one special bit is encoded instead, to identify whether any sample ia lhe
group is significant in Lhe current bitplane (using a separate context rnodel). If any of • Forward-significance-propagation pass (P~). The sub-block samples are visiLed
the four sarnples becomes significanL, the index of lhe first such sample is sent as a in scanline order. lnsignificant samples and samples thaL do not satisfy Lhe neighbor
2-bit quanlitY. hood requiremenl are skipped. For Lhe LH, HL, and LL subbands, lhe neighborhood
Siga coding. The siga coding primitive is invoked at rnost once for each sample, requirernent is LhaL at least one of Lhe horizontal neighbors has lo be significant. For
jmrnediately after the sample rnakes a transition from being insignificant Lo significant lhe 11H subband, the neighborhood requirement is that aL leasL one of Lhe fourdiagonal
during a zero coding or run-length coding operation. Since it lias four horizontal and neighbors musL be significant.
vertical neighbors, each of which rnay be insignificante positive~ or negative, there are For significant saniples that pass Lhe neighborhood requiremeaL, Lhe zero coding and
34 = 81 different context configuraLions. }lowever, exploiting both horizontal and run-length coding primiLives are invoked as appropriate, lo deLermine whelher Lhe
vertical syrnmetrY and assurning that Lhe conditional distribULion of Xi [kl, given any sample first becomes significant in bitplane p. If so, lhe siga coding primitive is
neighborhood configuraLion, is Lhe sarne as that of —x11k), Lhe nurnber of contexts is invoked to encode lhe sign. This is called Lhe forward-signiticance-propagation pass,
reduced loS. because a sample LhaL has been found lo be significanL helps ia lhe new significance
LeL i~ [k] be O if boLh horizontal neighbors are insignificalit. 1 if at least one horizontal determination sLeps Lhat propagate in Lhe direcLion of Lhe scan.
neighbor is positive, or —1 if at least one horizontal neighbor is negative (and ii, jkl is
272 Chapter 9 Image Compression Standards ENa 5d11 enc arvIlc.uoas HcIp

A 15 seccnd cllp of music tom a compacl disc was digiLized


at lhree diffêrent sampling rales (11 kNz, 22 Idiz, and 44 kHz)
with 8-bil precision. lhe effects aí lhe difFeçent samplng raLes
are clearly audible. This is a demonstration ai lhe Nyquist
Theorem.
FIGURE 9.9: AppearanCe aí coding passes and quad-tree codes in each bloclVs embedded Pies: Buifor. to PIay Nyquist Theorem:
bilstream. 8-bit Audio clip
The niinimum sampling
MuslcllkHz frequency af an ND converter
• ~ pass (P~). This pass is identicai lo VÇ. except shauld be at Ieast twice the
IM~Tci
Lhat it proceeds in lhe reverse order. The neighborhood requiremeflt is reiaxed Lo frequency aí the signal being
include saniples Lhat bave at least one significanL neighbor in any direction. IMns~~ LHI;I ss~ — measured.
VOL dose

• Magnitude refinement pass (V~). This pass encades samples that are already sig
nificant but tbat have not been caded in Lhe previous two passes. Such samples are
pracessed with Lhe magnitude refinemenL primitive.
Lie EdU Esse lisos Ije
• Normalization pass (P). The value vf [k~ aí ali samples not cansidered in lhe
A 15 second chp aí nnisic fram a compacL disc was digíLized aI
previons Lhree coding passes is coded using Lhe sign coding and run-lengLh coding Lhree difíerent sampling rales (11 kHz, 22 kl-{z. and 44 13z) wilh
primitives, as appropriate. If a sample is found la be significant, iLs sign is immediately 8-bil precision. The efíecta of Lhe different sampling rales are
coded using Lhe sign cading primitive. clearly audible. This is a demonstration of lhe Nyquist
Theorem.

Figure 9.9 shows lhe Iayaut of coding passes and quad-tree codes iii each bloek’s em Ny
bedded bitsLream. 8” denotes Lhe quad-Lree cade identifying Lhe significanL sub-blocks in
bitplane p. Natice Lhat for any bitplane p, 8” appears just before Lhe final cading pass
~r•
sial Lhe initial cading pass This implies that sub-blacks that became significant for Lhe
first Lime iii bilpiane p are ignared until Lhe final pass.
Muelc 11 1Hz

Muslo 22 1Hz

Music4d 1Hz
Post CompressiOn Rate-DistOrtiOn OptimizatiOfl. After ali Lhe subband samples have disse

been compressed, aposi compression rate distortion (PCRD) step is perfanned. The gaai of
PCRD isto produce an optimal Lruncatian aí each cade block’s independent biLstream such n••~j ~ ~ ‘li
Lhat distartian is minimized, subject lo Lhe bit-raLe canstraint. For each tnincated embedded
bilstream of cade black B, having rale R71, the averail dislorlian aí Lhe recansLnicted image À FIGURE 2.4: Colars and fants. Courtesy of Ron Vetter.
is (assuming distartion is addilive)
(9.3)
13= ~

where D” is lhe distartian from cade block B~ having tnincatiofl paint n~. For each cade ~ FIGURE 2.6: Calor wheel.
black R~, distartion is computed by
(9.4)
D7 = w~. (~ [k] — s,[kfl2
keB,

where s~ IkI is Lhe 2D sequence aí subband sampies in cade black 8, and ~‘ [k] is Lhe
quantized representatian aí these samples assaciated with Inincalian poinl n. The vaiue w~.
is Lhe L2 norm af Lhe wavelet basis function for the subband l,~ LhaL contains cade black 8.
BIue Cvnn YeIIow Red

Magenh Grcen

Magenta

Red Cya ri

Black (0,0,0) WIiite (1, 1, 1) WhiÉe (0,0,0) 8 Iack (1, 1, 1)

The RGB Cube The CMY Cube

(c) (d) FIGURE 4.15: RGB and CMY color cubes.

À FIGURE 3.5: High-resolution color and separate R, G, B color channel images.


(a) example of 24-bit color image forestf ire bmp. (b, c, d) R, G, and 8 color
channels for this image.

YelIow

(a) (b)

À FIGURE 3.7: Example of 8-bit À FIGURE 3.17: JPEG image with iow
À FIGURE 4.16: Additive and subtractive color: (a) RGB is used to specify additive color;
color image. quality specified by user.
(b) CMY is used to specify subtractive color.
(a)

e O

(b) (c) (d)

À FIGURE 4.18: Y’UV decomposition of color image. Top image (a) is original color image;
(b) is Y; (c) is U; (d) is lI.

L= 100
White

Green: a<O YeIIow: b>0

BIue:b<O Re&a>O

(c)
Black
L O A FIGURE 9.13: Comparison of JPEG and JPEG2000; (a) Original image; (b) JPEG (lefi)
and JPEG2000 (right) images compressed aL 0.75 bpp; (e) JPEG (IefL) and JPEG2000
FIGURE 4.14: CIELAB model. À FIGURE 4.21: SMPTE Monitor gamut. (right) images compressed at 0.25 bpp.
(a)

(b) (e)

À FIGURE 12.10: Sprite Coding. (a) The the foreground object (piper) in a blue-screen
image; (b) the foreground object (piper) in a bluescreen image; (e) the composed video
seene. Piper irnage courtesy of Sirnon Froser University Pipe BonS

Moving region: helicopter Moving region: person Moving region: boat

À FIGURE 12.19: MPEG-7 Video segments.


4 FIGURE 18.10: Section 9.2 The JPEG2000 Standard 273
Model and target
images. (a) Sample . . . . .
The optimal selection of Lruncation poinLs n1 can be fonnulaLed uno a minimization
model image (b)
problem subjecL Lo Lhe following constrarnt:
sample database
p~~ctvnO~ image containing the R = < Rm°’ (9.5)
model book. Active —

Perception textbook
cover courtesy where R’~ is Lhe available biL raLe. For some À, any seL of truncation points (nt} Lhat
Lawrence Eribaum minimizes
Associates, mc.

(a) (D(À) + ÀR(À)) = + ÀR~) (9.6)

is optimal in Lhe rate-disiortion sense. Thus, flnding Lhe set of truncaLion points thaL miii-
FIGURE 18.13: imizes Equation (9.6) wiLh LoLal rate R(À) = R’~°-~ would yield the solution tolhe enLire
Color locales. (a) oplimization problem.
Color locales for b1~fl,i:v Since the seL of LruncaLion pounts is discreLe, iL is generally nOL possible to find a value
Lhe model image; of À for which R(À) is exacLly equal to ~ However, since Lhe EBCOT algoriLhm uses
(b) color locales for , relaLively small code blocks, each of which has many LruncaLion poinLs, iL is sufficient to
a database image. find Lhe smallesL value of À such ihaL R(À) ~
lt is easy to see Lhat each code block Bt can be minimized independently. LeL M be Lhe
seL of feasible truncaLion poinLs and let j~ < be an enumeration of Lhese feasible
truncation poinLs having corresponding disLortion-rate siopes given by die ralios

t~D~
= —4-- (9.7)
~

where = R/* _R/k_i and = D/t —üf~’. li isevidentthaLtheslopesaresLrictly

decreasung, since lhe operational disLortion-rate curve is convex and sLricLly decreasing. The
minimization problem for a fixed value of À is simply the trivial selecLion

= max fit E .MIS/t > À) (9.8)

The optimal value À can be found using a simple bisecLion method operating on lhe
distortion-rate curve. A deLailed descripLion of this method can be found iii [6].

Layer FormaLion and Representation. The EBCOT algoriLhm offers boLh resolution
and quality scalabiliLy, as opposed lo oLher weII-known scalable image compression algo
rithms such as EZW and SPIHT, which offer only qualily scalability. This functionality is
achieved using a layered biLstream organization and a two-tiered codung straLegy.
The final biLstream EBCOT produces is composed of a collecLion of quality layers. The
quality Iayer Qi conLains Lhe uniLial R7’ bytes of each code block R~ and the oLher Iayers
Qq conLain Lhe incrementa! contribuLion L7 = 4’
— R”
q q—I
~ O from code block R~. The
quantiLy :r7 is Lhe truncaLion poinL correspondung to the raLe disLortion Lhreshold Àq selected
for Lhe qth qualiLy Iayer. Figure 9.10 illusLrates lhe layered biLsiream (after [5]).
274 Chapter 9 Image Compression Standards Section 9.2 The JPEG2000 Standard 275

•Forp=F—l,P—2.0

Empty Empty Empty — Send binary digits lo idenlify wheiher p7~Xt < p. The redundant biLs are
skipped.
— ff p~flifl = p, Iben slop.

The redundanl bils are those corresponding Lo Lhe condition p7I0Xt < p that can be
inferred either from ancestors such lhat pmax1 < p or from lhe partial quad-lree code
used Lo identify p7” for a different code block R~.
Empty

8o B~ 82 83 84 85 B~ B~
9.2.2 Adapting EBCOT to JPEG2000
JPEG2000 uses lhe EBCaI’ algorithm as iLs primary coding melhod. However, lhe ai
FIGURE 9.10: Three quality layers with eight blocks each. goriLhm is siightly modified lo enhance compression efficiency and reduce compuLalionai
complexiiy.
To furlher enhance compression efflciency, as opposed Lo initializing lhe enLropy coder
Along with chese incremental contribulions, auxiliary informalion such as lhe length L7, using equiprobable siates for ali conlexls, Lhe JPEG2000 sLandard makes an assumpiion
lhe number of new coding passes t’17 = — Lhe value p7”'’ when R~ makes iis first of highly skewed disLribunons for some conlexls, Lo reduce Lhe model adaplaLion cosL for
nonemply contribuLion Lo qualiLy layer Qq, and ibe index qi of Lhe qua]ity layer lo which B~ Lypical images. Several small adjuslmenis are made lo Lhe original algorilhm Lo further
firsl makes a nonemply contribulion must be expliciLly stored. This auxiliaxy information reduce iLs execulion lime.
is compressed in Lhe second-Lier coding engine. Hence, in this Lwo-tiered architecture, Firsl, a low-complexity ariLhmelic coder Lhal avoids mulLiplicalions and divisions, known
Lhe Iirsl tier produces Lhe embedded block biLstreams, whiie Lhe second encodes lhe biock as Lhe MQ coder [7], replaces Lhe usual arilhmetic coder used in Lhe original algoriLhm.
contribuLions lo each qualiLy layer. Furlhermore, JPEG2000 does nol Iranspose Lhe HL subband’s code blocks. lnslead, lhe
The focus of lhis subsecLion is lhe second-lier processing of lhe auxiiiary informaLion corresponding enlries in lhe zero coding conlexL assignmenL map are Lransposed.
accompanying each qualiLy layer. The second-lier coding engine handies carefuliy lhe lwo To ensure a consisLenl scan direclion, JPEG2000 combines lhe forward- and reverse
quanlilies Lhat exhibit substanLial inLerbiock redundancy. These lwo quantities are p7”'’ and signiflcance-propagalion passes mIo a single forward-significance-propagalion pass wilh a
Lhe index qi of lhe quality layer Lo which B~ first makes a nonempty conlribution. neighborhood requirement equal lo Lhal of lhe original reverse pass. In addiLion, reducing
The quanLiLy qi is coded using a separale embedded quad-Lree code wiLhin each subband. Lhe sub-block size 1o4 X 4 from Lhe original 16 x 16 eliminales Lhe need lo explicilly code
i..eL = 8~ be Lhe leaves and B[ be lhe rooL of Lhe lree lhaL represenls Lhe enlire subband. sub-block significance. The resulling probabilily dislribuLion for these small sub-blocks is
LeI qJ = min{q~ 8~ c B() be lhe index of Lhe firsl layer in which any code block in quad highly skewed, so lhe coder behaves as if ali sub-blocks are significanL.
BJ malces a nonempty conlribulion. A singie bil identifies wheLher > q for each quad The cumulaLive effecL of Lhese modificalions is an increase of abouL 40% in sofLware
ar each levei:, wiLh redundanL quads omiLled. A quad is redundanl if eilher qJ < q — 1 or execution speed, wiLh an average loss of aboul 0.15dB relalive Lo Lhe original algoriLhm.
q~+I > q for some parenl quad B7~I.
The olher redundanl quanlity lo consider is p7”. Ir is clear Lhat p7”'’ is irreievanL unIu 9.2.3 Region-of-Interest Coding
lhe coding of Lhe qualily iayer Qq. Thus, any unnecessary informalion concerning p7”” A significanL feaLure of Lhe new JPEG2000 standard is Lhe abilily Lo perform region-of
need nol be senL unLil we are ready lo encode Qq. EBCCYI’ does Lhis using a modified inLeresl (ROl) coding [8]. Here, particular regions of Lhe image may be coded wilh betler
embedded quadLree driven from Lhe leaves raLher than from lhe rool. qualily lhan Lhe resL of Lhe image or lhe background. The meLhod is called MAXSHIFI’,
Lei 8Ç be Lhe elemenis of Lhe quad lree slruclure buili ou Lop of Lhe code biocks B~ from a scaling-based meLhod Lhal scales up lhe coefficients in Lhe ROl so LhaL Lhey are placed
any subband, and leI p7’°’~’ = max{p7°’IBj c Efl. In addition, leI B be lhe anceslor of inLo higher biLpianes. During Lhe embedded coding process, Lhe resulling bits are placed in
quads from which E~ descends and leL P be a value guaranleed lo be larger than p7” for fronL of lhe non-ROl part of lhe image. Therefore, given a reduced biLrate, Lhe ROl will be
any code biock B,. When code biock R~ IirsL conLribules Lo Lhe biLstream in quahLy layer decoded and relined before lhe resL of Lhe image. As a resulL of Lhese mechanisms, lhe Rol
Qq. Lhe value of p7” = ≠Iax.O is coded using Lhe foilowing algorilhm: will have much beLler qualily lhan Lhe background.
apter 9 Image Compression Standards Sectiori 9.3 The JPEG-LS Standard 277

9.2.4 Comparison of JPEG and JPEG2000 Performance


Afler studying Lhe internais of lhe JPEG2000 compression algorithm, a natural queslion
that comes lo mmd is, how well does JPEG2000 perform compared to olher well-known
standards, in particular JPEG? Many comparisons have been made between JPEG and olher
well-known standards, 50 here we compare JPEG2000 only to lhe popular JPEG.
Various criteria, such as computalional complexity, error resilience, compression effl
ciency, and so on, have been used to evaluate Lhe performance of systems. Since our main
focus is on the compression aspect of lhe JPEG2000 standard, here we simply compare
compression efficiency. (lnterested readers can refer lo [9] and [10] for comparisons using
other criteria.)
Given a fixed bitrate, leIs compare qualily of compressed images quantitatively by lhe
PSNR: for color images, lhe PSNR is calculaled based on Lhe average of lhe mean square
error of ali lhe RGB components. Also, we visually show results for bolh JPEG2000 and
JPEG compressed images, so that you can make your own qualitative assessment. We
perform a compaiison for Lhree categories of images: natural, computer-generated, and
medical, using three images from each calegory. The LesL images used are shown on lhe
Lextbook web sile in Lhe Further Exploration seclion for lhis chapter.
For each image, we compress using JPEG and JPEG2000, aL four bilrates: 0.25 bpp,
0.5 bpp, 0.75 bpp, and 1.0 bpp. Figure 9.12 shows plots of lhe average PSNR of lhe images
in each category against bitrate. We see that JPEG2000 subslantiaily outperforms JPEG in
ali categories.
For a qualitative comparison of lhe compression results, let’s choose a singie image and
show decompressed outpul for lhe Lwo algorilhms using a low bitrale (0.75 bpp) and lhe
lowesl bitrate (0.25 bpp). From lhe resuits in Figure 9.13, il should be obvious Lhat images
compressed using WEG2000 show significantiy fewer visual artifacis.

9.3 THE JPEG-LS STANDARD


Generaliy, we wouid iikely apply a lossless compression scheme to images Lhat are critical
iii some sense, say medical images of a brain, or perhaps images lhat are difficult or costly
Lo acquire. A scheme in competition wilh lhe lossless mode provided in JPEG2000 is lhe
JPEG-LS standard, specifically aimed aI lossless encoding [11]. The main advantage of
JPEG-LS over JPEG2000 is that JPEG-LS is based on a iow-compiexity algorilhm. JPEG
LS is part of a larger ISO effort aimed aI becter compression of medical images.
(d) JPEG-LS is in fact lhe currenL ISO/ITU standard for lossless or “near iossiess” compres
sion of continuous-tone images. The core algorithm in JPEG-LS is cailed LOw COmple.xiiy
FIGURE 9.11: Region of interest (ROl) coding of an image with increasmg bit-rale using a circuiarly LOssless COmpression for Images (LOCO-F), proposed by Hewlett-Packard [II]. The de
shaped ROl: (a) 0.4 bpp; (b) 0.5 bpp; (c) 0.6 bpp; (d) 0.7 bpp. sign of chis algoriLhm is motivated by lhe observation Lhat complexity reduction is oflen
more important overali Iban any small increase in compression offered by more complex
aigorithms.
LOCO-l exploits a concepl called context modeling. The idea of context modeling is Lo
One Lhing to note is Lhat regardless of scaling, fuli decoding of Lhe bitstream willi result Lalce advantage of lhe structure in lhe input source — conditional probabilities of what pixei
in reconstruction of lhe entire image wiLh Lhe highest fideliLy available. Figure 9.11 demon values foliow from each olher in Lhe image. This extra knowledge is cailed lhe conrext. lf
strates lhe effect of region-of-interest coding as lhe target bitrate of lhe sample image is lhe inpuc source contains substantial struclure, as is usuaily Lhe case, we could polentiaily
increased. compress iL using fewer bits lhan Lhe OLh-order enlropy.
278 Chapter 9 Image Compression Standards on 9.3 The JPEG-LS Standard 279

• 1 1~ 1

(c)

FIGURE 9.13: Comparison of JPEG and JPEG2000: (a) original image; (b) JPEG (lefi) and
JPEG2000 (right) images compressed ai 0.75 bpp; (c) JPEG (lefi) and JPEG2000 (righi)
images compressed ai 0.25 bpp. (This figure also appears in lhe color inseri seclion.)
280 Chapter 9 Image Compression Standards Section 9.3 The JPEG-l.S Standard 281

li is easy lo see that lhis prediclor switches between lhree simple predictors. 11 outputs
c ad a when lhere is a vertical edge to Lhe left of lhe current location; iL outpuls b when lhere
is a horizontal edge above lhe current localion; and finally ii oulputs a + 1’ e when lhe
neighboring samples are relatively smooth.
b x
9.3.2 Context Determination
The conlexl model lhat conditions Lhe current prediction error (lhe residual) is indexed using
FIGURE 9.14: JPEG-LS context model. a lhree-component context vector Q = (qi, q~, q~), whose components are
qj = d—b
As a simple exampie, suppose we have a binary source with P(0) = 0.4 and P (1) 0.6. (9.10)
‘12 = b—c
Then Lhe Oth-order entropy 11(S) = —0.4 iog2(0.4) — 0.6 log2(0.6) = 0.97. Now suppose
we aiso know lhat lhis source has the property IbaL if the previous symbol isO, lhe probability = c—a
of lhe current symbol being 0 is 0.8, and if lhe previous symboi is 1, lhe probabilily of lhe These differences represent Lhe local gradient that captures lhe local smoothness or edge
current symbol being O is 0.1. contenls that surround lhe current sample. Because these differences can potentially take on
lf we use lhe previous symboi as our cozuexi, we can divide lhe input symbois into Lwo a wide range of values, lhe underlying context model is huge, making lhe conlext-modeling
sets, corresponding to context O and context 1, respectively. Then Lhe entropy of each of approach impractical. To solve this problem, parameter reduction melhods are needed.
the two seis is An effeclive melhod is lo quantize these differences so lhat they cmi be represented by
a limited number of values. The componenls of Q are quanuized using a quanlizer with
11(Sj) = —0.81082(0.8) — 0.2 log2(0.2) = 0.72 decision boundaries T, , —1, 0, ~ , T. la JPEG-LS, T = 4. The context size is

11(52) = —0.1 log2(O.1) — 0.9iog2(0.9) = 0.47 further reduced by replacing any conlexl vector Q whose first element is negative by —Q.
Therefore, lhe number of different context states is (2T+$’+l = 365 in total. The vector Q
The average bit-rale for lhe entire source wouid be 0.4 x 0.72 + 0.6 x 0.47 = 0.57, which is lhen mapped mio an integer in [0. 364].
is substantially iess than lhe Olh-order entropy of lhe entire source iii this case.
LOCO-1 uses a context modei shown in Figure 9.14. [ti rasler scan order, lhe context 9.3.3 Residual Coding
pixeis a, b, e, and d ali appear before lhe curreni pixel x. Thus, this is called a causal
contexi. For any image, lhe prediction residual has a finite size, a. For a given prediction 2, lhe
LOCO-I can be broken down into lhree components: residual e is ia lhe range —2 ~ e < a — 2. Since lhe value 2 can be generated by lhe
decoder, lhe dynamic range of lhe residual e can be reduced modulo a and mapped into a
• Prediction. Predicting lhe value of lhe nexl sample x’ using a causal template value belween —1~J and í!1 — 1.
It can be shown that lhe enor residuaIs follow a rwo-sided geome:ric distribuzion (TSGD).
• Context determination. Determining lhe conlexl in which x’ occurs As a resuit, lhey are coded using adaptively selected codes based on Golonib codes, which
are optimal for sequences wilh geometric distributions [12].
• Residual coding. Entropy coding of lhe prediction residual conditioned by lhe contexl
of x’ 9.3.4 Near-Lossless Mode

9.3.1 Prediction The JPEG-IS slandard also offers a near-lossless mode, in which the reconslructed samples
deviale from lhe original by no more than an amount 8. The main lossless JPEG-LS mode
A better version of prediclion can use an adaptive model based on a calculation of lhe can be considered a special case of lhe near-lossless mode with 8 = 0. Near-lossless com
local edge direction. However, because JPEG-LS is aimed ai iow complexily, lhe LOCO-l pression is achieved using quantization: residuais are quantized using a uniform quanlizer
algorilhm instead uses a fixed predictor lhat performs primilive tests Lo detect vertical and having intervals of lenglh 28 + 1. The quantized values of e are given by
horizontal edges. The fixed predictor use~J by lhe aigorithm is given as follows:
Q(e) =sign(e)[~+8j (9.11)
min(a,b) e? max(a,b)
2’= max(a,b) c<min(a,b) (9.9) Since 8 can take on oniy a small number of integer values, lhe division operation can be
a + 1’ e otherwise implemented efficiently using lookup tables. In near-lossless mode, lhe prediclion and
context delerminalion slep described previously are based on Lhe quantized values only.
282 Chapter 9 Image Compression Standards Section 9.4 Bilevel Image Compression Standards 283

9.4 BILEVEL IMAGE COMPRESSION STANDARDS in the original image. By contrast, Lhe JBIG2 standard is explicitly designed for iossy,
lossless, and iossy to lossless image compression. The design goal for JBIG2 aims not only
As more and more documents are handled in electronic form, efficient methods for com
at providing superior lossless compression performance over exisling slandards but also aL
pressing bilevel images (those with only 1-bil, black-and-white pixels) are much in demand.
incorporating lossy compression aL a much higher compression ratio, wilh as little visible
A familiar example is fax images. Algorithms that Lake advantage of the binary nature of
degradation as possible.
the image daLa often perform better than generic image-compression algorilhms. Earlier
A unique feature of JBIG2 is Lhat it is both quality progressive and contem progressive.
facsimile standards, such as G3 and G4, use simple modelsof the sLructure of bilevel images.
By qualiLy progressive, we mean that Lhe bilstream behaves similarly Lo Lhat of lhe JBIG
Each scanline in lhe image is treated as a run of black-and-white pixeis. However, consid
standard, in which lhe image quality progresses from lower to higher (or possibly lossiess)
edng the neighboring pixels and the nature of data to be coded allows much more efficient
quality. On lhe olher hand, content progressive allows different types of image data lo be
aigorithms to be constructed. This section examines Lhe JBIG standard and its successor,
added ptogressively. The JBIG2 encoder decomposes the input bilevel image inLo regions
JBIG2, as well as Lhe underlying motivaLions and principies for these two standards.
of different altribules and codes each separalely, using different coding metheds.
As in other image compression slandards, only lhe JBIG2 bitstream. and Lhus the de
9.4.1 The JBIG Standard
coder, is explicitly defined. As a result, any encoder that produces lhe contct bitstream is
JBIG is Lhe coding standard recommended by Lhe Joint Bi-levei image Processing Group for “compliant”, regardless of the actions iL actually lalces. AnoLher feature of JBIG2 Lhat seIs
binaiy images. This lossiess compression standard is used primariiy te code scanned images iL apart from olher image compression standards is that iL is able to represent multiple pages
of printed or handwritten text, computer-generated text, and facsimile Lransmissions. lt of a document in a single file, enabling it lo exploit interpage similarities.
offers progressive encoding and decoding capability, in the sense Lhat Lhe resulting bitstream For example, if a character appears on one page, it is likely lo appear on olher pages as
contains a set of progressiveiy higher-resoiution images. This standard can also be used to well. Thus, using a dictionary-based technique, Lhis characler is coded only once instead of
code grayscaie and calor images by coding each bitplane independenlly, but lhis is not Lhe mulLiple times for every page on which iL appears. This compression technique is somewhat
main objective. analogous Lo video coding, which exploits interframe redundancy lo increase compression
The JBIG compression standard has three separate medes of operation: progressive, efficiency.
progressive-compatible sequential, and single-progression sequential. The progressive 3B102 offers conlent-progressive coding and superior compression perforrnance lhrough
compatible sequenlia] mede uses a bitstream compatible with the progressive mode. The ,nodel-based coding, in which different modeis are constructed for different data types in
only difference is Lhat the data is divided into strips in Lhis mede. an image, realizing additional coding gain.
The single-progression sequential mode has only a single lowesl-resoiution layer. There
fore, an entire image can be coded without any reference to other higher-resoiution layers. Modcl-Based Coding The idea behind model-based coding is essenlially Lhe sarne
BoLh these modes can be viewed as special cases of lhe progressive mode. Therefore, our as that of conLext-based coding. From lhe study of the latter, we know we can realize
discussion covers only the progressive mode. better compression performance by carefully designing a context template and accurately
The JBIG encoder can be decomposed into two components: estimating Lhe probability dislribution for each context. Similarly, if we can separate the
image content into different categories and derive a model specifically for each, we are
. Resolution-reduction and differentiai-iayer encoder
much more likeiy to accurateiy model the behavior of the daLa and thus achieve higher
. Lowest-resoiution-iayer encoder compression ratio.
In lhe rnlo style of coding, adapLive and model LemplaLes capture Lhe sLrucwre within
The input image goes lhrough a sequence of resoiution-reduction and differential-iayer Lhe image. This model is general, in Lhe sense Lhat it applies lo ali kinds of data. However,
encoders. Each is equivalent in functionality, except that lheir input images have differenl being general implies that iL does not explicitly deal wiLh lhe structural differences between
resolutions. Some implementations of lhe JBIG standard may choose lo reeursively use one LexL and halftone daLa that comprise nearly ali lhe contents of bilevel images. JBIG2 takes
such physicai encoder. The lowest-resolution image is coded using Lhe iowest-resolution advantage of lhis by designing custom modeis for Lhese data Lypes.
layer encoder. The design of Lhis encoder is somewhat simpler than that of Lhe resolution The JBIG2 specificalion expects the encoder Lo first segment lhe input image into regions
reduction and differential-layer encoders, since Lhe resolution-reduction and deterministic of different dala types, in particular, text and halftone regions. Each region is then coded
prediction operations are not needed. independently, according Lo its characteristics.
Texi-Region Coding.
9.4.2 The JBIGZ Standard Each text region is furlher segmented into pixel blocks conlaining connected biack pixels.
While lhe iBIO standard offers both lossless and progressive (lossy Lo lossless) coding These blocks correspond Lo characters Lhat make up Lhe content of this region. Then,
abilities, the Iossy image produced by this standard has significantly lower quality Iban Lhe instead of coding ali pixels of each character, Lhe bitmap of one representative instance of
original, because lhe lossy image contains at most oniy one-quarter of Lhe number of pixels Lhis character is coded and placed into a dic:ionary. For any character Lo be coded, Lhe
284 Chapter 9 Image Compressiori Standards Section 9.6 Exercises 285

algorilhm first Lries Lo find a maLch with Lhe characters in Lhe dicLionary. lf arte is found, An easy-lo-use JPEG demo, wrilLen in Java, is available for you Lo lry, linked from Lhis
then bolh a pointer lo Lhe corresponding enlry in the dictionary and the position of Lhe seclion of lhe LexI web siLe. OLher useful links include
character on Lhe page are coded. OLherwise, Lhe pixel block is coded directly and added
lo Lhe diclionary. This technique is referred to as paltem niatching and substitution in the • Thumbnails for lesI images used for JPEG/JPEG2000 performance evalualion: naL
J8102 specificaLion. ural images, compuLer-generaled images, and medical images
However, for scanned doçuments, iL is Llnlikely LhaL two instances of lhe sarne character
wiIl match pixel by pixel. lii Lhis case. 38102 allows Lhe opLion of including refiriemenL Many JPEG- and JPEG2000-relaled links
daLa Lo reproduce Lhe original characler on Lhe page. The refinemenL data codes lhe currenL
character using Lhe pixeis iii lhe maLching characler in the dictionary. The encoder has lhe • Java and C implemenLalions of JPEG2000 encoder and decoder
freedom lo choose lhe refinernenl Lo be exacL ar lossy. This melhod is called sofi partem
tnatching. • A Java appler for JPEG and JPEG2000 comparison
The numeric data, such as Lhe index aí malched characLer in Lhe dicLionary and Lhe
• A simple explanaLion of conLexl-based image compression
posiLion of the characlers on Lhe page, are eiLher biLwise ar Huffman encoded. Each biLmap
for lhe characlers in the diclionary is coded using JBIG-based lechniques. • The original research papers for LOCO-l
Halftone.Region Coding
The 38102 sLandard suggesls two melhods for halftone image coding. The firsl is similar • The JPEG-LS public domam source cade
Lo lhe contexl-based ariLhmelic coding used in JBIG. The only difference is thaL lhe new
sLandard ailows lhe conLexL LempiaLe Lo include as many as 16 Lemplale pixeis, fourofwhich • An inLroducrion Lo 3810 and documenlalion and source cade for 210 and JBIG2
may be adaplive.
The second melhod is called descreening. This involves converting back Lo grayscale • A good resource for daLa compression compiled by Mark Neison Lhal includes li
and coding Lhe grayscale values. In Lhis method, lhe bilevel region is divided mIo biocks braries, documenlalion, and source cade for JPEG, JPEG-2000, JPEG-LS, 3810, eLc.
of size ni~, x n~. For an m x n bilevel region, lhe resuiLing grayscale image has dimension
= [(ia + (mg, — ifl/niaj by a8 = [(a + (na — 1))/naJ. The grayscaie value is lhen 9.6 EXERCISES
compuLed Lo be lhe sum of lhe binary pixel values in lhe corresponding tnj, x n~ block. (a) JPEO uses Lhe Discrele Cosine Transform (Dcl’) for image compression.
The bilpianes of lhe grayscaie image are coded using context-based arilhmelic coding. The
grayscale values are used as indices inLo a diclionary of halfLone bilmap palLems. The 1. WhaL is lhe value of F(0, 0) if Lhe image f(i, j) is as below?
decoder can use Lhis value Lo index mIo Lhis dictionaiy, Lo reconsLrucL lhe original haiflone ii. Which AC coefficient F(u, u)~ is lhe largesL for Lhis f(i, j)? Why? Is
image. Lhis F(u, v) posilive or negative? Why?

Preprocessing and Postproccssing. JBIG2 allows lhe use of lossy compression bul 20 20 20 20 20 20 20 20
does nol specify a rneLhod for doing so. From lhe decoder poinL of view, lhe decoded biL 20 20 20 20 20 20 20 20
sLrearn is lossless with respecL lo Lhe image encoded by Lhe encoder, allhough nol necessariiy 80 80 80 80 80 80 80 80
wilh respecl Lo lhe original irnage. The encoder rnay modify lhe inpul irnage in a prepro 80 80 80 80 80 80 80 80
cessing sLep, lo increase coding efficiency. The preprocessor usually Lries Lo change lhe 140 140 140 140 140 .140 140 140
original image Lo lower lhe cade lenglh in a way Ihal does noL generally affecL Lhe image’s 140 140 140 140 140 140 140 140
appearance. Typicaliy, it lhes lo remove rioisy pixeis and srnooLh ouL pixel blocks. 200 200 200 200 200 200 200 200
Poslprocessing, anolher issue not addressed by lhe specificaLion, can be especially useful 200 200 200 200 200 200 200 200
for halftones, polenLially producing more visually pleasing images. It is also helpful lo Lune
lhe decoded image lo a particular oulpuL device, such as a laser prinler. (b) Show in delail how a Lhree-level hierarchicai JPEG will encode lhe image above,
assuming thaL
9.5 FURTHER EXPLORATION
1. The encoder and decoder aL ali lhree leveis use Lossless JPEG.
The books by Pennebaker and MiLchell [1] and Taubman and Marceilin [3] provide good
ii. Reduction simply averages each 2 x 2 block inLo a single pixel value.
references for JPEG and JPEG2000, respeclively. Bhaskaran and KonsLanlinides [2] provide
delailed discussions of several irnage compression sLandards and lhe lheory underlying Lhem. iii. Expansion duplicaLes lhe singie pixei value foiar limes.
286 Chapter 9 tmage Compression Standards Section 9.7 References 287

2. In JPEG, lhe Discrete Cosine Ttransform is applied to 8 x 8 blocks in an image. For 8. We decide Lo create a new irnage-compression standard based on JPEG, for use wiLh
now, let’s cail it DCT-8. Generally, we can define a DCT-N lo be applied Lo N x N irnages that will be viewed by an alien species. What part of Lhe JPEG workflow
blocks in an image. DCT-N is defined as: would we likely have to change?
9. Unlike EZW, EBCUI’ does not expliciLly take advantage of Lhe spalial relationships
of wavelet coefficierits. insLead, iL uses lhe PCRD optirnization approach. Discuss
N—I N-l
2C(u)C(v) (21 + fluir (2j + fluir . . Lhe rationale behind this approach.
FN(u,u) =
N
cos
2N
cos
2N
fQ ‘
j)
10. Is the JPEG2000 biLstream SNR scalable? lfso, expiam how ii is achieved using Lhe
t=0 J=O EBCOT algoriLhm.

C(~) =
14 for4 = O 11. IrnplemenL Lransfonn coding, quanLization, and hierarchical coding for Lhe encoder
li oLherwise and decoderof a Lhree-level Hierarchical JPEG. Yourcode should include a (minirnal)
graphical user inLerface for Lhe purpose of demonsLrating your resulLs. You do nol
Given f(i, j) as below, show your work for deriving ali pixel values of F2(u, v). need lo irnplemenL Lhe entropy (lossiess) coding pari; optionally, you may include any
(ThaL is, show lhe result of applying DCT-2 lo the image below.) publicly available code for iL.

100 —100 100 —100 100 —100 100 —100 9.7 REFERENCES
100 100 100 100 100 100 —100
100 1 W.B. Pennebaker and J.L. MiLchell, The JPEG 511!! Image Data Compression Standard, New
100 —100 100 100 IDO —100 100 —100 York: Van NosLrand Reinhold, 1993.
100 —100 100 —100 IDO —100 100 —IDO 2 V. Bhaskaran and K. KonsLanlinides, i~nage and Video Conipression Standards: A!gorithms
100 —100 100 —100 100 —100 100 —100 and Architeclures • 2nd ed., Boston: Kluwer Academic Publishers, 1997.
100 —100 100 —100 100 —100 100 —100
3 D.S. Taubman and M.W. Marceilin, JPEG2000: linage Conipression Fundamentais, Standards
IDO —100 100 —100 100 —100 100 —100
and Practice, Norwell, MÁ: Kluwer Acadeinic Publishers, 2002.
100 —100 100 —100 100 —100 100 —100
4 C. A. Christopoulos, “Tulorial on JPEG2000T In Pior. of iii?. Conf ou Image Processing.
3. According lo lhe DCT-N definiLion above, FN(1) and FN(N — 1) are lhe AC coeffi- 1999.
cients representing lhe Iowest and highesL spaLial frequencies, respeclively. 5 D. Taubman, “High Performance Scalable Image Compression wiLh EBCOU’ IEFE Trans.
Image Processing, 9(7): 1158—1170.2000.
(a) It is bown Lhat F16(1) and Fg(1) do not capture lhe sarne (Iowesl) frequency 6 K. Rarnachan&an and M. Venerli, “Besi Wavelet Packct Basis in a Rale-DisLorlion Sense’
response in image filLering. Expiam why. IEEE Trans. linage Processing, 2: 160—173, 1993.
(b) Do F,6(,15) and F8(7) caplure lhe sarne (highesl) frequency response? 7 1. ~ E Ono, T. Yanagiya, T. Kimura, and M. Yoshida, Proposa! of lhe Arithmetic Coder
for JPEG2000, ISORECJTCI/SC29!WGI NI 143, 1999.
4. You are given a compuler carloon piclure and a phoLograph. If you have a choice of 8 A. N. Skodras, C. A. Chrisiopoulos, and T. Ebrahimi, “JPEG2000: The Upcoming StiII Image
using either JPEG compression or 0W, which compression would you apply for lhese Compression Standard’ Ia Iidz Portuguese Conference on Pattern Recognition, pp. 359—366,
Lwo irnages? Justify your answer. 2000.
5. Suppose we view a decompressed 512 x 512 JPEO irnage bul use only lhe calor P~ 9 D. Sania-Cn,z and T. Ebrahimi, “A SLudy of JPEG2000 StiII Image Coding Versus OLher
of Lhe sLored image inforrnation, noL Lhe luminance pan, Lo decornpress. WhaL does Standards,» Ia X European Signa! Processing Conference, pp. 673—676, 2000.
lhe 512 x 512 color image look Iike? Assume JPEG is compressed using a 4:2:0 10 IX Sania-CraZ, T. Ebrahimi, J. Askelof, M. Larsson, and C. A. Chrislopoulos, JPEG2000 StiU
scheme. iniage Coding Versus OtherStandards, ISOIIEC JTCI/SC29ÍWGI (ITU-T 508), 2000.
6. (a) How rnany principal modes does JPEO have? What are Ibeir names? li M. Weinberger, 0. Seroussi, and 0. Sapiro, “The LOCO-1 Lossless Irnage Compression AI
(b) In Lhehierarchical model,expiainbrieflywhy wemust includeanencode/decode goiidim: Nnciples and Siandardizalion inLo JPEG-LST Technical Report HPL-98-193R1,
cycle on lhe coder side before LransmitLing difference images lo Lhe decode side. HewIeLL-Packard Technical Report HPL-98- 193R1, 1998.
(c) What are lhe two meLhods used to decode only part of lhe informaLion in a 12 N. Merhav, 0. Seroussi, and M. J. Weinberger, “Oplimal Prefix Codes for Sources wilh Two
JPEG file, so lhat lhe image can be coarsely displayed quickiy and iteratively Sided Geometric DislribuLions’ IEEE Transactions on lnfonnation Theory, 46W: 121 135.
increased in qualiLy? 2000.
7. Could we use wavelet-based compression in ordinary JPEG? How?
Section 10.2 Video Compression Based on Motion Compensation 289

CHAPTER 10 Reference trame Targel frame

Basic Video Compression (x,y) .(Xo,Yo)

Techniques N
Macroblock

Malched macroblock Search window

As discussed in Chapler 7, lhe volume of uncompressed video daLa could be exLremeiy large.
Even a modest CIF video with a piclure resolulion of oniy 352 x 288. if uncompressed, FIGURE 10.1: Macrobiocks and molion vecLor in video compression: (a) reference trame;
would cany more than 35 Mbps. In HDTV, lhe bitraLe could easily exceed 1 Gbps. This (b) LargeL frame.
poses chalienges and problems for slorage and network communicaLions.
This chapter introduces some basic video compression Lechniques and iliustraLes them
iii standards H.261 and H.263 — two video compression standards aimed mostiy at video A video can be viewed as a sequence of images stacked in Lhe temporal dimension. Since
conferencing. The next Lwo chapters furlher introduce severa! MPEG video compression Lhe frame rale of Lhe video is oflen reiaLiveiy high (e.g., > IS frames per second) and lhe
slandards and lhe ialest, H.264. camera paramelers (focal iengLh, posilion, viewing angle, etc.) usualiy do not change rapidiy
beLween frames, lhe contenLs of consecuLive frames are usualiy similar, uniess certain objects
in Lhe scene move exlremeiy fasl. lo oLher words, lhe video has temporal redundancy.
10.1 INTRODUCFION TO VIDEO COMPRESSION
Temporal redundancy is ofLen significanL and iL is expioiLed, so Lhat nol every frame of
A video consists of a lime-ordered sequence of frames — images. An obvious solution lhe video needs lo be coded independenliy as a new image. InsLead, Lhe difference beLween
Lo video compression would be predictive coding based on previous frames. For example, lhe currenl frame and olher trame(s) in Lhe sequence is coded. lf redundancy belween Lhem
suppose we simpiy created a predictor such lhaL Lhe predicLion equals Lhe previous frame. is greal enough, lhe difference images could consisl mainly of smaii vaiues and iow enlropy,
Then compression proceeds by subLracLing images: insLead of subLracling Lhe image from which is good for compression.
itself (i.e., use a derivaLive), we subtract in Lime order and code lhe residual error.
As we menLioned, aiLhough a simplisLic way of deriving Lhe difference image is lo sub
And Lhis works. Suppose most of Lhe video is unchanging in lime. Then we gel a nice Lract one image from lhe oLher (pixei by pixei), such an approach is ineffeclive in yieiding
hisLogram peaked sharply aI zero — a great reducLion in terms of Lhe entropy of lhe original a high compression raLio. Since lhe main cause of Lhe difference beLween frames is cam
video, jusl what we wish for. era andlor objecl motion, lhese molion generalors cari be “compensaLed” by deLecling Lhe
However, il lurns oul Lhal at accepLabie cosL, we can do even belLer by searching for dispiacemenL of corresponding pixels or regions in these frames and measuring lheir dif
jusl Lhe righL parIs of lhe image Lo sublracl from lhe previous frame. Afler ali, our naive ferences. Video compression aigoriLhms Lhal adopL this approach are said Lo be based on
subtraclion scheme will likeiy work weli for a background of office fumiture and sedentary molion compensaLion (MC). The Lhree main sLeps of Lhese aigorithms are:
universily Lypes, but wouidn’L a fooLbail game have players zooming around lhe trame,
producirg large vaiues when subtracLed from lhe previousiy sLalic green piaying 6eid? 1. Molion eslimaLion (moLion vecLor search)
So in Lhe nexL section we examine how lo do beLLer. The idea of looking for lhe foolbail 2. Motion-compensaLion-based predicLion
piayer iii lhe nexL frame is calied motion estimation, and lhe concept of shifting pieces of 3. Derivation of Lhe prediclion error — Lhe difference
lhe trame around so as Lo besl subtracL away Lhe player is cailed motion compensation.
Forefhciency, each image is divided inLo macroblocks of size N x N. By defauit, N = 16
10.2 VIDEO COMPRESSION BASED ON MOTION COMPENSATION for iuminance images. For chrominance images, P1 = 8 if 4:2:0 chroma subsampiing is
adopled. MoLion compensalion is not performed aL Lhe pixei levei, nor at lhe levei of video
The image compression lechniques discussed in lhe previous chapLers (e.g., JPEG and
object, as in laler video sLandards (such as MPEG-4). lnstead, il is aL Lhe macrobiock levei.
JPEG2000) exploil spatial redundancy, Lhe phenomenon thaL piclure conlenls oflen change
relaliveiy slowly across images, making a large suppression of higher spaliai frequency The currenl image trame is referred Lo as lhe Targeiframe. A maLch is soughL between
componenLs viable. Lhe macroblock under consideralion in Lhe TargeL frame and lhe mosl similar macroblock in

288
290 Chapter 10 Basic Video Compression Techniques Section 10.3 Search for Motion Vectors 291

previous and/or future frame(s) [referred lo as Referenceframe(s)]. In that sense, lhe Target Targel frame, pixel by pixel, and theirrespective MA D is then derived using Equalion (10.1).
macroblock is predicted from lhe Reference macroblock(s). The vector (1, j) that offers lhe leasi MAD is designaLed lhe MV (ti, a) for lhe macrobiock
The dispiacemeni of lhe reference macrobiock lO Lhe Largel macroblock is calied a motion in Lhe Target frame.
vector MV. Figure 10.1 shows lhe case offorward prediction, in which lhe Reference frame
is Lalcen lo be a previous frame. lf lhe Reference frame is a future frame, iL is referred loas PROCEDURE 10.1 Motion-vector: sequential search
backwardprediction. The differenceof lhe lwo corresponding macroblocks is lhe prediction
error. BEGIN
For video compression based on molion compensalion, afler Lhe firsl frame, only lhe rnin_MAD = LARGE_NUMBER; I~ lniliahzation ~/
molion veclors and difference macroblocks need be coded, since Lhey are sufficienl for lhe for i = —p lo p
decoder lo regenerale ali macrobioeks in subsequenl frames. for j = —p lo p
We will relum lo Lhe discussion of some common video compression standards after Lhe 1
foilowing seclion, in which we discuss search algorithms for motion vectors. eta_MAL? — MADQ, J);
ifcurMAD <mirtMAD
10.3 SEARCH FOR MOTION VECTORS
The search for motion vecLors MV(u, a) as defined above is a malching problem, also called ,ni~~MAIJ = cur_MAD;
a correspondence problem [1]. Since MV search is computalionally expensive, iL is usually is = i; /~ Gel Lhe coordinates for MV. ~l
limited Lo a smali immediale neighborhood. Horizontal and vertical dispiacemenls i and j 1’ =
are in lhe range [—p, p],
where p is a posilive integer wilh a reialively small value. This
makes a search window of size (2p + 1) x (2p + 1), as Figure 10.1 shows. The cenler of
END
1
Lhe macroblock (xo, Yo) can be placed aI each of lhe grid posilions in lhe window. ____________________________________________________________________________
For convenience, we use lhe upper lefL comer (x, y) as lhe origin of Lhe macroblock in
Lhe TargeL frame. LeL C(x + k, y + 1) be pixels in lhe macrobiock in lhe Targel (currenl) Clearly, lhe sequential search melhod is ver)’ costly. From Equation (10.1), each pixel
frame and R(x + i + k, y + j + 1) be pixeis in Lhe macroblock in lhe Reference frame, comparison requires lhree operalions (sublraclion, absoluLe value, addition). Thus lhe cost
for oblaining a molion vector for a singie macroblock is (2p + 1) . (2p + 1) . N2 3 ~
where k and 1 are indices for pixeis in lhe macrobiock and i and J are Lhe horizonlal and Q(p2N2)
vertical displacemenls, respectively. The difference between Lhe lwo macroblocks can lhen
be measured by Lheir Mean Abro/um D~fference (MAL?), defined as As an example, leIs assume lhe video has a resolulion of 720 x 480 and a frame rale
of 30 fps; also, assume p = 15 and N = 16. The number of operaLions needed for each
N—I N—I motion vedor search is lhus
MADQ, J) = ~ Z Z IC(x +k, y + 1)— R(x + i + k, y + j +01, (10.1)
k=Ot=O (2p+l)2.N2-3=312x162x3.
where N is lhe size of lhe macroblock. Considering lhaL a single image frame has 72~7~~80 macroblocks, and 30 frames each
The goal of lhe search is lo finda vedor (i, j) as Lhe molion vector MV = (a, a), such second, lhe lolal operaLions needed per second is
lhaL MAD(i, j) is minimum:

OPS persecond = (2p+ l)2N23 720x480 30


(u, a) = [(é, J) 1 MADO, j) is minimum, i E [—p, p], j E [—p, p1] (10.2) NN
720 x 480
We used lhe mean absoluLe difference in lhe above discussion. However, lhis measure = 312 x 162 x 3 x x 30 29.89 x io9.
is by no means lhe only possible choice. In facL, some encoders (e.g., H.263) will simply 16 x 16
use lhe Suni ofAbsolute Difference (SAL?). Some oLher common error measures, such as Lhe This would certaialy make real-time encoding of Lhis video diflicuiL.
Mean Square Errar (MSE), would also be appropriale.
10.3.2 2D Logarithmic Search
10.3.1 Sequential Search A cheaper version, suboplimal bul 5h11 usually effeclive, is called Logariihniic Search. The
The simplesl method for finding molion veclors is Lo sequentially search lhe whole (2p + procedure for a 2D Logaiilhmic Search of motion veclors lakes several iLeralions and is akin
1) x (2p + 1) window in lhe Reference frame (also referred lo asfullsearch). A macroblock Lo a binary search. As Figure 10.2 illusLraLes, only nine locahions in lhe search window,
centered aI each of Lhe posilions wilhin Lhe window is compared lo lhe macroblock in lhe marked “1,” are iniLially used as seeds for a MAD-based search. Afler lhe one lhaL yields
292 Chapter 10 Basic Video Compression Techniques Section 10.3 Search for Motion Vectors 293

(x0 P’ Yo —p) 3 3 3 (X0 ÷P, Yo — ~ Molion veclors


2~ 203E2~3~
L.evel O ~ÇTI~

~ ~ ‘a )6 ‘ci
2~ a

q
.(xo, Yo)
I.wnsarnpie
by a facl.r .f 2

Levei 1 ____________

‘El ‘El ‘O
I.wnsample
by a fact.r.f 2
(x0—p,yotp) (x0+p,y0+p) Levei 2 ____________

FIGURE 10.2: 2D Logarilhrnic search for motion veclors.


FIGURE 10.3: A lhree-ievei hierarchical search for rnolion vecLors.
the minirnurn MAD is Iocated, the cenLer of Lhe new search region is rnoved Lo il, and lhe
stepsize (offset) is reduced Lo half. In Lhe nexl iteration, lhe nine new locations are marked
“2’ and soou.’ For lhe rnacroblock cenLered aI (xo, yo) inIbe Targel frame, Lhe procedure iL wouId be 8 ([iog2 p1 + 1) + 1, since lhe comparison Lhal yieided lhe leasL MAD from
is as follows: lhe lasL iLeraLion can be reused. Therefore, lhe complexily is dropped Lo O(log p W2).
Since p is usually of lhe sarne order of magniLude as N, lhe saving is subsLantial compared
PROCEDURE 10.2 Motion-vector: 2D-Logarithmic-search to O(p2N2)
BEGIN Using lhe same exarnple as in lhe previous subseclion, Lhe Lolal operaLions per second
offsel= íqi; d
Specify nine rnacroblocks within lhe search window in lhe Reference frame, rop lo
they are centered aI (xo, yt) and separated by offseL horizontaily andlor verlicaily;
WHILE iast ~ TRUE OPS.peLsecond = (8 ([Iog2 p1 + 1) + 1) . ~ 720 x480 •30

Find one of lhe nine specified rnacrobiocks lhat yieids rninirnurn MAD; = (8.[iog2l5]+9)x162x3x7~’<’~Ox30
if offset = 1 Lhen iast = TRUE;
offsel= íoffseí/21; 1.25 x io~.
Form a search region wih lhe new offset and new cenler found;
10.3,3 Hierarchical Search
END
The search for rnoLion veclors can benefil frorn a hierarchical (rnulliresoiuLion) approach in
lnslead of sequentiaHy comparing with (2p+ 1)2 rnacrobiocks from Lhe Reference frarne, which inilial eslirnalion of Lhe rnoLion veclorcan be oblained frorn images wilh a significantiy
lhe 2-D Logarithrnic Search wifl compare wilh oniy 9~ (fIog2 p1 + 1) macroblocks. IR facl, reduced resoluLion. Figure 10.3 depicLs a Lhree-Ievel hierarchical search in which Lhe original
irnage is aL levei 0, irnages aL leveis 1 and 2 are obLained by downsampling from Lhe previous
The proceduiv is heuristie. Ii assumes a general conhinuity (monoloniCily) of image contenis (lias ihey do leveis by a facLor of 2, and Lhe iniLial search is conducLed at levei 2. Since Lhe size of Lhe
n°1 change randomly within lhe search wíndow. Oiherwise, die procedure naght foi find me hesi match.
294 Chapter 10 Basic Video Compression Techniques Section 10.4 H.261 295

rnacrobiock is smailer and p can aiso be proporiionaily reduced ai this levei, Lhe number of
TABLE 10.1: Comparison of computationai cost of motion vector search rnethods according
operaLions required is greatiy reduced (by a factor of 16 aI this levei).
to lhe exampies.
The initiai estimation of Lhe motion vector is coarse because of lhe iack of image detail
and resolution. Ilis then refined levei by levei toward levei 0. Given lhe estimated molion Search melhod OP&periecond for 720 x 480 at 30 fps
vector (i?, iA) aL levei k, a 3 x 3 neighborhood centered aL (2 . iA, 2 iA) aI levei k —
is searched for Lhe refined motion vector. In other words, Lhe refinement is such that at p = IS ‘o
_________________

levei k — 1, the inotion vector (iA_I, u”1) satisfies Sequentiai search 29.89 x io~ 7.00 x

(2iA_l~iA_I~2iA+l, 2iA_15iA_I≤2iA+i),
2D Logarithmic search 1.25 x io~ 0.78 x l0~
Three-ievei l-lierarchicai search 0.51 x i09 0.40 x io9
and yieids minimum MAD for lhe macrobiock under examination.
Let (4. y~) denote the center of Lhe macroblock at levei k in lhe Target frarne. The
Hence,
procedure for hierarchicai motion vector search for the macrobioek centered at (4.
y~) in
the Targel frame can be outiined as foilows:
OPS per..second =

PROCEDURE 10.3 Motion-vector: hierarchical-search


720 x 480
BEGIN x3x x30
iVN
II Get macrobiock center position at Lhe iowesl resolution levei k, e.g., levei 2.
r~9 720 x 480
4 =4/2”; ~ =y~/2~; = L)+9b62 x3x 16x 16
x30
Use Sequentiai (or 2D Ingarithmic) search method to get initial estimated MV(u~’, pk)
at levei k; 0.51 x io9.
WJ-IILE iast ~ TRUE
Table 10.1 summarizes lhe comparison of Lhe lhree motion vector search melhods for a
720 x 480,30 fps video when p = 15 and 7, respectively.
Find one of lhe nine rnacrobiocks that yieids minimum MAD
at levei k — 1 centered at 10.4 H.261
(2(4+iA)—l <x<2(4+uA)+i, 2(y~+tA)~lSy~2(y~+iA)+1);
11.261 is an earlier digital video compression standard. Because its principie of motion
if k = 1 ihen last = TRUE; compensation—based compression is very much retained in ali later video compression
k=k—i standards, we wili start wilh a detailed discussion of H.261.
Assign (4, y~) and (iA, iA) with Lhe new center localion and molion vector; The International Telegraph and Teiephone Consuitative Comrnittee (CCII]’) initiated
development of H.261 in 1988. The final recommendation was adopted by Lhe International
END Teiecommunication Union-Teiecomrnunication siandardization sector (ITU-T), formerly
CCITT, in 1990 [2).
The standard was designed for videophone, videoconferencing, and other audiovisual
We wiiii use lhe sarne exampie as in lhe previous sections lo estimate lhe total operations
services over ISDN Leiephone lines. Initiaiiy, iL was intended Lo support multipies (from
needed each second for a ihree-level hierarchical search. For simpiicity, the overhead for
1 to 5) of 384 kbps channels. in lhe end, however, Lhe video codec supporls bitrates of
initiaiiy generating muitiresolution target and reference frames wiili not be inciuded, and it
p x 64 kbps, where p ranges from 1 to 30. Hence Lhe standard was once known as p * 64,
wiiii be assumed lhat Sequential search is used at each levei.
pronounced “p star 64”. The standard requires lhe video encoders delay to be iess than
The total number of macroblocks processed each second is stili x 30. However, 150 msec, so Lhat lhe video can be used for real-time, bidirectional video conferencing.
Lhe operations needed for each macrobiock are reduced to H.26i beiongs Lo lhe foliowing set of ITU recommendations for visual telephony systems:

. H.221. Frame stnicture for an audiovisual channel supporting 64 to 1,920 kbps


[@r~1÷l)2(~y+9r)2+9N2] x3.
. H.230. Frame controi signais for audiovisual systems
296 Chapter 10 Basic Video Compression Techniques Section 10.4 11.261 297

__________ For each For each


TABLE 10.2: Video formaLs supported by H.261. 1~ macrobiock Sx8block
Y 1 ____
Video Luminance Chrominance BitraLe (Mbps) H.26 1
formal image image (if 30 fps and support
resoluLion resolution uncompressed) 1-frame

QCIF 176 x 144 88 x 72 9.1 Required


CIF 352 x 288 176 x 144 36.5 Optional
1010010...

. H.242. Audiovisual communication protocois FIGURE 10.5: 1-frame coding.

• H.261. Video encoder/decoder for audiovisual services ai p x 64 kbps


10.4.1 Intra-Frame (1-Frame) Coding
• H.320. Narrowband audiovisual terminal equipment for p x 64 Icbps transmission
Macrobiocks are of size 16 x 16 pixels for lhe Y frame of Lhe original image. For Cb and
Table 10.2 lisLs the video formats supported by H.261. Chroma subsampling in H.261 is Cr frames, they correspond Lo arcas of 8 x 8, since 4:2:0 chroma subsampling is employed.
4:2:0. Considering lhe relatively low bitrate in network communications aL lhe Lime, support Ilence, a macrobiock consisis of four Y blocks, one Cb, and one Cr, 8 x 8 blocks.
for CCIR 601 QCIF is specified as required, whereas support for CIF is optional. For each 8 x 8 block, a DCT Lransform is applied. As in JPEG (discussed in deLail in
Figure 10.4 iilustrates a typical H.261 frame sequence. ‘l\vo Lypes of image frames are ChapLer 9), Lhe I3CT coefficienLs go Lhrough a quantizaLion siage. Afterwards, they are
defined: inLra-frames (1-frames) and inter-frames (P.frames). zigzag-scanned and eventualiy entropy-coded (as shown in Figure 10.5).
1-frames are treated as independent images. Basically, a transform coding method similar
lo JPEG is applied within each 1-frame, hence Lhe name “inLra”. 10.4.2 Inter-Frame (P-Frame) Predictive Coding
P-frarnes are noL independent. They are coded by a forward predictive coding meLhod Figure 10.6 shows the H.261 P-frame coding scheme based on moLion compensaLion. For
in which currenL macrobiocks are predicLed from similar macroblocks in Lhe preceding 1. each macrobiock in lhe TargeL frame, a motion vector is allocated by one of Lhe search meth
or P-frame, and d~fferences beLween Lhe macroblocks are coded. Temporal redundancy ods discussed earlier. Afier Lhe predicLion, a dsfference macroblock is derived lo measure
removal is hence included in P-frame coding, whereas 1-frame coding performs only spaiial lhe predicrion errar. IL is also canied in Lhe forra of four Y blocks, one Cb, and une Cr
redundancy remova!. lt is importanL to remember Lhat prediction from a previous P-frame block. Each of these 8 x 8 biocks goes through DCT, quanLizaLion, zigzag scan, and entropy
is aliowed (noLjusL from a previons 1-frame). coding. The moLion vecLor is also coded.
The interval between pairs of 1-frames is a variable and is determined by lhe encoder. SomeLimes, a good match cannoL be found — Lhe predicLion error exceeds a certain
Usually, an ordinary digiLal video has a couple of 1-frames per second. MoLion vecLors in acceptable levei. The macrobioek itseif is then encoded (treated as an intra macroblock)
H.26 1 are aiways measured in uniLs of full pixeis and have a iimiLed range of ± 15 pixeis and in this case is termed a non-motion-compensated macroblock.
thaL is, p = 15. P-frame coding encodes Lhe difference macrobiock (noL Lhe TargeL macrobiock iiseifl.
Since lhe difference macroblock usuaily has a much smalier enlropy Lhan lhe TargeL mac
robiock, a large compression ratio is aLtainabie.
In facL, even Lhe motion vecLor is noL direcLly coded. lnsLead, lhe difference, MVD,
beLween lhe moLion vecLors of lhe preceding macrobiock and current macrobiock is senL for
enLropy coding:

MVD = MVP,ecedi,,g — MVc,,,ren, (10.3)

10.4.3 Quantization in H.261


P P P1 P P P1
The quanLization in H.261 does noL use 8 x 8 quanLizaLion matrices, as in JPEG and MPEG.
FIGURE 10.4: H.261 Framesequence. lnsLead, ii uses a consLanL, cailed síepsize, for ali DCT coefficients wiLhin a macroblock.
Motin vector —..~pycoding

298 Chapter 10 Basic Video Compression Techniques Section 10.4 H.261 299

Target frame
Intra- trame

Curreflt
Current macrobiock Fronte
Difference macroblock
Y
E1Ec~
L_]DCr
Intra-fronte
O *0
For each 8 x 8 block

Da
Ltantization
vector

01100010..
(a) Encoder

FIGURE 10.6: H.261 P-frame coding based on motion compensation.


lnput code
According to the need (e.g., bitrate control of lhe video), stepsize can take on any one of
lhe 31 even values from 2 to 62. One exception, however, is made for Lhe DC coefficient in
intra mode, where a stepsize of 8 is always used. If we use DCT and QDCT Lo denote
tite DCT coefficients before and after quantization, lhen for DC coefficients in intra mode,

QDCT=mundl”_~c~ ~DCT\ (10.4)


~srep.size) = round Intra- frame
o —*•o
For ali other coefficients: Decoded trame
1 DCT 1 1 DCT 1
QDCT = Ls:ep.sizej = L2 x seQie]’ (10.5)
Predicoon

where scale is an integer in lhe range of [1, 31]. Motion vector


The midtread quantizer, discussed iii Section 8.4.1 typically uses a round operator.
Equation (10.4) uses this typeofquantizer. However, Equation (10.5) uses a floor operator (b) Decoder
and, as a resuit, leaves a center deadzone (as Figure 9.8 shows) in its quantization space,
wilh a larger input range mapped to zero. FIGURE 107: H.261: (a)encoder; (b)decoder.

104.4 H.261 Encoder and Decoder • An 1-frame is usually sent a couple of times in each second of lhe video.
Figure 10.7 shows a relatively complete picture of how lhe H.261 encoder and decoder
work. Here, Q and Q~ stand for quantization and its inverse, respectively. Switching of • As discussed earlier (see DPCM in Section 6.3.5), decoded frames (not lhe original
Lhe intra- and inter-frame modes can be readiiy implemented by a multiplexer. To avoid frantes) are used as reference frames in mocion estimation.
propagation of coding en-ors,
300 Chapter 10 Basic Video Compression Techniques Section 10.4 H.261 301

When Lhe subsequent Current Frame ‘1 arrives ai Point l,the Motion EsLimation process
TABLE 10.3: Data flow at the observaLion points in H.261 encoder.
is involced Lo find the motion vecLor for lhe besL matching macroblock in frame 1 for each of
Current frame Observation point Lhe macroblocks in ‘i• The estimaLed motion vecLor is sent to both Motion-CompensaLion
based Prediction and Variable-LengLh Encoding (VLE). The MC-based Prediction yields
123456 Lhe besL matching macroblock in l’j. This is denoted as P appearing at Point 2.
AtPoint 3, the “prediction erro?’ is obLained, which is Dj = ~‘1 — P~. Now Vi undergoes
1 1 101 DCT, QuantizaLion, and Entropy Coding, and Lhe resulL is sent to Lhe Output Buffer. As
before, Lhe DCT coefficients for D1 are also sent to Q and IDCT and appear at Point 4
as D1.
1’1 P1PÇD,131PA Added to at PoinL 5, wehave P1 = aL PoinL 6. This issLored in FrarneMemory,
waiting to be used for MoLion Estimation and Motion-CompensaLion-based Prediction for
P2 P2PD262P~P2 Lhe subsequent frame P2. The sLeps for encoding P2 are similar Lo those for P1, except thaL
F’2 will be the Current Frame and P1 becomes Lhe Reference Frame.
For Lhe decoder, Lhe input code for frames wiIl be decoded first by Entropy Decoding,
To illustrate the operational detail of Lhe encoder and decoder, IeL’s use a seenario where Q~’,and IDC’I’. For InLra-frame mode, Lhe firsL decoded frame appears aL Point 1 and Lhen
frames 1, P1, and P2 are encoded and then decoded. The data that goes through Lhe obser Point 4 as 1. IL is sent as Lhe firsL output and aL the sarne time sLored in the Frame Memory.
vation poinLs, indicaLed by Lhe circled numbers in Figure 10.7, is summarized in Tables 10.3 SubsequenLly, Lhe input code for InLer-frarne P1 is decoded, and prediction error 61 is
and 10.4. We wiIl usei, P1, P2 for Lhe original data, 1, Pj, P2 for the decoded data (usually received ai PoinL 1. Since Lhe moLion vecLor for the current macroblock is also enLropy
a lossy version of the original), and Pf, P~ for Lhe predictions in the Inter-frame mode. decoded and sent to Motion-CompensaLion-based PredicLion, the corresponding predicLed
For Lhe encoder, when the CurrenL Frame is an Intra-frame, Point number 1 receives rnacroblock P( can be localed in frame 1 and will appear at Points 2 and 3. Combined wiLh
macroblocks from Lhe 1-frame, denoted 1 in Table 10.3. Each 1 undergoes DCT, Quanti
we have P1 = Pf + D1 aL Point 4, and it is sent out as Lhe decoded frame and also
zation. and Enlropy Coding sLeps, and the result is sent to the OuLput Buifer, ready to be
sLored in Lhe Frame Memory. Again, Lhe sLeps for decoding 1’2 are similar lo Lhose for P1.
transmitLed.
Meanwhile, Lhe quanLized Dcl’ coefficienLs for 1 are also sent to Q~ and IDCT and
hence appear at Point 4 as 1. Combined with a zero input from Point 5, Lhe data ai Point 6 10.4.5 A Glance at the H.261 Video Bitstream Syntax
remains as 1 and this is stored in Frame Mernory, waiLing to be used for Motion Estimation
Let’s take a brief look aL Lhe H.261 video biLstream syntax (see Figure 10.8). This consisLs
and Motion-Compensation-based Prediction for the subsequent franie P1.
Quantization Control serves as feedback — that is, when the Output Buifer is too fuli, of a hierarchy of four layers: Picture, Group of Blocks (GOB), Macroblock, and Block.
Lhe quantization stepsize is increased, so as Lo reduce Lhe size of Lhe coded daLa. This is
known as an encoding rwe control process. 1. Picture Iayer. Picture Start Code (PSC) delineaLes boundaries beLween pictures.
Temporal Reference (TR) provides a Limestamp for Lhe picture. Since temporal sub
sampling can sometimes be invoked such Lhat some pictures will not be LransmiLted,
iL is importani lo have TR, Lo maintain synchronization wiLh audio. Picture Type
TABLE 10.4: Data flow ai the observation points in H.261 decoder.
(PType) specifies, for example, wheLher ii is a CIF or QCIF picLure.
CurrenL frame Observation point 2. GOB layer. H.261 picLures are divided into regions of li x 3 rnacroblocks (i.e.,
regions of 176 x 48 pixels in luminance images), each of which is called a Group
1234 ofBlocks (GOB). Figure 10.9 depicis Lhe arrangement of GOBs in a CIF or QCIF
luminance irnage. For insLance, the CIF image has 2 x 6 GOBs, corresponding to iLs
image resoluLion of 352 x 288 pixels.
1 1 01
Each GOB has its Srart Code (GBSC) and Group number (Gm. The GBSC is unique
and can be identified wiLhout decoding Lhe enLire variable-length code in Lhe bitstream.
p1 61pçp~ P~ In case a neLwork error causes a bit error or lhe loss of some biLs, 1-1.261 video can be
recovered and resynchronized aL the next identifiable GOB, prevenLing Lhe possible
propagaLion of errors.
P2 IizP~P~ P2
302 Chapter 10 Basic Video Compression Techniques Sectiori 10.5 H.263 303

riciure frame difference belween lhe molion veclors of Lhe preceding and current macrobiocks.
Moreover, since some blocks in lhe macroblocks maLch welI and some match poorly
in Molion Esiimaiion, a bilmask Coded Block Pattern (CBP) is used Lo indicaLe Lhis
Piciure
1 ~scl .ii~ ~ iayer
informalion. Oniy well-malched blocks will have Lheir coefficients lransmilied.
4. Block Iayer. For each 8 x 8 block, lhe bilsLream siarts with DC value, followed by
~—
008 pairs of lenglh of zero-run (Run) and Lhe subsequenL nonzero value (Levei) for ACs,
~ GBSC GN I0Q~IMB1l ME Iayer and finally lhe End of Block (FOB) code. The range of “Run” is [0, 63]. ‘Levei’
reflecls quantized values ils range is [ 127, 127), and Levei ≠ O.
1 Macroblock
1 Áddress~ Type MQuanil MVD CBP bi ...~ bS
10.5 H.263
Block H.263 is an improved video coding standard [3] for videoconferencing and olher audio
DC (Rua, Levei) . . . (Run, Levei) EOB layer
visual Services LransmilLed on Public Swilched Teiephone Nelworks (PSTN). IL aims ai
low bitrale communicalions ai biLrales of less Lhan 64 kbps. IL was adopLed by Lhe ITU-T
SLudy Group 15 in 1995. Similar Lo H.26I, ii uses prediclive coding for inLer-frames, Lo
pSC Picture SIarI Code TR Temporal Reference reduce Lemporal redundancy, and Lransform coding for lhe remaining signal, lo reduce
PType Picture Type 008 Group of Blocks spaLiai redundancy (for bolh inlra-frames and difference macrobiocks from inLer-frame
GBSC GOB SLart Code GN Group Number predicLion) [3].
GQuant GOB Quantizer MB Macroblock In addilion LO CIF and QCIF, H.263 supports sub-QCIF, 4CIF, and I6CIF. Table 10.5
MVD Molion VecLor DaLa summarizes video formais supported by H.263. lf noL compressed and assuming 30 fps,
MQuant MB Quanlizer
lhe bitraLe for high-resolulion videos (e.g., I6CIF) could be very high (> 500 Mbps).
CBP Coded Block PaLtem EOB End of Block For compressed video, lhe slandard defines maximuni biLraLe per picture (BPPmaxKb),
measured in uniL5 of 1,024 biis. In pracLice, a iower biL raLe for compressed H.263 video
FIGURE 10.8: Syntax of H.26I video bitstream. can be achieved.
As in H.261, Lhe H.263 sLandard also supporis lhe noLion of group of blocks. The
GQuant indicates Lhe quanlizer Lo be used in lhe 008, unless ii is overridden by any difference is Lhal GOBs in H.263 do noL have a fixed size, and they aiways sLarl and end aI
subsequent Macroblock Quantizer (MQuant). GQuanL and MQuanl are referred loas lhe lefi and righl borders of lhe piclure. As Figure 10.10 shows, each QCIF luminance image
scale in Equalion (10.5). consisls of 9 GOBs and each GOB has II x 1 MBs (176 x 16 pixeis), whereas each 4CIF
luminance image consisls of 18 GOBs and each GOB has 44 x 2 MBs (704 x 32 pixeis).
3. Macroblock Iayer. Each macroblock (ME) has its ownAddress, indicaling lIs position
wilhin Lhe GOB, quanlizer (MQuanl), and six 8 x 8 image blocks (4 Y, 1 CL 1 Cr).
Type denoLes whelher ii is an Intra- or InLer-, moLion-compensaled or non-molion
compensaled macroblock. Motion Vector Data (MVD) is obtained by laking lhe TABLE 10.5: Video formaLs supported by H.263.
Video Luminance Chrominance BiLraLe (Mbps) Bilrale (kbps)
GOB O 00B1
formaL image image (if 30 fps and BPPmaxKb
008 2 GOB 3
GOB O resolution resolution uncompressed) (compressed)
GOB4 GOB5
GOB 1
0086 0087 Sub-QCIF 128 x 96 64 x 48 4.4 64
0082
0088 0089 QCIF 176x144 88x72 9.1 64
GOBIO 00811 QCIF
CIF 352 x 288 176 x 144 36.5 256
cIF 4CJF 704x576 352x288 146.0 512
I6CIF 1408 x 1152 704 x 576 583.9 1024
FIGURE 10.9: Arrangemeni of GOBs in H.261 luminance images.
304 Chapter 10 Basic Video Compression Techniques Section 10.5 11.263 305

0080 008 O MV Current motion vector


008 1 008 1 MV2 MV3~
008 2 0082 MVI Previous motion vector
008 3 008 3 MV2 Above motion vector
0084 0084 MVI MV
0085 MV3 Above and right motion vector
008 5
Sub-QCIF
(a)
GOB O
008 1
0082
0083
0084
0085 00815 MV2 MV3~
0086 00816
0087 00817
008 8 [(0.0) MV
QCIF
CIF, 4CIF, and 1 6CIF
— Border
FIGURE 10.10: Arrangement of GOBs in H.263 luminance images.

(b)
10.5.1 Motion Compensation in 11.263
FIGURE 10.11: Prediction of motion vector in H.263: (a) predicted MV of Lhe current
The process of motion compensation in H.263 is similar to that of H.261. The motion macroblock is lhe median of (Mvi, MV2, MV3); (b) special Lreatment of MVs when lhe
vector (MV) is, however, not simply derived from Lhe current macroblock. The horizontal current macroblock is at border of picture or GOB.
and vertical components of lhe MV are predicted from lhe median values of lhe horizontal
and vertical components, respectively, of Mvi, MV2, MV3 from lhe “previous”, “above”
and “above and right” macroblocks [see Figure 10.11(a)]. Namely, for lhe macroblock with
10.5.2 Optional 11.263 Coding Modes
MV(u, v),
Besides its core coding algorithm, 11.263 specifies many negotiable coding options in its
= median(uI,u2,u3), various Annexes. Four of lhe common options are as follows:
VI, = median(ul,u2,V3). (10.6)

A0
lnstead of coding Lhe MV(u, u) itself, lhe error vector (8u, Su) is coded, where c3u = a b O FuIl-pixel position
is — u,, and Si, = ti — ti,,. As shown in Figure 10.11(b). when the current MB is at Lhe
• HaIf-pixel position
border of lhe picture or GOB, either (0, 0) or MV1 is used as Lhe motion vector for the
out-of-bound MB(s).
To improve Lhe quality of motion compensation — LhaL is, lo reduce Lhe prediction error • a=A
c d
— H.263 supports haif-pixelprecision as opposed lo fuli-pixel precision only in H.261. The
b = (A + B + 1)/ 2
default range for both Lhe horizontal and vertical components is and ti of MV(u, ti) is now
(—16,15.5). c=(A+C+ 0/2
The pixel values needed aL half-pixel positions are generated by a simple bilinear inter d=(A+B+C+D+2)/4
polation method, as shown in Figure 10.12, where A, B, C, D anda, b, c, d are pixel values
aI fulI-pixel positions and half-pixel positions respectively, and “1” indicates division by FIGURE 10.12: Half-pixel prediction by bilinear interpolation in 11.263.
Iruncation (also known as integer division).
306 Chapter 10 Basic Video Compression Techniques Section 10.5 H.263 307

• Unrestricted motion vector mode. The pixels referenced are no longer restricted to PB-frame
within the boundary of Lhe image. When Lhe motion vector points outside the irnage
boundary, Lhe value of Lhe boundary pixel geometrically closest to Lhe referenced
pixel is used. This is beneficial when image content is rnoving across the edge of Lhe
image, ofLen caused by object and/ar carnera movements. This mode also allows an
extension of the range of mation vecLors. The maximum range of motion vectors is
[—31.5,31.5], which enabies efficient coding of fasL-moving objects in videos.

• Syntax-based aritbmetic coding mede. Like 11.261, H.263 uses variabie-Iength


coding as a defaulL coding rneLhod for Lhe DCT coefficients. Variable-IengLh coding
implies thaL each symbol rnust be coded mIo a fixed, inLegral number of bits. By
employing ariLhrneLic coding, Lhis resLbction is removed, and a higher compression
ratio can be achieved. Experirnenls show biLraLe savings of 4% for inter-frames and
10% for inLra-frarnes in lhis mode.
As ia 11.261, lhe synlax of 11.263 is sLructured as a hierarchy of fourlayers, eachusinga
combinaLion of fixed- and variable-iength code. In Lhe syntax-basedarithnietic coding
(SAC) mode, ali variable-IengLh coding operations are replaced wiLh ariLhmetic coding FIGURE 10.13: A PB-frame in 11.263.
operaLions. According lo the syntax of each iayer, Lhe aiiLhmetic encoder needs lo
code a differenL bitsLream from various componenLs. Since each of these bitstreams
has a different disLribution, 14.263 specifies a model for each disLribution, and Lhe videos wiLh moderale motion. Under large moLions, PB-frames do nol cornpress as
ariLhmelic coder switches Lhe model on lhe fi>’, according tolhe syntax. weIl as B-frames. An improved mode has been deveioped in H.263 version 2.

• Advanced prediction mede. In Lhis mode, Lhe macrobiock size for motion compen 10.5.3 H.263+ and H.263++
sation is reduced from 16 lo 8. Four motion vectors (from each of Lhe 8 x 8 blocks) The second version of 1-1.263, aiso known as 11.263+ was approved in January 1998 by
are generaLed for each macroblock in Lhe luminance image. Afterward, each pixel in ITU-T Study Group 16. IL is fuily backward-compaLible wiLh lhe design of 11.263 version 1.
lhe 8 x 8 luminance prediction block takes a weighted sum of Lhree predicled values The airn of H.263÷ is Lo broaden lhe potenLial appiicaLions and offer addiLional flexibiiity
based on lhe molion vector of Lhe currenL lurninance biock and two out of Lhe four ia Lerrns of cusLom source forrnats, different pixei aspect ratios, and ciock frequencies.
motion vecLors from the neighboring blocks tlial is, one from Lhe block aL Lhe left H.263+ includes numerous recommendaLions lo improve code efficiency and error resilience
or righl side of Lhe curTeal iuminance block and one from Lhe block above or below. [5]. IL also provides 12 new negoLiabie modes, in addition Lo lhe four optianal modes in
Although sending four molion veclors incurs some additional overhead, lhe use ofLhis 11.263.
mode generaliy yieids beLler predicLion and hence considerable gain in compression. Since its developmenl carne afLer lhe sLandardizaLion ofMPEG-1 and 2, it is not surprising
LhaL it also adopts many aspects of Lhe MPEG standards. Beiow, we mention only briefly
• PB-frames mede. As shown by MPEG (delailed discussions in Lhe next chapLer), lhe
some of Lhese features and leave Lheir deLailed discussion lo lhe nexL chapter, where we
introduction of a B-frame, which is predicLed bidireclionaliy from both Lhe previous
sLudy lhe MPEG standards.
frame and Lhe fulure frame, can often improve Lhe quality of prediction and hence lhe
compression ralio wiLhoul sacrificing picture quaiiLy. In H.263, a PB-frame consists
• The unresLlicted molion vedor mode is redefined under 11.263+. It uses Reversible
of two piclures coded as one unil: one 1’ frame, predicled from Lhe previous decoded
Variabte Length Coding (RVLC) lo encode lhe difference moLion vectors. The RVLC
1-frame or P-frarne (or P-frame part of a PB-frame), and one B-frame, predicted from
encoder is able Lo minimize Lhe impacl of Lransmission error by aliowing Lhe decoder
boLh the previous decoded 1- ar P-frame and Lhe P-frame currenLly being decoded
Lo decode from boLh forward and reverse direcLions. The range of rnolion veclors is
(Figure 10.13).
exlended again lo [—256,256]. Refer Lo [6, 7] for more detaiied discussions on Lhe
The use of the PB-frarnes mode is indicaLed in PTYPE. Since Lhe P- and B-frames consLrucLion of RVLC.
are closely coupled in lhe PB-frame, lhe bidirecLional motion vectors for lhe B-frame
need nol be independently generated. lnstead, they can be temporally scaled and • A shce slructure is used lo replace GOB for additional flexibilily. A slice can conlain
further enhanced from the forward moLion vector of Lhe P-frame [4] so as Lo reduce a variable number of rnacroblocks. The lransmission order can be eilher sequenlial
Lhe biLrate overhead for lhe B-frarne. PB-frames mode yields saúsfaclory resulLs for or arbitrary, and lhe shape of a slice is noL required Lo be rectangular.
308 Chapter 10 Basic Video Compression Techniques Sectiori 10.7 Exercises 309

• H.263+ implemenis Temporal, SNR, and Spatial scalabilities. ScalabiliLy refers lo io.7 EXERCISES
lhe abiliLy to handle various consLraints, such as display resoluLion, bandwidch, and
hardware capabilities. The enhancemenL layer for Temporal scalability increases 1. Thinking about my large collecLion of JPEG images (of my family Laken in various
locales), 1 decide lo unify them and make them more accessible by simply combining
percepLual quality by inserting B-frames between Lwo P-frames.
them mio a big FL261-compressed file. My reasoning is IhaL 1 can simply use a viewer
SNR scalability is achieved by using various quantizers of smaller and smallerstepsize lo step through lhe file, making a cohesive whole ouL of my collecLion. Comment on
to encode addiLional enhancement layers mio the biLslream. Thus, the decoder can de lhe uLility of Lhis idea, in lerms of Lhe compression raLio achievable for lhe sei of
cide how many enhancement layers Lo decode according Lo computational or network images.
constrainLs. The concept of Spatial scalabiliLy is similar Lo that of SNR scalability. In
2. In block-based video coding, whai takes more effort: compression or decompression?
Lhis case, Lhe enhancement layers provide increased spaLial resolution. Eriefly expIam why.
• H.263+ supports improved PB-frames mode, in which Lhe two motion vecLors of Lhe 3. An H.261 video has lhe ihree color channels Y, C~, Cb. Should MVs be computed
8-frame do noL have lo be derived from Lhe forward motion vecLor of the P-frame, as for each channel and ihen transmiiLed? JusLify your answer. If noL, which channel
in version 1. instead, Lhey can be generaLed independently, as in MPEG- 1 and 2. should be used for moLion compensaLion?
4. WorkouL lhe following problem of2D L.ogariLhmic Search for molion veclors mn deLail
• Deblocking filLers in lhe coding loop reduce blocking effecLs. The filter is applied
(see Figure 10.14).
lo lhe edge boundaries of Lhe four luminance and two chrominance blocks. The
coefficienL weights depend on the quanLizer sLep.size for Lhe block. This technique The largei (currenL) frame is a P-frame. The size of macroblocks is 4 x 4. The moLion
resulLs in betLer prediction as well as a reducLion in blocking artifacls. vector is MV(&, ~y), in which 1~x E [—p, p], ày E [—p, p]. In Lhis quesLion,
assume p 5.
The deveiopment of H.263 has conLinued beyond iLs second version, wiLh Lhe new exLen The macroblock in question (darkened) in Lhe frame has ils upper lefL comer aL (x,, y,).
sion known infonnaily as H.263÷÷ [8]. H.263++ includes the baseline coding methods of li coniains 9 dark pixeis, each wiLh inlensily value lO; lhe other 7 pixels are part of
1-1.263 and additional recommendaLions for enhanced reference picture seleclion (ERPS), Lhe background, which has a uniform inLensiLy value of 100. The reference (previous)
data partition slice (DPS), and addiLionai supplemental enhancemenL informaLion. frame has 8 dark pixeis.
ERPS mede operates by managing a multiframe buffer for stored frames, enhancing
coding efficiency and error resiiience. DPS mode provides additional enhancemenL lo error
resilience by separaLing header and moLion-vecLor daLa from DC1’ coefficienL data in lhe
xl
biLstream and protecLs Lhe motion-vecLor daLa by using a reversible code. The additional
supplemental enhancement informaLion provides lhe abiliLy lo add backward-compaLible
enhancemenLs to an H.263 bitstream.
H H ~-
10.6 FURTHER EXPLORATION H fr- —

Tekalp [9] and Poynlon [10] seL ouL the fundamentaIs of digital video processing. They 3,, y, — -‘

provide a good overview of lhe maLhematical foundaLions of Lhe problems Lo be addressed


invideo.
The books by Bhaskaran and Konslantinjdes [li], Ghanbari [12], and Wang et ai. [13]
include good descriptions of video compression algorithms and present many interesting
insights mIo this problem.
The Further Exploration section of Lhe LexLbook web siLe for this chapter conLains useful
links to information on H.261 and H.263, including
Reference frame Target frame
TuLorials and White Papers
• Pixel with inLensily value lO
• Soflware implemenLaLions
Oiher background (unmarked) pixels ali have inLensiLy value 100
• An H263/H263+ Iibrary
FIGURE 10.14: 2D LogariLhmic search for moLion vectors.
• A Java H.263 decoder
310 Chapter 10 Basic Video Compression Techniques Section 10.8 References 311

(a) What is Lhe best lsx, ~y, and Mean Absolute Error (MAE) for Lhis macroblock? 5 0. Cole, B. Erol, and M. GaIlanL, “11.263+: Video Coding ai Low Bil RaLes’ lEifffrransactions
(b) Show sLep by step how Lhe 2D Logaritbmic Search is performed, inciude lhe on Circuirs and Systems for Video Technology, 8(7): 849—866, 1998.
locations and passes of lhe search and ali intermediale ~x, Ay, and MAEs. 6 Y. Takishima, M. Wada, and 11. Murakami, “Reversible Vaxiable Lengih Codes’ JEEE Trans
acrions on Communications, 43(24): 158—162, 1995.
5. The logarithmic MV search melhod is suboptimal, in Lhat iL reiies on continuity in Lhe 7 C.W.Tsai andi.L. Wu “On Consinicting QieHuffman-Code-Based ReversibleVariable-Lengch
residual frame. Codes;’ JEEE Transactions on Communicanons, 49(9): 1506—1509, 2001.
8 Drafi for 11.263++ .4nnexes ti. 1< and W lo Reconimendation 11.263. Iniernational Telecom
(a) Expiam why Lhat assumplion is necessary, and offer a justification for IL. municaLion Union (ITU-T). 2000.
(b) Give an exaniple where this assumption faiis.
9 A.M. Tekalp, Digital Video Processing, Upper Saddle River, P11: Prenlice liail PTR, 1995.
(c) Does lhe hierarchical search meLhod suffer from suboptimality 100?
lo C.A. Poynion, Digital Video and HDTV Algoruhms and Jnteirfaces, San Francisco: Morgan
6. A video sequence is given Lo be encoded using H.263 in PE-mode, having a frame size Kaufmann, 2003.
of4ClF, frame rale of 30 fps, and video iength of 90 minuLes. The foiiowing is known ii V. Bhaskaran and K. Konstanlinides, image and Video Conipression Standards: Algorithms
aboul lhe compression parameLers: on average, two 1-frames are encoded per second. and Architectures, 2nd cd., Bosion: Kluwer Academic Publishers, 1997.
The video aL Lhe required quaiity has an 1-frame average compression ratio of 10:1, an 12 M. Ghanbari, Video Coding: An Ia:roduction lo Standard Codecs, London: instiLue of EIec
average P-frame compression ratio Lwice as good as 1-frame, and an average E-frazne nica! Engineers, 1999.
compression ratio twice as good as P-frame. Assuming lhe compression paramelers 13 y. Wang, J. OsLermann, and Y.Q. Zhang, Video Processing and Comniunications, Upper Saddle
inciude ali necessary headers, calculate lhe encoded video size. River, 113: Prenlice Hali, 2002.
7. Assuming a search window of size 2p+I, what is Lhe complexity of motion estimaLion
for a QCIF video in lhe advanced prediclion mode of 11.263, using

(a) ‘me brute-force (sequenlial search) melhod?


(b) The 2D iogarilhmic method?
(e) The hierarchical melhod?

8. Discuss how lhe advanced prediction mode in H.263 achieves betLer compression.
9. In 11.263 molion estimalion, lhe niedian of Lhe molion vecLors from Lhree preceding
macroblocks (see Figure 10.11(a)) is used as a prediclion forlhe cuirenl macrobiock. Ii
can be argued lhaL lhe median may nol necessarily reflect lhe best predicfion. Describe
some possible improvements on Lhe currenl method.
10. 11.263+ ailows independenl forward MVs for B-frames in a PB-frame. Compared
Lo 11.263 in PB-mode, whal are Lhe Lradeoffs? What is lhe point in having PB joinl
coding if E-frames have independenl molion veclors?

10.8 REFERENCES
1 D. Man’, Visto,,, San Francisco: W. H. Freeman, 1982.

2 Video CodecforAudiovisualServicesat p x64 kbitls, 1TU-TRecommendalion 11.261, version 1


(Dec 1990), version 2 (Mar 1993).

3 Video Coding for Low Bit Rate Communication, ITU-T RecominendaLion H.263, verSion 1,
(Nov 1995), version 2 (Feb 1998).

4 B.G. Haskeii, A. Puri, and A. Netravali, Digital Video: An Introducrion lo MPEG-2, New York:
Chapman & Hali, 1997.
Section 11.2 MPEG-1 313

CHAPTER 11 Previous frame Target frame Next frame

MPEG Video Coding 1 —

MPEG-1 and 2
FIGURE 11.1: The need for bidirectional search.

examine Lhe main features of MPEG- 1 video coding and leave discussions of MPEG audio
coding lo Chapter 14.

11.1 OVERVIEW 11.2.1 Motion Compensation in MPEG-1


The Moving Pictures Experts Group (MPEG) was established in 1988 [1, 2] lo create a As discussed in Lhe Iast chapter, motion-compensation-based video encoding in H.261 works
standard for delivery of digital video and audio. Membership grew from about 25 experts in as follows: Iii motion estimation, each macroblock of Lhe target P-frame is assigned a best
1988 Lo a community of more Lhan 350, from about 200 companies and organizations [3,4]. matching macroblock from Lhe previously coded 1- or P-frame. This is called a prediction.
li is appropriately recognized LhaL proprietary interests need to be maintained within Lhe The difference between Lhe macroblock and its matching macroblock is lhe prediction errar,
family of MPEG standards. This is accomplished by defining only a compressed bitstream which is sent Lo DCT and its subsequent encoding steps.
that implicitly defines the decoder. The compression algorithms, and Lhus the encoders, are Since the prediction is from a previous frarne, it is calledforwardprediciion. Dueto
completely up Lo Lhe manufacturers. unexpected movements and occlusions in real scenes, Lhe Larget macroblock may not have
In this chapter, we wilI study some of the most important design issues of MPEG- 1 and a good matching entity in the previous frame. Figure 11.1 illustrates that Lhe macroblock
2. The next chapter will cover some basics of the later standards, MPEG-4 and 7, which containing part of a ball in Lhe larget frame cannot find a good matching macroblock in Lhe
have somewhat different objectives. previous frame, because half of the bali was occluded by another object. However, a match
can readily be oblained froin Lhe next frame.
11.2 MPEG-1 MPEG introduces a third frame type— 8-frames —and theiraccompanying bidireetional
The MPEG- 1 audio/video digital compression standard was approved by Lhe International molion compensation. Figure 11.2 illustrates the motion-compensation-based B-frame cod
Organization for Standardizationflntemational Electroteehnical Cominission (ISO/IEC) ing idea. In addition lo Lhe forward prediction, a backward prediction is also performed,
MPEG group in November 1991 for Coding of Moving Pictures and Associated Audio in which Lhe matching macroblock is obtained from a future 1- or P-frame in Lhe video
for Digital Siorage Media ai up to aboul 1.5 Mbii/s [5]. Common digital storage media in sequence. Consequently, each macroblock from a 8-frame will specify up lo two motion
clude compact discs (CDs) and video compact discs (VCDs). Out of lhe specified 1.5 Mbps, vectors, one from Lhe forward and one from lhe backward prediction.
1.2 Mbps is intended for coded video, and 256 kbps can be used for stereo audio. This yields lfmatching in both directions is successful, two motion vectors wiII be sent, and the two
a picture quality comparable Lo VHS cassettes and a sound quality equal to CD audio. corresponding matching macroblocks are averaged (indicated by “%“ in Lhe figure) before
tn general. MPEG- 1 adopts lhe CCIRÓO 1 digital TV format, also known as Source Input comparing lo Lhe target macroblock for generating lhe prediction error. lf an acceptable
Formal (SIF). MPEG- 1 supports only noninterlaced video. Normally, its picture resolution match can be found in only one of lhe reference frames, only one molion vector and its
corresponding macroblock will be used from either lhe forward or backward prediction.
is 352 x 240 for NTSC video aI 30 fps, or 352 x 288 for PAL video at 25 fps. li uses 4:2:0
chroma subsampling. Figure 11.3 iliustrales a possible sequence of video frames. The actual frame patlern
is determined aI encoding time and is specified in lhe video’s header. MPEG uses M lo
The MPEG-l standard, also referred Lo as ISO/IEC 11172 [5], has five parts: 11172-1
indicate the interval between a P-frame and its preceding 1- or P-frame, and N lo indicate
Sysiems, 11172-2 Video, 11172-3 Audio, 11172-4 Conformance, and 11172-5 Software.
lhe inlerval between two consecutive 1-frames. In Figure 11.3, M = 3, N = 9. A special
Briefly, Systems takes care of, among many things, dividing output mIo packets of bit
case is M = 1, when no B-frame is used.
streams, multiplexing, and synchronization of the video and audio streams. Conformance
(or compliance) specifies lhe design of tests for verifying whether a bitstream or decoder Since Lhe MPEG encoder and decoder cannot work for any macroblock from a 8-frame
complies with the standard. Software includes a complete software implementation of Lhe without its succeeding P- or 1-frame, lhe actual coding and transmission order (shown at
MPEG- 1 standard decoder and a sample software implementation of an encoder. We will lhe bottom of Figure 11.3) is different from Lhe display order of the video (shown above).

312
314 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.2 MPEG-1 315

Previous reference frame Target frame Future reference frame


TABLE 11.1: The MPEG-l constrained parameter set.
Parameler Value
Horizontal size of picture ~ 768
Vertical size of piclure ~ 576
Number of macroblocks/picture ~ 396
Number of macroblockshecond < 9, 900
Difference macroblock Frame rale < 30 fps
Bilrate 1, 856 kbps

11.2.2 Other Major Differences from H.261


Beside introducing bidirectional motion compensation (the B-frames), MPEG- 1 also differs
For each 8 x 8 block from H.261 in the following aspects:
1’
• Source formais. H.261 supports only CIF (352 x 288) and QCJF (176 x 144) source
formats. MPEG-l supports SIF (352 x 240 for NTSC, 352 x 288 for PAL). It also
Motion vectors —.‘-j Entropy coding allows specification of other fonnats, as Iong as lhe consirainedparamererser (CPS),
shown in Table 11.1, is satisfied.
Ir
• Slices. lnstead of GOBs, as in H.261, an MPEG-1 picture can be divided mIo one
0011101...
or more slices (Figure 11.4), which are more flexible than GOBs. They may contam
variable numbers of macroblocks in a single picture and may also slart and end
FIGURE 11.2: B-frame coding based on bidirectional motion compensation.
anywhere, as long as they flul the whole picture. Each slice is coded independently.

The inevitable delay and need for buffering become an important issue in real-time network
transmission, especially in streaming MPEG video.

Display order 1 8 B P8 E P E 81
Coding and 1 P E B P8 El B E
transmission order
FIGURE 11.4: Slices in an MPEG-1 picture.
FIGURE 11.3: MPEG frame sequence.
316 Chapter 11 MPEG Video Coding 1 — MPEG-1 and 2 Section 11.2 MPEG-1 317

range [1, 31]. Using DCT and QDCT to denoce Lhe Dcl’ coefficienLs before and
TA8LE 11.2: Default quanlizaLion table (Q i) for intra-coding. after quantizalion, for DCT coefficienls in intra-mode,
8 16 19 22 26 27 29 34
16 16 22 24 27 29 34 37
QDCT[i, J] = round
/8 x DCT[i,j]\
~ step~ize[i, ~ ) = roun
d( 8 x DCT[Í,J] ~
Qi [1, J] ~ scaie)’ (11.1)

19 22 26 27 29 34 34 38
and for DCT coefficients in inLer-mode,
22 22 26 27 29 34 37 40
1 8 x DCT[i,j] 1_ 1 8 x DCT[i,J] 1
22 26 27 29 32 35 40 48 QDCT[i, j] (11,2)
26 27 29 32 35 40 48 58
=
.j
[step$ize(i, jj — [~ri, j] X SCalej’

26 27 29 34 38 46 56 69 where Qi and Q2 refer lo Tables 11.2 and 11.3, respeclively.

27 29 35 38 46 56 69 83 Again, a round operalor is Lypicaily used in Equation (11.1) and hence leaves no
dead zone, whereas a floor operalor is used in EqLlation (11.2), ieaving a center
dead zone in its quantizalion space.

For exampie, Lhe slices can have different scale factors in Lhe quantizer. This provides • To increase precision of Lhe motion-compensaLion-based predicLions and hence reduce
addiLional flexibiiily in biLrate conLrol. prediclion errors, MPEG-1 allows moLion veclors lo be of subpixel precision (1/2
pixel). The Lechnique of bilinear inLerpolaLion discussed in Seclion 10.5.1 for H.263
Moreover, lhe siice concept is importanL for enor recovery, because each slice Lias can be used Lo generaLe Lhe needed values aI haif-pixel locaLions.
a unique slice_start_code. A siice in MPEG is similar lo Lhe GOB in H.261 (and
H.263): itis lhe lowesl levei in lhe MPEG layer hierarchy lhat can be fully recovered • MPEG-i supporls larger gaps beLween 1- and P-frames and consequenLly a much
without decoding lhe encire seI of variable-length codes in lhe bilslream. larger moLion-veclor search range. Compared lo lhe maximum range of ±15 pixeis
for molion veclors in H.26 1, MPEG- 1 supporls a range of [—512, 511.5] for half-pixeI
Quantization. MPEG- 1 quantizalion uses differenl quanlization tables for ils intra- precision and (—1,024, 1,023] for fuil-pixel precision moLion vectors. However, due
and inler-coding (Tables 11.2 and 11.3). The quanlizer numbers for intra-coding Lo Lhe practical limilalion in iLs picLure resoiulion, such a large maximum range mighL
never be used.
(Table 11.2) vary within a macrobiock. This is differenl from H.261, where ali
quanlizer numbers for AC coefficienls are conslant wilhin a macrobiock. • The MPEG- 1 bilstream aliows random access. This is accomplished by lhe Group of
Pictures (GOP) Iayer, in which each GOP is lime-coded. In addilion, lhe firsL frame
The stepsize[i, j] value is now delermined by lhe producl of Q[i, j] and scale,
in any GOP is an 1-frame, which eliminales Lhe need lo reference oLher franies. Tina,
where Q or Q2 is one of Lhe above quanlizaLion lables and scale is an inleger in lhe
lhe GOP layer aliows Lhe decoder lo seek a particular posiLion wilhin lhe bilsLream
and sLart decoding from there.

TABLE 11.3: DefaulL quanlizaLion lable (Q2) for inLer-coding. Tabie 11.4 IisLs Lypical sizes (in kilobyles) for ali Lypes of MPEG- 1 frames. 11 can be seen
LhaL Lhe typical size of compressed P-frames is significanLly sniafler Lhan LhaL of 1-frames,
16 16 16 16 16 16 16 16
16 16 16 16 16 16 16 16
TABLE 11.4: Typical compression performance of MPEG- 1 frames.
16 16 16 16 16 16 ló 16
16 16 ló 16 16 ló 16 16 Type Size Compression
16 16 16 16 16 16 16 16 1 i8kB 7:1
16 16 16 16 16 16 16 ló P 6kB 20:1
ló 16 16 16 16 16 16 ló B 2.5icB 50:1
16 16 ló 16 16 ló 16 16 Average 4.8 kB 27:1
318 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.3 MPEG-2 319

TABLE 11.5: Profiles and Leveis iii MPEG-2.

—~- SNR Spatialiy


1 Sequence GOP GO? GOP 1 Sequence 1 Sequence Levei Simple Main scaiable scaiable High 4:2:2 Muliiview
~ header 1 end code layer
proflie proflie proflie profile proflie profile profile
— 1-Iigh
1 oop 1 Group ofpiciure
~ header Picture Picture ~cture . . . Piciure iayer High 1440 *

Main * * *
— Low
1
1
Piciure
header
1
1 Slicc I Slice Slice 11 . . . 11 Slice
Piciure
layer
* *

— cussed above. There is also an uncommon Lype, D-piclure (DC coded), in which oniy
1 Siice 1 Slice DC coefficienls are retained. MPEG- 1 does nol aliow mixing D-picures with other
~ header Macroblock Macrobiock . .. Macrobloclc iayer
types, which malces D-pictures impractical.
~-~- 4. Siice iayer. As mentioned eariier, MPEG- 1 introduced Lhe slice notion for bitraie
1 Macroblock Macrobiock control and for recovery and synchronization after losL or comiped bits. Slices may
1 header BIockO Block ~ Block2 BIock3 Block4 Biock5 iayer
have variable numbers of niacroblocks in a singie picture. The iength and posiiion of
~— each slice are specified iii Lhe header.
1 Differentiai 1 Block
5. Macrobiock Iayer. Each macrobiock consisis of four Y blocks, one C,, bloclc, and
1 DC coefficient VLC rim VLC run ... end of block layer
one Cr block. Ali biocks are 8 x 8.
(if intra macroblock)
6. Block Iayer. Jf the biocks are intra-coded, Lhe differential DC coefficient (DPCM
FIGURE 11.5: Layers of MPEG-i video biisiream. of DCs, as iii JPEG) is sent firsL, foilowed by variabie-iength cedes (VLC), for AC
coefficients. Otherwise, DC and AC coefficients are both coded using lhe variabie
iength codes.
because inter-frame compression exploits temporal redundancy. Notably, B-frames are even
Mitcheil et ai. [6] provide detaiied infonnation on Lhe headers in various MPEG-1 layers.
smaller than P-frames, due partiaily lo Lhe advantage of bidirectional prediction. It is also
because B-frames are often given lhe lowest priority in Lerms of preservation of qualily;
11.3 MPEG-2
hence, a higher compression ratio can be assigned.
Developmeni of Lhe MPEG-2 standard started in 1990. Uniike MPEG-1, which is basicaily
11.2.3 MPEG-1 Video Bitstream a standard for sioring and piaying video on Lhe CD of a singie computer ai a low bitraie
(1.5 Mbps), MPEG-2 [7] is for higher-quahiy video ata bitrale of more than 4 Mbps. li was
Figure 11.5 depicts lhe six hierarchical layers for the bitstream of au MPEG-1 video. iniiialiy developed as a standard for digiial broadcasi TV.
Iii Lhe iate 1 980s, ,4dvanced 7V (ATV) was envisioned, Lo broadcasi I-IDTV via ierresiriai
1. Sequence iayer. A video sequence consists of one or more groups of piciures (GOPs). networics. During Lhe deveiopment of MPEG-2, digital ATV finaliy took precedence over
lt aiways starts with a sequence header. The headercontains information about Lhe pic various early atiempts at anaiog soluiions Lo HDTV. MPEG-2 has managed lo meet lhe
ture, such as horizontaLsize and verticaLsize, ptrel.aspecLratio,frame_rate, biLrate, compression and bilrate requirements of digiiai TVIHDTV and in fact supersedes a separate
buffersize, quanhzalionina:rix, and so on. Optional sequence headers between GOPs standard, MPEG-3, inuiiaily thought necessary for HDTV.
can indicate parameter changes. The MPEG-2 audio/video compression standard, also referred loas ISOIIEC 13818 [8],
2. Group of Pictures (GOPs) iayer. A GOP contains one or more pictures, one of was approved by Lhe ISO/IEC Moving Picture Experis Group in November 1994. Similar
which must be an 1-piciure. The GOP header contains information such as timesode to MPEG-1, ii has paris for Systems, Video, Audio, Conformance, and Sofiware, pILIs
Lo indicate hour-minute-second-frame from lhe start of the sequence. oLher aspecis. MPEG-2 has gained wide acceptance beyond broadcasting digital TV over
3. Picture iayer. The three common MPEG- 1 picture types are 1-picture (inlra-coding), terresirial, sateliiie, or cabie networks. Among various applications such as interactive TV,
P-piczure (predictive coding), and B-picture (Bidireclionai predictive coding), as dis it is also adopted for digital video discs or digital versatile discs (DVD5).
320 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.3 MPEG-2 321

TABLE 11.6: Four leveis ia lhe main proflie of MPEG-2.


Top-field
Levei Maximum Maximum Maximum Maximum coded Application _____________________

resoiution fps pixeis/see data rale (Mbps)


High 1920 x 1,152 60 62.7 x 106 80 Fiimproduction
High 1440 1,440 x 1,152 60 47.0 x 106 60 ConsumerHDTV
Main 720 x 576 30 10.4 x 106 15 Studio TV t Bottom-fieid
Low 352 x 288 30 3.0 x 106 4 Consumer tape
equivalent

MPEG-2 defined sevenprofiies aimed aL differenl applications (e.g., iow-deiay videocon


ferencing, scalable video, HDTV). The profiles are Sünpie, Moi,,, SNR sealabie, Spa:iaiiy
scaiabie, High, 4:2:2, and Muitiview (where Lwo views wouid refer lo slereoscopic video).
Wilhin each profile, up Lo four leveis are defined. As Table 11.5 shows, foI ali profiles have
four leveis. For exampie, lhe Simpie proiBe has only Lhe Main levei; whereas Lhe High
profile does not have Lhe Low levei.
Tabie 11.6 iisls lhe four leveis in lhe Maia proflie, wilh Lhe maximum amount of data
and Largeled applicalions. For example, Lhe High levei supports a high picture resolulion
of 1,920 x 1,152, a maximum frame rale of 60 fps, maximum pixel rale of 62.7 x 106 par
second, and a maximum data rale after coding of 80 Mbps. The Low levei is targeted aL
5fF video; hence, iL provides backward compalibilily with MPEG- 1. The Main levei is for
CCIR6OI video, whereas l-ligh 1440 and High leveis are aimed aí European HDTV and
Norlh American HDTV, respecliveiy.
The DVD video speciflcation allows oniy four display resolulions: 720 x 480,704 x 480,
352 x 480, and 352 x 240. Hence, Lhe OVO video standard uses oniy a restricted form of 1 or P B P
the MPEG-2 Main profile aL Lhe Main and Low leveis.
(b)
11.3.1 Supporting Interlaced Video
FIGURE 11.6: Field piclures and fleld-prediction for field-piclures in MPEG-2: (a) frame
MPEG-1 supporls oniy nornnlerlaced (progressive) video. Since MPEG-2 is adopted by picture versus fieid-pictures; (b) fleld prediction for fleid-pictures.
digital broadcasl TV, II must aiso support interlaced video, because Lhis is one of lhe options
for digilal broadcasl TV and HDTV.
As mentioned earlier, in inlerlaced video, each frame consisls of lwo fleids, referred lo in Lhe fleld-piclure. As shown beiow, Lhis observation will become an importanl factor in
as Lhe top-fieid and Lhe bouam-field. In afrome-piciure, ali scanlines from both fields are developing different modes of predictions for motion-compensaiion-based video coding.
interleaved lo form a single frame. This is then divided into 16 x 16 macrobioeks and
coded using molion compensalion. On Lhe other hand, if each fleld is lrealed as a separate Five Modes of Predictions MPEG-2 defines frarne predicrion and fieH prediction as
piclure, Lhen il is cailed fieid-piciure. As Figure 11.6(a) shows, each frame-picture can be weli as flve different prediction modes, suilable for a wide range of applications where lhe
spiit mIo lwo field-pictures. The figure shows ló scanlines from a frame-piclure on lhe lefL, requirement for lhe accuracy and speed of motion compensation vary.
as opposed Lo 8 scanlines in each of lhe Lwo fleid porlions of a fieid-picture on lhe righl.
We see lhal, in terms of display area on Lhe monitoríl’V, each 16-columu x 16-row 1. Framepredictionforframe-pictures. ThisisidenticaltoMPEG-lmotion-conipen
macroblock in Lhe fleld-pictutt corresponds Lo a ló x 32 biock area in Lhe frame-picture, sation-based prediction methods in both P-framcs and B-frames. Frame prediclion
whereas each 16 x 16 macrobiock in Lhe frame picture corresponds Lo a 16 x 8 block area works weil for videos conlaining oniy siow and moderate object and camera molions.
322 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.3 MPEG-2 323

2. Fiekl prediction for field-pictures. [See Figure 11.6(b)] This mode uses a rnac r7
roblock size of 16 x 16 frorn field-piclures. For P-field-piclures (lhe rightrnost ones
-~

(////Y/
r7
4~4~4~r
V// / / y/
~fl~E
shown in Lhe figure), prediclions are rnade from lhe Lwo most recerilly encoded
fields. Macroblocks in lhe top-field picture are forward-predicted from lhe top-field or
bottom-field pictures of lhe preceding 1- or P-frame. Macroblocks in lhe bottorn-field
piclure are predicted from ihe top-field piclure of lhe sarne frame or lhe boLlorn-field V/ / / ~
////~/—
piclure of lhe preceding E- or P-frarne.
For B-field-picLures, bolh forward and backward prediclions are rnade frorn fieId
piclures of preceding and succeeding,1- or P-frarnes. No regulalion requires Lhal
field “pariLy” be rnainLained .—. lhaL is, Lhe lop-field and boLtorn-field piclures can be
(// / / Y/
~?/ / / Y/
Z_/ ~‘ z!.’ Z.,.
~H
predicled frorn eiLher Lhe lop or bolLom fields of lhe reference piclures.
(a) (b)
3. Field prediction for frame-pictures. This mode lreaLs lhe Lop-field and bolLorn
field of a frarne-piclure separalely. Accordirigly, each 16 x 16 rnacroblock frorn lhe
Largel frame-picLure is spliL mIo lwo 16 x 8 pans, each corning from one field. Field FIGURE 11.7: (a) Zigzag (pra ~ressive) and (b) alLemaLe (inLerlaced) scans of Der coeffi
predicLion is carried 0111 for lhese 16 x 8 parts in a rnanner sirnilar Lo Lhal shown in cients for videos in MPEG-2.
Figure 11.6(b). Besides Lhe smaller block size, Lhe only difference is lhal lhe bollom
field wilI foI be predicled frorn lhe lop-field of lhe sarne frarne, since we are dealing blocks are frorn differenL fields; hence, lhere is less correlation belween Lhern Lhan belween
wilh frame-piclures now. Lhe alLemale rows. This suggebLs lhat Lhe DCT coefficienls at low verlical spaLial frequencies
For example, for P-frame-piclures, lhe botLorn 16 x 8 part wilI inslead be predicled Lend lo have reduced magnitudes, compared lo lhe ones ir’ noninLerlaced video.
frorn eilher field frorn lhe preceding 1- or P-frame. Two motion veclors are lhus Based on Lhe above ana]ysis, an altemale scan is inlroduced. II rnay be applied on a
generaLed for each 16 x 16 rnacroblock in Lhe P-frame-picture. Sirnilarly, up lo four picture-by-piCLure basis in MPEG-2 as an allernalive Lo a zigzag scan. As Figure 11.7(a)
moLion veclors can be generated for each macroblock in lhe B-frarne-picture. indicales, zigzag scan assurnes lhal ir’ noninlerlaced video, Lhe DCT coefficients aI lhe upper
4. 16 x 8 MC for field-pictures. Each 16 x 16 rnacroblock frorn lhe targel field-picLure lefl comer of lhe block oflen have larger rnagniludes. AILemaLe scan (Figure 11.7(b)) recog
is now split mIo top and boltorn 16 x 8 halves — lhal is, Lhe first eight rows and Lhe nizes Lhal in inlerlaced video, Lhe verlically higher spalial frequency componenLs rnay have
nexl eight rows. Field prediclion is perforrned on each half. As a resuil, Lwo molion larger magnitudes and Lhus aII?ws Lhern lo be scanned earlier in lhe sequence. Experirnents
veclors will be generaled for each 16 x 16 macroblock ir’ Lhe P-field-picLure and up have shown [7] lhal alLemale scan can irnprove lhe PSNR by up Lo 0.3 dB over zigzag scan
Lo four rnolion vecLors for each rnacroblock in lhe B-field-piclure. This mode is good and is rnost effeclive for videos wilh fasL rnolion.
for finer rnolion compensalion when rnolion is rapid and irregular. In MPEG-2, FieIdLCl’ ca~ address lhe sarne issue. Before applying DC]’, rows in lhe
5. Dual-prime for P-pictures. This is Lhe only mode lhaL can be used for eilher frame rnacroblock of frarne-piclures ~zaii be reordered, so lhal Lhe firsl eighl rows are frorn lhe lop
pictures or field-piclures. AL firsL, field prediction from each previous field wilh lhe field and lhe lasL eighl are from Lhe boLlorn-field. This resLores Lhe higher spaLial redundancy
sarne pariLy (Lop or boLlom) is made. Each rnolion vedor MV is lhen used Lo derive a (and correlaLion) beLween conseculive rows. The reordering will be reversed afler Lhe IDCT.
calculaLed motion vecLor CV in lhe field wilh lhe opposiLe parily, taking mIo accounl FieIdMCl’ is not applicable Ri chrorninance images, where each macroblock has only 8 x 8
lhe Lempora] scaling and verlical shifL belween lines in lhe Lop and botlom fields. In pixeis.
lhis way, lhe pair MV and CV yields Lwo preliminary prediclions foreach rnacroblock.
Their predicLion errors are averaged and used as lhe final prediclion errar. This rnode 11.3.2 MPEG-2 Scalabilities
is aimed aI mirnicking B-picture prediclion for P-picLures wilhoul adopling backward As in JPEG2000, scalability is also an irnporlanL issue for MPEG-2. Since MPEG-2 is
prediclion (and hence Iess encoding delay). designed for a varieLy of appli~alions, including digital 1V and HDTV lhe video wiIl ofLen
be transrnitLed over networks ~NiLh very different characlerisLics. Therefore iL is necessary
lo have a single coded bilslrearn LhaL is scalable lo various bilrates.
Alternate Scan and FieIdJ)CT Aliem ate Scan and Field.DCT are Lechniques airned MPEG-2 scalable coding i~ also known as layered coding, in which a base layer and one
aL improving lhe effecLiveness of DC]’ on prediclion eITors. They are applicable only Lo or more enhancemenL layers c~n be defined. The base Iayer can be independently encoded,
frame-piclures in inLerlaced videos. transrniLled, and decoded, lo qbtamn basic video qualily. The encoding and decoding of lhe
After frame prediction in frame-piclures, lhe prediction error is senL lo DC]’, where each enhancernenl Iayer, however, depends on lhe base layer or lhe previous enhancemenl layer.
block is of size 8 x 8. Due lo lhe nature of inLerlaced video, lhe conseculive rows ir’ lhese Ofien, only one enhancernent layer is employed, which is called Lwo-Iayer scalable coding.
324 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.3 MPEG-2 325

Scalable coding is suitable for MPEG-2 video transmitted over networks wiLh following
characterisLics.

• Very different bitrates. lf Lhe link speed is slow (such as a 56 kbps modem une),
only Lhe bitstream from lhe base layer will be sent. Otherwise, bitstreams from one
or more enhancement layers will also be sent, lo achieve improved video quality.
• Variable-bitrate (VBR) channels. When lhe bitrate of Lhe channel deteriorates,
bitslreams from fewer orno enhancement layers will be transmiLted, and vice versa.

• Noisy connections. The base layercan be beLter protected ar sent via channels known
Lo be less noisy.

Moreover, scalable coding is ideal for progressive transmission: bitsLreams from lhe base
layer are senL IirsL, lo give useis a fast and basic view of lhe video, followed by gradually
increased data and improved qualiLy. This can be useful for delivering compatible digiLal
TV (ATV) and HDTV.
MPEG-2 supporls Lhe following scalabiliLies:

• SNR scalability. The enhancement layer provides higher SNR.

• Spatial scalability. The enhancement layer provides higher spaLial resolulion.

• Temporal scalability. The enhancement layer faciliLates higher frame raLe.

• Hybrid scalability. This combines any Lwo of Lhe above Lhree scalabilities.

• Data partitioning. QuanLized DCI’ coelficienLs are split inLo partitions.

SNR Scalability Figure 11.8 iflustraLes how SNR scalability works in lhe MPEG 2
encoder and decoder.
The MPEG-2 SNR scalable encoder generates outpuL bitstreams Bitsiase and
Bits..enhance aL Lwo layers. Ai lhe base layer, a coarse quanLization of Lhe DCI’ coef
ficienLs is employed, which results in fewer biLs and a relatively low-quality video. After
variable-length coding, Lhe bitstream is called Bits_base.
The coarsely quanLized DCI’ coeflicienis are Lhen inversely quantized (Q~) and fed Lo
Lhe enhancemenL layer, lo be compared wiLh lhe original DCI’ coefficient. Their difference
is finely quantized lo generaLe a DCT coefficient refinemem, which, after variable-length
coding, becomes Lhe bilsLream called Bits_enhance. The inversely quanLized coarse
and refined DCT coefficienLs are added back, and afLer inverse DCT (IDCI’), they are used
for motion-compensated prediclion for lhe nexL frame. Since Lhe enhancemenl/refinement
over Lhe base layer improves Lhe signal-Lo-noise-raLio, lhis type of scalability is called SNR
scatability.
lf, for some reason (e.g., Lhe breakdown of some neLwork channel), Bits_enhance
from lhe enhancement layer cannot be oblained, lhe above scalable scheme can sliIl work
using Bits_base only. ln ihat case, lhe input from Lhe inverse quanlizer (Q~) of lhe
enhancemenL layer simply has lo be trealed as zero.
Section 11.3 MPEG-2 327
326 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2

The decoder (Figure 11.8(b)) operales in reverse order Lo lhe encoder. BoLh Bits_base BiLs_enhance
and Bits_enhance are variable-Iength decoded (VL,D) and inversely quanLized (Q l)
before Lhey are added Logelher Lo resLore Lhe DCT coefflcienLs. The remaining sLeps are Lhe
sarne as in any motion-cornpensation-based video decoder. lfboth bilsLreams (Bits..base
and Bits_enhance) are used, lhe ouLpuL video is Output..high wiih enhanced qualily.
lf only Bits..base is used, lhe oulpuL video Output..base is of basic quality.

Spatial Scalability The base and enliancemenL layers for MPEG-2 spaLial scalabiliLy
are nol as lighLly coupled as in SNR scalabiliLy; hence, lhis Lype of scalabiliLy is sornewhaL
less complicated. We will noL show lhe deLails of boLh encoder and decoder, as we did
above, buL wiIl expiam only Lhe encoding process, using high-Ievel diagrarns.
The base Iayer is designed Lo generaLe a biLslream of reduced-resoluLion picLures. Corn
bining ihem wilh Lhe enhancemenl Iayer produces piclures aI Lhe original resolution. As
(a)
Figure 11.9(a) shows, lhe original video dala is spaliaily decimaLed by a facLor of 2 and senl
lo lhe base iayer encoder. AfLer the normal coding sleps of molion compensaLion, DCT on
predicLion errors, quantizaLion, and enlropy coding, lhe oulpuL bilslream is Bits±ase. Predicled ME
As Figure 11.9(b) indicaLes, lhe predicLed macroblock from Lhe base layer is now spalially from enhancernenL
inlerpolaled Lo gel lo resoluLion 16 x 16. This is Lhen combined wiLh Lhe normal, Lemporaliy
layer
predicled macroblock from lhe enhancemenL layer ilself, lo form lhe prediclion macroblock
for Lhe purpose of moLion compensalion in lhis Iayered coding. The spaLial inlerpolation
here adopls bilinear interpolauon, as discussed before.
The combination of macroblocks uses a sirnple weighL Lable, where Lhe value of lhe lnlerpolaLed ME
weighL iv is in lhe range of [O, 1.0]. If w = 0, no consideraLion is given lo lhe predicled
from base Iayer
macroblock from lhe base layer. II iv = 1, Lhe predicLion is enLirely from Lhe base layer.
Normally, boLh predicled macroblocks are lineariy combined, using Lhe weighls iv and 1— iv,
respecLively. To achieve minimum predicLion errors, MPEG-2 encoders have an analyzer
lo choose differenl iv values from lhe weighl Lable on a macroblock basis.
Predicted ME
Temporal Scalability Ternporally scalable coding has bolh Lhe base and enhancemenl from base layer
layers of video ala reduced Lemporal raLe (frame rale). The reduced frame rales for lhe Iayers
are ofLen lhe sarne; however, they could also be differenL. PicLures from lhe base layer and (b)
enhancemenl Iayer(s) have Lhe sarne spaLial resolulion as in Lhe inpul video. When combined,
lhey reslore Lhe video Lo ils original lernporal raLe. FIGURE 11.9: Encoder for MPEG-2 Spalial scalabilily: (a) block diagram; (b) combining
Figure 11.10 illuslrales lhe MPEG-2 implernenLalion of Lemporal scalability. The inpul lemporal and spalial predicLions for encoding aL enhancernenL layer.
video is Lemporally demulliplexed mIo Lwo pieces, each carrying half Lhe original frame rale.
As before, Lhe base layer encoder carnes out Lhe normal single-layer coding procedures for
ils own inpul video and yields Lhe ouLpul bilsLream Bits±ase. Combined motion-compensation predicíion and interlayer molion-compensation
The prediclion of maLching macrobloclcs aL Lhe enhancemenL Iayer can be oblained in prediction. [Figure 11.10(c).] This furLher combines lhe advanlages of Lhe ordinany
two ways [7]: Inierlayer mo:ion-conipensa:edprediction or combined motion-compensated forward predicLion and lhe above inlerlayer prediclion. Macroblocks of B-frames aI
prediction and interlayer moi’ion-conzpensatedprediction. Lhe enhancemenl Iayer are forward-predicLed from lhe preceding frame aL ils own layer
and “backward”-predicLed frorn Lhe preceding (or, alLernalively, succeeding) frame aL
Interlayer motion-compcnsated prediction. [Figure 11.10(b).] The macroblocks Lhe base layer. AL lhe firsL frarne, Lhe P-frame aI lhe enhancernenL layer adopts only
of 8-frames for molion compensaLion aI lhe enhancemenl layer are predicled frorn forward prediclion from lhe 1-frame aL lhe base Iayer.
Lhe preceding and succeeding frames (eilher 1-, P-, or E-) aL Lhe base layer, so as Lo
exploil Lhe possible inler-Iayer redundancy in moLion compensaLiOn.
328 Chapter 11 MPEG Video Coding 1— MPEG-1 and 2 Section 11.3 MPEG-2 329

• SNR and spatial hybrid scalability


Temporal Bits_enhance
enhancement layer • SNR and temporal hybrid scalability
encoder

Current Temporal Usually, a Lhree-layerhybrid coder wilI be adopted, consisting of base layer, enhancement
frame demultiplexer layer 1, and enhancement layer 2.
For example, for Spatial and temporal hybrid scalability, Lhe base layer and enhancement
Temporal BiLs_base layer 1 will provide spatial scalability, and enhancement layers 1 and 2 will provide temporal
base layer scalability, in which enhancement layer lis effectively serving as a base layer.
encoder For lhe encoder, Lhe incoming video data is first temporally demultiplexed into two
streams: one co enhancement layer 2; the other to enhancement layer 1 and Lhe base layer
(a) (after further spatial decimation for Lhe base layer).
Temporal The encoder generates three output bitstreams: (a) Bits±ase from Lhe base layer,
enhancement layer (b) spaiially enhanced Bits_enhancel from enhancemenc layer l,and (e) spacially and
temporally enhanced Bits_enhance2 from enhancement Iayer 2.
The implementations of the other two hybrid scalabilities are similar and are left as
exercises.
Base
Data Partitioning The compressed video scream is divided into Lwo partitions. The
base partition contains lower-frequency DC]’ coefficients, and Lhe enhancement partition
contains high-frequency DCI’ coefficients. Although the partitions are sometimes also
referred Lo as Iayers (base layer and enhancement layer), scrictly speaking, data partitioning
(b) does not conduct Lhe same type of layered coding, since a single stream of video data
is simply divided up and does not depend further on Lhe base partition in generating Lhe
enhancement parlition. Nevertheless, data partitioning can be useful for Lransmission over
noisy channels and for progressive transmission.

11.3.3 Other Major Differences from MPEG-1


Better resilience te bit errors. Since MPEG-2 video wilI often be transmicted on
various networks, some of them noisy and unreliable, bit errors are inevitable. To cope
with this, MPEG-2 systems have two types ofstreams: Program and Transpor!. The
Program stream is similar to the Systems scream in MPEG-1; hence, it also facilitates
backward compatibility with MPEG- 1.
(e) The Transport stream aims aL providing error resilience and the abiliLy to include
multiple programs wich independent time bases in a single scream, for asynchronous
FIGURE 11.10: Encoder for MPEG-2 temporal scalability: (a) block diagram; (b) inter multiplexing and network cransmission. Instead of using long, variable-lengLh pack
layer motion-compensated prediction; (c) combined motion-compensated prediclion and ecs, as in MPEG- 1 and in the MPEG-2 Program siream, iL uses fixed-lengLh (1 88-byte)
interlayer motion-compensated prediction. packels. li also has a new header syntax, for better error checking and correction.

• Support of 4:2:2 and 4:4:4 chroma subsampling. In addition to 4:2:0 chroma


Hybrid Scalability Any two of the above three scalabilicies can be combined to form subsampling, as in H.261 and MPEG-l, MPEG-2 also allows 4:2:2 and 4:4:4, to
hybrid scalability. These combinations are increase color qualicy. As discussed in Chapter 5, each chrominance piccure in 4:2:2
is horizonLally subsampled by a factor of 2, whereas 4:4:4 is a special case, where no
Spatial and temporal hybrid scalability chroma subsampling actually talces place.
330 Chapter 11 MPEG Video Coding 1— MPEG-l and 2 Sectiori 11.6 References 331

3. WhaL are some of Lhe enhancernenLs of MPEG-2, compared wiLh MPEG-i? Why
TABLE 11.7: Possible nonlinear scale in MPEG-2. hasn’L lhe MPEG-2 standard superseded lhe MPEG-I sLandard?
1 2 3 4 5 6 7 8 9 10111213 14 IS 16 4. B-frames provide obvious coding advanlages, such as increase in SNR aL Iow bilrates
1 2 3 4 5 6 7 8 lO 12 14 16 18 20 22 24 and bandwidLh savings. WhaL are some of lhe disadvantages of B-frames?
17 18 192021 2223242526272829 30 31 5. The MPEG- 1 standard introduced B-frames, and lhe motion-vector search range has
28323640444852566472808896 104 112 accordingiy been increased from [—15,15] in H.261 to [—512,511.5]. Why was
this necessary? Calculate Lhe number of B-frames between consecutive P-franies Lhat
wouid justify lhis increase.
• Nonlinear quantization. QuanLization in MPEG-2 is similar Lo that in MPEG-1. lis
6. Redraw Figure 11.8 of Lhe MPEG-2 two-iayer SNR scalabiiity encoder and decoder
r:ep~ize is also deLermined by Lhe producL of Q[i, j] and scale, where Q is one of Lhe
Lo inciude a second enhancemenL iayer.
default quantizaLion tabies for intra- or inter- coding. Two Lypes of scales are allowed.
For lhe firsL, scole is Lhe sarne as in MPEG-l, in which ii is an inLeger in the range of 7. Draw biock diagrams for an MPEG-2 encoder and decoder for (a) SNR and spatial
hybrid scaiability. (b) SNR and temporal hybrid scalability.
[1, 31] and sca1e~ = i. For Lhe second Lype, however, a nonlinear reiationship exists
— thaL is, scale~ ~ 1. The ith scale value can be iooked up in Tabie 11.7.
8. Why aren’L B-frames used as reference frames for rnotion compensation? Suppose
there is a mode where any frame type can be specified as a reference frame. Discuss
• More restricted siice structure. MPEG- 1 aliows slices to cross macroblock row lhe tradeoffs of using reference B-frames instead of P-frames in a video sequence
boundaries. As a resuiL, an entire picture can be a single slice. MPEG-2 slices must (i.e., eliminaLing P-frames compleLeiy).
slart and end in lhe sarne macroblock row. In oLher words, lhe Ieft edge of a picLure 9. Suggest a melhod for using motion compensaLion in Lhe enhancemenL iayerof an SNR
always starts a new slice, and Lhe iongesL siice in MPEO 2 can have only one row of scaiable MPEG-2 biLstream. Why isn’t that recommended in lhe MPEG-2 sLandard?
macrobIocks. 10. Write a program lo impiement Lhe SNR scaIability in MPEG-2. Your program should
be able to work on any macrobIock using any quantization step$izes and shouId outpuL
• More flexible video formais. According Lo Lhe sLandard, MPEG-2 picture sizes can
both Bits.base and Bitsenhance biLslreams. The variabIe-Iength coding step
be as large as 16k x 16k pixeis. In reality, MPEG-2 is used mainly to support various
can be omitted.
picLure resolutions as defined by DVD, ATV, and HDTV.

Similar Lo H.261, 11.263, and MPEG-1, MPEG-2 specifies only iLs bitstream synLax and 11.6 REFERENCES
lhe decoder. This leaves much room for future improvement, especially on the encoder 1 L. Chiarighone, “The DeveIopment of an lnLegraled Audiovisual Coding Srnndard: MPEG’
side. The MPEG-2 video-stream synLax is more complex than that of MPEG-1, and good ProceedingsofthelffEE, 83: 151—157, 1995.
references for iL can be found in [8, 7]. 2 L. Chiariglione, “Jmpact of MPEG Slandards on Muitimedia Industry’ Proceedings of ihe
IEEE, 86(6): 1222—1227, 1998.
11.4 FURTHER EXPLORATION 3 R. Schafer and T. Sikora, Digital Video Coding SLandards and Their Role in Video Cornmu
The books by MiLcheil eL ai. [6] and Haskell et ai. [7] provide pertinenL details regarding nications,” Proceedings of lhe 1555,83(6): 907—924, 1995.
MPEG-l and 2. The articie by Haskeli et ai. [9] discusses digital video coding sLandards in 4 0.1. LeGail, “MPEG: A Video Compression SLandard for Multimedia ApplicationsT Comnus
general. nicationsoftheACM, 34(4): 46—58, 1991.
The textbook web site’s Further Expioration secLion for Lhis chapter gives URLs for 5 Information Technology— Coding ofMovingl’icturesandAssociatedAudioforDigitalStorage
various resources, including: Lhe MPEG home page, FAQ page, and overviews and working Media ai up to about 1.5 Mbitls, InLernaLional SLandard: ISOIIEC 11172, Paris 1—5, 1992.
docurnents of the MPEG- 1 and MPEG-2 standards. 6 J.L. Mitcheii, W.R. Pennebaker, C.E. Fogg. and D.J. LeGail, MPEG Video Compressioa Stan
dard, New York: Chapman & HalI, 1996.
11.5 EXERCISES 7 B.G. Haskell, A. Puri, and A. NeLravali, Digital Video: An Introduction toMPEG-2, New York:
1. As we know, MPEG video compression uses 1-, P-, and B-frames. However, Lhe Chapman & HalI, 1997.
eariier H.261 standard does noL use B-frames. Describe a situation in which video 8 Infonnation Technology — Generic Coding o! Moving Pictures and Associated Audio Infor
compression would noL be as effective wilhouL B-frames. (Your answer shouid be mation, JnLernaLional Standard: 1SO/IEC 13818, ParIs 1—lO, 1994.
differenL from Lhe one in Figure 11.1.) 9 B.G. Haskell, eL ai., “image and Video Coding: Ernerging Slandards and Beyond’ lESE
2. Suggest an explanation for the reason the defauit quantization table Q2 for inter-frames Transacrions on Circuits and Systems for Video Technology, 8(7): 814—837, 1998.
is ali constanl, as opposed tolhe default quantization table Q of inLra-frames.
Section 12.1 Overview of MPEG4 333

CHAPTER 12 Encode Decode

MPEG Video Coding II —

MPEG-4, 7, and Beyond

12.1 OVERVIEW OF MPEG-4


MPEG- 1 and -2 employ frame-based coding techniques, in which each rectangular video
frame is treated as a unit for compression. Their main concem is high compression ratio
and satisfactory quality of video under such compression techniques. MPEG-4 is a newer
standard [1]. Besides compression, iL pays great attention to user interactivities. This
allows a larger number of users 10 create and communicate their multimedia presentations
and applications on new infrastnictures, such as Lhe Jntemet, Lhe World Wide Web (WWW), FIGURE 12.1: Compnsition and manipulation of MPEG-4 videos (VOP = Video Object
and mobile/wireless networks. MPEG-4 deparis from its predecessors in adopting a new Plane).
object-based coding approach — media objecis are now entities for MPEG-4 coding. Media
objects (also known as audio and visual objecis) can be either natural or synthetic; that is
Lo say, they may be captured by a videocamera DE created by computer programs. Lhese media data entities so that Lhey can be transmitted with guaranteed Quality of Service
Object-based coding not only has Lhe potential of offering higher compression ratio (QoS), and (c) interacting with Lhe audiovisual scenc aL Lhe receiving end. MPEG-4 provides
but is also beneficial for digital video composition, manipulation, indexing, and reLrieval. a toolbox of advanced coding modules and algoriLhms for audio and video compressioli.
Figure 12.1 illustrates how MPEG-4 videos can be composed and manipulated by simple MPEG-4 defines Bbiary Format for Scenes (BIFS) [3] that facilitales Lhe composition
operations such as insertion/deletion, Lranslationlrotation, scaling, and so on, on Lhe visual
of media objecis inLo a scene. BIFS is often represenLed by a scene graph, in which Lhe
objects. nodes describe audiovisual primitives and their attributes and lhe graph stnicture enables
MPEG-4 (version 1) was flnalized in October 1998 and became an international standard a description of spatial and temporal relationships of objecis iii lhe scene. BIFS is an
in early 1999, referred Lo as ISO/IEC 14496 [2]. An improved version (version 2) was enhancement of Virtual Reality Modeling Language (VRML). In particular, ii emphasizes
flnalized in December 1999 and acquired Intemational Standard status in 2000. Similar to Liming and synchronization of objecis, which were lacking in Lhe original VRML design.
Lhe previous MPEG standards, its first five parts are Systems, Video, Audio, Conformance, In addition Lo BIFS, MPEG-4 (version 2) provides a programming environment, MPEG-J
and Software. lts sixth part. Delivery Muluinedia Integration Framework (DMIF), is new, [4], in which Java applications (called MPEGIeÍs) can access lava packages and APIs soas
and we wiIl discuss iL in more detail in Chapter 16, where we discuss multimedia network Lo enhance end users’ interaclivities.
communications and applications.
Originally targeted at low-bitrate communication (4.8 toM kbps for mobile applications The hierarchical structure of MPEG-4 visual bitstreams is veiy different from that of
MPEG-I and 2 in Lhat iL is very much video-object-oriented. Figure 12.3 illustrates five
and up Lo 2 Mbps for other applications), Lhe bitrate for MPEG4 video now covers a large
range, between 5 kbps and lO Mbps. leveIs of lhe hierarchical description of a scene in MPEG-4 visual bitstreams. In general,
each Video-objeci Sequence (VS) will have one or more Video Objects (VOs), each VO wiIl
As the Reference Modeis in Figure 12.2(a) show, an MPEG~1 system simply delivers
have one or more Video Objeci Layers (VOL5), and so on. Syniactically, ali five leveis have
audio and video data from lis storage and does not allow any user interactivity. MPEG
a unique start code in lhe bitsiream, Lo enable random access.
2 added an Interaction component [indicated by dashed lines in Figure 12.2(a)] and thus
permits limited user interactions in applications such as neLworked video and Interaclive
TV. MPEG.4 [Figure 12.2(b)] is an entirely new standard for (a) composing media objects 1. Video-object Sequence (VS). VS deiivers the complete MPEG-4 visual scene, which
to create desirable audiovisual scenes. (b) multiplexing and synchronizing Lhe bitstreams for may contam 20 or 30 natural or synthetic objccts.

332
334 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Coding in MPEG-4 335

Video-object Sequence (VS)

Video Object (VO)

Video Object Layer (VOL)

Group of VOPs (GOV)

Video Object Plane (VOP)

FIGURE 12.3: Video-object-orienied hierarchical description of a scene in MPBG-4 visual


bilstreams.

nonscaiable coding. As a special case, MPEG-4 also supports a special type of


VOL with a shorter header. This provides bitstream compatibility with Lhe baseline
(a) H.263 [5].
4. Group of Video Object Planes (GOV). GOV groups video object planes. lt is an
opcional levei.
5. Video Objcct Plane (VOP). A VOP is a snapshot of a VO aI a particular moment,
retlecling lhe VO’s shape, Lexture, and molion parameters at Lhat inslanc. In general,
a VOP is an image of arbitrary shape. A degenerate case in MPEG-4 video coding
occurs when lhe entire rectangular video frame is treated as a VOP. In Lhis case, it
is equivalent to MPEG- 1 and 2. MPEG-4 allows overlapped VOPs — that is, a VOP
can partially occiude anolher VOP in a scene.

12.2 0BJEcT-BASED VISUAL CODING IN MPEG-4


MPEG-4 encodes/decodes each VOP separateiy (instead of considering Lhe whole frame).
Hence, its object-based visual coding is also known as VOP-based coding. Our discussion
will start with coding for natural objects (more decails can be found in [6, 7]). Section 12.3
describes synthetic object coding.

12.2.1 VOP-Based Coding vs. Frame-Based Coding


(b)
MPEG-l and 2 do not support the VOP concept, hence, their coding method is referred to
FIGURE 12.2: Comparison of interactivities in MPEG standards: (a) reference models in asfranze-based. Since each frame is divided mIo many macrobiocks from which motion
MPEG-l and 2 (inceraccion in dashed lines supported only by MPEG-2); (b) MPBG-4 compensation-based coding is conducced, it is also known as block-based coding. Fig
reference model. ure 12.4(a) shows Lhree frames from a video sequence with a vehicle moving toward Lhe
left and a pedestrian waiking in the opposite direccion. Figure 12.4(b) shows lhe typical
block-based coding in which lhe motion vector (MV) is obtained for one of the macroblocks.
2. Video Object (VO). VO is a particular objecl inibe scene, whieh can be of arbitrary MPEG- 1 and 2 visual coding are concemed only wiLh conipression ratio and do noL
(nonrectangular) shape, corresponding to an object or background of Lhe scene. consider lhe existence of visual objecis. Therefore, Lhe mocion vectors generaled may be
3. Video Object Layer (VOE). VOL facilitates a way to support (multilayered) scalable inconsistent with lhe object’s motion and would not be useful for objecc-based video analysis
coding. A VO can have multiple VOLs under scalable coding ora single VOL under and indexing.
336 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Coding in MPEG4 337

Previous frame Current frame Next frame Figure 12.4(c) iliuscrates a possible example in which both potencial matches yield small
prediction errors. If Potencial Match 2 yields a (slightly) smaller prediction error than
potential Match 1, MV2 wili be chosen as Lhe motion vector for lhe macroblocks in lhe
block-based coding approach, although only MVI is consistent with lhe vehicle’s direction
of niotion.
Objecl-based coding in MPEG-4 is aimed aL solving chis problem, in addilion 10 ~ni
proving compression. Figure 12.4(d) shows each VOP is of arbitrary shape and will ideally
(a) obtain a unique molion vector consisteM with lhe object’s mocion.
MPEG-4 VOP-based coding also employs lhe motion compensation technique. An Intra
frame-coded VOP is called an I-VOP. lnter-frame-coded VOPs are called P-VOPs if only
•La.E~ •• (iiock moLion~ forward prediccion is employed or B VOPs if bidireccional predictions are employed. The
•I~.~_.’ 1 L estimaLioj~j new difhculty here is that Lhe VOPs may have arbitrary shapes. Tberefore, in addition lo
ibeir texiure, their shape information must now be coded.
[ MPEG~[.~id 23 MV li is worth noting chal texture here accually refers lo lhe visual content, that is lhe gray
levei and chroma values of lhe pixels in Lhe VOP. MPEG- 1 and 2 do not code shape
(b)
information, since ali frames are rectangular, but they do code Lhe values of lhe pixels in lhe
frame. In MPEG- 1 and 2, lhis coding was nol explicitly referred co as texture coding. The
Previous frame MVI Currencframe
term “texture” comes from computer graphics and shows how lhis discipline has encered
Polential Match 1 lhe video coding world with MPEG-4.
Below, we slart wich a discussion of mocion-compensacion-based coding for VOPs, foi
iowed by incroductions lo rexture coding, shape coding, static texiure coding, sprite coding,
and global motion compensation.
A~ MV2
Potencial Match 2 12.2.2 Motion Compensation
(e) This seccion addresses issues of VOP-based motion compensation in MPEG-4. Since l-VOP
voP 1 coding is relatively straightforward, our discussions wiil concentrate on coding for P-VOP
VOP2
and/or B VOP unless I-VOP is explicitly mentioned.
As before, mocion-compensation-based VOP coding in MPEG-4 again involves three
(~~ctWOP5) Object sceps: motion estimation, motion-compensation-based prediccion, and coding of lhe predic
[base~~odifl~t’ mocions VOP2 tion error. To facilitate motion compensation, each VOP is divided mIo many macroblocks,
as in previous frame-based methods. Macroblocks are by default 16 x 16 in luminance
D~PEG-4 ) ..~.._. images and 8 x 8 in chrominance images and are Lreaced specially when they straddle lhe
boundary of an arbicrarily shaped VOP.
(d) MPEG-4 defines a rectangular bounding box for each VOP. lis lefI and top bounds
are Lhe left and lop bounds of lhe VOP, which in tum specify lhe shifted origin for lhe
FIGURE 12.4: Comparison becween block-based coding and object-based coding: (a) a video VOP from lhe original (O, O) for lhe video frame in Lhe absoluce (frame) coordinace system
sequence; (b) MPEG- 1 and 2 block-based coding; (c) two potencial macches in MPEG- 1 (see Figure 12.5). Boch horizontal and vertical dimensions of lhe bounding box must be
and 2; (d) objecc-based coding in MPEG-4. multiples of 16 in lhe luminance image. Therefore, lhe box is usually slightly larger than a
convencional bounding box.
Macroblocks enlirely wilhin Lhe VOP are referred Lo as interior macroblockr. As is
apparent from Figure 12.5, many of lhe macroblocks straddle lhe boundary of lhe VOP and
are called boundary tnacroblocks.
Motion compensacion for interior macroblocks is carried oul in lhe same manner as in
MPEG-l and 2. However, boundary macroblocks could be difficult lo match in molion
338 Chapter 12 MPEG Video Coding II — MPEG-4, 7. and Beyond Section 12.2 Object-Based Visual Coding in MPEG-4 339

-~ 1 1 1
Video frame 45’52’55’60 45:52:55:60 6060 45525560 6060
- — a — —1— — — —4— — —.4 —4— — —4—
— — —

42:48:50 50:50:50 42:48:50 50:50:50


Boundary macroblock 1 1

4
~-~-:-~~ 55 ‘60’60
1 1 1
5l’54’55’55’60 60
Boi iding box ~ [~J70[~ 6o:pq
of heVOP
401501 8090 40:50165:65180190 40:501656518090
(a) (b) (e)
VO?
FIGURE 12.7: An example of repetiLive padding in a boundary macroblock of a reference
VO?: (a) original pixels within lhe VOP; (b) after horizontal repelilive padding; (c) foliowed
Interior macroblock by vertical repeLitive padding.

flul in lhe values for Lhe interval of pixeis outside the VO? in lhe macroblock. lf lhe inLerval
FIGURE 12.5: Bounding box and boundary macroblocks of VOP. is bounded by two boundary pixeis, their average is adopted.

ALGORITHM 12.1 HORIZONTAL REPETITIVE PADDING


estimation, since VOPs often have arbiLrary (rionrectangular) shape, and their shape may begin
change from one instanl in Lhe video lo anolher. To help match eveiy pixel in Lhe targel for ali rows in Boundary macroblocks in Lhe Reference VOP
VOP and mccl Lhe mandatory requirement of rectangular blocks in iransform coding (e.g., if 3 (boundary pixel) in the row
DCT), a preprocessing step of padding is applied lo Lhe reference VOPs prior lo motion for ali intervai oulside of VOP
estimation. if inter vai is bounded by only one boundary pixel b
Only pixeis within Lhe VOP of lhe cuffent (target) VO? are corisidered for malching in assign lhe value ofb Lo ali pixeis in ínterim!
motion compensation, and padding takes place only in Lhe reference VOPs. else 1/ intervai is bounded by iwo boundary pixels b~ and b2
For quality, some betier extrapolation method than padding could have been developed. assign Lhe value of (b1 + b2)/2 lo ali pixels in inter vai
Padding was adopted in MPEG-4 largely due Lo its simplicity and speed. end
The first LWO steps of mation compensation are: padding, and motion vector coding.
The subsequenL vertical repetitive padding algorithm works similarly. IL examines each
Padding. For ali boundary macrobloclcs in Lhe reference VO1’, horizontal repelitive column, and Lhe newiy padded pixels by Lhe preceding horizontal padding process are Lreated
padding is invoked firsL, followed by vertical repetitive padding (Figure 12.6). AfLerward, as pixeis inside Lhe VOP for the purpose of this vertical padding.
for ali exterior ,nacrobiocks that are outside of Lhe VOP but adjacent lo one or more boundary
macrobiocks, extendedpadding is applied. EXAMPLE 12.1
The horizontal repeLitive padding algorithm examines each row in Lhe boundary mac
roblocks in lhe reference VOP. Each boundary pixel is replicaLed Lo lhe lefi and/ar righL lo Figure 12.7 iliustrales an example of repetiLive padding in a boundary macrobiack of a
reference VOP. Figure 12.7(a) shows lhe iuminance (ar chrominance) inLensity vaiues of
pixeis in Lhe VOP with Lhe VOP’s boundary shown as darkened lines. For simpiicity, Lhe
Horizontal Vertical macroblock’s resoiution is reduced 106 X 6 in this example, although its actual macroblock
Extended size is 16 x 16 in luminance images and 8 x 8 in chrominance images.
Repetitive Repetitive
Padding
Padding Padding 1. Horizontal repetitive padding [Figure 12.7(b)]
Row O. The rightmost pixel of Lhe VOP is lhe oniy boundary pixei. lIs intensity value,
FIGURE 12.6: A sequence of paddings for reference VOPs in MPEG-4 60, is used repetitively as lhe value of lhe pixels outside Lhe VOP.
340 Chapter 12 MPEG Video Coding li — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Coding irt MPEG-4 341

Row 1. Similarly, lhe righlmosl pixel of lhe VOP is the only boundary pixel. lIs where p is lhe maximal allowable magnitude for u and u.
intensity value, 50, is used repetilively as the pixel value oulside of lhe VOP. For molion compensalion, lhe motion vector MV is coded. As in H.263 (see Fig
Rows 2 and 3. No horizontal padding, since no boundary pixeis exisl. ure 10.11), the molion vedor of lhe targel macroblock is not simpiy talcen as lhe MV.
Row 4. There exisl two inlervals oulside the VOP, each bounded by a single boundary lnstead, MV is predicted from lhree neighboring macroblocks. The prediction error for the
pixel. Their inlensily values, 60 and 70, are used as lhe pixel values of the two inlervals, motion vector is then variabie-ienglh coded.
respeclively. Following are some of lhe advanced molion compensalion techniques adopled similar lo
lhe ones in H.263 (see Sedlion 10.5).
Row 5. A single inlerval oulside lhe VOE’ is bounded by a pair of boundary pixeis of
lhe VOE’. The average of their intensity values, (50 + 80)12 = 65, is usai repetitively • Four molion vectors (each from an 8 x 8 block) can be generaled for each macroblock
as lhe value of lhe pixeis belween lhem. in the luminance component oCa VOP.
2. Vertical repetitive padding [Figure 12.7(c)]
• Motion vectors can have subpixei precision. At half-pixei predision. lhe range of
Column O. A single inlerval is bounded by a pair of boundary pixels of lhe VO?.
motion vedlors is [—2,048, 2,047]. MPEO-4 also ailows quarter-pixel precision in
One is 42 in lhe VOE’; lhe olher is 60, which jusl arose from horizontal padding. The
lhe luminance component of a VOE’.
average of lheir inlensity values, (42 + 60)/2 = 51, is repelilively used as lhe value
of lhe pixels between lhem. • Unreslricted molion veclors are aliowed: MV can poinl beyond the boundaries of lhe
Columns 1, 2,3,4 and 5. These coiumns are padded similarly lo Column 0. reference VOE’. When a pixei oulside the VOE’ is referenced, ils value is stili defined,
due to padding.

12.2.3 Texture Coding


Exlended Padding Macroblocks entirely outside lhe VOE’ are exterior macroblocks.
Exterior macrobiocks immedialeiy next lo boundary macrobloclcs are fihied by replicaling lhe Texture refers lo gray levei (or chroma) varialions andlor patlems in lhe VOE’. Texture
values of lhe border pixeis of lhe boundaxy macroblocic. We fole lhat boundary macrobiocks coding in ME’EG~4 can be based eilher on Dcl’ or shape-Adaptive DC?’ (SA.DCT).
are by now fuily padded, so ali lheir horizonlal and vertical border pixeis have defined
values. lf an exterior macrobiock has more lhan one boundary macroblock as ils immediate Texture Coding Based on DCT. in 1-VOP, lhe gray (or chroma) values of lhe pixeis
neighbor, the boundary macroblock to use for extended padding follows a priority list: ieft, in each macroblock of lhe VOP are direcliy coded, using lhe DCT foliowed by VLC, which
lop, right. and bollom. is similar lo what is done in JPEG for 5h11 pictures. P-VOP and 8-VOE’ use molion
Later versions of ME’EG-4 allow some average values of these macroblocks lo be used. compensation-based coding. hence, il is lhe prediclion error lhal is sem lo DCI’ and VLC.
This extended padding process can be repeated lo fihi in ali exterior macroblocks wilhin lhe The foliowing discussion will be focused on molion-compensalion-based lexlure coding for
redlangular bounding box of lhe VOE’. E’-VOPandB VOE’.
Coding for Interior macroblocks, each 16 x 16 in lhe luminance VOE’ and 8 x 8 in
Motion Vector Coding. Each macroblock from lhe larget VOE’will finda besl malching lhe chrominance VOP, is similar lo lhe conventional motion-cornpensahion-based coding
macroblock from the reference VOE’ lhrough lhe foliowing molion eslimation procedure. in H.261, 1-1.263, and MPEG-l and 2. E’rediclion errors from lhe six 8 x 8 blocks of each
LeI C(x+k, y+i) bepixelsofthemacroblock inthe largel VOE’, and R(x+i +k, y+j+I) macroblock are oblained afler lhe convenlional motion eslimalion slep. These are sent loa
be pixeis of lhe rnacroblock in the reference VOP. Similar lo MAD in Eq. (10.1), a Sum of DCT rouline lo oblain six 8 x 8 blocks of DCT coefficienls.
Absotute D (Ijerence (SAD) for measuring lhe difference between lhe lwo macrobiocks can For boundary macroblocks, areas oulside lhe VOP in lhe reference VOE’ are padded using
be defined as repelitive padding, as described above. After metion compensalion, lexlure prediclion errors
wilhin lhe largel VOE’ are oblained. For portions of lhe boundary macroblocks in lhe largel
SAD(i,j) = ~ z
N 1H 1

k=0 1=0
C(x+k,y+I)—R(x+i+k,y+j+i)I
VOP outside lhe VOE’, zeros are padded lo lhe bloclc sent lo DCT, since ideally, prediclion
errors would be near zero inside lhe VOE’. Whereas repehihive padding and exlended padding
were for betler malching in molion compensalion, lhis addilional zero padding is for beller
.Map(x+k,y+l)
DCI’ resulls in lexlure coding.
where N is lhe size of the macroblock. Map(p, q) = 1 when C(p. q) is a pixei within lhe The quanlizalion stepsize for lhe DC cornponenl is 8. For lhe AC coefficienls, one of
targel VOE’; olherwise, Map(p, q) — 0. The vector (1, j) that yields the minimum SAD is lhe foilowing lwo melhods can be ernpioyed:
adopled as lhe motion vedor MV(u, u):
• The H.263 melhod, in which ali coefficienls redeive lhe sarne quanlizer controlled by
(u, u) — ((1, j) 1 SADO, j) is minimum, 1 E [—p, p], j E [—p, ~11 (12.1) a single parameler, and differenl macroblocks can have differenl quanlizers.
342 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Coding ri MPEG-4 343

• The MPEG-2 meLhod, in which DCT coefficienLs in lhe sarne rnacroblock can have x x x
different quanLizers and are further conLrolled by Lhe stepsize parameler. 1
II
II
Shapc-Adapiive DCT (SA-DCT)-Bascd Coding for Boundary Macroblocks. SA
D~ [8] is anoLher LexLure coding melhod for boundary macroblocks. Due lo ils effeclive
ness, SA-DCT has been adopLed for coding boundary macroblocks in MPBG-4 version 2.
1 D DCT-N is a varialion of lhe 1 O D~ described earlier LEqs. (8.19) and (8.20)], in lhaL r~) n ‘á, n ~ —

N elements are used in Lhe Lransforrn insLead of a Iixed N = 8. (For shorl, we wiIl denole (a) f(x.y) (b)f’(x,y)
Lhe ID DCT-N lransform by DCT-N in lhis seclion.)
Eqs. (12.2) and (12.3) describe lhe DCT-N Lransforrn and iLs inverse, IDCT-N. (c) F’(x,v)

1D Discrete Cosine Ttansform-N (DCT.N)


DCT-6
F(u) rC(u) cos (21 + fluir f(i) (12.2)
Dcr-5
DCT’-4
DCT-2
DCT-1

1D Inverse Discrete Cosine ‘ftansfonn-N (IDCT-N)

E-’ (d) F” (x,v) (e) G(u,v)


1(i) = Z q~C(u)cos (21 +2Nl)ux F(u) (12.3)
FIGURE 12.8: Texture coding for boundary macroblocks usirig Lhe Shape Adaplive Da
u=0
(SA-DCT).
wherei 0,1 N 1, u=0,l N—1,and

Some coding consideraLions:


ifu = 0,
C(u) = 2
oLherwise • The lolal number of DCT coefficienLs in G(u, ti) is equal Lo Lhe number of gray pixels
inside lhe 8 x 8 block of lhe boundary macroblock, which is Jess Lhan 8 x 8. Hence,
SA~DCT is a 2D DCI’ and is compuled as a separable 21) transforrn in Lwo ilerations of lhe meLhod is shape adaptive and is more efficienl lo compuLe.
DCT-N. Figure 12.8 illuslraLes lhe process of lexlure coding for boundary macroblocks using
SA-DCT. The Lransforrn is applied Lo each of lhe 8 x 8 blocks in lhe boundary macroblock. • AL decoding Lime, since lhe array elemenls musL be shifled back properly alter each
iLeration of IDCT-Ns, a binary mask of Lhe original shape is required Lo decode lhe
Figure 12.8(a) shows one of lhe 8 x 8 blocks of a boundary rnacroblock, where pixels
texLure informaLion coded by SA-DCT. The binary rnask is lhe sarne as lhe binary
inside lhe macrobloclç, denoted f(x, y), are shown gray. The gray pixeis are firsI shifled
alpha map described below.
upward lo obtain f’(x, y), as Figure 12.8(b) shows. In Lhe firsl ileralion, DCT-N is applied
Lo each column of f’(x, y), wiLh N delerrnined by lhe number of gray pixels in Lhe column.
12.2.4 Shape Coding
Hence, we use DCT-2, DCT-3, Dc’r-5, and so on. The resulling DCT-N coefficienls are de
noled by F’(x, v), as Figure 12.8(c) shows, where lhe dark doIs indicale lhe DC coefflcienls Unlike in MPEG- 1 and 2, MPEG-4 must code lhe shape of lhe VOP, since shape is one of
of lhe DCT-Ns. The elernenLs of F’(x, v) are then shifLed lo Lhe left Lo obLain F”(x, v) in Lhe inLrinsic fealures of visual objecls.
Figure 12.8(d). MPEG-4 supporls lwo lypes of shape informaLion: binary and grayscaie. Binasy shape
In Lhe second iteraLion, DCT-N is applied lo each row of F”(x, v) lo oblain G(u, v) informalion can be in lhe form of a binary rnap (also luiown as a binary alpha map) lhaL
(Figure 12.8(e)), in which Lhe single dark dol indicaLes lhe DC coefficient G(0, O) of lhe 2D is of Lhe sarne size as lhe VOP’s recLangular bounding box. A value of 1 (opaque) or
SA-DCT. O (Lransparenl) in lhe bilmap indicaLes whether lhe pixel is inside or oulside lhe VOP.
344 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Coding in MPEG-4 345

Alternatively, Lhe grayscale shape information actually refers Lo Lhe shape’s transparency, Currenl frame Refererice frame Currenl frame
wilh gray values ranging from O (Lransparenl) lo 255 (opaque). 987 1 i8~
_____32 l~
Binary Shape Coding To encode lhe binary alpha map more efficiently, Lhe map is
65432 1716
100 4
divided inLo ló x ló blocks, also known as Binary Alpha Blocks (BAB). If a BAB is
Corresponding posilions
enlirely opaque or Lransparent, iL is easy lo cade, and no speciai Lechnique of shape coding is
necessary. EL is Lhe boundary BABs lhal conLain Lhe conLour and hence Lhe shape informaLion (a) (b)
for lhe VOP. They are Lhe subject of binary shape coding.
Various conlour-based and bitmap-based (or area-based) algorilhms have been sLudied FIGURE 12.9: Conlexls ia CAE for binary shapecoding in MPEG-4. O indicaLes lhe currenL
and compared for coding boundary BABs [9]. Two of lhe finalisis were both biLmap-based. pixel, and digiLs indicaLe Lhe olher pixeis ia lhe neighborhood: (a) inLra-CAE; (b) ialer-CAE.
One was lhe Modified Mod(fied REAL’ (MMR) algorithm, which was also an optional
enhancemenL ia Lhe fax Group 3 (G3) standard [lO] and Lhe mandaLory compression meLhod
ia Lhe Group 4 (G4) sLandard [II]. The oLher finaiist was Context-basedArithmeiic Encoding mode. Also, Lo prevenL error propagaLion, a k-faclor is defined, such Lhal every k lines musl
(CAE), which was iniLially developed for JBIG [12]. CAE was finaliy chosen as Lhe binary conLain aL leasl one une coded using convenlional run-ienglh coding. These niodificalions
shape-eoding method for MPEG-4 because of iLs simpiicity and compression efficiency. consLiLuLe lhe Modified REAL’ algoriLhm used in Lhe G3 standard. The Modified Modified
MMR is basically a series of simplifications of the Relailve Elenieni Address Designate READ (MMR) algoniLhm simply removes lhe reslniclions imposed by Lhe k-factor.
(READ) algorilhm. The basic idea behind lhe READ algoriLhm is Lo cade Lhe current For Conlexl-based Ariihmelic Encoding, Figure 12.9 iliuslnaLes Lhe “conlexL” for a pixel
une relative lo lhe pixel lacations in Lhe previously coded une. The aigoriLhm siarts by
in Lhe boundary BAB. In inLra-CAE mode. when only lhe targel alpha map is involved
idenLifying tive pixel locaLions ia Lhe previous and current lines: (Figure 12.9(a)), len neighboring pixeis (numbered from O Lo 9) in Lhe same alpha map form
Lhe conlexl. The len binary numbens associaled wiLh Lhese pixeis can ofíer up Lo 210 = 1,024
ao: Lhe lasI pixei value known lo bolh lhe encoder and decoder possible conLexls.
. ai: lhe iransilion pixel lo Lhe right of ao Now, iL is apparenil lhaL certain conlexis (e.g., ali Is or ali Os) appear more frequenLly Lhan
olhers. WiLh some prior sLaLisLics, a probabiuiLy Lable can be builL lo indicale Lhe probabiuiLy
a2: lhe second Lransilion pixel lo lhe righL of ao of occurrence for each of Lhe 1,024 coniexls.
Recall lhal ArilhmeLic coding (Chapler 7) is capable of encoding a sequence of prob
• b1: lhe firsL transilion pixei whose color is opposiLe lo ao ia lhe previousiy coded une abilisLic symbois with a single number. Now, each pixel can look up Lhe lable lo find a
probabilily value for iLs conLexL. CAE simply scans Lhe 16 x ló pixeis in each BAB sequen
• b2: Lhe firsL transilion pixel lo Lhe righL af b1 on lhe previously coded une
Lially and applies AriLhmelic coding lo eventualiy derive a single floating-poinL number for
Lhe BAB.
READ works by examining Lhe reiative posilions of Lhese pixeis. At any lime, both Lhe
encoder and decoder know lhe posilion of ao, b1, and b2, whiie lhe positions ai and a~ are lnLer-CAE mode is a nalurai exLension of inLrn-CAE: iL involves boLh lhe largeL and
known only ia lhe encoder. reference alpha maps. For each boundary macrobiock in lhe LargeL frame, a process of
molion esLimalion (in inieger precision) and compensaLion is invoked firsL Lo locaLe lhe
Three coding mades are used:
maLching macrobiock in Lhe reference frame. This esiabuishes Lhe corresponding posilions
• lf lhe rua IengLhs on lhe previous and the current lines are similar, lhe distance beLween for each pixel in lhe boundary BAB.
ai and bi should be much smalier Lhan Lhe disLance beLween ao and a1. Thus, lhe Figure 12.9(b) shows Lhe conlexl of each pixei includes four neighboring pixeis from Lhe
vertical ntode encodes Lhe currenL rua Iength as ai — bi. largeL alpha map and tive pixeis from lhe reference alpha map. According Lo ils conLexL,
each pixel in Lhe boundary EAB is assigned one of Lhe 2~ = 512 probabiliLies. AfLerward,
• If Lhe previous une has no similar nun lengLh, Lhe currenl nun iengLh is coded using lhe CAE aigoriLhm is applied.
one-dimensional mn-lengLh coding. This is cailed lhe horizontal inode. The 16 x ló binary map oniginaliy conLains 256 biLs of informalion. Compressing il Lo
a single 6oating number achieves a subsLanlial saving.
• lf a~ < b~ < < a~, we can simply Lransmil a codeword indicating iL is in pass
inode and advance ao lo lhe posiLion under b2, and conlinue lhe coding process. The above CAE melhod is !ossless! The MPEG-4 group also examined some simple
Iossy versions of Lhe above shape-coding meLhod. For example, ihe binary alpha map can
Some simplificaLions can be made lo Lhe READ algoniLhm for praclical implemenLaLion. be simply subsampied by a factor of 2 ar 4 before arilhmeLic coding. The Lradeoff is, of
For example, if lia1 b1 < 3, Lhen il is enough lo indicaLe Lhat we can apply lhe verlical course, Lhe deLerioralion of Lhe shape.
346 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.2 Object-Based Visual Codirig in MPEG-4 347

Grayscaie Shape Coding The temi grayscale shape coding in MPEG-4 could be
rnisleading, because lhe mie shape information is coded in lhe binary aipha map. Grayscale
here is used Lo describe lhe transparency of lhe shape, nol lhe Lexture!
In addilion lo lhe bitpianes for RGB frame buffers, rasler graphics uses extra bitpianes for
an alpha niap, which can be used lo describe Lhe lransparency of lhe graphical objecl. When
lhe aipha map has more lhan one bilpiane, muilipie leveis of lransparency can be introduced
— for exampie, 0 for lransparenl, 255 for opaque, and any number in between for various

degrees of intermediate lransparency. The Lerm grayscaie is used for transparency coding
ir, MPEG-4 sirnpiy because lhe transparency number happens to be in lhe range of O to 255
— lhe sarne as conventional 8-bil grayscaie intensilies.

Grayscaie shape coding iii MPEG-4 ernpioys lhe sarne technique as in lhe lexture coding
described above. IL uses Lhe aipha map and biock-based molion compensation and encodes
prediction errors by DCT. The boundary rnacrobioclcs need padding, as before, since not ali
pixeis are in Lhe VOP.
o
Coding of Lhe lransparency infonnalion (grayscaie shape coding) is iossy, as opposed 10
coding of Lhe binary shape informalion, which is by defauil iossiess.

12.2.5 Static Texture Coding


(b) (e)
MPEG-4 uses waveieL coding for Lhe texture of sLalic objecls. This is pariicuiarly appiicabie
when lhe lexlure is used for rnapping onto 3D surfaces.
FIGURE 12.10: Sprile coding: (a) Lhe spriLe panoramic irnage of lhe background; (1,) Lhe
As inLroduced in Chapter 8, wavelet coding can recursiveiy decompose an irnage mIo foreground objecL (piper) in a biuescreen image; (e) lhe cornposed video scene. (This figure
subbands of muiLipie frequencies. The Embedded ZeroLree Wavelel (EZW) aigorilhm [13] aiso appears in Lhe calor inserl section.) Piper image counesy of Simon Fraser University
provides a compact represenlalioli by expioiling lhe polenlialiy iarge number ofinsignificanl Pipe Band.
coefficienls in lhe subbands.
The coding of subbands in MPEG-4 static LexLure coding is conducted as foiiows:
12.2.6 Sprite Coding
Video pholography often involves camera rnovernents such as pan, lilL, zoom in/oul, and so
• The subbands wilh Lhe iowesl frequency are coded using DPCM. Prediclion of each
on. Often, Lhe rnain objecLive is Lo Lrack and examine foreground (moving) objecLs. Under
coefficienl is based on lhree neighbors.
lhese circumsLances, lhe bacicground can be lrealed as a slatic irnage. This creales a new
VO Lype, lhe sprite — a graphic image lhaL can freeiy move around wiLhin a larger graphic
image or seI of images.
• Coding of olher subbands is based on a muiliscaie zerotree waveiel coding rnelhod.
To separale lhe foreground object from Lhe background, we introduce lhe noLion of a sprite
panorama — a slili irnage Lhal describes lhe slalic background over a sequence of video
The muiliscaie zerolree has aparent—child relalion (PCR) Iree for each coefficient in lhe frames. lt can be generated using irnage “slilching” and warping Lechniques [14]. Pie iarge
iowesl frequency subband. As a resuiL, Lhe locaLion informaLion of ali coefilcienls is betler sprite panoramie image can be encoded and senl Lo lhe decoder oniy once, aL Lhe beginning
Lracked. of Lhe video sequence. When Lhe decoder receives separateiy coded foreground objecLs and
paranieters describing Lhe camera movemenls Lhus far, iL cmi efficienlly reconslruct lhe scene.
In addition Lo Lhe original magnilude of lhe coefficienls, lhe degree of quanlization affecls
Figure 12.10(a) shows a spriLe lhal is a panoramic irnage slilched from a sequence of
Lhe dala rale. If lhe magnitude of a coefficienl is zero afler quanlizalion, iL is considered
video frarnes. By combining Lhe sprile background wiLh Lhe piper iii lhe biuescreen image
insignificanl. AI lirsL, a large quanlizer is used; only lhe mosl significanl coefficienls are
(Figure 12.10(b)), Lhe new video scene (Figure 12.10(c)) can readiiy be decoded wilb lhe
selected and subsequently coded using arithmetic coding. The difference belween lhe
aid of lhe spriLe code and Lhe addiLional pan/Gil and zoom pararneLers. Clearly, foreground
quantized and Lhe original coefficienLs is kepl in residual subbands, which wiill be coded in
objects can eiLher be frorn lhe original video scene or newly crealed lo realize flexibie
Lhe nexL ileration in which a smaiier quanLizer is employed. The process can continue for
objecl-based composilion of MPEG-4 videos.
additionai iteralions; hence, II is very scaiabie.
348 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.3 Synthetic Object Coding in MPEG4 349

12.2.7 Global Motion Compensation dx,,,dy,,


Mesh
Cornmon camera motions, such as pan, lilL, roLation, and zoorn (so-called global moLions, geometry
since Lhey apply lo every block), often cause rapid conLent change between successive video Encoded
coding Variabie mesh data
frames. Traditional block-based motion compensaLion would result in a large number of Mesh daLa
lengLh
signiflcanl motion vectors. Also, lhese Lypes of camera motions cannot ali be described using xn, y,i, til,
Mesh coding
lhe LranslaLional motion model ernployed by bloclc-based moLion compensalion. Global
motion compensation (GMC) is designed Lo solve Lhis problem. Tbere are four major molion
components: coding
ex,,, ey,,
Mesh daLa
• Global motion estimation. Global moion estirnation computes lhe motion of Lhe
mernory
currenL image with respecl lo Lhe sprite. By “global” is rneanl overail change due
lo camera change — zooming in, panning lo Lhe side, and so on. It is computed
by minimizing lhe sum of square differences between Lhe spriLe S and Lhe global FIGURE 12.11: 2D Mesh Objecl Plane (MOP) encoding process.
motion-compensated image 1’.
N detail skipped), if SADGMC < SADLMC, Lhen use GMC Lo generale lhe predicLed
E = E(S(xj, y~) — I’(x1, y~), (12.4) reference VOP. Olherwise, use LMC as before.
1=1
12.3 SYNTHETIC oBiEa CODING IN MPEG-4
The idea here is lhal if Lhe background (possibly sLitched) image is a sprite S(x~, y~),
we expeel lhe new frame lo consisl mainly of lhe sarne background, allered by these The number of objecLs in videos LhaL are created by compuler graphics and anirnalion
global camera motions. To furlher consLrain lhe global molion esLimalion problem, sofLware is increasing. These are denoled synthetic objecis and can oflen be presented
lhe moLion over Lhe whole image is parameterized by a perspective moLion model LogeLher wiLh natural objects and scenes in games, TV ads and prograrns, and animalion or
using eighl parameters, defined as fealure films.
In Lhis section, we briefiy discuss 2V mesh-based and 3V rnodel-based coding and ani
— ao+C1IXI +a2yi rnation methods for synLhelic objecls. Beek, PeLajan, and OsLerniann [16] provida a more
— a6xt +a7yI + 1 deLaiIed survey of lhis subjecL.
a3+a4X1+asyj (12.5)
a~x1+a7yÍ+l 12.3.1 20 Mesh Object Coding
A 21) mesh is a lesseilalion (or partilion) of a 20 planar region using polygonal patches. The
This resulling consLrained minirnizaLion problem can be solved using a gradient
verlices of the polygons are referred Lo as nodes of lhe mesh. The mosL popular meshes are
descent-based rneLhod [15).
triangular meshes, where ali poiygons are triangies. The MPEG-4 standard makes use of
• Warping and blending. Once lhe rnolion paramelers are compuLed, lhe background lwo Lypes of 20 rnesh: uniform ntesh and Delaunay mesh [17]. Both are Lriangular meshes
images are warped Lo align wiLh respecL lo lhe sprile. The coordinaLes of Lhe warped Lhal can be used Lo modei natural video objects as weii as syntheLic animaLed objects.
irnage are computed using Eq. (12.5). Afterward, lhe warped image is blended inLo Sioce Lhe triangulaLion sLnicture (Lhe edges beLween nodes) is known and can be readily
lhe currenL spriLe lo produce Lhe new sprile. This can be done using simple averaging regenerated by Lhe decoder, iL is nol coded expiiciLly in Lhe bitsLream. Hence, 20 mesh objecl
or sorne form of weighLed averaging. coding is compacl. Ali coordinate values of lhe mesh are coded in half-pixel precision.
Each 20 mesh is lrealed as a mesh objeci plane (MOP). Figure 12.11 iiiustraLes lhe
• Motion Lrajectory coding. Inslead of direcLly transmiLling Lhe molion paramelers, encoding process for 20 MOPs. Coding can be divided mIo geomeoy coding and inotion
we encode only lhe dispiacemenLs of reference poinls. This is called trajectory coding coding. As shown, lhe inpul data is Lhe x and y coordinaLes of ali lhe nades and lhe triangles
[15]. Poinls aL Lhe corners of lhe VOP bounding box are used as reference poinls, and (1,,,) in lhe mesh. The ouLpuL daLa is Lhe dispiacernents (dx,,, dy,,) and Lhe predicLion errors
Lheir corresponding poinLs in lhe sprile are caIculaled. The difference belween Lhese of Lhe molion (ex,,, ey~), both of which are explained beiow.
two entiLies is coded and lransmilLed as differenlial rnolion veclors.

• Choice of local motion compensation (LMC) or GMC. Finally, a decision has to 20 Mesh Geometry Coding MPEG-4 ailows four Lypes of uniform meshes with dif
ferent LrianguiaLion strucLures. Figure 12.12 shows such meshes wilh 4 x 5 mesh nodes.
be made wheLher lo use GMC or LMC. For lhis purpose, we can apply GMC Lo
Each uniform mesh can be specified by five parameters: Lhe firsL lwo specify Lhe nurnber of
Lhe moving background and LMC Lo Lhe foreground. HeurisLicaily (and wiLh much
350 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.3 Synthetic Object Coding in MPEG-4 351

,fl,
(a) Type O (b)Type 1 (c) Type 2 (d) Type 3
~0

FIGURE 12.12: Four types of unifonn meshes: (a) Lype O; (b) type 1; (c) type 2; (d) Lype3.
P4
Pl

nodes in each row and column respectively; Lhe next’two specify lhe horizontal and vertical
size of each rectangle (containing two triangles) respectively; and the last specifies Lhe type
of the uniform mesh.
(a)
Uniform meshes are simple and are especially good for representing 21) rectangular
objects (e.g., ibe enLire video frame). When used for objects of arbitrary shape, they are
applied Lo (overlaid on) the bounding boxes of the VOPs, which incurs some inefficiency.
A Delaunay mesh is a better object-based mesh represenLaLion for arbitrary-shaped 2D
objects.
Definitoni: lfvisaljelaunaytriangulaLion,thenaflyOfitStrianglestn = (P1, P~, Jj) E 73
satisfies the property that the circumcircle of r,, does nol contam in its interior any other
node point Pi.
A Delaunay mesh for a video object can he obtained in lhe following steps: P4

1. SeleeL boundary nodes of Lhe mesh. A polygon is used Lo approximate Lhe boundary
of lhe objecL. The poiygon vertices are the boundary nodes of lhe Delaunay mesh.
A possible heuristic is Lo select boundary points with high curvatures as boundary (b)
nodes.
2. Choose interior nodes. Feature points within lhe objecL’s boundary such as edge FIGURE 12.13: Delaunay mesh: (a) boundary nodes (Po lo P7) and inLeiior nodes (Pg lo
points or corners, can be chosen as interior nodes for Lhe mesh. P13); (b) triangular mesh obtained by conslrained Delaunay Lriangulation.
3. Perform Delaunay triangulation. A constroined Detaunay trianguication is per
formed on lhe boundary and interior nodes, with lhe polygonal boundary used as a
Lhe total number of Lriangles in lhe Delaunay mesh is Na + 2N1 — 2. In lhe above figure.
consLraint. The triangulation will use line segments connecting consecutive boundary
nodes as edges and form Lriangles only wiLhin Lhe boundary. Lhissumis8+2 x 6—2= 18.
Unlike a uniform mesh, lhe node locations in a Delaunay mesh are irregular; hence, they
musL be coded. By convenLion of MPEG-4, lhe localion (x0, yo) of lhe Lop lefL boundary
Constrained Delaunay Tt-iangulation. Interior edges are first added to form new nod& is coded firsi, followed by Lhe oLher boundary poinLs counierclockwise [see Fig
triangles. The algorithm will examine each interior edge lo make sure iL islocatly Delaunay. ure 12.13(a)] or clockwise. Afterward, lhe locations of Lhe inLerior nodes are coded in
Given two Lriangles (Pá, P~, Ij) and (Pá, Pk, Pj) sharing an edge jk, if (Pá, Pj, Pk) any order.
contains P; or (P~, ~‘1, 1’,) contains P1 in lhe inLerior of iLs circumcircle, Lhen 71 is not ExcepL for Lhe first location (xo, yo). alI subsequenL coordinales are coded differentially
locally Delaunay and will be replaced by ~ new edge 17. —thaL is, forn > 1,
lf 1’, falis exacLly on lhe circumcircle of (P1, P~, Pk) (and accordingly, P1 also falis
dx,, = x,, —x,,1, dyn = i’n Yn—I. (12.6)
exacLly on Lhe circumcircle of (P~, ~k, PÓ), lhen 71 will be viewed as locally Delaunay

only if P~ or 1’; has lhe IargesL x coordinaLe among Lhe four nodes. and afterward, dx,,, dy,, are variable-length coded.
Figure 12.13(a) and (b) show Lhe set of Delaunay mesh nodes and lhe resuiL of lhe
constrained Delaunay triangulation. If Lhe total number of nodes is N, and N = 1’!,, + N1 Tbe iop leti boundary node is defined as the one ihat has lhe minímum .r + y coordinale value. Jf mole lhan
one boundaiy node lias the sarne e + y, lhe one wilh lhe minímu,n y is chosen.
where Na and !‘/~ denote lhe number of boundary nodes and interior nodes respecLiveiy, Lhen
352 Chapter 12 MPEG Video Coding li — MPEG-4, 7, and Beyond Section 12.3 Synthetic Object Coding in MPEG4 353

motion vectors for ali nade points in Lhe 21) mesh. Mesh-based Lexture mapping is now
used to generate lhe Lexture for Lhe new animated surface by warping [14] Lhe LexLure of
each Lriangle in Lhe reference MOE’ onto lhe corresponding triangle in Lhe target MOR This
facilitates Lhe animation of 2D syntheLic video objects.
For triangular meshes, a common mapping function for Lhe warping is Lhe affine trans
form, since it maps a line Lo a une and can guarantee Lhat a Lriangle is mapped lo a Lriangle. lt
will be shown below that given Lhe six vertices of Lhe two matching triangles, lhe parameters
for Lhe affine transform can be obtained, so LhaL Lhe Lransform can be applied Lo ali points
wiLhin Lhe Larget Lriangle for Lexwre mapping.
Given a poinL P = (x, y) ar’ a 2D plane, a linear lransform can be specified, such Lhat
FIGURE 12.14: A breadth-first order of MOP Lriangles for 2D inesh moLion coding.
[x’ y’j = [x ~ E 0,2
a22
(12.9)

2D Mesh Motion Coding. The motion of each MOP triangle in either a uniform or A Lransform T is linear if T(aX + ØY) = aT(X) + ØT(Y), where a and fi are scalars.
Delaunay mesh is described by Lhe motion vectors of its three vertex nodes. A new mesh The above linear transform is suitable for geometric operations such as rotaLion and scaling
stnicture can be creaLed only in lhe inLra-frame, and its triangular topology will not alter but not lo Lranslation, since addition of a constant vector is not possible.
iii Lhe subsequent inter-frames. This enforces one-to-one mapping iii 2D mesh motion
Definiton 2: A transform A is an affine transforin if and only if there exists a vecLor C and
estimation.
a linear transform T such thaL A(X) = T(X) + C.
For any MOP triangle (Pé, P~, Pi), ifthe tnotion vectors for P~ and ri are known lo be
MV1 and MVi, then a prediction Predk will be made for the motion vecLor of Pk, rounded lf Lhe point (x, y) is represented as [x, y, 1] in Lhe homogeneous coordinate sysLem com
lo a half-pixel precision: monly used in graphics [18], Lhen an affine transforni LhaL transfonus [x, y, 1] lo [x’, y’, 1]
is defined as:
Predk = 0.5 (MV1 + MVi). (12.7)
[1 y’ 1j=[x y lii
r 021
~I2
~22
O
O (12.10)
The prediction error Ck is coded as
L 031 032 1
Ck = MVk — Predk. (12.8)
IL realizes Lhe following mapping:
Once Lhe three motion vectors of Lhe first MOP triangie lo are coded, aI least one neigh xl = ailx+a2ly+a31 (12.11)
boring MOP Lriangle will share an edge with 1o~ and Lhe motion vector for its third vertex
= ai2x+a22y+a32 (12.12)
nade can be coded, and 50 on.
The estimation of motion vectors will start aI the initial triangle o~ which is Lhe triangie and has aL most 6 degrees of freedom represenLed by Lhe parameters au, a~j, a3~, 012, 022,
Lhat contains lhe top left boundary node and Lhe boundary node next Lo iL, ciockwise. Motion a32.
vectors for ali oLher nodes ir’ Lhe MOP are coded differentially, according Lo Eq. (12.8). A The following 3 x 3 matrices are Lhe affine transfonus for Lranslating by (T1, T~), rotaLing
breadth-first order is esLablished for traversing lhe MOE’ triangles ir’ Lhe 2D mesh motion counterclockwise by 8, and scaling by facLors S~ and S~:
coding process.
Figure 12.14 shows how a spanning tree can be generated Lo obtain Lhe breadth-first order [i O °1 E cosO sinO 01 ~-s1 o o
of Lhe triangies. As shown, Lhe initial Lriangle to has two neighboring triangles ti and ~2, 1 O 1 0 —sinO cosO O O S~ O
which are not visited yet. They become child nades of o in Lhe spanning tree. LT1 2’~ JJ L ° O 1J LO O 1
Triangles lj and 12, in mm, have their unvisited neighboring triangies (and hence child The following are Lhe affine transforms for sheering along Lhe x-axis and y-axis, respecLively:
nades) (3, (4, and (5, te, respectively. The traverse order so far is to. tj, t2, (3, £4, t~, in a
breadth-firsL fashion. One levei down Lhe spanning Lree, t~ has oniy one child node l~, since
lhe oLher neighbor i~ is already visited; (4 has only one child nade lg; and so on.
riH~ 00
1 O
r~o1 14 O
lo
Lo o i Lo o i
ZD Object Animation The above mesh motion coding established a one-to-one map
ping between Lhe mesh triangles in the reference MOP and lhe target MOE’. lt generated where ll~ and II,, are consLants determining Lhe degree of sheering.
354 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.3 Synthetic Object Coding in MPEG-4 355

The above simple affine Lransforms can be combined (by malrix muitiplications) Lo yield
composite afline transfonns — for example, for a transiation followed by a rotalion, or a
sheering followed by other Lransforms.
li can be proven (see Exercise 7) IhaL any composite transform Lhus generaLed will have
exaccly Lhe sarne matrix form and will have aL mosl 6 degrees of freedom, specified by aj
a21, a31, afl, afl. a32.
If lhe triangle in lhe targeL MOP is
MPG
(Po, Pi, P2) = ((xo, yo), (xi, yi). (x2, Y2))
(a) (b)
and Lhe matching Lriangle in lhe reference MOP is
FIGURE 12.15: Mesh-based Iexlure mapping for 20 objecL animalion.
(P~, P~,P&) = ((4,4), (xi, y~), ~x, y~)),
then Lhe mapping belween lhe lwo triangles can be uniquely defined by lhe following: studied for 3D object animaLion [19]. MPEG-4 goes beyond wireframes, so lhaL lhe surfaces
of Lhe face or body objecls can be shaded or texLure-rnapped.
4 4 1~ Exo y’o IlEaji a12 01
xly;1l=IxIyIlIIa2ia22ol (12.13) Face Object Coding and Animation. Face models for individual faces could either
44 lJ L~ lJLaai ~32 1J be crealed manually or generaLed auLomalically lhrough compuLer vision and paLlern recog
niLion techniques. However, Lhe former is cumbersorne and neverlheless inadequale, and
Eq. (12.13) conLains six linear equations (Lhree for x’s and lhree for y’s) required lo lhe laLter has yel Lo be achieved reliably.
resolve lhe six unknown coefficienls °ii, a2~, ~3i, a1~, ~22. a32. LeL Eq. (12.13) be stated MPEG-4 has adopled a generic defaull face model, developed by Lhe Virlual Realily
as = XA. Then ii is known lhat A = X1X’, with inverse maliix given by X~ =
Modeling Language (VRML) Consorlium [20]. Face Anirnation Parameiers (FAPs) can be
adj (X)/dei (X), where adj (X) is Lhe adjoinL ofX and dei (X) is lhe deLerminanl. Therefore, specified Lo achieve desirable animalions — deviations from lhe original “neulral” face. In
addilion, Face Definition Parameiers (FDPs) can be specified Lo betLer describe individual
[ aH ‘~I2 0 1 E x~ yo 1 ~ E 4 4 1 faces. Figure 12.16 shows Lhe feature poinLs for FDPs. Fealure poinLs lhaL can be affecLed
102la2201=HIYlhI Ix~yí1 by animaLion (FAP5) are shown as solid circles; and lhose LhaL are noL affecLed are shown
Leu a32 IJ Lx2 Y2 lJ L4 4 1 as empty circies.


r1 yi y2—Yo yo—yi 1E4 4
xl yl 1 102.14)
Sixly-eighL FAPs are defined [16]: FAP lis for visemes and FAP 2 for facial expressions.
— dei(X) 1 x2yo—x0y2 xoYi —xtyo JL4 1 J
L. x1y2—x2y1

where dei(X) = xo(yl — y2) — yo(xi — x2) + (xiy2 — x2yl).


Since Lhe three verlices of lhe mesh triangle are never colinear poinls, il is ensured Lhal X
is foI singular — thal is, der (X) $ 0. Therefore Eq. (12.14) always has a unique solulion.
The above affine Lransform is piecewise — lhal is, each Iriangle can have ils own affine
transform. IL works weIl only when lhe object is mildly defonned during Lhe animaLion
sequence. Figure 12.15(a) shows a Delaunay mesh wiLh a simple word mapped onlo iL.
Figure 12.15(b) shows lhe warped word in a subsequenl MOP in lhe animaled sequence
afLer an affine transforni.

12.3.2 3D Model-Dased Coding


Because of lhe frequenL appearances of hunian faces and bodies in videos, MPEG-4 has (a) (b)
defined special 313 models forface objecis and body objecis. Some of lhe polenLial appli
FIGURE 12.16: FeaLure poinls for face definition parameLers (FDPs). (FeaLure poinLs for
calions for these new video objects include leleconferencing, human—compuler inlerfaces,
Leelh and longue are nol shown.)
garnes, and e-commerce. In lhe pasl, 3D wireframe models and their animalions have been
n 12.5 MPEG-4 PartlO/H.264 357
356 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond

Visemes code highiy realislic iip moLions by modeling Lhe speaker’s curtem moulh posiLion. TABLE 12.1: Tuols for MPEG-4 naLurai visual objecl lypes.
Ali olher FAPs are for possible movemenLs of head, jaw, lip, eyelid, eyebali, eyebrow, pupil,
ohm, cheek, tongue, nose, ear, and so on. ObjecL types
For example, expressions mnciude neutral, joy, sadness, anger~ fear, disgust and surprise. Simpie Scaiable
Each is expressed by a sei of features — sadness for exampie, by slightly closed eyes,
reiaxed mouth, and upward-benL inner eyebrows. FAPs for movement inciude heacLpitch, Tools Simple Core Main scaiable N-bit sLill Lexlure
head..yaw, lzeadsoll, openjaw, thrustjaw, shiJLjaw, pushdwttomiip, pushiopdip, and 50 BasicMCbased tools * * * * *

on. * * * *
g.VOP
For compression, Lhe FAPs are coded using predicLive coding. PredicLions for FAPs in
* * *
the larget frame are made based on FAPs in lhe previous frame, and prediction errors are BinarY shape coding
Lhen coded using arilhmeLic coding. DCI’ can also be employed lo improve Lhe compression Gray1evel shape coding *

ratio, alLhough it is considered more computationally expensive. FAPs are also quantized, *
SpriLe
with different quantizaLion step sizes employed to explore Lhe facL Lhat certain FAPs (e.g.,
*
open_jaw) need iess precision than others (e.g., pushiopiip). InLerlaCe
* * *
Temporal scaiabiiiiy (P-VOP)
Body Object Coding and Animation. MPEG-4 Version 2 inLroduced body objecis, *
Spalial and temporal scalabiiily
which are a natural extension lo face objects.
(rectangular VOP)
Working wiih the Humanoid Animation (H-Anim) Group in the VRML Consortium, *
MPEG adopted a generic virtual human body with defauit posture. The defauiL is standing, N-bib
*
with feet pointing Lo the fronl, arins at Lhe sides, with paims facing inward. There are Scalable sLiil lexLure
296 BodyAnimation Parame:ers (BAP5). When appiied Lo any MPEG-4-compiiant generic Enor resilience * * * * *

body, Lhey wili produce Lhe same animation.


A large number of BAPs describe joint angies connecting differenL body parLs, inciuding
lhe spine, shoulder, clavicle, elbow, wrisL, finger, hip, knee, ank]e, and toe. This yields 186 lype is inLroduced, lo define Lhe bois needed to creale video objecis and lhe ways Lhey can be
degrees of freedom toLhe body, 25 Lo each hand alone. Furthermore, some body movements combined in a scene. Table 12.1 shows Lhe various Loois applicable Lo Lhe MPEG-4 naLural
can be specified in muitipie leveis of detail. For example, five different leveis, supporting 9, visual objecL types? For exampie, for objecL lype “core”, oniy five bois are used. Toois
24,42,60, and 72 degrees of freedom can be used for lhe spine, depending on Lhe complexity such as “gray-ievei shape coding”, “sprile”, “inLerlace”, and so on, wili not be used.
of Lhe animaLion. Table 12.2showstheseobjecl types in MPEG-4 visual profiles. Main proflie, forexample,
For specific bodies, Body Definition Parameters (BDP5) can be specified for body dimen supports only object lypes “simpie”, “core”, “main”, and “scaiable sliil LexLure”.
sions, body surface geometry, and, opLionally, Lexture. Body surface geometry uses a 3D Tabie 12.3 iisls Lhe leveis supported by Lhe three mosl commonly used Visual profiles:
polygon mesh represenLaLion, consisting of a sei of polygonal planar surfaces in 3D space simpie, core, and main. For example, although CII’ (352 x 288) is supported in fourdifferenL
[18]. The 3D mesh representation is popular in computer graphics for surface modeling. leveis (leveis 2 and 3 in simple profile, levei 2 in Core profile, and levei 1 in Main profile),
Coupied wiLh LexLure mapping, iL can deliver good (pholoreaiistic) renderings. very differenl biLrales and maximum numbers of objecLs are specified. Hence, different
The coding of BAPs is similar Lo Lhat of FAPs: quantizaLion and predicLive coding are qualiLies for CIF videos would be expecLed.
used, and lhe prediclion errors are further compressed by arilhmeLic coding.
12.5 MPEG4 PARTIO/H.264
12.4 MPEG-4 OBJECT TYPES, PROFILES AND LEVELS
The JOinL Video Team (JVT) of ISOIIEC MPEG and ITU-T VCEG (Video Coding Experls
Like MPEG-2, MPEG-4 defines many Profiles and Leveis for various applications. The Group) developed Lhe H.264 video compression standard, which was scheduied lo be com
sLandardiZaLiOn of profiles and leveis in MPEG.4 serves two main purposes: ensuring inLer pieied by March 2003. IL was formerly lmown by ils working lilie “H.26L.” Preliminary
operabiliLy beLween implemenLaLions and allowing lesling of conformance lo lhe standard. sludies using sofiware based on Lhis new sLandard suggesLs lhaL H.264 offers up Lo 50%
MPEG-4 nol only specified Visual profiles (Part 2 [2]) and Audio profiles (Part 3 [2]) but beLlercompression Lhan MPEG 2 and up 1030% beLter lhan H.263+ and MPEG-4 advanced
also Graphics profiles, Scene descriplion profiles, and one ObjecI descriplor proflie in ils simpie profile.
Systems part (Part 1 [2]). We will brieíly describe lhe Visual profiles in lhis secLion.
2We have n°1 lissed ibe MPEG-4 synihelic visual oLijeci Iypes. which include animated 20 mesh. simple face.
Since MPEG-4 scenes ofLen conLain more lhan one video objeci, lhe concepL of object simple body, and so mi.
358 Chapter 12 MPEG Video Codirig II — MPEG-4, 7, and Beyond Section 12.5 MPEG-4 PartlO/H.264 359

. Inverse scan, quantization, and transfomi of residual pixeis


TABLE 12.2: MPEG-4 Natural Visual Object Types and Profiles
ReconslnzCtion
Profiles
Object Simple Scalable ln-ioop deblocking filter on reconstructed pixeis
Lypes Simple Core Main Sealable N-bit Texture
Each picture can again be separated into macroblocks (16 x 16 blocks), and arbitrary-sized
Simple * * * * * slices can group multiple macroblocks into self-contained units.
Core * * *
VLC-Based Entropy Decoding. Two entropy methods are used in lhe variable-length
Main *
entropy decoder: Unified-VLC (UVLC) and Context Adaptive VLC (CAVLC). UVLC
Simple scalable * uses simpie exponential Golomb codes to decode header data, mocion vectors, and olher
N-bit *
nonresidual data, whiie Lhe more complex CAVLC decodes residual coefficients.
ln CAVLC, multipie VLC tables are predefined for each data Lype (runs, leveis, etc.), and
Scalable still texture * *
predefined ruies predict lhe optimal VLC tabie based on lhe context (previously decoded
symbois). CAVLC aliows muitipie statistical models to be used for each data type and
improves entropy coding efficiently over existing fixed VLC, such as iri 1-1.263+.
TABLE 12.3: MPEG-4 Leveis iii Simple, Core, and Main Visual Proliles
Motion Compensation (P-Prediction). Inter-frame motion compensation in H.264 is
~‘pical Bitrate Max number similar to H.263+ but more sophisticated. Tnstead of limiting motion-compensatiori block
Profile Levei picture size (bitslsec) ofobjecis sizetoeither 16x 16 or8 x 8, as in H.263+, H.264 uses a tree-structured motion segmentation
down to 4 x 4 biock size (16 x 16, ló x 8,8 x 16,8 x 8, 8 x 4,4 x 8,4 x 4). This allows
1 176x144(QCIF) 64k 4
much more accurate motion compensation of moving objects.
Simple 2 352 x 288 (CIF) 128 k 4 Furlhermore, motion vectors in H.264 can be up Lo sample accuracy. A six-tap sink filter
3 352x288(CIF) 384k 4 is used for ha]f-pixel interpolation, tu preserve high frequency. Simple averaging is used
for quarter-pixel interpoiation, which provides not only more accurate motion but also a
Core l 176 x 144 (QCIF) 384k 4 lower-pass filter lhan lhe half-pixel. Multiple reference frames are also a standard feature
2 352x288(CIF) 2M 16 in 8.264, so lhat lhe ability to choose a different reference frame for each macrobiock is
available in ali profiles.
1 352x288(CIF) 2M 16
Main 2 720x576(CCIR60I) 15M 32 Intra-Prediction (1-Prediction). H.264 exploits much more spatiai prediction Lhan in
3 1920 x 1080 (HDTV) 38.4 M 32 previous video standards such as H.263+. Intra- coded macrobiocks are ali predicted using
neighboring reconstructed pixels (using both intra- and inter- coded reconstructed pixels).
Similar to motion compensation, different block sizes can be chosen for each intra- coded
The outcome of Lhis work is actually two identicai standards: ISO MPEG4 Part 10 and macrobiock (16 x 16 or 4 x 4). There are nine prediction modes for 4 x 4 biocks (where
ITU-T 8.264. WiLh its supeiior compression performance over MPEG-2, 8.264 is currently each 4 x 4 block in a macroblock can have a different prediction mode) and four prediction
one of Lhe leading candidates to carry HDTV video content on many potential applications. modes for 16 x 16 blocks. This sophisticated intra-prediction is powerful as it drastically
The following sections give a brief overview of tbis new standard in accordance with [21]. reduces Lhe amount of data Lo be transmitted when temporal prediction faiis.

12.5.1 Core Features fransform, Scan, Quantizatioa Given the powerful and accurate P- and 1- prediction
schemes in H.264, it is recognized Lhat te spatial correlation in residual pixeis is typically
Similar lo lhe previous ITU-T 8.263+, H.264 specifies a biock-based, motion-compensated very low. 1-lence, a simpie integer-precision 4 x 4 DCT is sufflcient Lo compact lhe energy.
transform hybrid decoder with five major biocks: The inLeger arithmetic aiiows exact inverse Lransform on ali processors and eliminates en
coder/decoder mismatch problems in previous transform-based codecs. 8.264 also provides
Entropy decoding
a quantization scheme with noniinear srep-sizes to obtain accurate rate control at both Lhe
Motion compensation or intra-prediction high and low ends of lhe quanúzation scale.
360 Chapter 12 MPEG Video Coding ii — MPEG-4, 7, and Beyond Section 12.6 MPEG-7 361

In-Loop Dcblocking Fiitcrs. H.264 specifies a sophisticated signal-adaptive deblock • Weighted Prediction. Global weights (multiplier and an offset) for modifying Lhe
ing filter in which a set of filLers is applied on 4 x 4 block edges. FilLer length, sLrengtb, motion-compensated prediction samples can be specified for each slice, Lo predict
and type (deblockinglsmoothing) vary, depending on macroblock coding parameters (intra lighting changes and other global effects, such as fading.
or inter- coded, moLion-vector differences, reference-frame differences, coefficients coded)
and spaLial activity (edge detection), so Lhat blocking artifacts are eiiminated without dis i2.5.~ Exteflded Profile Features
Lorting visual features. The H.264 deblocking fllLer is important in increasing the subjective The eXLended profile (or profile X) is designed for Lhe new video streaming applicaLions.
quality of Lhe standard. This profile allows non-low-delay features, bitstream switching features, and also more
error-resilience Lools. IL includes ali Baseline prolile features plus lhe following:
115.2 Baseline Profile Features
• B slices
The Baseline profile of H.264 is intended for real-Lime conversational applications, such as
videoconferencing. li contains ali the core coding tools of H.264 discussed above and Lhe • Weighted prediction
following additional error-resilience tools, to allow for error-prone caniers such as IP and
wireless networks: • Slice data partitioning. This partitions slice daLa with differenL importance into
separate sequences (header informaLion, residual infonnation) so that more imporlant
daLa can be transmitLed on more reliable channels.
• Arbitrary slice order (ASO). The decoding order of slices within a picture may not
follow monotonic increasing order. This allows decoding of out-of-order packets in • SP and SI slice types. These are slices Lhat contam special temporal prediction
a packet-switched rietwork thus reducing latency. modes, to allow biLstream switching, fast forwardlbackward, and random access.

• Flexible macroblock order (EMO). Macroblocks can be decoded in any order, such The vastly improved H.264 core features, togeLher with new coding tools offer signilicant
as checkerboard patlems, not just raster scan order. This is useful on error-prone improvement in compression ratio, error resiliency, and subjective quality overexisting ITU
networks, so that loss of a slice results in loss of macroblocks scattered in the picture, T and MPEG standards.
which can easily be masked from human eyes. This feature can also help reducejitter
and latency, as the decoder may decide not Lo wait for late slices and still be able to 12.6 MPEG-7
produce acceptable pictures.
As more and more multimedia conLent becomes an integral part of various applications,
effective and efficient reLrieval becomes a primary concern. In October 1996, Lhe MPEG
• Rcdundant slices. Redundant copies of Lhe slices can be decoded, to further improve
group therefore took on Lhe development of another major standard, MPEG-7, following on
error resilience. MPEG-l, 2, and 4.
One common ground between MPEG-4 and MPEG-7 is the focus on audiovisual ob
12.5.3 Main Profile Features jecis. The main objective of MPEG-7 [22) isto serve the need of audiovisual content-based
The Main profile defined by H.264 represents non-low-delay applications such as broad reLrieval (or audiovisual object retrievai) in applications such as digiLal libraries. Neverthe
casting and stored-medium. The Mairi profile contains ali Baseline profile features (except less, it is certainly not limiLed Lo reLrievai — iL is appIicable Lo any muiLimedia applications
ASO, 13(0, and redundant slices) plus Lhe foliowing non-low-delay and higher complexity involving Lhe generation (contem creation) and usage (contem consumption) of multimedia
features, for maximum compression efficiency: data.
MPEG-7 became an intemationai standard in September 2001. its formal name i~ Mui
• B slices. The bi-prediction mode in 11.264 has been made more flexible than in tiniedia Contem EJescription Interface, documenLed in ISOIIEC 15938 [23). The standard’s
existing sLandards. Bi-predicted pictures can also be used as reference frames. Two seven parIs are Systems, Description Definition Language, Visual, Audio, Muitimedia De
reference frames for each macrobiock can be in any temporal direction, as long as Lhey scription Schemes, Reference Software, and Conformance.
are available in Lhe reference frame buifer. Hence, in addition Lo the normal forward + MPEG-7 supports a variety of multimedia applications. Its data may include still pictures,
backward bi-prediction, iL is legal to bave bacicward + backward or forward + forward graphics, 3D modeis, audio, speech, video, and composition information (bow to combine
prediction as well. these elements). These MPEG-7 data elements can be represented in textual or binary
format, or both. Part 1 (Systems) specifles the syntax of Binaryformatfor MPEG-7 (RiM)
• Context Adaptive Binary Arithmetic Coding (CABAC). This coding mode replaces data. Part 2 (Description Dehnition Language) specifies the syntax of Lhe LexLual format
VLC-based entropy coding with binary arithmetic coding that uses a different adaptive which adopts XML Schema as its language of choice. A bidirectional iossless mapping is
slalistics model for differenL data Lypes and contexts. defined between Lhe textual and binary representations.
362 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.6 MPEG-7 363

beyond lhe scope of MPEG-7. These are ieft open for indusLry innovation and compelition
and, more importantly, for Lhe arrival of ever-improving new Lechnologies.
Similar tolhe Simulation Model (SM) in MPEG-l video, lhe TesL Model (TM) in MPEG
2 video, and lhe Verification Models (VMs) in MPEG-4 (video, audio, SNHC, and syslems),
MPEG-7 names its working modei Lhe Experimenrarion Model (XM) — an alphabeLical
pun! XM provides descripcions of various bois for evaluating lhe Ds, DSs and DOL, so
Lhal experimenls and veri6caLions can be conducted and compared by mulLiple independenL
parties ali over Lhe world. The firsi seL of such experimenLs is calied lhe core experimenís.

12.6.1 DescriptOr (D)


MPEG-7 descripLors are designed Lo describe boLh low-level fealures, such as color, LexLure,
shape, and motion, and high-level feaLures of semanlic objecLs, such as events and abstracL
concepts. As menLioned above, meihods and processes for auLomalic and even semiaLuo
matic feaLure extracLion are noL part of lhe sLandard. Despite lhe efforts and progress ia Lhe
fields of image and video processing, compuler vision, and panem recognilion, aulomaLic
and reliable feaLure extracLion is not expected in Lhe near fuLure, especialiy aI lhe high levei.
The descriptors are chosen based on a comparison of Lheir perforniance, efflciency, and
size. Low-level visual descriptors for basic visual features [26) include

FIGURE 12.17: Possibie Appiications using MPEG-7. • Color

— Color space. (a) RGB, (b) YCbCr, (c) HSV (hue, saluraLion, value) [IS],
Figure 12.17 illustraLes some possible applications thal wili benefil from MPEG-7. As (d) HMMD (HueMaxMinDiff) [27], (e) 30 color space derivabie by a 3 x 3
shown, features are exLracled and used lo inslanliale MPEG-7 descriptions. They are Lhen malrix from RGB, (O monochrome
coded by lhe MPEG-7 encoder and senL Lo lhe storage and transmission media. Varions — Color quantization. (a) Linear, (b) nonlinear, (c) lookup tables
search and query engines issue search and browsing requesls, which constiLuLe lhe pull — Dominant colors. A smail number of representative colors in each region or
aclivilies of Lhe InlerneL, whereas lhe agenLs fliLer oul numeroLis materiais pushed onLo Lhe image. These are useful for image reirieval based on color simiiarity
terminal users and/or compuler sysLeflls and applicalions lhaL consume lhe data.
For multimedia conlenl descriplion, MPEG-7 has developed Descriptors (O), Descrip — Scalable color. A color histogram in 145V color space. li is encoded by a Haar
tion Schenies (DS), and a Description Definilion Language (DDL). Foilowing are some of transform and hence is scalable
Lhe imporlanL ternis: — Color Iayout. Sparial distribution of colors for color-iayoul-based reLrieval
— Color structure. The frequency of a calor structuring elemen: describes both
. Feature. A characLerisLic of Lhe daLa
Lhe color conLent and its slructure in Lhe image. The color sLructure element is
. Descriptor (0). A definition (synlax and semanlics) of lhe feature composed of several image sampies in a local neighborhood LhaI have Lhe same
color
• Description Schcme (DS). SpecificaLion of Lhe sLrucLure and relationship beLween
Group of Frames/Group of Pictures (GoF/GoP) color. Similar Lo Lhe scalable
Ds and OSs (see [24))
coior, excepl Lhis is applied lo a video segmenL ora group of still images. An
• Description. A seI of instanLialed Ds and DSs Lhat describes Lhe structurai and aggregaLed color histogram is oblained by Lhe application of average, median,
conceptual information of Lhe conLent, storage and usage of Lhe contenl, and so 011 or intersection operaLions lo Lhe respecLive bins of ali color hisLograms in Lhe
GoF/GoP and is then senl lo lhe Haar transform
• Description Definition Language (DDL). Synlactic ruies Lo express and combine
DSs and Os (see [25)) Texture

lt is made clear [23) that Lhe scope of MPEG-7 is Lo standardize Lhe Os, DSs and DDL for — Homogeneous texture. Uses orienlalion and scale-Luned Gabor Iillers [28]
descriptions. The mechanism and process of producing and consuming Lhe descriplions are IhaL quanLiLaLively represenL regions of homogeneous texture. The advantage of
364 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.6 MPEG-7 365

Gabor filters is that they provide simultaneous optimal resolution in both space y
and spatial-frequency domains [29]. Also, they are bandpass filters tbat conform Boom
to Lhe human visual profile. A fliLer bank consisting of 30 Gabor filters, at five
differenL scales and six differenL directions for each scale, is used lo extract lhe Pan
Lexture descripLor
— Texture browsing. Describes lhe regulariiy, coarseness, and direcüonality of
edges used lo represeni and browse homogeneous textures [30]. Again, Gabor Tracic
O O
filters are used
— Edge histogram. Represents Lhe spatial disLribuLion of fourdirectional (0°, 450,
90°, 135°) edges and one nondirectional edge. Images are divided mIo small x
subimages, and au edge histogram with five bins is generaced for each subimage f/
/zoom
. Shape
Region-based shape. A set of Angular Radial Transforin (ART) [31] coeffi
cients is used Lo describe au object’s shape. An object can consist of one or Roil
more regions, with possibly some boles in Lhe object. ART transform is a 2D
complex transfonn defined in Lerms of polar coordinates on a unit disc. ART z /‘~‘oUy
basis funcüons are separable along the angular and radial dimensions. Thirty
six basis funcLions, 12 angular and Lhree radial, are used Lo extract Lhe shape FIGURE 12.18: Camera motions: pan, Lilt, roil, dolly, Lrack, and boom. (Camera bas au
descriptor effecLive focal lengLh of f. lt is shown initially aI Lhe origin, pointing lo lhe direction of
— Contour-based shape. Uses a curvagure scale space (CSS) representation [32] z-axis.)
Lhat is invarianL Lo scale and roLation, and robusL to nonrigid moLion and partial
occiusion of Lhe shape — Spatiotemporal locator. Describes spaLiotemporal regions iii video sequences.
— 39 shape. Describes 3D mesh modeis and shape index [33]. The hisLogram of Uses one or more sets of descriptors of regions and Lheir motions
Lhe shape indices over Lhe entire mesh is used as Lhe descriptor
. Others
Motion
— Face recognition. A normalized face image is represented as a ID vector, then
— Camera motion. Fixed, pan, LiIt, mil, dolly, Lrack, boom. (See Fig 12.18 and projected onto a seL of 49 basis vecLors, representing ali possible face vectors
[34].)
12.6.2 Description Scheme (DS)
— Objeci motion Lrajectory. A lisI of keypoinLs (x, y, z, O. Optionai interpola
tion funcLions are used Lo specify Lhe acceleration along the paffi. (See [34].) This section provides a brief overview of MPEG-7 DescripLion Schemes (DSs) in Lhe areas of
Basic elements, ConLent management, ConLent descripLion, NavigaLion and access, ContenL
— Parametric objeçt motion. The basic model is the 2D afine model for trans
organization, and User interaction.
lation, rotation, scaiing, sheering, and Lhe combination of Lhese. A planar per
spective modei and quadratic model cmi be used for perspective disLortion and Basic elements
more complex movements
— Motion activity. Provides descriptions such as Lhe intensity, pace, mood, and — Datatypes and mathematical structures. Vectors, matrices, histograms, and
SO on
SO on, of Lhe video — for example, “scoring in a hockey game” or “interviewing
a person” — Constructs. [.inks media files and locaiizing segmenLs, regions, and soou
— Schema toois. Includes moI elements (starting elements of MPEG-7 XML
Localization
documents and descripLions), Lop-Ievel elements (organizing DSs for specific
— Region iocator. Specifies the localization of regions in images with a box ora conLent-orienLed descriptions), and package bois (grouping relaLed DS compo
polygon nenLs of a description inLo packages)
366 Chapter 12 MPEG Video Coding II — MPEG-4, 7, and Beyond Section 12.6 MPEG-7 367

. Contcnt Management

— Media Description. Invoives a single OS, lhe Mediainformavion OS, composed


of a Medialdenrification O and one or more Media ProJile Os that contam infor
mafion such as coding method, transcoding hints, storage and delivery formats,
and so on
— Creation and Production Description. Includes information about creation
(titIe, creators, creation location, date, etc.), classification (genre, language,
parental guidance, etc.), and related materiais
— Content Usage Description. Various DSs to provide information about usage
righis, usage record, availability, and finance (cost of production, income from
o
content use) Moving region: helicopter Moving region: person Moving region: boat

• Content Description
FIGURE 12.19: MPEG-7 videosegment. (This figure also appears in thecolorinsertsection.)
— Structurai Description. A Segment OS describes siructural aspects of lhe
conient. A segment is a section of an audiovisual object. The relationship
Sequentia!Sunimary OS. Hierarchical summaries provide a keyframe hierarchy
among segments is oflen represented as a segment tree. When lhe relationship
of multiple leveis, whereas sequential summaries often provide a silde show or
is not purely hierarchical, a seginen: graph is used
audiovisual skim, possibly with synchronized audio and iext
TIie Segment OS can be implemented as a class object. It has five subclasses:
Figure 12.20 illustrates a summary for a video of a “dragon-boat” parade and
Audiovisual segment DS, Audio segnient DS, Sri!? region DS, Moving region
race in a part The summary is organized in a three-level hierarchy. Each video
OS, and Video segrnent OS. The subclass DSs can recursively have their own
segment at each levei is depicted by a keyframe of thumb-naii size
subclasses
A 5h11 region DS, for example, can be used Lo describe an image in terms of — Partitions and Decompositions. This refers Lo view partitions and decompo
lis creation (title, creator, date), usage (copyright), media (file format), textual sitions. The View partitions (specified by View OSs) describe different space
annoiation, color histogram, and possibly texture descriptors, and so on. The mi and frequency views of lhe audiovisual data, such as a spatial view (±15 could
tial region (image, in this case) can be further decomposed mio several regions, be a spaiial segment of ata image), temporal view (as in a temporal segment
which can in turn have their own OSs. of a video), frequency view (as in a waveiet subband of ata image), or resolu
Lion view (as in a thumbnail image), and so on. The View decornpositions DSs
Figure 12.19 shows a Video segment for a marine rescue mission, in which a
person was lowered orno a boat from a helicopter. Three moving regions are
inside Lhe Video segment. A segment graph can be constructed Lo include such
structural descriptions as composition of lhe video frame (helicopter, person,
boat) spatial reiationship and motion (above, on, close-to, move-toward, etc.)
of lhe regions
— Conceptual Description. This involves higher-level (nonstructurai) description
ofthe content, such as Even: OS for basketball game or Lakers baligame, Object
OS for John or person, State OS for semantic properties at a given time or
location, and Concepi OS for abstract notations such as “freedom” or “mystery”.
As for Segment DSs, the concept DSs can also be organized in a tree ar graph

• Navigation and access

— Summaries. These provide a video summary for quick browsing and navigation
of lhe content, usualiy by presenting only the keyframes. The following DSs are
supported: Summarization OS, HierarchicalSwnmary DS, HighlightLevel OS,
368 Chapter 12 MPEG Video Coding Ii — MPEG4, 7, and Beyond Section 12.7 MPEG-21 369

specify different Lree or graph decompositions for organizing Lhe views of Lhe — Multiple media types, including audio, video, and audiovisual presentalions
audiovisual data, such as a SpaceTree DS (a quad-tree image decomposition) — Enumented data lypes for MimeType, Countrycode, Regioncode,
— Variations of lhe Content. A Variation DS specilies a variaLion from the Currencycode, and Charactersetcode
original data in image resolution, frame raLe, color reduction, compression, and — Inteilectual Property Manageinen: and Protection (IPMP) for Ds and DSs
50 on. lt can be used by servers to adapt audiovisual data delivery to network
and terminal characteristics for a given Quality of Service (QoS)
12.7 MPEG-21
• ConteM Organization As we stepped into the new century (and millennium), multimedia had seen its ubiquitous use
in almost ali areas. An ever-increasing number of conteni creators and contem consumers
— Coliections. The ColiectionStructure DS groups audiovisual contents inLo clus- emerge daily in society. However, Lhere is no uniform way lo define, identify, describe,
ters. It specifies common properties of the cluster elements and reiationships manage, and prolecL multimedia data as yei.
among Lhe clusters. The development of lhe newest standard, MPEG-21: Multirnedia Framework [35],
— Modeis. Model DSs include a Probability model DS, Analytic model DS, and staited in June 2000. To quote from iLs draft Technical Report,
Ciassifier DS that exlract Lhe models and slatistics of Lhe attributes and features
of Lhe collections. “The vision for MPEG-2l is Lo define a muitimedia framework to enable
Lransparent and augmented use of mullimedia resources across a wide range of
• User Interaction networlcs and devices used by different communities.”

— UserPreference. DSs describe user preferences in the consumption of audio- ‘ruie seven key elements in MPEG-2 1 are
visual contents, such as content types, browsing modes, privacy characLeristics,
and whether preferences can be altered by an agent that analyzes user behavior. • Digital item declaration, Lo estabiish a unifom~ and fiexible abstraction and interop
erable schema for decIaring digital items.
12.6.3 Description Definition Language (DDL) • Digital item identification and description, Lo establish a framework for standard
MPEG-7 adopted the XML Schema Language initiaily deveioped by Lhe WWW Consortium ized identification and description of digital items, regardiess of Lheir origin, Lype, or
(W3C) as iLs DescripLion Definition Language (DDL). Since XML Schema Language was granulariiy.
nor designed specifically for audiovisual contents, some exLensions are made to it. Without
• Content management and usage, to provide an interface and protocol Lhat faciiitate
Lhe detaiis, Lhe MPEG-7 DDL has lhe foliowing componenLs:
management and use (searching, caching, archiving, distribuling. etc.) of the contenL.
• XMI, Schema structure components • Inteliectual property managemeat and protection (1PMP), Lo enable conLenls Lo

— The Schema — the wrapper around definitions and declaraLions be reliably managed and protecLed.

— Primary sLrucLurai components, such as simple and complex type definiLions, • Terminais and networks, lo provide interoperable and Lransparent access Lo conLent
and atlribute and element declarations with QualiLy of Service (Q0S) across a wide range of networks and terminais.
— Secondary structural components, such as attribute group definitions, identity- • Content representation, Lo represent conLent in an adequate way Lo pursuing Lhe
constrainL definitions, group definitions, and notaLion deciarations objective of MPEG-2 1, namely “conlent anytime anywhere”.
— “l-ielper” components, such as annotations, particles, and wildcards
• Event reporting, to esLablish metrics and interfaces for reporting events (user inter-
• XML Schema datatype components acLions), so as to understand performance and aiLernatives.
— Primitive and derived data types MosL of Lhe nine parts of MPEG-2l wilI become intemational sLandardS by 2003. The
development of MPEG-21 involved coilaboraLive work wiLh numerous olher international
— Mechanisms for Lhe user Lo derive new daLa lypes
organizations and sLandards bodies inciuding W3C, MuiLiservice Switching Forum (MSF),
— Type checking better than XML 1.0 Society of Motion Picture and Television Engineers (SMPTE), and Digital Audio Visual
Council (DAVIC). The objective of the standard appears ambitious. IL remains Lo be seen
• MPEG-7 Extensions
how effective and influential iL wiIl be compared Lo MPEG’s eariier (exLremely) successful
— Anay and matrix data types standards.
370 Chapter 12 MPEG Video Codirig Ii — MPEG-4, 7, and Beyond Section 1210 References 371

12.8 FURTHER EXPLORATION 5. As a programming projecc, compute the SA-DCT for Lhe following 8 x 8 block:

The books by Pari and Chen [3] and Pereira and Ebrabimi [36] provide an excellent col 0000 16 O O O
lection of chapLers wiLh pertinent deLails of MPEG-4. An entire book ediced by Manjunatb, 4 O 8 16 32 16 8 0
Salembier, and Sikora [37] is devoced to MPEO-7. 4 O 16 32 64 32 ló O
The Further Exploration secLion of the texL web site for this chapter provides links to O O 32 64 128 64 32 O
4 O O 32 64 32 O O
O 16 O O 32 O O O
• The MPEG borne page o o o o 16 O O O
o o o o o O 00
• The MPEG FAQ page
6. What is lhe compucalional cost of SA-DCT, compared lo ordinary Dcl’? Assume Lhe
video object is a 4 x 4 square in Lhe rniddie of an 8 x 8 block.
• Overviews, Lutorials, and working documenLs of MPEG-4
7. Affine transforms can be cornbined to yieid a composite affine transform. Prove lhat
the composite lransform wili have exactly lhe same form of matrix (wich [00 11T as
• Tutoriais on MPEG-4 Part IO/H.264 the last column) and aL rnost 6 degrees of freedom, specified by lhe parameters ali,
afl, 031, ~ a~, a~.
• Overviews of MPEG-7 and working documenls for MPECI-21 8. Mesh-based mocion coding works relatively well for2D animation and face animation.
What are Lhe main problems when iL is applied Lo body animalion?
• Documencation for XML schemas thac form lhe basis of MPEG-7 DDL 9. How does MPEG-4 perform VOP-based moLion compensaLion? Outline the necessary
sceps and draw a block diagram iiluscraLing lhe data flow.
12.9 EXERCISES 10. What is lhe major molivaLion behind lhe developrnenL of MPEG-7? clive three exam
pies of real-worid applicacions Lhat may benefit from MPEG-7.
1. MPEG-4 motion compensation is supposed to be VOP-based. At lhe end, lhe VOP is 11. ‘l\vo of lhe main shape descripcors in MPEG-7 are “region-based” and “contour
still divided mio macrobloeks (interior macroblock, boundary macrobloek, etc.) for based”. There are, of course, numerous ways of describing Lhe shape of regions and
motion compensation. conLours.

(a) What are lhe potencial problems of lhe curreni implementation? How can they (a) WhaL wouid be your favoriLe shape descripLor?
be improved? (b) How would it compare lo ART and CSS in MPEG-7?
(b) Can there be une VOP-based motion compensacion? How would it compare to
Lhe current implementation? 12.10 REFERENCES

2. MPEG- 1,2, and 4 are ali known as decoder standards. The compression algorithms, l T. Sikora, “Pie MPEG-4 Video Standard Verificacion ModelT IEEE Transactions on Circuits
and Systenisfor Video Technology, Special issue on MPEG-4, 70): 19—31, l997.
hence Lhe details of lhe encoder, are left open for future irnprovement and developmenc.
For MPECI-4, the major issue of video objeci segmentation how co obtain lhe VOPs 2 Infonnation zechnology (Jeneric Coding of Audio-Visual Objects, IniernaLional Standard:
is left unspecified. ISO/IEC 14496, Paris 1 6, 1998.
3 A. Puri and T. Chen, eds.. Mukimedia Syste’ns, Standards, and Nerworks. Ncw York: Marcel
(a) Propose some of your own approaches to video object segmentation. Dekker. 2000.
(b) What are lhe potencial problems of your approach? 4 O. Fernando, eI ai., “Java in MPEG-4 (MPEO-J),” in Mulrimedia, Systenis, S:andards, and
Networks, A. Puri and T. Chen, eds., New York: Marcei Dekker, 2000, p. 449-460.
3. Why was padding introduced in MPEG-4 VOP-based coding? Name some potential
problems of padding. 5 Video Coding for Low 8h Rate Comniunicalion, ITU-T Recommendation 11.263, Version 1,
Nov. 1995, Version 2, Feb. 1998.
4. Motion vectors can have subpixel precision. In particular, MPEG~4 allows quaaler
pixel precision in the lurninance VOPs. Describe an algorichm that will realize tbis 6 A. Pmi, eLa]., “MPE(J-4 Natural Video Coding — PariU’ in Muhtirnedia, Systems, Standards,
precision. and Networks, A. Pari and T. Chen, eds., New York: Marcel Dekker, 2000, p 205—244.
372 Chapter 12 MPEG Video Coding Ii — MPEG-4, 7, and Beyond Section 12.10 References 373

7 T. Ebrabimi, F. Dufaux, and Y. Nakaya, “MPEG-4 Natural Video Coding Part II,” in Mui 25 J. FJunter and E’. Nack, “An Overview of me MPEG-7 Description Definition Language (DDL)
tirnedia, Systems, Srandards, and Networks, A. Pun and T. Chen, eds., New York: Marcel Proposals,” Signai Processing: iniage Comrnunicarion, 16(1.2): 271—293,2001.
Dekicer, 2000, p. 245—269. 26 T. Sikora, “Tlie MPEG-7 Visual Standard for Conteni Description — An OverviewT lESE

8 P. Kauff, et ai., “Functional Coding of Video Using a Shape-Adaptive DO’ Algorithm and Transactions on Circuits and Systenis for Video Technoiogy, Special issue on MPEG-7, 11(6):
an Object-Based Motion Prediccion Toolbox’ IEEZ Transacrions on Circuits and Systems for 696—702,2001.
Video Technoiogy, Special issue on MPEG-4, 7W: 181—196, 1997. 27 B.S. Manjunalh, J.-R. Ohm, V.V. Vasudevan. and A. Yaniada, “Color and Texture Descriptors’
9 J. Ostermann, 8.5. )ang, J. Shin, and T. Chen, “Coding of Arbitraiily Shaped Video Objects lESE Transactions on Circuits Sysrenis for Video Techno?ogy. li: 703—715,2001.
in MPEG-4:’ in Proceedings of ibe !nternational Conference on Image Yrocessing (ICIP ‘97), 28 8.5. Manjunalb. CM. Haley, and W.Y. Ma, “Multiband Techniques for Texture Classification
1997. and SegmentationT ii, Handbook of bnage and Video Processing, A. Bovik, cd., San Diego:
lo Siandardizarion of Group 3 FacsitniieApparatusforDocunient Transinission, ITU-T Recom Academic Press, 2000, pp. 367—381.
mendation T.4, 1980. 29 T. P. Weldon, W. E. Higgins. and O. E’. Dunn, “Efficient Gabor Filter Design for Texture
II Facsi,ni!e Coding Schemes and Coding Contra! Functions for Group 4 Facsintile Apparatus, Segmentation,” Panem Recognition. 29(12): 2005-2016, 1996.
ITU-T Recommendation T.6, 1984. 30 E Wu, B.S. Manjunath, 5. Newsam, and H,D. Shin, “A Texture Descriptor for Browsing and
12 Infonna:ion Technoiogy — Coded Representarion of Picture and Audio Infonnation — Pra Similarity Retrieval.” Signa! Processing: linage Conin,unication. 160-2): 33—43. 2000.
gressive Si-Leve? image Compression, Intemational Standard: ISO/JEC 1i544, also ITU-T 31 1’. Salembier and .1. Srnith, “Overview of MPEG-7 Multimedia Desciiption Schemes and
Recommendation T.82, 1992. Schema Tools,” in introduclion lo MPEG-7: Muhiniedia Content Description Interface,
13 J.M. Shapiro, “Embedded lmageCoding Using Zerotrees of Wavelet CoefficientsT IEEE Trans B.S. Manjunath, P. Salembier, and T. Sikora, eds., New York: Wiley, 2002, Chapter 6.
actions on Signa? Processing, 41(12): 3445—3462, 1993. 32 E’. Moklita,ian and A.K. Mackworth, “A Theory of Multiscale, Curvature-Based Shape Rep
14 O. Woiberg, Digital irnage Warping, Los Alamitos, CA: Computer Society Press, 1990. resentation for Planar Curves;’ lESE Transsctions on Partem Anaiysis and Machine lnte??i’
gence,14(8): 789405, 1992.
IS MC. Lee, ei ai., “A Layered Video Object Coding System tising Sprite and Affine Moúon
Model,” 1EEE Transacuons on Circuits and Systeins for Video Techno?ogy, 7(i): 130—145, 33 J.J. Koenderink and A.). van Doorn, “Surface Shape and Curvaturc ScalesT Image and Vision
1997. Co,npuling, lO: 557—565, 1992.

16 P. van Beek, “MPEG-4 Synthetic Video,” in Mui:intedia. Systenzs. Standards, and Networks, 34 5. Jeannín, ei ai., “Motion Descriptor for Content-Based Video RepresentationT Signa! Pra
A. Puri and T. Chen, eds., New York: Marcel Dekker, 2000, pp. 299—330. cessing: irnage Cornmunication. 16(1-2): 5945.2000.

17 A.M. Tekaip. P. van Beelc, C. Toldo, and B. Gunsel, “2D Mesh-Based Visual Objeci Representa 35 informazion Technology Multimedia Framework, Intemational Standard: ISO/IEC 21000,
tion for interaclive Syntheticft~atural Digital Video Proceedingsofthe 1EEE, 86: 1029—1051, Parts 1—9,2003
1998. 36 E’. Pereira and T. Ebrahimi, The MPEG-4 Book, Uppcr Saddle River, NJ: Prentice Mali, 2002.
18 J.D. Foley. A. van Dam, 5K. Feiner, and J.F. Hughes, Computer Graphics: Principies and 37 B.S. Manjunath, E Salembier, and T. Sikora, eds.. introduction to MPEG-7: Muiti,nedia
Practice, 2nd cd., Reading, MA: Addison-Wesley, 1990. Content Description inreiface. New York: Wiley, 2002.
19 A. Watt and M. Watt, AdvancedAnimation and Rendering Techniques. Reading MA: Addison
Wesley, 1999.
20 informarion Technoiogy — The Virtual Reaiity Mode!ing Language — Part 1: Funcrional
Specijlcation and UTF-8 Encoding, Intemational Standard: ISO/IEC 14772-1, 1997.
21 T. Wiegand, “)VT-Fl00: Study of Final Committee Draft ofJoint Video Specification (ITU-T
Rec. Ff264 — 1SO/IEC 14496-10 AVC), Draft 1d’ in Sinh Meeting ofJVTofiSO/1ECMPEG
and ITU-T VCEG, 2002.
22 SE’. Chang, 1’. Sikora, and A. Puri, “Overview of the MPEG-7 StandartU’ IEEE Transactions
on Circuits and Systevns for Video Techno?ogy, Special issue on MPEG-7, 11(6): 688—695,
2001.
23 inJbnnarion Technoiogy — Multirnedia Conrent Description inteiface. Intemalional Standard:
ISO1IEC 15938, Parts 1—6,2001.
24 E Salembier and J. R. Smilh, “MPEG-7 Multimedia Description Schemes’ IEEE Transactions
on Circuíts Systernsfor Video Technoiogy, 11(6): 748—759, 2001.
Section 13.1 ADPCM ri Speech Coding 375

CI-JAPTER 13 1.0

0.5

Basic Audio Compression 0.0

Techniques —0.5
— 1.0
0 2000 4000 6000 8000
Time
(a)

Compression of audio informaLion is samewhat special in multimedia systems. Some


of lhe Lechniques used are familiar, while others are new. In Lhis chapler, we take a Iook
ai basic audio compression techniques applied Lo speech compression, setting ouL a general
inLroduction lo a Jarge topic with a long history. More extensive information can be found
in the Refertnces and Further Exploration secLions at lhe end of Lhe chapter.
In the nexL chapter, we consider the set of Lools developed for general audio compression
under lhe aegis of Lhe Motion PicLure Experls Group, MPEG. Since this is generally of high
inLerest Lo readers focusing on mulLimedia, we Lreat thaL subjecL in greaLer delail.
O 2000 4000 6000 8000
To begin with, let us recali some of lhe issues covered in Chapter 6 on digital audio in
Time
muiLimedja, sucli as, the jz-Iaw for companding audio signais. This is usually cornbined
wiLh a simple technique that exploils lhe Lemporal redundancy presenL in audio signais. We (b)
saw in Chapter 10, on video compression, lhat differences in signals beLween Lhe presenL
and a pasl time could ver)’ effeclively reduce Lhe size of signal values and, importantly, 1.0
concentrate Lhe histogram of pixel values (differences, now) mIo a rnuch smaller range.
0.5
The resuiL of reducing Lhe variance of values is lhaL Lhe enLropy is greaLly reduced, and
subsequent Huffrnan coding can produce a greatly compressed bilstream. 0.0
The sarne applies here. Recail from Chapter 6 Lhat quanLized sampled oulpuL is called
Pulse Code ModulaLion, or PCM. The differences version is called DPCM, and lhe adaptive —0.5
version is called ADPCM. VarianLs Lhat Lake inLo accouni speech properties follow from
ihese. —1.0
O 2000 4000 6000 8000
In Lhis chapter, we look ai ADPCM, Vocoders, and more general Speech Compression:
LPC, CELP, MBE, and MELP. Time
(c)
13.1 ADPCM IN SPEECH CODING
FIGURE 13.1: Waveform of lhe word “audio:” (a) speech sample, linear PCM aL 8kHz and
13.1.1 ADPCM 16 biis per sample; (b) speech sample, resLored from G.721-compressed audio ai 4 bits per
sample; (c): difference signa] beLween (a) and (b).
ADPCM forms Lhe heart aí lhe ITU’s speech compression standards G.721, (3.723, (3.726,
and G.727. (See lhe Further ExploraLion secLion for cade for Lhese sLandards.) The dif
ferences among Lhese slandards involve Lhe bitraLe and some deLails of Lhe algoriLhm. The
defaulL inpuL is jt-law-coded PCM 1 6-biL samples. Speech performance for ADPCM is such Figure 13.1 shows a l-second speech sample of a voice speaking Lhe word “audio.” In
Lhat Lhe perceived qua]ity of speech ai 32 kbps is only slightly poorer lhan wilh Lhe sLandard Figure 13.1(a), ibe audio signal is siored as linear PCM (as opposed lo lhe defauli jt-law
64 kbps PCM lransmission and is beLter Lhan DPCM. PCM) recorded ai 8,000 samples per second, wiLh 16 bits per sample. After compression

374
376 Chapter 13 Basic Audio Compression Techniques Section 13.2 6.726 ADPCM 377

10 In fact, an aigorithm due lo Jayant [1] aiiows us lo adapt a backward quantizer step size
after receiving just one output! The Jayant quanlizer simpiy expands lhe step size if lhe
5 quantized input is in Lhe outer leveis of lhe quanlizer and reduces lhe step size if lhe input
is near zero.
o. Suppose we have a unifonn quantizer, so that every range lo which we compare inpul
2
o vaiues is of size A. For exampie, for a 3-bit quantizer, there are k = 0.. 7 leveis. For 3-bit
—5 0.726, oniy 7 leveis are used, grouped around zero.
The Jayant quantizer assigns muhiptier values Mk Lo each leveI, wiffi vaiues smaiier Lhan
—lo 1 for leveis near zero and vaiues iarger Lhan i for outer leveis. The muitiplier multipiies
Lhe step size for Lhe next signai value. That way, outer vaiues eniarge lhe step size and are
—lO —5 O 5 10
Input likeiy lo bring Lhe next quanlized vaiue back Lo Lhe middie of lhe availabie leveis. Quantized
values near Lhe middie reduce lhe step size and are Iilcely to bring Lhe next quanlized value
FIGURE 13.2: G.726 quantizer. cioser to Lhe outer leveis.
So, for signai f~, lhe quanlizer slep size A is changed according Lo Lhe quantized vaiue
k, for Lhe previous signal vaiue f,,...j, by Lhe simpie formula
with ADPCM using standard 0.721, the signal appears as in Figure 13.1(b). Figure 13.1(c) (13.2)
A ~- MkA
shows Lhe difference between Lhe actual and reconstrucled, compressed signais. Although
differences are apparent eiectronically between Lhe lwo, Lhe compressed and original signals
Since lhe quantized version of lhe signai is driving Lhe change, this is indeed a backward
are perceptualiy very similar.
adaptive quantizer.
In 0.726, how a is ailowed Lo change depends on whether Lhe audio signal is actuaily
13.2 6.726 ADPCM speech or is iikeiy data Lhat is simpiy using a voice band. In Lhe former case, sampie-lo
ITUG.72ósupersedes lTUstandardsG.721 and 0.723. ltprovidesanotherversionofG.71 1, sample differences can fluctuale a greal deai, whereas in Lhe latler case of data transmission,
including companding, aL a lower bilrate. 0.726 can encode 13- or 14-bil PCM samples Lhis is less Irue. To adjust lo eilher siluation, the factor a is adjusted using a formula with
or 8-bil jt-law or A-Iaw encoded data mIo 2-, 3-, 4-, or 5-bit codewords. lt can be used in lwo pieces.
speech transmission over digital networks, videoconferencing, and ISDN communications. G.726 works as a backward-adaplive Jayant quantizer by using fixed quanlizer sleps
The 0.726 standard works by adapting afixed quantizer in a simple way. The different based on Lhe iogarithm of lhe input difference signal, e~ divided by a. The divisor a is
sizes of codewords used amount to bitrates of 16 kbps. 24 kbps, 32 kbps, or 40 kbps, at an writlen in terms of its iogarithm:
8 klt sampling raLe. The standard defines a niultipiier constant a that wili change for every (13.3)
log2 a
difference vaiue e~, depending on Lhe current scale of signais. Define a scaled difference
signal f~ as follows:
Since we wish Lo distinguish belween situalions when difference values are usually smali,
and when Lhey are iarge, a is divided into a so-caiied locked parI, OL, and an unlocked pan,
4 = sn—&n au The idea is thal Lhe locked part is a scaie factor for smali difference vaiues and changes
= 4/a (13.1) siowly, whereas lhe uniocked part adapts quickiy Lo larger differences. These correspond
Lo iog quantities fiz, and fia.
where ~,, is lhe predicted signai value. f~ is then fed mIo the quantizer for quantization. The iogarilhm vaiue is wrilten as a sum of Lwo pieces,
The quantizer is as displayed in Figure 13.2. Here, Lhe input value is defined as a ratio of a
difference with lhe factor a. fi=Afiu+(1A)fiz, (13.4)
By changing Lhe value of a, lhe quanlizer cmi adapt to change in Lhe range of Lhe difference
signai. The quantizer is a nonuniform midtread quantizer, so it inciudes Lhe value zero. The where A changes so that it is about 1 for speech, and about O for voice-band data. It is
quantizer is backward adaptive. calcuiated based on Lhe variance of Lhe signal, keeping track of severai past signai values.
A baclcward-adaptive quantizer works in principie by noticing if 100 many values are The “unlocked” part adapls via lhe equalion
quantized lo vaiues far from zero (which wouid happen if Lhe quantizer step size in f were
au ~— Mkau
too small) or if too many values feil dose lo zero too much of Lhe time (which wouid happen
if Lhe quantizer step size were too large). fia e iog2 Mk + fia (13.5)
378 Chapter 13 Basic Audio Compression Techriiques Section 13.3 Vocoders 379

where Mk is a Jayant muiLiplier for the kLh levei. The locked part is slighLiy modified from
the unlocked parI, via
e- (1— B)flt + Bfljj (13.6)

where B is a small number, say 2—6.


The 0.726 predictor is complicated: iL uses a linear combinalion of six quancized differ
ences and two reconstructed signal values from the previous six signal values f~. —l

13.3 VOCODERS —2
The coders (encoding/decoding aigorithms) we have studied so far could have been applied 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Lo any signais, notjust speech. Vocoders are specifically voice coders. As such, they cannot Time (rnsec)
be usefully appiied when olher analog signals, such as modem signais, are in use.
Vocoders are concemed wilh modeling speech, so that the salient features are captured in
FIGURE 13.3: The solid une shows Lhe superposiLion of Lwo cosines, with a phase shift. The
as few bits as possible. They use eilher a model of Lhe speech waveform in Lime (Linear Pre
dashed une shows Lhe sarne with no phase shifL. The wave is very differenL, yet Lhe sound
dictive C’oding (LPC) vocoding), or else break down the signal inLo frequency componenis
is lhe same, percepLually.
and model these (channel vocoders and formam vocoders).
lncidenlally, we likely alI know that vocoder simuiation of Lhe voice is noL wonderfui
yel — when the library calls you with your overdue notificaLion, the automated voice is
strangely lacking in zest. Then Lhe seI of two signais is LransmitLed aL 48 kbps for Lhe low frequencies, where we can
hear discrepancies well, and aI only 16 kbps for Lhe high frequencies.
13.3.1 Phase Insensitivity Vocoders can operaLe aL low bilraLes, 1—2 kbps. Todo so, a channel vocoder first applies
a filter bank to separaLe out Lhe different frequency componenis, as in Figure 13.4. However,
Recali from Section 8.5 Lhat we can break down a signal inLo iLs consLiLuenL frequencies by
as we saw above, only Lhe energy is important, so first Lhe wavefonn is “recLified” to its
analyzing iL using some variant of Fourier analysis. Ia principie, we can also reconstituLe Lhe
absolute value. The fllLer bank derives relalive power leveis for each frequency range. A
signal from lhe frequency coefficients developed thaL way. BuL it turns out Lhat a complete
subband coder would not recLify Lhe signai and would use wider frequency bands.
reconstiLuLing of speech waveform is unnecessary, perceptuaily: ali that is needed is for the
amount of energy at any Lime tO be abouL right, and Lhe signal will sound about right. A channel vocoder also analyzes the signal to determine lhe general pitch of Lhe speech
“Phase” is a shift in the Lime argument, inside a funcLion of Lime. Suppose we sLrike — low (bass), or high (tenor) — and also Lhe excitation of Lhe speech. Speech excitaLion is
a piano key and generale a roughly sinusoidal sound cos(wr), with a, = 2rf. lf we mainiy concerned wiLh whether a sound is voiced or unvoiced. A sound is unvoiced if iLs
wait sufficient Lime Lo generate a phase shifL K/2 and then strike anoLher key, wiLh sound signal simpiy iooks like noise: Lhe sounds s and! are unvoiced. Sounds such as Lhe vowels
cos(2wt + x/2), we generate a waveform like the soiid line in Figure 13.3. This waveform a, e, and o are voiced, and their waveform Iooks peiiodic. The o at lhe end of Lhe word
is Lhe sum cos(a,r) + cos(2a,l + 42). “audio” in Figure 13.1 is fairly periodic. During a vowel sound, air is forced through Lhe
lfwedidnoLwait beforestriking lhesecond note(l/4msec, in Figure 13.3), ourwavefonn vocal cords in a stream of regular, short puffs, occurring at Lhe raLe of 75—150 pulses per
would be cos(wt) + cos(2w:). But perceptually, Lhe two notes would sound lhe sarne, even second for men and 150—250 per second for wornen.
Lhough in actuality Lhey would be shifted in phase. ConsonanLs can be voiced or unvoiced. For Lhe nasal sounds of the leLters ia and n,
Hence, if we can geL Lhe energy spectrum right — where we hear loudness and quiet lhe vocal cords vibrate, and air is exhaled Lhrough Lhe nose rather Lhan Lhe mouLh. These
then we don’t really have Lo worry about Lhe exacL waveform. consonanLs are Lherefore voiced. The sounds 1’, d, and g, in which lhe rnouth sLarts closed
but Ihen opens Lo Lhe following vowel over a transiLion lasLing a few milliseconds, are also
13.3.2 Channel Vocoder voiced. The energy of voiced consonants is grealer Lhan Lhat of unvoiced consonants buL
less than thaL of vowei sounds. Examples of unvoiced consonants include Lhe sounds sh, th,
SubbandJiltering is Lhe process of applying a bank of band-pass filters Lo lhe analog signal,
and h when used at Lhe fronL of a word.
Lhus acLually carrying out Lhe frequency decomposition indicaLed ia a Fourier analysis.
Subband coding is Lhe process of making use of the informaLion derived from this fihtering A channel vocoder applies a vocai-Lract transfer model Lo geaerate a vecLor of exciLation
Lo achieve beLter compression. parameLers thaL describe a model of Lhe sound. The vocoder also guesses whether lhe sound
For example, an older ITU recommendaLion, 0.722, uses subband filtering of analog is voiced or unvoiced and, for voiced sounds, esLimales the period (i.e., lhe sound’s piLch).
signals inLojust two bands: voice frequencies ia 50Hz lo 3.5 kHz and 3.5 kl-iz Lo 7kHz. Figure 13.4 shows lhaL the decoder also applies a vocal-tracl model.
380 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vococlers 381

Analysis íilters Synlhesís fihlers


1 Low-frequency ,i
—I 1111cr
Fron Ist analysís 1111cr
c4)
From 2nd analysís filter 1)
E4)
From 3rd asiaiysis 1111cr o
o
‘o
.0
ci

-. ~ Mulliplex, tnnsmil,
demulliplex
~uency o 2 4 6 8 1013161922252831
Frequency (8,000)32 Hz)

...froicedlunvoic FIGURE 13.5: Formants are Lhe salienL frequency componenls presem in a sampie ofspeech.
decisíon Here, lhe soiid une shows frequencies presenl in lhe firsl 40 msec of lhe speech sampie
in Figure 6.15. The dashed tine shows Lhat while similar frequencies are stiui prcsenL one
second laler, Lhey have shifled.
Pitch
Since what is sem is an anaiysis of lhe sound talher Ilian sound ilseif, lhe bilrale using
FIGURE 13.4: Channei vocoder. LPC can be smaii. This is uike using a simpie descriptor such as MIDI Lo generate music:
we send jusL the descripLion paramelers and lei lhe sound generacor do iLs besl Lo creaLe
Because voieed sounds can be approximated by sinusoids, a periodic pulse generator appropriate music. The difference is lhaL as weul as piLch, duraLion, and loudness variabies,
recreates vojced sounds. Since unvoiced sounds are noise-iike, a pseudo-noise generator is here we aiso send vocal traci excitalion parameLers.
applied, and ali vaiues are scaied by the energy estimares given by Lhe band-pass filler sei. Afler a biock of digiLized sampies, cailed a segmenr orJ;-ame, is anaiyzed, lhe speech
A ehannei vocoder can achieve an inleiligibie bul synthelic voice using 2,400 bps. signal generated by Lhe oulpul vocai-tracl modei is caicuiated as a funeLion of Lhe currenL
speech oulput pius a second tema linear in previous model coefficienls. This is how “linear”
13.3.3 Formant Vocoder in Lhe coder’s name arises. The modei is adaplive — lhe encoder side sends a new sei of
coefficienls for each new segmenL.
It turns out Ibal nol ali frequencies preseni in speech are equaiiy represerned. Inslead, oniy The lypicai number of seIs of previous coefficienls used is N = lO (lhe “modei order” is
certain frequencies show up strongiy, and oLhers are weak. This is a direct consequence of lO), and such aR LPC-i0 [3] sysLem Lypically uses a raLe of 2.4 kbps. The model coefficients
how speech sounds are formed, by resonance in oniy a few chambers of lhe mouth, Éhroai, a~ ad as predictor coefhcienls, muilipiying previous speech oulpuL sample values.
and nose. The important frequency peaks are cailedfarmanrs [2]. LPC starts by deciding whether lhe currenl segment is voiced or unvoiced. For unvoiced
Figure 13.5 shows how lhis appears: oniy a few, usualiy just four or so, peaks of energy speech, a wide-band noise generator is used lo create sampie values f(n) Lhat ad
aL certain frequencies are presenl. However, just where lhe peaks occur changes in time, lo lhe vocal lradL simuialor. For voiced speech, a pulse-train generaLor creates vai
as speech conlinues. For exampie, Lwo different vowei sounds would aclivale differenl sets Modei pararneLers a1 are eaicuialed by using a ieast-squares seI of equalio - -

of formants — ffiis reilects lhe differenL vocal tract conhgurations necessary Lo form each lhe difference belween lhe acLual speech and lhe speech generaied by Lhe v -

vowei. lisuaiiy, a smali segment of speech is anaiyzed, say 10—40 msec, and formanls exciled by lhe noise or pulse-Irain generators lhal capture speech parameLers.
are found. A Fornzant Vocoder works by encoding oniy che most imporlanl frequencies. if Lhe outpul values generated are denoted sO), lhen for input values f(n). lhe oulp
Formant vocoders can produce reasonably inleHigibie speech aI only 1,000 bps. depends on p previous outpw sampie values, via

13.3.4 Linear Predictive Coding s(n) = Eajs(n —1) + Gf (e) (13.7)


LPC vocoders exlract salient features of speech directiy from lhe waveform raLher than (=1
lransforming the signai tolhe frequeney domam. LPC coding uses a lime-varying model of Here, O is known as lhe gain factor. Note LhaL lhe coefficients a~ ad as values in a linear
vocal-tract sound generaLed from a given excilalion. What is lransmilted is a sei of param predidlor model. The pseudo-noise generator and pulse generalor are as discussed above
eters modeiing lhe shape and excilation of lhe vocal LraCL, not actual signais or differences. and depicled in Figure 13.4 in regard lo lhe channel vocoder.
382 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vocoders 383

The speech encoder works iii a blockwise fashion. The inpuL digital speech signal is After getLing lhe LP coefficienLs, gain G can be calculated as
analyzed in some smali, fixed-length segments, called speech frames. For Lhe LPC speech
coder, lhe frame length is usually selected as 22.5 msec, which corresponds Lo 180 samples
O = E([s(n) — ajs(n —
for 8 kllz sampled digital speech. The speech encoder anaiyzes lhe speech frames Lo obtain
lhe parameters such as LP coefficiencs a~, i = 1 .. p, gain O, pitch P, and voicedlunvoiced
decision U/V.
To calculate LP coefficients, we can solve lhe followung minimization problem for
= E([s(n) — E a~s(n — j)]sO)) (13.12)
j=l

miii E{is(n) — Eaisn — i)]2) (13.8) = #(0,0)—~aj#(0.j)


j=l i=I

By Lakung Lhe derivative of a~ and setLing ii lo zero, we get a sei of p equations:


For lhe autocorrelaLion scheme, G = R(0) — —~ a~RÚ). Order-lO LI’ analysis is

found lo be enough for speech coding applications.


E{(s(n) — E a1s(n — j)]s(n — 1)) = O, i = 1.. p (13.9) The piich P of lhe curreni speech frame can be exLracted by Lhe correlaLion meLhod by
i=l findung lhe index of the peak of
LetLing Ø(i, j) = E(s(n — i)s(n — j)), we have
N-I+rn N—I+n, N-l+m
•(1,1)
#(2,1)
•(l,2)
•(2,2)
.

...
4,(i,p)
•(2,p)
aj
02 —
•(0,1)
•(0,2)
= E
11=1~1
s(n)s(n__fl/[ E
?l=m
~2(~) E
n=n~
s2(n_o]I~~2
(13.10) 1 E [Pmin,Pmax] (13.13)

•Q’,l) •(p,2) . •(p,p) a,, 4’(O,p) Thesearchingrange [Pmjn, Pmax] is ofLen selected as (12, 140] for8 kHzsampling speech.
The autocorrelanon meihod is often used Lo calculate LP coefficients, where DenoLe P as Lhe peak lag. lf v(P) is less than some given Lhreshold, Lhe current frame is
classified as an unvoiced frame and will be reconsLrucLed in lhe receiving end by stimuiating
wiLh a whiLe-noise sequence. OLherwise, Lhe frame is determined as voiced and stimulaLed
•Oj)=ZSw(n -i)s~(n —j)/E$(n) 1 =0...p, j = 1 ...p (13.11)
wiLh a periodic waveform aL Lhe reconstruction stage. In practical LPC speech coders, Lhe
piLch estimation and U/V decision procedure are usuaily based on a dynamic prograrnming
s~(n) = s(n +m)w(n) is Lhe windowed speech frame starting from limem. Since ~Q, j) is scheme, soas to correcL lhe oflen occurrung errors of pitch doubling or halving in Lhe single
determuned only by ‘—ii. we define~(i, j) = R(Ii —ii). Sincewe also have R(0) ? O, Lhe frame scheme.
maLrix {Ø(i, j)) is positive synm~eLric, and Lhus a fast scheme tocalculate lhe LPcoefficients Iii LPC- lo, each segment is 180 samples, or 22.5 msec aL 8 kHz. The speech parameLers
is as follows. transmitLed are Lhe coefficienLs a~; O, lhe gain factor; a voicedlunvoiced flag (1 bit); and
Lhe piLch period if Lhe speech is voiced.
PROCEDURE 13.1 LPC Coefficients
13.3.5 CELP
2(0) = R(0), i = 1 CELP, Cade Exciged Linear Predicgion (sometimes Codebook Fxcited), is a more complex
while i < p family of coders thaL atiempts to mitigate Lhe lack of qualiLy of Lhe simple LPC model by
li, = [R(i)— EZ~ a~’ RQ — J)]/E(j — 1) using a more complex description of lhe excitation. An enLire seL (a codebook) of exciLation
= vecLors is maLched Lo Lhe actual speech, and Lhe index of Lhe best rnatch is sent Lo Lhe receiver.
for j = Ito 1 — This compiexity increases Lhe bitrate Lo 4,800—9,600 bps, Lypically.
a~ =a7’ —kic4j In CELP, since ali speech segments make use of Lhe sarne set of LemplaLes from Lhe
2(i) = (1 — k?)E(i — 1) Lempiate codebook, lhe resulting speech is perceived as much more natural Lhan Lhe Lwo
i 4-1+1 mode excitalion scheme in the LPC-10 coder. The quality achieved is considered sufficient
forj = 1 Lop for audio conferencing.

p A low bitrace is required for conferencing, but lhe perceived quaiity of Lhe speech must
sLill be of an accepLabie standard.
384 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vocoders 385

Original speech s(n)


Ia CELP coders two kiads of prediction, Loag Time PredicLiori (LTP) and Short Time
PredicLion (STP), are used Lo elimiaate Lhe reduadancy in speech signais. STP is ali analysis
of sanipies iL atLempts Lo predicL Lhe next sample from several previous ones. Here, LTP STP
redundancy is due Lo lhe facL Lhat usually one sample wiIl not change drastically from lhe
nexL. LTP is based on Lhe idea thaL in a segment of speech, or perhaps from segmenL Lo
segment, especially for voiced sounds, a basic periodicity or piLch wifl cause a waveform
that more or less repeaLs. We can reduce Lhis redundancy by finding Lhe pitch.
For coacreLeness, suppose we sample at 8,000 samples/sec and use a lO msec frame, syn Lhes ized
conLaining 80 samples. Then we can roughly expect a piLch thaL corresponda Lo an approxi speech S~(n)
maLely repeating patLern every 12 Lo 140 samples or 80. (NoLice Lhat Lhe pitch may actually
be longer than Lhe chosen frame size.)
STP is based on a short-Lime L.PC analysis, discussed in Lhe lasL section. It is “short-tinie” StochasLic
in thaL Lhe predicLion involves oaly a few samples, nota whole frame or several frames. STP codebook
is also based on minimizing Lhe residue ertor over Lhe whole speech frame, buL iL captures
lhe correlation overjust a short range of samples (10 for order- lO LPC). 1~
AfLer STP, we can subLract signal minus predicLion Lo arrive at a differeatial coding
siLuation. However, even in a seI of errors eQi), Lhe basic piLch of lhe sequence may still os
remam. This is estimaLed by means of LTP. ThaL is, LTP is used to furthcr eliminate Lhe
periodic redundancy inhereat in Lhe voiced speech signais. Esseatially, STP captures Lhe
formanL sLructure of Lhe short-Lerm speech specLrum, while LTP recovers Lhe Iong-lerm WeighLed
correlalion ia lhe speech signal LhaL represents Lhe periodiciLy ia speech.
Thus Lhere are always LWO sLages — and lhe order is in facl usually STP followed by LTP,
siace we always sLart off assuming zero error and lhea remove Lhe piLch componeal. (If we
FIGURE 13.6: CELP analysis model with adaplive and stochasLic codebooks.
use a closed-Ioop scheme, STP usually is done firsl). LTP proceeds using whole frames
or, more oflen, subframes equal Lo one quarler of a frame. Figure 13.6 shows Lhese Lwo
sLages. and hence a minimum summed-error value
LI? is ofLen implemented as adaprive codebook searching. The “codeword” ia lhe
adapLive codebook is a shifled speech residue segmeal indexed by lhe lag r corresponding L1 ~ s(n)s(n — r)j2
Emin(r) 5:32(11) (13.16)
Lo Lhe currenL speech frame or subframe. The idea is Lo look ia a codebook of waveforms
Lo find one that maLches Lhe currenl subframe. We generally Iook ia Lhe codebook usiag
=
x~:~ s2(n
— ___________________

a normaiized subframe of speech, so as weIl as a speech segmenL maLch, we also obLain a Nolice LhaL Lhe sample s(n — r) could bela Lhe previous frame.
scaling value (lhe gain). The gain corresponding Lo lhe codeword is denoted as g~ Now, lo obtain Lhe oplimum adapLive codebook index r, we can carry ouL a search
There are Lwo lypes of codeword searching: open-Ioap and closed4oop. Open-loop exclusively in a small range determined by Lhe piLch period.
adaplive codebook searching Iries Lo minimize Lhe long-lerm prediclion error bul noL lhe More ofLen, CELP coders use a closed-loop search. Rather Lhan simply considering
percepLual weighted reconslrucled speech error, sum-of-squares, speech is reconsiructed, wilh percepLual error minimized via an adaplive
codebook search. So in a closed-Ioop, adaplive codebook search, Lhe best candidaLe in lhe
adaptive codebook is selected Lo minimize Lhe distortion of locally reconsLructed speech.
E(r) = ~ —gos(n — r)]2 (13.14) ParameLers are found by minimizing a measure (usually Lhe mean square) of Lhe difference
n=O beLween Lhe original and lhe reconsLrucLed speech. Since Lhis means lhaL we are simuiLane
ously incorporating synLhesis as well as analysis of Lhe speech segmeal, Lhis meLhod is also
By seLting lhe partial derivaLive ofgo Lo zero, BE(r)/Bgo = 0, we geL called analysis-by-synthesis, or ~4-B-S.
The residue signal afler STP based oa L.PC analysis and LTP based on adapLive code
= E~Z~ s(n)s(n - r) (13.15) word searching is like while noise and is encoded by codeword maLchiag in Lhe sLochasLic
E11::0 s2(n — r) (random or probabilisLic) codebook. This kind of sequenLial opLimizaLion of Lhe adapLive
386 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vocoders 387

codeword and stochastic codeword methods is used hecausejoinLly optimizing lhe adapLive The rcols of lhese two polynomiais are spaced around lhe unil circie in Lhe z plane
and sLcchastic codewords is often loo complex lo meeL real-time demands. and have mirror symmetry wiLh respecL lo Lhe x-axis. Assume p is even and denole Lhe
The decoding direction is jusi lhe reverse of lhe above process and works by combining phase angles of lhe roots of P(z) and Q(z) above Lhe x axis as 61 < 0~ < ... <
lhe contribution from Lhe two types of excitalions. and ~9i < ‘P2 < < Ç0p/2, respecLiveiy. Then lhe vecLor (cos(61), cos(’Pi), cos(02),
cos(6p/~). CO5(çOp/2)) is lhe LSP coefficienL vecLor, and vector (S~ , ~i. O~, ~i
DOD 4.8 KBPS CEI.P (FS1016)*. DOD 4.8 kbps CELP [4] is an eariy CELP coder ,Op/2. ‘Pp/Z) isusualiy calied LineSpecirum Frequency, or LSF. Based cri Lhe reiaLionship
adopLed as a U.S. federal standard te updale Lhe 2.4 kbps LPC- 1 Oe (FS 1015) vocoder. This A(z) = [P(z) + Q(z)]/2, we can reconstrucl lhe LP coefficienLs ai lhe decoder end from
vocoder is now a basic benchmark lo lesI oLher low-bitrate vocoders. FSIOI6 uses an 8 lhe LSP or 1SF coefficienis.
kliz sampling raLe and 30 msec frame size. Each frame is further split into four 7.5 msec AdapLive codebook searching in FS 1016 is via a closed-loop search based on perceplually
subframes. In FSIOI6, STP is based on an open-loop order-lO LPC analysis. weighled crrors. As opposed Lo considering jusl lhe mean squared error, here errors are
To improve coding efficiency, a fairly sophisLicaled Lype of Lransform coding is carried weighied so as Lo Lake human percepLion mIo accounl. In lerms of lhe z-iransform, iL is
oul. Then, quanLizalion and compression are done in lerms of lhe Lransform coefficients. found Ihal Lhe following mulLiplier does a goodjob:
First, in this field il is common lo use the z-transform. Here, z is a complex number and
represenis a kind of complex “frequency’ If z = e_2”1~’, Lhen Lhe discrete z-transform A(z) 1 — E~- a1zj’
reduces lo a discrele Fourier Lransform. The z-Lransform makes Fouber transforms look A(z/y) =
1 — ~ QiY’Z~~’
. . O<y<l (13.22)
like polynomials. Now we can wrile Lhe error in a prediction equalion
with a conslanL parameler y.
e(n) = s(n) — a~s(n — i) (13.17) The adaplive codebook has 256 codewords for 128 inLeger deiays and 128 noninleger
delays (wiLh half-sampie intervai, for belLer resolulion), Lhe former ranging from 2010 147.
To reduce searching complexily, even subframes are searched in an inLerval relalive Lo lhe
in Lhe z domam as previous odd subframe, and Lhe difference is coded wilh 6 biLs. The gain is nonuniformly
Eft) = A(z)S(z) (13.18) scalar coded beLween —1 and 2 wiLh 5 bils.
Slochastic codebook search is applied for each of lhe four subframes. The slochastic
where E(z) is Lhe z-Lransform of Lhe error and S(z) is Lhe transform of Lhe signal. The lerm ccdebook of FS 1016 is generaLed by ciipping a unil variance Gaussian distribulion random
A(z) is Lhe Lransfer function in lhe z domam, and equals sequence Lo wilhin a lhreshold of absolute value 1.2 and quanlizing Lo lhree values —1, 0,
and 1. The sLochasLic codebook has 512 codewords. The codewords are overiapped, and
A(z) —1— (13.19) each is shifled by 2 wiih respecl Lo lhe previous codeword. This kind of slochasLic design is
cailed anAlgebraic Codebook. lt has many variaLions and is wideiy applied in receni CELP
coders.
with lhe sarne coefficienLs a~ as appear in Equalion (13.7). How speech is reconstrucied,
Denoiing Lhe excilalion vector as u111, Lhe periodic componeni obiained in lhe first stage
lhen, is via
is ~ v01 is Lhe slochaslic componeni search resuil in lhe second sLage. Jn ciosed-loop
S(z) = E(z)/A(z) (13.20) searching, Lhe reconslrucled speech can be represenLed as

wiih lhe estimated error. For this reason, A(z) is usuaily sLated in Lerms of l/,l(z). ?=~ + (u + v~’1)H (13.23)
The idea of going lo lhe z-Lransform domam is Lo convert lhe LP ccefficienLs lo Line
Spectrum Pair (LSP) coefficienls, which are given in lhis domam. The reason is Lhal Lhe where u is equal Lo zero aL Lhe first slage and ~(°) aL lhe second stage, and ~ is lhe zero
LSP space has several good properlies wiLh respecl Lo quanlization. LSP represenlalion has response of Lhe LPC reconslruciing fliler. Malrix II is lhe LrUncaLed LPC reconslrucLing
become standard and has been applied lo nearly ali lhe recenL LPC-based speech coders, fliLer uniL impulse-response malrix
such as 0.723.1,0.729, and MELP. To get LSP coeflicienLs, we constmcl lwo poiynomials
h0 h1 112 . - . hL_l
P(z) = A(z) + z~”A(z~)
O ho
Q(z) = A(z) z~”t0A(C’)
— (13.21) o o (13.24)
where p is lhe order of Lhe LPC analysis and A(z) is lhe lransform funcLion of Lhe LP fihter,
with z lhe Lransform domam variable. The z-transform is jusL like lhe Fourier transform but
wiLh a complex “frequency.”
b o
388 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vocoders 389

where L is lhe Ienglh of lhe subframe (lhis simply represenls a convolulion). Similarly, where r(n) is lhe speech component afler perceplual weighling and eliminaling lhe zero
defining W as lhe unil response malrix of lhe perceptual weighling filler, lhe perceplually response componenl and periodie componenl conlributions. Based on melhods similar lo
weighted error of reconslructed speech is lhose presented in lhe lasI section, we can sequenlially optimize lhe gaio and position for
each pulse. Say we first assume there is only one pulse and find lhe best gaio and position.
e= (s —.?)W =e0—v~HW (13.25) Afler removing lhe contribulion [mm lhis pulse, we can gel lhe nexl optimal pulse based on
lhe same melhod. This process is done recursively until we gel ali M pulses.
where eo = (5 —~j) W — is 11W. The codebook searching process isto finda codeword y(~ The stochaslic codebook slniclure for lhe ACELP model is different from FS 1016. The
in lhe codebook and conesponding at1~ such lhal u~ = at1~yt1~ and eer is minimized. To following table shows lhe ACELP excilation codebook:
make lhe problem lractable, adaptive and stochaslic codebooks are searched sequenlially.
Denoting a quantized version by Õ~’~ = Q[â0~], lhen lhe criterion of codewordsearching
in the adaplive codebook or stochastic codebook is lo minimize eeT over ali ~ in terms Sign Positions
of an expression in ã~, eo, and y(O. ±1 0, 8, 16,24,32,40,48, 56
The decoder of lhe CELP codec is a reverse process of lhe encoder. Because of lhe ±1 2, lO, 18,26,34,42,50, 58 (13.28)
unsymmelrical complexity properly of vedor quanlization, lhe complexity in the decoder ±1 4, 12,20,28, 36,44,52, (6(i)
side is usually much lower. ±1 6, 14, 22,30,38,46,54. (62)

G.723.1*. G723.1 [5]is ao ITU standard aimed aI mullimedia communicalion. llhas There are only four pulses. Each can be in eight positions, coded by three bits each.
been incorporaled into H.324 for audio encoding in videoconference applications. 0.723.1 Also, lhe sigo of lhe pulse lakes one bit, and anolher bil is lo shift ali possible positions
is a dual-rale CELP-type speech coder lhat can work at bitrales of 5.3 kbps and 6.3 kbps. to odd. Thus, lhe index of a codeword has 17 bils. Because of the special slniclure of lhe
0.723.1 uses many lechoiques similarto FS1OI6, discussed in lhe lasl seclion. The input algebraic codebook, a fast algorithm exisls for efficienl codeword searching.
speech is again 8 kHz, sampled in 1 6-bit linear PCM formal. The speech frarne size is also Besides lhe CELP coder we discussed above, there are many other CELP-type codecs,
30 msec and is furlber divided into four equal-sized subframes. Order-lO LPC coefficients developed mainly for wireless communication syslems. The basic concepls of these coders
are eslimated in each subframe. LP coeflicienls are further converled lo LSP veclors and are similar, except for differenl implemenlalion delails on parameter analysis and codebook
quantized by predidlive splitting VQ. LP coefficients are also used lo form lhe perceptually slnicturing.
weighled flller. Some examples include lhe 12.2 kbps OSM Enhanced Fuil Rale (EFR) algebraic CELP
0.723.1 firsl uses an open-loop pitch estimator lo gel a coarse pilch eslimalion in a codec [6] and 15-641 EFR [7], designed for lhe North American digital ceilular IS-136
lime inlerval of every lwo subframes. Closed-loop pitch searching is done in every speech TDMA syslem. 0.728 [8] is a low-delay CELP speech coder. 0.729 [9] is another CELP
subframe by searching lhe dala in a range of lhe open-loop pitch. After LP filtering and based ITU slandard aimed aI toll-qualily speech communicalions.
removing lhe hamionie components by LTP, lhe stochastic residue is quantized by Multi 0.729 is a Conjugate-SiructureAlgebraic-Code-Excited-Linear-Prediction (CS-ACELP)
pulse Maxiinum Likelihood Quantizazion (MP-MLQ) for lhe 5.3 lcbps coder or Algebraic codec. 0.729 uses a lO msec speech analysis frame and lhus has lower delay than 0.723.1,
Code-Excited Linear Prediction (ACELP) for lhe 6.3 kbps coder, which has a slightly higher which uses a 30 msec speech frame. 0.729 also has some inherent protection schemes lo
speech qualily. These lwo modes can be swilched at any boundary of lhe 30 msec speech deal wilh packet loss in applications such as VoIP.
frames.
In MP-MLQ, the contribulion of the stochaslic component is represenled as a sequence 13.3.6 Hybrid Excitation Vocoders*
of pulses Hybrid Excitalion Vocoders are anolher large class of speech coders. They are different
[mm CELP, iri which lhe excilation is represented as lhe conlribulions of lhe adaplive
M
and stochastic codewords. lnslead, hybrid excilation coders use model-based methods lo
u(n) = ~gt8(n —mg) (13.26) inlroduce multi-model excilation.

where M is lhe number of pulses and gj is gaio of lhe pulse aI posilion m~. The closed-loop MBE. The Multi-Band Excitation (MBE) [lO] vocoder was developed by MIT’s Lin
search is done by minimizing colo Laboratory. The 4.15 kbps IMBE codec[ li] has become lhe standard for IMMSAT.
MBE is also a blockwise codec, in which a speech analysis is done in a speech frame unil
of aboul 20 1030 msec. In lhe analysis part of lhe MBE coder, a spedlrum analysis such as
e(n) = r(n) — 81h(n — m1) (13.27) FF1’ is first applied for lhe windowed speech in lhe currenl frame. The short-time speech
speclnim is further divided mIo different spedlrum bands. The bandwidlh is usually ao
390 Chapter 13 Basic Audio Compression Techniques Section 13.3 Vocoders 391

integer limes Lhe basic frequency that equals lhe inverse of lhe pilch. Each band is described Mulliband Excitalion Linear Predictive (MELP). The MELP speech codec is a new
as “voiced” or “unvoiced”. U.S. federal sLandard lo replace lhe old LPC-lO (FSIOI5) sLandard, wilh Lhe application
The paramelers of Lhe MBE coder lhus include lhe speclrum envelope, piLch, focus on low-bitrale safely communicaliOns. AI 2.4 kbps, MELP [12] has comparable speech
unvoiced/voiced (UIV) decisions for differenl bands. Based on differenL biLrale demands, qualiLy Lo lhe 4.8 kbps DOD-CELP (FS 1016) and good robuslness ia a noisy environmenl.
Lhe phase of lhe specLrum can be paramelerized or discarded. In lhe speech decoding pro MELP is also based on LPC analysis. Differenc from lhe hard-decision voicedlunvoiced
cess, voiced bands and unvoiced bands are synlhesized by differenL schemes and combined model adopled in LPC-lO, MEIS uses a rnultiband sofl-decision model for Lhe exciLaLion
Lo generaLe lhe final ouLpul. signa]. The LP residue is band-passed, and a voicing sLrengLh parameLer is esLimaled for
MBE ulilizes lhe analysis-by-synLhesis scheme in parameLereslimation. Parameters such each band. The decoder can reconstrucl Lhe excilalion signal by combining lhe periodic
as basic frequency, specLrum envelope, and subband U/V decisions are ali done via closed pulses and while nOi5es~ based on Lhe vOiCing slrenglh in each band. Speech can be lhen
loop searching. The crileria of lhe closed-Ioop opLimizaLion are based on minimizing lhe reconsLruCled by passing lhe excitaLion Lhrough lhe LPC synLhesis filter.
percepLually weighled reconslrucled speech error, which can be represenLed ia lhe frequency Differenl from MBE, MELP divides lhe excilalion inLo five fixed bands of 0—500, 500—
domam as 1000, 1000—2000, 2000—3000, and 3000—4000 Hz. 11 eslimales a voice degree parameLer
1 f~ in each band based on lhe normalized correlaLion function of lhe speech signal and Lhe
t=~ 1
2ir j.,,~
G(w)ISw(w)Swr(w)Idco (13.29) smooLhed, recLified signal in Lhe non-DC band. LeI 31(n) denole lhe speech signal in band
k, and u1(n) denole Lhe DC-removed smooLhed reclified signal of 31(n). The correlaLion
where S~ (w) and S,~ (w) are lhe original speech short-lime specLrum and reconsLrucled funcLion is defined as
speech shorl-lime specLrum, and G(ø) is lhe speclruni of Lhe perceplual weighLing filter.
Similar Lo Lhe closed-loop searching scheme in CELP, a sequenlia] opLimizaLion meLhod (1333)

is used lo make Lhe problem lracLable. In lhe firsl sLep, alI bands are assumed voiced bands,
and lhe speclnim envelope and basic frequency are esLimaLed. RewriLing Lhe speclrum error where P is Lhe piLch of lhe current frame, and N is Lhe frame lengLh. Then lhe voicing
with lhe all-voiced assumplion, we have slrengLh for band k is defined as max(R,, (P), R,,~ (P)).
To furlher remove lhe buzziness of lradilional LPC- lO speech coders for lhe voiced
â = zM

m=—M
~
~—

2r Ju
~ G(ø)IS~(w) — AmEwrQD)12dW] (13.30) speech segmenL, MEIS adopls a jiLlery voiced sLale lo simulale Lhe marginal voiced speech
segments. ThejiLlery slaLe is indicaLed by an aperiodic fiag. lf Lhe aperiodic flag is seI inche
analysis end, Lhe receiver adds a random shifling componenl lo lhe periodic pulse excilalion.
in which M is band number ia [0,3v], A,,, is lhe speclrum envelope of band ei, Ewr(w) The shifLing can be as big as P14• The jilLery sIale is determined by Lhe peakiness of lhe
is Lhe shorL-Lime window speclrum, and a,,, = (rn — ~)cuo, fim = (m + ~)wo. SelLing full-wave reclified LP residue e(n),
ai/aAm = 0, we gel
~1 ç’~—~ e(n~21h/2
peakiness = N z~nO (13.34)
O (w)S~, (w)E~,, (ø)dco
A,, = (13.31) ~ t’JC~ IeOOI
f: G(w)IE~,(a,)I2da, If peakiness is grealer chan some Lhreshold, lhe speech frame is deLermined as jiLlered.
The basic frequency is obLained aL Lhe sarne Lime by searching over a frequency inLerval To beller reconslrucl Lhe shorl Lime speclrum of lhe speech signal, lhe speclrum of Lhe
lo minimize ~. Based on Lhe esLimaLed speclrum envelope, an adaplive Lhresholding scheme residue signal is noL assumed lo be fiaL, as iL is in Lhe LPC- 10 speech coder. After normalizing
LesLs Lhe maLching degree for each band. We label a band as voiced if lhere is a good lhe LP residue signal, MELP preserves lhe magniludes corresponding tolhe firsl rnin(1O,
malching; ocherwise, we declare lhe band as unvoiced and re-eslimaLe lhe envelope for lhe P14) basic frequency harmonics. Basic frequency is Lhe inverse of Lhe pitch period. The
unvoiced band as higher harmonics are discarded and assumed lo be uniLy spectrum.
The l0-d magnilude vedor is quanlized by 8-bil vecLor quanlizalion, using a percepLual
weighled dislance measure. Similar lo mosL modera LPC quanLizalion schemes, MEIS also
Am ~ (13.32) converts LPC parameLers lo 1SF and uses four-sLage vecLor quanlizaLion. The biLs allocaled
for Lhe four slages are 7, 6, 6, and 6, respeclively. Aparl from inlegral pitch eslimalion
The decoder uses separale melhods lo synthesize unvoiced and voiced speech, based similar Lo LPC-I0, MELP applies a fraclional picch refinemenL procedure Lo improve lhe
on Lhe unvoiced and voiced bands. The Lwo lypes of reconsLrucLed componenls are then accuracy of pilch eslimation.
combined lo generaLe synthesized speech. The final sLep is overlapping Lhe sum of Lhe In Lhe speech reconscrudlion process, MELP does nol use a periodic pulse Lo represenL
synlhesized speech in each frame Lo gel Lhe final oulpul. lhe periodic exciLaLion signal buL uses a dispersed waveform. To disperse Lhe pulses, a
392 Chapter 13 Basic Audio Compression Techniques Section 13.6 References 393

finite imptdse response (FIR) filter is applied Lo Lhe pulses. MELP also applies a percepLual 3. Write code ia read a WAV file. You will need Lhe following seL ofdefiniiions: a WAV
weighting filter post-filLer Lo Lhe reconsLrucLed speech so as LO suppress Lhe quantizaLion file begins wiLh a 44-byLe header, mn unsigned byte formaL. Some impoilant parameLer
noise and improve Lhe subject’s speech quality. informaLion is coded as follows:
ByLe[22.. 23] Number of channels
13.4 FURTHER EXPLORATION ByLe[24 .. 27] Sampling rale
ByLe[34.. 35] Sampling bits
A cornprehensive introduction Lo speech coding can be found ia Spanias’s excellenL article ByLe[40..43] Data lengLh
[13). Sua MicrosysLems has made available Lhe code for iLs implementation of sLandards 4. WriLe a program Lo add fade in and fade ouL effecLs Lo sound clips (ia WAV fonnaL).
0.711,0.721, and 0.723, ia e. The code can be found from Lhe link ia Lhis section of Lhe SpecificaLions for Lhe fades are as follows: The algoriLhm assumes a linear envelope;
LexI web siLe, along wiLh sample audio files. Since Lhe code operaLes on raw 2-byte data, Lhe Lhe fade-in duration is from 0% Lo 20% of Lhe daLa samples; Lhe fade-ouL duraLion is
link also gives simple MATLAB conversion code Lo read and write WAV and RAW data. from 80% Lo 100% of Lhe data samples.
For audio, Lhe court of final appeal is Lhe ITU standards body iLself. SLandards promoLed lf you like, you can make your code able Lo handle boLh mono and stereo WAV files.
by such groups allow our modems Lo talk Lo each other, permit Lhe developmenL of mobile lf necessary, impose a limiL on Lhe size of Lhe input file, say 16 megabytes.
communicaLions, and so on. The ITU, which seus sLandards, is linked Lo from Lhe LCXL web 5. In Lhe LexL, we study an adaptive quantizaLion scheme for ADPCM. We can also use
siLe. an adaptive prediction scheme. We consider Lhe case of one Lap predicLion, 3(n) =
a s(n — 1). Show how lo esLimate Lhe parameLera ia an open-loop meLhod. EstimaLe
More informaLion on speech coding can be found in Lhe speech FAQ file links. L.inks Lo Lhe SNR gain you can geL, compared Lo Lhe direct PCM meLhod based on a uniform
LPC and CELP cade are also included. quanLizaLion scheme.
6. Linear predicLion analysis can be used Lo estimate Lhe shape of Lhe envelope of Lhe
13.5 EXERCISES short-Lime spectrum. Given Len LP coefficienLs ai aio, how do we geL Lhe for
mant position and bandwidLh?
1. la Section 13.3.1 we discuss phase insensiLivity. ExpIam Lhe meaning of lhe Lerm
7. Download and implemenL a CELP coder (see Lhe Lextbook web siLe). Try oul Lhis
“phase” in regard Lo individual frequency componenLs in a composiLe signal. speech coder on your own recorded sounds.
2. lnpuL a speech segmenL, using C or MATLAB, and verify LhaL formanLs indeed exist
8. In quanLizing LSP vecLors in 0.723.1, spliLting vector quanhizaLion is LIsed: if Lhe
— LhaL any speech segment has only a few important frequencies. Also, verify LhaL
dimensionaliLy of LSP is lO, we can spliL Lhe vector inLo Lhree subvecLors of lengLh
formants change as Lhe interval of speech beirig examined changes. 3, 3, and 4 each and use vecLor quantizaLion for Lhe subvecLors separaLely. Compare
A simple approach Lo coding a frequency analyzer is Lo reuse Lhe Dcl’ coding ideas Lhe codebook space complexiLy with and wiLhouL spliL vecLor quanLizaLion. Give lhe
we have previously considered in Section 8.5. In one dimension, Lhe DCT transfomi codebook searching Lime complexiLy improvemenL by using spliLLing vecLor quanLi
reads zaLion.
9. Discuss Lhe advanLage of using an algebraic codebook ia CELP coding.
10. The LPC- lO speech coder’s qualiLy deteriorates rapidly wiLh strong background noise.
F(u) = /fC(u) ~j cos (21 +l)u,r (13.35) Discuss why MELP works beLter in Lhe sarne noisy conditions.
11. Give a sirnple Lime-domam meLhod for pitch esLimaLion based on Lhe auLocorrelaLion
function. WhaL problem wiIl Lhis simple scheme have when based on one speech
where 1, u = 0, 1 N — 1, and Lhe consLanLs C(u) are given by frame? lf we have Lhrec speech frames, including a previous trame and a fuLure
frame, how can we improve ihe esLimaLion resuiL?
12. On Lhe receiver side, speech is usually generaLed based on Lwo trames’ paramelers
,J-2.Ç-
T tU— (13.36) insLead of one, Lo avoid abrupL LransiLions. Give iwo possible meLhods LO obtain
1 otherwise smooLh LransiLions. Use Lhe LPC codec Lo illustraLe your idea.

13.6 REFERENCES
If we use Lhe speech sample in Figure 6.15, LIlen Laking Lhe one-dirnensional DCT
of Lhe firsL, or lasL, 40 msec (i.e., 32 samples), we arrive aL Lhe absoluLe frequency 1 N.S. JayanL and R Noil, Digital Coding of Waveforms, Englewood CliÍTs, NJ: PrenLice-Hali,
cornponenLs as in Figure 13.5. 1984.
394 Chapter 13 Basic Audio Compression Techniques

2 J.C. Bellamy, Digital Telephony, New York: Wiley, 2000. CHAPTER 14


3 Thomas E. Tremam, ‘íhe Govemment Standard Linear Prediclive Coding Algcrithm: LPC
lO,” Speech Techno?ogy, April 1982. M PEG Aud io Com pression
4 ii’. Campbell, ir,, TE. Tremam, and V.C. Welch, 7he DOD 4.8 kbps Standard (Proposed
Federal Standard 1016):’ InAdvances inSpeech Coding. BosLon: KinwerAcadeinicPubliShers,
1991.
5 Dual foje Speech Coder for Muinrnedia Con,rnunicotions Transntilting ar 5.3 and 6.3 kbit/s,
ITU-T Recommendalion 6.723.1, March, 1996.
6 GSM Enhanced Fui? Rale Speech Transcoding (GSM 06.60), ETSI SLandards Docuinentation,
EN 301 245, 1998. Have you ever attended a dance and found that for quite some Lime afterward you couldn’t
7 TDMA Ceiluiar/PCS Radio interface.Enhanced Fui? Rale Speech Codec, TIAIEIA/IS-641 hear much? You were deaiing wiLli a Lype of temporal masking!
Ilave you ever noticed Lhat Lhe person on Lhe sound board aI a dance basicaliy cannoL
standard, 1996.
hear high frequencies anymore? Since many technicians have such hearing damage, some
8 Coding of Speech ai 16 kbit/s lsing L.aw-De?ay Code F.xcited Linear Progranuning, 1TU-T compensate by increasing Lhe volume leveis of lhe high frequencies, 50 they can hear them.
Recommenda(ion 6.728, 1992, lfyour hearing is not damaged, you experience this music mix as 100 piercing.
9 Coding of Speech ai 8 kbit/s Using Conjugate-Structure Aigebraic’code-&cited linear- Moreover, if a very loud Lone is produced, you also noLice iL is impossible Lo hear any
Predicuon (CS-ACELP), ITU-T Recommendation 6.729, 1996. sound nearby in Lhe frequency spectrum Lhe band’s singing may be drowned ouL by Lhe
lO 0. W. Griffin and IS. Lim “Multi-Band Excitation Vocoder’ lEEETransactionsonAcOuSiiCS, lead guilar. lf you’ve noLiced Lhis, you have experiencedfrequency masking!
Speech, and Signa? Processzng, 36(8): 1223—1235, 1988. MPEG audio uses Lhis kind of perception phenomenon by simply giving up on Lhe Lones
11 “lnmarsal flntemational Mobile Satellitej-M Voice Codec, v2’ lnmarsat-M Specification, Feb ffiaL can’L be heard anyway. Using a curve of human hearing percepLual sensiliviLy, an MPEG
1991. audio codec makes decisions on when and lo whaL degree frequency masking and Lemporal
12 A.V. McCree and Ti’. Bamwell, “Mixed ExcitaLion LPC Vocoder Model for Low BiL RaLe masking make some componenLs of Lhe music inaudible. It then conirols lhe quanLizalion
Speech Coding’ IEEE Transactions on Speech and Audio Processing, 3(4): 242—250, July process 50 Lhal Ihese componenis do not influence Lhe ouLpuL.
1995. So far, ia Lhe previous chapter, we have concenLraLed on telephony applicaLions —usuaily,
13 A. Spanias, “Speech Coding: A Tutorial ReviewT ,°roceedings of rhe IEFE, 82: 1541 1582, LPC and CELP are tuned lo speech parameLers. In contrast, in this chapier, we consider
1994. compression methods applicabie lo general audio, such as music or perhaps broadcast digiLal
TV. InsLead of modeling speech, lhe meLhod used is a waveforni coding approach — one that
aLtempis Lo make Lhe decompressed ampliLude-versus-Lime waveform as much as possibie
like lhe inpuL signal.
A maia Lechnique used in evaluaLing audio content for possible compression makes use
of a psychoacoustic niodel of hearing. The kind of coding carrieci ouL, Lhen, is generafly
referred Lo as perceptual coding.
In Lhis chapLer, we look ai how such considerations impact MPEG audio compression
slandards and examine in some deLail ai Lhe foliowing Lopics:

• Psychoacoustics

• MPEG-1 Audio Compression

• LaLer MPEG audio deveiopmenis: MPEG-2, 4,7, and 21

14.1 PSYCHOACOUSTICS

Recail thaL lhe range of human hearing is abouL 20 Hz Lo abouL 20 kHz (for people who
have not gone Lo many dances). Sounds aI higher frequencies are uitrasonic. However, lhe

395
396 Chapter 14 MPEG Audio Compression Section 14.1 Psychoacoustics 397

frequency range of the voice is lypicaily oniy from aboul 500 Hz to 4 kHz. The dynarnic Equai Loudness Response Curves for the Human Ear
range, the ratio of lhe maximum sound amplitude LO the quietest sound humans can hear, is
on lhe order of aboul 120 dB.
Recail thaI lhe decibel unit represents ralios of inlensily on a iogarithmic scaie. The
reference point foro dB is lhe threshoid of human hearing — lhe quietest sound we can hear,
measured ali kl-lz. Technicaiiy, this is a sound lhat creates a bareiy audibie sound intensily
of 10 12 Walt per square meter. Our range of magnitude perception is thus incredibiy wide:
lhe levei at which lhe sensation of sound begins to give way lo lhe sensalion of pain is aboul
1 WallIm2, so we can perceive a ralio of 10121 t
The range of hearing actuaily depends on frequency. Ata frequency of 2 kl-lz, lhe ear can o
readily respond lo sound lhat is about 96 dE more powerfui lhan lhe smaiiest perceivabie

14.1.1
sound aI thal frequency, or in other words a power ratio of 232. Table 6.1 lists some of lhe
common sound leveis in decibeis.

Equal-Loudness Relations
Suppose we piay lwo pure lones, sinusoidal sound waves, wilh the sarne amplitude bul
different frequencies. Typicaily, one may sound louder than lhe other. The reason is thal
1
lhe ear does nor hear iow or high frequencies as well as frequencies in lhe middie range. la
particular, at normal sound volume leveis, lhe ear is most sensitive to frequencies between
1 kHz and 5 kflz.

Fietcher-Munson Curves. The Fielcher-Munson equai-loudness curves dispiay lhe 0.1 1.0 10.0
reiationship between perceived Ioudness (in phons) for a given slirnuius sound volume Frequency (kHz)
(Sound Pressure Levei, in dB), as a function of frequency. Figure 14.1 shows lhe ear’s
perception of equal ioudness. The abscissa (shown in a semi-iog piot) is freguency, in FIGURE 14.1: Fielcher-Munson equal ioudness response curves for the human ear
kHz. The ordinale axis is sound pressure levei lhe actual ioudness of lhe tone generated (rerneasured by Robinson and Dadson).
in an experiment. The curves show lhe ioudness with which such tones are perceived by
humans. The botlom curve shows whal levei of pure tone stimuius is required to produce
lhe perception of a iO dB sound. effective 10dB at lO kHz, we would have lo produce an absolute magnitude of 20 dE. The
Ali lhe curves are arranged so thal lhe perceived ioudness levei gives lhe sarne ioudness ear is cieariy more sensilive in lhe range 2 kl-Jz to 5 kHz and nol nearly as sensitive in lhe
as for lhat ioudness levei of a pure tone at 1 kl-lz. Thus, the ioudness levei at lhe 1 kHz poinl range 6 k}lz and above.
is always equai lo the dB levei on the ordinate axis. The botlom curve, for exarnple, is for AI lhe iower frequencies. if lhe source is aI levei 10 dE, a 1 kHz lone wouid aiso sound
10 phons. Ali lhe tones on this curve wili be perceived as ioud as a 10dB, 1,000Hz lone. aliO dE; however, a iower, 100Hz tone must be ala levei 30 dB —20 dB higher lhan lhe
The figure shows more accurate curves, deveioped by Robinson and Dadson [1], lhan the 1 kHz lone! So we are nol very seásilive lo the lower frequencies. The explanalion of this
Fietcher and Munson originais [2]. phenornenon is thal the ear cana! ampiifies frequencies from 2.5 104 kHz.
The idea is that a tone is produced aI a certain frequency and measured ioudness levei, Nole that as lhe overali ioudness increases, lhe curves flatlen somewhal. We are ap
lhen a hurnan rales lhe ioudness as it is perceived. On lhe lowest curve shown, each pure proximateiy equaliy sensitive to low frequencies of a few hundred Hz if Lhe sound levei is
tone between 20 Fiz and 15 kHz wouid have lo be produced at the volume levei given by ioud enough. And we perceive most iow frequencies beller lhan high ones at high volume
the ordinale for il to be perceived ata 10dB ioudness levei [1]. The next curve shows whal leveIs. Hence, at the dance, loud music sounds betler lhan quiel music, because then we
lhe magnitude wouid have lo be for pure lones lo each be perceived as being at 20 dB, and can aclualiy hear low frequencies and nol jusI high ones. (A “ioudness” switch on some
so on. The lop curve is for perception at 90 dB. sound systems simpiy boosts lhe iow frequencies as weil as some high ones.) However,
For exampie, at 5,000 Hz, we perceive a lone lo have a ioudness levei of 10 phons when above 90 dB, peopie begin lo become uncomfortabie. A typical city subway operales at
the source is acluaiiy oniy 5 dB. Nolice lhat at the dip aI 4kHz, we perceive lhe sound as aboul 100 dE.
being about lO dB, when in fact lhe slimuialion is oniy about 2 dB. To perceive lhe sarne
398 Chapter 14 MPEG Audio Compression Section 14.1 Psychoacoustics 399

14.1.2 Frequency Masking 70

How does one Lone interfere with another? At whal levei does one frequency drown out
another? This question is answered by masking curves. Also, masking answers Lhe question
of how much noise we can tolerate before we cannot hear Lhe actual music. Lossy audio
data compression methods, such as MPEG Audio or Doiby Digital (AC-3) encoding, which
is popular in movies, remove some sounds that are masked anyway, thus reducing the total
amount of information. t

The general situation in regard lo masking is as foliows:

• A iower tone can effectiveiy mask (make us unable Lo hear) a higher tone.

• The reverse is not Lrue. A higher Lone does noL mask a lower Lone well. Tones can
in fact mask lower-frequency sounds, but noL as effectiveiy as they mask higher
frequency ones. O 1 2 3 4 5 6 7 8 9 lO II 12 13 14 IS
Frequency (ki-Jz)
• The greater Lhe power in lhe masking tone, Lhe wider its influence — Lhe broader the
range of frequencies iL can mask. FIGURE 14.3: Effect on Lhreshoid of human hearing for a 1 kirlz masking Lone.

• As a consequence, if two Lones are wideiy separated in frequency, little masking


occurs. The point of the Lhreshold of hearing curve is Lhat if a sound is above Lhe dB levei shown
— say it is above 2 dB for a 6 kl-lz tone — ihen the sound is audible. Otherwise, we cannot
hear iL. Turning up lhe 6 kflz lone so LhaL iL equais or surpasses lhe curve mcans we can
Threshold of Hearing. Figure 14.2 shows a ploL of Lhe Lhreshold of human hearing, Lhen disiinguish the sound.
for pure Lones. To deLermine such a plot, a particular frequency Lone is generated, say 1 kHz. An approximaLe formula exisLs for Lhis curve, as foilows [3]:
ILs volume is reduced Lo zero in a quieL room or using headphones, then Lumed up until Lhe
sound is jusL barely audible. Data points are generated for ali audible frequencies in Lhe Threshold(f) = 3.64(f/i000)°1 — 6.se_06tf1lO~E~_~3)2 + i0~(f/l000)~ (14.1)
sarne way.
The lhreshold units are dB. Since Lhe dE uniL is a ralio, we do have Lo choose which
frequency wiIl be pinned Lo the origin, (0,0). In EquaLion (14.1), this frequency is 2,000 Hz:
Threshold(f) = O at f = 2 kFJz.

Frequency Masking Curves. Frequency masking is studied by playing a particular


pure Lone, say 1 kl-Iz again, aL a ioud volume and deLermining how this Lone affecLs our
ability Lo hear lones aL nearby frequencies. To do so, we wouid generate a 1 kHz masking
tone at a fixed sound ievei of 60 dE, Lhen raise Lhe levei of a nearby Lone, say 1.1 kHz, unLil
t il isjust audible. The Lhreshold in Figure 14.3 plots Lhis audibie levei.
lt is importanL Lo realize Lhat this masking diagram holds oniy for a single masking lone:
Lhe piot changes if oLher masking Lolies are used. Figure 14.4 shows how Lhis Iooks: Lhe
higher Lhe frequency of Lhe masking tone, the broader a range of influence ii has.
If, for example, we play a 6 kHz tone in Lhe presence ofa4 kHz masking Lone, lhe masking
tone has raised the Lhreshold curve much higher. Therefore, at ils neighbor frequency of
6 kliz, we must now surpass 30 dE lo distinguish the 6 kHz tone.
Hz The pracLical poinL is Lhat if a signal can be decomposed mIo frequencies, Lhen for
frequencies thal wiii be partiaily masked, only Lhe audible parI will be used Lo seL quanLizalion
FIGURE 14.2: Threshold of human hearing, for pure Lofles. noise Ihresholds.
400 Chapter 14 MPEG Audio Compression Section 14.1 Psychoacoustics 401

70 TABLE 14.1: CriLical bands and Lheir bandwidths.


4
Band li Lower bound (Hz) CenLer (Hz) Upper bound (Hz) Bandwidlh (Hz)
50 50 100
40 2 100 150 200 100
30 3 200 250 300 100

20 4 300 350 400 100


5 400 450 510 110
lo
6 510 570 630 120
O
7 630 700 770 140
—lo
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS 8 770 840 920 150
Frequency (kliz) 9 920 1000 1080 160
lo 1080 1170 1270 190
FIGURE 144: Effecl of masking Lones aL Lhree different frequencies.
II 1270 1370 1480 210
12 1480 1600 1720 240
Critical Bands. The human hearing range naLurally divides into critical bands, with Lhe
properly IhaL lhe human audiLory syslem cannot resolve sounds betler than wilhin abouL oie 13 1720 1850 2000 280
crilical band when other sounds are present. Hearing has a limited, frequency-dependent 14 2000 2150 2320 320
resolution. According Lo [4j, “In a complex lone, lhe crilical bandwidth corresponds Lo
IS 2320 2500 2700 380
Lhe smallesl frequency difference between LwO parlials such Lhal each can slill be heard
separalely. ... lhe crilical bandwidth represeriLs Lhe ears resolving power for simullaneous 16 2700 2900 3150 450
lones or partials.” 17 3150 3400 3700 550
AL Lhe low-frequency end, a crilical band is less Lhan 100 Hz wide, while for high
18 3700 4000 4400 700
frequencies, lhe widlh can be greater Éhan 4 kl-lz. This indeed is yel anolher kind of
perceptual nonunifortnily. 19 4400 4800 5300 900
Experimenls indicale LhaL lhe criLical bandwidlh remains approximalely conslanL in widLh 20 5300 5800 6400 1100
for masking frequencies below aboul 500 Hz — lhis widLh is aboul 100 Hz. However, for
21 6400 7000 7700 1300
frequencies above 500 Hz, lhe criLical bandwidlh increases approximalely linearly wilh
frequency. 22 7700 8500 9500 1800
Generally, lhe audio frequency range for hearing can be parlilioned into about 24 critical 23 9500 10500 12000 2500
bands (25 are LypiCally used for coding applicalions), as Table 14.1 shows.
NotwiLhstanding Lhe general definilion of a criLical band, iL Lums oul lhat our hearing 24 12000 13500 15500 3500
apparaLus acLually is somewhal Luned lo cerlain critical bands. Since hearing depends on 25 15500 18775 22050 6550
physical sLrucluLts in lhe inner car, lhe frequencies aI which these slrucLures besL resonale
is importanL. Frequency masking is a resuil of Lhe ear slruclures becoming “saLuraled” aL
Lhe masking frequency and nearby frequencies.
1-lence, Lhe ear operaLes somelhing like a seI of band-pass filters, which each allows a lim
Bark Unií. Since Lhe range of frequencies affected by masking is broader for higher
iled range of frequelicies Lhrough and blocks ali olhers. ExperimenLs lhat show Ihis are based
frequencies, iL is useful Lo define a new frequericy uniL such lhal, in Lerms of lhis new unil,
on Lhe observalion lhal a conslanL-vOlume sound wilI seem louder if iL spans Lhe boundary
each of lhe masldng curves (lhe parIs of Figure 14.4 above lhe Lhreshold in quiel) have abouL
beLween lwo criLical bands than iL would were iL conLained enlirely wilhin one critical band
lhe sarne width.
L5]. In elTecI, lhe caris not very discriminaLing within a crilical band, because of masking.
402 Chapter 14 MPEG Adio Compression Section 14.1 Psychoacoustics 403

60
80
250Hz 500Hz 1kHz 2kHz 4kHz 8kHz
60
40 TesL lone
~ 40 ‘o
20
20
o

0 5 lo 15 20 25 —5 O 10 100 1000
Critical band number (Bark) Delay time (ms)

FIGURE 14.5: Effeci of masking Lones, expressed in Bark units. FIGURE 14.6: The louder lhe lest tone, lhe shorter lhe amounL of time required before the
tesL Lone is audible once lhe masking tone is removed.

The new uniL defined is called Lhe Bark, named afler Heinrich Barkhausen (188 1—1956),
au early sound scientist. One Eark uniL corresponds lo Lhe width of one critical band, for 14.1.3 Temporal Masking
any masking frequency [6, 7]. Figure 14.5 displays critical bands, with the frequency (Lhe Recali ihat after Lhe dance it takes quiLe a while for ourhearing Lo retum to normal. Generally,
abscissa) given in Bark uniLs. any loud Lone causes Lhe hearing recepLOrs in Lhe inner ear (liLLIe hairlike structures called
The conversion between a frequency f and iLs corresponding critical-band number b, cuja) to become saturated, and they require Lime Lo recover. (Many oLher perceptual systems
expressed in Eark uniLs, is as follows: bebave in Lhis temporally slow fashion — for example, the recepLors in Lhe eye have Lhis
same kind of “capacitance” effect.)
Critical band number (Bark) = f/l00, for! <500 (14.2) To quanLify Lhis Lype of behavior, We can measure Lhe time sensiliviLy of headng by
9+4log2(f/I000), for!> 500 ariolher masking experiment. Suppose we again play a masking Lone at 1 kHz with a
volume levei of 60 dB, and a nearby Lone ai, say, 1.1 kHz with a volume levei of 40 dB.
In ierms of Lhis new frequency measure, lhe critical-band number tu equals 5 when f = Since Lhe nearby LesL tone is masked, it cannot be heard. However, once Lhe masking Lone is
500 fiz. At doubie ihaL frequency, for a masking frequency of 1 kl-lz, Lhe Bark value goes iurned off, we can again hear Lhe 1.1 kfiz tone, buL oniy afLer a smail amounL of Lime. The
up Lo 9. Another formula used for lhe Bark scale is as foilows: experiment proceeds by sLopping the tesL tone slighily afLer Lhe masking Lone is turned off,
say lO msec laLer.
tu = 13.Oarctan(0.76f) + 3.Sarctan(f2/56.25) (14.3) The delay time is adjusied lo Lhe minimum amount of time such LhaL lhe LesL Lone canjusL
be distinguished. In general, lhe louder the LesL tone, Lhe less time iL Lakes for our hearing
where f is in kHz and b is in Barks. The inverse equation gives Lhe frequency (in kHz) Lo geL over hearing Lhe masking Lone. Figure 14.6 shows ihis effecl: it may take up Lo as
corresponding to a particular Bark value b: much as 500 msec for us Lo discem a quiet Lest Lone after a 60 dB masking Lone has been
played. Of course, Lhis plot would change for differenL masking ione frequencies.
f = [(exp(0.219 x b)/352) + 0.1] x b 0.032 x exp[—0.15 x (tu — 5)21 (14.4) Test Lones wiLh frequencies near lhe masking tone are, of course, Lhe most masked.
Therefore, for a given masking Lone, we have a Lwo-dimensional temporal masking situaLion,
Frequencies forming the boundaries between Lwo critical bands are given by inleger Bark
as in Figure 14.7. The closer Lhe frequency Lo Lhe masking Lone and the closer in Lime Lo
values. The criLical bandwidth (df) for a given center frequency f can also be approximaLed when lhe masking tone is sLopped, Lhe greater likelihood Lhat a iesi tone cannot be heard.
by [8] The figure shows Lhe ioial effect of boLh frequency and temporal masking.
df = 25 + 75 x [1 + l.4(f2)]°~ (14.5) The phenomenon of saturaLion also depends on just how iong Lhe masking tone has been
applied. Figure 14.8 shows LhaL for a masldng tone played longer (200 msec) Lhan anoLher
(100 msec), it Laices longer before a LeSL tone can be heard.
where f is in kliz and df is in Hz.
The idea of lhe Bark unit is Lo define a more percepLually uniform unit of frequency, in As weli as being able lo mask olher signais Lhat occurjusL afier iL sounds (posr-masking), a
that every crilical band’s widLh is roughly equal in terms of Barks. particular signal can even mask sounds playcd jusi before lhe stronger signal (pre.marking).
Section 14.2 MPEG Audio 405
404 Chapter 14 MPEG Audio Compression

pre-masking has a much shorter effecLive interval (2-5 msec) ia which it is operative Lhan
does post-masking (usually 50—200 msec).
MPEG audio compression takes advanLage of these consideraLions in basically construct
ing a large, multidimensional Iookup table. It uses Lhis to LransmiL frequency components
60 LhaL are masked by frequency masking or temporal masking or boLh, using fewer bits.
50
14.2 MPEG AUDIO
40 MPEG Audio proceeds by first applying a filter bank to the input, Lo break Lhe inpuL into
o iLs frequency componenLs. In parallel, iL applies a psychoacoustic model Lo Lhe daLa, and
30 LhJS model is used in a bit-allocation biock. Then Lhe number of biLs allocated is used to
quantize the information from Lhe fliLer bank. The overail result is thaL quanLization provides
o
e 20 Lhe compression, and biLs are allocaLed where they are most needed Lo Iower Lhe quanLization
t
z noise below an audibie leveI.
.2 lo
o Tones below surface 14.2.1 MPEG Layers
are inaudible
MP3 is a popular audio compression standard. The “3” siands for Layer 3, and “MP”
stands for Lhe MPEG- E standard. Recail Lhat we looked aL MPEG video compression ia
Chapter li. However, Lhe MPEG sLandard actualiy delineaLes three different aspecLs of
~4Fnquencyz) multimedia: audio, video, and systems. MP3 forms part of the audio component of Lhis
Time (sec) first phase of MPEG. lt was released ia 1992 and resulted in the international sLandard
ISO/IEC 11172-3. published in 1993.
MPEG audio seLs out Lhree downward-compaLibie layers of audio compression, each able
FIGURE 14.7: Effect of temporal maskirig depends on both Lime and closeness in frequency.
to understand Lhe lower layers. Each offers more complexiLy ia the psychoacoustic model
applied and correspondingly betLer compression for a given levei of audio qualiLy. However,
an increase in complexity, and concomiLantiy in compression effectiveness, is accompanied
by exLra delay.
Layers 1 Lo 3 ia MPEG Audio are compaLible, because ali iayers include the same file
header informaLion.
60 Layer 1 qualiLy can be quite good, provided a comparaLively high biLraLe is available.
Digital Audio Tape typically uses Layer 1. Layer 2 has more complexity and was proposed
for use in digital audio broadcasting. Layer 3 (MP3) is most complez and was originally
aimed at audio Lransmission over ISDN lines. Each of Lhe layers also uses a different
40 frequency Lransform.
Most of Lhe complexiLy increase is at the encoder rather than at the decoder side, and this
accounLs for Lhe popularity of MP3 players. 1.ayer 1 incorporates the simpiesL psychoacous
Lic model, and Layer 3 uses Lhe most complex. The objective is a good tradeoff beLwcen
quality and biLrate. “Quality” is defined ia terms of iistening LesL scores (the psychologisLs
hoid sway here), where a qualiLy measure is defined by:
O 5 lO 50 100 • 5.0 = “Transparent” — undetectable difference from original signal; equivaienL to
Deiay Lime (ms) CD-quality audio aI 14- Lo 16-biL PCM

FIGURE 14.8: EffecL of temporal masking also depends on Lhe Iength of time the masking • 4.0 = Perceptibie difference, buL not annoying
Lone is applied. Solid curve: masking tone played for 200 msec; dashed curve: masking
Lone played for 100 msec. 3.0 SlighLly annoying
406 Chapter 14 MPEG Audio Cornpression Section 14.2 MPEG Audio 407

• 2.0 = Annoying What Lo drop


Audio
• 1.0 — Very annoying (PCM) Encoded
inpul
(Now that’s scientific!) Ai 64 kbps per channel, Layer 2 scores between 2.1 and 2.6, and
Layer 3 scores between 3.6 and 3.8. So Layer 3 provides a substantial improvemenL but is
stili not perfecl by any means.

14.2.2 MPEG Audio Strategy


Compression is certainly called for, since even audio can take %irly substancial bandwidLh:
CD audio is sampied aL 44.1 kHz and 16 bitslchannei, so for two channeis needs a bitraLe
of abouL 1.4 Mbps. MPEG- 1 aims aL about 1.5 Mbps overail, wiLh 1.2 Mbps for video and (a)
256 kbps for audio.
The MPEG approach Lo compression relies on quantization, of course, but also recognizes Encoded Decoded
Lhat the human auditory system is not accurale within Lhe widLh of a critical band, boch in bitstream Bitsiream Frequency Frequency PCM audio
Lerms of perceived ioudness and audibiiiLy of a Lest frequency. The encoder empioys a bank unpacking sampie lo time
of fiuiers thai act Lo firsL analyze lhe frequency (speclrat) componenLs of lhe audio signal _____________ reconsLruction1 lransformation1
by calcuiaLing a frequency iransform of a window of signai values. The bank of fihiers (b)
decomposes lhe signal inLo subbands. Layer 1 and Layer 2 codecs use a quadrature-mirror
fihier bank, while Lhe Layer 3 codec adds a DCT. For Lhe psychoacoustic model, a Fourier FIGURE 14,9: (a) Basic MPEG Audio encoder; and (b) decoder.
transfonii is used.
Then frequency masking can be broughc lo bear by using a psychoacouslic model Lo
esLimate Lhe just noliceable noise levei. In iLs quantization and coding sLage. lhe encoder However, temporal masking is iess importani for compression Lhan is frequency masking,
balances Lhe masking behavior and Lhe available number of bits by discarding inaudibie which is why it is sometimes disregarded enLireiy in lower-compiexily coders. Layer 3 is
frequencies and scaling quancizaLion according Lo Lhe sound levei lefL over, above masking direcLed ioward iower bitraLe applicaLions and uses a more sophisticated subban
leveis. wiLh nonuniform subband widths. lt also adds nonuniform quantization and enLropy coding.
A sophislicaLed modei wouid iake inLo accounL lhe acLual widLh of Lhe critical bands BitraLes are standardized aL 32 320 kbps.
ceniered aI differenL frequencies. Wichin a critical band, our auditory system cannoL finely
resolve neighboring frequencies and instead tends lo biur Lhem. As menLioned earlier, 14.2.3 MPEG Audio Compression Algorithm
audible frequencies are usuaiiy divided inLo 25 main criLical bands, inspired by Lhe auditory
criticai bands.
However, in keeping with design simplicity, lhe modei adopts a unhform width for ali Basie Algorithm. Figure 14.9 shows the basic MPEG audio compression algorithm. lL
frequency anaiysis fiulers, using 32 overlapping subbands [9, lO]. This means LhaL ai iower proceeds by dividing Lhe input inLo 32 frequency subbands, via a lilter bank. This is a linear
frequencies, each of lhe frequency analysis “subbands” covers lhe widLh of several critical operaLion LhaL takes as lts inpuL a set of 32 PCM samples, sampled in Lime, and produces as iLs
bands of Lhe auditory sysLem, whereas ai higher frequencies Lhis is not so, since a crilical output 32 frequency coeflicienis. If lhe sampling raLe is f~, say f, = 48 ksps (kiiosamples
band’s widLh is less than 100Hz ai Lhe low end and more iban 4 kllz ai lhe high end. For per second; i.e., 48 kHz), then by Lhe Nyquisl cheorem, the maximum frequency mapped
each frequency band, Lhe sound levei above lhe masking levei dicLates how many bits musL will be f~/2. Thus Lhe mapped bandwidlh is divided inLo 32 equai-width segments, each of
be assigned Lo code signal vaiues, so that quanLizatiOn noise is lcepL below lhe masking levei widlh f3/64 (Lhese segmencs overlap somewhac).
and hence cannot be heard. In Lhe Layer 1 encoder, Lhe sets of 32 PCM vaiues are first assembled inLo a sei of 12
In Layer 1, lhe psychoacouscic modei uses only frequency masking. BiLrates range from groups of32s. Hence, thecoderhas an inherenc time lag, equal lo lhe lime lo accumulale 384
32 kbps (mono) Lo 448 kbps (sLereo). Near-CD stereo quality is possible wilh a bilrate (i.e., 12 x 32) samples. For example, if sampiing proceeds aL 32 kbps, lhen a time duraLion
of 256—384 kbps. Layer 2 uses some temporal masking by accumulating more samples of 12 msec is required since each sei of 32 samples is Lransmitted each millisecond. These
and examining lemporal masking between lhe currenL biock of samples and lhe ones jusL seis of 12 samples, each of size 32, are called segnienls. The poini of assembling them is Lo
before and jusL afler. Bilrates can be 32—192 kbps (mono) and 64—384 kbps (sLereo). Stereo examine 12 seLs of vaiues at once in each of lhe 32 subbands, afLer frequency analysis has
CD-audio quaiity requires a bitrace of about 192 256 kbps. been carried OW, Lhen base quanLizaLion onjust a summary figure for ali 12 vaiues.
408 Chapter 14 MPEG Audio Compression Section 14.2 MPEG Audio 409

Header SBS format SBS Ancillary data


f2 -r - ~ - -:

samples sampies sampies

i2 12 12
FIGURE 14.10: Exampie MPEG Audio frame. )__Subband fiiLer O sampies sampies sampies

~12~
The delay is actually somewhat ionger than LhaL required Lo accumulate 384 sampies, sampies samples samples
since header information is also required. As weli, ancüiary data, such as multiiingual data Audio (PCM)
and surround-sound data, is allowed. Higher layers aiso allow more ihan 384 sampies lo sam~ 1!
be analyzed, so the format of the subband-samples (SUS) is also added, with a resulting
: :
1 .
frame of data, as in Figure 14.10. The header contains a synchronization code (twelve 12 12 12
Is — 1llI1l11lll1), lhe sampling rale used, Lhe bitrate, and stereo information. The sampies i sampies sampies
frame formal also contains room for so-cailed “anciilary” (extra) information. (In fact, an
MPEG- 1 audio decoder can aL Ieast partially decode an MPEG-2 audio bitstream, since
the file header begins wilh an MPEG- 1 header and piaces lhe MPEG-2 datastream into Lhe
1Subbandfiiter3l [
Each subband filter produces 1 sample out
Layeri
Frame
MPEG-l Anciiiary Data iocation.) for every 32 sampies in
MPEG Audio is set up co be abie to handie stereo or mono channeis, ofcourse. A speciai Layer 2 and Layer 3
joini-siereo mode produces a single stream by laking into account the redundancy between Frame
Lhe two channeis in stereo. This is lhe audio version of a composite video signai. It can
aiso deal wiLh dual-monophonic — lwo channeis coded independentiy. This is usefui for FIGURE 14.11: MPEG Audio frame sizes.
parailel treatment of audio — for exampie, two speech streams, one in Engiish and one iii
Spanish.
Consider Lhe 32 x 12 segment as a 32 x 12 macrix. The nexL stage of the aigorithm The scahng factor is first quantized, using 6 bits. The i 2 vaiues in each subband are
is concemed with scale, so that proper quantization leveis can be set. For each of the 32 lhen quanlized. Using 4 bits, Lhe bit aiiocations for each subband are transmitted, after an
subbands, the maximum amplitude ofthe 12 sampies in Lhat row of Lhe array is found, which iterative bit aliocation scheme is used. Then the data is transmitted, with appropriate bit
is Lhe scaling factor for lhaL subband. This maximum is Lhen passed to lhe biL-aiiocation deplhs for each subband. Aitogether, the data consisLing of lhe quancized scaiing factor and
bloek of the aigoriLhm, aiong with the SBS (subband sampies). The key point of Lhe bit lhe 12 codewords are grouped into a coileccion known as Lhe Subband-Sampie formaL.
aliocaLion bLock isto determine how to apportion the total number of code bits avaiiabie for On the decoder side, Lhe values are de-quantized, and magnitudes of Lhe 32 sampies are
the quantization of subband signais Lo minimize Lhe audibiiity of the quantization noise. reestabiished. These are passed to a banic of syntherisfihrers, which reconstitute a set of 32
As we know, lhe psychoacoustic model is fairiy compiex — more Lhanjust a seL of iookup PCM samples. Note LhaL lhe psychoacoustic model is not needed in Lhe decoder.
Labies (and in fact lhis model is not standardized in Lhe specification — iL forms part of Lhe
Figure 14.11 shows how sampies are organized. A Layer 2 or Layer 3 frame acLualiy
“art” conteM of an audio encoder and is one major reason ali encoders are not lhe sarne).
accumuiates more than 12 sampies for each subband: insLead of 384 sampies, a frame
in Layer 1, a decision step is inciuded co decide whetber each frequency band is basicaiiy
inciudes 1,152 sampies.
hke a tone or iike noise. From that decision and Lhe scaiing factor, a masking threshoid is
caicuiated for each band and compared with lhe threshoid of hearing.
1111 Aliocation. The bit-aiiocation aigorithm is noc part of the standard, and iL can
The modei’s oucput consiscs of a set of what are known as signal-to-niask ratios (SMRs)
lherefore be done in many possibie ways. The aim is Lo ensure Lhat ali the quantization
Lhat flag frequency components with amplitude beiow the masking levei. The SMR is lhe
noise is below Lhe masking chreshoids. However, lhis is usuaily not the case for low bitrates.
ratio of the short-term signai power wiLhin each frequency band to Lhe minimum masking
The psychoacoustic modei is broughL mIo piay for such cases, Lo aliocaLe more bits, from
lhreshoid for lhe subband. The SMR gives the amplitude resoiution needed and therefore
lhe number avaiiabie, co Lhe subbands where increased resoiution wiil be most beneficial.
also controis Lhe bit aiiocations that shouid be given to the subband. After determination
One common scheme is as foilows.
of the SMR, Lhe scaling factors discussed above are used to seI quantization leveis such
Lhat quantization error itseif falis beiow Lhe masking levei. This ensures that more bits are For each subband, lhe psychoacoustic modei caicuiaLes the Signal-ro-Mask Ratio, in
used in regions where hearing is most sensicive. in sum, Lhe coder uses fewer bits in criticai dB. A lookup tabie in Lhe MPEG Audio standard also provides an esLimate of the SNR
bands when fewer can be used withouc making quantizacion noise audibie. (signai-Lo-noise ratio), assuming quancization to a given number of quancizer leveis.
— - lo criLical band

410 Chapter 14 MPEG Audio Compression Section 14.2 MPEG Audio 411

Sound pressure PCM


audio signal
levei (db) Masker
Filter banlc: Linear Bilatream

[9 32 subbands quanhizer ‘ formathing Cded


signal

,024-poini _________ Psychoacoustic Side-informrnion


FF1’ model coding

mai FIGURE 14.13: MPEG-1 Audio Layers 1 and 2.

BiLs aliocated
As weli, the psychoacoustic modei does betLer at modeiing slowiy-changing sound if
Lhe time window used is longer. BiL aliocaLion is applied lo window ienglhs of 36 samples
Neighboring Critical band Neighboring Frequency instead of 12, and resolulion of Lhe quanLizers is increased from 15 bits Lo 16. To ensure
band band LhaL Lhis greaLer accuracy does noL mean poorer compression, lhe number of quantizers lo
choose from decreases for higher subbands.
FIGURE 14.12: Mask-Lo-noise raLio and signal-to-mask ratio. A qualitaLive view of SNR,
SMR and MNR, with one dominant masker and m bits aliocated Lo a particular ciiticai band. Layer 3. Layer 3, or MP3, uses a biLrale similar lo Layers 1 and 2 buL produces
subsLantially belLer audio quaiiLy, again at Lhe price of increased complexiLy.
Afillerbank similar lo that used in Layer2 is employed, excepL thaL now percepLual critical
Then lhe Mask-ro-Noise Rario (MNR) is defined as lhe difference
bands are more closeiy adhered lo by using a set of fiiLers wilh nonequal frequencies. This
layer also takes mIo account sLereo redundancy. It also uses a refinement of lhe Fourier
MNRdB = SNRdB — SMRdB (14.6)
Lransform: Lhe Modified Discrere Cosine Transform (MDC?) addresses problems Lhe DCT
as Figure 14.12 shows. The iowest MNR is determined, over ali lhe subbands, and Lhe has aL boundaries of Lhe window used. The DiscreLe Fourier Transform can produce block
number of code-bits ailocated to Lhis subband is incremented. Then a new estimate of lhe edge effecls. When such data is quanlized and then iransformed back lo lhe time domam,
SNR is made, and lhe process iLerates until no more bits are left to aliocate. lhe beginning and ending samples of a biock may noL be coordinaLed wiLh the preceding
Mask caiculations are performed in parailei with subband fiitering, as in Figure 14.13. and subsequent blocks, causing audibie periodic noise.
The masking curve caiculation requires ari accurate frequency decomposition of lhe in The MDCT shown in Equalion (14.7), removes such effecLs by overlapping frames by
put signai, using a Discrete Fourier Transform (DFT). The frequency spectrum is usuaily 50%.
caicuiated with a 1 ,024-point Fast Fourier Transform (FFT).
In Layer 1, 16 uniform quantizers are pre-calcuiated, and for each subband lhe quantizer
giving lhe iowest distorlion is chosen. The index of Lhe quantizer is senl as 4 bits of side
information for each sobband. The maximum resolution of each quantizer is 15 bits.
= 2Z f(i)cos[t (~ + ~2+ (u + 1/2)] ,u = 0 N/2 —1 (14.7)

Layer 2. Layer 2 of Lhe MPEG- 1 Audio codec includes smail changes Lo effecL bitrate The MDCT also gives betLer frequency resolulion for lhe masking and biL aliocaLion
reduction and quality improvemenL, aI lhe price of an increase in complexiLy. The main operaLions. Optionally, lhe window size can be reduced back Lo 12 samples from 36. Even
difference in Layer 2 is LhaL Lhree groups of 12 sampies are encoded in each frame, and so, since lhe window is 50% overlapped, a 12-sample window stiil includes an extra 6
temporal masking is broughl mio play, as weil as frequency masking. One advantage is that samples. A size-36 window includes an extra 18 poinLs. Since iower frequencies are more
ofLen torielike raLher Lhan noiselike, lhey need nol be anaiyzed as carefully, soa mixed mode
if Lhe scaling factor is similar for each of lhe three groups, a sirigie scaiing factor can be
is also available, wilh 36-poinL windows used for Lhe lowesl Lwo frequency subbands and
used for ali lhree. BuL using three frames in Lhe filter (before, currenl, and nexl), for a total
of 1,152 samples per channel, approximales Lakiiig iemporai masking into account. l2-pomni windows used for Lhe rest.
412 Chapter 14 MPEG Audio Compression ection 14.2 MPEG Audio 413

PCM
audio signal TABLE 14.2: MP3 compression performance.

Sound quality Compression ratio


Telephony
BeLter chan 48:1
shortwave
BeLter than 24:1
AM radio
Sirnilar t. 26:1 to 24:1
FM radi.
Near-CD IS kNz SLereo 16:1
Coded audio CD > 15kHz Stereo 14:1 Lo 12:1
signal

FIGURE 14.14: MPEG-1 Audio Layer 3. technology for Lhe L) VD-Audio Recordable (DVD-AR) formal and is also adopted by XM
Radio, one of the two sateilite radio services in North America.
MPEG-2 audio can support up co 48 channels, sampling raLes becween 8 kllz and 96 Mia,
As well, instead of assigning scaling factors co uniform-width subbands, MDCT coeM and bitrates up lo 576 kbps per channel. Like MPEG-l. MPEG-2 supports Lhree differenc
ciencs are grouped in Lerrns of lhe auditory systern’s actual critical bands, and scaling factors, “proIBes”, but wich a differenc purpose. These are the Main, Low Complexity (LQ, and
called scale factor bands, are calculated from lhese. lhe Scalable Sompling Rate (SSR). The LC profile requires Iess compucacion chan the Main
More bits are saved by carrying out encropy coding and rnaking use of nonuniform profile. buc the SSR profile breaks up lhe signal so that different bitrates and sarnpling raLes
quantizers. And, finally, a different bit allocation scherne is used, with Lwo parIs. Firstly, a can be used by different decoders.
nested Ioop is used, wich an inner loop that adjusts Lhe shape of lhe quantizer, and an outer The three proIBes follow rnoscly lhe sarne scherne, with a few rnodificacions. First, au
loop that then evaluates Lhe distortion from that bit configuration. lf Lhe ei~or (“distortion”) MDCT cransform is carried ouc, either on a “Iong” window with 2,048 samples ora “short”
is coo high, Lhe scale factor band is amplified. Second, a bit reservoir banks bits from frames window wich 256 samples. The MDCT coelficienis are then filtered by a Temporal Noise
thaL don’t need thern and allocates lhem to frarnes chat do. Figure 14.14 shows a sumrnary Shaping (TNS) Lool, wich Lhe objective of reducing pre-masking effects and betcer encoding
of MPEG Audio Layer 3 coding. signals with stable picch.
Table 14.2 shows various achievable MP3 compression racios. ln particular, CD-quality The MDCT coefficiencs are lhen grouped mio 49 scale factor bands, approxirnately
equivalenc to a good-resolucion version of Lhe human acoustic systern’s cricical bands. In
audio is achieved wiLh compression ratios in the rangeof 12:1 lo 8:1 (i.e.. bitraces of 128 Lo
parailel with Lhe frequency cransform, a psychoacousLic model similar Lo the one in MPEG- 1
192 kbps).
is carried out, Lo find masking thresholds.
The Main profile uses a prediclor. Based on lhe previous Lwo frarnes, and only for
14.2.4 MPEG-2 AAC (Advanced Audio Coding) frequency coefficients up Lo 16 kHz, MPEG-2 subcracts a prediccion frorn lhe frequency
The MPEG-2 standard is widely ernployed, since ic is Lhe standard vehicle for DVDs, and iL, coeflicienis, provided this step wiIl indeed reduce distortion. Quantization is govemed by
too, has an audio cornponent. The MPEG-2 Advanced .4udio Coding (MC) standard [II) two rules: keep discortion below Lhe masldng chreshold, and keep lhe average nurnber ofbits
was aimed aL Lransparent sound reproduction for theaters. lt can deliver (bis aL 320 kbps used per frarne concrolled, using a bit reservoir. Quancizacion uses scaling faccors — which
for five channels, so lhaL sound can be played from five directions: left, right. center, Ieft can be used to arnplify sorne of Lhe scale factor bands — and nonuniform quantizacion.
surround, and right-surround. So-called 5.1 channel syslerns also include a low-frequency MPEG-2 AAC also uses entropy coding for both scale factors and frequency coefficienls.
enhancemeni (LFE) channel (a “woofer”). On lhe other hand, MPEG-2 AAC is also capable Agamn, a nested 1oop is used for bic allocalion. The inner loop adapts lhe nonlinear
of delivering high-qualicy stereo sound aL bicrates below 128 kbps. 11 is lhe audio coding quanlizer, chen applies encropy coding colhe quancized data. lf Lhe bit limil is reached for
the curreni frarne, Lhe quancizer step size is increased co use fewer bics. The outer loop
414 Chapter 14 MPEG Audio Compression Section 14.3 Other Commercial Audio Codecs 415

decides whether for each scale factor band the distortion is below Lhe masking threshold. lf
TABLE 14.3: Comparison of audio coding systems.
a band is Loo distorted, ii is amplified to increase the SNR of thaL band, at che price ofusing
more bits. Codec BiLraLe Complexity Main
In Lhe SSR profile. a Polyphase Quadrature Filter (PQF) bank is used. The meaning of
Lhis phrase is Lhat the signal is firsL split into four frequency bands of equal widih, Lhen an kbps/channel application
MDCT is applied. The point of Lhe firsL sLep is Lhat the decoder can decide Lo ignore one of J3olby AC-2 128—192 Low (encoder/decoder) PoinL-Lo-point, cable
Lhe four frequency paris if Lhe bitrate must be reduced.
Dolby AC-3 32—640 Low (decoder) HDTV, cable, DVD
14.2.5 MPEG4 Audio Sony ATRAC 140 Low (encoder/decoder) Minidisc
MPEG-4 audio integrates severa! different audio components fluo one standard: speech
compression, percepLually based coders, Lext-to-speech, and MDI. The primary general
Another “boi” in strucLured audio is called SiructuredAudio Orchestra Longuage (SAOL,
audio coder, MPEG-4 AAC [12], is similar Lo Lhe MPEG-2 AAC standard, with some minor
pronounced “sai!”), which allows simple specification of sound synthesis, irtcluding special
changes.
effecLs such as reverberation.
Overail, structured audio talces advantage of redundancies in music to greaLly compress
Perceptual Coders. One change is Lo incorporaLe a Perceptual Noise Subsururion mod
sound descripLions.
ule, which Iooks ai scale facLor bands above 4 kHz and includes a decision as Lo whether
Lhey are noiselike or tonelike. A noiselike seale factor band itself is not transmitted; instead,
just iLs energy is transmiiLed, and the frequency coefficient is set Lo zero. The decoder (hen 14.3 OTHER COMMERCIAL AUDIO CODECS
inserts noise wiih ihaL energy. Table 14.3 sununarizes Lhe Larget biLrate range and main features of other modera general
Anolher modificaLion isto include a Bir-SlicedArühmeric Coding (BSAC) module. This audio codecs. They bear many similarities to MPEG-2 audio codecs.
is an algoiithm for increasing bitrate scalability, by allowing Lhe decoder side to be able Lo
decode a 64 kbps stream using only a 16 kbps baseline output (and steps of 1 kbps from that 14A THE FUTURE: MPEG-7 AND MPEG-21
minimum). Recail LhaL MPEG-4 is aimed ai compression using objects. MPEG-4 audio has several
MPEG-4 audio also includes a second perceptual audio coder, a vector-quantization inLeresting features, such as 3D localizaLion of sound, integration of MIDI, LexL-to-speech,
method entitled Trnnsform-doniain Weightedlnterleave Vector Quantization (Twin VQ). This different codecs for differenL bitraLes, and use of the sophisticated MPEG-2 AAC codec.
is aimed at low bitraLes and allows lhe decoder Lo discard porlions of te bitstream lo However, newer MPEG standards are mainly aimed aL “search”: how can we find objects,
implement both adjustable bitrate and sampling rate. The basic sLrategy of MPEG-4 audio assuming Lhat multimedia is indeed coded in terms of objecLs?
is to allow decoders tu apply as many or as few audio tools as bandwidth allows. The formulation ofMPEG-21 [13] is an ongoing effort, aimed at driving a standardization
effort for a Multi,nedia Frantework from a consumer’s perspecLive, pariicularly addressing
Strucíured Codcrs. To have a Iow bitrate delivery option, MPEG-4 Lakes what is interoperability. However, we can say something more specific about how MPEG-7 means
termed a Syn:hetic/Natural Hybrid Coding (SNHC) approach. The objective isto integrate Lo describe a sLrucLured model of audio [14], so as to promote case of search for audio
boLh “natural” multimedia sequences, boLh video and audio, with those arising synthetically. objects.
In audio, Lhe laLter are Lenned structured audio. The idea is Lhat for low bitrate operation, Officialiy called a method for Multimedia Conten; Description Interface, MPEG-7 pro
we can simply send a poinLer to Lhe audio model we are working with and then send audio vides a means of standardizing metadaLa for audiovisual muiLimedia sequences. MPEG4
model parameLers. is meanL Lo represent informaLion about mulLimedia information.
In video, such a model-based approach mighL involve sending face-animation data rather The objective, in terms of audio, is to faciliLaLe Lhe representaLion and search for sou
Lhan natural video frames of faces. ln audio, we could send Lhe infonnafion LhaL English content, perhaps Lhrough Lhe Lune or other descripLors. Therefore, researchers are Iaboring to
is being modeled, then send codes for Lhe basesounds (phonemes) of English, along with develop descriptors Lhat efficiently describe, and can help find, specific audio in fil
oLher assembler-like codes specifying duration and piLch. might require human or automatic content analysis and might be aimed notjust ai low-Ievel
MPEG-4 takes a roolbox approach and allows specificaiion of many such modeis. For structures, such as melody, but aL actually grasping information regarding sLnictural and
example, Text-To-Speech (TTS) is an ulLra-Iow biLrate meLhod and actually works, provided semantic content [15].
we need not care whaL Lhe speaker acLually sounds lUte. Assuming we went ou to derive Au example application supporied by MPEG-7 is automatic speech recognilion (ASR).
Face Animation Paramelers from such low bitrate informaLion, we anive direcLly at a ver~’ Language understanding is also an objective for MPEG-7 “conLent”. In Lheory, MPEG-7
low biLrate videoconferencing system. would allow searching on spoken and visual evenLs: “Find me Lhe part where HamieL says,
Section 14.7 References 417
416 Chapter 14 tVIPEG Audio Compression

‘To be or not Lo be.” However, Lhe objecLive of delineaLing a compleLe, sLnictured audio 5. Search Lhe web Lo discover whaL is meant by Lhe foliowing psychoacoustic phenomena:
model for MPEG-7 is by no means complete.
(a) Virtual piLch
Nevertheless, low-ievel feaLures are important. A recent surnmary of such work [ló] seLs
ouL one seI of such descriptors. (b) Auditory scene anaiysis
(c) OcLave-reiated compiex tones
14.5 FURTHER EXPLORATION (d) Tri-Lone paradox
Good reviews of MPEG Audio are contained in the articles [9, 17]. A comprehensive expli (e) lnharmonic complex tones
cation of natural audio coding in MPEG-4 appears in [IS]. Structured audio is introduced
ia [19], and exhaustive articies on natural and structured audio ia MPEG-4 appear ia [20] 6. If Lhe sampiing rale f~ is 32 ksps, in MPEG Audio Layer 1, what is lhe width ia
and [21]. frequency of each of Lhe 32 subbands?
The Further Exploration section of Lhe text web site for Lhis chapter conlains a number 7. Given that Lhe levei of a rnasking lone aL Lhe 81h band is 60 dB, and 10 msec afLer iL
of useful links: stops. the masking effecL Lo Lhe 9Lh band is 25 dB.

. Excellent coilections of MPEG audio and MP3 iinks (a) WhaL wouid MP3 do if lhe original signai at Lhe 9th band is aL 40 dB?
The MPEG audio FAQ (b) WhaL if lhe original signal is aL 20 dB?
(c) How many biLs shouid be allocated Lo Lhe 9Lh band in (a) and (b) above?
• An exceIIent reference by the Fraunhofer-Oeseiischaft research insLiLute, “MPEG 4
Audio ScalabIe Profile,” on Lhe subject of Tools for Large SLep Scalability. This 8. What does MPEG Layer 3 (MP3) audio do differenLly from Layer 1 Lo incorporate
allows Lhe decoder Lo decide how many toois to appiy and aL what complexity, based Lemporai masking?
on available bandwidLh.
9. Expiam MP3 in a few paragraphs, for an audience of consumer-audio-equipmenL
salespeople.
14.6 EXERCISES
10. lmpiement MDCT, jusL for a single 36-sampie signal, and compare Lhe frequency re
(a) WhaL is Lhe Lhreshoid of quiet, according Lo Equation (14.1), aL 1,000 Fiz? sulLs Lo those from DCI For low-frequency sound, which does betLer aL concentraLing
(Recali Lhat lhis equation uses 2 kllz as the reference for Lhe O dB levei.) Lhe energy ia Lhe firsL few coefficienLs?
(b) Take Lhe derivaLive of EquaLion (14.1) and set iL equal Lo zero, ~o determine 11. Convert a CD-audio cut lo MP3. Compare Lhe audio quaIiLy of Lhe original and lhe
the frequency aL which the curve is minimum. WhaL frequency are we mosL compressed version — can you hear Lhe difference? (Many peopie cannoL.)
sensitive to? HinL: One has Lo solve Lhis numericaliy. 12. For Lwo stereo channels, we wouid iilce Lo be able Lo use the facL that lhe second channel
2. Loudness versus amplitude. Which is louder: a 1,000Hz sound aL 60dB ora 100Hz behaves, usually, ia a paraliel fashion Lo the firsL, and appiy informaLion gieaned from
sound aL 60 dB? the ffrst channel to compression of Lhe second. Discuss how you Lhink Lhis might
3. For Lhe (newer versions of Lhe) Fietcher-Munson curves, in Figure 14.1, the way Lhis proceed.
data is actually observed is by setling Lhe y-axis value, che sound pressure levei, and
measuring a humans estimation of Lhe effective perceived loudness. Given Lhe set of 14.7 REFERENCES
observaLions, whaL must we do lo tum Lhese inLo lhe set of perceived ioudness curves 1 D.W. Robinson and R.S. Dadson, “A Re-delerminaLion of lhe Equal-Loudness BelaLions for
shown in Lhe figure? Pure Tones,” British Journal ofApplied Physics, 7: 166—181, 1956.
4. Two tones are piayed together. Suppose tone 1 is fixed, but Lone 2 has a frequency
2 El. FieLcher and W.A. Munson, “Loudness, lis Definition, MeasuremenL and Caicuiation,” J. of
LhaL can vary. The crincai bandwidth for Lone lis lhe frequency range for Lone 2 over
lhe Acousvic Sociezy o!Arnerica, 5: 82—107, 1933.
which we hear bears, and a roughness ia Lhe sound. BeaLs are overlones at a lower
frequency Iban Lhe two dose tones; lhey arise from Lhe difference in frequencies of 3 T. Painter and A. Spanias, “Perceptual Coding of Digital AudioT Proceedings of the 1EEE,
Lhe two tones. The criLicai bandwidth is bounded by frequencies beyond which Lhe 88(4): 451—513, 2000.
Lwo tones sound with LWO disLinct piLches. 4 B. Truax, HandbookforAcousric Ecology, 2nd cd. Bumaby, BC, Canada: Cambridge SLrceL
Publishing, 1999.
(a) What would be a rough eslimaLe of Lhe criLicai bandwidth aI 220 Hz?
(b) Expiam ia words how you would set up an experimenL Lo measure lhe critical 5 D. O’Shaughnessy, Speech Co,n,nunicarions: Hurnan and Machine, Los Alamitos, CA: IEEE
Press, 2000.
bandwidth.
418 Chapter 14 MPEG Audio Compression

6 A.J.M. Houtsma, “ sychophysics and Modem Digital Audio Technology’ Philips Jounial of 1’ A R T T H R E E
Research,47: 3—14,1992. _______________________________________________

7 E. Zwicker and U. Tilmann, “Psychoacoustics: Matching Signais to the Final ReceiverT Jour
na! o! lhe Audio £ngineering Society, 39: 115—126, 1991.
8 D. Lubman, Objective Meuics for Characterizing Automotive linerior Sound Quality’ in
Inter-Noise ‘92, 1067—1072.
L.J I_.’l•’1 I~/I D IA
9 13. Pan, “A titoijal on MPEG/Audio Compression,” ÍEEEMu!ri,nedia, 2(2): 60—74, 1995.
lO P. Noli, “MPEG Digilal Audio CodingT JEEE Signa! Processing Magazine, 14(5): 59—81,
Sep. 1997.
O P/1 £11 ti Ii 1 C~’1”1 () [‘1
II Informarion Technology — Generic Coding of Moving Pictures and Associted Audio Infor

mation, Part 7: Advanced Audio Coding (MC), Intemational Standard: ISO/IEC 13818-7,
1997.
I~J D REl”R1 E~ 1
12 Informorion Technology — Coding ofAudio-Visual Objects, Part 3: Audio, Intemational Stan
dard: ISO/IEC 14496-3, 1998.
13 Inforination Technology — Multitnedia Fran,ework, International Standard: ISO/IEC 21000, chapter 15 Computer and Multimedia Networks 421
Paris 1-7,2003.
14 Inforniorion Technology—Multimedia Contem Description Interface, Part 4: Audio, Interna- Chapter 16 Multimedia Network Communications and
Lional Siandard: ISO/IEC 159384.2001. Applications 443
IS Ai’. Lindsay, 5. Srinivasan, iRA. Charlesworlh, R N. Gamer, and W. K,iechbaum, “Repre- Chapter 17 Wireless Networks 479
sentation and Linking Mechanisms for Audio in MPEG-7T Signa? Processing: Image Com
munication, 16: 193-209,2000. Chapter 18 Content-Based Retrieval in Digital Libraries
16 P. Philippe, “Low-Level Musical Descripcors for MPEG-7’ Signa! Processing: Image Com 511
,nunication, 16: 181—191,2000.
17 5. Shlien, “Guide to MPEG-l Audio Standard,” 1EEE Transactions on Broadcasting, 40: 206—
218, 1994.
Multimedia places great demands on networks and systerns. This part examines several
18 K. Brandenburg, O. Kunz, and A- Sugiyama, “MPEG-4 Natural Audio CodingT Signa! P,~ importam multimedia networks and applications that are essentiai and challenging.
cessing: Image Communicarion, IS: 423—444, 2000.
19 ED. Scheirer, “Stmctured Audio and Effects Processing in the MPEG4 Multiinedia Standard,” Multimedia Networks
Multimedia Systems, 7: 11—22, 1999.
With lhe ever-increasing bandwidth made available by brealcthroughs in fiber optics, we are
20 3.13. Johnston, &R. Quackenbush, J. 1-ferre, and B. OtilI, “Review of MJ’EG-4 General Audio
Coding’ in Mu?timedia Systems, Standard,, and Networks, cd. A- Puri andT, Chen, New Yortc: witnessing a convergence of telecornmunication networks and computer and muitimedia
Marcel Deldcer, 2000, 131-155- networks and a surge in mixed traffic types (lnternet Lelephony, video-on-demand, etc.)
through them. The technoiogies of multiplexing and scheduling are being constantiy reex
21 ED. Scheirer, Y. Lee, and 3W. Yang, “Syntbetic Audio and SNHC Audio in MPEG-4,” in amined. Moreover, we are also witnessing an emergence of wireless networks (Lhink about
Mulumedia Systems, Standards, and Networks, cd. A. Puri and T. Chen, New York: Marcel our ccli phones and PDAs).
Dekker, 2000, 157—177.
In Chapter 15, we look aL basic issues and technologies for computer and multimedia
networks, and in Chapter 16 we go on 10 consider multimedia network comrnunications and
applications. Chapter 17 provides a quick introduction to Lhe basics of wireless networks
and issues related to multimedia communication over Ibese networks.

Content-Based Retrieval in Digital Libraries


Automated retrieval of syntactically and semantically useful contents from multimedia
databases is crucial, especially when Lhe contents have become so rich and tire size of

419
420

Lhe databases has grown so rapidly. Chapter 18 loob aL a particular appiication of multi- C H A P T E R 15
media database systems, examining the issues involved in content-based retrieval, storage,
and browsing in digital libraries.
Computer and Multimedia
Networks

Compuler networks are essential to Lhe modern computing environment we know and
have come to reiy upon. Multimedia networks share ali major issues and technoiogies of
computer networks. Moreover, lhe ever-growing needs for various multimedia communi
cations have made networks one of Lhe most active areas for research and development.
This chapter will start with a review of some common lechniques and Lerminologies
in computer and multimedia necworks, followed by an introduction Lo various high-speed
networks, since chey are becoming a central part of most contemporary muitimedia systems.

15.1 BASICS OF COMPUTER AND MULTIMEDIA NETWORKS


15.1.1 OSI Network Layers
lt has long been recognized IbaL network communication is a complex task that involves
muitiple leveis of protocols. A multiiayer protocoi architecture was thus proposed by
Lhe International Organization for Standardization (ISO) in 1984, called Open Sysrenis
In:erconnection (OS!), documented by 150 Standard 7498. The OSI Reference Model has
Lhe following network layers [1, 2]:

1. Physical Layer. Defines electrical and mechanical properties of lhe physical interface
(e.g., signal levei, specifications of the connectors, etc.); aiso specifies lhe functions
and procedurai sequences performed by circuits of Lhe physicai interface.
2. Data Linlc Layer. Specifies the ways lo establish, maintain, and Lerminate a iink, such
as transmission and synchronization of data frames, error decection and correction,
and access protocol lo lhe Physical layer.
3. Network Iayer. Defines lhe routing of data from one end Lo lhe other across Lhe
network, such as circuit switching or packet switching. Provides services such as
addressing, internetworking, error handling, congestion control, and sequencing of
packets.
4. ‘fransport iayer. Provides end-to-end communication between end systems that
support end-user appiications or services. Supports cither connection-orienied or
connectionless protocols. Provides error recovery and fiow control.
5. Session Iayer. Coordinates interaction between user applications 00 different hosts,
manages sessions (connections), such as completion of long file transfers.

421
Section 15.1 Basics of Computer and Multimedia Networks 423
422 Chapter 15 Computer and Multimedia Networks

051 TCP/IP 1. Ttansmission Control Protocol (TCP). TCP is conneciion-orienied: it provides re


liable data transfer belween pairs of communicaling processes across lhe network. IL
handies Lhe sending of applicalion daLa lo lhe desLination process, regardless of data
Applicalion gram or packet size. However, TCP/IP is esLablished for packeL-swiLched neLworks
FTP, Telnet, SMTP/MIME only. Hence, lhere are no circuits, and daLa sLiIl have lo be packetized.
ApplicaLion HTTP, SNMP, etc.
PresenLation TCP relies on lhe IP layer for delivering lhe message lo Lhe desLinaLion compuler
specified by its !P address. IL provides message packelizing, error detection, relrans
mission, packel resequencing, and mulliplexing. Since a process running TCP/IP is
Session
required lo be able Lo eslablish muiLiple nelwOrk conneclions Lo a remole process,
TCP (connecLion-orienled) mulLiplexing is achieved by idehtifying connecLions using port numbers.
Trans port
Transport UDP (connecLionless)
For every TCP connecLion, both communicaLing computers aliocaLe a buffer called a
window Lo receive and send dala. FIow conlrol is established by only sending daLa
InLerneL IPv4, lPv6, RSVP in lhe window Lo lhe destinaLion compuLer wiLhouL overflowing ils window. The
Network
maximum data lhal can be Lransmitled aI a lime is Lhe size of lhe smaller window of
NeLwork access X.25, Elhernet, Token ring, lhe lwo computers.
Data link (LLC and MAC) FDDI, PPP/SLIP, etc.
Each TCP daLagram header conlains lhe source and deslinalion ports, sequence num
ber, checksum, window field, acknowledgmenL number, and oLher fields.
1O)lOOBase-T, l000Base-T,
Physical Physical
Fibre Channel, etc. • The source and desgination paris are needed for lhe source process Lo know
where Lo deliver Lhe message and for Lhe desLinaLion process Lo know where Lo
reply lo Lhe message (Lhe address is specified in Lhe IP layer).
FIGURE 15.1: Comparison of 051 and TCPIIP protoco! architectures and sample protocois.
• As packels Lravel across Lhe neLwork, Lhey can arrive oul of order (by following
different palhs), be losI, orbe duplicated. A sequence nu,nber reorders arriving
packels and deLecLs whelher any are missing. The sequence number is actually
6. Presentation Iayer. Deals with Lhe syntax of transmitled data, such as conversion lhe byLe count of lhe firsl data byle of Lhe packel ralher Lhan a serial number for
of differenl daLa formaLs and codes due Lo differenl convenLions, compressiOn. or lhe packeL.
encryption. • The checksuin verifies wiLh a high degree of certainLy Lhat lhe packel arrived
7. Application Iaycr. Supports various applicalion programs and proLocois, such as undamaged, despiLe channel inlerference. lf lhe calculaLed checksum for lhe
VI?, TeineL, HTfP, SNMP, SMTP/MIME, and so 0w received packeL does nol malch lhe LransmiLled one, lhe packet is dropped.
• The window fleld specilies how many byLes Lhe current compuLer’s buffer can
receive. This is typically senl wiLh acknowledgmenL packets.
15.1.2 TCPIIP Protocois • Acknowledgment (ACK) paclceLs have lhe ACK nuniber specified — lhe number
The OS! proLocol architecttire, alLhough instrumental in lhe development of computer net of byles correcLly received so far in sequence (corresponding Lo a sequence
works, did not gain fuli accepLance, due largely Lo Lhe competing and more pracLical TCP/IP number of Lhe firsl missing packel).
set of proLocois. TCP/IP proLocois were developed before OSI and were funded moslly by The source process sends daLagrams lo Lhe destinalion process up lo lhe window num
Lhe U.S. DepartmenL of Defense. They become lhe de facto sLandard afLer their adopLion by ber and waits for ACKs before sending any more daLa. The ACK packet will arrive
lhe InterneI. wiLh new window number infomiation lo indicate how much more daLa Lhe destinalion
Figure 15.1 compares Lhe 051 and TCP/IP protocol architectures. Ii can be seen Lhat buffercan receive. IfACKis nol received in a small time interval, specifiedby retrans
TCP/IP reduced Lhe toLal number of layers and basically merged lhe Lop Lhree 051 layers mission nmeout (RTO), Lhe packeL is resent from Lhe local window buffer. TCPIIP
mIo a single application Iayer. In fact, TCPIIP is even so flexible as Lo someLimes aHow does noL specify congesLion conLrol mechanisms, yeL every TCPIIP implemenLaLion
application Iayer proLocois operating directly on IR should include il.
Allhough TCP is reliable, Lhe overhead of retransmission is often viewed as too high
for many real-time multimedia applicaLions, such as sLreaming video. These will
‘fl-ansport Layer: TCP and UDR TCP and UDP are two transpori Iayer protocois
Lypically use UDP.
used in TCP/IP Lo faciliLaLe host-to-hosL (or peer-Lo-peer) communications.
Section 15.2 Multiplexing Technologies 425
424 Chapter 15 Computer and Multimedia Networks

2. User Datagram Protocol (UDP). UDP is connectionless: Lhe message to be sentis a seemed more lhan adequale. In realily, however, we could be running OUL of new JP ad
single datagram. If lhe message is too long or requires guaranteed delivery, it wiII have dresses soon (projecled in year 2008).
lo be handied by lhe application layer. Essentiaily, the oniy thing UDP provides is This is nol only because of lhe proliferation of personal compulers and wireiess devices
muiLipiexing and error deLecLion through a checksum. Although Lhe UDP header does bel aiso because 1P addresses are assigned wasLefuliy. For exampie, lhe IP address is of
have fieids to specify source and destinaLion port numbers, lhe source port number is lhe font (network number host number). Under rnany nelwork numbers, lhe percentage
optionai, since lhe deslinalion computer is noL expected lo repiy tolhe message (lhere of used hosL numbers is relativeiy srnail, not lo menlion some inacLive hosts thal may sLiiii
is no acknowiedgrnent). occupy their previously assigned addresses.
AiLhough UDP data transrnission is much faster than TCP, it is unreliable, especiaiiy As a shorl-term solulion lo lhe shorlage of IP address avaiiabiiily (due lo Iiinitations
in a congested network. The increasingiy improving quality of fiber-optic networks of service provider or cosI), some LANs use proxy servers or Network Address Transia
minimizes packet ioss. In most real-time muitimedia appiications (e.g., streaming tion (NAT) devices lhat proxy servers impiemenl (in addilion lo conLent caching and olher
video or audio), packets thaL arrive iate are simpiy discarded. Although higher-level fealures). The NAT device separales lhe LAN from lhe inLerconnected network and has
protocois can be used forretransmission, flow control, and congestion avoidance, more oniy one IP address lo handie lhe communicalion of ali Lhe compuLers on lhe LAN. Each
reaiisticaiiy error concealmeni must be expiored for acceptabie Qualily of Service compuLer on a LAN is assigned a local IP address Lhal cannot be accessed from Lhe inLercon
(QoS). nected network. The NAT device Lypically mainLains a dynamic NAT lable lhaL Lransiales
communication ports used with iLs public IP address lo lhe poris and local IP addresses of
lhe communicaLing computers.
Network Layer: Internet Protocol (II’). The 1P iayer provides lwo basic services:
When a local compuler sends an 1? paclceL wilh lhe local address as lhe source, il goes
packet addressing and packel fragmentaLion. Point-to-poinL rnessage Lransmission is readily
through lhe NAT device, which changes lhe source IP address lo Lhe NAT device lP address
supported within any Local Arca Networks (LAN5), and in fact, LANs usually support
lhal is global. When an 1? packet airives on some comrnunication port Lo lhe NAT Ii’
broadcast. However, when a message needs Lo be sent Lo a rnachine on a different LAN,
address, lhe destination address is changed lo Lhe local IP address according 10 lhe NAT
an inLentediale device is needed lo forward lhe message. The IP protocol provides for a
table, and lhe packel is forwarded lo lhe appropriate computer.
global addressing of computers across ali interconnected networks, where every networked
compuler (or device) is assigned a globally unique IP address. In January 1995, lPv6 (IP version 6) was recommended as Lhe nexi generauon 1P (IPng)
For an 1? packeL to be transmilLed across differenl LANs or WideArea Networks (WAN5), by lhe Intemet Engineering Task Force (IETF) in ils Request for Comments (RFC) 1752,
gaieways or rouiers are employed, which use routing tables lo direct lhe messages according “The Recoromendation for lhe IP NexI Cieneration Protocol”. Among many improvements
to deslinaíion II’ addresses. A gateway is a compuLer lhal usually resides at the edge of lhe over lPv4, il adopls 128-bil addresses, allowing 2128 3.4 x 1038 addresses [2]. This will
LAN and can send 1? packels on bolh lhe LAN network interface and lhe WAN network certainly settle the problem of shorlage of IP addresses for a long time (if nol forever).
interface lo comrnunicate with olher interconnecled compulers nol on lhe LAN. A rouler is
a device Lhal receives packets and roules lhem according lo lheir destination address for lhe 15.2 MULTIPLEXING TECHNOLOGIES
sarne type of network.
Modera communication links usually have high capacity. This became even more Irue after
The IP layer aiso has lo transiale lhe deslination II’ address of incoming packels Lo
lhe introduclion of fiber-optic neLworks. When lhe link capacily far exceeds any individual
lhe appropriale nelwork address. la addilion, rouLing Labies idenlify for each destinalion
user’s data raLe, mul:iplexirig must be introduced for users lo share the capacily.
IP lhe next best router 1P Lhrough which Lhe paclcel should Lravel. Since lhe best roule
can change depending on node availabiliLy, nelwork congestion and olher faclors, routers In lhis section, we examine lhe basic multiplexing technologies, followed by a survey on
have lo cornmunicale wilh each olher Lo determine Lhe besL roule for groups of IPs. The several modem networlcs, suoh as ISDN, SONET, and ADSL.
communication is done using internei Control Message Protocol (ICMP).
IP is connectionless; it provides no end-Lo-end flow control. Every packet is lrealed 15.2.1 Basics of Multiplexing
separately and is nol relaLed Lo pasL or fuLure packets. Hence, packels can be received oul
of order and can aiso be dropped or duplicated. 1. Frequency Division Multiplexing (FUM). In FDM, multiple channels are arranged
Packel fragmenLalion is performed when a packeL basto travei over a neLwork lhat accepLs according lo Lheir frequency. Analogously, radios and televisions are good examples
only packets of a srnaller size. In Lhat case, IP packeLs are splil mIo Lhe required smaller of FDM — lhey share Lhe limited bandwidth of broadcasl bands in lhe air by dividing
size, senL over lhe network lo lhe next hop, and reassembled and resequenced lhere. them mIo many channels. Nowadays, cable TV resembles an FDM data network even
more closely, since il has similar transmission media. Ordinary voice channels and
In iLs currenL version, IPv4 (IP version 4), IP addresses are 32-bit numbers, usually spec
TV channels have conventional bandwidlhs of 4 kHz for voice, 6 MHz for NTSC TV,
ified using dotted decimal notation (e.g., 128.77.149.63 = 10000000 01001101 10010101
and 8 MHz for PAL or SECAM TV.
00111111). The 32-biL addressing in principie allows 232 4 biliion addresses, which
Sectiori 15.2 Multiplexing ec no ogies 427
426 Chapter 15 Computer and Multimedia Networks

For FDM lo work properly, analog signais must be modulaied first, with a unique TABLE 15.1: Comparison of TDM Carrier Standards
carrier frequency f~ for each chamei. As a result, the signal occupies a bandwidth
8, centered at f~. The receiver uses a band-pass filter tuned for Lhe channel-of-interest FormaL Number of Data rate FormaL Number of Data rate
Lo capture Lhe signa], then uses a demodulator Lo decode iL. channels (Mbps) channels (Mbps)
Basic modulation techniques include Amplitude Modulation (AM), Frequency Mod
ulation (FM), and Phase Modulation (PM). A combination of Ampliwde Modulation TI 24 1.544 El 32 2.048
and Phase Modulation yields Lhe Quadrature Amplitude Modulation (QAM) method 12 96 6.312 E2 128 8.448
[1,2] used in many modera appiications. T3 672 44.736 E3 512 34.368
Digital data is often çyansmitted using analog signals. The classic example is a modem
T4 4032 274.176 E4 2048 139.264
(modulator.demodulator) transmitting digital data on telephone networks. A carrier
signal is modulated by Lhe digital data before transmission, then demodulated upon its ES 8192 565.148
reception Lo recover Lhe digital data. Basic modulation techniques are Amplitude-Shift
Keying (ASK), Frequency-Shift Keying (FSK), and Phase-Shifl Keying (PSK). QPSK
(Quadrature Phase-Shift Keying) is an advanced version of PSK that uses a phase (accordingly buffers) do not have data Lo transmit, Lhe slot is wasted. Asynchronous
shift of 90 degrees instead of 180 degrees [2]. As QAM, it can also combine phase TDM (or Stausucal TDM) gaLhers Lhe statistics of Lhe buffers ia this regard. It will
wiLh amplitude, so as lo carry multiple bits on each subcarrier. assign only k (k < m) time slots to scan Lhe k buffers likely to have data Lo send.
2. Wavelength Division Multiplexing (WDM). WDM is a variation of FDM Lhat is Asynchronous TDM has Lhe potential for higher Lhroughput, given Lhe sarne caber
especially useful for data transmission in optical fibers. In essence, light beams repre data rale. There is, however, an overhead, since now Lhe source address must also be
senting channels of different wavelengLhs are combined aI Lhe source and transmitted sent, along wiLh Lhe data, Lo have Lhe frame demultiplexed correctly.
within Lhe sarne fiber; they are split again aI Lhe receiver end. The combining and Traditionaily, voice data over a te]ephone channel has a bandwidLh of 4 kHz. Ac
splitting of light beams is carried out by optical devices [e.g., Add-Drop Multiplexer cording to Lhe Nyquist theorem, 8,000 samples per second are required for a good
(ADM)], which are highly reliable and more efficient than electronic circuits. Since digitization. This yields a time interval of 125 itsec for each sample. Each channel
Lhe bandwidLh of each fiber is ver)’ high (> 25 terahertz for each band), Lhe capacity cmi transmil 8 bits per sample, producing a gross data rale (including data and control)
of WDM is tremendous — a huge number of channels can be multiplexed. As a for each voice chanriel of 8 x 8,000 = 64 kbps.
result, Lhe aggregate bitrate of fiber trunks can potentially reach dozens of terabits per Ia North America and Japan, a TI caber’ is basically a synchronous TDM of 24
second. voice channels (i.e., 24 Lime slots), of which 23 are used for data and lhe last one
1’Wo variations of WDM are for synchronization. Each TI frame contains 8 x 24 = 192 bits, plus one bit for
framing [1, 2]. This yields a gross data raLe of 193 bits per 125 ~tsec — Lhat is,
• Dense WDM (DWDM), which employs densely spaced wavelengLhs Lo allow
193 bits/sampie x 8,000 sampleslsec = 1.544 Mbps.
a larger number of channels Lhan WDM (e.g., more Lhan 32).
Four TI carriers can be further multiplexed Lo yield a 12. Note LhaL 12 has a gross
• Wideband WDM (WWDM), which allows Lhe transmission of color lights wiLh data rate of 6.312 Mbps, which is more Lhan 4 x 1.544 = 6.176 Mbps, because more
a wider range of wavelengffis (e.g., 1310 lo 1557 nm for long reach and 850 mm framing and control bits are needed. In a similar fashion, T3 and T4 are created.
for short reach) lo achieve a larger capacity than WDM.
Similar caber formaIs have been defined by Lhe ITU-T, wiLh levei 1 (EI) starting aI
3. Time Division Multiplexing (TDM). As described above, FDM is more suitable for 2.048 Mbps, in which each frame consists of 32 time s]ots: 8 x 32 x 8,000 = 2.048
analog data and is less common in digital computer networlcs. TDM is a technology Mbps. 1\vo slots are used for framing and synchronization; Lhe oLher 30 are for data
for directly multiplexing digital data. lf lhe source data is analog, it must first be channels. The multiplexed numberof channels quadruples ar each of Lhe nexl leveis —
digitized and converted mIo Pulse Cade Modulation (PCM) samples, as described in E2, E3, and se on. Table 15.1 compares the data rales of boLh TDM caber standards.
Chapter 6.
In TDM, multiplexing is performed along Lhe time (t) dimension. Multiple buffers are
15.2.2 Integrated Services Digital Network (ISDN)
used for m (m > 1) channels. A bit (or byte) will be taken from each buifer at one of
Lhe m cycled time slots until a frame is fonned. The TDM frame will be transmitted For over a century, Piam Old Telephone Service (PCYI’S) was supported by Lhe public circuit
and Lhen demultip]exed after its reception. switched telephone system for analog voice transmission. In 1980s, Lhe ITU-T started Lo
The scheme described above is known as Synchronous TDM, in which each of Lhe m
t’flie formal foribe TI variar is called 051,12 is called DS2, and no on. Less slrictly. Ibese Lwo notaúons (1’
buffers is scanned in Lura and treated equally. 1f, at a given time slot, some sources
and OS) are often nsed interchangeably.
428 Chapter 15 Computer and Multimedia Networks Section 15.2 Multiplexing Technologies 429

develop ISDN to meet the needs of various digital services (e.g., cailer ID, instant cali setup,
teleconferencing) in which digital data, voice, and somelimes video (e.g., in videoconfer TABLE 15.2: Equivaiency of SONET and SDH
encing) can be transmitted.
SONET SONET SDH Line rate Payload rate
By default, ISON refers to Narrowband ISDN. The ITIJ-T lias subsequently developed
BroadbandlsDN(B-ISDN). Its default switching technique is Asynchronous Transfer Mode electrical levei opticai levei equivalent (Mbps) (Mbps)
(ATM) [3) which will be discussed later. STS-i OC-I — 51.84 50.112
ISDN defines several types of full-duplex channels:
STS-3 OC-3 STM-1 155.52 150.336
• B (bearer)-channel. 64 kbps each. B-channeis are for data transmission. Mostiy STS-9 OC-9 STM-3 466.56 451.008
they are circuit-switched, but tbey can also support packet switching. ff needed, one STS-12 OC-i2 STM-4 622.08 601.344
B-channel can be readily used to replace P(YFS.
STS-18 OC-i8 STM-6 933.12 902.016
• D (deita)-channei. 16 kbps or 64 kbps. D-cltannel takes care of caIl setup, cali STS-24 OC-24 STM-8 1244.16 1202.688
control (cail forwarding, caIl waiting, etc.), and network maintenance. The advantage STS-36 OC-36 STM-12 1866.24 1804.032
ofhaving a separate D-channel is that conlrol and maintenance can be done in realtime
STS-48 OC-48 STM-16 2488.32 2405.376
in D-channel while B-channels are transniitting data.
STS-96 OC-96 STM-32 4976.64 4810.752
The foliowing are the main specifications of ISDN: STS-192 OC-i92 STM-64 9953.28 962 1.504

• It adopts Synchronous TDM, in which the above channels are muitipiexed.


In optical networks, electrical signals must be converled to optical signals for transmission
• Two type of interfaces were available to users, depending on the data and subscription and converted back after their reception. Accordingly, SONET uses the terms Synchronous
rates: Transpor! Signal (STS) for the electrical signais and Opsical Carrier (OC) for the opticai
signais.
An STS-1 (OC-i) frame consisis of 810 TDM bytes. lt is transmitted in 125 ~Lsec,
— Basic «ate Interface provides two B-channels and one D-channel (at 16 kbps). 8,000 frames per second, so the data rate is 810 x 8 x 8,000 = 51.84 Mbps. Ali other
The total of 144 kbps (64 x 2 + 16) is multiplexed and transmitted over a 192 STS-N (OC-N) signais are further multipiexing of STS-l (OC-1) signals. For exampie,
kbps link. three STS-i (OC-I) signais are multiplexed for each STS-3 (OC-3) at 155.52 Mbps.
— Primary Rale Interface provides 23 B-channels and one D-channel (at 64 lnstead of SONET, ITU-T developed a similar standard, Synchronous Digital ffierarclzy
kbps) in Nortli America and Japan; 30 B-channels and two D-channels (at 64 (SDH), using the technoiogy of Synchronous Transpor! Module (STM). STM- lis the lowest
kbps) in Europe. The 238 and ID fit in Ti nicely, because Ti lias 24 time slots in 5DM — it corresponds to STS-3 (OC-3) in SONET.
and a data rate of 24 slots x 64 kbpslslot 1544 kbps; whereas the 30B and Table 15.2 Iists the SONET electrical and opticai leveis and their 5DM equivalents and
2D fit in El, which lias 32 time slots (30 of them available for user channels) data rates. Among ali, OC-3 (STM-i), OC-l2 (STM-4), OC-48 (STM-i6), and OC-192
and a data rate of 32 x 64 = 2,048 kbps. (STM-64) are the ones mostly used.

15.2A Asymmetric Digital Subscriber Line (ADSL)


Because of its relatively siow data rate and high cost, narrowband ISDN has generaIly
failed to meet the requirement of data and muitimedia networks. For home computerllntemet ADSL is the telephone industry’s answer to the las; mi/e chalienge — delivering fast network
users, it lias iargely been replaced by Cable Modem and Asymmetric Digital Subscriber Line service to every home. it adopts a higher data rate downstream (from network to subscriber)
(ADSL) discussed below. and lower data rale upstream (from subscriber lo network); hence, it is asynunetric.
ADSL makes use of existing telephone twisted-pair lines to transmit Quadrature Ampli
15.2.3 Synchronous Optical NETwork (SONET) tude Modulated (QAM) digitai signals. lnstead of the conventional 4 kHz for audio signals
on teiephone wires, the signai bandwidth on ADSL lines is pushed to 1 MHz or higher.
SONET is a standard initially developed by Belicore for optical fibers that support data rates ADSL uses FDM (Frequency Division Muitiplexing) to multiplex three channels:
mucli beyond T3. Subsequent SONET standards are coordinated and approved by ANSI in
ANSI TIfOS, Ti.106 and TI .107. SONET uses circuit switching and synchronous TDM. • The high speed (1.5 to 9 Mbps) downstream channel at lhe high end of the spectrum
430 Chapter 15 Computer and Multimedia Networks Section 15.3 LAN and WAN 431

TABLE 15.3: Maximum Dislances for ADSL Using 1\visted-Pair Copper Wire TABLE 15.4: Hislory of Digilal Subscriber Lines

Data Rale tvsintç~• C~içmwa Name Meaning Data raLe

1.544 Mbps 00 V.32 or V.34 Voice band modems


Digilal subscriber une
1.2 Lo 56 kbps
160 kbps
1.544 Mbps ~pi

6.1 Mbps HDSL High daLa rale 1.544 Mbps

ai~ digilai subscriber une or 2.048 Mbps


6.1 Mbps
SDSL Singie une 1.544 Mbps
digiLal subscriber line or 2.048 Mbps
A medium speed (1610 640 kbps) duplex channel
ADSL Asymmelric 1.5 109 Mbps
• A PCYI’S channel aI Lhe low end (nexl Lo DC, 0—4 kl-lz) of lhe spectrum.2 digiLal subscriber une 16 lo 640 kbps
VDSL Very high dala rale 13 lo 52 Mbps
The lhree channels can lhemselves be furLher divided inLo 4 kJ-lz subchannels (e.g., 256
subchannels for lhe downsLream channel, for a LoLal of 1 MHz). The mulliplexing scheme digital subscaiber une 1.5 lo 2.3 Mbps
among lhese subchanneis is also FDM.
Because signais (especially Lhe higher-frequency signais near or aL 1 MHz) aLlenuaLe
quickly on lwisLed-pair lines, and noise increases with une lenglh, lhe signal-Lo-noise railo 15.3.1 Local Area Networks (LANs)
wiIl drop Lo an unacceplable levei afier a certain disLance. Nol considering Lhe effecl of MosI LANs use a broadcasL lechnique. Wilhoul exceplion, Lhey use a shared medium.
bridged Laps, ADSL has lhe distance limiLalions shown in Table 15.3 when using only Ilence, medium access conlrol is an importanl issue.
ordinary lwisled-pair copper wires. The IEEE 802 commiltee deveioped lhe IEEE 802 reference model for LANs. Since
The key lechnology for ADSL. is Discrete Muizi-Tone (DM1’). For beLler lransmission iayer 3 and above in Lhe 051 reference mede] are applicabie Lo eilher LAN, MAN, or WAN,
in poLenlialiy noisy channeis (eiLher downslream or upsLreazn), Lhe DMT modem sends lesl main developmenls of lhe IEEE 802 sLandards are on lhe lower layers — Lhe Physical and
signals Lo ali subchannels firsl. IL lhen caiculales lhe signal-lo-noise ralios, Lo dynamicaliy lhe Dala Link layers. In particular, lhe Dala Link iayer’s funcLionalily is enhanced, and lhe
deLermine Lhe amouni of dala Lo be senl in each subchannel. The higher Lhe SNR, Lhe more layer has been divided mIo Lwo sublayers:
dala senL. Theoretically, 256 downslream subchanneis, each capabie of carrying over 60
kbps, will generaLe a data rale of more lhan 15 Mbps. In reaiity, DMT delivers 1.5 10 9 • Medium Access Control (MAC) iayer. This layer assembies or disassembies frames
Mbps under currenl Lechnology. upon Lransmission or receplion, performs addressing and error correclion, and regu
Table 15.4 offers a brief hislory of various digilal subscriber lines (xDSL). DSL corre lales access conIrol Lo a shared physicai medium.
sponds lo lhe basic-rale ISDN service. HDSL was an efforl Lo deliver Lhe TI (or El) dala
rale wilhin a low bandwidlh (196 lcHz) [2]. However, il requires two Lwisled pairs for 1.544 • Logicai Link Control (LLC) layer. This iayer performs fiow and error conlrol and
Mbps or Lhree LwisLed pairs for 2.048 Mbps. SDSL provides lhe same service as HDSL on MAC-layer addressing. 11 also acts as an inLerface Lo higher iayers. LLC is above
a single lwisLed-pair une. VDSL is a standard Lhal is 51111 aclively evolving and forras lhe MAC in lhe hierarchy.
future of xDSL.
Following are some of lhe aclive IEEE 802 subcommiLlees and lhe arcas lhey define:
15.3 LAN AND WAN • 802.1 (Higher Layer L,AN Protocois). The relalionship beLween Lhe 802.X slandards
Local Arca Neiwork (LAN) is reslricled Lo a smaii geographical area, usually loa relalively and lhe 051 reference model, lhe inlerconnecLion and managemenl of lhe LANs
small number of staLions. Wide Arca Nerwork (WAN) refers lo neLworks across ciLies
and counLries. Belween LAN and WAN, lhe Lena Melropolitan Arca Network (MAN) is • 802.2 (LLC). The general slandard for logical uink conlrol (LLC)
sometimes also used.
• 8023 (Ethernet). Medium access conlrol (CSMAJCD) and physical layer specifica
2Ahernaiively, an ISDN channel na be supportcd la place orlow and medium-apced channels. Lions for ELhemel
Section 15.3 LAN and WAN 433
432 Chapter 15 Computer and Multimedia NetwOrks
Fast Ethernet (known as IOOBASE-T) has a maximum data raLe of 100 Mbps3 and is
• 802.5 (Token Ring). Medium access control and physical iayer specifications for entirely Ethemet-compalibie. Indeed, it is common nowadays Lo mix 100BASE-T and
token ring 1OBASE-T through a switch (instead of a 1mb) — Lhat is, IOOBASE-T iink between server
and IOOBASE-T switch, and severai IOBASE-T iinks between Lhe switch and worlcstalions.
• 802.9. LAN interfaces at the medium access control and physical layers for integrated Since the swilch is capabie of handiing muitiple comrnunicalions aL Lhe sarne time, ali
serv ices worlcstations cmi comrnunicate up to a maximum data raLe of 10 Mbps.

• 802.10 (Security). Interoperable LAN/MAN security for other IEEE 802 standards
Token Ring. Stalions on a token ring are connected in a ring topoiogy, as Lhe name
• 802.11 (Wireicss LAN). Medium access methad and physicai Iayer specifications for suggests. Data frames are transmitted in one direction around Lhe ring and can be read by
wireless LAN (WLAN) ali stations. The ring structure can be mereiy iogicai when stations are acluaiiy (physicaliy)
connected Lo a hub, which repeats and reiays the signa! down the “ring”.
• 802.14 (Cable-TV based broadband communicatiOfl nctwork). Standard protocol A smaH frame, cailed a token, circuiates while Lhe ring is idie. To transmit, a station S
about Lwo-way transrnissian of multirnedia services over cabie TV; e.g., Hybrid Fiber musl wait until Lhe token arrives. The source station S then seizes Lhe Loken and converts
Coax (HFC) cable medem and cable network ii lo a front end of its data frarne, which Lhen traveis on Lhe ring and is received by Lhe
destination station. The data franie continues traveling on Lhe ring until it comes back to
• 802.15 (Wireiess PAN). Access meLhod and physicai layer specifications for Wire station 5, which releases il and puts it back anta Lhe ring.
iess Personai Area Network (WPAN). A Personal Asca Network (PAN) supports
Access Lo Lhe shared medium is reguiated by ailowing only one token; hence, coilision
coverages on Lhe order aí lO meters
is avoided. By default, Lhe ring operates in a round-robin fashion. Every time a token
• 802.16 (Broadband wircless). Access rnethod and physical layer specifications for is released, Lhe next slation gets Lhe chance Lo talce it, and ao on. Optionaliy, a muitipie
broadband wireless networks priority scheme can also be used for access contrai — a station can transmit a frarne aL a
given priority if it can grab a token wiffi an equal ar iower priority; otherwise, it niakes a
reservation and waits for iLs Lum.
Ethernet. &hernet is a packet-switched bus network, iL is Lhe most popular LAN Lo The data raLes of Lhe token rings were either 4 Mbps ar 16 Mbps over shielded twisLed
date. As of 1998, Lhe coverage of ELhernets has reached 85% of networked cornputers. To pair. The 4 Mbps ring manipuiates Lhe token as described above. In Lhe 16 Mbps ring, Lhe
send a message, Lhe recipient’s Ethemet address is attached Lo Lhe message, and Lhe message token can be released as soon as Lhe source sLation sends out Lhe data frame. This increases
is sent lo everyone on Lhe bus. Only Lhe designated station wili receive Lhe message, while ring usage by allowing more than one frame to travei on Lhe ring simultaneously. New
others wiII ignore iL. technology has enabied 100 Mbps token rings [2) and IEEE 802.5v was a feasibility study
The probiem of rnediurn access contrai for Lhe network is solved by Carrier Sense Muiti for Gigabit token ring in 1998.
pie Access with Coiiision Detection (CSMA/CD). The sLation that wishes Lo semi a message
must listen to Lhe network (carrier sense) and wait unLil there is no traflic. ApparenLiY, Fiber Distributed Data Interface (FDDI). FDDI is a successor of the original token
multipie stations couid be waiting and then send Lheir messages aL lhe sarne time, causing ring [4). Medium access contrai (MAC) of FDDI is similar Lo MAC in TEBE 802.5 deseribed
a coilision. During frame transmission, the statiOn compares Lhe signais received wiLh lhe above for token rings.
ones senL. If they are different, it detects a coilision. Once a coilision is detected, Lhe station FODI has a dual-ring topoiogy, with iLS primary ring for daLa transmission and secondary
stops sending Lhe frame, and Lhe frame is retransmitted after a random delay. ring for fault tolerance [5]. ifdamage is detected in both rings, lhey canbejoined Lo function
A good transmission medium for Ethemet is coaxial cabie (ar opticai fiber for newer as a single ring.
generations). However, il is also possible to use Lwisted pair. Since these are simply The biLrale of FODI is 100 Mbps. Because of lhe relatively fast transmission speed,
telephone wires, in most cases Lhey are aiready in ofúce buiidings or homes and do not need the source stations wili simply absorb lhe Loken (instead of converting iL as part of its data
to be reinstalied. frame, as in lhe original token ring) before sending its data frarne(s).
Often a star LAN is used, in which each station is connected directiy to a hub, which
In FDDI, once a station captures a Loken, iL is granled a Lime period and may send as
also heips cope with Lhe potential of lower Lransmission quaiity. The hub is mi active device
many data frames as il can within lhe period. Also, Lhe loken will be reieased as soon as lhe
and acts as a repeater. Every time it receives a signai from one station, il repeats, so olher
frames are transmitted (eariy Loken release).
stations wili hear. Logicaiiy, this is stili a bus, aiLhough iL is physicaliy a star network.
The maximum data raLe for ordinary Ethernet is 10 Mbps. For lhe lO Mbps LAN,
unshieided twisted pair was used in IOBASE-T within 100 meters, whereas optical Uber 3Nexl gencralion Elhemeis are Gigabi: Etheniei and lO-Gigabil Elhernel. which wiII be describcd laler.
was used in IOBASE-F up lo 2 kiiomelers.
434 Chapter 15 Computer and Multimedia Networks Section 15.3 LAN and WAN 435

The FDDI network is allowed Lo spread over distances up to 100 kni. It supporls up to 500 is only logicai and not dedicaLed, and packels from lhe same source to Lhe sarne
stations, as long as the maximurn distance of neighboring stations is Iess than 2 kilomeLers. destination can be transferred through different “circuits”. Sequencing (ordering Lhe
Hence, FDDI is primarily used iii L,AN or MAN backbones. packets) is much easier in virtual circuiLs. Retransmission is usually requested upon
FDDI supports both synchronous and asynchronous modes [5]. Synchronous mode detection of an error.
enables bandwidth reservation and guaranteed data Lransrnission up to the synchronous Packet switching becomes ineifective when lhe network is congested and becomes
capacity. Asynchronous mode is similar lo lhe token ring proLocol. FDDI-2 supports unreliable by severeiy deiaying or losing a iarge number ofpackets.
an additional mode — isochronous mode [5], in which Lhe networic is Lime-sliced, with
each machine gelting a fixed piece. FDDI can thus provide isochronous services for delay Frame Relay. Modera high-speed links have low errar rale; in opticai fiber, it can
sensitive applications (such as audio and video) and synchronous and asynchronous services be down to lhe order of iO_12. Many bits added lo each packeL for excessive error
for others in the sarne network. checking in ordinary packet swiLching (X.25) thus becorne unnecessary.
As X.25, frarne elay works ai Lhe data link conLrol layer. Frame relay made lhe
15.3.2 Wide Area Networks (WANs) following major changes Lo X.25:
WÂJV usuaiiy refers Lo networks across ciLies and countries. Instead of broadcasl, Lhey
— Reduction of error checking. No more acknowiedgrnent, no more hop-to-hop
invariabiy use some Lype of switching technologies.
fiow control and error control. OpLionally, end-to-end flow control and error
control can be performed aL a higher iayer.
Switching Tcchnoiogies. The common types of switching Lechnoiogies are circuir
switching and packet switching. The latter also has its modera varianis offraine relay and — Reduction ei layers. The multiplexing and switching virtual circuits are changed
cdl relay. from layer 3 in X.25 lo layer 2. L.ayer 3 of X.25 is eliminated.

Circuit Switching. The public switched telephone network (PSTN) is a good exam Frame reiay is basicaily a cheaper version of packet switching, wiLh minimai services.
pie of circuit swilching, in which an end-to-end circuit (dupiex, in this case) must be Frames have a iengLh up to 1,600 bytes. When a bad frame is received, iL wiili simply
established that is dedicated for Lhe duration of Lhe connection aL a guaranteed band be discarded. The data raLe for frarne reiay is Lhus much higher, in Lhe range of TI
width. Although initialiy designed for voice cornmunications, iL can also be used (1.5 Mbps) Lo T3 (44.7 Mbps).
for data Lransrnission. indeed, it is still the basis for narrowband 1SDN, discussed in • Ceil Relay (ATM). Asynchronous transfer mode adopls small and fixed-length (53
Section 15.2.2. To cope with muiLi-users and variable data rales, iL adopls FDM or bytes) packets referred to as cells. Hence, ATM is also known as ceIl relay.
synchronous TDM multiplexing.
As Figure 15.2 shows, Lhe srnall packet size is beneficial in reducing latency in ATM
Circuit switching is preferabie if Lhe user demands a connection andlor more or networks. When Lhe darkened packet arrives slightly behind another packet of a
iess constant data rates, as in certain constant-bitrate video communications. IL is normal size (e.g., 1 lcB) in Figure 15.2(a), iL musL wait for Lhe compleLion of Lhe other’s
inefficient for general multimedia communication, especially for variable (somelimes Lransmission, causing serialization delay. When Lhe packet (ccli) size is smali, as in
bursty) data rates. Figure 15.2(b), much iess waiting time is needed for Lhe darkened celI Lo be sent. This
Luras out to significantly increase network LhroughpuL, which is especialiy beneficial
• Packct Switching. Packet switching is used for almost ali data neLworks in which
for real-time muitimedia applications. A’I’M is known lo have the poLential to deiiver
data raLes lend Lo be variable and sometimes bursty. Before transmission, data is
high data rates at hundreds (and thousands) of Mbps.
broken into smail packers, usualiy 1,000 bytes or less. The header of each packet
canies necessary control information, such as destination address, routing, and so on. Figure 15.3 compares Lhe above four switching technologies in lerms of Lheir bitrates and
X.25 was lhe most commoniy used protocol for packet swiLching. compiexity. It can be seen Lhat circuit switching is Lhe ieast complex and offers a constant
Generaily, two approaches are available to switch and route Lhe packets: datagram (fixed) data raLe, whiie packet switching is Lhe opposite.
and virtual circuits. In Lhe former, each packet is treated independentiy as a datagram.
No Lransfer route is predetennined prior Lo Lhe transrnission; hence, packets may be 15.3.3 Asynchronous Transfer Mode (ATM)
unknowingly losL or anive in lhe wrong order. It is up to the receiving slation to detect Ever since Lhe 1980s, lhe dramalic increase in data communications and muitimedia ser
and recover lhe errors, as is Lhe case wiLh TCPIIP. vices (voice, video, etc.) has posed a major chalienge Lo teiecommunication networks.
in virtual circuits, a route is predetermined through requesi and accept by ai] nodes With Lhe ever-expanding bandwidth through optical fiber, braadband !SDN (B-1SDN) be
along Lhe route. lt is a “circuit” because Lhe route is fixed (once negoLiated) and used came a reality. By 1990, the ITU-T (formeriy CCITI’) adopted synchronous optical net
for lhe duration of Lhe connection; nonetheless, it is “virtual” because Lhe “circuit” worklsynchronous digital hierarchy (SONETISDH) as the base of B-ISDN. Since SONET
Section 1 5.3 LAN and WAN 437
436 Chapter 15 Computer and Mv timedia Networks

~H VPI
LujiiiiIiiiIIII11111111I111h111I~~
VCI PT~

o 8 16 24 32. 40
5 bytes
(a)
GFC = General Flow ConLiol Ri’ = Payioad Type
VPI = Virtual Path Identifier CLP = Ccli Loss Priority
VCI = Virtual Channel Identifler HEC = Header Error Check

cm FIGURE 15.4: ATM UNI celi header.


mm
Initiaiiy, ATM was used for WANs, especially serving as backbones. Nowadays, it is
also used in LAN appiications.

(b) The ATM Ccli Structure. ATM celis have a fixed forrnat: their size is 53 bytes, of
which lhe flrst 5 bytes are for the ccii header, foliowed by 48 bytes of payload.
FIGURE 15.2: Latency: (a) seriahzation delay in a noni~ai packet switching network; The ATM iayer has two Lypes of interfaces: User-Network Interface (UNI) is local,
(b) lower iaLency in a ccii network. between a user and an ATM neLwork, and Network-Network Interface (NNI) is between
ATM switches.
Figure 15.4 iliustrates lhe structure of an ATM UNI ccii header. The header starts
uses circuit switching technoiogy and specifies only Lhe transmission and multiplexing of with a 4-bit general flow control (GFC) which controis traffic entering Lhe network at the
data, a new standard for switching technoiogy was desired. local user-network levei. It is foiiowed by an 8-bit Virtual Path Identifier (VPI) and i6-bit
ATM can provide high speed and iow delay — its operational version has been scaied Lo Virtual Channei Identifler (VCI) for selecting a particular virtual path and virtual circuit,
2.5 Gbps (0048). ATM is also fiexible in supporting various technoiogies, such as Frame respectively. The combination of VPI (8 bits) and VCI (16 bits) provides a unique routing
relay (bursty), IP Ethemet, xDSL, SONET/SDH, and wireless networks. Moreover, it is indicator for Lhe ccli. As an anaiogy, VPI is iike an arca cede (604). and VCI is iike Lhe
capabie of guaranteeing predefined leveis of Quality of Service (QoS). Hence, ATM was foiiowing digits (555.1212) in a phone number.
chosen as lhe switching technoiogy for B-ISDN. The 3-bit payioad Lype (P1’) specifies whether Lhe ccii is for user data or management
and maintenance, network congestion, and so on. For exampie, 000 indicates user data ccii
type 0, no congestion; 010 indicates user data ccii type 0, congestion experienced. Ri’ may
Fixed data raLe
be altered by the network, say from 000 te 010, te indicate that Lhe neLwork has become
Circuit swiLching congested.
The 1-bit ccii ioss priority (CLP) aiiows Lhe specificaLion of a low-prioriLy ccii when
Ccii reiay (ATM) CLP is set w 1. This provides a hint te lhe ATM swiLches about which celis te drop when
Compiexity Lhe neLwork is congested.
The 8-bit hcader error detection (HEC) checks errors only in the header (not in Lhe
Frame reiay payioad). Since Lhe rest of the header is oniy 32 bits iong, this is a reiativeiy iong 8-biL fieid;
iL is used for both error checking and correction [2].
Packet switching The NNI header is similar Lo lhe UNI header, except it does not have the 4-biL GFC.
Instead, its VPI is increased te 12 bits.
Variable data raLe
ATM Laycrs and Sublayers. Figure 15.5 iiiustrates Lhe comparison beLween 051
FIGURE 15.3: Comparison of different swiLching techniques. layers and ATM iayers and subiayers at ATM Adaptauon Layer (AAL) and below. As
ection 15.4 Access Networks 439
438 Chapter 15 Computer and Multimedia Networks

OSI ATM TABLE 15.5: Comparison of Fast, Gigabit, and lO-GigabiL Etbemets.

Fast Ethernet Gigabit Ethernet IO-Gigabit Ethernet


CS -
(1 OOBASE-T) (I000BASE-T)
Transport ~L
Data rate 100 Mbps 1 Gbps 10 Gbps
Network Transmission mode Fali or half duplex Fali or haifduplex Fuli duplex only
ATM CSMAICD CSMAJCD N/A (no coilision)
AcceSS method
Data Iink Fiber only
TC - - Mediam Copper or fiber Fiber or copper
Physical PMD Up Lo 2 km (fiber) Up LoS 1cm (SM fiber) Up to 40km (SM fiber)
Target distance
Physical
200 rn (copper) 550 m (MM fiber) 300 m (MM fiber)
25 m (copper)
AAL = ATM Adaptation Layer Network Type LAN LAN/MAN LAN/MAN/WAN
CS = Convergence Sublayer IEEE Standard 802.3u 802.3z 802.3ae
SAR = Segmentation and Reassembly Year 1995 1998 2002
TC = Transmission Convergence
PMD = Physical Mediam Dependent
Gigabit EtherneL adopts fuli-duplez modes for connections Lo and from switches and
FIGURE 15.5: Comparison of OSI (iayer 4 and below) and ATM Iayers. haif-duplex modes for shared connecLions that use repeaters. Since coilisions do occur
frequently in half-duplex rnodes, Gigabit Ethemet uses sLandard Ethernet access method
Carrier Sense Multiple Access wiLh Collision Detection (CSMA)CD), as in its predecessors.
shown, AAL corresponds to Lhe 0Sf Transport layer and part of Lhe Network Iayer. It Gigabit Ethemet has beco rapidly replacing Fast ELhernet and FDDI, especially in network
consists of two sublayers: convergence sublayer (CS) and segmentation and reassembly backbones. It has gone beyond LAN and found use in MANs.
(SAR). CS provides interface (convergence) to user applications. SAR is in charge of ccii IO-GigabflEthernet was compleLed in 2002. li retains the main characteristics ofEthernet
segmentation and reassembly. (bus, packet swiLching) and Lhe sarne packeL formaL as before. AL a data raLe of 10 Gbps,
The ATM layer corresponds Lo parts of the 051 Network and Data Link layers. iLs main it funcLions only over optical fiber. Since iL operates only under fali duplex (swiLches and
functions are uiow control, management of virtual circuit and path, and ccli rnultiplexing and buffered disLributors) II does noL need CSMAJCD for coilision detection.
demultiplexing. The ATIvI Physical Iayer consists of two sublayers: Transrnission Conver lO-Gigabit ELhernet is expecLed Lo finally enable Lhe convergence of voice and data
gence (TC) and Physical Mediam Dependent (PMD). PMD corresponds Lo the 051 Physical neLworks. lt can be substanLially cheaper Lhan ATM. ILs design encompasses ali LAN,
layer, whereas TC does header errar checking and packinglunpacking trames (cells). This MAN, and WAN, and its carrying capacity is equivalent or superior to Fiber Channei, High
makes Lhe ATM Physical layer very different from the 051 Physical layer, where framing Perfonnance Parallel InLerface (HIPPI), Ultra 320 or 640 SCSI, and ATMIS0NET OC-192.
is left for the 051 DaLa Link layer. The rnaximurn iinic distance is increased Lo 40 kilometers for SM fiber (see Table 15.5). In
facL, speciai care is Lakeo for interoperability wiLh SONET/SDH, so Ethernet packets can
15.3.4 Gigabit and 1O-Gigabit Ethernets readily travei across SONET/SDH links.
Tabie 15.5 provides abrief comparison of FastEthernet, GigabitEthernet, and lO-Gigabit
Gigabit Ethernet became a standard (IEEE 802.3z) in 1998 [2]. IL employs Lhe sarne frame Ethernet.
formaL and size as Lhe previous Ethernets and is backward compatible with IOBASE-T and
IOOBASE-T. li is gerieraliy known as I000BASE-T although it can be fiirlher classified as 15.4 ACCESS NETWORKS
I000BASE-LX, I000BASE-SX, I000BASE-CX, and I000BASE-T when iL uses various
fiber or copper media. The maximum link distance under I000BASE-LX is 5 kiiorneLers An access network connects end users toLhe core network. IL is also known as Lhe “last mile”
for single-mode optical fiber (SM fiber), 550 rneters for multi-mode fiber (MMfiber), and for delivering various multirnedia services, which could include lntemeL access, telephony,
merely 25 rneters for shielded twisted pair. and digiLal and analog TV services.
440 Chapter 15 Computer and Iviultimedia Networks Section 15.5 Common Peripheral Interfaces 441

Reside ADSL, discussed earlier, some known options for access networks are:
TABLE 15.6: Speed of Common Peripheral Interfaces

Hybrid Fibcr-Coax (HFC) Cabie Network. Optical fibers connect Lhe core network Data-rate ___________________________ Data-raLe
with Optical Network Units (ONU5) in the neighborhood, each of which Lypically
serves a few hundred homes. Ali end users are then served by a shared coaxial cabie. SeÍI~J Port
lis kbps Uitra2 SCSI 40 ME/s
standa~ paraliel port lIS kB/s IEEE 1394 (FireWire, i.Link) 1.5—50 MB/s
Traditionaily, analog cable TV was aliocaLed a frequency range of 50—500 MHz,
USB 1.5 MB/s USE 2 60 MB/s
divided into 6 MHz channeis for NTSC ‘IV and 8 MHz channeis in Lurope. For HFC
cable networks, Lhe downstream is allocated a frequency range of 450—750 MHz, and gCPIEPP paraliel port 3 ME/s Wide UlLra2 SCSI (Fast 40) 80 MEIs
tipsLream is allocated a range of 5—42 M1lz. For Lhe downstream, a cable modem IDE 3.3—16.7 ME/s Uitra3 SCSI 80 MB/s
acts as a tuner to capLure the QAM moduiated digital sLream. The upstream uses
SCSI-l 5 ME/s Ultra ATA 133 133 ME/s
Quadrature Phase-ShifL Keying (QPSK) [2] modulation, because it is more robusL in
the noisy and congested frequency spectrum. SCSI-2 (Fast SCSI, Fast lO MB/s Wide Uitra3 SCSI (Ultra 160 160 MB/s
narrOw 5CM) SCSI, FasL 80)
A poLenLial problem of HFC is the noise or interference on the shared coaxial cabie. 20 MEIs IIIPPI 100—200 MEIs
Fast wide SCSI (Wide SCSI)
Privacy and securiLy on Lhe upstream channel are also a concem.
ijflra SCSI (SCSI-3, Ultra 20 MB/s Ultra 320 SCSI 320 MEIs
narroW SCSI)
• Fiber To The Curb (FflC). Optical fibers connect Lhe core network with ONUs aL
BIDE 33 MEIs Fiber Channel 100400 MBIs
the curb. Each ONU is then connecLed Lo dozens of homes via twisted-pair copper
or coaxial cable. For FITC, a star Lopology is used aL Lhe ONUs, so Lhe media Lo the Wide Ultra SCSI (Fast 20) 40 MB/s Ultra 640 SCSI 640 MEIs
end user are not shared — a much improved access neLwork over HFC. Typical data
rales are TI to T3 in the downstream direction and up to 19.44 Mbps in the upsLream USB Universal Serial Bus SCSI Smali Computer System Interface
direction.
ECP Enhanced Capability Port Narrow 8-bit data
• Fiber To The Home (FI’TH). Optical fibers connecL Lhe core networlc directly with a
small group of homes, providing Lhe highestbandwidth. Forexample, before reaching EPP Enhanced Paraliei Port Wide l6-bit data
four homes, a 622 Mbps downstream can be spiit into four 155 Mbps downstreams
IDE lntegrated Disk Electronics HIPPI High Performance Parailel Interface
by TDM. Since most homes have only Lwisted pairs andlor coaxial cables, the imple
mentation cost of FT]’H will be high. EIDE Enhanced IDE

• Terrestrial Distribution. Terrestrial broadcasting uses VHF and UHF specLra (ap
proximaLely 40—800 MHz). Each channel occupies 8 MHz in Europe and 6 MHz
in Lhe U.S., and each Lransmission covers about 100 Iciiometers in diameter. AM 15.5 COMMON PERIPHERAL INTERFACES
and EM modulations are employed for analog videos, and Coded Orthagonal Fre
quency Division Mul:iplexing (COFDM) for digital videos. The standard is known For a comparison, Table 15.6 Jists lhe speeds of various common peripheral interfaces
for connecting 1/O and other devices [hard disk, printer, CD-ROM, poinling devices (e.g.,
as Digital Video Broadcasting-Terrestrial (DV~-T). Since Lhe return channel (up
mouse), Personal Digital Assistant (PDA), digital camera, and so on].
stream) is not supported in terresLrial broadcasLing, a separate POTS or N-ISDN link
is recommended for lhe upstream in interacLive applications.
15.6 FURTHER EXPLORATION
• Satellite Distribution. SaLeIliLe broadcasting uses Lhe Gigahertz spectrum. Each Good general discussions on computer networlcs and data communications are given in the
sateilite covers an arca of several Lhousand ldiometers. For digital video, each satel books by Tanenbaum [1] and Stallings [2].
liLe channel typically has a data rate of 38 Mbps, good for several Digital Video The Further Exploration section of the text web site for this chapter provides an exiensive
Broadcasring (DVE) channels. lis standard is DigiLal Video Eroadcasting-Satellite set of web resources for computer and multimedia networks including links Lo
(DVB-S). Similar to DVB-T, PO1’S or N-ISDN is proposed as a means ofsupporting
upstream data in DVB-S. • SONET FAQ, elc.
442 Chapter 15 Computer and Multimedia Networks

• xI3SL introductions aL che DSL Forum web site


CHAPTER 16
• Introductions and White Papers on A1’M

• FAQ and White Papers on 10 Gigabit Ethemet at the Alliance web site Multimedia Network
• IEBE 802 standards
Communications and
• IETF Request for Comments (RFC) for lPv6 (Intemet Protocol, Version 6)

15.7 EXERCISES
Applications
1. What is the main difference between lhe 051 and TCP/IP reference modeis?
2. IPv6 is a newer IP protocol. What is its advantage over IPv4?
3. UDP does not provide end-to-end flow control. but TCP does. Expiam how this is
achieved using sequence numbers. Give an example where a packetized message sent Fundamentally, multimedia nelwork communication and (tradilional) computer network
using UDP is received incorrectly, but when using TCP it is received correctiy under
communication are similar, since they bolh deal with data communicalions. However,
lhe sarne circumstances (without channel errors). chailenges in multimedia nelwork communications arise because multimedia data (audio,
4. As a variation of FDM, WDM is used for multiplexing over fiber-opfic channels. video, etc.) are known as continucus media. They have lhe following characteristics:
Compare WDM with FDM.
5. Both ISDN and ADSL deliver integrated network services, such as voice, video, and • Voluminous. They demand high data raLes, possibly dozens or hundreds of Mbps.
50 on, lo home users or small-office usas. What are lhe advantages of ADSL. over
• Real-Time and Interactive. They demand low deiay and synchronizalion between
ISDN?
audio and video for “lip sync”. In addilion, applications such as videoconferencing
6. Severa! protocols, such as Ethernel, Token ring, and FDDI, are commonly used in and inleraclive multimedia require lwo-way traifie.
LAN. Discuss lhe functionalities of these three technologies and differences among
them. • Sometimes Bursty. Data rates fluctuate draslically — for example, in video-on
7. Frame relay and Celi relay are variants of packet switching. Compare these two demand, no trafiic most of lhe time but burst lo high volume.
technologies.
8. WhaL is lhe difference between switching and routing? Are routing aigorithms specific
to a switching technology? 16.1 QUALITY OF MULTIMEDIA DATA TRANSMISSION
9. l-low many sublayers are Ibere in ATM? What are they? 16.1.1 Quality 01 Service (QoS)
10. In 1-IFC cable networlcs, two modulation schemes are used for sending downstream Quality of Service (Q0S) for multimedia data iransmission depends on many parameters.
and upstream data. Why should lhe upsiream case be handled differently from down Some of lhe most importanl are:
stream? Should we employ different multiplexing technologies as well?
• Data Rate. A measure of transmission speed, often in kilobils per second (kbps) or
15.8 REFERENCES megabits per second (Mbps)
1 AS. Tanenbaum, Coniputer Wetworks, 4± ed., Upper Saddle River, RI: Prentice 1h11 PTR,
2003. • Latency (maximum frame./packet delay). Maximum time needed from transmis
sion Lo reception, often measured in milliseconds (msec). In voice communication,
2 W. Stallings, Data & Computer Communications, 6th cd., Upper Saddle River, Ni: Prentice for exampie, when lhe round-trip deiay exceeds 50 msec, echo becomes a noticeable
1h11, 2000.
problem; when lhe one-way delay is longer than 250 msec, talker overlap will occur,
3 W. Stallings, ISDN and Broadband ISDN. with Frame Reiay and /ITM, Upper Saddle River, since each caller will lalIc without knowing lhe other is siso talldng.
NJ: Prenúce Hali, 1999.
4 K. Tolly, lnLroduction Lo FDDU’ Data Conununications. 22(11): 81—56, 1993. • Packet loss or error. A measure (in percenlage) of ator rale of lhe packelized data
lransmission. Packets get lost or garbled, such as over Lhe lnlemet. They may also
5 R. Steinmetz and K. Nahrsledt, Muitimedia: Coniputing, Cainmunications & Applications.
Upper Saddle River, TU: Prentice Nau PTR, 1995. be delivered late or in lhe wrong order. Since retransmission is often undesirable, a

443
444 apter 16 Multimedia Network Communications and Applications Section 16.1 Quality of Multimedia Data Transmission 445

Time! Frame played TABLE 16.1: Requirement on network bandwidth/bitrate.

Appllcation Speed requirement


Telephone l6kbps
Audio conferencing 32 kbps
CD-quality audio 128—192 kbps
Digital music (QoS) 64—640 kbps
H. 261 64 kbps—2 Mbps
H. 263 <64 kbps
DVI video 1.2—1.5 Mbps
MPEG- 1 video 1.2—1.5 Mbps
(b) MPEG-2 video 4—60 Mbps
HDTV (compressed) > 20 Mbps
FIGURE 16.1: Jicters in frame playback: (a) high jilter; (b) lowjitter.
HDTV (uncompressed) > 1 Gbps

simple error-recovery method for real-Lime multimedia is Lo replay the lasL packet, MPEG-4 video-on-demand (QoS) 250—750 kbps
hoping Lhe error is not noticeable. Videoconferencing (QoS) 384 kbps—2 Mbps
In general, for uncompressed audio/video, lhe desirable packet loss is < lO_2 (lose
every hundredth packet, on average). When it approaches 10%, itbecomes intolerable.
For compressed multimedia and ordinary data, Lhe desirable packet loss is less than • Priority data. Two-way traffic, low loss and Iow Iatency, with prioritized delivery,
lo i08. Some prioritized delivery Lechniques, described in Section 16.1.3, can such as e-conunerce applicaLions
alieviate Lhe impact of packet loss.
• Silver. Moderate latency and jitter, strict ordering and sync. One-way traffic, such
• Jitter (or delay jitter). A measure of smoothness of Lhe audio/video playback. as sLreaming video; or two-way traffic (also Inreraclive), such as web surfing and
Technically, jiuer is related Lo lhe variance of framelpacket delays. A large buifer Intemet games
(jitter buffer) can to hold enough frames Lo allow Lhe frame with lhe longest delay lo
arrive, to reduce playbackjitter. However, this increases Lhe latency and may note • Best Effort (also Background). No real-time requiremenL, such as downloading or
desirable in real-Lime and interactive applications. Figure 16.1 illustrates examples lransferring large files (movies)
of high and low jilters in frame playbacks. • Bronze. No guaranLees for Lransmission
• Sync skew. A measure of multimedia data synchronization, often measureil in mil
liseconds (msec). For a good Iip synchronization, Lhe liniit of sync skew is ±80 msec Table 16.1 lisis lhe general bandwidLh/bit rate requirement for multimedia networks.
between audio and video. In general, ±200 msec is still acceptable. For a video with Table 16.2 lists some specificaLions for tolerance Lo delay and jitter in digital audio and
speaker and voice Lhe limit of sync skew is 120 msec if video precedes voice and video of differenL qualities.
20 msec if voice precedes video. (The discrepancy is probably because we are used
Lo have sound lagging image ata distance.) Perceived QoS. AlLhough QoS is commonly measured by lhe above Lechnical param
eters, QoS itself is a “collective effect of service performances that determine Lhe degree of
satisfaction of lhe user of that service,” as defined by lhe Intemational Telecommunications
Multimedia Service Classes Based on lhe above measures, multimedia applications Union. In olher words, it has everything lo do with how lhe userperceives iL.
can be classified into lhe following Lypes: In reaJ-Lime multimedia, regularity is more important Lhan latency (i.e., jilter and quality
• Real-Time (also Conversational). ‘IWo-way traffic, low latency and jitter, possibly tiuctuation are more annoying than slighLly longer waiting); temporal correctness is more
with prioritized delivery, such as voice telephony and video telephony importanL than Lhe sound and picture quality (i.e., ordering and synchronization of audio
and video are of primary importance); and humans Lend lo focus on one subject aI a Lime.
Section 16.2 Multimedia over Ip 447
446 Chapter 16 Multimedia Network CommunicatiOnS and ApplicationS

The Lwo maia advanLages of MPLS are Lo support Traffic En,çineering (TE), which is
TABL.E 16.2: Tolerance of iatency and jiLter in digiLal audio and video. used essentially lo control Lraffic Ilow, and Virtual Private Netwo,*s (VPN). Boih TE and
VPN help delivery of QoS for mullimedia daLa. MPLS supports eight service classes. For
Application Average latency tolerance Average jitter toicrance more deLail refer lo RFC 3031.
(msec) (msec) DiffServ and MPLS can be used togeLher Lo allow beLier control of both QoS performance
130 per class and provision of bandwidLh, reLaining advanLages of boih MPLS and DiffServ.
Low-end videoconference (64 icbps) 300
130
Cornpressed voice (16 kbps) 30 16.1.3 prioritized Delivery
7 When a high packeL loss or error rate is deLecLed ia Lhe event of network congestion, priori
MPEG NTSC video (1.5 Mbps) 5
9 tized delivery of muiLimedia daLa can be used lo alieviate Lhe perceived deLerioralion.
MPEG audio (256 kbps) 7
HDTV video (20 Mbps) 0.8 • Prioritization for types of media. Transmission aigoriLhnis can provide prioritized
delivery LO different media — for example, giving higher priority lo audio Lhan lo
video — since loss of conLent in audio is oflen more noLiceable than in video.
User focus is usually at Lhe cenLer of the screen, and iL takes time to refocus, especially after • Prioritization for uncompressed audio. PCM audio biLstreams can be broken mio
a scene change. groups of every nth sample — prioritize and send k of Lhe total of ti groups (k s n)
Togelher with the perceptual nonunifonniLy we have siudied ia previous chapters, many and ask Lhe receiver Lo inLerpolaLe Lhe lost groups if so desired. For example, if Lwo
issues of perception can be exploited ia achieving the best perceived QoS in networked OUL of four groups are lost, Lhe effective sampiing rale is 22.05 kHz instead of 44.1
multimedia. kHz. Loss is perceived as change in sampiing raLe, nOL dropouts.

16.1.2 QoS for IP Protocois • Prioritization for JPEG image. The different scans in Progressive JPEG and dif
ferenL resoluLions of the image ia Hierarchicai JPEG can be given different prioriLies,
QoS policies and technologies enabie key meirics discussed ia lhe previous section such for exampie, highest prioriLy for Lhe scan wiLh Lhe DC and first few AC eoefficienLs,
as iatency, packet loss, and jitter Lo be controlled by offering different leveis of service Lo and higher prioriLy Lo lower-resoluLion componenLs of Lhe J-lierarchical JPEG image.
differenL packet streams or applications.
Frame relay routing protocol and ATM provide some leveis of QoS, buL currenLly mosL • Prioritization for compressed video. Video prioriLizaLion algoriihms can sei prior
lntemet applications are built on 1?. II’ is a “best-effort” communications technoiogy and lhes lo minimize playback delay and jiLter by giving lhe highesL priorily Lo recepLion
does not differentiate among different IP appiications. Therefore iL is bani Lo provide QoS of 1-frames and Lhe lowest priority lo 8-frames. In scalable video (such as MPEG
over II’ by currenL routing methods. 2 and 4) using layered coding, lhe base layer can be given higher prioriLy ihan Lhe
Abundant bandwidth improves QoS, but in complex networks, abundant bandwidth is enhancemenL layers.
unlikely Lo be available everywhere (ia practice, many 1P neLworics routinely use oversub
scription). ia particular, iL is uniikely lo be available in ali Lhe access links. Even if it 16.2 MULTIMEDIA OVER IR
is available everywhere, bandwidth alone can’t resolve problems doe to sudden peaks ia Dueto Lhe great populariLy and availabiliLy of Lhe intemeL, various efforts have been made
traffic. Lo make MulLimedia over IP a reaiity, aiLhough iL was lmown to be a chailenge. This secLion
Differentialed Service (DiffServ) uses DiffServ code [Type of Service (TOS) ocLet in IPv4 will study some of Lhe key issues, technologies, and protocols.
packet and Traffic Ciass octet ia JPv6 packet] Lo ciassify packets Lo enabie Lheir differentiaLed
treaLment. IL is becoming more wideiy deployed ia intradomain networks and enterprise 16.2,1 IP-Multicast
networks, as it is simpler and scaies weii, although it is also appiicabie to end-to-end
networks. DiffServ, in conjunction with other QoS techniques, is emerging as Lhe de facto in network terminology, a broadcasi message is senL Lo ali nodes in lhe domam, a uniccisi
QoS technology. See IETF RequesL for Comments (RFC) 2998 for more information. message is sent lo only one node, and a multicast message is senL loa seL of specified nodes)
Muflipte Protocol Label Switching (MPLS) faciliLates Lhe maniage of IP Lo 051 iayer 2 ÍP-tnulzicasr enabies muiLicasl on lhe inlerneL. ii is vila) for applicaLions such as mailing
lisis, builetin boards, group file transfer, audio/video-on-demand, audiolvideoconferencing,
technologies, such as ATM, by overlaying a protocol on top of IP. it introduces a 32-bit
and so on. SLeve Deeriag inlroduced IP-multicast technology in his 1988 Ph.D. dissertation.
iabei and inserts one or more shirn labels jato Lhe header of mi IP packet ia a backbone
IP network. It thus creaLes Lunneis, caiied Label Switched Pai/is (LSP). By doing $0, Lhe IPv6 siso ailows anycas’, whereby lhe rnessage is sem ‘o any one or ibe specified nodes.
backbone IP neLwork beconies connection-oriented.
448 Chapter 16 Multimedia Network CommunicatiOns and Applications Section 16.2 Multimedia over IP 449

User User jp-multicast has anonymous membership. The source host mullicasts to one of lhe above
lp-mullicasl addresses — il doesn’L know who wiil receive. The hostsoftware maps IP-group
Router Router addresseS mIo a lisL of recipienls. Then iL either mullicasLs when there is hardware support
(e.g., Elhemet and FDDI have hardware mullicast) or sends mulliple unicasLs Lhrough lhe
neXL nade in Lhe spanning tree.
One potential problem of multicasting is taL 100 many packets will be Lraveling and
User ajive in te nelwork. Fortunately, IP packets have a tbne-to-iive (TTL) field tal limits lhe
packet’s lifelime. Each router decrements Lhe T1’L of lhe pass-by packel by aI leasL one.
The packet is discarded when ils ‘ITL is zero.
rue IP-mulLicasL melhod described above is based on UDP (not TCP), so as Lo avoid
excessive acknowledgmenls from mulliple receivers for every message. As a resull, packeLs
are delivered by “besL efi’orl”, so reliability is limiLed.
MRouter
lnternet Group Managemcnt Protocol (IGMP). Interne: Group Management Pro
MRouLer toco! (IGMP) was designed to help Lhe mainlenance of multicasl groups. Two special types
of IGMP messages are used: Query and Report. Query messages are mLllLicast by
rouLers lo ali local hosls, lo inquire abouL group membership. Report is used lo respond
lo a query and tojoin groups.
On receiving a query, members waiL for a random lime before responding. II a member
Router Router Rou ter
hears anoLher response, iL will not respond. Roulers periodically query group membership,
and declare lhemselves group members if lhey get a response lo aI leasL one quer)’. If no
responses occur after a while, lhey declare lhemselves nonmembers.
IGMP version 2 enforces a lower laLency, £0 Lhe membership is pruned more prompLly

L User
Ali User
afLer ali members in Lhe group leave.

Reliable Multicast Wansport. IETF RFC 2357 was an attempl lo define criteria for
evalualing reliable IP-multicasl prolocols.
FIGURE 16.2: Tunnels for 1? MulLicasl in MBone. As Almeroth [4] poinLs oul, MBone maintains a fiaL virtual topology and does noL provide
good rouLe aggregaLion (at lhe peak lime, MBone had approximalely 10,000 rouLes). Hence,
iL is noL scalable. Moreover, Lhe original design is highly distributed (and simplislic). It
One of lhe first triais of IP-multicast was in March 1992, when te InterneI Engineering assumes no cenLral management, which results in ineifective Lunnel management, tal is,
Task Force (IETF) meeting in San Diego was broadcast (audio only) on lhe Internet. Lunnels connecling islands are noL optimally allocated. Somelimes multiple lunnels are
created over a single physical link, causing congeslion.
MBone. The IntemeL Muiticast Backbone (MBone) is based on IP-multicast technol Paul et ai. [5] presented Lhe Reflabie Muiacasi Transpori Protocoi (RMTP), which sup
ogy [1]. Starting in lhe early 1990s, ii has been used, for example, for audio and video ports rouLe aggregalion and hierarchical rouling.
conferencing on Lhe InterneI [2, 3]. Earlier applications include vai for audio conferencing, WhetLen and Taskale [6] provided an overview of Reiiabie Multicast Transpor! Protocoi
ide and nv for video conferencing. Other apphcation bois include wb for whiteboards in II (RMTP II) taL supportsforwani error control (FEC) and is LargeLed for real-time delivery
shared workspace and sdr for maintaining session directories on MBone. of multimedia data.
Since many routers do foI support multicast, MBone uses a subnetwork of rouLers
(mrouiers) that support multicast Lo forward multicast packets. As Figure 16.2 shows, 16.2.2 RTP (Real-time Transport Protocol)
Lhe mrouters (or so-called isiands) are connected wiLh tunnels. MulticasL packets are encap
sulaLed inside regular II’ packeLs for “Lunneling”, so tal Lhey can be sent lo Lhe destination The original Intemet design provided “besL-effort” service and was adequate forapplications
through Lhe islands. such as e-mail and FTP. However, iL is not suitable for real-Lime multimedia applications.
Recali Lhat underlPv4, IP addresses are 32 bits. If Lhe firsL 4 bits are 1110, Lhe message is RTP is designed for Lhe transpor: of real-time data, such as audio and video streams, often
an IP-multicasL message. IL covers IP addresses ranging from 224.0.0.0 Lo 239.255.255.255. for audio or videoconferencing. It is inLended primarily for mulLicast, although il can also
450 Chapter 16 Multimedia Network CommunicatiOns and Applications Section 16.2 Multimedia over IP 451

be appiied Lo unicast. li was used, for example, in nv for MBone [3), Netscape LiveMedia,
Microsoft Netmeeting, and Intel Videophone. v P CSRC M~ Payioad type~ Sequence number 31
RTP usually mns on top of UDP. which provides efficient (but iess reliable) connecLionless
datagram service. There are two main reasons for using UDP instead of TCP. First, TCP is
a connection-oriented Lransport protocol; hence, iL is more difficult Lo scale opina multicast TimesLamp
environment. Second, TCP achieves its reliability by retransmiLting missing packets. As
mentioned earlier, in muiLimedia daLa Lransmissions, the reliability issue is less impoitant. Synchronization source (SSRC) ID
Moreover, Lhe Iate arrival of retransrnitted daLa may nor be usable in real-time applications
anyWay.
ConLributing source (CSRC) lDs (optional)
Since UDP wiIi not guaranlee Lhat Lhe daLa packets arrive in lhe original order (not to
mention synchronizaLion of muiliple sources), RTP rnust create its own ti,neslamping and
sequencing mechanisrns to ensure Lhe ordering. RTP inLroduces Lhe foilowing additionai
paramelers in Lhe header of each packet [7]: FIGURE 16.3: RTPpacketheader.
• Payload type indicates the media daLa type as weli as its encoding scheme (e.g.,
PCM, I-L26i1l-L263, MPEG 1,2, and 4 audio/video, etc.) so lhe receiver knows how
162.3 Real Time Control Protocol (RTCP)
to decode iL.
RTCP is a companion prolocol of RTP. IL monitors Q0S in providing feedback toLhe server
• Timestamp is Lhe mosL important mechanism of RTP. The limesLamp records Lhe (sender) on quality of data transmission and conveys informaLion abouL Lhe participants of a
insLant when lhe first ocLet of the packeL is sampied; it is set by Lhe sender. Wilh muiLiparty conference. RTCP aiso provides Lhe necessary information for audio and video
the timestamps. Lhe receiver can pIay Lhe audio/video in proper tiniing order and synchronization, even if lhe3’ are sent ihrough differenL packeL streams.
synchronize multiple streams (e.g., audio and video) when necessary. The five types of RTCP packels are as beiow.

• Sequence number isto complernent Lhe function of timestamping. it is incremented 1. Receiver report (RR) provides quality feedback (number of iasL packet received,
by one for each RTP data packet sent, to ensure Lhat lhe packeLs can be reconsLrncted number of losL packets, jiLter, limesLamps for calcuiaLing round-liip deiays).
in order by Lhe receiver. This becomes necessary, for exampie, when ali packels of a 2. Sender report (SR) provides informalion abouL Lhe receplion of RR, number of
video frame sometimes receive Lhe sarne timestarnp, and LimesLamping alone becomes packeLs/bytes sent, and so on.
insufficient.
3. Source description (SOES) provides information about the source (e-mali address,
• Synchronization source (SSRC) 1]) identifies sources of inuiLimedia data (e.g., au phone number, fuil name of Lhe participant).
dio, video). lf Lhe data come from lhe same source (Lransiator, mixer), they wil]1 be 4. Bye indicales Lhe end of participalion.
given Lhe sarne SSRC ID, so as to be synchronized. 5. Appiication specific functions (APP) provides for fuLure exLension of new feaLures.

• Contributing Source (CSRC) II) idenLifies Lhe source of conuibuLors, such as ali RTP and RTCP packeLs are senL to Lhe same IP address (muiticast or unicast) buL on
speakers in an audio conference. differenL ports.

Figure 16.3 shows lhe RTP header format. The first 12 octets are of fixed foniiat, foilowed 16.2.4 Resource Reservation Protocol (RSVP)
by optionai (O or more) 32-bit ContribuLing Source (CSRC) lOs.
RSVP is a seLup protocol for Internet resource reservation. Protocois such as RTP, described
Bits O and 1 are for the version of RTP, bit 2(P) for signaling a padded payload, biL 3 (X)
above, do noL address Lhe issue of QoS conLroi. RSVP was Lhus developed [8] Lo guarantee
for signaIing an exlension Lo Lhe header, and bits 4 Lhrough 7 for a 4-bit CSRC coma that
desirabie QoS, mostiy for muiticast, a]Lhough it is aiso apphcabie to unicast.
indicates Lhe number of CSRC IDs foiiowing Lhe fixed part of Lhe header.
A general communicaLion mede] supported by RSVP consisLs of m senders and ti re
Bit 8(M) signais Lhe first packeL in an audio frarne or last packet in a video frame, since
ceivers, possibiy in various mulLicast groups (e.g., in Figure 16.4(a), m = 2, n = 3, and
an audio frame can be played out as soon as Lhe first packet is received, whereas a video
lhe Irees for Lhe two multicast groups are depicLed by Lhe arrows — solid and dashed lines,
frame can be rendered oniy afLer Lhe lasI paclcet is received. Bits 9 through 15 describe lhe
respecLively). In Lhe special case of broadcasLing, tu = 1; whereas in audio- or videocon
payioad Lype, EiLs 16 Lhrough 31 are for sequence number, foliowed by a 32-bit timesLanlp
ferencing, each hosL acLs as boffi sender and receiver in the session, lhat is, m = ti.
and a 32-bit Synchronization Source (SSRC) iD.
Section 16.2 Multimedia over IP 453
452 Chapter 16 Multimedia Network Communications and Appiications

The main chalienges of RSVP are LhaL many senders and receivers may compete for
Lhe Iirnited network bandwidih, lhe receivers can be heterogeneous in demanding different
contents wiih differeni Q0S, and they can be dynamic by joining orquitting muiticast groups
aI any time.
The rnosL irnportant messages aí RSVP are Path and Resv. A Path message is
iniLiated by Lhe sender and traveis iowards ihe mulLicasL (or unicast) desLination addresses.
II conLains informalion abouL Lhe sender and the path (e.g., Lhe previous RSVP hop), so Lhe
receiver can find Lhe reverse path to lhe sender for resource reservaLion. A Resv message
is sent by a receiver ihaL wishes to rnake a reservation.
RSVP is receiver-initiated. A receiver (aL a ieaf aí Lhe mulLicasL spanning tree) iniLi
ates Lhe reservation requesL Resv, and ihe requesL Lravels back toward the sender but
not necessariiy ali Lhe way. A reservaLion wiIl be rnerged with an exisiing reservaLion
made by other receiver(s) for the sarne session as soon as Lhey meet ai a rouler. The
rnerged reservation will accommodaLe Lhe highesl bandwidth requirernenL among ali
merged requesis. The user-iniLiated scheme is highly scaiable, and ii rneeLs users’
heterogeneous needs.
• RSVP creates oniy soft síate. The receiver host must mainLain the sofi state by
periodicaiiy sending Lhe sarne Resv message; oLherwise, lhe siate wili time OUL. There
(a) (b) is no distincLion beLween Lhe initiai rnessage and any subsequenL refresh message. if
there is any change in reservaLion, Lhe sLate wiil automaticaliy be updaLed according ia
lhe new reservation pararneters in Lhe refreshing rnessage. Hence, Lhe RSVP schenie
is highiy dynamic.
Figure 16.4 depicis a sirnpie neiwork wiLh iwo senders (Si, 52), ihree receivers (Ri,
R2, and R3), and four rouiers (A, B, C, D). Figure 16.4(a) shows Lhat Si and S2 send
?ath messages along their paths ia Ri, R2, and R3. In (b) and (c), Ri and R2 send out
Resv rnessages ia Si and 52, respectiveiy, ia malce reservaLions for SI and S2 resources.
From C ia A, Lwo separate channels rnusi be reserved since Ri and R2 requesied differenL
dalastreams. in (d), R2 and R3 send ouL iheir Resv messages ia Si, Lo make addiLionai
requests. R3’s requesi was merged with Ri ‘s previous requesL aL A, and R2’s was rnerged
with R1’s ai C.
Any possible variaLion of Q0S Lhai demands higher bandwidth can be deaii wiLh by
modifying ihe reservaLion state parameLers.

16.2.5 Real-Time Streaming Protocol (RTSP)

Strearning Audio and Video. in Lhe early days, muilirnedia data was iransmiLted over lhe
(c) (d) neLwork (ofLen with siow links) as a whoie large file, which wouid be saved Lo a disk, then
piayed back. Nowadays, more and more audio and video daLa is transmiiied írom a stored
FIGURE 16.4: A scenario aí neLWork resource reservaiion wiLh RSVP: (a) senders Si and media server ia the clienL in a daLasiream Lhai is aimosL insiantiy decoded — streaming
S2 send out iheir PAPH messages ia receivers Ri R2, and R3; (b) receiver RI sends oui audio and streaniing video.
RESV message lo Si; (e) receiver R2 sends oul RESV message Lo 52; (d) receivers R2 and Usuaily, Lhe receiver wiil sei aside bufíer space Lo prefetch Lhe incoming strcam. As soon
R3 send oul iheir RESV rnessages Lo Si. as lhe buffer is fiuied Loa certain exteni, lhe (usualiy) cornpressed daLa wiii be uncompressed
and piayed back. ApparenLiy, the buffer space needs Lo be sufficienLiy iarge Lo deai with
lhe possibie jiLier and ia produce continuous, smaoth piayback. On the other hand, Loa
454 Chapter 16 Multimedia Network Communications and Applications Section 16.2 Multimedia over IR 455

GET request 2. Session settip. The clieni issues a SETUP lo inform lhe server of lhe destinaLion IP
address, port number, protocols. and TTL (for multicasi). The session is seL up when
GET response lhe server retums a session ID.
3. Requesting and receiving media. Afterreceiving a PLAY, lhe server starts lo Lransmit
OPTIONS requesL streaming audio/video data, using IZTP. It is foliowed by a RECORD or PAUSE. Ocher
OPTIONS response VCR commands, such as FAST-FORWARD and REWIND are also supported. During
Lhe session, Lhe client periodically sends an RTCP packel Lo lhe server, lo provide
SETUP request M feedback information aboul lhe QoS received (as described in Seccion 16.2.3).
e 4. Session closure. TEARDOWN doses Lhe session.
SETUP response
d
c PLAY reqLaest
16.2.6 Internet Telephony
The Publlc Swiiched Teiephone Network (PSTN) relies on copper wires carrying analog
a
PLAY response voice signais. lt provides reliable and Iow-cosL voice and facsimile services. In Lhe eighties
and nineties, modems were a popular means of “data over voice networks”. In fact, lhey
5 were predominant before lhe introduction of ADSL. and cable modems.
e RTP audio e As PCs and Lhe Inlernet became readily available and more and more voice and data
n r commurlicaLiOns became digiLal (e.g., in ISDN), “voice overdata networlcs’ especiaiiy Voice
RTP video
v over IP (VoIP) started Lo aLtract a great deal of inLerest in research and user communiLies.
WiLh ever-increasing network bandwidlh and lhe ever-improving quality of multimedia data
RTÇP e compression, Internet telephony [10] has become a reaiity. Increasingly, it is foI resLricled
r Lo voice (V0IP) — iL is about integrated voice, video, and data services.
The main advantages of lntemeL lelephony over POTS2 are Lhe following:
PAUSE requesl • It provides greaL flexibility and extensibiliLy in accommodating inLegraled services
such as voicemail, audio- and videoconferences, mobile phone, and so on.
PAUSE response

L
• lt uses packel switching, nol circuit swiLching; hence, network usage is much more
efficienL (voice communication is bursty and VBR-encoded).
TEARDOWN requesL
• WiLh Lhe technologies of mulLicasL or multipoint communicaLion, muiLiparty calls are
TEARDOWN response noc much more difficult Lhan Lwo-parly calis.
• WiLh advanced muiLimedia data-compression lechniques, various degrees of QoS can
be supported and dynamically adjusLed according toLhe network Lraffic, an improve
FIGURE 16.5: A possible scenario of RTSP operaLions. ment over Lhe “ali or none” service in POTS.
• Good graphics user interfaces can be developed lo show available features and ser
vices, monitor cali sLatus and progress, and so on.
large a buffer will introduce unnecessary initial delay, which is especially undesirable for
interacLive applications such as audio- or videoconferencing [9]. As Figure 16.6 shows, Lhe lransport of real-time audio (and video) in TnLemet lelephony
is supported by RTP (whose control proLocol is RTCP), as described in Section 16.2.2.
The RTSP Protocol. RTSP is for communicaiion between a client and a stored media Streaming media is handled by RTSP and InLemet resource reservation is taken care of by
server. Figure 16.5 iliustrates a possible scenario of four RTSP operations: RSVP.
InterneI telephony is noL simply a streaming media service over lhe InterneL, becaLzse it re
1. Requesting presentation description. The client issues a DESCRIBE request LO quires a sophisticaLed signaling protocol. A sLreaming media server can be readily idenLified
lhe Stored Media Server to obtain the presentation descripLion, such as, media types 2~Ors refers ‘o piam old leiephone servives that do noi include new features such as vali waiting, vali forward
(audio, video, graphics, etc.), frame raLe, resoluLion, codee, and so on, from lhe server. ing, and ao on.
closing channels for media streams, obtaining gateway between GSTN and Internei

456 Chapter 16 Multimedia Network CommunicatiOnS and Applications Section 16.2 Multimedia over IP 457

• 11.245. Control protocol for multimediacommunicaLions — for example, opening and


H.323orSIP
telephony
RTP, RTCP, RSVP, RTSP
__________________________________ • 11.235. Security and encryption for H.323 and other H.245-based multimedia Lermi

Transport layer (UDP, TCP) riais


Audio Codecs
Network iayer (1?, 1? Muiticast)
• 0.711. Codec for 3.1 kI-iz audio over 48, 56, or 64 kbps channels. 0.711 describes
DaLa link layer Pulse Code Modulation for normal Lelephony
• G.722. Codec for 7 kHz audio over 48, 56, or 64 kbps channels
Physical layer
• 0.723.1. Codec for 3.1 lU-Iz audio over 5.3 or 6.3 kbps channels. (The VoIP Fonim

FIGURE 16.6: Network protocol structure for internet telephony. adopted 0.723.1 as lhe codec for V0IP.)
• 0.728. Codec for 3.1 kHz audio over ló kbps channels

by a URI (Universal Resource Identifier), whereas acceptance of a cali via Internet telephony • G.729, G.729a. Codec for 3.1 kHz audio over 8 kbps chameIs. (The Frame Relay
depends on Lhe cailee’s current location, capabiiity, availability, and desire to communicate. Foram adopted 0.729 by as Lhe codec for voice over frame relay.)
The following are brief descriptions of Lhe H.323 standard and one of Lhe most commonly
used signaling protocols, Session Initiation Protocol (511’). i/ideo Codecs

H.323. 11323 [li, 12] is a standard for packet-based multimedia communication ser- • H.261. Codec for video aí p x 64 kbps (p > 1)
vices over networks (LAN. InLernet, wireless network, etc.) that do not provide a guaranteed
Q0S. It specifies signaling protocols and desciibes terminais, multipoint control units (for • H.263. Codec for iow-bitrate video (< 64 kbps) over lhe GSTN
conferencing), and gateways for integrating Internet telephony with General Switched Tele Related Standards
phone Network (GSTN)3 data terminaIs.
The H.323 signaling process consists of two phases: • H.320. The original standard for videoconferencing over ISDN networks

1. Cal! setup. The calier sends Lhe gatekeeper (GK) a Registration, Adniission and • 11.324. An extension of H.320 for video conferencing over Lhe GSTN
Status (RAS) Adinission Request (ARQ) message, which contains lhe name and phone
number of Lhe callee. The 0K may either grant permission or reject lhe request, with • T.120. Real-time data and conferencing control
reasons such as “security violaLion” and “insufficient bandwidth”.
2. Capability exchangc. An H.245 control chanriel will be esLablished, for which Lhe
firsL step is Lo exchange capabilities of boLh Lhe caller and callee, such as wheLher iL is Session Initiation Protocol (SIP) — A Signaling Protocol. SIP [lO] is an application
audio, video, or data; compression and encrypLion, and so on. layer control
phony. These protocol
sessions in
arecharge of establishing
not limited and terminating sessions
Lo VoIP communications Lhey alsoininclude
IntemeLmulti
Lele

H.323 provides mandatory support for audio and optional support for data and video. Ii media conferences and mulLimedia distribution.
is associaLed with a family of related software standards Lhat deal wiLh calI control and daLa Similar Lo HTI’P, SI? is a text-based protocol lhat is different from H.323. Ii is also a
compression for Intemet telephony. Foliowing are some of lhe relaLed standards: client-server protocol. A cailer (lhe client) initiates a requesL, which a server processes and
Signaling and Control responds
call requests.
to. There
The difference
are Lhree Lypes
between
of servers.
the two is
A proxy
LhaL theserver
proxyand
server
a redirect
forwards
server
Lhe requesLs
forward

11.225. Call conLrol protocol, including signaling, registration, admissions, packeLi- Lo lhe next-hop server, whereas lhe redirecí server retums Lhe address of lhe next-hop server
zaíion and synchronization of media streams tolhe
Theclient,
Lhird so as to
type is redirect Lheserver,
a location cail toward
whichLhefinds
destination.
current locations of users. Location
~OS174 is a synonym for PSTN (publie switched ielephone networlQ. servers usually communicate with Lhe redirecL or proxy servers. They may use finger,
Section 16.3 Multimedia over ATM Networks 459
458 Chapter 16 Multimedia Network CommunicationS and AppIicatiOns

Location server Steps 3, 4. j ohn@home . ca is not logged on Lhe server. A request is sent Lo Lhe
nearhy location server. John’s current address, john@work.ca, is lo
SIP c)ient (cailer) caLed.

Proxy Stcp 5. Sincethe server is aredirecL server, itreLurnstheaddressjohn@work.ca


Lo Lhe proxy server P1.

Step 6. Try Lhe nexi proxy server P2 for john@work.ca.

Steps 7,8. P2 consulis lis location server and obtains John’s local address,
john_doe@my.work.ca.

8 Steps 9, 10. The next-hop proxy server P3 is conLacLed, which in turn forwards Lhe
inviLation to where Lhe clienL (callee) is.

Steps 11—14. John accepts the cali aL his currenL locaLion (at work) and Lhe acknowl
server (P3) edgmenLs are returned LO Lhe calier.

SIP can also use Session iJescription Protocol (SDP) Lo galher information abouL Lhe
10 callee’s media capabilities.
SIP client (caflee) Session Description Protocol (SDP). As iLs name suggesLs, SDP describes mulLimedia
sessions. As in SIP, SDP descriptions are in LexLual form. They include the number and
FIGURE 16.7: A possible scenario of SIP session initialion. types of media sLreams (audio, video, whiteboard session, eLc.), destination address (unicasL
or muiticast) for each stream, sending and receiving port numbers, and media formaLs
(payload types). When iniLiaLing a caIl, lhe cailer includes Lhe SDP iriformaLion in Lhe
rwhois, Lightweighl Directory Access Protocol (LDAP), or other multicast-based protoco 5
INVITE message. The cafled parLy responds and someLimes revises the SDP information,
to determine a user’s address. according Lo its capabiliLy.
SIP can advertise its session using e-mau, news groups, web pages or directories, or
Session Announcement Protocol (SAI’) — a multicast proLocol.
16.3 MULTIMEDIA OVER ATM NETWORKS
The methods (commands) for clients to invoke are
16.3.1 Video Bitrates over ATM
. INVITE — invites callee(s) te participate in a cail.
The ATM Forum supports various Lypes of video biL-rates:
. ACK — acknowledges Lhe invitation.
• Constant Bitrate (CBR). For example, for uncompressed video or CBR-coded video.
. OPTIONS — inquires about media capabulities without setting upa cali. As mentioned before, if lhe allocaLed biLrate of CBR is too low, ceil loss and distortion
of Lhe video conLent are inevitable.
. CMCEL — terminates Lhe invitation.
• Variable Bit Rate (VBR). The mosL commonly used video bitrate for compressed
. BYE — terminates a cal).
video. li can be further divided inLo real-time Variable Ritrate (rt-YBR) suiLable for
• REGISTER — sends user’s location information te a registrar (a SIP server). compressed video, and non real-time Variable 8h Rale (nrt-VBR) for specified QoS.

Figure 16.7 illustrates a possible scenario when a cailer initiates a SIP session: • Available Bit RaLe (ABR). As in IP-based service, daLa Lransmission can be backed
off or buffered due to congesLion. Ceil loss raLe and minimum celi daLa rale can
Step 1. Cailer sends an INVITE j ohn@hoitte . cato Lhe local Proxy server P1. sometimes be specified.

Step 2. The proxy uses its Domam Name Service (DNS) to locate Lhe server for Unspecified Bit Rate (UBR). Provides no guarantee on any quality parameLer.
john@hoitie.ca and sends Lhe request Lo II.
460 Chapter 16 Multimedia Network Communications and Applications Section 16.3 Multimedia over ATM Netwo

User data
TABLE 16.3: Comparison of AAL Lypes.
up LO65,535 bytes
1-leader Trailer
~1~~/
AAL 1 AATJ 3/4 AAII 5
CS output
)~ CS header/trailer overhead O byte 8 bytes 8 bytes
Notused
— 7 ‘‘ ‘\ ~l
SAR header/Lrailer overhead 1 ar 2 bytes 4 bytes O byte
~ aw~ SARourpul SAR payload 47 or 46 bytes 44 bytes 48 bytes
-.-48 bytes.i .-48 byles-’-i ..-48 bytes-’i
CS checksum None None 4 bytes
1 1
~1 ~ ____________ A1’M output SAR checksum None 10 bits None
-.—53 bytes—’- .÷...53 byles—’- ..—53 bytes—’

• As part of Lhe SAR trailer, AAL 3/4 has a checksum field for error checldng. To
FIGURE 16.8: Headers and trailers added aI lhe CS and SAR sublayers.
cul down lhe overhead, lhe checksum is only 10 bits long, which is unfortnnately
inadequate. AAL 5 does It aL lhe CS and allocates 4 bytes for Lhe checksum. Again,
16.3.2 ATM Adaptation byer (AAL) iL is based on lhe assumption that bit-Lransmission error is rare. However, when AAL
5 does error checking, iLhas enough information from lhe long checksum.
AAL converts various formats of user data into ATM datastreams and vice versa. The
following lists five types of AAL protocols: By now, AAL. 5 has superseded AAL 3/4. The ATM Forem agrees Lhat beside CBR
• AAL type 1 supports real-time, constant bitrate (CER), connection-oriented data services that will use AAL 1, evety olher service will use AAL 5. For more detalls of Lhe
streams. AALs, see Tanenbaum [13] and Stallings [14).
Table 16.4 summarizes lhe supporl for video Lransmission with and without ATM.
• AAL type 2 was intended for variable bitrate (VER) compressed video and audio.
However, the protocol never really materialized and is now inactive. 16.3.3 MPEG-2 Convergence to ATM
• AAL types 3 and 4 were similar, and have since been combined into one type: The ATM Forem has decided Lhat MPEG-2 will be Lransported over AAL5. As mentioned
AAL Lype 3)4. It supports variable bitrate (VER) of either connection-oriented or in Section 11.3.3, by default, two MPEG-2 packets (each 188 bytes) from Lhe transport
connectionless general (non-real-time) data services. stream (TS) will be mapped into one AAL-5 service data unit (SDU) [15].
When establishing a virtual channel connection, Lhe following QoS parameters must be
• AAL type 5 was the new protocol introduced for multimedia data transmission. It specified:
promises to support alI classes of data and video services (from CER lo UBR, from
rt-VBR Lo mi-VER). li is assumed LhaL lhe layers above lhe AAL are connection Maximum cdl transfer delay (latency)
oriented and that the ATM layer beneath it has a low error rate.

As Figure 16.8 shows, headers and trailers are added lo lhe original user data at lhe TABLE 16.4: Support for digital video transmission.
Convergence Sublayer (CS) and Segmentation And Reassembly (SAR) sublayer. The,y
eventualiy fonn Lhe 53-byte ATM cells with lhe 5-byte ATM header appended. Video requirement Support in ATM Support without ATM
The existence of lhe five different Lypes of AAL was due largely lo history. lii particular,
ali AAL types except AAL 5 were developed by lhe telecomniunications industry and were Bandwidth Scalable to several Clbps Up lo 100 Mbps
generally unsuitable for interactive multimedia applications and services [13]. Table 16.3 Latency and jilter Q0S support RSVP
provides a comparison among lhe lInce active AAL types, for example, comparing AAL
3/4 with AAL 5, CBR or VBR AAL 1, 2,5, LAN emulation, ISDN and ADSL
circuit emulation, etc.
• AAL 3/4 has an overhead of designating 4 bytes for each SAR ceil, whereas AAL 5
has none aL this sublayer. Considering lhe numerous SAR celis, lhis is a substantial Multicasting Multicast switch, or IP-multicast or protocol
saving for AAL 5. li is of course possible only with modem, relatively error-free permanent virtual circuil independent multicast (PIM)
fiber-optic technology.
Section 16.4 Transportof MPEG-4 463
462 Chapter 16 Multimedia Network CommunicatiOnS and ApplicatiOns

Locel DM1F
• Maximum celi deiay jiLier
iniuji Sccne

• Cdl ioss ratio (CLR)

• Ccli error ratio (CER) Dili

• Severely errored ccli biock ratio (SECBR)


_____i~aPuon~c~
An audio-visual service-speciflc convergence sublayer (AYSSCS) is also proposed, to
enabie transmitting video over AAL5 using ABR services.
_____________ DM
Audio Video OWects
16.3.4 MulticaSt over ATM
Compared tolP muiticasL, which is a “best-effort” service provided on Lop of UDP, muiticast ‘\J
in ATM networks had severai chailenges [16, 17]:
Scene Description
• ATM is connection.oriented; hence, ATIVI muiticasting must set up ali muitipoint
connections.

• QoS in ATM must be negotiated at connection setup time and be known to ali switches.
FIGURE 16.9: DMIF — Lhe multimedia content delivery inLegraLion framework.
• It is difficuit lo support multipoint.to-POint or muitipoint-tO.mUlLipOint conuections
in ATM, because AAL 5 does not lceep track of muitipiexer number or sequence
number. It cannot reassembie lhe data correctly at lhe receiver side if ceils from Interface (DAÍ), which Lranslates lhe application’s requests inLo specific proLocol messages,
different senders are interleaved ai their reception. Lo be transported through one of lhe three types of mediums.
Scalabie and efficient ATM multicasi (SEAAO and shared many-lo-many ATM reser When lhe delivery is through a network, Lhe DMIF is unaware of Lhe applicaLion. In
vaiáons (SMART) are two approaches Lo multicasting over ATM [16]. The former fact, an additional DMIF Network Interface (DNI) is needed, to iake care of iheir signaling
uses a unique identifier and lhe iaLter a token scheme te avoid Lhe ambiguity caused messages for specific networks.
by ceil interieaving. When delivering multimedia daLa, DMIF is similar Lo F1’P. First, a SETUP session is
established wilh lhe remoLe network siLe. Second, sLreams are selected and a STREAN
16.4 TRANSPORT OF MPEG-4 request is sent toLhe DMIF peer. which retums a pointer Lo a separaLe connection where lhe
streaming will Lalce place. Third, lhe new connection is estabiished, and data is streamed.
The design of MPEG-4 was motivated by multimedia applications on lhe WWW. Tu particu
lar, multimedia (text, graphics, audio. video, etc.) objects and scene descriptions (temporal la Lhe scenai~os of Broadcast and Local storage. Lhe application will know how Lhe daLa
and spatiai relationships of lhe video objects) can be transmitted by Lhe server and interpreted is stored and delivered. Hence, ihis becomes part of Lhe DMIF implemenLation.
and reassembled at the client side, Lo drastically reduce multimedia data Lranslnitted onto DMIF has built-in Q0S monitoring capabiiity. It supports (a) conLinuous moniloring,
lhe WWW. This section briefiy describes lhe Delivery Multimedia lnwgralion Framework (b) specific QoS queries, and (c) QoS violation notificaLion.
(DMIF) and Lhe issue of MPEG-4 over IR
16.4.2 MPECS-4 over IP
16.4.1 DMIF in MPEG4
The specifications on MPEG-4 over ÍP networks are jointly developed by lhe MPEG and
DMIF is an interface between multimedia applications and theirtranspOrt. It supports remote IETF as a framework in Part 8 of MPEG-4 (lSO/IEC 14496-8) and an InformaLive RFC in
interactive network access (IP, ATM, PS1’N, ISDN, or mobile), broadcast media (cable or mlr.
sateilite), and local media on disks. MPEG-4 sessions can be carried over IP-based protocois such as RTP, RTSP, and IlTI’R
The interface is transparent Lo the application, soa single application can run on different Deiails regarding RTP payload format are specified by IETF RFC 3016. In short, generic
transport layers, as long as lhe right DMIF is instantiated. RTP payload formar defines a mapping between logical MPEG-4 Si. packets and RTP
Figure 16.9 shows the integration of delivery through three types of communicatiOn packets, and FlexMuxpayloadforma: maps FlexMux packetized streams lo RTP packets.
mediums. As shown, Lhe local application inleracis with a uniform DMIF Applicatiofl
ection 16.5 Media-on-Demand (MOD) 465
464 Chapter 16 Multimedia Network Communications and Applications

16.5 MEDIA-ON-DEMAND (MOD)


Media-on-Demand involves many fundamental muitimedia network communication issues. Multimedia Networks
In this section, we wiil briefiy introduce Interactive TV, broadcast schemes for video-on
demand, and issues of bufíer management.
r ____

16.5.1 Interactive 1V (11V) and Set-Top Box (STB) i STB


1 (Set-Top Box) Network interface and
Interactive TV (ITV) is a muitimedia system based on lhe television seIs in homes. It can communication unit
support a growing number of activities, such as

. TV (basic, subscription, pay-per-view)

• Video-on-Demand (VOO) Audio/Video Graphics Peripheral controi


Processing
• Information services (news, weather, magazines, sports events, ele.) unit unit unit LmiL

• Interactive entertainment (lnternet games, etc.)


• E-commerce (online shopping, stoek trading)
• Access Lo digital libraries and educationai materiais

A new development in Digital Video Broadcasting (DVB) is Muitimedia Home Platform


(DVB-MHP) which supports ali the activities above as well as electronic program guide
TV monitor
na
Disks TIO devices

(EPG) for television. FIGURE 16.10: General architecture of Set-top Box.


The fundamental differences between ITV and conventional cable TV are first, Lhat ITV
invites user interactions; hence Lhe need for two-way Lraffic — downstream (content provider
Lo user) and upstream (user to content provider). Second, ITV is rich in information and 16.5.2 Broadcast Schemes for Video-on-Demand
multimedia content.
To perform the above functions, a Set-top Box (STB) is required, which generally has Among ali possibie Media-on-Demand services, Lhe most popular is likely to be subscription
Lhe following components, as Figure 16.10 shows: to movies: over high-spced networks, customers can specify Lhe movies Lhey want and Lhe
time Lhey want Lo view Lhem. The statistics of such services suggest thaL mosL of Lhe demand
• Nctwork interface and communication unit, including tuner and demoduiator (to is usually concenLrated on a few (lO Lo 20) popular movies (e.g., new releases and top-Len
extract the digital stream from analog channel), security devices, and a comifiunication movies of Lhe season). This makes ii possible lo multicast or broadcasL Lhese movies, since
channel for basic navigation of WWW and digital libraries as well as services and a number of clients can be put into lhe nexL group following Lheir requesL.
maintenance An importani quality measure of such MOO service is lhe waiLing time (latency). We
• Processing unit, including CPU, memory, and a special-purpose operating system will define access time as lhe upper bound beLween Lhe Lime of requesling Lhe movie and
for Lhe STB lhe Lime oíconsuming lhe movie.
Given Lhe potentially extremeiy high bandwidlh of fiber-optic networks, iL is conceivabie
• Audio/video unit, including audio and video (MPEG-2 and 4) decoders, Digital Lhal lhe entire movie can be fed lo lhe clienL in a relaLively shorL time if ilhas access to some
Signai Processor (DSP), buffers, and DIA converters high-speed network. The problem with Lhis approach is Lhe need for aD unnecessarily large
slorage space at Lhe client side.
• Graphics unit, supporting real-time 30 graphics for animation and games
• Peripherai control unit, controliera for disks, audio and video 1/O devices (e.g., Staggercd Broadcasting. For simplicity, we wiil assume alI movies are encoded using
digital video cameras), CD/DVD reader and writer, and 50011 COnslanL-bitraLe (CBR) encoding, are of Lhe same length L (measured in Lime units), and
wiii be piaycd sequenlially from beginning Lo end wiLhouL interrupLion. The availabie high
Section 15.4 described various Access Networks and Lheir comparative advantages and bandwidLh W is divided by Lhe playback raleb lo yield Lhe bandwidth ratio 8. Thebandwidth
disadvantages iii Lransmitting multimedia data efficiently and securely for ITV services.
Section 16.5 Media-on-Demand (MOD) 467
466 Chapter 16 Multimedia Network Communications and Applications

0 4 8 12 16 20 24 28 32 36 40 44 48
Channell ~ I2~3I4I5I6~7I8~ 1 1 1 1 1 1 1 1

Channei2 ~ ~ ~ ChanneI 1 1 1 [II 1 1


Channel 2 ~ •n•••n ••n•••••nn•
Channel3 1 i8IiI2I3L4I5I6L7I8~1~2~3I4~5I6I Chaunel 3 ••E~•••I•• •n•••••n••••
Channel 4 ~fl —— ———— -

Channei6 HI2IH4I5I6I7HI1I Chanuel 5 —— ——. ———


Channcl 6
FIGURE 16.11: Staggered broadcasting with M = 8 movies and K = 6 channeis.
Channel 7

of Lhe server is usually divided up mio K logical channels (K ~ 1). FIGURE 16.12: Skyscraper broadcasLing with seven segments.
Assuming Lhe server broadcasts up toM movies (M > 1), ali can be periodicaliy broad
casi on ali these channels wiih the start-Iime of each movie siaggered. This is therefore
referred te as Siaggered broadcasling. Figure 16.11 shows an exampie of Slaggered broad The size of S1 determines access time for Pyramid broadcasting. By default, we set
castirig in which M = 8 and K = 6. a = lo yield lhe shortest access time. Access time drops exponenLially wiLh Lhe
lf the division of the bandwidih is equal among alI K logicai channels, ihen access Lime increase in LoLai bandwidLh 8, because a can be increased lineariy.
for any movie is 5 = %L. (Nole: lhe access time is acLuaily indeperidenL of lhe value of K.) A main drawback of Lhe above scheme is Lhe need for a iarge storage space on the client
In other words, access time will be reduced linearly with the increased network bandwidlh. side, because Lhe last Lwo segmenLs are Lypicaliy 75—80% of the movie size. lnstead of using
a geomelric series, Skyscraper broadcas;ing [191 uses (1, 2,2,5,5, 12, 12, 25, 25, 52, 52,
Pyramid Broadcasting. Viswanathan and lmieiinski [18] proposed Pyramid broad as Lhe sedes of segment sizes, Lo aileviale the demand on a large buffer.
casiing, in which movies are divided up mio segments of increasing sizes. That is, L1+i = Figure 16.12 shows an exampie of Skyscraper broadcasting with seven segmenLs. As
a~L,, where L1 is Lhe size (lengLh) of Segment S1 and a > 1. Segment S~ wilI be periodically shown, two clienLs who made a requesL aL time intervais (1, 2) and (16, 17), respeclively,
broadcast on Channel i. In oLher words, insLead of staggering the movies on K channels, have their respective Lransmission schedules. At any given moment, no more lhan Lwo
the segmenLs are now slaggered. Each channel is given Lhe same bandwidLh, and Lhe iarger segments need Lo be received.
segmenLs are broadcasl less frequenLly. Hu [20] described Greedy Equal Bandwidth Bivadcasung (GEBB) in 2001. The seg
Since lhe available bandwidlh is assumed Lo be significantly larger Lhan Lhe movie piay ment sizes and their corresponding channel bandwidLhs are analyzed, wiLh Lhe objective
back rate b (i.e., 8 >> 1), it is argued thaL the client can be playing a smaller Segment S1 of minimizing Lhe toLal server bandwidth required Lo broadcasL a specific video. Different
and simuitaneously be receiving a larger SegmenL S1.~.1. from Lhe above pyramid-based broadcasting schemes, GEBB operates in a “greedy” fash
Tu guaranLee continuous (noninterrupted) playback, the necessary condiLion is ion. The client receives as much daLa as possibie from ali Lhe channeis immediately afLer
plciybackiinie(S1) > accessiinze(S~.j.l) (16.1) “Luning in” te a video broadcast. The client ceases receiving a segment immediately before
piaying back lhe corresponding segmenl. Figure 16.13 iliustraLes GEBB. In Lhis figure, ali
The playbackaime(S1) = L,. Given lhe bandwidLh aliocaled Lo each channei is 8/ K b, lhe bandwidchs are equal.
access_time(S~+i) = L,÷;Al = !j~&j~≥L, which yields The server bandwidLh optimization problem can be formaliy siated as:
«. L~ M
(16.2)
81K minimize
ConsequenLly,
8 subjectto B~= 1=1,2 K (16.4)
(16.3) w + ~ SJ
- MK
Section 16.5 Media-on-Demand (MOD) 469
46$ Chapter 16 Multimedia Network Communications and Applications

Display p1 S1 ~2 53 54 55
Channel 1

8
Download plane
8.

FIGURE 16.13: lllustration of GEBU. The shaded arca represents data received and played
baclc by the clienL. t
Channel 2
~2.i ~2.2 ~21 ~2.2 S2.l S2.2 b/2
where w is the wait time and Lhe B~ is Lhe bandwidth of Channel 1. The condition represenLed
~
by Equation (16.4) ensures that Segment S1 is completely received at Lhe exacL Lime when
the playback of SegmenL S1~~1 LerminaLes. Thus, Lhe segmenLs are available exactly ontime
for their playback.
The above nonlinear opLimization problem is solved using Lhe Lagrange multiplier
method. The resuit is Lhat the required bandwidh is minimized when the channel bandwidths
are equal. The broadcasLing bandwidth of each channel is
Channel3
£EEEEEE__
= B=B l<i,j5K (16.5)
IS Channel 4 ~ 54 543 54~4 ... b14
= (—+1) —l (16.6)
\W /
t
s + s
= —1] (_+ i) (16.7) FIGURE 16.14: Harmonic broadcasting.

The segmenL progression fcillows a geomeLrical sequence. (~ + 1) ~ is calied Lhe golden Take s2 as an example: it consists of two haives, S21 and ~22~ Since bandwidth B2
factor of video segmentation. is only b/2, during lhe playback Lime of Si, one-half of S2 (say S21) wiil be downloaded
(prefetched). lt Lakes Lhe entire playback time of S2 Lo downioad the otherhalf (say S22),jusL
Harmonic Broadcasting. Juhn and Tseng [211 invenLed Harmonic broadcasting in as 52 is Íinishing playback. Simiiarly, by lhis Lime, lwo-Lhirds of S3 is already prefetched,
1997, which adopts a differenL sLrategy. The size of ali segmenLs remains constanL, whereas 50 Lhe remaining third of 53 can be downloaded just in Lime for playback from channel 3,
the bandwidLh of channel é is B1 = b/i, where b is Lhe movie’s playback raLe. In other which has a bandwidth of oniy b/3, and so on.
words, Lhe channel bandwidths follow the de,creasing paLtem b, b/2, b/3, . . . b/K. The
The advantage of l-iarmonic broadcasLing is thaL the Harmonic number grows slowly
total bandwidLh allocaLed for delivering lhe movie is Lhus
with K. For exampie, when K = 30, IIK 4. ff lhe movie is 120 minuLes long, Ibis yields
smail segments — only 4 minutes (120/30) each. Hence, Lhe access time for Harmonic
B=E!~=HKb (16.8) broadcasting is generaily shorter Lhan for Pyramid broadcasting, and lhe demand on LoLal
bandwidth (in this case 4b) is modesL. Juhn and Steng [21] show Lhat Lhe upper bound for
Lhe buffer size aL Lhe client side is 37% of the entire movie, which also compares favorably
where K is the total number of segmenls, and 11x = is the Harnionic number of wiLh Lhe original pyramid broadcasting scheme.
K. However, Lhe above Harmonic broadcasting scheme does noL aiways work. For exampie,
Figure 16.14 shows an example of Hannonic broadcasting. Alter requesting Lhe movie, if Lhe chent sLarts Lo downioad aL Lhe second insLance of ~t in Figure 16.14. Lhen by Lhe time
the client is aliowed to download and piay lhe firsL occurrence of segment St fromchannel 1. it finishes Si, only lhe second half of ~2 — LhaL is, ~22 — is prefetched. The ciient wiIl
Meanwhiie, the ciient wiiI download ali other segments from Lheir respective channels.
470 Chapter 16 Multimedia Network Commuriications and Applications ection 16.5 Media-on-Demand (MOD) 471

SIot 1 2 34 567 8 910111213 Bytes


deiivered
ChannellSj ~1 51 5~ S1 S~S~ ~1 S~ S1 S~ SI ~~ A
Channel2 ~2 54 ~2 55 S~ 54 ~2 S5 ~2 54 ~2 Ss Cjoins A
Channel3 53 ~6 S8 53 57 59 53 ~6 S1 53 57 5~ 53

FIGURE 16.15: First three channel-segment maps of Pagoda broadcasLing. E

not be able Lo simulLaneousiy download and play 521 from channel 2, since Lhe available
bandwidih is only half Lhe playback raLe.
An obvious lix Lo the above probiem is Lo ask Lhe client lo delay Lhe playback of si by Cjoins E
one slot. The drawback of this deiayed Ham1onic broadcasting scheme is thaL it doubles the
access Lime. Since 1997, several varianLs of lhe original meLhod, such as cautious harmonic
broadcasting, quasi-harmonic broadcasting, and poiyhcsrmonic b;vadcasiing, have been
proposed, ali of which address Lhe problem wiLh added cornplexiLy.

Pagoda Broadcasting. Harmonic broadcasLing schemes broadcasL videos using a large


number of low-bandwidLh sLreams, while Pyramid broadcasLirtg schemes broadcast videos o 12345678
using a smail number of high-bandwidLh streams. The total required bandwidths of Pyramid
broadcasting schemes are generally higher Lhan Lhose of Harnionic-based schemes. But FIGURE 16.16: SLream merging.
managing a large number of independent datastreams for Harmonic broadcasLing is likely
la be daunLing.
Paris, Carter, and Long [22, 23] proposed Pagoda broadcasting and lis variant. They
present a frequency broadcasting scheme that LHes Lo combine lhe advantages of Harmonic
Lo assume Lhat lhe receiving bandwidLh is aL leasL Lwice the playback raLe, so thaL Lhe clienL
and Pyramid schemes.
can receive LWO streams aL the sarne Lime.
Figure 16.15 ilIusLrates Pagada broadcasLing. lt partitions each video into ii fixed-size
segmenLs of duratian 7’ = L/n, where Tis defined as a time sIol. Then iL broadcasLs lhese The server wili deliver a video stream as soon as iL receives the requesL from a client.
segments at Lhe consumption bandwidth b buL wiLh differenL periods. So the problem is Lo Meanwhile, lhe clienL is aiso given access Lo a second sLream of the sarne video, which was
select Lhe proper segmenL-Lo-channel mapping and lhe proper broadcasting period for each initiaLed earlier by anoLher ciient. AL a certain poinL, Lhe first stream becomes unnecessary,
segment. because ali its conlents have been prefetched from lhe second strearn. AI this time, the firsL
Compared Lo Pyramid and l-larmanic broadcasting, Pagoda broadcasLing is not band sLream will merge wiLh (or “join”) the second.
width efficienL for any given waiLing Lime. It requires fewer segrnenLs compared to Har
monic broadcasting lo achieve comparable waiting time requiremenLs whiie requiring more As Figure 16.16 shows, the “first stream” E sLarts aL Limei = 2. The solid une indicates
segments Lhan Pyramid broadcasting. Lhe playback rate, and Lhe dashed line indicates Lhe receiving bandwidth, which is twice lhe
Ali Lhe above protocois are based on the assumption Lhat the videos are encaded using playback raLe. The client is ailowed Lo prefetch from an earlier (“second”) stream A, which
ConsLant Bit RaLe (CBR). Some protocois were proposed Lo deal with VBR-encoded videas. was iaunched at t = 0. Au = 4, stream E joias A.
For further study, readers are referred lo [20, 24].
The technique of stream merging can be appiied hierarchicaiiy — Hierarchical malucas;
stream merging (HMSM) [25]. As Figure 16.16 shows, strearn C, which started aL t = 4,
Siream Merging. The above broadcasL schernes are most effecLive when iimited user
would join E aL t = 6, which in Lurn joined A. The original stream E would have been
interactions are expecLed — Lhat is, once requested, ciienLs wiil stay with the sequential
obsolete afLer i = 4, since it joined A. ia this case, iL wiill have LO be retained uniu ; = 6,
access schedule and waLch Lhe movie in its entirety.
when Cjoins A.
Stream merging is more adaptive to dynamic user inLeracLions, which is achieved by
dynamicaily combining muiticast sessions [25]. It stili makes Lhe assurnption that Lhe A variation of Siream merging is piggybacking, in which the playback rale of Lhe sLreams
client’s receiving bandwidLh is higher than the video piayback raLe. In fact, iL is commOn is siighLiy and dynamically adjusted, lo enabie merging (piggybacldng) af Lhe streams.
Section 16.5 Media-ori-Demand (MOD) 473
472 Chapter 16 Multimedia Network Communications and Applications

Available network bandwidth 1 When 4(t) < ~H d(i), we have inadequaLe network throughput and hence buffer
Data (bytes) underflow (orslarva:ion), whereas when AQ) > Ei d(i)-f E, we haveexcessive network
tl-iroughpuL and buffer overflow. Both are harmful to smooth, continuous playback. In buffer
Overflow underflow, no daLa is available to piay, and iii bufferoverflow, media packeLs must be dropped.
Figure 16.17 iliustraLes the IimiLs imposed by Lhe media playback (consumption) data
raLe and the buffered data rate. (The transmission raLes are lhe slopes of Lhe curves.) At any
time, data musL be in lhe buffer for smooLh playback, and the data transmitted must be more
than Lhe daLa consumed. lf Lhe network bandwidth available is as in Line ii iii Lhe figure, at
some point during playback, Lhe data to be consumed wiIl be greaLer than can be sent. The
buifer will underflow, and playback will be interrupLed. Also, at any point, Lhe total amount
of daLa transmiLted must not exceed Lhe total consumed plus Lhe size of Lhe buifer.
bandwidth 11 If Lhe neLwork available bandwidLh is as in Line 1 and Lhe media was sent as fasL as
possible witbout buffer considerations (as in nonnal file downloads), Lhen Loward lhe end
Buffer of lhe video, Lhe data received will be greater Lhan Lhe buffer can store aL Lhe time. The
B t(frame#) buffer will overflow and drop Lhe exLra packets. Then Lhe server will have Lo retransmit the
packets dropped, or Lhese packets will be missing. Although, during overflow, some Lime
is ailowed for reLransmission, Lhis increases bandwidth requirements (and hence may cause
FIGURE 16.17: The data that a client can store in the buffer assisLs the smooih playback of underfiow in Lhe future). lii many cases, such as broadcasL, no back channel is available.
the media when Lhe media rate exceeds the available network bandwidth. Techniques Lo mainLain data in the prefeLch buffer withouL overflowing or underflowing
iL are known as transmission rale conuvi schemes. Two simple approaches are to prefetch
video daLa to Ml Lhe bufier and Lry Lo Lransmit aL Lhe mean video bitrate, or to keep Lhe buifer
16.5.3 Buifer Management
fuil without exceeding the available bandwidLh. For video sections thaL require higher
Continuous media usually have expected playbczck rales, such as 30 fps for NTSC video bandwidth than available, Lhe transmission raLe control schemes hope Lhat Lhe data already
and 25 fps for PAL. lf the video is delivered Lhrough the network, Lhen without work-ahead in Lhe buffer and the available network bandwidth will enable smooLh playback without
smooLhing aL playback Lime, the required network throughpuL must be higher than the video’s buifer underfiow.
peak biLrate for uninternlpLed video playback.
As discussed earlier, most compressed media are VBR-coded. Usually, Lhe more activi An Optimal Plan for Transmission Rates. Given knowledge abouL the daLa raLe char
ties (motions in the video, changes in the speech or music), Lhe higher Lhe required biLrate. acLeristics of Lhe media stored on Lhe server [26], it is possible Lo use the prefeLch buifer more
The mean biLrate for MPEG- 1 is 1.5 Mbps and for MPEG-2 is ? 4 Mbps. Media Lhat have efficiently for Lhe network. The media server can plan ahead for a transmission rate such that
VBR characLerisLics can have a low bitrate aL one point and a much higher bitrate at another Lhe media can be viewed without inLerruption and the reserved bandwidth minimized. Many
poinL. The peak bitrate can be much larger Lhan Lhe mean bitraLe for Lhe media and may not transmission plans may minimize peak rate, but Lhere is a unique plan that also minimizes
be supported by the network bandwidLh available. rate variability — the variance of the transmission rate. Such a rate Lransmission plan is
AlLhough not popular nowadays, CBR coding is also an opLion — variable distortions referred to as the opti,nal work-ahead smaozhing pian.
can be introduced to maintain a constanL bitrate. CBR coding is Iess efficienL than VBR: Minimizing raLe variability is imporLanL, since it implies Lhe optimal rate plan is a seL of
to obtain comparable qualiLy of coded media, Lhe CBR bitraLe is typically 15—30% higher piecewise, consLant-raLe-transmission segments. Basing Lhe bandwidth reservaLion sLraLegy
than Lhe mean VBR video bitrate (the average biLrate of Lhe video). on lhe current transmission rate raLher Lhan Lhe peak raLe allows some processing and neLwork
resources Lo be minimized and changes in bandwidLh reservation Lo be less frequenL. For
To cope wiLh Lhe variable biLrate and neLwork ioad flucLuation, buffers are usually em
discussion purposes, Lhe following wili refer to video media, although the technique could
ployed aL both sender and receiver ends [9). A prefelch buffer is introduced at the client side
be exLended for general media.
(e.g., in Lhe clienL’s Set-top Box) to smooth the Lransmission rate (reducing Lhe peak rate).
The video daLa rate can be analyzed frame by frame, although that mighL noL be the
lf Lhe size of frame g is dO), Lhe buffer size is 8, and the number of daLa bytes received so
besL sLrategy, since inter-frames cannoL be decoded by Lhemselves and introduce a decoding
far (at play time for frame t) is AQ), then for ali 1 E 1,2 N, it is required that
delay anyhow. AddiLionaily, the computaLional cost is high. Indeed, iL is more pracLical Lo
approximate Lhe video data rate by considering lhe LoLal data consumed by Lhe time each
1-frame should be displayed. The approximaLion could be made coarser by considering
Zdo) <AQ) < ~dO) + 8 (16.9) only the LoLal data consumed aL Lhe first frame after a scene transition, assuming lhe movie
daLaraLe is consLant in Lhe same scene.
Section 16.6 Further Exploration 475
474 Chapter 16 Muitimedia Networ ommunications and Applications

We can lhink of Lhis Lechnique as stretching a rubber band from DO) Lo D(N) bounded
Data (bytcs)
by Lhe curves defined by DQ) and WQ). The slope of Lhe total-data-lransmilted curve is lhe
iransmission data rate. lntuitively, ‘~‘e can minimize Lhe siope (or Lhe peak rale) if, whenever
Optimal scheduie ihe transmission data rate has to change, it does so as early as possibie in Lhe transmission
AQ) plan.
As an iliuslration, consider Figure i6.i8. The server starls Lransmiiting data when the
prefetch buffer is at state (a). lt determines ihat to avoid buifer underfiow ai point (c), lhe
Constant (average) Lransmission raLe has Lo be high enough Lo have enough data at point (c). However, at that
rate schedule rate, Lhe buffer wiil overilow aL point (b). Hence it is necessary Lo reduce lhe transmission
rate somewhere between points (c) and (b).
The earliest such point (that minimizes transmission rate variability) is point (c). The
Buífer rale is reduced to a iower constani bitrate until point (d), where lhe buffer is empty. After
that, the rate must be further reduced (lo iower than the average bitrate ) Lo avoid overflow
li 1~~ : (frame #) untii point (e), when the raLe must finaliy be increased.
(a) Consider any intervai [p, q] and leL BQ) represent Lhe amount of data in Lhe buffer at time
:. Then the maximum constant data rale that can be used wilhoul overflowing Lhe buifer is
FIGURE 16.18: The optimal smoothing plan for a specific video and buifer size. In this case, given by Rrnax
il is not feasible to transmit aL Lhe constant (average) data rale.
W(t) — (D(p) + 8(p)) (16.13)
= mm
As before, define dO) to be Lhe size of frame ,, where 1 E 1,2 N, and N is the toLal
number of frames in Lhe video. Similarly, define aO) to be the amount of daLa transmiLted The minimum data rale that must be used over the same intervai Lo avoid underflow is given
by lhe video server during Lhe pIayback time for frame: (for short, cali it aL time t). Let by ~
DO) be lhe total data consumed and AO) be Lhe total data sent at time:. Formaily:
DQ) — (D(p) + 8~,’» (16.14)
Rrnin = max
DO) = (16.10) p+IS~q 1 p
Naturaiiy iL is required ihat R,nax ~ ~ otherwise no constant bitrate transmission
is feasibie over intervai [p, q]. The aigorithm to constnict the optimal lransmission plan
AQ) = ~ aQ) (16.11)
sLarts wiih interval [p, q = p + 1] and keeps incrementing q, each time recaiculaling Rmax
and Rmin. II Rrnax is Lo be increased, a raLe segmenl is created with raLe ~ over intervai
Let Lhe bulTersize be B. Then aL ariy time 1, Lhe maximum total amountof data thaL can be [p, qmax], where qmax is Lhe latest point at which the buffer is fuli (lhe lalesl poial ia intervai
received without overflowing the buffer during lhe time 1..: is WQ) = DQ — 1) + 8. Now [p, q] where Rmax is achieved).
iL is easy to sLate the conditions for a server lransmission raLe that avoids buifer overfiow or Equivnlentiy, if Rmin is lo be decreased, a rale segment is crealed wilh rale ~ over
underflow: intervai [p, q~j,,], where q,,~,, is lhe latest point aL which Lhe buffer is empty.
Pianning transmission rates can readily consider maximum ailowed network jitter. Sup
DQ) < A(t) < WQ) (16.12)
pose there is no deiay in the receiving raLe. Then at time:, AQ) byles of data were received,
To avoid buifer overflow or underfiow lhroughout Lhe video’s duration, Equation (16.12) which must not exceed WQ). Now suppose Lhe network delay is at ils worsl — 8 sec
has to hoid for ali 1 E 1, 2 N. Define S Lo be Lhe server iransmission scheduie (or maximum deiay. Video decoding wiil be deiayed by 8 seconds, so the prefetch buffer wili
pian), i.e., S = a(l), a(2) a(N). S is called afeasible transmission schedute if for ali not be freed. Hence Lhe DQ) curve needs Lo be modified Lo a DO —8) curve. Figure 16.18
1, S obeys Equation (16.12). Figure 16.18 illustrates lhe bounding curves DQ) and WQ) depicts this. This situalion provides protection against overflow or underfiow ia lhe pian for
and shows Lhat a constant (average)-bitrate transmission plan is not feasibie for this video, a given maximum deiay jitter.
because simpiy adopting the average bitrate would cause underflow.
When frame sizes dQ) for ali : are known ahead of transmission time, the server can 16.6 FURTHER EXPLORATION
pian ahead to generate an oplimai transmission scheduie that is feasibie and minimize Lhe For good general discussions on multimedia network communications see Steinmetz and
peak transmission raLe [26]. Additionaliy, Lhe pian minimizes scheduie variance, optimally Nahrstedt [27], Wu and Irwin [28], and Jeffay and Zhang [29]. Wang et ai. [30] provide a
trying lo smooth Lhe lransmission as much as possible.
476 Chapter 16 Multimedia Network Communications and Applications section 16.8 References 477

good discussion on video processing and communications. For a discussion of ATM and 9 ~ again Lhe optimal work-ahead smoothing technique, it was suggested
MPEG-2, integrating digital video into broadband networks, see Orzessek and Sommer that instead of using every video frame, only frames aL the beginning of staLisLically
[15]. different compression video segments can be considered. How would you modify Lhe
The Further Exploration section of Lhe LexL web site for Lhis chapler lisLs a good seL of algoriihm (or video information) lo support Lhat?
web resources for multimedia network communicaLions such as
10. Unicasi Lransmission is when a server establishes a single comniunicaLion channel
• ITU-T recommendations with a specific clienI.

• Mftone sites (a) For video slreaming, suggest a couple of methods for unicast video transinission,
assuming lhe video is VBR encoded and clienL feedback is ailowed.
• RTP, RTSP, and SiP pages (b) Which one of your meihods is betLer9 Why?
• Introductions and White Papers on ATM
11. Multicasi transmission is whcn a server LransmiLs a singie muitimedia sLream Lo ali
• lntroducLions and White Papers on DVB hstening multicasL rouLers, and they forward iL unIu a client receives lhe sLream.

An extensive iist of RFCs from lhe IETF: (a) For VBR video sLreaming, suggest a coupie of meLhods for multicasL video
Lransnussion.
• CriLeria for evaiuating reliabie multicasL Lransport proLocois
(b) Which one of your mcLhods is betLer? Why?
• ProLocols for real-Lime Lransmissiofl of muilimedia data (RTP, RTSP, and RSVP) Hint: Although a clienl may have a reverse channei for feedback, if ali ciients
send feedback, it would cause congeslion in lhe network.
• Protocois for VoIP (SIP, SDP, and SAP)
• Diffserv and MPLS 16.8 REFERENCES
1 H. Eriksson, “MBONE: lhe Muilicasl Backbone[ Conununications o! theACM, 37(8): 54—60,
16.1 EXERCISES 1994.
1. Discuss aI ieasL two altemative meLhods for enabling QoS rouling on packet-switched 2 M.R. Macedonia and DI’. Bruzman, “MBone Provides Audio and Video across lhe interneI,”
networlcs based on a Q0S ciass specified for any muitimedia packeL (this wouid apply ~ Cosnputer, 27(4): 30—36, 1994.
lo any .store-and-forward network). 3 V. Kumar, MBone: Interactive Mulrintedia ou the Interne?, Indianapolis: New Riders, 1996.
2. SuggesL a few addiLional prioriLy delivery rneLhods for specific muitimediaapplications
4 K.C. Almeroth, “The EvoluLion of Multicast. from lhe MBone Lo inLerdomain Mullicast lo
Lhat were not mentioned in SecLion 16.1.3.
lnternet2 DeploymenlT IEEE Network, 14: 10—20, January/Fcbruary 2000.
3. When should RTP be used and when should RTSP be used? Is Lhere an advanLage in
combining both protocois? 5 Arcas
5. Paul,lo etCommunications,
ai., “Reliable MulticasL Transporl1997.
15(3): 407—421, Protocol (RMTP)’ ÍEEE Jounial ou Selected

4. Consider again Figure 16.4, iliustraLing RSVP. In (d), receiver R3 decides Lo send
an RSVP RESV message Lo Si. Assuming Lhe figure specifies lhe complete sLate of 6 B. WheLlen and O. Taskale, “An Overview of Reliabie MullicasL Transport Protocol II,” IEEE
lhe neLwork, is the paLh reserved optimal for maximizing fuLure neLwork throughput? Netwo,*, 14: 3747. January/February 2000.
if not, what is Lhe opLimai paLh? WithouL modifying Lhe RSVP proLocol, suggest a 7 C. Liu, “Muilimedia over iP: RSVP, RTP, RTCP. RTSP,” In Handbook of Ernerging Comniuni
scheme in which such a path wili be discovered and chosen by Lhe network nodes. catio,,s Technologies: The Next Decade, cd. R. Osso, Boca Ralon, CRC Press, 2000,2946.
5. Browse lhe web lo find currenL LechnOlogies designed for lnLernet telephony. 8 ~ang, eL ai., “RSVP: a New Resource ReSerVation ProlocoU’ IEEE Network, 7(5): 8—19,
~.

6. For SLaggered broadcasting, if lhe division of Lhe bandwidth is equal among ali K 1993.
logical channeIs (K ? 1), show Lhat access time is independent of Lhe value of K. 9 M. Knanz, “Bandwidth Allocalion Stralegies forTransporting Variable-Bil-Rale Video Traffic,”
7. Specify on Figure 16.17 Lhe characteristics offeasible video Lransmission schedules. IEEE Com,nu,ucatio,,s Magazine, 35W: 4046, 1999.
WhaL is lhe oprima! transinission schedu!e? lO II. Schuizrinne and J. Rosenberg, “The lEI? Internet Teiephony ArchiLecture and Protocols’
8. For lhe optimai work-ahead smoothing Lechnique, how wouid you aigoriLhmicaliy iEEENetwork, 13: 18—23, MaylJune 1999.
determine aL which point Lo change the planned Lransmission rale? What is Lhe Lrans- ~l ~ Muhbnedia Conin,unica?ions Systenzs. 1TU-T Rccommendation H.323, Novem
mission rale? ber 2000 (carlier version Seplember 1999).
478 Chapter 16 Multimedia Network Communications and Applications

12 J. Toga and J. 0(1, “ITU-T Siandardization Acüviiies for Interactive Multimedia Communica- C H A P T E R 17
tions on Packei-Based Neiworks: 1-1.323 and Related Rccommunications’ Computerffevvorks,
31(3): 205—223, 1999.
13 AS. Tanenbaum, Computer Networks, 4th cd., Upper Saddie River, NJ: Prentice Hail PTR,
2003.
‘dili rei ess ISiet\AIorks
14 W. Stallings, Data & Coniputer Communications, 6ih cd., Upper Saddle River, NJ: Preniice
Hali, 2000.
15 M. Orzessek and 1’. Sommcr, ATM & MPEG-2, Upper Saddle River, NJ: Prenlice Hail PTR,
1998.
16 U. Varshney, “Multicasting: Issues and Neiwork Suppori,” In Multiniedia Communications
Directions & Innovations, cd. J.D. Gibson, San Diego: Academic Press, 2001, 297 310. 11.1 WIRELESS NETWORKS
17 G. Annitage, “IP Multicast over ATM Metworks,” IEEE Journa? on Selected Arcas in Com- The rapid developments in computerand communication technologies have made ubiquitous
munications, 15(3): 445—457, 1997. computing a reaiity. From cordless phones in Lhe eariy days Lo ceilular phones in Lhe nineties
and personal digital assistants (PDA5), PockelPCs, and videophones nowadays, wireless
18 8. Viswanathan and T. Imielinski, “Pyramid Broadcasting for Video on Demand ServiceT In
JifEE Conf on Multiniedia Computing and Networking, 1995, 66—77. communication
(PCS),personal has been Lhe cote network
communications technology that [1,2],
(PCN) enabledandpersonal
personalcomniunication
digital celiularservices
(PDC).
19 K.A. Hua and 5. Sheu, “Skyscraper Broadcasting: a New Broadcasting Scheme for Metropoli- Geographically, wireless networks are often divided into cells. Each mobile phone in a
tan Video-On-Demand SystemsT In Proceedings o! the ACM SIGCOMM, 1997, 89—100. ccli contacts lis access polar, which serves as a gatcway Lo lhe network. The access poitas
20 A. Hu, “Video-on-Demand Broadcasting Protocols: A Compreshensive Study’ Proceeding of themselves are connected through wired lines, or wireless networks or sateilites that form
1EEE IFCOM ‘01, 2001, 508—517. Lhe core network. When a mobile user moves oul of the range of lhe initial access point, a
21 L. Juhn and L. Tseng, “Harmonic Broadcasiing for Vidco-on-Demand ServiceT IEEE Trans- handoff(or handover, as ii is called in Europe) is required lo maintain lhe communication.
actionson Braadcasting, 43(3): 268—271, 1997. In 1985, frequency bands ai 902—928 MHz, 2.400—2.4835 GHz, and 5.725—5.850 GHz
22 J.F. Paris, S.W Carter, and D.D.E. Long, “A Hybrid Broadcasting Protocol for Video on De- were assigned Lo Industrial, Scientific, and Medical applications by lhe FCC, hence lhe
mandT in Proceeding of the 1999 Multimedia Coniputing and Networking Conference MMC’N name ISM bands.
‘99, 1999, 317—326. Traditionally, ceil size is on Lhe order of kilomelers. The introduction of PCS, however,
23 tE Paris, “A Simple Low-Bandwidih Broadcasling Protocol for Video-on-Demand’ Proceed- creates Lhe need for a hierarchical ceilular network in which several leveis of cells can be
delined:
ing of tire 7th Intemational Conference mi Computer ~‘om,nuuications and Networks (ICJN
‘98). 1999, 690—697.
• picoceli. Each covers up lo 100 meiers; useful for wireiess/cordless applications and
24 D. Saparilla. K. Ross, and M. Reisslein, “Periodic Broadcasling wilh VBR-Encoded Video’ devices (e.g., PDAs) in an office or home.
in Proceeding of IEFE lnfocont ‘99, 1999,464-471 -
25 D. Eager, M. Vernon, and 1. Zahorjan, “Minimizing Bandwidih Requiremenls for On-Demand • microcell. Eachcovers uplo 1,000 melers in cilies orlocal arcas, such as radioaccess
Data Delivery,” IEEE Transactions on Knowledge and Data Engineering. 13(5): 742—757, payphones on lhe streets.
2001.
26 J.D. Salehi, Z.L. Zhang, iR Kurose, and D. Towsley, “Supporting Siored Video: Reducing • celi. Each has up Lo 10,000 meters coverage; good for national or continental net
Rale Variability and End-lo-End Resource Requirements through Opiimal SmoolhingT ACM works.
SJGMETRJCS. 24(1): 222—231, 1996.
• macroceil. Provides worldwide coverage, such as satellite phones.
27 R. Sieinme(z and 1<. Nahrsiedl, Multitnedia: Contputing, Com,nunications & Applications,
Upper Saddle River, NJ: Preniice HalI PTR, 1995. Fading is a conijnon phenomenon in wireless (and especially mobile) communications,
28 C.H. Wu andJ.D. lrwin, Emerging Multiniedia ConiputerConimunication Technologies. Upper in which lhe received signal power (suddenly) drops. Multipathfading occurs when a signal
Saddle River, NJ: Prenuce HaIl VER, 1998. reaches lhe receiver via multiple paths (some of Lhem bouncing off buildings, hilis, and oiher
29 K. ieffay and H. Zhang, Readings in Multimedia Computing and Networking, San Francisco, objects). Because they anive aI differenl limes and phases, Lhe multiple instances of lhe
CA: Morgan Kaufmann, 2002. signal can cancel each olher, causing lhe loss of signal or conneclion. The problem becomes
30 Y. Wang, 3. Osiermann, and Y.Q. Zhang, Video Processing and Comnaunications, Upper Saddle more severe when higher data rales are explored.
River, NJ: Preniice 1-fali, 2002.
479
480 Chapter 17 Wireless Networks Section 17.1 Wireless Networks 481

The sarne seI of frequency channels (denoled f1 lo f7 in Figure 17.1) will be reused once
in each clusler, following Lhe illusLraLed symrneLric paLlem. The so called reuse factor is
K = 7. In an AMPS sysLern, for exampie, Lhe maxirnum number of channels (including
conlrol channels) available in each ccli is reduced Lo 4 16/K = 4 16/7 59.
in lhis configuralion, users in Lwo different cluslers using lhe sarne frequency f, are
guaraflleed Lo be more lhan D aparLgeographicaily, where bis lhe diarneler of Lhe hexagonal
ccli. In a vacuurn, eieclromagnelic radiaLion decays aI a rale of LY~2 over a disLance D.
However, in real physical spaces on lhe earLh, lhe decay is consislenLly rneasured aLa rnuch
fasLer raLe of D’5 Lo D5. This rnakes Lhe FDMA scherne feasible for analog wireless
cornrnunicaLiOns, since inLerference by users of lhe sarne frequency channel from oLher
groups becornes insignificanl.

17.1.2 Digital Wireless Networks


Second-generalion (20) wireless neLworks use digiLal iechnology. Besides voice, digiLai
data is increasingiy LransrnilLed for applicaLions such as LeXL rnessaging, sLrearning audio,
and eleclronie publishing. In NorLh Arnerica, Lhe digital ceilular networks adopLed lwo
cornpeling LechnoiOgies in 1993: lime Division MultipleAccess (TDMA) and Code Division
FIGURE 17.1: A possible georneLric layouL for an FDMA celhilar syslern with a cluster size Multiple Access (COMA). In Europe and Asia, Global SystemforMobile communications
of seven hexagon cells. (GSM) [1], which used TOMA, was inLroduced in 1992.
Below, we inLroduce TOMA and GSM firsi, foliowed by an inlroduction Lo spread spec
ti-um and analysis of COMA.
17.1.1 Analog Wireless Networks
Eariier wireless cornrnunicaLion networks were used rnostly for voice communicaLions, such 17.1.3 TDMA and GSM
as Lelephone and voice mau. First-generalion (1 G) celiular phones used anaiog lechnology As lhe narne suggesls, TDMAcreales rnulLipiechanneis In mulLiple Lirne sloLs while allowin
and Frequency Division Multiple Access (FDMA), in which each user is assigned a separaLe Lhern lo share Lhe sarne carrier frequency. In pracLice, TOMA is aiways cornbined wilh
frequency channel during Lhe cornrnunicaLion. lis sLandard was Advanced Mobile Phone FDMA —lhaL is, Lhe enlire allocaLed specLrurn is firsL divided inLo rnulLiplecarrier frequenc
Systeni (AMPS) in Norlh America, Total Access Cominunication System (TACS) and Nordic channeis, each of which is furLher divided in Lhe Lirne dirnenston by TOMA.
Mobile Telephony (NMT) in Europe and Asia. DigiLal daLa LransmissiOn users needed GSM was established by Lhe European Conference of Postal and Telecomtnunications
moderns to access Lhe neLwork; Lhe Lypical daLa rale was 9,600 bps. Administrations (CEPT) in 1982, wilh lhe objeclive of crealing a sLandard for a mobile
AMPS, for exampie, operates aL lhe 800—900 MHz frequency band. Each direclion of Lhe cornmunicaLion nelwork capable of handling miliions of subscribers and providing roarning
Lwo.way communicalion is allocated 25 MHz, wiLh mobile station transmit (MS transmil) services lhroughoul Europe. IL was designed Lo operale in Lhe 900 MHz frequency range and
in Lhe band of 824 Lo 849 MHz and base station transmit (BS lransmiL) in Lhe band of 869 was accordingiy narned GSM 900. Europe also supporls OSM 1800, which is Lhe original
Lo 894 MUi. Each of lhe 25 MHz bands is lhen divided up for lwo operalor bands, A and GSM sLandard modified Lo operaLe aL lhe 1.8 6Hz frequency range.
B, giving each 12.5 MHz. FDMA further divides each of lhe 12.5 MHz operaLor bands mIo ln North Arnerica, Lhe GSM neLwork uses frequencies a~ Lhe range of 1.9 0Hz (GSM
416 channels, which resulls in each channel having a bandwidlh of 30 KHz. The frequency 1900). However, Lhe predorninanL use of TOMA Lechnoiogy is by operaLors using lhe
of any MS Lransrnil channel is always 45 MHz below Lhe frequency of lhe corresponding TIA/EIA JS-54B and Lhe IS-136 sLandards. These sLandards are somelirnes referred lo as
BS transmiL channel in cornmLlnicaLion. digiial-AMPS or D-AMPS. IS-54B was superseded in 1996 by lhe newer IS-136 standard
Sirnilarly, TACS operates aI Lhe 900 MHz frequency band. It canies up Lo 1,320 fui) which ernploys digital control channels DCCH and oLher enhanced user services. IS-136
duplex channels, wiLh a channei spacing of 25 KHz. operaLes in lhe frequencies of 800 MHz and 1.90Hz (Lhe PCS frequency range), providing
Lhe sarne digital services in boLh. GSM and IS-136 cornbine TOMA wiLh FOMA to use
Figure 17.1 illusLrales a possible geomeLric layouL for an FDMA celiular sysLern. (For
Lhe ailocaled speeLrurn and provide easy backward cornpalibiliLy wilh pure FOMA-rnode
clarily, celis from Lhe first cluster are rnarked wiLh Lhicker borders). A clusler of seven
(analog) rnobile sLaLions.
hexagon edis can be defined for Lhe covered cellular arca. As long as each ccli in a clusLer
As Figure 17.2 shows, lhe uplink (mobile slaLion to base sLaLion) of OSM 900 uses Lhe
is assigned a unique seL of frequency channels, inLerference from neighboring edis will be
890—915 MHz band, and Lhe downlink (base slalion Lo mobile sialion) uses 935—960 MHz.
negligible.
482 Chapter 17 Wireless N Section 17.1 Wireless Networks 483

FH spread spectrum

ê~ :
signatr

apreader ~‘ des::ader
D~c ulalor Data

Frame N+l Pseudorandom Pseudorandom


frequency frequeney
gelieralor generatOr
Frame N

Frame N—l Transmilier Receiver

FIGURE 17.3: Transmitter and receiver of Frequency Hopping (FEl) spread speclrum.

f(MHz) 17.1.4 Spread Spectrum and CDMA


890 915 935 960 Spread speclrum is a lechnology in which Lhe bandwidth of a signal is spread before trans
25MHz 25MHz 1
1~ Uplink
1
‘1
1
I~ Downlink
~1 mission. In iLs appearance, the spread signal mighL be indisLinguishable from background
noise, 50 II has distinct advanLages of being secure and robust against intentional inLerference
(jamming).
FIGURE 17.2: Frequency and Lime divisions in GSM. Spread specLrum is applicable Lo digiLal as well as analog signals, because both can be
modulated and “spread”. The earlier generation of cordless phones and celiular phones, for
example, used analog signals. However, iL is Lhe digital applications, in particular CDMA,
In olher words, each is allocated 25 MHz. The frequency division in GSM divides each 25 Lhat made Lhe technology popular in various wireless daLa networks.
MHz into 124 canier frequencies each wiLh a separation of 200 JUIz. The time division in Following is a brief description of lhe two ways of implementing spread spectrum:
GSM Lhen divides each carrier frequency into TDMAframes; 26 TDMA frames are grouped frequency hopping and direci sequence.
inLO a :raffic channel (TCH) of 120 msec that carnes speech and data traffic.
Frequency Hopping. Frequency hopping is lhe earlier method for spread spectrurn.
Each TOMA frame is thus approximately 4.615 msec (i.e., 120/26 msec) and consists
The technology of analog frequency hopping was invented by Hedy Lamarr, the actress
of eight time slots of length 4.615/8 0.577 msec. Each mobile station is given unique
Lime slots during which iL can send and receive daLa. The sendlreceive does noL occur at Lhe [3], in 1940, during Lhe World War II. Figure 17.3 iliustrates Lhe main componenLs ofthe
LnansmitLer and receiver for frequency-hopping.
sarne time slot: iL is separated by Lhree slots.
IniLially, data (analog ordigiLal) is modulaLed to generate some baseband signal cenLered
GSM provides a vanieLy of data services. GSM useis can send and receive data lo users aLa base frequency fa. Because of lhe relatively low data raLe iii cunent wireless applica
on POTS, ISDN, and packet-switched or circuit-switched public data necworks. GSM also tions, Lhe bandwidLh of Lhe baseband 8,, is generally nannow. For example, if lhe data raLe
supporLs Short Message Service (SMS), in which text messages up lo 160 characLers can is 9.6 kbps, Lhen (depending on Lhe modulating scheme) lhe bandwidth B~ would noL be
be delivered Lo (and from) mobile phones. One unique feaLure of GSM is Lhe subscriber higher than 2 x 9.6 = 19.2 KHz. The pseudorandom frequency generator produces random
identiiy module (SIM), a smart card Lhat canies the mobile user’s personal number and frequencies fr within a wideband’ whose bandwidLh is usually on Lhe order of megahertz
enables ubiquiLous access to GSM services. (MHz). AL Lhe Frequency-Hopping (FEl) Spreader, fr is modulated by the baseband signal
By default, the GSM network is circuiL swiLched, and its data raLe is limited to 9.6 kbps. Lo generate the Spread Speclrum Signal, which has the same shape as Lhe baseband signal
General Packet Radio Service (GPRS), developed in 1999, supports packet-switched data buL a new center frequency
overwireless connecLions, so useis are “always connected”. It is also referred Lo as one ofthe
2.5G (between second- and Lhird-generaLion) services. The theoreLical maximum speed of fcfr+fa (17.1)
GPRS is 171.2 kbps when alI eight TOMA time sloLs are takenby a single user. lo real imple
mentations, single-user throughput reached 56 kbps in the year 2001. Apparently, when Lhe The choice of lhe frequency f, is conirolled by a random nuniber generalor. Because lhe algorithm is
network is shared by multiple users, the maximum dala raLe for each GPRS user wilI drop. delerninistie ii is foi lnily random bul pseudorandom.
Section 17.1 Wireless Networks 485
484 ap er 17 Wireless Networks

OS spread speclnirn

1m
r s!gnali — 1

oaI..1j..2zf$ZA~1 ~Data
DaLa
Tr 135 cede
Wideband Wideband
pscudo.noisc 1 pseudo-noise
carTier careier

Tnnsnitter Receiver

Pseudo-noise
FIGURE 17.5: TransmitLer and Receiver of Direcl Sequence (DS) spread spectrum.
FIGURE 17.4: Sprcading in Direct Sequence (DS) spread specLnim.
code) consisls of a stream of narrow pulses cal!ed chips, wilh a bil widLhõfTr. lIs bandwidLh
Since fr changes randomly in lhe wideband, f~ of Lhe resulting signa! is “hopping” in the 8, is on Lhe order of !/Tr. Because Tr is srna!1, 8, is much wider Lhan Lhe bandwidth Bj,
wideband accordingly. of Lhe narrowband signa!.
The spreading code is mu!tip!ied wilh the inpul data. When Lhe data bit is 1. the oulpul
AL Lhe receiver side, Lhe process is reversed. As !ong as Lhe sarne pseudorandom frequency
DS cede is identica! tolhe spreading code, and when lhe data bil is —!, Lhe outpuL DS cede
generator is used, lhe signa! is guaranteed Lo be proper!y despread and demodulated.
is lhe inverted spreading code. As a result. Lhe speclruni of lhe origina! narrowband daLa is
II is irnportanL lo note thal allhough lhe FH niethod uses a wideband spread speclnim, aI
spread, and Lhe bandwidLh of Lhe DS signal is
any given moment during transmission, Lhe FF1 signa! accupies only a srnal! portion of Lhe
band — lhaL is, Bõ. 8DS = 8,. (17.2)
The LransmissiOn aí Lhe FH spread speclrum signal is raLher secure and robust againsl
narrowband jarnming aLlacks, since on!y a Liny portion aí lhe FH signa! can be received or The despreading process invo!ves mu!tip!ying lhe 135 cede and Lhe spreading sequence.
jammed in any narrow band. As !ong as Lhe sarne sequence is used as in Lhe spreader, lhe resu!ting signa! is Lhe same as
!f Lhe hopping rale is s!ower lhan lhe dala rale, il is called slow hoppin.g and is easier Lo lhe origina! data. Figure !7.5 shows Lhe imp!ementalion oi lhe LransmitLer and receiver for
realize. S!aw hopping has been used in GSM and shown Lo help reducing rnu!Lipath fading, Lhe DS spread specLmm. An imp!emenlaLion delai! s!ight!y different from Figure 17.4 is
since each TDMA frarne with frequency hopping wi!! !ike!y be sent under a differenL calTier lhal lhe 135 spreader and despreader are actua!!y ana!og devices. The daLa and spreading
frequency. !nfast hopping, Lhe hopping rale is much faster lhan Lhe data raLe, which makes sequences are modu!aled mIo ana!og signa!s before being fed Lo Lhe 135 spreader.
il more secure and effective in resisLing narrowband interference. There are Lwo ways to implernenl CDMA rnu!tip!e access: orthogona! cedes or nonorthog
ona! codes. A mobi!e sLalion is dynarnica!!y assigned a unique spreading cede in lhe ce!!
Direct Sequence. Occasiona!!y, when Lhe FF1 spread spectrum scheme is ernp!oyed ia that is a!so being used by lhe base slatian Lo separaLe and despread ils signal.
a mu!Liplc-access environmenL, more Lhan one signa! can hop OnLO Lhe sarne frequency and For orthogonal CDMA, Lhe spreading codes in a ce!1 are orlhogona! lo each olher [4].
lhus creaLe undue interference. Although some form of TDMA can a!!eviale lhe prob!ern, MosL common!y, lhe Wa!sh—Hadamard codes are used, since Lhey possess an imporlanL
this sLi!! iniposes a !irniLalion on Lhe maxirnum number of users. property ca!!ed orthogonal variable spreading factor (OVSF). This sLates lhat OVSF codes
A rnajor breaklhrough in wire!ess techno!ogy is Lhe deve!apment and adoption of Code of differenl !engLhs (i.e., different spreading facLors) are sLi!! orthogona!. Orthogona!iLy
Division Mulriple Access (CDMA). The foundalion of CDMA is Direct Sequence (DS) is desirab!e, since as !ong as Lhe data is spread by orthogonal codes, iL can be perfect!y
spread spectruni. Un!ike FDMA or frequency hopping, in which each user is supposed separaLed at lhe receiver end.
Lo occupy a unique frequency band aI any rnoment, mu!lip!e CDMA users can make use However, lhis properly comes at a price: Walsh—Hadamard codes can have multip!e
of Lhe same (and fu!l) bandwidLh ei Lhe shared wideband channe! during lhe enLire period aulocorre!alion peaks if lhe sequences are nol synchranized, so exLerna! synchronizaLion is
of lransrnission! A comman frequency band can also be al!ocated lo rnu!Liple users in a!l necessary for the receiver Lo know where lhe beginning aí Lhe 135 signal is. Synchronization
ce!!s — in other words, providing a reuse facLor of K = 1. This has Lhe potentia! lo greaL!y is lypica!!y achieved by uLi!izing a Global Posirioning Sysrem (GPS) in Lhe base sLation.
increase Lhe maxirnum numberof users, as !ong as lhe interference from Lhem is manageab!e. Anolher disadvanlage is Lhal orthogona! cedes are concenLrated around a sma!l number aí
As Figure 17.4 shows, for each CDMA LransmitLera unique pseudo-noise sequence is fed carrier frequencies and Lherefore have low specLra! uli!izaLion.
Lo Lhe DtrecL Sequence (135) spreader. The pscudo-noise (a!so ca!!ed chip code or spreading
Section 17.1 Wireless Networks 487
486 Chapter 17 Wireless Networks

Nonorthogonai cedes are Pseudo-random Noise (PN) sequences. PN sequences need ~ewriIing Eq. (17.6), we have
to have an average bit value of around 0.5 and a singie autocorrelaLion peak at Lhe sLart
W/R
of Lhe sequence. Thus, PN sequences are self-synchronizing and do not need exLernal Mi-
synchronizaLion. A speciai PN sequence ofLen used is the Gold sequence. Gold sequences E ti,,
have three cross-correiation peaks.
The DS spread specLnim makes use of Lhe enLire bandwidth of Lhe wideband; hence, iL or approximaLeiY
is even more secure and robust against jamming. However, under muiLipie access, signais
W/R
wili stili interfere with each olher due Lo mulLipath fading, ouLer ccli interference, and oLher M E/N (17.7)
facLors. Beiow we provide a brief anaiysis of Lhe viabiliLy of DS spread spectrum — Lhat is,
CDMA.
Equation (17.7) is aR importani resuiL. IL states ihat the capacily of the CDMA sysLern
17.1.5 Analysis of CDMA Lhat is, Lhe maximum number of users in a ccli is delermined by lwo factors: W/R
When FDMA orTDMA are used for a muiLipie-access system, bandwidth or time is divided and E,, N~.
up based on Lhe worsL case — LhaL is, ali users accessing Lhe system simuitaneously and ali W/R is lhe ratio beLween Lhe CDMA bandwidlh W and user’s data raLe R. This is Lhe
Lhe Lime. This is of course hardly Lhe case, especiaily for voice communicaLions. CDMA bandwidth spreading factor or the processing gain. This is equivaient to Lhe number of
ailows users in Lhe sarne channei lo share Lhe enLire channel bandwidth. Since Lhe effecLive chips in Lhe spreading sequence. Typically, iL can be in Lhe range 102 LO i0~.
noise is lhe sum of ali oLher users’ signais, iL is based on Lhe se cailed “average case” or E,, Nj, is Lhe bit-level SNR. Depending on Lhe QoS (error rate requiremenL) and the
“average interference”. At Lhe receiver, the DS input is recovered by correiating with Lhe implemenLalion (error-correction scheme, resislance lo mulLipalh fading, etc.), a digital
particular user’s designaLed spreading cede. Hence, as long as an adequale levei of signai demoduialor can usually work weIl wiLh a biL-level SNR in the range 3 te 9dB.
Lo-noise raLio is mainLained, Lhe quaiiLy of lhe CDMA reception is guaranteed, and universal As an example, leL’s assumelhe bit-levei SNR Lobeanominai 6dB (from i0iog E~/N~ =
frequency reuse is achieved. 6dB); then E~/N,, 4. in the 1S-95A standard, W = 1.25 MHz and R = 9.6 kbps. According
LeL’s denote the Lhermal noise of lhe receiver as Nr and the received signal power of te Eq. (17.7),
each user as ~‘1~ The interference te the source signai received aL Lhe base staLion is
W/R 1,250/9.6
M~E/N~ ~ ~32
N = NT + Z ~

This capacily of 32 seerns Lo compare weii wilh lhe AMPS sysLem. When Lhe reuse
where M is the maximum number of users in a ccli. factor is K = 7, with a bandwidLh of 1.25 MHz, lhe maximum number of AMPS channeis
If we assume that lhe lhermai noise ~T is negligibie and Lhe received P~ from each user aliowed wouid be oniy 1,250/(30 x 7) 6. However, Lhe above CDMA analysis has
is lhe sarne, Lhen assumed no interference frem neighboring ceiis. if lhis were lhe case, AMPS could have
adopLed a reuse factor of K = 1; lIs maximum number of channeis would have been
N = (M — 1))’, (17.3) 1,250/30 42. Se hew does CDMA perform if lhe interference from neighboring celis is
laken inLo consideralion?
The received signal energy per biL Eh is Lhe ratie of P1 over lhe date raLe 1? (bps),
it Lums ouL 15] lhaL lhe received inLerference from ali users in neighbor celis is merely
Eh = P,/R, (17.4) aboul 60% of Lhe inlerference from users within lhe ceil. Hence, Eq. (17.7) can simply be
rnodified Lo inciude a facLorof 1.6 (i.e., 100% + 60%) Lo reflecL lhe neighbor ccli interference:
and Lhe interference N~, is

Nj,=N/W=(M—i)Pj/W (17.5) M i.6.Ej,/N,, (17.8)

where W (Hz) is lhe bandwidth of Lhe CDMA wideband signal carrier.


The above factor of 1.6 can also be calied Lhe effective reuse factor, because its role is
The signai-lo-noise ratio (SNR) is Lhus
similar Lo lhe reuse facLor K in lhe FDMA syslems. it shouid be apparenl that CDMA offers
P1/R W/R a larger capacity than FDMA because of lIs beLter use of lhe whole bandwidth and lhe much
(17.6)
smaiier (1.6 « 7) reuse factor.
488 Chapter 17 Wireless Networks Section 17.1 Wireiess Networks 489

The above example now yields The 30 sLandardizaLion process started in 1998, when lhe ITU called for Radio Tranr
,nission Technology (RTT) proposals for InlernaLional Mobile Telecommunication-2000
(JMT-2000). Since Lhen, lhe projeci has been known as 3G or universal mobile lelecornmu
M 1,250/9.6 22
1.6 x 4 nicalions syslem (UMTS). Regional standards bodies Lhen adopted Lhe IMT-2000 require
menLs, added their own, and developed proposals and lheir evaluaLions Lo submil Lo ITU
— a capacity gain of 22/6 over AMPS. (ITU-R for radio lechnologies).
Before concluding Lhis brief analysis of COMA, ii must be pointed oul that severa) major Even as specificalions were being developed in lhe regional sLandards bodies, which
simplifications have been made above. Some contributed Lo enhanced perforrnance, whereas have members from many multinational corporaLions, iL was noted Lhal mosL bodies len
oLhers hampered performance. ViLerbi [5] provided a LhorOugh anaiysis of the principies of adopL similar WCDMA Lechnology. To achieve global sLandardizalion and more efficiently
COMA. hold discussions aboul lhe sarne Lopic, Lhe Third Generation Partnership Project (3GPP)
was established in late 1998 Lo specify a global slandard for WCDMA Lechnology, which
We assumed that the received energy P1 from each user is equal. Olherwise, the was named Universal Terrertrial Radio Access (UTRA). The standards bodies thaL joined
“near-far” probiem dominaLes, where lhe received signai from Lhe “near use?’ is Loo lo create lhe 3GPP forum are ARIB (Japan), ETSI (Europe), ‘ITA (Korea), TI’C (Japan),
slrong and from Lhe “far use?’ is too weak, and Lhe whoie sysLem’s perfoanance wili and Ti (NorLh America). Laler in 1999, CWTS (China) joined lhe group.
coilapse. This requires a sophisticaLed power control on lhe transmilter. The modera The 3GPP foram focused on WCOMA air inLerface, which is aimed aL advancing GSM
CDMA power conlrol updaLes power leveis over 1,500 Limes per second lo make sure technology and is designed Lo interface wiLh Lhe GSM MAP core ncLwork. AI lhe sarne lime
lhe received P~ ‘sare approximately Lhe sarne. lhe Telecommunication lndustry AssociaLion (TIA), with major indusLry supporl, had been
developing Lhe cdma2000 air interface recommendalion for ITU Lhat is Lhe evoluLion of lhe
• As a resuli of Lhe Light power conLrol rtquired, COMA neLworks have to implernent soft JS-95 sLandard and is designed lo be used on ANSI-41 (or iS-4l) core network.
handover. ThaL is, when a mobile user crosses ccli boundary, iL has Lo communicate As similar work was going on in Asia, following lhe 3OPP example, Lhe standards
on ai least two channels ai once, one for each base sLation in range. IL would use Lhe organizaLions decided lo form a second foram called Third Generanon Partnership Projea 2
iowesL amount of power necessary for iLs signal Lo be received properly in ai leasl one (30PP2). The sLandards bodies lhaL are membcrs are ARIB (Japan), CWTS (China), TIA
base siation. This is done Lo minimize ouLer celI inLerference of Lhe mobile. (North America), ‘JTA (Korea), and TTC (Japan).
The 3GPP and 30PP2 fomms, despiLe having some similariLies in WCOMA air interface
• To reduce muLual interference, antennas are noL omnidirecLional. InsLead, direcLional proposals, sliil propose compeling slandards. l-{owever, in Lhe interest of crealing a global
antennas are used, and colieclively they are divided mio sectors. In a Lhree-way standard, Lhe Lwo forums are moniLoring each oLher’s progress and support recommendaLions
seclorized cell, AMPS capaciLy is reduced lo l,250/(30 x 7 x 3) 2. Remarkably, by lhe operators harmonization group. The two forums have agreed loa harmonized standard
COMA capacity is noL suscepLible Lo such sectorizaLion. Therefore, its capacity gain referred lo as global3G (030) ihal will have Lhree modes: OirecL Spread (OS), Multi-Carrier
over AMPS in secLored cells is even greaLer. In lhe above exampie, ii is 22/2 11. (MC), and Time Division Ouplex (TOO), where lhe OS and TOO modes are specified as in
WCOMA by lhe 3OPP group, and lhe MC mode is, as in cd,na2000, specified by 30PP2.
• IL was assumed Lhat each user needs lhe fuli daLa raLe alI lhe lime, which is false AlI air interfaces (alI modes) can be used with both core neLworks. AL the end of 1999,
in many real applicaLions. For example, voice applications use only 3540% of lhe ITU-R released the IMT-2000 specification Lhai for Lhe most part followed lhe harmonized
capacily. Effective voice coding can readily increase lhe nelwork capacily by a facLor slandard recommendations for WCOMA.
of more Lhan Lwo. The multimedia naLure of Lhe 30 wireless services calls for a rapid development of a
new generation of handseLs, where supporL for video, beLler soflware and user inLerface, and
17.1.6 36 Digital Wireless Networks longer baLtery life will be key faclors.
Third-generalion (30) wireless services feaLure various multimedia services, such as (10w- A migralion (or evoluLion) path is specified for 20 wireless nelworks supporting dig
rale) video over Lhe InterneI. ApplicaLions include wireless web surfing, video mail, conLin ital communication over circuil-swiLched channels Lo 30 neLworks supporLing high data
uous media on demand, mobile multimedia, mobile e-commerce, remote medical service, raLes over boLh circuil-swiLched and paclceL-switched channels. The evolulion paLh has an
and so on. Unlike lhe currenl Wireless LÁN (WLAN), which is by and large for indoor and inLermediate slep lhaL is easier and cheaper Lo achieve (fewer changes Lo lhe neLwork in
privale nelworks, 30 is mostly for public neLworks. While a large number of 20 wireless frasLructure) called 2.50 (2.5-generalion), which is associaled wiLh enhanced daLa raLes and
nelworks used both COMA (such as IS-95A in North America) and TOMA (among Lhem packet data services (m.e., Lhe addiLion of packel switching Lo 20 networks). Table 17.1
Lhe mosi popular ones are GSM and IS-l36), Lhe 30 wireless networks will predominanLly summarizes Lhe 20,2.50, and 30 sLandards that have been (or will be) developed using Lhe
use Wïdeband CDMA (WCDMA), IS-41 core networks (in Norlh America) and OSM MAP core nelworks (in Europe, eLc.).
490 Chapter 17 Wireless Networks section 17.1 wireless Networks 491

The lasi slage in Lhe evolution lo 3G is Lhe recommended MC mode in IMT-2000. Tt is


TABLE 17.1: EvoluLion from 20 to 30 Wireless NeLworks
referred to as cdma2000 3X (or 3X Ri]’), since it uses a carrier spectrum of 5 MHz (3 x
ANSI-41 core network Peak data rate 8 Carrier spectrum W 1.25 MHz channels) to deliver a peak rate of ai ieast 2—4 Mbps. The chip rate is also tripled

n cdmaOne (iS-95A)
cdmaOne (IS-95B)
14.4 kbps
liS kbps
1.25 MHz
1.25 MHz
Lo 3.686 Mcps.
Typicai 30 data raLes are 2 Mbps for sLaLionary indoor applications, and 384 kbps and
128 kbps for slow- and fast-moving users, respecLiveiy.

cdma2000 IX 307 kbps 1.25 MHz The GSM Evolution. The OSM radio access neLwork (RAN) uses Lhe OSM MAP
cdma2000 1 xEV-DO 2.4 Mbps 1.25 MHz core neLwork. The IMT-2000 DS and TDD niodes are based on Lhe WCDMA technology
cdma2000 IxEV-DV 4.8 Mbps 1.25 MHz deveioped for the 6SM MAP neLwork. GSM is TDMA-based and therefore iess compalibie
wiLh Lhe WCDMA technology than IS-95. Hence Lhe 36 WCDMA standard does not achieve
cdma2000 3X > 2 Mbps 5 MHz backward compatibility with currenL-generaLion OSM networks. Moreover, each evolulion
GSM MAS core network Peak data rate R Carrier spectrum W toward 30 requires support for anoLher mode of operaLion from mobile sLaLions.
OSM is a 20 network providing only circuiL-switched communication. General Packet
26 OSM (TDMA) 14.4 kbps 1.25 MHz Radio Service (GPRS) is a 2.50 enhancement Lhat supports packel switching and higher
date rates. As with CDMA2000 IX, EDGE (Enhanced Data rales for Global Evolugion or
2.50 GPRS (TDMA) 170 kbps 1.25 MHz
Enhanced Data GSM Environment) supports up lo Lripie Lhe data rate of GSM and OPRS.
30 EDGE (TDMA) 384 kbps 1.25 MHz EDGE is stili a TDMA-based sLandard, detined mainly for GSM evolution to WCDMA.
36 WCDMA 2 Mbps 5 MHz However it is defined in IMT-2000 as UWC- 136 for Single Carrier Mode (IMT-SC) and,
as such, is a 30 soluLion. lt can achieve a data rate up Lo 384 kbps by new modulation and
radio techniques, Lo optimize the use of available spectrum.
The IS-95 Evolution. IS-95A and IS-95B, now known as cd,naOne, are based on the Eventualiy, Lhe 30 technology (also referred to as 3GSM) will be adapted according Lo
IS-4 1 core network and use narrowband CDMA air interface. As such, ali development Lhe WCDMA modes IMT-2000 recommendalions. WCDMA has two modes of operation:
is geared toward extending Lhe existing CDMA framework to 30 (wideband CDMA) with Direct Sequence (DS) [also cailed Frequency Division Duplex (FDD)] and Time Division
backward compatibiiiLy. This is seen as a major cosL efficiency issue and therefore has Duplex (TDD). FDD mode is used in Lhe paired frequencies specLrum aliocaLed where Lhe
major industry support, as well as quick adapLability. IS-95A is a 20 technology and has uplink and downlink channels use differenL frequencies. However, for Lhe unpaired frequen
oniy circuit-switched channels wiLh data rates up to 14.4 kbps. An extension to iL is 1S-95B cies, it is necessary to LransmiL boLh uplink and downlink channeis at Lhe same frequencies.
(2.50), which supports packet switching and achieves maximum rates of 115 kbps. This is achieved by using Lime slots, having Lhe uplink use a different time slot tban Lhe
IMT-2000 MC mode, originaily calied cdma2000, can operate in ali bands of the IMT downlink. lt also requires a more complicaLed Liming control in Lhe mobile iban in TDMA.
spectrum (450,700,800,900, 1700, 1800, 1900, and 2100 MHz). To ease Lhe depioyment of Key differences in WCDMA air interface from a narrowband CDMA air interface are
cdma2000, Lhe evoiution framework is divided into four sLages, each is backward compaLibie • To supporl bilrates up Lo 2 Mbps, a wider channei bandwidth is aHocaLed. The
with previous stages and cdmaOne. WCDMA channel bandwidih is 5 MHz, as opposed lo 1.25 MHz for 15-95 and offier
The cdma2000 IX (or IX Ri]’) specification, also known as Lhe high rate packet data air earlier sLandards.
inLerface specification, delivers enhanced services up to 307 kbps peak rate and 144 kbps on
average. This air inLerface provides two to three Limes the data capaciLy of IS-95B. The IX • To effectively use Lhe 5 MHz bandwidLh, longer spreading codes at higher chip rates
means thaL it occupies one limes the channels for cdmaOne — 1.25 MHz carrier bandwidth are necessary. The chip raLe specified is 3.84 Mcps, as opposed lo 1.2288 Mcps for
per channel. As wiLh lhe IS-95 air interface, the chip raLe is 1.2288 Mcps (megachips per 15-95.
second). • WCDMA supports vasiable bitrates, from 8 kbps up Lo 2 Mbps. This is achieved using
The next step in cdma2000 depioyment is cdma2000 1 xEV (EV for EVolution), split variable-iength spreading codes and time frames of 10 msec, at which Lhe user data
into Lwo phases. The air interface Lries to support boLh ANSI-41 and GSM MAP networks, rate remains constant buL can change from one frame Lo Lhe other — hence Bandwidth
aithough priority is given Lo ANSI-4i. The first phase is calied IxEV-DO (Data Only), sup on Deniand (B0D).
porling data transmission only aL rates up to 2.4 Mbps. Voice communication is transmitLed
on a separate channei. Phase 2 is called 1 xEV-DV (Data and Voice) and enhances lhe 1 xEV • WCDMA base sLaLions use asynchronous CDMA wiLh Goid codes. This eliminates
inLerface Lo support voice communication as weli. li promises an even higher data rate, up Lhe need for a GPS in the base station for global Lime synchronizaLion, as in 15-95
Lo 4.8 Mbps. syslems. Base sLaLions can now be made smaller and iess expensive and can be
iocated indoors.
492 Chapter 17 Wireless Networks Section 17.2 Radio Propagation Modeis 493

17.1.7 Wireless I.AN (Wl.AN) High Performance Radio LAN (HIPERLAN)2) is Lhe European sibiing of tEEE 802.11 a.
lt also operates in Lhe 5 0Hz band and is promised to deliver a data rate of up to 54 Mbps.
From Lhe beginning, Wireless WAN (Wide Arca Networlc) was popular, due to various Wesel [2] provides a good description of HIPERLAN.
voice and data applications. The increasing availability of laptop computers brought about
Iceen interest in Wireless LANs (Local Area Networks). Moreover, Lhe emergence laLely IEEE 802.llg and others. IEEE 802.1 lg, an exLension of 802.1 lb, is an aLtempt Lo
of ubiquitous and pervasive computing [6] has created a new surge of interest in Wireless
achieve data raLes up Lo 54 Mbps iii Lhe 2.4 0Hz band. As in 802.1 la, OFDM willl be used
LANs (WLAN5).
insLead of DS spread specLrum. However, 802.1 lg still suffers from higher RF interference
Lhan does 802.1 la, and as in 802.1 lb, has Lhe limiLation of three access poinLs in a local
IEEE 802.11. IEEE 802.11 was Lhe carlier standard for WLAN developed by the IEEE
arca.
802.11 working group. IL specified Medium Access Control (MAC) and Physical (PHY)
IEEE 802.llg is designed to be downward compaLible with 802.llb, which brings a
layers for wireless connectiviLy in a local arca within a radius of several hundred fect. PHY
significant overhead for ali 802.1 1h and 802.1 lg users on Lhe 802.1 lg neLwork.
supported both Prequency Hopping (FH) spread spectrum and Direct Sequence (DS) spread
spectrum. The ISM frequency band uscd was 2.4 0Hz. Moreover, (diffused) infrared lighL Another half-dozen 802.11 standards are being developed thaL deai with various aspects
was also supported for indoor communications iii Lhe range of 10—20 meters. of WLAN. The Further Exploration secLion of this chapLer has WWW URLs for these
WLAN can be used either as a replacement or an extension toLhe wired LAR Similar to sLandards. NoLably, 802.11 e deals with MAC enhancement for QoS. especiaily prioriLized
Ethemet, Lhe basic access meLhod of 802.11 is Carrier Sense Muitipie Access wiih Coilision transmission for voice and video.
Avoidance (CSMAJCA). The data raLes supported by 802.11 were 1 Mbps and 2 Mbps.
The 802.11 standards also address Lhe following important issues: Bluetooth. Bluetooth (named after lhe tenth-century king of Denmark Harold Blue
Looth) is a new protocol inLended for short-range (piconeL) wireiess communications. In
• Security. Enhanced authentication and encryption, since WI.AN is even more sus particular, it can be used Lo replace cables connecting mobile andlor fixed computers and
cepLible to break-ins. devices. li uses FH spread spectrum at Lhe 2.4 0Hz ISM band and a full-duplex signal
that hops among 79 frequencies aLI MHz intervals and ata rate of 1,600 hops per second.
• Power management. Saves power during no transmission and handies doze and BlueLooLh supports both circuiL switching and packet switching. tt supports up to Lhree voice
awake. channels (each 64 kbps symmetric) and more than one daLa channel (each over 400 kbps
• Roaming. PenniLs acceptance of lhe basic message formal by different access poinLs. symmeLric).
The BlueLooth consortium web site (www.bluetoolh.com) provides a detailed core speci
fication and includes a description of Lhe use of Wireless Application ProLocol (WAP) in Lhe
IEEE 802.llb. IEEE 802.1 lb is an enhancement of 802.11. lt sLill uses DS spread Bluetooth environment. tn the “briefcase Lrick”, for example, Lhe user’s mobile phone will
spectrum and operaLes in the 2.4 GHz band. WiLh the aid of new technology, especially Lhe communicate wiLh his/ber laptop periodicaily, so e-mau can be reviewed from the handheld
Complementary Code Keying (CCIC) modulation technique, it supports 5.5 and 11 Mbps phone withouL opening the briefcase.
in addiLion Lo Lhe original 1 and 2 Mbps. and its functionality is comparable to Ethemet. Some new Sony camcorders already have a builL-in Bluetooth inLerface. This permiLs
In North America, for example, the allocated spectrum for 802.11 and 802.1 lb is 2.400— moving or sLiIl pictures to be senL Lo a PC or Lo the web direcLly (withouL a PC) through a
2.48350Hz. Regardless of chedaLarate (1,2,5.5, or 11 Mbps), thebandwidthofaDS sprcad mobile phone equipped with BlueLooth, aL a speed of over 700 kbps, within a disLance of
spectrum channel is 20 MHz. Three nonoverlapped DS channels can be accomniodated lO meLers. Such camcorders can even be used Lo browse the WWW and send e-mau wiLh
simultaneously, allowing a maximum of 3 access points in a local arca. JPEG or MPEG-1 aLtachments.
IEEE 802.1 lb has gained public acceptance and is appearing in WLANs everywhere,
including university campuses, airports, conference cenLers, and so on. 17.2 RADIO PROPAGATION MODELS
Radio Lransmission channels present much greaLer engineering difficulLies than wired lines.
IEEE 802.lla. IEES 802.11 a operaLes in Lhe 5 GHz band and supports data raLes in
la Lhis section, we briefiy present lhe most common radio channel modeis Lo gain insight
Lhe range of 6 to 54 Mbps. Instead of DS spread specLrum, it uses Orthogonal Frequency
inLo Lhe cause of biL/frame errors and to classify Lhe Lypes of biL errors, lhe amounL, and
Division Multiplexing (OFDM). lt allows 12 nonoverlapping channels, hence a maximum
whether they are bursLy.
of 12 access points in a local arca.
Because 802.1 la operates in Lhe higher frequency (5 0Hz) band, it faces much less Various effects cause radio signal degradation in Lhe receiver side (other Lhan noise).
Radio Frequency (RF) interference, such as from cordless phones, Lhan 802.11 and 802.1 lb. They can be classified as short-range and long-range effecLs. Accordingly. mulLipaLh fading
Coupled with the higher daLa rate, ir has great potential for supporting various multimedia models are available for smail-scaie fading channels, and paLh-loss models are available for
applications in a LAN environment. iong-range atmospheric atLenuation channels.
Section 17.2 Radio Propagation Modeis 495
494 (hapter 17 Wireless Networks
pfr) K=0
For indoor channels, Lhe radio signa! power is general!y Iower, and lhere are more objects 0.6
in a small place; some are moving. Hence, muilipaLh fading is Lhe main facLor for signa!
degradation, and fading modeis are esLablished. lo such an environment, the transmitted
signal is split mio mu!tiple paths going lo Lhe receiver, each path having lIs own actenuation, 0.5
phase delay, and lime delay.
MultipaLh modeis probabilistica!Iy state Lhe received signal amplitude, which varies ac K=3 -
K-5 K=10 K 20
cording lo whether Lhe signa!s superimposed aL Lhe receiver are added destructively or
0.4
constructively. Signa! fading occurs due Lo refiecLion, refraction, scatLering, and diffraction
(main!y from moving objecls).
Ouldoors Lhere are aiso refracLion, diffraction, and scaLlering effecLs, mosLly caused by lhe
ground and buildings. Long-range communication, however, is dominaLed by atmospheric
0.3
attenuation. Depending ou the frequency, radio waves can penetrate the ionosphere (> 3
0Hz) and esiablish line-of-sight (LOS) communicalion, or for lower frequencies reflecL off
lhe ionosphere and lhe ground, or travei aiong Lhe ionosphere Lo Lhe receiver. Frequencies 0.2
over 3 GHz (which are necessary for sateilite LransmissiOns Lo penetrate Lhe ionosphere)
experience gaseous aLLenuations, influenced primari!y by oxygen and water (vapor and
raio). 0.1

17.2.1 Multipath Fading


Fading modeis Lry to mode! lhe amplitude of lhe superimposed signal at lhe receiver. The 8 !0
2 4 6
Dopplerspreadof a signal is defined as the distribution of Lhe signa! power over lhe frequency
spectnim (the signal is modu!ated aL a specific frequency bandwidlh). When lhe Doppler
FIGURE 17.6: Rician PDF p!ol wiLh K-facior = 0, !, 3,5, 10, and 20.
spread of Lhe signa! is sma!! enough, Lhe signa! is coherent — that is, lhere is on!y one
disLinguishab!e signa! ai Lhe receiver. This is Lypica!!y lhe case for narrowband signals.
However, when lhe signal is wideband, differeni frequencies of lhe signa! have different
fading paths, and a few disLinguishab!e signa! paths are observed aL lhe receiver, separated
Note lhal when .ç = 0 (K = 0) there is no LOS, and lhe mode! Lhus reduces Lo a Rayleigh
in Lime. For narrowband signa!s, Lhe mosL popular inodeis are Rayleighfading and Rician
distribution. When K — co lhe mode! refiecis lhe addirive white Gaussian noise (AWGN)
fading.
condiLions. Figure !7.6 shows lhe Rician probability densiLy function for K-facLors of 0, 1,
The Ray!eigh mode! assumes an infinite number of signa! paths with no line-of-sight
3,5, lO, and 20, wiLh sLandard deviaLion of a = 1.0.
(LOS) Lo lhe receiver for mode!ing lhe probabi!ity density funcLion 1’, of received signa!
ampliLude r: For a wideband signa!, lhe fading paLhs are more empirica!ly driven. One way is lo
(17.9) mode! lhe amplitude as a summaflon over a!! lhe paths, each having randomized fading.
Pr(r) -~ ei? The number of palhs can be 7 for a c!osed-room environment (six wal!s and LOS) ora larger
number for other environments. An aliernalive technique of modeling lhe channel fading is
where a is Lhe sLandard deviation of lhe probability densily funcLion. A!Lhough the number of by measuring lhe channel impulse response.
signa! paths is typica!!y not 100 large, lhe Ray!eigh model does provide a good approximatiOn
when lhe number of paLhs is over 5. A similar Lechnique is uti!ized in CDMA sysiems, proposed in cdma2000 as wel! and
A more general mode! thaL assumes a LOS is lhe Rician model. lt defines a K-factor as added Lo WCDMA as pari of lhe harmonizalion efforL. A CDMA sLaLion (boLh mobile
a ratio of lhe signa! pewer lo lhe scatLered power — LhaL is, K is lhe factor by which the and base slation) has rake receivers, which are mu!iiple CDMA radio receivers Luned Lo
LOS signal is greaLer lhan lhe other paths. The Rician probabi!iLy density funcLion P~, is signa!s wiLh differenL phase and amplitude, lo recompose lhe CDMA transmission lhaL split
lo different disLinguishable paLhs. The signal ai each rake receiver is added up lo achieve
r r 32 (17.10)
P~(r) = — ei.? K I0(.—’Jii), where K = —
beLler SNR. To tune lhe rake receivers Lo Lhe proper fading paLhs, CDMA syslems have a
e special pitot channel lhat sends a well-known piloL signal, and Lhe rake receivers are adjusLed
As before, r and « are lhe signa! amplitude and sLandard deviation respeclive!y, and s Lo recognize lhaL symbo! on each fading paLh.
is Lhe LOS signal power. 1,, is a niodified Besse! funclion of ihe firsi kind with O order.
496 Chapter 17 Wireless Networks Section 17.3 Multimedia over Wireless Networks 497

17.2.2 Path Loss Lhe smaller device size is lhaL lower-resolulion videos are acceplable, which helps reduce
prOCessing Lime.
For long-range communicalion, lhe signal loss is dominated by altenualion. The free-space
Second, due lo memory consLrainls and reasons for Lhe use of wireless devices, as well
altenuation model for LOS lransmission is iii inverse proportion lo lhe square of distance
(d2) and is given by Lhe Friis radiation equation as billing procedures, real-lime communication is likely Lo be required. Long delays before
sLarting Lo see a video are eilher nol possible or not acceptable.
— _______
Finally, wireless channels have much more interference Lhan wired channels, wiLh specific
(17.11) loss patterns depending on Lhe environmenL condilions. The bitrale for wireless channels is
— (4,r2)d2L
also much more limiLed, alLhOugh Lhe 3G bilrales are more suilable for video. This implies
that alihough a lot of bil proteclion musL be applied, coding efficiency has lo Ir mainlained
Sr and S, are Lhe received and lransmitted signal power, Gr and G, are Lhe anlenna gain as well. Error-resilienL coding is important.
facLors, À is Lhe signal wavelength, and L is the receiver Ioss. EL can be shown, however, 3G slandards specify thal video shall be standard compliant. Moreover, most companies
that if we assume ground reflecLion, altenuation increases lo be proportional to d~. wiIl concenlrate on developing products using sLandards, in lhe inleresL of inleroperabilily
Another popular medium-scale (urban City size) model is Lhe Hata model, which is of mobiles and nelworks. The video sLandards reasonable for use over wireless channels
empirically derived based on Okumura path loss data in Tokyo. The basic form of Lhe palh are MPEG-4 and H.263 and its varianls, since they have low bitrate requiremenls.
Ioss equation in dB is given by The 3GPP2 group has defined Lhe following QoS parameLers for wireless videoconferenc
ing services [7]. The QoS paramelers specified for lhe wireless part are more stringenl Lhan
L = A + B logio(d) + C. (17.12) Lhose required for end-Lo-end Lransmissions. The 3OPP QoS requiremenLs for multimedia
Lransmission are nearly identical [8].
Here, A is a function of the frequency and antenna heights, 8 is an environmenl function,
and C is a function depending on Lhe carrier frequency. Again, d is the distance from lhe • SynchronizaLion. Video and audio should be synchronized lo wilhin 20 msec.
LransmilLer Lo Lhe receiver.
Sateilite modeis are attenuated primarily by rain. Hence, meLeorological rainfali density • ThroughpuL. The ntinimum video bilrate lo be supported is 32 kbps. Video raLes of
maps Can be used Lo communicale with lhe region. AtLenuation is computed according Lo 128 Icbps, 384 kbps, and above should be supporled as well.
the amounL of rainfail in lhe area on Lhe given daLe.
• Delay. The maximum end-lo-end Lransmission delay is defined lo be 400 msec.
17.3 MULTIMEDIA OVER WIRELESS NETWORKS
• Jitter. The maximum delay jitter (maximum difference belween Lhe average delay
We have studied lhe evolution of current 2G networks lo future high-capaciLy 3G networks, and Lhe 951h percentile of lhe delay distribution) is 200 msec.
but is lhere a demand for 3G neLworks? Multimedia over wireless will certainly need a higher
bandwidLh. SuggesLed multimedia applications range from web browsing, streaming video,
• Error rate. The videoconferencing system should be able Lo toierate a frame error
videoconferencing, coliaboraLive work, and slide-show presenlations Lo enhanced roadside
raLe of 10_2 ora bit error rale of 10 for circuit-switched transmission.
assisLance and downloadable GPS maps for drivers.
In lhis seclion we are concemed mainly wilh sending video robuslly over wireless chan
In Lhe following, we discuss lhe vulnerability of a video sequence lo bit entrs and ways
nels, such as for a videoconferencing applicalion. This applicalion should be prominenL on
Lo improve resilience Lo errors.
3G handhelds, since iL is a natural exLenskn Lo voice communication.
Because wireless dala Lransmissions incur lhe most daLa Ioss and dislortion, error re
17.3.1 Synchronization Loss
silience and error correcLion become primary concems. We have thus included some brief
descriplion of synchronizalion loss, error-resilienL entropy coding, error concealmenl, and A video slream is either packeLized and lransmilLed over a packet-switched channel or
Forward Error CorrecLion (FEC) in lhis seclion, although most of these Lechniques are also lransmiLled as a conLinuous bitstream over a circuit-swilched channel. In either case, it is
applicable Lo olher networks. obvious lhaL packet loss or bit enor wiIi reduce video quality. 1f a bil loss or packel loss is
A few characLerisLics of wireless handheld devices are worth keeping in mmd when Iocalized in Lhe video in boLh space and time, Lhe loss can sLill be acceptable, since a frame
designing mulliinedia transmission, in particular video Lransmission. Firsl, both lhe hand is displayed for a very short period, and a small error mighl go unnoLiced.
held size and baltery life limit Lhe processing power and memory of Lhe device. Thus, However, digiLai video coding lechniques involve variable-lengLh codes, and frames
encoding and decoding must have relalively Iow complexity. Of course, one advanlage of are coded wiLh different prediction and quantizalion leveis. IinfortunaLely, when a packet
498 Chapter 17 Wireless Networks Section 17.3 Multimedia over Wireless Networks 499

containing variable bit-Iength data (such as DC]’ coefficients) is damaged, that error, if Addilionaliy, an adapLive intra-frame refresh mode is ailowed, where each macrobiock
unconsLrained, will propagate ali the way throughout Lhe scream. This is called loss of can be coded independenLly of Lhe frame as an inter- or intra- biock according lo its motion, co
decoder synchronizalion. Even if Lhe decoder can detect Lhe error due lo an invalid coded assist wilh error conceaiment. A fasLer-moving block will require more frequenL refreshing
symbol or coeflicients out of range, ii stiil cannot estabiish Lhe neM point from which Lo .—thac is, be coded in intra- mode more often. Synchronization markers are easy Lo recognize
sLart decoding [9]. and are particularly weil suited lo devices with limiLed processing power, such as ccli phones
As we have iearned in Chapter lo this complete biisLream loss does not happen for videos and mobile devices.
coded with standardized protocol layers. The Picture layer and Lhe Group Of Blocks (GOB) For interacLive applicaLions, if a back channel is availabie Lo Lhe encoder, a few addiLional
iayer or Slice headers have synchronization markers that enable decoder resynchronization. error control Lechniques are availabie, classified as sender-receiverfeedback. According Lo
For example, Lhe H.263 bitstream has four layers — the Picture layer, GOB layer, Mac Lhe bandwidlh available aI any moment, Lhe receiver can ask Lhe sender lo iower or increase
roblock layer, and Block layer. The Picture L.ayer starts with a unique 22-bit picLure start Lhe video biLrale (transmission raLe control), which combats packet loss due lo congestion.
code (PSC). The iongest entropy-coded symbol possible is 13 bits, so lhe PSC serves as a lf Lhe stream is scalable, it can also ask for enhancemenL layers.
synchronization marker as well. The GOB Jayer is provided for synchronizaLion after a few
blocks rather than Lhe enLire frame. The group of blocks sLart code (GBSC) is 17 bits long Additionaiiy, Annex N of H.263+ specifies thaL Lhe receiver can notice damage in a
and also serves as a synchronization marker.2 The macroblock and lhe Block Jayers do noL reference frame and request Lhat the encoder use a different reference frame for predicLion
contam unique start codes, as these are deemed high overhead. — a reference frame lhe decoder has reconstrucLed correctiy.

ITU standards after 1-1.261 (e.g., H.263, H.263+, etc.) support slice-structured mode The above techniques can be used in wireless real-time video applications such as video
instead of GOBs (H.263 Annex K), where slices group biocks together according to Lhe conferencing, since wireless ccli communication supports a back channel ifnecessary. llow
block’s coded bit lengLh raLher Lhan Lhe number of blocks. The objective is lo space slice ever, it is obviously cheaper noL Lo use one (iL wouid reduce multipie-access inLerference in
headers within a known distance of each olher. That way, when a bitstream error looks Lhe uplink).
iike a synchronizaLion marker, if Lhe marker is not where Lhe slice headers should be it is
discarded, and no false resynchronization occurs.
Since slices need Lo group an integral number of macroblocks togeLher, and macroblocks 17.3.2 Error Resilient Entropy Coding
are coded using VLCs, iL is not possibie Lo have ali slices lhe same size. l-lowever, Lhere
is a minimum distance after which Lhe nexL scanned macroblock will be added Lo a new The main purpose of GOBs, siices, and synchronizaLion markers is lo reestablish decoder
siice. We know thaL DC coefficients in macroblocks and motion vectors of macrobiocks synchronization as soon as possible after an error. In Annex K of H.263+, slices achieve
are difi’erentially coded. Therefore, if a macroblock is damaged and Lhe decoder locaLes lhe betLer resilience, since they impose furLher consLraints on where lhe stream can be synchro
next synchronization marker, iL mighL stiil not be able to decode Lhe sLream. nized. However, anolher aigoriLhm, called Error Resilieni Eniropy Coding (EREC), can
To alieviate Lhe probiem, slices also reset spaLial predicLion parameters; differentiai cod achieve synchronizaLion after every single macrobiock, without any of Lhe overhead of Lhe
ing across slice boundaries is not permiLted. The 150 MPEG sLandards (and H.264 as well) slice headers or GOB headers. The aigorithm is cailed EREC because iL Lakes entropy-coded
specify siices thaL are nol required Lo be of similar biL length and so do not proLecL against variable-lengLh macroblocks and rearranges lhem in an elTor-resilienL fashion. In addiLion,
false markers weli. it can provide graceful degradaLion.
OLher Lhan synchronizaLion ioss, we should note thaL errors in prediction reference frames EREC takes a coded bitstream of a few blocks and rearranges Lhem 50 thac lhe beginning
cause much more damage lo signal qualiLy than errors in frames not used for prediction. of ali lhe blocks is a fixed dislance apart. AlLhough Lhe blocks can be of any size and any
ThaL is, a frame error for an l-frame will deterioraLe Lhe qualiLy of a video stream more than media we wish Lo synchronize, Lhe following descriplion wiil refer lo macroblocks in videos.
a frame error for a P- or 8-frame. Similarly, if Lhe video is scalable, an error aI Lhe base The algorithm proceeds as in Figure 17.7.
layer wiii deteriorate the quality of a video sLream more Lhan in enhancement layers.
MPEG-4 defines addiLional error-resiiient tools LhaI are useful forcoding under noisy and IniLially, EREC sloLs (rows) of fixed biL-lengLh are aliocated wilh total biL-lenglh equal
wireiess channel conditions. These are in addiLion Lo slice coding and Reversible Variable Lo (or exceeding) Lhe total bit-length of ali lhe macrobiocks. The number of slots is equal
LengLh Codes (RVLC5) [lo, li]. To further help wiLh synchronizaLiOn, a data parliLioning Lo Lhe number of macrobiocks, excepL Lhat lhe macroblocks have varying biL-lengffi and
scheme wiii group and separate header information, moLion vecLors, and DC1’ coefficients lhe slots have a fixed bit-iength (approximaLely equal lo Lhe average biL-lengLh of ali Lhe
mio different packets and puL synchronization markers between them. As we shali see later macroblocics). As shown, lhe last EREC sioL (row) is shorler when lhe Lolal number of biLs
does not divide evenly by the number of sloLs.
on, such a scheme is also beneficial lo unequal protection Forward Error CorrecLion (FEC)
schemes. LeL k be lhe number of macroblocks which is equai Lo Lhe number of slols, 1 be Lhe total
2Synchronizalion markers are always larger chan ffie minimom required. in case bil enors change bits Lo look bit-lengLh of ali Lhe macroblocks, mbs[ 1 be lhe macrobiocks, slots[ ] be lhe EREC sloLs,
111w synchronization markers Lhe procedure for encoding lhe macrobiocks is shown below.
Section 17.3 Multimedia over Wireless Networks 501
500 Chapter 17 Wireless Networks
Macroblocks EREC slots
EREC slots MacrobiockS
II

EELH II~

—— Li’~J

LI 1

I~InI~*l I~

FIGURE 17.8: Example of macrobicck decoding using EREC.

The decoder side works in reverse, with lhe additional requirement lhaL il has Lo detecl
when a macrobiock has been read in fuli. It accornpiishes Lhis by deLecLing lhe end of
FIGURE 17.7: Example of macrobiock encoding using EREC.
macrobiockwhen ali DCTcoefficienLs have been decoded (ora block end cade). Figure 17.8
shows an example of Lhe decoding process for Lhe macroblocks coded using EREC in
Figure 17.7.
The Lransmissiofl arder of lhe daLa in Lhe slols is row-major — lhaL is, aI first Lhe data
PROCEDURE 17.1 EREcEncode in sial O is senl, Lhen siaL 1, and so on, lefL lo right. IL is easy lo see how Lhis Lechnique is
BEGIN resiiienL Ia errors. No malLer where the damage is, even aI Lhe beginning of a macroblock,
j = O; we sLiii know where lhe nexl macroblock sLarls — il is a fixed disLance from lhe previous
Repeal until 1 = O a’ie. In Lhis case, no synchronizalion markers are used, so Lhe GOB Iayer ar siices are foI
necessary eiLher (aithough we sLiili mighL wanL Ia reslricl spatial propagaLian af errar).
for i = O to k — 1 When Lhe macrobiocks are coded using a daLa partiLioning Lechnique (such as Lhe one
for MPEG-4 described in lhe previous secLion) and also biLpiane parliliOniflg, an errar in
rn = (i + J) madk; lhe bilsLream wili deslroy iess significanl data while receiving lhe significanL daLa. Ilis
fim is the macroblock number corresponding lo siaL é; obvious lhaL lhe chance for errar propagalion is grealer.for biLs aL Lhe end of lhe sial lhan aI
Shift as many biLs as possible (wilhoul overflow) from mbslil inLO slois[m]; lhe beginning. On average, Lhis wiii aisa reduce visual delerioraLian aver a nanpartiLiOned
sb = number of bits successfully shifLed into .slors[m] (wiLhout overflow); encading. This achieves graceful degradalion under worsening errar condilions.
= 1 — sb;

173,3 Error Concealment


j = j + 1; II shifL Lhe macroblocks downwards
DespiLe ali lhe efforls lo minimize accurrences of errors and their significance, errars can
END sLili be visuaily annaying. Errorconcealmenl tcchniques are thus inlraduced La appraximale
lhe iosL dala on Lhe decoder side.
The macroblocks are shifted into Lhe corresponding sloLs uniu ali lhe bits of Lhe mac Many errar canceairnenl lechniques appiy eiLher in Lhe spaliai, lemparal, ar frequency
domam, ar a combination af lhem. Ali lhe lechniques use neighbaring frames lemporaiiy
robiock have been assigned or remaining biLs of lhe macrobiock don’l fit into lhe sial. Then
lhe macroblocks are shifted down, and this procedure repeaLs.
502 Chapter 17 Wireless Networks Section 17.3 Multimedia over Wireless Networks 503

or neighboring macrobiocks spatialiy. The transport stream coder interleaves the video neighboring pixels in Lhe block. The function unknowns are Lhe missing coefficients.
packets, SO thaI ia case of a burst packet ioss, not ali lhe errors willl be at one place, and lhe In lhe case where motion inforrnation is available, prediction smoothness is added Lo
missing data can be estimaLed from lhe neighborhood. Lhe objective function for minimization, weighted as desired.
Error concealment is necessary for wireless video communication, since lhe error rates The simple smooLhness measure defiried above has lhe problem that iL smoolhes edges
are higher lhan for wired channels and mighl eva be higher than can be transmitted with as well. We can attempt Lo do better by increasing lhe order of lhe smooffiing crilerion
appropriate bit protection. Moreover, Lhe error raLe Iluctuates more often, depending on from linear lo quadratic or cubic. This will increase Lhe chances of having boffi edge
various mobility or weather conditions. Decoding errors due lo missing or wrong data reconsLrncLion and smoolhing along lhe edge direction. AI a larger computaLiOnal
received are more noliceable on devices with limited resolution and small screen sizes. cosi, we can use an edge-adaptive smoothing melhod, whereby Lhe edge direcLions
This is especially Lrue if macroblock size remains large, lo achieve encoding efficiency for inside Lhe block are firsL determined, and smoothing is noL permitted across edges.
lower wireless bicrates. 3. Frequency smoothing for bigh-frequency coeflicients. Smooffiing can be defined
Following is a summary of techniques for error concealnient. (See [12] for further much more simply, Lo save on computationai cost. AlLhough lhe human visual system
details.) is more sensiLive lo low frequencies, iL would be disLurbing lo see a checkerboard
pattem where ii does noL belong. This will happen when a high-frequency coefficient
1. Dealing with lost macrobiock(s). A simple and popular technique for concealment is erroneously assigned a high value. The simpiest remedy is lo seL high-frequency
coeificienls Lo O if lhey are damaged.
can be used when Dcl’ blocks are damaged but lhe motion veclors are received
correctly. The missing block coefficients are estimaled from lhe reference frame, If lhe frequencies of neighboring blocks are correlated, ii is possible to estimate losI
assuming no prediction errors. Since Lhe goal of motion-compensated video is lo coefficients ia Lhe frequency domam direclly. For each missing frequency coefficient
minimize prediction errors, this is an appropriate assumption. The missing block is ia a biock, we esLimate iLs value using an inLerpolaLion of Lhe sarne frequency coeffi
hence temporally masked using lhe block ia Lhe reference frame. cient values from lhe four neighboring blocks. This is applicable aL higher frequencies
We can achieve eva better resulis if lhe video is scaiable. In thai case, we assume only if the image has regular palLerns. Unfortunately Lhat is nol usualiy lhe case for
that lhe base layer is received correctly and that iL contains lhe motion vectors and natural images, 50 most of lhe Lime Lhe high coefficients are again seI Lo O. Tempo
base iayer coefficients Lhat are mosl importanl. Then, for a lost macroblock ai lhe ral prediclion error blocks are even less correiaLed ai ali frequencies, so this melhOd
enhancement iayer, we use lhe motion vectors from Lhe base layer, repiace lhe DCI’ applies only for inLra-frames.
coefficients ai the enhancement layer, and decode as usual from lhere. Since coeifi 4. Estimation of lost motion vectors. Loss of moLion vectors prevents decoding of mi
cients of less importance are estimaled (such as higher-frequency coefficients), even entire predicted bioclc, so iL is importanl lo esLimale molion vectors well. The easiest
if the estimation is not Loo accurate duelo prediction errors, Lhe concealment is more way lo estimale losL motion veclors is Lo set thern lo O. This works well oniy in
effective than ia a nonscalable case. Lhe presence of very liLlle motion. A beLter estimation is obtamned by exarnining lhe
moLion veclors of reference macroblocks and of neighboring macroblocks. Assuming
lf motion vector information is damaged as well, Lhis technique can be used orily if Lhe
motion vecLors are estimated using anolher concealment technique (lo be discussed motion is also coherent, ii is reasonabie lo take Lhe moLion vectors of lhe corresponding
macroblock ia Lhe reference frame as Lhe moLion vectors for lhe damaged targel block.
next). The estimation of Lhe motion vedor has lo be good, or the visual quality
of lhe video could be inauspicious. To apply this technique for intra-frames, some Similarly, assuming objects wilh consislent motion fleids occupy more Lhan one mac
sLandards, such as MPEG-2, also aliow Lhe acquisition of motion vecLors for intra roblock, Lhe moLion vecLor for lhe damaged block cmi be approximated as mi inter
coded frames (i.e., treating them as mIra- as well as inter-frarnes). These molion polaLion of Lhe motion veclors of Lhe surrounding bloclcs LhaL were received correcLly.
vectors are discarded if Lhe block has no error. Typicai simple interpolalion schemes are weighted-average and median. Also, lhe
spatial esLimation of lhe motion vector cmi be cornbined wiLh lhe esLirnalion from Lhe
2. Combining temporal, spatial and frequency coherences. lnstead of jusi relying
reference frame using weighted sums.
on lhe lemporal coherence of motion vectors, we caii combine ii wiLh spatial and fre
quency coherences. By having mies for eslimating missing block coefficients using
Lhe received coefllcients and neighboring blocks in Lhe same frame, we cmi con 11.3.4 Forward Error Correction (FEC)
ceal errors for inlra-frames and for frames with darnaged motion vector information. Some data are vitally imporlant for correet decoding. Missing Dcl’ coefficients rnay be
Additionally, combining wiLh prediction using motion vectors will give us a betler estimated or Lheir effect visualiy concealed lo some degree. However, some lost and im
approximation of lhe prediction error block. properly estirnated data, such as picture coding mode, quantization levei, or rnost data in
Missing block coefficienls can be esLimated spatially by minimizing lhe error of a higher layers of a video standard protocol sLack, will cause caLastrophic video decoding
smooLhness function defined over Lhe block and neighboring blocks. For simplicily, failure. In such cases, we would like lo ensure “error-free” transmission. However, most
lhe smoothness function can be chosen as lhe sum of squared differences of pairwise channels, ia particular wireless channels, are noisy, and lo ensure correct Lransmission, we
504 Chapter 17 Wireless Networks Section 17.3 Multimedia over Wireless Networks 505

must provide adequaLe redundant retransmissions (when no back channei is available). Packingdrder
Forward Error Correclion (FEC) is a Lechnique that adds redundant data lo a bitsLream
to recover some random bit errors in iL. Ideally, lhe channel-packet error raLe (or bit error
rate) is esLirnaLed, and enough redundancy is added lo malce lhe probability of error afler h
FEC recovery iow.
The intervai over which lhe packel error raLe is estimated is chosen Lo be lhe smallest
possible (Lo minimize iaLency and computation cosI) LhaL reliably estimates lhe frame ioss
probabiliLy. NaLuraily, when bursl frarne ioss occurs, lhe esLimation may no longer be
adequate.
Frame errors are also calied erasures, since lhe enlire packel is dropped on au error.
Videos have Lo be transmitted over a channel wilh limited bandwidlh. Therefore, II is
imporlant Lo minimize redundancy, because à comes aI lhe expense of biLrates available
for video source coding. AL lhe sarne Lime, enough redundancy is needed so lhaL lhe video FIGURE 17.9: InLerieaving scheme for redundancy cedes. PackeLs or bits are sLored in rows,
can mainLain required Q0S under lhe current channel error condilions. There is au opLimal and redundancy is generaled in lhe lasL r coiumns. The sending order is by coiumns, top lo
amount of redundancy lhal minimizes video dislortion, given certain channel conditions. bollom, Lhen lefI Lo right.
FEC codes in general fali into two categories: block codes and convolutional codes.
Block codes appiy to a greup of biLs aL once Lo generaLe redundancy. Convolutional codes
apply loa slring of biLs one aL a Lime and have memory lhal cmi slore previous bils as weli. An importanL subciass of BCR cedes lhaL appiies lo muiLiple packels is Lhe Reed—Soiomon
The foilowing presenls both Lypes of FEC cedes in brief [13]. (RS) codes. RS cedes have a generalor polynomiai over GF(2m), wilh iii being lhe packeL
size in biLs. RS codes Lake a group of k source packels and ouLpul n packels wilh r = n — k
Biock Cedes. Block codes [2] Lake as inpul k biLs and append r = n — k bits of FEC redundancy packeLs. Up lo r losl packeLs can be recovered frorn ti coded packels if we
data, resuiling in an n-biL-long slring. These codes are referred to as (ti, k) cedes. The Lwo lcnow lhe erasure points. Olherwise, as wilh ali FEC cedes, recovery can be appiied oniy
types of biock codes are linear and cyclic. Ali error correction codes operale by adding Lo half lhe number of packels (similariy, lhe number of bils), since error-poinL detection is
space beLween valid source strings. The space is measured using a Hamming distance, now necessary as well.
defined as lhe minimum number of biLs between any coded strings lhal need Lo be changed In the RS codes, only [~] packeLs can be recovered. Fortunalely, in lhe pacicel FEC
50 as Lo be idenLicai lo a second string. scenario lhe packels have headers that can conLain a sequence number and CRC cedes on
To delecl r errors, lhe Hamming disLance has Lo aI least equai r; otherwise, Lhe comipl lhe physicai Iayer. In most cases, a packeL wiLh an error is dropped, and we cmi LeH Lhe
sLring mighL seem valid again. This is nol sufficient for correcting r errors however, since localien ef lhe missing packeL from lhe missing sequence number. RS cedes are used in
lhere is feL enough dislance among vahd codes Lo choose a preferabie correclion. Te correcl slorage media such as CD-ROMs and in neLwork muiLimedia lransmissions lhaL can have
r errors, Lhe Hamming dislance musl be aI leasl 2r [14, 15]. Linear codes are simple lo bursl errors.
compute but have higher coding overhead than cyclic cedes. It is also possibie Lo use packeL interleaving Lo increase resilience lo bursL packet ioss.
Cyciic cedes are slaled in lerms of generaLor poiynomials of maximum degree equal As Figure 17.9 shows, lhe RS cede is generated for each of Lhe h rows of k source video
lo Lhe number of source bils. The source biLs are lhe coefficienls of lhe polynomial, and packets. Then iL is LransmiLled in coiumn-major order, SO lhal lhe firsl packet of each of
redundancy is generated by mulLipiying wiLh anoLher polynomiai. The cede is cyclic, since Lhe li rows is lransmilLed firsl, lhen lhe secend, and se en. If a bursl packel Ioss occurs,
Lhe modulo operaLion in effecl shifts lhe poiynomial coefficienls. we can lolerale more lhan r erasures, since lhere is enough redundancy daLa. This scheme
One of Lhe mosL used classes of cyclic cedes is lhe Bose_ChaudhUri_HOcquenghem inLroduces addiLional deiay buL does noL increase compuLatienai cosL.
(8CR) cedes, since they apply lo any binary sfting. The generator polynomial for BCH is RS cedes cmi be useful for transmission ever packel nelwerks. When Lhere are bursl
given over GF(2) (lhe binary Galois field) and is Lhe Iowest-degree poiynomiai wilh rooLs packel lesses, packeL inLerieaving, and packeL sequencing, iL is possibie lo delecl which
of a’, where a is a primitive elemenL of lhe field (i.e., 2) and i goes over lhe range of 110 packels were received incorrectly and recover Lhem using Lhe avaiiabie redundancy. if lhe
Lwice Lhe number of biLs we wish Lo contct. video has scalabilily, a belLer use ef ailocated bandwidLh isto apply adequaLe FEC protecLion
BCH cedes can be encoded and decoded quickly using inleger arilhmelic, since Lhey use en lhe base layer, centaining melien veeters and ali header infennalion required Lo decode
Galois fields. R.261 and R.263 use BCH lo ailow for 18 parily biLs every 493 source biLs. video lo lhe minimum QoS. The enhancement layers can receive eiLher iess proleclion or
Unforlunaleiy, lhe IS pariLy bils wili certect aI most Lwo errors in lhe source. flus, lhe none aL ali, reiying jusl on resilienl ceding and error cencealmenl. EiLher way, Lhe minimum
packets are sLiiii vuinerable Lo bursL biL errors or singie-packel errors. Q0S is already achieved.
Section 17.3 Multimedia over Wireless Networks 507
506 Chapter 17 Wireless Networks

Some of lhe presenL and fuLure 3G applicaLions are:


A disadvanlage of biock codes is Lhal Lhey cannoL be selectively applied Lo certain biLs. li
is difficull Lo proLect higher-proLocol-Iayer headers wilh more redundancy bils Lhan for, say, • Mulurnedia Messaging Sei-vice (MMS), a new messaging proLocol for mulLimedia
Dcl’ coeflicienis, if Lhey are senl in Lhe sarne Lransporl packeL (or even group of packeLs). dala on mobile phones Lhal incorporales audio, images, and oLher mullimedia conlenL,
On Lhe olher hand, convolutional codes can do Ibis, which makes tem more efficient for along wilh IradiLionai texI messages
daLa in which unequal protection is advantageous, such as videos. Although convolulional
cedes are nol as effecLive againsl burst packet loss, for wireless radio channels bursl packet • Mobile videophone, Vou’, and voice-acLivaLed network access
loss is nol predominanL (and nol present in mosL propagaLion models). • Mobile Iniemel access, with slreaming audio and video services

Convolutional Cedes. ConvoluLional FEC codes are defined over generaLor polyno • Mobile inlranellexlraneL access, wiLh secure access lo corporale LANs, Virtual Private
miais as weiI [13]. They are compuLed by shifLing k message biLs mIo a coder Lhat convolves Nelworks (VPN5), and lhe InLemel
tem with te generaLor polynomial Lo generale ri bils. The raLe of such cede is defined lo
• CusLomized infolainmenl service Lhal provides access Lo personalized conlenl any
be The shifling is necessary, since coding is achieved using memory (shift) regislers.
Lime, anywhere, based on mobile portaIs
There can be more Lhan k regislers, in which case pasL bils also affecL lhe redundancy code
generaLed. • Mobile online muiliuser gaming
After producing Lhe n bils, some redundancy bils can be deleled (or “punclured”) lo
decrease lhe size of n, and increase Lhe rale of lhe cede. Such FEC schemes are known as • UbiquiLous and pervasive compuLing [6], such as auiomobile teiemalics, where an
rale compatible punctured convoluilonal (RCPC) codes. The higher Lhe raLe, lhe lower lhe automaled navigalion syslem equipped wiLh GPS and voice recogniLion can interacl
bit prolecLion ‘viii be, bul also Lhe less overhead ou lhe biLraLe. A Vilerbi aigoriLhm wiLh with Lhe driver lo obviaLe reading maps whiie driving
soft decisions decodes Lhe encoded bil slream, ailhough rurbo codes are gaining populariLy. The indusLry has iong envisioned Lhe convergence of IT, enLertainmenl, and Lelecommu
RCPC codes provide an advantage over biock codes forwireiess (secLions of lhe) neLwork, nicaLions. A major porlion of lhe telecommunicaLion fieid is dedicaLed lo handheid wireless
since bursl packel losses are nol likely. RCPC puncLuring is done after generalion of parily devices — Lhe mobile sLalions (ccli phones). AI lhe same time, Lhe computer induslry has
inforuiation. Knowing lhe significance of lhe source biLs for video qualily, we can appiy a focused ou creaiing handheld compuLers thaL can do aL leasl some importanl tasks necessary
differenL amounL of puncLuring and hence a differenL amounl of error prolecLion. SLudies for people on Lhe go. Handheid compuLers are classified as Pockel PCs or PDAs.
and simulaLions of wireless radio modeis have shown Lhal applying unequal proLecLion using PockeL PCs are typically larger, have a keyboard, and supporL mosl funcLions and pra
RCPC according lo biL significance informaLion resulls in beller video quahly (up lo 2 dB grarns of a desklop PC. PDAs do simpier Lasks, such as sloring evenl caiendars and phone
beller) for lhe sarne allocaled bilraLe lhan videos prolecled using RS codes. numbers. PDAs normaily use a form of handwriling recognilion for inpul, aiLhough some
SimplisLically, lhe PicLure layer in a video proLocol should gel Lhe highesl proLecLion, Lhe incorporale keyboards as weli. PDA manufacLurers are slriving lo supporl more PC-iike
macroblock layer lhal is more locaiized will gel lower prolecLion, and lhe Dcl’ coefficienLs funclions and aL lhe same lime provide wireless packei services (including voice over lP),
in lhe block layer can gel lilLle prolecLion, or none aI ali. This couid be exLended furlher lo so LhaL a PDA can be used as a phone as well as for wireiess Inlemel conneclivily.
scaiable videos in similar ways. As wilh ali small portable compulers, lhe Hurnan Compuler InleracLion (HCI) problem
The cdma2000 sLandard uses convoluLionai codes lo prolecL lransmilled bits for any is more significanl Lhan when using a deskLop compuLer. Where Lhere is no space for a key
daLa lype, wilh differenL cede rales for differenl lransmission biLrales. If fulure 3G nelworks board, iL is envisioned tal command inpuL wilI be accomplished lhrough voice recognilion.
incorporale dala-type-specific provisions and recognize lhe video slandard chosen for Irans MosL of Lhe new PDA producls supporL image and video caplure, MP3 playback, e-mail,
mission, Lhey can adapliveiy apply Iranspori coding of lhe video sLream wiLh enough unequal and wireIess proLocoIs such as 802.11 b and BIueLoolh. Some aiso acL as ccii phones when
redundancy suilabie Lote channei condilions aL lhe lime and QoS requested. connecled Lo a GPRS or ~cs neLworic (e.g., lhe Handspring Treo). They have color screens
and supporL web browsing and muiLimedia e-mau messaging. Some Bluelooth-enabied
17.3.5 Trends in Wireless Interactive Multimedia PDAs reiy on BiueLoolh-compalible ccli phones Lo access mobile nelworks. However, as
ccli phones become more powerful and PDAs incorporale 802.11 b interface cards, Bluetooth
The UMTS fortim foresees lhaL by 2010, Lhe number of subscribers of wireless muilimedia mighl become iess viable,
communicaLion wiii exceed a biilion worldwide, and such Lraffic will be worlh over severai As PDA manufacturers iook lo Lhe fuLure, ihey wish lo supporl nol oniy vOice commu
hundred biilion dollars lo operaLors. AddiLionaily, 3G wili also speed Lhe convergence nicaLion over wireiess networks buL aiso mulLimedia, such as video communicaLion. Some
of LeieconimunicaLiOns, compulers, mullimedia conlenl, and conLenl providers lo supporl PDAs incorporale advanced digilai cameras wiLh flash and zoom (e.g., lhe Sony CLIE).
enhanced services. The encoding of video can be done using MPEG-4 or H.263, and lhe PDA couid supporl
Mosl ceiiular neLwOrks around lhe worid have already offered 2.5G services for a few muilipie playback formaIs.
years. iniflai 3G services are also being offered globaily, wilh cdma2000 lx service already
commerciaily availabie in mosL countries.
508 Chapter 17 Wireless Networks Sectiori 17.5 Exercises 509

Ccli phone manufacturers, for their pan, are Lrying to incorporate more computer-like 2. Oiscuss lhe difference between lhe way GSMJGPRS and WCDMA achieve variabie
functionality, including Lhe basic tasks supported by PDAs, web browsing, games, image biLrate transmissions.
and video capture, attachments Lo e-mau, streaming video, videoconferencing, and so on. 3. We have seen a geomelric iayoul for a celiular nelwork in Figure 17.1. The figure
Growth in demand is steady for interaclive muitimedia, in particular image and video com assumes hexagonal celis and a symmetric pian (i.e., lhat lhe scheme for splilting
munications. Most ccli phone manufacturers and mobile service providers akeady support lhe frequency specLrum over difi’erenl celis is uniform). Also, lhe reuse factor is
some kind of image or video communicalion, either in Lhe form of e-mau attachments, video K = 7. Depending on ccii sizes and radio interference, Lhe reuse factor may need
streaming, or even videoconferencing. Similariy lo Short-text Messaging Service (SMS), lo be differenl. SLiiI requiring hexagonal ceils, can ali possibie reuse factors achieve
lhe new messaging proiocoi Mut,bnedia Messaging Service (MMS) is gaining support in a symmetric pian? Which ones can? Can you specuiate on a formula for general
Lhe industry as an interim solution to Lhe bandwidth limitation. New ccii phones feature possibie reuse factors?
color displays and have buiit-in digital cameras. MosL ccli phones use integrated CMOS
4. What is Lhe spreading gain for 15-95? WhaL is Lhe spreading gain for WCDMA
sensors, and some handsels even have two of them. By 2004, Lhe number of camera sensors UTRA FDD mode, assuming ali users want lo LransmiL aI maximum bitrale? What is
on mobile phones is estimated lo exceed Lhe number of digital cameras sold worldwide. lhe impacl of Lhe difference between lhe spreading gains?
Ccli phones have supported web browsing and e-mau functionaiity for a few years, but
with packet services, Bluetoolh, and MMS, they can support video streaming in various 5. When a ceiluiar phone user saveis across lhe ccli boundary, a handoff (or handover)
from one ccli lo lhe olher is necessary. A hard (imperfecl) handoff causes dropped
formais and MP3 playbacic. Some ccii phones even include a touch screen IbaL uses hand
caiis.
writing recognition and a styius, as most POAs do. Olher ccii phones are envisioned Lo be
smaii enough Lo be wearabie, inslead of a wrist watch.
(a) COMA (Direct Sequence) provides much belLer handoff performance than
FDMA or Frequency Hopping (Mi). Why?
17.4 FLJRTHER EXPLORATION
(b) Suggesl an improvemenL to handoff so iL can be sofler.
Tanenbaum [14] has a good general discussion of wireiess networks, and Wesel [2] offers
some specifics about wireiess communications networks. Viterbi [5] provides a soiid anal 6. In a CDMA ccii, when a COMA mobile sLalion moves across a ccii boundary, a sofi
ysis 011 spread spectrum and Lhe foundation of CDMA. Wang et ai. [16] give an in-depth handoffoccurs. Moreover, ceiis are also spiit inLo sectors. and when a mobile station
discussion on error control in video communications. moves between sectors, a softer handoffoccurs.
The Further Expioration section of Lhe LexL web sile for Lhis Chapter contains current
web resources for wireiess networks, including (a) Provide argumenls for why a sofler handoff is necessary.
. A survey on wireiess networks and celiuiar phone technoiogies (b) Siale aI ieast one olher difference between Lhe Lwo handoffs.

. A report on GSM Hini: Ouring handoff in a COMA sysLem. Lhe mobile stalions can LransmiL aI iower
power leveis Lhan inside lhe ccii.
An iniroduction Lo GPRS
7. MosL of the schemes for channel aiiocation discussed in this chapter are fixed (or
uniform) channei assignmenL schemes. It is possibie Lo design a dynamic channei
and links Lo aiiocalion scheme lo improve lhe performance of a ccii neLwork. Suggest such a
dynamic channei aiiocation scheme.
NTIA for informalion on spectmm management
8. The 2.5G lechnoiogies are designed for packel-swilching services. This provides
• Home pages of lhe COMA Oeveiopment Group, IMT-2000, UMTS, cdma2000 RTT, data-on-demand connectivily without Lhe need lo eslabiish a circuit firsL. This is
3GPP, and so on advantageous for sporadic data bursis.
• Wireiess LAN standards
(a) Suggest a method lo implemenl muitipie access control for TDMA packeL ser
We also show images of severai PDAs and severai modera ccli phones. vices (such as GPRS).
(b) Circuits are more eflicient for ionger dala. Extend your suggesled melhod so
17.5 EXERCISES LhaL lhe channei goes through a contention process oniy for Lhe firsl pacicel
lransmitLed.
1. Iii impiemenlaLions of TOMA systems such as GSM and IS- 136, and lo a Iesserdegree
in networks based on COMA, such as 15-95, an FDMA Lechnoiogy is stili in use lo
Hini: Add reservalions Lo your scheme.
divide Lhe aiiocated carrier spectnim mIo smaiier channeis. Why is this necessary?
510 Chapter 17 Wireless Networks

9. H.263÷ and MPEG-4 use RVLCs, which allow decoding of a stream in both forward C H A P T E R 18
and backward directions from a synchronization marker. The RVL.Cs increase Lhe
bitrate of the encoding over regular entropy codes.

(a) Why is this beneficial for transmissions over wireless channels? Conte nt— Based Retrieva 1 i n
(Ii) WhaL condition is necessaof for it to be more efficient Iban FEC? I~ i g ital Li bra ri es
10. Why are RVLCs usually applied only to motion vectors? If you wanted to reduce Lhe
bilrate impact, what changes would y~u make?

17.6 REFERENCES
1 M. Rahnema, ‘Overview of GSM Sysiem and Protocol Architecture,” IEEE Conimunications
Magazine, 31(4): 92—100, 1993.
2 E.K. Wesel, WirelessMuitimedia Conununications: Networking Video, Voice, and Data, Read- 18.1 HOW SHOULD WE RETRIEVE IMAGES?
ing, MA: Addison-Wesley, 1998.
Consider lhe image in Figure 18.1 of a small portion of The Garden ofDelighis by Hierony
3 F. Meeks, “The Sound of Lamarr’ Forbes, May 14, 1990. mus Bosch (1453—1516), now in Lhe Prado museum in Madrid. This is a famous painting,
4 14. Holma and A. Toskala, eds., WCDMA for UMTS: RadioAccessfor Third Generation Mobile but we may be stumped in understanding Lhe painter’s intent. Therefore, if we are aiming
Cotnmunications, New York: Wiley 2001. ai automatic retrieval of images, it should be unsurprising that encapsulating Lhe semantics
5 A.J. Viterbi, CDMA: Principies of SpreadSpeclrum Communication. Reading, MA: Addison- (meaning) in the image is an even more difficult chailenge. A proper annotation of such
Wesley, 1995. an image certainly should include the descriptor “people”. On Lhe offier hand, should this
6 J. Burkhardt, ei ai., Pervasive Computing: Technoiogy and Architeciure of Mobile lnternet image be blocked by a “NeL nanny” screening out “naked people” (as in [1»?
Applications, Boston, MA: Addison Wesley, 2002. We know very well LhaL most major webbrowsers have a web search buiton formultimedia
7 Third Generation Partnership Project 2 (3GPP2), Vldeo Conferencing Sen’ices Stage i, content, as opposed lo Lext. For Eosch’s painting, a text-based search wili very likely do
3GGP2 Speeifications, S.R0022, July 2000. ihe best job, should we wish Lo find Lhis particular image. YeL we may be interested in fairiy
8 Third Generation Partnership Projeci (3GPP), Qos for Speech and Multi’nedia Codec, 3GPP general searches, say for scenes wiffi deep blue skies and orange sunsets. By pre-calculating
Specifications, TR-26.912, March 2000. some fundamental statistics about images stored in a database, we can usually find simple
scenes such as Lhese.
9 K. N. Ngan, C. W. Yap, and 1<. T. Tan, Video Coding For Wireless Conununication Systems,
New York: Marcel Deklcer, 2001. in its incepLion, retrieval from digital libraries began with ideas borrowed from traditional
information retrieval disciplines (see, e.g., [2]). This une of inquiry continues. For example,
lO Y. Takishima, M. Wada, and II. Murakami, “Reversible Variable L.engih Codes,” JEEE Trans
acrionson Co,nrnunications, 43(2-4): 158—162. 1995. in [3], images
techniques. Forarea Lraining
classifiedseiinto indoororoutdoorclasses
of images using basic
and captions, Lhe number information-retrieval
of limes each word appears
Ii C.W.Tsai and J.L. Wu, “On Constnicting LheHuffman-Code-Based ReversibleVariable-Lenglh iii Lhe document is divided by ibe number of Limes each word appears over ali documents
Codes[ IEEE Transactions on Communications, 49(9): 1506—1509, 2001. in a ciass. A similar measure is devised for siatistical descriptors of Lhe content of image
12 Y. Wang and Q.F. Zhu, “ErrorControl and Concealment for Video Cominunication: A Review’ segments, and Lhe Lwo information-retrieval-based measures are combined for an effective
Proceedings of lhe IEEE, 86(5): 974997, 1998. classification mechanism.
13 A. Houghlon, Erro, Coding For Engineers, Norwell MA: Kluwer Academic Publishers, 2001. Ilowever, most muitimedia retijevai schemes have moved toward an approach favoring
14 AS. Tanenbaum, Contputer Neiworks, 4Lh cd., Upper Saddle River Ni: Prenlice MalI PTR, multimedia content itseif, without regard to or reliance upon accompanying textual informa
2003. tion. Only recently has atiention once more been placed on Lhe deeper probiem ofaddressing
IS W. Staliings, Data & Coniputer Coznmunications, 6th ed., Upper Saddle River, Ml: Prentice semantic content in images, once again malcing use of accompanying LexL. If data consists
Mali, 2000. of siatistical features built from objects in images and aiso of texi associaied with Lhe im
16 Y. Wang, J. Osiennann, and Y. Q. Zhang, Video Processing and Co,nmunications, UpperSaddie ages,other.
lhe each For
typeexampie,
of modality — text and image — provides semantic content omitted from
an image of a red rose will not normally have Lhe manually added
River, Ml: Prentice Hail, 2002. keyword “red” since Lhis is generaily assumed. Hence, image features and associated words
may disambiguate each olher (see [4]).

511
512 Chapter 18 ontent-Based Retrieval in Digital Libraries Section 18.2 C-BIRD —A Case Study 513

engines devised on these features are said to be content-based: the search is guided by
image similarity measures based on the statistical contenl of each image.
Typically, we might be interested in looking for images similar to our current favorite
Santa. A more industry-oriented application would typically be seeking a particular im
age of a postage stamp. Subject fields associated with image database search include art
galieries and museums, fashion, interior design, remote sensing, geographic informalion
systems. meteorology, trademark databases, criminology, and ar! iricreasing number ofother
areas.
A more difficult type of search involves looking for a particular object within images.
which we can term a search-by-object model. This involves a much more complete catalog
of image contents and is a much more difficult objective. Generally, users will base ibeir
searches on search by associalion [6], meaning a first cut search followed by refinemeni
based on similarity to some of the query resulis. For general images represeniative of a kind
aí desired picture, a category search returns one element of the requested se!, such as one
or several trademarks in a database of such logos. Alternalively, lhe query may be based on
a very specific image, such as a particular piece of ar! — a larga searcli.
Another axis to bear in mmd in understanding the many existing search systems is whether
the domam being searched is narrow, such as lhe database of trademarks, ar wide, such as
a se! of commercial siock photos.
For any syslem, we are up against the fundamental nature of machine systems that aim
to replace human endeavors. The main obstacles are neatly summarized in what lhe authors
FIGURE 18.1: How can we best characterize the information content ofan image? aí the summary in [6] term the sensory gap and the semantic gap:
o! Museo dei Prado.
The sensory gap is the gap between lhe object in the world and the information
in a (compulational) description derived from a recording of that scene.
In this chapter, however, we shall focus only on the more standardized systems that
make use of image features to retrieve images from databases or from the web. The types
of features typically used are such statistical measures as the colar histogram for an im The semantic gap is the lack of coincidence belween the inforniation that one
age. Consider ar! image that is colorful — say, a Santa Claus plus sled. The combina can extract from the visual data and the inlerpretation that the same data have
tion of bright red and fiesh tones andbrowns might be enough of ar! image signature to for a user in a given situation.
allow us to at least find similar images in our own image database (of office Christmas
parties).
Recall that a colar histogram is typically a three-dimensional array that counts pixels Image fealures record specifics about images, but the images themselves may elude de
with specific red, green, and blue values. The nice feature aí such a structure is that is does scription in such terms. And while we may certainly be able to describe images linguistically,
not caie about the orientation of the image (since we are simply counting pixel values, no! lhe message in the image, the semanlics, is difficult to capture for machine applications.
their orientation) and is also fairly impervious to object occlusions. A seminal paper on
this subject [5] launched a tidal wave of interest in such so-called “low-level” features for 18.2 C-BIRD — A CASE STUDY
images.
Let us consider the specifics of how image queries are canicd out. To make the discussion
Other simple features used are such descriptors as calor iayout, meaning a simple sketch
concrete, we underpin our discussion by using the image dalabase search engine devised by
of where in a checkerboard grid covering the image to look for blue skies and orange
sunsets. Another feature used is texture, meaning some type of descriptor typically based oneof lhe authors of this text (see [7]). This system is called Conrenr-Based Image Retrieval
from Digital libra ries (C-BIRD), an acronym devised from content-based image retrieval,
on an edge image, formed by taking partial derivatives of the image itself — classifying
ar CBIR. (The URL for this search engine is given in the Further Exploration section for
edges according to closeness of spacing and orientation. An interesting version of this
lhis chapter.)
approach uses a histogram of such edge features. l’exture iayour can also be used. Search
Section 18.2 C-BIRD — A Case Study 515
514 Chapter 18 Content-Based Retrieval ii, Digital Libraries

.3Cb.dJW,bl-I .LI~.U.*,,, - - — -
int hist[256] [2561 ~256) // reset to O
I/irnage is an appropriate struct
//with byte fields red,green,blue

for i=0. . (MAX_Y—1)


e -t
for j=0. . (MAX_X—1)

R = irnage[iflj).red;
G = imagelil[j).green;
B = image[i][jJ.blue;
hist[R) (GI [EI..;

Usuaily, we do not use histograms wiLh 50 many bins. in part because fewer bins tend lo

smooLh ouL differences in similar but unequal images. We also wish Lo save sLorage space.
~ ugn,~4j~fi1PIS How image search proceeds is by maLching lhe fewure vector for lhe sample image, ia
Lhis case the colar histogram, wilh Lhe feature vector for every oraL leasl many of — Lhe
p - rcáa..j r~—dw.
rcu~p~a. l.*awa r St,u~ SemI Rn~.n~twàg,’
images in lhe daLabase.
C-BIRD calculales a colar hisLogram for each targel image as a preprocessing sLep, lhen
s*a,de.,qeat.4a ~N~6telfl
references il in Lhe dalabase for each user query image. The histogram is defined coarsely,
with bins quanLized Lo 8 bils, wiLh 3 bils for each of red and green and 2 for blue.
FIGURE 18.2: C-BIRD image-search GUI. For exampie, Figure 18.3 shows thal lhe user has selecied a particular image — one
wiLh red flowers. The resulL oblained, from a daLabase of some 5,000 images, is a seL of 60
maLching images. MosI CEIR sysLems reLura as Lhe resuil seI eilher Lhe Lop few maLches
or lhe maLch seI wiLh a similanly measure above a fixed threshold value. C-BIRD uses lhe
18.2.1 C-BIRD GUI
lalter approach and Lhus may reLum zero search resuIts.
Figure 18.2 shows Lhe GUI for Lhe C-BIRD sysLem. The anime image daLabase can be How maLching proceeds in pracLice depends on whal measure of similarity we adopL.
browsed, ar iL can be searched asiag a selection of Loois: text annOtations, calor histograms, The standard measure used for color hisLograms is called Lhe hisiogro.rn intersection. Firsl,
iilumination.invariant colar hislograms, calor density, colar layout, lexture layoul. and a color hisLogram H~ is generated for each image i in Lhe daLabase. We like Lo Lhink of Lhe
model-based search. Many of lhe images are Iceyframes from videos, and a video piayer is histogram as a three-index array, bul of course Lhe machine Lhinks of it as a long vecLor —
incorporated in lhe sysLem. hence Lhe lerm “feaLure vecLor” for any of lhese lypes of mcasures.
The hislogram is normalized, 50 Lhat iLs sum (now a double) equais uniLy. This nor
LeL’s step through ihese options. Other systems, discussed ia Section 18.3, have similar malizalion slep is inleresLing: il effecliveIy removes lhe size of lhe image. The reason is that
feature seis. if lhe image has, say, resoluLion 640 x 480, Lhen Lhe hislogram entries sum to307,200. But
if lhe image is only one-quarLer Lhat size, ar 320 x 240, lhe sum is onIy 76,800. Division
18.2.2 Color Histogram by lhe total pixel count removes lhis difference. ln facl, Lhe normalized hisLograms can be
viewed as probobiliiy densiryfunctions ~dfs). The histogram is lhen sLored in lhe daLabase.
la C-BIRD, features are precompuLed for each image in lhe database. The most prevalent Now suppose we select a “model” image lhe new image ia malch againsL ali possibie
feature lhal is ulilized in image database retrieval is lhe colar histogram [5], a Lype of global largets in lhe dalabase. lIs histogram ~ is inLersecLed wiLh ali daLabase image hisLograms
image fealure, LhaL is, lhe image is nol segmented; insLead, every image region is treated H,, according lo Lhe equalion [5]
equaliy.
A colar hisLogram counts pixeis wiLh a given pixel value in red, green, and blue (RGB). intersection Z min(Hf, H~,) (18.1)
For example, in pseudocode, for images wiih 8-bit values in each of R, G, E, we can 611 a
histogram thal has 256~ bins:
Section 18.2 (-OIRO — A Case Study 517
516 Chapter 18 Content-Based Retrieval iii Digital Libraries

.— ~Ieixi
Cb ~ li, Ft-~~ 1~ ~
_n ri;—
4 7 Ô ~
— — ~ Sa~Fa~ - -~
— $iI~ P*’
: - qJt,.a.,...,..,..w&
≥1

e
brLi~,

S~~T,c
r Ç Di~ac

e-
II IIflflflø
.0-li, - - ‘li
~1 inin.suflnprr
innflo prrrct
To’a r~u~bero/,mcgn is 7- Showi~g L~wgnI LIw&~h7
n’a Wwi~snwdO -5~~~njmqn 1 sh,ô,j~h
rlw,.. rcaL4.n ~
rr~~— rc~~L,c, Sintsnnh Rnbitflr,iàga
VWopany r T&..I..1 r
r r..nt_s
ei ei
- --- o. e.’—

FIGURE 18.4: Colordensily query scheme.


FIGURE 18.3: Search by colorhislogram results. (This figure also appears in Lhe color insert
section.) Some thumbnaii images are fivm the Corei Galiery and are copyright Corei. Ali
rights reserved.
18.2.5 Texture Layout

where superscript j denotes hislogram bin j, with each histogram having n bins. The cioser Similar Lo color Iayoul search, lhis query allows Lhe user lo draw lhe desired Lexture disLri
lhe intersection value is to 1, lhe betcer lhe images match. This intersection value is fast lo bution Available IexLures are zero density lexlure, medium-density edges in four direcLions
compute, but we should note that Lhe intersection value is sensitive Lo color quanlization (0°, 45 ,90°, 135 ) and combinations of lhem, and high-density LexLure in four directions
and combinaLions of ihem TexLure matching is done by ciassifying lexlures according lo
18.2.3 Calor Density directionality and density (or separalion) and evalualing lheir correspondence Lo lhe texlure
disLribution seiecLed by Lhe user in Lhe lexLure block iayouL Figure 18.6 shows how Lhis
Figure 18.4 dispiays Lhe scheme for showing color density. The user selects Lhe percentage
layout scheme is used.
of lhe image having any particular color or set of colors, using a color picker and sliders.
We can choose from eiLher conjunction (ANDing) or disjunction (ORing) a simple color
percencage specification This is a coarse search method. Texture Analysis Detalis

IL is worihwhile considering some of lhe deLails for a lexture-based conlent anaiysis aimed
18.2.4 Color Layout
aI image search These details give a La5LC of Lypicai Lechniques sysLems musL empioy Lo
The user can seI up a scheme of how colors should appear in Lhe image, in lerms of coarse work in practical situalions
biocks ofcolor. The userhas a choice of four grid sizes: 1 x 1,2 x 2,4 x 4 and 8 x 8. Search
FirsL, we create a LexLure histogram A typical seI of indices for comprehending Lexture
is specified on one of the grid sizes, and lhe grid can be filial wilh any RGB color value —
is Tamura’s [8) Human perception sLudies show lhaL “repeLiLiveness’ “direcLionalily,” and
or no color vaiue aL ali, lo indicate Lhat Lhe ceil should noL be considerei Every database
granulariLy” are lhe most relevant discriminaLory facLors ia human LexLural perceplion L91
image is parlilioned mIo windows four limes, once for each window size. A clustered color
Here, we use a two-dimensional texlure hisLogram based on direciionaiiiy Ø and edge
hislogram is used inside each window, and lhe five mosL frequenl colors are stored in lhe
dalabase Each query ccli posiLion and size corresponds Lo lhe position and size of a window

separailoti ~, which is closeiy related lo “repetiLiveness” measures lhe edge orienlations,
and ~ measures Lhe distances belween parailel edges
in Lhe image Figure 18.5 shows how lhis iayoul scheme is usei
518 Chapter 18 Content-Based Retrieval in Digital Libraries Section 18.2 C-BIRD — A Case Study 519

1h ~fl ,a. 1.T~


~.. E* — !r 1.* i~

4. se~3 -J
4’ *
...~&! ~ -
.— ~ •‘~~ •— ~ — — ~s
~dl€J
I€’’,,~’~”~’ - 1~ J
~#.~.‘W’~4h’& -
J

e E’!

Ia”
M

e
e — c_ E- — ~xs:n ~khn flmn £~,
EM!
-- 1 lflhiWItUCøDG Bfl •
• J$%II*$*IØ# 1
• • inuD!flan
•in -ii mmi —E -

TCWJr,.a,OJ~GtC 9- Tciaimat.rofimogn*i39$- Xb1,!~ig ,es4422• •‘~


r -

fl!~a.— PEL.~’ r.—,~.— rrâ.h,,.. .E1m.A -

r r,*a 5w’ Snnb ~j Rnwtllr.’.uin.4


rrá.’..r.a UT.....L,& r —S~4tI Rnt~h,w~,t

FIGURE 18.5: Color layouL grid. FIGURE 18.6: Texture IayouL grid.

To extracl an edge map, lhe image is lirst coriverted lo luminance Y via Y — 0.299R + For edge separation f, for each edge pixel é we measure Lhe distance along iLs gradient
0.587G + 0.1 14B. A Sobe? edge operolor [lO] is applied lo Lhe Y-image by siiding lhe •~ Lo lhe nearesL pixel j having #j •~
within 15 . lf such a pixel j doesn’l exist, lhe
foliowing 3 x 3 weighting maLrices (convolution mosks) over lhe image: separation is considered infinite.
Having creaLed edge direclionaliLy and edge separalion maps, C-BJRD constructs a 20
-1 O I~ texLure hisiogram 0ff vcrsus~. The initial histogram size is 193 x 180, where separation
(18.2) value f = 193 is reserved for a separation of infinity (as well as any 4 > i92). The
d~: —2 0 21
-l o i~] flIUjflI hisLogram size is Lhen reduced by three for each dimension Lo size 65 x 60, where joined
enLries are summed Logelher.
The histogram is “smooLhed” by replacing each pixel wilh a weighLed som of iLs neighbors
lf we average around each pixel wich lhese weighls, we produce approximalions lo deriva
and is then reduced again Lo size 7 x 8, wiLh separalion value 7 reserved for infinity. AL this
lives.
sLage, the texture histogram is also normaiized by dividing by Lhe number of pixels in lhe
The edge magnitude D and the edge gradienl~ are given by image segmenL.

D—_,,/~id3, •=arcLan~ (18.3) 18.2.6 Search by Iliumination Invariance


iliuminalion change can dramalically alLer lhe color measured by camera RGB sensors,
NexL, Lhe edges are Lhinned by suppressing ali bui maximum values. lf a pixel é wilh frompink under dayiighL lo purpie under fluorescenL lighLing, for example.
edge gradienL ~ and edge magnilude D1 has a neighbor pixel j along Lhe direcLion of ~ To deal wilh iliumination change from lhe quer)’ iniage Lo differenL daLabase images, each
wiffi gradienl Øj •~
and edge magnilude D1 > D1, Lhen pixel lis suppressed 100. color-channel band of each image is firsL normaiized, lhen compressed Lo a 36-vecLor [II].
To make a binary edge image, we seI ali pixeis wiLh D grealer than a threshold value Lo Normalizing each of Lhe R, O, and 8 bands ofan image serves as a simpie yeL effecLive guard
1 and ali oLhers Lo 0. againsl color changes when Lhe lighting color changes. A Lwo-dimensional color histogram
Section 18.2 C-BIRD — A Case Study 521
520 Chapter 18 Content-Based Retrieval ia Digital Libraries

-—- :181,1
[h.Sà - Fl.~. I~ — — tIL ~WkL -, o-
4. vcp •,S ~ 9*’ L4 -

~ zl ~ L.*,”
d

a
e 4

-a
R~de
o ap~
1~I
‘e-
T~4mr~rof~gn £37 - L&e.~ wiagq p,u~~h 924
To&ntrofw’mgutSS7 - sbow4~fln3 thu..th24
r
r Rntt
rcd,Ki,,.. rr,t,’,* Ft.e~.ni,.rct _________________
—..,. r ——

rcbp..n,. rT..ncm,s r — S~~tSt.4 i


zi
$e-~~’IflLWL IW.O~&M4I!4#4n

FIGURE 18.8: C-BIRD interface, showing objeci selection using an ellipse primiLive. (This
FIGURE 18.7: Search wiLh iliumination invariance. Some ;humbnaii images are from the figure also appears ir’ Lhe color insert section.) !mage is frorn :he Corei Galiery and is
Corei Gaiiery and are copyright Corei. Ali righis resen’ed. copyright Corei. Ali rights reserved.

is lhen creaLed using the chiv,nariciry, which is Lhe seL of band ratios [R, GI/(R + 6 + B). Seleclion mode. An objecL is Lhen inleracLiveiy selected as a portion of lhe image; Lhis
ChromaticiLy is similar Lo Lhe chrominance in video, ir’ Lhat iL capLures color informaLion consLitutes an object guery by example.
only, noL luminance (or brighLness). Figure 18.8 shows a sampie object selecLion. An image region can be seiecLed using
A 128 x 128—bin 213 color histogram can then be treated as an image and compressed primitive shapes such as a reclangie or ellipse, a magic wand Lool thaL is basicaliy a seed
using a waveiet-based compression scheme [12]. To further reduce Lhe number of vector based fiooding aigoriLhm, an active contour (a “snake”), ora brush boi, where the painLed
componenLs ir’ a feaLure vecLor, lhe DCT coefficients for Lhe smaller histogram are calculated region is selected. Ali lhe selections can be combined wiLh each other using Booleari
and placed in zigzag order, Lhen ali but 36 componenLs are dropped. operations such as union, inLersecLion, or exclusion.
MaLching is performed ia Lhe compressed domam by taking Lhe Euclidean distance be Once the object region is deflned lo a user’s saLisfacLion, it can be dragged lo lhe righL
Lween two DCT-compressed 36-componenL feature vecLors. (This illuminaLion-invarianL pane, showing ali current selecLions. MuIliple regions can be dragged Lo the selection pane,
scheme and lhe objecL-modei-based search described next are unique Lo C-BIRD.) Fig buL only Lhe acLive object ir’ the seiection pane wiIll be searched on. The user can also conLroi
ure 18.7 shows the resulls of such a search. parameters such as flooding thresholds, brush size, and active contour curvaLure.
Several of Lhe above iypes of searches can be done aL once by checking multipie check DeLails of Lhe underlying mechanisms of this Search by Object Model are seL out ir’
boxes. This reLurns a reduced Iist of images, since the lisL is the conjuncLion of ali resuiting [12] and introduced below as ar’ exampie of a working system. Figure 18.9 shows a block
separaLe return lisLs for each meLhod. diagram for how Lhe aigoriLhm proceeds. First, lhe user-selecLed model image is processed
and iLs features are localized (deLails iii lhe following sections). Color histogram intersection,
18.2.7 Search by Object Model based on Lhe reduced chromaLiciLy histogram described in Section 18.2.6 is Lhen applied as
a firsL “screen’ Further steps esLimate Lhe pose (scale, LransiaLion, rotation) of Lhe object
The mosL important search type C-B1RD supports is lhe model-based objecL search. The
inside a targel image from lhe database. This is followed by veriflcaLion by intersecLion of
user picks a sample image and inLeracLiveiy selecLs a region for objecL searching. ObjecLs
LexLure histograms and then a final checic using ar’ efflcienL version of a Generalized Hough
phoLographed under differenL scene conditions are sLiil effecLively matched. This search
Transform for shape verificaLion.
type proceeds by lhe user selecLing a Lhumbiiail and clicking lhe Model tab lo enLer Object
522 Chapter 18 Content-Based Retrieval in Digital Libraries Section 18.2 C-BIRD —A Case Study 523

User objecL Model feature


model selection localization

Color-based Pose TexLure Shape


irnage estimatior’ support verification
screening
FIGURE 18.10: Model and targei images: (a) sample model image; (b) sample daiabase
irnage containing lhe model book. (This figure also appears ir’ Lhe color insert seclin.)
Active Perception textbook cover courtesy Lawrence Erlbau,n Associates, mc.
~

Dalabase Matches Feature l.ocalization versus Image Segnientation


For image segmenlation (cf. [14]): if R is a segrnented region,
FIGURE 18.9: Block diagram of object rnaLching steps. 1. 1? is usually connecLed; ali pixeis ir’ R are conneczed (8-connecLed or 4-connected).
2. R1 fl R1 = ~, i Ø j; regions are disjoini.
A possible model image and one of Lhe Largei images ir’ lhe darabase might be as in 3. U7~ R1 = 1, where lis Lhe enlire image; Lhe segmenralion is cornpleie.
Figure 18.10, where Lhe scene in (b) was illuminated with a dim fluorescent Iight.
ObjecL reLrieval algoriihms based on irnage segrnenraiion permit imprecise regions by
aliowing a tolerance on lhe region-matching measure. This accounts for smail imprecision
Locales in Feature Localization in the segmenlalion but not for over- or under-segmenlation, which can be atiributed Lo
The Search by Object Model iniroduced above — finding ari object inside a target image lhe pixel-level approach. This works only for siinplified images, where objecL pixeis have
— is a desirable yet difficult mechanism for querying muiLimedia data. An added difficulty
staLisiics LhaL are posiLion-invarianL.
is Lhat objecis can be phoLographed under differenL Iighting condiLions. Human vision has A coarse localization of irnage features based on proximity and compacLness is likely Lo
“color consiancy” [13], an invariani processing, presumably, Lhat allows us lo see colors be a more effective and atLainable process than image segmenraiion.
under differenL lighting as the sarne. For image indexing, ii should be useful to deLermine Definition: A locale £f is a local enclosure of feature f.
only a cavariam processing LhaL changes along wiLh changing Iight [12]. Ir’ LhaL case, we A locale uses blocks of pixels called ides as its posiLioning uniis and has Lhe following
could aim at aiso recovering Lhe lighLing change. descripLors:
Since objecL-based search considers objecLs wirhin ar’ image, we should apply some sori
ofsegrnentation Lo look ai regions of objects — say, patches ihaL have about Lhe sarne color. 1. Envelope Lj-. A sei of tiles representing Lhe localily of Cj
However, ii has been found to be more useful to use a sei of rough, possibly overlapping 2. Geometric parameters.
regions (called locales [7]) lo express noL a complete umage segmentaLion but instead a
coarserfearure localizalion. Mass M(Cj) = counL of lhe pixels having feaLure f,
IL is worthwhile looking ir’ more deLail at Lhis Iocale-directed search method, which we M(1j)
describe along wiLh Lhe process of feature localization. Since we are inrerested ir’ IighLing ceniroid CtCf) = ~ P~/MCC~~), P~ = position
change, we also Iook aL a technique Lo compensate for iliuminaLion change, so as Lo caio!
1=1
ouL a color covariant search.
524 Chapter ri ent-Based Retrieval in Digital Libraries Section 18.2 C-BIRD — A Case Study 525

red
to LranslaLions, rotations, and scaling, we will start with calor localizaLion, allhough oLher
fealures (texture, shape, motion, etc.) can cerlainly be localized similarly.

A A Q/~/ L~iue
Dominant Color Enhancement
To localize on colar, we firsL remove noise and blurring by restoring colors smooLhed
out during image acquisiLion. The image is converLed from lhe RGB color space Lo a
chroniaticity-iuminance color space. For a pixel wilh calor (R, G, B), we define
~ i=R+G+B,r—R/l,g=G/l (18.4)
/ / /
7 1
//fJ / /
///%;////
where the luminance 1 is separaLed from Lhe chromaLicily (r, g). Clearly, we can also use
aR approximately illumination-invariant version of color, as in Section 18.2.6.
////// Prior Lo classifying feaLure Lues, image pixeis are classilied as having eiLher dorninan:
/////// colo, or iransitional colo,-. Pixels are classified donuinanL or LransiLional by examining Lheir
neighborhood.
FIGURE 18.11: Locales for feature localization.
Definition: Do,ninani colors are pixel colors thaL do ROL lie on a slope of color change
Mcq) in lheir pixel neighborhood. Transitional colors do.
and eccentricity E(Cj-) = Z IIP~ — C(Cf)112/M(Lf).
lf a pixel does not have sufficienL number of neighbors with similar color values within
a Lhreshold, iL is considered noise and is also classified as Iransitional. The uniforniity of
3. Color, texture, and shape parameters of the locale. For exampie. locale chromatic Lhe dominanL colors is enhanced by smoothing Lhe dominanl pixeis only, using a 5 x 5
ity, elongation, and locale Lexture hisLogram average Quer, wilh Lhe exceplion Lhat only dominant pixeis having similar color are aver
aged. Figure 18.12 shows how dominanL calor enhancement can clarify lhe target image in
lnitially, an image is subdivided into square tiles (e.g., 8 x 8 or ló x 16). While the pixel
Figure 18.10 above.
is Lhe building uniL for image segmenlalion, Lile is Lhe building unit for feaLure localizaLion.
Tiles group pixeis with similar features within LheirexlenL and are said to have feaLure f if
enough pixeis in Lhem have feaLure f (e.g., 10%). Tile feature iist
Tiles are necessary for good esLimation of initial objecL-level statistics and representa ‘files have a tilefeaiure lisi of ali lhe color feaLures associaLed wiLh Lhe Lile and their geomet
Lion of muiLiple features ai the sarne location. However, locale geometric parameters are rical staLisLics. On Lhe first pass, dominanL pixels are added tolhe Lile feaLure IisL. For each
measured iii pixels, RoL Lues. This preserves feaLure granularity. Hence, feature localization pixel added, if Lhe calor is dose Lo a feaLure on lhe iist within Lhe iuminance-chromaLicity
is not merely a reduced-resolution variaLion on image segmenlation. Lhresholds, the color and geometricai staListics for the feature are updaled. OLherwise, a
AfLer a feaLure localization process, lhe following can be Inie: new colar frature is added Lo the list. This feaLure lisI is referred Lo as the dorninanrfeaiure
lis!.
1. 3f : £~ is nol connected. On the second pass, ali Iransitional colors are added Lo lhe dominant feature lisL without
2. 3! Rg: £j- fl £~ Ø ~, f ≠ g; locales are non-disjoin!. modifying Lhe calor, but updaLing Lhe geometrical staLisLics. To deLermine which dominant
3. U~-L,- ~ 1, non-cornpleteness: ROL ali image pixeis are represenLed. feature iist node Lhe transitional pixel should merge lo, we examine lhe neighborhood of
the transiLional pixel and find the closest color thaL is welI represenLed in lhe neighborhood.
Figure 18.11 shows a skeLch of lwo locales for color red and one for coior biue. The Iinks Tf an associated dominanL colar doesn’L exisl, iL is necessary Lo crcale a second iransiiional
represenL an associaLion wiLh an envelope, which demonsLrates thaL locales do not have Lo featu,e lis! and add Lhe Iransitional calor Lo iL.
be connecLed, disjoinL, or compleLe, yet colors are still localized. The dominanL colar (r1, gj, F~) Laken on by a LransiliOnal pixel ip having calor (r, g, 1)
salisfies Lhe following minimizaLion:
Tile Classification
1W lIfr) (rr )~/F(ri.~ili)
Before locales can be generaLed, Lues are firsL classified as having cerLain features, for minll( — (18.5)
example, red tiles, or red and blue Lis. Since colar is most useful for CBIR and is invarianL 1=1 I\g gi
Section 18.2 C-BIRD — A Case Study 527
526 Chapter 18 Content-Based Retrieval in Digital Libraries

PROCEDURE 18.1 Localeslnit II Pseudo-code for Iink initialization


BEGIN
LeL c[n1][n~] be Lhe 2D array oí child nades.
LeL P[nx/2][ny/2] be Lhe 2D array af parent nades.
For each child node c[i][j] do
LeL cn = c[i)[j] and pn = p[i/2][j/2].
For each nade cn~, in Lhe feaLure lisL ofcn do
Find nade P~q ir’ the feature usE aí pn lhaL has similar colar.
lf Lhe mergid eccenLricity aí cn1, and P~q has E < r Lhen
Merge Cflp and ~flq.
1f P~q doesn’t exist ar E >= r Lhen
Add ciz~, La Lhe start of Lhe feature list of pn.
ENO

FIGURE 18.12: Smoathing nsing dominant colors: (a) original image naL smaaLhed; Afler Lhe pyramid linkage iniLialization, lhe campeLition begins. Since a4 x 4 overlapped
(b) smoathed image wiLh transiLional colors shown iri Iight gray; (c) smoathed image wiLh pyramid structure is used, four parenLs compete far linkage wiLh Lhe child, one of which is
transitianal colors shown in Lhe replacemenL dominanL colors (ifpassible). Lower row shaws aiready linked La it. This process is illustraLed by the EnvelopeGrowing pseudo-cade:
delail images.

PROCEDURE 18.2 EnvelopeGrowing 1! Pseudo-code for locale generation


The parameter nc is lhe number aí nansimiiar colors in Lhe neighborhoad of the :p. Similar BEGIN
colors are averaged Lagenerate Lhe (r1, ~, 4) colors. F(r1, ~, 4) isthe frequency aí lhe Í’~ Let c[nj[n.] be lhe 2D array of child nades.
average colar, ar in other words, the number aí similar colors averaged lo generate colori. Let p[nx/2][n3,/2] be lhe 2D array af parenL nades.
The calar thaL minimizes this equaLion is lhe besL compromise for daminant calar selecLion RepeaL unLil parent-child linkage does noL change anymare
for rp in terms af calor similarity and number aí similar colors in the neighbarhaod. The For each child nade c[i][j] do
neighborhaod size was chosen la be 5 x 5 in aur implementaLian. Letcn = c[i][j] and pn E
When ali pixels have been added lo the Lues, lhe dominant and transiLionai color feaLure For each nade cn~, in Lhe feaLure iist af cii da
lists are merged. lf a LransiLianal list nade is dose ia calar ta a dominanL usE nade, the Find nade P”q in Lhe feaLure lisLs aí p;i
geomelrical staListics for Lhe merged nade are updated, but anly Lhe colar fram Lhe daminant thaL has similar calar and minimizes the distance
lisI is preserved. Otherwise, the nades from bolh usEs are just concatenated anLa ihejainL lisL. IIC(cnp) — C(pnq)~
lf lhe merged eccenLricity aí cn~, and P~q has E < r then
Locale Generation Swap the iinkage aí cvi1, lo iLs parenL lO pfl~.
Update lhe associaLed geametrical sLaLisLics.
Lacales are generated using a dynamic 4 x4 overlapped pyramid linking pracedure [15]. On Ir’ Lhe parent feature list p remove empty nodes.
each leveI, parent nades compete for inclusian aí child nades in a fair campetiLion. Image cIo up a levei in the pyramid and repeaL lhe pracedure.
Lues are Lhe baitam-level child nades aí lhe pyramid, and lacales are generated for lhe END
entire image when the campetiLion prapagales la lhe tO~ levei. The tap-level pyramid nade
has a lisL of calor fealures wilh assaciaLed envelopes (collecLions aí Lues) and geomeirical
slatisties [12]. Foliowing the pyramidai linking, locales having smali mass are remaved, since small
locales are nat accuraLe enaugh and are prababiy eiLher an insignificanL part of an object ar
Campetitian an each levei is iniLialized by using a 2 x 2 nanoverlapped linkage slruc
naise. To increase lhe eíficiency af Lhe search, lacales are alsa sorted accarding lo decreasing
Lure, where faur child nades are linked wiLh a single parenL node. The Localeslnit
mass size.
initializalian proceeds as fallows:
528 Chapter 18 Content-Based Retrieval in Digital Libraries Section 18.2 C-BIRD —A Case Study 529

The color updaLe equation for parenL locale j and child locale i ai iteraLion k + 1 is

(rjk+0, g (k+I) = ( (k~ (14 1(k))T M5~ +


+ (r,~. gÇk) J(k))T M~k)
_________________________________
(18.6)

and the update equations for the geometrical slatistics are


M5k~ = M514+M?1 (18.7)
= C5~M5k) +
(18.8)

— (E514 + C~!’} ÷ + (E~14 + + C~?2)MÍ~

j
— c~]i)Z (18.9)
x.i

Figure 18.13 shows how color locales appear for sample model and target images.

Texture Analysis
Every locale is associaled with a Iocale-based LexiLire hislogram as discussed lo Sec
LiOn 18.2.5. Thus a tocale-dependent threshold makes more sense in generating lhe edge
map. The threshold is obLained by examining Lhe histogram of Lhe locale edge magniLudes.
1
The texture hisLogram is smooLhed using a Gaussian fliLer and subsampled to size 8 x 7,
a
Lhen normalized.
The Iocale-based texLure is a more effecLive measure of LexLure Lhan is a global coe, ¶ ~fl
sirice the locale-dependenL thresholds can be adjusLed adaptively. Figure 18.14 compares
locale-based edge deLecLion Lo global-Lhreshold-based edge deLecLion, as discussed in Sec
Lion 18.2.5. The edge-maps shown demonstraLe that for Lhe lamp and lhe banana objecLs,
some edge points are missing when using global thresholding, but mosL of them exisl when (b)
using locale-based thresholding. To draw Lhe locale-based edge-map, edge pixels generated
for any locale are unioned togeLher. FIGURE 18.13: Color locales: (a) color locales for Lhe model image; (b) color locales for a
daLabase image. (This figure also appears in Lhe color insert secLion.)
Object Modeling and Matching
ObjecL models in C-BIRD consisL of a seL of localized feaLures. As shown above, Lhey
The objecL ilnage selecLed by lhe user is senL Lo~he scrver for maLching againsL lhe locales
provide a rich seL of staLisLical ineasures for laLer maLching. Moreover, ibeir geomeLric
database. The localization of thc submilted inodel objecL is considered Lhe appropriaLe
relationships, such as Lhe spatial arrangemerit of locales, are also exiracted. They are besL
localization for Lhe object, so thaL ilnage locales need to be found Lhal have a one-Lo-one
represenLed using vecLors connecling centroids of Lhe respecLive locales.
correspondence wiLh model locales. Such a correspondence is called ao assignnienl.
The objeci-search meLhod recovers 2D rigid object IranslaLion, scale, and roLaLion, as well
as iliuminaLion change (fulI delails are given in [12)). C-BIRD also allows a combinaLion A locale assignment has Lo pass several screcning tests to verify an objecL match. Screen
search, where an objecL search can be combined with oLher, simpler search Lypes. Ia that ing tesLs are applicd in order of increasing complexily and dependence on previous tesLs.
case, Lhe searches are execuLed according Lo decreasing speed. Since objecL search is Lhe Figure 18.9 shows lhe sequence of sleps during an object maLching process: (a) user oh
most complex search available, IL is cxecuLed IasL, and only on lhe search resulLs passed on ject model selecLion and modcl feaLure localizaLion, (b) color-based screening LcsL, (c) pose
so far by the oLher search lypes. esLimation, (d) lexLure support, and (e) shape verificalion.
530 Chapter 18 Content-Based Retrieval in Digital Libraries Section 18.2 C-BIRD — A Case Study 531

— Jilumination color covariani screening


— ChromaliciLy voLing
— EiasLic correlalion

• EslimaLion of image objecl pose (sLep c)

• Texiure support (sLep d)

• Shape verificalion (step e)


(a) (b)
• Recovery of lighLing change
FIGURE 18.14: Global versus locale-based thresholds: (a) the edge map for Lhe database
image using a global threshold; (b) Lhe edge map for lhe database image using a locale-based The idea of color covarianl malching is lo realize Lhat colors may change, from model
threshold. lo LargeL, since lhe lighLing may easily change. A diagonal model of lighling change slates
ihal lhe enrire red channel responds Lo lighling change via an overail muiLiplicalive change,
The object match measure Q is formulated as follows: as do lhe green and blue channeis, each wilh their own multipiicative consLanl [II].
SI’ l..ocales vote on lhe correcL lighling change, since each assignmenl of one model locale
Q= wi (18.10) color lo a largeL one implies a diagonal lighLing shift. Many voLes in lhe same ceil of a
1=1 voLing space wili imply a probable peak value for lighLing change. Using lhe chromaliciLy
voling scheme, ali image locales are paired wiLh ali modei locales lo voLe for lighling change
where n is lhe number of locales in lhe assignmenL, rn is Lhe number of screening tesls
values in a voting alTay.
considered for lhe measure, Q~ is lhe fitness value of lhe assignmenl in screening test 1, and
We can evaluaLe lhe feasibilily of having an assignmenl of image locales lo model locales
w~ are weights thaL corresporid Lo Lhe importance of Lhe fitness value of each screening Lest.
using lhe esLimaled chromalicity shift parameLers by a lype of ekisüc correlation. This
The w~ can be arbiLrary; Lhey do noL have Lo sum Lo 1. Care has lo be Laken Lo normalize lhe
compuLes lhe probabilily thaL Lhere can be a correcL assignment and retums Lhe seI ofpossible
Q~ values lo lie in lhe range [0.. I],so Lhat Lhey ali have lhe sarne numerical meaning. assignmenLs. l-iaving a candidale seL of chromalicity shift parameters, each candidate is
Locales wiLh higher mass (more pixels) staLisLically have a smalier percenLage of local
successively used lo compute Lhe elastic correlaLion measure. lf Lhe measure is high enough
izaLion error. The fealures are beLter defined, and small errors average out, so we have higher
confidence ir’ locales with large mass. Similarly, assignments with many model locales are (higher Lhan 80%, say), lhe possible assignments relumed by Lhe elaslic correlaLion process
preferable lo few model locales, since lhe cumulative locale mass is larger and lhe errors are Lesled for objecl maLching using pose estimaLion, LexLure supporl, and shape verification.
average ouL. Figure 18.15 shows Lhe elasLic correlalion process applied in Lhe model chromalicily
We Lry Lo assign as many locales as possible firsL, lhen compute Lhe maich measure and space Q(r’, g’): lhe model image has Lhree iocale colors aL A’, B’ and C’. AlI Lhe image
check Lhe error using a tighl Lhreshold. Locales are removedor changed in lhe assignmenL locaie coiors, A, E, C, D, E, and E, are shifled Lo Lhe model illuminant. Although Lhe locales
as necessary uniu a match is obLained. AL LhaL poinl, iL is probably Lhe besL malch measure (A’, E’, C’) and (A, E, C) are supposed lo be maLching enliLies, Lhey do nol appear aL exactiy
possibie, $0 iL is unnecessary lo Lry olher assignments. In this case, ali possible permuLaLions lhe same localion. lnstead of a rigid lemplaLe malching (or correlalion) meLhod, we employ
of locale assignmenLs do noL have lo checked. elaslic correlaLion, ir’ which lhe nodes A, E, C are allowed lo be localed in Lhe vicinily of
ln Lhe worsl case, when lhe objecL model is noL present in Lhe search image, we have lo A’, B’, C’, respecLively.
tesl ali assignments lo determine Lhere is no malch. The image locaies in Lhe database and The pose estimalion meLhod (slep (c)) uses geomelrical relaLionships beLween iocales
lhe object model locales are sorted according lo decreasing mass size. for esLablishing pose parameters. For lhaL reason, iL has Lo be perforrned on a feasible
locale assignmenl. Locale spaLial relationships are represenled by relalionships belween
Matching Steps Lheir cenLroids. The number of assigned locales is aliowed Lo be as few as lwo, which is
enough geomeLry informalion lo drive eslimaLion of a rigid body 2D displacemenL model
The screening LesLs applied Lo locales lo generaLe assignmenls and validaLe them are: wiLh four parameLers Lo recover: x, y Iranslalion, rolalion R, and scaie s [12].
• Coior-based screening tesLs (slep b):
Section 18.3 Synopsis of Current image Search Systems 533
532 Chapter 18 Content-Based Retrieval ri Digital Libraries

r
(a) (b)
FIGURE 18.15: EiasLic correlation in f2(r’, g’). FIGURE 18.16: lising Lhe GHT for shape verificaLion: (a) GHT accumuiator array image; (b) recon
strucLion of Lhe deLecLed object using the esLimaLed pose and Lhe GHT LemplaLe (edge map).
Results of pose estimation are both Lhe best pose paramelers for an assignment and Lhe
minimizaLion objective value, which is an indication of how weIi Lhe locales assignment fiL Figure 18.17 shows some search resuits for lhe pink book in C-BIRD.
using lhe rigid-body displacemenL modei. lf Lhe error is within a smaii threshoid, Lhe pose While C-BIRD is an experimental sysLem, it does provide a proof in principie that Lhe
estimaLe is accepted. diflicuiL task of search by objecL modei is passibie.
The Lexture-sUpport screening LesL uses a variaLion of histogram inLersection technique,
where lhe LexLure histograms of locales in Lhe assignmenL are intersected. lf the intersecLion Video Locales
measure is higher than a threshoid, the lexture match is accepted. Definition: A video locale is a sequence of image feaLure iocales Lhat share similar feaLures
The final match verificaLion process (sLep (e)) is shape verificaLion by the method of in Lhe spalioLemporal domam of videos.
Genercalized Hough Transfonn (Gil) [ló]. The GHT is robust with respect Lo noise and Like locales in images, video iocaies have their color, LexLure, and geomeLric properties.
occiusion [17]. Performing a fui! GHT search for ali possibie rotaLion, scaie, and Lransiation Moreover, Lhey capLure motion parameters, such as motion Lrajectory and speed, as weli as
parameters is computaLionally expensive and inaccurate. Such a search is not feasibie for temporal informaLion, such as the lifespan of Lhe video locaie and its Lemporai reiaLionships
large databases. with respect Lo oLher video locales.
However, afLer performing pose esLimation, we already know Lhe pose parameLers and Since video proceeds in smail Lime sleps, we can aiso expect Lo deveiop new locales
can appiy Lhem Lo the model reference poinL Lo find Lhe esLimated reference point in lhe from ones already known from previous video frames more easily Lhan simpiy starting from
database image. Flence, lhe GHT search reduces Lo a mere confirmaLion Lhal Lhe number scratch in each frame [18].
of vaLes in a smail neighborhood around Lhe reference point is indicaLive of a maich. This Figure 18.18 shows thaL while speeding up the generation of locales substanLially, iitLle
GHT matching approach Lakes oniy a few seconds for a Lypical search. The reference point difference occurs in generaLing locales from each image (lnLra-frame) and from predicting
used is Lhe modei cenLer, since it minimizes voting error caused by errors in edge gradient and Lhen refining lhe locales (Inter-frame).
measuremenLs. Whiie we shali noL go inLo Lhe detaiis of generaLing the video locales, suífice iL to say
Once we have shape verification, Lhe image is reporled as a match, and its match measure Lliat Lhe inter-frame algorithm is always much faster lhan the inLra-frame one. Moreover,
Q returned, if Q is large enough. After obLaining maLch measures Q~ for ali images in lhe video iocaies provide an effective means Loward real-time video objecL segmenLaLion and
daLabase, the Q~ measures are sorted according Lo decreasing value. The number of maiches Lracking [18].
can further be restricted Lo Lhe Lop k if necessary. An esLimate of Lhe correct iliumination
change follows from correcL matches reported. 18.3 SYNOpSIS OF CURRENT IMAGE SEARCH SYSTEMS
Figure 18.16(a) shows Lhe GHT voting resuit for searching Lhe pink book from one of lhe
Some other currenL image search engines are menlioned here, aiong with URLs for each
daLabase images as in Figure 18.10(b). Darkiiess indicaLes lhe number of voLes received,
(more IJRLs and resources are in Lhe Further ExploraLions secLion). The foliowing is by no
which in tum indicales Lhe iikelihood Lhat Lhe object is in the image and aL LhaL localion.
means a complete synopsis. Most of lhese engines are experimenlai, buL ali lhose inciuded
Figure 18.16(b) shows the reconsLructed edge map for the book. Since Lhe modei edge map
here are inLeresting in some way. Several includequery feaLures differenL fromthoseouLlined
and lhe location, orienLaLion, and scale of Lhe objecL are known now, this reconsLrucLion is for C-BIRD.
entireiy auiomated.
534 Chapter 18 n ent-Baseci Retrieval iii Digital Libraries Section 18.3 ynopsis of Cijrrent Image Search Systems 535

B~dS,~hC~S

_e. •
536 Chapter 18 Content-Based Retrieval in Digital Libraries Section 18.3 Synopsis of Current lniage Search Systems 537

The sLriking feature of this metric is that iL allows us lo use simple differences ofaverage i8-~•5 Blobworld
three-dimensional calor as a first screen, because Lhe simpler metric is guaranLeed lo be a
Blobworld [23] was also developed aL UC Berkeley. It atLempls lo capLure lhe idea of
bound on Lhe more complex arte in EqLlaLion (18.12) [20].
objecLs by segmenting images mIo regions. To achieve a good segmenLaLion, an expec
QBIC has been developed further since its iniLial version and now forms an essential
tatioti maxirnizahion (EM) algoriLhm derives Lhe maximum likelihood for a good cluster
(and licensable) part of IBM’s SuiLe of Digital Libraty products. These aim aI providing a
compleLe media-coliecrion management system. ing in Lhe fealure space. Blobworld allows for boLh textual and conLent-based searching.
The sysLem has some degree of feedback, in thaL iL displays lhe inLernal represenLaLion
An interesting developmenL in lhe QBIC research effort aL IBM is Lhe attempL Lo include
of Lhe submiLted image and lhe query resulls, so Lhe user can becler guide Lhe algorilhm.
grayscale imagery in its domam (21], a difficult retrieval lask. QBIC can combine oLher aL
hlLp:/lelib.cs.berkeley.edulpholosfblobworld
tributes with color-only-based searches — these can be Lextual annotations, such as captions,
and Lexture. Texture, parlicularly, helps in graylevel image retrieval, since Lo some extent
iL captures sLnicLural information in an image. Database issues begin Lo dominate once lhe 18.3.6 Columbia University Image Seekers
data set becomes very large, with careful control on cluster sizes and representatives for a A leam at Columbia University has developed Lhe following search engines:
tree-based indexing scheme.
• Content-Based Visual Query (CBVQ), dcveloped by Lhe ADVENT projecL at
18.3.2 UC Santa Barbara Search Engines
Columbia Universily, is lhe lirst of the series. (ADVENT slands forAl! Digital Video
Alexandria Digital Library (ADL) is a seasoned image search engine devised at Encoding. Netwo,-ki,,g and Trans,nission.) IL uses conlenL-based image relrieval based
Lhe UniversiLy of California, Santa Barbara. Thc ADL is presently concerned wiLh on calor, texLure, and colar composition. hLlp://maya.clr.columbia.edu:8088/cbvq
geographical data: “spaLial data on Lhe web”. The user can interacL with a map and
zoom inLo a map, Lhen retrieve images as a query resuiL type LhaL pertain lo lhe selecLed • VisualSEEk is a color-photograph reLrieval system. Queries are by colar layout, ar
map arca. This approach mitigates the fact Lhat LerabyLes, perhaps, of data need Lo be by an image instance, such as lhe URL of a seed image, or by insLances of prior
sLored for LANDSAT images, say. lnstead, ADL uses a multiresolution approach thaL matches. VisualSEEk supporLs queries based on Lhe spalial relationships of visual
allows fast browsing by making use of image Lhumbnails. MulLiresolution images features. htLp://www.clr.columbia.edulvisualseek
means LhaL It is possible lo selecL a certain region within an image and zoom in on iL.
htLp://www.alexandria.ucsb.edu
• SaFe, an inLegraLed spalial and fealure image sysLem, exLracls regions from an image
• NETRA [22] is also part of the Alexandria DigiLal Library projecL. Now in ils and compares Lhe spatial arrangemenLs ofregions. hllp://disney.clr.columbia.edu/safc
second generation as NETRA II, iL emphasizes calor image segmentation for objecL
or region- based search. hLLp://maya.ece.ucsb.edulNetral • WebSEEk collects images (and LexL) from lhe web. The emphasis is on making a
searchable catalogue with such Lopics as animaIs, archilecture, arL, astronomy, caIs,
• Perccption-Based Image Retrieval (PBIR) aims aLa beLter version of learning and
and soon. Relevance feedback is provided in the form of lhumbnail imagesand motion
relevance feedback techniques with leaming algoriLhms LhaL Lry lo geL aL Lhe underlying
icons. For video, a good form of feedback is also inclusion of small, short video
query behind Lhe user’s choices in zeroing mn on lhe righL (argel.
sequences as animared 0W files. http://www.ctr.columbia.edu/webseek (mncludes a
hLLp://www.mmdb.ece.ucsb.edul demo/corelactn/
demo version)
18.3.3 Berkeley Digital Library Project
18.3.7 lnformedia
The URL for this University of California, Berkeley, search engine is hLtp:llelib.cs.berkeley
.edu. Texi queries are supported, wiLh search aimed ata particular commercial or oLher seL The Informedia Digital Video Library project aL Camegie Mellon iJniversity is now in its
of stock phoLos. The experimenLal version Lries lo include semantic informaLion from LexL second generation. known as Informedia II. This cenlers on “video mining” and is funded
as a clue for image search. by a consortium of government and corporate sponsors. http://informedia.cs.cniu.edu/

18.3.4 Chabot 18.3.8 MetaSEEk


Chabol is an earlier sysLem, also from UC Berkeley, that aims lo include 500,000 digitized MetaSEEk is a meta-search engine, also developed ai Columbia but under Lhe a
muiLiresoluLion images. ChaboL uses lhe relaLional daLabase management system POST (heir IMKA Intelligeni Multimedia Knowledge Applicalion Project. The idea is to q
GRES (o access Lhese images and associated LexLual data. The syslem sLores both text and several olher online image search engines, rank Lheir performance for diffe
colar hisLogram data. lnsLead of color percenlages, a “mosLly red” type of simple query is visual queries, and use lhem selectivcly for any particular search.
acceptable. hlLp://htlp.cs.berkeley.edulginger/chaboL.hLml http://ana.ctr.columbia.edulrnelaseek/
Section 18.4 Relevance Feedback 539
538 Chapter 18 Content-Based Retrieval in Digital Libraries
sofLware uses are color conLent, shape conteM, LexLure content, biighiness structure color
18.3.9 Photobook and EourEyes sLmclure, and aspect ratio. hLtp://vrw.convera.com:80l S/csl
Pbotobook [24] was one of Lhe earlier CBIR systems developed by lhe MIT Media Labora
tory. li searches for Lhree different types of image coMeDi (faces, 2-D shapes, and texture i8.~ RELEVANCE FEEDBACK
images) using three mechanisms. For the íhst two types, ii creaLes an eigenfuncLiOn space Relevancefeedback is a powerful Lool that has been brought lo bear in recenl CEIR sysLems
a seL of “eigenimageS”. Then new images are described in Lerms of their coordinates in is (see, e.g., [27]). Brieíiy, lhe idea is Lo involve lhe user in a ioop, whereby images
Lhis basis. For texiures, an image is treated as a sum of ihree orthogonal componentS in a retrieved are used in further rounds of convergence onlo correct reLunis. The usual situation
decomposition denoted as Wold features [25]. is LhaL the user identifies images as good, bad, or don’L care, and weighting systems are
With relevance feedback added, Photobook became FourEyes [26]. Not only does Lhis updaled according Lo Lhis user guidance. (AnoLher approach is Lo move Me quety roward
system assign positive and negative weighL changes fçr images, but given a similar query to positively marked conient [29]. An even more inLeresLing idea is Lo move every daLa point
oneithasseenbefore, ~ hLtp://vismOd.WWW.mediamhte~1 in a disciplined way, by warping lhe space of feaLure points [30]. ln the lalLer approach, lhe
/v is modldemOs/Ph0t0b00k points Lhemselves move along wilh Lhe high-dimensional space being warped, much like
raisins embedded in a volume of Jeilo ihat is being squeezed!)
18.3.10 MARS
18.4.1 MARS
MARS (MuiLimedia Analysis and Retrieval System) [27] was developed at the University
of Illinois aI ~~bana.Champaign. The idea was to create a dynamic system of feature in lhe MARS system [27], weights assigned to feaLure poinLs are updaLed by user inpul.
represenlations thaL could adapt Lo different applicationS and differenL users. Relevance Firsl, lhe MARS auLhors suppose Lhal Lhere are many fealures, 1 = 1 .. 1 of them, such
feedback (see SecLion 18.4), with changes of weightings directed by the user, is Lhe main as color, LexLure, and so on. For each such feaLure, ihey further suppose Lhat we can use
tool used. multiple representations of each. For example, for color we may use color hisLograms, color
layout, momenls of color hisLograms, dominant colors, and so on. Suppose that, for each
18.3.11 Virage 1, Lhere are j = 1.. .J~ such representaLions. Finally, for each represenlation j offeature 1.
suppose Lhere is associaled a sei of k = 1 .. Kij componenLs of a feature vecLor. So in Lhe
Visual InformaLion Retrieval (Virage) [28] operates on objecLs within images. Image in
end, we have feaLure vector componenls rija.
dexing is performed after several preprocessing operations. such as smoothing and conlrast
Each kind of fealure 1 has importance, or weighL, 114, and weights ll/~ are associated
enhancement. The details of Lhe feaLure vector are proprietary; however, ii is known that
wiLh each of lhe representaLions for Lhe kind of fealure 1. Weighls Wuk are also associated
the compuLation of each feature is made by not one but several meLhods, with a com with each component of each represenLation. Weights are meant to be dynamic, in thal Lhey
posiLe feature vector composed of lhe concatenaLion of these individual compuLaLiOns.
change as further rounds of user feedback are incorporated.
http://www.virage.com’
Lei 1’ = { fJ be lhe whole set of feaLures fi. LeL R — (rij) be lhe set of representations
for a given feature fj. Then, again,jusL for Lhe current feaLure 1, suppose that M = ~m,j) is
18.3.12 Viper a seL of similariiy measures used io determine how similar or dissimilar lwo representaLions
Visual InformaLion Processing for Enhanced Retrieval (VIPER) is an experimental system are in set R. ThaL is, differenL metrics should be used for differenL represeniations: a vector
Lhat concentraLes on a user-guided shapíng of finer and finer search conslraints. This is based represenLation might use Mahalanobis disLance for comparing feaLure vectors, while
referred to as relevance feedback. The system is developed by researchers at the University hisLogram inLersecLion may be used for comparing color hisLograms. With seL D being the
of Geneva. VIPER makes use of a huge sei of approximatelY 80,000 potential image raw image dala, an entire expression of a relevance feedback algorithm is expressed as a
feaLures, based on color and textures aI different scales and in a hierarchical decomposiliOn moclel (D, F, R, M).
of the image ai different scales. VIPER is distributed under the auspices of lhe open Then lhe reLrieval process suggesLed in [29] is as follows:
software disLribuiion system GNU (“Gnu’s Not Unix”) under a General Public License.
http://viper.unige.Ch 1. IniLialize weighLs as uniform values:
vi~ = 1/1
18.3.13 Visual RetrievalWare wij = 1IJ~
Visual RetrievalWare is an image search technology owned by Convera, mc. lt is builL WIJa = 1/K~~
on techniques created for use by various government agencies for searching databases of
standards documenLs. lLs image version powers Yahoo’s Image Surfer. Honeywell has RecalI Lhat lis lhe number of feaLures in seI F; .4 is lhe number of represeniaLions
licensed this technology as well. Floneywell x-rayed over one million of iLs products and for feature f~; and ~ is lhe length of Lhe represenLaLion vector rjj.
plans to be able Lo index and search a daiabase of these x-ray images. The features this
540 Chapter 18 Content-Based Retrievai in Digital Libraries Section 18.5 Quantifying Results 541

2. A database image’s similariLy to lhe query is first defined in tenns of components: 3. The inverse of Lhe sLandard deviaLion of feaLure rijk is assigned LO Lhe componenL
weight WIJk:
S(rij) = m,~(rij, WIJÂ).
W~~& =
Then each representaLion’s similarity values are grouped as Cjjk

= ~ WijS(r~). ThaL is, Lhe smailer Lhe variance, the larger Lhe weight.
i 4. Finally, Lhese weighLs are also normalized:

3. Finally, lhe overali similarity 8 is defined as WjJk


WIJk~ zw~
8= Z WjS(fj).
The basie advantage of puLting lhe user into Lhe ioop by using relevance feedback is that
4. The top N images similar LO query image Q are Lhen returned. Lhis way, Lhe user need noL provide a compieLely accuraLe inilial query. Relevance feed
5. Each of Lhe reLrieved images is mariced by lhe user as highly reievant, relevant, no back esLabiishes a more accurale link between iow-ievei features and high-level cOncepLs,
opinion, nonrelevant, or highly nonreievant, according lo his orher subjective opinion. somewhat ciosing the semantic gap. Of course, retrieval perfonnance of CBIR sysLems is
beLtered lhis way.
6. Weights are updated, and lhe process is repeated.

Similarities have Lo be normalized to get a meaningfui set of images returned: 18.4.2 iFind
An experimenLal system Lhat expiicitly uses relevance feedback in image retrievai is lhe
1. Since representations may have different scales, features are nonirnlized, both offline
MicrosofL Research sysLem iFind [31]. This approach atLempts to geL away fromjusL low
(iniranormalizalion), and online (internormahzation).
leveI image feaLures by addressing Lhe semantic conlent in images. Images are associaLed
2. lnLranormalizatiOn: the idea here is lhe normalization of Lhe rjj/~ soas Lo place equal wiLh keywords, and a semantic neL is buiit for image access based on Lhese, inLegraLed wilh
emphasis on each component within a representation vector rgf. For each coinponent low-ievei feaLures. Keywords have links Lo images in Lhe database, wiLh weights assigned
k, find an average over ali M images in lhe database, tk. Then replace that component lo each link. The degree of relevance, Lhe weighL, is updaLed on each relevance feedback
by ils normalized score in Lhe usual fashion from statistics: round.
riJk — lÂk Ciearly, an image can be assOciaLed wilh muiliple keywords, each wiLh a different degree
rijk —*
of relevance. Where do lhe keywords come from? They can be generaLed manuaily or
reLrieved from lhe AL.T HTML Lag associaLed wiLh an image, using a web crawler.
3. lnternormaiization: here we look for equal emphasis for each simiiarity value S(rij)
within the overail measure 8. We (ind Lhe mean p~jj and standard deviation Cq over
18.5 QUANTIFYING RESULTS
ali database image sinnlarüy measures 8.
4. Then, online, for any new query Q we replace the raw similarity between Q and a Generaliy speaking, some simpie expression of lhe performance of image search engines is
database imagem by desirable.
S,~Q(rIJ) — Iii) In informaLion reLrieval theory,precision is Lhe percenlage of reievant documents retrieved
SmQ(rij) —, 3e~1 compared Lo lhe number of ali lhe documenis retrieved, and recali is Lhe percenLage of
reievanL documents reLrieved ouL of ali relevanL documents. Recail and prtcision are wideiy
Finally, lhe weight update process is as foHows: used for reporling reLrievai performance for image relrieval systems as weli. However,
Lhese measures are affecLed by Lhe daLabase size and lhe amounL of similar information in
1. Scores of (3, 1,0. 1, —3) are assigned lo user opinions “highiy reievant” LO “highiy Lhe database. Aiso, Lhey do not consider fuzzy maLching or search resuiL ordering.
nonreievanL”. In equaLion form, Lhese quanLiLies are defined as
2. WeighLs are updated as
—* ~ + Score

Desired images returned


for images viewed by Lhe user. Then weights are normaiized by Precision =
Ali rewzeved imager
Wu > Desired iniages relurned
Recali = (18.13)
Ali derired iniager
Section 18.6 Querying on Videos 543
542 Chapter 18 Content-Based Retrieval in Digital Libraries

Different schemes have been proposed for organizing and dispiaying sloryboards rea
In general, lhe more we relax Lhresholds and allow more images to be returned, lhe smaller
sonably succinctly. The mosl slraighlforward meLhod isto display a lwo-dimensional array
lhe precision, buL the larger lhe recall. The curve of precision versus recall is Lermed a
of keyfranies. JusL whaL cOnslilules a good keyframe has of course been subject Lo much
receiver operalor characterislic (ROC) curve. lt plols lhe relationship between sensiLivity
debate. One approach mighL be to simply oulpul one frame every few seconds. However.
and specificity over a range of parameters.
aclion has a lendency lo occur belween longer periods of inactive slory. Therefore, some
kind of clustering melhod is usually used, lo represenL a longer period of lime thaL is more
18.6 QUERYING ON VIDEOS or less Lhe same wiLhin lhe Lemporal period belonging lo a single keyframe.
Video indexing can make use of molion as Lhe salient fealure of temporally changing images Some researchers have suggested using a graph-based meLhod. Suppose we have a video
for various types of queries. We shall not examine video indexing in any detail here boi of two lalking heads, Lhe inlcrviewer and lhe interviewee. A sensible representation mighl
refer Lhe reader Lo the excellent survey in [32]. be a digraph with directed arcs taking us from one person lo Lhe oLher, Lhen back again. In
In brief, since temporality is the main difference between a video and jusL a coliecLion lhis way, we can encapsulate much information abouL lhe video’s sLmcLure and also have
of images, dealing with Lhe Lime componeni is firsL and foremosL in comprehending lhe available the arsenal of Lools developed for graph pruning and managemenl.
indexing, browsing, search, and retrieval of video contenL. A direcLion taken by lhe QBIC Other “proxies” have also been developed for represenling shoLs and scenes. A grouping
group [21] is a new focos on sLoryboard generation for aulomaLic understanding ofvideo— of seis of keyframes may be more represenlative than jusl a sequence of keyframes, as may
lhe so-called “inverse Hollywood” problem. In produclion of a video, Lhe writer and director keyframes of variable sizes. Annolalion by LexL or voice, of each set of keyframes in a
starl with a visual depicLion of how lhe story proceeds. In a video understanding siluation, “skimmed” video, may be required for sensible understandíng of lhe underlying video.
we would ideally wish lo regeneraLe Lhis sLoryboard as lhe slarling place for comprehending A mosaic ofseveral frames may be useful, wherein frames are combined inLo largerones
Lhe video. by malching feaLures over a seI of frames. This results in seI of larger keyframes Lhal are
The firsL place lo sLart, then, would be dividing lhe video mIo shois, where each shoL perhaps more represenLaLional of Lhe video.
consists roughly of lhe video frames between the on and off clicks of Lhe Record buLton.
An even more radical approach Lo video represenlalion involves selecling (or creaLing)
However, IransiLions are often placed between shoLs — fade-in, fade-oul, dissolve, wipe,
a single frame lhaL besl represenLs lhe enLire movie~ This could be based on making sure
and so on —50 delection of shol boundaiies may noL be so simple as for abrupL changes.
Lhal people are in lhe frame, lhaL lhere is aclion, and so on. In [35], Dufaux proposes an
Generally, since we are dealing wiLh digiLal video, if aL alI possible we would like lo algorilhm lhaL selects shoLs and keyframes based on measures of moLion-acliviLy (via frame
avoid uncompressing MPEG files, say, lo speed lhroughpuL. Therefore, researchers lry Lo dilference), spaLial acLivily (via enlropy of lhe pixel value dislribution), skin-color pixeis,
work on Lhe compressed video. A simple approach lo this idea isto uncompressjusL enough and face detection.
lo recover lhe DC Lerni, generaLing a Lhumbnail 64 limes smaller than Lhe original. Since we
By Laking mIo account skin color and faces, Lhe algorilhm increases lhe likelihood of Lhe
musl consider P- and B-frames as well as l-frames, even generaLing a good approximaLion
selecLed keyframe including people and porlrails, such as close-ups of movie aclors, thereby
of Lhe besL DC image is itself a complicaLed problem.
producing inLeresLing keyframes. 5km color is learned using Iabeled image samples. Face
Once DC frames are obtaincd from Lhe whole video — or, even belter, are obtained on deleclion is performcd using a neural nel.
lhe fty — many approaches have been used for finding shol boundaries. FeaLures used have
Figure 18.19(a) shows a seleclion of frames froni a video of beach acLivily (see [36]).
Lypically been color, lexture, and motion veclors, alLhough such concepls as trajecLories
Here, lhe keyframes in Figure 18.19(b) are sclecLed based mainly on color informalion (bul
LraverSed by objecls have also been used [33].
beingcareful wiLh respecl to lhe changes incurred by changing illumination condiLions when
ShoLs are grouped inLo scenes. A scene is a colleclion of shols Lhat belong togelher and videos are shol).
thal are conLiguous in Lime. Even higher-level semanLics exisL in so-called “fim grammar”
[34]. Semantic informaLion such as Lhe basic elemenLs of Lhe story may be obLainable. These A more difficull problem arises when changes belween shols are gradual and when
are (aL lhe coarsesL leveI) lhe slory’s exposilion, crisis, climax, and denouement. colors are raLher similar overall, as iii Figure 18.20(a). The keyframes in Figure 18.20(b)
are sufficient Lo show lhe developmenL of Lhe whole video sequence.
Audio information is imporLanL for scene grouping. In a Lypical scene, Lhe audio has no
break wiLhin a scene, even though many shots may Lake place over Lhe course of Lhe scene. Olher approaches aLlempL lo deal wilh more profoundly human aspecls of video, as
General Liming informaLion from movie crealion may also be broughL Lo bear. opposed to lower-level visual or audio fealures. Much efforL has gone mIo applying dala
mining or knowledge-base lechniques Lo ciassiJ5’ing videos mIo such calegories as sporLs,
TexL may indeed be Lhe mosl useful means of delineaLing shols and scenes, making use
news, and so on, and Lhen subcaLegories such as fociball and baskelball. Zhou and Kuo [37]
ofclosed-caplioning information already available. However, relying on texL is unreliable,
give a good summary of aLLempls lo provide inLelligenl syslems for video analysis.
since ii may nol exisL, especially for legacy video.
544 Chapter ontent-Sasecl Retrieval in Digita 1 raries Section 18.9 Further Exploratiori 545

4 1*

(b)

FIGURE 18.20: Garden video: (a) frames from a digital video; (b) keyframes selecied.

(b)
2. Spatiotemporai queries, such as trajectories
FIGURE 18.19: Digital video and associaced keyframes, beach video: (a) frames from a 3. Semanlic features; syntactic descriptors
digilal video; (b) keyframes seiected. 4. Reievance feedback, a weil-known technique from information retrievai
5. Sound, cspecialiy spoken documents, such as using speaker information
18.7 QUERYING ON OTHER FORMATS 6. Muitimedia database techniques, such as using relationai databases of images
7. Fusion of textual, visual, and speech cues
Work on using audio or combining audio with video ia better comprehend multimedia
8. Automaticandinstantvideomanipulation; user-enablededitingofmuitimediadatabases
content is fascinaLing. Wang et ai. [38] is a good introduction to using both audio and
video cues. He et ai. [39] offer an inleresting effort Lo understand and navigate slides from 9. Muitimedia security, hiding, and authentication Lechniques such as watermarking
lectures, based on lhe time spent oneach slide and lhe speaker’s intonation. Otherinteresting
This fieid is truiy rich and meshes well with lhe outline direclion of MPEG-7.
approaches include search-by-audio [40] and “query-by-hurnming” [41].
ln another direction, researchers try Lo create a search profile lo encompass most instances
Other features researchers have looked ai for indexing include indexing actions, concepts
avaiiable, say ali “animais”. Then, for relalional database searches, such search profiles are
and feelings, facial expressions, and so on. Clearly, this lleld is a developing and growing
communicated via database queries. For searches using visual features, inteiligent search
one. particularly because of lhe advent of lhe MPEG 7 standard (see Chapte’r 12).
engines learn a user’s query concepis through active leaming [43]. This type of endeavor is
cailed “query-based learning”.
18.8 OUTLOOK FOR CONTENT-BASED RETRIEVAL
Anather approach focuses on comprehending how peopie view images as similar, on
A recent overview [42] coliecting Lhe very latest ideas in content-based retrieval identified the basis of perception [44]. The funclion used in this approach is a Lype of “percepLual
Lhe following present and future irends: indexing, search, query, and retrieval ofmultimedia similarity measure” and is learned by finding Lhe best sei of features (calor, texture, etc.) La
data based on capture “similarity” as deuined via Lhe groups of similar images identified.

1. Video retrieval using video features: image calor and object shape, video segmenta 18.9 FURTHER EXPLORATION
tion, video keyframes, scene analysis, structure of objects, motion vectors, opticai flow
(from compuler vision), multispectral data, and so-calied “signatures” that summarize Good books [45,46,47,48) are beginning Lo appear an the issues involved in CBIR.
lhe data
546 Chapter 18 Content-Based Retrieval ri Digital Libraries Section 18.11 References 547

Links to many useful coritent-based retrieval sites are collected inibe Further Exploration 8. Suppose a color histogram is defined coarseiy, with bins quantized lo 8 bits, with
section of lhe Lext web site for this chapter: 3 bits for each red and green and 2 for blue. SeL up an appropriate structure for such
a histogram, and fill iL from some image you read. Template Visual C++ cade for
. A Java applet version of Lhe C-BIRD system described in Section 18.2 reading an image is on lhe text web site, as saznpleCcode. zip under “Sampie
Code”.
. A demo of QBIC as an artwork server
9. Try creating a Lexture histogram as described in Section 18.2.5. You could Lry a smali
• Demo versions of Lhe Alexandria Digital Library, the Berkeley Digital Library Project, image and follow lhe sleps given Ihere, using MATLAB, say, for case of visualization.
Photobook, Visual RetrievalWare, VIPER, and VisualSEEk 10. Describe how you may find an image containing some 2D “brick pattem” in an image
database, assuming Lhe color of the “brick” is yellow and lhe color of Lhe “gaps”
• A demo of MediaSite, now rebranded Sonic Foundary Media Systems. The Informe- is blue. (Make sure you discuss Lhe limitations of your method and lhe possible
dia project provided lhe search engine power for lhis commercially available system. improvements.)
• A demo of the NETRA system. The idea is LO seiect au image, Lhen a particular
(a) Use coloronly.
segment within an image, and search on Lhat model.
(b) Use edge-based Lexture measures only.
• A video describing Lhe technology for Lhe Virage system. Virage provides lhe search (c) Use color, Lexture, and shape.
engine for AitaVista’s Image Search
11. Themain differencebetweenastaticimageand videois Lhe availabiliLyof motion in Lhe
• The keyframe production method for Figures 18.19 and 18.20.
laiter. One important part of CBR from video is motion estimation (e.g., Lhe direction
• And links lo standard sets of digital images and videos, for testing retrieval and video and speed of any movement). Describe how you could estimate Lhe movement of an
segmentatiOn programs object in a video clip, say a car, if MPEG (instead of uncompressed) video is used.
12. Color is three-dimensional, as Newton pointed out. ln general, we have made use of
18.10 EXERCISES several different color spaces, ali of which have some kind of brightness axis, plus
two intrinsic-color axes.
1. What is lhe essence of fealure localizalion? WhaL are lhe pros and cons of this
approach, as opposed Lo Lhe traditional image segmentatiOn method? LeL’s use a chromaticiiy two-dimensional space, as defined in Equation (4.7). We’ll
use just Lhe first two dimensions, {x, y}. Devise a 2D color histogram for a few
2. Show that Lhe update equation (Equation 18.9) is correct — Lhat is, lhe eccentricity
images, and find their histogram intersections. Compare image similarity measures
for parent locale j aL iteration k + 1 can be derived using Lhe eccentricity, with those derived using a 3D color histogram, comparing over severa! different color
centroid, and mass information for lhe parent locale j and child locale 1 ai iteration k. resolutions. is it worth keeping ali three dimensions. generally?
(Note: C~k} and c~’} are Lhe x and y components of lhe centroid c~, respectively.)
13. lmplement an image search engine using low-level image features such as colar his
3. Try lhe VIPER search engine, refining Lhe search with relevance feedback for a few logram, color moments, and texture. Construct an image database that conlains ai
iterations. The demo mentions Gabor histograms and Gabor blocks. Read enough of least 500 images from aL least 10 different categories. Perfonu retrieval Iasks using a
Lhe files associated with lhe site Lo determine lhe meaning of these tenns, and write a single low-level feature as well as a combination of features. Which feature combi
short explanation of their use. nation gives lhe best retrievai results, in lerms of both precision and recail, for each
4. Try a few of lhe more experimental image search engines in lhe Further Exploration category of images?
section above. Some are quite impressive, but most are fairly undependable when
used on broad data domains. 18.11 REFERENCES
5. Devise a texL-annotation taxonomy (categorization) for image descriptions, starting 1 M.M. Fleck, D.A. Forsyth, and C. Bregler, “Finding Naked People’ in Luropean Congress on
your classification using lhe seI of Yahoo! categories, say. Compuser !4sion, 1996, (2)593—602.
6. Examine severa! web site image captions. Haw useful would you say lhe textual 2 CC. Chang and S.Y. Lee, “Retrieval of SimilarPictures on Piclorial Databases’ Palrem Recog
data is as a cue for identifying image contenis? (l’ypically, search systems use word nilion, 24:675—680, 1991.
slemming, for eliminating Lense, case, and number from words — lhe word sremming 3 5. Paek, C.L. Sable, V. iiatzivassiloglou, A. Jaimes, BH. Schiffman, SE Chang. and
becomes Lhe word riem.) K.R. McKeown, “lntegration of Visual and Text Based Approaches for Lhe Conlent Label
7. Suggest at least Lhree ways in which audio analysis can assist in video retrieval-system ing and Classification of Pholographs’ in ACM SIGIR ‘99 Workshop on Muirimedia lndexing
related tasks. and Reirieval, 1999, 423—444.
Section 1811 References $49
548 Chapter 18 Content-Based Retrieval in Digital Libraries

23 C. Carson, 5. Belongie, li. Creenspan, and J. Malik, “Blobworld: Image SegmenLation Using
4 K. Barnard and DA. Forsylh, “Learning lhe Semantics of Words and Pictures’ In Proceedings ExpecLation-Maximizalion and its Application to Image Querying,” IEBE Tma,,sacgio,,s ou
ofthe !n:ernational Conjerence ou Compuser Vision, 2001,2:408—415. Partem Analysis and Madune Inteiligence, 24(8):1026—1038, 2002.
5 M.J. Swain and Dli. Ballard, “Color lndexing’ international Journal o! Compuser Vision, 24 A. Penlland, R. Picard, and 5. Sclaroff, “PhoLobook: ContenL-Based Manipulaiion of Image
7(1)11—32, 1991.
DaLabases,” in Siorage and Rerrievalfor image and Video Databases (SP!E~, 1994, 34—47,
6 A.W.M. Smculders, M. Worring. 8. Santini, A. GupLa. and R. Jain, “Contenc-Based Image 25 F. Liu and R.W. Picard, “Periodicity, DirectionaiiLy, and Randomness: Wold FeaLures for Image
Reirieval at Lhe End of lhe Early Years” IEEE Transactions ou Panem Analysis and Machine
Modeling and Retrieval,” lESE Transacrions on Partem Analysis and Machine Inteiligence,
Inteiligence. 22:1349—1380, 2000.
18:722—733, 1996.
7 Z.N. Li, O.R. Zaïane, and Z. Tauber, “lilumination Invariance and Object Model in Conteni’ 26 R.W. Picard, T. P. Minka, and M. Szummer, “Modeiing User SubjectiviLy in linage Libraries,”
Based Image and Video Reirieval,” Joumal of Visual Conununication and buage Representa- in IEEE lnsen,ariona? Conj’erence ou Iniage Processing, 1996, 2:777—780.
fiou, I0(3):219—244, 1999.
27 Y. Rui, T. S. Huang, M. Oriega, and 5. Mehrocra, “Relevance Feedback: A Power Tool for
8 li. Tamura, 5. Mori, and T. Yaniawaki, “Texiure Features Corresponding Lo Visual PerceplionT InteracLive ContenL-Based Image Relrieval,” lESE Transacnons ou Circuits and Systems for
1EEE Transactions ou Systems, Mau, and Obernetics. sMc-8(6):46o—473, 1978. Video Technoiogy, 8(5):644—655, 1998.
9 A.R. Rao and CL. Lohse, “Towards a TexLure Naming Syslem: Idenlifying Relevant Dimen 28 A. Hampapur, A. Gupta, E. Horowitz, and C,F. Shu, “The Virage Image Search Engine: An
sions of Texture,” in IEEE Conference ou Visualization, 1993, 220—227. Open Framework for Image ManagemenlT in Ssorage and Rerrieva? for Image and Video
lo R. iam. R. Kasturi, and B.G. Schunck, Machine Vision. New Yo,t: McGraw-HulI, 1995. Darabases (SPIE), 1997, 188—198.
II M.S. Drew, 3. Wei, and Z.N. Li, “llluminaLion-lnvariant Image Retrieval and Video Segmen 29 Y. Ishilcawa, R. Subramanya, and C. Faloutsos, “Mindreader: Querying Databases lhrough
laLion’ Pastem Recognition, 32:1369—1388, 1999. Multiple Examples,” in 24gb luremnasional (‘onfemence ou Ve,y Large Data Bases, VLDB,
12 M.S. Drew, Z.N. Li, and Z. Taubcr, “Iliumination ColorCovarianL Locale-Based Visual Object 1998,433-438.
Reirieval’ Panem Recognition, 35(8): 1687—1704, 2002. 30 H.Y. Bang and T. Chen, “Feature Space Warping: An Approach to Relevance FeedbackT in
13 B.V. Funt and G.D. Finlayson, “Color Constanl Color lndexing’ !EEE Transaclions on Panem Inrernarional Confemence ou buage Processing, 2002, 1:968—971.
Analysis and Machine intelligence, 17:522—529, 1995. 3! Y. Lu, C. 1-tu, X. Zhu, li. Zhang, and Q. Yang, “A Unified Framework for Semancics and
14 D.H. Ballard and CM. Brown, Compuser l4sion, Upper Saddle River. NI: Prcnlice [fali, 1982. FeaLure Based Relevance Feedback in Image Retrieval SystemsT in Eighth ACM Inremnanional
Conference ou Mulrimedia, 2000, 31—37.
IS Til. Hong and A. Rosenfeld, “Compacl Region Exlraclion Using Weighted Pixel Linking in a
Pyramid’ IEEE Transactions ou Panem Analysis and Machine inseiligence, 6:222—229, 1984. 32 R. Bruneili, O. Mich, and CM. Modena, “A Survey on lhe Automatic Indexing of Video DaLa’
Joumnal of Visual Com,nunicarion and huage Repmesentation, 10:78—112, 1999.
16 D. Ballard, “Generalizing Lhe Hough Transform lo Deiect ArbiLrary Shapes,” Panem Recog
nilion, I3(2):lll—122, 1981. 33 SE. Chang, et ai., “VideoQ: An Automated Contenl Based Video Search Syslem Using Visual
Cues’ in Pmceedings ofACM Mulgimedia 97, 1997, 313 324.
17 R Gvozdjak and Z.N. Li, “From Nomad lo Explorer: Aclive Objecl Recognilion on Mobile
Robots,” Panem Recognition. 3 l(6):773—790, 1998. 34 D. Bordwell and K. Thompson, Filio Ar!: Ati Introduction, New York. McGraw Hili, 1993.

18 3. Au, Z.N. Li, and M.S. Drew, “ObjecL Segmentation and Trackirig Using Video Locales,” 35 E Dufaux, “Key Frame Seleclion lo Represent a Video,” in Inteniatianal Confemence ou hnage
in Pmoceedings of lhe Iniernational Conference on Panem Recognition (ICPR 2002), 2002, Processing, 2000,2:275—278.
2:544—547. 36 M.S. Drew and 3. Au, “Video Keyframe Production by EfficienL Cluscering of Compressed
19 M. Flickner. et aI, “Query by Image and Video Conlenl: Tbe QBIC System7 1EEE Compuser, Chromaticity Signalures’ In ACM Mulrinsedia 2000, 2000, 365 368.
28(9):23—32. 1995. 37 W. Zhou and C.C.J. Kuo, lntelligenn Systems for Video Undersranding, Upper Saddle River,
20 3. Hafner, liS. Sawhney. W. Equilz, M. Flickner, and W. Niblack, “Efficienl Color Hislogram NJ: Prenlice-Hali PTR, 2002.
indexing for Quadralic Form DisLance FuncLions,” IEEE Transactions ou Pastem Analysis and 38 Y. Wang, Z. Liu, and J.C. Huang, “Multimedia Conlent Analysis Using Both Audio and Visual
Machine Inreiligence, 17:729—736, 1995. Clues,” lESE Signal Pmocessing Magazine, 17:12—36, 2000.
21 W. Niblack, Xiaoming Zhu, J.L. Hafner, T. Breuel, D. Ponceleon, D. Pctkovic, MD. Flickner, 39 L. He, E. Sanocicj, A. Gupla, and 3. Grudin, “Auto-Summarizalion of Audio-Video Presenia
E. Upfal, SI. Nin, 5. SuIl, E. Dom, Boon-Lock Yeo, A. Srinivasan. D. Zivkovic, and M Penner, Lions,” in ACM Multimedia, 1999, 1:489-498.
“Updates to ibe QBIC Syslem,” in Ssorage and Reirievalforlinage and Video Dasabases, 1998, 40 E. Woid, T. Bium, D. Keislar, and 1. Wheaion, “Conlenl-Based Classification, Search, and
150—161.
ReLrieval of Audio’ lESE Multimedia, 3:27—36, 1996.
22 Y. Deng, D. Mukherjee, and ES. Manjunalh, “NETRA-V: Toward an ObjecL-Based Video 41 N. Kosugi, Y. Nishihara, T. Sakala, M. Yamamuro, and 1<. Kushima, “A PracLicai Query-by
RepresentationT in Siorage and Rerrievalfor Image and Video Daiabases (SPIE), 1998, 202 Huniming Systeni for a Large Music Database’ in .4CM Mulriniedia, 2000. 333—342.
215.
550 Chapter 18 Content-Based Retrieval in Digital Libraries

42 S. Basu, A. Dei Bimbo, AH. Tewfik, and H. Zhang, “Special Issue on Mullimedia DaiabaseT
IEEE Transaclions on Muitimedia, 4(2):141—I43, 2002.
43 I.J. Ccx, ML. Milier, TI’. Minka, TV. Papathomas, and P.N. Yianilos, “The Bayesian Image
Retrieval System, Pichunter: Tbeory, Implementation and Psychological Experiments’ IEEE
Transactions on Image Processing, 9(fl:20—37, 2000. Index
44 B. Li, E. Chang, and C.T. Wu, “DPF — A Percepwal Distance Function for Image Ret,ieval’
in JEEE Interna?i anal Conference on Image Processing, 2002,2:597—600.
45 0. Lu, Muitimedia Database Manageineni Systems, Norwood, MA: Artech House Publishing,
1999. .4-law, 134, 137, 148 Alias, 128—130
46 A. Dei Bimbo, Visual Inforination Reinevai, San Francisco: Morgan Kaufmann, 1999. compander, 205 AM (Amplitude Moduiation), 426
47 M. Lew, cd., Principies of Visual !nfornzation Retnievai, Berlin: Springer-Verlag, 2001. ~t-Iaw, 134, 137, 148, 374 AMPS (Advanced Mobile Phone Syslem), 480
48 V. Castelli and L.D. Bergman, eds., intage Databases: Search and Reinievai ofDigitallmagery, compander,
2D mesh, 349 205 AnaIog video, 113
New York: Wiiey, 2002. geomelry coding, 349 Animation, 36 Max, 16
3D Studio
motion coding, 352 Java3D, 16
object coding, 3’l~ Maya, 16
2D objcc animation, 352 RenderMan, 16
3D model-based coding, ~ Soflimage, 16
3D pclygon mesh, 356 Anti-aliasing, 328
30,488,489,491,496,506 filter 329
030 (Global 30), 489 APC (Adaptive Predictive Ccding), 158
Arihmetic coding, 187, 345
AC (Altemate Currenl), 209 Aspeci ratio, 116, 121, 123
Access network, 439, 464 ATM (Asynchronous Transfer Mode), 428,
FTTC (Fiber Te The Curb), 440 435, 459
FTI’H (Fiber Te The Home). 440 AAL (ATMAdaptation Layer), 437
HFC (Hybrid Fiber-Coax) cable network, cdl, 437
432,440 layer, 437
satellue distribution, 440
NNI (Network’Network Interface), 437
lerrestrial distribulion, 440
Access point. 479 UNI (User-Network Inlerface), 437
ATM Adaptation Layer (AAL), 460
Active pixel, 116
ATV (Advanccd TV), 319
Active video une, 116
Audio
AD (Analog-to-Digital) converter, 135, 158
Adaptive compression algorithms. 176 filtering, 136
Adaptive Huffman coding, 176 foiitials, 136
Adobe Photoshop, 38 Audio and visual objects. 332
alpha channel, 39 Audio compression standard
magic wand boi, 39 0.711, 376
Adobe Premiere, 37 0.721,374,376
limeline window, 37 0.723,374,376
li-ansition, 40 G.723.l, 386, 388
ADPCM (Adaptive Differential Pulse Code 0.726,374, 376
Modulation), 147, 158, 374, 376 0.727,374
ADSL (Asymmetric Digital Subscriber Line), 0.728.389
429 0.729,386. 389
Affine transform, 353 AVI (Audio Video Inlerleave), 37

551
552 Index Index 553

BAB (Binary Alpha Block), 344 STP (Short ‘Tïme Prediction), 384 DAI (DMIF Application Interface), 463 EBCOT (Embedded Block Coding with
Band-Iimited signal, 128 Chroma subsampling, 120 Datagram. 423 Optimized Truncation), 241, 267,
Band-lirniting filter, 136 Chromaticity, 92 dB (Decibel), 131 269, 273, 275
Band-pass filter, 136, 142, 149 diagram, 91 DC (Direct Currenú, 209 EDTV (Enhanced Definition TV), also see
Bandwidth, 136 Chrominance, 104 De-interlacing, 114 HDTV, 124
BAPs (Body Animation Paramelers), 356 CIF (Cominon lntennediate Format), 122 Decoder mapping, 148 Enu-opy. 168
Baseline JPEG, 261 Clustering, 68 Dictionary-based coding, 181 coding, 169, 171
BDPs (Body Definition Paraineters), 356 CMY. 101 Differential coding, 150, 191 Error concealment, 424,501
Bi-level image compression standards, also see CMYK, 102 Diffserv (Differentiated Service), 446 Error Resilient Entropy Coding (EREC), 499
JBIG, 282 Codec, 168 Digital audio, IS Elbernei, 432
BIFS (Blnary Fonnat for Scenes), 333 Coder mapping, 148 coding of, 147 lO-Gigabil Elhernet, 439
Bilinear intctpolation, 317, 326 Codeword, 167 Cool Edit, is Fast EtliemeI, 433
Binary Shape Coding, 344 Color Pro Tools, is Gigabit Eihemet, 438
Binary tiee, 171 cycling, 68 quancization and Iransmission, 147 Euler’s formula, 218
Bilmap, 61 density, 516 Sound Forge, is Excited purity, 93
Bitplane, 61 histogxam, 66, 514 Digital library, 167 EXIF (Exchange Image File), 77
Block-based coding, 335 Iayout, 516
Bluetootb, 493 lookup cable devising. 68 Digitizalion ci sound, 126 Extended Huffman
Expander function, coding,
205 176
Discrele Cosine Transform (DCT), 207, 209.
BMP (BitMap). 78 monitor specification, 94 214 Extended padding, 340
Body objecl, 356 picker, 68 ID, 208 EZW (Embedded Zerotree Wavelet), 241,242,
Buifer management primaries, 89 244
2D, 207, 208, 253
optimal plan, 473 subcarrier, 117 basis funcúon, 209,216
pealc bit-rale, 472,473 lexlure, 517 Discrete FourierTransfomi (DFT), 218 FAP (Face Animation Parameter), 355
prefetch buifer, 472 Color science, 82 Fax Standards
light and speclia, 82 Discrete Wavelet Transform (DWT). 223, 230 03,282
C-BlRD, 513 speclra sensitivity of lhe eye, 84 Dispersion, 82 04,282
GUI, 514 visible light, 82 Distortion measure, 199 FDDI (Fiber Distributed Data Interface), 433
search by iliumination invariance, 519 Color-matching function, 89 MSE (Mean Square Error), 199 FDMA (Frequency Division Multiple Access),
search by object model, 520 Commission Internationale de LEclairage PSNR (Peak Signal Lo Noise Ratio), 200 480
Cabie modem, 432,440 (dE), 89 SNR (signal-to-noise ratio), 200 FDP (Face Definition Parameter), 355
CAE (Context-based Arithmetic Encoding), Component video, 112 Dithering, 62,68 Feature Iccalization, 523
344 Composite video, 112, 113 dither matrix, 63 FM (Frequency Modulation), 137, 139, 426
Camera syslem. 86 Compression ordered dither, 63 Foi~»ard Error Correction (FEC), 503, 504
CBIR (Content-Based Image Retrieval), 513 lossless. 64,76, 167 DM (Deila Modulation), 157 Fourier transfonn, 222
curreni CBIR system, 533 Iossy, 64,76, 167,253 adaptive, 158 Frame buffer, 61, 100
CCIR (Consullative Coinmittee for ralio, 76, 168 unifcrm, 157 Frame-based coding, 335
Inlernational Radio), 120 speech, 148 DMIF (Delivery Multimedia Integration F1’P (File Transfer Protocol), 449,463
Ccli]’ (Inlernational Telcgraph and ‘Telephonc Compressor funclion, 205 Framework), 462
Consultative Committee), 150 Cones, 84,85 DMT (Discrete MuiLi-Tone). 430 Gamma correction, 87, 88, 97, IDO
CDMA (Cede Division MuIliple Access). 48!, Contexl modeling, 277 DominaM color cnhancement, 525 Gamut, 95
486 Conlinuous Fourier Transform (CFT), 218 DPCM (Diferencial Pulse Code Modulation), prinler, 102
cdma2000, 489,490 Continuous Wavelet Transforra (CWT), 223, 147, IS!, 154, 260 Gaussian disiribution, 249
cdmaOne, 490 227 Dreamweaver, SI Generalized Mai-kup Language (GML), 9
CELP (Code Excited Linear Prediction), 383 CRT (Cathode Ray Tube), 87 DV video (Digital Video), 119. 120 GIF (Graphics Interchange Forxnat), 7!
adaptive codebook, 384 CSMAJCD (Carrier Sense Multiple Access DVB (Digital Video Broadcasting), 440,464 animation, 16
LSF (Line Spectrum Frequency), 387 with Coilision Detection), 432,439 DVB-MHP (Multimedia 1-leme color map, 73
LSP (Une Spectnim Pair), 386 CSS (Cascading Style Sheets), II Platfonu), 464 GlF87, 73
LTP (Long Time Prediction), 384 DVD, 319,320,412,415 GIF89, 73
stochastic codebook, 387 DA (Digital-to-Analog) converter, 135 Dynamic range, 151 screen descriptor, 7!
554 Index Index 555

Global motion compensation, 348 H.26L, 357 lnterlaccd scanning, 113 LPC-I0, 38)
GPRS (General Packet Radio Service), 482 H.323, 456 lnierlacing, 71 Luminance, 104
GPS (Global PosiLioning SysLem), 485, 496 I-Ialf-pixel precision, 304 Internei ielephony, 455 LZW (Lempel-Ziv-wcl~h) 71,73 78 181
Granular distonion, 202 Halftone printing, 63 IP (InLernei Protocol), 424,447
Graphics. IS 1-Iannonics, 128 IP-mulLicast, 447 Macroblock, 289
Fireworks editing, 15 Hartley, 168 IPv4 (IP version 4)424,448 Macrornedia Director, 40
Freehand editing. IS HDTV (High Definition TV), 122, 124 lPv6 (II’ version 6), 425,447 3D Sprite, 45
Illustracor editing, 15 1-lierarchical JPEG, 263 ISDN (Integraied Services Digital Neiwork), animation, 42
Graplaics animation file, 77 Honiogencous coordinate sysiem, 353 427 control, 43
FLC, 77 Horizontal reirace, 114 Lingo script, 44
GL, 77 HTML (HyperText Markup Language), lO, 35, JEIG (ioint Bi-level Image Experts Group), objeci, 46
1FF, 77 56,71 282 Macromedia Flash, 46
Quicktime, 77 H’ITP (HyperTexL Transfer Protocol), 9,422, JBIG2, 282 animation, 48
Gray-level 457 Jitier, 444,462 symbol, 47
image, 192 Huffman coding. 173 ,loint Video Team (JVT), also see 11.264, 357 window, 46
intensiiy, 169 optimality, 175 JPEG (mmi Photographic Experis Group), 75, MAN (Metropolitan Area Network), 430
Grayscale, 75 prefix property, 175 253,255 MBone, 448
Grayscale shape coding, 346 procedure. 177 DCT, 255 Mean absolute difference, 290
Group oi Video Object Plane (GOV), 335 troe. 177 entropy coding, 259, 261 Media objects, 332
GSM (Global System for Mobile Human vision, 84 main steps, 253 Media-on-Demand (MOD), 464,465
communications), 481,482,491 Hybrid excitation vocoder, 389 mode, 262 Median-cui algorithm, 69
GSTN (General Switched Telephonc MBE (Multi-Band Excitation), 389 quantization, 255 MIDI (Musical Insirurnent Digital miei-face),
Network). 456 MELP (Multiband Excitation Linear zigzag scan, 259 139
Predictive), 386. 391 JPEG-LS, 277 banks, 140
H.261, 288 Hypermedia, 7 JPEG2000, 265 channel, 140
bitstream, 301 Hypertext, 7 channel niessages, 143
block layer. 303 KLT (Karhunen-Loève transform), 207 channel modo, 145
encoder and decoder, 298 IDC1’ (Inverse Discrete Cosine Transfonu), channcl pressure, 144
formats supported, 296 208 LAB color model, 98 conversion to WAV, 147
GOB (Group of Blocks) Iayer, 301 IETF (Internei Engineering Task Force), 425, LAN (Local Arca Network), 424,430 key pressure, 144
inter-frame (p-frame) coding. 297 448,463 Laiency, 443,461 keyboard, 139
intra-frame (I-frame) coding, 297 IGMP (Intemet Group Managemeni Protocol), Locales patch, 140
macroblock layer, 302 449 definition, 523 system messages, 143, 146
piclure layer, 301 Image locale generation, 526 velocity, 140, 144
quancization, 297 24-bit, 64 object maiching, 529 voice messages, 144
siep size, 297 8-bit, 65 object modeling, 528 MtniDV, 117
H.263, 288, 303 data iype, 64 texture analysis, 528 MMR (Modified Modified Read) algorithm,
motion compensation. 304 descriptor, 513 tile classification, 524 344
optional coding mode, 305 Fireworks editing, IS video locales, 533 MMS (Multimedia Messaging Sei-vice), 507
PB-frame, 306 fonnation, 85 LOCO-l (LOw COmplexity LOssless Model-based coding, 283
H.263+, 307 histogram, 169 COmpression for Images), 277 Modem, 426
1-1 263-t4, 307 monochrome, 61 Lookup Table (LUT), 66—68 Motion Compensalion (MC), 288, 289, 337
H.264, 357 Photoshop editing, is Lossless image compression, 191 backward prediction, 290
Baseline Profile, 360 resolution, 61 Lossless JPEG, 193,264 forward prediction, 290
deblocking filiei-, 360 retrieval, 511 encoder, 194 Motion estimation, 288
Extended Profile, 361 Infonnation theory, 168 predictor, 193 Motion JPEG, 262
1-prediction, 359 Intellectual Property Management and Lossy image compression, 199 Motion vector, 290
Main Profile, 360 Protection (IPMP), 369 Lowpass filter, 136 2D logarithmic search, 291
P-prediction, 359 Interactive TV (1TV), 464 LPC (Linear Predictive Coding), 380, 388 hierarchical search, 293
556 Index Index 557

scquential search, 290 hybrid scalability, 328 tools, 14 Network


MPEG (Moving Pictures Experts Group), 312 inlerlaced video, 320 Muilimedia awhoring, 17,20 Application Layer, 422
MPEG audio compression, 395 modes of prediction, 321 Authorware, 17 corrnection-oriented, 423
bark, 402 profilc, 320 aulomatic authoring, 33 connectionlcss, 424
bit aliocation, 409 Prograin Siream, 329 Director, 17 Data Link Layer, 421
bil reservoir, 412 SNR scalabilily, 324 Dreamweaver, SI Network Layer, 421
critical band, 400 spatial scalability. 326 Flash, 17 Physical Layer, 421
equal-loudness curves. 396 temporal scalability. 326 Quest, 17 Preseniacion Layer, 422
frequency masking, 396. 398 Transpori Stream, 329 technical issues, 31 Session Laycr, 421
MDCT (Modified Discrete Cosine MPEG-2 AAC (Advanced Audio Coding), 412 tools, 37 Transport Layer, 421
Transform), 411 Iow complexity profile. 413 Multimedia metaphors, 21 Nelwork Address Transiation (NAT), 425
MNR (Mask-lo-Noise Ratio), 410 maio profile, 413 cardiscripting metaphor, 22 NMT (Nordic Mobile Telephony), 480
MP3. 405,411,412 PQF (Polyphase Quadrature Filter) bank, castiscort/scripting metaphor. 22 N1’SC (National Television System
MPEG layers, 405 414 frames mecaphor, 22 Committee), 88, 94, 312
Layer 1, 405 scalable sampling rate profile, 413 hierarchical melaphor, 21 Nyquist
Layer 2,405,410 TNS (Temporal Noise Shaping) boi, 413 jconic/flowcontrol mebaphor, 21 frcquency, 129
Layer 3,405,411 MPEG-21, 369, 415 scripting melaphor, 21 ratc, 128, 149, 158
psychoacoustics, 395 MPEG4, 332, 335 slide show metaphor, 21 Ihcorem, 128,427
scale factor band, 412 biriary shape, 343 Multimedia presentalion, 25
SMR (Signal-bo-Mask Ratio), 409 body object, ~ graphics style, 25 Objecb-based visual coding, 335
temporal masking, 403 Delaunay mesh. 350 sprite animation, 27 OC (Optical Carrier), 429
ihreshold of hearing, 398 face objecb, 354 video transition, 28 Octave, 128
MPEG worldng model gray-scale shape, ~ Multimedia production, 23 OFDM (Ortogonal Frequency Division
SM (Simulation Model), 363 Pan 10. also see 11.264,358 fiowchart phase, 24 Mulliplexing), 492
TM (Test Model), 363 uniform mesli. 349
VM (Verification Model), 363 wavelet coding, 346 probotyping and tesbing, 24 Optimal Work-Ahead
ONU (Optical Smoothing
Nebworkllnit), 440 Plan, 473
XM (experimentation Model), 363 MPEG-4 AAC (Advanced Audio Coding), 414 storyboani, 24 Oribogonal, 216
MPEG-I, 312 BSAC (Bii-Siiced Arithmebic Coding), Multipath Fading, ~ Orthonormal, 216
B-frame, 313 414 Rayleigh fading,
Rician fading, ~~ basis, Syslems
0Sf (Open 230 Inlerconoection), 421
bibstream, 318 perceptual coders, 414
bloclc layer, 319 perceplual noise substitution, 414 Multiple Protocol Label Switching (MPLS), Out•of’gamut color, 95
D-frame, 319 SAOL (Stnictured Audio Orchestra 446
differences from 11.261,315 Language), 415 Multiplexing. 425 Packet fragmcntation, 424
Group of Pictures (GOPs) Iayer, 318 SNHC (Synbhetic/Nabural Hybrid DWDM (DWDM) (Dense WDM), 426 Padding. 338
1-frame, 313 Coding), 414 FDM (Frequency Division Multiplexing), Horizontal repebitive padding, 339
macroblock Iayer, 319 slnjctured coder, 414 425 Vertical repetibive padding, 339
mobion compensation, 313 ‘l]’S (Text-To-Speech), 414 TDM (Time Division Multiplexing), 426 PAINT, 78
P-frame, 313 MPEG-7, 361 WDM (Wavelength Division PAL (Phase Alternabing Line). 94,119,312
performance of, 317 Description Definition Language (DDL), Mulbiplexing), 426 Paletie animation, 68
picture Iayer, 318 368 WWDM (Wideband WDM), 426 PAN (Personal Area Neiwork), 432
prediction, 313 Description Scheme (DS), 365 Multiresolution analysis, 223, 225 PCM (Pulse Code Modulation), 147—150, 374
quanhization, 316 descripbor, 363 Multisampling, 142 PCN (Personal Communications Nehwork),
sequence Iayer, 318 MPEG-7 audio, 415 Munseli color naming system, 100 479
slice, 315 MPEO-J. 333 MUSE (MUltiple sub-Nyquist Sampling PCS (Personal Communication Services), 479
slice layer, 319 MPEGIet. 333 Encoding), 122 PDA (Personal Digital Assishant), 479, 507
MPEG-2, 319 Multimedia, 3 Music sequencing, 14 PDC (Personal Digital Ceilular), 479
alternate scan, 322 history of, 5 Calcewalk, 14 PDF (Portable Document Format), 78
data partitioning, 329 projecis, 5 Cubase, 14 Pel, 61
diflerences from MPEG-l, 329 research, 4 Soundedit, IS Perceptual nonuniformity, 134, 161,400,446
558 Index Index 559

Pervasive computing, 492,507 midtread, 201 Sequencer. 139. 143 Surface spcctral reflectance, 85
Phase Modulalion, 426 vector quantization, 206 Sequenlial JPEG, 262 Switching, 434
PICT, 78 codebook, 206 Set-lop Box (STB), 464,472 ceil relay (ATM), 435
Pitch, 128, 139 Quanlizer Shannon—Fano algorithm, 171 circuit swilching, 434
Pixel, 61 backward adaptive, 376 Shape coding, 343 frame relay, 435
Pixel clock, 116 Jayant, ~ SIF (Source lnput Format), 312 packet switching, 434
PNG (Portable Nelwork Graphics), 76 Lloyd-Max, 154 SIP (Session Initiation Prolocol). 456 Sync skew, 444
alpha-channei, 77 Querying on video, 542 SMART (Shared Many-io-many ATM Synthesizer, 139, 142
Polyphony, 140 Reservalions), 462 Synthelic object coding, 349
Post Compression RaLe Distorlion (PCRD). Rate-dislortion theory, 200 SMIL (Synchronized Multimedia Integration Synthelic sound, 137
272 ‘ rate-distortion function, 200 Language), 12
PostScript, 78 RecalI, 541 SMPTE (Sociely of Motion Picture and TACS (Total Access Communication Syslem),
POTS (Piam OId Telephone Service), 455, 482 Reference frame, 290 Television Engineers), 88,94 480
PPM, 79 Region of Interest (ROl), 266,275 SMPTE-170M, 88 Targel fraine, 289
Precision, 541 Relevance feedback, 539 SMS (Short Message Service), 482 TCP (Transmission Control Protocol). 423
Predictive coding, 154 Retina, 84 SNR (Signal-to-Noise Ratio), 131,430 Acknowledgement (ACK), 423
Iossless, 151 RGB, 100. 363 SONET (Synchronous Optical NETwork), 428 reiransmission timeout, 423
Iossy. 151 RGB to CMY, 101 Sound window, 423
Prioritized delivery, 447 RLC (Run-lenglh Coding), 171 card, 139 TCPIIP prolocol, 422
Profile, 356 Rods, 84 digitization, 127 TDMA (Time Division Multiple Access), 481
Progressive JPEG, 262 RSVP (Resource ReSerVation Protocol), 451 wave, 126 Television systems
Progressive scanning, 113 RTCP (Real Time Control Protocol), 451 Spatial NTSC (National Television System
PSTN (Public Switched Telephone Network), RTP (Real-time Transport Protocol), ~9 domam, 191 Committec), 113, 116
434. 455,456 RTSP (Real Time Streaming Protocol), 453, frequency, 75, 105 PAL (Phase Allemating Line), 119
454 redundancy. 254, 288 SECAM (Systeme Electronique Couleur
QAM (Quadralure Amplitude Modulation), Run-length Coding (RLC), 259 Spectral Power Distribution (SPD), also see Avec Memoire), 119
426,429,440 Run-length Encoding (RLE), 78,259 spectnim, 82 Temporal redundancy. 289
QCIF (Quarter-CIF), 122 RVLC (Reversible Vatiable Lenglh Code), 498 Spectmphotometer, 82 Texture
QPSK (Quadralure Phase-Shifl Keying), 426, Spectmm, 82 analysis, 517
440 S-Video, 112,113 locus,92 coding, 341
Quadrature modulation, 117 SA-DCT (Shape Adaptive DCT), 341,342 SPIHT (Set Partitioning in Hierarchical Trees), Iayout, 517
Quality of Service (QnS), 424, 443,445,451, Sampling, 127 241,247 TIFF (Tagged Image File Fomiat), 77
453,461—463 alias frequency, 129 Spread spectrum, 483 Timbre, 140
QoS for lP, 446 fixed rale, 129 direct sequence. 484 Token ring. 433
Quanufying search results, 541 folding frequency, 129 frcquency hopping, 483 Transducer, 126
Quantization, 128, 255 frequency, 127 Sprite, 347 Transform coding, 207
decision boundary, 148 nonuniforn, 128 coding, 347 Transmission rale control, 473
distortion, 155 rate, 128 SQNR (Signal to Quantizalion Noise Ralio), Tristimulus values, 91
error, 132, 155 troe frequency, 129 131 Two Sided Geometric Distribulion (TSGD),
linear formal, 133 unifoim, 128 sRGB (Standard ROR), 89, 108 281
noise, 132 SAP (Session Announcement Prolocol), 458 Standard Generaiized Markup Language
nonuniform, 133 5DB (Synchronous Digital Hierarchy), 429 (SGML), 9 u-Iaw (also see ír-law), 134, 135. 137
nonunifonn quanhizer, 204 SDP (Session Descriphion Protocol), 459 Static texture coding, 346 Ubiquilous computing, 492, 507
companded quantizer, 205 SDTV (Standard Definition TV), also see STM (Synchronous Transport Module), 429 UDP (User Dalagram Prolocol), 424
Lloyd—Max quantizer, 204 HDTV, 124 Streaming media, 453
reconslruction levei, 148 SEAM (Scalable and Eflicient ATM streaming audio, 453 Variabie-length Coding (VLC), 167, 171,306
uniform, 133 Multicast), 462 streaming video, 453 Vertical retrace, 114
uniform scalar quantizer, 201 SECAM (Systeme Electronique Couleur Avec STS (Synchronous Transport Signal), 429 Video bitrales, 459
midrisc, 201 Memoire), 94, 119 Subband, 234 ABR (Available Bit Rahe), 459
560 Index

CBR (Constant Bit Rate), 459,472 Wavelcngtli dominant, 93


nrt-VBR (non real-Lime Vaijable Bit Wavelet, 222
Rate), 459 admissibilily condilion, 229
rt-VBR (real-time Variable Bit Rale), 459 analysis filter. 232
UBR (Unspecified Bit Rale), 459 basis, 230
VBR (Variable BiL RaLe). 459,472 biorthogonal 232
Video broadcasting, 465 compact suppott, 232
Greedy Equal Bandwidth Broadcasting mothcr wavelel, 229
(GEBB). 467 synthesis 611cr. 232
harmonic broadcasling, 468 WCDMA (Widebafld CDMA). 488, 489,491
pagoda broadcasting, 470 Weber’s Law, 133
pyramid broadcasting. 466 While-poinL correction, 96
staggered broadcasling. 465 Wireless LAN (WLAN), 432,488,492
stream mcrging, 470 IEBE 802.11,492
Video card, 61 IEEE 802.! la, 492
Video conferencing 295, 303, 360 IEEE 802.1 lb, 492
Video editing, IS IEEE 802.1 lg, 493
Afier Effects, 16 Wireless PAN (WPAN), 432
Final Ciii Pro, 16 WMF (Windows MeLa File). 60,78
Premiere, IS WWW (World Wide Web), 8,71,332,462
Video ObjecL (VO), 334
Video Object Layer (VOL), 334 XML (Extensible Markup Language), 11. 361.
Video Object Plane (VOP), 335 368
Video signais, 112 XYZ Lo RGB, 97
Video-object Sequence (VS), 333
Video.on-Demand (VOD), 443. 464,465 YCbCr, 107, 120. 363
Vocoder. 378 YIQ, 104,105,254
CELP (Code Excited Linear Prediction). YUV, 104,254
383
channel vocoder, 378
formant vocoder, 380
LPC (Linear Predictive Coding), 380
phase inSenSitiVíty, 378
VoIP (Voice over IP), 455
VRML (Virtual Reality Modelling Language).
SI
animation and interaction, 54
history, SI
overvieW, SI
shapes, 52
specifics, 54

W3C (World Wide Web Consortium), 8, 368


WAN (Wide Arca Network), 424,430.434
WAP (Wireless Application Protocol), 493
Wave Lable, 137
daLa, 142
file, 139
synthesis, 138

Вам также может понравиться