Вы находитесь на странице: 1из 35

Challenges & Implications for

VLSI Architectures for Multimedia


Processing

Vineet Sahula
sahula@ieee.org
Deptt of ECE
Malaviya National Institute of Technology
Jaipur
Outline
• Motivation & challenges
• Choice of architectures
• Tasks in Multimedia processing
• Design optimization approach
– Throughput enhancement
– Power optimization

IETE'05 VLSI Arch. for Multimedia 2


Motivation
• Performance improvement of wireless
systems attributed to recent advances in
– Comm. Technology Standards & networking
– VLSI Technology
• VLSI design technology has evolved due to
– Demands raised by applications viz. mobile
computing & communication (almost converged)

IETE'05 VLSI Arch. for Multimedia 3


Multimedia Processing
• Multiple medium
– Amalgamation of Data, voice, audio, images, graphics,
speech
• Characteristics
• Desire for high bandwidth, high throughput, and low power
• Mobile applications
• Low cost, very low power, real time

IETE'05 VLSI Arch. for Multimedia 4


Classification of Multimedia Tasks
• Low level
• Characterized by highly regular sequences of operations & data
accesses
– Implies identical operations on large number of samples with high
potential for data parallelism

• Medium level
• Tasks are link between simple data structures (pixel) and
symbolic information
– Data dependent decisions & lower regularity

• High level
• Operations on symbols & complex objects of variable sizes
– Highly data dependent computation-flow
– Advance prediction not possible

IETE'05 VLSI Arch. for Multimedia 5


Video Compression

• Throughput (MOPS)
– Motion estimation (~80%)
• Not matched by a GPP
• Needs specific optimizations
– ASICs

• Power consumption profile


– Motion estimation (~70%)
• Requires an integral
approach from system level
to gate level design

IETE'05 VLSI Arch. for Multimedia 6


Implementation Architecture

Energy Efficiency
MOPS/mW
Dedicated HW
Reconfigurable HW
DSP/ASIPs
Programmable Processor
flexibility
• Software programmable
• general purpose (GPP)
• Application specific (DSP/ASIP)
• Hardware programmable (CPLD/FPGA)
• Dedicated hardware (ASIC)
IETE'05 VLSI Arch. for Multimedia 7
Data-Path & Control
a b • z=(a+b)+(c+d)
c d
• Dedicated HW
– 2 time steps with 2
Mx My ALUs
Start

Mx=1
– 1 time step with 3
LR=1 S1 ALUs
My=0
R
S2 Mx=0
My=1
•Control FSM
LR=1 • 1-hot encoded
z Stop [HW control]
• Micro-program control
[Control memory]
IETE'05 VLSI Arch. for Multimedia 8
Programmable Processors
z=(a+b)+(c+d)

Reg I-Reg
Bank
Load R1
Memory Load R2
Rx Ry HW
Microprogram
Control
control R3R1+R2

LR1 LR2 LR3 LR4 Load R1


Rz Load R2
CISC
RISC R4R1+R2

DSP- Multiply-Accumulate R R +R
3 3 4
IETE'05 VLSI Arch. for Multimedia 9
Architecture Characteristics
• Processors
• Instruction set is fixed/customized
• Algorithm changes adapted through SW rewriting
• Power & computation-time overheads are large
• Reconfigurable HW
• Architecture at logic level is fixed
• Architecture reconfiguration requires interconnection
programming
• Dedicated HW
• HW can’t be reconfigured
• Can be extremely power-efficient and high performance
IETE'05 VLSI Arch. for Multimedia 10
Dedicated HW
• Suitable most compute intensive Low Level Tasks

• Dedicated HW- VLSI implementation of highest


efficiency
• Overhead for control is minimum
• Power consumption can be made low

• Functionality is fixed
• Redesign means new design

IETE'05 VLSI Arch. for Multimedia 11


Programmable Architecture
• Suitable for High Level Tasks

• Highly flexible for irregular tasks


• Can address larger application domain

• Control overhead is large


• Silicon area is large
• Software development time ia an additional
overhead
IETE'05 VLSI Arch. for Multimedia 12
Hybrid Architecture
• Suitable for Medium Level Tasks as well as
HLTs
• Choice of mixing ASIPs/DSP/FPGAs [2,3]

[2] D. Chauhan et al, Hardware Design evaluation for fast motion estimation, B. Tech. Thesis, MNIT Jaipur, 2004
[3] Govind S. and V. Sahula, ASIP Design Space exploration for motion estimation IEEE VDAT 2003
IETE'05 VLSI Arch. for Multimedia 13
Dedicated HW Implementation
• 2D DCT/IDCT for Video codec
– Matrix multiplication, a regular and parallelized

• Motion Estimation
– Estimate MV through Block matching
• a very regular & parallelized
– Minimizing a distortion metric
– Mean absolute difference MAD
– Object based ?
IETE'05 VLSI Arch. for Multimedia 14
Media Processor Chips
• Philips TriMedia
• Audio/visual, graphics, communication tasks
• VLIW
– 25 FU: ALUs,multipliers, FP units

• Texas Instruments TMS320C6X


• Not exactly MM chip, but a general purpose DSP
• VLIW
– 2 symmetrical data paths
– 4 FU
– Multi-port registers

IETE'05 VLSI Arch. for Multimedia 15


Media Processor Chips
• AxPe1280V
• Video signal processor
• RISC core
• SIMD/MIMD
– 8 Data paths units

• AT & T AVP4000
• DSP
• 3 ASICs

IETE'05 VLSI Arch. for Multimedia 16


Throughput Enhancement
• Exploit parallelism in data operations/ computation
• Explicit parallel implementation
– Multiprocessors
• SIMD, MIMD
• Implicit parallel solution
– Pipelined FU
– Pipelined Data-path
• Dedicated HW, processors
– Pipelined Control (Instruction level parallelism)
• Processors only

IETE'05 VLSI Arch. for Multimedia 17


Terminology I1I2..Ii…

Data
• Critical path delay, TD Path
– From primary input Ii to Primary output Oi
O1O2 …Oj..
• TD in ns
• Throughput: rate of getting output/sec
Throughput=number-of-operations/sec,
much higher than 1/TD

IETE'05 VLSI Arch. for Multimedia 18


Data-path/FU Pipelining
• Un-Pipelined
– Delay TD
– Throughput
• 1/TD
• Linear Pipeline of stage k
– Delay is TD/k
– Throughput is k/TD
• Non-linear P/L
– Latency is L including register delays
– Delay is L
– Throughput is complex function ?
• of k and L
• Wave-pipelining
– Asynchronous circuit
– No registers
IETE'05 VLSI Arch. for Multimedia 19
FU Pipelining- FP Adder
Unpipelined delay kT
S1
For n data, total-delay knT
S2 Throughput 1/kT
S3
S4
Delay of a stage T
Stages k(=4)
Throughput < 1/T
• Speed-up n k
SU 
=(delay-unpipelined)/(delay-pipelined) n  k 1
SU lim  k
n 
=(Throughput-pipelined)/(throughput-
unpipelined)
IETE'05 VLSI Arch. for Multimedia 20
Pipelined Circuit

IETE'05 VLSI Arch. for Multimedia 21


Data Path with FB/FF Connections

Area-pipeline

IETE'05 VLSI Arch. for Multimedia 22


Data Path with FB/FF Connections

Area-pipeline

IETE'05 VLSI Arch. for Multimedia 23


Pipelining Explorations- Infeasible
Alternatives

IETE'05 VLSI Arch. for Multimedia 24


Pipelining- Feasible Alternatives

IETE'05 VLSI Arch. for Multimedia 25


Limitations During Parallelizing
• Amdahl law
1 (1-Fractionenhanced) Fractionenhanced
--- = +----------------------
SU SUenhanced

1. For x=0.9, SUenhanced=100


1/SU=0.1+0.9/100=0.1009 SU~10

2. To achieve SU=80 with 100 parallel resources, what x is


feasible?
1/80=1-x+x/100 99x/100=79/80 x=99.8%
Only 2% code must be sequential!!!

IETE'05 VLSI Arch. for Multimedia 26


Power Dissipation in a CMOS Gate
-Inverter
• Switching power
– VCC.i(t).dt
– Influenced by supply voltage

VCC VCC

Vin=0 Vout Vout


Vin=‘1’
CL CL

IETE'05 VLSI Arch. for Multimedia 27


Dynamic Power Dissipation-
Capacitor Path
• During capacitor charging
• P1 gets disipated in p-channel Tr., resistive dissipation
into Rp
• P2 gets transferred to CL
• Vout transits 01
• During capacitor discharging
• P2 gets dissipated
• Vout transits 10
• In a cycle, thus total power is Pdyn= P1+ P2

IETE'05 VLSI Arch. for Multimedia 28


Dynamic capacitive power
• Formula for dynamic power:
Pdyn  C V 2
L CC f
• Observations
– Depends on CL
• Fanout number
– Depends on frequency of operation f
– Depends on VDD

IETE'05 VLSI Arch. for Multimedia 29


Reducing Power in a Gate
• Lower the supply voltage!
– Quadratic effect on dynamic power
• Reduce capacitance
– Short interconnect lengths
– Drive small gate load (small gates, small fan-
out)
• Reduce frequency
– Lower clock frequency -> use more parallelism
– Lower signal activity

IETE'05 VLSI Arch. for Multimedia 30


Switching Activity
0
0
0
1
0
1
• Out of 4 possible output transitions
1 0 1 – output transition occurs for two input
1 1 1
pattern-pairs (IPP) only
– CL remains connected
– Out of 4 clocks, power dissipating
switching is for one clock only
– Average power for 4 clocks
• ¼ CLf V2DD
• In general
– P=aCLf V2DD
– a is switching activity of gate or
composite logic-circuit

IETE'05 VLSI Arch. for Multimedia 31


Low Power VLSI Implementation
• During HW implementation
– Explore possibility of low power solution
• System level power management
– Clock gating
– Dynamic power management
• Behavioral level transformations
– Suited for DSP circuits
– Algorithm level, Filter-structure level, dataflow transformations,
Voltage scaling, clock gating
• Architecture level
– Bit-width reduction for arithmetic operations [4], pre-computation
based architectures,clock gating

[4] G. Singh, Low power Floating Point Arithmetic circuits, M. tech. Thesis, MNIT, 2003

IETE'05 VLSI Arch. for Multimedia 32


Low Power HW…
• Logic Level
– FSM synthesis
• State assignment for low power, minimum Hamming distance to minimize
switching
– Low power technology mapping
• Circuit Level
– Apply low power data-sequence to a logic gate [5]
– Transistor sizing for low power and high performance

[5] P. Jain, V. Sahula, Low power IPP characterization for small digital circuits, IEEE VDAT, 2002

IETE'05 VLSI Arch. for Multimedia 33


Low Power Software
• Sources of dissipation
– Buses, memory, control & clock distribution
scheme
• Power optimization
– Match algorithm to architectural resources
– Minimize memory accesses
– Proper sequencing of data transfer on bus, to
reduce bus switching
– Instruction reordering

IETE'05 VLSI Arch. for Multimedia 34


Thanks.

IETE'05 VLSI Arch. for Multimedia 35