Вы находитесь на странице: 1из 12

Hardware Accelerators

Ingo Sander ingo@imit.kth.se


See also Wolf: Computers as Components, Ch 7

How to improve the performance of a microprocessor system?


z

Choose a faster version of your microprocessor Add additional computational units that are perform special functions?
z z z z

Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator
2B1448 SoC Architectures 2

October 28, 2005

Hardware Accelerators
z

Accelerated System Architecture


1 Request

If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.
2B1448 SoC Architectures 3 October 28, 2005

accelerator
3 2 Data

CPU

Result

memory I/O

October 28, 2005

2B1448 SoC Architectures

Amdahls Law
z

Example (Henessey & Patterson)


z

Amdahls law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!
Speedup = = ExecutionTimeOld ExecutionTimeEnhanced 1 FractionEnhanced (1 FractionEnhanced ) + SpeedupEnhanced

An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to
z

implement a square root unit that speeds up this operation with a factor of 10, or to Improve the floating-point instructions in general so that they can run 2 times faster.

Fraction denotes the time the enhancement can be used!


October 28, 2005 2B1448 SoC Architectures 5 October 28, 2005 2B1448 SoC Architectures 6

Example (Henessey & Patterson)


z

Almdahls Law Lessons to be learned


z

Square Root:
z

Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22 Speedup = 1 / ((1-0.5)+0.5/2) = 1/0.75 = 1.33

The maximum speedup that is possible is limited by the fraction!


z z

Floating-Point:
z

Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F) F Max. Speedup 0.1 0.3 0.5 0.9 2 10

1.11 1.43

Improve the common cases!


October 28, 2005 2B1448 SoC Architectures 7 October 28, 2005 2B1448 SoC Architectures 8

An Accelerator is not a CoProcessor


z

Design of a hardware accelerator


z

A co-processor is connected to the CPU and executes special instructions.


z

Instructions are dispatched by the CPU.

An accelerator appears as a device on the bus

Which functions shall be implemented in hardware and which functions in software? Hardware/software co-design: joint design of hardware and software architectures The hardware accelerator can be implemented in
z z

Application-specific integrated circuit. Field-programmable gate array (FPGA).


2B1448 SoC Architectures 10

October 28, 2005

2B1448 SoC Architectures

October 28, 2005

Hardware Software Co-Design


Good estimates are needed for good partitioning

Hardware/Software Co-Design
z

System Model Partitioning & Mapping Veri fication

Original C-Program

Estimation Library HW-Model (VHDL) HW Synthesis Netlist


October 28, 2005

Hardware/Software Co-design covers the following problems


z

Which functions shall go to HW and SW?

SW-Model (C/C++) SW Compilation

Veri fication

Executable Program
11

Co-Specification: the creations of specifications that describe both the hardware and software of a system Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction
2B1448 SoC Architectures 12

2B1448 SoC Architectures

October 28, 2005

Co-Synthesis
z

Partitioning
z

Four tasks are included in co-synthesis


z

Partitioning: The functionality of the system is divided into smaller, interacting computation units Allocation: The decision, which computational resources are used to implement the functionality of the system Scheduling: If several system functions have to share the same resource, the usage of the resource must be scheduled in time Mapping: The selection of a particular allocated computational unit for each computation unit

z z

During partitioning the functionality of the system is partitioned into several parts (corresponding to the allocated/available components) Many possible partitions exist Analysis is done by evaluating the costs of different partitions
A C D A C D

All these tasks depend on each other!


October 28, 2005 2B1448 SoC Architectures 13 October 28, 2005 2B1448 SoC Architectures 14

Estimation
z

Estimation Accuracy and Fidelity


z

In order to get a good partitioning, there is a need for good figures about
z

performance for a function on different components execution time for communication time

The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations

October 28, 2005

2B1448 SoC Architectures

15

October 28, 2005

2B1448 SoC Architectures

16

Fidelity
Quality metric

Hardware/Software Co-Design
Quality metric

Strategies:
1. Start with an all-software-configuration While (Constraints are not satisfied) Move the SW function that gives the best improvement to HW (implemented in COSYMA [Ernst, Henkel, Brenner 1993]) 2. Start with an all-hardware-configuration While (Constraints are satisfied) Move the most costly HW component to SW (implemented in Vulcan [Gupta, DeMicheli 1995])

Fidelity = 100% z

Fidelity = 33% (only A > C correct)

Though accuracy is much higher in (1) than in (2), the estimates are not very useful for the partitioning process because of the low fidelity! This can cause bad design decisions!
2B1448 SoC Architectures 17

October 28, 2005

October 28, 2005

2B1448 SoC Architectures

18

System design tasks


z

Papers on HW/SW Co-Design


z

Design a heterogeneous multiprocessor architecture.


z

Processing element (PE): CPU, accelerator, etc.

z z

Divide Tasks to Processing Elements Verify that


z z

z z

Functionality of the system is correct System meets the performance constraints

R. Ernst et al. Hardware-software co-synthesis from Microcontrollers. IEEE Design & Test of Computers. December 1993. R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993. G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997. (and much much more)

Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)
19 October 28, 2005 2B1448 SoC Architectures 20

October 28, 2005

2B1448 SoC Architectures

Why accelerators?
z

Why accelerators? contd.


z

Better cost/performance.
z

Better real-time performance.


z z

Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.

Put time-critical functions on less-loaded processing elements. Remember Scheduling Overhead (Chapter 6)
z

To improve performance by choosing a faster CPU may be very expensive!


cost

Extra CPU cycles must be reserved to meet deadlines.


cost

deadline

deadline plus Scheduling overhead


performance

performance
October 28, 2005 2B1448 SoC Architectures 21 October 28, 2005 2B1448 SoC Architectures 22

Why accelerators? contd.


z z z z

Accelerated system design


z

Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even the largest single CPU.

First, determine that the system really needs to be accelerated.


z z z

Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead? performance analysis; scheduling and allocation.

Design Tasks
z z

z z

Design the accelerator itself. Design CPU interface to accelerator.


2B1448 SoC Architectures 24

October 28, 2005

2B1448 SoC Architectures

23

October 28, 2005

Performance analysis
z

Single- vs. multi-threaded


z

Critical parameter is speedup: how much faster is the system with the accelerator? Must take into account:
z z z

One critical factor is available parallelism:


z

Accelerator execution time. Data transfer time. Synchronization with the master CPU.
The Accelerator needs to know, when it can start its computation z The CPU needs to know when the results are ready
z

single-threaded/blocking: CPU waits for accelerator; multithreaded/non-blocking: CPU continues to execute along with accelerator.

To multithread, CPU must have useful work to do.


z

But software must also support multithreading.

October 28, 2005

2B1448 SoC Architectures

25

October 28, 2005

2B1448 SoC Architectures

26

Sources of parallelism
z

Total execution time


z

Overlap I/O and accelerator computation.


z

Single-threaded:
P1

Multi-threaded:
P1
Split

Perform operations in batches, read in second batch of data while computing on first batch. May reschedule operations to move work after accelerator initiation.

Find other work to do on the CPU.


z

P2 P3 P4

A1
Accel.

P2 P3
Join

A1
Accel.

P4
CPU
2B1448 SoC Architectures

October 28, 2005

2B1448 SoC Architectures

27

October 28, 2005

CPU

28

Execution time analysis


Single-threaded: z Count execution time of all component processes.
Execution Time Acc. CPU P1 A1 tx Acc. P2 tout P3 P4 Time CPU P1

Accelerator execution time


z

Multi-threaded: z Find longest path through execution.


Execution Time A1 P3 P2 P4 Time

Total accelerator execution time:


z taccel

= tin + tx + tout

Data input Accelerated computation

Data output

tin

Communication Overhead
October 28, 2005 2B1448 SoC Architectures 29 October 28, 2005 2B1448 SoC Architectures 30

Data input/output times


z

Example for Accelerator Architecture


Mem

Bus transactions include:


z z z

flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc.

CPU

DMA
Bus Interface

Accelerator
Registers

Read Unit Write Unit

Core

October 28, 2005

2B1448 SoC Architectures

31

October 28, 2005

2B1448 SoC Architectures

32

Accelerator/CPU interface
z

Caching problems
z

Accelerator registers provide control registers for CPU. Data registers can be used for small data objects. Accelerator may include special-purpose read/write logic.
z

Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).

Especially valuable for large data transfers.

October 28, 2005

2B1448 SoC Architectures

33

October 28, 2005

2B1448 SoC Architectures

34

Possible Problems with Caches


1. 2.

Design for a multiprocessor environment


z

3.

CPU reads location S. Accelerator writes location S. CPU reads location S.


Cache 2

Memory

If several processing entities exit, each entity can have several tasks Divide functional specification into units.
z z

Map units onto PEs. Units may become processes.

Wrong value!

CPU

Determine proper level of parallelism:


f1() f2() f3(f1(),f2())

Accelerator

vs. f3()

October 28, 2005

2B1448 SoC Architectures

35

October 28, 2005

2B1448 SoC Architectures

36

Partitioning methodology
z

Partitioning example
cond 1 cond 2 P1 Block 1

Divide CDFG into pieces, shuffle functions between pieces. Hierarchically decompose CDFG to identify possible partitions.

Block 2 P2 Block 3 P5 P3
October 28, 2005 2B1448 SoC Architectures 37 October 28, 2005

P4
2B1448 SoC Architectures 38

Scheduling and allocation


z

Example Accelerator
Data-flow Graph
x y

Must:
z z

schedule operations in time; allocate computations to processing elements.

Architecture
P A M

Scheduling and allocation interact, but separating them helps.


z

Alternatively allocate, then schedule.

h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures 39 October 28, 2005 2B1448 SoC Architectures 40

Execution Times
P f g h 5 5 5 A 2 2 z

Single-Processor Solution
Data-flow Graph

Both P and A have sufficient registers P and A cannot access the bus simultaneously A memory access (load or store) takes 1 time unit

Load x Load y
P

1 1 5 5 5 1 18

f g h Store h(...)

h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures 41 October 28, 2005 2B1448 SoC Architectures

42

Processor-Accelerator Solution I
Data-flow Graph
x y
P Load x Load y f g A 1 1 2 2 1 1

Processor-Accelerator Solution II
Data-flow Graph
x y Load y g A f g P Load f h h P Store h 1 5 1 Total
Total 16

P 1 5 Load x f Store f

A 1 2 1

A
Load f 1 1 5 1

Store f Store g Load g h Store h

13

h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures

h(f(x),g(y))
43 October 28, 2005

Exploitation of parallelism leads to fast solution!


2B1448 SoC Architectures 44

Still Single-Thread!

System integration and debugging


z

Summary
z

z z

Try to debug the CPU/accelerator interface separately from the accelerator core. Build equipment to test the accelerator. Hardware/software co-simulation can be useful.

The use of a hardware accelerator can lead to a more efficient solution


z

In particular when the parallelism in the functionality can be exploited

Hardware/Software co-design techniques can be used for the design of an accelerator You have to be aware of cache coherence problems, if the processor or accelerator uses a cache
2B1448 SoC Architectures 46

October 28, 2005

2B1448 SoC Architectures

45

October 28, 2005