Вы находитесь на странице: 1из 130

2007-02-06

Embedded Tutorial

Hardware/Software Codesign
of Embedded Systems

Petru Eles and Zebo Peng


Embedded Systems Laboratory (ESLAB)
Linkping University

Lecture Contents
=

Introduction and basic issues.

Architectures and platforms.

Analysis, co-simulation, and design space


exploration.
System-level power/energy optimization.

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Introduction
=

Codesign of embedded
systems

The design flows

Definition and motivation

System level design issues

Prof. Z. Peng, ESLAB/ LiTH

Traditional Design Flow


Informal System Specification
Early, Manual Partitioning

SW Specification

HW Specification

Programming

HW Design

SW Simulation

HW Simulation

SW Implementation

HW Implementation

Integration and System Test


Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Design Time
Traditional Design:

HW/SW Codesign:
Specification
& Partitioning

Specification
& Partitioning
HW Design
&
Simulation

Co-sim. HW Design
SW Design
&
&
&
Co-verif. Simulation
Simulation

SW Design
&
Simulation

Integration
&
Test

Integration
&
Test

Reduced TTM

time
Prof. Z. Peng, ESLAB/ LiTH

time
5

HW/SW Codesign
=

The concurrent design of hardware and


software elements, supporting explicit
hardware/software trade-off.
0 Co-specification to create an common
specification that describes both hardware and
software elements.
0 Co-synthesis to concurrently synthesis the
hardware and software implementations as well
as their interfaces.
0 Co-simulation and co-verification to
simultaneously simulate and verify the hardware
and software elements.

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Why Codesign?
=
=

Reduce time-to-market.
Achieve better designs:
0 More design alternatives can be explored.
0 Better solutions can be found by advanced optimization
techniques.

To meet strict design constraints, such as:


0 Timing or performance constraints.
0 Power dissipation.
0 Physical constraints, e.g., size, weight, etc.
0 Safety and reliability constraints.
0 Cost constraints.

Codesign is also made possible by the advances in


design methodologies and tools.

Prof. Z. Peng, ESLAB/ LiTH

Vertical Codesign
=

Instruction set processor design, for both generalpurpose systems and ASIPs (Application Specific
Instruction Processors).
Specification

To determine how big the


hardware engine you need to
run your application and
meet its constraints.

Software
Instruction set
Hardware
Hardware

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Codesign of Processors
=

General-Purpose Processors
0 Architectural support for operating systems.
0 Cache design and tuning (e.g., selection of cache
size and control schemes).
0 Pipeline control design (control mechanisms,
compiler design).

ASIPs
0 Customization of instruction sets and specific
resources (e.g., accelerator and coprocessor).
0 Design of register files, busses and
interconnections.
0 Development of specific compiler.

Prof. Z. Peng, ESLAB/ LiTH

Horizontal Codesign
=

Some of system functionality is implemented in


software running on programmable CPUs, while other
functions are implemented in hardware.
Typical for design of embedded systems.
Specification

Codesign of
Specialized processor

Programmable
ASICs
Processor
Hardware
Prof. Z. Peng, ESLAB/ LiTH

10

2007-02-06

What is an Embedded System?


=

There are many different definitions!


0 A special-purpose computer system that is used for a
particular task.
0 A computer based systems embedded in real life
machines. Though computer based, it dose not have the
usual key-board and monitors. The processor and related
circuitry are configured to do a specific task.

Some highlights what it is (not) used for:


0 Any device which includes a programmable component but
itself is not intended to be a general purpose computer.

Some focus on what it is built from:


0 A collection of programmable parts surrounded by ASICs
and other standard components, that interact continuously
with an environment through sensors and actuators.

Prof. Z. Peng, ESLAB/ LiTH

11

Characteristics of an Embedded System


=

Dedicated (not general purpose).


0 One or several applications known at design-time.

Contains a programmable component.


0 But usually not programmable by the end-user.

Interacts (continuously) with the environment:


0 Real-time behavior.
0 Predictable.
0 Safe and reliable.
0 Run-time environment is fixed (faster ? better).

Usually very cost sensitive:


0 Mass products in highly competitive markets and have to be
shipped at a low cost.

Low power is often preferred.

Prof. Z. Peng, ESLAB/ LiTH

12

2007-02-06

Embedded Systems
General purpose systems

Embedded systems

Microprocessor
market shares
in 1999

99%

1%

Prof. Z. Peng, ESLAB/ LiTH

13

Embedded Controllers

Sensors

Environment

CPU

HW Unit
Application-special logic
Timers
A/D and D/A conversion

Actuators

Memory

Reactive systems.
0
0

The system never stops.


The system responds to signals produced by the environment.

Prof. Z. Peng, ESLAB/ LiTH

14

2007-02-06

Distributed Embedded Systems


Actuators

Sensors

I/O Interface
RAM
CPU

ROM
ASIC

Network Interface

ECU

ECU

ECU

ECU

ECU

ECU

Gateway

Gateway

Prof. Z. Peng, ESLAB/ LiTH

15

Time and Power Constraints


=

Time constraints:
0 They have to perform in real-time: if data are not ready by
a certain deadline, the system fails to perform correctly.
0 Hard deadline failure to meet leads to major hazards.
0 Soft deadline failure to meet can be tolerated but quality
of service is reduced.

Power constraints:
0 There are several reasons why low power/energy
consumption is required.
0 Battery life:
High energy consumption short battery life time.

0 Cost aspects:
High power consumption strong power supply, and
expensive cooling system.
Prof. Z. Peng, ESLAB/ LiTH

16

2007-02-06

Safety Critical Requirements


=

Embedded systems are often used in life


critical applications.
0 Avionics, automotive electronics, nuclear plants,
medical applications, military applications, etc.

=
=

Reliability and safety are major requirements.


To guarantee correctness during design:
0 Formal verification: Mathematics-based methods
to verify certain properties of the designed
system.
0 Automatic synthesis: Certain design steps are
automatically performed by design tools
Correctness by construction.

Prof. Z. Peng, ESLAB/ LiTH

17

Short Time to Market


=

In highly competitive markets it is critical to catch


the market window:
0 A short delay with the product on the market can have
catastrophic financial consequences (even if the quality of
the product is excellent).

Design time has to be reduced!


0 Advanced design methodologies.
0 Efficient design tools.
0 Reuse of previously designed and verified (hardware and
software) blocks.
0 Platforms for several products in a family.
0 Good designers who understand both software and
hardware!

Prof. Z. Peng, ESLAB/ LiTH

18

2007-02-06

The ES Design Challenges


=
=
=
=
=
=

=
=

Increasing application complexity (e.g., automotive).


Heterogeneous architecture (HW, SW, network,
mechatronics, etc.).
Stringent time and power constraints.
Low cost requirement.
Short time to market.
Safety and reliability (e.g., very long life-time).
In order to achieve all these requirements, systems
have to be highly optimized.
Both hardware and software aspects have to be
considered simultaneously!

Prof. Z. Peng, ESLAB/ LiTH

19

Current Design Practice


1.
2.

3.
4.

5.
6.

Start from some informal specification and a set of


constraints (time, power, and cost constraints).
Generate a more formal specification, based on some
modeling concept (FSM, data-flow, etc.), using
Matlab, Statecharts, SystemC, C, UML, or VHDL.
Simulate the model in order to check its
functionality. The model is modified, if needed.
Choose an architecture such that the cost limit is
satisfied, and hopefully that time and power
constraints will be fulfilled.
Implement both the hardware and software
components and build a prototype.
Validate the system.
= A usual outcome: Neither time nor power constraints are
satisfied!!!

Prof. Z. Peng, ESLAB/ LiTH

20

10

2007-02-06

The Consequences
=

Delays in the design process:


0 Increased design cost.
0 Delays in time to market missed market window.

High cost due to many iterations with


implementation and prototyping.

Bad design decisions taken under time pressure:


0 Low quality.
0 High cost.

The lesson: We need to explore more design


alternatives in an efficient manner.
0 At the system level!

Prof. Z. Peng, ESLAB/ LiTH

21

System-Level Design
Informal Specification,
Constraints
Modeling

Functional
Simulation

Arch. Selection

System Model

Formal
Verification

System
Architecture

Mapping

Estimation

Scheduling
Not OK
Not OK

Mapped and
Scheduled Model
OK

Software Model

Simulation

Structural
Simulation
Formal
Verification

Hardware Model

Lower-Level Design
Prof. Z. Peng, ESLAB/ LiTH

22

11

2007-02-06

The Improved Design Flow


=

Several design alternatives are evaluated


before going down to the lower-level design.
0 This is performed as part of the design space
exploration process.
0 Different architectures, mappings and schedules
are explored, before the actual implementation
and prototyping.

We get highly optimized solutions in short


time.
0 There is a good chance that design iterations at
the lower-level, including prototyping, can be
avoided.

Prof. Z. Peng, ESLAB/ LiTH

23

Additional Improvements
=

Formal verification
0 It is impossible to do an exhaustive simulation.
0 Especially for safety critical systems, formal verification is
needed.

Simulation
0 Used not only for functional validation.
0 Should also be used after mapping and scheduling in order
to check, for example, timing properties.
0 May be used also during the implementation steps:
hardware/software co-simulation.

Hardware/software trade-offs
0 Hardware/Software partitioning to decide what is to be
mapped on a programmable processor (SW) and what is
going into HW.
0 Hardware/software co-synthesis to coordinate the HW
and SW synthesis processes and allow moving of
functionality from one to the other.

Prof. Z. Peng, ESLAB/ LiTH

24

12

2007-02-06

The Lower-Level Issues


=

Software generation:
0 Encoding in an implementation language (C, C++,
assembler).
0 Compiling (this can include particular optimizations for
application specific processors, DSPs, etc.).
0 Generation of a real-time kernel or adapting to an existing
operating system.

Hardware synthesis:
0 Encoding in a HDL (VHDL and Verilog).
0 Successive synthesis steps: high-level, register-transfer
level, logic -level synthesis.

Hardware/software integration:
0 The software is run together with the hardware model
(co-simulation).

Prototyping:
0 A prototype of the hardware is constructed and the
software is executed on the target architecture.

Prof. Z. Peng, ESLAB/ LiTH

25

Lower-Level Design
There are established CAD tools on the market which
automatically perform many of the low level tasks:
=
=
=

Code generators (software model C, hardware


model VHDL)
Compilers.
Hardware synthesis tools:
0 RT-level synthesis
0 Logic synthesis
0 Layout and physical implementation

=
=

Test generators and debuggers.


Simulation and co-simulation tools.

Prof. Z. Peng, ESLAB/ LiTH

26

13

2007-02-06

Focus on System-Level Design

Have huge influence on the quality of the final


implementation.
Very few commercial tools are available.

Mostly experimental and academic tools available.

Huge efforts and investments are currently made in


order to develop tools and methodologies for system
level design.
Ad-hoc solutions are less and less acceptable.
It is the system level we are mainly interested, in
this course!

Prof. Z. Peng, ESLAB/ LiTH

27

Concluding Remarks
=

Codesign provides the capability to make


explicit and efficient hardware/software
trade-off.
Codesign of embedded systems have many
advantages and challenges.
Cost and performance optimization requires
system-level approaches.

Prof. Z. Peng, ESLAB/ LiTH

28

14

2007-02-06

Analysis, Co-Simulation
and Design Space Exploration
Zebo Peng
Embedded Systems Laboratory (ESLAB)
Linkping University

Outline

Design space exploration

Static analysis techniques

Co-simulation approaches

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

The Design Space


Very large due to many solution parameters:

0 architectures and components


0 hardware/software partitioning
0 mapping and scheduling
0 operating systems and global control
0 communication synthesis

Hardware

Microprocessor
ASIC
Analog
circuit
Sensor

Software

C
o
S

Embedded
memory

Sourc
e: S3
Source: Stratus
Computers

DSP
Network

High-speed electronics
Prof. Z. Peng, ESLAB/ LiTH

Design Space Exploration


What are needed in order to explore the complex
design space to find a good solution:
=
=
=
=
=

Exploration in the higher level of abstractions.


Development of high-level analysis and estimation
techniques.
Employment of very fast exploration algorithms.
Memory-less algorithms.
Each solution needs a huge data structure to store,
so we cant afford to keep track of all visited
solutions.

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

The Optimization Problem


The majority of design space exploration tasks can be
viewed as optimization problems:
To find
0 the architecture (type and number of processors, memory
modules, and communication blocks, as well as their
interconnections),
0 the mapping of functionality onto the architecture
components, and
0 the schedules of basic functions and communications,

such that a cost function (in terms of implementation


cost, performance, power, etc.) is minimized and a
set of constraints is satisfied.
Prof. Z. Peng, ESLAB/ LiTH

The System Partitioning Problem


5

15

65

35
8

45
24

20

40

35
3

23

67

56
6

Two -way partitioning

A feasible solution for the k-way partitioning


can be represented as:
xi = j; j {1, 2, ..., k}, i = 1, 2, ..., n.

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Hardware/Software Partitioning
Input:

Implementation independent system


specification consisting of interacting
processes (e.g., VHDL).

Output: Two sets of processes, assigned for hardware


and software implementation respectively.
Target architecture:
- Microprocessors
- ASICs
- Shared memories

Prof. Z. Peng, ESLAB/ LiTH

Hardware/Software Partitioning
Assumptions:
=

Microprocessor and ASIC working in parallel;

Reducing the amount of communication between


the microprocessor and hardware improves the
overall performance.
Objectives:
=

=
=

Maximal performance at a given cost limit.


Minimal implementation cost such that the timing
and other constraints are satisfied.

Prof. Z. Peng, ESLAB/ LiTH

2007-02-06

Hardware/Software Partitioning
=

Quantitative values can be derived via simulation,


profiling, or static analysis of the specification.
Ex.
0 computation load (CL) number of operations executed
by a basic region or process of the specification.
0 communication intensity (CI) total number of
communication operations on a channel between two
processes.

Performance improvement based on:


0 Placing computation intensive processes into hardware.
0 Increasing parallelism.
0 Reducing inter-domain communication.

Prof. Z. Peng, ESLAB/ LiTH

Process Graph Formulation


=

nodes correspond to processes, which could be


processes or basic blocks in the original specification
(e.g., VHDL).
node weights reflect the degree of suitability for
hardware implementation of the corresponding
process:
0
0
0
0

=
=

the computation load of the process;


the uniformity of operations in the process;
the potential parallelism inside the process;
suitability for software implementation.

edges connect two nodes iff there exists a


communication channel between them.
edge weights a measure of communication and
mutual synchronization between the processes.

Prof. Z. Peng, ESLAB/ LiTH

10

2007-02-06

Process Graph Formulation


=

The Graph Partitioning Problem:


To partition the process graph into two groups such
that the sum of the weights of the cut edges will be
minimal, subject to a set of constraints:
Ex.
H
H _ cos ti Max

Physical limitation of silicon area

Wi N Lim1 i Hw

Implement a node in HW, when


it is appropriate.

iH

Prof. Z. Peng, ESLAB/ LiTH

11

Features of CO Problems
=

Most CO problems, e.g., system partitioning with


constraints, for digital system designs are NPcompete.
The time needed to solve an NP-compete problem
grows exponentially with respect to the problem size
n.
For example, to enumerate all feasible solutions for a
scheduling problem (all possible permutation), we
have:
0 20 tasks in 1 hour (assumption);
0 21 tasks in 20 hour;
0 22 tasks in 17.5 days;
0

...

0 25 tasks in 6 centuries.

Prof. Z. Peng, ESLAB/ LiTH

12

2007-02-06

Features of CO Problems
=

Many CO problems can be formulated as an Integer


Linear Programming (ILP) problem, and solved by an
ILP solver.
It is inherently more difficult to solve an ILP problem
than the corresponding Linear Programming problem.
The size of problem that can be solved successfully
by ILP algorithms is an order of magnitude smaller
than the size of LP problems that can be easily
solved.

Prof. Z. Peng, ESLAB/ LiTH

13

Heuristics
=

A heuristic seeks near-optimal solutions at a


reasonable computational cost without being able to
guarantee either optimality or feasibility.
Motivations:
0 Many exact algorithms involve a huge amount of
computation effort.
0 The decision variables have frequently complicated
interdependencies.
0 We have often nonlinear cost functions and constraints,
even no mathematical functions.
Ex. The cost function f can, for example, be defined by a
computer program (e.g., for power estimation).

0 Approximation of the model for optimization.


A near optimal solution is usually good enough and could be
even better than the theoretical optimum.
Prof. Z. Peng, ESLAB/ LiTH

14

2007-02-06

Transformational
Constructive
(Iterative improvement)

Heuristic Approaches to CO
Problem specific

Generic methods

Clustering
List scheduling
Left-edge algorithm

Branch and bound


Divide and conquer

Kernighan-Lin
algorithm

s)
tic
s
i
r
eu
H
l
eta
M
(

Prof. Z. Peng, ESLAB/ LiTH

Neighborhood search
Simulated annealing
Tabu search
Genetic algorithms

15

Clustering for System Partitioning

Each node initially belongs to its own cluster, and


clusters are then gradually merged until the desired
partitioning is found.
The merge operation is selected based on local
information (closeness metrics), rather than global
view of the whole system.
v2

v2

v1

v5

v4

v3

v3

v1

v2

v5

v4

v1
4

v5

v4

v3
v2

v3
v2

v1

v1

v4

v4
v5

Prof. Z. Peng, ESLAB/ LiTH

v3

v5

16

2007-02-06

The Kernighan- Lin Algorithm (KL)


=

A graph is partitioned into two clusters of


arbitrary size, by minimizing a given
objective function.
KL is based on an iterative partitioning
strategy:
0 The algorithm starts with two arbitrary clusters
C1 and C2.
0 The partitioning is then iteratively improved by
moving nodes between the clusters.
0 At each iteration, the node which produces the
minimal value of the cost function is moved; this
value can, however, be greater than the value
before moving the node.

Prof. Z. Peng, ESLAB/ LiTH

17

Branch-and- Bound
=

Traverse an implicit tree to find the best leaf (solution).

4-City TSP

0
3

0
41

1
40

2
3

41

40

4
0

Total cost of this solution = 88

Prof. Z. Peng, ESLAB/ LiTH

18

2007-02-06

0 0

41

40

Branch-and- Bound Ex
=
=

Low-bound on the cost function.


Search strategy
{0}
L0

{0,1}
L3

{0,2}
L6

{0,3}
L 41

{0,1,2}
L 43

{0,1,3}
L8

{0,2,1}
L 46

{0,2,3}
L 10

{0,3,1}
L 46

{0,3,2}
L 45

{0,1,2,3}
L = 88

{0,1,3,2}
L = 18

{0,2,1,3}
L = 92

{0,2,3,1}
L = 18

{0,3,1,2}
L = 92

{0,3,2,1}
L = 88

Prof. Z. Peng, ESLAB/ LiTH

19

Neighborhood Search Method


=

Step 1

(Initialization)

(A) Select a starting solution xnow X.


(B) xbest = xnow , best_cost = c(xbest).
=

Step 2 (Choice and termination)


Choose a solution xnext N(xnow ).
If no solution can be selected or the terminating criteria apply,
then the method stop.

Step 3 (Update)
Re-set xnow = xnext .
If c(xnow ) < best_cost, perform Step 1(B).
Goto Step 2.

N(x) denotes the neighborhood of x, which is a set of solutions


reachable from x by a simple transformation.
Prof. Z. Peng, ESLAB/ LiTH

20

10

2007-02-06

Neighborhood Search Method


=

The neighborhood search method is very attractive for


many CO problems as they have a natural neighborhood
structure, which can be easily defined and evaluated.
0 Ex. Graph partitioning: swapping two nodes.
5

15

65

35

45
24

65

35

40

35
23

15

20

45

56

67

20

40

35

56

4
8

24

23

Prof. Z. Peng, ESLAB/ LiTH

67

21

The Descent Method


=

Step 1

(Initialization)

Step 2 (Choice and termination)


Choose xnext N(xnow ) such that c(xnext ) < c(xnow ), and
terminate if no such xnext can be found.

Step 3

(Update)

The descent process can easily be stuck at a local


optimum:
Cost

Solutions

Prof. Z. Peng, ESLAB/ LiTH

22

11

2007-02-06

Dealing with Local Optimality


=

Enlarge the neighborhood.


Start with different initial solutions.

To allow uphill moves:

0 Simulated annealing
0 Tabu search

Cost

Prof. Z. Peng, ESLAB/ LiTH

Solutions

23

The SA Algorithm
Select an initial solution xnow X;
Select an initial temperature t > 0;
Select a temperature reduction function ;
Repeat
Repeat
Randomly select xnext N(xnow );
= cost(xnext ) - cost(xnow );
If < 0 then xnow = xnext
else generate random p uniformly in the range (0, 1);
If p < exp(- /t) then xnow = xnext ;
Until iteration_count = nrep;
Set t = (t);
Until stopping condition = true.
Return xnow as the approximation to the optimal solution.
Prof. Z. Peng, ESLAB/ LiTH

24

12

2007-02-06

A HW/SW Partitioning Example


75000

70000

optimum at iteration 1006


65000

Cost function value

60000

55000

50000

45000

40000

35000
0

200

400

600

800

1000

1200

1400

Number of iterations
Prof. Z. Peng, ESLAB/ LiTH

25

Analysis Techniques
=

Analysis and simulation techniques are essential for


hardware/software codesign:
0 To guide the design space exploration.
0 To provide feedback to the human designers.
0 To support design validation.

Selection of an analysis/simulation technique is


usually based on trade-off between efficiency and
accuracy.
For certain analysis, such as worst-case execution
time analysis, it is also very important that the result
is safe (i.e., correct or pessimistic).

Prof. Z. Peng, ESLAB/ LiTH

26

13

2007-02-06

Performance Metrics
=

Extreme case performance


0 Worst-case execution time
0 Best-case execution time

=
=

Average case performance


Probabilistic performance
0 Used in soft real-time applications
0 To accurately handle the variable execution time of tasks,
which may be due to

Application characteristics (e.g., data dependent loops);


Architectural factors (e.g., cache misses);
External factors (e.g., network load); or
Insufficient knowledge.

0 To guarantee a high probability of meeting timing


constraints.
Prof. Z. Peng, ESLAB/ LiTH

27

Simulation-based Techniques
=
=

=
=
=

Software Running the compiled program


on the simulated target architecture.
Hardware Building a simulation model of
the hardware and executing it to collect
information.
A very large number of inputs should be used
in order to get good results.
Only practical for average and probabilistic
execution time estimation.
It is difficult to use when individual programs
are not running in isolation.

Prof. Z. Peng, ESLAB/ LiTH

28

14

2007-02-06

Static Analysis
Techniques that use results of information collected by
analyzing the programs without executing them.
=

No assumption about input data is made.

Restriction on software
0 bounded loops
0 absence of recursive functions
0 absence of dynamic function calls

Can be used for:


0 program analysis behavior of a single program on a
processor.
0 system performance analysis behavior of multiple
processes on a single processor or several processors.

Prof. Z. Peng, ESLAB/ LiTH

29

Program Analysis
=

The estimated worst-case execution time (WCET)


must be safe and tight.
Possible execution time

Actual
WCET
=

t
Estimated
WCET

The idea tool for source code analysis would produce


a good WCET estimate based on the following inputs:
0
0
0
0

Source code.
Compiler.
Machine architecture description.
Operating system.

Prof. Z. Peng, ESLAB/ LiTH

30

15

2007-02-06

Program Path Analysis


=

To determine what sequence of instructions will be


executed in the worst case scenario.
A basic block is composed of
instructions in a straight line

Prof. Z. Peng, ESLAB/ LiTH

Let us first assume that


each instruction takes a
fixed time to execute

31

Program Path Analysis


=

=
=

Infeasible paths can be eliminated by data


flow analysis and path information provided
by the programmer.
The number of feasible paths is typically
exponential with the program size.
Efficient methods are needed to avoid
enumeration of all paths.

Prof. Z. Peng, ESLAB/ LiTH

32

16

2007-02-06

ILP Formulation
Let xi be the number of times a basic block Bi is executed;
ci be the execution time of the basic block Bi, which is
assumed to be a constant.
The total execution time of the program for a particular
execution is:
C1

ci xi

C3

i =1

C1 + C2 + C4 + 11 C5 + 10 C6 + C7
C7

Prof. Z. Peng, ESLAB/ LiTH

C2
C4

C5

11

C6

10

33

ILP Formulation (Contd)


The estimated WCET of the program is:

max ci xi
i =1

subject to a set of constraints Ax b.


=
=

The quality of the constraints define the tightness of


the estimate.
Constraint classification:
0 Program structural constraints deduced from the
programs control flow graph.
0 Program functionality constraints provided by the user to
specify loop bounds and other path information.

Prof. Z. Peng, ESLAB/ LiTH

34

17

2007-02-06

An Example
d1

/* k >= 0 */
s = k;
while (k < 10) {
if (ok)
j++;
else {
j = 0;
ok = true;
}
k++;
}
r = j;

x1 B1

d8
x2 B2 while (k<10)
d3
x3 B3 if (ok)
d5
d4
B5 j = 0;
x4 B4 j++;
x5
ok=true;
d6
d7
x6 B6
k++
d9
x7 B7

Prof. Z. Peng, ESLAB/ LiTH

s = k;
d2

r = j;
d 10

CFG

35

Constraints I
=

d1

Structural constraints:

x1 B1 s = k;
d2
x2 B2 while (k<10)

d1 = 1
x1 = d1 = d2
x2 = d2 + d8 = d3 + d9
x3 = d3 = d4 + d5
...

x4 B4

d9

d3
x3 B3 if (ok)
d5
d4
j++;
x5 B5 j = 0;
ok=true;
d6
d7
x6 B6
k++
x7 B7 r = j;
d 10

Prof. Z. Peng, ESLAB/ LiTH

d8

CFG

36

18

2007-02-06

Constraints II
=

Functionality constraints:

X1
X2
X3
X4

Loop bound information


0 x1 x3 10 x1
Path information
x5 1 x1

X5
X6
X7

Prof. Z. Peng, ESLAB/ LiTH

/* k >= 0 */
s = k;
while (k < 10) {
if (ok)
j++;
else {
j = 0;
ok = true;
}
k++;
}
r = j;

37

Remarks on Performance Analysis


=

One of the main issues of hardware/software


codesign is estimation and analysis.
Analysis of average and probabilistic performance
can be done by simulation.
Worst case execution time analysis can only be
efficiently done by static analysis techniques.
Efficient techniques for analyzing impacts of many
advanced micro-architecture components are still
research issues.

Prof. Z. Peng, ESLAB/ LiTH

38

19

2007-02-06

Simulation
=

Applied usually directly to the design


descriptions, e.g. VHDL.
Can be used at different levels of
abstractions:
0 System
0 Algorithmic
0 Register-transfer
0 Logic
0 Gate
0 Switch and circuit

Prof. Z. Peng, ESLAB/ LiTH

39

Co-Simulation
=

How the hardware and software components are


simulated at the same time?

Problems:
=
=

Different simulation platforms are used;


Software runs fast while hardware simulation is
relatively slow.
0 How to run the system simulation as fast as possible and
keep the two domains synchronized?

Slow models provide full details and produce


accurate results; fast models dont produce enough
timing information and simulation is less accurate.

Prof. Z. Peng, ESLAB/ LiTH

40

20

2007-02-06

Approaches to Co-Simulation 1
=

Gate-level model of the processor


VHDL
simulation

Gatelevel
model
(VHDL)

ASIC
model
(VHDL)

VHDL
simulation

SW

Co-simulation framework
0 Gate level simulation of the processor is very slow (tens of
clock cycles/sec).
Ex. 10 cycles/sec, 1 GHz processor 100 million seconds
(3.2 years) are needed to simulate one second of real time.
0 This provides a very accurate solution and is very simple
from the co-simulation point of view.
Prof. Z. Peng, ESLAB/ LiTH

41

Approaches to Co-Simulation 2
=

Instruction-set architecture models


Program
running
on host

ISA
model
(C
progr.)

ASIC
model
(VHDL)

VHDL
simulation

SW

Co-simulation framework
0 There is no hardware model of the target processor; the
software is executed on an ISA model (usually in C);
execution on the ISA model provides interface information
(including timing) needed for co-simulation.
0 This is fast but timing accuracy depends on the interface
information.
Prof. Z. Peng, ESLAB/ LiTH

42

21

2007-02-06

Approaches to Co-Simulation 3
=

Translation-based models
Program
Running
directly
on host

Software
compiled
into native
code for
the host

ASIC
model
(VHDL)

VHDL
simulation

Co-simulation framework
0 There is no hardware model of the target processor; the
software is compiled into native code for the host
processor; software execution provides interface
information (including timing) needed for co-simulation.
Prof. Z. Peng, ESLAB/ LiTH

43

Approaches to Co-Simulation 4
=

Hardware in the loop:

0 Combine hardware and software in one solution, by using


the physical device to model its own behavior.
0 An adaptor formats inputs to the physical device, applies
the input, returns the resulting outputs with timing
information to the simulator.
0 This is a good choice for modeling complex standard
components such as microprocessors.
Prof. Z. Peng, ESLAB/ LiTH

44

22

2007-02-06

Approaches to Co-Simulation 5
=

Mixed level simulation to combine the strength of


simulation at different levels of abstractions and
provide a possibility to compare results at different
levels.
Broadband simulator One broadband language is
used which covered several abstraction levels.
Multi-simulator several simulators are used in an
integrated environment. Main issues to deal with:
0 The data exchange between the various simulators.
0 The synchronization of the simulators involved.

Prof. Z. Peng, ESLAB/ LiTH

45

Concluding Remarks
=

Efficient techniques for design verification for


embedded systems are hot research topics.
The basic problem of co-simulation is how to
simulate hardware and software together so
that simulation is fast and accurate.
Formal verification mathematically proves
design correctness. The issues there are
computational complexity and integration into
the design flow.

Prof. Z. Peng, ESLAB/ LiTH

46

23

Hardware/Software Codesign

Arch & Platf - 1

Architectures and Platforms


1.

Architecture Selection: The Basic Trade-Offs

2.

General Purpose vs. Application-Specific Processors

3.

Processor Specialisation

4.

ASIP Design Flow

5.

Specialisation of a VLIW ASIP

6.

Tool Support for Processor Specialisation

7.

Application Specific Platforms

8.

IP-Based Design (Design Reuse)

9.

Reconfigurable Systems

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 2

Remember the Design Flow


Informal Specification,
Constraints
Modeling

Functional
Simulation

Arch. Selection

System model

Formal
Verification

System
architecture

Mapping

Estimation

Scheduling

not OK

Mapped and
scheduled model
OK

Softw. model

Simulation

Softw. Generation
Simulation

Testing
OK

Prototype

Fabrication
Petru Eles, IDA, LiTH

Simulation
Formal
Verification

Hardw. model
Hardw. Synthesis

Softw. blocks
not OK

not OK

Hardw. blocks

Hardware/Software Codesign

Arch & Platf - 3

Architecture Selection and Mapping

Select the underlying hardware structure on which to run the


modelled system.

Map the functionality captured by the system over the


components of the selected architecture.
Functionality includes processing and communication.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 4

Architecture Selection

General
Purpose
vs.
Application
Specific

Software
vs.
Hardware

Use a general purpose, existing platform


and map the application on it.
or something in-between
Build a customised architecture strictly
optimised for the particular application.

Use programmable processors


running software.
or both
Use dedicated electronics

fixed
reconfigurable

Monoprocessor
Mono vs. Multipr.
Single vs. Multichip

Petru Eles, IDA, LiTH

Multiprocessor

single chip
multi chip

Hardware/Software Codesign

Arch & Platf - 5

Architecture Selection (contd)


The trade-offs:
Performance (high speed, low power consumption)
high
Hardware
Application specific
General purpose

low

high

Reconfigurable
hardware
Software

low

Flexibility (how easy it is to upgrade or modify)


General purpose
Application specific

high

Software

low

Reconfigurable
hardware
Hardware

Petru Eles, IDA, LiTH

high

low

Hardware/Software Codesign

Arch & Platf - 6

order of
order of
magnitude magnitude

energy
consumed

Architecture Selection (contd)

GP proc.

high

ASIP
FPGA

med.

low

ASIC
low

Petru Eles, IDA, LiTH

med.

high

flexibility

Hardware/Software Codesign

Arch & Platf - 7

General Purpose vs. Application Specific Processors


Both GP processors and ASIPs (application specific instruction set
processors) can be RISCs, CISCs, DSPs, microcontrollers, etc.
- One could look at DSPs and microcontrollers as being specific
for DSP and simple control applications respectively.
- An application specific DSP or microcontroller is, however,
more specialised then just for DSP or control applications.
GP processors
- Neither instruction set nor microarchitecture or memory
system are customised for a particular application or family of
applications
ASIPs
- Instruction set, microarchitecture and/or memory system are
customised for an application or family of applications.
- What results is better performance and reduced power
consumption.
Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 8

What Makes an ASIP Specific?


What can we specialize in a processor?
Instruction set (IS) specialisation
Exclude instructions which are not used
- reduces instruction word length (fewer bits needed for encoding);
- keeps controller and data path simple.
Introduce instructions, even exotic ones, which are specific to the
application: combinations of arithmetic instructions (multiplyaccumulate), small algorithms (encoding/decoding, filter), vector
operations, string manipulation or string matching, pixel operations, etc.
- reduces code size reduced memory size, memory bandwidth,
power consumption, execution time.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 9

What Makes an ASIP Specific?


Function unit and data path specialisation
Once an application specific IS is defined, this IS can be
implemented using a more or less specific data path and more or
less specific function units.
Adaptation of word length.
Adaptation of register number.
Adaptation of functional units
- Highly specialised functional units can be introduced for string
matching and manipulation, pixel operation, arithmetics, and
even complex units to perform certain sequences of
computations (co-processors).

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 10

What Makes an ASIP Specific?


Memory specialisation
Number and size of memory banks.
Number and size of access ports.
- They both influence the degree of parallelism in memory access.
- Having several smaller memory blocks (instead of one big)
increases parallelism and speed, and reduces power consumption.
- Sophisticated memory structures can increase cost and bandwidth
requirement.
Cache configuration:
- separate instruction/data?
- associativity
- cache size
- line size

Petru Eles, IDA, LiTH

Depends very much on the characteristics


of the application and, in particular, on the
properties related to locality.
Very large impact on performance and
power consumption.

Hardware/Software Codesign

Arch & Platf - 11

What Makes an ASIP Specific?


Interconnect specialization
Interconnect of functional modules and registers.
Interconnect to memory and cache.
- How many internal buses?
- What kind of protocol?
- Additional connections increase the potential of parallelism.
Control specialisation

Centralised control or distributed (globally asynchronous)?


Pipelining?
Out of order execution?
Hardwired or microprogrammed?

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 12

ASIP Design Flow


(It can be seen as a part of the big design flow - slide 2)

Processor
Architecture

Algorithm(s)
Compiler

Simulator
Performance
numbers

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 13

A SOC for Multimedia Applications

Glue logic
A/D and D/A
Controller
(ASIP)

VLIW
processor
(ASIP)

On-chip
memory

DSP
(GP)

This is a typical application specific


platform. Its structure has been
adapted for a family of applications.
Besides GP processor cores, the
platform also consists of ASIP cores
which themselves are specialised.
Petru Eles, IDA, LiTH

The application specific


Controller performs
master control of the
system and memory
access control.
The off-the-shelf (GP)
DSP performs less
computation intensive
modem and sound codec
functions.
The VLIW ASIP performs
computation intensive
functions: discrete cosine
and inverse discrete
cosine transforms,
motion estimation, etc.

Hardware/Software Codesign

Arch & Platf - 14

Specialization of a VLIW ASIP


To memory system
Internal storage & interconnect
Crossbar / Bus

Register File 1

Register File 2

Register File 3

ALU MULT MULT


A1
M1
M2
Cluster 1

MULT MULT ALU ALU


A2 A3
M3
M4
Cluster 2

MAC ALU MULT ALU


A5
MA1 A4
M5
Cluster 3

Datapath
Instruction fetch & decode

Petru Eles, IDA, LiTH

From memory system

Hardware/Software Codesign

Arch & Platf - 15

Specialization of a VLIW ASIP (contd)


Thats how an instruction word looks like:

op1

op2

Cluster 1

Petru Eles, IDA, LiTH

op3

op4

op5

op6

Cluster 2

op7

op8

op9

op10 op11

Cluster 3

Hardware/Software Codesign

Arch & Platf - 16

Specialization of a VLIW ASIP (contd)


Traditionally the datapath is organised as single register file shared by
all functional units.
Problem: Such a centralised structure does not scale!
We increase the nr. of functional units in order to increase parallelism
We have to increase the number of registers in the register file
Internal storage and communication between functional units and
registers becomes dominant in terms of area, delay, and power.
High performance VLIW processors are limited not by arithmetic
capacity but by internal bandwidth.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 17

Specialization of a VLIW ASIP (contd)


A solution: clustering.
Restrict the connectivity between functional units and registers, so
that each functional unit can read/write from/to a subset of
registers.
Organise the datapath as clusters of functional units and local
register files.
Nothing is for free!!!
Moving data between registers belonging to different clusters takes
much time and power!
You have to drastically minimise the number of such moves by:
- Carefully adapting the structure of clusters to the application.
- Using very clever compilers.
Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 18

Specialization of a VLIW ASIP (contd)


Instruction set specialisation: nothing special.
Function unit and data path specialisation
- Determine the number of clusters.
- For each cluster determine
- the number and type of functional units;
- the dimension of the register file.
Memory specialisation is extremely important because we need to
stream large amounts of data to the clusters at high rate; one has
to adapt the memory structure to the access characteristics of the
application.
- determine the number and size of memory banks

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 19

Specialization of a VLIW ASIP (contd)


Interconnect specialization
- Determine the interconnect structure between clusters and
from clusters to memory:
- one or several buses,
- crossbar interconnection
- etc.
Control specialisation:
Thats more or less done, as we have decided for a VLIW
processor.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 20

Tool Support for Processor Specialisation

Look at the design flow on slide 12!


In order to be able to generate a specialised architecture you need:

Retargetable compiler

Configurable simulator

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 21

Retargetable Compiler
Retargetable compiler

Processor
Architecture

Algorithm
Retargetable
Compiler

Object code

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 22

Retargetable Compiler (contd)


An automatically retargetable compiler can be used for a range of
different target architectures.
The actual code optimization and code generation is done by the
compiler, based on a description of the target processor architecture.
This description is formulated in a, so called, architecture description
language.
Having a good compiler is not only important for the processor
specialisation process!
Once you have got your specialised ASIP you need a good compiler
in order to efficiently make use of it!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 23

Configurable Simulator
Such a simulator can be
configured for a particular
architecture (based on an
architecture description)

Processor
Architecture
Object code

Simulator
Performance
numbers

Petru Eles, IDA, LiTH

In this context, the most


important output produced by
the simulator is performance
numbers:
- throughput
- delay
- power/energy consumption

Hardware/Software Codesign

Arch & Platf - 24

Application Specific Platforms

Not only processors but also hardware platforms can be specialised


for classes of applications.

The platform will define a certain communication infrastructure


(buses and protocols), certain processor cores, peripherals,
accelerators commonly used in the particular application area, and
basic memory structure.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 25

Application Specific Platforms (contd)

Proc.
Core3

Proc.
Core2

Proc.
Core1

Cache

DMA

Memory

Bridge

System bus
Peripheral bus
Peripheral

Petru Eles, IDA, LiTH

Reconfigurable
logic

Peripheral

Hardware/Software Codesign

Arch & Platf - 26

Application Specific Platforms (contd)


Design space exploration for platform definition:

Platform
Architecture

Applications
Mapping/
Compiling
Simulator
Performance
numbers

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 27

Instantiating a Platform

Once we have an application, the chip to implement on will not be


designed as a collection of independently developed blocks, but will
be an instance of an application specific platform.

The hardware platform will be refined by


- determining memory and cache size
- identifying the particular cores, peripherals to be used
- adding specific ASICs, accelerators
- determining the amount of reconfigurable logic (if needed)

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 28

Instantiating a Platform (contd)

Platform
Architecture

Platform
Instance

Application
Mapping/
Compiling
Simulator
Performance
numbers

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 29

System Platforms
What we discussed about (see previous slides) are so called
hardware platforms.
The hardware platform is delivered together with a software layer:
hardware platform + software layer = system platform.
Software layer:
- real-time operating system
- device drivers
- network protocol stack
- compilers
The software layer creates an abstraction of the hardware
platform (an application program interface) to be seen by the
application programs.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 30

IP-Based Design (Design Reuse)


The key concept in order to increase designers productivity is reuse.
In order to manage the complexity of current large designs we do not
start from scratch but reuse as much as possible from previous
designs, or use commercially available pre-designed IP blocks.
IP: intellectual property.
Some people call this IP-based design, core-based design, reuse
techniques, etc.:
Core-based design is the process of composing a new system
design by reusing existing components.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 31

IP-Based Design (contd)


What are the blocks (cores) we reuse?
interfaces, encoders/decoders, filters, memories, timers,
microcontroller-cores, DSP-cores, RISC-cores, GP processor-cores.

Possible(!) definition
A core is a design block which is larger than a typical RTL
component.

Of course:
We also reuse software components!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 32

IP-Based Design (contd)

Core 1

Core 2

Core 3

glue

glue

glue

Interconnection bus/switch
glue

Interface

Library
Vendor B

Library
Vendor A

Core 4
processor
Library
Vendor C

What we have designed here can be:


An application specific SOC
A platform to be further instantiated for a particular application.
Petru Eles, IDA, LiTH

I/O

Hardware/Software Codesign

Arch & Platf - 33

Types of Cores
Hard cores: are fully designed, placed, and routed by the supplier.
A completely validated layout with definite timing

rapid integration

low flexibility

Firm cores: technology-mapped gate-level netlists.

less predictability

Petru Eles, IDA, LiTH

flexibility during
place and route

Hardware/Software Codesign

Arch & Platf - 34

Types of Cores (contd)


Soft cores: synthesizable RTL or behavioral descriptions.

much work with


integration and
verification.

maximal flexibility

Flexibility can provide opportunities like e.g. adding application


specific instructions to a processor core by modifying the
behavioral description.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 35

Reconfigurable Systems
Programmable Hardware Circuits:
They implement arbitrary combinational or sequential circuits
and can be configured by loading a local memory that determines
the interconnection among logic blocks.
Reconfiguration can be applied an unlimited number of times.

Main applications:
- Software acceleration
- Prototyping

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 36

Reconfigurable Systems (contd)


Dynamic reconfiguration: spacial and temporal partitioning

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Petru Eles, IDA, LiTH

Processor

Memory

at t

at t 2

FPGA
Accelerator

at t 3
at

t4

lly
a
r
po ned
m
te rtitio
pa

Hardware/Software Codesign

Arch & Platf - 37

Reconfigurable Systems (contd)


System on Chip with dynamically reconfigurable datapath
C code
Profiling &
Kernel
extraction

CPU
On
chip
mem.

Kernels

Reconfigurable
datapath

Hw/Sw
partitioning
Datapath
synthesis

Petru Eles, IDA, LiTH

C code

Hardware/Software Codesign

Arch & Platf - 38

Summary
Architecture selection is about making trade-offs along the
dimensions of speed, cost, flexibility, and power consumption.
ASIPs are programmable processors, specialised for a particular
application or for a family of applications.
Specialisation of an ASIP concerns instruction set, function units
and data path, memory system, interconnect, and control.
Two design tools are of great importance in order to perform
processor specialisation: retargetable compiler and configurable
simulator.
Not only processors can be specialised but also platforms. A
Platform is specialised to execute a certain family of applications.
The particular hardware to be used for a given application is a
specialised instantiation of the platform.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Arch & Platf - 39

Summary (contd)

Reuse is a key technique in order to achieve high design


productivity. Cores to be reused can be from interfaces and
decoders to filters and processors.
The three types of cores differ in their flexibility, predictability, and
the effort needed for integration: hard, firm, and soft cores.
Reconfigurable systems can provide good flexibility and, at the
same time, many of the advantages of classical hardware
implementation. They are mainly used for software acceleration
and prototyping.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 1

System-Level Power/Energy Optimization


1.

Sources of Power Dissipation

2.

Reducing Power Consumption

3.

System Level Power Optimization

4.

Dynamic Power Management

5.

Mapping and Scheduling for Low Energy

6.

Real-Time Scheduling with Dynamic Voltage Scaling

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 2

Remember the Design Flow


Informal Specification,
Constraints
Modeling

Functional
Simulation

Arch. Selection

System model

Formal
Verification

System
architecture

Mapping

Estimation

Scheduling

not OK

Mapped and
scheduled model
OK

Softw. model

Simulation

Softw. Generation
Simulation

Testing
OK

Prototype

Fabrication
Petru Eles, IDA, LiTH

Simulation
Formal
Verification

Hardw. model
Hardw. Synthesis

Softw. blocks
not OK

not OK

Hardw. blocks

Hardware/Software Codesign

Low Power/Energy - 3

Why is Power Consumption an Issue?


Portable systems - battery life time!
Systems with a very limited power budget: Mars Pathfinder,
autonomous helicopter, ...
Desktops and servers: high power consumption
- raises temperature and deteriorates performance & reliability
- increases the need for expensive cooling mechanisms
One of the main difficulties with developing high performance
chips is heat extraction.
High power consumption has economical and ecological
consequences.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 4

Sources of Power Dissipation in CMOS Devices


dynamic

static

1
2
P = --- C V DD f N SW + Q SC V DD f N SW + I leak V DD
2
Switching power
Power required to
charge/discharge
circuit nodes

Short-circ. power
Dissipation due
to short-circuit
current

C
= node capacitances
NSW = switching activities
(number of gate transitions
per clock cycle)
f
= frequency of operation

Petru Eles, IDA, LiTH

Leakage power
Dissipation
due to leakage
current

VDD = supply voltage


QSC = charge carried by
short circuit current
per transition
Ileak = leakage current

Hardware/Software Codesign

Low Power/Energy - 5

Sources of Power Dissipation in CMOS Devices (contd)


CMOS transistor (N-type)

drain

Vbs

n
ai
dr

ga

so

ur
ce

te

Threshold voltage:
-

gate

body
source

Vbs = body bias voltage


Vth = threshold voltage

Petru Eles, IDA, LiTH

The minimal voltage


required at the gate to
turn on the transistor

Hardware/Software Codesign

Low Power/Energy - 6

Sources of Power Dissipation in CMOS Devices (contd)


CMOS transistor (N-type)

CMOS inverter
Vdd

Vbs

n
ai
dr

ga

so

ur
ce

te

drain

gate

CL

body
source

Vbs = body bias voltage


Vth = threshold voltage
Vdd = supply voltage
CL = output load capacitance
Petru Eles, IDA, LiTH

Dynamic power
-

Charging and discharging the


output load capacitance
Momentary short circuits at a
gates output

Hardware/Software Codesign

Low Power/Energy - 7

Sources of Power Dissipation in CMOS Devices (contd)


CMOS transistor (N-type)

CMOS inverter
Vdd

Vbs

n
ai
dr

ga

so

ur
ce

te

drain

gate It flows even when

body

the voltage at the


gate is below Vth
source

Static power
Vbs = body bias voltage
Vth = threshold voltage
Vdd = supply voltage
CL = output load capacitance

Petru Eles, IDA, LiTH

Subthreshold leakage
conduction
Junction leakage (drain
and source to body)

CL

Hardware/Software Codesign

Low Power/Energy - 8

Sources of Power Dissipation in CMOS Devices (contd)

For long:
Leakage power has been considered negligible compared to
dynamic.

Today:
Total dissipation from leakage is approaching the total from
dynamic.

As technology drops below 65nm:


Leakage power is exceeding dynamic.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 9

Sources of Power Dissipation in CMOS Devices (contd)


Leakage power is consumed even if the circuit is idle (standby). The
only way to avoid is decoupling from power.

Short circuit power can be around 10% of total.

Switching power is still the main source of power consumption.


For the rest of the discussion, we consider mainly switching
power. At the end we come back to leakage.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 10

Power and Energy Consumption


1
2
P = --- C V DD f N SW
2
1
2
E = P t = --- C V DD N CY N SW
2
NCY = number of cycles needed for the particular task.
In certain situations we are concerned about power consumption:
- heath dissipation, cooling:
- physical deterioration due to temperature.
Sometimes we want to reduce total energy consumed:
- battery life.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 11

Reducing Power/Energy Consumption

The main sources:


Reduce supply voltage
Reduce switching activity
Reduce capacitance
Reduce number of cycles

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 12

Reducing Power/Energy Consumption (contd)


Circuit level
Ordering of transistors in gate (influences capacitance).
Transistor sizing.
Logic level
Dont-care optimization to reduce switching activity.
Reduce spurious switching activity by balancing the delays of
paths that converge at each gate.
Technology mapping.
State encoding such that switching activity is minimised: if
state s has a large number of transitions to state q, they
should be given uni-distant codes.
Encoding to minimise switching activity in arithmetic units or
on the bus.
Gated clocks: Gate the clocks of circuits (registers, gates,
arithmetic units when they are in idle time periods.
Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 13

Reducing Power/Energy Consumption (contd)

Behavioral level
Schedule and map operations so that number of cycles is
minimised (with increased number of switching per clock
cycle) you can run at slower clock rate you can reduce
supply voltage.
Allocate and share modules so that power consumption is
reduced (for example, by reducing switching activity)

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 14

Reducing Power/Energy Consumption (contd)


Architecture level
Specialise instruction set, datapath, register structure to the
particular architecture, with power consumption as an optimization
goal.
- You have on the chip and you switch only those resources
(gates) you really need.
Reduce power consumption on the bus.
- lower switching activity: clever encoding, reduce switching activity on the address bus by exploiting correlations;
- minimise the bus length (capacitance) by optimal module
placement.
- bus segmentation: transform a long heavily loaded global bus
into a partitioned set of local bus segments.
Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 15

Reducing Power/Energy Consumption (contd)


Optimise the memory structure.
- Memory transfers are extremely power hungry: a memory
transfer takes 33 times more energy than an addition!
Reducing the number of memory accesses is a very efficient
way to save power!
- Adapt the number of caches, their size and associativity, and
the length of the cache line to the application reduce
number of memory transfers.
- Interesting trade-off: larger caches consume more power but
reduce number of memory transfers find the right balance!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 16

Reducing Power/Energy Consumption (contd)

Provide instruction support for Power management:


- Instructions which allow to put in stand-by or shut down certain
parts of the system.
- Instructions which allow to dynamically fix the supply voltage
(dynamic voltage scaling).

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 17

Reducing Power/Energy Consumption (contd)


System Level
Static techniques are applied at design time.
- Compilation for low power: instruction selection considering
their power profile, data placement in memory, register
allocation.
- Algorithm design: find the algorithm which is the most powerefficient.
- Task mapping and scheduling.
Dynamic techniques are applied at run time.
- These techniques are applied at run-time in order to reduce
power consumption by exploiting idle or low-workload periods.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 18

System Level Power Optimization

Three techniques will be discussed:

1. Dynamic power management: a dynamic technique.

2. Task mapping: a static technique.

3. Task scheduling with dynamic power scaling: static & dynamic.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 19

Dynamic Power Management (DPM)


Decisions:

application
power aware OS
hardware

Switching among multiple power


states:
idle
sleep
run
Switching among multiple
frequencies and voltage levels.

Goal:
Energy optimization
QoS constraints satisfied

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 20

Dynamic Power Management (contd)


Hardware Support (e.g. Intel Xscale Processor)
RUN: operational
IDLE: Clocks to the
CPU are disabled;
recovery is through
interrupt.
SLEEP: Mainly
powered off;
recovery through
wake-up event.
Other intermediate
states: DEEP
IDLE, STANDBY,
DEEP SLEEP
Petru Eles, IDA, LiTH

0.75V, 60mW
150MHz
1.3V, 450mW
RUN
RUN
600MHz
RUN
1.6V, 900mW
RUN
800MHz
160s
RUN
10s

1.5ms
10s

IDLE
40mW

140ms
90s

SLEEP
160W

Hardware/Software Codesign

Low Power/Energy - 21

Dynamic Power Management (contd)


DPM techniques are used in laptops, personal digital assistants
(PDAs), and other portable appliances in order to shut down or
place in stand-by unused devices.
The goal is power saving.
DPM techniques are implemented in the operating system
(including Windows 2000 running on laptops).
The power breakdown for a laptop computer:
- 36% of total power consumed by the display
- 18% by hard-disk
- 18% by wireless LAN interface
- 7% by keyboard, mouse, etc.
- 21% by digital VLSI circuits.

Petru Eles, IDA, LiTH

dont forget
these!

Hardware/Software Codesign

Low Power/Energy - 22

The Basic Concept of DPM


When there are requests for a device the device is busy;
otherwise it is idle.
When the device is idle, it can be shut down to enter a low-power
sleeping state.
Workload Requests
Device state

Busy

Power state

Working

Requests
Busy

Idle
Tsd

Sleeping

Twu Working

?
T1 T2 T3

Petru Eles, IDA, LiTH

T4

Time

Hardware/Software Codesign

Low Power/Energy - 23

The Basic Concept of DPM (contd)


Changing the power state takes time (several seconds) and extra energy.
Tsd : shutdown delay
Twu : wake-up delay

Send the device to sleep only if the saved energy justifies the overhead!
The main Problems:
Dont shut down such that delays occur too frequently.
Dont shut down such that the savings due to the sleeping are
smaller than the power overhead of the state changes.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 24

Power Management Policies


Power management policies are concerned with predictions
related to idle periods:
- For shut-down: try to predict how long the idle period will be in
order to decide if a shut-down should be performed.
- For wake-up: try to predict when the idle period ends, in order
to avoid user delays due to Twu.
It is quite difficult, and often the wake-up is started simply
when a request has arrived.
Typical Policies:
1. Time-out
2. Predictive
3. Stochastic

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 25

Time-out Policy
It is assumed that, after a device is idle for a period (the interval T1 - T2
on slide 16), it will stay idle for at least a period which makes it efficient
to shut down.
Drawback: you waste energy during the period (compared to
instantaneous shut-down).
Policies:
- Fixed time-out period: you set the value of , which then stays constant.
- Adjusted at run-time: increase or decrease , depending on the
length of previous idle periods.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 26

Predictive Policy
The length of an idle period is predicted. If the prediction is for an idle
period long enough, the shut-down is performed immediately (no time
interval T1 - T2 on slide 16).
Policy
Shut down after
Idle Period
- L-shaped distribution for --------------------------------------------------;
Previous Busy Period short busy period!

Idle Period

Short busy periods


are followed by
long idle periods.
Busy periods longer
than a threshold
are followed by
short idle periods.

Petru Eles, IDA, LiTH

Busy Period

Hardware/Software Codesign

Low Power/Energy - 27

Stochastic Policy
Predictions are based on Markov models: requests and power state
transitions of the device are modelled as probabilistic state machines.
The power manager observes the arriving requests, the request
queue and the device generates shutdown commands.
Power manager

The device:
provides service
Petru Eles, IDA, LiTH

obs.

s.
ob

co

an

ds

Markov model:
device

Markov model:
request generator
ob

s.

request requests Environment or user:


generates requests
queue

Hardware/Software Codesign

Low Power/Energy - 28

Mapping and Scheduling for Low Energy


For many embedded systems DPM techniques, like presented
before, cannot be applied:
They have no devices like hard-disk, no (or small) display
VLSI is a main source of power dissipation.
They have time constraints we have to keep deadlines
(usually we cannot afford shut-down and wake-up times).
The operating system is small no sophisticated techniques at
run-time.
The application is known at design time we know a lot about
the application already at design time.

Static techniques can be used (applied at design time).


Mapping and scheduling for low energy are important!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 29

Mapping for Low Energy


1
2

3
5

Task

1
2
3
4
5
6
7
8

4
7
8
p3

p4

Bus
Petru Eles, IDA, LiTH

WCET

Energy

p3

p4

p3

p4

10

10

11

17

21

15

10

10

14

15

19

14

Hardware/Software Codesign

Low Power/Energy - 30

Mapping for Low Energy (contd)


Consider a mapping:
p3: 1, 3, 6, 7, 8.
p4: 2, 4, 5.
Execution time: 52;

Time
p3

Communication times and energy:


C1-2: t = 1; E = 3.
C3-5: t = 2; E = 5.
C4-8: t = 1; E = 3.
C5-7: t = 1; E = 3.

Energy consumed: 75.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64

6
2

p4

bus
C1-2

Petru Eles, IDA, LiTH

C3-5

C5-7

C4-8

Hardware/Software Codesign

Low Power/Energy - 31

Mapping for Low Energy (contd)


Consider a mapping:
p3: 1, 3, 6, 7.
p4: 2, 4, 5, 8.
Execution time: 57;

Time
p3

Communication times and energy:


C1-2: t = 1; E = 3.
C3-5: t = 2; E = 5.
C7-8: t = 1; E = 3.
C5-7: t = 1; E = 3.
Energy consumed: 70.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64

6
2

p4

bus
C1-2

Petru Eles, IDA, LiTH

C3-5

C5-7

C7-8

Hardware/Software Codesign

Low Power/Energy - 32

Mapping for Low Energy (contd)

The second mapping with 8 on p4 consumes less energy;

Assume that we have a maximum allowed delay = 60.

This second mapping is preferable, even if it is slower!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 33

Real-Time Scheduling with Dynamic Voltage Scaling


The energy consumed by a task, due to switching power (slide 6):
1
2
E = --- C V DD N CY N SW
2

NSW = number of gate transitions


per clock cycle.
NCY = number of cycles needed
for the task.

Reducing supply voltage VDD is the most efficient way to reduce


energy consumption.
The frequency at which the processor can be operated depends on VDD:
2

( V DD V t )
f = k -----------------------------, k: circuit dependent constant; Vt: threshold voltage.
V DD
The execution time of the task: t exe = N CY

Petru Eles, IDA, LiTH

V DD
-----------------------------------
2
k ( V DD V t )

Hardware/Software Codesign

Low Power/Energy - 34

Real-Time Scheduling with Dynamic Voltage Scaling (contd)


The scheduling problem:
Which task to execute at a certain moment on a certain processor so
that time constraints are fulfilled?

The scheduling problem with voltage scaling:


Which task to execute at a certain moment on a certain processor, and
at which voltage level, so that time constraints are fulfilled and energy
consumption is minimised?

The problem: reducing supply voltage extends execution time!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 35

Variable Voltage Processors

Several supply voltage levels are available.

Supply voltage can be fixed by the application (operating system)


through execution of particular instructions.

Frequency is automatically adjusted to the current supply voltage.

Several processors with variable voltage levels are already


available. There will be more and more in the near future.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 36

The Basic Principle


We consider a single task :
- total computation: 109 execution cycles.
- deadline: 25 seconds.
- processor nominal (maximum) voltage: 5V.
- energy: 40 nJ/cycle at nominal voltage.
- processor speed: 50MHz (50106 cycles/sec) at nominal voltage.
V2

109 cycles

52

Etotal = 40 J
slack

0
Petru Eles, IDA, LiTH

10

15

20

texe = 20 sec

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 37

The Basic Principle (contd)


Lets make it slower!
VDD = 2.5V
- energy: 402.52/52 = 10nJ/cycle.
- speed: 502.5/5 = 25MHz
V2

750106 cycles

250106 cycles

52

Etotal = 32.5 J
texe = 25 sec

2.52
0
Petru Eles, IDA, LiTH

10

15

20

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 38

The Basic Principle (contd)


VDD = 4V
- energy: 4042/52 = 25nJ/cycle.
- speed: 504/5 = 40MHz
V2

109 cycles

52

Etotal = 25 J

42

0
Petru Eles, IDA, LiTH

texe = 25 sec

10

15

20

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 39

The Basic Principle (contd)


If a processor uses a single supply voltage and completes a
program just on deadline, the energy consumption is minimised.
Consider two tasks 1, 2:
Computation
- 1: 250106 execution cycles; 2: 750106 execution cycles;
Deadline: 25 seconds.
Processor nominal (maximum) voltage: 5V.
1
Energy:
- 40 nJ/cycle at nominal voltage.
- 25 nJ/cycle at VDD = 4V.
2
Processor speed:
- 50MHz (50106 cycles/sec) at nominal voltage.
- 40MHz at VDD = 4V.
Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 40

The Basic Principle (contd)


Find the voltage so that the tasks just meet their deadline you
have minimised energy consumption!

6
V2 25010
cycles

750106 cycles

42

Etotal = 25 J

1
0

Petru Eles, IDA, LiTH

2
5

10

15

20

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 41

Considering Task Particularities


Energy consumed by a task:
1
2
E = --- C V DD N CY N SW
2

NSW = number of gate transitions


per clock cycle.
C = switched capacitance per
clock cycle.

Average energy consumed by task per cycle:


1
2
E CY = --- C V DD N SW
2
Often tasks differ from each other in terms of executed operations
NSW and C differ from one task to the other.
The average energy consumed per cycle differs from task to task.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 42

Considering Task Particularities (contd)


Consider two tasks 1, 2:
Computation
- 1: 250106 execution cycles; 2: 750106 execution cycles;
Deadline: 25 seconds.
1
Processor nominal (maximum) voltage: 5V.
Processor speed:
- 50MHz (50106 cycles/sec) at nominal voltage.
2
- 40MHz at VDD = 4V.
- 25MHz at VDD = 2.5V.

Energy 1
- 50 nJ/cycle at VDD = 5V.
- 32 nJ/cycle at VDD = 4V.
- 12.5 nJ/cycle at VDD = 2.5V.

Petru Eles, IDA, LiTH

Energy 2
- 12.5 nJ/cycle at VDD = 5V.
- 8 nJ/cycle at VDD = 4V.
- 3 nJ/cycle at VDD = 2.5V.

Hardware/Software Codesign

Low Power/Energy - 43

Considering Task Particularities (contd)


Here we have a solution with VDD = 4V, and deadline just fulfilled:
Etotal = 32nJ/cycle 250 106cycles + 8nJ/cycle 750 106cycles
6
V2 25010
cycles

750106 cycles

42

Etotal = 14 J

1
0

Petru Eles, IDA, LiTH

2
5

10

15

20

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 44

Considering Task Particularities (contd)


Here we run 1 at VDD = 2.5V, and 2 at VDD = 5V; the tasks finish
just on deadline.
Etotal = 12.5nJ/cycle 250 106cycles + 12.5nJ/cycle 750 106cycles
V2 250106 cycles

750106 cycles

52

Etotal = 12.5 J

2
2.52
0
Petru Eles, IDA, LiTH

1
5

10

15

20

25

time (sec)

Hardware/Software Codesign

Low Power/Energy - 45

Considering Task Particularities (contd)

If power consumption per cycle is not constant (but differs from task
to task), the rule on slide 33 is not true any more.
Voltage levels have to be reduced with priority for those tasks which
have a larger energy consumption per cycle.

One particular voltage level has to be established for each task, so


that deadlines are just satisfied.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 46

Discrete Voltage Levels


Practical microprocessors can work only at a finite number of discrete
voltage levels.
The ideal voltage Videal, determined for a certain task does not exist.

A task is supposed to run for time texe at the voltage Videal.


On the particular processor the two closest available neighbours to
Videal are: V1 < Videal < V2.
You have minimised the energy if you run the task for time t1 at
voltage V1 and for t2 at voltage V2, so that t1 + t2 = texe.

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 47

Scheduling Policies

The techniques described here, in order to find optimal voltage


levels for real-time tasks, can be applied both with:

Static cyclic scheduling

Priority-based scheduling

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 48

The Pitfalls with Ignoring Leakage


2

E = NC C eff V dd + L g ( V dd K 3 e

Minimise this and


ignore the rest!

Petru Eles, IDA, LiTH

K 4 V dd

K 5 V bs

+ V bs I ju ) t

Hardware/Software Codesign

Low Power/Energy - 49

The Pitfalls with Ignoring Leakage

E = NC C eff

2
V dd

+ L g ( V dd K 3 e

Dynamic decreases
with Vdd regardless
of increased time.

K 4 V dd

K 5 V bs

+ V bs I ju ) t

Leakage decreases
with Vdd, but growth
with time!

1. We dont optimize global energy but only a part of it!


2. We can get it even very wrong and increase energy
consumption!

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 50

E = NC C eff V dd + L g ( V dd K 3 e

K 4 V dd

K 5 V bs

+ V bs I ju ) t

Energy per Cycle

8e-10

70nm technology, Crusoe processor

7e-10
6e-10
5e-10
4e-10
3e-10

Dynamic energy

2e-10
1e-10
0

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Jejurikar et. al., DAC04

Petru Eles, IDA, LiTH

1 Vdd

Hardware/Software Codesign

Low Power/Energy - 51

E = NC C eff V dd + L g ( V dd K 3 e

K 4 V dd

K 5 V bs

+ V bs I ju ) t

Energy per Cycle

8e-10

70nm technology, Crusoe processor

7e-10
6e-10
5e-10
4e-10
3e-10

Dynamic energy

2e-10
1e-10
0

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Jejurikar et. al., DAC04

Petru Eles, IDA, LiTH

Leakage energy
1 Vdd

Hardware/Software Codesign

Low Power/Energy - 52

E = NC C eff V dd + L g ( V dd K 3 e

K 5 V bs

+ V bs I ju ) t

Critical point!
If you go beyond this
70nm
with technology
Vdd energy grows

8e-10

Energy per Cycle

K 4 V dd

7e-10
6e-10
5e-10

Dynamic + Leakage

4e-10
3e-10

Dynamic energy

2e-10
1e-10
0

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Jejurikar et. al., DAC04

Petru Eles, IDA, LiTH

Leakage energy
1 Vdd

Hardware/Software Codesign

Low Power/Energy - 53

Summary

Power consumption becomes a central issue for embedded


systems design.
Power/energy consumption can be reduced by reducing supply
voltage, switching activity, switched capacitance, number of
executed cycles.
There are means at all levels of the design to reduce power
consumption: circuit, logic, behavioral, architecture, system level.
At system level we distinguish dynamic techniques (applied during
run-time) and static techniques (applied at design time).

Petru Eles, IDA, LiTH

Hardware/Software Codesign

Low Power/Energy - 54

Summary (contd)
Dynamic power management is implemented by the operating
system, and is mainly used in portable appliances to shut down or
place in stand-by unused devices.
Typical policies for power management are: time-out, predictive,
and stochastic.
Both at task mapping and at scheduling, design decisions can be
made with have a huge impact on power/energy consumption.
Real-time scheduling in the context of processors with voltage
scaling is extremely interesting. The main trade-off is voltage level
vs. execution time. One has to find the optimal voltage levels such
that energy consumption is reduced and deadlines are still
fulfilled.

Petru Eles, IDA, LiTH

Вам также может понравиться