Embedded System Architecture by Ralf Niemann

2007-02-06
Embedded Tutorial
Hardware/Software Codesign
of Embedded Systems
Petru Eles and Zebo Peng

Embedded Systems Laboratory (ESLAB)
Linkping University
Lecture Contents
=
Introduction and basic issues.
Architectures and platforms.
Analysis, co-simulation, and design space

exploration.
System-level power/energy optimization.
Prof. Z. Peng, ESLAB/ LiTH
2007-02-06
Introduction
=
Codesign of embedded
systems
The design flows
Definition and motivation
System level design issues
Traditional Design Flow

Informal System Specification
Early, Manual Partitioning
SW Specification
HW Specification
Programming
HW Design
SW Simulation
HW Simulation
SW Implementation
HW Implementation
Integration and System Test

2007-02-06
Design Time
Traditional Design:
HW/SW Codesign:
Specification
& Partitioning
Specification
& Partitioning
HW Design
&
Simulation
Co-sim. HW Design
SW Design
&
&
&
Co-verif. Simulation
Simulation
SW Design
&
Simulation
Integration
&
Test
Integration
&
Test
Reduced TTM
time
time
5
HW/SW Codesign
=
The concurrent design of hardware and

software elements, supporting explicit
hardware/software trade-off.
0 Co-specification to create an common
specification that describes both hardware and
software elements.
0 Co-synthesis to concurrently synthesis the
hardware and software implementations as well
as their interfaces.
0 Co-simulation and co-verification to
simultaneously simulate and verify the hardware
and software elements.
2007-02-06
Why Codesign?
=
=
Reduce time-to-market.
Achieve better designs:
0 More design alternatives can be explored.
0 Better solutions can be found by advanced optimization
techniques.
To meet strict design constraints, such as:

0 Timing or performance constraints.
0 Power dissipation.
0 Physical constraints, e.g., size, weight, etc.
0 Safety and reliability constraints.
0 Cost constraints.
Codesign is also made possible by the advances in

design methodologies and tools.
Vertical Codesign
=
Instruction set processor design, for both generalpurpose systems and ASIPs (Application Specific
Instruction Processors).
Specification
To determine how big the

hardware engine you need to
run your application and
meet its constraints.
Software
Instruction set
Hardware
Hardware
2007-02-06
Codesign of Processors
=
General-Purpose Processors
0 Architectural support for operating systems.
0 Cache design and tuning (e.g., selection of cache
size and control schemes).
0 Pipeline control design (control mechanisms,
compiler design).
ASIPs
0 Customization of instruction sets and specific
resources (e.g., accelerator and coprocessor).
0 Design of register files, busses and
interconnections.
0 Development of specific compiler.
Horizontal Codesign
=
Some of system functionality is implemented in

software running on programmable CPUs, while other
functions are implemented in hardware.
Typical for design of embedded systems.
Specification
Codesign of
Specialized processor
Programmable
ASICs
Processor
Hardware
10
2007-02-06
What is an Embedded System?

=
There are many different definitions!

0 A special-purpose computer system that is used for a
particular task.
0 A computer based systems embedded in real life
machines. Though computer based, it dose not have the
usual key-board and monitors. The processor and related
circuitry are configured to do a specific task.
Some highlights what it is (not) used for:

0 Any device which includes a programmable component but
itself is not intended to be a general purpose computer.
Some focus on what it is built from:

0 A collection of programmable parts surrounded by ASICs
and other standard components, that interact continuously
with an environment through sensors and actuators.
11
Characteristics of an Embedded System

=
Dedicated (not general purpose).

0 One or several applications known at design-time.
Contains a programmable component.

0 But usually not programmable by the end-user.
Interacts (continuously) with the environment:

0 Real-time behavior.
0 Predictable.
0 Safe and reliable.
0 Run-time environment is fixed (faster ? better).
Usually very cost sensitive:

0 Mass products in highly competitive markets and have to be
shipped at a low cost.
Low power is often preferred.
12
2007-02-06
Embedded Systems
General purpose systems
Embedded systems
Microprocessor
market shares
in 1999
99%
1%
13
Embedded Controllers
Sensors
Environment
CPU
HW Unit
Application-special logic
Timers
A/D and D/A conversion
Actuators
Memory
Reactive systems.
0
0
The system never stops.

The system responds to signals produced by the environment.
14
2007-02-06
Distributed Embedded Systems

Actuators
Sensors
I/O Interface
RAM
CPU
ROM
ASIC
Network Interface
ECU
ECU
ECU
ECU
ECU
ECU
Gateway
Gateway
15
Time and Power Constraints

=
Time constraints:
0 They have to perform in real-time: if data are not ready by
a certain deadline, the system fails to perform correctly.
0 Hard deadline failure to meet leads to major hazards.
0 Soft deadline failure to meet can be tolerated but quality
of service is reduced.
Power constraints:
0 There are several reasons why low power/energy
consumption is required.
0 Battery life:
High energy consumption short battery life time.
0 Cost aspects:
High power consumption strong power supply, and
expensive cooling system.
16
2007-02-06
Safety Critical Requirements

=
Embedded systems are often used in life

critical applications.
0 Avionics, automotive electronics, nuclear plants,
medical applications, military applications, etc.
=
=
Reliability and safety are major requirements.

To guarantee correctness during design:
0 Formal verification: Mathematics-based methods
to verify certain properties of the designed
system.
0 Automatic synthesis: Certain design steps are
automatically performed by design tools
Correctness by construction.
17
Short Time to Market

=
In highly competitive markets it is critical to catch

the market window:
0 A short delay with the product on the market can have
catastrophic financial consequences (even if the quality of
the product is excellent).
Design time has to be reduced!

0 Advanced design methodologies.
0 Efficient design tools.
0 Reuse of previously designed and verified (hardware and
software) blocks.
0 Platforms for several products in a family.
0 Good designers who understand both software and
hardware!
18
2007-02-06
The ES Design Challenges

=
=
=
=
=
=
=
=
Increasing application complexity (e.g., automotive).

Heterogeneous architecture (HW, SW, network,
mechatronics, etc.).
Stringent time and power constraints.
Low cost requirement.
Short time to market.
Safety and reliability (e.g., very long life-time).
In order to achieve all these requirements, systems
have to be highly optimized.
Both hardware and software aspects have to be
considered simultaneously!
19
Current Design Practice

1.
2.
3.
4.
5.
6.
Start from some informal specification and a set of

constraints (time, power, and cost constraints).
Generate a more formal specification, based on some
modeling concept (FSM, data-flow, etc.), using
Matlab, Statecharts, SystemC, C, UML, or VHDL.
Simulate the model in order to check its
functionality. The model is modified, if needed.
Choose an architecture such that the cost limit is
satisfied, and hopefully that time and power
constraints will be fulfilled.
Implement both the hardware and software
components and build a prototype.
Validate the system.
= A usual outcome: Neither time nor power constraints are
satisfied!!!
20
10
2007-02-06
The Consequences
=
Delays in the design process:

0 Increased design cost.
0 Delays in time to market missed market window.
High cost due to many iterations with

implementation and prototyping.
Bad design decisions taken under time pressure:

0 Low quality.
0 High cost.
The lesson: We need to explore more design

alternatives in an efficient manner.
0 At the system level!
21
System-Level Design
Informal Specification,
Constraints
Modeling
Functional
Simulation
Arch. Selection
System Model
Formal
Verification
System
Architecture
Mapping
Estimation
Scheduling
Not OK
Not OK
Mapped and
Scheduled Model
OK
Software Model
Simulation
Structural
Simulation
Formal
Verification
Hardware Model
Lower-Level Design
22
11
2007-02-06
The Improved Design Flow

=
Several design alternatives are evaluated

before going down to the lower-level design.
0 This is performed as part of the design space
exploration process.
0 Different architectures, mappings and schedules
are explored, before the actual implementation
and prototyping.
We get highly optimized solutions in short

time.
0 There is a good chance that design iterations at
the lower-level, including prototyping, can be
avoided.
23
Additional Improvements
=
Formal verification
0 It is impossible to do an exhaustive simulation.
0 Especially for safety critical systems, formal verification is
needed.
Simulation
0 Used not only for functional validation.
0 Should also be used after mapping and scheduling in order
to check, for example, timing properties.
0 May be used also during the implementation steps:
hardware/software co-simulation.
Hardware/software trade-offs
0 Hardware/Software partitioning to decide what is to be
mapped on a programmable processor (SW) and what is
going into HW.
0 Hardware/software co-synthesis to coordinate the HW
and SW synthesis processes and allow moving of
functionality from one to the other.
24
12
2007-02-06
The Lower-Level Issues

=
Software generation:
0 Encoding in an implementation language (C, C++,
assembler).
0 Compiling (this can include particular optimizations for
application specific processors, DSPs, etc.).
0 Generation of a real-time kernel or adapting to an existing
operating system.
Hardware synthesis:
0 Encoding in a HDL (VHDL and Verilog).
0 Successive synthesis steps: high-level, register-transfer
level, logic -level synthesis.
Hardware/software integration:
0 The software is run together with the hardware model
(co-simulation).
Prototyping:
0 A prototype of the hardware is constructed and the
software is executed on the target architecture.
25
Lower-Level Design
There are established CAD tools on the market which
automatically perform many of the low level tasks:
=
=
=
Code generators (software model C, hardware

model VHDL)
Compilers.
Hardware synthesis tools:
0 RT-level synthesis
0 Logic synthesis
0 Layout and physical implementation
=
=
Test generators and debuggers.

Simulation and co-simulation tools.
26
13
2007-02-06
Focus on System-Level Design
Have huge influence on the quality of the final

implementation.
Very few commercial tools are available.
Mostly experimental and academic tools available.
Huge efforts and investments are currently made in

order to develop tools and methodologies for system
level design.
Ad-hoc solutions are less and less acceptable.
It is the system level we are mainly interested, in
this course!
27
Concluding Remarks
=
Codesign provides the capability to make

explicit and efficient hardware/software
trade-off.
Codesign of embedded systems have many
advantages and challenges.
Cost and performance optimization requires
system-level approaches.
28
14
2007-02-06
Analysis, Co-Simulation
and Design Space Exploration
Zebo Peng
Embedded Systems Laboratory (ESLAB)
Linkping University
Outline
Design space exploration
Static analysis techniques
Co-simulation approaches
2007-02-06
The Design Space

Very large due to many solution parameters:
0 architectures and components

0 hardware/software partitioning
0 mapping and scheduling
0 operating systems and global control
0 communication synthesis
Hardware
Microprocessor
ASIC
Analog
circuit
Sensor
Software
C
o
S
Embedded
memory
Sourc
e: S3
Source: Stratus
Computers
DSP
Network
High-speed electronics
Design Space Exploration

What are needed in order to explore the complex
design space to find a good solution:
=
=
=
=
=
Exploration in the higher level of abstractions.

Development of high-level analysis and estimation
techniques.
Employment of very fast exploration algorithms.
Memory-less algorithms.
Each solution needs a huge data structure to store,
so we cant afford to keep track of all visited
solutions.
2007-02-06
The Optimization Problem

The majority of design space exploration tasks can be
viewed as optimization problems:
To find
0 the architecture (type and number of processors, memory
modules, and communication blocks, as well as their
interconnections),
0 the mapping of functionality onto the architecture
components, and
0 the schedules of basic functions and communications,
such that a cost function (in terms of implementation

cost, performance, power, etc.) is minimized and a
set of constraints is satisfied.
The System Partitioning Problem

5
15
65
35
8
45
24
20
40
35
3
23
67
56
6
Two -way partitioning
A feasible solution for the k-way partitioning

can be represented as:
xi = j; j {1, 2, ..., k}, i = 1, 2, ..., n.
2007-02-06
Hardware/Software Partitioning
Input:
Implementation independent system

specification consisting of interacting
processes (e.g., VHDL).
Output: Two sets of processes, assigned for hardware

and software implementation respectively.
Target architecture:
- Microprocessors
- ASICs
- Shared memories
Assumptions:
=
Microprocessor and ASIC working in parallel;
Reducing the amount of communication between

the microprocessor and hardware improves the
overall performance.
Objectives:
=
=
=
Maximal performance at a given cost limit.

Minimal implementation cost such that the timing
and other constraints are satisfied.
2007-02-06
=
Quantitative values can be derived via simulation,

profiling, or static analysis of the specification.
Ex.
0 computation load (CL) number of operations executed
by a basic region or process of the specification.
0 communication intensity (CI) total number of
communication operations on a channel between two
processes.
Performance improvement based on:

0 Placing computation intensive processes into hardware.
0 Increasing parallelism.
0 Reducing inter-domain communication.
Process Graph Formulation

=
nodes correspond to processes, which could be

processes or basic blocks in the original specification
(e.g., VHDL).
node weights reflect the degree of suitability for
hardware implementation of the corresponding
process:
0
0
0
0
=
=
the computation load of the process;

the uniformity of operations in the process;
the potential parallelism inside the process;
suitability for software implementation.
edges connect two nodes iff there exists a

communication channel between them.
edge weights a measure of communication and
mutual synchronization between the processes.
10
2007-02-06
Process Graph Formulation

=
The Graph Partitioning Problem:

To partition the process graph into two groups such
that the sum of the weights of the cut edges will be
minimal, subject to a set of constraints:
Ex.
H
H _ cos ti Max
Physical limitation of silicon area
Wi N Lim1 i Hw
Implement a node in HW, when

it is appropriate.
iH
11
Features of CO Problems
=
Most CO problems, e.g., system partitioning with

constraints, for digital system designs are NPcompete.
The time needed to solve an NP-compete problem
grows exponentially with respect to the problem size
n.
For example, to enumerate all feasible solutions for a
scheduling problem (all possible permutation), we
have:
0 20 tasks in 1 hour (assumption);
0 21 tasks in 20 hour;
0 22 tasks in 17.5 days;
0
...
0 25 tasks in 6 centuries.
12
2007-02-06
Features of CO Problems
=
Many CO problems can be formulated as an Integer

Linear Programming (ILP) problem, and solved by an
ILP solver.
It is inherently more difficult to solve an ILP problem
than the corresponding Linear Programming problem.
The size of problem that can be solved successfully
by ILP algorithms is an order of magnitude smaller
than the size of LP problems that can be easily
solved.
13
Heuristics
=
A heuristic seeks near-optimal solutions at a

reasonable computational cost without being able to
guarantee either optimality or feasibility.
Motivations:
0 Many exact algorithms involve a huge amount of
computation effort.
0 The decision variables have frequently complicated
interdependencies.
0 We have often nonlinear cost functions and constraints,
even no mathematical functions.
Ex. The cost function f can, for example, be defined by a
computer program (e.g., for power estimation).
0 Approximation of the model for optimization.

A near optimal solution is usually good enough and could be
even better than the theoretical optimum.
14
2007-02-06
Transformational
Constructive
(Iterative improvement)
Heuristic Approaches to CO
Problem specific
Generic methods
Clustering
List scheduling
Left-edge algorithm
Branch and bound

Divide and conquer
Kernighan-Lin
algorithm
s)
tic
s
i
r
eu
H
l
eta
M
(
Neighborhood search
Simulated annealing
Tabu search
Genetic algorithms
15
Clustering for System Partitioning
Each node initially belongs to its own cluster, and

clusters are then gradually merged until the desired
partitioning is found.
The merge operation is selected based on local
information (closeness metrics), rather than global
view of the whole system.
v2
v2
v1
v5
v4
v3
v3
v1
v2
v5
v4
v1
4
v5
v4
v3
v2
v3
v2
v1
v1
v4
v4
v5
v3
v5
16
2007-02-06
The Kernighan- Lin Algorithm (KL)

=
A graph is partitioned into two clusters of

arbitrary size, by minimizing a given
objective function.
KL is based on an iterative partitioning
strategy:
0 The algorithm starts with two arbitrary clusters
C1 and C2.
0 The partitioning is then iteratively improved by
moving nodes between the clusters.
0 At each iteration, the node which produces the
minimal value of the cost function is moved; this
value can, however, be greater than the value
before moving the node.
17
Branch-and- Bound
=
Traverse an implicit tree to find the best leaf (solution).
4-City TSP
0
3
0
41
1
40
2
3
41
40
4
0
Total cost of this solution = 88
18
2007-02-06
0 0
41
40
Branch-and- Bound Ex
=
=
Low-bound on the cost function.

Search strategy
{0}
L0
{0,1}
L3
{0,2}
L6
{0,3}
L 41
{0,1,2}
L 43
{0,1,3}
L8
{0,2,1}
L 46
{0,2,3}
L 10
{0,3,1}
L 46
{0,3,2}
L 45
{0,1,2,3}
L = 88
{0,1,3,2}
L = 18
{0,2,1,3}
L = 92
{0,2,3,1}
L = 18
{0,3,1,2}
L = 92
{0,3,2,1}
L = 88
19
Neighborhood Search Method

=
Step 1
(Initialization)
(A) Select a starting solution xnow X.

(B) xbest = xnow , best_cost = c(xbest).
=
Step 2 (Choice and termination)

Choose a solution xnext N(xnow ).
If no solution can be selected or the terminating criteria apply,
then the method stop.
Step 3 (Update)
Re-set xnow = xnext .
If c(xnow ) < best_cost, perform Step 1(B).
Goto Step 2.
N(x) denotes the neighborhood of x, which is a set of solutions

reachable from x by a simple transformation.
20
10
2007-02-06
Neighborhood Search Method

=
The neighborhood search method is very attractive for

many CO problems as they have a natural neighborhood
structure, which can be easily defined and evaluated.
0 Ex. Graph partitioning: swapping two nodes.
5
15
65
35
45
24
65
35
40
35
23
15
20
45
56
67
20
40
35
56
4
8
24
23
67
21
The Descent Method

=
Step 1
(Initialization)
Step 2 (Choice and termination)

Choose xnext N(xnow ) such that c(xnext ) < c(xnow ), and
terminate if no such xnext can be found.
Step 3
(Update)
The descent process can easily be stuck at a local

optimum:
Cost
Solutions
22
11
2007-02-06
Dealing with Local Optimality

=
Enlarge the neighborhood.

Start with different initial solutions.
To allow uphill moves:
0 Simulated annealing
0 Tabu search
Cost
Solutions
23
The SA Algorithm
Select an initial solution xnow X;
Select an initial temperature t > 0;
Select a temperature reduction function ;
Repeat
Repeat
Randomly select xnext N(xnow );
= cost(xnext ) - cost(xnow );
If < 0 then xnow = xnext
else generate random p uniformly in the range (0, 1);
If p < exp(- /t) then xnow = xnext ;
Until iteration_count = nrep;
Set t = (t);
Until stopping condition = true.
Return xnow as the approximation to the optimal solution.
24
12
2007-02-06
A HW/SW Partitioning Example

75000
70000
optimum at iteration 1006

65000
Cost function value
60000
55000
50000
45000
40000
35000
0
200
400
600
800
1000
1200
1400
Number of iterations
25
Analysis Techniques
=
Analysis and simulation techniques are essential for

hardware/software codesign:
0 To guide the design space exploration.
0 To provide feedback to the human designers.
0 To support design validation.
Selection of an analysis/simulation technique is

usually based on trade-off between efficiency and
accuracy.
For certain analysis, such as worst-case execution
time analysis, it is also very important that the result
is safe (i.e., correct or pessimistic).
26
13
2007-02-06
Performance Metrics
=
Extreme case performance

0 Worst-case execution time
0 Best-case execution time
=
=
Average case performance

Probabilistic performance
0 Used in soft real-time applications
0 To accurately handle the variable execution time of tasks,
which may be due to
Application characteristics (e.g., data dependent loops);

Architectural factors (e.g., cache misses);
External factors (e.g., network load); or
Insufficient knowledge.
0 To guarantee a high probability of meeting timing

constraints.
27
Simulation-based Techniques
=
=
=
=
=
Software Running the compiled program

on the simulated target architecture.
Hardware Building a simulation model of
the hardware and executing it to collect
information.
A very large number of inputs should be used
in order to get good results.
Only practical for average and probabilistic
execution time estimation.
It is difficult to use when individual programs
are not running in isolation.
28
14
2007-02-06
Static Analysis
Techniques that use results of information collected by
analyzing the programs without executing them.
=
No assumption about input data is made.
Restriction on software
0 bounded loops
0 absence of recursive functions
0 absence of dynamic function calls
Can be used for:

0 program analysis behavior of a single program on a
processor.
0 system performance analysis behavior of multiple
processes on a single processor or several processors.
29
Program Analysis
=
The estimated worst-case execution time (WCET)

must be safe and tight.
Possible execution time
Actual
WCET
=
t
Estimated
WCET
The idea tool for source code analysis would produce

a good WCET estimate based on the following inputs:
0
0
0
0
Source code.
Compiler.
Machine architecture description.
Operating system.
30
15
2007-02-06
Program Path Analysis

=
To determine what sequence of instructions will be

executed in the worst case scenario.
A basic block is composed of
instructions in a straight line
Let us first assume that

each instruction takes a
fixed time to execute
31
Program Path Analysis

=
=
=
Infeasible paths can be eliminated by data

flow analysis and path information provided
by the programmer.
The number of feasible paths is typically
exponential with the program size.
Efficient methods are needed to avoid
enumeration of all paths.
32
16
2007-02-06
ILP Formulation
Let xi be the number of times a basic block Bi is executed;
ci be the execution time of the basic block Bi, which is
assumed to be a constant.
The total execution time of the program for a particular
execution is:
C1
ci xi
C3
i =1
C1 + C2 + C4 + 11 C5 + 10 C6 + C7
C7
C2
C4
C5
11
C6
10
33
ILP Formulation (Contd)

The estimated WCET of the program is:
max ci xi
i =1
subject to a set of constraints Ax b.

=
=
The quality of the constraints define the tightness of

the estimate.
Constraint classification:
0 Program structural constraints deduced from the
programs control flow graph.
0 Program functionality constraints provided by the user to
specify loop bounds and other path information.
34
17
2007-02-06
An Example
d1
/* k >= 0 */
s = k;
while (k < 10) {
if (ok)
j++;
else {
j = 0;
ok = true;
}
k++;
}
r = j;
x1 B1
d8
x2 B2 while (k<10)
d3
x3 B3 if (ok)
d5
d4
B5 j = 0;
x4 B4 j++;
x5
ok=true;
d6
d7
x6 B6
k++
d9
x7 B7
s = k;
d2
r = j;
d 10
CFG
35
Constraints I
=
d1
Structural constraints:
x1 B1 s = k;
d2
x2 B2 while (k<10)
d1 = 1
x1 = d1 = d2
x2 = d2 + d8 = d3 + d9
x3 = d3 = d4 + d5
...
x4 B4
d9
d3
x3 B3 if (ok)
d5
d4
j++;
x5 B5 j = 0;
ok=true;
d6
d7
x6 B6
k++
x7 B7 r = j;
d 10
d8
CFG
36
18
2007-02-06
Constraints II
=
Functionality constraints:
X1
X2
X3
X4
Loop bound information

0 x1 x3 10 x1
Path information
x5 1 x1
X5
X6
X7
/* k >= 0 */
s = k;
while (k < 10) {
if (ok)
j++;
else {
j = 0;
ok = true;
}
k++;
}
r = j;
37
Remarks on Performance Analysis

=
One of the main issues of hardware/software

codesign is estimation and analysis.
Analysis of average and probabilistic performance
can be done by simulation.
Worst case execution time analysis can only be
efficiently done by static analysis techniques.
Efficient techniques for analyzing impacts of many
advanced micro-architecture components are still
research issues.
38
19
2007-02-06
Simulation
=
Applied usually directly to the design

descriptions, e.g. VHDL.
Can be used at different levels of
abstractions:
0 System
0 Algorithmic
0 Register-transfer
0 Logic
0 Gate
0 Switch and circuit
39
Co-Simulation
=
How the hardware and software components are

simulated at the same time?
Problems:
=
=
Different simulation platforms are used;

Software runs fast while hardware simulation is
relatively slow.
0 How to run the system simulation as fast as possible and
keep the two domains synchronized?
Slow models provide full details and produce

accurate results; fast models dont produce enough
timing information and simulation is less accurate.
40
20
2007-02-06
Approaches to Co-Simulation 1
=
Gate-level model of the processor

VHDL
simulation
Gatelevel
model
(VHDL)
ASIC
model
(VHDL)
VHDL
simulation
SW
Co-simulation framework
0 Gate level simulation of the processor is very slow (tens of
clock cycles/sec).
Ex. 10 cycles/sec, 1 GHz processor 100 million seconds
(3.2 years) are needed to simulate one second of real time.
0 This provides a very accurate solution and is very simple
from the co-simulation point of view.
41
=
Instruction-set architecture models

Program
running
on host
ISA
model
(C
progr.)
ASIC
model
(VHDL)
VHDL
simulation
SW
0 There is no hardware model of the target processor; the
software is executed on an ISA model (usually in C);
execution on the ISA model provides interface information
(including timing) needed for co-simulation.
0 This is fast but timing accuracy depends on the interface
information.
42
21
2007-02-06
=
Translation-based models
Program
Running
directly
on host
Software
compiled
into native
code for
the host
ASIC
model
(VHDL)
VHDL
simulation
0 There is no hardware model of the target processor; the
software is compiled into native code for the host
processor; software execution provides interface
information (including timing) needed for co-simulation.
43
=
Hardware in the loop:
0 Combine hardware and software in one solution, by using

the physical device to model its own behavior.
0 An adaptor formats inputs to the physical device, applies
the input, returns the resulting outputs with timing
information to the simulator.
0 This is a good choice for modeling complex standard
components such as microprocessors.
44
22
2007-02-06
=
Mixed level simulation to combine the strength of

simulation at different levels of abstractions and
provide a possibility to compare results at different
levels.
Broadband simulator One broadband language is
used which covered several abstraction levels.
Multi-simulator several simulators are used in an
integrated environment. Main issues to deal with:
0 The data exchange between the various simulators.
0 The synchronization of the simulators involved.
45
Concluding Remarks
=
Efficient techniques for design verification for

embedded systems are hot research topics.
The basic problem of co-simulation is how to
simulate hardware and software together so
that simulation is fast and accurate.
Formal verification mathematically proves
design correctness. The issues there are
computational complexity and integration into
the design flow.
46
23
Arch & Platf - 1
Architectures and Platforms

1.
Architecture Selection: The Basic Trade-Offs
2.
General Purpose vs. Application-Specific Processors
3.
Processor Specialisation
4.
ASIP Design Flow
5.
Specialisation of a VLIW ASIP
6.
Tool Support for Processor Specialisation
7.
Application Specific Platforms
8.
IP-Based Design (Design Reuse)
9.
Reconfigurable Systems
Petru Eles, IDA, LiTH
Arch & Platf - 2
Remember the Design Flow

Constraints
Modeling
Functional
Simulation
Arch. Selection
System model
Formal
Verification
System
architecture
Mapping
Estimation
Scheduling
not OK
Mapped and
scheduled model
OK
Softw. model
Simulation
Softw. Generation
Simulation
Testing
OK
Prototype
Fabrication
Simulation
Formal
Verification
Hardw. model
Hardw. Synthesis
Softw. blocks
not OK
not OK
Hardw. blocks
Arch & Platf - 3
Architecture Selection and Mapping
Select the underlying hardware structure on which to run the

modelled system.
Map the functionality captured by the system over the

components of the selected architecture.
Functionality includes processing and communication.
Arch & Platf - 4
Architecture Selection
General
Purpose
vs.
Application
Specific
Software
vs.
Hardware
Use a general purpose, existing platform

and map the application on it.
or something in-between
Build a customised architecture strictly
optimised for the particular application.
Use programmable processors

running software.
or both
Use dedicated electronics
fixed
reconfigurable
Monoprocessor
Mono vs. Multipr.
Single vs. Multichip
Multiprocessor
single chip
multi chip
Arch & Platf - 5
Architecture Selection (contd)

The trade-offs:
Performance (high speed, low power consumption)
high
Hardware
Application specific
General purpose
low
high
Reconfigurable
hardware
Software
low
Flexibility (how easy it is to upgrade or modify)

General purpose
Application specific
high
Software
low
Reconfigurable
hardware
Hardware
high
low
Arch & Platf - 6
order of
order of
magnitude magnitude
energy
consumed
Architecture Selection (contd)
GP proc.
high
ASIP
FPGA
med.
low
ASIC
low
med.
high
flexibility
Arch & Platf - 7
General Purpose vs. Application Specific Processors

Both GP processors and ASIPs (application specific instruction set
processors) can be RISCs, CISCs, DSPs, microcontrollers, etc.
- One could look at DSPs and microcontrollers as being specific
for DSP and simple control applications respectively.
- An application specific DSP or microcontroller is, however,
more specialised then just for DSP or control applications.
GP processors
- Neither instruction set nor microarchitecture or memory
system are customised for a particular application or family of
applications
ASIPs
- Instruction set, microarchitecture and/or memory system are
customised for an application or family of applications.
- What results is better performance and reduced power
consumption.
Arch & Platf - 8
What Makes an ASIP Specific?

What can we specialize in a processor?
Instruction set (IS) specialisation
Exclude instructions which are not used
- reduces instruction word length (fewer bits needed for encoding);
- keeps controller and data path simple.
Introduce instructions, even exotic ones, which are specific to the
application: combinations of arithmetic instructions (multiplyaccumulate), small algorithms (encoding/decoding, filter), vector
operations, string manipulation or string matching, pixel operations, etc.
- reduces code size reduced memory size, memory bandwidth,
power consumption, execution time.
Arch & Platf - 9

Function unit and data path specialisation
Once an application specific IS is defined, this IS can be
implemented using a more or less specific data path and more or
less specific function units.
Adaptation of word length.
Adaptation of register number.
Adaptation of functional units
- Highly specialised functional units can be introduced for string
matching and manipulation, pixel operation, arithmetics, and
even complex units to perform certain sequences of
computations (co-processors).
Arch & Platf - 10

Memory specialisation
Number and size of memory banks.
Number and size of access ports.
- They both influence the degree of parallelism in memory access.
- Having several smaller memory blocks (instead of one big)
increases parallelism and speed, and reduces power consumption.
- Sophisticated memory structures can increase cost and bandwidth
requirement.
Cache configuration:
- separate instruction/data?
- associativity
- cache size
- line size
Depends very much on the characteristics

of the application and, in particular, on the
properties related to locality.
Very large impact on performance and
power consumption.
Arch & Platf - 11

Interconnect specialization
Interconnect of functional modules and registers.
Interconnect to memory and cache.
- How many internal buses?
- What kind of protocol?
- Additional connections increase the potential of parallelism.
Control specialisation
Centralised control or distributed (globally asynchronous)?

Pipelining?
Out of order execution?
Hardwired or microprogrammed?
Arch & Platf - 12
ASIP Design Flow

(It can be seen as a part of the big design flow - slide 2)
Processor
Architecture
Algorithm(s)
Compiler
Simulator
Performance
numbers
Arch & Platf - 13
A SOC for Multimedia Applications
Glue logic
A/D and D/A
Controller
(ASIP)
VLIW
processor
(ASIP)
On-chip
memory
DSP
(GP)
This is a typical application specific

platform. Its structure has been
adapted for a family of applications.
Besides GP processor cores, the
platform also consists of ASIP cores
which themselves are specialised.
The application specific

Controller performs
master control of the
system and memory
access control.
The off-the-shelf (GP)
DSP performs less
computation intensive
modem and sound codec
functions.
The VLIW ASIP performs
computation intensive
functions: discrete cosine
and inverse discrete
cosine transforms,
motion estimation, etc.
Arch & Platf - 14
Specialization of a VLIW ASIP

To memory system
Internal storage & interconnect
Crossbar / Bus
Register File 1
Register File 2
Register File 3
ALU MULT MULT

A1
M1
M2
Cluster 1
MULT MULT ALU ALU

A2 A3
M3
M4
Cluster 2
MAC ALU MULT ALU

A5
MA1 A4
M5
Cluster 3
Datapath
Instruction fetch & decode
From memory system
Arch & Platf - 15
Specialization of a VLIW ASIP (contd)

Thats how an instruction word looks like:
op1
op2
Cluster 1
op3
op4
op5
op6
Cluster 2
op7
op8
op9
op10 op11
Cluster 3
Arch & Platf - 16

Traditionally the datapath is organised as single register file shared by
all functional units.
Problem: Such a centralised structure does not scale!
We increase the nr. of functional units in order to increase parallelism
We have to increase the number of registers in the register file
Internal storage and communication between functional units and
registers becomes dominant in terms of area, delay, and power.
High performance VLIW processors are limited not by arithmetic
capacity but by internal bandwidth.
Arch & Platf - 17

A solution: clustering.
Restrict the connectivity between functional units and registers, so
that each functional unit can read/write from/to a subset of
registers.
Organise the datapath as clusters of functional units and local
register files.
Nothing is for free!!!
Moving data between registers belonging to different clusters takes
much time and power!
You have to drastically minimise the number of such moves by:
- Carefully adapting the structure of clusters to the application.
- Using very clever compilers.
Arch & Platf - 18

Instruction set specialisation: nothing special.
Function unit and data path specialisation
- Determine the number of clusters.
- For each cluster determine
- the number and type of functional units;
- the dimension of the register file.
Memory specialisation is extremely important because we need to
stream large amounts of data to the clusters at high rate; one has
to adapt the memory structure to the access characteristics of the
application.
- determine the number and size of memory banks
Arch & Platf - 19

Interconnect specialization
- Determine the interconnect structure between clusters and
from clusters to memory:
- one or several buses,
- crossbar interconnection
- etc.
Control specialisation:
Thats more or less done, as we have decided for a VLIW
processor.
Arch & Platf - 20
Tool Support for Processor Specialisation
Look at the design flow on slide 12!

In order to be able to generate a specialised architecture you need:
Retargetable compiler
Configurable simulator
Arch & Platf - 21
Retargetable Compiler
Retargetable compiler
Processor
Architecture
Algorithm
Retargetable
Compiler
Object code
Arch & Platf - 22
Retargetable Compiler (contd)

An automatically retargetable compiler can be used for a range of
different target architectures.
The actual code optimization and code generation is done by the
compiler, based on a description of the target processor architecture.
This description is formulated in a, so called, architecture description
language.
Having a good compiler is not only important for the processor
specialisation process!
Once you have got your specialised ASIP you need a good compiler
in order to efficiently make use of it!
Arch & Platf - 23
Configurable Simulator
Such a simulator can be
configured for a particular
architecture (based on an
architecture description)
Processor
Architecture
Object code
Simulator
Performance
numbers
In this context, the most

important output produced by
the simulator is performance
numbers:
- throughput
- delay
- power/energy consumption
Arch & Platf - 24
Application Specific Platforms
Not only processors but also hardware platforms can be specialised

for classes of applications.
The platform will define a certain communication infrastructure

(buses and protocols), certain processor cores, peripherals,
accelerators commonly used in the particular application area, and
basic memory structure.
Arch & Platf - 25
Application Specific Platforms (contd)
Proc.
Core3
Proc.
Core2
Proc.
Core1
Cache
DMA
Memory
Bridge
System bus
Peripheral bus
Peripheral
Reconfigurable
logic
Peripheral
Arch & Platf - 26
Application Specific Platforms (contd)

Design space exploration for platform definition:
Platform
Architecture
Applications
Mapping/
Compiling
Simulator
Performance
numbers
Arch & Platf - 27
Instantiating a Platform
Once we have an application, the chip to implement on will not be

designed as a collection of independently developed blocks, but will
be an instance of an application specific platform.
The hardware platform will be refined by

- determining memory and cache size
- identifying the particular cores, peripherals to be used
- adding specific ASICs, accelerators
- determining the amount of reconfigurable logic (if needed)
Arch & Platf - 28
Instantiating a Platform (contd)
Platform
Architecture
Platform
Instance
Application
Mapping/
Compiling
Simulator
Performance
numbers
Arch & Platf - 29
System Platforms
What we discussed about (see previous slides) are so called
hardware platforms.
The hardware platform is delivered together with a software layer:
hardware platform + software layer = system platform.
Software layer:
- real-time operating system
- device drivers
- network protocol stack
- compilers
The software layer creates an abstraction of the hardware
platform (an application program interface) to be seen by the
application programs.
Arch & Platf - 30
IP-Based Design (Design Reuse)

The key concept in order to increase designers productivity is reuse.
In order to manage the complexity of current large designs we do not
start from scratch but reuse as much as possible from previous
designs, or use commercially available pre-designed IP blocks.
IP: intellectual property.
Some people call this IP-based design, core-based design, reuse
techniques, etc.:
Core-based design is the process of composing a new system
design by reusing existing components.
Arch & Platf - 31
IP-Based Design (contd)

What are the blocks (cores) we reuse?
interfaces, encoders/decoders, filters, memories, timers,
microcontroller-cores, DSP-cores, RISC-cores, GP processor-cores.
Possible(!) definition
A core is a design block which is larger than a typical RTL
component.
Of course:
We also reuse software components!
Arch & Platf - 32
IP-Based Design (contd)
Core 1
Core 2
Core 3
glue
glue
glue
Interconnection bus/switch
glue
Interface
Library
Vendor B
Library
Vendor A
Core 4
processor
Library
Vendor C
What we have designed here can be:

An application specific SOC
A platform to be further instantiated for a particular application.
I/O
Arch & Platf - 33
Types of Cores
Hard cores: are fully designed, placed, and routed by the supplier.
A completely validated layout with definite timing
rapid integration
low flexibility
Firm cores: technology-mapped gate-level netlists.
less predictability
flexibility during
place and route
Arch & Platf - 34
Types of Cores (contd)

Soft cores: synthesizable RTL or behavioral descriptions.
much work with

integration and
verification.
maximal flexibility
Flexibility can provide opportunities like e.g. adding application

specific instructions to a processor core by modifying the
behavioral description.
Arch & Platf - 35
Reconfigurable Systems
Programmable Hardware Circuits:
They implement arbitrary combinational or sequential circuits
and can be configured by loading a local memory that determines
the interconnection among logic blocks.
Reconfiguration can be applied an unlimited number of times.
Main applications:
- Software acceleration
- Prototyping
Arch & Platf - 36
Reconfigurable Systems (contd)

Dynamic reconfiguration: spacial and temporal partitioning
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Processor
Memory
at t
at t 2
FPGA
Accelerator
at t 3
at
t4
lly
a
r
po ned
m
te rtitio
pa
Arch & Platf - 37
Reconfigurable Systems (contd)

System on Chip with dynamically reconfigurable datapath
C code
Profiling &
Kernel
extraction
CPU
On
chip
mem.
Kernels
Reconfigurable
datapath
Hw/Sw
partitioning
Datapath
synthesis
C code
Arch & Platf - 38
Summary
Architecture selection is about making trade-offs along the
dimensions of speed, cost, flexibility, and power consumption.
ASIPs are programmable processors, specialised for a particular
application or for a family of applications.
Specialisation of an ASIP concerns instruction set, function units
and data path, memory system, interconnect, and control.
Two design tools are of great importance in order to perform
processor specialisation: retargetable compiler and configurable
simulator.
Not only processors can be specialised but also platforms. A
Platform is specialised to execute a certain family of applications.
The particular hardware to be used for a given application is a
specialised instantiation of the platform.
Arch & Platf - 39
Summary (contd)
Reuse is a key technique in order to achieve high design

productivity. Cores to be reused can be from interfaces and
decoders to filters and processors.
The three types of cores differ in their flexibility, predictability, and
the effort needed for integration: hard, firm, and soft cores.
Reconfigurable systems can provide good flexibility and, at the
same time, many of the advantages of classical hardware
implementation. They are mainly used for software acceleration
and prototyping.
Low Power/Energy - 1
System-Level Power/Energy Optimization

1.
Sources of Power Dissipation
2.
Reducing Power Consumption
3.
System Level Power Optimization
4.
Dynamic Power Management
5.
Mapping and Scheduling for Low Energy
6.
Real-Time Scheduling with Dynamic Voltage Scaling
Remember the Design Flow

Constraints
Modeling
Functional
Simulation
Arch. Selection
System model
Formal
Verification
System
architecture
Mapping
Estimation
Scheduling
not OK
Mapped and
scheduled model
OK
Softw. model
Simulation
Softw. Generation
Simulation
Testing
OK
Prototype
Fabrication
Simulation
Formal
Verification
Hardw. model
Hardw. Synthesis
Softw. blocks
not OK
not OK
Hardw. blocks
Why is Power Consumption an Issue?

Portable systems - battery life time!
Systems with a very limited power budget: Mars Pathfinder,
autonomous helicopter, ...
Desktops and servers: high power consumption
- raises temperature and deteriorates performance & reliability
- increases the need for expensive cooling mechanisms
One of the main difficulties with developing high performance
chips is heat extraction.
High power consumption has economical and ecological
consequences.
Sources of Power Dissipation in CMOS Devices

dynamic
static
1
2
P = --- C V DD f N SW + Q SC V DD f N SW + I leak V DD
2
Switching power
Power required to
charge/discharge
circuit nodes
Short-circ. power
Dissipation due
to short-circuit
current
C
= node capacitances
NSW = switching activities
(number of gate transitions
per clock cycle)
f
= frequency of operation
Leakage power
Dissipation
due to leakage
current
VDD = supply voltage

QSC = charge carried by
short circuit current
per transition
Ileak = leakage current
Sources of Power Dissipation in CMOS Devices (contd)

CMOS transistor (N-type)
drain
Vbs
n
ai
dr
ga
so
ur
ce
te
Threshold voltage:
-
gate
body
source
Vbs = body bias voltage

Vth = threshold voltage
The minimal voltage

required at the gate to
turn on the transistor

CMOS inverter
Vdd
Vbs
n
ai
dr
ga
so
ur
ce
te
drain
gate
CL
body
source

Vdd = supply voltage
CL = output load capacitance
Dynamic power
-
Charging and discharging the

output load capacitance
Momentary short circuits at a
gates output

CMOS inverter
Vdd
Vbs
n
ai
dr
ga
so
ur
ce
te
drain
gate It flows even when
body
the voltage at the

gate is below Vth
source
Static power
Vdd = supply voltage
CL = output load capacitance
Subthreshold leakage
conduction
Junction leakage (drain
and source to body)
CL
For long:
Leakage power has been considered negligible compared to
dynamic.
Today:
Total dissipation from leakage is approaching the total from
dynamic.
As technology drops below 65nm:

Leakage power is exceeding dynamic.

Leakage power is consumed even if the circuit is idle (standby). The
only way to avoid is decoupling from power.
Short circuit power can be around 10% of total.
Switching power is still the main source of power consumption.

For the rest of the discussion, we consider mainly switching
power. At the end we come back to leakage.
Power and Energy Consumption

1
2
P = --- C V DD f N SW
2
1
2
E = P t = --- C V DD N CY N SW
2
NCY = number of cycles needed for the particular task.
In certain situations we are concerned about power consumption:
- heath dissipation, cooling:
- physical deterioration due to temperature.
Sometimes we want to reduce total energy consumed:
- battery life.
Reducing Power/Energy Consumption
The main sources:

Reduce supply voltage
Reduce switching activity
Reduce capacitance
Reduce number of cycles
Reducing Power/Energy Consumption (contd)

Circuit level
Ordering of transistors in gate (influences capacitance).
Transistor sizing.
Logic level
Dont-care optimization to reduce switching activity.
Reduce spurious switching activity by balancing the delays of
paths that converge at each gate.
Technology mapping.
State encoding such that switching activity is minimised: if
state s has a large number of transitions to state q, they
should be given uni-distant codes.
Encoding to minimise switching activity in arithmetic units or
on the bus.
Gated clocks: Gate the clocks of circuits (registers, gates,
arithmetic units when they are in idle time periods.
Behavioral level
Schedule and map operations so that number of cycles is
minimised (with increased number of switching per clock
cycle) you can run at slower clock rate you can reduce
supply voltage.
Allocate and share modules so that power consumption is
reduced (for example, by reducing switching activity)

Architecture level
Specialise instruction set, datapath, register structure to the
particular architecture, with power consumption as an optimization
goal.
- You have on the chip and you switch only those resources
(gates) you really need.
Reduce power consumption on the bus.
- lower switching activity: clever encoding, reduce switching activity on the address bus by exploiting correlations;
- minimise the bus length (capacitance) by optimal module
placement.
- bus segmentation: transform a long heavily loaded global bus
into a partitioned set of local bus segments.

Optimise the memory structure.
- Memory transfers are extremely power hungry: a memory
transfer takes 33 times more energy than an addition!
Reducing the number of memory accesses is a very efficient
way to save power!
- Adapt the number of caches, their size and associativity, and
the length of the cache line to the application reduce
number of memory transfers.
- Interesting trade-off: larger caches consume more power but
reduce number of memory transfers find the right balance!
Provide instruction support for Power management:

- Instructions which allow to put in stand-by or shut down certain
parts of the system.
- Instructions which allow to dynamically fix the supply voltage
(dynamic voltage scaling).

System Level
Static techniques are applied at design time.
- Compilation for low power: instruction selection considering
their power profile, data placement in memory, register
allocation.
- Algorithm design: find the algorithm which is the most powerefficient.
- Task mapping and scheduling.
Dynamic techniques are applied at run time.
- These techniques are applied at run-time in order to reduce
power consumption by exploiting idle or low-workload periods.
System Level Power Optimization
Three techniques will be discussed:
1. Dynamic power management: a dynamic technique.
2. Task mapping: a static technique.
3. Task scheduling with dynamic power scaling: static & dynamic.
Dynamic Power Management (DPM)

Decisions:
application
power aware OS
hardware
Switching among multiple power

states:
idle
sleep
run
Switching among multiple
frequencies and voltage levels.
Goal:
Energy optimization
QoS constraints satisfied
Dynamic Power Management (contd)

Hardware Support (e.g. Intel Xscale Processor)
RUN: operational
IDLE: Clocks to the
CPU are disabled;
recovery is through
interrupt.
SLEEP: Mainly
powered off;
recovery through
wake-up event.
Other intermediate
states: DEEP
IDLE, STANDBY,
DEEP SLEEP
0.75V, 60mW
150MHz
1.3V, 450mW
RUN
RUN
600MHz
RUN
1.6V, 900mW
RUN
800MHz
160s
RUN
10s
1.5ms
10s
IDLE
40mW
140ms
90s
SLEEP
160W
Dynamic Power Management (contd)

DPM techniques are used in laptops, personal digital assistants
(PDAs), and other portable appliances in order to shut down or
place in stand-by unused devices.
The goal is power saving.
DPM techniques are implemented in the operating system
(including Windows 2000 running on laptops).
The power breakdown for a laptop computer:
- 36% of total power consumed by the display
- 18% by hard-disk
- 18% by wireless LAN interface
- 7% by keyboard, mouse, etc.
- 21% by digital VLSI circuits.
dont forget
these!
The Basic Concept of DPM

When there are requests for a device the device is busy;
otherwise it is idle.
When the device is idle, it can be shut down to enter a low-power
sleeping state.
Workload Requests
Device state
Busy
Power state
Working
Requests
Busy
Idle
Tsd
Sleeping
Twu Working
?
T1 T2 T3
T4
Time
The Basic Concept of DPM (contd)

Changing the power state takes time (several seconds) and extra energy.
Tsd : shutdown delay
Twu : wake-up delay
Send the device to sleep only if the saved energy justifies the overhead!
The main Problems:
Dont shut down such that delays occur too frequently.
Dont shut down such that the savings due to the sleeping are
smaller than the power overhead of the state changes.
Power Management Policies

Power management policies are concerned with predictions
related to idle periods:
- For shut-down: try to predict how long the idle period will be in
order to decide if a shut-down should be performed.
- For wake-up: try to predict when the idle period ends, in order
to avoid user delays due to Twu.
It is quite difficult, and often the wake-up is started simply
when a request has arrived.
Typical Policies:
1. Time-out
2. Predictive
3. Stochastic
Time-out Policy
It is assumed that, after a device is idle for a period (the interval T1 - T2
on slide 16), it will stay idle for at least a period which makes it efficient
to shut down.
Drawback: you waste energy during the period (compared to
instantaneous shut-down).
Policies:
- Fixed time-out period: you set the value of , which then stays constant.
- Adjusted at run-time: increase or decrease , depending on the
length of previous idle periods.
Predictive Policy
The length of an idle period is predicted. If the prediction is for an idle
period long enough, the shut-down is performed immediately (no time
interval T1 - T2 on slide 16).
Policy
Shut down after
Idle Period
- L-shaped distribution for --------------------------------------------------;
Previous Busy Period short busy period!
Idle Period
Short busy periods

are followed by
long idle periods.
Busy periods longer
than a threshold
are followed by
short idle periods.
Busy Period
Stochastic Policy
Predictions are based on Markov models: requests and power state
transitions of the device are modelled as probabilistic state machines.
The power manager observes the arriving requests, the request
queue and the device generates shutdown commands.
Power manager
The device:
provides service
obs.
s.
ob
co
an
ds
Markov model:
device
Markov model:
request generator
ob
s.
request requests Environment or user:

generates requests
queue
Mapping and Scheduling for Low Energy

For many embedded systems DPM techniques, like presented
before, cannot be applied:
They have no devices like hard-disk, no (or small) display
VLSI is a main source of power dissipation.
They have time constraints we have to keep deadlines
(usually we cannot afford shut-down and wake-up times).
The operating system is small no sophisticated techniques at
run-time.
The application is known at design time we know a lot about
the application already at design time.
Static techniques can be used (applied at design time).

Mapping and scheduling for low energy are important!
Mapping for Low Energy

1
2
3
5
Task
1
2
3
4
5
6
7
8
4
7
8
p3
p4
Bus
WCET
Energy
p3
p4
p3
p4
10
10
11
17
21
15
10
10
14
15
19
14
Mapping for Low Energy (contd)

Consider a mapping:
p3: 1, 3, 6, 7, 8.
p4: 2, 4, 5.
Execution time: 52;
Time
p3
Communication times and energy:

C1-2: t = 1; E = 3.
C3-5: t = 2; E = 5.
C4-8: t = 1; E = 3.
C5-7: t = 1; E = 3.
Energy consumed: 75.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
6
2
p4
bus
C1-2
C3-5
C5-7
C4-8

Consider a mapping:
p3: 1, 3, 6, 7.
p4: 2, 4, 5, 8.
Execution time: 57;
Time
p3
Communication times and energy:

C1-2: t = 1; E = 3.
C3-5: t = 2; E = 5.
C7-8: t = 1; E = 3.
C5-7: t = 1; E = 3.
Energy consumed: 70.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
6
2
p4
bus
C1-2
C3-5
C5-7
C7-8
The second mapping with 8 on p4 consumes less energy;
Assume that we have a maximum allowed delay = 60.
This second mapping is preferable, even if it is slower!
Real-Time Scheduling with Dynamic Voltage Scaling

The energy consumed by a task, due to switching power (slide 6):
1
2
E = --- C V DD N CY N SW
2
NSW = number of gate transitions

per clock cycle.
NCY = number of cycles needed
for the task.
Reducing supply voltage VDD is the most efficient way to reduce

energy consumption.
The frequency at which the processor can be operated depends on VDD:
2
( V DD V t )
f = k -----------------------------, k: circuit dependent constant; Vt: threshold voltage.
V DD
The execution time of the task: t exe = N CY
V DD
-----------------------------------
2
k ( V DD V t )
Real-Time Scheduling with Dynamic Voltage Scaling (contd)

The scheduling problem:
Which task to execute at a certain moment on a certain processor so
that time constraints are fulfilled?
The scheduling problem with voltage scaling:

Which task to execute at a certain moment on a certain processor, and
at which voltage level, so that time constraints are fulfilled and energy
consumption is minimised?
The problem: reducing supply voltage extends execution time!
Variable Voltage Processors
Several supply voltage levels are available.
Supply voltage can be fixed by the application (operating system)

through execution of particular instructions.
Frequency is automatically adjusted to the current supply voltage.
Several processors with variable voltage levels are already

available. There will be more and more in the near future.
The Basic Principle

We consider a single task :
- total computation: 109 execution cycles.
- deadline: 25 seconds.
- processor nominal (maximum) voltage: 5V.
- energy: 40 nJ/cycle at nominal voltage.
- processor speed: 50MHz (50106 cycles/sec) at nominal voltage.
V2
109 cycles
52
Etotal = 40 J
slack
0
10
15
20
texe = 20 sec
25
time (sec)
The Basic Principle (contd)

Lets make it slower!
VDD = 2.5V
- energy: 402.52/52 = 10nJ/cycle.
- speed: 502.5/5 = 25MHz
V2
750106 cycles
250106 cycles
52
Etotal = 32.5 J
texe = 25 sec
2.52
0
10
15
20
25
time (sec)

VDD = 4V
- energy: 4042/52 = 25nJ/cycle.
- speed: 504/5 = 40MHz
V2
109 cycles
52
Etotal = 25 J
42
0
texe = 25 sec
10
15
20
25
time (sec)

If a processor uses a single supply voltage and completes a
program just on deadline, the energy consumption is minimised.
Consider two tasks 1, 2:
Computation
- 1: 250106 execution cycles; 2: 750106 execution cycles;
Deadline: 25 seconds.
Processor nominal (maximum) voltage: 5V.
1
Energy:
- 40 nJ/cycle at nominal voltage.
- 25 nJ/cycle at VDD = 4V.
2
Processor speed:
- 50MHz (50106 cycles/sec) at nominal voltage.
- 40MHz at VDD = 4V.

Find the voltage so that the tasks just meet their deadline you
have minimised energy consumption!
6
V2 25010
cycles
750106 cycles
42
Etotal = 25 J
1
0
2
5
10
15
20
25
time (sec)
Considering Task Particularities

Energy consumed by a task:
1
2
E = --- C V DD N CY N SW
2
NSW = number of gate transitions

per clock cycle.
C = switched capacitance per
clock cycle.
Average energy consumed by task per cycle:

1
2
E CY = --- C V DD N SW
2
Often tasks differ from each other in terms of executed operations
NSW and C differ from one task to the other.
The average energy consumed per cycle differs from task to task.
Considering Task Particularities (contd)

Consider two tasks 1, 2:
Computation
- 1: 250106 execution cycles; 2: 750106 execution cycles;
Deadline: 25 seconds.
1
Processor nominal (maximum) voltage: 5V.
Processor speed:
- 50MHz (50106 cycles/sec) at nominal voltage.
2
- 40MHz at VDD = 4V.
- 25MHz at VDD = 2.5V.
Energy 1
- 12.5 nJ/cycle at VDD = 2.5V.
Energy 2
- 12.5 nJ/cycle at VDD = 5V.
- 3 nJ/cycle at VDD = 2.5V.

Here we have a solution with VDD = 4V, and deadline just fulfilled:
Etotal = 32nJ/cycle 250 106cycles + 8nJ/cycle 750 106cycles
6
V2 25010
cycles
750106 cycles
42
Etotal = 14 J
1
0
2
5
10
15
20
25
time (sec)

Here we run 1 at VDD = 2.5V, and 2 at VDD = 5V; the tasks finish
just on deadline.
Etotal = 12.5nJ/cycle 250 106cycles + 12.5nJ/cycle 750 106cycles
V2 250106 cycles
750106 cycles
52
Etotal = 12.5 J
2
2.52
0
1
5
10
15
20
25
time (sec)
If power consumption per cycle is not constant (but differs from task
to task), the rule on slide 33 is not true any more.
Voltage levels have to be reduced with priority for those tasks which
have a larger energy consumption per cycle.
One particular voltage level has to be established for each task, so

that deadlines are just satisfied.
Discrete Voltage Levels

Practical microprocessors can work only at a finite number of discrete
voltage levels.
The ideal voltage Videal, determined for a certain task does not exist.
A task is supposed to run for time texe at the voltage Videal.

On the particular processor the two closest available neighbours to
Videal are: V1 < Videal < V2.
You have minimised the energy if you run the task for time t1 at
voltage V1 and for t2 at voltage V2, so that t1 + t2 = texe.
Scheduling Policies
The techniques described here, in order to find optimal voltage

levels for real-time tasks, can be applied both with:
Static cyclic scheduling
Priority-based scheduling
The Pitfalls with Ignoring Leakage

2
E = NC C eff V dd + L g ( V dd K 3 e
Minimise this and

ignore the rest!
K 4 V dd
K 5 V bs
+ V bs I ju ) t
The Pitfalls with Ignoring Leakage
E = NC C eff
2
V dd
+ L g ( V dd K 3 e
Dynamic decreases
with Vdd regardless
of increased time.
K 4 V dd
K 5 V bs
+ V bs I ju ) t
Leakage decreases
with Vdd, but growth
with time!
1. We dont optimize global energy but only a part of it!

2. We can get it even very wrong and increase energy
consumption!
K 4 V dd
K 5 V bs
+ V bs I ju ) t
Energy per Cycle
8e-10
70nm technology, Crusoe processor
7e-10
6e-10
5e-10
4e-10
3e-10
Dynamic energy
2e-10
1e-10
0
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Jejurikar et. al., DAC04
1 Vdd
K 4 V dd
K 5 V bs
+ V bs I ju ) t
Energy per Cycle
8e-10
70nm technology, Crusoe processor
7e-10
6e-10
5e-10
4e-10
3e-10
Dynamic energy
2e-10
1e-10
0
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Leakage energy
1 Vdd
K 5 V bs
+ V bs I ju ) t
Critical point!
If you go beyond this
70nm
with technology
Vdd energy grows
8e-10
Energy per Cycle
K 4 V dd
7e-10
6e-10
5e-10
Dynamic + Leakage
4e-10
3e-10
Dynamic energy
2e-10
1e-10
0
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Leakage energy
1 Vdd
Summary
Power consumption becomes a central issue for embedded

systems design.
Power/energy consumption can be reduced by reducing supply
voltage, switching activity, switched capacitance, number of
executed cycles.
There are means at all levels of the design to reduce power
consumption: circuit, logic, behavioral, architecture, system level.
At system level we distinguish dynamic techniques (applied during
run-time) and static techniques (applied at design time).
Summary (contd)
Dynamic power management is implemented by the operating
system, and is mainly used in portable appliances to shut down or
place in stand-by unused devices.
Typical policies for power management are: time-out, predictive,
and stochastic.
Both at task mapping and at scheduling, design decisions can be
made with have a huge impact on power/energy consumption.
Real-time scheduling in the context of processors with voltage
scaling is extremely interesting. The main trade-off is voltage level
vs. execution time. One has to find the optimal voltage levels such
that energy consumption is reduced and deadlines are still
fulfilled.

Embedded System Architecture by Ralf Niemann

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Embedded System Architecture by Ralf Niemann

Загружено:

Авторское право:

Доступные форматы

2007-02-06

Petru Eles and Zebo Peng

Introduction and basic issues.

Architectures and platforms.

Analysis, co-simulation, and design space

Prof. Z. Peng, ESLAB/ LiTH

The design flows

Definition and motivation

System level design issues

Prof. Z. Peng, ESLAB/ LiTH

Traditional Design Flow

Integration and System Test

The concurrent design of hardware and

Prof. Z. Peng, ESLAB/ LiTH

To meet strict design constraints, such as:

Codesign is also made possible by the advances in

Prof. Z. Peng, ESLAB/ LiTH

To determine how big the

Prof. Z. Peng, ESLAB/ LiTH

Prof. Z. Peng, ESLAB/ LiTH

Some of system functionality is implemented in

What is an Embedded System?

There are many different definitions!

Some highlights what it is (not) used for:

Some focus on what it is built from:

Prof. Z. Peng, ESLAB/ LiTH

Characteristics of an Embedded System

Dedicated (not general purpose).

Contains a programmable component.

Interacts (continuously) with the environment:

Usually very cost sensitive:

Low power is often preferred.

Prof. Z. Peng, ESLAB/ LiTH

Prof. Z. Peng, ESLAB/ LiTH

The system never stops.

Prof. Z. Peng, ESLAB/ LiTH

Distributed Embedded Systems

Prof. Z. Peng, ESLAB/ LiTH

Time and Power Constraints

Safety Critical Requirements

Embedded systems are often used in life

Reliability and safety are major requirements.

Prof. Z. Peng, ESLAB/ LiTH

Short Time to Market

In highly competitive markets it is critical to catch

Design time has to be reduced!

Prof. Z. Peng, ESLAB/ LiTH

The ES Design Challenges

Increasing application complexity (e.g., automotive).

Prof. Z. Peng, ESLAB/ LiTH

Current Design Practice

Start from some informal specification and a set of

Prof. Z. Peng, ESLAB/ LiTH

Delays in the design process:

High cost due to many iterations with

Bad design decisions taken under time pressure:

The lesson: We need to explore more design

Prof. Z. Peng, ESLAB/ LiTH

The Improved Design Flow

Several design alternatives are evaluated

We get highly optimized solutions in short

Prof. Z. Peng, ESLAB/ LiTH